-
2025-12-16
18/395,257
2023-12-22
US 12,499,139 B1
2025-12-16
-
-
Dangelino N Gortayo
Cooley LLP
2043-12-22
Smart Summary: A method is designed to organize information from a large database that changes over time. It starts by taking a main set of data and a smaller set that is part of it. The smaller set is broken down based on specific time or logical categories. By counting how often individual words and their combinations appear in both datasets, the method creates a new set that highlights important content. Finally, this new set helps group documents with similar themes, and any new information can be added to keep these groups updated. 🚀 TL;DR
A computer implemented method includes receiving a baseline dataset divisible by temporal or logical criteria. A target dataset representing a small fraction of the baseline dataset is received. The target dataset is segmented by the temporal or logical criteria. Numbers of documents containing individual words within the baseline dataset are identified to form baseline singles. Numbers of documents containing common combinations of individual words within the target dataset are identified to form target tuples. Combinations of the target tuples are coalesced based upon baseline singles criteria to form a coalesced dataset representing significant content in the target dataset. The coalesced dataset is used to cluster documents from the target dataset into threads of related content. Updated content is received. The updated content is integrated with the coalesced dataset to form updated threads of related content.
Get notified when new applications in this technology area are published.
G06F16/3326 » CPC main
Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data; Querying; Query formulation; Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages
G06F16/3334 » CPC further
Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data; Querying; Query processing; Query translation Selection or weighting of terms from queries, including natural language queries
G06F16/383 » CPC further
Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data; Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
G06F16/332 IPC
Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data; Querying Query formulation
G06F16/3332 IPC
Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data; Querying; Query processing Query translation
This application claims priority to U.S. Provisional Patent Application No. 63/477,157, filed Dec. 23, 2022, the contents of which are incorporated herein by reference. This application is related to commonly owned U.S. patent application Ser. No. 18/395,243, filed Dec. 22, 2023.
This invention relates generally to finding significant content in large corpuses of data. More particularly, this invention is directed to finding significant content in large corpuses of data by analyzing individual terms in the data to develop meaningful thematic conceptual clusters of the data sources.
Relational databases have a well-known relational structure that is conducive to Structured Query Language (SQL) queries. In contrast, an unstructured database, also known as a document-oriented database or a No-SQL database is a comparatively unruly collection of materials that defy efficient query processing.
Accordingly, there is a need for improved processing within unstructured databases.
A computer implemented method includes receiving a baseline dataset divisible by temporal or logical criteria. A target dataset representing a small fraction of the baseline dataset is received. The target dataset is segmented by the temporal or logical criteria. Numbers of documents containing individual words within the baseline dataset are identified to form baseline singles. Numbers of documents containing common combinations of individual words within the target dataset are identified to form target tuples. Combinations of the target tuples are coalesced based upon baseline singles criteria to form a coalesced dataset representing significant content in the target dataset. The coalesced dataset is used to cluster documents from the target dataset into threads of related content. Updated content is received. The updated content is integrated with the coalesced dataset to form updated threads of related content.
The invention is more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which:
FIG. 1 illustrates a system configured in accordance with an embodiment of the invention.
FIG. 2 illustrates processing operations performed in accordance with an embodiment of the invention.
FIG. 3 illustrates thread processing performed in accordance with an embodiment of the invention.
FIG. 4 illustrates story processing performed in accordance with an embodiment of the invention.
FIG. 5 illustrates a hierarchical relationship between coalesced tuples, threads, snapshots, and stories associated with the invention.
FIG. 6 illustrates snapshot merging performed in accordance with an embodiment of the invention.
FIG. 7 conceptually illustrates the disclosed invention's organization of numerous disaggregated documents into aggregated stories of coalesced tuples.
FIG. 8 illustrates the disclosed system integrated with Large Language Model (LLM) machines in accordance with an embodiment of the invention.
FIG. 9 illustrates story ID updates in accordance with an embodiment of the invention.
FIG. 10 illustrates aggregate coalesced tuple terms summed to produce aggregate terms.
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
FIG. 1 illustrates a system 100 configured in accordance with an embodiment of the invention. The system 100 includes client machines 102_1 through 102_N connected to server machines 104_1 through 104_N via a network 106, which may be any combination of wired and wireless networks. Content source machines 150_1 through 150_N are also connected to network 106.
The server machines 104_1 through 104_N implement operations disclosed herein. By way of overview, the server machines 104_1 through 104_N collect content from content source machines 150_1 through 150_N to form an unstructured database 142 of content. A clustering module 144 subsequently organizes the content, as discussed in detail below. Thereafter, the client machines 102_1 through 102_N can query the unstructured database 142 and perform other operations on the organized content. The memory 140 also stores a thread processor 146 that is used to construct a thread database 148 and snapshot database 150, as detailed below. The memory 140 also has a story processor 152 used to construct a story database 154, as detailed below.
Each client machine 102_1 through 102_N includes a processor 110 and input/output devices 112 connected via a bus 114. The input/output devices 112 may include a keyboard, mouse, touch display and the like. A network interface circuit 116 is also connected to the bus 114 to provide connectivity to network 106. A memory 120 is also connected to the bus 114. The memory 120 stores instructions executed by processor 110. In particular, the memory stores a client module 122 that includes instructions to use network 106 to communicate with servers 104_1 through 104_N.
Each server (e.g., server 104_1) includes a processor, 130, input/output devices 132, a bus 134 and a network interface circuit 136 to provide connectivity to network 106. A memory 140 is also connected to the bus 134. The memory stores instructions executed by processor 130 to implement operations disclosed herein. The memory stores an unstructured database 142 and a clustering module 144 to organize information in the unstructured database 142. The organized information is processed by a thread processor 146, which creates a thread database 148 and a snapshot database 150 based upon the processing, as detailed below.
Each content server (e.g., 150_1) includes a processor 151, input/output devices 152, a bus 154 and a network interface circuit 156 to provide connectivity to network 106. A memory 160 is also connected to the bus 154. The memory 160 stores a content source module 162 with instructions executed by processor 151 to deliver content to one or more servers 104_1 through 104_N.
FIG. 2 illustrates processing operations 200 performed by clustering module 144. A baseline dataset is received 201. More particularly, a baseline dataset divisible by a temporal criterion (e.g., a 24-hour period) or a logical criterion (e.g., chapters in a textbook) is received.
A target dataset is then received 202. The target dataset represents a small fraction of the baseline dataset. The target dataset is segmented by the same temporal criterion or logical criterion as the baseline dataset. Subsequent operations analyze the target dataset against the baseline dataset to identify significant content in the target dataset.
Baseline singles are then formed 206. That is, numbers of documents containing individual terms within the baseline dataset are identified to form baseline singles (e.g., single terms and an associated number of documents).
Target tuples are then formed 206. That is, numbers of documents containing individual terms within the target dataset are identified to form target tuples (e.g., collections of two or more terms, where each collection is a tuple).
A coalesced dataset is then formed 208. Combinations of target tuples are coalesced based upon baseline singles criteria to form a coalesced dataset, as demonstrated below.
In one embodiment, a supplementary dataset of baseline values is then formed. The supplementary dataset baseline values include terms commonly deemed significant.
In this embodiment, a second coalesced dataset is formed. That is, the original coalesced data set and the supplementary dataset baseline values are used to form a second coalesced dataset representing significant content in the target dataset.
The foregoing operations are more fully appreciated in connection with the following detailed example. The example is for illustrative purposes. Many variations on the example are still within the scope of the invention. For example, the following example focuses on a baseline dataset of news articles divided into temporal periods of twenty-four hours. The example also focuses on an embodiment that uses tuples with four values (“quads”), but other tuple configurations may be used in accordance with embodiments of the invention.
The source dataset is defined as the general input data on which the clustering module 144 runs. The clustering module 144 is also referred to herein as CT-X. For example, the source dataset may be news articles spanning the past 365 days. A key attribute of the source dataset is that it contains properties that allow the dataset to be divided into two sub-datasets (target and baseline). For example, a source dataset that will be divided by its temporal property requires that a publishing date be present for every document in the source dataset.
Lookup tables from external data sources can also be brought in to provide evidence of terms of importance (“important terms”) within the source dataset. This step is optional but extremely beneficial for the quality of results. Depending on the dataset being studied, the type of lookup table may vary. Examples of lookup tables relevant to a source dataset of news articles might include:
For best results, source data should be cleaned and pre-processed by removing punctuation, performing stemming/lemmatization, identifying important terms via lookup tables, and connecting terms that belong together (e.g., first name, last name) with an underscore to make them compound terms. This step is optional but extremely beneficial for the quality of results. By standardizing the format of each term, indexing and term matching performance throughout the system is dramatically increased.
Consider the following news article:
The baseline dataset is defined as the source dataset without the target dataset. For example, days 1-364. For best results, the target and baseline datasets should not overlap.
The target dataset is the range in which to look for thematically related/trending content. For example, day 365, in contrast to the baseline of days 1-364.
A bag of words is used to hold a list of terms or tuples along with their calculated frequency. Consider the following examples.
| Term | Frequency | |
| apple | 876 | |
| peach | 431 | |
| pear | 339 | |
| Tuple | Frequency | |
| [apple, pear] | 221 | |
| [peach, pear] | 109 | |
The following bags are used within an embodiment of the clustering module 144 (CT-X):
Term frequency is calculated in rounds, with each round taking the terms with frequencies above a certain threshold to limit the computation required. This is referred to herein as skim mining. Skim mining collects significant terms in the target and omits insignificant terms in the target to effectively compress target data into its most significant content. Skim mining is also performed on large datasets to limit the number of tuples to a reasonable level (e.g., no more than 50-100k). When limiting by the number of tuples, the dataset is first ordered by most frequent tuples to least frequent, then the top 50-100k tuples are used.
It is also important that tuples with the same frequency are used as the floor. For example:
If the floor is calculated using the top 5 records, records (tuples) 1-7 must be used, as position 5 is the same as position 6 and 7. To take the top 5 records, score at position 5=43, therefore floor is >=43 . . . and tuples 1-7 must be used.
Skim mining is performed for 2 reasons:
The minimum frequency should optimally be the quad threshold value (2×LVK threshold). Consider the following abstract dataset and thresholds:
Target bag of singles:
“e” occurs in less than 7 documents; therefore, it can be removed from the next round.
Target bag of doubles:
All doubles are above the threshold of 7 and will be evaluated for the next round.
Target bag of triples:
Target bag of quads:
Some key take aways:
The values for each shown above is calculated for each term detected in the documents (corpus), except for stop terms and other less useful terms (e.g., “the”, “a”).
CT-X uses the following values:
The baseline frequency calculation forms what is expected for a certain term; the number of documents the term would be expected to appear in. The target frequency calculation is the number of documents the term appears in within the target dataset. Consider the following examples.
| Baseline | Target | |||
| Single/Term | Score | Score | Multiple | Conclusion |
| australia | 5.7 | 12 | 2.11 | Trending |
| bolivia | 1.8 | 10 | 5.56 | Trending |
| france | 7.6 | 22 | 2.89 | Trending |
| germany | 4.2 | 3 | 0.71 | Not trending |
| global_warming | 29.8 | 20 | 0.67 | Not trending |
| iceland | 2.1 | 18 | 8.71 | Trending |
| usa | 8.5 | 30 | 3.53 | Trending |
| world_cup | 1.3 | 15 | 11.54 | Trending |
Single terms are not particularly reliable for querying content, as they lack context around the reason for that term to be trending. CT-X expands the document frequency calculation to tuples. It looks for a combination of terms to be used within multiple documents. Consider the following example.
| Target | ||
| Double/Tuple | Score | Conclusion |
| australia, world_cup | 11 | 11 docs contained australia and world_cup |
| australia, bolivia | 10 | 10 docs contained australia and bolivia |
| australia, france | 12 | 12 docs contained australia and france |
| australia, usa | 9 | 9 docs contained australia and usa |
| iceland, usa | 20 | 20 docs contained iceland and usa |
| global_warming, usa | 14 | 14 docs contained global_warming and usa |
Some assumptions can already be made. Australia and any other term cannot be contained within more than 12 times (as that was the target score). Australia and world_cup were contained within 11 documents of the 12 documents containing Australia, and 11 of the 15 possible world_cup documents. We can assume that the tuple frequency is always lower than each of the tuple's terms maximum possible frequency within the target dataset. In one embodiment of the invention, quads (tuples size=4) are used.
| Target | ||
| Quad/Tuple | Score | Conclusion |
| australia, bolivia, france, | 10 | 10 docs contained all terms |
| world_cup | ||
| australia, france, usa, world_cup | 12 | 10 docs contained all terms |
| bolivia, france, usa, world_cup | 10 | 10 docs contained all terms |
| iceland, france, usa, | 12 | 12 docs contained all terms |
| global_warming | ||
| germany, france, global_warming, | 1 | 1 doc contained all terms |
| usa | ||
Quads contain far more information than a single term, and we now have context surrounding a single term. We now also have potentially hundreds of tuples that contain a single term, so a way to associate the tuples should be used. The association between quads is defined by a common term where the following conditions are met:
To identify baseline terms that can be used for association (merging), the LVK threshold is used:
LVK threshold=Average frequency of baseline term+(0.75*standard deviation of baseline terms)
Any tuples that share a term that is also an LVK (baseline frequency below LVK threshold) can be merged to form a “coalesced tuple”. Assume that the LVK threshold is 3.5; the table below identifies the LVK terms that can be used for merging. The quad threshold should be at least 2× the LVK threshold (in this case, >7.0).
| Tuple | |||
| Quad/Tuple | LVKs | Frequency | Notes |
| australia, bolivia, | bolivia, | 10 | Tuple freq good, |
| france, world_cup | world_cup | contains merge terms | |
| australia, france, usa, | world_cup | 12 | Tuple freq good, |
| world_cup | contains merge term | ||
| bolivia, france, usa, | bolivia, | 10 | Tuple freq good, |
| world_cup | world_cup | contains merge terms | |
| iceland, france, usa, | iceland | 12 | Tuple freq good, |
| global_warming | contains merge term | ||
| germany, france, | global_warming | 1 | Tuple freq too low, |
| global_warming, usa | ignored | ||
The resulting coalesced tuples are:
| australia, bolivia, france, | [australia, bolivia, france, world_cup], |
| usa, world_cup | [australia, france, usa, world_cup], [bolivia, |
| france, usa, world_cup] | |
| france, global_warming, | [iceland, france, usa, global_warming] |
| iceland, usa | |
Even though the coalesced tuples contain some of the same terms, they describe unique stories and provide context beyond what a single term can provide. When the originating dataset is queried, tuples within each coalesced tuple should be used to ensure the accuracy of matching documents.
CT-X output is an array of coalesced tuples. Scoring and original tuples are kept intact to allow detailed document queries of the source datasets. An example excerpt of CT-X output is below.
The cluster processing 200 of FIG. 2 is the first operation in a closed loop process 300 of FIG. 3. FIG. 3 illustrates operations performed by thread processor 146. The thread processor 146 is used to construct a thread database 148 and a snapshot database 150.
Each thread is a container for combining new content with cluster processing output. The new content is subject to cluster processing 200 performed by clustering module 144. Thread creation is then initiated 302. That is, new coalesced data is integrated with existing coalesced data. Consequently, threads keep track of content from multiple output rounds of cluster processing 200.
In the example of reporting of events in the news cycle, threads need to be dynamic and malleable. Events can split and multiple events can merge into a single event depending on the dynamics of the evolving event and how it is being reported.
Threads attempt to provide the flexibility to achieve this, along with the ability to self-correct as more data is collected and a more accurate statistical analysis can be performed.
The initial goal of threads is to create a forward-only path for each event. The events represented by coalesced tuples are either merged/evolved with existing threads to keep them current, or if no existing thread can be found, a new one is created.
When no active (non-archived) thread exists, a new thread is created from each input coalesced tuple and the following data is stored:
When existing active (non-archived) threads exist, a comparison to all existing threads is performed for each new input tuple. To evolve an existing active thread, criteria such as the following may be used:
If none of the criteria to evolve an existing active thread are met, it is assumed that there is no existing related thread, and a new active thread is created. Consider the following example:
Therefore, the existing thread is evolved and the coalesced tuple linked to the thread is replaced.
If a thread has not evolved within a certain time (e.g., 72 hours), the thread is archived. Once a thread is archived, it can no longer be evolved and is excluded from all thread processing.
The state of the thread is retained after archiving.
As the coalesced tuples were generated from the source dataset, the tuples contained within the coalesced tuples (threads) can be used to query the source dataset for documents. Content population and activation is performed every time the thread loop is completed.
Tuples/quads are updated after every thread evolution for each thread. These tuples are used to match content in the source dataset. A simple match is where all terms in a quad are found within a document. Consider the examples below:
Only 1 complete tuple is generally required for a match. Singles, doubles, and triples are by-products of the quad generation process. Content that is not a direct match with quads within the thread can be discovered via singles, doubles and triples that are:
From these aggregate tuple terms, the following trending tuples were found:
The baseline vs target values for these terms are:
| Baseline | Target | Trending | ||
| Term | Frequency | Frequency | Score | |
| curb_your_enthusiasm | 0.6 | 30 | 112.91 | |
| Larry_David | 0.7 | 24 | 83.56 | |
| hbo | 25.7 | 26 | 0 | |
The “safe” component to these trending singles, doubles and triples is important. The safe definitions are:
VLB=Very Low Baseline and is defined as terms with a baseline below:
LVB threshold=Average frequency of baseline term+(0.1*standard deviation of baseline terms)
LB=Low Baseline and is defined as terms with a baseline below:
LVB threshold=Average frequency of baseline term+(0.5*standard deviation of baseline terms)
HB=High Baseline and is defined as terms with a baseline above:
HB threshold=Average frequency of baseline term+(10*standard deviation of baseline terms)
For reference, the LVK threshold is defined for terms with a baseline below:
LVK threshold=Average frequency of baseline term+(0.75*standard deviation of baseline terms)
The trending score for a single is calculated using the baseline frequency and target frequency for an individual term [n].
TrendingScoreN=(TargetFreqN−BaselineFreqN)*ln(1−(TargetFreqN/BaselineFreqN))
For doubles and triples, the trending score of the tuple is the RMS (root-mean-square) of all the single's trending scores that it contains.
For example:
The next operation of FIG. 3 is thread scoring 304. Thread scoring is calculated by active content matching the thread. Two types of scores are calculated to evaluate the size of the pattern and its relevance in the near term.
The content volume is simply the number of documents that match the coalesced tuple. For example, 100 documents=100 points. The content velocity is the sum of all the documents; however, they are plotted on a chart that positions them by age. Age=0 is given 100% of score. Age of 24 hours (or max score age) is given a 0% score.
Various content velocity curve filters may be used. A parabolic curve quickly ages, then slowly (25% age=50% score, 75% age=25% score). A linear filter ages linearly (50% age=50% score). A tan curve ages quickly, then slowly, then quickly again. A constant filter does not age (always 100% score)
The thread score at any given moment is the sum of the thread score curve up to the age limit for content. For example, a thread score could be defined as:
When threads are output, they are sorted in order of thread score (highest to lowest). Consider the following example.
| Thread | Score | |||
| Id | Title | Aggregate Terms | (0-10) | Documents |
| 2382437 | ukraine | achieved, achieves, americans, annual, | 6.900016616 | 1189 |
| asked, . . . | ||||
| 2346502 | gaza_city | ally, ambush, attacks, battles, calls, | 6.195291328 | 1907 |
| ceasefire, . . . | ||||
| 2365087 | Andre_Braugher | actor, age, Andre_Braugher, brief, | 6.050925059 | 386 |
| brooklyn_ninenine, . . . | ||||
| 2383500 | Donald_Trump | according, adults, americans, apnorc, | 5.959255938 | 810 |
| bidentrump, . . . | ||||
| 2382436 | defense bill | $886, act, annual, authorization, authorizes, | 5.853076967 | 103 |
| big, . . . | ||||
| 2385985 | curb_your_enthusiasm | curb_your_enthusiasm, end, ending, final, | 5.741705452 | 35 |
| hbo, . . . | ||||
| 2385522 | six years | ago, alex, alive, batty, boy, british, france, | 5.472362973 | 36 |
| missing, . . . | ||||
| 2361598 | Rudy_Giuliani | attorney, begin, case, damages, decide, | 5.237168296 | 195 |
| deciding, . . . | ||||
| 2384832 | new_jersey | bull, commuters, delays, loose, morning, | 5.134232245 | 28 |
| new_jersey, . . . | ||||
| 2383605 | Taylor_Swift | 34th, birthday, Blake_Lively, | 5.094833372 | 61 |
| new_york_city, night, . . . | ||||
| 2384018 | guyana | guyana, leaders, longstanding, meet, | 5.071720406 | 36 |
| region, . . . | ||||
The next processing operation of FIG. 3 is thread merging 306. As the output can vary considerably from cycle to cycle, some additional threads may be created as the ebbs and flows of the news cycle (or equivalent data source) play out. Each thread is referenced by a unique numeric identifier. Threads are generally independent of each other; however, it is not uncommon for them to be associated in a parent/child relationship. This is done when 2 or more threads are found to be sharing a significant amount of content. A child thread is defined by it having a parent thread associated with it. A parent thread is simply a thread which is not a child thread.
A test is performed on all active threads using the content that has been associated with them. If greater than 50% of a thread's content is contained within another thread, it will be merged with it (and will become a child thread of the parent it merged with).
This test is completed after threads have been populated with content.
Every thread cycle, existing parent/child relationships from previous cycles are ignored, and all are re-evaluated. It is common for the same parent/child relationships to be reestablished each cycle. Some variance between cycles does occur, however.
Threads that have been marked to be merged are arranged in order of size (from most content to least). The largest thread is chosen as the parent, and all other matching threads are chosen as children of the parent. The parent is the only thread shown in queries until the next round where all threads are re-evaluated, and the parent may change.
The merged threads are stored in a thread database 308. Snapshots of the thread database are then generated 310. The snapshots are then stored in a snapshot database 312.
Threads are always kept up-to-date based on the latest cluster processing outputs. Historical thread state is retained using snapshots. A snapshot is a copy of all active threads at the end of a thread cycle. Slices are used as a container to store the state of all active parent threads at that time. Snapshots are date/time stamped.
Slices are a container that can be used to specify a source dataset, or simply be used to store a snapshot of another slice. Slices based on a dataset are updated every thread cycle. Slices used for storing a snapshot are immutable once created. As only active parent threads are copied, a copy of all child thread IDs (including the parent thread ID) are also copied to retain a reference to the original threads that can link snapshot threads to threads in a neighboring snapshot.
A snapshot may contain the following components:
The client module 122 may be used to query the snapshot DB 150. Snapshot querying is often used to obtain the latest complete output of thread processing, content, and scoring. Threads contained in the output of a snapshot query are output in order of highest to lowest thread score. The thread score indicates both the strength of the pattern and relevance of the thread calculated from recency and volume of content when the snapshot was taken. The thread score may be normalized to a range between 0 and 10. Consider the following example.
| SSThread | Score | ||
| ID | Thread Title | Terms | (0-10) |
| 2388836 | ukraine | 1st, achieved, achieves, americans, annual, asked, . . . | 7.03 |
| 2388843 | gaza_city | ally, ambush, battles, calls, ceasefire, crushes, . . . | 6.36 |
| 2388854 | Andre_Braugher | actor, age, Andre_Braugher, brief, | 6.09 |
| brooklyn_ninenine, . . . | |||
| 2388861 | defense bill | $886, act, annual, authorization, authorizes, big, . . . | 6.03 |
| 2388865 | Donald_Trump | according, adults, americans, apnorc, bidentrump, . . . | 5.97 |
| 2388871 | curb_your_enthusiasm | curb_your_enthusiasm, end, ending, final, hbo, . . . | 5.92 |
| 2388875 | six years | ago, alex, alive, batty, boy, british, france, missing, . . . | 5.65 |
| 2388879 | Draymond_Green | center, Draymond_Green, face, forward, . . . | 5.65 |
| 2388883 | new_jersey | bull, commuters, delays, loose, morning, | 5.32 |
| new_jersey, . . . | |||
| 2388887 | Sidney_Powell | apology, case, chesebro, deals, decide, election, . . . | 5.28 |
Snapshots can be queried by:
Results returned for specific snapshot queries:
Results returned for list of available snapshots:
Returning to FIG. 1, a story processor 152 combines snapshots in the snapshot database 150 to form a story that is stored in a story database 154. FIG. 4 illustrates story processing 400 performed by the story processor 152. The thread processing 300 of FIG. 3 is a first operation of a closed loop process in which snapshots are combined into a story 402 and then the story in recorded in the story database 404.
Continuities connect all thread snapshots into a continuous story from start to finish. As original/source threads are forward-only, previous states are not accessible. Snapshots provide a solution to access previous states. Connecting thread state via snapshots is not always a straightforward calculation, as one thread is often made up of a parent and child threads, where the parent thread can easily change to a child thread (and vice-versa) between cycles.
Stories bind conversations between snapshots/threads, which allows for a much grander perspective over time. Stories also have the benefit of not being a real-time/one-off calculation. Stories can benefit from hindsight, and fix anomalies that occurred in the past automatically, using data that occurred in the future with respect to the original thread.
The story processor 152 detects “continuities” by loading all snapshots and looking for the next thread in the sequence from one snapshot to the next. When continuities have been calculated in their entirety, they are saved as stories (continuities are incomplete or non-final stories).
FIG. 5 illustrates different coalesced tuples 500 forming a set of parent threads 502, which then form different snapshots 504_1 through 504_N. As shown in FIG. 5, when snapshots are generated, the original thread IDs (Source Thread IDs) are recorded; this includes the parent thread ID and the child thread IDs associated with the parent. FIG. 5 also shows how snapshots 506, 508, 510 and 512 are combined into a single story.
The snapshot copy retains all the original parent/child thread IDs (Source Thread IDs) at that moment, so even if the parent and child get swapped in a subsequent snapshot, the relationship can be re-formed.
By keeping track of the constantly evolving parent/child thread IDs, the complex relationship between similar threads can be distilled down to a single story by evaluating Source Thread IDs.
The continuity is calculated from snapshot to snapshot by trying to match the best matching thread moving forward. This is done by selecting the thread in the subsequent snapshot that matches the most Source Thread IDs. This is shown conceptually in FIG. 6.
Consider a list of threads in Snapshot A. Each thread is compared against all threads in the next snapshot (Snapshot B). The thread that contains the highest percentage of overlapping Source Thread IDs is chosen as the winner, which is the top row of FIG. 6. If no match can be made from a thread in Snapshot A to Snapshot B, the continuity ends. If a thread in Snapshot B was not matched with a thread from Snapshot A, a continuity is created.
More complex parent/child thread relationships can exist in source threads. An example of this is below, where 3 parent threads each have many child threads that overlap over the 3 snapshots. The result is 2 distinct continuities.
Continuity compression merges multiple continuities into a single continuity under the following circumstances:
Two merge parameters can be set:
Where possible, existing story IDs are used to create a persistent reference point for stories. When story IDs are archived, the new story ID is referenced to provide a redirect path (where 2 or more stories became a single story). The first thread in the continuity is used to extract a story ID. If that thread is not linked to an existing story ID, a new story ID is created. All snapshot threads within that continuity branch are updated with the same story ID taken from the first thread or from the newly created story. This is depicted in FIG. 9.
Stories are defined by the snapshot threads associated with them. Therefore, calculated attributes may change after each story loop if the snapshot threads associated with it were altered.
In one embodiment, basic story attributes include:
Calculated story attributes derived from the associated snapshot thread IDs may include:
Example: Document list returned for associated snapshot thread IDs for story.
| Snapshot Thread ID | Document ID |
| 1000 | 15FB5B74-335A-47FF-896D- |
| E09B1A3222FE | |
| 1000 | 33819D7F-AE8D-4806-912F- |
| 7BD80F3EF79A | |
| 1000 | 98946136-F234-4A10-97AF- |
| 1F87BC2E134D | |
| 1001 | 15FB5B74-335A-47FF-896D- |
| E09B1A3222FE | |
| 1001 | 33819D7F-AE8D-4806-912F- |
| 7BD80F3EF79A | |
| 1001 | 98946136-F234-4A10-97AF- |
| 1F87BC2E134D | |
| 1001 | B140C5C9-8AE8-4D09-B332- |
| 7C5175DF3C7A | |
| 1002 | 15FB5B74-335A-47FF-896D- |
| E09B1A3222FE | |
| 1002 | 98946136-F234-4A10-97AF- |
| 1F87BC2E134D | |
| 1002 | B140C5C9-8AE8-4D09-B332- |
| 7C5175DF3C7A | |
| Document ID | Score | |
| 15FB5B74-335A-47FF-896D- | 3 | |
| E09B1A3222FE | ||
| 33819D7F-AE8D-4806-912F- | 2 | |
| 7BD80F3EF79A | ||
| 98946136-F234-4A10-97AF- | 3 | |
| 1F87BC2E134D | ||
| B140C5C9-8AE8-4D09-B332- | 2 | |
| 7C5175DF3C7A | ||
FIG. 10 illustrates aggregate coalesced tuple terms for Thread Snapshots A, B are summed to produce AGGScore and AGGTerm.
In one embodiment, a single story contains a packet of the following data:
Story output combines the documents that match the story, along with the identity of the story defined by the coalesced tuples found in the associated thread snapshots. A story can represent an active (present) or historical view of the source dataset. The active stories are identified by the “active” attribute. The most recent score and number of documents for a given story is found by looking at the Scores and NumDocuments attributes and finding the most recent Timestamp and the score/value associated with it.
| Stor ID | 243556 |
| AggregateCoalescedTupleTermScores | [[volcano, 2090851], [iceland, 2090280], [erupts, |
| 1999341], [reykjanes, 1960981], [peninsula, | |
| 1931228], [weeks, 1885293], [sky, | |
| 1780927], [volcanic_eruption, 1760769], [orange, | |
| 1746350], [evacuated, 1743913], [monday, | |
| 1731789], [alert, 1721882], [high, | |
| 1719710], [thousands, 1714571], [night, | |
| 1708499], [turning, 1682191], [prompting, | |
| 1615217], [started, 1586429], [civil_defense, | |
| 1446110], [town, 990795], [country, | |
| 440009], [grindavik, 104255], [eruption, | |
| 67571], [activity, 32084], [lava, 12861], [seismic, | |
| 12520], [meteorological, 4872], [office, | |
| 4830], [erupted, 4552], [spewing, | |
| 4366], [earthquakes, 3097], [southwest, | |
| 2070], [intense, 1590], [flights, 924], [scientists, | |
| 726], [nearby, 465], [following, 426], [southwestern, | |
| 408], [affect, 234], [near, 186], [rock, | |
| 144], [icelandic, 108], [magma, 72], [earthquake, | |
| 72]] | |
| AggregateDocumentScores | Dictionary of <Document, Score>, where score is |
| a number (see Document object) | |
| e.g. {[15FB5B74-335A-47FF-896D- | |
| E09B1A3222FE, 3], . . . } | |
| NumDocuments | Dictionary of <Timestamp, Count> |
| e.g. {[202312190330, 74], 202312190345, 81], | |
| . . . } | |
| Scores | Dictionary of <Timestamp, Score> |
| e.g. {[202312190330, 6.3], 202312190345, 6.5], | |
| . . . } | |
| CreatedTimestamp | 202312190330 |
| LastTimestamp | 202312190345 |
| Active | True |
A document object contained within the <Document, Score> dictionary has the following fields:
| DocumentID | 2d214187-62b3-4f3e-93c0-5a6a0ef2b9f4 |
| URL | https://www.latimes.com/world-nation/story/2023-12- |
| 18/iceland-volcano-erupts-weeks-after-thousands-were- | |
| evacuated-from-a-town-on-reykjanes-peninsula | |
| Title | Iceland volcano erupts weeks after thousands evacuated |
| Body | A volcanic eruption began Monday night on |
| Iceland's Reykjanes Peninsula, turning the sky orange | |
| and prompting the country's civil defense | |
| to be on high alert | |
| Timestamp | 202312190132 |
Thus, the story DB 154 has stories that are the culmination of the disclosed production line. The disaggregated documents found in the source dataset are automatically organized into a digestible collection of distinct stories that are continually updated as the source dataset is updated. These distinct stories are made up of tuples that merged into groups (coalesced tuples) due to one or more of their components trending over the baseline of the dataset. As these components were trending over the target range of the dataset compared to the baseline, they are considered statistically significant and are used to merge tuples together. This allows a safe meshing of common and uncommon terms into a single story, even though stories will often have common terms.
As the tuples were derived from the source dataset, the tuples within each coalesced tuple will match with the documents that they were originally derived from. After the source dataset has been processed, what is output is an automatically organized dataset of stories, which in turn reference documents from the original source dataset. When compared to the original dataset, each story is far easier to digest by machine. Documents that do not follow a pattern identified by a story can be ignored, and as each document is scored for relevance, quality filters can be used to further optimize ingestion of documents without having to ingest the entire story.
FIG. 7 conceptually depicts the processing disclosed herein. On the left are disaggregated documents of the unstructured DB 142. This collection of documents is large and unorganized. On the right side of the figure is a conceptual depiction of the story DB 154, where each set of linked dots represents a story. The story is constructed from individual documents, each represented by a dot. Thus, FIG. 7 shows how the disclosed invention effectively filters an original document set by removing non-trending information. The figure also shows how the original document set is organized into related information.
FIG. 8 illustrates a system 800 configured in accordance with an embodiment of the invention. The system 800 generally corresponds to system 100 of FIG. 1, but the memory is augmented to include an LLM trainer 156 and a query processing module 158. Also, content machines 150_1 through 150_N are not depicted as being connected to network 106. Rather, LLM machines 170_1 through 170_N are connected to network 106. LLM machine 170_1 and each of the other LLM machines are nodes in an LLM network. LLM machine 170_1 includes a processor 171, input/output devices 172, a bus 174, and a network interface circuit 176. A memory 180 is connected to bus 174. The memory 180 stores an LLM module 182 with instructions executed by processor 171 to support training of an LLM and subsequent use of a trained LLM to prepare responses to queries submitted by the query module 158.
The LLM trainer 156 accesses the story DB 154 to collect pre-organized data to train an LLM. Typically, when it comes to processing news and current event data beyond the temporal limits of an LLM's knowledge base, the LLM's only option is to utilize what is effectively the disaggregated data on the left side of FIG. 7. With the disclosed technology the LLM trainer 156 supplies the LLM model with automatically aggregated stories of coalesced tuples. Thus, it becomes possible to train the mode on story data, and the training process is far more efficient since there is less data and the submitted data is already thematically organized.
The query module 158 supports queries applied to the trained LLM model and the story DB 154. Observe that story results are prepared in advance, so queries are quick and efficient. The disclosed system has already identified the patterns (stories) most likely to yield results within the source dataset.
Even though the source dataset has been broken down and grouped into discrete objects (stories) orders of magnitude fewer in number, it retains the statistically significant documents within each story. Documents within each story can be further filtered by setting an intelligent cap via a relevance score.
In one embodiment, the query module supplies two types of query results.
As stories are updated in an automatic, ongoing fashion, efficient and automatic machine to machine querying and ingesting of stories provides accurate near real-time results. Full Story query types are used to copy the entire story DB 154. A full query type is intended to copy the entire story database into another system.
Incremental story query types will only return stories that have changed since the UTC timestamp provided. Usually, incremental queries are used after a complete full query has been performed at least once.
Before being ingested by an LLM or other systems, a further limiting of each story's documents can be achieved by filtering by either x % or x number of documents based on the relevance score provided for each document. The highly relevant documents contain more of the tuples used to represent the story than the documents with a lower relevance score.
Story relevance is calculated using the aggregate term scores, which is based on the coalesced tuples contained within all thread snapshots of the story.
The formula to implement the document limit specified is as follows:
| Sum | Sum | Docu- | |||
| Story | Terms | (Terms) | (Matching) | Relevance | ments |
| A | [Taylor_Swift, 80], | 140 | 110 | 0.785714286 | 80 |
| [NFL, 30], [a, 20], | |||||
| [b, 10] | |||||
| B | [a, 200], [b, 100], | 315 | 15 | 0.047619048 | 4 |
| [Taylor_Swift, 10], | |||||
| [NFL, 5] | |||||
| C | [a, 50], | 69 | 10 | 0.144927536 | 14 |
| [Taylor_Swift, 10], | |||||
| [c, 5], [d, 4] | |||||
| 0.97826087 | 100 | ||||
| (limit) | |||||
An embodiment of the present invention relates to a computer storage product with a computer readable storage medium having computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include but are not limited to: magnetic media, optical media, magneto-optical media, and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using an object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hardwired circuitry in place of, or in combination with, machine-executable software instructions.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required to practice the invention. Thus, the foregoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described to best explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.
1. A computer implemented method, comprising:
receiving a baseline dataset divisible by temporal or logical criteria;
receiving a target dataset representing a small fraction of the baseline dataset, the target dataset being segmented by the temporal or logical criteria;
identifying, without user direction, numbers of documents containing individual words within the baseline dataset to form baseline singles;
identifying, without user direction, numbers of documents containing common combinations of individual words within the target dataset to form target tuples where each common combination of individual words provides context for each word in the common combination;
coalescing, without user direction, combinations of the target tuples based upon baseline singles criteria to form a coalesced dataset representing significant content in the target dataset;
using the coalesced dataset to cluster documents from the target dataset into threads of related content;
receiving updated content; and
integrating the updated content with the coalesced dataset to form updated threads of related content.
2. The computer implemented method of claim 1 wherein integrating is based upon common term criteria in the updated content and the coalesced dataset.
3. The computer implemented method of claim 1 further comprising scoring threads based upon content volume.
4. The computer implemented method of claim 1 further comprising scoring threads based upon content velocity.
5. The computer implemented method of claim 4 wherein scoring threads utilizes a content velocity curve filter.
6. The computer implemented method of claim 1 further comprising merging threads to form merged threads.
7. The computer implemented method of claim 6 further comprising:
securing snapshots of merged threads; and
storing the snapshots in a database.
8. The computer implemented method of claim 7 further comprising searching the database for a designated snapshot.
9. The computer implemented method of claim 8 wherein searching the database for a designated snapshot is based upon a snapshot identifier.
10. The computer implemented method of claim 8 wherein searching the database for a designated snapshot is based upon a snapshot temporal parameter.
11. The computer implemented method of claim 7 further comprising combining snapshots into a story.
12. The computer implemented method of claim 11 wherein the story has a story identification, a story title, story timestamps, story keywords, story scores, and a document object with a uniform resource locator specifying the network location of the original source material.
13. The computer implemented method of claim 11 further comprising applying selected stories to a language model to form a trained language model.
14. The computer implemented method of claim 13 further comprising supporting query processing of the trained language model in accordance with a temporal parameter.
15. The computer implemented method of claim 13 further comprising supporting query processing of the trained language model in accordance with a story identification.
16. The computer implemented method of claim 13 further comprising supporting query processing of the trained language model by returning a relevance score for a response supplied to a query.