🔗 Share

Patent application title:

Apparatus and method for clustering related tuples derived from content in a dynamic unstructured database

Publication number:

Publication date:

2025-12-16

Application number:

18/395,257

Filed date:

2023-12-22

✅ Patent granted

Patent number:

US 12,499,139 B1

Grant date:

2025-12-16

PCT filing:

PCT publication:

Examiner:

Dangelino N Gortayo

Agent:

Cooley LLP

Adjusted expiration:

2043-12-22

Smart Summary: A method is designed to organize information from a large database that changes over time. It starts by taking a main set of data and a smaller set that is part of it. The smaller set is broken down based on specific time or logical categories. By counting how often individual words and their combinations appear in both datasets, the method creates a new set that highlights important content. Finally, this new set helps group documents with similar themes, and any new information can be added to keep these groups updated. 🚀 TL;DR

Abstract:

A computer implemented method includes receiving a baseline dataset divisible by temporal or logical criteria. A target dataset representing a small fraction of the baseline dataset is received. The target dataset is segmented by the temporal or logical criteria. Numbers of documents containing individual words within the baseline dataset are identified to form baseline singles. Numbers of documents containing common combinations of individual words within the target dataset are identified to form target tuples. Combinations of the target tuples are coalesced based upon baseline singles criteria to form a coalesced dataset representing significant content in the target dataset. The coalesced dataset is used to cluster documents from the target dataset into threads of related content. Updated content is received. The updated content is integrated with the coalesced dataset to form updated threads of related content.

Inventors:

Cameron Carrett 1 🇦🇺 Kellyville, Australia
Jacob Guernsey 1 🇺🇸 San Francisco, CA, United States

Assignee:

Break the Web Technology Co. 1 🇺🇸 San Francisco, CA, United States

Applicant:

Break the Web Technology Co. 🇺🇸 San Francisco, CA, United States

Interested in similar patents?

Get notified when new applications in this technology area are published.

Create Free Alert

Classification:

G06F16/3326 » CPC main

Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data; Querying; Query formulation; Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages

G06F16/3334 » CPC further

Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data; Querying; Query processing; Query translation Selection or weighting of terms from queries, including natural language queries

G06F16/383 » CPC further

Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data; Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

G06F16/332 IPC

Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data; Querying Query formulation

G06F16/3332 IPC

Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data; Querying; Query processing Query translation

Description

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to U.S. Provisional Patent Application No. 63/477,157, filed Dec. 23, 2022, the contents of which are incorporated herein by reference. This application is related to commonly owned U.S. patent application Ser. No. 18/395,243, filed Dec. 22, 2023.

FIELD OF THE INVENTION

This invention relates generally to finding significant content in large corpuses of data. More particularly, this invention is directed to finding significant content in large corpuses of data by analyzing individual terms in the data to develop meaningful thematic conceptual clusters of the data sources.

BACKGROUND OF THE INVENTION

Relational databases have a well-known relational structure that is conducive to Structured Query Language (SQL) queries. In contrast, an unstructured database, also known as a document-oriented database or a No-SQL database is a comparatively unruly collection of materials that defy efficient query processing.

Accordingly, there is a need for improved processing within unstructured databases.

SUMMARY OF THE INVENTION

BRIEF DESCRIPTION OF THE FIGURES

The invention is more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates a system configured in accordance with an embodiment of the invention.

FIG. 2 illustrates processing operations performed in accordance with an embodiment of the invention.

FIG. 3 illustrates thread processing performed in accordance with an embodiment of the invention.

FIG. 4 illustrates story processing performed in accordance with an embodiment of the invention.

FIG. 5 illustrates a hierarchical relationship between coalesced tuples, threads, snapshots, and stories associated with the invention.

FIG. 6 illustrates snapshot merging performed in accordance with an embodiment of the invention.

FIG. 7 conceptually illustrates the disclosed invention's organization of numerous disaggregated documents into aggregated stories of coalesced tuples.

FIG. 8 illustrates the disclosed system integrated with Large Language Model (LLM) machines in accordance with an embodiment of the invention.

FIG. 9 illustrates story ID updates in accordance with an embodiment of the invention.

FIG. 10 illustrates aggregate coalesced tuple terms summed to produce aggregate terms.

Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 illustrates a system 100 configured in accordance with an embodiment of the invention. The system 100 includes client machines 102_1 through 102_N connected to server machines 104_1 through 104_N via a network 106, which may be any combination of wired and wireless networks. Content source machines 150_1 through 150_N are also connected to network 106.

The server machines 104_1 through 104_N implement operations disclosed herein. By way of overview, the server machines 104_1 through 104_N collect content from content source machines 150_1 through 150_N to form an unstructured database 142 of content. A clustering module 144 subsequently organizes the content, as discussed in detail below. Thereafter, the client machines 102_1 through 102_N can query the unstructured database 142 and perform other operations on the organized content. The memory 140 also stores a thread processor 146 that is used to construct a thread database 148 and snapshot database 150, as detailed below. The memory 140 also has a story processor 152 used to construct a story database 154, as detailed below.

Each client machine 102_1 through 102_N includes a processor 110 and input/output devices 112 connected via a bus 114. The input/output devices 112 may include a keyboard, mouse, touch display and the like. A network interface circuit 116 is also connected to the bus 114 to provide connectivity to network 106. A memory 120 is also connected to the bus 114. The memory 120 stores instructions executed by processor 110. In particular, the memory stores a client module 122 that includes instructions to use network 106 to communicate with servers 104_1 through 104_N.

Each server (e.g., server 104_1) includes a processor, 130, input/output devices 132, a bus 134 and a network interface circuit 136 to provide connectivity to network 106. A memory 140 is also connected to the bus 134. The memory stores instructions executed by processor 130 to implement operations disclosed herein. The memory stores an unstructured database 142 and a clustering module 144 to organize information in the unstructured database 142. The organized information is processed by a thread processor 146, which creates a thread database 148 and a snapshot database 150 based upon the processing, as detailed below.

Each content server (e.g., 150_1) includes a processor 151, input/output devices 152, a bus 154 and a network interface circuit 156 to provide connectivity to network 106. A memory 160 is also connected to the bus 154. The memory 160 stores a content source module 162 with instructions executed by processor 151 to deliver content to one or more servers 104_1 through 104_N.

FIG. 2 illustrates processing operations 200 performed by clustering module 144. A baseline dataset is received 201. More particularly, a baseline dataset divisible by a temporal criterion (e.g., a 24-hour period) or a logical criterion (e.g., chapters in a textbook) is received.

A target dataset is then received 202. The target dataset represents a small fraction of the baseline dataset. The target dataset is segmented by the same temporal criterion or logical criterion as the baseline dataset. Subsequent operations analyze the target dataset against the baseline dataset to identify significant content in the target dataset.

Baseline singles are then formed 206. That is, numbers of documents containing individual terms within the baseline dataset are identified to form baseline singles (e.g., single terms and an associated number of documents).

Target tuples are then formed 206. That is, numbers of documents containing individual terms within the target dataset are identified to form target tuples (e.g., collections of two or more terms, where each collection is a tuple).

A coalesced dataset is then formed 208. Combinations of target tuples are coalesced based upon baseline singles criteria to form a coalesced dataset, as demonstrated below.

In one embodiment, a supplementary dataset of baseline values is then formed. The supplementary dataset baseline values include terms commonly deemed significant.

In this embodiment, a second coalesced dataset is formed. That is, the original coalesced data set and the supplementary dataset baseline values are used to form a second coalesced dataset representing significant content in the target dataset.

The foregoing operations are more fully appreciated in connection with the following detailed example. The example is for illustrative purposes. Many variations on the example are still within the scope of the invention. For example, the following example focuses on a baseline dataset of news articles divided into temporal periods of twenty-four hours. The example also focuses on an embodiment that uses tuples with four values (“quads”), but other tuple configurations may be used in accordance with embodiments of the invention.

The source dataset is defined as the general input data on which the clustering module 144 runs. The clustering module 144 is also referred to herein as CT-X. For example, the source dataset may be news articles spanning the past 365 days. A key attribute of the source dataset is that it contains properties that allow the dataset to be divided into two sub-datasets (target and baseline). For example, a source dataset that will be divided by its temporal property requires that a publishing date be present for every document in the source dataset.

Lookup tables from external data sources can also be brought in to provide evidence of terms of importance (“important terms”) within the source dataset. This step is optional but extremely beneficial for the quality of results. Depending on the dataset being studied, the type of lookup table may vary. Examples of lookup tables relevant to a source dataset of news articles might include:

- A list of known people/celebrities
- A list of countries, states, regions, and city names
- A list of company/organization names
- A list of special events (Christmas, etc.)
- A list of movies, TV shows, books, plays etc.
- Wikipedia article entries

For best results, source data should be cleaned and pre-processed by removing punctuation, performing stemming/lemmatization, identifying important terms via lookup tables, and connecting terms that belong together (e.g., first name, last name) with an underscore to make them compound terms. This step is optional but extremely beneficial for the quality of results. By standardizing the format of each term, indexing and term matching performance throughout the system is dramatically increased.

Consider the following news article:

- IDF Spokesperson to Newsmax: Hamas Requested Certain Prisoners for Swaps Maj. Doron Spielman, the international spokesperson for the Israel Defense Forces, told Newsmax on Wednesday that Hamas requested certain individuals be released as part of the hostage-prisoner swaps that have been taking place between Israel and the Palestinian militant
  - after pre-processing the article becomes . . .
- idf spokesperson to newsmax hamas requested certain prisoners for swaps—maj doron spielman the international spokesperson for the israel_defense_forces told newsmax on wednesday that hamas requested certain individuals be released as part of the hostage prisoner swaps that have been taking place between israel and the palestine militant

The baseline dataset is defined as the source dataset without the target dataset. For example, days 1-364. For best results, the target and baseline datasets should not overlap.

The target dataset is the range in which to look for thematically related/trending content. For example, day 365, in contrast to the baseline of days 1-364.

A bag of words is used to hold a list of terms or tuples along with their calculated frequency. Consider the following examples.


	Term	Frequency

	apple	876
	peach	431
	pear	339

	Tuple	Frequency

	[apple, pear]	221
	[peach, pear]	109

The following bags are used within an embodiment of the clustering module 144 (CT-X):

- Baseline bag of singles
  - Single terms
  - Average number of documents each term is found in per day within the baseline dataset
- Target bag of singles
  - Single terms
  - Average number of documents each term is found in per day within the target dataset
    - If the target date range is a single day, no average calculation is necessary; if the target date range spans more than 1 day, the average across multiple days should be calculated
- Target bag of doubles
  - Doubles; tuple size of 2 [a, b]
  - Average number of documents each tuple/combination of terms are all found in per day within the target dataset.
- Target bag of triples
  - Triples; tuple size of 3 [a, b, c]
  - Average number of documents each tuple/combination of terms are all found in per day within the target dataset.
- Target bag of quads
  - Quads; tuple size of 4 [a, b, c, d]
  - Average number of documents each tuple/combination of terms are all found in per day within the target dataset.

Term frequency is calculated in rounds, with each round taking the terms with frequencies above a certain threshold to limit the computation required. This is referred to herein as skim mining. Skim mining collects significant terms in the target and omits insignificant terms in the target to effectively compress target data into its most significant content. Skim mining is also performed on large datasets to limit the number of tuples to a reasonable level (e.g., no more than 50-100k). When limiting by the number of tuples, the dataset is first ordered by most frequent tuples to least frequent, then the top 50-100k tuples are used.

It is also important that tuples with the same frequency are used as the floor. For example:

- 100
- 43
- 43
- 43
- 42
- 42
- 41

If the floor is calculated using the top 5 records, records (tuples) 1-7 must be used, as position 5 is the same as position 6 and 7. To take the top 5 records, score at position 5=43, therefore floor is >=43 . . . and tuples 1-7 must be used.

Skim mining is performed for 2 reasons:

- To vastly reduce the computation required
- To increase the Low Value Keyword (LVK) threshold to tuple threshold spread. The greater the tuple threshold, the greater each term within a tuple must be trending.

The minimum frequency should optimally be the quad threshold value (2×LVK threshold). Consider the following abstract dataset and thresholds:

- LVK threshold=3.5
- Quad/tuple threshold=7.0 (2×LVK threshold)

Target bag of singles:

- a=11; a occurs in 11 documents
- b=40; b occurs in 40 documents
- c=97; c occurs in 97 documents
- d=21; d occurs in 21 documents
- e=3; e occurs in 3 documents

“e” occurs in less than 7 documents; therefore, it can be removed from the next round.

Target bag of doubles:

- [a, b]=11; a, b occurs in 11 documents; b is contained in all documents containing a
- [a, c]=8; a, c occurs in 8 documents
- [a, d]=10; a, d occurs in 10 documents
- [b, c]=29; b, c occurs in 29 documents
- [b, d]=21; b, d occurs in 21 documents
- [c, d]=11; c, d occurs in 11 documents

All doubles are above the threshold of 7 and will be evaluated for the next round.

Target bag of triples:

- [a, b, c]=8; a, b, c occurs in 8 documents
- [a, b, d]=10; a, b, d occurs in 10 documents
- [b, c, d]=11; b, c, d occurs in 11 documents

Target bag of quads:

- [a, b, c, d]=8; a, b, c, d occurs in 8 documents

Some key take aways:

- The frequency of tuple [a, b, c, d] will always be less than combinations it contains:
  - [a, b], [a, c], [a, d], [b, c], [b, d], [c, d] will always have a frequency of >=[a b, c, d]
  - a, b, c, d will always have a frequency >=their combinations in double, triple, quad form
- If the target tuple frequency of [a, b, c, d] is 8, it can be assumed that the target frequency for single terms a, b, c, d individually is >=8.

The values for each shown above is calculated for each term detected in the documents (corpus), except for stop terms and other less useful terms (e.g., “the”, “a”).

CT-X uses the following values:

- Document frequency in baseline dataset for each term
- Document frequency in target dataset for each term
- Document frequency in target dataset for term tuples (combinations of terms)

The baseline frequency calculation forms what is expected for a certain term; the number of documents the term would be expected to appear in. The target frequency calculation is the number of documents the term appears in within the target dataset. Consider the following examples.


	Baseline	Target
Single/Term	Score	Score	Multiple	Conclusion

australia	5.7	12	2.11	Trending
bolivia	1.8	10	5.56	Trending
france	7.6	22	2.89	Trending
germany	4.2	3	0.71	Not trending
global_warming	29.8	20	0.67	Not trending
iceland	2.1	18	8.71	Trending
usa	8.5	30	3.53	Trending
world_cup	1.3	15	11.54	Trending

Single terms are not particularly reliable for querying content, as they lack context around the reason for that term to be trending. CT-X expands the document frequency calculation to tuples. It looks for a combination of terms to be used within multiple documents. Consider the following example.


	Target
Double/Tuple	Score	Conclusion

australia, world_cup	11	11 docs contained australia and world_cup
australia, bolivia	10	10 docs contained australia and bolivia
australia, france	12	12 docs contained australia and france
australia, usa	9	9 docs contained australia and usa
iceland, usa	20	20 docs contained iceland and usa
global_warming, usa	14	14 docs contained global_warming and usa

Some assumptions can already be made. Australia and any other term cannot be contained within more than 12 times (as that was the target score). Australia and world_cup were contained within 11 documents of the 12 documents containing Australia, and 11 of the 15 possible world_cup documents. We can assume that the tuple frequency is always lower than each of the tuple's terms maximum possible frequency within the target dataset. In one embodiment of the invention, quads (tuples size=4) are used.


	Target
Quad/Tuple	Score	Conclusion

australia, bolivia, france,	10	10 docs contained all terms
world_cup
australia, france, usa, world_cup	12	10 docs contained all terms
bolivia, france, usa, world_cup	10	10 docs contained all terms
iceland, france, usa,	12	12 docs contained all terms
global_warming
germany, france, global_warming,	1	1 doc contained all terms
usa

Quads contain far more information than a single term, and we now have context surrounding a single term. We now also have potentially hundreds of tuples that contain a single term, so a way to associate the tuples should be used. The association between quads is defined by a common term where the following conditions are met:

- Tuple/quad frequency is sufficiently high (e.g., 10+)
- Baseline term common between 2 tuples that has baseline frequency sufficiently low.

To identify baseline terms that can be used for association (merging), the LVK threshold is used:
LVK threshold=Average frequency of baseline term+(0.75*standard deviation of baseline terms)

Any tuples that share a term that is also an LVK (baseline frequency below LVK threshold) can be merged to form a “coalesced tuple”. Assume that the LVK threshold is 3.5; the table below identifies the LVK terms that can be used for merging. The quad threshold should be at least 2× the LVK threshold (in this case, >7.0).


		Tuple
Quad/Tuple	LVKs	Frequency	Notes

australia, bolivia,	bolivia,	10	Tuple freq good,
france, world_cup	world_cup		contains merge terms
australia, france, usa,	world_cup	12	Tuple freq good,
world_cup			contains merge term
bolivia, france, usa,	bolivia,	10	Tuple freq good,
world_cup	world_cup		contains merge terms
iceland, france, usa,	iceland	12	Tuple freq good,
global_warming			contains merge term
germany, france,	global_warming	1	Tuple freq too low,
global_warming, usa			ignored

The resulting coalesced tuples are:


australia, bolivia, france,	[australia, bolivia, france, world_cup],
usa, world_cup	[australia, france, usa, world_cup], [bolivia,
	france, usa, world_cup]
france, global_warming,	[iceland, france, usa, global_warming]
iceland, usa

Even though the coalesced tuples contain some of the same terms, they describe unique stories and provide context beyond what a single term can provide. When the originating dataset is queried, tuples within each coalesced tuple should be used to ensure the accuracy of matching documents.

CT-X output is an array of coalesced tuples. Scoring and original tuples are kept intact to allow detailed document queries of the source datasets. An example excerpt of CT-X output is below.

- 510 coalesced tuples formed from 39045 tuples. 60 coalesced tuples after filtering.
- 10,651,358 documents in source dataset
- 19,485 documents in target dataset (days=1)
- 10,631,873 documents in baseline dataset (days=364)
- Effective Tuple Threshold=10.0
- LVK Threshold=4.46
- Raw Tuple Count=3,182,390
- Used Tuple Count=39,045
- Coalesced Tuple Count=510
- Coalesced Tuple Count after filtering=60
  Coalesced Tuple 1
- [abortion¹⁴⁸⁵⁶, texas¹⁴⁷²⁰, woman¹⁴⁵⁹⁷, state¹⁴²⁶², leaves¹³⁸⁴⁴, sought¹³⁵⁵², procedure¹³⁴⁵², attorneys¹³⁴²⁵, court¹³³¹⁷, permission¹³²⁵⁷, pregnant¹¹⁸⁵³, challenge¹¹⁴⁸⁸, bans¹¹³⁸²unprecedented¹¹¹⁷⁹, usa¹¹⁰⁶⁷, restrictive¹⁰⁹⁶⁶, left⁹⁴⁹⁰, obtain⁹³⁰⁴, monday⁶⁷⁰⁵, cox³⁰³⁹, kate²⁹⁷⁶, ruling⁴⁰¹, texas_supreme_court¹¹⁸, emergency⁵⁴, ban⁴², leave¹⁰]⁵⁹⁸³⁹
- [abortion, state, texas, woman]⁵⁶, [abortion, cox, kate, texas]⁴⁶, [abortion, cox, texas, woman]³⁸+4203 more tuples
  Coalesced Tuple 2
- [Donald_Trump⁷⁹⁰³, supreme_court⁷⁸²⁸, special_counsel⁷⁵⁰⁶, Jack_Smith⁶⁸¹¹, asks⁵⁹²¹, rule⁵⁷⁶¹prosecuted⁵³⁷³, election⁵²³⁴, charges⁵¹⁵⁶, overturn⁵¹²⁴, quickly⁴⁸⁰⁶, plotted⁴⁷¹⁹, asked⁴⁵⁶²results³⁸⁰⁰, monday²⁹⁶², immunity¹¹¹⁸, decide⁵⁷⁷, case⁵¹⁶, prosecution⁴²³, asking³⁵⁹, criminal¹⁰⁹, immune⁷⁷, presidential⁵⁶, federal⁵³, claim⁴¹, crimes⁴¹, scotus³⁶, usa³³, request³⁰, interference²², trial¹⁷, court¹⁰]²¹⁷⁴⁶
- [Donald_Trump, Jack_Smith, special_counsel, supreme_court]⁷³,
- [asks, Donald_Trump, special_counsel, supreme_court]⁵⁵,
- [asks, Donald_Trump, Jack_Smith, supreme_court]⁴⁶+¹⁵⁸¹more tuples
  Coalesced Tuple 3
- [faculty¹⁰⁰¹, antisemitism⁹⁹⁶, harvard⁹⁵¹, Claudine_Gay⁹³⁵, president⁷⁷⁸, university_president⁷¹⁵, remarks⁴¹⁶, rallies³⁸³, members³⁸², criticized³⁷², aid³⁶³, hundreds²⁰⁵, university¹⁸⁵, calls¹⁷⁴, urging¹⁴¹, congressional¹⁰⁸, hearing¹⁰⁸, comments⁵⁰, signed⁴², lawmakers⁴⁰, amid²¹, faces²¹, testimonyl¹, donors¹⁰, resign¹⁰, school¹⁰]²¹⁰⁷
- [antisemitism, Claudine_Gay, harvard, president]²², [calls, Claudine_Gay, harvard, president]²⁰, [Claudine_Gay, faculty, harvard, president]¹⁷+188 more tuples
- +57 more coalesced tuples

The cluster processing 200 of FIG. 2 is the first operation in a closed loop process 300 of FIG. 3. FIG. 3 illustrates operations performed by thread processor 146. The thread processor 146 is used to construct a thread database 148 and a snapshot database 150.

Each thread is a container for combining new content with cluster processing output. The new content is subject to cluster processing 200 performed by clustering module 144. Thread creation is then initiated 302. That is, new coalesced data is integrated with existing coalesced data. Consequently, threads keep track of content from multiple output rounds of cluster processing 200.

In the example of reporting of events in the news cycle, threads need to be dynamic and malleable. Events can split and multiple events can merge into a single event depending on the dynamics of the evolving event and how it is being reported.

Threads attempt to provide the flexibility to achieve this, along with the ability to self-correct as more data is collected and a more accurate statistical analysis can be performed.

The initial goal of threads is to create a forward-only path for each event. The events represented by coalesced tuples are either merged/evolved with existing threads to keep them current, or if no existing thread can be found, a new one is created.

When no active (non-archived) thread exists, a new thread is created from each input coalesced tuple and the following data is stored:

- Coalesced tuple aggregate terms and their scores. E.g. [a²⁰⁹⁰, b²⁰⁹⁰, c⁹⁸⁸, d⁶⁶⁷, e⁴²⁰]
- Coalesced tuple aggregate score. E.g. 2090
- Tuples used to form the coalesced tuple and their individual scores. E.g. [a, b, c, d]⁴⁵, [b, c, d, e]³²
- Timestamp of thread creation

When existing active (non-archived) threads exist, a comparison to all existing threads is performed for each new input tuple. To evolve an existing active thread, criteria such as the following may be used:

- Must have at least 1 common LVK
- Must have at least 50% common terms
- Must have at least 4 common terms OR
- Simply have at least 80% common terms (instant match).

If none of the criteria to evolve an existing active thread are met, it is assumed that there is no existing related thread, and a new active thread is created. Consider the following example:

- [Donald_Trump¹⁷⁴³⁷, supreme_court¹⁷²⁶³, case¹⁷¹¹⁴, hear¹⁶⁹¹⁴, hundreds¹⁶⁵⁹¹, charge¹⁶⁴⁶⁰, capitol riot¹⁶⁰²¹, including¹⁵⁸⁴⁴, undo¹⁵⁵⁵⁷, obstruction¹¹⁰⁵⁸, proceeding¹⁰⁶⁹³, official¹⁰⁴³⁸, defendants¹⁰²⁶⁶, accused¹⁰⁰⁹⁸, review⁹⁹⁷¹, justices⁹⁹²⁴, appellate⁹⁷⁷⁹, ruling⁹⁶⁷⁸, revived⁹⁶⁵⁶, wednesday⁶⁸⁰⁴, charges⁶³⁹⁵, upend⁶¹⁷⁷, stemming⁵⁸⁰⁶, appeal⁵³¹⁰, trial⁴⁶, agreed³⁷, election²⁶, usa²¹, president¹³, common¹², mifepristone¹², charged¹¹]⁷⁰³⁵⁸
- **Match found using 1 LVK+4 common terms**
- [Donald_Trump⁷⁹⁰³, supreme_court⁷⁸²⁸, special_counsel⁷⁵⁰⁶, Jack Smith⁶⁸¹¹, capitol riot⁶⁵⁹⁰, asks⁵⁹²¹, rule⁵⁷⁶¹, prosecuted⁵³⁷³, election⁵²³⁴, charges⁵¹⁵⁶, overturn⁵¹²⁴, quickly⁴⁸⁰⁶, plotted⁴⁷¹⁹, asked⁴⁵⁶², results³⁸⁰⁰, monday²⁹⁶², immunityl¹¹⁸, decide⁵⁷⁷, case⁵¹⁶, prosecution⁴²³, asking³⁵⁹, criminal¹⁰⁹, immune⁷⁷, presidential⁵⁶, federal⁵³, claim⁴¹, crimes⁴¹, scotus³⁶, usa³³, request³⁰, interference²², trial¹⁷, court¹⁰]²¹⁷⁴⁶

Therefore, the existing thread is evolved and the coalesced tuple linked to the thread is replaced.

If a thread has not evolved within a certain time (e.g., 72 hours), the thread is archived. Once a thread is archived, it can no longer be evolved and is excluded from all thread processing.

The state of the thread is retained after archiving.

As the coalesced tuples were generated from the source dataset, the tuples contained within the coalesced tuples (threads) can be used to query the source dataset for documents. Content population and activation is performed every time the thread loop is completed.

Tuples/quads are updated after every thread evolution for each thread. These tuples are used to match content in the source dataset. A simple match is where all terms in a quad are found within a document. Consider the examples below:

- [curb_your_enthusiasm, end, Larry_David, season]¹³,
- [curb_your_enthusiasm, ending, Larry_David, season]¹³,
- [curb_your_enthusiasm, hbo, Larry_David, season]¹²
- **all 3 tuples are found in**
- https://ohnotheydidnt.livejournal.com/127425109.html
- curb_your_enthusiasm ending with season 12 as Larry_David says i bid you farewell
- curb_your_enthusiasm is ending with season 12 which premieres February 4 on hbo and max the 10 episode season will conclude with a series_finale on April 7 Larry_David says as curb comes to an end i_will now have the opportunity to finally shed this larry
- https://deadline.com/2023/12/curb-your-enthusiasm-officially-ending-season-12-hbo-1235668008/
- curb_your_enthusiasm officially ending with season 12 on hbo—this is pretty pretty pretty sad news curb_your_enthusiasm is officially coming to an end with season 12 on hbo every season creator and star Larry_David says that he ending the show but this
- https://www.dailystar.co.uk/showbiz/us-showbiz/breaking-curb-your-enthusiasm-axed-31677141
- curb_your_enthusiasm axed after 12 seasons as fans mourn end of an era-after 12 years on air hbo fan favourite sitcom curb_your_enthusiasm has been axed with writer and comedian of the show Larry_David announcing he is ending the much loved series

Only 1 complete tuple is generally required for a match. Singles, doubles, and triples are by-products of the quad generation process. Content that is not a direct match with quads within the thread can be discovered via singles, doubles and triples that are:

- Determined to be trending when the target frequency is greater than baseline frequency
- Where the “safe” singles, doubles, or triples all match terms in the aggregate terms of the thread (which is all the unique terms found in the coalesced tuples)
- [curb_your_enthusiasm⁷⁴, Larry_David⁵⁶, season⁵⁶, hbo⁴⁸, end³¹, ending³¹]⁷⁴

From these aggregate tuple terms, the following trending tuples were found:

- Safe Trending Singles→[curb_your_enthusiasm]_112.9, [Larry_David]_83.6
- Safe Trending Doubles→[curb_your_enthusiasm, Larry_David]_99.3,
- [curb_your_enthusiasm, hbo]_79.8, [hbo, Larry_David]_59.1
- Safe Trending Triples→[curb_your_enthusiasm, hbo, Larry_David]_81.1

The baseline vs target values for these terms are:


	Baseline	Target	Trending
Term	Frequency	Frequency	Score

curb_your_enthusiasm	0.6	30	112.91
Larry_David	0.7	24	83.56
hbo	25.7	26	0

The “safe” component to these trending singles, doubles and triples is important. The safe definitions are:

- Singles=VLB
- Doubles=1 LB and no HB
- Triples=1 LB and no HB

VLB=Very Low Baseline and is defined as terms with a baseline below:
LVB threshold=Average frequency of baseline term+(0.1*standard deviation of baseline terms)

LB=Low Baseline and is defined as terms with a baseline below:
LVB threshold=Average frequency of baseline term+(0.5*standard deviation of baseline terms)

HB=High Baseline and is defined as terms with a baseline above:
HB threshold=Average frequency of baseline term+(10*standard deviation of baseline terms)

For reference, the LVK threshold is defined for terms with a baseline below:
LVK threshold=Average frequency of baseline term+(0.75*standard deviation of baseline terms)

The trending score for a single is calculated using the baseline frequency and target frequency for an individual term [n].
TrendingScore_N=(TargetFreq_N−BaselineFreq_N)*ln(1−(TargetFreq_N/BaselineFreq_N))

For doubles and triples, the trending score of the tuple is the RMS (root-mean-square) of all the single's trending scores that it contains.

For example:

- Trending Score Single [a]=TrendingScore_A
- Trending Score Double [a, b]=SQRT ((TrendingScore_A{circumflex over ( )}2+TrendingScore_B{circumflex over ( )}2)/2)
- Trending Score Triple [a, b, c]=SQRT ((TrendingScore_A{circumflex over ( )}2+TrendingScore_B 2+TrendingScore_C2)/3)

The next operation of FIG. 3 is thread scoring 304. Thread scoring is calculated by active content matching the thread. Two types of scores are calculated to evaluate the size of the pattern and its relevance in the near term.

- Content volume
- Content velocity

The content volume is simply the number of documents that match the coalesced tuple. For example, 100 documents=100 points. The content velocity is the sum of all the documents; however, they are plotted on a chart that positions them by age. Age=0 is given 100% of score. Age of 24 hours (or max score age) is given a 0% score.

Various content velocity curve filters may be used. A parabolic curve quickly ages, then slowly (25% age=50% score, 75% age=25% score). A linear filter ages linearly (50% age=50% score). A tan curve ages quickly, then slowly, then quickly again. A constant filter does not age (always 100% score)

The thread score at any given moment is the sum of the thread score curve up to the age limit for content. For example, a thread score could be defined as:

- Parabolic score curve.
- Max 48 hours.

When threads are output, they are sorted in order of thread score (highest to lowest). Consider the following example.


Thread			Score
Id	Title	Aggregate Terms	(0-10)	Documents

2382437	ukraine	achieved, achieves, americans, annual,	6.900016616	1189
		asked, . . .
2346502	gaza_city	ally, ambush, attacks, battles, calls,	6.195291328	1907
		ceasefire, . . .
2365087	Andre_Braugher	actor, age, Andre_Braugher, brief,	6.050925059	386
		brooklyn_ninenine, . . .
2383500	Donald_Trump	according, adults, americans, apnorc,	5.959255938	810
		bidentrump, . . .
2382436	defense bill	$886, act, annual, authorization, authorizes,	5.853076967	103
		big, . . .
2385985	curb_your_enthusiasm	curb_your_enthusiasm, end, ending, final,	5.741705452	35
		hbo, . . .
2385522	six years	ago, alex, alive, batty, boy, british, france,	5.472362973	36
		missing, . . .
2361598	Rudy_Giuliani	attorney, begin, case, damages, decide,	5.237168296	195
		deciding, . . .
2384832	new_jersey	bull, commuters, delays, loose, morning,	5.134232245	28
		new_jersey, . . .
2383605	Taylor_Swift	34th, birthday, Blake_Lively,	5.094833372	61
		new_york_city, night, . . .
2384018	guyana	guyana, leaders, longstanding, meet,	5.071720406	36
		region, . . .

The next processing operation of FIG. 3 is thread merging 306. As the output can vary considerably from cycle to cycle, some additional threads may be created as the ebbs and flows of the news cycle (or equivalent data source) play out. Each thread is referenced by a unique numeric identifier. Threads are generally independent of each other; however, it is not uncommon for them to be associated in a parent/child relationship. This is done when 2 or more threads are found to be sharing a significant amount of content. A child thread is defined by it having a parent thread associated with it. A parent thread is simply a thread which is not a child thread.

A test is performed on all active threads using the content that has been associated with them. If greater than 50% of a thread's content is contained within another thread, it will be merged with it (and will become a child thread of the parent it merged with).

This test is completed after threads have been populated with content.

Every thread cycle, existing parent/child relationships from previous cycles are ignored, and all are re-evaluated. It is common for the same parent/child relationships to be reestablished each cycle. Some variance between cycles does occur, however.

Threads that have been marked to be merged are arranged in order of size (from most content to least). The largest thread is chosen as the parent, and all other matching threads are chosen as children of the parent. The parent is the only thread shown in queries until the next round where all threads are re-evaluated, and the parent may change.

The merged threads are stored in a thread database 308. Snapshots of the thread database are then generated 310. The snapshots are then stored in a snapshot database 312.

Threads are always kept up-to-date based on the latest cluster processing outputs. Historical thread state is retained using snapshots. A snapshot is a copy of all active threads at the end of a thread cycle. Slices are used as a container to store the state of all active parent threads at that time. Snapshots are date/time stamped.

Slices are a container that can be used to specify a source dataset, or simply be used to store a snapshot of another slice. Slices based on a dataset are updated every thread cycle. Slices used for storing a snapshot are immutable once created. As only active parent threads are copied, a copy of all child thread IDs (including the parent thread ID) are also copied to retain a reference to the original threads that can link snapshot threads to threads in a neighboring snapshot.

A snapshot may contain the following components:

- Thread title
- Thread scores
- Active content
- Snapshot time
- Original thread creation time
- List of all parent/child thread IDs

The client module 122 may be used to query the snapshot DB 150. Snapshot querying is often used to obtain the latest complete output of thread processing, content, and scoring. Threads contained in the output of a snapshot query are output in order of highest to lowest thread score. The thread score indicates both the strength of the pattern and relevance of the thread calculated from recency and volume of content when the snapshot was taken. The thread score may be normalized to a range between 0 and 10. Consider the following example.


SSThread			Score
ID	Thread Title	Terms	(0-10)

2388836	ukraine	1st, achieved, achieves, americans, annual, asked, . . .	7.03
2388843	gaza_city	ally, ambush, battles, calls, ceasefire, crushes, . . .	6.36
2388854	Andre_Braugher	actor, age, Andre_Braugher, brief,	6.09
		brooklyn_ninenine, . . .
2388861	defense bill	$886, act, annual, authorization, authorizes, big, . . .	6.03
2388865	Donald_Trump	according, adults, americans, apnorc, bidentrump, . . .	5.97
2388871	curb_your_enthusiasm	curb_your_enthusiasm, end, ending, final, hbo, . . .	5.92
2388875	six years	ago, alex, alive, batty, boy, british, france, missing, . . .	5.65
2388879	Draymond_Green	center, Draymond_Green, face, forward, . . .	5.65
2388883	new_jersey	bull, commuters, delays, loose, morning,	5.32
		new_jersey, . . .
2388887	Sidney_Powell	apology, case, chesebro, deals, decide, election, . . .	5.28

Snapshots can be queried by:

- Specific Snapshot ID (Snapshot Slice ID)
- Snapshot closest to timestamp
- List of available snapshots

Results returned for specific snapshot queries:

- Snapshot ID
- List of snapshot threads
  - Snapshot Thread ID
  - Title
  - Score
  - List of documents
    - URL
    - Title
    - Body

Results returned for list of available snapshots:

- List of snapshots
  - Snapshot Slice ID
  - Snapshot Thread Count
  - Timestamp

Returning to FIG. 1, a story processor 152 combines snapshots in the snapshot database 150 to form a story that is stored in a story database 154. FIG. 4 illustrates story processing 400 performed by the story processor 152. The thread processing 300 of FIG. 3 is a first operation of a closed loop process in which snapshots are combined into a story 402 and then the story in recorded in the story database 404.

Continuities connect all thread snapshots into a continuous story from start to finish. As original/source threads are forward-only, previous states are not accessible. Snapshots provide a solution to access previous states. Connecting thread state via snapshots is not always a straightforward calculation, as one thread is often made up of a parent and child threads, where the parent thread can easily change to a child thread (and vice-versa) between cycles.

Stories bind conversations between snapshots/threads, which allows for a much grander perspective over time. Stories also have the benefit of not being a real-time/one-off calculation. Stories can benefit from hindsight, and fix anomalies that occurred in the past automatically, using data that occurred in the future with respect to the original thread.

The story processor 152 detects “continuities” by loading all snapshots and looking for the next thread in the sequence from one snapshot to the next. When continuities have been calculated in their entirety, they are saved as stories (continuities are incomplete or non-final stories).

FIG. 5 illustrates different coalesced tuples 500 forming a set of parent threads 502, which then form different snapshots 504_1 through 504_N. As shown in FIG. 5, when snapshots are generated, the original thread IDs (Source Thread IDs) are recorded; this includes the parent thread ID and the child thread IDs associated with the parent. FIG. 5 also shows how snapshots 506, 508, 510 and 512 are combined into a single story.

The snapshot copy retains all the original parent/child thread IDs (Source Thread IDs) at that moment, so even if the parent and child get swapped in a subsequent snapshot, the relationship can be re-formed.

By keeping track of the constantly evolving parent/child thread IDs, the complex relationship between similar threads can be distilled down to a single story by evaluating Source Thread IDs.

The continuity is calculated from snapshot to snapshot by trying to match the best matching thread moving forward. This is done by selecting the thread in the subsequent snapshot that matches the most Source Thread IDs. This is shown conceptually in FIG. 6.

Consider a list of threads in Snapshot A. Each thread is compared against all threads in the next snapshot (Snapshot B). The thread that contains the highest percentage of overlapping Source Thread IDs is chosen as the winner, which is the top row of FIG. 6. If no match can be made from a thread in Snapshot A to Snapshot B, the continuity ends. If a thread in Snapshot B was not matched with a thread from Snapshot A, a continuity is created.

More complex parent/child thread relationships can exist in source threads. An example of this is below, where 3 parent threads each have many child threads that overlap over the 3 snapshots. The result is 2 distinct continuities.

- Snapshot A=[Thread_SA101: 1, 5, 6, 7, 8], [Thread_SA102: 2, 9, 10, 11], [Thread_SA103: 12, 13, 14, 15]
- Snapshot B=[Thread sB201: 1, 2, 5, 6, 7, 8], [Thread sB202: 9, 10, 11], [Thread_SB203: 12, 13, 14, 15]
- Snapshot C=[Thread_SC301: 1, 2, 5, 6, 7, 8, 9, 10, 11], [Thread_SC303: 12, 13, 14, 15]

Continuity compression merges multiple continuities into a single continuity under the following circumstances:

- When a continuity's pool of source threads (list of all source threads from all snapshot threads in the continuity) are wholly contained within another story
- When a story is partially contained within another story; exact overlap percentage requirements can vary.

Two merge parameters can be set:

- 1. Minimum overlap merge percentage (default=100%)
  - (A intersect B)/Min (A, B)
    - Common IDs between A and B, Minimum # of IDs contained within A and B.
- 2. Minimum aggregate overlap merge percentage (default=50%)
  - (A intersect B)/(A union B)
    - Common IDs between A and B; Total distinct # of IDs contained within A and B.

Where possible, existing story IDs are used to create a persistent reference point for stories. When story IDs are archived, the new story ID is referenced to provide a redirect path (where 2 or more stories became a single story). The first thread in the continuity is used to extract a story ID. If that thread is not linked to an existing story ID, a new story ID is created. All snapshot threads within that continuity branch are updated with the same story ID taken from the first thread or from the newly created story. This is depicted in FIG. 9.

Stories are defined by the snapshot threads associated with them. Therefore, calculated attributes may change after each story loop if the snapshot threads associated with it were altered.

In one embodiment, basic story attributes include:

- Story ID
- Associated snapshot thread IDs

Calculated story attributes derived from the associated snapshot thread IDs may include:

- Aggregate document scores
  - Including relevance calculation for each document based on the number of thread snapshots in the document associated with the story.

Example: Document list returned for associated snapshot thread IDs for story.


Snapshot Thread ID	Document ID

1000	15FB5B74-335A-47FF-896D-
	E09B1A3222FE
1000	33819D7F-AE8D-4806-912F-
	7BD80F3EF79A
1000	98946136-F234-4A10-97AF-
	1F87BC2E134D
1001	15FB5B74-335A-47FF-896D-
	E09B1A3222FE
1001	33819D7F-AE8D-4806-912F-
	7BD80F3EF79A
1001	98946136-F234-4A10-97AF-
	1F87BC2E134D
1001	B140C5C9-8AE8-4D09-B332-
	7C5175DF3C7A
1002	15FB5B74-335A-47FF-896D-
	E09B1A3222FE
1002	98946136-F234-4A10-97AF-
	1F87BC2E134D
1002	B140C5C9-8AE8-4D09-B332-
	7C5175DF3C7A

Aggregate Story Document Output:


	Document ID	Score

	15FB5B74-335A-47FF-896D-	3
	E09B1A3222FE
	33819D7F-AE8D-4806-912F-	2
	7BD80F3EF79A
	98946136-F234-4A10-97AF-	3
	1F87BC2E134D
	B140C5C9-8AE8-4D09-B332-	2
	7C5175DF3C7A

- Aggregate coalesced tuple term scores
  - The aggregate of all the aggregate coalesced tuples contained within the snapshot threads.

FIG. 10 illustrates aggregate coalesced tuple terms for Thread Snapshots A, B are summed to produce _AGGScore and _AGGTerm.

- Number of documents by snapshot timestamp
  - The number of documents in the story over time
- {[_SATimestamp, _SAThread.Documents.Count( )], [_SnTimestamp, _SnThread.Documents.Count( )] . . . }
- Thread scores by snapshot timestamp
  - The score of the story over time
- {[_SATimestamp, _SAThread.Score], [_SnTimestamp, _SnThread.Score] . . . }
- Active-active in most recent snapshot (an active story)
- Story_n.Active=Story_n.ThreadIds.Contains(Snapshot_Latest.ThreadIds)

In one embodiment, a single story contains a packet of the following data:

- Unique identifier (Story ID)
- Coalesced Tuples (in aggregate form)
- Associated Documents
- Score (0-10)
- Active Flag

Story output combines the documents that match the story, along with the identity of the story defined by the coalesced tuples found in the associated thread snapshots. A story can represent an active (present) or historical view of the source dataset. The active stories are identified by the “active” attribute. The most recent score and number of documents for a given story is found by looking at the Scores and NumDocuments attributes and finding the most recent Timestamp and the score/value associated with it.


Stor ID	243556
AggregateCoalescedTupleTermScores	[[volcano, 2090851], [iceland, 2090280], [erupts,
	1999341], [reykjanes, 1960981], [peninsula,
	1931228], [weeks, 1885293], [sky,
	1780927], [volcanic_eruption, 1760769], [orange,
	1746350], [evacuated, 1743913], [monday,
	1731789], [alert, 1721882], [high,
	1719710], [thousands, 1714571], [night,
	1708499], [turning, 1682191], [prompting,
	1615217], [started, 1586429], [civil_defense,
	1446110], [town, 990795], [country,
	440009], [grindavik, 104255], [eruption,
	67571], [activity, 32084], [lava, 12861], [seismic,
	12520], [meteorological, 4872], [office,
	4830], [erupted, 4552], [spewing,
	4366], [earthquakes, 3097], [southwest,
	2070], [intense, 1590], [flights, 924], [scientists,
	726], [nearby, 465], [following, 426], [southwestern,
	408], [affect, 234], [near, 186], [rock,
	144], [icelandic, 108], [magma, 72], [earthquake,
	72]]
AggregateDocumentScores	Dictionary of <Document, Score>, where score is
	a number (see Document object)
	e.g. {[15FB5B74-335A-47FF-896D-
	E09B1A3222FE, 3], . . . }
NumDocuments	Dictionary of <Timestamp, Count>
	e.g. {[202312190330, 74], 202312190345, 81],
	. . . }
Scores	Dictionary of <Timestamp, Score>
	e.g. {[202312190330, 6.3], 202312190345, 6.5],
	. . . }
CreatedTimestamp	202312190330
LastTimestamp	202312190345
Active	True

A document object contained within the <Document, Score> dictionary has the following fields:


DocumentID	2d214187-62b3-4f3e-93c0-5a6a0ef2b9f4
URL	https://www.latimes.com/world-nation/story/2023-12-
	18/iceland-volcano-erupts-weeks-after-thousands-were-
	evacuated-from-a-town-on-reykjanes-peninsula
Title	Iceland volcano erupts weeks after thousands evacuated
Body	A volcanic eruption began Monday night on
	Iceland's Reykjanes Peninsula, turning the sky orange
	and prompting the country's civil defense
	to be on high alert
Timestamp	202312190132

Thus, the story DB 154 has stories that are the culmination of the disclosed production line. The disaggregated documents found in the source dataset are automatically organized into a digestible collection of distinct stories that are continually updated as the source dataset is updated. These distinct stories are made up of tuples that merged into groups (coalesced tuples) due to one or more of their components trending over the baseline of the dataset. As these components were trending over the target range of the dataset compared to the baseline, they are considered statistically significant and are used to merge tuples together. This allows a safe meshing of common and uncommon terms into a single story, even though stories will often have common terms.

As the tuples were derived from the source dataset, the tuples within each coalesced tuple will match with the documents that they were originally derived from. After the source dataset has been processed, what is output is an automatically organized dataset of stories, which in turn reference documents from the original source dataset. When compared to the original dataset, each story is far easier to digest by machine. Documents that do not follow a pattern identified by a story can be ignored, and as each document is scored for relevance, quality filters can be used to further optimize ingestion of documents without having to ingest the entire story.

FIG. 7 conceptually depicts the processing disclosed herein. On the left are disaggregated documents of the unstructured DB 142. This collection of documents is large and unorganized. On the right side of the figure is a conceptual depiction of the story DB 154, where each set of linked dots represents a story. The story is constructed from individual documents, each represented by a dot. Thus, FIG. 7 shows how the disclosed invention effectively filters an original document set by removing non-trending information. The figure also shows how the original document set is organized into related information.

FIG. 8 illustrates a system 800 configured in accordance with an embodiment of the invention. The system 800 generally corresponds to system 100 of FIG. 1, but the memory is augmented to include an LLM trainer 156 and a query processing module 158. Also, content machines 150_1 through 150_N are not depicted as being connected to network 106. Rather, LLM machines 170_1 through 170_N are connected to network 106. LLM machine 170_1 and each of the other LLM machines are nodes in an LLM network. LLM machine 170_1 includes a processor 171, input/output devices 172, a bus 174, and a network interface circuit 176. A memory 180 is connected to bus 174. The memory 180 stores an LLM module 182 with instructions executed by processor 171 to support training of an LLM and subsequent use of a trained LLM to prepare responses to queries submitted by the query module 158.

The LLM trainer 156 accesses the story DB 154 to collect pre-organized data to train an LLM. Typically, when it comes to processing news and current event data beyond the temporal limits of an LLM's knowledge base, the LLM's only option is to utilize what is effectively the disaggregated data on the left side of FIG. 7. With the disclosed technology the LLM trainer 156 supplies the LLM model with automatically aggregated stories of coalesced tuples. Thus, it becomes possible to train the mode on story data, and the training process is far more efficient since there is less data and the submitted data is already thematically organized.

The query module 158 supports queries applied to the trained LLM model and the story DB 154. Observe that story results are prepared in advance, so queries are quick and efficient. The disclosed system has already identified the patterns (stories) most likely to yield results within the source dataset.

- For machines: Source dataset is broken down into digestible chunks, where ability to ingest can be scaled with controlled loss of fidelity, where each story's resources can be governed by filtering by relevance.
- For users: There is an immediate breakdown of important patterns in the source data set. Consequently, less time is required to sort through results. Stacked levels of relevance calculations return precise results first.

Even though the source dataset has been broken down and grouped into discrete objects (stories) orders of magnitude fewer in number, it retains the statistically significant documents within each story. Documents within each story can be further filtered by setting an intelligent cap via a relevance score.

In one embodiment, the query module supplies two types of query results.

- Non-interactive queries do not alter the source story in any way. Any additional filtering is the responsibility of the target system. Non-interactive queries can be automated to ensure the target system is as up to date as required. An example of a non-interactive query would be LLM fine-tuning, where the story DB 154 is used as supplemental training data.
- Interactive queries filter the results to immediately make them easier to ingest in the target system. Interactive queries are usually one-off requests. This is especially relevant for RAG (Retrieval Augmented Generation) with an LLM, although just as pertinent to any other system that requires the most targeted data retrieval with a hard limit to returned results (documents).

As stories are updated in an automatic, ongoing fashion, efficient and automatic machine to machine querying and ingesting of stories provides accurate near real-time results. Full Story query types are used to copy the entire story DB 154. A full query type is intended to copy the entire story database into another system.

Incremental story query types will only return stories that have changed since the UTC timestamp provided. Usually, incremental queries are used after a complete full query has been performed at least once.

Before being ingested by an LLM or other systems, a further limiting of each story's documents can be achieved by filtering by either x % or x number of documents based on the relevance score provided for each document. The highly relevant documents contain more of the tuples used to represent the story than the documents with a lower relevance score.

Story relevance is calculated using the aggregate term scores, which is based on the coalesced tuples contained within all thread snapshots of the story.

- Story_nrelevance=[sum of matching story_nterm scores]/[sum of all story_nterm scores]

The formula to implement the document limit specified is as follows:

- Document limit for story_n=Floor [([story_nrelevance]/[sum of relevance score for all stories])*[document limit]]


		Sum	Sum		Docu-
Story	Terms	(Terms)	(Matching)	Relevance	ments

A	[Taylor_Swift, 80],	140	110	0.785714286	80
	[NFL, 30], [a, 20],
	[b, 10]
B	[a, 200], [b, 100],	315	15	0.047619048	4
	[Taylor_Swift, 10],
	[NFL, 5]
C	[a, 50],	69	10	0.144927536	14
	[Taylor_Swift, 10],
	[c, 5], [d, 4]
				0.97826087	100
					(limit)

An embodiment of the present invention relates to a computer storage product with a computer readable storage medium having computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include but are not limited to: magnetic media, optical media, magneto-optical media, and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using an object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hardwired circuitry in place of, or in combination with, machine-executable software instructions.

The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required to practice the invention. Thus, the foregoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described to best explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.

Claims

The invention claimed is:

1. A computer implemented method, comprising:

receiving a baseline dataset divisible by temporal or logical criteria;

receiving a target dataset representing a small fraction of the baseline dataset, the target dataset being segmented by the temporal or logical criteria;

identifying, without user direction, numbers of documents containing individual words within the baseline dataset to form baseline singles;

identifying, without user direction, numbers of documents containing common combinations of individual words within the target dataset to form target tuples where each common combination of individual words provides context for each word in the common combination;

coalescing, without user direction, combinations of the target tuples based upon baseline singles criteria to form a coalesced dataset representing significant content in the target dataset;

using the coalesced dataset to cluster documents from the target dataset into threads of related content;

receiving updated content; and

integrating the updated content with the coalesced dataset to form updated threads of related content.

2. The computer implemented method of claim 1 wherein integrating is based upon common term criteria in the updated content and the coalesced dataset.

3. The computer implemented method of claim 1 further comprising scoring threads based upon content volume.

4. The computer implemented method of claim 1 further comprising scoring threads based upon content velocity.

5. The computer implemented method of claim 4 wherein scoring threads utilizes a content velocity curve filter.

6. The computer implemented method of claim 1 further comprising merging threads to form merged threads.

7. The computer implemented method of claim 6 further comprising:

securing snapshots of merged threads; and

storing the snapshots in a database.

8. The computer implemented method of claim 7 further comprising searching the database for a designated snapshot.

9. The computer implemented method of claim 8 wherein searching the database for a designated snapshot is based upon a snapshot identifier.

10. The computer implemented method of claim 8 wherein searching the database for a designated snapshot is based upon a snapshot temporal parameter.

11. The computer implemented method of claim 7 further comprising combining snapshots into a story.

12. The computer implemented method of claim 11 wherein the story has a story identification, a story title, story timestamps, story keywords, story scores, and a document object with a uniform resource locator specifying the network location of the original source material.

13. The computer implemented method of claim 11 further comprising applying selected stories to a language model to form a trained language model.

14. The computer implemented method of claim 13 further comprising supporting query processing of the trained language model in accordance with a temporal parameter.

15. The computer implemented method of claim 13 further comprising supporting query processing of the trained language model in accordance with a story identification.

16. The computer implemented method of claim 13 further comprising supporting query processing of the trained language model by returning a relevance score for a response supplied to a query.

Resources