[go: up one dir, main page]

WO2024261293A1 - Methods and systems for search and information retrieval - Google Patents

Methods and systems for search and information retrieval Download PDF

Info

Publication number
WO2024261293A1
WO2024261293A1 PCT/EP2024/067524 EP2024067524W WO2024261293A1 WO 2024261293 A1 WO2024261293 A1 WO 2024261293A1 EP 2024067524 W EP2024067524 W EP 2024067524W WO 2024261293 A1 WO2024261293 A1 WO 2024261293A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
summaries
embedding
document
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/EP2024/067524
Other languages
French (fr)
Inventor
Alex NIM
James BATCH
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Adarga Ltd
Original Assignee
Adarga Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Adarga Ltd filed Critical Adarga Ltd
Publication of WO2024261293A1 publication Critical patent/WO2024261293A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Definitions

  • the invention relates to methods and systems for searching large numbers of text documents and other unstructured data.
  • the methods reduce the computational resources required during search processes and avoid loss of information typically associated with condensing data into shorter formats. Retrieving information from large bodies of text data is simplified and made more efficient.
  • the present invention matches search queries against text summaries taken from ingested documents.
  • the summaries are created by segmenting text within the documents based on the similarity of consecutive sentences and then condensing the resulting segments.
  • a set of summaries are returned (i.e. output) that have been identified as matching the query or for which the corresponding document segments have been identified as matching the query.
  • This approach provides an efficient search process whilst avoiding loss of significant information content in the search results.
  • the methods allow for the rapid and efficient retrieval of information from large collections of text documents in response to a search query.
  • a method performed by one or more processors in which the one or more processors perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments; receiving a search query; returning one or more summaries of respective document segments within the plurality of document segments, wherein the summaries and/or the corresponding document segments have been identified to correspond to the search query; wherein each returned summary is shorter than its respective document segment.
  • the one or more processors perform the further steps of either: in response to receiving the search query, identifying document segments that correspond to the search query, and generating a summary for each identified document segment, wherein each summary is shorter than the corresponding identified document segment, and wherein the step of returning one or more summaries of respective document segments within the plurality of document segments comprises returning the summaries generated for the identified document segments; or, prior to receiving the search query, generating a summary for each document segment in the plurality of document segments and storing said summaries, wherein each summary is shorter than the corresponding document segment, and, in response to receiving the search query, identifying summaries that correspond to the search query, and, wherein the step of returning one or more summaries of respective document segments within the plurality of document segments comprises returning the identified summaries.
  • This approach also preserves high levels of the information content within the original text documents and can return (i.e. output) substantially all the information within the original documents in response to a query. If an entire document were summarised to provide a document-level summary, significant amounts of information would be lost. Important information that is contained in only a small portion of the document can be lost when summarising entire documents. As such, the methods discussed herein offer accurate search results whilst requiring fewer resources than traditional search methodologies.
  • the method involves the processors performing the further steps of arranging the summaries to be returned into a plurality of clusters based on the similarity of said summaries and returning said summaries arranged in their clusters. More preferably still the method involves the involves the processors performing the further steps of comparing the clusters and generating a cluster hierarchy in which the clusters are ordered depending on their relevance to the search query and/or the similarity of the summaries contained therein, wherein the clusters are returned in accordance with the cluster hierarchy.
  • each text document is segmented based on the similarity of consecutive sentences in the document.
  • a set of segments can be created that each relate to a singular topic. This allows for substantially all information within the original document to be split out into segments before summarisation occurs. Therefore, relatively little information is lost during the segmenting and summarisation process and substantially all of the information in the originally ingested documents can be efficiently interrogated and searched.
  • the search results produced through this process the identified summaries that are relevant to the search query - are short and easy to comprehend. Therefore, a user may easily identify information from the identified summaries returned in response to their query.
  • the identified summaries may be automatically grouped and returned in appropriate clusters which further aids understanding.
  • each text document is segmented based on the similarity of sentences to the at least one preceding sentence, and wherein preferably each text document is segmented based on the similarity of each sentence to the two or three preceding sentences.
  • the documents may be segmented based on the semantic similarity between two sentences, or each block of three or more consecutive sentences (e.g. 3 to 6 consecutive sentences).
  • each text document is segmented based on the similarity of each sentence to its two, three, four or five preceding sentences.
  • Preferably segmenting the one or more text documents into a plurality of document segments comprises: dividing each text document into a plurality of sentences; generating sentence embeddings for each sentence within the plurality of sentences, wherein the sentence embeddings are numerical vectors representing the respective sentences and their semantic meanings; and iteratively: (a) determining a cosine similarity between a sentence embedding of a given sentence and a sentence embedding for the sentence preceding the given sentence or between a sentence embedding of a given sentence and an average of sentence embeddings for the at least two sentences preceding the given sentence; (b) comparing the cosine similarity to a segmentation threshold; and (c) in response to determining that said cosine similarity is lower than the segmentation threshold, determining that the given sentence is the start of a new segment; whereas when said cosine similarity is greater than the segmentation threshold, the given sentence is included in the same segment as the sentence that directly precedes the given sentence; wherein steps (a) to (
  • segmentation may comprise determining the similarity between one or more given sentences and the single immediately preceding sentence.
  • the given sentence is sufficiently semantically similar to the preceding sentences, it is included in the existing segment which includes at least the immediately preceding sentence. Where the given sentence is semantically different from the preceding sentence a new segment is created which begins with this new sentence. Where new segments are created the iterative process may be adjusted such that subsequent sentences are only compared to sentences within the new segment, rather than semantically different sentences from a previous segment.
  • the sentences within the text documents may be identified using a sentencizer.
  • Sentencizers are tools that divide unstructured text into separate sentences based on (for instance) punctuation, line breaks and the semantic meaning of text.
  • the sentencisers may be a model trained on labelled sentence data or writing heuristics to identify sentence boundaries. Suitable sentencisers include examples from NLTK, SpaCy, and PySBD.
  • the sentence embeddings may be generated from the respective sentences by a sentence transformer, such as Sentence-BERT.
  • a sentence transformer such as Sentence-BERT.
  • Other suitable models for generating sentence embeddings could alternatively be used including embedding APIs produced by OpenAI.
  • cosine similarities is a particularly computationally efficient method to identify similarities between the vectors of the sentence embeddings.
  • alternative metrics for similarity between vectors could be used such as Euclidian distance or Manhattan distance.
  • the method may involve determining a Euclidean distance or Manhattan distance between the sentence embedding of the given sentence and the average of the embeddings for the at least two preceding sentences, comparing the Euclidean distance or Manhattan distance to a segmentation threshold and determining whether the given sentence should be included in the same segment as its preceding sentence or not based on this comparison.
  • the method comprises the step of determining a cosine similarity between a sentence embedding of a given sentence and an average of sentence embeddings for the at least two sentences preceding the given sentence, preferably the average of sentence embeddings is calculated as a weighted average in which the sentence embeddings of the two or more preceding sentences are weighted based on the position of their respective sentence in the document relative to the given sentence. By the position of the sentences it is understood that their embeddings are weighted depending on how close they are to the given sentence. Sentences that are further from the given sentence are less heavily weighted.
  • the sentence embedding of the sentence which immediately precedes the given sentence may be weighted more or less highly than the sentence embedding of the sentence that immediately precedes it. Placing additional emphasis on the immediately preceding sentence offers increased accuracy when segmenting.
  • the similarity between a given sentence and the immediately preceding sentence is a particularly strong indicator whether said given sentence should be included as part of the same segment as the preceding sentence or whether a new segment should be created.
  • the weighting may be predefined by a user or a manufacturer of the system.
  • the sentence embedding of the immediately preceding sentence may be weighted at least 1.2 times, and preferably 2 times greater, than the sentence embedding of the n-2th sentence which precedes said immediately preceding sentence.
  • the segmentation threshold may be predefined or determined dynamically for different documents. These thresholds allow segments of appropriate length to be created.
  • the segmentation threshold is in the range of 0.1 to 0.5 and more preferably from 0.2 to 0.4.
  • the segmentation threshold may be predetermined or set by a user or a system designer as a value between 0.2 and 0.4.
  • a single figure is often not suitable for all documents since different documents will tend to have different distributions of cosine similarities between adjacent sentence groups.
  • Using a fixed segmentation threshold can result in a particularly large number of very short summaries and/or a small number of very long summaries. Therefore, in some preferred examples the segmentation threshold is determined dynamically, such that a different segmentation threshold is determined for each text document.
  • Setting a segmentation threshold dynamically - such that the segmentation threshold is specific to each text document - may involve calculating the cosine similarities between the sentence embeddings of all sentences within said text document or the cosine similarities between the sentence embeddings of the sentences within all groups of two, three or more consecutive sentences within said document. The method may then involve calculating the mean and/or the standard deviation of these cosine similarities. The segmentation threshold may be set based on the mean and/or standard deviation of these cosine similarities.
  • the segmentation threshold may be set as the mean of the cosine similarity scores, the mean of the cosine similarity scores minus their standard deviation, or the mean of the cosine similarity score minus the standard deviation as multiplied by an empirically set factor (e.g. a factor in a range from 0.2 to 0.8, and more preferably in the range from 0.25 to 0.5).
  • an empirically set factor e.g. a factor in a range from 0.2 to 0.8, and more preferably in the range from 0.25 to 0.5.
  • Using the mean of the cosine similarity score minus the standard deviation as multiplied by an empirically set factor in the range from 0.25 to 0.5 is particularly preferred as it provides summaries of appropriate lengths for a large range of documents.
  • the mean and standard deviation it may be assumed that the distribution of cosine similarities follows a gaussian distribution of cosine similarity scores.
  • the processes described above may also be implemented using an alternative similarity metric - e.g. the Manhattan
  • generating a summary for each document segment comprises abstractive summarisation.
  • Abstractive summarisation rephrases text in a shorter manner whilst retaining semantic meaning.
  • Suitable tools for performing the abstractive summarisation include Distill-BART.
  • abstractive summarisation can generate new phrases and wording, preferably a relatively large amount of the original wording of each segment is retained to reduce the risk that information is lost.
  • Another suitable model for summarisation could alternatively be used such as GPT4 from OpenAI (RTM).
  • summaries of text could be created using extractive summarisation - a process in which existing sentences or phrases are isolated from the original text and concatenated or otherwise combined together - or any other form of summarisation.
  • segmenting the one or more text documents comprises dividing each text document based on a predetermined length for each segment.
  • each text document is repeatedly split into a new segment after a predetermined number of characters, words or sentences.
  • the document may be iteratively divided into segments after a predetermined number of characters in the range from 200 to 1000 characters, preferably from 250 to 800 characters, and more preferably still from 300 to 600 characters. Therefore, the text document may be divided into a set of text segments with fixed lengths corresponding to the predetermined length, with the exception of the final text segment which includes the remainder of the document once the earlier segments are formed.
  • each text document may be repeatedly split into a new segment at the end of the word or sentence containing a character corresponding to a predetermined number of characters for each segment, or at the end of a sentence containing a word corresponding to a predetermined number of words for each segment.
  • the text document is divided into a series of text segments which are approximately equal (just greater) than the predetermined length for each segment.
  • each summary comprises from 2 to 6 sentences and preferably from 3 to 4 sentences and/or each segment comprises from 4 to 20 sentences and preferably from 5 to 16 sentences.
  • each document segment may comprise from 200 to 1000 characters, preferably from 250 to 800 characters, more preferably still from 300 to 600 characters. These lengths provide a strong compromise between search efficiency, information loss and ease of information retrieval.
  • the thresholds and parameters used during segmentation and summarisation may be selected or adjusted to maintain the sizes of segments and summaries within these ranges.
  • a method performed by one or more processors in which the one or more processors perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document; generating a summary for each document segment, wherein each summary is shorter than the corresponding document segment; storing the summaries generated for the plurality of document segments; receiving a search query; identifying stored summaries that correspond to the search query; and, returning the identified summaries that correspond to the search query.
  • this method is based on the aspect of the invention discussed above and relates to methods where the segmentation of text documents and summarisation of segments are performed before searching is performed.
  • This method requires reduced computational resources in comparison to traditional search methodologies as there is no need to access and search the full body of all text documents in response to each search query. Instead, only the summaries must be searched. Furthermore, the amount of data that is returned in response to a search query is reduced, since the summaries can be returned rather than excerpts of the full text. Therefore, searching can be quicker and less computationally expensive. Storage requirements can also be reduced as there may be no need to store the full text of all ingested text documents. As such, it will be seen that the invention provides an alternative data structure that supports improved searching in view of the technical issue of the amount of computational resources required to search large numbers of text documents and large amounts of data.
  • This approach also preserves high levels of the information content within the original text documents and can return (i.e. output) substantially all the information within the original documents in response to a query. If an entire document were summarised to provide a document-level summary, significant amounts of information would be lost. Important information that is contained in only a small portion of the document can be lost when summarising entire documents. As such, the methods discussed herein offer accurate search results whilst requiring fewer resources than traditional search methodologies. By segmenting each document based on the similarity of consecutive sentences, a set of segments can be created that each relate to a singular topic. This allows for substantially all information within the original document to be split out into segments before summarisation occurs. Therefore, relatively little information is lost during the segmenting and summarisation process and substantially all of the information in the originally ingested documents can be efficiently interrogated and searched.
  • the search results produced through this process - the identified summaries that are relevant to the search query - are short and easy to comprehend. Therefore, a user may easily identify information from the identified summaries returned in response to their query. As will be discussed further below, in preferred embodiments, the identified summaries may be automatically grouped and returned in appropriate clusters which further aids understanding.
  • the method steps according to this aspect can be divided into two processes: a preliminary process in which summaries are generated from text documents; and a search process in which search queries are received and relevant summaries are identified and returned.
  • the preliminary process may involve the steps discussed above of obtaining the one or more text documents, segmenting the one or more text documents into a plurality of document segments, generating summaries for each document segment, and storing the summaries.
  • the search process may involve the steps discussed above of receiving a search query (e.g. a query comprising one or more keywords, a natural language query, and/or a query comprising an embedding that represents suitable summaries numerically), identifying stored summaries that correspond to the search query, and returning the identified summaries that correspond to the search query.
  • a search query e.g. a query comprising one or more keywords, a natural language query, and/or a query comprising an embedding that represents suitable summaries numerically
  • the steps of the preliminary process and the search process may be performed by a single processor or by different processors. Where the preliminary process and search process are performed by different processors, the processors may be located either locally or remotely from one another. These different processors may be connected by any suitable means, including the internet or another network.
  • segmentation and summarisation and storing the summaries during an initial step before receiving any search queries reduces search latency significantly and avoids the need to perform these computationally expensive steps each time a query is received. Indeed, the segmentation and summarisation steps tend to take significantly more time and resources than the search process and optional clustering process (discussed further below). Searching and clustering (if performed) take relatively few resources and can be performed in real time after receiving a search query and without a substantial delay.
  • steps of the method may be conceptually divided into a preliminary process and a subsequent search process, in practice the steps of the method may also be performed in various different orders where appropriate. For example, steps of the preliminary process and search process may be performed consecutively in a same flow, sequentially or in parallel.
  • the method starts with obtaining one or more text documents which are intended to be searched.
  • the method may be used with a wide range of data from a wide range of sources, both public and private.
  • the documents may take the form of substantially any written text including articles, books, scientific papers and other texts. However, this is not essential. For instance, obtaining one or more text documents may additionally include:
  • Text documents may be obtained from an external source.
  • An external source is a source that is remote from the system and processors performing the method. However, in some cases the source may be local.
  • the source is a cloud-based source such as a proprietary sharepoint site, a proprietary cloud storage container, or an S3 bucket (a cloud object storage object offered by Amazon Web Services (RTM)).
  • RTM Amazon Web Services
  • text documents are obtained via an API allowing access to an external data supplier.
  • the text documents may be publicly available or may be documents that are private and are supplied by an organisation that wishes to perform searching using the methods discussed herein.
  • the text documents are segmented based on the semantic similarity of consecutive sentences in the document.
  • the document is divided up based on the shared meaning and content of the consecutive sentences. Consequently, each segment tends to refer to a single topic or subject. As the text changes topic or subject a new segment will be created. Since each segment is directed to a single semantic topic, the subsequent summarisation process will retain a greater proportion of the original information than would be achieved otherwise.
  • summaries of text could be created using extractive summarisation - a process in which existing sentences or phrases are isolated from the original text and concatenated or otherwise combined together - or any other form of summarisation.
  • segmenting the one or more text documents comprises dividing each text document based on a predetermined length for each segment.
  • the summaries of the text segments created through the previously discussed methods and any associated metadata may be stored in a database or any other suitable storage.
  • the storage may be located within, local to or remotely from the one or more processors.
  • the storage may be located in the cloud such as when the storage is a S3 bucket and/or connected to the processors by the internet, LAN or another suitable network.
  • the stored summaries may subsequently be searched.
  • the search query may comprise one or more keywords, a natural language query or a sample embedding to be compared to the stored summaries.
  • a natural language query or a sample embedding to be compared to the stored summaries.
  • other means for prompting models may also be used.
  • Summaries that contain content relevant to the search query can be identified and returned as results.
  • Any suitable search algorithm may be used, such as the BM25 algorithm or a retrieval model configured to respond to a natural language query.
  • the search query may comprise one or more keywords
  • the method may comprise identifying one or more stored summaries that correspond to the keywords
  • the search query may comprise a natural language query
  • the method may comprise identifying one or more stored summaries that correspond to the natural language query
  • the search query may comprise a sample embedding and the method may comprise identifying one or more stored summaries that have summary embeddings that are similar or match the sample embedding (e.g. by calculating the cosine similarity between the sample embedding and the summary embeddings of the stored embeddings).
  • the search query may be received from a user and the identified summaries matching the query may be returned to the user.
  • this is not essential and the query may be automatically generated by a machine and/or the results may be returned to a machine for further processing.
  • the search query may include one or more filters defining which of the stored summaries should be searched and/or returned in the method.
  • the method may comprise searching a subset of the stored summaries based on the filter.
  • a search query may include (but is not limited to):
  • a date filter that restricts the search to summaries to those from a document within a defined date range, restricts the search to summaries created within a defined date range and/or restricts the search to documents or summaries created either before or after a specific date;
  • a country of origin filter indicating whether to include or exclude from the search summaries based on documents from specific countries (e.g. summaries based on webpages with a specific country of origin of their URL);
  • an institution filter indicating whether to include or exclude from the search summaries based on documents from specific institutions or from specific types of institutions (e.g. private, university, government, etc.);
  • a publisher filter indicating whether to include or exclude one or more summaries from the search based on the publisher of the document(s) on which the summaries are based;
  • a file type filter indicating whether to include or exclude one or more summaries from the search based on the file type of the document(s) on which the summaries are based (e.g. pdf, webpage, presentation, video file, audio file, etc.).
  • the identified summaries may be returned in the form of a list or more preferably a report.
  • a report containing the identified summaries may be in electronic or paper form.
  • the report may include in addition to the identified summaries, a contents page listing the identified summaries and/or any clusters of the identified summaries and copies of, or links to, the original documents.
  • the report may be a web or html report and available via a website or web portal.
  • the method comprises the further steps of arranging the identified summaries into a plurality of clusters based on the similarity of the identified summaries; and returning the identified summaries arranged in their clusters. Therefore, the identified summaries may be clustered based on their semantic similarity. As will be discussed in more detail below, the clusters may in turn be ordered based on their relevance to the search query, such that the identified summaries are clustered and ordered based on their similarity.
  • the summaries that have been identified as relevant to the search query may be arranged into groups or clusters that are similar - i.e. semantically similar - before they are returned. Therefore, the summaries presented in response to the query are returned in groups on the basis of their thematic or contextual similarity.
  • the clustering process creates a data structure for the identified summaries that would otherwise be unstructured and difficult to navigate. Clustering can guide users through the results of a search query, enabling users to more efficiently identify and retrieve relevant information from the ingested text documents.
  • the method further comprises comparing the clusters and creating a cluster hierarchy in which the clusters are ordered depending on their relevance to the search query and/or the similarity of the summaries contained therein; wherein the clusters of identified summaries are returned in accordance with the cluster hierarchy.
  • the clusters may be ordered based on their semantic similarities.
  • Arranging the identified summaries into a plurality of clusters may comprise: generating at least one metadata embedding for each of the stored summaries; wherein the metadata embedding is a numerical vector representing metadata associated with the respective stored summary or metadata associated with a part of the respective stored summary; and calculating a summary embedding for each stored summary based on the at least one metadata embedding; and wherein the method comprises in response to receiving a search query: arranging the identified summaries into a plurality of clusters based on the similarity of their summary embeddings.
  • the metadata from which metadata embeddings are generated may include one or more of key words, named entities identified through named entity recognition (NER), a title or date of the original document from which the respective stored summary is generated, and names, events or dates listed within the stored summary.
  • the metadata embeddings may represent these pieces of metadata as a numerical vector.
  • the metadata embeddings may include key word embeddings or named entity embeddings that represent key words or named entities within the respective stored summary.
  • the metadata embeddings may include embeddings of the summary or a portion of the summary itself.
  • the metadata embeddings may include one or more sentence embeddings that are numerical vectors representing individual sentences within the respective stored summaries. Generating the summary embeddings may include averaging or weighting different metadata embeddings together.
  • the method may comprise generating a sentence embedding for each of the sentences within the stored summaries, wherein the sentence embeddings are numerical vectors representing the respective sentences, and calculating a summary embedding for each stored summary based on the plurality of sentence embeddings (e.g. by averaging the sentence embeddings).
  • the method may comprise generating sentence embeddings for each stored summary, as well as one or more further metadata embedding(s) for each stored summary, and generating a summary embedding based on the sentence embeddings and the one or more further metadata embedding(s).
  • the further metadata embeddings may be one or more key word embeddings that are numerical vectors which each represent an individual word or phrase in the respective stored summary and/or one or more named entity embeddings that are numerical vectors which each represent an individual word or phrase in the respective stored summary identified as a named entity by an NER classifier.
  • the generation of a summary embedding may comprise averaging or weighting the sentence embeddings and the one or more further metadata embeddings.
  • arranging the identified summaries into a plurality of clusters comprises: dividing each of the stored summaries into one or more sentences; generating one or more of a sentence embedding, a key word embedding and a named entity embedding for each of the sentences within the stored summaries; wherein the sentence embeddings are numerical vectors representing the respective sentences and their semantic meanings, the key word embeddings are numerical vectors representing individual words or phrases in the respective sentences and their semantic meanings, and the named entity embeddings are numerical vectors representing words identified as named entities that correspond to a Named Entity Recognition (NER) tag by an NER classifier in the respective sentences; calculating a summary embedding for each stored summary, wherein the summary embedding is a numerical vector representing a weighted average of the sentence embedding, the key word embeddings and/or the named entity embeddings for each sentence of said stored summary; and wherein the method comprises in response to receiving a search query: arranging the identified sum
  • the steps of dividing stored summaries into sentences, generating metadata embeddings, sentence embeddings, key word embeddings and/or named entity embeddings, and calculating a summary embedding for the stored summaries discussed above are performed for each stored summary.
  • the processes for calculating summary embeddings are preferably performed as part of the preliminary process discussed above in which text documents are obtained, segmented and summarised and before any search query is received.
  • the summary embeddings can be seen as metadata associated with the respective stored summaries.
  • the summary embeddings may be stored in a database or other storage format, preferably with the corresponding stored summaries.
  • Calculating summary embeddings during a preliminary process and before any search queries are received is efficient since the calculated summary embeddings may be used to create clusters of summaries for all subsequent search queries. Creating the summary embeddings - which are a further example of metadata for the stored summaries - before search queries are received reduces search latency, since creating embeddings requires relatively large amounts of resources whilst arranging the summaries into clusters and ordering the clusters within a hierarchy (if performed) is relatively inexpensive. However, the calculation of summary embeddings during a preliminary process and before search is not essential.
  • the calculation of summary embeddings could be performed during the search process, after the receipt of a search query. For instance, the calculation of summary embeddings could be performed for a given summary the first time that this summary is to be returned in response to a search query - i.e. the first time the summary is considered relevant to a search query. In a further alternative, the calculation of summary embeddings may be performed each time a search query is received for all summaries which have been identified as relevant to the search query (i.e. for each of the identified summaries during every search). Similarly, the steps involved in calculating summary embeddings may be split across the preliminary process and search process discussed above. Nevertheless, in each case the steps for calculating summary embeddings may be as discussed above.
  • the sentences within the stored summaries may be identified and divided using a sentencizer.
  • Sentencizers are tools that divide unstructured text into separate sentences based on (for instance) punctuation, line breaks and the semantic meaning of text.
  • the word embeddings may be calculated using a keyword extractor model such as the sentence-transformer BERT model.
  • Sentence embeddings may be generated from the respective sentences by a tool such as the BERT sentence-transformer.
  • Key words may be obtained from a keyword extraction model, such as KeyBERT, a publicly available library that leverages sentence embeddings and n-gram word/phrases.
  • the method may include calculating the cosine similarity between the document embedding and n-gram word or phrase embeddings.
  • the key words with the highest cosine similarity score may be taken as the key words most representative of the summary itself.
  • any suitable model can be used here, not limited to KeyBERT, for example, a supervised model trained using document keyword pairs.
  • the key words may include single words and/or phrases formed of multiple words.
  • Named entities may be obtained through an NER classifier such as the ROBERTA model developed by Facebook Al (RTM). However, any suitable model for finding words that follow a given NER ontology may be used. These models identify words from a given text that match a given class from the model’s ontology. These words are extracted from the unstructured text and classified into predefined classes from the ontology (e.g. person, date, street address, company name, business, monetary value, event, brand, etc.). It may be appreciated that the key words and named entities on which the key word embeddings and named entity embeddings are based may include some overlap. This is to be expected. The processes of identifying key words and identifying named entities offer different ways to interpret the word content of each summary.
  • RTM Facebook Al
  • both key word embeddings and named entity embeddings in addition to sentence embeddings allows the method to identify relatively large amounts of information regarding each summary. This enables summaries to be more easily linked and connected together automatically. Therefore, clusters can be made larger whilst maintaining conceptual similarity between summaries within the cluster. Larger clusters of conceptually linked summaries are desirable since smaller clusters (especially lots of very small clusters) do not offer a significant improvement to users in comparison to simply listing summaries individually. Alternatively, only one of key word embeddings and named entity embeddings may be used in the process.
  • key word embeddings and named entity embeddings are weighted greater in the summary embeddings than the sentence embedding.
  • this approach achieves relatively large clusters of conceptually linked summaries.
  • the summary embeddings were based on the sentence embeddings only or if sentence embeddings were relatively heavily weighted in comparison to key word embeddings or named entities identified using NER tags the process would produce larger numbers of relatively small clusters. This is because the sentence embeddings relating to individual sentences within a summary tend to be highly specific and distinct, which reflects how different sentences in each summary can contain very different content following the summarisation process despite being in the same summary and discussing the same topic.
  • the key word embeddings from the summary are weighted at least twice as heavily as the sentence embeddings.
  • the key word embeddings may be weighted between two and three times more heavily than the sentence embeddings, and preferable 2.5 times more heavily.
  • the named entity embeddings may be weighted at least 1.2 times as heavily as the sentence embeddings.
  • the key word embeddings may be weighted between 1.2 and 2 times more heavily than the sentence embeddings, and preferably 1.5 times more heavily.
  • the summary embedding may be calculated as 0.2 times the sentence embeddings plus 0.5 times the sentence embedding plus 0.3 times the named entity embeddings.
  • the identified summaries are arranged into a plurality of clusters based on the similarity of their summary embeddings using an agglomerative clustering algorithm.
  • Agglomerative clustering methods are examples of hierarchical clustering algorithms, which start with clusters formed of individual objects (in this case identified summaries) and repeatedly merge similar clusters to form sets of combined clusters.
  • a predefined cluster distance is used to define a threshold at which the distance between the clusters of identified summaries is too great, and the clusters are not merged further.
  • the predefined cluster distance may be set by the designer of the system performing the method empirically so as to produce acceptable numbers and sizes of clusters.
  • the final set of clusters of identified summaries can then be returned in response to the query.
  • clusters may be generated by divisive clustering where all segments begin in an initial cluster which is repeatedly split to form smaller clusters, or any other hierarchical clustering method.
  • the method includes generating a cluster hierarchy and returning the clusters and identified summaries therein in accordance with the cluster order.
  • Generating a cluster hierarchy may comprise: calculating a cluster embedding for each cluster, wherein each cluster embedding is an average of the summary embeddings of the identified summaries contained within a respective cluster; calculating an average cluster embedding using an average of the plurality of cluster embeddings, and ordering the clusters based on their cluster embeddings and the average cluster embedding.
  • the cluster having a cluster embedding closest to the average of the cluster embeddings is selected as the starting cluster. This cluster is also referred to herein as the dominant cluster.
  • generating a cluster hierarchy may comprise identifying the similarity between the cluster embedding of each cluster and the average cluster embedding. Therefore, a respective similarity between the cluster embedding of each cluster and the average cluster embedding is calculated for each cluster within the plurality of clusters.
  • Generating a cluster hierarchy may further comprise selecting as a first cluster the cluster that has the greatest similarity between its respective cluster embedding and the average cluster embedding.
  • this first cluster is also referred to as the dominant cluster. Being closest to the average or centre of all clusters, the dominant cluster may be seen to be particularly relevant to the search query received by the method. As such, prioritising the dominant cluster helps enable a user to receive important information quicker.
  • the method may involve identifying a next closest cluster that is most similar to the first (dominant) cluster, finding a cluster that is more similar to the next closest cluster and so on until all clusters are ordered.
  • the method may involve iteratively: calculating a similarity between each of the respective cluster embeddings of the remaining clusters of the plurality of clusters and the cluster embedding of the previously selected cluster (i.e. the dominant cluster in the first iteration); and selecting as the next cluster in the cluster hierarchy the cluster which has a cluster embedding that is most similar to the embedding of the previously selected cluster.
  • the clusters may be ordered and returned in a manner that is particularly clear and easy to understand, with each cluster being particularly closely related and flowing clearly from the preceding cluster.
  • the method may comprise ordering the clusters based on the similarity between their respective cluster embeddings and the average cluster embedding.
  • the clusters and identified summaries therein may be returned in an order from a cluster which has the most similar cluster embedding in comparison to the average cluster embedding to a cluster which has the least similar cluster embedding in comparison to the average cluster embedding.
  • the first or dominant cluster remains the cluster that has a cluster embedding which is most similar to the average cluster embedding.
  • This approach again offers an order of clusters that is easy to understand because clusters are arranged from the clusters closest to the centre of the overall collection of clusters and which may therefore be perceived as most relevant to the search query at hand, to the clusters which are furthest from the centre and therefore appear less relevant. This helps provide information that is more relevant to a search quicker.
  • the similarity between cluster embeddings and average cluster embeddings may be determined by any suitable approach including using the cosine similarity, Manhattan similarity or Euclidean similarity.
  • the cosine similarity is preferred, being particularly quick and computationally efficient.
  • the average of the plurality of cluster embeddings is the mean of the plurality of cluster embeddings. However, other averages may also be used.
  • These steps of generating a cluster hierarchy indirectly arranges the clusters into a hierarchy or structure based on their approximate relevance to the search query. This is because the average of the cluster embeddings is determined by the summary embeddings for each identified summary, and the identified summaries are selected from the pool of stored summaries by their relevance to the search query.
  • the search results returned by the method are particularly valuable when returned in accordance with the cluster hierarchies created using the approaches described above since they help present the most relevant and valuable information first and/or help guide those receiving the results from one cluster to the next. Nevertheless, it will be appreciated that a variety of other cluster hierarchies could be created.
  • the method comprises further steps of: generating a summary order for the identified summaries within each cluster; and returning the identified summaries within each cluster in accordance with this summary order.
  • Generating a summary order may comprise calculating a summary embedding for each identified summary in the cluster, calculating an average summary embedding based on the summary embeddings of all identified summaries in the cluster, and ordering the identified summaries based on their summary embeddings and the average summary embedding.
  • ordering the identified summaries may comprise identifying the similarity between the summary embedding of each identified summary and the average summary embedding, and selecting as a first (dominant) summary, the summary that has the greatest similarity between its cluster embedding and the average summary embedding. Being closest to the average or centre of all summaries within the cluster, the dominant summary appears most relevant to the search query received by the method. As such, prioritising the dominant summaries and presenting it first helps enable a user or system receiving the results of a search query to receive important information quicker.
  • the method may then find the next closest summary to the dominant summary and so on until all summaries are ordered.
  • the method may involve iteratively: calculating a similarity between the respective summary embeddings of the remaining identified summaries within the cluster and the summary embedding of the previously selected summary (i.e. the dominant cluster in the first iteration); and selecting as the next summary in the order the identified summary that has a summary embedding that is most similar to the embedding of the previously selected summary.
  • the summaries may be ordered and presented in a manner that is particularly clear and easy to understand, with each identified summary being similar and flowing clearly from the preceding identified summary.
  • the remaining identified summaries other than the dominant summary may be ordered based on the similarity of their summary embeddings to the average summary embedding.
  • identified summaries therein may be returned in an order from a summary which has the most similar summary embedding in comparison to the average summary embedding within the cluster, to a summary which has the least similar summary embedding in comparison to the average summary embedding.
  • This approach again offers an order of summaries that is easy to understand because summaries are arranged from the summary closest to the centre of the overall collection of summaries within each cluster and which may therefore be perceived as most relevant to the search query at hand, to the clusters which are furthest from the centre and therefore appear less relevant. This helps provide information that is more relevant to a search quicker.
  • the similarity between summary embeddings and average summary embeddings may be determined by any suitable approach including using the cosine similarity, Manhattan similarity or Euclidean similarity.
  • the cosine similarity, Manhattan similarity or Euclidean similarity methods may be used in the calculation of any of the similarities discussed herein.
  • the method involves generating and storing a summary title for each of the stored summaries and/or generating a summary title for each identified summary that corresponds to the search query.
  • the method may further involve returning each identified summary with its corresponding summary title.
  • the summary title for each of these summaries may be generated using a title generator model.
  • a particularly preferred model is the title generator T5 model from Hugging Face (RTM) although any suitable title generator could be used.
  • the method may involve generating a cluster title for each cluster and returning the identified summaries arranged in their clusters with their corresponding cluster title.
  • the summary title and/or cluster title may help assist users or systems to understand the results of a search query and to quickly understand the contents of the summaries within the cluster.
  • generating a cluster title for each cluster may comprise setting each cluster title to be identical to the summary title of the dominant summary within the respective cluster.
  • the method may involve identifying a dominant summary and summary titles as discussed above.
  • generating a title for each cluster of identified summaries may be based on the contents of the identified summaries therein.
  • generating a title for each cluster may comprise: concatenating the identified summaries in the cluster; generating a cluster summary from the concatenated summaries, wherein the cluster summary is shorter than the concatenated summaries; and, generating a title for the cluster based on the cluster summary.
  • Each cluster summary may be generated using abstractive summarisation of the concatenated text from all identified summaries within the respective cluster. Suitable tools for performing the abstractive distillation include Distill-BART, although other models can be used.
  • the title may be generated for the cluster using any suitable title generator model, but a preferred example is the title generator T5 model from Hugging Face (RTM).
  • the method may comprise generating a cluster summary for each cluster based on the identified summaries within each cluster or the corresponding document segments, wherein each cluster summary is shorter than the concatenated length of the identified summaries within each cluster and/or the concatenated length of the corresponding document segments.
  • the cluster summary may be the same as the dominant summary within the cluster or a summarisation of the concatenated summaries or corresponding document segments from within the cluster. This summarisation may use any of the summarisation processes discussed above.
  • a method performed by any preceding claim wherein the one or more processors perform the steps of obtaining one or more text documents, segmenting the one or more text documents into a plurality of document segments, receiving a search query, identifying document segments that correspond to the search query, generating a summary for each identified document segment, wherein each summary is shorter than the corresponding identified document segment, and, returning the summaries of the identified document segments that correspond to the search query.
  • this method is based on the first aspect of the invention discussed above and shares many similarities with the directly preceding aspect of the invention. Indeed, to a user, the method of this aspect will appear very similar to those of the preceding aspects - in response to a query a set of corresponding summaries are received.
  • this aspect relates to methods where the segmentation of text documents is performed before searching is performed but where summarisation is performed after a search query is received.
  • a search query is compared to the document segments obtained from the text documents, and summarisation is only performed for document segments that have been identified to correspond to the search query.
  • This approach offers strong search results and efficient use of computational resources. Since the steps performed before searching do not involve summarisation this pre-processing is quicker and less computationally expensive. This is particularly important where the corpus of text documents to be searched is particularly large. Similarly, the overall number of document segments which need to be summarised using the method, especially in situations where the number of searches to be performed relatively low in comparison to the number of documents and document segments to be stored. Storage requirements can also be reduced as there is no need to store summaries for all document segments.
  • the one or more processors perform the step of storing the plurality of document segments, wherein preferably the step of storing the plurality of document segments is performed prior to receiving a search query.
  • Suitable storage mediums are the same as those discussed for the previous aspects of the invention.
  • the method involves generating a segment embedding for each of the document segments, wherein each segment embedding is a numerical vector representing the corresponding document segment.
  • the method involves storing the segment embeddings, for instance in a vector database, database or cloud container. More preferably still the steps of generating a segment embedding for each of the document segments and storing the segment embeddings is performed prior to receiving a search query.
  • the segment embedding may be used during the search process and when clustering and ordering summaries.
  • Preferably identifying document segments that correspond to the search query is based on a comparison the segment embeddings and a sample embedding contained within or derived from the search query, and/or a comparison of the named entities and/or keywords within the document segments and sample named entities and/or keywords contained within or derived from the search query.
  • This search process may be similar to the search processes described in relation to the preceding aspect of the invention, with the exception that the search is performed against the document segments rather than summaries generated from these segments. The same sub-processes and tools discussed above may be used during both search steps.
  • the method according to the present aspect involves steps of clustering and ordering the summaries of the identified document segments that are equivalent to the corresponding steps in the preceding aspects.
  • the summaries may be returned in these clusters and orders.
  • the clustering and ordering steps may be substantially the same as those previously described and use corresponding tools, with the exception that they may be based on summary embeddings which represent the summaries and/or segment embeddings which represent the identified document segments.
  • These clustering and order steps may include any other optional or preferable features of the equivalent processes discussed above.
  • the method may include steps of generating summary titles, cluster titles and cluster summaries that are equivalent to those discussed above in reference to the previous aspects and include the step of returning the summaries and clusters with these features.
  • the one or more processors perform the further steps of: arranging the summaries of the identified document segments into a plurality of clusters based on the similarity of the identified document segments and/or the summaries, and returning said summaries arranged in their clusters.
  • the one or more processors perform the further steps of: comparing the clusters and generating a cluster hierarchy in which the clusters are ordered depending on their relevance to the search query, the similarity of the summaries contained therein and/or the similarity of the identified document segments that correspond to the summaries contained therein; wherein the clusters are returned in accordance with the cluster hierarchy.
  • arranging the summaries into a plurality of clusters comprises obtaining a segment embedding for each of the identified document segments, wherein each segment embedding is a numerical vector representing the corresponding document segment, and arranging the summaries into a plurality of clusters based on the similarity of the segment embeddings of their respective identified document segments, and/or obtaining a summary embedding for each of the summaries generated from the identified document segments, wherein each summary embedding is a numerical vector representing the corresponding summary, and arranging said summaries into a plurality of clusters based on the similarity of their summary embeddings.
  • the summary embeddings may be generated directly from the summaries - e.g. by passing the text of the summary through an embedding model such as ‘gte-base’ or ‘all-mpnet-base-v2’ (both available from Hugging Face (RTM)).
  • segment embeddings may be generated directly from the identified document segments - e.g. by passing the text of the document segment through an embedding model such as ‘gte-base’ or ‘all-mpnet-base-v2’.
  • generating the summary and/or segment embeddings may comprise generating at least one metadata embedding for each of the summaries or document segments; wherein the metadata embedding is a numerical vector representing metadata associated with at least a part of the respective summary or document segment.
  • the summary embedding or segment embedding may then be calculated based on said metadata embeddings (e.g. by taking an average or a weighted average of the metadata embeddings).
  • Suitable examples of metadata to form the metadata embeddings and the processes for forming these metadata embeddings are discussed above.
  • the summaries are arranged into a plurality of clusters using a hierarchical clustering algorithm and/or an agglomerative clustering algorithm.
  • Preferably generating a cluster hierarchy comprise calculating a cluster embedding for each cluster, wherein each cluster embedding is an average of the summary embeddings of the summaries contained within the respective cluster and/or the segment embeddings of the identified document segments that correspond to the summaries contained within the respective cluster, calculating an average cluster embedding using an average of the plurality of cluster embeddings, and ordering the clusters based on their cluster embeddings and the average cluster embedding.
  • Preferably ordering the clusters based on their cluster embeddings and the average cluster embedding comprises identifying a similarity between the cluster embedding of each cluster within the plurality of clusters and the average cluster embedding, and either: ordering the clusters based on the similarity between their respective cluster embeddings and the average cluster embedding; or selecting as a first cluster in a cluster order the cluster that has the greatest similarity between its respective cluster embedding and the average cluster embedding, and generating the remainder of the cluster hierarchy by iteratively, calculating a similarity between each of the respective cluster embeddings of the remaining clusters and the cluster embedding of the previously selected cluster and selecting as the next cluster in the cluster hierarchy the cluster which has a cluster embedding that is most similar to the embedding of the previously selected cluster.
  • the one or more processors perform the further steps of generating a summary order for the summaries within each cluster; and returning the summaries within each cluster in accordance with the summary order; wherein preferably generating a summary order for the summaries within each cluster comprises identifying a similarity between the respective summary embedding of each summary and the average summary embedding of said cluster and/or identifying a similarity between the segment embeddings of the respective identified document segments corresponding to each segment and the average segment embedding of said cluster, and either: ordering the summaries based on the similarity between their respective summary embeddings and the average summary embedding of said cluster or the similarity between the segment embeddings of their respective identified document segments and the average segment embedding of said cluster; or selecting as a first summary in the summary order, the summary that has the greatest similarity between its summary embedding and the average summary embedding for said cluster or between the respective segment embeddings of its corresponding identified document segment and the average segment embedding for said cluster, and
  • the one or more processors perform the further steps of generating and storing a summary title for each of the summaries, and returning each summary with its corresponding summary title, generating a cluster title for each cluster, and returning the summaries arranged in their clusters with their corresponding cluster titles, and/or generating a cluster summary for each cluster based on the summaries within each cluster or the identified document segments corresponding to the summaries within each cluster, wherein each cluster summary is shorter than the concatenated length of the summaries and/or the concatenated length of the identified document segments corresponding to the summaries within each cluster, and returning the summaries arranged in their clusters with their corresponding cluster summaries.
  • a further aspect of the invention there is provided method performed by one or more processors, in which the one or more processors perform the steps of obtaining one or more text documents, segmenting the one or more text documents into a plurality of document segments, identifying document segments that correspond to the search query, arranging the identified document segments into a plurality of clusters based on the similarity of said identified document segments, generating a cluster title and/or cluster summary for each cluster based on the identified document segments within each cluster, wherein each cluster summary is shorter than the concatenated length of the identified document segments in each cluster, and returning the identified document segments arranged in their clusters with their corresponding cluster titles and/or cluster summaries.
  • This method returns relevant document segments in response to a search query rather than summaries of said document segments.
  • This provides a further search process that is computationally efficient since the amount of information which must be transferred and stored is reduced. Moreover, the results of the search are easily understood and discrepancies between document segments and their summaries and/or data loss during the summarisation process are avoided.
  • the steps of this method may comprise any of the optional or preferable features of the analogous steps of the preceding aspects of the invention.
  • the clustering and order processes discussed above with reference to summaries of document segments can be implemented using the document segments themselves. As such, further discussion of appropriate techniques for obtaining text documents, segmenting text documents, searching against document segments, clustering document segments, ordering document segments within clusters, ordering clusters and creating titles for document segments and clusters and creating summaries for clusters is provided above.
  • a system comprising a plurality of processors, the processors configured to perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document; generating a summary for each document segment, wherein each summary is shorter than the corresponding document segment; storing the summaries generated for the plurality of document segments; receiving a search query; identifying stored summaries that correspond to the search query; and returning the identified summaries that correspond to the search query.
  • a system comprising a plurality of processors the processor configured to perform the steps of any of the methods discussed above.
  • Systems in accordance with these aspects offer corresponding benefits to the methods discussed above.
  • the systems may be configured to perform any of the optional or preferable steps discussed above.
  • a non-transitory computer readable medium storing instructions which when run by one or more processors cause the one or more processors to perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document; generating a summary for each document segment, wherein each summary is shorter than the corresponding document segment; storing the summaries generated for the plurality of document segments; receiving a search query; and identifying stored summaries that correspond to the search query; and, returning the identified summaries that correspond to the search query.
  • a non-transitory computer readable medium storing instructions which when run by one or more processors cause the one or more processors to perform the steps of any of the methods discussed above.
  • the stored instructions of these aspects offer corresponding benefits to the methods discussed above.
  • the stored instructions may include instructions that cause the processors to perform any of the optional or preferable steps discussed above with reference to the previous aspects of the invention.
  • the invention provides particularly computationally efficient searching without loss of information content.
  • the invention offers significant support to the task of information retrieval from large bodies of data.
  • Figure 1 shows a flow chart of a method in accordance with embodiments of the invention
  • Figure 2 shows schematically a method of segmenting and summarising text documents suitable for use in embodiments of the invention
  • Figure 3 shows schematically a system for performing search in accordance with embodiments of the invention
  • Figure 4 shows schematically a report returned in accordance with embodiments of the invention
  • Figure 5 shows a flow chart of a further method in accordance with embodiments of the invention
  • Figure 6 shows a flow chart of a further method in accordance with embodiments of the invention.
  • the method begins with ingestion of text data.
  • step s101 one or more text documents are obtained.
  • the text documents typically contain unstructured text data.
  • the text documents may for instance be submitted by a user, obtained from the internal database of an organisation, or located from public sources (e.g. websites, journals or encyclopaedias).
  • Obtaining the text documents may comprise generating them by transcribing video and/or audio files, extracting text from multimodal sources or translating text documents from different languages.
  • the method typically involves ingesting a large number of text documents (e.g. up to, and over, 1 million). Each text document may range in length. For instance, suitable text documents are typically at least 5 to 10 sentences long but can be hundreds of pages and thousands of sentences long.
  • the text documents are segmented into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document.
  • the text documents are segmented based on their semantic similarities. As such, each segment will relate to a different subject or topic.
  • the segments produced from each text document are a plurality of extracts or excerpts from the text. Each segment is typically in the region of 4 to 20 sentences long, and preferably 5 to 16 sentences.
  • the text is split into individual sentences using a sentencizer.
  • Sentence embeddings representing the semantic content of each sentence as a numerical vector are generated using a tool such as BERT sentence-transformer.
  • the sentence embeddings of adjacent or consecutive sentences may then be compared to determine whether the sentences relate to similar content and should be kept together in a segment or whether the content of the sentences is sufficiently different that the sentences should be split apart into different segments.
  • a cosine similarity is calculated between the sentence embedding of a given sentence and the average sentence embeddings of the two or more (e.g. three) sentences that immediately precede the given sentence. If the cosine similarity is greater than a predetermined threshold the given sentence is considered sufficiently similar to the preceding sentences that it should be included in their segment. If the cosine similarity is lower than the threshold it is determined that the given sentence is the start of a new segment as it is sufficiently semantically different to the preceding text. It will be appreciated that a wide variety of alternative methods for calculating the similarity of adjacent sentences are equally possible.
  • calculating the average sentence embeddings of the two or more preceding sentences involves weighting the preceding sentences based on their adjacency or closeness to the given sentence. For example, the sentence that immediately precedes the given sentence may be weighted higher than the sentence that in turn precedes it. That is if a document includes N sentences that are numbered consecutively through the document as 1, 2, 3 ... n ... N, the cosine similarity of the sentence embedding of sentence n may be calculated with reference to an average sentence embedding calculated based on the sentence embeddings of sentences n - 1, n - 2 and n - 3.
  • the embedding of the immediately preceding sentence n - 1 may be weighted more highly than its preceding sentence n - 2, and sentence n - 2 may be weighted more highly than the sentence n - 3 that precedes it in turn. Suitable weightings may be selected by a user and/or a designer.
  • the segmentation threshold may be fixed - e.g. being set by a user or technician - or may be varied dynamically for each document. When varied dynamically, the segmentation threshold may be set based on the mean and/or standard deviation of cosine similarities between the embedding of each sentence in a document and the embedding or average embedding of its respective preceding sentence(s). For example, the segmentation threshold may be set as the mean of the cosine similarities minus the multiple of the standard deviation of the cosine similarities and a factor in the range of 0.2 to 0.8.
  • the text documents may be segmented in alternative manner based on a predetermined length for the segments.
  • the text documents may be repeatedly or iteratively divided after a predetermined number of characters, words or sentences.
  • the document may be split into a new segment after a predetermined number of characters, words or sentences (e.g. every 400 characters or every 8 sentences).
  • the document may be iteratively split into a new segment at the end of each word or sentence in which a character corresponding to the predetermined number of characters occurs (e.g. at the end of the word or sentence in which the 400th character occurs) or at the end of each sentence which includes the word corresponding to a predetermined number of words for each segment (e.g. at the end of the sentence containing the 80th word).
  • the method involves the summarisation of the segments.
  • a summary for each document segment is generated, wherein each summary is shorter than the corresponding document segment.
  • the summaries are generated through abstractive summarization. Suitable tools for performing the abstractive summarisation include Distill-BART.
  • a summary of each document segment may be generated using a Large Language Model (LLM) such as ChatGPT produced by OpenAI (RTM).
  • LLM Large Language Model
  • ChatGPT produced by OpenAI (RTM).
  • RTM OpenAI
  • a single summary is produced for each document segment.
  • Each summary is typically in the region of 2 to 6 sentences, and preferably 3 to 4 sentences long.
  • the generated summaries are then stored in step s104.
  • the summaries may be stored in a database or cloud container such as a S3 bucket from Amazon Web Services (RMT).
  • RMT Amazon Web Services
  • the stored summaries provide a text corpus for future searching.
  • steps s101 to s104 may be seen conceptually as forming part of a preliminary process 100a that is completed before a search process 100b occurs. Multiple searches can be performed on the same set of stored summaries. As such, the segmentation and summarization processes do not need to be repeated each time a search query is received.
  • the search query may include one or more keywords, one or more embeddings, and/or a natural language query.
  • the search query is used to identify relevant content within the stored summaries.
  • the search query may additionally define one or more filters which restrict the summaries which should be searched and/or returned in response to the search query.
  • the search query may include a date filter defining a range of dates that a user is interested in or a filter on the types of documents or sources a user wants to receive information and summaries from.
  • the filters indicate that the search should be restricted to summaries taken from documents from one or more publishers or instructions, summaries of documents of specific file type(s), summaries from documents that come from trusted or untrusted sources, summaries from documents of a specific country of origin and/or summaries from internal or external data.
  • step s106 the method involves identifying stored summaries that correspond to the search query.
  • the method involves the identification of a subset of the stored summaries (so called “identified summaries”) that correspond to the search query.
  • keywords of the search query may be compared to the text of each of the stored summaries using the BM25 algorithm or any other suitable means.
  • a retrieval model configured to respond to a natural language query may identify summaries relevant to a natural language query.
  • a model may be configured to identify summaries that are similar or relevant to an embedding contained within the search query. Stored summaries that match the keywords, natural language query and/or sample embedding (and meet any filter requirements set out in the search query) are identified as appropriate results to return to the user or machine issuing the search query.
  • a sample embedding contained within or derived from the search query may be compared to summary embeddings which each represent a corresponding summary.
  • the summaries which have a summary embedding which are sufficiently similar or most similar to the sample embedding may be identified as corresponding to the search query. For example, all summaries which are more similar than a predetermined similarity may be identified as corresponding to the search query or a predetermined number of most similar summaries may be identified as corresponding to the search query.
  • the identified summaries - that is the summaries identified as corresponding to or matching the search query in step 106 - are returned or output in step s108.
  • the identified summaries may be returned in the form of a list or report.
  • an electronic or paper report may be generated that includes the identified summaries.
  • the report may include in addition to the identified summaries, a contents page listing the identified summaries and copies of, or links to, the original documents obtained in step s101 .
  • the report may be a web or html report and available via a website or web portal.
  • the method may involve the additional step of arranging the identified summaries into a plurality of clusters based on the similarity of the identified summaries. This is shown by step s107 in figure 1 , surrounded by a dashed box. Following this clustering process, the identified summaries may be returned (i.e. output) in their clusters in step s108. During step s107 the identified summaries are structured into semantically similar groups, such that the summaries in each group share a thematic or contextual similarity.
  • Arranging the identified summaries into a plurality of clusters in step s107 may comprise generating at least one metadata embedding for each of the identified and/or stored summaries, the metadata embedding being a numerical vector representing metadata associated with the respective stored summary or metadata associated with a part of the respective summary. Thereafter, a summary embedding may be calculated for each stored summary based on the at least one metadata embedding.
  • the metadata embeddings may include one or more of a sentence embedding, a key word embedding and a named entity embedding for each of the sentences within the stored summaries, wherein the sentence embeddings are numerical vectors representing the respective sentences, the key word embeddings are numerical vectors representing individual words or phrases in the respective sentences, and the named entity embeddings are numerical vectors representing words identified as named entities that correspond to a Named Entity Recognition (NER) tag by an NER classifier in the respective sentences.
  • NER Named Entity Recognition
  • the summary embedding may be considered a “compound embedding”, being an average or a weighted average of a series of metadata embeddings.
  • a summary embedding may be derived or calculated directly from the respective summary.
  • the text of the summary may be passed through an embedding model such as ‘gte-base’ or ‘all- mpnet-base-v2’ both of which are available from Hugging Face (RTM).
  • Step s107 may then include arranging the identified summaries into a plurality of clusters based on their summary embeddings.
  • Suitable metadata embeddings include sentence embeddings, key word embeddings and named entity embeddings, as well as embeddings representing the title or date of the original document from which the respective summary is generated or representing names, events or dates listed in the summary itself. Therefore, the clustering may involve as an initial step dividing one or more of the summaries into one or more sentences using a sentenciser. Subsequently one or more of a sentence embedding, a plurality of word embeddings and named entity embeddings can be generated for each sentence within said one or more summaries.
  • the sentence embeddings are numerical vectors representing the respective sentences and their semantic meanings and can be generated by BERT sentence-transformer.
  • the word embeddings are numerical vectors representing the semantic meaning of individual words or phrases in the respective sentences and may be generated by the sentence-transformer BERT model.
  • the named entity embeddings are numerical representations of named entities identified using NER classification.
  • NER classification also referred to as NER tagging
  • the named entity embeddings are therefore based on the actual entities identified using NER tagging.
  • a summary embedding representing the semantic meaning of the summary as a whole may be generated based on the generated sentence embeddings, word embeddings and/or named entity embedding of its sentences. For example these metrics may be averaged or weighted to generate the summary embedding.
  • the method may include dividing a summary into individual sentences and generating a sentence embedding for each of the sentences within the stored summaries. Thereafter a summary embedding can be calculated for each summary based on the plurality of sentence embeddings (e.g. by averaging or weighting the sentence embeddings).
  • the method may additionally involve generating one or more further metadata embedding(s) for each summary, and generating a summary embedding based on the sentence embeddings and the one or more further metadata embedding(s).
  • These processes for generating summary embeddings may be performed for all of the stored summaries - e.g. as part of the preliminary process 100a discussed above. Alternatively, these processes may be performed for the identified summaries in response to a search query during the search process 100b. Thereafter, in response to receiving a search query and following the identification of summaries that are relevant to the search query, the method may involve arranging the identified summaries into a plurality of clusters based on the similarity of their summary embeddings. This provides an efficient method of comparing and grouping the identified summaries into clusters. Preferably the identified summaries are arranged into a plurality of clusters based on the similarity of their summary embeddings using an agglomerative clustering algorithm. However, other clustering approaches may also be used.
  • the method may optionally involve arranging or structuring the clusters themselves in a cluster hierarchy.
  • the clusters and the identified summaries therein may be returned (i.e. output) in accordance with this cluster hierarchy.
  • Generating a cluster hierarchy may involve calculating a plurality of cluster embeddings, wherein each cluster embedding is an average of the summary embeddings of the identified summaries contained within a respective cluster; calculating an average cluster embedding using an average of the plurality of cluster embeddings, and ordering the clusters based on their cluster embeddings and the average cluster embedding.
  • the cluster having a cluster embedding closest to the average of the cluster embeddings is selected as the starting cluster (also termed the dominant cluster). The model then finds the next closest cluster to the starting cluster and so on until clusters are ordered.
  • a corresponding method may be used to order the identified summaries within each cluster.
  • a similarity between the respective summary embedding of each identified summary and an average summary embedding of said cluster may be calculated.
  • the summaries may be ordered based on the similarity between their respective summary embeddings and the average summary embedding of said cluster.
  • Equally the summaries may be ordered by selecting as a first summary in the summary order, the summary that has the greatest similarity between its summary embedding and the average summary embedding for said cluster, and generating the remainder of the summary order by iteratively calculating a similarity between the respective summary embeddings of the remaining identified summaries within said cluster and the summary embedding of the previously selected summary and selecting as the next summary in the summary order the identified summary which has a summary embedding that is most similar to the embedding of the previously selected summary.
  • the summaries in each cluster may be returned in accordance with the generated summary order.
  • Titles can be generated for either all stored summaries and/or the identified summaries. Additionally, titles may be generated for the clusters. The search results (the clusters and identified summaries) may be presented with these titles. Titles for each summary may be generated using a title generator tool such as the T5 model available from Hugging Face (RTM). Titles for clusters may be taken from one or more of the identified summaries therein (e.g. from the dominant summary) and/or by concatenating the text of the identified summaries therein and subsequently generating a title based on this concatenated text - e.g. again a title generator tool such as the T5 model available from Hugging Face (RTM).
  • RTM Hugging Face
  • Titles for stored summaries may be generated during the preliminary step s100a and the titles stored in association with the stored summaries to reduce latency when a search query is received. However, this is not essential.
  • Breaking documents into segments and individually summarising each segment before searching as discussed herein offers significant benefits.
  • search latency and resource requirements are significantly reduced.
  • a user may wish to search documents using the keywords ‘drones’ and ‘Ukraine’. If documents are summarised without segmenting and the resulting document-level summaries searched, then a search based on these document-level summaries would likely not return any content on drones from a relatively long document that primarily covers Ukraine and mentions drones in only a single paragraph.
  • the single, isolated paragraph is likely to be lost during summarisation and would not be presented to a user in a search.
  • the paragraph may contain very important insight, but because it does not make up a significant element of the semantic meaning of the original document, it is likely it would not be captured in the document-level summary.
  • the document covering Ukraine will have been segmented multiple times based on semantic similarity.
  • the paragraph on drones will be considered semantically different from other topics relating to Ukraine and will be separated into an individual segment.
  • This segment on drones may be summarised to reduce its length, whilst maintaining its semantic meaning, to make it quicker to search and easier to understand by a user.
  • the important insight only found in an isolated paragraph of a document which primarily covers a different topic entirely is still presented to a user in response to their keyword search. It is therefore possible to retrieve information efficiently and without careful inspection of the full content of all documents.
  • Figure 2 shows a set of text documents 10 which may be obtained through step s101.
  • three text documents 10 are shown in Figure 3.
  • the text documents 10 comprise text data.
  • the data in these text documents is typically unstructured and therefore difficult to search by conventional methods.
  • the method may involve obtaining structured data.
  • each of the text documents 10 is divided into one or more segments 12. As shown, each text document 10 is segmented into a corresponding plurality of segments 12. Each segment 12 is a sub-section of the respective original document 10. The segments 12 are a series of consecutive passages or excerpts from the respective text documents 10.
  • each text document 10 is shown as being divided into four segments 12. However, this is not essential. As discussed above with reference to step s102, the text documents 10 are divided into segments 12 based on semantic similarity between consecutive sentences within their text. Therefore, the number of segments 12 generated from each document 10 will depend in part on the number of different topics and subjects discussed in each document 10.
  • the segments 12 obtained from the text documents 10 are summarised to generate a corresponding plurality of summaries 14.
  • abstractive summarisation is performed to reduce the length of the segments 12 whilst maintaining semantic meaning.
  • the summaries 14 may be seen as a plurality of relatively short text fragments that maintain the information in the original text documents 10 but which are easy to search and understand due to their reduced length.
  • the summaries 14 obtained from the documents 10 may be stored and searched.
  • the summaries 14 may be stored and searched in the manner discussed above with reference to steps s104 to s108 of Figure 1 .
  • the storage and search of summaries 14 will be discussed further with reference to the system and method shown schematically in Figure 3.
  • FIG. 3 shows a system 20 configured to perform the methods discussed herein.
  • the system 20 comprises a summarisation processor 22, a search processor 24 and a database 26 (or other suitable storage medium).
  • the summarisation processor 22, search processor 24 and database 26 are connected by communication lines 28.
  • the communication lines 28 may be wired or wireless and may be the internet or any other network.
  • the summarisation processor 22 is configured to perform the preliminary process 100a described above with reference to Figure 1.
  • the summarisation processor 22 receives a plurality of text documents 10 as shown by arrow D.
  • the summarisation processor 22 divides the text documents 10 into segments (not shown) and summarises the segments to produce a plurality of summaries 14.
  • the summaries 14 are stored in the database 26 (although other storage is possible). These processes are in accordance with steps s101 to s104 discussed above.
  • the search processor 24 is configured to perform the search process 100b discussed above with reference to steps s105 to s108 of Figure 1.
  • the search processor 24 receives search queries from a user or other system as shown by arrow Q.
  • the search processor 24 identifies a subset of the summaries from the plurality of summaries 14 stored on the database 26 that are relevant to the search query.
  • the search processor 24 outputs (i.e. returns) this subset of identified summaries 16 to the user or other system as shown by arrow O.
  • the search processor 24 may attempt to match keywords within the search query Q with words in the stored summaries 14.
  • the processor 24 may attempt to match stored summaries 14 that correspond to a natural language query or stored summaries 14 that are similar or correspond to an embedding within the search query.
  • the search processor 24 may limit its search to stored summaries 14 that meet one or more filters or other criteria defined in the search query.
  • the filters may define for example, appropriate date ranges or sources from which summaries should be identified and returned by the system 20.
  • the search processor 24 may contact the database 26 via the summarisation processor 22 or directly using a further communication line (not shown).
  • the search processor 24 is further configured to cluster the identified summaries 16.
  • the search processor 24 arranges the identified summaries 16 into clusters based on their semantic similarities and outputs the identified summaries 16 in these clusters. As shown, five summaries 16 have been identified as relevant to the search query Q and they are returned in two clusters 18. An exemplary approach for returning the search results will be discussed below in Figure 4.
  • the summarisation processor 22 may further order the clusters 18 into a cluster hierarchy and output the clusters 18 in an order or format based on this hierarchy.
  • Each cluster 18 may be arranged in the cluster hierarchy based on its relevance to the search query Q and/or their similarity to the other clusters.
  • the summarisation processor 22 may generate a title for each cluster 18 and output the title in combination with the cluster 18.
  • the titles may be created by recursively summarising the identified summaries 16 within each cluster 18 and then generating a title from the created cluster summary using a title generator tool.
  • the summarisation processor 22, search processor 24 and database 26 may be located remotely or locally to each other. In addition, one or more of these components may be located remotely from the systems and users that provide text documents 10, that issue search queries and that receive search results.
  • the summarisation processor 22 and search processor 24 are separate processors. However, this is not essential and their functions could alternatively be implemented on a single processor.
  • the architecture shown in Figure 3 should be seen as exemplary rather than restrictive.
  • Figure 4 shows schematically an example of how identified summaries that correspond to a search query may be returned.
  • the figure shows a simplified electronic report 30 containing the identified summaries as shown thereon that may be returned by the systems discussed herein.
  • the report 30 is a graphical user interface displayed by a monitor, screen, touchscreen or other suitable device.
  • the plurality of identified summaries 43 are returned together with their summary titles 42.
  • the identified summaries 43 are arranged in clusters and the report includes links to, and is able to display, the original documents from which each identified summary 43 is taken.
  • the report 30 comprises three panels 31 , 32, 33.
  • a cluster panel 31 is shown on the left of the report 30.
  • the cluster panel 31 is configured to list cluster titles 41 corresponding to a plurality of clusters that each contain one or more identified summaries 43.
  • the system is configured to receive a selection of a cluster title 41.
  • a selection of a cluster title 41 is received, as shown by hashed summary title 41a, the list shown in the cluster panel 31 may be expanded to show the summary titles 42 of the identified summaries 43 within the corresponding cluster.
  • a selection may be received using a mouse, keyboard, touchscreen or any other suitable input device.
  • the summary titles 42 of the identified summaries 43 within the cluster are listed in an indented manner below the selected cluster title 41 a to avoid confusion with the cluster titles 41 .
  • Other options to differentiate the appearance of the summary titles 42 and cluster titles 41 in the cluster panel 31 are also possible (e.g. colour, typeface, font, size, etc.).
  • the selected cluster title 41a may be visually emphasised or highlighted in comparison to the other cluster titles 41 in response to its selection.
  • a summary panel 32 is shown in the centre of the report 30.
  • the summary panel 41 is configured to show the text of the identified summaries 43 and their corresponding summary titles 42 of the cluster corresponding to the selected cluster title 41a.
  • the relatively short identified summaries are easy to understand and offer rapid understanding of the results of a search query.
  • the system is further configured to receive a selection of an identified summary 43 shown within the summary panel 32.
  • the selection may be received using a mouse, keyboard, touchscreen or any other suitable input device.
  • the selected identified summary 43a and its summary title 44 may be visually emphasised or highlighted in the report 30 in comparison to the other identified summaries 43 as shown by the hashed fill in Figure 4.
  • the identified summaries 43 may be indented 43a relative to the summary titles 42.
  • Other options to differentiate the appearance of the selected identified summary 43a in comparison to the other identified summaries 43 and the identified summaries 43 relatively to their summary titles 42 are also possible (e.g. colour, typeface, font, size, etc.).
  • a document panel 32 is shown on the right of the report 30.
  • the document panel 33 is configured to display the original text document 44 from which the selected identified summary 43a has been generated. Therefore, a user is able to easily investigate the original text document 44 based on the identified summaries returned within the report 30. This further helps a user understand the results of a search query.
  • the report 30 shown in Figure 4 is a particularly effective approach for returning identified summaries to a user.
  • a user is able to quickly navigate between clusters, to rapidly understand the contents of the clusters based on their titles and text, and investigate the source text documents if necessary. Users may receive and understand information particularly quickly. Nevertheless, the results of a search query may be returned in various other manners including as a text report.
  • the method 200 of Figure 5 will appear very similar to the method shown in Figure 1.
  • Search queries are input to a systems and summaries that corresponding to each search query are returned, optionally arranged in clusters as discussed above.
  • the search is performed against the document segments.
  • the summarisation step is only performed on document segments have been identified to correspond to the search query. This maintains the benefits of improved search results whilst offering more efficient use of computational resources.
  • the preliminary (pre-processing) process 200a can be quicker and less computationally expensive, which can be particularly important where the set of text documents to be searched is particularly large.
  • the overall number of document segments which need to be summarised can be reduced in situations where the number of searches to be performed relatively low in comparison to the number of documents and document segments to be stored. Storage requirements can also be reduced as there is no need to store summaries for all document segments.
  • the flow 200 shown in Figure 5 begins with the ingestion of text data and the segmentation of this text data into smaller segments.
  • step s201 one or more text documents are obtained.
  • step s202 each of these text documents are segmented into a plurality of document segments.
  • the text documents may be segmented based on the semantic similarity of consecutive sentences.
  • the text documents may be segmented based on a predetermined length for each segment.
  • the document may be split into a new segment after a predetermined number of characters, words or sentences (e.g. every 400 characters or every 8 sentences).
  • the document may be split into a new segment at the end of the word or sentence in which a character corresponding to the predetermined number of characters occurs (e.g. at the end of the word or sentence in which the 400th character occurs) or at the end of the sentence which includes the word corresponding to a predetermined number of words for each segment (e.g. at the end of the sentence containing the 80th word).
  • the text documents are obtained and the generated segments are generated in a preliminary (preprocessing) procedure 200a before a search procedure 200b occurs.
  • the generated document segments may be stored (e.g. in a database, cloud container or other storage medium) for subsequent searching.
  • the predetermined length for the segments may be in the range from 200 to 1000 characters, preferably from 250 to 800 characters, more preferably still from 300 to 600 characters. Equally, the predetermined length for the segments may be in the range from 4 to 20 sentences and preferably from 5 to 16 sentences
  • the method may further include step s203 of generating a segment embedding for each of the document segments, wherein each segment embedding is a numerical vector representing the corresponding document segment.
  • segment embeddings may be subsequently used in search and clustering steps. After their generation, the segment embeddings may be stored in any suitable storage medium such as a vector database or cloud container for later use.
  • the segment embeddings may be generated from metadata embeddings - i.e. is a numerical vector representing metadata associated with the respective stored summary or metadata associated with a part of the respective document segment.
  • Suitable metadata embeddings include one or more of a sentence embedding, a key word embedding and a named entity embedding for each of the sentences within the document segments, wherein the sentence embeddings are numerical vectors representing the respective sentences, the key word embeddings are numerical vectors representing individual words or phrases in the respective sentences, and the named entity embeddings are numerical vectors representing words identified as named entities that correspond to a Named Entity Recognition (NER) tag by an NER classifier in the respective sentences.
  • NER Named Entity Recognition
  • the summary segment may be considered a “compound embedding”, calculated as an average or a weighted average of said metadata embeddings.
  • segment embeddings may be derived or calculated directly from the respective document segments.
  • text of the document summary may be passed through an embedding model such as ‘gte-base’ or ‘all-mpnet-base-v2’ both of which are available from Hugging Face (RTM).
  • RTM Hugging Face
  • the method 200 continues with the search process 200b. Performing a search begins with the receipt of a search query in step s204. Subsequently in step s205 the method involves identifying document segments that correspond to the search query. These steps corresponds to steps s105 and s106 in method 100 of Figure 1 with the exception that the searching is performed against the document segments rather than summaries of these segments.
  • the search query may have any of the features and advantages discussed above with reference to Figure 1 .
  • the search query may include one or more keywords, one or more embeddings, and/or a natural language query.
  • the search query may additionally define one or more filters which restrict the summaries which should be searched and/or returned in response to the search query.
  • the step of identifying document segments which correspond to the search query may have any of the features and advantages discussed above with reference to the equivalent step in Figure 1.
  • a sample embedding contained within or derived from the search query may be compared to the segment embeddings, each of represent a respective document segment.
  • the document segments which have a segment embedding which is sufficiently similar to or that are most similar to the sample embedding may be identified as corresponding to the search query. For example, all document segments which are more similar than a predetermined similarity may be identified as corresponding to the search query or a predetermined number of most similar document segments may be identified as corresponding to the search query.
  • keywords or named entities in the search query may be compared to the text of each of the stored summaries, or a retrieval model configured to respond to a natural language query may identify summaries relevant to a natural language query.
  • step s206 of generating a summary for each identified document segment - i.e. for all of the document segments that correspond to the search query.
  • Each summary is shorter than the corresponding document segment, and therefore easier to understand when returned to a user.
  • the summarisation of each segment may occur in the manner described above with reference to step s103 in Figure 1 and offer corresponding benefits.
  • a summary of each identified document segment may be generated using a Large Language Model (LLM) such as ChatGPT produced by OpenAI (RTM).
  • RTM OpenAI
  • ChatGPT has been found to provide rapid response times and high quality summaries.
  • other tools such as Distill-BART may also be used.
  • each summary is typically in the region of 2 to 6 sentences, and preferably 3 to 4 sentences long.
  • the summaries - i.e. the summaries of the document segments identified as corresponding to the search query - are returned in step s208.
  • This return of summaries is equivalent to step s208 of Figure 1 and may include any of the features and advantages previously discussed.
  • the summaries may be returned in the form of a list or report (e.g. according to the schematic report 30 shown in Figure 4).
  • the method may include the step s207 of arranging the summaries - i.e. the summaries of the document segments identified as corresponding to the search query - into a plurality of clusters based on the similarity of the identified segments and/or the corresponding summaries. Moreover, the summaries may be returned in these clusters.
  • the clustering process may be analogous to the clustering process discussed above with reference to step s107 in the method 100 of Figure 1.
  • the clustering process may include any of the sub-steps and features discussed previously and offer corresponding benefits.
  • the clustering process may be performed using either the segment embeddings identified for each segment in step s203 or using summary embeddings produced for each summary generated in step s206.
  • the generation of summary embeddings may use the approach discussed above with reference to Figures 1 to 3.
  • the summaries are arranged into a plurality of clusters based on the similarity of their summary embeddings and/or the segment embeddings that correspond to each summary.
  • This process may use an agglomerative clustering algorithm. However, other clustering approaches may also be used.
  • the summary embeddings and/or segment embeddings may be obtained for these steps by accessing the embeddings from storage (e.g. by accessing the segment embeddings generated in step s203) or by generating the embeddings from the respective summaries or segments using the approaches discussed above.
  • the method may optionally involve arranging or structuring the clusters themselves in a cluster hierarchy.
  • Generating a cluster hierarchy may involve calculating a plurality of cluster embeddings, wherein each cluster embedding is an average of the summary embeddings of the identified summaries contained within a respective cluster and/or the segment embeddings of the identified document segments that correspond to the summaries contained within the respective cluster, calculating an average cluster embedding using an average of the plurality of cluster embeddings, and ordering the clusters based on their cluster embeddings and the average cluster embedding.
  • the cluster having a cluster embedding closest to the average of the cluster embeddings is selected as the starting cluster (also termed the dominant cluster).
  • the model finds the next closest cluster to the starting cluster and so on until clusters are ordered.
  • Summaries may also be ordered within the clusters as previously discussed. This ordering process may use the steps discussed above with reference to Figure 1 , but can be based on either summary embeddings generated for each of the summaries and/or on the segment embeddings of the document segments corresponding to each summary.
  • titles can be generated for each of the summaries to be returned.
  • cluster titles and/or cluster summaries may be generated for the clusters. Suitable processes for these steps are discussed above. The summaries and clusters may be returned with these features.
  • step s301 one or more text documents are obtained.
  • step s302 the one or more text documents are segmented into a plurality of document segments.
  • step s303 segment embeddings are generated for each document segment. Further details of these steps are discussed above with reference to Figure 5.
  • a search query is received in step s304 and document segments that correspond to the search query are identified in step s305. Again these steps are identical or equivalent to steps s204 and s205. Further details of these steps are discussed above with reference to Figure 5.
  • the identified document segments are arranged into a plurality of clusters based on the similarity of the identified document segments.
  • the clustering may be performed based on the segment embeddings of the identified document segments and be consistent with the corresponding clustering processes discussed above. Furthermore, the clusters may be ordered and the document segments within each cluster ordered as previously discussed.
  • a cluster summary for each cluster is generated based on the identified document segments within said cluster, wherein each cluster summary is shorter than the identified document segments within said cluster.
  • the cluster summary may be a summary of on a dominant document segment within said cluster (i.e. the document segment with a segment embedding most similar to the average of the segment embeddings of the cluster) or by summarising a concatenation of document segments from within the cluster. Equally, the cluster and/or the document segments may be titled using the mechanisms discussed above.
  • the identified document segments i.e. the document segments identified as corresponding to the search query
  • the clusters are returned with the corresponding cluster summaries and with any cluster or document segment titles.
  • any of the functionality described in this text or illustrated in the figures can be implemented using software, firmware (e.g., fixed logic circuitry), programmable or non-programmable hardware, or a combination of these implementations.
  • the terms “component” or “function” as used herein generally represent software, firmware, hardware or a combination of these.
  • the terms “component” or “function” may refer to program code that performs specified tasks when executed on a processing device or devices.
  • the illustrated separation of components and functions into distinct units within the block diagrams above may reflect an actual physical grouping and allocation of such software and/or hardware, or can correspond to a conceptual allocation of different tasks performed by a single software program and/or hardware unit.
  • the various processes described herein can be implemented on the same processor or different processors in any combination.
  • methods and processes described herein can be embodied as code (e.g., software code) and/or data.
  • code and data can be stored on one or more computer-readable media, which may include any device or medium that can store code and/or data for use by a computer system.
  • a computer system reads and executes the code and/or data stored on a computer-readable medium, the computer system performs the methods and processes embodied as data structures and code stored within the computer-readable storage medium.
  • a processor e.g., a processor of a computer system or data storage system.
  • Computer-readable media include removable and nonremovable structures/devices that can be used for storage of information, such as computer-readable instructions, data structures, program modules, and other data used by a computing system/environment.
  • a computer-readable medium includes, but is not limited to, volatile memory such as random access memories (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read-only-memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memories (MRAM, FeRAM), and magnetic and optical storage devices (hard drives, magnetic tape, CDs, DVDs); network devices; or other media now known or later developed that is capable of storing computer- readable information/data.
  • Computer-readable media should not be construed or interpreted to include any propagating signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to methods and systems for searching large numbers of text documents and other unstructured data. The method involves obtaining one or more text documents, segmenting the one or more text documents into a plurality of document segments, receiving a search query, returning one or more summaries of respective document segments within the plurality of document segments, wherein the summaries and/or the corresponding document segments have been identified to correspond to the search query, wherein each returned summary is shorter than its respective document segment.

Description

METHODS AND SYSTEMS FOR SEARCH AND INFORMATION RETRIEVAL
FIELD OF THE INVENTION
The invention relates to methods and systems for searching large numbers of text documents and other unstructured data. The methods reduce the computational resources required during search processes and avoid loss of information typically associated with condensing data into shorter formats. Retrieving information from large bodies of text data is simplified and made more efficient.
BACKGROUND
Huge amounts of publicly available data exist, and increasing amounts are generated on a daily basis. Searching this information is extremely resource intensive. Moreover, when faced with a query it is difficult to extract relevant or valuable information from a large corpus of unstructured data.
Equally, whilst many organisations store large amounts of data internally, they lack an effective way to search and order this data. Whilst answers to questions may exist in their data, this information may be very difficult to retrieve.
Conventional approaches for searching text are based on matching keywords within a search query with matching or similar words within the text to be searched. Individual instances of the keywords and sections of documents containing the keywords are returned to a user.
However, storing and accessing large bodies of data during searching is very computationally intensive.
Moreover, the results returned using conventional keyword searches can be difficult to interpret. It is not always easy to understand the intended meaning of a document or a section of a document in which an isolated keyword is located. As such, it is possible to become confused or to overlook important insights when presented within traditional search results. There is a need for improved systems and methods that overcome at least some of the disadvantages and limitations discussed above.
SUMMARY OF INVENTION
The present invention matches search queries against text summaries taken from ingested documents. The summaries are created by segmenting text within the documents based on the similarity of consecutive sentences and then condensing the resulting segments. In response to the search, a set of summaries are returned (i.e. output) that have been identified as matching the query or for which the corresponding document segments have been identified as matching the query.
This approach provides an efficient search process whilst avoiding loss of significant information content in the search results. The methods allow for the rapid and efficient retrieval of information from large collections of text documents in response to a search query.
In accordance with an aspect of the invention there is provided a method performed by one or more processors, in which the one or more processors perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments; receiving a search query; returning one or more summaries of respective document segments within the plurality of document segments, wherein the summaries and/or the corresponding document segments have been identified to correspond to the search query; wherein each returned summary is shorter than its respective document segment.
Preferably the one or more processors perform the further steps of either: in response to receiving the search query, identifying document segments that correspond to the search query, and generating a summary for each identified document segment, wherein each summary is shorter than the corresponding identified document segment, and wherein the step of returning one or more summaries of respective document segments within the plurality of document segments comprises returning the summaries generated for the identified document segments; or, prior to receiving the search query, generating a summary for each document segment in the plurality of document segments and storing said summaries, wherein each summary is shorter than the corresponding document segment, and, in response to receiving the search query, identifying summaries that correspond to the search query, and, wherein the step of returning one or more summaries of respective document segments within the plurality of document segments comprises returning the identified summaries.
These methods require reduced computational resources in comparison to traditional search methodologies as there is no need to access and search the full body of all text documents in response to each search query. Instead, only the summaries and/or text segments must be searched. Furthermore, the amount of data that is returned in response to a search query is reduced, since the summaries can be returned rather than excerpts of the full text. Therefore, searching can be quicker and less computationally expensive. Storage requirements can also be reduced as there may be no need to store the full text of all ingested text documents. As such, it will be seen that the invention provides an alternative data structure that supports improved searching in view of the technical issue of the amount of computational resources required to search large numbers of text documents and large amounts of data.
This approach also preserves high levels of the information content within the original text documents and can return (i.e. output) substantially all the information within the original documents in response to a query. If an entire document were summarised to provide a document-level summary, significant amounts of information would be lost. Important information that is contained in only a small portion of the document can be lost when summarising entire documents. As such, the methods discussed herein offer accurate search results whilst requiring fewer resources than traditional search methodologies.
Preferably the method involves the processors performing the further steps of arranging the summaries to be returned into a plurality of clusters based on the similarity of said summaries and returning said summaries arranged in their clusters. More preferably still the method involves the involves the processors performing the further steps of comparing the clusters and generating a cluster hierarchy in which the clusters are ordered depending on their relevance to the search query and/or the similarity of the summaries contained therein, wherein the clusters are returned in accordance with the cluster hierarchy.
In some embodiments each text document is segmented based on the similarity of consecutive sentences in the document. By segmenting each document based on the similarity of consecutive sentences, a set of segments can be created that each relate to a singular topic. This allows for substantially all information within the original document to be split out into segments before summarisation occurs. Therefore, relatively little information is lost during the segmenting and summarisation process and substantially all of the information in the originally ingested documents can be efficiently interrogated and searched. In addition, the search results produced through this process - the identified summaries that are relevant to the search query - are short and easy to comprehend. Therefore, a user may easily identify information from the identified summaries returned in response to their query. As will be discussed further below, in preferred embodiments, the identified summaries may be automatically grouped and returned in appropriate clusters which further aids understanding.
Preferably each text document is segmented based on the similarity of sentences to the at least one preceding sentence, and wherein preferably each text document is segmented based on the similarity of each sentence to the two or three preceding sentences. As such, the documents may be segmented based on the semantic similarity between two sentences, or each block of three or more consecutive sentences (e.g. 3 to 6 consecutive sentences). In particularly preferred examples, each text document is segmented based on the similarity of each sentence to its two, three, four or five preceding sentences.
Preferably segmenting the one or more text documents into a plurality of document segments comprises: dividing each text document into a plurality of sentences; generating sentence embeddings for each sentence within the plurality of sentences, wherein the sentence embeddings are numerical vectors representing the respective sentences and their semantic meanings; and iteratively: (a) determining a cosine similarity between a sentence embedding of a given sentence and a sentence embedding for the sentence preceding the given sentence or between a sentence embedding of a given sentence and an average of sentence embeddings for the at least two sentences preceding the given sentence; (b) comparing the cosine similarity to a segmentation threshold; and (c) in response to determining that said cosine similarity is lower than the segmentation threshold, determining that the given sentence is the start of a new segment; whereas when said cosine similarity is greater than the segmentation threshold, the given sentence is included in the same segment as the sentence that directly precedes the given sentence; wherein steps (a) to (c) are repeated for different given sentences in each document.
This approach compares each given sentence to its preceding sentence(s). In preferred examples the text documents are segmented based on the semantic similarity of sets of three or more consecutive sentences (the given sentence and its two or more preceding sentences). However, in further examples segmentation may comprise determining the similarity between one or more given sentences and the single immediately preceding sentence.
Where the given sentence is sufficiently semantically similar to the preceding sentences, it is included in the existing segment which includes at least the immediately preceding sentence. Where the given sentence is semantically different from the preceding sentence a new segment is created which begins with this new sentence. Where new segments are created the iterative process may be adjusted such that subsequent sentences are only compared to sentences within the new segment, rather than semantically different sentences from a previous segment.
The sentences within the text documents may be identified using a sentencizer. Sentencizers are tools that divide unstructured text into separate sentences based on (for instance) punctuation, line breaks and the semantic meaning of text. The sentencisers may be a model trained on labelled sentence data or writing heuristics to identify sentence boundaries. Suitable sentencisers include examples from NLTK, SpaCy, and PySBD.
The sentence embeddings may be generated from the respective sentences by a sentence transformer, such as Sentence-BERT. Other suitable models for generating sentence embeddings could alternatively be used including embedding APIs produced by OpenAI.
Using cosine similarities is a particularly computationally efficient method to identify similarities between the vectors of the sentence embeddings. However, in further examples, alternative metrics for similarity between vectors could be used such as Euclidian distance or Manhattan distance. For example in alternative examples the method may involve determining a Euclidean distance or Manhattan distance between the sentence embedding of the given sentence and the average of the embeddings for the at least two preceding sentences, comparing the Euclidean distance or Manhattan distance to a segmentation threshold and determining whether the given sentence should be included in the same segment as its preceding sentence or not based on this comparison.
Where the method comprises the step of determining a cosine similarity between a sentence embedding of a given sentence and an average of sentence embeddings for the at least two sentences preceding the given sentence, preferably the average of sentence embeddings is calculated as a weighted average in which the sentence embeddings of the two or more preceding sentences are weighted based on the position of their respective sentence in the document relative to the given sentence. By the position of the sentences it is understood that their embeddings are weighted depending on how close they are to the given sentence. Sentences that are further from the given sentence are less heavily weighted.
For instance, the sentence embedding of the sentence which immediately precedes the given sentence may be weighted more or less highly than the sentence embedding of the sentence that immediately precedes it. Placing additional emphasis on the immediately preceding sentence offers increased accuracy when segmenting. The similarity between a given sentence and the immediately preceding sentence is a particularly strong indicator whether said given sentence should be included as part of the same segment as the preceding sentence or whether a new segment should be created. The weighting may be predefined by a user or a manufacturer of the system. For example, where attempting to determine whether a given sentence (the nth sentence) should be included in a segment based on the similarity between the given sentence and the two or more preceding sentences, the sentence embedding of the immediately preceding sentence (the n-1 th sentence) may be weighted at least 1.2 times, and preferably 2 times greater, than the sentence embedding of the n-2th sentence which precedes said immediately preceding sentence.
The segmentation threshold may be predefined or determined dynamically for different documents. These thresholds allow segments of appropriate length to be created.
Where segmentation is performed based on the cosine similarity of each sentence to the one or more preceding sentences, preferably the segmentation threshold is in the range of 0.1 to 0.5 and more preferably from 0.2 to 0.4. For example, the segmentation threshold may be predetermined or set by a user or a system designer as a value between 0.2 and 0.4. However, a single figure is often not suitable for all documents since different documents will tend to have different distributions of cosine similarities between adjacent sentence groups. Using a fixed segmentation threshold can result in a particularly large number of very short summaries and/or a small number of very long summaries. Therefore, in some preferred examples the segmentation threshold is determined dynamically, such that a different segmentation threshold is determined for each text document.
Setting a segmentation threshold dynamically - such that the segmentation threshold is specific to each text document - may involve calculating the cosine similarities between the sentence embeddings of all sentences within said text document or the cosine similarities between the sentence embeddings of the sentences within all groups of two, three or more consecutive sentences within said document. The method may then involve calculating the mean and/or the standard deviation of these cosine similarities. The segmentation threshold may be set based on the mean and/or standard deviation of these cosine similarities. For example, the segmentation threshold may be set as the mean of the cosine similarity scores, the mean of the cosine similarity scores minus their standard deviation, or the mean of the cosine similarity score minus the standard deviation as multiplied by an empirically set factor (e.g. a factor in a range from 0.2 to 0.8, and more preferably in the range from 0.25 to 0.5). Using the mean of the cosine similarity score minus the standard deviation as multiplied by an empirically set factor in the range from 0.25 to 0.5 is particularly preferred as it provides summaries of appropriate lengths for a large range of documents. When calculating the mean and standard deviation it may be assumed that the distribution of cosine similarities follows a gaussian distribution of cosine similarity scores. The processes described above may also be implemented using an alternative similarity metric - e.g. the Manhattan or Euclidean similarities of sentences within the text document in question.
The summaries generated from the document segments are shorter than the corresponding document segment but preferably retain substantially all the semantic meaning from the longer segment. Therefore, searches may be made more efficient without losing information. Preferably, generating a summary for each document segment comprises abstractive summarisation. Abstractive summarisation rephrases text in a shorter manner whilst retaining semantic meaning. Suitable tools for performing the abstractive summarisation include Distill-BART. Although abstractive summarisation can generate new phrases and wording, preferably a relatively large amount of the original wording of each segment is retained to reduce the risk that information is lost. Another suitable model for summarisation could alternatively be used such as GPT4 from OpenAI (RTM).
Alternatively, summaries of text could be created using extractive summarisation - a process in which existing sentences or phrases are isolated from the original text and concatenated or otherwise combined together - or any other form of summarisation.
In further examples, segmenting the one or more text documents comprises dividing each text document based on a predetermined length for each segment. Preferably each text document is repeatedly split into a new segment after a predetermined number of characters, words or sentences. For example the document may be iteratively divided into segments after a predetermined number of characters in the range from 200 to 1000 characters, preferably from 250 to 800 characters, and more preferably still from 300 to 600 characters. Therefore, the text document may be divided into a set of text segments with fixed lengths corresponding to the predetermined length, with the exception of the final text segment which includes the remainder of the document once the earlier segments are formed. In further examples, each text document may be repeatedly split into a new segment at the end of the word or sentence containing a character corresponding to a predetermined number of characters for each segment, or at the end of a sentence containing a word corresponding to a predetermined number of words for each segment. In such an example the text document is divided into a series of text segments which are approximately equal (just greater) than the predetermined length for each segment.
In particularly preferred examples each summary comprises from 2 to 6 sentences and preferably from 3 to 4 sentences and/or each segment comprises from 4 to 20 sentences and preferably from 5 to 16 sentences. Additionally or alternatively, each document segment may comprise from 200 to 1000 characters, preferably from 250 to 800 characters, more preferably still from 300 to 600 characters. These lengths provide a strong compromise between search efficiency, information loss and ease of information retrieval. The thresholds and parameters used during segmentation and summarisation may be selected or adjusted to maintain the sizes of segments and summaries within these ranges.
In accordance with a further aspect of the there is provided a method performed by one or more processors, in which the one or more processors perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document; generating a summary for each document segment, wherein each summary is shorter than the corresponding document segment; storing the summaries generated for the plurality of document segments; receiving a search query; identifying stored summaries that correspond to the search query; and, returning the identified summaries that correspond to the search query. As will be seen, this method is based on the aspect of the invention discussed above and relates to methods where the segmentation of text documents and summarisation of segments are performed before searching is performed.
This method requires reduced computational resources in comparison to traditional search methodologies as there is no need to access and search the full body of all text documents in response to each search query. Instead, only the summaries must be searched. Furthermore, the amount of data that is returned in response to a search query is reduced, since the summaries can be returned rather than excerpts of the full text. Therefore, searching can be quicker and less computationally expensive. Storage requirements can also be reduced as there may be no need to store the full text of all ingested text documents. As such, it will be seen that the invention provides an alternative data structure that supports improved searching in view of the technical issue of the amount of computational resources required to search large numbers of text documents and large amounts of data.
This approach also preserves high levels of the information content within the original text documents and can return (i.e. output) substantially all the information within the original documents in response to a query. If an entire document were summarised to provide a document-level summary, significant amounts of information would be lost. Important information that is contained in only a small portion of the document can be lost when summarising entire documents. As such, the methods discussed herein offer accurate search results whilst requiring fewer resources than traditional search methodologies. By segmenting each document based on the similarity of consecutive sentences, a set of segments can be created that each relate to a singular topic. This allows for substantially all information within the original document to be split out into segments before summarisation occurs. Therefore, relatively little information is lost during the segmenting and summarisation process and substantially all of the information in the originally ingested documents can be efficiently interrogated and searched.
In addition, the search results produced through this process - the identified summaries that are relevant to the search query - are short and easy to comprehend. Therefore, a user may easily identify information from the identified summaries returned in response to their query. As will be discussed further below, in preferred embodiments, the identified summaries may be automatically grouped and returned in appropriate clusters which further aids understanding.
It will be seen that the method steps according to this aspect can be divided into two processes: a preliminary process in which summaries are generated from text documents; and a search process in which search queries are received and relevant summaries are identified and returned.
The preliminary process may involve the steps discussed above of obtaining the one or more text documents, segmenting the one or more text documents into a plurality of document segments, generating summaries for each document segment, and storing the summaries. Whereas the search process may involve the steps discussed above of receiving a search query (e.g. a query comprising one or more keywords, a natural language query, and/or a query comprising an embedding that represents suitable summaries numerically), identifying stored summaries that correspond to the search query, and returning the identified summaries that correspond to the search query.
The steps of the preliminary process and the search process may be performed by a single processor or by different processors. Where the preliminary process and search process are performed by different processors, the processors may be located either locally or remotely from one another. These different processors may be connected by any suitable means, including the internet or another network.
Performing segmentation and summarisation, and storing the summaries during an initial step before receiving any search queries reduces search latency significantly and avoids the need to perform these computationally expensive steps each time a query is received. Indeed, the segmentation and summarisation steps tend to take significantly more time and resources than the search process and optional clustering process (discussed further below). Searching and clustering (if performed) take relatively few resources and can be performed in real time after receiving a search query and without a substantial delay.
Despite these comments, although the steps of the method may be conceptually divided into a preliminary process and a subsequent search process, in practice the steps of the method may also be performed in various different orders where appropriate. For example, steps of the preliminary process and search process may be performed consecutively in a same flow, sequentially or in parallel.
The method starts with obtaining one or more text documents which are intended to be searched. The method may be used with a wide range of data from a wide range of sources, both public and private.
The documents may take the form of substantially any written text including articles, books, scientific papers and other texts. However, this is not essential. For instance, obtaining one or more text documents may additionally include:
• transcribing a video and/or audio file;
• extracting text from a webpage, spreadsheet, presentation or an unstructured data file;
• and/or translating a text document into a new language. Text documents (or the files from which text documents are generated using the steps of transcription, extraction, translation, etc.) may be obtained from an external source. An external source is a source that is remote from the system and processors performing the method. However, in some cases the source may be local. In preferred examples, the source is a cloud-based source such as a proprietary sharepoint site, a proprietary cloud storage container, or an S3 bucket (a cloud object storage object offered by Amazon Web Services (RTM)). However, this is not essential and the text documents, summaries and metadata discussed herein may be stored in substantially any suitable system for data storage. In further examples, text documents are obtained via an API allowing access to an external data supplier. As discussed above, the text documents may be publicly available or may be documents that are private and are supplied by an organisation that wishes to perform searching using the methods discussed herein.
As mentioned previously, preferably the text documents are segmented based on the semantic similarity of consecutive sentences in the document. As such, the document is divided up based on the shared meaning and content of the consecutive sentences. Consequently, each segment tends to refer to a single topic or subject. As the text changes topic or subject a new segment will be created. Since each segment is directed to a single semantic topic, the subsequent summarisation process will retain a greater proportion of the original information than would be achieved otherwise.
Alternatively, summaries of text could be created using extractive summarisation - a process in which existing sentences or phrases are isolated from the original text and concatenated or otherwise combined together - or any other form of summarisation. In further examples, segmenting the one or more text documents comprises dividing each text document based on a predetermined length for each segment.
Further detail of the segmentation process has been provided above relation to the previous aspect of the invention. The summaries of the text segments created through the previously discussed methods and any associated metadata may be stored in a database or any other suitable storage. The storage may be located within, local to or remotely from the one or more processors. The storage may be located in the cloud such as when the storage is a S3 bucket and/or connected to the processors by the internet, LAN or another suitable network.
The stored summaries may subsequently be searched. The search query may comprise one or more keywords, a natural language query or a sample embedding to be compared to the stored summaries. However, other means for prompting models may also be used. Summaries that contain content relevant to the search query can be identified and returned as results. Any suitable search algorithm may be used, such as the BM25 algorithm or a retrieval model configured to respond to a natural language query.
For example, the search query may comprise one or more keywords, and the method may comprise identifying one or more stored summaries that correspond to the keywords, and/or the search query may comprise a natural language query, and the method may comprise identifying one or more stored summaries that correspond to the natural language query, and/or the search query may comprise a sample embedding and the method may comprise identifying one or more stored summaries that have summary embeddings that are similar or match the sample embedding (e.g. by calculating the cosine similarity between the sample embedding and the summary embeddings of the stored embeddings).
In preferred examples the search query may be received from a user and the identified summaries matching the query may be returned to the user. However, this is not essential and the query may be automatically generated by a machine and/or the results may be returned to a machine for further processing.
The search query may include one or more filters defining which of the stored summaries should be searched and/or returned in the method. In response to receiving a search query with such filters, the method may comprise searching a subset of the stored summaries based on the filter. For instance, a search query may include (but is not limited to):
• a date filter that restricts the search to summaries to those from a document within a defined date range, restricts the search to summaries created within a defined date range and/or restricts the search to documents or summaries created either before or after a specific date;
• a file origin filter that restricts the search to summaries to those produced from internal data, external data, or a combination of the two;
• a trusted filter indicating whether to search summaries based on trusted and/or non-trusted documents;
• a source filter indicating whether to include or exclude from the search summaries based on documents from specific sources;
• a country of origin filter indicating whether to include or exclude from the search summaries based on documents from specific countries (e.g. summaries based on webpages with a specific country of origin of their URL);
• an institution filter indicating whether to include or exclude from the search summaries based on documents from specific institutions or from specific types of institutions (e.g. private, university, government, etc.);
• a publisher filter indicating whether to include or exclude one or more summaries from the search based on the publisher of the document(s) on which the summaries are based;
• a file type filter indicating whether to include or exclude one or more summaries from the search based on the file type of the document(s) on which the summaries are based (e.g. pdf, webpage, presentation, video file, audio file, etc.).
As the results of the search, the identified summaries may be returned in the form of a list or more preferably a report. For example a report containing the identified summaries may be in electronic or paper form. The report may include in addition to the identified summaries, a contents page listing the identified summaries and/or any clusters of the identified summaries and copies of, or links to, the original documents. The report may be a web or html report and available via a website or web portal.
Preferably the method comprises the further steps of arranging the identified summaries into a plurality of clusters based on the similarity of the identified summaries; and returning the identified summaries arranged in their clusters. Therefore, the identified summaries may be clustered based on their semantic similarity. As will be discussed in more detail below, the clusters may in turn be ordered based on their relevance to the search query, such that the identified summaries are clustered and ordered based on their similarity.
In other words, the summaries that have been identified as relevant to the search query may be arranged into groups or clusters that are similar - i.e. semantically similar - before they are returned. Therefore, the summaries presented in response to the query are returned in groups on the basis of their thematic or contextual similarity. This enables a user or machine that receives the search results to quickly and easily interpret the information contained in the identified summaries. The clustering process creates a data structure for the identified summaries that would otherwise be unstructured and difficult to navigate. Clustering can guide users through the results of a search query, enabling users to more efficiently identify and retrieve relevant information from the ingested text documents.
However, even using clusters of identified summaries, it can be difficult to understand and retrieve information as the number of clusters increases - e.g. to 10 or more clusters.
Therefore, in particularly preferred examples, the method further comprises comparing the clusters and creating a cluster hierarchy in which the clusters are ordered depending on their relevance to the search query and/or the similarity of the summaries contained therein; wherein the clusters of identified summaries are returned in accordance with the cluster hierarchy. For instance, the clusters may be ordered based on their semantic similarities. Through these steps a hierarchy or structure is automatically applied to the clusters. The data within the clusters is made easier to navigate and understand. Therefore, the cluster hierarchy again helps guide users through the results of a search query, enabling users to more efficiently identify and retrieve relevant information from the ingested text documents. However, this is not essential and in other examples the clusters might be presented in any order.
Arranging the identified summaries into a plurality of clusters may comprise: generating at least one metadata embedding for each of the stored summaries; wherein the metadata embedding is a numerical vector representing metadata associated with the respective stored summary or metadata associated with a part of the respective stored summary; and calculating a summary embedding for each stored summary based on the at least one metadata embedding; and wherein the method comprises in response to receiving a search query: arranging the identified summaries into a plurality of clusters based on the similarity of their summary embeddings.
The metadata from which metadata embeddings are generated may include one or more of key words, named entities identified through named entity recognition (NER), a title or date of the original document from which the respective stored summary is generated, and names, events or dates listed within the stored summary. As such, the metadata embeddings may represent these pieces of metadata as a numerical vector. For instance, the metadata embeddings may include key word embeddings or named entity embeddings that represent key words or named entities within the respective stored summary. Equally, the metadata embeddings may include embeddings of the summary or a portion of the summary itself. For instance, the metadata embeddings may include one or more sentence embeddings that are numerical vectors representing individual sentences within the respective stored summaries. Generating the summary embeddings may include averaging or weighting different metadata embeddings together.
Therefore, in preferred examples the method may comprise generating a sentence embedding for each of the sentences within the stored summaries, wherein the sentence embeddings are numerical vectors representing the respective sentences, and calculating a summary embedding for each stored summary based on the plurality of sentence embeddings (e.g. by averaging the sentence embeddings).
Optionally, the method may comprise generating sentence embeddings for each stored summary, as well as one or more further metadata embedding(s) for each stored summary, and generating a summary embedding based on the sentence embeddings and the one or more further metadata embedding(s). For instance the further metadata embeddings may be one or more key word embeddings that are numerical vectors which each represent an individual word or phrase in the respective stored summary and/or one or more named entity embeddings that are numerical vectors which each represent an individual word or phrase in the respective stored summary identified as a named entity by an NER classifier. Again the generation of a summary embedding may comprise averaging or weighting the sentence embeddings and the one or more further metadata embeddings.
As such, in particularly preferred examples, arranging the identified summaries into a plurality of clusters comprises: dividing each of the stored summaries into one or more sentences; generating one or more of a sentence embedding, a key word embedding and a named entity embedding for each of the sentences within the stored summaries; wherein the sentence embeddings are numerical vectors representing the respective sentences and their semantic meanings, the key word embeddings are numerical vectors representing individual words or phrases in the respective sentences and their semantic meanings, and the named entity embeddings are numerical vectors representing words identified as named entities that correspond to a Named Entity Recognition (NER) tag by an NER classifier in the respective sentences; calculating a summary embedding for each stored summary, wherein the summary embedding is a numerical vector representing a weighted average of the sentence embedding, the key word embeddings and/or the named entity embeddings for each sentence of said stored summary; and wherein the method comprises in response to receiving a search query: arranging the identified summaries into a plurality of clusters based on the similarity of their summary embeddings. This provides an efficient method of comparing and grouping the identified summaries into clusters.
It should be noted that the steps of dividing stored summaries into sentences, generating metadata embeddings, sentence embeddings, key word embeddings and/or named entity embeddings, and calculating a summary embedding for the stored summaries discussed above are performed for each stored summary. The processes for calculating summary embeddings are preferably performed as part of the preliminary process discussed above in which text documents are obtained, segmented and summarised and before any search query is received. The summary embeddings can be seen as metadata associated with the respective stored summaries. The summary embeddings may be stored in a database or other storage format, preferably with the corresponding stored summaries.
Calculating summary embeddings during a preliminary process and before any search queries are received is efficient since the calculated summary embeddings may be used to create clusters of summaries for all subsequent search queries. Creating the summary embeddings - which are a further example of metadata for the stored summaries - before search queries are received reduces search latency, since creating embeddings requires relatively large amounts of resources whilst arranging the summaries into clusters and ordering the clusters within a hierarchy (if performed) is relatively inexpensive. However, the calculation of summary embeddings during a preliminary process and before search is not essential.
In some cases the calculation of summary embeddings could be performed during the search process, after the receipt of a search query. For instance, the calculation of summary embeddings could be performed for a given summary the first time that this summary is to be returned in response to a search query - i.e. the first time the summary is considered relevant to a search query. In a further alternative, the calculation of summary embeddings may be performed each time a search query is received for all summaries which have been identified as relevant to the search query (i.e. for each of the identified summaries during every search). Similarly, the steps involved in calculating summary embeddings may be split across the preliminary process and search process discussed above. Nevertheless, in each case the steps for calculating summary embeddings may be as discussed above.
In more detail, the sentences within the stored summaries may be identified and divided using a sentencizer. Sentencizers are tools that divide unstructured text into separate sentences based on (for instance) punctuation, line breaks and the semantic meaning of text. The word embeddings may be calculated using a keyword extractor model such as the sentence-transformer BERT model. Sentence embeddings may be generated from the respective sentences by a tool such as the BERT sentence-transformer.
Key words may be obtained from a keyword extraction model, such as KeyBERT, a publicly available library that leverages sentence embeddings and n-gram word/phrases. The method may include calculating the cosine similarity between the document embedding and n-gram word or phrase embeddings. The key words with the highest cosine similarity score may be taken as the key words most representative of the summary itself. Note that any suitable model can be used here, not limited to KeyBERT, for example, a supervised model trained using document keyword pairs. The key words may include single words and/or phrases formed of multiple words.
Named entities may be obtained through an NER classifier such as the ROBERTA model developed by Facebook Al (RTM). However, any suitable model for finding words that follow a given NER ontology may be used. These models identify words from a given text that match a given class from the model’s ontology. These words are extracted from the unstructured text and classified into predefined classes from the ontology (e.g. person, date, street address, company name, business, monetary value, event, brand, etc.). It may be appreciated that the key words and named entities on which the key word embeddings and named entity embeddings are based may include some overlap. This is to be expected. The processes of identifying key words and identifying named entities offer different ways to interpret the word content of each summary. Using both key word embeddings and named entity embeddings in addition to sentence embeddings allows the method to identify relatively large amounts of information regarding each summary. This enables summaries to be more easily linked and connected together automatically. Therefore, clusters can be made larger whilst maintaining conceptual similarity between summaries within the cluster. Larger clusters of conceptually linked summaries are desirable since smaller clusters (especially lots of very small clusters) do not offer a significant improvement to users in comparison to simply listing summaries individually. Alternatively, only one of key word embeddings and named entity embeddings may be used in the process.
Preferably key word embeddings and named entity embeddings are weighted greater in the summary embeddings than the sentence embedding. Advantageously, this approach achieves relatively large clusters of conceptually linked summaries. In contrast, if the summary embeddings were based on the sentence embeddings only or if sentence embeddings were relatively heavily weighted in comparison to key word embeddings or named entities identified using NER tags the process would produce larger numbers of relatively small clusters. This is because the sentence embeddings relating to individual sentences within a summary tend to be highly specific and distinct, which reflects how different sentences in each summary can contain very different content following the summarisation process despite being in the same summary and discussing the same topic.
In preferred examples, when calculating the summary embeddings the key word embeddings from the summary are weighted at least twice as heavily as the sentence embeddings. For example, the key word embeddings may be weighted between two and three times more heavily than the sentence embeddings, and preferable 2.5 times more heavily. Additionally or alternatively, when calculating the summary embeddings the named entity embeddings may be weighted at least 1.2 times as heavily as the sentence embeddings. For example, the key word embeddings may be weighted between 1.2 and 2 times more heavily than the sentence embeddings, and preferably 1.5 times more heavily. In particularly, preferred examples the summary embedding may be calculated as 0.2 times the sentence embeddings plus 0.5 times the sentence embedding plus 0.3 times the named entity embeddings.
Preferably the identified summaries are arranged into a plurality of clusters based on the similarity of their summary embeddings using an agglomerative clustering algorithm. Agglomerative clustering methods are examples of hierarchical clustering algorithms, which start with clusters formed of individual objects (in this case identified summaries) and repeatedly merge similar clusters to form sets of combined clusters. A predefined cluster distance is used to define a threshold at which the distance between the clusters of identified summaries is too great, and the clusters are not merged further. The predefined cluster distance may be set by the designer of the system performing the method empirically so as to produce acceptable numbers and sizes of clusters. The final set of clusters of identified summaries can then be returned in response to the query.
In alternative examples, clusters may be generated by divisive clustering where all segments begin in an initial cluster which is repeatedly split to form smaller clusters, or any other hierarchical clustering method.
As discussed above, preferably the method includes generating a cluster hierarchy and returning the clusters and identified summaries therein in accordance with the cluster order. Generating a cluster hierarchy may comprise: calculating a cluster embedding for each cluster, wherein each cluster embedding is an average of the summary embeddings of the identified summaries contained within a respective cluster; calculating an average cluster embedding using an average of the plurality of cluster embeddings, and ordering the clusters based on their cluster embeddings and the average cluster embedding. Preferably, the cluster having a cluster embedding closest to the average of the cluster embeddings is selected as the starting cluster. This cluster is also referred to herein as the dominant cluster.
In preferred examples, generating a cluster hierarchy may comprise identifying the similarity between the cluster embedding of each cluster and the average cluster embedding. Therefore, a respective similarity between the cluster embedding of each cluster and the average cluster embedding is calculated for each cluster within the plurality of clusters.
Generating a cluster hierarchy may further comprise selecting as a first cluster the cluster that has the greatest similarity between its respective cluster embedding and the average cluster embedding. Herein this first cluster is also referred to as the dominant cluster. Being closest to the average or centre of all clusters, the dominant cluster may be seen to be particularly relevant to the search query received by the method. As such, prioritising the dominant cluster helps enable a user to receive important information quicker. To order the remaining clusters within the cluster hierarchy the method may involve identifying a next closest cluster that is most similar to the first (dominant) cluster, finding a cluster that is more similar to the next closest cluster and so on until all clusters are ordered. As such, the method may involve iteratively: calculating a similarity between each of the respective cluster embeddings of the remaining clusters of the plurality of clusters and the cluster embedding of the previously selected cluster (i.e. the dominant cluster in the first iteration); and selecting as the next cluster in the cluster hierarchy the cluster which has a cluster embedding that is most similar to the embedding of the previously selected cluster. In this way the clusters may be ordered and returned in a manner that is particularly clear and easy to understand, with each cluster being particularly closely related and flowing clearly from the preceding cluster.
Alternatively, the method may comprise ordering the clusters based on the similarity between their respective cluster embeddings and the average cluster embedding. As such, the clusters and identified summaries therein may be returned in an order from a cluster which has the most similar cluster embedding in comparison to the average cluster embedding to a cluster which has the least similar cluster embedding in comparison to the average cluster embedding. As such, the first or dominant cluster remains the cluster that has a cluster embedding which is most similar to the average cluster embedding. This approach again offers an order of clusters that is easy to understand because clusters are arranged from the clusters closest to the centre of the overall collection of clusters and which may therefore be perceived as most relevant to the search query at hand, to the clusters which are furthest from the centre and therefore appear less relevant. This helps provide information that is more relevant to a search quicker.
In the examples discussed above, the similarity between cluster embeddings and average cluster embeddings may be determined by any suitable approach including using the cosine similarity, Manhattan similarity or Euclidean similarity. The cosine similarity is preferred, being particularly quick and computationally efficient.
Preferably the average of the plurality of cluster embeddings is the mean of the plurality of cluster embeddings. However, other averages may also be used.
These steps of generating a cluster hierarchy indirectly arranges the clusters into a hierarchy or structure based on their approximate relevance to the search query. This is because the average of the cluster embeddings is determined by the summary embeddings for each identified summary, and the identified summaries are selected from the pool of stored summaries by their relevance to the search query. The search results returned by the method are particularly valuable when returned in accordance with the cluster hierarchies created using the approaches described above since they help present the most relevant and valuable information first and/or help guide those receiving the results from one cluster to the next. Nevertheless, it will be appreciated that a variety of other cluster hierarchies could be created.
Preferably, the method comprises further steps of: generating a summary order for the identified summaries within each cluster; and returning the identified summaries within each cluster in accordance with this summary order. This enables the method to provide results of a search query that are particularly valuable since the summaries may be ordered to provide the most relevant information first and/or help guide those receiving the results from one cluster to the next. Generating a summary order may comprise calculating a summary embedding for each identified summary in the cluster, calculating an average summary embedding based on the summary embeddings of all identified summaries in the cluster, and ordering the identified summaries based on their summary embeddings and the average summary embedding.
In preferred examples, ordering the identified summaries may comprise identifying the similarity between the summary embedding of each identified summary and the average summary embedding, and selecting as a first (dominant) summary, the summary that has the greatest similarity between its cluster embedding and the average summary embedding. Being closest to the average or centre of all summaries within the cluster, the dominant summary appears most relevant to the search query received by the method. As such, prioritising the dominant summaries and presenting it first helps enable a user or system receiving the results of a search query to receive important information quicker.
To order the remaining identified summaries the method may then find the next closest summary to the dominant summary and so on until all summaries are ordered. As such, the method may involve iteratively: calculating a similarity between the respective summary embeddings of the remaining identified summaries within the cluster and the summary embedding of the previously selected summary (i.e. the dominant cluster in the first iteration); and selecting as the next summary in the order the identified summary that has a summary embedding that is most similar to the embedding of the previously selected summary. In this way the summaries may be ordered and presented in a manner that is particularly clear and easy to understand, with each identified summary being similar and flowing clearly from the preceding identified summary.
Alternatively, the remaining identified summaries other than the dominant summary may be ordered based on the similarity of their summary embeddings to the average summary embedding. As such, identified summaries therein may be returned in an order from a summary which has the most similar summary embedding in comparison to the average summary embedding within the cluster, to a summary which has the least similar summary embedding in comparison to the average summary embedding. This approach again offers an order of summaries that is easy to understand because summaries are arranged from the summary closest to the centre of the overall collection of summaries within each cluster and which may therefore be perceived as most relevant to the search query at hand, to the clusters which are furthest from the centre and therefore appear less relevant. This helps provide information that is more relevant to a search quicker.
Similar steps may be provided in examples where the identified summaries are not clustered, which can be seen as equivalent to a situation where all identified summaries are arranged in a single overarching cluster. In these situations the plurality of identified summaries are ordered using the steps discussed above.
In these examples, the similarity between summary embeddings and average summary embeddings may be determined by any suitable approach including using the cosine similarity, Manhattan similarity or Euclidean similarity. Indeed, the cosine similarity, Manhattan similarity or Euclidean similarity methods may be used in the calculation of any of the similarities discussed herein.
It will be seen that these methods for ordering the identified summaries or the identified summaries within each cluster are equivalent to the approaches discussed above for generating cluster hierarchies and ordering clusters. Other suitable methods for ordering summaries are also possible.
In particularly preferred examples, the method involves generating and storing a summary title for each of the stored summaries and/or generating a summary title for each identified summary that corresponds to the search query. The method may further involve returning each identified summary with its corresponding summary title. The summary title for each of these summaries may be generated using a title generator model. A particularly preferred model is the title generator T5 model from Hugging Face (RTM) although any suitable title generator could be used. In further examples, the method may involve generating a cluster title for each cluster and returning the identified summaries arranged in their clusters with their corresponding cluster title. The summary title and/or cluster title may help assist users or systems to understand the results of a search query and to quickly understand the contents of the summaries within the cluster.
In preferred examples, generating a cluster title for each cluster may comprise setting each cluster title to be identical to the summary title of the dominant summary within the respective cluster. As such, the method may involve identifying a dominant summary and summary titles as discussed above.
In further examples generating a title for each cluster of identified summaries may be based on the contents of the identified summaries therein. For example, generating a title for each cluster may comprise: concatenating the identified summaries in the cluster; generating a cluster summary from the concatenated summaries, wherein the cluster summary is shorter than the concatenated summaries; and, generating a title for the cluster based on the cluster summary. Each cluster summary may be generated using abstractive summarisation of the concatenated text from all identified summaries within the respective cluster. Suitable tools for performing the abstractive distillation include Distill-BART, although other models can be used. The title may be generated for the cluster using any suitable title generator model, but a preferred example is the title generator T5 model from Hugging Face (RTM).
In further examples the method may comprise generating a cluster summary for each cluster based on the identified summaries within each cluster or the corresponding document segments, wherein each cluster summary is shorter than the concatenated length of the identified summaries within each cluster and/or the concatenated length of the corresponding document segments. The cluster summary may be the same as the dominant summary within the cluster or a summarisation of the concatenated summaries or corresponding document segments from within the cluster. This summarisation may use any of the summarisation processes discussed above. According to a further aspect of the invention there is provided a method performed by any preceding claim, wherein the one or more processors perform the steps of obtaining one or more text documents, segmenting the one or more text documents into a plurality of document segments, receiving a search query, identifying document segments that correspond to the search query, generating a summary for each identified document segment, wherein each summary is shorter than the corresponding identified document segment, and, returning the summaries of the identified document segments that correspond to the search query. As will be seen, this method is based on the first aspect of the invention discussed above and shares many similarities with the directly preceding aspect of the invention. Indeed, to a user, the method of this aspect will appear very similar to those of the preceding aspects - in response to a query a set of corresponding summaries are received.
However, this aspect relates to methods where the segmentation of text documents is performed before searching is performed but where summarisation is performed after a search query is received. A search query is compared to the document segments obtained from the text documents, and summarisation is only performed for document segments that have been identified to correspond to the search query. This approach offers strong search results and efficient use of computational resources. Since the steps performed before searching do not involve summarisation this pre-processing is quicker and less computationally expensive. This is particularly important where the corpus of text documents to be searched is particularly large. Similarly, the overall number of document segments which need to be summarised using the method, especially in situations where the number of searches to be performed relatively low in comparison to the number of documents and document segments to be stored. Storage requirements can also be reduced as there is no need to store summaries for all document segments.
Preferably the one or more processors perform the step of storing the plurality of document segments, wherein preferably the step of storing the plurality of document segments is performed prior to receiving a search query. Suitable storage mediums are the same as those discussed for the previous aspects of the invention.
Preferably the method involves generating a segment embedding for each of the document segments, wherein each segment embedding is a numerical vector representing the corresponding document segment. Wherein preferably the method involves storing the segment embeddings, for instance in a vector database, database or cloud container. More preferably still the steps of generating a segment embedding for each of the document segments and storing the segment embeddings is performed prior to receiving a search query. The segment embedding may be used during the search process and when clustering and ordering summaries.
Preferably identifying document segments that correspond to the search query is based on a comparison the segment embeddings and a sample embedding contained within or derived from the search query, and/or a comparison of the named entities and/or keywords within the document segments and sample named entities and/or keywords contained within or derived from the search query. This search process may be similar to the search processes described in relation to the preceding aspect of the invention, with the exception that the search is performed against the document segments rather than summaries generated from these segments. The same sub-processes and tools discussed above may be used during both search steps.
Preferably the method according to the present aspect involves steps of clustering and ordering the summaries of the identified document segments that are equivalent to the corresponding steps in the preceding aspects. The summaries may be returned in these clusters and orders. The clustering and ordering steps may be substantially the same as those previously described and use corresponding tools, with the exception that they may be based on summary embeddings which represent the summaries and/or segment embeddings which represent the identified document segments. These clustering and order steps may include any other optional or preferable features of the equivalent processes discussed above. Analogously, the method may include steps of generating summary titles, cluster titles and cluster summaries that are equivalent to those discussed above in reference to the previous aspects and include the step of returning the summaries and clusters with these features.
Preferably the one or more processors perform the further steps of: arranging the summaries of the identified document segments into a plurality of clusters based on the similarity of the identified document segments and/or the summaries, and returning said summaries arranged in their clusters.
Preferably the one or more processors perform the further steps of: comparing the clusters and generating a cluster hierarchy in which the clusters are ordered depending on their relevance to the search query, the similarity of the summaries contained therein and/or the similarity of the identified document segments that correspond to the summaries contained therein; wherein the clusters are returned in accordance with the cluster hierarchy. Preferably, arranging the summaries into a plurality of clusters comprises obtaining a segment embedding for each of the identified document segments, wherein each segment embedding is a numerical vector representing the corresponding document segment, and arranging the summaries into a plurality of clusters based on the similarity of the segment embeddings of their respective identified document segments, and/or obtaining a summary embedding for each of the summaries generated from the identified document segments, wherein each summary embedding is a numerical vector representing the corresponding summary, and arranging said summaries into a plurality of clusters based on the similarity of their summary embeddings.
As discussed with reference to the previous aspects, the summary embeddings may be generated directly from the summaries - e.g. by passing the text of the summary through an embedding model such as ‘gte-base’ or ‘all-mpnet-base-v2’ (both available from Hugging Face (RTM)). Similarly, segment embeddings may be generated directly from the identified document segments - e.g. by passing the text of the document segment through an embedding model such as ‘gte-base’ or ‘all-mpnet-base-v2’. Alternatively, generating the summary and/or segment embeddings may comprise generating at least one metadata embedding for each of the summaries or document segments; wherein the metadata embedding is a numerical vector representing metadata associated with at least a part of the respective summary or document segment. The summary embedding or segment embedding may then be calculated based on said metadata embeddings (e.g. by taking an average or a weighted average of the metadata embeddings). Suitable examples of metadata to form the metadata embeddings and the processes for forming these metadata embeddings are discussed above.
Preferably the summaries are arranged into a plurality of clusters using a hierarchical clustering algorithm and/or an agglomerative clustering algorithm.
Preferably generating a cluster hierarchy comprise calculating a cluster embedding for each cluster, wherein each cluster embedding is an average of the summary embeddings of the summaries contained within the respective cluster and/or the segment embeddings of the identified document segments that correspond to the summaries contained within the respective cluster, calculating an average cluster embedding using an average of the plurality of cluster embeddings, and ordering the clusters based on their cluster embeddings and the average cluster embedding.
Preferably ordering the clusters based on their cluster embeddings and the average cluster embedding comprises identifying a similarity between the cluster embedding of each cluster within the plurality of clusters and the average cluster embedding, and either: ordering the clusters based on the similarity between their respective cluster embeddings and the average cluster embedding; or selecting as a first cluster in a cluster order the cluster that has the greatest similarity between its respective cluster embedding and the average cluster embedding, and generating the remainder of the cluster hierarchy by iteratively, calculating a similarity between each of the respective cluster embeddings of the remaining clusters and the cluster embedding of the previously selected cluster and selecting as the next cluster in the cluster hierarchy the cluster which has a cluster embedding that is most similar to the embedding of the previously selected cluster. Preferably, the one or more processors perform the further steps of generating a summary order for the summaries within each cluster; and returning the summaries within each cluster in accordance with the summary order; wherein preferably generating a summary order for the summaries within each cluster comprises identifying a similarity between the respective summary embedding of each summary and the average summary embedding of said cluster and/or identifying a similarity between the segment embeddings of the respective identified document segments corresponding to each segment and the average segment embedding of said cluster, and either: ordering the summaries based on the similarity between their respective summary embeddings and the average summary embedding of said cluster or the similarity between the segment embeddings of their respective identified document segments and the average segment embedding of said cluster; or selecting as a first summary in the summary order, the summary that has the greatest similarity between its summary embedding and the average summary embedding for said cluster or between the respective segment embeddings of its corresponding identified document segment and the average segment embedding for said cluster, and generating the remainder of the summary order by iteratively calculating a similarity between the respective summary embeddings of the remaining summaries within said cluster and the summary embedding of the previously selected summary, and selecting as the next summary in the summary order the summary which has a summary embedding that is most similar to the summary embedding of the previously selected summary, or calculating a similarity between the respective segment embeddings of the identified document segments corresponding to the remaining summaries and the segment embedding of the identified document segment corresponding to the previously selected summary, and selecting as the next summary in the summary order the summary with a corresponding segment embedding is most similar to the segment embedding of the identified document segment corresponding to the previously selected summary.
Preferably the one or more processors perform the further steps of generating and storing a summary title for each of the summaries, and returning each summary with its corresponding summary title, generating a cluster title for each cluster, and returning the summaries arranged in their clusters with their corresponding cluster titles, and/or generating a cluster summary for each cluster based on the summaries within each cluster or the identified document segments corresponding to the summaries within each cluster, wherein each cluster summary is shorter than the concatenated length of the summaries and/or the concatenated length of the identified document segments corresponding to the summaries within each cluster, and returning the summaries arranged in their clusters with their corresponding cluster summaries.
According to a further aspect of the invention there is provided method performed by one or more processors, in which the one or more processors perform the steps of obtaining one or more text documents, segmenting the one or more text documents into a plurality of document segments, identifying document segments that correspond to the search query, arranging the identified document segments into a plurality of clusters based on the similarity of said identified document segments, generating a cluster title and/or cluster summary for each cluster based on the identified document segments within each cluster, wherein each cluster summary is shorter than the concatenated length of the identified document segments in each cluster, and returning the identified document segments arranged in their clusters with their corresponding cluster titles and/or cluster summaries. This method returns relevant document segments in response to a search query rather than summaries of said document segments. This provides a further search process that is computationally efficient since the amount of information which must be transferred and stored is reduced. Moreover, the results of the search are easily understood and discrepancies between document segments and their summaries and/or data loss during the summarisation process are avoided. The steps of this method may comprise any of the optional or preferable features of the analogous steps of the preceding aspects of the invention. In particular, the clustering and order processes discussed above with reference to summaries of document segments can be implemented using the document segments themselves. As such, further discussion of appropriate techniques for obtaining text documents, segmenting text documents, searching against document segments, clustering document segments, ordering document segments within clusters, ordering clusters and creating titles for document segments and clusters and creating summaries for clusters is provided above.
According to a further aspect of the invention there is provided a system comprising a plurality of processors, the processors configured to perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document; generating a summary for each document segment, wherein each summary is shorter than the corresponding document segment; storing the summaries generated for the plurality of document segments; receiving a search query; identifying stored summaries that correspond to the search query; and returning the identified summaries that correspond to the search query.
According to a further aspect of the invention there is provided a system comprising a plurality of processors the processor configured to perform the steps of any of the methods discussed above.
Systems in accordance with these aspects offer corresponding benefits to the methods discussed above. The systems may be configured to perform any of the optional or preferable steps discussed above.
According to a further aspect of the invention there is provided a non-transitory computer readable medium storing instructions which when run by one or more processors cause the one or more processors to perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document; generating a summary for each document segment, wherein each summary is shorter than the corresponding document segment; storing the summaries generated for the plurality of document segments; receiving a search query; and identifying stored summaries that correspond to the search query; and, returning the identified summaries that correspond to the search query. According to a further aspect of the invention there is provided a non-transitory computer readable medium storing instructions which when run by one or more processors cause the one or more processors to perform the steps of any of the methods discussed above.
The stored instructions of these aspects offer corresponding benefits to the methods discussed above. The stored instructions may include instructions that cause the processors to perform any of the optional or preferable steps discussed above with reference to the previous aspects of the invention.
As discussed above, the invention provides particularly computationally efficient searching without loss of information content. The invention offers significant support to the task of information retrieval from large bodies of data.
BRIEF DESCRIPTION OF DRAWINGS
Specific examples of the invention will now be discussed with reference to the following figures:
Figure 1 shows a flow chart of a method in accordance with embodiments of the invention;
Figure 2 shows schematically a method of segmenting and summarising text documents suitable for use in embodiments of the invention;
Figure 3 shows schematically a system for performing search in accordance with embodiments of the invention;
Figure 4 shows schematically a report returned in accordance with embodiments of the invention;
Figure 5 shows a flow chart of a further method in accordance with embodiments of the invention; Figure 6 shows a flow chart of a further method in accordance with embodiments of the invention.
DETAILED DESCRIPTION
Methods and systems in accordance with the invention for performing search and information retrieval from large collections of text documents will now be discussed in reference to the flow chart of Figure 1 and the schematic process shown in Figure 2.
The method begins with ingestion of text data. In step s101 , one or more text documents are obtained. The text documents typically contain unstructured text data. The text documents may for instance be submitted by a user, obtained from the internal database of an organisation, or located from public sources (e.g. websites, journals or encyclopaedias). Obtaining the text documents may comprise generating them by transcribing video and/or audio files, extracting text from multimodal sources or translating text documents from different languages. The method typically involves ingesting a large number of text documents (e.g. up to, and over, 1 million). Each text document may range in length. For instance, suitable text documents are typically at least 5 to 10 sentences long but can be hundreds of pages and thousands of sentences long.
In step s102, the text documents are segmented into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document. The text documents are segmented based on their semantic similarities. As such, each segment will relate to a different subject or topic. The segments produced from each text document are a plurality of extracts or excerpts from the text. Each segment is typically in the region of 4 to 20 sentences long, and preferably 5 to 16 sentences.
To divide the text documents into segments, the text is split into individual sentences using a sentencizer. Sentence embeddings representing the semantic content of each sentence as a numerical vector are generated using a tool such as BERT sentence-transformer. The sentence embeddings of adjacent or consecutive sentences may then be compared to determine whether the sentences relate to similar content and should be kept together in a segment or whether the content of the sentences is sufficiently different that the sentences should be split apart into different segments.
A cosine similarity is calculated between the sentence embedding of a given sentence and the average sentence embeddings of the two or more (e.g. three) sentences that immediately precede the given sentence. If the cosine similarity is greater than a predetermined threshold the given sentence is considered sufficiently similar to the preceding sentences that it should be included in their segment. If the cosine similarity is lower than the threshold it is determined that the given sentence is the start of a new segment as it is sufficiently semantically different to the preceding text. It will be appreciated that a wide variety of alternative methods for calculating the similarity of adjacent sentences are equally possible.
In preferred implementations, calculating the average sentence embeddings of the two or more preceding sentences involves weighting the preceding sentences based on their adjacency or closeness to the given sentence. For example, the sentence that immediately precedes the given sentence may be weighted higher than the sentence that in turn precedes it. That is if a document includes N sentences that are numbered consecutively through the document as 1, 2, 3 ... n ... N, the cosine similarity of the sentence embedding of sentence n may be calculated with reference to an average sentence embedding calculated based on the sentence embeddings of sentences n - 1, n - 2 and n - 3. When calculating the average sentence embedding, the embedding of the immediately preceding sentence n - 1 may be weighted more highly than its preceding sentence n - 2, and sentence n - 2 may be weighted more highly than the sentence n - 3 that precedes it in turn. Suitable weightings may be selected by a user and/or a designer. The segmentation threshold may be fixed - e.g. being set by a user or technician - or may be varied dynamically for each document. When varied dynamically, the segmentation threshold may be set based on the mean and/or standard deviation of cosine similarities between the embedding of each sentence in a document and the embedding or average embedding of its respective preceding sentence(s). For example, the segmentation threshold may be set as the mean of the cosine similarities minus the multiple of the standard deviation of the cosine similarities and a factor in the range of 0.2 to 0.8.
In alternative examples the text documents may be segmented in alternative manner based on a predetermined length for the segments. Thus the text documents may be repeatedly or iteratively divided after a predetermined number of characters, words or sentences. For example the document may be split into a new segment after a predetermined number of characters, words or sentences (e.g. every 400 characters or every 8 sentences). Alternatively the document may be iteratively split into a new segment at the end of each word or sentence in which a character corresponding to the predetermined number of characters occurs (e.g. at the end of the word or sentence in which the 400th character occurs) or at the end of each sentence which includes the word corresponding to a predetermined number of words for each segment (e.g. at the end of the sentence containing the 80th word).
Having divided the text documents into segments in step s102, the method involves the summarisation of the segments. In step s103 a summary for each document segment is generated, wherein each summary is shorter than the corresponding document segment. The summaries are generated through abstractive summarization. Suitable tools for performing the abstractive summarisation include Distill-BART. Equally, a summary of each document segment may be generated using a Large Language Model (LLM) such as ChatGPT produced by OpenAI (RTM). The summaries generated in this manner are shorter than the corresponding document segment but retain semantic meaning from the segment. A single summary is produced for each document segment. Each summary is typically in the region of 2 to 6 sentences, and preferably 3 to 4 sentences long.
The generated summaries are then stored in step s104. For instance, the summaries may be stored in a database or cloud container such as a S3 bucket from Amazon Web Services (RMT).
The stored summaries provide a text corpus for future searching. As such, steps s101 to s104 may be seen conceptually as forming part of a preliminary process 100a that is completed before a search process 100b occurs. Multiple searches can be performed on the same set of stored summaries. As such, the segmentation and summarization processes do not need to be repeated each time a search query is received.
Performing a search begins with the receipt of a search query in step s105. The search query may include one or more keywords, one or more embeddings, and/or a natural language query. The search query is used to identify relevant content within the stored summaries. The search query may additionally define one or more filters which restrict the summaries which should be searched and/or returned in response to the search query. For instance, the search query may include a date filter defining a range of dates that a user is interested in or a filter on the types of documents or sources a user wants to receive information and summaries from. In further examples, the filters indicate that the search should be restricted to summaries taken from documents from one or more publishers or instructions, summaries of documents of specific file type(s), summaries from documents that come from trusted or untrusted sources, summaries from documents of a specific country of origin and/or summaries from internal or external data.
In step s106 the method involves identifying stored summaries that correspond to the search query. In other words, in step s106 the method involves the identification of a subset of the stored summaries (so called “identified summaries”) that correspond to the search query. For example, keywords of the search query may be compared to the text of each of the stored summaries using the BM25 algorithm or any other suitable means. Equally, a retrieval model configured to respond to a natural language query may identify summaries relevant to a natural language query. In further examples a model may be configured to identify summaries that are similar or relevant to an embedding contained within the search query. Stored summaries that match the keywords, natural language query and/or sample embedding (and meet any filter requirements set out in the search query) are identified as appropriate results to return to the user or machine issuing the search query.
In preferred examples a sample embedding contained within or derived from the search query may be compared to summary embeddings which each represent a corresponding summary. The summaries which have a summary embedding which are sufficiently similar or most similar to the sample embedding may be identified as corresponding to the search query. For example, all summaries which are more similar than a predetermined similarity may be identified as corresponding to the search query or a predetermined number of most similar summaries may be identified as corresponding to the search query.
Subsequently, the identified summaries - that is the summaries identified as corresponding to or matching the search query in step 106 - are returned or output in step s108. The identified summaries may be returned in the form of a list or report. For example, an electronic or paper report may be generated that includes the identified summaries. The report may include in addition to the identified summaries, a contents page listing the identified summaries and copies of, or links to, the original documents obtained in step s101 . The report may be a web or html report and available via a website or web portal.
Optionally, the method may involve the additional step of arranging the identified summaries into a plurality of clusters based on the similarity of the identified summaries. This is shown by step s107 in figure 1 , surrounded by a dashed box. Following this clustering process, the identified summaries may be returned (i.e. output) in their clusters in step s108. During step s107 the identified summaries are structured into semantically similar groups, such that the summaries in each group share a thematic or contextual similarity.
Arranging the identified summaries into a plurality of clusters in step s107 may comprise generating at least one metadata embedding for each of the identified and/or stored summaries, the metadata embedding being a numerical vector representing metadata associated with the respective stored summary or metadata associated with a part of the respective summary. Thereafter, a summary embedding may be calculated for each stored summary based on the at least one metadata embedding. The metadata embeddings may include one or more of a sentence embedding, a key word embedding and a named entity embedding for each of the sentences within the stored summaries, wherein the sentence embeddings are numerical vectors representing the respective sentences, the key word embeddings are numerical vectors representing individual words or phrases in the respective sentences, and the named entity embeddings are numerical vectors representing words identified as named entities that correspond to a Named Entity Recognition (NER) tag by an NER classifier in the respective sentences. As such, the summary embedding may be considered a “compound embedding”, being an average or a weighted average of a series of metadata embeddings.
Alternatively, a summary embedding may be derived or calculated directly from the respective summary. To generate a summary embedding the text of the summary may be passed through an embedding model such as ‘gte-base’ or ‘all- mpnet-base-v2’ both of which are available from Hugging Face (RTM).
Step s107 may then include arranging the identified summaries into a plurality of clusters based on their summary embeddings. Suitable metadata embeddings include sentence embeddings, key word embeddings and named entity embeddings, as well as embeddings representing the title or date of the original document from which the respective summary is generated or representing names, events or dates listed in the summary itself. Therefore, the clustering may involve as an initial step dividing one or more of the summaries into one or more sentences using a sentenciser. Subsequently one or more of a sentence embedding, a plurality of word embeddings and named entity embeddings can be generated for each sentence within said one or more summaries. The sentence embeddings are numerical vectors representing the respective sentences and their semantic meanings and can be generated by BERT sentence-transformer. The word embeddings are numerical vectors representing the semantic meaning of individual words or phrases in the respective sentences and may be generated by the sentence-transformer BERT model. The named entity embeddings are numerical representations of named entities identified using NER classification. NER classification (also referred to as NER tagging) applies tags to each token or named entity in the respective sentences that corresponds to a category or class of entities in a given ontology. The named entity embeddings are therefore based on the actual entities identified using NER tagging. A summary embedding representing the semantic meaning of the summary as a whole may be generated based on the generated sentence embeddings, word embeddings and/or named entity embedding of its sentences. For example these metrics may be averaged or weighted to generate the summary embedding.
Equally, the method may include dividing a summary into individual sentences and generating a sentence embedding for each of the sentences within the stored summaries. Thereafter a summary embedding can be calculated for each summary based on the plurality of sentence embeddings (e.g. by averaging or weighting the sentence embeddings). Optionally, the method may additionally involve generating one or more further metadata embedding(s) for each summary, and generating a summary embedding based on the sentence embeddings and the one or more further metadata embedding(s).
These processes for generating summary embeddings may be performed for all of the stored summaries - e.g. as part of the preliminary process 100a discussed above. Alternatively, these processes may be performed for the identified summaries in response to a search query during the search process 100b. Thereafter, in response to receiving a search query and following the identification of summaries that are relevant to the search query, the method may involve arranging the identified summaries into a plurality of clusters based on the similarity of their summary embeddings. This provides an efficient method of comparing and grouping the identified summaries into clusters. Preferably the identified summaries are arranged into a plurality of clusters based on the similarity of their summary embeddings using an agglomerative clustering algorithm. However, other clustering approaches may also be used.
In turn, the method may optionally involve arranging or structuring the clusters themselves in a cluster hierarchy. The clusters and the identified summaries therein may be returned (i.e. output) in accordance with this cluster hierarchy.
Generating a cluster hierarchy may involve calculating a plurality of cluster embeddings, wherein each cluster embedding is an average of the summary embeddings of the identified summaries contained within a respective cluster; calculating an average cluster embedding using an average of the plurality of cluster embeddings, and ordering the clusters based on their cluster embeddings and the average cluster embedding. Preferably, the cluster having a cluster embedding closest to the average of the cluster embeddings is selected as the starting cluster (also termed the dominant cluster). The model then finds the next closest cluster to the starting cluster and so on until clusters are ordered.
A corresponding method may be used to order the identified summaries within each cluster. A similarity between the respective summary embedding of each identified summary and an average summary embedding of said cluster may be calculated. The summaries may be ordered based on the similarity between their respective summary embeddings and the average summary embedding of said cluster. Equally the summaries may be ordered by selecting as a first summary in the summary order, the summary that has the greatest similarity between its summary embedding and the average summary embedding for said cluster, and generating the remainder of the summary order by iteratively calculating a similarity between the respective summary embeddings of the remaining identified summaries within said cluster and the summary embedding of the previously selected summary and selecting as the next summary in the summary order the identified summary which has a summary embedding that is most similar to the embedding of the previously selected summary. The summaries in each cluster may be returned in accordance with the generated summary order.
Titles can be generated for either all stored summaries and/or the identified summaries. Additionally, titles may be generated for the clusters. The search results (the clusters and identified summaries) may be presented with these titles. Titles for each summary may be generated using a title generator tool such as the T5 model available from Hugging Face (RTM). Titles for clusters may be taken from one or more of the identified summaries therein (e.g. from the dominant summary) and/or by concatenating the text of the identified summaries therein and subsequently generating a title based on this concatenated text - e.g. again a title generator tool such as the T5 model available from Hugging Face (RTM).
Titles for stored summaries may be generated during the preliminary step s100a and the titles stored in association with the stored summaries to reduce latency when a search query is received. However, this is not essential.
Breaking documents into segments and individually summarising each segment before searching as discussed herein offers significant benefits. In particular, search latency and resource requirements are significantly reduced.
Many text documents discuss or reference more than one topic. If a text document were summarised as a whole to produce a short summary (e.g. 3 to 4 sentence) a large amount of the content would be ignored or lost. As such, entire topics discussed in a document which was only summarised and not previously segmented may be lost. Therefore, whilst simply summarising documents and searching against these document-level summaries could reduce the resources required to perform each search, the information contained in the search results would be very poor. In contrast, by segmenting documents based on semantic similarity, multiple summaries will be created from any one document assuming the document covers a range of different topics. Therefore, more information is retained and a more accurate response with additional information can be provided in response to a search.
For example, a user may wish to search documents using the keywords ‘drones’ and ‘Ukraine’. If documents are summarised without segmenting and the resulting document-level summaries searched, then a search based on these document-level summaries would likely not return any content on drones from a relatively long document that primarily covers Ukraine and mentions drones in only a single paragraph. The single, isolated paragraph is likely to be lost during summarisation and would not be presented to a user in a search. The paragraph may contain very important insight, but because it does not make up a significant element of the semantic meaning of the original document, it is likely it would not be captured in the document-level summary.
In contrast, in the method 100 discussed above, the document covering Ukraine will have been segmented multiple times based on semantic similarity. The paragraph on drones will be considered semantically different from other topics relating to Ukraine and will be separated into an individual segment. This segment on drones may be summarised to reduce its length, whilst maintaining its semantic meaning, to make it quicker to search and easier to understand by a user. The important insight only found in an isolated paragraph of a document which primarily covers a different topic entirely is still presented to a user in response to their keyword search. It is therefore possible to retrieve information efficiently and without careful inspection of the full content of all documents.
The steps of segmenting and summarising documents are illustrated schematically in the flow 10 of Figure 2. As shown a plurality of documents 10 are broken down in to segments 12, and each segment 12 is summarised to create a corresponding summary 14. Figure 2 shows a set of text documents 10 which may be obtained through step s101. For illustrative purposes, three text documents 10 are shown in Figure 3. However, in practice the method may be used with very large bodies of documents (e.g. at least 10, at least 100 or at least 1000 text documents). The text documents 10 comprise text data. The data in these text documents is typically unstructured and therefore difficult to search by conventional methods. However, the method may involve obtaining structured data.
As discussed above with reference to step s102, each of the text documents 10 is divided into one or more segments 12. As shown, each text document 10 is segmented into a corresponding plurality of segments 12. Each segment 12 is a sub-section of the respective original document 10. The segments 12 are a series of consecutive passages or excerpts from the respective text documents 10.
For the purpose of illustration, each text document 10 is shown as being divided into four segments 12. However, this is not essential. As discussed above with reference to step s102, the text documents 10 are divided into segments 12 based on semantic similarity between consecutive sentences within their text. Therefore, the number of segments 12 generated from each document 10 will depend in part on the number of different topics and subjects discussed in each document 10.
Subsequently, the segments 12 obtained from the text documents 10 are summarised to generate a corresponding plurality of summaries 14. As discussed above with reference to step s103, abstractive summarisation is performed to reduce the length of the segments 12 whilst maintaining semantic meaning. The summaries 14 may be seen as a plurality of relatively short text fragments that maintain the information in the original text documents 10 but which are easy to search and understand due to their reduced length.
Thereafter, the summaries 14 obtained from the documents 10 may be stored and searched. For instance, the summaries 14 may be stored and searched in the manner discussed above with reference to steps s104 to s108 of Figure 1 . The storage and search of summaries 14 will be discussed further with reference to the system and method shown schematically in Figure 3.
Figure 3 shows a system 20 configured to perform the methods discussed herein. The system 20 comprises a summarisation processor 22, a search processor 24 and a database 26 (or other suitable storage medium). The summarisation processor 22, search processor 24 and database 26 are connected by communication lines 28. The communication lines 28 may be wired or wireless and may be the internet or any other network.
The summarisation processor 22 is configured to perform the preliminary process 100a described above with reference to Figure 1. The summarisation processor 22 receives a plurality of text documents 10 as shown by arrow D. The summarisation processor 22 divides the text documents 10 into segments (not shown) and summarises the segments to produce a plurality of summaries 14. The summaries 14 are stored in the database 26 (although other storage is possible). These processes are in accordance with steps s101 to s104 discussed above.
The search processor 24 is configured to perform the search process 100b discussed above with reference to steps s105 to s108 of Figure 1. The search processor 24 receives search queries from a user or other system as shown by arrow Q. The search processor 24 identifies a subset of the summaries from the plurality of summaries 14 stored on the database 26 that are relevant to the search query. The search processor 24 outputs (i.e. returns) this subset of identified summaries 16 to the user or other system as shown by arrow O. The search processor 24 may attempt to match keywords within the search query Q with words in the stored summaries 14. Additionally or alternatively, the processor 24 may attempt to match stored summaries 14 that correspond to a natural language query or stored summaries 14 that are similar or correspond to an embedding within the search query. The search processor 24 may limit its search to stored summaries 14 that meet one or more filters or other criteria defined in the search query. The filters may define for example, appropriate date ranges or sources from which summaries should be identified and returned by the system 20.
To identify and obtain the summaries 16 that are relevant to the search query the search processor 24 may contact the database 26 via the summarisation processor 22 or directly using a further communication line (not shown). The search processor 24 is further configured to cluster the identified summaries 16. The search processor 24 arranges the identified summaries 16 into clusters based on their semantic similarities and outputs the identified summaries 16 in these clusters. As shown, five summaries 16 have been identified as relevant to the search query Q and they are returned in two clusters 18. An exemplary approach for returning the search results will be discussed below in Figure 4.
The summarisation processor 22 may further order the clusters 18 into a cluster hierarchy and output the clusters 18 in an order or format based on this hierarchy. Each cluster 18 may be arranged in the cluster hierarchy based on its relevance to the search query Q and/or their similarity to the other clusters. The summarisation processor 22 may generate a title for each cluster 18 and output the title in combination with the cluster 18. The titles may be created by recursively summarising the identified summaries 16 within each cluster 18 and then generating a title from the created cluster summary using a title generator tool.
The summarisation processor 22, search processor 24 and database 26 may be located remotely or locally to each other. In addition, one or more of these components may be located remotely from the systems and users that provide text documents 10, that issue search queries and that receive search results.
As shown, the summarisation processor 22 and search processor 24 are separate processors. However, this is not essential and their functions could alternatively be implemented on a single processor. The architecture shown in Figure 3 should be seen as exemplary rather than restrictive. Figure 4 shows schematically an example of how identified summaries that correspond to a search query may be returned. The figure shows a simplified electronic report 30 containing the identified summaries as shown thereon that may be returned by the systems discussed herein. The report 30 is a graphical user interface displayed by a monitor, screen, touchscreen or other suitable device.
In the report the plurality of identified summaries 43 are returned together with their summary titles 42. The identified summaries 43 are arranged in clusters and the report includes links to, and is able to display, the original documents from which each identified summary 43 is taken.
The report 30 comprises three panels 31 , 32, 33. A cluster panel 31 is shown on the left of the report 30. The cluster panel 31 is configured to list cluster titles 41 corresponding to a plurality of clusters that each contain one or more identified summaries 43. The system is configured to receive a selection of a cluster title 41. When a selection of a cluster title 41 is received, as shown by hashed summary title 41a, the list shown in the cluster panel 31 may be expanded to show the summary titles 42 of the identified summaries 43 within the corresponding cluster. A selection may be received using a mouse, keyboard, touchscreen or any other suitable input device. As shown, the summary titles 42 of the identified summaries 43 within the cluster are listed in an indented manner below the selected cluster title 41 a to avoid confusion with the cluster titles 41 . Other options to differentiate the appearance of the summary titles 42 and cluster titles 41 in the cluster panel 31 are also possible (e.g. colour, typeface, font, size, etc.). The selected cluster title 41a may be visually emphasised or highlighted in comparison to the other cluster titles 41 in response to its selection. By listing cluster titles 41 and summary titles 42, the cluster panel can be seen as equivalent to a contents page in a conventional text report.
A summary panel 32 is shown in the centre of the report 30. The summary panel 41 is configured to show the text of the identified summaries 43 and their corresponding summary titles 42 of the cluster corresponding to the selected cluster title 41a. The relatively short identified summaries are easy to understand and offer rapid understanding of the results of a search query.
The system is further configured to receive a selection of an identified summary 43 shown within the summary panel 32. Again, the selection may be received using a mouse, keyboard, touchscreen or any other suitable input device. The selected identified summary 43a and its summary title 44 may be visually emphasised or highlighted in the report 30 in comparison to the other identified summaries 43 as shown by the hashed fill in Figure 4. In addition, within the summary panel 32 the identified summaries 43 may be indented 43a relative to the summary titles 42. Other options to differentiate the appearance of the selected identified summary 43a in comparison to the other identified summaries 43 and the identified summaries 43 relatively to their summary titles 42 are also possible (e.g. colour, typeface, font, size, etc.).
A document panel 32 is shown on the right of the report 30. The document panel 33 is configured to display the original text document 44 from which the selected identified summary 43a has been generated. Therefore, a user is able to easily investigate the original text document 44 based on the identified summaries returned within the report 30. This further helps a user understand the results of a search query.
Thus the report 30 shown in Figure 4 is a particularly effective approach for returning identified summaries to a user. A user is able to quickly navigate between clusters, to rapidly understand the contents of the clusters based on their titles and text, and investigate the source text documents if necessary. Users may receive and understand information particularly quickly. Nevertheless, the results of a search query may be returned in various other manners including as a text report.
Further methods and systems in accordance with the invention for performing search and information retrieval from large collections of text documents will now be discussed in reference to the flow charts of Figures 5 and 6. These methods may be implemented by one or more processors, these processors may comprise the features discussed above with reference to Figures 3.
To a user, the method 200 of Figure 5 will appear very similar to the method shown in Figure 1. Search queries are input to a systems and summaries that corresponding to each search query are returned, optionally arranged in clusters as discussed above. However, rather than generating and storing a summary for each document segment during a preliminary procedure before searching is performed on the stored summaries (as in flow 100 of Figure 1), in the method 200 of Figure 5 the search is performed against the document segments. The summarisation step is only performed on document segments have been identified to correspond to the search query. This maintains the benefits of improved search results whilst offering more efficient use of computational resources. When compared to the method 100 of Figure 1 , in the method 200 of Figure 2 the preliminary (pre-processing) process 200a can be quicker and less computationally expensive, which can be particularly important where the set of text documents to be searched is particularly large. Similarly, the overall number of document segments which need to be summarised can be reduced in situations where the number of searches to be performed relatively low in comparison to the number of documents and document segments to be stored. Storage requirements can also be reduced as there is no need to store summaries for all document segments.
The flow 200 shown in Figure 5 begins with the ingestion of text data and the segmentation of this text data into smaller segments. In step s201 , one or more text documents are obtained. In step s202 each of these text documents are segmented into a plurality of document segments. These steps broadly correspond respectively to steps s101 and s102 discussed above with reference to Figure 1 . They may involve any of the specific, optional or alternative features discussed above with reference to Figure 1 and offer corresponding benefits.
For example, the text documents may be segmented based on the semantic similarity of consecutive sentences. Alternatively, the text documents may be segmented based on a predetermined length for each segment. For example, the document may be split into a new segment after a predetermined number of characters, words or sentences (e.g. every 400 characters or every 8 sentences). Alternatively the document may be split into a new segment at the end of the word or sentence in which a character corresponding to the predetermined number of characters occurs (e.g. at the end of the word or sentence in which the 400th character occurs) or at the end of the sentence which includes the word corresponding to a predetermined number of words for each segment (e.g. at the end of the sentence containing the 80th word). Optionally the text documents are obtained and the generated segments are generated in a preliminary (preprocessing) procedure 200a before a search procedure 200b occurs. The generated document segments may be stored (e.g. in a database, cloud container or other storage medium) for subsequent searching. The predetermined length for the segments may be in the range from 200 to 1000 characters, preferably from 250 to 800 characters, more preferably still from 300 to 600 characters. Equally, the predetermined length for the segments may be in the range from 4 to 20 sentences and preferably from 5 to 16 sentences
Optionally, the method may further include step s203 of generating a segment embedding for each of the document segments, wherein each segment embedding is a numerical vector representing the corresponding document segment. These segment embeddings may be subsequently used in search and clustering steps. After their generation, the segment embeddings may be stored in any suitable storage medium such as a vector database or cloud container for later use.
The segment embeddings may be generated from metadata embeddings - i.e. is a numerical vector representing metadata associated with the respective stored summary or metadata associated with a part of the respective document segment. Suitable metadata embeddings include one or more of a sentence embedding, a key word embedding and a named entity embedding for each of the sentences within the document segments, wherein the sentence embeddings are numerical vectors representing the respective sentences, the key word embeddings are numerical vectors representing individual words or phrases in the respective sentences, and the named entity embeddings are numerical vectors representing words identified as named entities that correspond to a Named Entity Recognition (NER) tag by an NER classifier in the respective sentences. The summary segment may be considered a “compound embedding”, calculated as an average or a weighted average of said metadata embeddings.
However, in more preferred examples the segment embeddings may be derived or calculated directly from the respective document segments. To generate a segment embedding the text of the document summary may be passed through an embedding model such as ‘gte-base’ or ‘all-mpnet-base-v2’ both of which are available from Hugging Face (RTM).
The method 200 continues with the search process 200b. Performing a search begins with the receipt of a search query in step s204. Subsequently in step s205 the method involves identifying document segments that correspond to the search query. These steps corresponds to steps s105 and s106 in method 100 of Figure 1 with the exception that the searching is performed against the document segments rather than summaries of these segments.
The search query may have any of the features and advantages discussed above with reference to Figure 1 . The search query may include one or more keywords, one or more embeddings, and/or a natural language query. As previously discussed, the search query may additionally define one or more filters which restrict the summaries which should be searched and/or returned in response to the search query.
Similarly, the step of identifying document segments which correspond to the search query may have any of the features and advantages discussed above with reference to the equivalent step in Figure 1. In preferred examples a sample embedding contained within or derived from the search query may be compared to the segment embeddings, each of represent a respective document segment. The document segments which have a segment embedding which is sufficiently similar to or that are most similar to the sample embedding may be identified as corresponding to the search query. For example, all document segments which are more similar than a predetermined similarity may be identified as corresponding to the search query or a predetermined number of most similar document segments may be identified as corresponding to the search query. Alternatively, keywords or named entities in the search query may be compared to the text of each of the stored summaries, or a retrieval model configured to respond to a natural language query may identify summaries relevant to a natural language query.
After identifying document segments that correspond to the search query (s205), the method continues with step s206 of generating a summary for each identified document segment - i.e. for all of the document segments that correspond to the search query. Each summary is shorter than the corresponding document segment, and therefore easier to understand when returned to a user. The summarisation of each segment may occur in the manner described above with reference to step s103 in Figure 1 and offer corresponding benefits. Preferably, a summary of each identified document segment may be generated using a Large Language Model (LLM) such as ChatGPT produced by OpenAI (RTM). In particular ChatGPT has been found to provide rapid response times and high quality summaries. However, other tools such as Distill-BART may also be used. As previously discussed, each summary is typically in the region of 2 to 6 sentences, and preferably 3 to 4 sentences long.
Subsequently, the summaries - i.e. the summaries of the document segments identified as corresponding to the search query - are returned in step s208. This return of summaries is equivalent to step s208 of Figure 1 and may include any of the features and advantages previously discussed. For example, the summaries may be returned in the form of a list or report (e.g. according to the schematic report 30 shown in Figure 4).
Optionally, the method may include the step s207 of arranging the summaries - i.e. the summaries of the document segments identified as corresponding to the search query - into a plurality of clusters based on the similarity of the identified segments and/or the corresponding summaries. Moreover, the summaries may be returned in these clusters.
The clustering process may be analogous to the clustering process discussed above with reference to step s107 in the method 100 of Figure 1. As such, the clustering process may include any of the sub-steps and features discussed previously and offer corresponding benefits. Notably, the clustering process may be performed using either the segment embeddings identified for each segment in step s203 or using summary embeddings produced for each summary generated in step s206. The generation of summary embeddings may use the approach discussed above with reference to Figures 1 to 3.
Preferably the summaries are arranged into a plurality of clusters based on the similarity of their summary embeddings and/or the segment embeddings that correspond to each summary. This process may use an agglomerative clustering algorithm. However, other clustering approaches may also be used. The summary embeddings and/or segment embeddings may be obtained for these steps by accessing the embeddings from storage (e.g. by accessing the segment embeddings generated in step s203) or by generating the embeddings from the respective summaries or segments using the approaches discussed above.
In turn, the method may optionally involve arranging or structuring the clusters themselves in a cluster hierarchy. Generating a cluster hierarchy may involve calculating a plurality of cluster embeddings, wherein each cluster embedding is an average of the summary embeddings of the identified summaries contained within a respective cluster and/or the segment embeddings of the identified document segments that correspond to the summaries contained within the respective cluster, calculating an average cluster embedding using an average of the plurality of cluster embeddings, and ordering the clusters based on their cluster embeddings and the average cluster embedding. Preferably, the cluster having a cluster embedding closest to the average of the cluster embeddings is selected as the starting cluster (also termed the dominant cluster). The model then finds the next closest cluster to the starting cluster and so on until clusters are ordered. Summaries may also be ordered within the clusters as previously discussed. This ordering process may use the steps discussed above with reference to Figure 1 , but can be based on either summary embeddings generated for each of the summaries and/or on the segment embeddings of the document segments corresponding to each summary.
As discussed above, titles can be generated for each of the summaries to be returned. Additionally, cluster titles and/or cluster summaries may be generated for the clusters. Suitable processes for these steps are discussed above. The summaries and clusters may be returned with these features.
Therefore, it will be seen that substantially the same processes are used to cluster summaries, order clusters, order summaries with clusters and generating summary tiles, cluster titles and cluster summaries in both method 100 of Figure 1 and method 200 of Figure 5. These processes can be performed regardless of whether summaries are generated from all document segments in a preliminary (pre-process) process or if summaries are generated only for document segments that are identified as corresponding to a search query. Moreover, as mentioned the steps may be performed based on either summary embeddings or segment embeddings for the corresponding document segments.
The flexibility of the clustering and ordering processes is further demonstrated in the method 300 shown in Figure 6. This method 300 is very similar to the method 200 shown in Figure 5. With the exception that the document segments that are identified as corresponding to the search query are not summarised before they are returned. Instead, the document segments themselves are returned to a user. The document segments are clustered in the manner discussed above based on their segment embeddings and the document segments return in accordance with their clusters. Optionally the clusters are ordered, the document segments within each cluster are ordered, and a cluster title or cluster summary as discussed with reference to the previous examples. The preliminary process 300a are equivalent to the preliminary process 200a of the method 200 in Figure 5. In step s301 , one or more text documents are obtained. In step s302, the one or more text documents are segmented into a plurality of document segments. In step s303, segment embeddings are generated for each document segment. Further details of these steps are discussed above with reference to Figure 5.
In a search process 300b, a search query is received in step s304 and document segments that correspond to the search query are identified in step s305. Again these steps are identical or equivalent to steps s204 and s205. Further details of these steps are discussed above with reference to Figure 5.
Thereafter, in step s306, the identified document segments are arranged into a plurality of clusters based on the similarity of the identified document segments. The clustering may be performed based on the segment embeddings of the identified document segments and be consistent with the corresponding clustering processes discussed above. Furthermore, the clusters may be ordered and the document segments within each cluster ordered as previously discussed.
Optionally, in step s307 a cluster summary for each cluster is generated based on the identified document segments within said cluster, wherein each cluster summary is shorter than the identified document segments within said cluster. The cluster summary may be a summary of on a dominant document segment within said cluster (i.e. the document segment with a segment embedding most similar to the average of the segment embeddings of the cluster) or by summarising a concatenation of document segments from within the cluster. Equally, the cluster and/or the document segments may be titled using the mechanisms discussed above.
Finally, in method 300 the identified document segments (i.e. the document segments identified as corresponding to the search query) are returned arranged in their clusters in step s308. Preferably the clusters are returned with the corresponding cluster summaries and with any cluster or document segment titles. This method which returns relevant document segments in response to a search query rather than summaries of said document segments offers an computationally efficient search process as the amount of information which must be transferred and stored is reduced. Moreover, the results of the search are easily understood. Discrepancies between document segments and their summaries and/or data loss during the summarisation process are avoided.
The methods described above demonstrate the flexibility of the clustering and ordering processes discussed herein. These steps may be performed on a large range of search results, including summaries generated and stored in a preliminary or pre-processing and step and subsequently identified as being relevant to a search query (as in Figure 1), on summaries generated following the receipt of a search query and from document segments identified as corresponding to the search query (as in Figure 5), and on document segments themselves (as in Figure 6).
It is important to note that while the present invention has been described in a context of a fully functioning data processing system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies equally regardless of a particular type of signal bearing media actually used to carry out distribution.
Generally, any of the functionality described in this text or illustrated in the figures can be implemented using software, firmware (e.g., fixed logic circuitry), programmable or non-programmable hardware, or a combination of these implementations. The terms "component" or "function" as used herein generally represent software, firmware, hardware or a combination of these. For instance, in the case of a software implementation, the terms "component" or "function" may refer to program code that performs specified tasks when executed on a processing device or devices. The illustrated separation of components and functions into distinct units within the block diagrams above may reflect an actual physical grouping and allocation of such software and/or hardware, or can correspond to a conceptual allocation of different tasks performed by a single software program and/or hardware unit. Thus, the various processes described herein can be implemented on the same processor or different processors in any combination.
As mentioned, methods and processes described herein can be embodied as code (e.g., software code) and/or data. Such code and data can be stored on one or more computer-readable media, which may include any device or medium that can store code and/or data for use by a computer system. When a computer system reads and executes the code and/or data stored on a computer-readable medium, the computer system performs the methods and processes embodied as data structures and code stored within the computer-readable storage medium. In certain embodiments, one or more of the steps of the methods and processes described herein can be performed by a processor (e.g., a processor of a computer system or data storage system). It should be appreciated by those skilled in the art that computer-readable media include removable and nonremovable structures/devices that can be used for storage of information, such as computer-readable instructions, data structures, program modules, and other data used by a computing system/environment. A computer-readable medium includes, but is not limited to, volatile memory such as random access memories (RAM, DRAM, SRAM); and non-volatile memory such as flash memory, various read-only-memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memories (MRAM, FeRAM), and magnetic and optical storage devices (hard drives, magnetic tape, CDs, DVDs); network devices; or other media now known or later developed that is capable of storing computer- readable information/data. Computer-readable media should not be construed or interpreted to include any propagating signals.
Although specific embodiments of the disclosure have been described, various modifications, alterations, alternative constructions, and equivalents are also encompassed within the scope of the disclosure. Embodiments of the present disclosure are not restricted to operation within certain specific data processing environments, but are free to operate within a plurality of data processing environments. Additionally, although embodiments of the present disclosure have been described using a particular series of transactions and steps, it should be apparent to those skilled in the art that the scope of the present disclosure is not limited to the described series of transactions and steps. Various features and aspects of the above-described embodiments may be used individually or jointly.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that additions, subtractions, deletions, and other modifications and changes may be made thereunto without departing from the broader spirit and scope as set forth in the claims. Thus, although specific disclosure embodiments have been described, these are not intended to be limiting. Various modifications and equivalents are within the scope of the following claims. The modifications and variations include any relevant combination of the disclosed features.

Claims

1 . A method performed by one or more processors, in which the one or more processors perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments; receiving a search query; returning one or more summaries of respective document segments within the plurality of document segments, wherein the summaries and/or the corresponding document segments have been identified to correspond to the search query; wherein each returned summary is shorter than its respective document segment.
2. A method according to claim 1 , wherein the one or more processors perform the further steps of either: in response to receiving the search query, identifying document segments that correspond to the search query, and generating a summary for each identified document segment, wherein each summary is shorter than the corresponding identified document segment; and wherein the step of returning one or more summaries of respective document segments within the plurality of document segments comprises returning the summaries generated for the identified document segments; or, prior to receiving the search query, generating a summary for each document segment in the plurality of document segments and storing said summaries, wherein each summary is shorter than the corresponding document segment; and, in response to receiving the search query, identifying summaries that correspond to the search query;
RECTIFIED SHEET (RULE 91 ) ISA/EP and, wherein the step of returning one or more summaries of respective document segments within the plurality of document segments comprises returning the identified summaries.
3. The method of any preceding claim, wherein the one or more processors perform the further steps of: arranging the summaries to be returned into a plurality of clusters based on the similarity of said summaries; and returning said summaries arranged in their clusters.
4. The method of claim 3, wherein the one or more processors perform the further steps of: comparing the clusters and generating a cluster hierarchy in which the clusters are ordered depending on their relevance to the search query and/or the similarity of the summaries contained therein; wherein the clusters are returned in accordance with the cluster hierarchy.
5. The method of any preceding claim, wherein each text document is segmented based on the similarity of sentences to the at least one preceding sentence, and wherein preferably each text document is segmented based on the similarity of each sentence to the two, three or more preceding sentences.
6. The method of any preceding claim, wherein segmenting the one or more text documents into a plurality of document segments comprises: dividing each text document into a plurality of sentences; generating sentence embeddings for each sentence within the plurality of sentences, wherein the sentence embeddings are numerical vectors representing the respective sentences and their semantic meanings; and iteratively:
(a) determining a cosine similarity between a sentence embedding of a given sentence and a sentence embedding for the sentence preceding the given sentence or between a sentence
RECTIFIED SHEET (RULE 91 ) ISA/EP embedding of a given sentence and an average of sentence embeddings for the at least two sentences preceding the given sentence;
(b) comparing the cosine similarity to a segmentation threshold; and
(c) in response to determining that said cosine similarity is lower than the segmentation threshold, determining that the given sentence is the start of a new segment; whereas when said cosine similarity is greater than the segmentation threshold, the given sentence is included in the same segment as the sentence that directly precedes the given sentence; wherein steps (a) to (c) are repeated for different given sentences in each document.
7. The method of claim 6, wherein the method comprises the step of determining a cosine similarity between a sentence embedding of a given sentence and an average of sentence embeddings for the at least two sentences preceding the given sentence, the average of sentence embeddings is calculated as a weighted average in which the sentence embeddings of the two or more preceding sentences are weighted based on the position of their respective sentence in the document relative to the given sentence.
8. The method of any of claims 1 to 4, wherein segmenting the one or more text documents comprises dividing each text document based on a predetermined length for each segment; wherein preferably each text document is repeatedly split into a new segment after a predetermined number of characters, words or sentences; wherein more preferably each text document is repeatedly split into a new segment at the end of the word or sentence containing a character corresponding to a predetermined number of characters for each segment, or at the end of a sentence containing a word corresponding to a predetermined number of words for each segment.
RECTIFIED SHEET (RULE 91 ) ISA/EP
9. The method of any preceding claim, wherein each summary comprises from 2 to 6 sentences and preferably from 3 to 4 sentences and/or each document segment comprises from 4 to 20 sentences and preferably from 5 to 16 sentences.
10. The method of any preceding claim, wherein each document segment comprises from 200 to 1000 characters, preferably from 250 to 800 characters, more preferably still from 300 to 600 characters.
11. A method performed by any preceding claim, wherein the one or more processors perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments; receiving a search query; identifying document segments that correspond to the search query; generating a summary for each identified document segment, wherein each summary is shorter than the corresponding identified document segment; and, returning the summaries of the identified document segments that correspond to the search query.
12. A method according to claim 10, wherein the one or more processors perform the step of storing the plurality of document segments, wherein preferably the step of storing the plurality of document segments is performed prior to receiving a search query.
13. A method according to any of claims 11 to 12, wherein the one or more processors perform the further steps of: generating a segment embedding for each of the document segments, wherein each segment embedding is a numerical vector representing the corresponding document segment; and preferably storing the segment embeddings;
RECTIFIED SHEET (RULE 91 ) ISA/EP wherein preferably the steps of generating a segment embedding for each of the document segments and storing the segment embeddings is performed prior to receiving a search query.
14. A method according to claim 13, wherein identifying document segments that correspond to the search query is based on: a comparison the segment embeddings and a sample embedding contained within or derived from the search query; and/or a comparison of the named entities and/or keywords within the document segments and sample named entities and/or keywords contained within or derived from the search query.
15. A method according any of claims 11 to 14, wherein the one or more processors perform the further steps of: arranging the summaries of the identified document segments into a plurality of clusters based on the similarity of the identified document segments and/or the summaries; and returning said summaries arranged in their clusters.
16. A method according to claim 15, wherein the one or more processors perform the further steps of: comparing the clusters and generating a cluster hierarchy in which the clusters are ordered depending on their relevance to the search query, the similarity of the summaries contained therein and/or the similarity of the identified document segments that correspond to the summaries contained therein; wherein the clusters are returned in accordance with the cluster hierarchy.
17. The method of claim 15 or 16, wherein arranging the summaries into a plurality of clusters comprises: obtaining a segment embedding for each of the identified document segments, wherein each segment embedding is a numerical vector representing the corresponding document segment, and arranging the summaries into a
RECTIFIED SHEET (RULE 91 ) ISA/EP plurality of clusters based on the similarity of the segment embeddings of their respective identified document segments; and/or obtaining a summary embedding for each of the summaries generated from the identified document segments, wherein each summary embedding is a numerical vector representing the corresponding summary, and arranging said summaries into a plurality of clusters based on the similarity of their summary embeddings.
18. The method of claim 16 or claim 17, wherein the summaries are arranged into a plurality of clusters using a hierarchical clustering algorithm and/or an agglomerative clustering algorithm.
19. The method of claims 16 to 18, wherein generating a cluster hierarchy comprises: calculating a cluster embedding for each cluster, wherein each cluster embedding is an average of the summary embeddings of the summaries contained within the respective cluster and/or the segment embeddings of the identified document segments that correspond to the summaries contained within the respective cluster; calculating an average cluster embedding using an average of the plurality of cluster embeddings, and ordering the clusters based on their cluster embeddings and the average cluster embedding.
20. The method of claim 18, wherein ordering the clusters based on their cluster embeddings and the average cluster embedding comprises: identifying a similarity between the cluster embedding of each cluster within the plurality of clusters and the average cluster embedding, and either: ordering the clusters based on the similarity between their respective cluster embeddings and the average cluster embedding; or: selecting as a first cluster in a cluster order the cluster that has the greatest similarity between its respective cluster embedding and the
RECTIFIED SHEET (RULE 91 ) ISA/EP average cluster embedding, and generating the remainder of the cluster hierarchy by iteratively: calculating a similarity between each of the respective cluster embeddings of the remaining clusters and the cluster embedding of the previously selected cluster; and selecting as the next cluster in the cluster hierarchy the cluster which has a cluster embedding that is most similar to the embedding of the previously selected cluster.
21. The method of claims 15 to 20, wherein the one or more processors perform the further steps of: generating a summary order for the summaries within each cluster; and returning the summaries within each cluster in accordance with the summary order; wherein preferably generating a summary order for the summaries within each cluster comprises identifying a similarity between the respective summary embedding of each summary and the average summary embedding of said cluster and/or identifying a similarity between the segment embeddings of the respective identified document segments corresponding to each segment and the average segment embedding of said cluster, and either: ordering the summaries based on the similarity between their respective summary embeddings and the average summary embedding of said cluster or the similarity between the segment embeddings of their respective identified document segments and the average segment embedding of said cluster; or: selecting as a first summary in the summary order, the summary that has the greatest similarity between its summary embedding and the average summary embedding for said cluster or between the respective segment embeddings of its corresponding identified document segment and the average segment embedding for said cluster, and generating the remainder of the summary order by iteratively:
RECTIFIED SHEET (RULE 91 ) ISA/EP calculating a similarity between the respective summary embeddings of the remaining summaries within said cluster and the summary embedding of the previously selected summary, and selecting as the next summary in the summary order the summary which has a summary embedding that is most similar to the summary embedding of the previously selected summary; or calculating a similarity between the respective segment embeddings of the identified document segments corresponding to the remaining summaries and the segment embedding of the identified document segment corresponding to the previously selected summary, and selecting as the next summary in the summary order the summary with a corresponding segment embedding is most similar to the segment embedding of the identified document segment corresponding to the previously selected summary.
22. The method of any of claims 14 to 21 , wherein the one or more processors perform the further steps of: generating and storing a summary title for each of the summaries, and returning each summary with its corresponding summary title; generating a cluster title for each cluster, and returning the summaries arranged in their clusters with their corresponding cluster titles; and/or generating a cluster summary for each cluster based on the summaries within each cluster or the identified document segments corresponding to the summaries within each cluster, wherein each cluster summary is shorter than the concatenated length of the summaries and/or the concatenated length of the identified document segments corresponding to the summaries within each cluster, and returning the summaries arranged in their clusters with their corresponding cluster summaries.
23. The method of any of claims 1 to 10, wherein the one or more processors perform the steps of: obtaining one or more text documents;
RECTIFIED SHEET (RULE 91 ) ISA/EP segmenting the one or more text documents into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document; generating a summary for each document segment, wherein each summary is shorter than the corresponding document segment; storing the summaries generated for the plurality of document segments; receiving a search query; identifying stored summaries that correspond to the search query; and, returning the identified summaries that correspond to the search query.
24. The method of claim 23, wherein the one or more processors perform the further steps of: arranging the identified summaries into a plurality of clusters based on the similarity of the identified summaries; and returning the identified summaries arranged in their clusters.
25. The method of claim 24, wherein the one or more processors perform the further steps of: comparing the clusters and generating a cluster hierarchy in which the clusters are ordered depending on their relevance to the search query and/or the similarity of the summaries contained therein; wherein the clusters of identified summaries are returned in accordance with the cluster hierarchy.
26. The method of either claim 24 or claim 25, wherein arranging the identified summaries into a plurality of clusters comprises: generating at least one metadata embedding for each of the stored summaries, wherein the metadata embedding is a numerical vector representing metadata associated with the respective stored summary or metadata associated with a part of the respective stored summary; and calculating a summary embedding for each stored summary based on the at least one metadata embedding;
RECTIFIED SHEET (RULE 91 ) ISA/EP and wherein the method comprises in response to receiving a search query: arranging the identified summaries into a plurality of clusters based on the similarity of their summary embeddings; wherein preferably, the identified summaries are arranged into a plurality of clusters based on the similarity of their summary embeddings using a hierarchical clustering algorithm and/or an agglomerative clustering algorithm.
27. The method of any of claims 24 to 26, wherein arranging the identified summaries into a plurality of clusters comprises: dividing each of the stored summaries into one or more sentences; generating one or more of a sentence embedding, a key word embedding and a named entity embedding for each of the sentences within the stored summaries; wherein the sentence embeddings are numerical vectors representing the respective sentences, the key word embeddings are numerical vectors representing individual words or phrases in the respective sentences, and the named entity embeddings are numerical vectors representing words identified as named entities that correspond to a Named Entity Recognition (NER) tag by an NER classifier in the respective sentences; calculating a summary embedding for each stored summary, wherein the summary embedding is a numerical vector representing a weighted average of the sentence embeddings, the key word embeddings and/or the named entity embeddings for each sentence in said stored summary; and wherein the method comprises in response to receiving a search query: arranging the identified summaries into a plurality of clusters based on the similarity of their summary embeddings; wherein preferably the identified summaries are arranged into a plurality of clusters based on the similarity of their summary embeddings using a hierarchical clustering algorithm and/or an agglomerative clustering algorithm.
RECTIFIED SHEET (RULE 91 ) ISA/EP
28. The method of claim 26 or claim 27, wherein generating a cluster hierarchy comprises: calculating a cluster embedding for each cluster, wherein each cluster embedding is an average of the summary embeddings of the summaries contained within the respective cluster; calculating an average cluster embedding using an average of the plurality of cluster embeddings, and ordering the clusters based on their cluster embeddings and the average cluster embedding.
29. The method of claim 28, wherein ordering the clusters based on their cluster embeddings and the average cluster embedding comprises: identifying a similarity between the cluster embedding of each cluster within the plurality of clusters and the average cluster embedding, and either: ordering the clusters based on the similarity between their respective cluster embeddings and the average cluster embedding; or: selecting as a first cluster in a cluster order the cluster that has the greatest similarity between its respective cluster embedding and the average cluster embedding, and generating the remainder of the cluster hierarchy by iteratively: calculating a similarity between each of the respective cluster embeddings of the remaining clusters and the cluster embedding of the previously selected cluster; and selecting as the next cluster in the cluster hierarchy the cluster which has a cluster embedding that is most similar to the embedding of the previously selected cluster.
30. The method of claims 24 to 29, further comprising: generating a summary order for the identified summaries within each cluster; and returning the identified summaries within each cluster in accordance with the summary order;
RECTIFIED SHEET (RULE 91 ) ISA/EP wherein preferably generating a summary order for the identified summaries within each cluster comprises identifying a similarity between the respective summary embedding of each identified summary and the average summary embedding of said cluster, and either: ordering the summaries based on the similarity between their respective summary embeddings and the average summary embedding of said cluster; or: selecting as a first summary in the summary order, the summary that has the greatest similarity between its summary embedding and the average summary embedding for said cluster, and generating the remainder of the summary order by iteratively: calculating a similarity between the respective summary embeddings of the remaining identified summaries within said cluster and the summary embedding of the previously selected summary; and selecting as the next summary in the summary order the identified summary which has a summary embedding that is most similar to the embedding of the previously selected summary.
31. The method of any of claims 23 to 30, further comprising: generating and storing a summary title for each of the stored summaries and/or generating a summary title for each identified summary that corresponds to the search query, and returning each identified summary with its corresponding summary title; and/or generating a cluster title for each cluster and returning the identified summaries arranged in their clusters with their corresponding cluster titles; and/or generating a cluster summary for each cluster based on the identified summaries within each cluster or the corresponding document segments, wherein each cluster summary is shorter than the concatenated length of the identified summaries within each cluster and/or the concatenated length of the corresponding document segments.
RECTIFIED SHEET (RULE 91 ) ISA/EP
32. A method performed by one or more processors, in which the one or more processors perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments; identifying document segments that correspond to the search query; arranging the identified document segments into a plurality of clusters based on the similarity of said identified document segments; generating a cluster title and/or cluster summary for each cluster based on the identified document segments within each cluster, wherein each cluster summary is shorter than the concatenated length of the identified document segments in each cluster; and returning the identified document segments arranged in their clusters with their corresponding cluster titles and/or cluster summaries.
33. A system comprising a plurality of processors, the processors configured to perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document; generating a summary for each document segment, wherein each summary is shorter than the corresponding document segment; storing the summaries generated for the plurality of document segments; receiving a search query; identifying stored summaries that correspond to the search query; and, returning the identified summaries that correspond to the search query.
34. A system comprising a plurality of processors, the processors configured to perform the steps of the method of any of claims 1 to 32:
RECTIFIED SHEET (RULE 91 ) ISA/EP
35. A non-transitory computer readable medium storing instructions which when run by one or more processors cause the one or more processors to perform the steps of: obtaining one or more text documents; segmenting the one or more text documents into a plurality of document segments, wherein each text document is segmented based on the similarity of consecutive sentences in the document; generating a summary for each document segment, wherein each summary is shorter than the corresponding document segment; storing the summaries generated for the plurality of document segments; receiving a search query; identifying stored summaries that correspond to the search query; and, returning the identified summaries that correspond to the search query.
36. A non-transitory computer readable medium storing instructions which when run by one or more processors cause the one or more processors to perform the steps of the method of any of claims 1 to 32.
RECTIFIED SHEET (RULE 91 ) ISA/EP
PCT/EP2024/067524 2023-06-23 2024-06-21 Methods and systems for search and information retrieval Pending WO2024261293A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP23181312 2023-06-23
EP23181312.2 2023-06-23

Publications (1)

Publication Number Publication Date
WO2024261293A1 true WO2024261293A1 (en) 2024-12-26

Family

ID=87047924

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2024/067524 Pending WO2024261293A1 (en) 2023-06-23 2024-06-21 Methods and systems for search and information retrieval

Country Status (1)

Country Link
WO (1) WO2024261293A1 (en)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MARSHALL IAIN J ET AL: "Trialstreamer: A living, automatically updated database of clinical trial reports", vol. 27, no. 12, 9 December 2020 (2020-12-09), AMSTERDAM, NL, pages 1903 - 1912, XP093095834, ISSN: 1067-5027, Retrieved from the Internet <URL:http://academic.oup.com/jamia/article-pdf/27/12/1903/34838541/ocaa163.pdf> [retrieved on 20231027], DOI: 10.1093/jamia/ocaa163 *

Similar Documents

Publication Publication Date Title
Bennani-Smires et al. Simple unsupervised keyphrase extraction using sentence embeddings
Sumathy et al. Text mining: concepts, applications, tools and issues-an overview
US9659084B1 (en) System, methods, and user interface for presenting information from unstructured data
US12032915B2 (en) Creating and interacting with data records having semantic vectors and natural language expressions produced by a machine-trained model
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
WO2010014082A1 (en) Method and apparatus for relating datasets by using semantic vectors and keyword analyses
Krishnaveni et al. Automatic text summarization by local scoring and ranking for improving coherence
Jabbar et al. A survey on Urdu and Urdu like language stemmers and stemming techniques
CN118535710A (en) Text processing method and device, and computer system for improving information retrieval and generation quality
JP2024091709A (en) Sentence creation device, sentence creation method, and sentence creation program
Jain et al. Context sensitive text summarization using k means clustering algorithm
Ramachandran et al. Document clustering using keyword extraction
CN112949287B (en) Hot word mining method, system, computer equipment and storage medium
CN117688140B (en) Document query method, device, computer equipment and storage medium
US11853356B1 (en) System and method for generating hierarchical mind map and index table
Saikumar et al. Two-Level text summarization using topic modeling
Sariki A book recommendation system based on named entities
Ghorpade et al. A comparative analysis of TextRank and LexRank algorithms using text summarization
WO2024261293A1 (en) Methods and systems for search and information retrieval
Modi et al. Multimodal web content mining to filter non-learning sites using NLP
Helmy et al. Towards building a standard dataset for arabic keyphrase extraction evaluation
US20060248037A1 (en) Annotation of inverted list text indexes using search queries
US20210073258A1 (en) Information processing apparatus and non-transitory computer readable medium
JP2000105769A (en) Document display method
Ali et al. Towards Sindhi named entity recognition: Challenges and opportunities

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24733243

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024733243

Country of ref document: EP