Disclosure of Invention
The invention provides a text abstract generating method for solving the problem that core attention information of a research report cannot be quickly acquired when the research report is read.
The invention provides a text abstract generating method, which specifically comprises the following steps:
randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article;
and fusing articles connected in the tree balanced binary tree layer by layer in pairs to generate a target text abstract fused with key information of at least two articles.
Preferably, the tree-shaped balanced binary tree includes 1 st to nth layer nodes, the nth layer nodes are leaf nodes, N is a positive integer, and the articles connected in the tree-shaped balanced binary tree are fused layer by layer, including:
Fusing the N-layer node articles into N-1 layer node articles, wherein each node of the N-1 layer represents an N-1 layer node article, and each N-1 layer node article is generated by fusing the N-layer node articles connected to the same N-1 layer node;
and fusing the layers from the N-1 to the 1 st layer by layer to generate a1 st layer node article, wherein the 1 st layer node article is generated by fusing the 2 nd layer node articles connected to the 1 st layer node, and the 1 st layer node article is the target text abstract.
Preferably, the fusing of the articles in pairs specifically includes the following steps:
Determining two connected articles through the same upper node;
Identifying the requirement information in the two connected articles by using a named entity identification technology, and extracting key sentences from the requirement information, wherein the key sentences are sentences containing the key information in the articles;
Screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on anchor points and the cosine similarity calculated based on semantics;
and splicing the screened key sentences to obtain a key sentence set after the two articles are fused.
Preferably, the identifying the requirement information in the two connected articles by using a named entity identifying technology, and extracting key sentences from the requirement information, specifically comprises the following steps:
Identifying the requirement information in the two connected articles through a named entity identification technology, and filling the requirement information in a preset problem template, so that a plurality of problems are generated;
and extracting answer fragments from each article for each question by using a Monte-BERT model, wherein the obtained answer fragments are key sentences.
Preferably, the filtering of the extracted key sentences based on the similarity between the key sentences specifically includes the following steps:
matching elements in the two key sentence sets according to the similarity to form a bipartite graph;
and selecting a key sentence set with the highest information content and the least sentences through a greedy algorithm.
Preferably, the similarity calculation formula between the key sentences is as follows;
For cosine similarity based on semantic computation, the vector representation of each key sentence is computed by a Monte-sub pre-training modelTherein, whereinThe number of key sentences, x is a positive integer, and then the cosine similarity between vectors is calculated, wherein the specific formula is that
;
For the similarity calculated based on the anchor points, it can be confirmed by a numerical comparison mode or a character comparison mode,A weight coefficient representing the similarity calculated based on the anchor point.
Preferably, the text abstract generating method further comprises the following steps:
after the fusion of all the articles is completed, the key sentences in the finally fused articles are sequenced by adopting a Monte-BERT pre-training model to obtain an initial target text abstract, and transition texts are generated among the key sentences of the initial target text abstract through a Monte-T5 model, so that the target text abstract with the transition texts is obtained.
Preferably, the transition text is generated between key sentences of the initial target text abstract through the Monte-T5 model, so that the target text abstract with the transition text is obtained, and the method specifically comprises the following steps of:
setting masks among key sentences of the initial target text abstract respectively;
and predicting the content of the mask through a Monte-T5 model to obtain a generated transition text, thereby obtaining a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract.
Preferably, the text abstract generating method further comprises the following steps:
Summary and/or summary text is generated for the target text summary with transition text, the summary text being a paragraph or sentence, resulting in a final target text summary.
Preferably, the generating summary and/or summary text for the target text summary with transition text, wherein the summary text is paragraphs or sentences, thereby obtaining a final target text summary, specifically includes the following steps:
Generating a topic vocabulary for each key sentence in sequence through a Monascus-T5 model and marking the topic vocabulary;
Designing a prompt question-answer template for inquiring main lecture contents about each topic vocabulary in a target text abstract with a transition text, setting answer answers of corresponding questions as masks, and predicting the masked contents by using a Monte-T5 model to obtain summarized and/or summarized text;
and combining the summary and/or summary text and the target text abstract with the transition text to obtain the final target text abstract.
Compared with the prior art, the text abstract generation method has the following advantages:
1. The text abstract generation method specifically comprises the steps of randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article, and fusing articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles. It can be understood that the target text abstract is generated by fusing the key information of the articles, so that a reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved.
2. The tree balanced binary tree comprises 1 st to N th layers of nodes, wherein the N th layers of nodes are leaf nodes, N is a positive integer, the articles connected in the tree balanced binary tree are fused in pairs layer by layer, the tree balanced binary tree comprises the steps of fusing the N th layers of node articles into N-1 st layers of node articles, each node of the N-1 st layer represents one N-1 st layer of node articles, each N-1 st layer of node articles are generated by fusing the N th layer of node articles connected to the same N-1 st layer of node, the N-1 st to 1 st layer of node articles are fused in layers to generate the 1 st layer of node articles, the 1 st layer of node articles are generated by fusing the 2 nd layer of node articles connected to the 1 st layer of node, and the 1 st layer of node articles are the target text abstract. The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, and less important information can be removed in the layer-by-layer fusion, so that the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.
3. The method specifically comprises the steps of determining two connected articles through the same upper node, identifying required information in the two connected articles through a named entity identification technology, extracting key sentences from the required information, wherein the key sentences are sentences containing the key information in the articles, screening the extracted key sentences based on similarity among the key sentences, determining the similarity among the key sentences through similarity calculated based on anchor points and cosine similarity calculated based on semantic, and splicing the screened key sentences to obtain a key sentence set after the fusion of the two articles. It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, so that the obtained key sentence set can contain all important information.
4. In the invention, the requirement information in two connected articles is identified by a named entity identification technology and is filled into a preset question template to generate a plurality of questions, and an answer segment is extracted from each article for each question by using a Monte-BERT model, wherein the obtained answer segment is a key sentence. It can be understood that the generated questions can be controlled through the preset question templates, so that the user can conveniently introduce the preference of the user on the important attention subject or entity, and on the other hand, the more accurate key sentence set fragments in the paragraphs can be conveniently found through the use of the extraction question-answering model.
5. In the method, elements in two key sentence sets are matched pairwise according to the similarity to form a bipartite graph, and the key sentence set with the largest information content but the smallest sentence number is selected through a greedy algorithm to be used as the key sentence set after the articles are fused. It can be understood that the more compact and better the key sentence set after the fusion of the articles, the more comprehensive the content is ensured under the condition of reducing the reading quantity, and the user can be helped to acquire the information quickly.
6. In the invention, the similarity between key sentences is comprehensively determined by the similarity calculated based on anchor points and cosine similarity calculated based on semantics. The design is beneficial to enhancing the reliability of the similarity comparison result between the key sentences, so that the readability of the sequenced texts is enhanced, and the reading experience of a user is improved.
7. In the invention, transition text is generated between key sentences of the initial target text abstract through the Monte-T5 model, so that the target text abstract with the transition text is obtained. It will be appreciated that, through the previous steps, the initial target text summary already contains substantially all the information necessary for an article summary, but these information are only spliced very directly, so that continuity and readability are poor when reading, and generating transition text between key sentences of the initial target text summary can further improve the readability of the target text summary, thereby further improving the reading experience of the user.
8. In the method, the transition text is generated by setting masks among key sentences of the initial target text abstract respectively, and predicting the contents of the masks through a Monte-T5 model to obtain the generated transition text, so that the target text abstract with the transition text is obtained, and the transition text is used for perfecting the logic relationship among the adjacent key sentences in the initial target text abstract. Through the previous algorithm steps, the initial target text abstract already contains all necessary information of one research report abstract, so that the transition text is often a shorter logic phrase, a connective word or a subtitle, and the like, and does not contain excessive useful information text. The generation of the transitional text can be simplified into the generation of the logic phrase, the connective or the subtitle among sentences. The Monte-T5 model is pre-trained by using 300GB mass data and stores rich priori knowledge, so that a higher generation effect can be achieved under the setting of fine tuning of few samples.
9. The text abstract generating method further comprises the steps of generating summary and/or summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or sentence, and the summary and/or summary text and the target text abstract with the transition text are combined, so that the final target text abstract is obtained. It can be appreciated that generating the summary and/or summary text can make the final target text abstract more convenient to read, assist the user in more quickly acquiring information, and further improve the user's reading experience.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples of implementation in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The terms "vertical," "horizontal," "left," "right," "upper," "lower," "upper left," "upper right," "lower left," "lower right," and the like are used herein for illustrative purposes only.
Referring to fig. 1, a first embodiment of the present invention provides a text summary generating method, which specifically includes the following steps:
Step S1, randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article;
And S2, fusing articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles.
It can be understood that the target text abstract is generated by fusing the key information of a plurality of articles, so that a reader can quickly acquire the key information of the plurality of articles, and the reading efficiency of the reader is improved
Specifically, in this embodiment, the article is a report.
Further, the tree-shaped balanced binary tree comprises 1 st to N th layer nodes, the N th layer nodes are leaf nodes, N is a positive integer, and articles connected in the tree-shaped balanced binary tree are fused layer by layer in pairs, and the tree-shaped balanced binary tree specifically comprises the following steps:
Fusing the N-layer node articles into N-1 layer node articles, wherein each node of the N-1 layer represents an N-1 layer node article, and each N-1 layer node article is generated by fusing the N-layer node articles connected to the same N-1 layer node;
The method comprises the steps of merging N-1 to 1 layers layer by layer to generate 1 layer node articles, wherein the 1 layer node articles are generated by merging 2 layer node articles connected to 1 layer nodes, the 1 layer node articles are target text summaries, specifically, the N-1 layer node articles are merged into N-2 layer node articles, each node of the N-2 layer represents an N-2 layer node article, each N-2 layer node article is generated by merging N-1 layer node articles connected to the same N-2 layer node, the 3 layer node articles are merged into 2 layer node articles, each node of the 2 layer represents a2 layer node article, each 2 layer node article is generated by merging 3 layer node articles connected to the same 2 layer node, and the 1 layer node articles are target text summaries.
The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, and less important information can be removed in the layer-by-layer fusion, so that the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.
Please refer to fig. 1 and fig. 2, the fusing of articles in two pairs in step2 specifically includes the following steps:
step S21, determining two connected articles through the same upper node;
Step S22, identifying the requirement information in two connected articles by using a named entity identification technology, and extracting key sentences from the requirement information, wherein the key sentences are sentences containing the key information in the articles;
Step S23, screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is jointly determined by the similarity calculated based on anchor points and the cosine similarity calculated based on semantics;
And step S24, splicing the screened key sentences to obtain a key sentence set after the fusion of the two articles.
It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, so that the obtained key sentence set can contain all important information.
Referring to fig. 1 to 3, step S22 specifically includes the following steps:
Step S221, identifying the requirement information in the two connected articles by using a named entity identification technology, and filling the requirement information in a preset problem template so as to generate a plurality of problems;
step S222, extracting answer fragments from each article for each question by using a Monte-BERT model, wherein the obtained answer fragments are key sentences.
It can be understood that the generated questions can be controlled through the preset question templates, so that the user can conveniently introduce the preference of the user on the important attention subject or entity, and on the other hand, the more accurate key sentence set fragments in the paragraphs can be conveniently found through the use of the extraction question-answering model.
It should be noted that, the named entity recognition technology is the prior art, please refer to the following for details :Che, W., Feng, Y., Qin, L.,&Liu, T. (2021). N-LTP: An Open-source Neural Language Technology Platform for Chinese. EMNLP.
Referring to fig. 1 to 4, step S221 specifically includes the following steps:
Step S2211, identifying the requirement information in each article through a named entity identification technology, and filling the requirement information into a preset problem template to generate a plurality of problems, wherein the requirement information is determined according to the gaps in the problem template;
Step S2212, extracting answer fragments from each key sentence for each question by using a Monte-BERT model, wherein the obtained answer fragments are the key sentences.
It can be understood that the generated questions can be controlled through the preset question templates, so that the user can conveniently introduce the preference of the user on the important attention subject or entity, and on the other hand, the more accurate key sentence set fragments in the paragraphs can be conveniently found through the use of the extraction question-answering model.
Further, a large number of questions, such as "[ what is the business model of a company ] may be prepared in advance in the preset question template, and then the entity names of several companies in the paragraph are identified by the named entity identification technique to be filled into the question template, so as to generate a question.
Further, when the answer segment is extracted, a plurality of questions generated by each report are sequentially recorded as Q1, Q2..qy, Y is a positive integer, for example, two questions generated by the report a are recorded as a.q1 and a.q2, and corresponding answers are recorded as A1 and A2, so that subsequent matching is facilitated.
Referring to fig. 2 and 5, step S23 specifically includes the following steps:
step S231, matching elements in the two key sentence sets according to the similarity to form a bipartite graph;
and S232, selecting a key sentence set with the highest information content and the lowest sentence quantity as a key sentence set obtained by fusing a plurality of articles through a greedy algorithm.
It can be understood that the more compact and better the key sentence set after the fusion of the articles, the more comprehensive the content is ensured under the condition of reducing the reading quantity, and the user can be helped to acquire the information quickly.
It should be noted that, the matching algorithm of the bipartite graph can refer to the prior art, refer to Wang Junli, zhou Qing, yang Yaxing for details, a text semantic similarity analysis method [ P ]. Shanghai city, CN106547739B,2019-04-02. The present invention uses key sentences rather than topics as nodes. That is, if the similarity of the two key sentences is greater than the set threshold, the two sentences are bordered. After iteration, a bipartite graph is finally generated. The nodes of the two-part graph represent sentence numbers in the research report, and the edges represent the similarity relationship of two sentences. The method comprises the steps of firstly calculating the output degree of each node, namely the number of edges taking the node as an endpoint, then taking the node as a representation of the content of information in corresponding sentences, and selecting a key sentence set with the highest content of information but the least number of sentences as a key sentence set after fusion of a plurality of research reports through a greedy algorithm.
Further, the similarity calculation formula between the key sentences is as follows;
For cosine similarity based on semantic computation, the vector representation of each key sentence is computed by a Monte-sub pre-training modelTherein, whereinThe number of key sentences, x is a positive integer, and then the cosine similarity between vectors is calculated, wherein the specific formula is that
;
For the similarity calculated based on the anchor points, it can be confirmed by a numerical comparison mode or a character comparison mode,A weight coefficient representing the similarity calculated based on the anchor point.
It can be appreciated that the design is beneficial to enhancing the reliability of the similarity comparison result between key sentences, so as to enhance the readability of the sequenced texts, thereby enhancing the reading experience of users.
Further, in the part for calculating the similarity based on the anchor point, for the number comparison mode, only if the number is an important component part of the article content, the confirmation can be performed by the number comparison mode, and if the two sentences contain the same number, the two sentences are considered to be similar sentences.
Further, in the portion for calculating the similarity based on the anchor point, the similarity may be calculated by using an edit distance of two sentences, a longest common subsequence length, or an N-Gram similarity, for example, for the character comparison.
Further, in step S2, the arrangement of the key sentences is completed through a Monte-BERT pre-training model, and the Monte-BERT pre-training model can calculate the probability of the key sentences arranged at the ith position, and order the key sentences based on the calculation result, wherein i is a positive integer.
The Monte-BERT pre-training model adopts a pre-training task of sentence sequence prediction, so that a downstream task of inter-sentence consistency sequencing can be well adapted.
Referring to FIG. 6, the arrangement of key sentences is exemplified, giving three sentencesIs arranged into after random scramblingThe input samples to construct the Monte Carlo encoder are shown, where "[ CLS ]" represents the input sample initiator, ""Represents a start of a sentence,".."represents text content of a corresponding sentence," [ SEP ] "represents a separator (which may be regarded herein as an ending of an input sample). And then encodes it using a Monte-sub encoder to obtain
The hidden vectors of (a) are respectivelyIs provided with
It is used as the key vector K and the value vector V of the Monte decoderIs decoded as a query vector Q of a Monte decoder (MengziDecoder)And finally, obtaining through a pointer networkRepresenting the probability that the j-1 th sentence is arranged at the i-th position. Thus, the consistency sequencing of the three sentences is completed, and it can be understood that the sequencing of more than three sentences is the same as the sequencing of the three sentences.
Details of how to use the Monte-BERT pre-training model to accomplish the consistency ordering are disclosed in Lee, H., Hudson, D.A., Lee, K.,&Manning, C.D. (2020). SLM: Learning a Discourse Language Representation with Sentence Unshuffling. EMNLP.
Referring to fig. 7, the summary generating method herein further includes the following steps:
And step S3, after the fusion of all the articles is completed, sorting key sentences in the finally fused articles by adopting a Monte-BERT pre-training model to obtain an initial target text abstract, and generating transition texts among the key sentences of the initial target text abstract by adopting a Monte-T5 model to obtain the target text abstract with the transition texts.
It will be appreciated that, through the previous steps, the initial target text summary already contains substantially all the information necessary for an article summary, but these information are only spliced very directly, so that continuity and readability are poor when reading, and generating transition text between key sentences of the initial target text summary can further improve the readability of the target text summary, thereby further improving the reading experience of the user.
Referring to fig. 7 and 8, step S3 specifically includes the following steps:
step S31, setting masks among key sentences of the initial target text abstract respectively;
And S32, predicting the mask content through a Monte-T5 model to obtain a generated transition text, thereby obtaining a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract.
It will be appreciated that through the preceding algorithm steps, the initial target text summary already contains substantially all of the information necessary for a summary, and therefore, the transition text is often a relatively short logical phrase, connective or subtitle, etc., and does not contain too much useful information text. The generation of the transitional text can be simplified into the generation of the logic phrase, the connective or the subtitle among sentences. The Monte-T5 model is pre-trained by using 300GB mass data and stores rich priori knowledge, so that a higher generation effect can be achieved under the setting of fine tuning of few samples.
For example, three well-ordered key sentences are input and respectively recorded as:、 And The input templates are set as: Finally, the contents of < mask 1>, < mask 2>, and the like are predicted through a Monte-T5 model, so that the generated transition text can be obtained, and s represents the input ending symbol. The target text abstract with the transition text is Sum.
Further, the Monte-T5 model is fine-tuned with a fine-tuning dataset before use, and the fine-tuning dataset is constructed based on part-of-speech tagging and/or punctuation recognition and/or sub-title.
It can be understood that the generation problem of the transition text can be simplified into the generation problem of logic phrases, connective words or subtitles among sentences, and the like, so that the generation effect of the Monte-T5 model can be effectively ensured by pertinently constructing a fine-tuning data set.
The application scenario of constructing the fine adjustment data set based on the part-of-speech tagging is that a plurality of words with part-of-speech such as prepositions or connective words generally exist in an article, the prepositions, connective words and the like are replaced by masks, when the model is fine-tuned, the replaced words are used as labels, and texts at the mask positions are generated as training tasks.
The application scene scenario for constructing the fine adjustment data set based on punctuation recognition is that general paragraphs with formats such as' investment advice:. The first to the third are commonly appeared in research information, and the colon is preceded by the summary of subsequent contents. Thus, the fine tuning task is also generated for text at the mask location, using the mask symbol to replace text before the colon.
The application scenario of constructing the fine tuning data set based on the subtitles is that a large number of subtitles are often existed in a research report, the data is reconstructed by using a subtitle text-paragraph text, and the data is constructed by using the same method as the method for constructing the fine tuning data set based on punctuation mark recognition.
With continued reference to fig. 7, the summary generating method herein further includes the following steps:
and S4, generating summary and/or summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or sentence, and thus the final target text abstract is obtained.
It can be appreciated that generating the summary and/or summary text can assist the user in more quickly obtaining information, further improving the user's reading experience.
Referring to fig. 7 and 9, step S4 specifically includes the following steps:
s41, sequentially generating and marking a topic vocabulary for each key sentence through a Monte-T5 model;
step S42, designing a prompt question-answer template for inquiring main lecture contents about each topic word in a target text abstract with a transition text, setting answer answers of corresponding questions as masks, and predicting the masked contents by using a Monte-T5 model to obtain summarized and/or summarized text;
And step S43, combining the summary and/or summary text and the target text abstract with the transition text to obtain the final target text abstract.
It can be appreciated that generating the summary and/or summary text can make the final target text abstract more convenient to read, assist the user in more quickly acquiring information, and further improve the user's reading experience.
Specifically, taking an outline as an example, three key sentences are input, and are respectively recorded as:、 And Firstly, generating a topic vocabulary for each key sentence sequentially through a Monte-T5 model, and respectively marking as:、 And Then designing a prompt template as Sum, asking about,AndThe method mainly comprises the steps of teaching what is called, answering: < mask > ", and finally predicting the content of the < mask > by using a Monte-T5 model to obtain the summary text. This step requires fine tuning of the model using small amounts of data, dataset source references including DOU Z-Y, LIU P, HAYASHI H, et al. 2021. GSum: A General Framework for Guided NeuralAbstractive Summarization. abs/2010.08014. and HE J, KRYSCINSKI W, MCCANN B, et al. 2020. CTRLsum: Towards Generic Controllable TextSummarization. abs/2012.04281., etc.
The topic generation data set construction method includes that firstly, for a certain paragraph and a corresponding subtitle in a research report, the largest public subsequence is extracted from the paragraph to serve as a candidate topic word, and the paragraph serves as a key sentence to construct a large amount of training data. The design prompt template is a keyword, a paragraph text and a topic word, wherein the topic word is a mask, and a training data fine tuning Monte-T5 model is used for generating a corresponding topic word at the position of the mask.
The method for constructing the data set generated by the answer is as follows, the data set of the part is basically consistent with the data set generated by the topic, but considering that one paragraph in the data set generated by the topic can only correspond to one subtitle and one topic word, a plurality of paragraphs and corresponding subtitles and keywords are needed to be combined, and then the combined subtitles are replaced by mask marks, so that a large amount of pseudo data is constructed. Finally, using the pseudo-data fine-tuning Monte-T5 model, a corresponding summary or summary text may be generated at the "< mask >" position.
Generally, the summary and summary content are not so different, and the summary and/or summary may be optionally generated.
Specifically, in this embodiment, only the summary is selected to be generated.
Compared with the prior art, the text abstract generation method has the following advantages:
1. The text abstract generation method specifically comprises the steps of randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article, and fusing articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles. It can be understood that the target text abstract is generated by fusing the key information of the articles, so that a reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved.
2. The tree balanced binary tree comprises 1 st to N th layers of nodes, wherein the N th layers of nodes are leaf nodes, N is a positive integer, the articles connected in the tree balanced binary tree are fused in pairs layer by layer, the tree balanced binary tree comprises the steps of fusing the N th layers of node articles into N-1 st layers of node articles, each node of the N-1 st layer represents one N-1 st layer of node articles, each N-1 st layer of node articles are generated by fusing the N th layer of node articles connected to the same N-1 st layer of node, the N-1 st to 1 st layer of node articles are fused in layers to generate the 1 st layer of node articles, the 1 st layer of node articles are generated by fusing the 2 nd layer of node articles connected to the 1 st layer of node, and the 1 st layer of node articles are the target text abstract. The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, and less important information can be removed in the layer-by-layer fusion, so that the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.
3. The method specifically comprises the steps of determining two connected articles through the same upper node, identifying required information in the two connected articles through a named entity identification technology, extracting key sentences from the required information, wherein the key sentences are sentences containing the key information in the articles, screening the extracted key sentences based on similarity among the key sentences, determining the similarity among the key sentences through similarity calculated based on anchor points and cosine similarity calculated based on semantic, and splicing the screened key sentences to obtain a key sentence set after the fusion of the two articles. It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, so that the obtained key sentence set can contain all important information.
4. In the invention, the requirement information in two connected articles is identified by a named entity identification technology and is filled into a preset question template to generate a plurality of questions, and an answer segment is extracted from each article for each question by using a Monte-BERT model, wherein the obtained answer segment is a key sentence. It can be understood that the generated questions can be controlled through the preset question templates, so that the user can conveniently introduce the preference of the user on the important attention subject or entity, and on the other hand, the more accurate key sentence set fragments in the paragraphs can be conveniently found through the use of the extraction question-answering model.
5. In the method, elements in two key sentence sets are matched pairwise according to the similarity to form a bipartite graph, and the key sentence set with the largest information content but the smallest sentence number is selected through a greedy algorithm to be used as the key sentence set after the articles are fused. It can be understood that the more compact and better the key sentence set after the fusion of the articles, the more comprehensive the content is ensured under the condition of reducing the reading quantity, and the user can be helped to acquire the information quickly.
6. In the invention, the similarity between key sentences is comprehensively determined by the similarity calculated based on anchor points and cosine similarity calculated based on semantics. The design is beneficial to enhancing the reliability of the similarity comparison result between the key sentences, so that the readability of the sequenced texts is enhanced, and the reading experience of a user is improved.
7. In the invention, transition text is generated between key sentences of the initial target text abstract through the Monte-T5 model, so that the target text abstract with the transition text is obtained. It will be appreciated that, through the previous steps, the initial target text summary already contains substantially all the information necessary for an article summary, but these information are only spliced very directly, so that continuity and readability are poor when reading, and generating transition text between key sentences of the initial target text summary can further improve the readability of the target text summary, thereby further improving the reading experience of the user.
8. In the method, the transition text is generated by setting masks among key sentences of the initial target text abstract respectively, and predicting the contents of the masks through a Monte-T5 model to obtain the generated transition text, so that the target text abstract with the transition text is obtained, and the transition text is used for perfecting the logic relationship among the adjacent key sentences in the initial target text abstract. Through the previous algorithm steps, the initial target text abstract already contains all necessary information of one research report abstract, so that the transition text is often a shorter logic phrase, a connective word or a subtitle, and the like, and does not contain excessive useful information text. The generation of the transitional text can be simplified into the generation of the logic phrase, the connective or the subtitle among sentences. The Monte-T5 model is pre-trained by using 300GB mass data and stores rich priori knowledge, so that a higher generation effect can be achieved under the setting of fine tuning of few samples.
9. The text abstract generating method further comprises the steps of generating summary and/or summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or sentence, and the summary and/or summary text and the target text abstract with the transition text are combined, so that the final target text abstract is obtained. It can be appreciated that generating the summary and/or summary text can make the final target text abstract more convenient to read, assist the user in more quickly acquiring information, and further improve the user's reading experience.
While the foregoing has been described in some detail by way of illustration of the principles and embodiments of the invention, and while the foregoing is in order to facilitate the understanding of the principles and embodiments of the invention, it will be apparent to those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention, and it is intended that the invention is not to be limited thereto, but is to be accorded the full scope of the invention as defined by the appended claims.