[go: up one dir, main page]

CN114611520B - A text summary generation method - Google Patents

A text summary generation method

Info

Publication number
CN114611520B
CN114611520B CN202210380604.8A CN202210380604A CN114611520B CN 114611520 B CN114611520 B CN 114611520B CN 202210380604 A CN202210380604 A CN 202210380604A CN 114611520 B CN114611520 B CN 114611520B
Authority
CN
China
Prior art keywords
text
articles
key
layer
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210380604.8A
Other languages
Chinese (zh)
Other versions
CN114611520A (en
Inventor
刘明童
王泽坤
周明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lanzhou Technology Co ltd
Original Assignee
Beijing Lanzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lanzhou Technology Co ltd filed Critical Beijing Lanzhou Technology Co ltd
Priority to CN202210380604.8A priority Critical patent/CN114611520B/en
Publication of CN114611520A publication Critical patent/CN114611520A/en
Application granted granted Critical
Publication of CN114611520B publication Critical patent/CN114611520B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本发明涉及文本摘要生成技术领域,特别涉及一种文本摘要生成方法,具体包括以下步骤:将预设的至少两篇文章随机组合生成树形平衡二叉树,树的叶节点代表一篇文章;逐层对树形平衡二叉树内相连的文章进行两两融合,以生成融合有至少两篇文章关键信息的目标文本摘要。可以理解的,通过融合多篇文章的关键信息来生成目标文本摘要,能让阅读者快速获取多篇文章的关键信息,提高阅读者的阅读效率。

The present invention relates to the technical field of text summary generation, and more particularly to a method for generating a text summary, comprising the following steps: randomly combining at least two predetermined articles to generate a tree-shaped balanced binary tree, wherein a leaf node of the tree represents an article; and layer-by-layer fusing connected articles within the tree-shaped balanced binary tree to generate a target text summary that incorporates key information from the at least two articles. It is understood that fusing key information from multiple articles to generate a target text summary allows readers to quickly obtain key information from multiple articles, thereby improving reading efficiency.

Description

Text abstract generation method
Technical Field
The invention relates to the technical field of text abstract generation, in particular to a text abstract generation method.
Background
The research report in the professional field is an important source for people to acquire high-quality information, such as industry development reports, securities analysis reports and the like. Because of the logic and expertise of report reports, a report often contains very rich information, and for the same event and thing, many specialized institutions and professionals often report the study, which results in a large number of report contents being required to be read by people to know the analyzed object, such as financial investors, to read all the information related to a certain target company, so as to find the answer of interest, to help make more accurate decisions. The overload problem of a large amount of information is faced, and the method improves the newspaper reading and information processing in the professional field, and is an indispensable technology for improving the working efficiency of people.
The traditional intelligent newspaper reading system is concentrated on information collection and classification arrangement, for example, related newspaper of the same company are aggregated together through a keyword clustering algorithm, so that people can read and search information conveniently. However, a report often contains tens of thousands of contents, and simple information aggregation cannot meet the requirement of people for quickly acquiring core attention information. On the other hand, the technology adopted in the existing intelligent newspaper reading is mainly based on N-gram matching algorithm, such as keyword-based content retrieval and clustering algorithm, on the basis, people understand the newspaper content and often need to read the newspaper throughout to find the answer of the concerned core problem, and the process of searching the key information needs to be completed by people, so that time and labor are consumed.
Disclosure of Invention
The invention provides a text abstract generating method for solving the problem that core attention information of a research report cannot be quickly acquired when the research report is read.
The invention provides a text abstract generating method, which specifically comprises the following steps:
randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article;
and fusing articles connected in the tree balanced binary tree layer by layer in pairs to generate a target text abstract fused with key information of at least two articles.
Preferably, the tree-shaped balanced binary tree includes 1 st to nth layer nodes, the nth layer nodes are leaf nodes, N is a positive integer, and the articles connected in the tree-shaped balanced binary tree are fused layer by layer, including:
Fusing the N-layer node articles into N-1 layer node articles, wherein each node of the N-1 layer represents an N-1 layer node article, and each N-1 layer node article is generated by fusing the N-layer node articles connected to the same N-1 layer node;
and fusing the layers from the N-1 to the 1 st layer by layer to generate a1 st layer node article, wherein the 1 st layer node article is generated by fusing the 2 nd layer node articles connected to the 1 st layer node, and the 1 st layer node article is the target text abstract.
Preferably, the fusing of the articles in pairs specifically includes the following steps:
Determining two connected articles through the same upper node;
Identifying the requirement information in the two connected articles by using a named entity identification technology, and extracting key sentences from the requirement information, wherein the key sentences are sentences containing the key information in the articles;
Screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on anchor points and the cosine similarity calculated based on semantics;
and splicing the screened key sentences to obtain a key sentence set after the two articles are fused.
Preferably, the identifying the requirement information in the two connected articles by using a named entity identifying technology, and extracting key sentences from the requirement information, specifically comprises the following steps:
Identifying the requirement information in the two connected articles through a named entity identification technology, and filling the requirement information in a preset problem template, so that a plurality of problems are generated;
and extracting answer fragments from each article for each question by using a Monte-BERT model, wherein the obtained answer fragments are key sentences.
Preferably, the filtering of the extracted key sentences based on the similarity between the key sentences specifically includes the following steps:
matching elements in the two key sentence sets according to the similarity to form a bipartite graph;
and selecting a key sentence set with the highest information content and the least sentences through a greedy algorithm.
Preferably, the similarity calculation formula between the key sentences is as follows;
For cosine similarity based on semantic computation, the vector representation of each key sentence is computed by a Monte-sub pre-training modelTherein, whereinThe number of key sentences, x is a positive integer, and then the cosine similarity between vectors is calculated, wherein the specific formula is that
;
For the similarity calculated based on the anchor points, it can be confirmed by a numerical comparison mode or a character comparison mode,A weight coefficient representing the similarity calculated based on the anchor point.
Preferably, the text abstract generating method further comprises the following steps:
after the fusion of all the articles is completed, the key sentences in the finally fused articles are sequenced by adopting a Monte-BERT pre-training model to obtain an initial target text abstract, and transition texts are generated among the key sentences of the initial target text abstract through a Monte-T5 model, so that the target text abstract with the transition texts is obtained.
Preferably, the transition text is generated between key sentences of the initial target text abstract through the Monte-T5 model, so that the target text abstract with the transition text is obtained, and the method specifically comprises the following steps of:
setting masks among key sentences of the initial target text abstract respectively;
and predicting the content of the mask through a Monte-T5 model to obtain a generated transition text, thereby obtaining a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract.
Preferably, the text abstract generating method further comprises the following steps:
Summary and/or summary text is generated for the target text summary with transition text, the summary text being a paragraph or sentence, resulting in a final target text summary.
Preferably, the generating summary and/or summary text for the target text summary with transition text, wherein the summary text is paragraphs or sentences, thereby obtaining a final target text summary, specifically includes the following steps:
Generating a topic vocabulary for each key sentence in sequence through a Monascus-T5 model and marking the topic vocabulary;
Designing a prompt question-answer template for inquiring main lecture contents about each topic vocabulary in a target text abstract with a transition text, setting answer answers of corresponding questions as masks, and predicting the masked contents by using a Monte-T5 model to obtain summarized and/or summarized text;
and combining the summary and/or summary text and the target text abstract with the transition text to obtain the final target text abstract.
Compared with the prior art, the text abstract generation method has the following advantages:
1. The text abstract generation method specifically comprises the steps of randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article, and fusing articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles. It can be understood that the target text abstract is generated by fusing the key information of the articles, so that a reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved.
2. The tree balanced binary tree comprises 1 st to N th layers of nodes, wherein the N th layers of nodes are leaf nodes, N is a positive integer, the articles connected in the tree balanced binary tree are fused in pairs layer by layer, the tree balanced binary tree comprises the steps of fusing the N th layers of node articles into N-1 st layers of node articles, each node of the N-1 st layer represents one N-1 st layer of node articles, each N-1 st layer of node articles are generated by fusing the N th layer of node articles connected to the same N-1 st layer of node, the N-1 st to 1 st layer of node articles are fused in layers to generate the 1 st layer of node articles, the 1 st layer of node articles are generated by fusing the 2 nd layer of node articles connected to the 1 st layer of node, and the 1 st layer of node articles are the target text abstract. The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, and less important information can be removed in the layer-by-layer fusion, so that the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.
3. The method specifically comprises the steps of determining two connected articles through the same upper node, identifying required information in the two connected articles through a named entity identification technology, extracting key sentences from the required information, wherein the key sentences are sentences containing the key information in the articles, screening the extracted key sentences based on similarity among the key sentences, determining the similarity among the key sentences through similarity calculated based on anchor points and cosine similarity calculated based on semantic, and splicing the screened key sentences to obtain a key sentence set after the fusion of the two articles. It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, so that the obtained key sentence set can contain all important information.
4. In the invention, the requirement information in two connected articles is identified by a named entity identification technology and is filled into a preset question template to generate a plurality of questions, and an answer segment is extracted from each article for each question by using a Monte-BERT model, wherein the obtained answer segment is a key sentence. It can be understood that the generated questions can be controlled through the preset question templates, so that the user can conveniently introduce the preference of the user on the important attention subject or entity, and on the other hand, the more accurate key sentence set fragments in the paragraphs can be conveniently found through the use of the extraction question-answering model.
5. In the method, elements in two key sentence sets are matched pairwise according to the similarity to form a bipartite graph, and the key sentence set with the largest information content but the smallest sentence number is selected through a greedy algorithm to be used as the key sentence set after the articles are fused. It can be understood that the more compact and better the key sentence set after the fusion of the articles, the more comprehensive the content is ensured under the condition of reducing the reading quantity, and the user can be helped to acquire the information quickly.
6. In the invention, the similarity between key sentences is comprehensively determined by the similarity calculated based on anchor points and cosine similarity calculated based on semantics. The design is beneficial to enhancing the reliability of the similarity comparison result between the key sentences, so that the readability of the sequenced texts is enhanced, and the reading experience of a user is improved.
7. In the invention, transition text is generated between key sentences of the initial target text abstract through the Monte-T5 model, so that the target text abstract with the transition text is obtained. It will be appreciated that, through the previous steps, the initial target text summary already contains substantially all the information necessary for an article summary, but these information are only spliced very directly, so that continuity and readability are poor when reading, and generating transition text between key sentences of the initial target text summary can further improve the readability of the target text summary, thereby further improving the reading experience of the user.
8. In the method, the transition text is generated by setting masks among key sentences of the initial target text abstract respectively, and predicting the contents of the masks through a Monte-T5 model to obtain the generated transition text, so that the target text abstract with the transition text is obtained, and the transition text is used for perfecting the logic relationship among the adjacent key sentences in the initial target text abstract. Through the previous algorithm steps, the initial target text abstract already contains all necessary information of one research report abstract, so that the transition text is often a shorter logic phrase, a connective word or a subtitle, and the like, and does not contain excessive useful information text. The generation of the transitional text can be simplified into the generation of the logic phrase, the connective or the subtitle among sentences. The Monte-T5 model is pre-trained by using 300GB mass data and stores rich priori knowledge, so that a higher generation effect can be achieved under the setting of fine tuning of few samples.
9. The text abstract generating method further comprises the steps of generating summary and/or summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or sentence, and the summary and/or summary text and the target text abstract with the transition text are combined, so that the final target text abstract is obtained. It can be appreciated that generating the summary and/or summary text can make the final target text abstract more convenient to read, assist the user in more quickly acquiring information, and further improve the user's reading experience.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a text summary generating method according to a first embodiment of the present invention.
Fig. 2 is a flowchart of step S2 of the text summarization method according to the first embodiment of the present invention.
Fig. 3 is a flowchart of step S22 of the text summarization method according to the first embodiment of the present invention.
Fig. 4 is a flowchart of step S221 of the text summarization method according to the first embodiment of the present invention.
Fig. 5 is a flowchart of step S23 of the text summarization method according to the first embodiment of the present invention.
Fig. 6 is a schematic diagram of a consistency ranking in a text summarization method according to a first embodiment of the present invention.
Fig. 7 is another flowchart of a text summarization method provided by the first embodiment of the present invention.
Fig. 8 is a flowchart of step S3 of the text summarization method according to the first embodiment of the present invention.
Fig. 9 is a flowchart of step S4 of the text summarization method according to the first embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples of implementation in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The terms "vertical," "horizontal," "left," "right," "upper," "lower," "upper left," "upper right," "lower left," "lower right," and the like are used herein for illustrative purposes only.
Referring to fig. 1, a first embodiment of the present invention provides a text summary generating method, which specifically includes the following steps:
Step S1, randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article;
And S2, fusing articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles.
It can be understood that the target text abstract is generated by fusing the key information of a plurality of articles, so that a reader can quickly acquire the key information of the plurality of articles, and the reading efficiency of the reader is improved
Specifically, in this embodiment, the article is a report.
Further, the tree-shaped balanced binary tree comprises 1 st to N th layer nodes, the N th layer nodes are leaf nodes, N is a positive integer, and articles connected in the tree-shaped balanced binary tree are fused layer by layer in pairs, and the tree-shaped balanced binary tree specifically comprises the following steps:
Fusing the N-layer node articles into N-1 layer node articles, wherein each node of the N-1 layer represents an N-1 layer node article, and each N-1 layer node article is generated by fusing the N-layer node articles connected to the same N-1 layer node;
The method comprises the steps of merging N-1 to 1 layers layer by layer to generate 1 layer node articles, wherein the 1 layer node articles are generated by merging 2 layer node articles connected to 1 layer nodes, the 1 layer node articles are target text summaries, specifically, the N-1 layer node articles are merged into N-2 layer node articles, each node of the N-2 layer represents an N-2 layer node article, each N-2 layer node article is generated by merging N-1 layer node articles connected to the same N-2 layer node, the 3 layer node articles are merged into 2 layer node articles, each node of the 2 layer represents a2 layer node article, each 2 layer node article is generated by merging 3 layer node articles connected to the same 2 layer node, and the 1 layer node articles are target text summaries.
The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, and less important information can be removed in the layer-by-layer fusion, so that the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.
Please refer to fig. 1 and fig. 2, the fusing of articles in two pairs in step2 specifically includes the following steps:
step S21, determining two connected articles through the same upper node;
Step S22, identifying the requirement information in two connected articles by using a named entity identification technology, and extracting key sentences from the requirement information, wherein the key sentences are sentences containing the key information in the articles;
Step S23, screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is jointly determined by the similarity calculated based on anchor points and the cosine similarity calculated based on semantics;
And step S24, splicing the screened key sentences to obtain a key sentence set after the fusion of the two articles.
It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, so that the obtained key sentence set can contain all important information.
Referring to fig. 1 to 3, step S22 specifically includes the following steps:
Step S221, identifying the requirement information in the two connected articles by using a named entity identification technology, and filling the requirement information in a preset problem template so as to generate a plurality of problems;
step S222, extracting answer fragments from each article for each question by using a Monte-BERT model, wherein the obtained answer fragments are key sentences.
It can be understood that the generated questions can be controlled through the preset question templates, so that the user can conveniently introduce the preference of the user on the important attention subject or entity, and on the other hand, the more accurate key sentence set fragments in the paragraphs can be conveniently found through the use of the extraction question-answering model.
It should be noted that, the named entity recognition technology is the prior art, please refer to the following for details :Che, W., Feng, Y., Qin, L.,&Liu, T. (2021). N-LTP: An Open-source Neural Language Technology Platform for Chinese. EMNLP.
Referring to fig. 1 to 4, step S221 specifically includes the following steps:
Step S2211, identifying the requirement information in each article through a named entity identification technology, and filling the requirement information into a preset problem template to generate a plurality of problems, wherein the requirement information is determined according to the gaps in the problem template;
Step S2212, extracting answer fragments from each key sentence for each question by using a Monte-BERT model, wherein the obtained answer fragments are the key sentences.
It can be understood that the generated questions can be controlled through the preset question templates, so that the user can conveniently introduce the preference of the user on the important attention subject or entity, and on the other hand, the more accurate key sentence set fragments in the paragraphs can be conveniently found through the use of the extraction question-answering model.
Further, a large number of questions, such as "[ what is the business model of a company ] may be prepared in advance in the preset question template, and then the entity names of several companies in the paragraph are identified by the named entity identification technique to be filled into the question template, so as to generate a question.
Further, when the answer segment is extracted, a plurality of questions generated by each report are sequentially recorded as Q1, Q2..qy, Y is a positive integer, for example, two questions generated by the report a are recorded as a.q1 and a.q2, and corresponding answers are recorded as A1 and A2, so that subsequent matching is facilitated.
Referring to fig. 2 and 5, step S23 specifically includes the following steps:
step S231, matching elements in the two key sentence sets according to the similarity to form a bipartite graph;
and S232, selecting a key sentence set with the highest information content and the lowest sentence quantity as a key sentence set obtained by fusing a plurality of articles through a greedy algorithm.
It can be understood that the more compact and better the key sentence set after the fusion of the articles, the more comprehensive the content is ensured under the condition of reducing the reading quantity, and the user can be helped to acquire the information quickly.
It should be noted that, the matching algorithm of the bipartite graph can refer to the prior art, refer to Wang Junli, zhou Qing, yang Yaxing for details, a text semantic similarity analysis method [ P ]. Shanghai city, CN106547739B,2019-04-02. The present invention uses key sentences rather than topics as nodes. That is, if the similarity of the two key sentences is greater than the set threshold, the two sentences are bordered. After iteration, a bipartite graph is finally generated. The nodes of the two-part graph represent sentence numbers in the research report, and the edges represent the similarity relationship of two sentences. The method comprises the steps of firstly calculating the output degree of each node, namely the number of edges taking the node as an endpoint, then taking the node as a representation of the content of information in corresponding sentences, and selecting a key sentence set with the highest content of information but the least number of sentences as a key sentence set after fusion of a plurality of research reports through a greedy algorithm.
Further, the similarity calculation formula between the key sentences is as follows;
For cosine similarity based on semantic computation, the vector representation of each key sentence is computed by a Monte-sub pre-training modelTherein, whereinThe number of key sentences, x is a positive integer, and then the cosine similarity between vectors is calculated, wherein the specific formula is that
;
For the similarity calculated based on the anchor points, it can be confirmed by a numerical comparison mode or a character comparison mode,A weight coefficient representing the similarity calculated based on the anchor point.
It can be appreciated that the design is beneficial to enhancing the reliability of the similarity comparison result between key sentences, so as to enhance the readability of the sequenced texts, thereby enhancing the reading experience of users.
Further, in the part for calculating the similarity based on the anchor point, for the number comparison mode, only if the number is an important component part of the article content, the confirmation can be performed by the number comparison mode, and if the two sentences contain the same number, the two sentences are considered to be similar sentences.
Further, in the portion for calculating the similarity based on the anchor point, the similarity may be calculated by using an edit distance of two sentences, a longest common subsequence length, or an N-Gram similarity, for example, for the character comparison.
Further, in step S2, the arrangement of the key sentences is completed through a Monte-BERT pre-training model, and the Monte-BERT pre-training model can calculate the probability of the key sentences arranged at the ith position, and order the key sentences based on the calculation result, wherein i is a positive integer.
The Monte-BERT pre-training model adopts a pre-training task of sentence sequence prediction, so that a downstream task of inter-sentence consistency sequencing can be well adapted.
Referring to FIG. 6, the arrangement of key sentences is exemplified, giving three sentencesIs arranged into after random scramblingThe input samples to construct the Monte Carlo encoder are shown, where "[ CLS ]" represents the input sample initiator, ""Represents a start of a sentence,".."represents text content of a corresponding sentence," [ SEP ] "represents a separator (which may be regarded herein as an ending of an input sample). And then encodes it using a Monte-sub encoder to obtain
The hidden vectors of (a) are respectivelyIs provided with
It is used as the key vector K and the value vector V of the Monte decoderIs decoded as a query vector Q of a Monte decoder (MengziDecoder)And finally, obtaining through a pointer networkRepresenting the probability that the j-1 th sentence is arranged at the i-th position. Thus, the consistency sequencing of the three sentences is completed, and it can be understood that the sequencing of more than three sentences is the same as the sequencing of the three sentences.
Details of how to use the Monte-BERT pre-training model to accomplish the consistency ordering are disclosed in Lee, H., Hudson, D.A., Lee, K.,&Manning, C.D. (2020). SLM: Learning a Discourse Language Representation with Sentence Unshuffling. EMNLP.
Referring to fig. 7, the summary generating method herein further includes the following steps:
And step S3, after the fusion of all the articles is completed, sorting key sentences in the finally fused articles by adopting a Monte-BERT pre-training model to obtain an initial target text abstract, and generating transition texts among the key sentences of the initial target text abstract by adopting a Monte-T5 model to obtain the target text abstract with the transition texts.
It will be appreciated that, through the previous steps, the initial target text summary already contains substantially all the information necessary for an article summary, but these information are only spliced very directly, so that continuity and readability are poor when reading, and generating transition text between key sentences of the initial target text summary can further improve the readability of the target text summary, thereby further improving the reading experience of the user.
Referring to fig. 7 and 8, step S3 specifically includes the following steps:
step S31, setting masks among key sentences of the initial target text abstract respectively;
And S32, predicting the mask content through a Monte-T5 model to obtain a generated transition text, thereby obtaining a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract.
It will be appreciated that through the preceding algorithm steps, the initial target text summary already contains substantially all of the information necessary for a summary, and therefore, the transition text is often a relatively short logical phrase, connective or subtitle, etc., and does not contain too much useful information text. The generation of the transitional text can be simplified into the generation of the logic phrase, the connective or the subtitle among sentences. The Monte-T5 model is pre-trained by using 300GB mass data and stores rich priori knowledge, so that a higher generation effect can be achieved under the setting of fine tuning of few samples.
For example, three well-ordered key sentences are input and respectively recorded as: And The input templates are set as: Finally, the contents of < mask 1>, < mask 2>, and the like are predicted through a Monte-T5 model, so that the generated transition text can be obtained, and s represents the input ending symbol. The target text abstract with the transition text is Sum.
Further, the Monte-T5 model is fine-tuned with a fine-tuning dataset before use, and the fine-tuning dataset is constructed based on part-of-speech tagging and/or punctuation recognition and/or sub-title.
It can be understood that the generation problem of the transition text can be simplified into the generation problem of logic phrases, connective words or subtitles among sentences, and the like, so that the generation effect of the Monte-T5 model can be effectively ensured by pertinently constructing a fine-tuning data set.
The application scenario of constructing the fine adjustment data set based on the part-of-speech tagging is that a plurality of words with part-of-speech such as prepositions or connective words generally exist in an article, the prepositions, connective words and the like are replaced by masks, when the model is fine-tuned, the replaced words are used as labels, and texts at the mask positions are generated as training tasks.
The application scene scenario for constructing the fine adjustment data set based on punctuation recognition is that general paragraphs with formats such as' investment advice:. The first to the third are commonly appeared in research information, and the colon is preceded by the summary of subsequent contents. Thus, the fine tuning task is also generated for text at the mask location, using the mask symbol to replace text before the colon.
The application scenario of constructing the fine tuning data set based on the subtitles is that a large number of subtitles are often existed in a research report, the data is reconstructed by using a subtitle text-paragraph text, and the data is constructed by using the same method as the method for constructing the fine tuning data set based on punctuation mark recognition.
With continued reference to fig. 7, the summary generating method herein further includes the following steps:
and S4, generating summary and/or summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or sentence, and thus the final target text abstract is obtained.
It can be appreciated that generating the summary and/or summary text can assist the user in more quickly obtaining information, further improving the user's reading experience.
Referring to fig. 7 and 9, step S4 specifically includes the following steps:
s41, sequentially generating and marking a topic vocabulary for each key sentence through a Monte-T5 model;
step S42, designing a prompt question-answer template for inquiring main lecture contents about each topic word in a target text abstract with a transition text, setting answer answers of corresponding questions as masks, and predicting the masked contents by using a Monte-T5 model to obtain summarized and/or summarized text;
And step S43, combining the summary and/or summary text and the target text abstract with the transition text to obtain the final target text abstract.
It can be appreciated that generating the summary and/or summary text can make the final target text abstract more convenient to read, assist the user in more quickly acquiring information, and further improve the user's reading experience.
Specifically, taking an outline as an example, three key sentences are input, and are respectively recorded as: And Firstly, generating a topic vocabulary for each key sentence sequentially through a Monte-T5 model, and respectively marking as: And Then designing a prompt template as Sum, asking about,AndThe method mainly comprises the steps of teaching what is called, answering: < mask > ", and finally predicting the content of the < mask > by using a Monte-T5 model to obtain the summary text. This step requires fine tuning of the model using small amounts of data, dataset source references including DOU Z-Y, LIU P, HAYASHI H, et al. 2021. GSum: A General Framework for Guided NeuralAbstractive Summarization. abs/2010.08014. and HE J, KRYSCINSKI W, MCCANN B, et al. 2020. CTRLsum: Towards Generic Controllable TextSummarization. abs/2012.04281., etc.
The topic generation data set construction method includes that firstly, for a certain paragraph and a corresponding subtitle in a research report, the largest public subsequence is extracted from the paragraph to serve as a candidate topic word, and the paragraph serves as a key sentence to construct a large amount of training data. The design prompt template is a keyword, a paragraph text and a topic word, wherein the topic word is a mask, and a training data fine tuning Monte-T5 model is used for generating a corresponding topic word at the position of the mask.
The method for constructing the data set generated by the answer is as follows, the data set of the part is basically consistent with the data set generated by the topic, but considering that one paragraph in the data set generated by the topic can only correspond to one subtitle and one topic word, a plurality of paragraphs and corresponding subtitles and keywords are needed to be combined, and then the combined subtitles are replaced by mask marks, so that a large amount of pseudo data is constructed. Finally, using the pseudo-data fine-tuning Monte-T5 model, a corresponding summary or summary text may be generated at the "< mask >" position.
Generally, the summary and summary content are not so different, and the summary and/or summary may be optionally generated.
Specifically, in this embodiment, only the summary is selected to be generated.
Compared with the prior art, the text abstract generation method has the following advantages:
1. The text abstract generation method specifically comprises the steps of randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article, and fusing articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles. It can be understood that the target text abstract is generated by fusing the key information of the articles, so that a reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved.
2. The tree balanced binary tree comprises 1 st to N th layers of nodes, wherein the N th layers of nodes are leaf nodes, N is a positive integer, the articles connected in the tree balanced binary tree are fused in pairs layer by layer, the tree balanced binary tree comprises the steps of fusing the N th layers of node articles into N-1 st layers of node articles, each node of the N-1 st layer represents one N-1 st layer of node articles, each N-1 st layer of node articles are generated by fusing the N th layer of node articles connected to the same N-1 st layer of node, the N-1 st to 1 st layer of node articles are fused in layers to generate the 1 st layer of node articles, the 1 st layer of node articles are generated by fusing the 2 nd layer of node articles connected to the 1 st layer of node, and the 1 st layer of node articles are the target text abstract. The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, and less important information can be removed in the layer-by-layer fusion, so that the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.
3. The method specifically comprises the steps of determining two connected articles through the same upper node, identifying required information in the two connected articles through a named entity identification technology, extracting key sentences from the required information, wherein the key sentences are sentences containing the key information in the articles, screening the extracted key sentences based on similarity among the key sentences, determining the similarity among the key sentences through similarity calculated based on anchor points and cosine similarity calculated based on semantic, and splicing the screened key sentences to obtain a key sentence set after the fusion of the two articles. It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, so that the obtained key sentence set can contain all important information.
4. In the invention, the requirement information in two connected articles is identified by a named entity identification technology and is filled into a preset question template to generate a plurality of questions, and an answer segment is extracted from each article for each question by using a Monte-BERT model, wherein the obtained answer segment is a key sentence. It can be understood that the generated questions can be controlled through the preset question templates, so that the user can conveniently introduce the preference of the user on the important attention subject or entity, and on the other hand, the more accurate key sentence set fragments in the paragraphs can be conveniently found through the use of the extraction question-answering model.
5. In the method, elements in two key sentence sets are matched pairwise according to the similarity to form a bipartite graph, and the key sentence set with the largest information content but the smallest sentence number is selected through a greedy algorithm to be used as the key sentence set after the articles are fused. It can be understood that the more compact and better the key sentence set after the fusion of the articles, the more comprehensive the content is ensured under the condition of reducing the reading quantity, and the user can be helped to acquire the information quickly.
6. In the invention, the similarity between key sentences is comprehensively determined by the similarity calculated based on anchor points and cosine similarity calculated based on semantics. The design is beneficial to enhancing the reliability of the similarity comparison result between the key sentences, so that the readability of the sequenced texts is enhanced, and the reading experience of a user is improved.
7. In the invention, transition text is generated between key sentences of the initial target text abstract through the Monte-T5 model, so that the target text abstract with the transition text is obtained. It will be appreciated that, through the previous steps, the initial target text summary already contains substantially all the information necessary for an article summary, but these information are only spliced very directly, so that continuity and readability are poor when reading, and generating transition text between key sentences of the initial target text summary can further improve the readability of the target text summary, thereby further improving the reading experience of the user.
8. In the method, the transition text is generated by setting masks among key sentences of the initial target text abstract respectively, and predicting the contents of the masks through a Monte-T5 model to obtain the generated transition text, so that the target text abstract with the transition text is obtained, and the transition text is used for perfecting the logic relationship among the adjacent key sentences in the initial target text abstract. Through the previous algorithm steps, the initial target text abstract already contains all necessary information of one research report abstract, so that the transition text is often a shorter logic phrase, a connective word or a subtitle, and the like, and does not contain excessive useful information text. The generation of the transitional text can be simplified into the generation of the logic phrase, the connective or the subtitle among sentences. The Monte-T5 model is pre-trained by using 300GB mass data and stores rich priori knowledge, so that a higher generation effect can be achieved under the setting of fine tuning of few samples.
9. The text abstract generating method further comprises the steps of generating summary and/or summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or sentence, and the summary and/or summary text and the target text abstract with the transition text are combined, so that the final target text abstract is obtained. It can be appreciated that generating the summary and/or summary text can make the final target text abstract more convenient to read, assist the user in more quickly acquiring information, and further improve the user's reading experience.
While the foregoing has been described in some detail by way of illustration of the principles and embodiments of the invention, and while the foregoing is in order to facilitate the understanding of the principles and embodiments of the invention, it will be apparent to those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention, and it is intended that the invention is not to be limited thereto, but is to be accorded the full scope of the invention as defined by the appended claims.

Claims (9)

1. The text abstract generating method is characterized by comprising the following steps of:
randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article;
The method comprises the steps of fusing articles connected in a tree-shaped balanced binary tree layer by layer in pairs to generate a target text abstract fused with key information of at least two articles;
The method for fusing articles in the tree-shaped balanced binary tree layer by layer specifically comprises the following steps:
Determining two connected articles through the same upper node;
Identifying the requirement information in the two connected articles by using a named entity identification technology, and extracting key sentences from the requirement information, wherein the key sentences are sentences containing the key information in the articles;
Screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on anchor points and the cosine similarity calculated based on semantics;
and splicing the screened key sentences to obtain a key sentence set after the two articles are fused.
2. The text abstract generation method of claim 1, wherein the tree balanced binary tree comprises 1 st to nth layer nodes, the nth layer nodes are leaf nodes, N is a positive integer, and the merging of articles connected in the tree balanced binary tree layer by layer comprises:
Fusing the N-layer node articles into N-1 layer node articles, wherein each node of the N-1 layer represents an N-1 layer node article, and each N-1 layer node article is generated by fusing the N-layer node articles connected to the same N-1 layer node;
and fusing the layers from the N-1 to the 1 st layer by layer to generate a1 st layer node article, wherein the 1 st layer node article is generated by fusing the 2 nd layer node articles connected to the 1 st layer node, and the 1 st layer node article is the target text abstract.
3. The text abstract generating method as claimed in claim 1, wherein the identifying the requirement information in the two connected articles by the named entity recognition technology, and extracting the key sentences therefrom, comprises the steps of:
Identifying the requirement information in the two connected articles through a named entity identification technology, and filling the requirement information in a preset problem template, so that a plurality of problems are generated;
and extracting answer fragments from each article for each question by using a Monte-BERT model, wherein the obtained answer fragments are key sentences.
4. The text abstract generating method as claimed in claim 1, wherein the filtering of the extracted key sentences based on the similarity between the key sentences comprises the steps of:
matching elements in the two key sentence sets according to the similarity to form a bipartite graph;
and selecting a key sentence set with the highest information content and the least sentences through a greedy algorithm.
5. The text excerpt generation method of claim 1, wherein:
the similarity calculation formula between key sentences is as follows ;
For cosine similarity based on semantic computation, the vector representation of each key sentence is computed by a Monte-sub pre-training modelTherein, whereinThe number of key sentences, x is a positive integer, and then the cosine similarity between vectors is calculated, wherein the specific formula is that
;
For the similarity calculated based on the anchor points, it can be confirmed by a numerical comparison mode or a character comparison mode,A weight coefficient representing the similarity calculated based on the anchor point.
6. The text excerpt generation method of claim 1, further comprising the steps of:
after the fusion of all the articles is completed, the key sentences in the finally fused articles are sequenced by adopting a Monte-BERT pre-training model to obtain an initial target text abstract, and transition texts are generated among the key sentences of the initial target text abstract through a Monte-T5 model, so that the target text abstract with the transition texts is obtained.
7. The text excerpt generation method of claim 6, wherein the generating the transition text between key sentences of the original target text excerpt through the montreal-T5 model, thereby obtaining the target text excerpt having the transition text, comprises the steps of:
setting masks among key sentences of the initial target text abstract respectively;
and predicting the content of the mask through a Monte-T5 model to obtain a generated transition text, thereby obtaining a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract.
8. The text excerpt generation method of claim 6, further comprising the steps of:
Summary and/or summary text is generated for the target text summary with transition text, the summary text being a paragraph or sentence, resulting in a final target text summary.
9. The method for generating a text summary according to claim 8, wherein the summary and/or summary text is generated for the target text summary having the transition text, and the summary text is a paragraph or sentence, thereby obtaining a final target text summary, and specifically comprising the steps of:
Generating a topic vocabulary for each key sentence in sequence through a Monascus-T5 model and marking the topic vocabulary;
Designing a prompt question-answer template for inquiring main lecture contents about each topic vocabulary in a target text abstract with a transition text, setting answer answers of corresponding questions as masks, and predicting the masked contents by using a Monte-T5 model to obtain summarized and/or summarized text;
and combining the summary and/or summary text and the target text abstract with the transition text to obtain the final target text abstract.
CN202210380604.8A 2022-04-12 2022-04-12 A text summary generation method Active CN114611520B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210380604.8A CN114611520B (en) 2022-04-12 2022-04-12 A text summary generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210380604.8A CN114611520B (en) 2022-04-12 2022-04-12 A text summary generation method

Publications (2)

Publication Number Publication Date
CN114611520A CN114611520A (en) 2022-06-10
CN114611520B true CN114611520B (en) 2025-09-23

Family

ID=81869041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210380604.8A Active CN114611520B (en) 2022-04-12 2022-04-12 A text summary generation method

Country Status (1)

Country Link
CN (1) CN114611520B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997143B (en) * 2022-08-04 2022-11-15 北京澜舟科技有限公司 Text generation model training method and system, text generation method and storage medium
CN116501862B (en) * 2023-06-25 2023-09-12 桂林电子科技大学 Automatic text extraction system based on dynamic distributed collection

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532369A (en) * 2019-09-04 2019-12-03 腾讯科技(深圳)有限公司 A kind of generation method of question and answer pair, device and server
CN112052308A (en) * 2020-08-21 2020-12-08 腾讯科技(深圳)有限公司 Abstract text extraction method and device, storage medium and electronic equipment
CN112541364A (en) * 2020-12-03 2021-03-23 昆明理工大学 Chinese-transcendental neural machine translation method fusing multilevel language feature knowledge

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008242612A (en) * 2007-03-26 2008-10-09 Kyushu Institute Of Technology Document summarization apparatus, method and program thereof
CN101706996A (en) * 2009-11-12 2010-05-12 北京交通大学 Method for identifying traffic status of express way based on information fusion
CN111814465A (en) * 2020-06-17 2020-10-23 平安科技(深圳)有限公司 Machine learning-based information extraction method, device, computer equipment and medium
CN113157907B (en) * 2021-03-16 2022-05-03 中南大学 A method, system, terminal device and readable storage medium for obtaining hierarchical text abstracts based on discourse structure
CN113468854A (en) * 2021-06-24 2021-10-01 浙江华巽科技有限公司 Multi-document automatic abstract generation method
CN113705210B (en) * 2021-08-06 2025-01-28 北京搜狗科技发展有限公司 A method and device for generating article outline and a device for generating article outline

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532369A (en) * 2019-09-04 2019-12-03 腾讯科技(深圳)有限公司 A kind of generation method of question and answer pair, device and server
CN112052308A (en) * 2020-08-21 2020-12-08 腾讯科技(深圳)有限公司 Abstract text extraction method and device, storage medium and electronic equipment
CN112541364A (en) * 2020-12-03 2021-03-23 昆明理工大学 Chinese-transcendental neural machine translation method fusing multilevel language feature knowledge

Also Published As

Publication number Publication date
CN114611520A (en) 2022-06-10

Similar Documents

Publication Publication Date Title
US20210375291A1 (en) Automated meeting minutes generation service
CN111324728A (en) Text event abstract generation method and device, electronic equipment and storage medium
US12147771B2 (en) Topical vector-quantized variational autoencoders for extractive summarization of video transcripts
CN111259631B (en) Referee document structuring method and referee document structuring device
CN112131350A (en) Text label determination method, text label determination device, terminal and readable storage medium
CN117453851B (en) Text index enhanced question-answering method and system based on knowledge graph
CN106328147A (en) Speech recognition method and device
CN113032552B (en) Text abstract-based policy key point extraction method and system
CN113821605B (en) Event extraction method
CN118551046A (en) Method for enhancing document processing flow based on large language model
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114265936A (en) A Realization Method of Text Mining for Science and Technology Projects
CN114626463A (en) Language model training method, text matching method and related device
CN114611520B (en) A text summary generation method
CN119577459B (en) Intelligent customer service training method and device for multi-mode large model and storage medium
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN117216008A (en) Knowledge graph-based archive multi-mode intelligent compiling method and system
CN112287687A (en) A Case Tendency Extractive Summarization Method Based on Case Attribute Awareness
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
CN114281948A (en) A method for determining minutes and related equipment
CN116384403A (en) A Scene Graph Based Multimodal Social Media Named Entity Recognition Method
CN115438195A (en) A method and device for constructing a knowledge map in the field of financial standardization
Abdalgader et al. Experimental study on short-text clustering using transformer-based semantic similarity measure
CN119047468B (en) Intelligent creation method based on NLP intention recognition
Azam et al. Current trends and advances in extractive text summarization: A comprehensive review

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant