CN113688633B - Method and device for determining outline - Google Patents
Method and device for determining outline Download PDFInfo
- Publication number
- CN113688633B CN113688633B CN202110880841.6A CN202110880841A CN113688633B CN 113688633 B CN113688633 B CN 113688633B CN 202110880841 A CN202110880841 A CN 202110880841A CN 113688633 B CN113688633 B CN 113688633B
- Authority
- CN
- China
- Prior art keywords
- outline
- text
- paragraph
- preset
- group
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a method and a device for determining a outline, wherein the method comprises the steps of obtaining description information of a text to be generated, obtaining semantic features of the description information as first semantic features, and determining the outline of the text to be generated from a preset outline based on the semantic features of the preset outline and the first semantic features. When the proposal provided by the embodiment of the invention is applied to determine the outline, the outline determination efficiency can be improved.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for determining a schema.
Background
Text automatic generation is a research branch of natural language processing, and the realization of enabling electronic equipment to generate text is realized. The text generated by the electronic equipment can assist the user in efficiently creating high-quality text. Before generating text, it is often necessary to determine the outline of the text, and the electronic device generates the text based on each outline.
In the prior art, the outline is usually selected manually by a worker from a large number of outline. However, this approach is time consuming and laborious, resulting in less efficient outline determination.
Disclosure of Invention
The embodiment of the invention aims to provide a method and a device for determining a outline, so as to improve the efficiency of outline determination. The specific technical scheme is as follows:
In a first aspect, an embodiment of the present invention provides a method for determining a synopsis, where the method includes:
obtaining description information of a text to be generated;
obtaining semantic features of the description information as first semantic features;
and selecting the outline of the text to be generated from the preset outline based on the semantic features of the preset outline and the first semantic features.
In an embodiment of the present invention, determining the outline of the text to be generated from the preset outline based on the semantic feature of the preset outline and the first semantic feature includes:
and selecting the outline of the text to be generated from the preset outline based on the similarity between the semantic features of the preset outline and the first semantic features.
In one embodiment of the present invention, the selecting the outline of the text to be generated from the preset outline based on the similarity between the semantic feature of the preset outline and the first semantic feature includes:
Selecting, as an alternative outline group, an outline group to which the outline of the text to be generated belongs from each outline group based on similarity between semantic features of a clustering center of each outline group and the first semantic features, wherein each outline group is an outline group obtained by clustering according to the similarity between semantic features of the outline;
And selecting the outline of the text to be generated from each outline of the alternative outline group according to the similarity between the semantic features of each outline in the alternative outline group and the first semantic features.
In one embodiment of the present invention, selecting the outline of the text to be generated from each outline of the candidate outline group according to the similarity between the semantic feature of each outline in the candidate outline group and the first semantic feature includes:
selecting an alternative outline group with the highest similarity between the semantic features of the clustering center and the first semantic features from the alternative outline groups;
and selecting the outline of the text to be generated from all the outlines in the selected alternative outline group according to the similarity between the semantic features of all the outlines in the selected alternative outline group and the first semantic features.
In one embodiment of the present invention, selecting the outline of the text to be generated from each outline of the candidate outline group according to the similarity between the semantic feature of each outline in the candidate outline group and the first semantic feature includes:
calculating the similarity between the semantic features of each outline in the alternative outline group and the first semantic features;
And selecting a first preset number of outline from the outline according to the sequence of the calculated correspondence similarity of the outline from high to low, and taking the first preset number of outline as the outline of the text to be generated.
In one embodiment of the present invention, the selecting, from the respective outline groups, the outline group to which the outline of the text to be generated belongs based on the similarity between the semantic feature of the clustering center of the respective outline groups and the first semantic feature includes:
Calculating the similarity between the semantic features of the clustering centers of each outline group and the first semantic features;
Selecting a first preset number of outline groups from the outline groups according to the sequence that the calculated correspondence similarity of each outline group is from high to low;
Determining the outline quantity of the outline containing the description information in each selected outline group;
and determining the outline group to which the outline of the text to be generated belongs from the selected outline groups according to the determined outline number.
In one embodiment of the present invention, the method further includes:
and selecting the paragraph of the text to be generated from preset paragraphs corresponding to the outline of the text to be generated.
In an embodiment of the present invention, the selecting a paragraph of the text to be generated from preset paragraphs corresponding to the outline of the text to be generated includes:
And selecting the paragraph of the text to be generated from the preset paragraphs corresponding to the outline of the text to be generated based on the similarity between the semantic features of the preset paragraphs corresponding to the outline of the text to be generated and the first semantic features.
In one embodiment of the present invention, the predetermined paragraph is a predetermined paragraph, and includes the following steps:
acquiring a pre-selected text corresponding to a preset outline;
and extracting paragraphs from each paragraph corresponding to the preset outline in the text, and taking the extracted paragraphs as preset paragraphs corresponding to the outline.
In one embodiment of the present invention, the extracting paragraphs from each paragraph corresponding to a preset outline in the text as preset paragraphs corresponding to an outline includes:
Determining characteristic information of each paragraph corresponding to the preset outline in the text;
And selecting an alternative paragraph from the paragraphs based on the characteristic information of the paragraphs, and determining the alternative paragraph as a preset paragraph corresponding to the preset outline in the text.
In an embodiment of the present invention, the determining the candidate paragraph as the preset paragraph corresponding to the preset outline in the text includes:
Determining semantic features of each alternative paragraph and word meaning features of each word in the alternative paragraph aiming at each alternative paragraph, inputting the determined semantic features and word meaning features into a pre-trained paragraph quality evaluation model to obtain a quality score value of the alternative paragraph, and taking a paragraph with the quality score value larger than a preset quality score threshold value as a preset paragraph corresponding to the preset outline in the text;
The paragraph quality evaluation model is obtained by training a preset neural network model by taking semantic features of a sample paragraph and word sense features of words in the sample paragraph as model input and taking a labeled quality score value of the sample paragraph as a training reference, and is used for obtaining the quality score value of the paragraph.
In one embodiment of the present invention, the method further includes:
And sorting the selected paragraphs of the text to be generated based on the outline of the text to be generated, and generating the text containing the outline of the text to be generated and the sorted paragraphs.
In one embodiment of the present invention, the description information includes at least one of user images, keywords, entity words, keywords sentences, and text types.
In a second aspect, an embodiment of the present invention provides an outline determining apparatus, including:
the information acquisition module is used for acquiring the description information of the text to be generated;
the feature obtaining module is used for obtaining semantic features of the description information and taking the semantic features as first semantic features;
The outline selecting module is used for selecting the outline of the text to be generated from the preset outline based on the semantic features of the preset outline and the first semantic features.
In one embodiment of the present invention, the outline selection module is specifically configured to select the outline to be generated from the preset outline based on a similarity between a semantic feature of the preset outline and the first semantic feature.
In one embodiment of the present invention, the outline selecting module includes:
The outline group selection submodule is used for selecting an outline group to which the outline of the text to be generated belongs from all outline groups as an alternative outline group based on the similarity between the semantic features of the clustering center of each outline group and the first semantic features, wherein each outline group is an outline group obtained by clustering according to the similarity between the semantic features of the outline;
And the outline selection sub-module is used for selecting the outline of the text to be generated from all the outline of the alternative outline group according to the similarity between the semantic features of all the outline of the alternative outline group and the first semantic features.
In one embodiment of the invention, the outline selection submodule is specifically configured to select an alternative outline group with the highest similarity between the semantic features of the clustering center and the first semantic features from the alternative outline groups, and select the outline of the text to be generated from each outline in the selected alternative outline groups according to the similarity between the semantic features of each outline in the selected alternative outline groups and the first semantic features.
In one embodiment of the invention, the outline selection submodule is specifically configured to calculate a similarity between a semantic feature of each outline in the candidate outline group and the first semantic feature, and select a first preset number of outlines from each outline according to a sequence from high to low of the calculated corresponding similarity of each outline, as the outline of the text to be generated.
In one embodiment of the present invention, the outline group selecting submodule includes:
The similarity calculation unit is used for calculating the similarity between the semantic features of the clustering centers of each outline group and the first semantic features;
the outline group selection unit is used for selecting a second preset number of outline groups from the outline groups according to the sequence that the calculated correspondence similarity of the outline groups is from high to low;
a number determining unit configured to determine the number of outline items including the description information in each of the selected outline groups;
and the outline group determining unit is used for determining the outline group to which the outline of the text to be generated belongs from the selected outline groups according to the determined outline number.
In one embodiment of the present invention, the apparatus further comprises a paragraph selection module,
The paragraph selection module is specifically configured to select a paragraph of the text to be generated from preset paragraphs corresponding to the outline of the text to be generated.
In an embodiment of the present invention, the paragraph selection module is specifically configured to select a paragraph of the text to be generated from preset paragraphs corresponding to the outline of the text to be generated based on a similarity between the semantic features of the preset paragraphs corresponding to the outline of the text to be generated and the first semantic features.
In one embodiment of the present invention, the apparatus further includes a preset paragraph determining module, where the preset paragraph determining module includes:
The text acquisition sub-module is used for acquiring a pre-selected text corresponding to a preset outline;
And the paragraph determining submodule is used for extracting paragraphs from all paragraphs corresponding to preset outline in the text and taking the extracted paragraphs as preset paragraphs corresponding to outline.
In one embodiment of the present invention, the paragraph determining submodule includes:
The information determining unit is used for determining the characteristic information of each paragraph corresponding to the preset outline in the text;
And the paragraph determining unit is used for selecting alternative paragraphs from the paragraphs based on the characteristic information of the paragraphs and determining the alternative paragraphs as preset paragraphs corresponding to the preset outline in the text.
In one embodiment of the present invention, the paragraph determining unit is specifically configured to determine, for each alternative paragraph, a semantic feature of the alternative paragraph and a word meaning feature of each word in the alternative paragraph, input the determined semantic feature and the word meaning feature into a pre-trained paragraph quality evaluation model, obtain a quality score value of the alternative paragraph, and use a paragraph with a quality score value greater than a preset quality score threshold as a preset paragraph corresponding to the preset outline in the text;
The paragraph quality evaluation model is obtained by training a preset neural network model by taking semantic features of a sample paragraph and word sense features of words in the sample paragraph as model input and taking a labeled quality score value of the sample paragraph as a training reference, and is used for obtaining the quality score value of the paragraph.
In one embodiment of the present invention, the apparatus further comprises a text generation module,
The text generation module is specifically configured to sort the paragraphs of the selected text to be generated based on the outline of the text to be generated, and generate a text including the outline of the text to be generated and the sorted paragraphs.
In one embodiment of the present invention, the description information includes at least one of user images, keywords, entity words, keywords sentences, and text types.
In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
A memory for storing a computer program;
And a processor, configured to implement the method steps described in the first aspect when executing the program stored in the memory.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the method steps of the first aspect described above.
From the above, when the proposal provided by the embodiment of the invention is applied to determine the outline, because the outline of the text to be generated is determined from the preset outline based on the semantic features and the first semantic features of the preset outline, compared with the prior art, the proposal determination efficiency is improved without manually determining the outline by staff.
In addition, as the first semantic features can reflect the semantics expressed by the description information of the text to be generated, the semantic features of the preset outline can reflect the semantics expressed by each preset outline, and according to the two types of information, the outline of the text to be generated can be determined more accurately from the preset outline.
Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a first outline determining method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a method for selecting a synopsis according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a second method for determining an outline according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a third method for determining an outline according to an embodiment of the present invention;
FIG. 5 is a block flow diagram of a paragraph obtaining method according to an embodiment of the present invention;
fig. 6 is a flowchart of a text information obtaining method according to an embodiment of the present invention;
FIG. 7 is a block flow diagram of calculating semantic similarity based on a model according to an embodiment of the present invention;
FIG. 8 is a block flow diagram of a text generation method according to an embodiment of the present invention;
FIG. 9a is a schematic flow chart of a preset paragraph obtaining process according to an embodiment of the present invention;
FIG. 9b is a schematic flow chart of an alternative segment quality evaluation process according to an embodiment of the present invention;
FIG. 9c is a schematic flow chart of an Attention mechanism according to an embodiment of the present invention;
Fig. 10 is a schematic structural diagram of a first outline determining apparatus according to an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of a outline selection module according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of a second outline determining apparatus according to an embodiment of the present invention;
fig. 13 is a schematic structural diagram of a third outline determining apparatus according to an embodiment of the present invention;
Fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic flow chart of a first outline determining method according to an embodiment of the present invention, where the method includes S101-S103.
The execution body of the embodiment of the invention can be electronic equipment, such as a server, a notebook computer and the like.
And S101, obtaining description information of the text to be generated.
The text to be generated may be news, articles, mote, etc.
The description information of the text to be generated may be understood as information for describing the text to be generated, for example, the description information of the text to be generated may include keywords, titles, text types, etc. of the text to be generated.
The description information is information for describing the text to be generated, and can reflect basic characteristics of the text to be generated, such as characteristics of types, styles, keywords and the like of the text to be generated, and can be called as characteristic information.
Specifically, the information for describing the text to be generated, which is input by the user, may be the description information of the text to be generated.
In one embodiment, the user may directly input descriptive information of the text to be generated.
For example, the user inputs keywords, types, titles and the like of the text to be generated, and the electronic equipment takes the information input by the user as description information of the text to be generated.
In another embodiment, the user may further input a description text segment of the text to be generated, and the electronic device extracts description information of the description text segment to obtain the description information of the text to be generated.
Specifically, after the electronic device obtains the description text segment, the description text segment may be cleaned and filtered, for example, the description text segment may be subjected to processing such as sensitive word filtering and deactivated word filtering, and then description information may be extracted from the cleaned and filtered description text segment.
In another embodiment, the electronic device may further obtain a user portrait, and use the description information input by the user and the user portrait as the description information of the text to be generated.
Specifically, when the user portrait is obtained, the user portrait may be obtained based on the user identifier and a preset correspondence between the user identifier and the user portrait. The user identification may be an ID number, a login name, etc. of the user.
The user portrayal is understood as information describing the user. The user portrayal may include descriptive information in a plurality of different dimensions, for example, the user portrayal may include attribute information, interest information, behavior information, scene information, etc. of the user.
S102, obtaining semantic features of the descriptive information as first semantic features.
The semantic features are used for reflecting the semantics expressed by a certain object, and in the embodiment of the invention, the semantic features of different objects appear, and for each object, the semantic features of the object are used for reflecting the semantics expressed by the object.
Semantic features of the descriptive information are used to reflect the semantics expressed by the descriptive information. Upon obtaining semantic features of the descriptive information, semantic features characterized in a vectorized form may be obtained. For example, the description information can be vectorized and encoded, and the encoded result is used as the semantic feature of the description information. Semantic information describing the information can be analyzed, and semantic features are extracted based on the analysis result.
Step S103, determining a outline of the text to be generated from the preset outline based on the semantic features and the first semantic features of the preset outline.
The outline is used to reflect the structural information of the text. Specifically, the outline may be each subtitle in the text, and the outline may also be a central sentence of each paragraph in the text.
For example, taking a patent text as an example, the outline of the patent text can comprise a description abstract, a drawing description, a claim, a description and a description drawing, and taking an academic paper as an example, the outline of the academic paper can comprise the abstract, a keyword, a specific content and a reference.
Since the subtitles of the text are typically summary titles, the central sentence of each paragraph in the text is typically used to summarize the central ideas of the paragraphs, and the text is typically formed according to a certain logic order based on the central ideas of each paragraph or the contents of each subtitle, the outline can more accurately reflect the structural information of the text.
The preset outline may be an outline extracted from various types of texts acquired in advance. The electronic equipment can store historical texts, extract the extracted outline in the stored texts to serve as a preset outline, and can also automatically crawl texts of specified websites based on an automatic crawler system at regular intervals and store the texts in a database so as to obtain incremental texts of the specified websites, monitor text information in the Internet in real time, discover text data sources, detect text data dynamic information, and accordingly obtain a large number of texts, extract the outline in the obtained texts and store the outline of the extracted texts as the preset outline.
The semantic features of the preset outline are used for reflecting the semantics expressed by the preset outline. Specifically, the semantic features of the preset outline may be that the electronic device performs vectorization encoding on each preset outline in advance, and the encoded vector is used as the semantic feature of each preset outline.
In one embodiment of the present invention, the outline of the text to be generated may be selected from the preset outline based on a similarity between the semantic features of the preset outline and the first semantic features.
The similarity between the semantic features of the preset outline and the first semantic features may be determined by calculating a distance between the semantic features of the preset outline and the first semantic features, based on the calculated distance. The distance may be euclidean distance, cosine distance, etc. For example, a distance similarity conversion algorithm may be employed to convert the calculated distance to similarity.
When selecting the outline of the text to be generated from the preset outlines based on the similarity between the semantic features of the preset outline and the first semantic features, selecting a preset outline with the highest preset number of similarity as the outline of the text to be generated, or selecting a preset outline with the similarity larger than the preset similarity as the outline of the text to be generated.
The preset number may be set empirically by a worker, for example, the preset number may be 1, 2,3, 4, etc. Taking the first preset number as 3 as an example, selecting 3 preset outline with highest similarity as outline of the text to be generated.
In this way, the outline of the text to be generated is determined based on the similarity between the semantic features of the preset outline and the first semantic features, the first semantic features can reflect the semantics expressed by the description information of the text to be generated, and the semantic features of the preset outline can reflect the semantics expressed by each preset outline, so that the outline of the text to be generated can be determined more accurately based on the similarity between the semantic features.
From the above, when the proposal provided by the embodiment is applied to determine the outline, because the outline of the text to be generated is determined from the preset outline based on the semantic features and the first semantic features of the preset outline, compared with the prior art, the proposal determination efficiency is improved without manually determining the outline by a worker.
In addition, as the first semantic features can reflect the semantics expressed by the description information of the text to be generated, the semantic features of the preset outline can reflect the semantics expressed by each preset outline, and according to the two types of information, the outline of the text to be generated can be determined more accurately from the preset outline.
In one embodiment of the present invention, the preset outline may be an outline included in a plurality of outline groups obtained by clustering according to similarity between semantic features of the outline.
Specifically, the semantic features of each preset outline may be clustered based on the similarity between the semantic features of each preset outline. For example, a K-means (K-means clustering) algorithm can be adopted to cluster semantic features of each preset outline to obtain clustered outline groups.
After determining each clustered outline group, determining a clustering center in each outline group, wherein the clustering center refers to an outline with higher similarity with each preset outline included in the outline group.
Specifically, a distributed file storage system may be used to store the semantic feature vector data of each outline in each clustering outline group and the semantic feature vector data of the clustering center, and store the semantic feature vector data of the clustering center of each clustered outline group in the memory of the electronic device.
On the basis of the above embodiment, referring to fig. 2, fig. 2 is a schematic flow chart of a method for selecting a synopsis according to an embodiment of the present invention. The step S103 may be implemented according to the following steps S201 to S202, where the outline of the text to be generated is selected from the preset outline based on the similarity between the semantic features of the preset outline and the first semantic features.
S201, selecting a synopsis group to which the synopsis of the text to be generated from the synopsis groups as an alternative synopsis group based on the similarity between the semantic features of the clustering center of each synopsis group and the first semantic features.
The outline groups are obtained by clustering according to the similarity among semantic features of the outline.
The outline groups are obtained by clustering according to the similarity among the semantic features of the outline, so that each outline group is provided with a clustering center, the clustering center refers to the outline with higher similarity among preset outline included by the outline group, and the semantic features of the clustering center can reflect the overall semantic features of each outline group.
The first semantic features are semantic features of descriptive information.
Specifically, since the semantic features of the clustering centers of each outline group can be stored in the memory of the electronic device, when the similarity between the semantic features of the clustering centers of each outline group and the first semantic features is calculated, the semantic features of the clustering centers of each outline group can be obtained from the memory of the electronic device, so that the obtaining efficiency of the semantic features of the clustering centers is improved.
When determining the similarity between the semantic features of the cluster centers of the respective outline groups and the first semantic features, a distance between the semantic features of the preset cluster centers of the respective outline groups and the first semantic features may be calculated, and the similarity may be determined based on the calculated distance. The distance may be euclidean distance, cosine distance, etc. For example, a distance similarity conversion algorithm may be employed to convert the calculated distance to similarity.
When the outline group to which the outline of the text to be generated belongs is selected from the outline groups as the candidate outline group based on the similarity between the semantic feature of the clustering center of each outline group and the first semantic feature, the outline group with the highest preset number of similarity can be selected as the candidate outline group, and the outline group with the similarity larger than the preset similarity can be selected as the candidate outline group.
S202, selecting the outline of the text to be generated from each outline of the alternative outline group according to the similarity between the semantic features of each outline in the alternative outline group and the first semantic features.
Specifically, when determining the similarity between the semantic feature of each outline and the first semantic feature in the alternative outline group, the distance between the semantic feature of each outline and the first semantic feature may be calculated, and the similarity is determined based on the calculated distance. The distance may be euclidean distance, cosine distance, etc. For example, a distance similarity conversion algorithm may be employed to convert the calculated distance to similarity.
Specifically, when selecting the outline of the text to be generated from the outline of each outline of the candidate outline group according to the similarity between the semantic feature and the first semantic feature of each outline of the candidate outline group, the outline with the highest preset number of similarities may be selected, or the outline with the similarity greater than the preset similarity may be selected.
In this way, firstly, the outline group to which the outline of the text to be generated belongs is selected from the outline groups after clustering to serve as an alternative outline group, and then the outline of the text to be generated is selected from the outline groups. And clustering is carried out on each outline group based on the similarity among semantic features of each outline, namely, each outline which is similar is divided into one outline group, and compared with the method for selecting the outline of the text to be generated by taking each outline as a unit, the efficiency of selecting the outline of the text to be generated by taking the outline group as a unit is higher.
In one embodiment of the present invention, the step of selecting the outline to be generated into the text from each outline of the candidate outline group according to the similarity between the semantic feature of each outline and the first semantic feature in the candidate outline group in S202 may be implemented according to the following steps A1-A2.
And A1, selecting an alternative outline group with the highest similarity between the semantic features of the clustering center and the first semantic features from the alternative outline groups.
Specifically, the candidate outline group with the highest similarity may be selected from the candidate outline groups based on the similarity between the semantic feature of the clustering center of each candidate outline group and the first semantic feature.
And A2, selecting the outline to be generated into the text from all the outlines in the selected alternative outline group according to the similarity between the semantic features of all the outlines in the selected alternative outline group and the first semantic features.
Specifically, when selecting the outline of the text to be generated, a preset number of outlines with highest similarity may be selected as the outline of the text to be generated. And the outline with the similarity larger than the preset similarity can be selected as the outline of the text to be generated.
In this way, since the similarity of the semantic feature of the clustering center of the selected candidate outline group and the first semantic feature is the highest, the semantic information of the clustering center of the selected candidate outline group is closest to the description information of the text to be generated, and the outline of the text to be generated is determined from the selected candidate outline group, the semantic information expressed by the determined outline of the text to be generated can be closest to the description information of the text to be generated, and therefore the accuracy of determining the outline of the text to be generated is improved.
In one embodiment of the present invention, the step of selecting the outline to be generated into the text from each outline of the candidate outline group according to the similarity between the semantic feature of each outline and the first semantic feature in the candidate outline group in S202 may be implemented according to the following steps B1-B2.
And B1, calculating the similarity between the semantic features of each outline in the alternative outline group and the first semantic features.
Specifically, when calculating the similarity between the semantic feature of each outline in the alternative outline group and the first semantic feature, the distance between the semantic feature of each outline and the first semantic feature may be calculated, and the similarity is determined based on the calculated distance.
And B2, selecting a first preset number of outline from each outline group according to the sequence that the calculated correspondence similarity of each outline is from high to low, and taking the outline as the outline of the text to be generated.
The first preset number may be empirically set by a worker. For example, the first preset number may be 6, 8, etc. Taking the first preset number as 6 as an example, 6 outline with highest similarity can be selected.
After the similarity between the semantic features of each outline and the first semantic features is obtained, selecting a first preset number of outline before according to the sequence from high to low of the calculated corresponding similarity of each outline. It will be appreciated that the selected outline may be an outline in a different set of alternative outline, or may be an outline in the same set of alternative outline.
For example, assuming that the order of the correspondence similarity of the respective outline is from high to low is outline 1, outline 2, outline 3, outline 4 and outline 5, and the first preset number is 3, the first preset number of outline obtained by selection is outline 1, outline 2 and outline 3.
In this way, the first preset number of outline items before being selected are used as outline items of the text to be generated according to the order that the corresponding similarity of the outline items is from high to low, and the similarity can reflect the similarity between semantic information expressed by the outline items and description information of the text to be generated, so that the accuracy of selecting the outline items of the text to be generated is improved.
In one embodiment of the present invention, the step S201 may be implemented according to the following steps C1-C4, where the outline group to which the outline of the text to be generated belongs is selected from the outline groups based on the similarity between the semantic feature of the clustering center of each outline group and the first semantic feature.
And step C1, calculating the similarity between the semantic features of the clustering centers of the outline groups and the first semantic features.
Specifically, when calculating the similarity between the semantic features of the cluster centers of the outline group and the first semantic features, the distance between the semantic features of each cluster center and the first semantic features may be calculated, and the similarity is determined based on the calculated distance.
And step C2, selecting a second preset number of outline groups from the outline groups according to the sequence that the calculated correspondence similarity of each outline group is from high to low.
The second preset number may be empirically set by a worker. For example, the second preset number may be 5, 10, etc.
After the similarity between the clustering center semantic features and the first semantic features of each outline group is obtained, selecting a second preset number of outline groups from each outline group according to the sequence that the calculated corresponding similarity of each outline group is from high to low.
For example, assuming that the order of the corresponding similarity of the outline groups from high to low is outline group 1, outline group 2, outline group 3, outline group 4 and outline group 5, and the second preset number is 3, the first second preset number of outline groups obtained by selection are outline group 1, outline group 2 and outline group 3.
And C3, determining the outline quantity of the outline containing the description information in each selected outline group.
The description information is the description information of the text to be generated obtained in S101.
The outline containing the description information may have two cases, one is an outline containing all the description information of the text to be generated, and the other is an outline containing part of the description information of the text to be generated.
Specifically, in determining the number of the outline, it may be first determined whether the outline in each outline group contains description information. In one embodiment, the description information of each outline in the outline group may be extracted, and whether the extracted description information is the description information of the text to be generated is determined, if so, the outline is considered to include the description information.
In another embodiment, the description information of each outline may be extracted in advance, the extracted description information is stored in a database, the description information of the text to be generated is matched with the description information corresponding to each outline stored in the database, and if the matching is successful, the outline may be considered to contain the description information.
After determining that the outline in each outline group contains the descriptive information, the outline number of the outline containing the descriptive information in each outline group can be counted, so as to determine the outline number of the outline containing the descriptive information in each selected outline group
And C4, determining the outline group to which the outline of the text to be generated belongs from the selected outline groups according to the determined outline number, and taking the outline group as an alternative outline group.
In one embodiment, the same initial weight is assigned to each of the outline groups, wherein the weight of the outline group is used to represent the probability that the outline group is selected as an alternative outline group. The larger the selection weight of the outline group is, the larger the probability that the outline group is selected as an alternative outline group is, and the smaller the selection weight of the outline group is, the smaller the probability that the outline group is selected as an alternative outline group is.
And adjusting and updating the weights of the outline groups based on the determined outline quantity, and selecting the outline group with the highest preset number of weight values as an alternative outline group after updating and adjusting the weights of the outline groups.
For example, assume that the selected outline group includes outline group S 1 and outline group S 2, wherein outline group S 1 contains description information outline number 5, outline group S 2 contains description information outline number 3, initial weights of outline group S 1 and outline group S 2 are 10, and preset number is 1.
The weight value of each enhancement group is adjusted based on the number of the outline groups containing the description information, wherein the adjusted weight values are respectively that the weight of the outline group S 1 is 15, the weight of the outline group S 2 is 12, and the outline group S 1 can be used as an alternative outline group because the weight of the outline group S 1 is highest.
In another embodiment, the same initial weight is firstly assigned to each outline group, the number of descriptive information contained in each outline group is determined, the weight of each outline group is adjusted based on the determined number and the number of outline groups, and the outline group with the highest preset number of weight values can be selected as an alternative outline group.
For example, assume that the selected outline group comprises outline groups S 3 and S 4, wherein outline group S 3 contains description information with outline number of 2, outline group S 4 contains description information with outline number of 3, the outline groups S 3 contain description information with outline number of 2, outline groups S 4 contain description information with outline number of 3, initial weights of outline groups S 3 and S 4 are 10, and preset number is 1
And adjusting the weight value of each enhancement group based on the number of the outline containing the descriptive information and the number of the descriptive information contained in each outline group. The weight value of each outline group can be adjusted according to the weight corresponding to the preset outline number and the weight corresponding to the number of the contained description information.
In this way, since the candidate outline group is determined based on the determined outline number, and the determined outline number can reflect the situation that each outline in the outline group contains descriptive information, when the determined outline number is larger, it can be indicated that a larger number of outline in the outline group contains descriptive information, it can be considered that semantic information expressed by the outline in the outline group is close to descriptive information of a text to be generated. Thus, the accuracy of determining the set of alternative outline can be improved based on the determined number of outline.
Referring to fig. 3, fig. 3 is a flow chart of a second outline determining method according to an embodiment of the present invention. On the basis of the above embodiment, the above method further includes the following step S104.
S104, selecting a paragraph of the text to be generated from preset paragraphs corresponding to the outline of the text to be generated.
The preset paragraphs may be paragraphs corresponding to the respective outline obtained by extracting the outline from the text obtained in advance, and the extracted outline is used as the preset paragraph corresponding to the outline.
Taking patent text as an example, a paragraph corresponding to the outline of the patent is the content of the abstract part of the specification, and a paragraph corresponding to the outline of the patent is the content of the claim part.
The electronic equipment can store historical texts, extract the outline in the stored texts to obtain paragraphs corresponding to the extracted outline as preset paragraphs, and can also automatically crawl texts of specified websites based on an automatic crawler system at regular intervals and store the texts in a database to obtain incremental texts of the specified websites, monitor text information in the Internet in real time, discover text data sources, detect text data dynamic information, and accordingly obtain a large number of texts, extract the outline in the obtained texts and obtain the paragraphs corresponding to the outline to serve as the preset paragraphs corresponding to the outline.
The preset outline and the preset paragraphs are corresponding, and specifically, one preset outline may correspond to one preset paragraph or may correspond to a plurality of preset paragraphs. Therefore, after determining the outline of the text to be generated, a preset paragraph corresponding to the outline of the text to be generated can be determined.
When a paragraph of the text to be generated is selected from preset paragraphs corresponding to the outline of the text to be generated, in one embodiment, a paragraph may be randomly selected from preset paragraphs corresponding to the outline of the text to be generated as a paragraph of the text to be generated.
In one embodiment of the present invention, a paragraph of the text to be generated may also be selected from preset paragraphs corresponding to the outline of the text to be generated based on a similarity between the semantic feature of the preset paragraph corresponding to the outline of the text to be generated and the first semantic feature.
Semantic features of the preset paragraphs corresponding to the outline of the text to be generated are used for reflecting the semantics expressed by the preset paragraphs. The first semantic features are semantic features of the descriptive information and are used for reflecting the semantics expressed by the descriptive information.
Specifically, the similarity may be determined by calculating a distance between a semantic feature of a preset paragraph corresponding to the outline of the text to be generated and the first semantic feature, and determining the similarity based on the calculated distance. The distance may be euclidean distance, cosine distance, etc. For example, a distance similarity conversion algorithm may be employed to convert the calculated distance to similarity.
When selecting a paragraph of the text to be generated from the preset paragraphs corresponding to the outline of the text to be generated based on the similarity between the semantic features of the preset paragraphs corresponding to the outline of the text to be generated and the first semantic features, selecting the preset paragraph with the highest similarity as the paragraph of the text to be generated, or selecting the preset paragraph with the similarity larger than the preset similarity as the paragraph of the text to be generated.
Because the paragraph of the text to be generated is determined based on the similarity between the semantic features of the preset paragraphs and the first semantic features, the first semantic features can reflect the semantics expressed by the description information of the text to be generated, and the semantic features of the preset paragraphs can reflect the semantics expressed by each preset paragraph, the semantic information expressed by the paragraphs determined based on the similarity between the semantic features can be relatively close to the semantic information expressed by the text to be generated, and the generated text is relatively accurate.
In this way, the paragraph of the text to be generated is selected from the preset paragraphs corresponding to the outline of the text to be generated, so that the conformity between the selected paragraph and the outline of the text to be generated is high, and the logic of the text is high when the text is generated subsequently.
Referring to fig. 4, fig. 4 is a flow chart of a third outline determining method according to an embodiment of the present invention. On the basis of the above embodiment, the above method further includes the following step S105.
Step S105, sorting the paragraphs of the selected text to be generated based on the outline of the text to be generated, and generating the text containing the outline of the text to be generated and the sorted paragraphs.
When the paragraphs of the selected text to be generated are ordered based on the outline of the text to be generated, the arrangement order of the paragraphs may be determined based on the position information of the outline of the text to be generated in the text.
For example, assuming that the outline of the text to be generated includes a beginning outline, an intermediate outline, and an ending outline, wherein the selected paragraph includes paragraph 1, paragraph 2, and paragraph 3, and the beginning outline corresponds to paragraph 1, the intermediate outline corresponds to paragraph 2, and the ending outline corresponds to paragraph 3, since the structure of the text is generally composed of the beginning outline, the intermediate outline, and the ending outline, the arrangement order of the selected paragraphs can be determined based on the outline of the text to be generated, and sequentially includes paragraph 1, paragraph 2, and paragraph 3.
When generating a text containing an outline of a text to be generated and a post-ranking paragraph, the outline of each text to be generated may be used as a subtitle of each corresponding post-ranking paragraph, and the determined subtitles and the post-ranking paragraphs may be combined to generate the text. It is also possible to use the outline of each text to be generated as the beginning sentence of each corresponding sequenced paragraph, thereby generating the text.
In this way, since the selected paragraphs are ordered based on the outline of the text to be generated, a text is generated that includes the outline of the text to be generated and the ordered paragraphs. And because the outline can reflect the structural information of the text, the selected paragraphs are ordered based on the outline of the text to be generated, so that the ordered paragraphs have the structural property, the structural property of the generated text is improved, and the overall logic property of the generated text is higher.
In addition, since the outline of the text to be generated is determined first, and then the paragraphs of the text to be generated are selected from the preset paragraphs corresponding to the outline of the text to be generated, the number of the preset paragraphs corresponding to the outline of the text to be generated is far smaller than the total number of the preset paragraphs, and therefore the efficiency of selecting the paragraphs of the text to be generated from the preset paragraphs corresponding to the outline of the text to be generated is high, and the efficiency of generating the text is improved.
Since in the prior art, electronic devices typically generate text in terms of words. Specifically, a first word of the text to be generated is determined according to the obtained keyword of the text to be generated, words adjacent to the first word are determined based on the determined first word, each word of the text to be generated is determined according to the method, and the text is generated based on each determined word. The text is generated by adopting a method taking words as units in the prior art, so that the generated text has poor structural property and the overall logic property of the text is not strong. In the scheme, the text is not generated word by word, and the selected paragraphs are ordered based on the outline of the text to be generated, so that the paragraphs after the ordering are structural, the structural property of the generated text is improved, and the overall logic property of the generated text is higher.
In one embodiment of the present invention, the description information of the text to be generated in the step S101 may include at least one of a user image, a keyword, a physical word, a keyword sentence, and a text type.
Specifically, the user portrayal is used to describe the characteristic information of the user. The user representation may describe the user's characteristic information from different dimensions. For example, the user portrait may contain user attribute feature information, user interest feature information, user behavior feature information, and user scene feature information.
The user attribute characteristic information comprises the gender, age, place, occupation and the like of the user, the user interest attribute characteristic information comprises the user interest article type, the authored article type, the authoring rule and the like, the user behavior characteristic information comprises the user recently read article type and the recently authored article type, and the user scene characteristic information comprises the current authoring scene of the user.
Specifically, when the user portrait is obtained, the user portrait may be determined according to the user identifier and the pre-stored correspondence between the user identifier and the user portrait. The corresponding relation between the pre-stored user identification and the user portrait can be real-time monitoring of the on-line and off-line behavior data of the user by the electronic equipment, constructing comprehensive, accurate and multidimensional user portraits for all users, and determining the corresponding relation between the user identification and the user portrait based on the constructed user portraits.
Keywords may be understood as words that express text center ideas or primary content. Specifically, when the above-described keywords are obtained, it may be determined based on the frequency of occurrence of each word in the text segment of the description text of the text to be generated, which is input by the user. For example, a TF-IDF (term frequency-reverse document frequency) algorithm can be used to extract keywords, and in the TF-IDF algorithm, the following formula is mainly used to extract keywords;
Where tf i,j represents the frequency of occurrence of the ith term in the above described text segment, df i represents the number of texts in the preset text library containing the above described ith term, N represents the total number of texts in the preset text library, W i,f is the importance value of the ith term in the above described text segment, specifically, when W i,f is higher, the importance value of the ith term in the above described text segment is higher, and when W i,f is highest, the ith term can be considered as a keyword.
Entity words can be understood as proper nouns appearing in text. When obtaining the entity words, the entity words of the descriptive text segments of the text to be generated, which are input by the user, can be extracted using the bi-directional LSTM and the conditional random field.
Key sentences may be understood as sentences expressing text center ideas or main content. Specifically, when obtaining the key sentence, the descriptive text segment of the text to be generated, which is input by the user, may be divided into a plurality of sentence units, a graph model is established according to the context relationship between each sentence unit, and sentence units with higher importance are determined based on the established graph model, so as to extract the key sentence in the text segment. The method can be realized based on a TextRank algorithm.
When the text type is obtained, the semantic features of the descriptive text segments of the text to be generated, which are input by the user, can be identified, the semantic features are matched with the semantic features of the preset text type based on the identified semantic features, and the text type is determined based on the matching result.
The following specific embodiment illustrates a specific implementation process for obtaining paragraphs of text to be generated. Referring to fig. 5, fig. 5 is a flowchart of a paragraph obtaining method according to an embodiment of the present invention.
In fig. 5, first, a set of optimal outline groups is selected from the clustered outline groups based on semantic features of the text to be generated. The outline group includes outline group 1, outline group 2.
And then determining the outline of the text to be generated from the selected optimal outline group based on the characteristic information of the text to be generated. Specifically, the outline of the text to be generated may be determined based on the case where each outline in the optimal outline group contains the feature information of the text to be generated. The outline in the above-described optimal outline group includes outline 1, outline 2.
And secondly, determining the paragraph of the text to be generated based on the characteristic information of the text to be generated from each preset paragraph corresponding to each outline. The preset paragraphs corresponding to the outline 1 include paragraphs 11, 12, and. Preset paragraphs corresponding to outline 2 above include paragraph 21, paragraph 22, and. Specifically, the paragraphs of the text to be generated may be determined based on the case where each paragraph contains feature information of the text to be generated.
In one embodiment of the present invention, the semantic features and feature information of the preset outline, the semantic features and feature information of the preset paragraph may be determined in the following manner. Referring to fig. 6, fig. 6 is a flowchart of a text information obtaining method according to an embodiment of the present invention.
In fig. 6, data acquisition is first performed. Specifically, the electronic device can monitor text information in the internet in real time based on the automatic crawler system, automatically crawl text of a specified website regularly and store the text in the message cache message queue, and adopt a preset data scheduling system to discover text data sources from the message cache message queue, and monitor and crawl text data dynamic information of the internet in real time.
And then cleaning and filtering. Specifically, the method can be used for processing text similarity de-duplication, spelling correction, filtering sensitive topics in the text, abnormal character processing, emotion analysis, correction of complex simplified mixed use, correction of misspelling, unification of word formats, deletion of useless information and the like.
And constructing the document portrait again. Specifically, the outline and the paragraph of the text after cleaning and filtering can be respectively extracted, and the characteristic information in the outline and the paragraph can be respectively obtained. The feature information may include document type, keywords, key sentences, entity words, text body, etc. The same processing method can be adopted for processing when feature information in the outline and the paragraph is obtained.
Specifically, when the above-described keywords are obtained, it may be determined based on the frequency of occurrence of each word in the text segment of the description text of the text to be generated, which is input by the user. For example, the key words may be extracted using the TF-IDF algorithm. When the entity words are obtained, the entity words of the descriptive text segment of the text to be generated, which are input by the user, can be extracted by using the bidirectional LSTM and the conditional random field. When the key sentences are obtained, the descriptive text segment of the text to be generated, which is input by the user, can be divided into a plurality of sentence units, a graph model is built according to the context relation among the sentence units, and sentence units with higher importance are determined based on the built graph model, so that the key sentences in the text segment are extracted. The method can be realized based on a TextRank algorithm. When the text type is obtained, the semantic features of the text describing the text segment to be generated, which are input by the user, can be identified, the semantic features are matched with the semantic features of the preset text type based on the identified semantic features, and the text type is determined based on the matching result.
And finally, extracting the characteristics and storing the data.
Specifically, a vectorization coding technology can be adopted to code the outline or the paragraph, and the coded result is used as the semantic feature of the outline or the paragraph.
And after the text portrait and the semantic features are obtained, the data can be stored in two sets of distributed data storage modes. The distributed data storage system is used for storing the feature information of the outline and the paragraphs, and the data storage system can store the original information of the text and the feature information of keywords, word segmentation, inverted indexes and the like. The other set of distributed data storage system is used for storing the semantic features of the outline and the paragraphs and calculating the similarity of the semantic features.
In one embodiment of the present invention, the step S102 may be implemented in the following manner, where the outline of the text to be generated is selected from the preset outline based on the similarity between the semantic features of the preset outline and the first semantic features, and the outline is used as the outline of the text to be generated.
Inputting the semantic features and the first semantic features of the preset outline into a pre-trained semantic similarity calculation model to obtain the similarity between the semantic features and the first semantic features of the preset outline, and selecting the outline of the text to be generated from the preset outline based on the obtained similarity.
The semantic similarity calculation model is obtained by training a preset neural network model by taking semantic features of a large number of sample preset outline and semantic features of a sample text as input and similarity between the semantic features of the sample preset outline and the semantic features of the sample text as training reference, and is used for obtaining the similarity between the semantic features of the preset outline and the first semantic features.
The semantic features of the sample preset outline and the semantic features of the sample text can adopt the semantic features in the semantic feature vector library in the graph X as samples.
Specifically, referring to fig. 7, fig. 7 is a flow chart of calculating semantic similarity based on a model according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of a semantic similarity calculation model, specifically, when calculating the similarity between the semantic features of the preset outline and the first semantic features, encoding the similarity between the semantic features of the preset outline and the first semantic features to obtain encoding results, connecting the obtained encoding results, calculating the difference and the inner product, and outputting the similarity between the semantic features of the preset outline and the first semantic features through the processing of the full connection layer and the Softmax layer.
Referring to fig. 8, fig. 8 is a flowchart of a text generation method according to an embodiment of the present invention. In fig. 8, a data platform construction module, a user portrait and user intention recognition module, a vectorized semantic retrieval and a text generation module are included.
The data platform building module comprises the steps of data acquisition, data cleaning, feature information and semantic feature extraction and data storage.
In the user portrait and intention recognition module, firstly, information of a text to be generated, which is input by a user, is acquired, then the information is filtered, for example, noise information is removed by filtering sensitive words, filtering stop words and the like on the input information, and then, characteristic information is extracted from the cleaned information, wherein the characteristic information can comprise keywords, entity words and text types of the text to be generated, and semantic characteristic extraction is carried out on the characteristic information.
And in the vectorization semantic retrieval module, similar semantic feature recall is carried out on the semantic features of the outline and the paragraphs stored in the distributed vector database based on the extracted semantic features.
In a text generation system, firstly, outline groups which are similar to the semantic features are determined from a distributed vector database, the extracted outline groups are ordered based on the feature information, a first preset number of outline groups after the ordering are selected, and paragraphs of a text to be generated are determined based on the semantic features and the semantic features of preset paragraphs corresponding to the selected outline groups, so that the text is generated based on the determined outline and paragraphs.
In an embodiment of the present invention, the preset paragraphs corresponding to the outline in the step S104 are predetermined paragraphs, and specifically, the corresponding preset paragraphs may be improved according to the following steps D1-D2.
And D1, acquiring a pre-selected text corresponding to the preset outline.
As can be seen from the description in the above step S103, the electronic device may store the history text, extract each outline from the history text, or may also periodically crawl the text of the specified website by using the automatic crawler system, extract the outline of the crawled text, store the outline in the database, and store the correspondence between each outline and the text. On the basis, the electronic equipment can pre-select texts corresponding to preset outline from the texts stored in the database, and the pre-selected texts corresponding to the preset improvement are used as the pre-selected texts.
And D2, extracting paragraphs from each paragraph corresponding to the preset outline in the text, and taking the extracted paragraphs as preset paragraphs corresponding to the outline.
In one embodiment, paragraphs may be randomly selected from the paragraphs corresponding to the preset outline in the text, and the number of the selected paragraphs may be 1 or more as the preset paragraphs corresponding to the outline.
In one embodiment of the present invention, feature information of each paragraph corresponding to the preset outline in the text may be further determined, and based on the feature information of each paragraph, an alternative paragraph is selected from each paragraph, and the alternative paragraph is determined as the preset paragraph corresponding to the outline.
The characteristic information of the paragraph is used for describing basic information of the paragraph, and the characteristic information of the paragraph can comprise the length of each sentence in the paragraph, punctuation mark composition of each sentence, the number of sentences in the paragraph and the like.
Taking the feature information as an example of the length of each sentence in a paragraph, when determining the feature information, the length of each paragraph in the text may be calculated.
When selecting the alternative paragraph, it may be to judge whether the feature information of each paragraph meets the preset information screening rule, and if so, take the paragraph as the alternative paragraph.
The preset information screening rule may be set by a worker according to experience, or the worker may use a screening rule including high-quality feature information as the preset information screening rule by extracting the feature information of the high-quality paragraph.
For example, the preset information filtering rule may include that the length of the sentence in the paragraph is greater than 8 bytes, the punctuation mark in the sentence includes comma and period, and the number of sentences in the paragraph is greater than 5.
When the characteristics of the paragraphs meet the preset information screening rules, the paragraphs can be used as alternative paragraphs.
For example, assume that the preset information screening rule is that a paragraph with a sentence length greater than 8 bytes is reserved, the paragraph contains preset punctuation marks (commas and periods), and the number of sentences in the paragraph is greater than 5, and assume that the paragraph is that' we should learn well about going up from day to day, must honor young and honor, the students should have difficulty in reaching assistance in time, be in class and carefully listen and talk, and write in class carefully. In this paragraph, there is a sentence with a length greater than 8 bytes, "we want to learn well about the day-up", and the paragraph contains commas and periods, and the number of sentences in the paragraph is greater than 5, that is, the paragraph satisfies the preset information screening rule, so that the paragraph can be used as an alternative paragraph.
In determining the candidate paragraph as the preset paragraph corresponding to the outline, in one embodiment, the candidate paragraph may be directly determined as the preset paragraph corresponding to the preset outline in the text. For example, assuming that the determined alternative paragraph is paragraph 1, paragraph 1 may be directly determined as a preset paragraph corresponding to the preset outline in the text.
In one embodiment of the present invention, for each alternative paragraph, the semantic feature of the alternative paragraph and the word meaning feature of each word in the alternative paragraph may be determined, the determined semantic feature and word meaning feature are input into a pre-trained paragraph quality evaluation model, the quality score value of the alternative paragraph is obtained, and the alternative paragraph with the quality score value greater than the preset quality score value is used as the preset paragraph corresponding to the preset outline in the text.
The semantic features of the alternative paragraphs are used for reflecting the semantics expressed by the alternative paragraphs, and the word sense features of the words are used for reflecting the semantics expressed by the words.
Specifically, when determining the word meaning characteristics of each word in the alternative paragraph, word segmentation processing may be performed on the alternative paragraph to obtain each word in the alternative paragraph, and word vectorization may be performed on each word in the alternative paragraph to obtain the word meaning characteristics of each word in the alternative paragraph. For example, when a jieba (crust) Word segmentation tool is used for Word segmentation, a Word2Vec (Word to Vector) model is used for carrying out Word vectorization on each Word.
Upon determining the semantic features of the alternative paragraph, semantic information of the alternative paragraph may be extracted, and the semantic features of the alternative paragraph may be determined based on the extracted semantic information.
The degree of association between the words may also be determined based on word sense characteristics of the words in the alternative paragraphs, and semantic characteristics of the alternative paragraphs may be determined based on the determined degree of association. Specifically, an Attention mechanism can be adopted, firstly, the association degree among words is determined based on word sense characteristics of the words in the alternative paragraphs, weights are given to the words based on the determined association degree to obtain a weight matrix, then the weights of the words are weighted and summed based on word vectors of the words and the weight matrix to obtain a weight matrix after weighted and summed of the words, and then the semantic characteristics of the alternative paragraphs are determined based on the word vectors of the words and the weight matrix after weighted and summed of the words.
The paragraph quality evaluation model is obtained by taking semantic features of a sample paragraph and word sense features of various words in the sample paragraph as model input, taking a labeling quality score value of the sample paragraph as a training reference, and training a preset neural network model. The neural network model may be TextCNN (Txet Convolutional Neural Networks, text convolutional neural network).
Specifically, in inputting the semantic features of the alternative paragraph and the word meaning features of each word in the alternative paragraph into a paragraph quality evaluation model, firstly, a convolution layer in the paragraph quality evaluation model carries out convolution operation on the semantic features and the word meaning features to obtain a convolution result, and then pooling and softmax solving are carried out on the convolution result to obtain the quality score value of the alternative paragraph.
Specifically, when performing a softmax (logistic regression) solution, the following formula can be used for calculation:
wherein p represents the sequence number of the preset quality class, k represents the total number of the preset quality classes, A quality score value representing the quality of the alternative paragraph as the p-th preset quality class,A p represents the normalized mass fraction value for the p-th preset mass classification for the mass of the alternative paragraph. When the mass fraction value is calculated by using the above formula, the calculated mass fraction value is within the range of [0,1 ].
In one embodiment of the present invention, the semantic features and feature information of the preset outline, the semantic features and feature information of the preset paragraph may be determined in the following manner. Referring to fig. 8, fig. 8 is a flowchart of a text information obtaining method according to an embodiment of the present invention.
In fig. 8, data acquisition is first performed. Specifically, the electronic device can monitor text information in the internet in real time based on the automatic crawler system, automatically crawl text of a specified website regularly and store the text in the message cache message queue, and adopt a preset data scheduling system to discover text data sources from the message cache message queue, and monitor and crawl text data dynamic information of the internet in real time.
And then cleaning and filtering. Specifically, the method can be used for processing text similarity de-duplication, spelling correction, filtering sensitive topics in the text, abnormal character processing, emotion analysis, correction of complex simplified mixed use, correction of misspelling, unification of word formats, deletion of useless information and the like.
And constructing the document portrait again. Specifically, the outline and the paragraph of the text after cleaning and filtering can be respectively extracted, and the characteristic information in the outline and the paragraph can be respectively obtained. The feature information may include document type, keywords, key sentences, entity words, text body, etc. The same processing method can be adopted for processing when feature information in the outline and the paragraph is obtained.
Specifically, when the above-described keywords are obtained, it may be determined based on the frequency of occurrence of each word in the text segment of the description text of the text to be generated, which is input by the user. For example, the key words may be extracted using the TF-IDF algorithm. When the entity words are obtained, a bidirectional LSTM (Long Short-Term Memory) and a conditional random field can be used to extract the entity words of the descriptive text segment of the text to be generated, which are input by the user. When the key sentences are obtained, the descriptive text segment of the text to be generated, which is input by the user, can be divided into a plurality of sentence units, a graph model is built according to the context relation among the sentence units, and sentence units with higher importance are determined based on the built graph model, so that the key sentences in the text segment are extracted. The method can be realized based on a TextRank algorithm. When the text type is obtained, the semantic features of the text describing the text segment to be generated, which are input by the user, can be identified, the semantic features are matched with the semantic features of the preset text type based on the identified semantic features, and the text type is determined based on the matching result.
And finally, extracting the characteristics and storing the data.
Specifically, a vectorization coding technology can be adopted to code the outline or the paragraph, and the coded result is used as the semantic feature of the outline or the paragraph.
And after the text portrait and the semantic features are obtained, the data can be stored in two sets of distributed data storage modes. The distributed data storage system is used for storing the feature information of the outline and the paragraphs, and the data storage system can store the original information of the text and the feature information of keywords, word segmentation, inverted indexes and the like. The other set of distributed data storage system is used for storing the semantic features of the outline and the paragraphs and calculating the similarity of the semantic features.
Referring to fig. 9a, fig. 9a is a schematic flow chart of a preset paragraph obtaining process according to an embodiment of the present invention.
In the first step in fig. 9a, a large amount of document data including a large amount of text is obtained.
And secondly, extracting paragraphs in each text to form a paragraph list.
And thirdly, performing rough extraction on the paragraphs in the paragraph list based on preset paragraph description information such as length, punctuation marks, sentence quantity and the like to obtain alternative paragraphs.
And fifthly, a Word2Vec model is adopted to determine Word meaning characteristics of each Word in each alternative paragraph, an Attention mechanism is adopted to determine semantic characteristics of each alternative paragraph, and the determined semantic characteristics and Word meaning characteristics are input into a pre-trained TextCNN model to obtain quality score values of the alternative paragraphs.
The seventh step is to take the alternative paragraphs with the mass fraction value larger than the preset mass threshold value as preset paragraphs based on the mass fraction value of each alternative paragraph.
Thus, the quality score value of the paragraph is determined by combining the deep learning method, and the quality of the paragraph can be accurately determined based on the quality score value.
Referring to fig. 9b, fig. 9b is a schematic flow chart of an alternative section quality evaluation process according to an embodiment of the present invention.
In fig. 9b, a list of paragraphs after rough extraction, i.e. the individual alternative paragraphs, is first obtained in the order of pointing with the arrows.
And secondly, processing each alternative section in a batch processing mode.
And then, aiming at each alternative paragraph, obtaining each Word in the alternative paragraph in a Word segmentation mode, and carrying out Word vectorization on each Word in a Word2Vec mode to obtain the Word vector of each Word in the alternative paragraph. And then, based on the obtained word vector and the Attention mechanism, obtaining the Attention characteristic diagram (Attention Feature Map) of the alternative paragraph.
And finally, inputting word vectors and attention characteristic diagrams of all words in the alternative paragraphs into a CNN (Convolutional Neural Networks, convolutional neural network) model, and outputting quality score values of the alternative paragraphs after convolution operation, pooling and softmax in the CNN model.
In this way, the batch processing mode is adopted to process each alternative paragraph, so that the processing efficiency is improved, and the paragraph obtaining efficiency is further improved.
Referring to fig. 9c, fig. 9c is a schematic flow chart of an Attention mechanism according to an embodiment of the present invention. In fig. 9c, dot product operation (dot) is firstly performed based on word sense features of each word in the alternative paragraph to obtain word weights (weights) of each word in the alternative paragraph, softmax solution is performed on the word weights of each word in the alternative paragraph to obtain weight normalization results, weighted summation is performed on the word sense features of each word in the alternative paragraph and the weight normalization results (summarize), a feature map (feature map) is output, and a final feature map (final feature map) is obtained based on the feature map and the word sense features of each word in the alternative paragraph.
Corresponding to the outline determining method, the embodiment of the invention also provides an outline determining device.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a first outline determining apparatus according to an embodiment of the present invention, where the apparatus includes the following modules 1001 to 1003.
An information obtaining module 1001, configured to obtain description information of a text to be generated;
a feature obtaining module 1002, configured to obtain semantic features of the description information as first semantic features;
The outline selection module 1003 is configured to select the outline of the text to be generated from the preset outline based on the semantic feature of the preset outline and the first semantic feature.
In one embodiment of the present invention, the outline selection module 1003 is specifically configured to select the outline of the text to be generated from the preset outline based on a similarity between the semantic feature of the preset outline and the first semantic feature.
Referring to fig. 11, fig. 11 is a schematic structural diagram of a outline selection module 1003 according to an embodiment of the present invention, where the module includes the following submodules 10031-10032.
The outline group selection submodule 10031 is configured to select, from each outline group, an outline group to which the outline of the text to be generated belongs as an alternative outline group based on a similarity between a semantic feature of a clustering center of each outline group and the first semantic feature, where each outline group is an outline group obtained by clustering according to the similarity between semantic features of the outline;
and a outline selection submodule 10032, configured to select, according to the similarity between the semantic feature of each outline in the candidate outline group and the first semantic feature, the outline of the text to be generated from each outline in the candidate outline group.
In one embodiment of the present invention, the outline selection submodule 10032 is specifically configured to select, from the candidate outline groups, an candidate outline group with a highest similarity between a semantic feature of a clustering center and the first semantic feature, and select, from each outline in the selected candidate outline group, an outline of the text to be generated according to a similarity between a semantic feature of each outline in the selected candidate outline group and the first semantic feature.
In one embodiment of the present invention, the outline selection submodule 10032 is specifically configured to calculate a similarity between the semantic feature of each outline in the candidate outline group and the first semantic feature, and select, according to the order of the calculated correspondence similarity of each outline from high to low, a first preset number of outlines from each outline as the outline of the text to be generated.
In one embodiment of the present invention, the outline group selection submodule 10031 includes:
The similarity calculation unit is used for calculating the similarity between the semantic features of the clustering centers of each outline group and the first semantic features;
the outline group selection unit is used for selecting a second preset number of outline groups from the outline groups according to the sequence that the calculated correspondence similarity of the outline groups is from high to low;
a number determining unit configured to determine the number of outline items including the description information in each of the selected outline groups;
and the outline group determining unit is used for determining the outline group to which the outline of the text to be generated belongs from the selected outline groups according to the determined outline number.
Referring to fig. 12, fig. 12 is a schematic structural diagram of a second outline determining apparatus according to an embodiment of the present invention, where the apparatus further includes a paragraph selecting module 1004.
The paragraph selection module 1004 is specifically configured to select a paragraph of the text to be generated from preset paragraphs corresponding to the outline of the text to be generated.
In one embodiment of the present invention, the paragraph selection module 1004 is specifically configured to select a paragraph of the text to be generated from preset paragraphs corresponding to the outline of the text to be generated based on a similarity between the semantic features of the preset paragraphs corresponding to the outline of the text to be generated and the first semantic features.
In one embodiment of the present invention, the apparatus further includes a preset paragraph determining module, where the preset paragraph determining module includes:
The text acquisition sub-module is used for acquiring a pre-selected text corresponding to a preset outline;
And the paragraph determining submodule is used for extracting paragraphs from all paragraphs corresponding to preset outline in the text and taking the extracted paragraphs as preset paragraphs corresponding to outline.
In one embodiment of the present invention, the paragraph determining submodule includes:
The information determining unit is used for determining the characteristic information of each paragraph corresponding to the preset outline in the text;
And the paragraph determining unit is used for selecting alternative paragraphs from the paragraphs based on the characteristic information of the paragraphs and determining the alternative paragraphs as preset paragraphs corresponding to the preset outline in the text.
In one embodiment of the present invention, the paragraph determining unit is specifically configured to determine, for each alternative paragraph, a semantic feature of the alternative paragraph and a word meaning feature of each word in the alternative paragraph, input the determined semantic feature and the word meaning feature into a pre-trained paragraph quality evaluation model, obtain a quality score value of the alternative paragraph, and use a paragraph with a quality score value greater than a preset quality score threshold as a preset paragraph corresponding to the preset outline in the text;
The paragraph quality evaluation model is obtained by training a preset neural network model by taking semantic features of a sample paragraph and word sense features of words in the sample paragraph as model input and taking a labeled quality score value of the sample paragraph as a training reference, and is used for obtaining the quality score value of the paragraph.
Referring to fig. 13, fig. 13 is a schematic structural diagram of a third outline determining apparatus according to an embodiment of the present invention, where the apparatus further includes a text generating module 1005.
The text generation module 1005 is specifically configured to sort the paragraphs of the selected text to be generated based on the outline of the text to be generated, and generate a text including the outline of the text to be generated and the sorted paragraphs.
In one embodiment of the present invention, the description information includes at least one of user images, keywords, entity words, keywords sentences, and text types.
Corresponding to the outline determining method, the embodiment of the invention also provides electronic equipment.
Referring to fig. 14, fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which includes a processor 1401, a communication interface 1402, a memory 1403 and a communication bus 1404, wherein the processor 1401, the communication interface 1402, and the memory 1403 communicate with each other through the communication bus 1404,
A memory 1403 for storing a computer program;
the processor 1401 is configured to implement the outline determining method provided by the embodiment of the present invention when executing the program stored in the memory 1403.
The communication bus mentioned above for the electronic device may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc., or may be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.
In still another embodiment of the present invention, a computer readable storage medium is provided, where a computer program is stored, where the computer program is executed by a processor to implement the outline determining method provided by the embodiment of the present invention.
In yet another embodiment of the present invention, a computer program product comprising instructions which, when run on a computer, cause the computer to perform the outline determining method provided by the embodiment of the present invention is also provided.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, electronic devices, computer readable storage medium embodiments, since they are substantially similar to method embodiments, the description is relatively simple, and relevant references are made to the partial description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.
Claims (13)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110880841.6A CN113688633B (en) | 2021-08-02 | 2021-08-02 | Method and device for determining outline |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110880841.6A CN113688633B (en) | 2021-08-02 | 2021-08-02 | Method and device for determining outline |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113688633A CN113688633A (en) | 2021-11-23 |
CN113688633B true CN113688633B (en) | 2025-06-27 |
Family
ID=78578889
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110880841.6A Active CN113688633B (en) | 2021-08-02 | 2021-08-02 | Method and device for determining outline |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113688633B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114969549A (en) * | 2022-06-24 | 2022-08-30 | 北银金融科技有限责任公司 | A kind of automatic recommendation method and system for financial news morning paper |
CN115952279B (en) * | 2022-12-02 | 2023-09-12 | 杭州瑞成信息技术股份有限公司 | Text outline extraction method and device, electronic device and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106970898A (en) * | 2017-03-31 | 2017-07-21 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating article |
CN112148857A (en) * | 2020-09-23 | 2020-12-29 | 中国电子科技集团公司第十五研究所 | A system and method for automatically generating military official documents |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503255B (en) * | 2016-11-15 | 2020-05-12 | 科大讯飞股份有限公司 | Method and system for automatically generating article based on description text |
CN111813947A (en) * | 2019-04-09 | 2020-10-23 | 北京国双科技有限公司 | Method and device for automatic generation of outline of court inquiry |
-
2021
- 2021-08-02 CN CN202110880841.6A patent/CN113688633B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106970898A (en) * | 2017-03-31 | 2017-07-21 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating article |
CN112148857A (en) * | 2020-09-23 | 2020-12-29 | 中国电子科技集团公司第十五研究所 | A system and method for automatically generating military official documents |
Also Published As
Publication number | Publication date |
---|---|
CN113688633A (en) | 2021-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112131350B (en) | Text label determining method, device, terminal and readable storage medium | |
CN111753060B (en) | Information retrieval method, apparatus, device and computer readable storage medium | |
CN108829822B (en) | Media content recommendation method and device, storage medium and electronic device | |
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
CN107357889B (en) | A cross-social platform image recommendation algorithm based on content or emotional similarity | |
CN108073568B (en) | Keyword extraction method and device | |
US10755177B1 (en) | Voice user interface knowledge acquisition system | |
CN107357793B (en) | Information recommendation method and device | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
CN112464656A (en) | Keyword extraction method and device, electronic equipment and storage medium | |
CN111985228A (en) | Text keyword extraction method and device, computer equipment and storage medium | |
CN113961666B (en) | Keyword recognition method, apparatus, device, medium, and computer program product | |
CN116431919B (en) | Intelligent news recommendation method and system based on user intention characteristics | |
CN114661861B (en) | Text matching method and device, storage medium and terminal | |
CN117056575B (en) | Method for data acquisition based on intelligent book recommendation system | |
CN113761125B (en) | Dynamic summary determination method and device, computing device and computer storage medium | |
CN119577459B (en) | Intelligent customer service training method and device for multi-mode large model and storage medium | |
CN111859955A (en) | A public opinion data analysis model based on deep learning | |
CN112188312A (en) | Method and apparatus for determining video material of news | |
CN113688633B (en) | Method and device for determining outline | |
CN113656575A (en) | Training data generation method and device, electronic equipment and readable medium | |
Lindén et al. | Evaluating combinations of classification algorithms and paragraph vectors for news article classification | |
CN119938824A (en) | Interaction method and related equipment | |
CN112163415A (en) | User intention identification method and device for feedback content and electronic equipment | |
CN117763133A (en) | Class recommendation method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |