CN111930929A

CN111930929A - Article title generation method and device and computing equipment

Info

Publication number: CN111930929A
Application number: CN202010658716.6A
Authority: CN
Inventors: 胡阿沛
Original assignee: CHEZHI HULIAN (BEIJING) SCIENCE & TECHNOLOGY CO LTD
Current assignee: CHEZHI HULIAN (BEIJING) SCIENCE & TECHNOLOGY CO LTD
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2020-11-13
Anticipated expiration: 2040-07-09
Also published as: CN111930929B

Abstract

The invention discloses an article title generation method, which is executed in computing equipment and comprises the following steps: selecting paragraphs of the target article based on multiple selection modes to obtain multiple input texts, wherein each input text comprises at least one paragraph of the target article, and each selection mode corresponds to different input texts; inputting each input text in the plurality of input texts into a trained text generation model for processing to generate a plurality of candidate article titles, wherein all the candidate article titles corresponding to the plurality of input texts form a candidate title set; and screening out the final title of the target article from the candidate title set based on a preset strategy. The invention also discloses a corresponding device and a computing device.

Description

Article title generation method and device and computing equipment

Technical Field

The invention relates to the technical field of internet information, in particular to an article title generation method, an article title generation device and computing equipment.

Background

Text header generation is one of the core problems in the field of natural language processing. The corresponding attractive titles are automatically generated for an article according to the content of the article, so that not only can readers be attracted to read the article, but also the workload of the writer of the article can be reduced.

Currently, the implementation of text title generation can be divided into two types: one is a title generation method of a generation formula, and the other is to extract key sentences from an article. The title generation method of the generation formula is based on a deep learning technology, a model for automatically generating a title from an article is learned by using massive articles and title data thereof, and when an article without the title exists, the article is input into the model, so that the title can be automatically generated according to a model learning result. And the key sentence is extracted from the article, so that the key sentence not only can summarize the article content to a certain extent, but also can independently form a sentence without being influenced by the context.

The key sentence extraction method needs to summarize the article content due to the specificity of the title, and needs to be able to form sentences independently, which may cause a situation of semantic incoherence, so it cannot be ensured whether the sentences extracted from the article can be used as the title directly.

The method for automatically generating a title through deep learning can generate a title which has strong diversity and is close to the content of an article according to the content of the article, but the method has low stability, cannot predict the form of the generated title, and cannot ensure whether the generated title is correctly available. And when the number of articles in a certain field is small, the model effect and generalization capability of deep learning are poor.

Disclosure of Invention

To this end, the present invention provides an article title generation method, apparatus and computing device in an effort to solve or at least alleviate at least one of the problems identified above.

According to one aspect of the invention, there is provided an article title generation method, executed in a computing device, the method comprising the steps of: selecting paragraphs of the target article based on multiple selection modes to obtain multiple input texts, wherein each input text comprises at least one paragraph of the target article, and each selection mode corresponds to different input texts; inputting each input text in the plurality of input texts into a trained text generation model for processing to generate a plurality of candidate article titles, wherein all the candidate article titles corresponding to the plurality of input texts form a candidate title set; and screening the final title of the target article from the candidate title set based on a preset strategy.

Optionally, in the article title generating method according to the present invention, paragraph selection is performed on the target article based on multiple selection manners, including the steps of: extracting keywords of the article, scoring the paragraphs according to the frequency of the keywords appearing in each paragraph in the target article, and selecting the paragraphs from top to bottom according to the score to obtain an input text.

Optionally, in the article title generating method according to the present invention, paragraph selection is performed on the target article based on multiple selection manners, and the method further includes the steps of: and selecting a plurality of paragraphs according to the paragraph sequence of the target article to obtain an input text.

Optionally, in the article title generating method according to the present invention, paragraph selection is performed on the target article based on multiple selection manners, and the method further includes the steps of: and selecting a first segment and a last segment of the target article, and then selecting a plurality of paragraphs according to the paragraph sequence of the target article to obtain an input text.

Optionally, in the article title generating method according to the present invention, inputting the input text into a trained text generating model for processing includes: performing word segmentation processing on an input text to obtain a plurality of words; converting each vocabulary in the plurality of vocabularies into a word vector to obtain a word vector sequence; and inputting the word vector sequence into a text generation model for processing.

Optionally, in the article title generating method according to the present invention, performing word segmentation processing on the input text includes: the method comprises the steps of segmenting input texts based on a predetermined word bank, and segmenting texts which do not belong to the predetermined word bank into single words, wherein the predetermined word bank comprises a plurality of specific words.

Optionally, in the article title generating method according to the present invention, the step of screening out the final title of the target article from the candidate title set based on a predetermined policy includes: for each candidate title in the candidate title set, a perplexity of the candidate title is calculated, and a predetermined number of candidate titles are selected in order from high to low according to the perplexity.

Alternatively, in the article title generating method according to the present invention, calculating the perplexity of the candidate title includes the steps of: the candidate title is divided into a plurality of clauses, clause confusion is calculated according to a loss function of each position in each clause, and the confusion of the title of the article to be determined is obtained according to the clause confusion.

Alternatively, in the article title generating method according to the present invention, after selecting a predetermined number of candidate titles in order from high to low in the degree of confusion, the method comprises the steps of: and grading the screened titles of the articles to be determined by using the trained click rate estimation model, and acquiring the final title corresponding to the target article according to the grade.

Optionally, in the article title generating method according to the present invention, the step of screening out the final title of the target article from the candidate title set based on a predetermined policy includes: and for each candidate title in the candidate title set, scoring by using a trained click rate estimation model, and acquiring a final title corresponding to the target article according to the score.

Optionally, in the article title generating method according to the present invention, the computing device is connected to a data storage device, a plurality of article texts and corresponding article titles are stored in the data storage device, and the training of the text generation model includes the following steps: selecting paragraphs of each article text based on multiple selection modes to obtain multiple training texts, wherein each training text comprises at least one paragraph of a target article, and each selection mode corresponds to different training texts; and taking each training text as the input of the text generation model, and taking the article title corresponding to the training text as the target output of the text generation model so as to train the text generation model.

Optionally, in the article title generating method according to the present invention, paragraph selection is performed on each article text based on multiple selection modes to obtain multiple training texts, including the steps of: and calculating the similarity of each paragraph in the article text and the corresponding article title, and selecting the paragraphs according to the similarity from high to low to obtain a training text.

Optionally, in the article title generating method according to the present invention, a plurality of paragraphs are selected according to the paragraph order of the body of the article to obtain a training text.

Optionally, in the article title generating method according to the present invention, paragraph selection is performed on the target article based on multiple selection manners, including the steps of: and selecting the first segment and the last segment of the text of the article, and then selecting a plurality of segments according to the sequence of the segments of the target article to obtain a training text.

Optionally, in the article title generating method according to the present invention, taking each training text as an input of a text generation model, includes the steps of: performing word segmentation processing on an input text to obtain a plurality of words; converting each vocabulary in the plurality of vocabularies into a word vector to obtain a word vector sequence; and inputting the word vector sequence into the text generation model for processing.

Optionally, in the article title generating method according to the present invention, performing word segmentation on the training text includes: the method comprises the steps of segmenting a training text based on a preset word bank, and segmenting the text which does not belong to the preset word bank into single words, wherein the preset word bank comprises a plurality of specific words.

According to another aspect of the present invention, there is provided a method for training a text generation model, which is executed in a computing device, the computing device being connected to a data storage device, the data storage device storing a plurality of article texts and corresponding article titles, the method including the steps of: selecting paragraphs of each article text based on multiple selection modes to obtain multiple training texts, wherein each training text comprises at least one paragraph of a target article, and each selection mode corresponds to different training texts; and taking each training text as the input of the text generation model, and taking the article title corresponding to the training text as the target output of the text generation model so as to train the text generation model.

In the training method of the text generation model according to the present invention, paragraph selection is performed on each article text based on a plurality of selection modes to obtain a plurality of training texts, including the steps of: and calculating the similarity of each paragraph in the article text and the corresponding article title, and selecting the paragraphs according to the similarity from high to low to obtain a training text.

In the training method of the text generation model according to the present invention, paragraph selection is performed on each article text based on a plurality of selection modes to obtain a plurality of input texts, including the steps of: and selecting a plurality of paragraphs according to the paragraph sequence of the text of the article to obtain a training text.

In the training method of the text generation model, paragraph selection is performed on a target article based on a plurality of selection modes, and the method comprises the following steps: and selecting the first segment and the last segment of the text of the article, and then selecting a plurality of segments according to the sequence of the segments of the target article to obtain a training text.

In the training method of the text generation model according to the present invention, taking each training text as an input of the text generation model, comprising the steps of: performing word segmentation processing on an input text to obtain a plurality of words; converting each vocabulary in the plurality of vocabularies into a word vector to obtain a word vector sequence; and inputting the word vector sequence into a text generation model for processing.

In the training method of the text generation model according to the present invention, the word segmentation process is performed on the training text, and the method includes the steps of: the method comprises the steps of segmenting a training text based on a preset word bank, and segmenting the text which does not belong to the preset word bank into single words, wherein the preset word bank comprises a plurality of specific words.

In the training method of a text generation model according to the present invention, the text generation model is an end-to-end structure including an encoder and a decoder.

In the training method of the text generation model according to the present invention, the encoder is a transform model and the decoder is an LSTM model.

According to still another aspect of the present invention, there is provided an article title generating apparatus, including: the input text acquisition module is used for carrying out paragraph selection on the target article according to a plurality of selection modes to obtain a plurality of input texts, each input text comprises at least one paragraph of the target article, and each selection mode corresponds to different input texts; the title generation module is used for inputting each input text in the plurality of input texts into a trained text generation model for processing to generate a plurality of candidate article titles, wherein all the candidate article titles corresponding to the plurality of input texts form a candidate title set; and the title screening module is used for screening the final title of the target article from the candidate title set according to a preset strategy.

In the title generation device according to the present invention, the title generation device is connected to a data storage device, the data storage device stores a plurality of article texts and corresponding article titles, and the title generation device further includes: the text generation model training module is used for carrying out paragraph selection on the text of each article according to a plurality of selection modes to obtain a plurality of training texts, each training text comprises at least one paragraph of a target article, each selection mode corresponds to different training texts, each training text is used as the input of the text generation model, the article title corresponding to the training text is used as the output of the text generation model, and the training text generation model is used.

According to yet another aspect of the invention, there is provided a computing device comprising at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising instructions for performing the article title generation method according to the present invention.

According to still another aspect of the present invention, there is provided a readable storage medium storing program instructions that, when read and executed by a client, cause the client to execute an article title generation method according to the present invention.

According to the technical scheme, paragraph selection is carried out on a target article based on multiple selection modes to obtain multiple input texts, data of article contents to be generated are enhanced, for each input text in the multiple input texts, the input text is input into a trained text generation model to be processed, multiple candidate article titles are generated, so that the titles have better diversity, and then the final title of the target article is screened out from a candidate title set based on a preset strategy. By enhancing the article content of the input title to be generated and improving the title generation method, more possible titles can be recalled, and the probability of obtaining available titles is increased.

Furthermore, in the training process of the text generation model, the similarity between the paragraphs and the titles is calculated to enhance the data, so that the training data volume is effectively increased, and the effect and generalization capability of the model are improved. In the screening process of the article titles, the persentence confusion degree is calculated through the loss function of each position in the persentence, the confusion degree of the titles of the pending articles is obtained according to the persentence confusion degree and serves as a sequencing basis, the condition that the titles are not smooth and smooth due to the fact that one place and two places are not smooth and smooth in the titles can be effectively found, the problem that the condition that the two places are not smooth and smooth due to the fact that the whole title confusion degree is directly calculated is solved, and the quality of the generated titles is improved.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a computing device 100, according to an embodiment of the invention;

FIG. 2 shows a flow diagram of an article title generation method 200 according to one embodiment of the invention;

fig. 3 shows a schematic diagram of an article title generation apparatus 300 according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 is a block diagram of an example computing device 100. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processor, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some embodiments, application 122 may be arranged to operate with program data 124 on an operating system. In some embodiments, computing device 100 is configured to perform the article title generation method, and program data 124 includes instructions for performing the method. According to an embodiment of the present invention, when the article title generation method is executed, the computing device 100 is further connected to a data storage device (not shown in the figure), in which a plurality of article texts and corresponding article titles are stored, and the data storage device is adapted to execute training of a text generation model.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, image input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164. In this embodiment, the target article may be obtained through a data input device such as a keyboard, or of course, the target article may also be obtained through the communication device 146.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media. In some embodiments, one or more programs are stored in a computer-readable medium, and the one or more programs include instructions for performing certain methods, such as the method for generating an article title according to the present invention, which is performed by the computing device 100 according to the embodiments of the present invention.

The computing device 100 has installed thereon a client application that supports network file transfer and storage, including native applications or browsers such as those including IE, Chrome, and Firefox, and stores locally various files such as photos, audio, video, documents (e.g., documents in the format of Word, PDF, etc.). The application client may run on an operating system such as Windows, MacOS, etc. Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a digital camera, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.

In the computing device 100 according to the present invention, the application 122 includes an article title generating apparatus 300, and the article title generating apparatus 300 resides in the computing device 100, so that the computing device 100 performs article title generation by executing the article title generating method 200.

FIG. 2 shows a flow diagram of an article title generation method 200 according to one embodiment of the invention. The method 200 is suitable for execution in a computing device, such as the computing device 100 described above. As shown in fig. 2, the article title generation method begins at step 210.

In step S210, paragraph selection is performed on the target article based on multiple selection modes to obtain multiple input texts. Each input text comprises at least one paragraph of the target article, wherein each selection mode corresponds to different input texts.

According to one embodiment of the invention, paragraph selection is performed on a target article based on a plurality of selection modes, which comprises the following steps: extracting keywords of the article, scoring the paragraphs according to the frequency of the keywords appearing in each paragraph in the target article, and selecting the paragraphs from top to bottom according to the score to obtain an input text. When paragraph selection is performed, one way is to select a predetermined number of paragraphs, and the other way is to select paragraphs to the maximum extent according to the input text within a range not exceeding a preset number of words. For example, if the target article includes 13 paragraphs, the predetermined number of paragraphs is 4, and the first four paragraphs with scores from high to low are paragraphs 1,6, 7 and 9, then paragraphs 1,6, 7 and 9 are combined in the original order of the article as an input text.

Further, a network is constructed through adjacent relations among words by adopting a TextRank algorithm in keyword extraction, then a weight value of each node is iteratively calculated by using PageRank, keywords can be obtained by sorting the weight values, and the PageRank approximates the possibility that a user randomly clicks a node link on the Internet to reach a specific webpage. The TextRank algorithm comprises the following specific steps:

a. segmenting a given text T into sentences, i.e. T ═ S₁,S₂,…,S_m]；

b. For each sentence S_iE.g. T, performing word segmentation and part-of-speech tagging, filtering out stop words, and only retaining words with specified part-of-speech, such as noun, verb and adjective, i.e. S_i＝[t_i,1,t_i,2,…,t_i,n]Wherein t is_i,jIs the reserved candidate keyword;

c. constructing a candidate keyword graph G (V, E), wherein V is a node set and consists of the candidate keywords generated in the previous step, then constructing an edge between any two points by adopting a co-occurrence relation, the edges exist between the two nodes only when the corresponding vocabularies co-occur in a window with the length of k, and k represents the size of the window, namely, at most k words co-occur;

d. iteratively propagating the weight of each node according to a calculation formula of the weight of each word until convergence;

e. carrying out reverse order arrangement on the node weights so as to obtain the most important T words as candidate keywords;

f. the most important T words are obtained from the previous step (the T value can be adjusted according to the actual situation), the marking is carried out in the original text, and if adjacent phrases are formed, the multiword keywords are combined.

Wherein, the calculation formula of each word score is as follows:

and (4) counting the frequency of the keywords appearing in each paragraph to score the paragraphs according to the keywords extracted by using the TextRank algorithm. The keyword extraction can also adopt a TF-IDF (Term Frequency/Inverse Document Frequency) mode, and the keyword extraction mode is not limited by the invention. Further, after keywords of the article are extracted based on the algorithm, preset keywords in the field to which the article belongs are added. The article belongs to the fields of medicine, literature, finance and the like, or is continuously subdivided into vertical fields according to business scenes in the existing fields, and corresponding keywords are set for each field.

According to an embodiment of the present invention, paragraph selection is performed on a target article based on multiple selection manners, which further includes the steps of: and selecting a plurality of paragraphs according to the paragraph sequence of the target article to obtain an input text. For example, if the target article includes 13 paragraphs and the predetermined selection number of paragraphs is 4, the 1 st, 2 nd, 3 nd and 4 th paragraphs are combined as an input text according to the original sequence of the article. When paragraph selection is carried out, the paragraph can be selected to the maximum extent according to the range that the input text does not exceed the preset word number.

According to another embodiment of the present invention, paragraph selection is performed on a target article based on multiple selection manners, which further includes the steps of: and selecting a first segment and a last segment of the target article, and then selecting a plurality of paragraphs according to the paragraph sequence of the target article to obtain an input text. For example, if the target article includes 13 paragraphs and the predetermined selection number of paragraphs is 4, the 1 st, 2 nd, 3 rd and 13 th paragraphs are combined as an input text according to the original sequence of the article. When paragraph selection is carried out, the paragraph can be selected to the maximum extent according to the range that the input text does not exceed the preset word number. Paragraph selection is carried out on the target article based on a plurality of selection modes, and training data are effectively increased. It should be noted that, in the above multiple selection manners, one of the paragraphs or the preset words is selected, so as to achieve that the lengths of the multiple input texts generated according to the same target article are close, and more context information of the original data can be utilized while the input and the output are aligned

Next, in step S220, for each of the plurality of input texts, the input text is input to the trained text generation model and processed, thereby generating a plurality of candidate sentence headings. All candidate article titles corresponding to the input texts form a candidate title set. The text generation model employed by the present invention is a sequence-to-sequence (Seq2Seq) model, which is typically an end-to-end structure that includes an encoder that converts source text into a vector and a decoder that converts the vector into target text.

In the embodiment of the invention, the encoder is a transform model, a Multi-head attribute mechanism is used, and a text sequence of the article content is modeled through position coding, so that longer content can be semantically coded and then trained in parallel, and the generation strength of title details and the correlation with the article are improved. The decoder is a short-term memory network (LSTM) model to solve the long-term dependence problem of sequences. By setting a special placeholder in the first position of the encoder, the output vector of the placeholder is used as the initial input state of the decoder, so that the two are combined. Meanwhile, a copy mechanism and a coverage mechanism of the pointer are adopted, so that the content error rate of the generated title is reduced, and the problem caused by the unregistered word is solved.

Further, after the encoder generates word distribution vectors of each position according to the text, various search algorithms, for example, a beam search (BeamSearch) algorithm, may be used to search the word distribution vectors of each position, so as to generate article titles corresponding to the text. Assuming that the parameter of the BeamSearch is k, it indicates that k optimal versions are maintained simultaneously when calculation is performed at each time step, wherein k decoded versions with the maximum probability are calculated at the t-th step according to the results of k t-1 steps and the word distribution vector at the t step at the same time until the end tag is decoded.

In order to avoid that the decoding effect is not good after the character is determined to be wrong or the form is single during the first search, according to one embodiment of the invention, the characters to be generated are sorted according to the generation probability for the first generation position of the article title, a plurality of characters with the top sorting are selected, and the search is started from the second position according to the cluster search by taking each character as the start. In the bundle searching process, a function punishing repeated contents is added, and the score of repeated generation is reduced. According to the requirement of actually generating the title, a function of pertinence is designed, and the title which meets the requirement is preferentially selected.

In one embodiment according to the present invention, the computing device 200 is connected to a data storage device, the data storage device stores a plurality of article texts and corresponding article titles, and the training of the text generation model comprises the following steps: selecting paragraphs of each article text based on multiple selection modes to obtain multiple training texts, wherein each training text comprises at least one paragraph of a target article, and each selection mode corresponds to different training texts; and taking each training text as the input of the text generation model, and taking the article title corresponding to the training text as the target output of the text generation model so as to train the text generation model.

Specifically, for each training text, inputting the training text into a pre-trained text generation model, outputting a title by the text generation model, calculating model loss according to the output title and a text title corresponding to the training text, and adjusting network parameters of the text generation model according to the model loss. And when the model converges or the training times reach a preset threshold, stopping training to obtain a trained text generation model.

Further, the paragraph selection way for the text of the article includes any one of the following three ways:

1. calculating the similarity of each paragraph in the article text and the title of the corresponding article, and selecting the paragraphs according to the similarity from high to low to obtain a training text;

2. selecting a plurality of paragraphs according to the paragraph sequence of the text of the article to obtain a training text;

3. and selecting the first segment and the last segment of the text of the article, and then selecting a plurality of segments according to the sequence of the segments of the target article to obtain a training text.

The method for calculating the similarity between each paragraph in the article text and the title of the corresponding article can use the Jaccard formula:

wherein, A is a paragraph of which the similarity is to be calculated, and B is a title of the article. The Jaccard formula is used for comparing similarity and difference between limited sample sets, the larger the Jaccard coefficient value is, the higher the sample similarity is, and other similarity calculation modes are not limited in the invention.

According to one embodiment of the invention, the step of segmenting the input text or the training text is included before the input text is input into the trained text generation model for processing and before each training text is input as the text generation model.

Specifically, the words are segmented based on a predetermined word bank, and texts which do not belong to the predetermined word bank are segmented into single words, wherein the predetermined word bank comprises a plurality of specific words. For example, English, acronym, and domain-specific word are divided into words, integers are divided into whole numbers with decimal points, for example, 11.30 is divided into 11.30, and all the others are processed into single characters. It should be noted that the specific vocabulary here may be preset keywords in the domain to which the article belongs in step S210, and the domain to which the article belongs is not limited to medical treatment, literature, finance, etc., or the specific vocabulary is further subdivided into vertical domains according to the service scene in the existing domain, and a corresponding keyword is set for each domain as a predetermined word bank in word segmentation. By controlling the size of the model dictionary, the efficiency and sufficiency of the training model are increased. Independent segmentation of core field words and segmentation method of special words such as numbers effectively solve the problem of unknown words while ensuring the independent semantics of main information of the field, and improve the semantic coding capability and generalization capability of the model to key information.

After the article content of the input headline to be generated is enhanced and the headline generation method is improved, and a plurality of candidate headlines are obtained according to a target article, in step S230, the final headline of the target article is screened from the candidate headline set based on a predetermined strategy. The candidate title set should be primarily screened, for example, for a generated title, the title probability is calculated according to the n-gram, the n-gram is deleted to obtain a title with a too low probability, and the n-gram considers that the current word is only related to the previous n words; and identifying key entities for generating titles, mainly comprising numerical values and factual entities, checking whether the titles are consistent with the articles or not, deleting the titles which do not exist in the articles, and reducing the complexity of subsequent steps through primary screening.

In the process of evaluating a plurality of output results of text-based deep learning, the confusion is generally used for measuring the degree of how good a probability distribution or a probability model predicts a sample, and the smaller the confusion, the larger the sentence probability. According to one embodiment of the present invention, for each candidate title in a candidate title set, a degree of confusion of the candidate title is calculated, and a predetermined number of candidate titles are selected in order from high to low degree of confusion.

Further, for each candidate title in the candidate title set, the candidate title is divided into a plurality of clauses, the clause confusion degree is calculated according to the loss function of each position in the clauses, and the confusion degree of the title of the article to be determined is obtained according to the clause confusion degree.

The specific steps for acquiring the perplexity of the title of the pending article according to the perplexity are as follows:

1. the first step is to generate a title based on punctuation marks and spaces, such as ","! ",". ","? ", the title is divided into several clauses.

2. Calculating the confusion degree of the clauses respectively, calculating a language model (based on a text generation model) which is trained in advance, outputting the loss of each position according to a loss function, filtering out the positions with the loss larger than a certain threshold value, calculating according to a confusion degree formula based on all the filtered positions, and obtaining the confusion degree of the clauses, wherein the confusion degree (perplexity) calculation formula comprises the following steps:

perplexity＝exp(loss/num_of_tokens)；

3. the confusion of all the clauses is calculated for each of the plurality of titles generated in the above manner, and the titles are sorted according to the highest confusion among all the clauses of the title and the average confusion among all the clauses.

For example, for the candidate heading "another black horse" in the automotive business area, the sales volume does not decrease but increases by 3.58%! Top-matching 8 universal heads ", and executing the following steps:

1. splitting into a plurality of clauses: "another black horse", "no reduction and no increase of 3.58% in sales amount", and "8 ten thousand head in top;

2. calculate the perplexity of each subtitle:

"[ '< start >', 'again', 'one', 'match', 'black', 'horse' ]

Loss at each position [5.4245,0.3301,5.7638,0.4389,0.1253],

[ 'pin', 'amount', 'no', 'down', 'reverse', 'up', 'num', '%' ]

Loss at each position [16.7474,19.3273,11.7242,7.1811,6.7205,5.7435,2.5558]

"(< start > ', ' Top ', ' fitting ', ' num ', ' ten thousand ', ' Out ', ' head ', ' eOS > ' ]

Loss at each position [7.4824,0.6683,2.1158,0.1098,4.9877,0.0422,8.7800]

3. For all clauses, filtering all positions with loss larger than 5.0, and calculating to obtain the perplexity of each clause: [268.7655,76152.0526,3397.6265]

The highest and average perplexity among all clauses are obtained (76152.0526,26606.1482) as perplexity for the title of the pending article. In the process of selecting a predetermined number of candidate titles according to the sequence from high to low in the confusion degree, the selection may be performed according to the confusion degree of the clauses, or the selection may be performed according to the average confusion degree, or the titles satisfying that the highest confusion degree of the clauses does not exceed a preset range are sorted according to the average confusion degree, which is not limited by the present invention. The problem that the condition that one or two places are not smooth can be covered when the whole title confusion degree is directly calculated is avoided, and the quality of the generated title is improved.

Further, after selecting a predetermined number of candidate titles in order from high to low in accordance with the degree of confusion, the method includes the steps of: and grading the screened titles of the articles to be determined by using the trained click rate estimation model, and acquiring the final title corresponding to the target article according to the grade. The click rate pre-estimation model is used for predicting the click probability of a user on a certain article, two aspects of data are needed, namely the data of the article on one hand and the data of the user on the other hand, common click rate pre-estimation models have logistic regression, feature selection is carried out by utilizing a nonlinear model GBDT, user features and advertisement features are respectively regularized by using Group Lasso in a loss function, and the like, and details are not repeated here.

According to another embodiment of the invention, the final headline of the target article is screened from the candidate headline set based on a predetermined strategy, each candidate headline in the candidate headline set can be directly scored by using a trained click rate estimation model, and the final headline corresponding to the target article is obtained according to the score, so that the efficiency of article headline generation is improved. The principle of the click-through rate estimation model is as described above, and it should be noted that, in the training process of the text generation model, the method further includes a step of adjusting the text generation model generated by the article title according to the click-through rate estimation model. Specifically, on the basis that a plurality of versions of titles are generated by the model, the titles of all the versions are put on line, then the user behaviors (the clicking condition of the titles by the user) of the online versions are obtained, and the user behaviors are added into the loss function of the model to continue training, so that the most needed result of the user can be continuously generated through evolution.

Fig. 3 is a schematic diagram of an article title generation apparatus 300 according to an embodiment of the present invention, where the article title generation apparatus 300 includes an input text acquisition module 310, a title generation module 320, and a title filtering module 330.

The input text acquiring module 310 is configured to perform paragraph selection on the target article according to multiple selection manners to obtain multiple input texts, where each input text includes at least one paragraph of the target article, and each selection manner corresponds to a different input text.

The headline generation module 320 is configured to input each of the plurality of input texts into a trained text generation model for processing, and generate a plurality of candidate article headlines, where all candidate article headlines corresponding to the plurality of input texts form a candidate headline set.

The headline screening module 330 is configured to screen the final headline of the target article from the candidate headline set according to a predetermined policy.

Further, the article title generating apparatus further includes a text generation model training module (not shown in the figure) configured to perform paragraph selection on each article text according to multiple selection manners to obtain multiple training texts, where each training text includes at least one paragraph of the target article, each selection manner corresponds to a different training text, each training text is used as an input of the text generation model, and an article title corresponding to the training text is used as an output of the text generation model, so as to train the text generation model.

A7, the method as in any one of A1-A6, the method for screening a final title of a target article from the set of candidate titles based on a predetermined policy, comprising the steps of:

for each candidate title in the candidate title set, a perplexity of the candidate title is calculated, and a predetermined number of candidate titles are selected in order from high to low according to the perplexity.

A8, the method as in a7, said calculating the perplexity of the candidate title, comprising the steps of:

the candidate title is divided into a plurality of clauses, clause confusion is calculated according to a loss function of each position in each clause, and the confusion of the title of the article to be determined is obtained according to the clause confusion.

A9, the method as defined in a7 or A8, comprising the steps of, after selecting a predetermined number of candidate titles in order from high to low degree of confusion:

and grading the screened titles of the articles to be determined by using the trained click rate estimation model, and acquiring the final title corresponding to the target article according to the grade.

A10, the method as in any one of A1-A6, the method for screening a final title of a target article from the set of candidate titles based on a predetermined policy, comprising the steps of:

and for each candidate title in the candidate title set, scoring by using a trained click rate estimation model, and acquiring a final title corresponding to the target article according to the score.

A11, the method according to any one of a1-a10, the computing device being connected to a data storage device having a plurality of article texts and corresponding article titles stored therein, the training of the text generation model comprising the steps of:

selecting paragraphs of each article text based on multiple selection modes to obtain multiple training texts, wherein each training text comprises at least one paragraph of a target article, and each selection mode corresponds to different training texts;

and taking each training text as the input of the text generation model, and taking the article title corresponding to the training text as the target output of the text generation model so as to train the text generation model.

A12, the method as in a11, wherein the paragraph selection is performed on each article text based on multiple selection modes to obtain multiple training texts, the method comprises the steps of:

and calculating the similarity of each paragraph in the article text and the title of the corresponding article, and selecting the paragraphs according to the similarity from high to low to obtain a training text.

A13, the method as in a11 or a12, wherein the paragraph selection is performed on each article text based on a plurality of selection modes to obtain a plurality of input texts, comprising the steps of:

and selecting a plurality of paragraphs according to the paragraph sequence of the text of the article to obtain a training text.

A14, the method as in any one of A11-A13, the paragraph selection for the target article based on multiple selection modes comprises the following steps:

and selecting the first segment and the last segment of the text of the article, and then selecting a plurality of segments according to the sequence of the segments of the target article to obtain a training text.

A15, the method as in any one of A11-A14, the taking each training text as an input of a text generation model, comprising the steps of:

performing word segmentation processing on the input text to obtain a plurality of words;

converting each vocabulary in the plurality of vocabularies into a word vector to obtain a word vector sequence;

and inputting the word vector sequence into the text generation model for processing.

A16, the method of A15, wherein the word segmentation processing is performed on the training text, comprising the steps of:

the method comprises the steps of segmenting a training text based on a preset word bank, and segmenting the text which does not belong to the preset word bank into single words, wherein the preset word bank comprises a plurality of specific words.

A18, the method as in a17, wherein the paragraph selection is performed on each article text based on multiple selection modes to obtain multiple training texts, the method comprises the steps of:

A19, the method as in a17 or a18, wherein the paragraph selection is performed on each article text based on a plurality of selection modes to obtain a plurality of input texts, comprising the steps of:

A20, the method as in any one of A17-A19, the paragraph selection for the target article based on multiple selection modes comprises the following steps:

A21, the method as in any one of A11-A14, the taking each training text as an input of a text generation model, comprising the steps of:

A22, the method of A21, the segmenting the training text, comprising the steps of:

the method comprises the steps of carrying out word segmentation processing on a training text based on a preset word bank, and segmenting the text which does not belong to the preset word bank into single words, wherein the preset word bank comprises a plurality of specific words.

A23, the method of any one of A17-A22, the text generation model being an end-to-end structure comprising an encoder and a decoder.

A24, the method as in A23, the encoder being a transformer model and the decoder being an LSTM model.

A26, the apparatus of a25, the apparatus being connected to a data storage device, the data storage device having a plurality of article texts and corresponding article titles stored therein, the apparatus further comprising:

the text generation model training module is used for carrying out paragraph selection on the text of each article according to multiple selection modes to obtain multiple training texts, each training text comprises at least one paragraph of a target article, each selection mode corresponds to different training texts, each training text is used as the input of a text generation model, the article title corresponding to the training text is used as the output of the text generation model, and the text generation model is generated.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. An article title generation method, executed in a computing device, the method comprising the steps of:

selecting paragraphs of the target article based on multiple selection modes to obtain multiple input texts, wherein each input text comprises at least one paragraph of the target article, and each selection mode corresponds to different input texts;

inputting each input text in the plurality of input texts into a trained text generation model for processing to generate a plurality of candidate article titles, wherein all candidate article titles corresponding to the plurality of input texts form a candidate title set;

and screening out the final title of the target article from the candidate title set based on a preset strategy.

2. The method of claim 1, wherein paragraph selection for the target article based on a plurality of selection modes comprises:

extracting keywords of the article, scoring the paragraphs according to the frequency of the keywords appearing in each paragraph in the target article, and selecting the paragraphs from top to bottom according to the score to obtain an input text.

3. The method of claim 1 or 2, wherein paragraph selection is performed on the target article based on a plurality of selection modes, further comprising the steps of:

and selecting a plurality of paragraphs according to the paragraph sequence of the target article to obtain an input text.

4. The method of any one of claims 1-3, wherein paragraph selection for the target article is based on a plurality of selection modes, further comprising:

and selecting a first segment and a last segment of the target article, and then selecting a plurality of paragraphs according to the paragraph sequence of the target article to obtain an input text.

5. The method of any one of claims 1-4, wherein inputting the input text into a trained text generation model for processing, comprises:

6. The method of claim 5, wherein the word segmentation process is performed on the input text, comprising the steps of:

the method comprises the steps of segmenting input texts based on a preset word bank, and segmenting texts which do not belong to the preset word bank into single words, wherein the preset word bank comprises a plurality of specific words.

7. A training method of a text generation model is executed in a computing device, the computing device is connected with a data storage device, a plurality of article texts and corresponding article titles are stored in the data storage device, and the training method comprises the following steps:

8. An article title generation apparatus, the apparatus comprising:

the system comprises an input text acquisition module, a text selection module and a text selection module, wherein the input text acquisition module is used for carrying out paragraph selection on a target article according to a plurality of selection modes to obtain a plurality of input texts, each input text comprises at least one paragraph of the target article, and each selection mode corresponds to different input texts;

the title generation module is used for inputting each input text in the plurality of input texts into a trained text generation model for processing to generate a plurality of candidate article titles, wherein all the candidate article titles corresponding to the plurality of input texts form a candidate title set;

and the title screening module is used for screening out the final title of the target article from the candidate title set according to a preset strategy.

9. A computing device, comprising:

at least one processor; and

a memory storing program instructions configured for execution by the at least one processor, the program instructions comprising instructions for performing the article title generation method of any of claims 1-6.

10. A readable storage medium storing program instructions which, when read and executed by a client, cause the client to perform the method of any one of claims 1-6.