Background
The task belongs to a text generation task, and generates a corresponding question for an article and a specified answer, so that the question can be answered with the answer in the original text. The method can be used for inquiry system, tutoring system, fairy tale questioning, factual question and answer data generation and the like. The method can also be used as a data re-removing means to expand a data set for the question and answer task. The question generation data set can use question and answer data during training, and when the question generation data set is actually used, the articles are named, entity recognition is carried out, the entities are extracted, and the entities can be used as answers to ask questions.
Conventionally, key entities are extracted through rules based on a syntax tree or a knowledge base, and then the extracted key entities are filled into a well-defined template to generate a problem of a specified format. The current common method is an Encoder-Decoder framework generated based on texts, wherein an article is subjected to encoding learning by an Encoder, an answer is also subjected to encoding learning by another network, and a question is generated by decoding the article and the encoding of the answer by a Decoder.
The problem generation task can also be combined with other tasks for learning, the problem generation is combined with the question and answer task, and regular terms are added to loss functions of respective models respectively for dual learning; the confrontation training may be performed by using the question generation model as a generator and the question-and-answer model as a discriminator. The problem generation is combined with the abstract generation task, the high-level parameters of the Encoder and the low-level parameters of the Decode can be shared, namely the parameters of the middle network layer are shared, and the unique parameters of the network layer close to the input and the output are kept for multi-task learning.
The evaluation index of the question generation task can use the bleu and rouge indexes generated by the text for measuring the similarity of the generated question and the real question. In addition, some data are sampled to carry out manual evaluation, and the fluency, the semantic reasonableness, the answer matching degree and the diversity of questions generated are evaluated.
In the existing research, fluency and semantic reasonableness of generated questions can achieve a relatively ideal effect, but answer matching degree still has a great space for improvement, the main method at present is to encode and learn answers, add the answers as answer constraints to the output of a Decoder to predict the distribution of generated words, and add the answer constraints on the basis of an encoder-Decoder to actually improve the answer matching degree greatly, but the constraints are not strong enough, so that the problem of unmatched answers cannot be solved completely, and further constraint reinforcement is needed.
In the aspect of generating countermeasure research, if two classifiers are used as the discriminators, the discriminators are simpler, the comparison is better trained, the precision usually exceeds that of a generator, and the generator and the discriminators are difficult to coordinate; if the question-answering model is used as the discriminator, the discriminator is more complex and the model is not adjusted well.
Disclosure of Invention
The present invention provides a problem generation method based on a progressive multi-discriminator to overcome at least one of the above-mentioned drawbacks of the prior art, and mainly solves the problem of mismatching between the problem and the answer in the problem generation.
The technical scheme of the invention is as follows: the generator uses a pointer-generator model in abstract generation, uses a copy mechanism to extract original text details and solve oov problems, and uses a coverage mechanism to solve the problem of repeated generation and improve; the answer constraint is mainly embodied in that an answer vector is used for predicting word distribution in a decoder, answer constraint is added in an encoder, after an article is coded and learned, the article is regulated with the constraint of the answer, and the part related to the answer is focused;
the discriminator is provided with three discriminators which are sequentially progressive, namely a true discriminator, a false discriminator, an attribute discriminator and a question-answer discriminator, firstly, the true discriminator is used for judging the authenticity of a generated question, under the condition that the discrimination result of the true discriminator reaches the standard, the attribute discriminator is used for judging whether the type of the generated question is matched with an answer, and under the condition that the discrimination result of the attribute discriminator also reaches the standard, the question-answer discriminator is used for judging whether the generated question can be answered by the answer; the difficulty of the discriminator is from easy to difficult, and the discriminator has progressive relation, the next discrimination can be carried out only if the result of the previous discriminator reaches the specified threshold value, otherwise, the previous discriminator is trained continuously; therefore, the discriminantor is trained in a progressive order of the levels, so that the generated problems are gradually better, the simulation effect is achieved firstly, the effect of matching the types of the problems and the answers is achieved, and the effect of completely matching the problems and the answers is achieved finally.
The invention designs three discriminators, wherein the true and false discriminator is used for judging whether the question is smooth and reasonable, the attribute discriminator further judges whether the question belongs to the category corresponding to the answer, and the question-answer discriminator further judges whether the question can be answered by the corresponding answer.
Compared with the prior art, the beneficial effects are: aiming at the question and answer unmatched questions in the text generation task, answer attribute information is added into an encoder and a decoder, a progressive multi-discriminator is designed, the constraint degree of the answers is sequentially enhanced from easy to difficult, the semantic quality of the generated questions is ensured, the question types of the questions are constrained, and the direct answers of the questions are constrained finally.
1) The progressive discriminators are designed to have a progressive relationship when in use and also to have a progressive relationship in function.
2) The data enhancement is carried out by using the discriminator, and the question-answering model can be used as the discriminator to supervise the generator and also can provide enhanced data for the generator, so that the double functions are achieved.
3) Answer type constraints are enforced in the generator.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent.
As shown in fig. 1, the generator uses a pointer-generator model, focuses on different original text information by using an attention mechanism in the Decoder, copies original text details and generates oov words by using a copy mechanism, penalizes repeated generation by using the coverage mechanism, improves the coverage mechanism, and improves the penalty mode of repeated generation;
the model structure comprises an Encoder model and a Decoder model.
The Encoder model is as follows:
firstly, coding and learning an article word vector by using a Bi-LSTM network, then carrying out named entity recognition on an answer to obtain a corresponding entity type, carrying out embedding to obtain a low-dimensional answer type vector, splicing the word vector, the answer entity vector and the output of an original LSTM corresponding to the answer in the article, then carrying out coding and learning by using the Bi-LSTM network, and finally carrying out equal weight averaging to obtain an answer vector; adding an answer entity vector can strengthen the constraint of answer types;
next, an attribute vector is calculated by coding the article by using the answer vector, softmax normalization is carried out, and the coding vector of the article is updated by using the obtained attribute vector, so that the original text coding related to the answer is improved;
decoder model:
the Decoder also uses a Bi-LSTM model, during training, an attention mechanism is used for generating a context vector for the Decoder under each step, and then the real words of the last step and the context vector are input into the Decoder together;
the Copy mechanism has a word list probability distribution predicted by using a certain probability, a part of the probability is reserved for directly copying a certain word from the original text, the probability of the word is obtained by directly using the attention probability of the encoder as Copy, and the final prediction is as follows:
the Coverage mechanism accumulates previous attention information under each step, and punishs the words which are repeatedly concerned:
the loss function of the final model is:
during testing, the generated word is obtained by directly using the probability vector of the last step without a real word as supervision, and the word vector of the word is used as the input of the LSTM of the Decoder.
A discriminator:
in the generation of a countermeasure network, it is common practice to predict the probability distribution of words by using a generator, take the word with the highest probability as the generated word each time, so as to obtain a generated text sequence, and then use the generated text as the input of a discriminator to train the generator according to the result of the discriminator, so that the problem of gradient dispersion exists, and it is necessary to calculate the gradient by methods such as reinforcement learning, which is not easy to train, and has the problem of excessively large action space.
Therefore, this document attempts to pass the continuous gradient back to the generator, avoiding the training problem caused by the gradient dispersion. And under each step of a generator decoder, performing weighted summation on all word vectors in a word list by using the word list probability distribution vector to obtain a weighted word vector, and using the word vector as the input of the discriminator instead of the word vector of the predicted word.
Therefore, on one hand, the problem of gradient dispersion can be solved, on the other hand, the generator can be supervised to generate a good word probability distribution by using the weighted word vector, a better weighted word vector can be obtained to be used as the input of the discriminator, the countermeasure discriminator can be better obtained, and the weighted word vector has richer semantic information than the one-hot vector.
Three discriminators are designed and have the function of hierarchical progression.
The discriminator is as follows in sequence:
true and false discriminator:
as shown in fig. 2, the problem vectors are classified into two classes by using a simplified FastText classification model, the problem vectors under each step are taken to be subjected to equal weight average, linear combination is carried out, and then the probability of the positive case is predicted by a sigmoid function:
its loss function is defined as the negative log-likelihood function:
an attribute discriminator:
the entity types of the answers are obtained in the front, the questions are classified in multiple ways, and the types corresponding to the questions are the entity types of the answers; then, a FastText classification model is also used for classifying the problems, meanwhile, hierarchical classification skills are used, Huffman numbers are used for coding the categories, and the hierarchical structure of the tree is used for replacing the flat standard softmax, so that the training can be accelerated;
the loss function is defined as a cross entropy function of multiple categories:
question-answer discriminator:
the question-answering model adopts a r-net model; the specific model is as shown in FIG. 3, firstly, LSTM is used for modeling articles and questions respectively, attention probability is calculated for the questions under each step of the articles, interaction vectors of the articles about the questions are obtained, a gate mechanism is added to filter unimportant information, then LSTM network learning is carried out, self-attention is carried out for the articles, and finally, the beginning position and the ending position of the answers in the articles are predicted respectively through two networks;
the loss function is defined as the cross entropy of the answer at the beginning and end positions in the text:
finally, the overall penalty function of the arbiter is defined as:
wherein, α, β, γ are the weights corresponding to the loss functions of the three discriminators, and the weight setting is changed from small to large, that is, if the result of the previous discriminator has reached the standard, the training weight of the discriminator is reduced, the training weight of the following discriminator is improved, the following discriminator is concerned more, and the effect of the following discriminator is improved on the basis of ensuring the previous discriminator.
FIG. 4 is a diagram of a discriminator model.
The training method comprises the following steps: the generator and the arbiter are pre-trained separately and then combined for training.
A pre-training generator: the pointer-generator model is directly pre-trained, the input is an article and an answer, the output is a generated question, and a loss function is defined as the cross entropy of the generated question and a real question.
Pre-training the discriminator: only an attribute discriminator and a question-answer discriminator are pre-trained, the attribute discriminator uses question and answer attributes, and a loss function is the cross entropy of the predicted attribute and the real attribute; the training question-answer discriminator uses articles, questions and answers in the question-answer data, and the loss function is the cross entropy of the predicted answer position and the real answer position.
Combining training: alternately training generators of n batchs and discriminators of m batchs, and if the accuracy rate of the discriminators is too high, reducing the training times of the discriminators or improving the training times of the generators.
When the generator is trained, firstly fixing parameters of the discriminator, inputting articles and answers, firstly predicting probability distribution of a word list through a problem generation model, and calculating a loss function by taking the probability corresponding to real words; and then multiplying the probability by the word table word vector to obtain a weighted word vector, inputting the weighted word vector into a discriminator model, setting a discriminator flag to be 1, calculating a loss function of the discriminator, adding the two loss functions to serve as the loss function of the generator, and updating the parameters of the generator.
When the discriminator is trained, the parameters of the generator are also fixed, the discriminator flag is set to be 0, the input is a real problem or a generated problem in the true and false discriminator, the output class of the real problem is 1, and the output class of the generated problem is 0; in the attribute discriminator, the input is a real question, and the output category is the corresponding answer category; in the question-answer discriminator, the input is an article and a real question, and the output is the start-stop position of the answer in the article.
Data enhancement:
in addition, the question-answering model in the discriminator is utilized to perform data enhancement on the question generation task. The method comprises the following steps:
problem generation model generation problem: the method comprises the steps of training a question generation model in advance, inputting an article and an answer, outputting the generated question, calculating bleu and rouge indexes of the generated question and a real question, averaging to obtain a matching metric value, setting a threshold value, and if the matching metric value is lower than the threshold value, indicating that the generated question and the real question are low in similarity and possibly not matched with the answer. We compose these articles and unmatched questions into a new data, and use the question-answering model to predict the answers again.
The question-answer model predicts the answer: inputting an article and a generated question, outputting the probability of the answer at the initial position of the original text, multiplying the probability of the initial position by the probability of the end position to be used as a prediction probability, and simultaneously setting a threshold value, if the prediction probability is higher than the threshold value, indicating that the answer can be found in the article by the question with high probability, and forming a new datum by the article, the question and the new answer to be used as an enhanced datum of a question generation model.
Retraining the problem generation model: the problem generation model is trained by using the original data and the enhanced data together, but some of the enhanced data may not be good in quality, so different weights are set for the original data and the enhanced data, the weight of the loss function of the original data is set to be slightly larger than that of the enhanced data, and the final loss function is defined as the weighted sum of the loss functions of the two parts of data:
where S1 denotes the original data set, S2 denotes the enhanced data, α is the weight of the original data set, and β is the weight of the enhanced data.
The enhanced data has the advantages that a large amount of new data can be expanded by utilizing the existing data or label-free data, the reliability of the predicted answer is ensured by the probability of the card-height question-answering model, although a small amount of noise data may exist, a large amount of reliable data can be obtained, and the data is added into the training data to improve the robustness of the model.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.