WO2020052405A1 - Corpus annotation set generation method and apparatus, electronic device, and storage medium - Google Patents
Corpus annotation set generation method and apparatus, electronic device, and storage medium Download PDFInfo
- Publication number
- WO2020052405A1 WO2020052405A1 PCT/CN2019/100823 CN2019100823W WO2020052405A1 WO 2020052405 A1 WO2020052405 A1 WO 2020052405A1 CN 2019100823 W CN2019100823 W CN 2019100823W WO 2020052405 A1 WO2020052405 A1 WO 2020052405A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- query
- corpus
- labeling
- results
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present application relates to the field of computer technology, and in particular, to a method and an apparatus for generating a corpus annotation set, an electronic device, and a computer-readable storage medium.
- the query statements are mainly manually labeled by the labelers.
- the query intent of the query statement including the intention of gossip, music on demand, weather query intent, etc.
- the cognition level of the tagger determines the tagging accuracy of the query.
- the cognitive level of the labeler may be different from that of ordinary people, or there is a deviation in the cognition of a query sentence, it is easy to make the query sentence labels included in the training set inaccurate, which will cause analysis of the training data.
- the model has large errors and cannot provide users with accurate answers.
- this application provides a method for generating a corpus labeling set.
- this application provides a method for generating a corpus annotation set, which is executed by an electronic device and includes:
- the query log includes a query statement
- a corpus annotation set is generated from the query sentences with similar annotation results and corresponding annotation results.
- this application provides another apparatus for generating a corpus annotation set, including:
- a log acquisition module configured to acquire a query log; the query log includes a query statement;
- a corpus obtaining module for extracting query sentences to be labeled from the query log to obtain the corpus to be labeled
- a result acquisition module configured to acquire annotated results of multiple parties on the query sentence in the corpus to be annotated
- a statement filtering module configured to filter query statements with similar labeled results from the corpus to be labeled according to the labeled results of the same query statement by multiple parties;
- the annotation set generation module is configured to generate a corpus annotation set from a query sentence with a similar annotation result and a corresponding annotation result.
- the electronic device includes:
- Memory for storing processor-executable instructions
- the processor is configured to execute the method for generating the corpus annotation set.
- the present application provides a computer-readable storage medium.
- the computer-readable storage medium stores a computer program, and the computer program can be executed by a processor to complete the method for generating the corpus annotation set.
- FIG. 1 is a schematic diagram of an implementation environment involved in this application
- Fig. 2 is a block diagram of a server device according to an exemplary embodiment
- Fig. 3 is a flow chart showing a method for generating a corpus annotation set according to an exemplary embodiment
- FIG. 5 is a schematic diagram of the division principle of multiple corpus annotation sets
- FIG. 6 is a schematic diagram of the influence curve of the corpus annotation set of each batch on the model performance
- step 330 is a detailed flowchart of step 330 in the embodiment corresponding to FIG. 3;
- Fig. 8 is a schematic diagram illustrating a generation principle of a corpus annotation set according to an exemplary embodiment
- step 350 is a detailed flowchart of step 350 in the embodiment corresponding to FIG. 3;
- step 370 is a detailed flowchart of step 370 in the embodiment corresponding to FIG. 3;
- FIG. 11 is a flowchart of a method for generating a corpus annotation set based on the embodiment of FIG. 3;
- Fig. 12 is a block diagram of a device for generating a corpus annotation set according to an exemplary embodiment
- FIG. 13 is a detailed block diagram of a corpus obtaining module in the embodiment corresponding to FIG. 12;
- FIG. 14 is a detailed block diagram of a result acquisition module in the embodiment corresponding to FIG. 12;
- FIG. 15 is a detailed block diagram of the sentence filtering module in the embodiment corresponding to FIG. 12.
- Fig. 1 is a schematic diagram of an implementation environment involved in the present application according to an exemplary embodiment.
- the implementation environment involved in this application includes a server device 110.
- the query log is stored in the server device 110, so that the server device 110 can use the corpus annotation set generation method provided in the present application to generate a corpus annotation set using the query log, thereby improving the accuracy of the query sentence annotation result in the corpus annotation set, and comparing the accuracy with
- the high corpus annotation set is used as a training set to train the data analysis model to improve the accuracy of the data analysis model.
- the data analysis model is used to analyze the query sentences entered by the user online to provide users with accurate answers.
- the implementation environment will also include providing data, that is, the data source of the query log.
- the data source may be a smart terminal 130.
- the server device 110 may be connected to the smart terminal 130 through a wired or wireless network, and obtain a query log collected and uploaded by the smart terminal 130.
- the query log refers to a process performed by the smart terminal 130 when the user inputs a query sentence. recording.
- the query log may include a time point, a query entered by a user, and a query result returned to the user.
- the query entered by the user may be in the form of text or voice.
- the query log may also include a large number of query statements entered by one or more users.
- the server device 110 then uses the method provided in this application to generate a corpus annotation set.
- the smart terminal 130 may be a smart phone, a smart speaker, or a tablet computer.
- the implementation environment will also include an intelligent terminal 140 that provides query statements and waits to provide a response to the user.
- an intelligent terminal 130 that provides query logs and an intelligent terminal that provides query statements and wait to provide a response to the user.
- the terminals 140 may be the same or different.
- the server device 110 that has generated a corpus tag set according to the query log provided by the smart terminal 130 is connected to the smart terminal 140 through a wired or wireless network, receives a query sentence input by a user, and is trained based on the generated corpus tag set.
- the data analysis model analyzes the query entered by the user, identifies the user's intention, generates an accurate response, and then feeds it back to the smart terminal 140 through a wired or wireless network.
- the method for generating the corpus annotation set of the present application is not limited to deploying corresponding processing logic in the server device 110, and may also be processing logic deployed in other machines.
- processing logic for generating a corpus annotation set is deployed in a terminal device with computing capabilities.
- FIG. 2 is a schematic structural diagram of a server device according to an embodiment of the present application.
- the server device 200 may have a relatively large difference due to different configurations or performance, and may include one or more central processing units (CPU) 222 (for example, one or more processors) and a memory 232, one or More than one storage medium 230 (eg, one or one storage device) storing application programs 242 or data 244.
- the memory 232 and the storage medium 230 may be temporary storage or persistent storage.
- the program stored in the storage medium 230 may include one or more modules (not shown), and each module may include a series of instruction operations on the server device 200.
- the central processing unit 222 may be configured to communicate with the storage medium 230 and execute a series of instruction operations in the storage medium 230 on the server device 200.
- the server device 200 may also include one or more power sources 226, one or more wired or wireless network interfaces 250, one or more input-output interfaces 258, and / or, one or more operating systems 241, such as Windows Server TM , mac OS XTM, UnixTM, Linux TM , FreeBSD TM and so on.
- the steps performed by the server device described in the embodiments shown in FIG. 3, FIG. 7, FIG. 8, and FIG. 10 to FIG. 12 below may be based on the server device structure shown in FIG. 2.
- the program may be stored in a computer-readable storage medium.
- the aforementioned storage medium may be a read-only memory, a magnetic disk, or an optical disk.
- Fig. 3 is a flow chart showing a method for generating a corpus annotation set according to an exemplary embodiment.
- the application scope and execution subject of the method for generating the corpus annotation set may be a server device or an electronic device for a server device described below, where the server device may be a server 110 device of the implementation environment shown in FIG. 1.
- the method for generating the corpus annotation set may be executed by an electronic device or the server device 110, and may include the following steps.
- step 310 a query log is obtained
- the query log refers to a record collected by a device when a user inputs a query sentence, and the device may be a smart speaker, a mobile terminal, or the like.
- the query log can be viewed as a raw corpus containing a large number of query statements.
- the so-called raw corpus refers to the query sentence belonging to the original real user, which has not been manually labeled.
- step 330 extract the query sentence to be labeled from the query log to obtain a corpus to be labeled
- the query log contains a large number of query statements, but not all query statements are valid, some may be input by users at random and do not represent any significance, some query statements may be too long or too short, and there are many queries Sentences may be repeated. If the query results of these query statements are used as the corpus annotation set, the accuracy of the annotation results in the corpus annotation set will be reduced, which will affect the accuracy of the data analysis model obtained by training the corpus annotation set as the training sample.
- the present application can extract the query sentence to be labeled from the query log according to a pre-configured strategy, and the query sentence to be labeled constitutes the corpus to be labeled.
- the extraction of the query sentence to be labeled may be an analysis of the query log, according to the configured useless / deactivated character library, remove the query sentence containing the useless / deactivated characters, remove the meaningless query sentence (such as random input A few characters without coherence), remove too long or too short query statements, remove duplicate query statements, remove already labeled query statements, and obtain the last remaining query statement as the query statement to be labeled.
- step 350 acquire the annotation results of multiple parties on the query sentence in the corpus to be annotated
- multiple parties can be multiple labeling personnel, multiple labeling devices, or multiple labeling programs in one device, which are used to indicate that there are multiple sources of labeling results for the query sentence in the corpus to be labeled.
- Labelers, equipment, or programs are collectively referred to as labelers.
- Each annotator can annotate the query in the annotated corpus (called "voting").
- Annotation refers to adding a classification label to the query sentence in the corpus to be annotated, and multiple "voting" results can reflect the correct classification of the query sentence.
- the labeling result is the classification label added by the labeler for the query.
- the labeling results can be intent labeling results, NER (Named Entity Recognition) labeling results, slot labeling results, or segmentation labeling results.
- the intent labeling result refers to the result of intent classification, for example, "I'm in a bad mood today", and the labeler's intent labeling result for the query sentence is "chat intent"; for example, "please give me a soothing song", the labeler The intent labeling result of the query is "music on demand intent”.
- the NER labeling result is that the index names the person name, place name, institution name, proper name, etc. in the query sentence.
- the slot labeling result refers to adding slot labels to each phrase in the query, such as the weather business field.
- the slot labels include time words, place words, weather business keywords, weather phenomenon words, and interrogative words.
- the word segmentation result refers to dividing a query sentence into multiple phrases, and multiple phrases are used as the word segmentation result. Each phrase can be regarded as a classification label.
- the labeler can perform intent labeling, NER labeling, slot labeling, or word segmentation labeling to obtain labeling results for each labeling task.
- each party may first perform intent annotation on the query sentence in the annotated corpus (according to the intent annotation document specification) to obtain an intent annotation set including the intent annotation result of the query sentence.
- the query statement can be divided into fields according to the intent labeling, and NER labeling (according to the NER labeling document specification) and slot labeling (according to the slot labeling document specification) can be performed at the same time in the divided fields to obtain NERs containing the NER labeling results. Callout set and slot callout set containing slot callout results.
- each labeler can also perform segmentation labeling on the corpus to be labeled to obtain a segmentation labeling set including the result of the segmentation labeling.
- the intent annotation set, slot annotation set, NER annotation set, or word segmentation annotation set may be stored in the storage medium of the server device, and the server device may obtain the annotation results of the query statements in the corpus to be annotated by the multiparty from the storage medium.
- step 370 according to the annotation results of the same query sentence by multiple parties, a query sentence with a similar annotation result is filtered from the corpus to be annotated;
- the query statements with similar labeling results refer to query statements with the same or similar multi-labeled results.
- the similarity of the multi-labeled results is greater than a preset value. It can be considered as a query with similar labeling results.
- the preset value can be 80%, 90%. %.
- the server device obtains the intent labeling results of the query statements in the set of corpora to be labeled by multiple parties, and sequentially compares the intent labeling results of the query statements by multiple parties for each query statement in turn. , To determine whether the annotation results of the query sentence by multiple parties are consistent (the similarity of the annotation results is greater than a preset value can be regarded as consistent), and then the query sentences with the same annotation results are selected from the corpus to be labeled.
- the multi-party labeling results are inconsistent, it may be a multi-label sample or a difficult sample. After review, a multi-label label set or a difficult sample set is added.
- a labeling task goes through the labeling process and finally gets three labeling sets: a single-label labeling set, a multi-label labeling set, and a difficult sample set.
- a query in a single-label callout set can be considered to be the same as the result of the multi-party callout.
- the single-label annotation set can be regarded as a reliable annotation set, and can be used as the training set and test set of the intent recognition model.
- the labeling result is a NER labeling result, a slot labeling result, or a word segmentation labeling result
- a corpus annotation set is generated from the query sentences with similar annotation results and corresponding annotation results.
- the corpus tagging set includes a query sentence and a corresponding tagging result.
- the query sentence belongs to a query sentence similar to the multi-party tagging result screened in step 370.
- the server device generates a query corpus tag set composed of the query sentence and the tagging result by using the filtered multi-party tagging results similar to the query sentence and the tagging result of the query sentence.
- the query sentence of the corpus tagging set belongs to a query sentence with similar multi-labeled results, the possibility of divergence in the tagging results of the query sentence in the corpus tagging set is small, and the accuracy of the tagging result is higher, and the accuracy is higher.
- the corpus annotation set is used as a training set to train data analysis models such as intent recognition models, which can improve the accuracy of the data analysis model.
- the server device obtains an annotation result 1 of the query sentence in the annotated corpus by the annotator 1, and the annotator 2 treats the annotated corpus.
- the annotation result 2 of the query sentence, the annotation party 3 is the annotation result 3 of the query sentence in the annotated corpus, and the annotation party 4 is the annotation result 4 of the query sentence in the annotated corpus.
- the server device merges the labeling result 1, labeling result 2, labeling result 3, and labeling result 4 to filter out four query statements with the same labeling result and add a single label labeling set.
- query results 1, 2, 3, and 4 of some query statements are inconsistent, and the inconsistent results are distributed as 1: 1, these query statements may be multi-labeled. In the case of multiple labels, these query statements are added to the multi-label label set. Assuming that the query results 1, 2, 3, and 4 of some query statements are all inconsistent, and may be multi-label samples or difficult samples, these query statements can be added to the multi-label label set or difficult label set.
- the corpus label set is a single label label set, which contains query statements with the same multi-label labeling results. That is to say, there is no difference in the annotation results of query sentences in the corpus annotation set, and the accuracy of the annotation results is high. Therefore, the corpus annotation set can be used as a highly accurate training set or test set for training the data analysis model.
- the corpus labeling set can be used as a training set to train the intent recognition model.
- the labeling result is a NER labeling result
- the corpus labeling set includes a query sentence with the same NER labeling result and its corresponding NER labeling result
- the corpus labeling set can be used as a training set to train a named entity recognition model.
- the labeling result is a slot labeling result
- the corpus labeling set can be used as a training set to train the slot labeling model.
- the labeling result is a word segmentation labeling result
- the corpus labeling set can be used as a training set to train the tokenization model. .
- the corpus annotation set can be added to the existing training set in an incremental and overlapping manner to retrain the data analysis model, and use the same test set to test the performance of the data analysis model, and evaluate the new corpus annotation set pair model.
- the performance improvement brought by performance reflects the quality and value of the new corpus annotation set.
- the method can be adopted to stagger the annotations in each cycle. As shown in Table 1 below.
- the above table uses four labelers as examples. In order to prevent the labelers from referencing the labeling results to each other, the labelers have different content on the same day. You can arrange labeling according to the schedule shown in Table 1, and take five days as a cycle. Results are counted for consistent and inconsistent labeled results among multiple people.
- step 330 specifically includes:
- step 33 the query statements in the query log that do not satisfy the preset conditions are removed;
- the query statements that do not meet the preset conditions may include one or more of the following forms: query statements containing useless / deactivated characters, meaningless query statements, excessively long or too short query statements, repeated query statements, etc. , Thereby avoiding subsequent annotation of these worthless query sentences, which not only increases the workload, but also affects the accuracy of the corpus annotation set.
- step 332 the remaining query statements in the query log are input into the multiple label prediction models that have been constructed, and the label prediction results of multiple label prediction models for the same query statement are output; Trained from different training sample sets;
- the label prediction model may be an intent recognition model for identifying the intent of a query sentence.
- the label prediction result may be an intent recognition result.
- the label prediction model can be trained using a large number of query sentences (that is, training sample sets) with known intent to label the results. Multiple label prediction models can be trained using different training samples. For example, all query sentences with known intent labeling results are divided into 4 batches, and each batch of query statements is trained to obtain a corresponding intent recognition model, and thus 4 intent recognition models can be obtained.
- the remaining query statements in the query log are input into 4 intent recognition models, and the intent recognition results of the 4 intent recognition models for the same query sentence are output.
- the label prediction model can also be a named entity recognition model, slot labeling model, or word segmentation model. These models can be obtained by training a large number of query statements with known NER labeling results. Known slot positions A large number of query statements with labeled results are trained, and a large number of query statements with known tokenized label results are trained.
- the label prediction result can be the corresponding named entity recognition result, slot labeling result, and word segmentation result. The construction method of the label prediction model belongs to the prior art, and is not repeated here.
- step 333 according to the label prediction results of the multiple query prediction models for the same query sentence, the query sentences with inconsistent label prediction results are filtered from the remaining query sentences to obtain the corpus to be labeled.
- boundary sample points are of great significance to the boundary of the training model, if you find sample points with multiple intents and certain probability distributions in different classes, adding such sample points to the training set for model training is more accurate than classifying Adding the sample points to the training set will help the model's performance improvement even more.
- the present application filters out query sentences with inconsistent tag prediction results from the remaining query sentences in the query log according to the tag prediction results of the same query sentence by multiple tag prediction models. That is, the model has low recognition accuracy for these query sentences, so these query sentences can be regarded as boundary sample points. Adding these boundary sample points to the corpus to be labeled for model training can improve the accuracy of the model.
- the foregoing step 331 may include the following steps:
- the meaningless query statement refers to a statement without specific intention, which may be a user error or a random input statement.
- a classifier is also a classification model. The role of a classifier is to identify the query statements in the query log, which are meaningful and which are meaningless. Specifically, a classifier can be trained through a large number of meaningful query sentences and non-meaningful query sentences. For example, a classifier can be obtained by training the parameters of a logistic regression model through a large number of meaningful query sentences and non-meaningful query sentences. Classifiers are a collective term for the methods used to classify samples in data mining. Classifiers are constructed in a variety of ways including algorithms such as decision trees, logistic regression, naive Bayes, and neural networks.
- the query sentence in the query log can be input to the trained classifier, and a meaningful or meaningless judgment result is output, thereby removing the meaningless query sentence in the query log.
- step 331 may further include the following steps:
- the labeled query statements in the query log and query statements similar to the labeled query statements are removed.
- the set of labeled query statements refers to a set of query statements with known labeling results.
- the labeled query statement set may be a generated corpus label set. According to the query statements contained in the query statement set, the query statements belonging to the set can be removed from the query statements of the query log.
- the labeled query statement refers to the query statements in the query statement set.
- Query statements similar to the labeled query statement can be calculated by calculating the similarity between the query statements to find out the query log with a higher similarity to the labeled query statement. Query statements to eliminate query statements with a high degree of similarity to labeled query statements in the query log.
- the remaining query statements in the above step 332 can be the removal of meaningless query statements, the removal of query statements containing useless or disabled characters, the removal of labeled query statements, and the removal of similarity to labeled query statements in the query log. After the query statement, the query statements remaining in the query log.
- step 331 may further include the following steps:
- a query sentence containing only a single entity word, a query sentence with a sentence length greater than a preset number of characters, or a duplicate query sentence is removed from the query log.
- the entity word refers to the name of a real specific thing, for example, a song name, a singer name, and the like.
- a query that contains only one entity word is difficult to distinguish between intent and word segmentation, so it is not suitable to join the corpus annotation set to participate in modeling.
- Query statements with a statement length greater than the preset number of characters refer to longer query statements. Such query statements are difficult to label, and the length of the query statement during the modeling process will undoubtedly increase the amount of calculation. Therefore, it is not suitable.
- Join corpus annotation set to participate in modeling Similarly, it is not necessary to add a corpus annotation set to participate in modeling in the repeated query statements in the query log. Therefore, to eliminate duplicate query statements, for example, three query statements are duplicated, and two query statements are retained and only one query statement is removed.
- the remaining query in step 332 can also be the last remaining query in the query log after removing the query that contains only a single entity word, the query that is longer than the preset number of characters, or the duplicate query. .
- the new query pre-process the new query, remove meaningless queries, remove useless / deactivated characters, remove single-entity query words, and remove extra long and repeated words.
- the query statements can be removed from the labeled query statements according to the labeled query statement collection, and the query statements with a high similarity to the labeled query statement collection can be removed. Further, the labels can be filtered through the above steps 332 and 333.
- the query sentences whose prediction results are inconsistent, and the filtered query sentences constitute a corpus to be marked.
- query sentences with similar annotation results can be filtered to generate a corpus annotation set.
- the corpus annotation set can be added to the annotated query sentence set to participate in the training of the model together.
- step 350 specifically includes:
- step 351 a labeling task for the corpus to be labeled is distributed to multiple parties, and the distribution of the labeling task triggers multiple parties to execute the labeling task in parallel;
- the labeling task may be an intent labeling task, a NER labeling task, a slot labeling task, or a word segmentation labeling task.
- multiple parties may be multiple labeling devices, and the server device sends the labeling tasks carrying the corpus to be labeled to the multiple labeling devices, triggering the multiple labeling devices to execute the labeling tasks in parallel.
- the labeling device may be an intelligent labeling device obtained through training in advance with a large amount of sample data. Each labeling device is trained with different sample data sets, so the labeling accuracy of each labeling device is different.
- the server device may issue a labeling task carrying a set of corpora to be labeled to terminal devices to which multiple labelers belong.
- the terminal device to which the labeler belongs can perform the display of the corpus to be labeled and the prompt of the labeling task. Users can click intent or mark to make intent mark, NER mark, slot mark and word segmentation.
- the terminal device of multiple mark people can get the mark result according to the user's click option or mark, and complete the corpus to be marked. Annotate tasks.
- the distribution of the labeling task triggers multiple parties to execute the labeling task in parallel, which specifically includes: the distribution of the labeling task triggers multiple parties to input the set of to-be-marked corpora into a labeling model configured by themselves and outputs The labeling results of the corpus to be labeled, respectively; wherein the labeling models with multiple configurations are obtained by training with different training sample sets.
- multiple parties can represent multiple labeling devices or multiple labeling programs.
- Each labeling party is configured with a labeling model. Since the labeling model configured with multiple parties is trained using different training sample sets, multiple labeling devices or multiple labeling programs have different labeling accuracy.
- the training sample set used by the multi-party configured labeling model in this embodiment is also different from the training sample set used by the label prediction model described above. For example, all samples can be divided into 10 training sample sets, and each training sample set can be trained to obtain the corresponding model, and then 10 models can be used as part of the label prediction model and part as the labeling model, using multiple labels
- the prediction model selects query sentences with inconsistent label prediction results to obtain the corpus to be labeled. Then, the multi-party labeling model is used to calculate the labeling results of the query statements in the corpus to be labeled. The multi-party labeling results of the query statements in the corpus to be labeled are obtained.
- Multiple labeling programs can perform the following steps in parallel: input the corpus to be labeled into a pre-built labeling model, and output the labeling results of the corpus to be labeled.
- the construction of the labeling model can refer to the construction of the label prediction model.
- step 352 receiving the labeling results returned by multiple parties executing the labeling task in parallel.
- the labeling result can be an intent labeling result, a NER labeling result, a slot labeling result, or a segmentation labeling result.
- the corpus to be annotated includes a plurality of buried-point sentences with known label information; the buried-point sentence refers to a query sentence that is known to accurately label the result, and performs multi-party tagging results on the buried-point sentence.
- the exact labeling result of the buried point statement is called label information.
- the above step 370 specifically includes:
- step 371 according to the labeling results of the multiple buried-point statements by multiple parties, comparing whether the labeling results of the multiple buried-point statements are consistent with corresponding label information, and calculating an accuracy rate of the multi-party labeling results;
- the accuracy of the multi-party labeling result refers to the accuracy of labeling multiple buried-point sentences by each labeling party.
- the calculation of the labeling accuracy of the buried-point sentence is used to evaluate the labeling accuracy of the current labeling party.
- a “buried point” method was used to verify the accuracy of each labeling party. Among them, it can be extracted from the previous batch of labeled data sets with more than 5% consistent query statements as multiple buried point statements for the current batch of known label information.
- the labeling results are calculated with The proportion of the label information is consistent, and the accuracy of the labeling result of the labeler is obtained.
- step 372 according to the accuracy rate of the multi-party labeling result, the source of the labeling result whose accuracy rate is not up to standard is removed from the multi-party sources.
- a threshold value can be set, and according to the accuracy rate of the labeling result of each user, the labeling party whose accuracy rate is less than the threshold value can be regarded as providing the labeling result whose accuracy rate does not meet the standard. This can delete the labeling results provided by such labelers whose accuracy is less than the threshold.
- all labelers are sorted from high to low accuracy.
- Several labelers with lower ranking can be regarded as labelers whose accuracy is not up to standard.
- the labeling results provided by the labeler with a substandard accuracy can be removed.
- step 373 according to the labeling results of the remaining sources, a query sentence with similar labeling results from multiple sources is filtered from the corpus to be labeled.
- the labeling results of the remaining sources refer to the labeling results of the corpus that are to be labeled by the remaining labeling parties after deleting the labeling results with substandard accuracy from the labeling results provided by multiple parties. That is to say, in the subsequent selection of query sentences with similar annotation results from multiple corpora to be annotated, the annotation results provided by the annotated parties are no longer based on the annotation results. According to the remaining taggers with higher accuracy, the tagging results of the tagging corpus are selected, and multiple query sentences with similar tagging results are filtered out from the tagging corpus.
- the method for generating a corpus annotation set provided in the present application further includes:
- step 1101 according to the annotation results of multiple parties on the same query sentence, a query sentence with inconsistent annotation results is filtered from the corpus to be annotated;
- the boundary samples are very helpful for model optimization and characterization of clearer classification boundaries.
- the boundary samples can be selected and used from samples inconsistent from multiple parties.
- the server device can filter out query statements with inconsistent labeling results from the corpus to be labeled according to the labeling results of the same query statement from multiple parties.
- step 1102 a multi-label query statement is obtained from the query statements with inconsistent labeling results, and a boundary sample point for optimizing a data analysis model is obtained.
- multi-label query statements that is, query statements with multiple labeling results
- Such multi-label query statements can be considered as boundary sample points.
- the recognition of this kind of query is difficult, so if the model can accurately identify the intent, slot, etc. of this kind of query, it will greatly improve the accuracy of the model.
- the data analysis model may be an intent recognition model, a named entity recognition model, a slot labeling model, a word segmentation model, and the like. The optimization of the data analysis model through such query statements can improve the recognition accuracy of the model.
- the query's intent contains both chat and intent intent. This sample is a boundary sample of intent classification and can help model training. Make accurate intent boundaries.
- the following is a device embodiment of the present application, which can be used to execute an embodiment of a method for generating a corpus annotation set performed by the server device 110 of the present application.
- a method for generating a corpus annotation set of the present application please refer to the embodiment of a method for generating a corpus annotation set of the present application.
- Fig. 12 is a block diagram of a device for generating a corpus annotation set according to an exemplary embodiment.
- the device for generating a corpus annotation set can be used in the server device 110 of the implementation environment shown in Fig. 1 to execute Figs. 3 and 7 -All or part of the steps of the method for generating a corpus annotation set shown in any one of FIG. 11.
- the device includes, but is not limited to, a log acquisition module 1210, a corpus acquisition module 1230, a result acquisition module 1250, a sentence filtering module 1270, and a label set generation module 1290.
- the log acquisition module 1210 is configured to acquire a query log, where the query log includes a query statement;
- a corpus obtaining module 1230 configured to extract query sentences to be labeled from the query log to obtain the corpus to be labeled;
- a result acquisition module 1250 configured to acquire the annotation results of multiple parties on the query sentence in the corpus to be annotated
- a sentence filtering module 1270 configured to filter, based on the annotation results of multiple parties for the same query sentence, query sentences with similar annotation results from the corpus to be marked;
- the annotation set generating module 1290 is configured to generate a corpus annotation set from a query sentence with a similar annotation result and a corresponding annotation result.
- the log acquisition module 1210 may be, for example, a wired or wireless network interface 250 with a certain physical structure in FIG. 2.
- the corpus acquisition module 1230, the result acquisition module 1250, the sentence filtering module 1270, and the annotation set generation module 1290 may also be functional modules for performing corresponding steps in the method for generating the corpus annotation set. It can be understood that these modules can be implemented by hardware, software, or a combination of both. When implemented in hardware, these modules may be implemented as one or more hardware modules, such as one or more application specific integrated circuits. When implemented in software, these modules may be implemented as one or more computer programs executing on one or more processors, such as programs stored in the memory 232 executed by the central processor 222 of FIG. 2.
- the corpus obtaining module 1230 includes:
- a sentence removing unit 1231 configured to remove a query sentence in the query log that does not satisfy a preset condition
- a label prediction unit 1232 is configured to input the remaining query statements in the query log into multiple label prediction models that have been constructed, and output the label prediction results of multiple label prediction models for the same query sentence; the multiple label prediction models Obtained by training with different training sample sets;
- a sentence extraction unit 1233 is configured to filter query sentences with inconsistent tag prediction results from the remaining query sentences according to the tag prediction results of the same query sentence by the multiple tag prediction models to obtain the corpus to be labeled.
- the sentence removing unit 1231 includes:
- the classification removing subunit is configured to classify the query sentences recorded in the query log by using the constructed classifier, and remove meaningless query sentences obtained by the classification.
- the sentence removing unit 1231 further includes:
- the first removing subunit is configured to remove the labeled query sentence and the query sentence similar to the labeled query sentence in the query log according to the labeled query sentence set.
- the sentence removing unit 1231 further includes:
- the second removing subunit is configured to remove a query sentence containing only a single entity word, a query sentence with a sentence length greater than a preset number of characters, or a duplicate query sentence in the query log.
- the result acquisition module 1250 includes:
- a task dispatching unit 1251 configured to dispatch a labeling task to the corpus to be labeled to multiple parties, and the dispatching of the labeling task triggers multiple parties to execute the labeling task in parallel;
- the result receiving unit 1252 is configured to receive a labeling result returned by multiple parties executing the labeling task in parallel.
- the distribution of the labeling task triggers multiple parties to execute the labeling task in parallel, including:
- the distribution of the labeling task triggers multiple parties to input the corpus to be labeled into their own labeled model in parallel, and outputs the respective labeling results for the corpus to be labeled.
- the labeling models configured by multiple parties use different training sample sets. Get it.
- the corpus to be annotated includes a plurality of buried point sentences with known tag information; as shown in FIG. 15, the sentence filtering module 1270 includes:
- An accuracy calculation unit 1271 configured to compare whether the labeling results of the multiple buried-point statements are consistent with corresponding label information according to the labeling results of the multiple buried-point statements by multiple parties, and calculate the accuracy of the multi-party labeling results;
- a source culling unit 1272 configured to remove, according to the accuracy rate of the multi-party labeling result, the source of the labeling result whose accuracy rate does not meet the standard;
- a sentence filtering unit 1273 is configured to filter query sentences with similar multi-source annotation results from the corpus to be annotated according to the annotation results of the remaining sources.
- This application also provides an electronic device that can be used in the server device 110 of the implementation environment shown in FIG. 1 to execute all or a method of generating a corpus annotation set shown in any of FIG. 3 and FIG. 7 to FIG. 11. Some steps.
- the electronic device includes:
- Memory for storing processor-executable instructions
- the processor is configured to execute the method for generating a corpus annotation set according to the above exemplary embodiment.
- a storage medium is also provided, and the storage medium is a computer-readable storage medium, and may be, for example, temporary and non-transitory computer-readable storage media including instructions.
- the storage medium stores a computer program that can be executed by the central processing unit 222 of the server device 200 to complete the above-mentioned method for generating a corpus annotation set.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
本申请要求于2018年9月10日提交中国专利局、申请号为201811048957.8、发明名称为“语料标注集的生成方法及装置、电子设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 10, 2018, with application number 201811048957.8, and the invention name is "Method and Device for Generating Corpus Label Sets, Electronic Equipment, Storage Medium", and its entire contents Incorporated by reference in this application.
本申请涉及计算机技术领域,特别涉及一种语料标注集的生成方法及装置、电子设备、计算机可读存储介质。The present application relates to the field of computer technology, and in particular, to a method and an apparatus for generating a corpus annotation set, an electronic device, and a computer-readable storage medium.
发明背景Background of the invention
在语音交互领域,主要是通过各种数据分析模型对用户输入的查询语句进行在线分析,识别用户意图,为用户提供精准的答复。而数据分析模型是通过对已标注的大量查询语句(简称训练集)进行训练得到的。所以,训练集中查询语句标注结果的准确性,直接影响了数据分析模型的准确性,决定了语音交互功能的智能化水平。In the field of voice interaction, online analysis of query sentences entered by users is performed through various data analysis models to identify user intent and provide users with accurate answers. The data analysis model is obtained by training a large number of labeled query sentences (referred to as the training set). Therefore, the accuracy of the query result annotation results in the training set directly affects the accuracy of the data analysis model and determines the intelligence level of the voice interaction function.
目前,主要通过标注人员对查询语句进行人工标注。例如,标注出查询语句的查询意图(包括闲聊意图、音乐点播意图、天气查询意图等等)。所以标注人员的认知水平决定了查询语句的标注准确性。At present, the query statements are mainly manually labeled by the labelers. For example, the query intent of the query statement (including the intention of gossip, music on demand, weather query intent, etc.). So the cognition level of the tagger determines the tagging accuracy of the query.
由于标注人员的认知水平可能与常人的认知程度不同,或者对某个查询语句的认知存在偏差,因此很容易使训练集所包含的查询语句标注不准确,进而造成训练得到的数据分析模型误差较大,无法为用户提供精准的答复。Because the cognitive level of the labeler may be different from that of ordinary people, or there is a deviation in the cognition of a query sentence, it is easy to make the query sentence labels included in the training set inaccurate, which will cause analysis of the training data. The model has large errors and cannot provide users with accurate answers.
发明内容Summary of the Invention
为了解决相关技术中存在的由于标注人员的认知存在偏差,导致训练集中查询语句的标注结果不准确的问题,本申请提供了一种语料标注集的生成方法。In order to solve the problem of inaccurate labeling results of query sentences in the training set due to the bias of the cognition of the labeling staff in the related art, this application provides a method for generating a corpus labeling set.
一方面,本申请提供了一种语料标注集的生成方法,由电子设备执行,包括:In one aspect, this application provides a method for generating a corpus annotation set, which is executed by an electronic device and includes:
获取查询日志;所述查询日志包括查询语句;Obtaining a query log; the query log includes a query statement;
从所述查询日志中进行待标注查询语句的提取,获得待标注语料集;Extracting to-be-annotated query sentences from the query log to obtain a to-be-annotated corpus;
获取多方对所述待标注语料集中查询语句的标注结果;Obtaining the annotation results of multiple parties on the query sentence in the corpus to be annotated;
根据多方对同一查询语句的标注结果,从所述待标注语料集中筛选出标注结果相似的查询语句;Filtering out query statements with similar labeling results from the corpus to be labeled according to the labeling results of the same query statement by multiple parties;
由所述标注结果相似的查询语句与对应的标注结果,生成语料标注集。A corpus annotation set is generated from the query sentences with similar annotation results and corresponding annotation results.
另一方面,本申请提供了另一种语料标注集的生成装置,包括:On the other hand, this application provides another apparatus for generating a corpus annotation set, including:
日志获取模块,用于获取查询日志;所述查询日志包括查询语句;A log acquisition module, configured to acquire a query log; the query log includes a query statement;
语料集获得模块,用于从所述查询日志中进行待标注查询语句的提取,获得待标注语料集;A corpus obtaining module for extracting query sentences to be labeled from the query log to obtain the corpus to be labeled;
结果获取模块,用于获取多方对所述待标注语料集中查询语句的标注结果;A result acquisition module, configured to acquire annotated results of multiple parties on the query sentence in the corpus to be annotated;
语句筛选模块,用于根据多方对同一查询语句的标注结果,从所述待标注语料集中筛选出标注结果相似的查询语句;A statement filtering module, configured to filter query statements with similar labeled results from the corpus to be labeled according to the labeled results of the same query statement by multiple parties;
标注集生成模块,用于由所述标注结果相似的查询语句与对应的标注结果,生成语料标注集。The annotation set generation module is configured to generate a corpus annotation set from a query sentence with a similar annotation result and a corresponding annotation result.
进一步的,本申请提供了一种电子设备,所述电子设备包括:Further, this application provides an electronic device. The electronic device includes:
处理器;processor;
用于存储处理器可执行指令的存储器;Memory for storing processor-executable instructions;
其中,所述处理器被配置为执行上述语料标注集的生成方法。The processor is configured to execute the method for generating the corpus annotation set.
进一步的,本申请提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序可由处理器执行完成上述语料标注集的生成方法。Further, the present application provides a computer-readable storage medium. The computer-readable storage medium stores a computer program, and the computer program can be executed by a processor to complete the method for generating the corpus annotation set.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。It should be understood that the above general description and the following detailed description are merely exemplary, and should not limit the present application.
附图简要说明Brief description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。The drawings herein are incorporated in and constitute a part of the specification, illustrate embodiments consistent with the present application, and together with the description, serve to explain the principles of the application.
图1是根据本申请所涉及的实施环境的示意图;FIG. 1 is a schematic diagram of an implementation environment involved in this application;
图2是根据一示例性实施例示出的一种服务器设备的框图;Fig. 2 is a block diagram of a server device according to an exemplary embodiment;
图3是根据一示例性实施例示出的一种语料标注集的生成方法的流程图;Fig. 3 is a flow chart showing a method for generating a corpus annotation set according to an exemplary embodiment;
图4是多种标注任务的标注结果示意图;4 is a schematic diagram of labeling results of various labeling tasks;
图5是多种语料标注集的划分原理示意图;FIG. 5 is a schematic diagram of the division principle of multiple corpus annotation sets;
图6是每个批次的语料标注集对模型性能的影响曲线示意图;FIG. 6 is a schematic diagram of the influence curve of the corpus annotation set of each batch on the model performance;
图7是图3对应实施例中步骤330的细节流程图;7 is a detailed flowchart of step 330 in the embodiment corresponding to FIG. 3;
图8是根据一示例性实施例示出的语料标注集的生成原理示意图;Fig. 8 is a schematic diagram illustrating a generation principle of a corpus annotation set according to an exemplary embodiment;
图9是图3对应实施例中步骤350的细节流程图;9 is a detailed flowchart of step 350 in the embodiment corresponding to FIG. 3;
图10是图3对应实施例中步骤370的细节流程图;10 is a detailed flowchart of step 370 in the embodiment corresponding to FIG. 3;
图11是图3对应实施例的基础上一种语料标注集的生成方法的流程图;11 is a flowchart of a method for generating a corpus annotation set based on the embodiment of FIG. 3;
图12是根据一示例性实施例示出的一种语料标注集的生成装置的框图;Fig. 12 is a block diagram of a device for generating a corpus annotation set according to an exemplary embodiment;
图13是图12对应实施例中语料集获得模块的细节框图;13 is a detailed block diagram of a corpus obtaining module in the embodiment corresponding to FIG. 12;
图14是图12对应实施例中结果获取模块的细节框图;14 is a detailed block diagram of a result acquisition module in the embodiment corresponding to FIG. 12;
图15是图12对应实施例中语句筛选模块的细节框图。FIG. 15 is a detailed block diagram of the sentence filtering module in the embodiment corresponding to FIG. 12.
实施方式Implementation
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of devices and methods consistent with certain aspects of the application as detailed in the appended claims.
图1是根据一示例性实施例示出的本申请所涉及的实施环境示意图。本申请所 涉及的实施环境包括服务器设备110。服务器设备110中存储有查询日志,从而服务器设备110可以采用本申请提供的语料标注集的生成方法,利用查询日志生成语料标注集,提高语料标注集中查询语句标注结果的准确性,将准确性较高的语料标注集作为训练集进行数据分析模型的训练,提高数据分析模型的准确性,后续通过数据分析模型对用户输入的查询语句进行在线分析,为用户提供精准的答复。Fig. 1 is a schematic diagram of an implementation environment involved in the present application according to an exemplary embodiment. The implementation environment involved in this application includes a
根据需要,该实施环境还将包括提供数据,即查询日志的数据来源。具体而言,在本实施环境中,数据来源可以为智能终端130,虽然图1中仅示出一个智能终端130,但是本领域技术人员应当理解,这里提供查询日志的智能终端的数量可以是多个。如图1所示,服务器设备110可以通过有线或无线网络连接到智能终端130,获取智能终端130采集并上传的查询日志,其中,查询日志是指智能终端130采集到用户输入查询语句所进行的记录。查询日志可以包括时间点、用户输入的查询语句、向用户返回的查询结果等。其中,用户输入的查询语句可以是文字或语音形式。查询日志还可以包括一个或多个用户输入的大量查询语句。然后服务器设备110采用本申请提供的方法,生成语料标注集。智能终端130可以是智能手机、智能音响、平板电脑。As needed, the implementation environment will also include providing data, that is, the data source of the query log. Specifically, in this implementation environment, the data source may be a
根据需要,该实施环境还将包括提供查询语句,等待为用户提供答复的智能终端140,本领域技术人员可以理解,提供查询日志的智能终端130和提供查询语句,并等待为用户提供答复的智能终端140可以相同,也可以不同。如图1所示,已根据智能终端130提供的查询日志生成语料标注集的服务器设备110,通过有线或无线网络连接到智能终端140,接收用户输入的查询语句,根据生成的语料标注集训练得到的数据分析模型,对用户输入的查询语句进行分析,识别用户的意图,生成精准的答复,然后通过有线或无线网络反馈给智能终端140。As required, the implementation environment will also include an
应当说明的是,本申请语料标注集的生成方法,不限于在服务器设备110中部署相应的处理逻辑,其也可以是部署于其它机器中的处理逻辑。例如,在具备计算能力的终端设备中部署生成语料标注集的处理逻辑等。It should be noted that the method for generating the corpus annotation set of the present application is not limited to deploying corresponding processing logic in the
参见图2,图2是本申请实施例提供的一种服务器设备的结构示意图。该服务器设备200可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)222(例如,一个或一个以上处理器)和存储器232,一个或一个以上存储应用程序242或数据244的存储介质230(例如一 个或一个以上海量存储设备)。其中,存储器232和存储介质230可以是短暂存储或持久存储。存储在存储介质230的程序可以包括一个或一个以上模块(图示未示出),每个模块可以包括对服务器设备200中的一系列指令操作。更进一步地,中央处理器222可以设置为与存储介质230通信,在服务器设备200上执行存储介质230中的一系列指令操作。服务器设备200还可以包括一个或一个以上电源226,一个或一个以上有线或无线网络接口250,一个或一个以上输入输出接口258,和/或,一个或一个以上操作系统241,例如Windows Server
TM,Mac OS
XTM,UnixTM,Linux
TM,FreeBSD
TM等等。下述图3、图7、图8、图10-图12所示实施例中所述的由服务器设备所执行的步骤可以基于该图2所示的服务器设备结构。
Referring to FIG. 2, FIG. 2 is a schematic structural diagram of a server device according to an embodiment of the present application. The
本领域普通技术人员可以理解实现下述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps of implementing the following embodiments may be completed by hardware, or related hardware may be instructed by a program. The program may be stored in a computer-readable storage medium. The aforementioned storage medium may be a read-only memory, a magnetic disk, or an optical disk.
图3是根据一示例性实施例示出的一种语料标注集的生成方法的流程图。该语料标注集的生成方法的适用范围和执行主体可以是服务器设备,或者是下文描述的用于服务器设备的电子设备,其中该服务器设备可以是图1所示实施环境的服务器110设备。如图3所示,该语料标注集的生成方法可以由电子设备,或者服务器设备110执行,可以包括以下步骤。Fig. 3 is a flow chart showing a method for generating a corpus annotation set according to an exemplary embodiment. The application scope and execution subject of the method for generating the corpus annotation set may be a server device or an electronic device for a server device described below, where the server device may be a
在步骤310中,获取查询日志;In step 310, a query log is obtained;
其中,查询日志是指设备采集到用户输入查询语句所进行的记录,该设备可以是智能音响、移动终端等。查询日志可以看成是包含大量查询语句的生语料集。所谓生语料集,是指属于原始真实用户的查询语句,未经过人工标注。The query log refers to a record collected by a device when a user inputs a query sentence, and the device may be a smart speaker, a mobile terminal, or the like. The query log can be viewed as a raw corpus containing a large number of query statements. The so-called raw corpus refers to the query sentence belonging to the original real user, which has not been manually labeled.
在步骤330中,从所述查询日志中进行待标注查询语句的提取,获得待标注语料集;In step 330, extract the query sentence to be labeled from the query log to obtain a corpus to be labeled;
需要说明的是,由于查询日志中包含大量查询语句,但是并非所有的查询语句都有效,有些可能是用户随意输入的并不代表任何意义,有些查询语句可能过长或过短、还有很多查询语句也许是重复的,如果将这些查询语句的标注结果作为语料标注集,则会降低语料标注集中标注结果的准确性,进而影响将语料标注集作为训练样本训练得到的数据分析模型的准确性。It should be noted that because the query log contains a large number of query statements, but not all query statements are valid, some may be input by users at random and do not represent any significance, some query statements may be too long or too short, and there are many queries Sentences may be repeated. If the query results of these query statements are used as the corpus annotation set, the accuracy of the annotation results in the corpus annotation set will be reduced, which will affect the accuracy of the data analysis model obtained by training the corpus annotation set as the training sample.
由此,本申请可以按照预先配置好的策略,从查询日志中提取出待标注的查询语句,由待标注的查询语句构成待标注语料集。其中,进行待标注查询语句的提取,可以是对查询日志进行分析,根据配置的无用/停用字符库,去除包含无用/停用字符的查询语句、去除无意义的查询语句(例如随意输入的没有连贯性的几个字符)、去除过长或过短的查询语句、去除重复的查询语句,去除已经标注过的查询语句,获得最后剩余的查询语句作为待标注的查询语句。Therefore, the present application can extract the query sentence to be labeled from the query log according to a pre-configured strategy, and the query sentence to be labeled constitutes the corpus to be labeled. Among them, the extraction of the query sentence to be labeled may be an analysis of the query log, according to the configured useless / deactivated character library, remove the query sentence containing the useless / deactivated characters, remove the meaningless query sentence (such as random input A few characters without coherence), remove too long or too short query statements, remove duplicate query statements, remove already labeled query statements, and obtain the last remaining query statement as the query statement to be labeled.
在步骤350中,获取多方对所述待标注语料集中查询语句的标注结果;In step 350, acquire the annotation results of multiple parties on the query sentence in the corpus to be annotated;
其中,多方可以是多个标注人员、多台标注设备,还可以是一台设备中的多个标注程序,用于表示待标注语料集中查询语句的标注结果存在多个来源,为便于描述,下文将标注人员、标注设备或标注程序统称为标注方。每一标注方可以对待标注语料集中的查询语句进行标注(称为“投票”)。标注是指为待标注语料集中的查询语句添加分类标签,多个“投票”结果才能反映出查询语句的正确分类。Among them, multiple parties can be multiple labeling personnel, multiple labeling devices, or multiple labeling programs in one device, which are used to indicate that there are multiple sources of labeling results for the query sentence in the corpus to be labeled. For the convenience of description, the following Labelers, equipment, or programs are collectively referred to as labelers. Each annotator can annotate the query in the annotated corpus (called "voting"). Annotation refers to adding a classification label to the query sentence in the corpus to be annotated, and multiple "voting" results can reflect the correct classification of the query sentence.
标注结果就是标注方为查询语句添加的分类标签。按照标注任务的不同,标注结果可以是意图标注结果、NER(Named Entity Recognition,命名实体识别)标注结果、槽位标注结果或分词标注结果。其中,意图标注结果是指意图分类结果,例如“今天心情不佳”,标注方对该查询语句的意图标注结果是“闲聊意图”;例如“请给我来一首舒缓的歌曲”,标注方对该查询语句的意图标注结果是“音乐点播意图”。The labeling result is the classification label added by the labeler for the query. According to the different labeling tasks, the labeling results can be intent labeling results, NER (Named Entity Recognition) labeling results, slot labeling results, or segmentation labeling results. Among them, the intent labeling result refers to the result of intent classification, for example, "I'm in a bad mood today", and the labeler's intent labeling result for the query sentence is "chat intent"; for example, "please give me a soothing song", the labeler The intent labeling result of the query is "music on demand intent".
NER标注结果是指标注出查询语句中的人名、地名、机构名、专有名词等。槽位标注结果是指为查询语句中的各个词组添加槽位标签,例如天气业务领域,槽位标签有时间词、地点词、天气业务关键词、天气现象词、疑问词等。分词标注结果是指将查询语句划分为多个词组,多个词组作为分词标注结果,每个词组可以看成一个分类标签。The NER labeling result is that the index names the person name, place name, institution name, proper name, etc. in the query sentence. The slot labeling result refers to adding slot labels to each phrase in the query, such as the weather business field. The slot labels include time words, place words, weather business keywords, weather phenomenon words, and interrogative words. The word segmentation result refers to dividing a query sentence into multiple phrases, and multiple phrases are used as the word segmentation result. Each phrase can be regarded as a classification label.
如图4所示,针对待标注语料集,标注方可以进行意图标注、NER标注、槽位标注或分词标注,得到每项标注任务的标注结果。具体的,每一方可以先对待标注语料集中的查询语句进行意图标注(按照意图标注文档规范),得到包含查询语句意图标注结果的意图标注集。进而根据意图标注可以将查询语句进行领域划分,并在 划分好的领域同时进行NER标注(按照NER标注文档规范)和槽位标注(按照槽位标注文档规范),分别得到包含NER标注结果的NER标注集和包含槽位标注结果的槽位标注集。其中,在意图标注的同时,每一标注方还可以进行对待标注语料集进行分词标注,得到包含分词标注结果的分词标注集。As shown in FIG. 4, for the corpus to be labeled, the labeler can perform intent labeling, NER labeling, slot labeling, or word segmentation labeling to obtain labeling results for each labeling task. Specifically, each party may first perform intent annotation on the query sentence in the annotated corpus (according to the intent annotation document specification) to obtain an intent annotation set including the intent annotation result of the query sentence. Furthermore, the query statement can be divided into fields according to the intent labeling, and NER labeling (according to the NER labeling document specification) and slot labeling (according to the slot labeling document specification) can be performed at the same time in the divided fields to obtain NERs containing the NER labeling results. Callout set and slot callout set containing slot callout results. Among them, at the same time as intent labeling, each labeler can also perform segmentation labeling on the corpus to be labeled to obtain a segmentation labeling set including the result of the segmentation labeling.
其中,意图标注集、槽位标注集、NER标注集或分词标注集可以存储在服务器设备的存储介质中,服务器设备可以从存储介质中获取多方对待标注语料集中查询语句的标注结果。The intent annotation set, slot annotation set, NER annotation set, or word segmentation annotation set may be stored in the storage medium of the server device, and the server device may obtain the annotation results of the query statements in the corpus to be annotated by the multiparty from the storage medium.
在步骤370中,根据多方对同一查询语句的标注结果,从所述待标注语料集中筛选出标注结果相似的查询语句;In step 370, according to the annotation results of the same query sentence by multiple parties, a query sentence with a similar annotation result is filtered from the corpus to be annotated;
其中,标注结果相似的查询语句是指多方标注结果一致或相近的查询语句,多方标注结果的相似度大于预设值,可以认为是标注结果相似的查询语句,该预设值可以80%、90%。Among them, the query statements with similar labeling results refer to query statements with the same or similar multi-labeled results. The similarity of the multi-labeled results is greater than a preset value. It can be considered as a query with similar labeling results. The preset value can be 80%, 90%. %.
在一种实施例中,假设标注结果为意图标注结果,服务器设备获取到多方对待标注语料集中查询语句的意图标注结果,依次针对每条查询语句,比对多方对该条查询语句的意图标注结果,判断多方对该条查询语句的标注结果是否一致(标注结果的相似度大于预设值即可认为一致),进而从待标注语料集中筛选出多方标注结果一致的查询语句。In one embodiment, assuming that the labeling result is an intent labeling result, the server device obtains the intent labeling results of the query statements in the set of corpora to be labeled by multiple parties, and sequentially compares the intent labeling results of the query statements by multiple parties for each query statement in turn. , To determine whether the annotation results of the query sentence by multiple parties are consistent (the similarity of the annotation results is greater than a preset value can be regarded as consistent), and then the query sentences with the same annotation results are selected from the corpus to be labeled.
具体的,针对同一个查询语句,如果多方的标注结果一致,则加入单标签标注集。如果不一致,则需要最终裁定人员协助审查不一致的具体情况:Specifically, for the same query, if the labeling results of multiple parties are consistent, a single label labeling set is added. If they are inconsistent, the final arbiter is required to assist in reviewing the specifics of the inconsistencies:
i)如果超过一半的标注方标注一致,取多方一致的标签作为标注结果,加入单标签标注集;i) If more than half of the labeled parties have the same label, take multiple labels that are consistent as the labeling result and add a single label label set;
ii)如果不一致结果分布为1:1,有可能是多标签的情况(用例可以标多个标签)。审核人员审核确定是多标签的情况,则加入多标签标注集;ii) If the distribution of inconsistent results is 1: 1, it may be the case of multiple labels (use cases can be labeled with multiple labels). If the reviewer determines that it is a multi-label case, the multi-label label set is added;
iii)如果多方标注结果都不一致,有可能是多标签样例或疑难样例,审核后,加入多标签标注集或疑难样例集中。iii) If the multi-party labeling results are inconsistent, it may be a multi-label sample or a difficult sample. After review, a multi-label label set or a difficult sample set is added.
这样,一个标注任务经过标注流程,最终得到三个标注集:单标签标注集、多 标签标注集以及疑难样例集。单标签标注集中的查询语句可以认为多方对该查询语句的标注结果相同。单标签标注集可以认为属于可靠的标注集,可以作为意图识别模型的训练集、测试集等。In this way, a labeling task goes through the labeling process and finally gets three labeling sets: a single-label labeling set, a multi-label labeling set, and a difficult sample set. A query in a single-label callout set can be considered to be the same as the result of the multi-party callout. The single-label annotation set can be regarded as a reliable annotation set, and can be used as the training set and test set of the intent recognition model.
同理,假设标注结果是NER标注结果、槽位标注结果或分词标注结果,也可以筛选出多方标注结果相同的查询语句。Similarly, assuming that the labeling result is a NER labeling result, a slot labeling result, or a word segmentation labeling result, it is also possible to filter out query statements that have the same labeling result on multiple parties.
在步骤390中,由所述标注结果相似的查询语句与对应的标注结果,生成语料标注集。In
其中,语料标注集包含查询语句及其对应的标注结果,其中,该查询语句属于步骤370筛选出的多方标注结果相似的查询语句。服务器设备利用筛选出的多方标注结果相似的查询语句,以及该查询语句的标注结果,生成由该查询语句与标注结果构成的查询语料标注集。换言之,由于语料标注集的查询语句属于多方标注结果相似的查询语句,所以语料标注集中查询语句的标注结果存在分歧的可能性较小,标注结果的准确性较高,进而将该准确性较高的语料标注集作为训练集进行意图识别模型等数据分析模型的训练,可以提高数据分析模型的准确性。The corpus tagging set includes a query sentence and a corresponding tagging result. The query sentence belongs to a query sentence similar to the multi-party tagging result screened in step 370. The server device generates a query corpus tag set composed of the query sentence and the tagging result by using the filtered multi-party tagging results similar to the query sentence and the tagging result of the query sentence. In other words, since the query sentence of the corpus tagging set belongs to a query sentence with similar multi-labeled results, the possibility of divergence in the tagging results of the query sentence in the corpus tagging set is small, and the accuracy of the tagging result is higher, and the accuracy is higher. The corpus annotation set is used as a training set to train data analysis models such as intent recognition models, which can improve the accuracy of the data analysis model.
如图5所示,针对待标注语料集中的查询语句,多个标注方对待标注语料集进行标注,服务器设备获取标注方1对待标注语料集中查询语句的标注结果1,标注方2对待标注语料集中查询语句的标注结果2,标注方3对待标注语料集中查询语句的标注结果3,标注方4对待标注语料集中查询语句的标注结果4。服务器设备对标注结果1、标注结果2、标注结果3、标注结果4进行合并,筛选出四个标注结果一致的查询语句加入单标签标注集,如果某个查询语句存在超过一半标注结果一致,也可认为多方标注结果相同,取多人一致的标注结果作为该查询语句的标注结果,并将该查询语句加入单标签标注集,将该单标签标注集作为语料标注集,可以与已标注的语料集进行合并,作为训练集、测试集。As shown in FIG. 5, for the query sentence in the corpus to be annotated, multiple annotators mark the corpus to be annotated, and the server device obtains an annotation result 1 of the query sentence in the annotated corpus by the annotator 1, and the annotator 2 treats the annotated corpus. The annotation result 2 of the query sentence, the annotation party 3 is the annotation result 3 of the query sentence in the annotated corpus, and the annotation party 4 is the annotation result 4 of the query sentence in the annotated corpus. The server device merges the labeling result 1, labeling result 2, labeling result 3, and labeling result 4 to filter out four query statements with the same labeling result and add a single label labeling set. If more than half of the query statements have the same labeling result, It can be considered that the multi-party labeling results are the same. Take the consistent labeling results of multiple people as the labeling result of the query, and add the query to the single-label labeling set, and use the single-label labeling set as the corpus labeling set. Sets are merged as training and test sets.
如图5所示,假设某些查询语句的标注结果1、2、3、4不一致,且不一致结果分布为1:1,则这些查询语句有可能是多标签的情况,通过审核人员审核确定是多标签的情况,将这些查询语句加入多标签标注集。假设某些查询语句的标注结果1、2、 3、4全部不一致,有可能是多标签样例或疑难样例,则可以将这些查询语句加入多标签标注集或疑难标注集。As shown in Figure 5, if the query results 1, 2, 3, and 4 of some query statements are inconsistent, and the inconsistent results are distributed as 1: 1, these query statements may be multi-labeled. In the case of multiple labels, these query statements are added to the multi-label label set. Assuming that the query results 1, 2, 3, and 4 of some query statements are all inconsistent, and may be multi-label samples or difficult samples, these query statements can be added to the multi-label label set or difficult label set.
需要解释的是,语料标注集就是单标签标注集,所包含的是多方标注结果相同的查询语句。也就是说,语料标注集中查询语句的标注结果不存在分歧,标注结果的准确性较高,由此,可以将语料标注集作为准确性较高的训练集或测试集进行数据分析模型的训练。What needs to be explained is that the corpus label set is a single label label set, which contains query statements with the same multi-label labeling results. That is to say, there is no difference in the annotation results of query sentences in the corpus annotation set, and the accuracy of the annotation results is high. Therefore, the corpus annotation set can be used as a highly accurate training set or test set for training the data analysis model.
例如,假设标注结果是意图标注结果,语料标注集包括意图标注结果相同的查询语句及其对应的意图标注结果,则语料标注集可以作为训练集进行意图识别模型的训练。假设标注结果是NER标注结果,语料标注集包括NER标注结果相同的查询语句及其对应的NER标注结果,则语料标注集可以作为训练集进行命名实体识别模型的训练。同理,假设标注结果是槽位标注结果,则语料标注集可以作为训练集进行槽位标注模型的训练,假设标注结果是分词标注结果,则语料标注集可以作为训练集进行分词标注模型的训练。For example, assuming that the labeling result is an intent labeling result, and the corpus labeling set includes a query sentence with the same intent labeling result and its corresponding intent labeling result, the corpus labeling set can be used as a training set to train the intent recognition model. Assuming that the labeling result is a NER labeling result, and the corpus labeling set includes a query sentence with the same NER labeling result and its corresponding NER labeling result, the corpus labeling set can be used as a training set to train a named entity recognition model. Similarly, assuming that the labeling result is a slot labeling result, the corpus labeling set can be used as a training set to train the slot labeling model. Assuming that the labeling result is a word segmentation labeling result, the corpus labeling set can be used as a training set to train the tokenization model. .
本申请上述示例性实施例提供的技术方案,通过从查询日志中获得待标注语料集,获取多个用户对该语料集中查询语句的标注结果,筛选出标注结果相同的查询语句,进而由这些查询语句及其对应的标注结果构成语料标注集。由于语料标注集的查询语句属于多方标注结果相同的、或相似的查询语句,所以语料标注集中查询语句的标注结果存在分歧的可能性较小,标注结果的准确性较高,进而将该准确性较高的语料标注集作为训练集进行意图识别模型等数据分析模型的训练,可以提高数据分析模型的准确性。In the technical solution provided by the foregoing exemplary embodiment of the present application, by obtaining a set of to-be-marked corpora from a query log, obtaining a plurality of user-annotated results of the query sentence in the corpus, filtering out query sentences with the same annotated result, and then using these queries The sentence and its corresponding labeling result constitute the corpus labeling set. Since the query sentence of the corpus tagging set belongs to the same or similar query sentence with multi-party tagging results, the possibility of disagreement on the tagging result of the query sentence in the corpus tagging set is small, and the accuracy of the tagging result is high, and the accuracy is further improved. A higher corpus annotation set is used as a training set to train data analysis models such as intent recognition models, which can improve the accuracy of the data analysis model.
根据需要,可以将语料标注集通过增量、叠加的方式加入已有的训练集,重新训练数据分析模型,并利用同一个测试集测试数据分析模型的性能,评估新增的语料标注集对模型性能带来的效果提升,反映出新增的语料标注集的质量和价值。As needed, the corpus annotation set can be added to the existing training set in an incremental and overlapping manner to retrain the data analysis model, and use the same test set to test the performance of the data analysis model, and evaluate the new corpus annotation set pair model. The performance improvement brought by performance reflects the quality and value of the new corpus annotation set.
以意图的语料标注集为例。将意识识别模型在测试集上的性能结果作为基准。之后,将获得的每批语料标注集加入模型训练集中,记录下每一批数据加入后训练出的模型的性能指标。如图6所示曲线记录了每个批次的语料标注集加入后训练出 的模型的性能,其中第六批(s6)数据对训练模型的性能增益明显,可以挑选该批次的语料标注集加入训练数据中。Take the intentional corpus tag set as an example. The performance results of the conscious recognition model on the test set are used as benchmarks. After that, each batch of corpus annotation set is added to the model training set, and the performance indicators of the model trained after each batch of data is added are recorded. The curve shown in Figure 6 records the performance of the model trained after the corpus annotation set of each batch is added. The sixth batch (s6) of the data has a significant performance gain on the trained model. The corpus annotation set of this batch can be selected. Added to training data.
此外,如果标注方是标注人员,为了防止标注人员在标注过程中相互作弊参考,可以采取每个周期内错开标注的方法标注。如下表1所示。In addition, if the annotator is the annotator, in order to prevent the annotators from cheating each other during the annotating process, the method can be adopted to stagger the annotations in each cycle. As shown in Table 1 below.
表1错开标注的任务安排表Table 1 staggered task schedule
上表以四个标注人员进行标注为例子,为了防止标注人员相互参考标注结果,标注人员同一天进行标注的内容不一样,可以根据表1所示计划表安排标注,以五天为一个周期获取结果并统计多人之间一致以及不一致的标注结果。The above table uses four labelers as examples. In order to prevent the labelers from referencing the labeling results to each other, the labelers have different content on the same day. You can arrange labeling according to the schedule shown in Table 1, and take five days as a cycle. Results are counted for consistent and inconsistent labeled results among multiple people.
在一种示例性实施例中,如图7所示,上述步骤330具体包括:In an exemplary embodiment, as shown in FIG. 7, the above step 330 specifically includes:
在步骤331中,去除所述查询日志中不满足预设条件的查询语句;In step 331, the query statements in the query log that do not satisfy the preset conditions are removed;
其中,不满足预设条件的查询语句可以包括以下一种或多种形式:包含无用/停用字符的查询语句,无意义的查询语句、过长或过短的查询语句、重复的查询语句等,从而避免后续对这些没有价值的查询语句进行标注,既增加了工作量,也影响了语料标注集的准确性。The query statements that do not meet the preset conditions may include one or more of the following forms: query statements containing useless / deactivated characters, meaningless query statements, excessively long or too short query statements, repeated query statements, etc. , Thereby avoiding subsequent annotation of these worthless query sentences, which not only increases the workload, but also affects the accuracy of the corpus annotation set.
在步骤332中,将所述查询日志中剩余的查询语句,输入已构建的多个标签预测模型,输出多个标签预测模型对同一查询语句的标签预测结果;所述多个标签预测模型通过采用不同的训练样本集训练得到;In step 332, the remaining query statements in the query log are input into the multiple label prediction models that have been constructed, and the label prediction results of multiple label prediction models for the same query statement are output; Trained from different training sample sets;
具体的,标签预测模型可以是用于识别查询语句意图的意图识别模型。相应的, 标签预测结果可以是意图识别结果。标签预测模型可以利用已知意图标注结果的大量查询语句(即训练样本集)训练得到。多个标签预测模型可以采用不同的训练样本训练得到。例如,将已知意图标注结果的所有查询语句分为4批,每批查询语句训练得到对应的意图识别模型,由此,可以得到4个意图识别模型。Specifically, the label prediction model may be an intent recognition model for identifying the intent of a query sentence. Accordingly, the label prediction result may be an intent recognition result. The label prediction model can be trained using a large number of query sentences (that is, training sample sets) with known intent to label the results. Multiple label prediction models can be trained using different training samples. For example, all query sentences with known intent labeling results are divided into 4 batches, and each batch of query statements is trained to obtain a corresponding intent recognition model, and thus 4 intent recognition models can be obtained.
在去除上述不符合要求的查询语句后,将查询日志中剩余的查询语句分别输入4个意图识别模型,输出4个意图识别模型对同一查询语句的意图识别结果。After removing the query statements that do not meet the requirements above, the remaining query statements in the query log are input into 4 intent recognition models, and the intent recognition results of the 4 intent recognition models for the same query sentence are output.
需要说明的是,根据标注任务的不同,标签预测模型也可以是命名实体识别模型、槽位标注模型或分词模型,这些模型可通过已知NER标注结果的大量查询语句训练得到,已知槽位标注结果的大量查询语句训练得到,已知分词标注结果的大量查询语句训练得到。同理,标签预测结果可以是对应的命名实体识别结果、槽位标注结果、分词结果。标签预测模型的构建方式属于现有技术,在此不再赘述。It should be noted that, depending on the labeling task, the label prediction model can also be a named entity recognition model, slot labeling model, or word segmentation model. These models can be obtained by training a large number of query statements with known NER labeling results. Known slot positions A large number of query statements with labeled results are trained, and a large number of query statements with known tokenized label results are trained. Similarly, the label prediction result can be the corresponding named entity recognition result, slot labeling result, and word segmentation result. The construction method of the label prediction model belongs to the prior art, and is not repeated here.
在步骤333中,根据所述多个标签预测模型对同一查询语句的标签预测结果,从所述剩余的查询语句中筛选出标签预测结果不一致的查询语句,得到所述待标注语料集。In step 333, according to the label prediction results of the multiple query prediction models for the same query sentence, the query sentences with inconsistent label prediction results are filtered from the remaining query sentences to obtain the corpus to be labeled.
由于边界样本点对于训练模型的边界意义重大,如果找到多意图、在不同类别上都有一定概率分布的样本点,把这类样本点加入训练集进行模型训练,相比于把已经能够分类准确的样本点加入训练集,对模型的性能提升帮助更大。Because the boundary sample points are of great significance to the boundary of the training model, if you find sample points with multiple intents and certain probability distributions in different classes, adding such sample points to the training set for model training is more accurate than classifying Adding the sample points to the training set will help the model's performance improvement even more.
本申请根据多个标签预测模型对同一查询语句的标签预测结果,从查询日志剩余的查询语句中筛选出标签预测结果不一致的查询语句。也就是说,模型对这些查询语句的识别准确性较低,所以这些查询语句即可认为是边界样本点,将这些边界样本点加入待标注语料集进行模型训练,可以提高模型的准确性。The present application filters out query sentences with inconsistent tag prediction results from the remaining query sentences in the query log according to the tag prediction results of the same query sentence by multiple tag prediction models. That is, the model has low recognition accuracy for these query sentences, so these query sentences can be regarded as boundary sample points. Adding these boundary sample points to the corpus to be labeled for model training can improve the accuracy of the model.
在一种示例性实施例中,上述步骤331可以包括以下步骤:In an exemplary embodiment, the foregoing step 331 may include the following steps:
通过已构建的分类器对所述查询日志中记录的查询语句进行分类,并去除分类得到的无意义的查询语句。Classify the query statements recorded in the query log by the constructed classifier, and remove the meaningless query statements obtained by the classification.
其中,无意义的查询语句是指没有具体意图的语句,可能是用户错误或随意输入的语句。分类器也就是分类模型,分类器的作用是识别查询日志中的查询语句, 哪些是有意义的,哪些是无意义的。具体可以通过大量有意义的查询语句和无意义的查询语句,训练得到分类器。举例来说,可以对通过大量有意义的查询语句和无意义的查询语句,训练逻辑回归模型的参数,得到分类器。分类器是数据挖掘中对样本进行分类的方法的统称,分类器的构建方式包含决策树、逻辑回归、朴素贝叶斯、神经网络等算法。Among them, the meaningless query statement refers to a statement without specific intention, which may be a user error or a random input statement. A classifier is also a classification model. The role of a classifier is to identify the query statements in the query log, which are meaningful and which are meaningless. Specifically, a classifier can be trained through a large number of meaningful query sentences and non-meaningful query sentences. For example, a classifier can be obtained by training the parameters of a logistic regression model through a large number of meaningful query sentences and non-meaningful query sentences. Classifiers are a collective term for the methods used to classify samples in data mining. Classifiers are constructed in a variety of ways including algorithms such as decision trees, logistic regression, naive Bayes, and neural networks.
具体的,可以将查询日志中的查询语句输入训练好的分类器,输出有意义或无意义的判定结果,进而可以去除查询日志中无意义的查询语句。还可以根据配置好的无用字符或停用字符库,去除查询日志中包含无用字符或停用字符的查询语句。Specifically, the query sentence in the query log can be input to the trained classifier, and a meaningful or meaningless judgment result is output, thereby removing the meaningless query sentence in the query log. You can also remove query statements that contain useless or disabled characters in the query log based on the configured useless or disabled character libraries.
在另一种示例性实施例中,上述步骤331还可以包括以下步骤:In another exemplary embodiment, the foregoing step 331 may further include the following steps:
根据已标注的查询语句集合,去除所述查询日志中已标注的查询语句以及与已标注查询语句相似的查询语句。According to the set of labeled query statements, the labeled query statements in the query log and query statements similar to the labeled query statements are removed.
其中,已标注的查询语句集合是指已知标注结果的查询语句的集合。已标注的查询语句集合可以是已经生成的语料标注集。根据该查询语句集合中所包含的查询语句,可以从查询日志的查询语句中去除属于该集合中的查询语句。已标注查询语句就是指该查询语句集合中的查询语句,与已标注查询语句相似的查询语句可以通过计算查询语句之间的相似度,找出查询日志中与已标注查询语句相似度较高的查询语句,从而去除查询日志中与已标注查询语句相似度较高的查询语句。The set of labeled query statements refers to a set of query statements with known labeling results. The labeled query statement set may be a generated corpus label set. According to the query statements contained in the query statement set, the query statements belonging to the set can be removed from the query statements of the query log. The labeled query statement refers to the query statements in the query statement set. Query statements similar to the labeled query statement can be calculated by calculating the similarity between the query statements to find out the query log with a higher similarity to the labeled query statement. Query statements to eliminate query statements with a high degree of similarity to labeled query statements in the query log.
也就是说,上述步骤332中剩余的查询语句可以是查询日志中去除无意义的查询语句、去除包含无用字符或停用字符的查询语句、去除已标注的查询语句以及去除与已标注查询语句相似的查询语句后,查询日志中剩余的查询语句。That is, the remaining query statements in the above step 332 can be the removal of meaningless query statements, the removal of query statements containing useless or disabled characters, the removal of labeled query statements, and the removal of similarity to labeled query statements in the query log. After the query statement, the query statements remaining in the query log.
在另一种示例性实施例中,上述步骤331还可以包括以下步骤:In another exemplary embodiment, the foregoing step 331 may further include the following steps:
去除所述查询日志中仅包含单个实体词的查询语句、语句长度大于预设字符数量的查询语句或者重复的查询语句。A query sentence containing only a single entity word, a query sentence with a sentence length greater than a preset number of characters, or a duplicate query sentence is removed from the query log.
其中,实体词是指真实的具体事物的名称,例如,歌曲名、歌手名等。仅包含一个实体词的查询语句,难以区分意图、分词等,所以不适合加入语料标注集参与建模。语句长度大于预设字符数量的查询语句是指较长的查询语句,这类查询语句 标注难度大,且参与建模时由于查询语句长度较长,无疑会增加计算量,由此,也不适合加入语料标注集参与建模。同样的,查询日志中重复的查询语句也没有必要加入语料标注集参与建模,所以去除重复的查询语句,例如三条查询语句重复,可以去除2条仅保留一条查询语句。Among them, the entity word refers to the name of a real specific thing, for example, a song name, a singer name, and the like. A query that contains only one entity word is difficult to distinguish between intent and word segmentation, so it is not suitable to join the corpus annotation set to participate in modeling. Query statements with a statement length greater than the preset number of characters refer to longer query statements. Such query statements are difficult to label, and the length of the query statement during the modeling process will undoubtedly increase the amount of calculation. Therefore, it is not suitable. Join corpus annotation set to participate in modeling. Similarly, it is not necessary to add a corpus annotation set to participate in modeling in the repeated query statements in the query log. Therefore, to eliminate duplicate query statements, for example, three query statements are duplicated, and two query statements are retained and only one query statement is removed.
综上,上述步骤332中剩余的查询语句还可以是去除了仅包含单个实体词的查询语句、语句长度大于预设字符数量的查询语句或者重复的查询语句后,查询日志中最后剩余的查询语句。In summary, the remaining query in step 332 can also be the last remaining query in the query log after removing the query that contains only a single entity word, the query that is longer than the preset number of characters, or the duplicate query. .
如图8所示,针对新增的查询语句,对新增的查询语句进行预处理,去除无意义查询语句,去除无用/停用字符,去除单实体词的查询语句,去除超长的、重复的查询语句,并且可以根据已标注的查询语句集合,去除已标注的查询语句,去除和已标注的查询语句集合中相似度很高的查询语句,进一步的,通过上述步骤332和333筛选出标签预测结果不一致的查询语句,筛选出的查询语句,构成待标注语料集。进而,根据多方对待标注语料集的标注结果,可以筛选出标注结果相似的查询语句生成语料标注集。进而语料标注集可以加入已标注的查询语句集合,一并参与模型的训练。As shown in Figure 8, for the new query, pre-process the new query, remove meaningless queries, remove useless / deactivated characters, remove single-entity query words, and remove extra long and repeated words. The query statements can be removed from the labeled query statements according to the labeled query statement collection, and the query statements with a high similarity to the labeled query statement collection can be removed. Further, the labels can be filtered through the above steps 332 and 333. The query sentences whose prediction results are inconsistent, and the filtered query sentences constitute a corpus to be marked. Furthermore, according to the annotation results of the corpus to be annotated by multiple parties, query sentences with similar annotation results can be filtered to generate a corpus annotation set. Furthermore, the corpus annotation set can be added to the annotated query sentence set to participate in the training of the model together.
在一种示例性实施例中,如图9所示,上述步骤350具体包括:In an exemplary embodiment, as shown in FIG. 9, the above step 350 specifically includes:
在步骤351中,向多方派发对所述待标注语料集的标注任务,所述标注任务的派发,触发多方并行执行所述标注任务;In step 351, a labeling task for the corpus to be labeled is distributed to multiple parties, and the distribution of the labeling task triggers multiple parties to execute the labeling task in parallel;
其中,标注任务可以是意图标注任务、NER标注任务、槽位标注任务或分词标注任务。举例来说,多方可以是多台标注设备,服务器设备向多台标注设备下发携带待标注语料集的标注任务,触发多台标注设备并行执行标注任务。需要说明的是,标注设备可以是事先通过大量样本数据训练得到的智能标注设备。每台标注设备采用不同的样本数据集进行训练,所以每台标注设备的标注精度不同。The labeling task may be an intent labeling task, a NER labeling task, a slot labeling task, or a word segmentation labeling task. For example, multiple parties may be multiple labeling devices, and the server device sends the labeling tasks carrying the corpus to be labeled to the multiple labeling devices, triggering the multiple labeling devices to execute the labeling tasks in parallel. It should be noted that the labeling device may be an intelligent labeling device obtained through training in advance with a large amount of sample data. Each labeling device is trained with different sample data sets, so the labeling accuracy of each labeling device is different.
在一种实施例中,服务器设备可以向多个标注人员所属的终端设备下发携带待标注语料集的标注任务。标注人员所属终端设备可以进行待标注集语料集的展示和标注任务的提示。用户可以通过点击选项或划取的方式进行意图标注、NER标注、 槽位标注和分词标注,多个标注人员所属终端设备根据用户点击选项或划取的操作获得标注结果,完成对待标注语料集的标注任务。In one embodiment, the server device may issue a labeling task carrying a set of corpora to be labeled to terminal devices to which multiple labelers belong. The terminal device to which the labeler belongs can perform the display of the corpus to be labeled and the prompt of the labeling task. Users can click intent or mark to make intent mark, NER mark, slot mark and word segmentation. The terminal device of multiple mark people can get the mark result according to the user's click option or mark, and complete the corpus to be marked. Annotate tasks.
在一种示例性实施例中,上述标注任务的派发,触发多方并行执行标注任务,具体包括:所述标注任务的派发,触发多方并行将所述待标注语料集输入自身配置的标注模型,输出各自对所述待标注语料集的标注结果;其中,多方配置的标注模型采用不同的训练样本集训练得到。In an exemplary embodiment, the distribution of the labeling task triggers multiple parties to execute the labeling task in parallel, which specifically includes: the distribution of the labeling task triggers multiple parties to input the set of to-be-marked corpora into a labeling model configured by themselves and outputs The labeling results of the corpus to be labeled, respectively; wherein the labeling models with multiple configurations are obtained by training with different training sample sets.
也就是说,此处多方可以代表多台标注设备或多个标注程序。每个标注方配置了标注模型,由于多方配置的标注模型采用不同的训练样本集训练得到,所以多台标注设备或多个标注程序具有不同的标注精度。需要说明的是,该实施例中多方配置的标注模型采用的训练样本集与上文中标签预测模型采用的训练样本集也不同。举例来说,可以将所有样本分成10个训练样本集,每个训练样本集经过训练可以得到对应的模型,进而可以将10个模型,一部分作为标签预测模型,一部分作为标注模型,利用多个标签预测模型筛选出标签预测结果不一致的查询语句,得到待标注语料集,之后利用多方的标注模型计算待标注语料集中查询语句的标注结果,获得多方对待标注语料集中查询语句的标注结果。In other words, multiple parties can represent multiple labeling devices or multiple labeling programs. Each labeling party is configured with a labeling model. Since the labeling model configured with multiple parties is trained using different training sample sets, multiple labeling devices or multiple labeling programs have different labeling accuracy. It should be noted that the training sample set used by the multi-party configured labeling model in this embodiment is also different from the training sample set used by the label prediction model described above. For example, all samples can be divided into 10 training sample sets, and each training sample set can be trained to obtain the corresponding model, and then 10 models can be used as part of the label prediction model and part as the labeling model, using multiple labels The prediction model selects query sentences with inconsistent label prediction results to obtain the corpus to be labeled. Then, the multi-party labeling model is used to calculate the labeling results of the query statements in the corpus to be labeled. The multi-party labeling results of the query statements in the corpus to be labeled are obtained.
假设多方是指服务器设备中部署的多个标注程序,多个标注程序可以并行执行以下步骤:将待标注语料集输入预先构建的标注模型,输出对待标注语料集的标注结果。标注模型的构建方式可以参照标签预测模型的构建。Assume that multiple parties refer to multiple labeling programs deployed in the server device. Multiple labeling programs can perform the following steps in parallel: input the corpus to be labeled into a pre-built labeling model, and output the labeling results of the corpus to be labeled. The construction of the labeling model can refer to the construction of the label prediction model.
在步骤352中,接收多方并行执行所述标注任务返回的标注结果。In step 352, receiving the labeling results returned by multiple parties executing the labeling task in parallel.
多台标注设备或多个标注人员所属终端设备,并行执行标注任务获得标注结果,并将标注结果返回至服务器设备,服务器设备接收多台标注设备或多个标注人员所属终端设备返回的标注结果。与标注任务对应,标注结果可以是意图标注结果、NER标注结果、槽位标注结果或分词标注结果。Multiple labeling devices or terminal devices belonging to multiple labeling personnel perform labeling tasks in parallel to obtain labeling results and return the labeling results to the server device. The server device receives the labeling results returned by the multiple labeling devices or terminal devices belonging to the multiple labeling personnel. Corresponding to the labeling task, the labeling result can be an intent labeling result, a NER labeling result, a slot labeling result, or a segmentation labeling result.
在一种示例性实施例中,待标注语料集包括已知标签信息的多条埋点语句;埋点语句是指已知准确标注结果的查询语句,为与多方对埋点语句的标注结果进行区分,埋点语句的准确标注结果称为标签信息,如图10所示,上述步骤370具体包括:In an exemplary embodiment, the corpus to be annotated includes a plurality of buried-point sentences with known label information; the buried-point sentence refers to a query sentence that is known to accurately label the result, and performs multi-party tagging results on the buried-point sentence. The exact labeling result of the buried point statement is called label information. As shown in FIG. 10, the above step 370 specifically includes:
在步骤371中,根据多方对所述多条埋点语句的标注结果,比较所述多条埋点语句的标注结果与对应标签信息是否一致,计算得到多方标注结果的准确率;In step 371, according to the labeling results of the multiple buried-point statements by multiple parties, comparing whether the labeling results of the multiple buried-point statements are consistent with corresponding label information, and calculating an accuracy rate of the multi-party labeling results;
需要说明的是,在根据多方标注结果对待标注语料集进行筛选时,需要先判断每个标注方的标注结果准确率,从而去除准确率较低的标注方提供的标注结果。It should be noted that when filtering the corpus of annotations based on the annotation results of multiple parties, it is necessary to first judge the accuracy of the annotation results of each annotation party, so as to remove the annotation results provided by the annotation party with a lower accuracy rate.
多方标注结果的准确率是指每一标注方对多条埋点语句进行标注的准确性,通过对埋点语句标注准确性的计算用于评估当前标注方的标注准确率。在标注过程中,采用“埋点”的方式对每一个标注方进行了准确率校验。其中,可以是从上一批标注完成的数据集中抽取5%多人一致的查询语句作为当前批次已知标签信息的多条埋点语句。针对每个标注方,可以根据该标注方对已知标签信息的多条埋点语句的标注结果,并且比较该多条埋点语句的标注结果与已知标签信息是否一致,计算出标注结果与标签信息一致的占比,得到该标注方的标注结果准确率。The accuracy of the multi-party labeling result refers to the accuracy of labeling multiple buried-point sentences by each labeling party. The calculation of the labeling accuracy of the buried-point sentence is used to evaluate the labeling accuracy of the current labeling party. During the labeling process, a “buried point” method was used to verify the accuracy of each labeling party. Among them, it can be extracted from the previous batch of labeled data sets with more than 5% consistent query statements as multiple buried point statements for the current batch of known label information. For each labeling party, according to the labeling results of multiple buried-point statements of known label information, and comparing whether the labeling results of the multiple buried-point statements are consistent with the known label information, the labeling results are calculated with The proportion of the label information is consistent, and the accuracy of the labeling result of the labeler is obtained.
在步骤372中,根据所述多方标注结果的准确率,从多方来源中剔除准确率不达标的标注结果来源。In step 372, according to the accuracy rate of the multi-party labeling result, the source of the labeling result whose accuracy rate is not up to standard is removed from the multi-party sources.
具体的,可以设定阈值,根据每个用户的标注结果准确率,准确率小于阈值的标注方可以认为是提供了准确率不达标的标注结果。由此可以删除这类准确率小于阈值的标注方提供的标注结果。Specifically, a threshold value can be set, and according to the accuracy rate of the labeling result of each user, the labeling party whose accuracy rate is less than the threshold value can be regarded as providing the labeling result whose accuracy rate does not meet the standard. This can delete the labeling results provided by such labelers whose accuracy is less than the threshold.
或者,按照每个标注方的标注结果准确率,对所有标注方进行准确率由高到低进行排序,排序靠后的若干标注方可以认为是准确率不达标的标注方。由此,可以去除准确率不达标的标注方提供的标注结果。Alternatively, according to the accuracy rate of the labeling result of each labeler, all labelers are sorted from high to low accuracy. Several labelers with lower ranking can be regarded as labelers whose accuracy is not up to standard. As a result, the labeling results provided by the labeler with a substandard accuracy can be removed.
在步骤373中,根据余下来源的标注结果,从所述待标注语料集中筛选出多来源标注结果相似的查询语句。In step 373, according to the labeling results of the remaining sources, a query sentence with similar labeling results from multiple sources is filtered from the corpus to be labeled.
余下来源的标注结果是指从多方提供的标注结果中,删除准确率不达标的标注方提供标注结果后,剩余标注方对待标注语料集的标注结果。也就说,后续从待标注语料集中筛选出多方标注结果相似的查询语句时,不再根据不达标的标注方提供的标注结果。根据余下准确率较高的标注方对待标注语料集的标注结果,从待标注语料集中筛选出多个标注方标注结果相似的查询语句。The labeling results of the remaining sources refer to the labeling results of the corpus that are to be labeled by the remaining labeling parties after deleting the labeling results with substandard accuracy from the labeling results provided by multiple parties. That is to say, in the subsequent selection of query sentences with similar annotation results from multiple corpora to be annotated, the annotation results provided by the annotated parties are no longer based on the annotation results. According to the remaining taggers with higher accuracy, the tagging results of the tagging corpus are selected, and multiple query sentences with similar tagging results are filtered out from the tagging corpus.
在一种示例性实施例中,如图11所示,本申请提供的语料标注集的生成方法还包括:In an exemplary embodiment, as shown in FIG. 11, the method for generating a corpus annotation set provided in the present application further includes:
在步骤1101中,根据多方对同一查询语句的标注结果,从所述待标注语料集中筛选出标注结果不一致的查询语句;In step 1101, according to the annotation results of multiple parties on the same query sentence, a query sentence with inconsistent annotation results is filtered from the corpus to be annotated;
需要说明的是,边界样本对于模型优化、刻画出更清晰的分类边界帮助很大。其中,边界样本可以从多方不一致的样本中筛选出来使用。具体的,服务器设备可以根据多方对同一条查询语句的标注结果,从待标注语料集中筛选出多方标注结果不一致的查询语句。It should be noted that the boundary samples are very helpful for model optimization and characterization of clearer classification boundaries. Among them, the boundary samples can be selected and used from samples inconsistent from multiple parties. Specifically, the server device can filter out query statements with inconsistent labeling results from the corpus to be labeled according to the labeling results of the same query statement from multiple parties.
在步骤1102中,从所述标注结果不一致的查询语句中获取多标签的查询语句,获得用于进行数据分析模型优化的边界样本点。In step 1102, a multi-label query statement is obtained from the query statements with inconsistent labeling results, and a boundary sample point for optimizing a data analysis model is obtained.
针对多个用户标注结果不一致的查询语句,通过审核人员审核可以从中获取多标签的查询语句(即可以有多个标注结果的查询语句),这类多标签的查询语句可以认为是边界样本点,这类查询语句的识别难度较大,所以如果模型可以准确识别这类查询语句的意图、槽位等,将大大提高模型的准确率。数据分析模型可以是意图识别模型、命名实体识别模型、槽位标注模型、分词模型等。通过这类查询语句进行数据分析模型的优化,可以提高模型的识别准确率。For query statements with inconsistent labeling results for multiple users, multi-label query statements (that is, query statements with multiple labeling results) can be obtained from the review by the reviewer. Such multi-label query statements can be considered as boundary sample points. The recognition of this kind of query is difficult, so if the model can accurately identify the intent, slot, etc. of this kind of query, it will greatly improve the accuracy of the model. The data analysis model may be an intent recognition model, a named entity recognition model, a slot labeling model, a word segmentation model, and the like. The optimization of the data analysis model through such query statements can improve the recognition accuracy of the model.
例如,“今天心情不佳,请给我来一首舒缓的歌曲。”该查询语句其意图既包含了闲聊意图也包含了音乐点播意图,该样本属于意图分类的边界性样本,能帮助模型训练出准确的意图边界。For example, "I'm in a bad mood today, please give me a soothing song." The query's intent contains both chat and intent intent. This sample is a boundary sample of intent classification and can help model training. Make accurate intent boundaries.
下述为本申请装置实施例,可以用于执行本申请上述服务器设备110执行的语料标注集的生成方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请语料标注集的生成方法实施例。The following is a device embodiment of the present application, which can be used to execute an embodiment of a method for generating a corpus annotation set performed by the
图12是根据一示例性实施例示出的一种语料标注集的生成装置的框图,该语料标注集的生成装置可以用于图1所示实施环境的服务器设备110中,执行图3、图7-图11任一所示的语料标注集的生成方法的全部或者部分步骤。如图12所示,该装置包括但不限于:日志获取模块1210、语料集获得模块1230、结果获取模块1250、 语句筛选模块1270以及标注集生成模块1290。Fig. 12 is a block diagram of a device for generating a corpus annotation set according to an exemplary embodiment. The device for generating a corpus annotation set can be used in the
日志获取模块1210,用于获取查询日志;所述查询日志包括查询语句;The log acquisition module 1210 is configured to acquire a query log, where the query log includes a query statement;
语料集获得模块1230,用于从所述查询日志中进行待标注查询语句的提取,获得待标注语料集;A
结果获取模块1250,用于获取多方对所述待标注语料集中查询语句的标注结果;A
语句筛选模块1270,用于根据多方对同一查询语句的标注结果,从所述待标注语料集中筛选出标注结果相似的查询语句;A
标注集生成模块1290,用于由所述标注结果相似的查询语句与对应的标注结果,生成语料标注集。The annotation set generating module 1290 is configured to generate a corpus annotation set from a query sentence with a similar annotation result and a corresponding annotation result.
上述装置中各个模块的功能和作用的实现过程具体详见上述语料标注集的生成方法中对应步骤的实现过程,在此不再赘述。For the implementation process of the functions and functions of each module in the above device, please refer to the implementation process of the corresponding steps in the method for generating the corpus annotation set for details, and details are not described herein again.
日志获取模块1210比如可以是图2中的某一个物理结构有线或无线网络接口250。The log acquisition module 1210 may be, for example, a wired or
语料集获得模块1230、结果获取模块1250、语句筛选模块1270以及标注集生成模块1290也可以是功能模块,用于执行上述语料标注集的生成方法中的对应步骤。可以理解,这些模块可以通过硬件、软件、或二者结合来实现。当以硬件方式实现时,这些模块可以实施为一个或多个硬件模块,例如一个或多个专用集成电路。当以软件方式实现时,这些模块可以实施为在一个或多个处理器上执行的一个或多个计算机程序,例如图2的中央处理器222所执行的存储在存储器232中的程序。The
在一种示例性实施例中,如图13所示,所述语料集获得模块1230包括:In an exemplary embodiment, as shown in FIG. 13, the
语句去除单元1231,用于去除所述查询日志中不满足预设条件的查询语句;A
标签预测单元1232,用于将所述查询日志中剩余的查询语句,输入已构建的多个标签预测模型,输出多个标签预测模型对同一查询语句的标签预测结果;所述多个标签预测模型通过采用不同的训练样本集训练得到;A
语句提取单元1233,用于根据所述多个标签预测模型对同一查询语句的标签预 测结果,从所述剩余的查询语句中筛选出标签预测结果不一致的查询语句,得到所述待标注语料集。A
在一种示例性实施例中,所述语句去除单元1231包括:In an exemplary embodiment, the
分类去除子单元,用于通过已构建的分类器对所述查询日志中记录的查询语句进行分类,并去除分类得到的无意义的查询语句。The classification removing subunit is configured to classify the query sentences recorded in the query log by using the constructed classifier, and remove meaningless query sentences obtained by the classification.
在一种示例性实施例中,所述语句去除单元1231还包括:In an exemplary embodiment, the
第一去除子单元,用于根据已标注的查询语句集合,去除所述查询日志中已标注的查询语句以及与已标注查询语句相似的查询语句。The first removing subunit is configured to remove the labeled query sentence and the query sentence similar to the labeled query sentence in the query log according to the labeled query sentence set.
在一种示例性实施例中,所述语句去除单元1231还包括:In an exemplary embodiment, the
第二去除子单元,用于去除所述查询日志中仅包含单个实体词的查询语句、语句长度大于预设字符数量的查询语句或者重复的查询语句。The second removing subunit is configured to remove a query sentence containing only a single entity word, a query sentence with a sentence length greater than a preset number of characters, or a duplicate query sentence in the query log.
在一种示例性实施例中,如图14所示,所述结果获取模块1250包括:In an exemplary embodiment, as shown in FIG. 14, the
任务派发单元1251,用于向多方派发对所述待标注语料集的标注任务,所述标注任务的派发,触发多方并行执行所述标注任务;A
结果接收单元1252,用于接收多方并行执行所述标注任务返回的标注结果。The
其中,标注任务的派发,触发多方并行执行所述标注任务,包括:The distribution of the labeling task triggers multiple parties to execute the labeling task in parallel, including:
所述标注任务的派发,触发多方并行将所述待标注语料集输入自身配置的标注模型,输出各自对所述待标注语料集的标注结果;其中,多方配置的标注模型采用不同的训练样本集训练得到。The distribution of the labeling task triggers multiple parties to input the corpus to be labeled into their own labeled model in parallel, and outputs the respective labeling results for the corpus to be labeled. Among them, the labeling models configured by multiple parties use different training sample sets. Get it.
在一种示例性实施例中,所述待标注语料集包括已知标签信息的多条埋点语句;如图15所示,所述语句筛选模块1270包括:In an exemplary embodiment, the corpus to be annotated includes a plurality of buried point sentences with known tag information; as shown in FIG. 15, the
准确率计算单元1271,用于根据多方对所述多条埋点语句的标注结果,比较所述多条埋点语句的标注结果与对应标签信息是否一致,计算得到多方标注结果的准确率;An
来源剔除单元1272,用于根据所述多方标注结果的准确率,从多方来源中剔除准确率不达标的标注结果来源;A
语句筛选单元1273,用于根据余下来源的标注结果,从所述待标注语料集中筛选出多来源标注结果相似的查询语句。A
本申请还提供一种电子设备,该电子设备可以用于图1所示实施环境的服务器设备110中,执行图3、图7-图11任一所示的语料标注集的生成方法的全部或者部分步骤。所述电子设备包括:This application also provides an electronic device that can be used in the
处理器;processor;
用于存储处理器可执行指令的存储器;Memory for storing processor-executable instructions;
其中,所述处理器被配置为执行上述示例性实施例所述的语料标注集的生成方法。Wherein, the processor is configured to execute the method for generating a corpus annotation set according to the above exemplary embodiment.
该实施例中电子设备的处理器执行操作的具体方式已经在有关该语料标注集的生成方法的实施例中进行了详细描述,此处将不做详细阐述说明。The specific manner in which the processor of the electronic device performs operations in this embodiment has been described in detail in the embodiment of the method for generating the corpus annotation set, and will not be described in detail here.
在示例性实施例中,还提供了一种存储介质,该存储介质为计算机可读存储介质,例如可以为包括指令的临时性和非临时性计算机可读存储介质。该存储介质存储有计算机程序,所述计算机程序可由服务器设备200的中央处理器222执行以完成上述语料标注集的生成方法。In an exemplary embodiment, a storage medium is also provided, and the storage medium is a computer-readable storage medium, and may be, for example, temporary and non-transitory computer-readable storage media including instructions. The storage medium stores a computer program that can be executed by the
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It should be understood that the present application is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes may be performed without departing from the scope thereof. The scope of the application is limited only by the accompanying claims.
Claims (17)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811048957.8 | 2018-09-10 | ||
| CN201811048957.8A CN110209764B (en) | 2018-09-10 | 2018-09-10 | Corpus annotation set generation method and device, electronic equipment and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020052405A1 true WO2020052405A1 (en) | 2020-03-19 |
Family
ID=67779909
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2019/100823 Ceased WO2020052405A1 (en) | 2018-09-10 | 2019-08-15 | Corpus annotation set generation method and apparatus, electronic device, and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN110209764B (en) |
| WO (1) | WO2020052405A1 (en) |
Cited By (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111611797A (en) * | 2020-05-22 | 2020-09-01 | 云知声智能科技股份有限公司 | Prediction data labeling method, device and equipment based on Albert model |
| CN111629267A (en) * | 2020-04-30 | 2020-09-04 | 腾讯科技(深圳)有限公司 | Audio labeling method, device, equipment and computer readable storage medium |
| CN111651988A (en) * | 2020-06-03 | 2020-09-11 | 北京百度网讯科技有限公司 | Method, apparatus, device and storage medium for training a model |
| CN112052356A (en) * | 2020-08-14 | 2020-12-08 | 腾讯科技(深圳)有限公司 | Multimedia classification method, apparatus and computer-readable storage medium |
| CN112541070A (en) * | 2020-12-25 | 2021-03-23 | 北京百度网讯科技有限公司 | Method and device for excavating slot position updating corpus, electronic equipment and storage medium |
| CN112686022A (en) * | 2020-12-30 | 2021-04-20 | 平安普惠企业管理有限公司 | Method and device for detecting illegal corpus, computer equipment and storage medium |
| CN112700763A (en) * | 2020-12-26 | 2021-04-23 | 科大讯飞股份有限公司 | Voice annotation quality evaluation method, device, equipment and storage medium |
| CN113255879A (en) * | 2021-01-13 | 2021-08-13 | 深延科技(北京)有限公司 | Deep learning labeling method, system, computer equipment and storage medium |
| CN113569546A (en) * | 2021-06-16 | 2021-10-29 | 上海淇玥信息技术有限公司 | Intent labeling method, device and electronic device |
| CN113642329A (en) * | 2020-04-27 | 2021-11-12 | 阿里巴巴集团控股有限公司 | Method and device for establishing term identification model, term identification method and device |
| CN113722289A (en) * | 2021-08-09 | 2021-11-30 | 杭萧钢构股份有限公司 | Method, device, electronic equipment and medium for constructing data service |
| CN114025216A (en) * | 2020-04-30 | 2022-02-08 | 网易(杭州)网络有限公司 | Media material processing method, device, server and storage medium |
| CN114783424A (en) * | 2022-03-21 | 2022-07-22 | 北京云迹科技股份有限公司 | Text corpus screening method, device, equipment and storage medium |
Families Citing this family (17)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110675862A (en) * | 2019-09-25 | 2020-01-10 | 招商局金融科技有限公司 | Corpus acquisition method, electronic device and storage medium |
| CN110852109B (en) * | 2019-11-11 | 2024-11-22 | 腾讯科技(深圳)有限公司 | Corpus generation method, corpus generation device, and storage medium |
| CN111177412B (en) * | 2019-12-30 | 2023-03-31 | 成都信息工程大学 | Public logo bilingual parallel corpus system |
| CN111160044A (en) * | 2019-12-31 | 2020-05-15 | 出门问问信息科技有限公司 | Text-to-speech conversion method and device, terminal and computer readable storage medium |
| CN111179904B (en) * | 2019-12-31 | 2022-12-09 | 出门问问创新科技有限公司 | Mixed text-to-speech conversion method and device, terminal and computer readable storage medium |
| CN111259134B (en) * | 2020-01-19 | 2023-08-08 | 出门问问信息科技有限公司 | Entity identification method, equipment and computer readable storage medium |
| CN113743117B (en) * | 2020-05-29 | 2024-04-09 | 华为技术有限公司 | Method and apparatus for entity annotation |
| CN111785272B (en) * | 2020-06-16 | 2021-06-11 | 杭州云嘉云计算有限公司 | Online labeling method and system |
| CN114078470B (en) * | 2020-08-17 | 2025-05-09 | 阿里巴巴集团控股有限公司 | Model processing method and device, speech recognition method and device |
| CN112163424B (en) * | 2020-09-17 | 2024-07-19 | 中国建设银行股份有限公司 | Data labeling method, device, equipment and medium |
| CN113407713B (en) * | 2020-10-22 | 2024-04-05 | 腾讯科技(深圳)有限公司 | Corpus mining method and device based on active learning and electronic equipment |
| CN112509578A (en) * | 2020-12-10 | 2021-03-16 | 北京有竹居网络技术有限公司 | A voice information recognition method, device, electronic device and storage medium |
| CN112925910B (en) * | 2021-02-25 | 2024-10-25 | 中国平安人寿保险股份有限公司 | Auxiliary corpus labeling method, device, equipment and computer storage medium |
| CN114757267B (en) * | 2022-03-25 | 2024-06-21 | 北京爱奇艺科技有限公司 | Method, device, electronic equipment and readable storage medium for identifying noise query |
| CN115293141A (en) * | 2022-06-23 | 2022-11-04 | 中国第一汽车股份有限公司 | Method, system and electronic device for preparing vehicle-mounted normalized vocabulary |
| CN115311668A (en) * | 2022-08-22 | 2022-11-08 | 支付宝(杭州)信息技术有限公司 | Test text picture generation method and device and marking quality determination method |
| CN115438184A (en) * | 2022-09-29 | 2022-12-06 | 北京明略昭辉科技有限公司 | New word labeling method and device, electronic equipment and storage medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020107844A1 (en) * | 2000-12-08 | 2002-08-08 | Keon-Hoe Cha | Information generation and retrieval method based on standardized format of sentence structure and semantic structure and system using the same |
| CN101105801A (en) * | 2007-04-20 | 2008-01-16 | 清华大学 | An automatic positioning method for key network resource pages |
| CN103136210A (en) * | 2011-11-23 | 2013-06-05 | 北京百度网讯科技有限公司 | Method and device for mining query with similar requirements |
| CN103530282A (en) * | 2013-10-23 | 2014-01-22 | 北京紫冬锐意语音科技有限公司 | Corpus tagging method and equipment |
| CN105912724A (en) * | 2016-05-10 | 2016-08-31 | 黑龙江工程学院 | Time-based microblog document expansion method oriented to microblog retrieval |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102541838B (en) * | 2010-12-24 | 2015-03-11 | 日电(中国)有限公司 | Method and equipment for optimizing emotional classifier |
| CN105389340B (en) * | 2015-10-20 | 2019-02-19 | 北京云知声信息技术有限公司 | A kind of information test method and device |
| CN106021461A (en) * | 2016-05-17 | 2016-10-12 | 深圳市中润四方信息技术有限公司 | Text classification method and text classification system |
| CN106202177B (en) * | 2016-06-27 | 2017-12-15 | 腾讯科技(深圳)有限公司 | A kind of file classification method and device |
| CN106372132A (en) * | 2016-08-25 | 2017-02-01 | 北京百度网讯科技有限公司 | Artificial intelligence-based query intention prediction method and apparatus |
| US10614043B2 (en) * | 2016-09-30 | 2020-04-07 | Adobe Inc. | Document replication based on distributional semantics |
| US11176188B2 (en) * | 2017-01-11 | 2021-11-16 | Siemens Healthcare Gmbh | Visualization framework based on document representation learning |
| CN107256267B (en) * | 2017-06-19 | 2020-07-24 | 北京百度网讯科技有限公司 | Inquiry method and device |
| CN108334496B (en) * | 2018-01-30 | 2020-06-12 | 中国科学院自动化研究所 | Man-machine conversation understanding method and system for specific field and related equipment |
-
2018
- 2018-09-10 CN CN201811048957.8A patent/CN110209764B/en active Active
-
2019
- 2019-08-15 WO PCT/CN2019/100823 patent/WO2020052405A1/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020107844A1 (en) * | 2000-12-08 | 2002-08-08 | Keon-Hoe Cha | Information generation and retrieval method based on standardized format of sentence structure and semantic structure and system using the same |
| CN101105801A (en) * | 2007-04-20 | 2008-01-16 | 清华大学 | An automatic positioning method for key network resource pages |
| CN103136210A (en) * | 2011-11-23 | 2013-06-05 | 北京百度网讯科技有限公司 | Method and device for mining query with similar requirements |
| CN103530282A (en) * | 2013-10-23 | 2014-01-22 | 北京紫冬锐意语音科技有限公司 | Corpus tagging method and equipment |
| CN105912724A (en) * | 2016-05-10 | 2016-08-31 | 黑龙江工程学院 | Time-based microblog document expansion method oriented to microblog retrieval |
Cited By (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113642329A (en) * | 2020-04-27 | 2021-11-12 | 阿里巴巴集团控股有限公司 | Method and device for establishing term identification model, term identification method and device |
| CN111629267A (en) * | 2020-04-30 | 2020-09-04 | 腾讯科技(深圳)有限公司 | Audio labeling method, device, equipment and computer readable storage medium |
| CN111629267B (en) * | 2020-04-30 | 2023-06-09 | 腾讯科技(深圳)有限公司 | Audio labeling method, device, equipment and computer readable storage medium |
| CN114025216B (en) * | 2020-04-30 | 2023-11-17 | 网易(杭州)网络有限公司 | Media material processing method, device, server and storage medium |
| CN114025216A (en) * | 2020-04-30 | 2022-02-08 | 网易(杭州)网络有限公司 | Media material processing method, device, server and storage medium |
| CN111611797B (en) * | 2020-05-22 | 2023-09-12 | 云知声智能科技股份有限公司 | Method, device and equipment for marking prediction data based on Albert model |
| CN111611797A (en) * | 2020-05-22 | 2020-09-01 | 云知声智能科技股份有限公司 | Prediction data labeling method, device and equipment based on Albert model |
| CN111651988A (en) * | 2020-06-03 | 2020-09-11 | 北京百度网讯科技有限公司 | Method, apparatus, device and storage medium for training a model |
| CN111651988B (en) * | 2020-06-03 | 2023-05-19 | 北京百度网讯科技有限公司 | Method, apparatus, device and storage medium for training model |
| CN112052356A (en) * | 2020-08-14 | 2020-12-08 | 腾讯科技(深圳)有限公司 | Multimedia classification method, apparatus and computer-readable storage medium |
| CN112052356B (en) * | 2020-08-14 | 2023-11-24 | 腾讯科技(深圳)有限公司 | Multimedia classification method, apparatus and computer readable storage medium |
| CN112541070B (en) * | 2020-12-25 | 2024-03-22 | 北京百度网讯科技有限公司 | Mining method and device for slot updating corpus, electronic equipment and storage medium |
| CN112541070A (en) * | 2020-12-25 | 2021-03-23 | 北京百度网讯科技有限公司 | Method and device for excavating slot position updating corpus, electronic equipment and storage medium |
| CN112700763B (en) * | 2020-12-26 | 2024-04-16 | 中国科学技术大学 | Speech annotation quality evaluation method, device, equipment and storage medium |
| CN112700763A (en) * | 2020-12-26 | 2021-04-23 | 科大讯飞股份有限公司 | Voice annotation quality evaluation method, device, equipment and storage medium |
| CN112686022A (en) * | 2020-12-30 | 2021-04-20 | 平安普惠企业管理有限公司 | Method and device for detecting illegal corpus, computer equipment and storage medium |
| CN113255879A (en) * | 2021-01-13 | 2021-08-13 | 深延科技(北京)有限公司 | Deep learning labeling method, system, computer equipment and storage medium |
| CN113255879B (en) * | 2021-01-13 | 2024-05-24 | 深延科技(北京)有限公司 | Deep learning labeling method, system, computer equipment and storage medium |
| CN113569546A (en) * | 2021-06-16 | 2021-10-29 | 上海淇玥信息技术有限公司 | Intent labeling method, device and electronic device |
| CN113722289A (en) * | 2021-08-09 | 2021-11-30 | 杭萧钢构股份有限公司 | Method, device, electronic equipment and medium for constructing data service |
| CN114783424A (en) * | 2022-03-21 | 2022-07-22 | 北京云迹科技股份有限公司 | Text corpus screening method, device, equipment and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN110209764B (en) | 2023-04-07 |
| CN110209764A (en) | 2019-09-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2020052405A1 (en) | Corpus annotation set generation method and apparatus, electronic device, and storage medium | |
| CN109472033B (en) | Method and system for extracting entity relationship in text, storage medium and electronic equipment | |
| Zhang et al. | Deep learning for drug–drug interaction extraction from the literature: a review | |
| WO2022222300A1 (en) | Open relationship extraction method and apparatus, electronic device, and storage medium | |
| CN111914066B (en) | Global searching method and system for multi-source database | |
| WO2020237872A1 (en) | Method and apparatus for testing accuracy of semantic analysis model, storage medium, and device | |
| CN107766371A (en) | A kind of text message sorting technique and its device | |
| CN107077463A (en) | Remote supervisory relation extractor | |
| CN114840685A (en) | Emergency plan knowledge graph construction method | |
| CN109359296B (en) | Public opinion emotion recognition method, device and computer-readable storage medium | |
| WO2020010834A1 (en) | Faq question and answer library generalization method, apparatus, and device | |
| CN113704422B (en) | Text recommendation method, device, computer equipment and storage medium | |
| CN108875059A (en) | For generating method, apparatus, electronic equipment and the storage medium of document label | |
| CN113722421A (en) | Contract auditing method and system and computer readable storage medium | |
| CN115934936A (en) | A Text Analysis Method for Intelligent Traffic Based on Natural Language Processing | |
| CN111143556A (en) | Software function point automatic counting method, device, medium and electronic equipment | |
| CN112926308B (en) | Methods, devices, equipment, storage media and program products for matching text | |
| CN109325122A (en) | Vocabulary generation method, text classification method, apparatus, device and storage medium | |
| CN110889275A (en) | Information extraction method based on deep semantic understanding | |
| CN118396092A (en) | Knowledge graph construction method of news data based on artificial intelligence | |
| CN114328800A (en) | Text processing method, apparatus, electronic device, and computer-readable storage medium | |
| CN116976321A (en) | Text processing method, apparatus, computer device, storage medium, and program product | |
| CN116719840A (en) | A medical information push method based on structured processing of medical records | |
| CN112115229A (en) | Text intent recognition method, device, system, and text classification system | |
| US20230359932A1 (en) | Classification process systems and methods |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19860116 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19860116 Country of ref document: EP Kind code of ref document: A1 |