CN112860919B

CN112860919B - Data labeling method, device, equipment and storage medium based on generation model

Info

Publication number: CN112860919B
Application number: CN202110193454.5A
Authority: CN
Inventors: 李薿; 陈曦; 崔艳; 庄伯金; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2024-07-12
Anticipated expiration: 2041-02-20
Also published as: CN112860919A; WO2022174496A1

Abstract

The application relates to the technical field of artificial intelligence and discloses a data labeling method, a device, equipment and a storage medium based on a generation model, wherein the method comprises the steps of obtaining a text to be labeled, splitting the text to be labeled, segmenting words and merging to obtain a target phrase; labeling the target phrase through a plurality of preset labeling rules respectively to obtain a label sample; obtaining sample labeling probability of a label sample on a target phrase, carrying out iterative updating on initial parameters generated by a generating model through the sample labeling probability to obtain a trained generating model, and outputting labeling accuracy through the trained generating model; and determining a target label sample according to the labeling accuracy. The application also relates to a blockchain technology, and the text to be annotated is stored in the blockchain. And labeling the data by a plurality of preset rules, and selecting a label sample with highest data labeling accuracy according to the generated model, thereby being beneficial to improving the data labeling accuracy.

Description

Data labeling method, device, equipment and storage medium based on generation model

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for labeling data based on a generation model.

Background

Along with the increasing significance of the knowledge graph in each vertical field, how to label the large-scale unlabeled data is the focus of attention in the current knowledge graph field.

Although the accuracy of the named entity identification with the annotation data is up to 99% at present, the time period is extremely long for manual construction of text annotation data in different fields. Moreover, the labeling data in different fields does not have universality completely. The distinction between business scenario, target user and product definition directly results in the difficulty of large scale annotation data applicable to various fields in the text field. How to improve the efficiency of labeling large-scale data becomes a challenge.

Aiming at the problems, the existing solution is to convert and map word sequences by acquiring the word sequences corresponding to the original text, thereby obtaining entity labeling vectors, and counting the quantity of preset entity information in the entity labeling vectors, thereby realizing the labeling of data; however, the labeling mode is obtained by converting and mapping the word vectors, and errors are easy to occur to labeling of data, so that the accuracy of data labeling of large-scale data is low. There is a need for a method that can improve the accuracy of data annotation.

Disclosure of Invention

The embodiment of the application aims to provide a data labeling method, device, equipment and storage medium based on a generation model so as to improve the accuracy of data labeling.

In order to solve the above technical problems, an embodiment of the present application provides a data labeling method based on a generation model, including:

obtaining a text to be marked, and splitting the text to be marked to obtain a split sentence;

Performing word segmentation processing on the split sentences to obtain target word segmentation, and merging the target word segmentation to obtain a target phrase;

Obtaining a plurality of preset labeling rules, and labeling the target phrases through the plurality of preset labeling rules respectively to obtain label samples corresponding to each preset rule;

Acquiring sample labeling probability of a label sample corresponding to each preset labeling rule on the target phrase, and acquiring initial parameters of a generated model according to the sample labeling probability and the label sample;

iteratively updating initial parameters of the generated model through the sample labeling probability to obtain a trained generated model, and outputting labeling accuracy corresponding to the label sample through the trained generated model;

And selecting the label sample with the highest labeling accuracy as a target label sample.

In order to solve the above technical problems, an embodiment of the present application provides a data labeling device based on a generation model, including:

the text splitting module to be labeled is used for acquiring a text to be labeled and splitting the text to be labeled to obtain a split sentence;

The target phrase acquisition module is used for obtaining target word segmentation through word segmentation processing of the split sentences and merging the target word segmentation to obtain a target phrase;

The label sample generation module is used for acquiring a plurality of preset labeling rules, and labeling the target phrases through the plurality of preset labeling rules respectively to obtain label samples corresponding to each preset rule;

The initial parameter generation module is used for acquiring the sample labeling probability of the label sample corresponding to each preset labeling rule on the target phrase, and obtaining initial parameters of a generated model according to the sample labeling probability and the label sample;

The labeling accuracy output module is used for iteratively updating the initial parameters of the generated model through the sample labeling probability to obtain a trained generated model, and outputting the labeling accuracy corresponding to the label sample through the trained generated model;

And the label sample selecting module is used for selecting the label sample with the highest labeling accuracy as a target label sample.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer device is provided comprising one or more processors; and the memory is used for storing one or more programs, so that the one or more processors can realize the data labeling method based on the generation model.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data labeling method based on a generative model of any of the above.

The embodiment of the invention provides a data labeling method, device and equipment based on a generation model and a storage medium. According to the embodiment of the invention, the target phrase is obtained after splitting, word segmentation and merging are carried out on the acquired text to be marked, so that the data marking is carried out on the text to be marked according to the target phrase conveniently; and obtaining a plurality of preset labeling rules, labeling the target phrases respectively through the plurality of preset labeling rules to obtain label samples corresponding to each preset rule, obtaining the sample labeling probability of the label samples corresponding to each preset labeling rule on the target phrases, obtaining initial parameters of a generated model according to the sample labeling probability and the label samples, carrying out iterative updating on the initial parameters of the generated model through the sample labeling probability to obtain a trained generated model, outputting the labeling accuracy corresponding to the label samples through the trained generated model, selecting the label sample with the highest labeling accuracy as the target label sample, realizing labeling of data through the plurality of preset rules, selecting the label sample with the highest data labeling accuracy according to the generated model, and being beneficial to improving the data labeling accuracy.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of an application environment of a data labeling method based on a generation model according to an embodiment of the present application;

FIG. 2 is a flowchart of an implementation of a data labeling method based on a generative model according to an embodiment of the present application;

FIG. 3 is a flowchart of an implementation of a sub-process of a data labeling method based on a generation model according to an embodiment of the present application;

FIG. 4 is a flowchart of still another implementation of a sub-process of the data labeling method based on the generation model according to the embodiment of the present application;

FIG. 5 is a flowchart of still another implementation of a sub-process of the data labeling method based on the generation model according to the embodiment of the present application;

FIG. 6 is a flowchart of still another implementation of a sub-process of the data labeling method based on the generation model according to the embodiment of the present application;

FIG. 7 is a flowchart of still another implementation of a sub-process of the data labeling method based on the generation model according to the embodiment of the present application;

FIG. 8 is a flowchart of still another implementation of a sub-process of the data labeling method based on the generation model according to the embodiment of the present application;

FIG. 9 is a schematic diagram of a data labeling device based on a generation model according to an embodiment of the present application;

Fig. 10 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

The present invention will be described in detail with reference to the drawings and embodiments.

Referring to fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a search class application, an instant messaging tool, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the data labeling method based on the generated model provided by the embodiment of the present application is generally executed by a server, and accordingly, the data labeling device based on the generated model is generally configured in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 illustrates one embodiment of a data labeling method based on a generative model.

It should be noted that, if there are substantially the same results, the method of the present invention is not limited to the flow sequence shown in fig. 2, and the method includes the following steps:

s1: and obtaining a text to be marked, and splitting the text to be marked to obtain a split sentence.

Specifically, after the text to be marked is obtained, the server performs preprocessing, such as data cleaning, on the text to be marked, and then splits the text to be marked according to paragraphs, sentences and the like according to the segmenters in the text, so as to obtain split sentences. The text to be labeled is a text which needs to be subjected to data labeling, so that a labeled text is generated.

S2: the target word segmentation is obtained by carrying out word segmentation processing on the split sentences, and the target word segmentation is combined to obtain the target phrase.

Specifically, in the above steps, the text to be annotated is already split into split sentences, the split sentences exist in the form of short sentences, for better data annotation of the split sentences later, word segmentation processing is performed on the split sentences through a preset word segmentation tool, so that each target word segment is generated, part-of-speech annotation is performed according to the part of speech of the target word segment, and the target word segments are combined according to the dependency syntax analysis mode, so that the target phrase is generated.

It should be noted that the preset word segmentation tools include, but are not limited to: the resultant barker, NLPIR word segmentation system and SnowNLP, etc. Preferably, the split sentence is segmented by adopting the barker segmentation to obtain the target segmentation. The crust word segmentation is used for most accurately cutting a sentence, is suitable for text analysis, scans all words which can be formed into words in the sentence, has high speed, and is suitable for word segmentation processing of split sentences in the embodiment.

The dependency syntax analysis is first proposed by french linguist l.tesniere. The sentence is analyzed into a dependency syntax tree, and the dependency relation among the words is described, namely, the syntactic collocation relation among the words is pointed out, and the collocation relation is related to semantics. In the embodiment of the application, the target word segmentation is combined in a dependency syntax analysis mode.

S3: obtaining a plurality of preset labeling rules, and labeling the target phrase according to the plurality of preset labeling rules respectively to obtain a label sample corresponding to each preset rule.

Specifically, in the embodiment of the application, after splitting, word segmentation and merging are carried out on the text to be marked, the target phrase is marked through a plurality of marking rules, and then the accuracy of the marking of the data by various rules is determined through a generating model, so that a label sample with the highest accuracy is selected, and the marking of the data is completed. The server obtains a plurality of preset labeling rules, and labels corresponding to the target phrases are labeled according to each preset labeling rule, so that the target phrases generate label samples corresponding to each preset rule.

It should be noted that, the plurality of preset labeling rules include, but are not limited to: regular recognition, remote matching knowledge base recognition and external data matching modes. The regular recognition is to preset different SQL query sentences and match corresponding labeling rules, so that different rules can label the target phrase. The remote matching knowledge base means that the target phrase is matched with the knowledge base of the peripheral device one by one, so that the target phrase is marked. The external data matching mode refers to matching the target phrase with external data provided by a crowdsourcing platform, for example, so as to finish labeling the target phrase. Preferably, the target phrase is marked by adopting a plurality of different marking rules, so that the accuracy of data marking in a plurality of modes can be screened, and the accuracy of data marking is improved.

S4: and obtaining the sample labeling probability of the label sample corresponding to each preset labeling rule on the target phrase, and obtaining initial parameters of the generated model according to the sample labeling probability and the label sample.

Specifically, the sample labeling probability refers to the coverage rate of the target phrase by the sample label obtained by using a preset labeling rule, and the iteration update can be performed on the parameters of the generated model. And because each preset labeling rule has different labeling probabilities for samples of different target phrases, the sample labeling probability corresponding to each preset labeling rule needs to be acquired first. The server also obtains initial estimated parameters of the generated model, namely initial parameters of the generated model, by initializing the sample labeling probability and the label sample.

Where generative models refer to models capable of randomly generating observed data, particularly given certain implicit parameters. The generation model assigns a joint probability distribution to the observation and the sequence of annotation data. In the embodiment of the application, the implicit parameters correspond to the real labels of the target phrases of the application, the observed values correspond to the sample labeling probability of the application, and the labeling data sequences correspond to the label samples of the application; therefore, according to the implicit parameters, namely the real data labels, a model of the observed data is randomly generated, and the labeling probability of each preset labeling rule on the target phrase can be judged.

S5: and carrying out iterative updating on initial parameters of the generated model through sample labeling probability to obtain a trained generated model, and outputting label accuracy corresponding to the label sample through the trained generated model.

Specifically, initial parameters of the generated model are fitted through sample tag probability, the sample tag probability is reversely transmitted back to iteratively update the initial parameters in a random gradient descending mode, so that the sample tag probability is different and approaches to the parameters of the generated model, and the trained generated model is obtained. And carrying out probability estimation on the label sample by using the trained parameters of the generated model, and carrying out weighted average treatment, thereby obtaining the labeling accuracy of the label sample under each preset rule.

The iterative updating means that initial parameters of a generated model are fitted through sample tag probability, and the sample tag probability is reversely transmitted back to perform iterative calculation on the initial parameters in a random gradient descending mode, so that the sample tag probability is different and is close to the parameters of the generated model.

S6: and selecting a label sample with highest labeling accuracy as a target label sample.

Specifically, the labeling accuracy of the label sample under each preset labeling rule is obtained through the steps, so that the label sample with the highest labeling accuracy is selected and used as the target label sample, the labeling of the target phrase by trying multiple labeling rules is realized, and the label sample with the highest accuracy is selected, so that the accuracy of data labeling is improved.

In the embodiment, the target phrase is obtained after splitting, word segmentation and merging are carried out on the acquired text to be marked, so that the data marking is carried out on the text to be marked according to the target phrase conveniently; and obtaining a plurality of preset labeling rules, labeling the target phrases respectively through the plurality of preset labeling rules to obtain label samples corresponding to each preset rule, obtaining the sample labeling probability of the label samples corresponding to each preset labeling rule on the target phrases, obtaining initial parameters of a generated model according to the sample labeling probability and the label samples, carrying out iterative updating on the initial parameters of the generated model through the sample labeling probability to obtain a trained generated model, outputting the labeling accuracy corresponding to the label samples through the trained generated model, selecting the label sample with the highest labeling accuracy as the target label sample, realizing labeling of data through the plurality of preset rules, selecting the label sample with the highest data labeling accuracy according to the generated model, and being beneficial to improving the data labeling accuracy.

Referring to fig. 3, fig. 3 shows a specific implementation manner of step S4, in which a sample labeling probability of a target phrase by a label sample corresponding to each preset labeling rule is obtained in step S4, and a specific implementation process of generating initial parameters of a model is obtained according to the sample labeling probability and the label sample, which is described in detail as follows:

s41: and calculating the coverage rate of the label sample corresponding to each preset labeling rule on the target phrase, and taking the coverage rate as the sample labeling probability.

Specifically, in order to train the generated model subsequently so that the sample labeling probability approaches to the parameters of the generated model, the sample labeling probability needs to be acquired first. And calculating the coverage rate of the label sample corresponding to each preset labeling rule on the target phrase, and taking the coverage rate as the sample labeling probability. In the embodiment of the application, the coverage rate is obtained by calculating the coverage degree of the label sample on the target phrase.

In a specific embodiment, when a remote matching knowledge base identification mode is adopted to label a target phrase, because the peripheral knowledge base may have a situation that target word segments in the target phrase cannot be matched one by one, the target word segments cannot be labeled in the mode, so that the labeling of the target phrase fails; and if the target word segmentation in the target phrase can be matched with the peripheral knowledge base one by one, the target phrase marking is successful. Dividing the label sample of the successful labeling of the target phrase by the total target phrase quantity, obtaining a result which is the coverage rate of the target phrase by a remote matching knowledge base identification mode, and taking the coverage rate as the sample labeling probability. For example, if the number of successful labeling of the target phrases is 9000 and the total number of target phrases is 10000, the sample labeling probability is 90%.

S42: and initializing the sample label probability and the label sample to obtain initial parameters of the generated model.

Specifically, the initialization process refers to assigning estimated parameter values to initial parameters of the generated model according to the sample tag probability and the tag sample, so as to obtain the initial parameters of the generated model.

In the implementation, the coverage rate of the label sample corresponding to each preset labeling rule on the target phrase is calculated, the coverage rate is used as the sample labeling probability, the sample labeling probability and the label sample are initialized to obtain the initial parameters of the generated model, the sample labeling probability and the initial parameters of the generated model are obtained, the subsequent training of the generated model is facilitated, and therefore the accuracy of data labeling is improved.

Referring to fig. 4, fig. 4 shows a specific implementation manner of step S5, in which initial parameters of a generated model are iteratively updated according to sample labeling probability in step S5 to obtain a trained generated model, and a specific implementation process of labeling accuracy corresponding to a label sample output by the trained generated model is described in detail as follows:

S51: and taking the difference value between the parameters of the generated model and the sample labeling probability as an optimized characteristic value.

Specifically, in the embodiment of the application, the parameters of the generated model are continuously close to the sample labeling probability by carrying out iterative updating on the parameters of the generated model, so that the difference value between the parameters of the generated model and the sample labeling probability is used as an optimized characteristic value, and the training degree of the generated model is judged by evaluating the optimized characteristic value.

Specifically, after the data volume reaches a certain scale, labeling the target phrase based on a plurality of preset labeling rules, training an obtained generation model, and estimating the real label of the target phrase based on the generation model is better than random guessing of the sample label; and because the parameters of the generated model are used for estimating the accuracy of the label sample, and the sample label probability is calculated by covering the total target phrase quantity by the successful target phrase labeling quantity; therefore, when the parameters of the generated model are closer to the sample labeling probability, namely the optimized characteristic value is smaller, the generated model is closer to the completion of training. For example, the initial parameter of the generated model is 0.4, the sample label probability is 0.92, the optimized feature value is 0.52, the optimized feature value is gradually reduced after the iterative updating is continuously performed, and when the optimized feature value becomes 0.01, the parameter of the generated model is very close to the sample label probability, and the iterative updating is ended.

S52: and (3) adopting a random gradient descent mode to reversely propagate the sample labeling probability so as to iteratively update the initial parameters, wherein each iteration update obtains a new parameter of the generated model and changes the optimized characteristic value.

Specifically, by adopting a random gradient descent mode, the sample labeling probability is counter-propagated to perform iterative updating on the initial parameters, a new parameter is obtained after each updating calculation to generate a model, and a new optimization characteristic value can be obtained by performing difference calculation on the new parameter and the sample labeling probability. The optimization feature value is calculated by the difference value between the generated model parameter and the sample labeling probability, and the generated model parameter is changed after each iteration update, so that the optimization feature value is changed after each iteration update.

The gradient descent method is one of iterative methods, and can be used for solving the least square problem. Gradient descent (GRADIENT DESCENT) is one of the most commonly employed methods in solving model parameters of machine learning algorithms, i.e., unconstrained optimization problems. When the minimum value of the loss function is solved, the minimum loss function and the model parameter value can be obtained through one-step iterative solution by a gradient descent method. Conversely, if the maximum of the loss function needs to be solved, then the gradient-lifting method needs to be used for iteration. In machine learning, two gradient descent methods, a random gradient descent method and a batch gradient descent method, have been developed based on a basic gradient descent method. In the embodiment of the application, a random gradient descent mode is adopted to carry out back propagation on the sample labeling probability so as to carry out iterative updating on the initial parameters.

Among them, the back propagation algorithm is a learning algorithm suitable for a multi-layer neuronal network, which is based on the gradient descent method. The input-output relationship of the back propagation network is essentially a mapping relationship: an n-input m-output back-propagation neural network performs the function of a continuous mapping from n-dimensional euclidean space to a finite field in m-dimensional euclidean space, which mapping is highly nonlinear.

In a specific embodiment, the sample labeling probability is input to an input layer of the neural network, passes through a hidden layer, finally reaches an output layer and outputs a result, and the process is a forward propagation process. However, as the output result of the neural network has errors with the actual result, the errors between the estimated value and the actual value are calculated, namely the optimized characteristic value is reversely propagated from the output layer to the hidden layer until the optimized characteristic value is propagated to the input layer; and in the back propagation process, the value of the sample labeling probability is adjusted according to the random decline of the optimized characteristic value, so that the optimized characteristic value is reduced. And iterating the steps until the optimized characteristic value reaches a preset threshold value.

S53: and stopping iterative updating when the optimized characteristic value reaches a preset threshold value to obtain a trained generation model.

Specifically, when the optimized characteristic value reaches a preset threshold value, the parameters of the generated model are close to sample labeling probability, and updating of the parameters of the generated model is stopped at the moment, so that the trained generated model is obtained.

The preset threshold is set according to the actual situation, and is not limited herein. In one embodiment, the predetermined threshold is 0.01.

S54: and outputting the labeling accuracy corresponding to the label sample through the trained generation model.

Specifically, the steps generate a trained generation model, then carry out probability estimation on the label sample through the trained generation model, and output the labeling accuracy corresponding to the label sample.

In this embodiment, by taking the difference value between the parameter of the generated model and the sample labeling probability as the optimized feature value, and adopting a random gradient descent mode, the sample labeling probability is counter-propagated to perform iterative update on the initial parameter, and when the optimized feature value reaches a preset threshold value, the iterative update is stopped, so as to obtain a trained generated model, and the label accuracy corresponding to the label sample is output through the trained generated model, so that training of the generated model and the label accuracy corresponding to the label sample are realized, thereby being beneficial to improving the accuracy of data labeling.

Referring to fig. 5, fig. 5 shows a specific implementation manner of step S54, and a specific implementation process of the label accuracy corresponding to the label sample output in step S54 through the trained generation model is described in detail as follows:

s541: and carrying out probability estimation on the label sample through the trained current parameters of the generated model to obtain basic probability.

Specifically, probability estimation is performed on the label sample through the current parameters to obtain basic probability, and the basic probability is convenient to further process subsequently, so that the final labeling accuracy is obtained. The current parameters refer to parameters of the generated model obtained by iterative updating when the optimized characteristic value reaches a preset threshold value.

In particular, since generating a model refers to a model that is capable of randomly generating observed data, particularly given certain implicit parameters. The generation model assigns a joint probability distribution to the observation and the sequence of annotation data. In the embodiment of the application, the implicit parameters correspond to the real labels of the target phrases of the application, the observed values correspond to the sample labeling probability of the application, and the labeling data sequences correspond to the label samples of the application; therefore, according to the implicit parameters, namely the real data labels, a model of the observed data is randomly generated, the model is composed of the current parameters, and the probability estimation of each preset labeling rule on the label sample can be judged, so that the basic probability is obtained.

S542: and carrying out weighted average processing on the basic probability to obtain the labeling accuracy corresponding to the label sample.

Specifically, the base probability is weighted and averaged, so that the accuracy of the annotation is more accurate.

In the embodiment, probability estimation is performed on the label sample through the trained current parameters of the generated model to obtain basic probability, and weighted average processing is performed on the basic probability to obtain the labeling accuracy corresponding to the label sample, so that the generated labeling accuracy is more accurate, and the improvement of the accuracy of data labeling is facilitated.

Referring to fig. 6, fig. 6 shows a specific implementation manner of step S1, in which a text to be annotated is obtained in step S1, and split is performed on the text to be annotated, so as to obtain a specific implementation process of a split sentence, which is described in detail as follows:

S11: and obtaining a text to be marked, and preprocessing the text to be marked to obtain a basic text.

Specifically, the preprocessing includes data cleaning of the text to be marked. Wherein data cleansing (DATA CLEANING) refers to the process of re-inspection and verification of data, with the purpose of deleting duplicate information, correcting errors that exist, and providing data consistency.

S12: and acquiring text separators contained in the basic text by adopting a regular matching mode.

S13: and splitting the basic text through the text separator to obtain a split sentence.

Specifically, a regular matching mode is adopted to obtain text separators contained in the basic text, and the text separators are used for segmenting the text in the subsequent step.

Optionally, the text separator includes a format separator and a punctuation separator.

Wherein, the format separator refers to a separator which is divided according to the text coding type or the structure of the text. The basic text is split according to the coding type of the text or the structure of the text through format separators.

Wherein punctuation separator refers to a separator that separates text according to punctuation. Through punctuation separator, the quick splitting of basic text is realized.

In the embodiment, the text to be marked is obtained, the text to be marked is preprocessed to obtain the basic text, the text separator contained in the basic text is obtained in a regular matching mode, the basic text is split through the text separator to obtain the split sentence, the target phrase is conveniently generated subsequently, and the corresponding label is conveniently marked subsequently.

Referring to fig. 7, fig. 7 shows a specific implementation manner after step S6, and this embodiment includes:

S61: acquiring a storage path of a text to be marked as a target storage path;

s62: and mapping the target label sample into the target storage path in a preset data mapping mode.

Specifically, for data tracing, a target label sample corresponding to a text to be marked is conveniently inquired, and the target label sample and the file to be marked are stored in the same path.

The preset data mapping modes include, but are not limited to: manual encoding (Hand-encoded) and visualization operations (GRAPHICAL MANUAL). The manual coding is to directly define the corresponding relation of the data by using programming languages like XSLT, JAVA and C++; visualization operations typically support a user drawing a line between data items to define correspondence between the data items. In a specific embodiment, the target label sample is mapped into the target storage path by a visualization operation.

Referring to fig. 8, fig. 8 shows a specific implementation process of merging target word segments to obtain a target phrase, which is described in detail as follows:

S2A: and marking the part of speech of the target word by means of part of speech marking to obtain the part of speech word.

The part-of-speech tagging is also called grammar tagging or part-of-speech disambiguation, and is a text data processing technology for tagging parts-of-speech of words in a corpus according to meaning and context in the linguistic of the corpus. Part-of-speech tagging can be done manually or by specific algorithms, using machine learning methods to achieve that part-of-speech tagging is the content of research in natural language processing. Common part-of-speech tagging algorithms include hidden Markov models, conditional random fields, etc. In the embodiment of the application, the part of speech tagging is carried out on the target word by means of part of speech tagging, so that the part of speech word is obtained.

S2B: and merging the part-of-speech word fragments conforming to the consistency rule according to the dependency syntactic analysis mode to obtain the target phrase.

Wherein, the consistency rule is marked by corresponding words by using a main-predicate-guest (SBV) relation. For example, "i eat apples" is marked as (i, object), (i, p, object), the extracted part-of-speech word is corresponding to the part-of-speech component, and the part-of-speech word conforming to the consistency rule is combined to obtain the target phrase.

In the embodiment, the part of speech tagging is performed on the target word segment in a part of speech tagging mode to obtain the part of speech word segment, and the part of speech word segment conforming to the consistency rule is combined according to the dependency syntax analysis mode to obtain the target phrase, so that the combination of the target word segment is realized, and the follow-up data tagging is facilitated.

It should be emphasized that, to further ensure the privacy and security of the text to be annotated, the text to be annotated may also be stored in a node of a blockchain.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

Referring to fig. 9, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a data labeling apparatus based on a generation model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 9, the data labeling apparatus based on the generation model of the present embodiment includes: the method comprises a text splitting module 71 to be labeled, a target phrase obtaining module 72, a label sample generating module 73, an initial parameter generating module 74, a labeling accuracy output module 75 and a label sample selecting module 76, wherein:

the text splitting module to be labeled 71 is configured to obtain text to be labeled, and split the text to be labeled to obtain a split sentence;

The target phrase obtaining module 72 is configured to obtain a target word by performing word segmentation processing on the split sentence, and combine the target word segments to obtain a target phrase;

The tag sample generating module 73 is configured to obtain a plurality of preset labeling rules, and label the target phrase according to the plurality of preset labeling rules, so as to obtain a tag sample corresponding to each preset rule;

The initial parameter generating module 74 is configured to obtain a sample labeling probability of a label sample corresponding to each preset labeling rule for a target phrase, and obtain initial parameters of a generated model according to the sample labeling probability and the label sample;

The labeling accuracy output module 75 is configured to iteratively update initial parameters of the generated model according to a sample labeling probability to obtain a trained generated model, and output a labeling accuracy corresponding to the label sample according to the trained generated model;

The label sample selecting module 76 is configured to select a label sample with the highest labeling accuracy as the target label sample.

Further, the initial parameter generation module 74 includes:

The sample labeling probability obtaining unit is used for calculating the coverage rate of the label sample corresponding to each preset labeling rule on the target phrase, and taking the coverage rate as the sample labeling probability;

and the initialization processing unit is used for initializing the sample label probability and the label sample to obtain initial parameters of the generated model.

Further, the labeling accuracy rate output module 75 includes:

the optimization feature value definition unit is used for taking the difference value of the parameter of the generated model and the sample labeling probability as an optimization feature value;

The iteration update proceeding unit is used for carrying out counter-propagation on the sample labeling probability by adopting a random gradient descent mode so as to carry out iteration update on the initial parameters, wherein each iteration update obtains new parameters of the generated model and changes the optimized characteristic values;

The iteration update stopping unit is used for stopping iteration update when the optimized characteristic value reaches a preset threshold value to obtain a trained generation model;

the labeling accuracy obtaining unit is used for outputting the labeling accuracy corresponding to the label sample through the trained generation model.

Further, the labeling accuracy obtaining unit includes:

The basic probability acquisition subunit is used for carrying out probability estimation on the label sample through the trained current parameters of the generated model to obtain basic probability;

And the basic probability processing subunit is used for carrying out weighted average processing on the basic probability to obtain the labeling accuracy corresponding to the label sample.

Further, the text splitting module to be tagged 71 includes:

The basic text generation unit is used for acquiring a text to be marked and preprocessing the text to be marked to obtain a basic text;

The text separator acquisition unit is used for acquiring text separators contained in the basic text in a regular matching mode;

And the splitting sentence generating unit is used for splitting the basic text through the text separator to obtain a splitting sentence.

Further, after the label sample selection module 76, the generating model-based data labeling apparatus further includes:

the target storage path acquisition module is used for acquiring a storage path of the text to be marked as a target storage path;

the data mapping module is used for mapping the target label sample into the target storage path in a preset data mapping mode.

Further, the target phrase acquisition module 72 further includes:

the part-of-speech word segmentation generating unit is used for marking the part of speech of the target word in a part-of-speech marking mode to obtain part-of-speech word segmentation;

and the target phrase generating unit is used for merging the part-of-speech word fragments conforming to the consistency rule according to the dependency syntactic analysis mode to obtain the target phrase.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 10, fig. 10 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 8 comprises a memory 81, a processor 82, a network interface 83 communicatively connected to each other via a system bus. It should be noted that only a computer device 8 having three components memory 81, a processor 82, a network interface 83 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer device may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The computer device can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 81 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 81 may be an internal storage unit of the computer device 8, such as a hard disk or memory of the computer device 8. In other embodiments, the memory 81 may also be an external storage device of the computer device 8, such as a plug-in hard disk provided on the computer device 8, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Of course, the memory 81 may also include both internal storage units of the computer device 8 and external storage devices. In this embodiment, the memory 81 is generally used to store an operating system and various types of application software installed on the computer device 8, such as program codes of a data labeling method based on a generative model. Further, the memory 81 may be used to temporarily store various types of data that have been output or are to be output.

The processor 82 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 82 is typically used to control the overall operation of the computer device 8. In this embodiment, the processor 82 is configured to execute the program code stored in the memory 81 or process data, for example, execute the program code of the data labeling method based on the generation model, so as to implement various embodiments of the data labeling method based on the generation model.

The network interface 83 may comprise a wireless network interface or a wired network interface, which network interface 83 is typically used to establish a communication connection between the computer device 8 and other electronic devices.

The present application also provides another embodiment, namely, a computer readable storage medium, where a computer program is stored, where the computer program is executable by at least one processor, so that the at least one processor performs the steps of a data labeling method based on a generative model as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method of the embodiments of the present application.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A data labeling method based on a generated model, which is characterized by comprising the following steps:

obtaining a plurality of preset labeling rules, and labeling the target phrases through the plurality of preset labeling rules respectively to obtain label samples corresponding to each preset rule, wherein the preset labeling rules comprise regular recognition, remote matching knowledge base recognition and external data matching modes;

Calculating the coverage rate of the label sample corresponding to each preset labeling rule on the target phrase, and taking the coverage rate as a sample labeling probability, wherein if the target phrase is labeled in a mode of identifying by adopting the remote matching knowledge base, the coverage rate is obtained by dividing the label sample successfully labeled by the target phrase by the total target phrase quantity, and the coverage rate is taken as the sample labeling probability;

Initializing the sample labeling probability and the label sample to obtain initial parameters of a generated model;

taking the difference value between the parameters of the generated model and the sample labeling probability as an optimized characteristic value;

the sample labeling probability is counter-propagated in a random gradient descending mode so as to carry out iterative updating on the initial parameters, wherein each iterative updating obtains the new parameters of the generated model and changes of the optimized characteristic values;

stopping the iterative updating when the optimized characteristic value reaches a preset threshold value to obtain a trained generation model;

Carrying out probability estimation on the label sample through the trained current parameter of the generated model to obtain basic probability, wherein the current parameter refers to the parameter obtained by iterative updating when the optimized characteristic value reaches a preset threshold value;

performing weighted average processing on the basic probability to obtain the labeling accuracy corresponding to the label sample;

Selecting the label sample with the highest labeling accuracy as a target label sample;

Acquiring a storage path of the text to be marked as a target storage path;

Mapping the target label sample into the target storage path through a visualization operation;

The step of merging the target word segments to obtain a target phrase comprises the following steps:

marking the parts of speech of the target word by means of part of speech marking to obtain part of speech word;

and merging the part-of-speech word segments conforming to the consistency rule according to the dependency syntactic analysis mode to obtain the target phrase.

2. The method for labeling data based on a generative model as claimed in claim 1, wherein the steps of obtaining the text to be labeled, splitting the text to be labeled, and obtaining the split sentence include:

Acquiring the text to be marked, and preprocessing the text to be marked to obtain a basic text;

Acquiring text separators contained in the basic text by adopting a regular matching mode;

And splitting the basic text through the text separator to obtain the split statement.

3. A data labeling apparatus based on a generative model, comprising:

The label sample generation module is used for acquiring a plurality of preset labeling rules, labeling the target phrases through the plurality of preset labeling rules respectively to obtain label samples corresponding to each preset rule, wherein the preset labeling rules comprise regular recognition, remote matching knowledge base recognition and external data matching modes;

The sample labeling probability obtaining unit is used for calculating the coverage rate of the label sample corresponding to each preset labeling rule on the target phrase, and taking the coverage rate as the sample labeling probability, wherein if the target phrase is labeled in a mode of identifying by adopting the remote matching knowledge base, the coverage rate is obtained by dividing the label sample successfully labeled by the target phrase by the total target phrase quantity, and the coverage rate is taken as the sample labeling probability;

the initialization processing unit is used for initializing the sample labeling probability and the label sample to obtain initial parameters of a generated model;

the optimization feature value definition unit is used for taking the difference value between the parameters of the generated model and the sample labeling probability as an optimization feature value;

The iterative updating unit is used for carrying out back propagation on the sample labeling probability by adopting a random gradient descending mode so as to carry out iterative updating on the initial parameters, wherein each time of iterative updating, the new parameters of the generated model and the optimized characteristic values are obtained to be changed;

the iteration update stopping unit is used for stopping the iteration update when the optimized characteristic value reaches a preset threshold value to obtain a trained generation model;

The basic probability obtaining subunit is used for carrying out probability estimation on the label sample through the current parameter of the trained generation model to obtain basic probability, wherein the current parameter refers to the parameter obtained by iterative updating when the optimized characteristic value reaches a preset threshold value;

the basic probability processing subunit is used for carrying out weighted average processing on the basic probability to obtain the labeling accuracy corresponding to the label sample;

The label sample selection module is used for selecting the label sample with the highest labeling accuracy as a target label sample;

The target storage path acquisition module is used for acquiring the storage path of the text to be marked as a target storage path;

the data mapping module is used for mapping the target label sample into the target storage path through a visualization operation;

Wherein, the target phrase acquisition module comprises:

4. A computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the generative model based data labelling method as claimed in any of claims 1 to 2 when the computer program is executed.

5. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the model-based data labeling method of any one of claims 1 to 2.