CN114676775B

CN114676775B - Sample information labeling method, device, equipment, program and storage medium

Info

Publication number: CN114676775B
Application number: CN202210301392.XA
Authority: CN
Inventors: 康战辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2025-02-11
Anticipated expiration: 2042-03-24
Also published as: CN114676775A

Abstract

The invention provides a sample information labeling method, a sample information labeling device and electronic equipment, wherein the related embodiments can be applied to various scenes such as cloud technology, cloud security, intelligent traffic and the like; the method comprises the steps of obtaining negative example sample information from a data source, obtaining static statistical feature sets corresponding to the positive example sample information and the negative example sample information respectively, determining model parameters of a sample information labeling model, labeling the sample information in the data source through the sample information labeling model to obtain labeled sample information, and therefore labor cost is saved, and dependence of quality of determining correlation of a text to be processed on quality of manual labeling is reduced.

Description

Sample information labeling method, device, equipment, program and storage medium

Technical Field

The present invention relates to information processing technology, and in particular, to a sample information labeling method, apparatus, electronic device, computer program product, and storage medium, so that the applicable fields of the present solution include, but are not limited to, the fields of automatic driving, internet of vehicles, intelligent transportation, etc.

Background

Artificial intelligence (AI, artificial Intelligence) is a comprehensive technology of computer science, and by researching the design principle and implementation method of various intelligent machines, the machines have the functions of sensing, reasoning and decision. Artificial intelligence technology is a comprehensive discipline, and is widely related to fields, such as natural language processing technology, machine learning/deep learning and other directions, and it is believed that with the development of technology, the artificial intelligence technology will be applied in more fields and become more and more valuable.

In the related art, a pre-trained language model makes breakthrough progress in machine reading and understanding tasks. The core idea is to pretrain a language model on a large-scale unsupervised text corpus to obtain semantic representations of the text. These semantic representations may be further applied in a characteristic or fine-tuning manner to a series of natural language understanding tasks including machine-readable understanding, which requires text-based relevance labeling, which refers to labeling text as one or more dependencies in a relevance hierarchy. Text relevance labeling has wide application in a large number of business scenes such as advertisement, recommendation, search and the like. Determining the relevance to which the text belongs is an important link in text relevance labeling. In a traditional text relevance determining mode, firstly, manually marking the relevance of a plurality of texts to obtain a training sample, then training a neural network and other machine learning models according to the training sample to obtain a mapping model, further inputting the text to be processed into the mapping model, and determining the relevance of the text to be processed through the mapping model. However, the process of manually labeling the training samples consumes a significant amount of manpower. And the mapping model is obtained by training according to a training sample of manual annotation, so that the quality of the correlation of the text to be processed is determined to have serious dependence on the quality of the manual annotation.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a method, an apparatus, an electronic device, a software program, and a storage medium for labeling sample information, which can determine the correlation of any sample information by using a machine learning technique, label the correlation of the sample information, save a link of manually labeling the correlation in a conventional manner, save labor cost, reduce the dependence of the quality of the correlation to which a text to be processed belongs on the quality of the manual labeling, improve the accuracy of text correlation labeling, and improve the user experience of a user.

The embodiment of the invention provides a sample information labeling method, which comprises the following steps:

Acquiring positive sample information from a data source, and marking the positive sample information to obtain marked positive sample information;

Acquiring negative example sample information from the data source;

static statistical feature sets respectively corresponding to the positive example sample information and the negative example sample information are obtained;

Training a sample information labeling model based on the labeling positive example sample information, the labeling negative example sample information and the static statistical feature set, and determining model parameters of the sample information labeling model;

and labeling the sample information in the data source through the sample information labeling model to obtain labeled sample information.

The embodiment of the invention also provides a sample information labeling device, which comprises:

the information transmission module is used for acquiring the positive sample information from the data source and marking the positive sample information to obtain marked positive sample information;

The information processing module is used for acquiring negative example sample information from the data source;

The information processing module is used for acquiring static statistical feature sets corresponding to the positive example sample information and the negative example sample information respectively;

the information processing module is used for training a sample information labeling model based on the labeling positive sample information, the labeling negative sample information and the static statistical feature set, and determining model parameters of the sample information labeling model;

the information processing module is used for marking the sample information in the data source through the sample information marking model to obtain marked sample information.

In the above scheme, the information processing module is configured to randomly acquire at least one piece of initial sample information in the data source, where the initial sample information at least includes a question sentence and a reply sentence corresponding to the question sentence;

The information processing module is used for marking the initial sample information according to the semantic relativity degree of the question sentences and the reply sentences to obtain initial sample information with different semantic relativity degrees;

The information processing module is used for screening positive sample information from the initial sample information with different semantic relativity degrees;

The information processing module is used for marking the positive example sample information based on the correlation degree of the positive example sample information to obtain marked positive example sample information.

In the above scheme, the information processing module is configured to determine a screening dimension of the static statistical feature in the static statistical feature set according to the type of the data source;

The information processing module is used for screening sample information from the positive example sample information and the negative example sample information of the data source according to the screening dimension of the static statistical characteristics;

The information processing module is used for performing single-heat encoding processing on the sample information to form static statistical characteristics corresponding to the positive example sample information and the negative example sample information respectively;

The information processing module is used for combining the static statistical characteristics of different screening dimensions to form static statistical characteristic sets corresponding to the positive example sample information and the negative example sample information respectively.

In the above solution, the information processing module is configured to, when the type of the data source is a search question-answer data source, set of static statistical features include at least one of the following:

The length, click rate and search times of the positive example sample information and the length, click rate and search times of the negative example sample information;

when the data source type is an intelligent chat data source, the set of static statistical features includes at least one of:

the length and the searching times of the positive example sample information and the length and the searching times of the negative example sample information.

In the scheme, the information processing module is used for acquiring the sample information which is not clicked in the data source as negative example sample information, or

The information processing module is used for screening negative example sample information from initial sample information with different semantic relativity degrees.

In the above scheme, the information processing module is configured to combine the labeled positive example sample information, the labeled negative example sample information, and the static statistical feature set to form a training sample set;

The information processing module is used for determining a convergence function of the sample information annotation model;

the information processing module is used for training the bidirectional attention neural network of the sample information labeling model based on the training sample set and determining network parameters of the bidirectional attention neural network;

The information processing module is used for training the feature coding network of the sample information labeling model based on the training sample set and determining network parameters of the feature coding network;

and the information processing module is used for updating the network parameters of the bidirectional attention neural network and the network parameters of the feature coding network until reaching the convergence condition corresponding to the convergence function.

In the above scheme, the information processing module is configured to obtain sample information to be processed and static statistical characteristics in the data source, where the sample information to be processed includes a question sentence and a reply sentence;

the information processing module is used for triggering a corresponding word segmentation library according to the text parameter information carried by the sample information to be processed;

The information processing module is used for performing word segmentation processing on the sample information to be processed through the triggered word dictionary of the word segmentation library to form different word-level feature vectors;

The information processing module is used for carrying out noise removal processing on the different word-level feature vectors to form a word-level feature vector set corresponding to the sample information to be processed;

the information processing module is used for processing the word-level feature vector set and the static statistical features through the sample information labeling model and determining the relevance of the question sentences and the answer sentences;

The information processing module is used for marking the sample information to be processed according to the relevance of the question statement and the reply statement to obtain a relevance marking result of the sample information to be processed.

The embodiment of the invention also provides electronic equipment, which comprises:

a memory for storing executable instructions;

and the processor is used for realizing the sample information labeling method when the executable instructions stored in the memory are operated.

The embodiment of the invention also provides a computer program product, which comprises a computer program or instructions and is characterized in that the method for labeling the sample information is realized when the computer program or instructions are executed by a processor.

The embodiment of the invention also provides a computer readable storage medium which stores executable instructions, wherein the executable instructions realize the sample information labeling method when being executed by a processor.

The embodiment of the invention has the following beneficial effects:

The method comprises the steps of obtaining positive example sample information from a data source, marking the positive example sample information to obtain marked positive example sample information, obtaining negative example sample information from the data source, obtaining static statistics feature sets corresponding to the positive example sample information and the negative example sample information respectively, training a sample information marking model based on the marked positive example sample information, the negative example sample information and the static statistics feature set to determine model parameters of the sample information marking model, marking the sample information in the data source through the sample information marking model to obtain marked sample information, and therefore determining correlation of any sample information through a machine learning technology, marking the correlation of the sample information, saving links of manual marking correlation in a traditional mode, saving labor cost, reducing dependence of quality of the correlation of a text to be processed on manual marking quality, improving accuracy of text correlation marking, and improving user experience.

Drawings

FIG. 1 is a schematic diagram of a usage scenario of a sample information labeling method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a composition structure of a sample information labeling device according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of an alternative method for labeling sample information according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a sample information labeling model according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of an alternative method for labeling sample information according to an embodiment of the present invention;

Fig. 6 is a schematic diagram of an application environment of a sample information labeling model in an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.

1) In response to a condition or state representing a dependency of an operation performed, the one or more operations performed may be in real-time or with a set delay when the dependency is satisfied, and without any particular limitation to execution sequencing.

2) Based on the condition or state used to represent the operation being performed, when the condition or state is satisfied, one or more operations may be performed in real time or with a set delay, and without any particular explanation, there is no restriction on the order of execution of the operations.

3) Softmax, a very common and relatively important function in machine learning, is widely used especially in multi-class scenarios, mapping some inputs to real numbers between 0-1, and normalizing the guaranteed sum to 1.

4) The neural network (Neural Network, NN) is an artificial neural network (ARTIFICIAL NEURAL NETWORK, ANN), abbreviated as neural network or neural-like network, in the field of machine learning and cognition science, is a mathematical or computational model that mimics the structure and function of a biological neural network (the central nervous system of an animal, particularly the brain) for estimating or approximating functions.

5) The log Clicklog the search engine background will count docid and the number of clicks for each input query and clicked doc, the recording format is as follows:

< query, docid > \t search times, number of clicks, docid rank positions.

6) Bidirectional attention neural network model (BERT Bidirectional Encoder Representations from Transformers) google.

7) Token, word unit, the input text needs to be divided into language units such as words, punctuation marks, numbers or pure alphanumerics before any actual processing is performed on the input text. These units are referred to as word units.

8) Softmax, normalized exponential function, is a generalization of the logic function. It can "compress" a K-dimensional vector containing arbitrary real numbers into another K-dimensional real vector such that each element ranges between 0,1 and the sum of all elements is 1.

The embodiment of the invention can be realized by combining Cloud technology, wherein Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data, and can also be understood as the general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a Cloud computing business model. Background services of technical network systems require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites, so cloud technologies need to be supported by cloud computing.

By the cloud technology, the sample information labeling method provided by the application can be used for taking the target media information domain as a recommendation reference, recording the target media information domain in the corresponding cloud server, and recommending corresponding media information to the target object through the target media information domain which is stored in the cloud server and is used as the recommendation reference when the target object browses media information in different terminals, so that the target object obtains a more accurate media information recommendation result.

It should be noted that cloud computing is a computing mode, which distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space and information service as required. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed. As a basic capability provider of cloud computing, a cloud computing resource pool platform, referred to as a cloud platform for short, is generally called an Infrastructure as a service (IaaS, infrasaround AS A SERVICE), and multiple types of virtual resources are deployed in the resource pool for external clients to select for use. The cloud computing resource pool mainly comprises computing equipment (which can be a virtualized machine and comprises an operating system), storage equipment and network equipment.

Before introducing the sample information labeling method provided by the application, firstly, the defect of media information recommendation in the related technology is briefly described, when the related technology is subjected to the relevance labeling, a user inputs a problem statement (query), a search engine background recalls a series of document sets containing the problem statement information according to a pre-established index, and then the candidate documents are ranked by calculating relevance scores of the problem statement and the candidate document sets, and ranking factors such as document quality, authority, timeliness and the like are fused for re-ranking and returned to a final ranking result of the user. The key search sorting algorithm is a relevance sorting algorithm, a large amount of < query, doc > relevance marking data are needed when a researcher trains the model, the key search sorting algorithm is obtained by manually marking by a data outsourcing company in the related technology, but the key search sorting algorithm is considered to have high economic cost and low efficiency of manual marking, and a large amount of marking samples cannot be timely obtained in model training development

In order to solve the above drawbacks in media information recommendation, the present application provides a sample information labeling method, referring to fig. 1, fig. 1 is a schematic view of a usage scenario of a media information processing method based on a sample information labeling model provided in an embodiment of the present application, a terminal (including a terminal 10-1 and a terminal 10-2) is provided with a client of application software related to a text input function, a user can input a corresponding problem sentence through the set text input client, the text input client can also receive a corresponding text relevance labeling result and display the received text relevance labeling result to the user, the terminal is connected to a server 200 through a network 300, the network 300 can be a wide area network or a local area network, or a combination of the two, and data transmission is implemented by using a wireless link.

As an example, the server 200 is configured to lay out a sample information labeling device and train a sample information labeling model, that is, obtain positive sample information from a data source and label the positive sample information to obtain labeled positive sample information, obtain negative sample information from the data source, obtain static statistics feature sets corresponding to the positive sample information and the negative sample information respectively, train a sample information labeling model based on the labeled positive sample information, the negative sample information and the static statistics feature sets, determine model parameters of the sample information labeling model, and label the sample information in the data source through the sample information labeling model to obtain labeled sample information.

The sample information labeling method provided by the embodiment of the application is realized based on artificial intelligence, wherein the artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a theory, a method, a technology and an application system which are used for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing environment, acquiring knowledge and acquiring an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In the embodiment of the application, the mainly related artificial intelligence software technology comprises the voice processing technology, machine learning and other directions. For example, speech recognition techniques (Automatic Speech Recognition, ASR) in speech technology (Speech Technology) may be involved, including speech signal preprocessing (SPEECH SIGNAL preprocessing), speech signal frequency domain analysis (SPEECH SIGNAL frequency analyzing), speech signal feature extraction (SPEECH SIGNAL feature extraction), speech signal feature matching/recognition (SPEECH SIGNAL feature matching/recognition), training of speech (SPEECH TRAINING), and the like.

For example, machine learning (MACHINE LEARNING, ML) can be involved, which is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning typically includes techniques such as deep learning (DEEP LEARNING) that includes artificial neural networks (ARTIFICIAL NEURAL NETWORK), such as convolutional neural networks (Convolutional Neural Network, CNN), recurrent neural networks (Recurrent Neural Network, RNN), deep neural networks (Deep neural network, DNN), and the like.

It can be appreciated that the sample information labeling method and the voice processing provided by the application can be applied to an intelligent device (INTELLIGENT DEVICE), and the intelligent device can be any device with an information display function, for example, an intelligent terminal, an intelligent home device (such as an intelligent sound box and an intelligent washing machine), an intelligent wearing device (such as an intelligent watch), a vehicle-mounted intelligent central control system (for displaying media information to a user through a small program for executing different tasks), or an AI intelligent medical device (for displaying treatment cases through media information).

The structure of the sample information labeling device according to the embodiment of the present invention will be described in detail below, and the sample information labeling device may be implemented in various forms, such as a dedicated terminal with a sample information labeling processing function, or may be a server provided with a sample information labeling device processing function, for example, the server 200 in fig. 1. Fig. 2 is a schematic diagram of a composition structure of a sample information labeling device according to an embodiment of the present invention, and it can be understood that fig. 2 only shows an exemplary structure of the sample information labeling device, but not all the structures, and some or all of the structures shown in fig. 2 can be implemented as required.

The sample information labeling device provided by the embodiment of the invention comprises at least one processor 201, a memory 202, a user interface 203 and at least one network interface 204. The various components in the sample information labeling apparatus are coupled together by a bus system 205. It is understood that the bus system 205 is used to enable connected communications between these components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 205 in fig. 2.

The user interface 203 may include, among other things, a display, keyboard, mouse, trackball, click wheel, keys, buttons, touch pad, or touch screen, etc.

It will be appreciated that the memory 202 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The memory 202 in embodiments of the present invention is capable of storing data to support operation of the terminal (e.g., 10-1). Examples of such data include any computer program for operating on a terminal (e.g., 10-1), such as an operating system and application programs. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application may comprise various applications.

In some embodiments, the sample information labeling device provided by the embodiment of the present invention may be implemented by combining software and hardware, and as an example, the sample information labeling device provided by the embodiment of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the training method of the sample information labeling model provided by the embodiment of the present invention. For example, a processor in the form of a hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex Programmable logic devices (CPLDs, complex Programmable Logic Device), field-Programmable gate arrays (FPGAs), or other electronic components.

As an example of implementation of the sample information labeling device provided by the embodiment of the present invention by combining software and hardware, the sample information labeling device provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, the software modules may be located in a storage medium, the storage medium is located in the memory 202, the processor 201 reads executable instructions included in the software modules in the memory 202, and the training method of the sample information labeling model provided by the embodiment of the present invention is completed by combining necessary hardware (including, for example, the processor 201 and other components connected to the bus 205).

By way of example, the Processor 201 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

As an example of a hardware implementation of the sample information labeling apparatus provided by the embodiment of the present invention, the apparatus provided by the embodiment of the present invention may be directly implemented by the processor 201 in the form of a hardware decoding processor, for example, one or more Application specific integrated circuits (ASICs, applications SPECIFIC INTEGRATED circuits), DSPs, programmable logic devices (PLDs, programmable Logic Device), complex Programmable logic devices (CPLDs, complex Programmable Logic Device), field-Programmable gate arrays (FPGAs), field-Programmable GATE ARRAY, or other electronic components may implement the training method for implementing the sample information labeling model provided by the embodiment of the present invention.

The memory 202 in embodiments of the present invention is used to store various types of data to support the operation of the sample information labeling apparatus. Examples of such data include any executable instructions, such as executable instructions, for operation on a sample information labeling device, in which a program implementing the training method of the embodiment of the present invention from a sample information labeling model may be included.

In other embodiments, the sample information labeling device provided by the embodiments of the present invention may be implemented in a software manner, and fig. 2 shows the sample information labeling device stored in the memory 202, which may be software in the form of a program, a plug-in, and a series of modules, and as an example of the program stored in the memory 202, may include the sample information labeling device, where the sample information labeling device includes the following software modules:

An information transmission module 2081 and an information processing module 2082. When the software modules in the sample information labeling device are read by the processor 201 into the RAM and executed, the training method of the sample information labeling model provided by the embodiment of the invention is implemented, where the functions of each software module in the sample information labeling device include:

The information transmission module 2081 is configured to obtain positive example sample information from a data source, and label the positive example sample information to obtain labeled positive example sample information.

An information processing module 2082, configured to obtain negative example sample information from the data source.

The information processing module 2082 is configured to obtain static statistics feature sets corresponding to the positive example sample information and the negative example sample information respectively.

The information processing module 2082 is configured to train a sample information labeling model based on the labeled positive sample information, the labeled negative sample information, and the static statistical feature set, and determine model parameters of the sample information labeling model.

The information processing module 2082 is configured to perform labeling processing on the sample information in the data source through the sample information labeling model, so as to obtain labeled sample information.

According to the electronic device shown in fig. 2, in one aspect of the application, the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of the electronic device, which executes the computer instructions, causing the electronic device to perform the different embodiments and combinations of embodiments provided in the various alternative implementations of the sample information labeling method described above.

Referring to fig. 3, fig. 3 is an optional flowchart of the sample information labeling method provided by the embodiment of the present invention, and it will be understood that the steps shown in fig. 3 may be performed by various electronic devices running the sample information labeling apparatus, for example, a dedicated terminal and a server with the sample information labeling apparatus, and the steps shown in fig. 3 are described below.

Step 301, a sample information labeling device obtains positive sample information from a data source, and labels the positive sample information to obtain labeled positive sample information.

In some embodiments of the present invention, positive example sample information is obtained from a data source, and the positive example sample information is labeled, so as to obtain labeled positive example sample information, which may be implemented by the following ways:

At least one piece of initial sample information is randomly acquired from the data source, wherein the initial sample information at least comprises a question statement and a reply statement corresponding to the question statement, the initial sample information is marked according to the semantic relativity degree of the question statement and the reply statement to obtain initial sample information with different semantic relativity degrees, positive sample information is screened from the initial sample information with different semantic relativity degrees, and the positive sample information is marked based on the relativity degree of the positive sample information to obtain marked positive sample information. The semantic relativity degree of the question sentence query and the answer sentence doc can comprise 5 different gears, which are respectively:

5 (excellent results can be prioritized) -text semantics are completely matched, and the main requirement of a user search query is completely met. 4 (satisfactory results can be prioritized) - - - -text semantic high match, highly meeting the user search query requirements. 3 (qualification result rank is lower than 4, 5-level) -text correlation or partial correlation, semantic partial correlation, meeting partial or small search requirements, and no semantic drift. 2 (general result ranking is back or not out) -text related or partially related, is in the same coarse domain as search requirements, but has semantic drift and does not accord with search query intent. 1 (poor results do not go out) -text semantics are irrelevant.

Wherein, 1 is the lowest grade, 5 is the highest grade, and the semantic relativity degree of the question sentence query and the answer sentence doc is 5, which indicates that the semantic relativity degree of the question sentence query and the answer sentence doc is the highest.

Step 302, the sample information labeling device acquires negative sample information from the data source.

In some embodiments of the present invention, the obtaining of negative example sample information may be achieved by:

And acquiring sample information which is not clicked in the data source as negative example sample information, or screening the negative example sample information from initial sample information with different semantic relativity degrees. According to the method, the sample information which is not clicked in the data source can be selected to be used as negative sample information according to the preset proportion of the negative sample, and 1-3 steps of sample information can be selected to be used as negative sample information according to the previous embodiment, so that the collection speed of the sample negative sample information is improved, and the training time of a sample marking model is saved.

Step 303, the sample information labeling device acquires static statistical feature sets corresponding to the positive sample information and the negative sample information respectively.

In some embodiments of the present invention, the taking of the static statistical feature sets corresponding to the positive example sample information and the negative example sample information respectively may be implemented by:

The method comprises the steps of determining screening dimension of static statistical features in a static statistical feature set according to the type of a data source, screening sample information in positive example sample information and negative example sample information of the data source according to the screening dimension of the static statistical features, performing independent heat coding processing on the sample information to form the static statistical features corresponding to the positive example sample information and the negative example sample information respectively, and combining the static statistical features of different screening dimensions to form the static statistical feature set corresponding to the positive example sample information and the negative example sample information respectively. Wherein the process of Encoding sample information by One-Hot Encoding (One Encoding) is that N bits of state registers can be used to encode N samples of information, each state having its own independent register bit, and only One of the bits being valid at any time. This may be accomplished by means of sklearn.

Specifically, in order to make different features in the static statistical feature set have consistency in encoding and ensure the training accuracy of the sample information standard, the following describes the processing procedure of onehot encoding by taking the search times of the query of the question sentence as an example:

When the number of searches of the problem sentence query1 is 49 and the number of searches of the problem sentence query2 is 50, if the feature vector is directly represented by the numerical value, the query with small difference of the number of searches is completely different and sparse in the feature dimension, and because the value ranges of different feature dimensions are greatly different, for example, the number of real-time searches of the search applet of the instant messaging client is millions, if the original feature value is directly used for encoding, the feature value explosion is caused, and the training accuracy of the sample information labeling model is affected. Thus, the above-described drawbacks are solved by onehot encoding of the section map. For example, the search times (i.e., query pv, qv for short) of different question sentences are set to different coding values, that is, a dictionary having a total of 4 value ranges is assumed, referring to table 1:

TABLE 1

From the above table, it can be seen that different qv value ranges correspond to 1 encoded at the corresponding shift positions and other bits are 0 encoded. If qv in the log is less than 100 times for all queries for those queries, which belongs to grade 4 in the feature of qv, i.e. position 4 is 1, then the codes are all 0001.

For other features in the static statistical feature set, the coding of the classification onehot may be performed according to the coding mode of the query search times in the foregoing embodiment (for example, the coding is divided into 4 stages, that is, the coding length of each type of feature is fixed length 4), and then direct splicing is performed, for example, the feature coding process for a certain problem statement query-a is as follows:

The Query-A length onehot codes 0010, the Query-A search time onehot codes 0001, the Query-A click rate characteristic onehot codes 0010, and the onehot codes are combined to show that the characteristic codes of the Query-A can be expressed as [ 0010 ] [ 0001 ] [ 0010 ].

The coding onehot is carried out on the characteristic of a certain reply sentence Doc-A clicked under the Query-A, wherein the Doc-A length onehot is 0100, the Doc-A clicking times onehot is 1000, therefore, the characteristic coding of the Doc-A can be expressed as [ 0100 ] [ 1000 ], and then the characteristic coding of the Query-A and the Doc-A are spliced, namely [ 0010 ] [ 0001 ] [ 0010 ] [ 0100 ] [ 1000 ]. It should be noted that, the numerical attribute may be directly used as the calculation feature, and the numerical attribute needs to be converted into the computable numerical feature. For example, the sample has 10 categories, and the 10 categories are numbered 1-10 at a time and converted into one-hot sparse vectors for feature calculation. For example, sample a belongs to class 3 of 10 categories, which is characterized by a property of [0,0,1,0,0,0,0,0,0,0].

In some embodiments of the present invention, when the type of the data source is a search question and answer data source, the static statistical feature set includes at least one of a length, a click rate, and a number of searches of the positive example sample information, and a length, a click rate, and a number of searches of the negative example sample information.

In some embodiments of the invention, when the type of data source is an intelligent chat data source, the set of static statistical features includes at least one of a length and a number of searches of the positive example sample information and a length and a number of searches of the negative example sample information. Therefore, according to the type of the data source, a user can flexibly adjust the static statistical feature type in the static statistical feature set so as to reduce the training time of the sample annotation model and facilitate the rapid deployment of the sample annotation model.

And 304, training a sample information labeling model by the sample information labeling device based on the labeled positive sample information, the labeled negative sample information and the static statistical feature set, and determining model parameters of the sample information labeling model.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a sample information labeling model in an embodiment of the present invention, where the sample information labeling model includes a network of bidirectional attention neural networks and a feature encoding network, during training, the labeled positive sample information, the labeled negative sample information and the static statistical feature set are combined to form a training sample set, a convergence function of the sample information labeling model is determined, based on the training sample set, the bidirectional attention neural network of the sample information labeling model is trained, network parameters of the bidirectional attention neural network are determined, based on the training sample set, the feature encoding network of the sample information labeling model is trained, network parameters of the feature encoding network are determined, and the network parameters of the bidirectional attention neural network and the network parameters of the feature encoding network are updated until convergence conditions corresponding to the convergence function are reached. As shown in fig. 4, text semantic vectors of a question sentence query and a reply sentence doc are extracted through a network of a bidirectional attention neural network, a CLS vector is obtained as a semantic feature vector of the text, a pooling layer (pooling) is connected to extract sentence feature vectors u and v of the query and doc, and difference vectors of the query and doc are calculated through u and v cross features, namely u-v, and are spliced with u and v to form similarity calculation vector representations corresponding to the query and doc. And then, splicing the correlation label with the output result of the feature coding network to finally form the correlation label of any sample.

In some embodiments of the present invention, after the sample information labeling model is trained, for a < query, doc > pair with a high click rate, the correlation step automatically labeled by the sample information labeling model is very low (i.e., not correlated), which may mean that the high click rate of the < query, doc > is not reliable, so that an alarm message may be sent to prompt a user to perform manual intervention screening, and the < query, doc > pair is prohibited to be used as sample information for training and use by other sample information labeling models.

In some embodiments of the present invention, when static statistics is performed, various behavior data of a user matched with a corresponding client may be collected through different program components, and an original log of the behavior data of the user may be effectively extracted, for example, a device number (user account number) of the user, media information type information, a browsing duration of media information, and a browsing integrity parameter of the media information are extracted. The historical click behaviors of the users and the browsing time length of the corresponding information are recorded through the subscription service and stored in the Redis, and the online recommendation system pulls the historical click behaviors of the corresponding users when the user requests come.

And 305, the sample information labeling device labels the sample information in the data source through the sample information labeling model to obtain labeled sample information.

Referring to fig. 5, fig. 5 is an optional flowchart of a sample information labeling method according to an embodiment of the present invention, and it will be appreciated that the steps shown in fig. 5 may be performed by various electronic devices running the sample correlation labeling apparatus, for example, a dedicated terminal with a sample correlation labeling function, a server with a sample processing phase function, or a server cluster. The following is a description of the steps shown in fig. 5.

Step 501, obtaining sample information to be processed in the data source and static statistical characteristics, wherein the sample information to be processed comprises a question statement and a reply statement.

Step 502, triggering a corresponding word segmentation library according to text parameter information carried by the sample information to be processed.

And 503, performing word segmentation processing on the sample information to be processed through the triggered word dictionary of the word segmentation library to form different word-level feature vectors.

In some embodiments of the present invention, the language habits and the operation habits of different users are different, and different word segmentation manners need to be adjusted for different users to adapt to the language habits of different users. Especially for Chinese, the Chinese characters are taken as basic ideographic units, the minimum semantic units with the meanings are words, and because the space between words is not used as segmentation like the space between English words, which words form words in a sentence of text is uncertain, therefore, word segmentation of Chinese text is an important work. Moreover, for text processing instruction text, which contains something that is only valuable for natural language understanding, for sample information labeling models, it is necessary to determine what is truly valuable retrieval basis for querying the relevant content, so by the word segmentation process shown in steps 502-503, a word-level feature vector set corresponding to the text can be formed, and meaningless word-level feature vectors, such as "ground" and "get", are avoided.

And 504, carrying out denoising processing on the different word-level feature vectors to form a word-level feature vector set corresponding to the sample information to be processed.

In some embodiments of the invention, a dynamic noise threshold matched with the use environment of the sample information labeling model can be determined, the problem text set is subjected to denoising processing according to the dynamic noise threshold, a dynamic word segmentation strategy matched with the dynamic noise threshold is triggered, and the problem text is subjected to word segmentation processing according to the dynamic word segmentation strategy matched with the dynamic noise threshold to form a dynamic word-level feature vector set corresponding to the problem text. Because the usage environments of the sample information labeling models (such as the usage objects of the corpus) are different, the dynamic noise threshold value matched with the usage environments of the sample information labeling models is also different, for example, in the usage environments of academic translation, the question text displayed by the terminal and the corresponding reply sentence only comprise the question text of the academic paper, and the dynamic noise threshold value matched with the usage environments of the sample information labeling models by the corresponding reply sentence needs to be smaller than the dynamic noise threshold value in the reading environment of the entertainment information text.

The method comprises the steps of determining a fixed noise threshold corresponding to the use environment of a sample information labeling model, denoising the problem text set according to the fixed noise threshold, triggering a fixed word segmentation strategy matched with the fixed noise threshold, segmenting the problem text according to the fixed word segmentation strategy matched with the fixed noise threshold, and fixing a word-level feature vector set corresponding to the problem text. When the sample information labeling model is solidified in a corresponding hardware mechanism, such as a vehicle-mounted terminal or an intelligent medical system, and the use environment is the professional term text information (or text information in a certain field), the processing speed of the sample information labeling model can be effectively improved, the waiting time of a user is reduced, and the use experience of the user is improved due to the fact that the noise is single and the fixed noise threshold corresponding to the fixed sample information labeling model is adopted.

And 505, processing the word-level feature vector set and the static statistical features through the sample information labeling model to determine the relevance of the question statement and the answer statement.

And 506, marking the sample information to be processed according to the relevance of the question statement and the reply statement to obtain a relevance marking result of the sample information to be processed.

The chat corpus labeling method provided by the embodiment of the application is described with a chat corpus labeling model packaged in a WeChat applet, wherein fig. 6 is a schematic view of an application environment of a sample information labeling model in the embodiment of the application, wherein as shown in fig. 6, along with development of man-machine interaction technology, more and more intelligent products based on man-machine interaction technology are generated, for example, when a problem statement is input through a search applet provided by an instant messaging client, sample information in a data source can be labeled by using the sample information labeling method provided by the application, and corresponding answer information can be generated by using the labeled sample information. In the process, by the sample information labeling method provided by the application, the correlation of any sample information is determined, the correlation of the sample information is labeled, the labor cost is saved, when a question text input by a user is query= "A company financial report" as an example, the corresponding reply sentence refers to the table 2:

TABLE 2

By using the sample information labeling model provided by the application, only reply sentences with the correlation gear of 5 and 4 can be provided, so that a user can obtain the reply sentences which meet the requirements better.

The beneficial technical effects are as follows:

The foregoing description of the embodiments of the invention is not intended to limit the scope of the invention, but is intended to cover any modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A method for labeling sample information, the method comprising:

Acquiring negative example sample information from the data source;

Determining screening dimension of static statistical features in a static statistical feature set according to the type of the data source;

Screening sample information in the positive example sample information and the negative example sample information of the data source according to the screening dimension of the static statistical characteristics;

Performing single-heat encoding treatment on the sample information to form static statistical characteristics corresponding to the positive sample information and the negative sample information respectively;

Combining static statistical features of different screening dimensions to form static statistical feature sets corresponding to the positive example sample information and the negative example sample information respectively;

2. The method of claim 1, wherein the obtaining the positive example sample information in the data source and labeling the positive example sample information to obtain labeled positive example sample information includes:

randomly acquiring at least one piece of initial sample information in the data source, wherein the initial sample information at least comprises a question sentence and a reply sentence corresponding to the question sentence;

Labeling the initial sample information according to the semantic relativity degree of the question sentences and the reply sentences to obtain initial sample information with different semantic relativity degrees;

screening positive sample information from the initial sample information with different semantic relativity degrees;

and labeling the positive example sample information based on the correlation degree of the positive example sample information to obtain labeled positive example sample information.

3. The method according to claim 1, wherein the method further comprises:

When the type of the data source is a search question and answer data source, the set of static statistical features includes at least one of:

4. The method of claim 1, wherein the obtaining negative example sample information in the data source comprises:

acquiring non-clicked sample information in the data source as negative sample information, or

And screening negative example sample information from the initial sample information with different semantic relativity degrees.

5. The method of claim 1, wherein the training a sample information labeling model based on the labeled positive example sample information, the negative example sample information, and the set of static statistical features, determining model parameters of the sample information labeling model, comprises:

Combining the marked positive example sample information, the negative example sample information and the static statistical feature set to form a training sample set;

determining a convergence function of the sample information annotation model;

Training a bidirectional attention neural network of the sample information labeling model based on the training sample set, and determining network parameters of the bidirectional attention neural network;

training a feature coding network of the sample information labeling model based on the training sample set, and determining network parameters of the feature coding network;

and updating the network parameters of the bidirectional attention neural network and the network parameters of the feature coding network until reaching the convergence condition corresponding to the convergence function.

6. The method according to claim 1, wherein the labeling the sample information in the data source by the sample information labeling model to obtain labeled sample information includes:

obtaining sample information to be processed and static statistical characteristics in the data source, wherein the sample information to be processed comprises a question statement and a reply statement;

triggering a corresponding word segmentation library according to text parameter information carried by the sample information to be processed;

Performing word segmentation processing on the sample information to be processed through the triggered word dictionary of the word segmentation library to form different word-level feature vectors;

Denoising the different word-level feature vectors to form a word-level feature vector set corresponding to the sample information to be processed;

Processing the word-level feature vector set and the static statistical features through the sample information labeling model, and determining the relevance of the question sentences and the answer sentences;

And labeling the sample information to be processed according to the correlation of the question statement and the reply statement to obtain a correlation labeling result of the sample information to be processed.

7. A sample information labeling apparatus, the apparatus comprising:

The information processing module is used for determining screening dimension of static statistical features in a static statistical feature set according to the type of the data source, screening sample information in positive example sample information and negative example sample information of the data source according to the screening dimension of the static statistical features, performing independent-heat coding processing on the sample information to form static statistical features respectively corresponding to the positive example sample information and the negative example sample information, and combining the static statistical features of different screening dimensions to form static statistical feature sets respectively corresponding to the positive example sample information and the negative example sample information;

8. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the sample information labeling method of any of claims 1-6.

9. An electronic device, the electronic device comprising:

a memory for storing executable instructions;

A processor configured to implement the sample information labeling method of any of claims 1-6 when executing the executable instructions stored in the memory.

10. A computer readable storage medium storing executable instructions which when executed by a processor implement the sample information labeling method of any of claims 1-6.