CN114564557B

CN114564557B - A method for updating a corpus processing model, a method for determining a category, and a device

Info

Publication number: CN114564557B
Application number: CN202011363647.2A
Authority: CN
Inventors: 尚航; 杨森
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2025-06-13
Anticipated expiration: 2040-11-27
Also published as: CN114564557A

Abstract

The present disclosure relates to a method for updating a corpus processing model, a method for determining a category, and a device. The method includes: obtaining a current batch sample set; grouping the sample corpora in the current batch sample set according to the category annotation information carried by the sample corpora, so that the sample corpora carrying the same category annotation information are in the same sample corpus group; obtaining a representation vector of the sample corpus based on the current corpus processing model; calculating the correlation between the representation vector of the sample corpus and the representation vector of the sample corpus in the same group, and obtaining a first correlation of the sample corpus; calculating the correlation between the representation vector of the sample corpus and the representation vector of the sample corpus in a different group, and obtaining a second correlation of the sample corpus; adjusting the parameters of the current corpus processing model to meet the model convergence condition according to the first correlation and the second correlation, and using the current corpus processing model that meets the model convergence condition as the target corpus processing model. The present disclosure can improve the efficiency of model updating.

Description

Corpus processing model updating method, category determining method and device

Technical Field

The disclosure relates to the field of artificial intelligence, and in particular relates to a corpus processing model updating method, a category determining method and a category determining device.

Background

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics.

Corresponding categories are determined for the corpus to be processed, and the categories can reflect the portrait description of the corpus to be processed to a certain extent. In the related art, a corpus processing model is often used when determining corresponding categories for a corpus to be processed. The corpus processing model takes a sample corpus carrying category labeling information as a modeling unit, and focuses on the relation between the sample corpus and the category labeling information corresponding to the sample corpus. Over time, changes in related business categories (e.g., adding new categories, deleting old categories) require updating the model based on the sample corpus carrying the changed category labeling information, which results in slower model update speeds and longer model update cycles.

Disclosure of Invention

The disclosure provides a corpus processing model updating method, category determining method and device, so as to at least solve the problems of low model updating speed and long updating period in the related technology. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a method for updating a corpus processing model, including:

Acquiring a current batch sample set;

Grouping according to category marking information carried by sample corpus in the current batch of sample sets, so that sample corpus carrying the same category marking information is located in the same sample corpus group;

obtaining a characterization vector of the sample corpus based on a current corpus processing model;

Calculating the relativity between the characterization vector of the sample corpus and the characterization vector of the same group of sample corpus, so as to obtain a first relativity of the sample corpus, wherein the same group of sample corpus is other sample corpus which is positioned in the same sample corpus group with the sample corpus;

Calculating the relativity between the characterization vector of the sample corpus and the characterization vector of a different set of sample corpus to obtain a second relativity of the sample corpus, wherein the different set of sample corpus is other sample corpus which is positioned in a different sample corpus set from the sample corpus;

And according to the first correlation degree and the second correlation degree, adjusting parameters of the current corpus processing model until the parameters meet model convergence conditions, and taking the current corpus processing model meeting the model convergence conditions as a target corpus processing model.

In an exemplary embodiment, the step of adjusting parameters of the current corpus processing model according to the first relevance and the second relevance until a model convergence condition is satisfied includes:

Obtaining the actual global relevance of the current batch of the sample corpus according to the first relevance and the second relevance;

Acquiring the current batch expected global correlation of the sample corpus;

Calculating a loss function value according to the actual global correlation of the current batch and the expected global correlation of the current batch;

And adjusting parameters of the current corpus processing model based on the loss function value until a model convergence condition is met.

In an exemplary embodiment, the step of obtaining the actual global relevance of the current batch of the sample corpus according to the first relevance and the second relevance includes:

And normalizing the first correlation degree and the second correlation degree to obtain the actual global correlation degree of the current batch.

In an exemplary embodiment, the step of obtaining the current batch sample set includes:

receiving a category expansion instruction, wherein the category expansion instruction comprises a target category and a sample corpus carrying target category labeling information;

And constructing the current batch of sample set based on the sample corpus carrying the target category labeling information.

In an exemplary embodiment, the step of obtaining the token vector of the sample corpus based on the current corpus processing model includes:

performing word segmentation on the sample corpus by using a first corpus processing structure of the current corpus processing model to obtain at least two sample corpus fragments, performing vector conversion on each sample corpus fragment, and obtaining a matrix representing the sample corpus based on vectors corresponding to each sample corpus fragment;

and carrying out coding processing on the matrix by utilizing a second corpus processing structure of the current corpus processing model to obtain a representation vector of the sample corpus.

According to a second aspect of embodiments of the present disclosure, there is provided a category determining method, including:

acquiring a corpus to be processed of an indication target object;

Taking the corpus to be processed as input, and obtaining a characterization vector of the corpus to be processed by using the target corpus processing model of the first aspect;

Determining a standard characterization vector matched with the characterization vector of the corpus to be processed based on the similarity between the characterization vector of the corpus to be processed and a plurality of standard characterization vectors, wherein each standard characterization vector carries corresponding category marking information;

And determining the category of the target object based on the category marking information corresponding to the matched standard characterization vector.

In an exemplary embodiment, before the step of determining a standard token vector matching the token vector of the corpus to be processed based on the similarity between the token vector of the corpus to be processed and a plurality of standard token vectors, the method further comprises the step of determining the plurality of standard token vectors;

the step of determining the plurality of standard token vectors comprises:

Obtaining a standard corpus, wherein the standard corpus records standard corpus and characterization vectors of the standard corpus, each standard corpus carries corresponding category marking information, and the characterization vectors of the standard corpus are obtained by utilizing the target corpus processing model;

Word segmentation is carried out on the corpus to be processed to obtain at least two corpus fragments;

Querying the standard corpus based on each corpus fragment to obtain a standard corpus set corresponding to each corpus fragment, wherein the standard corpus in the standard corpus set corresponding to the corpus fragment comprises the corpus fragments;

Obtaining a standard corpus set according to the standard corpus set corresponding to each corpus fragment;

determining at least two target standard corpus based on the occurrence frequency of each standard corpus in the standard corpus set;

And obtaining a characterization vector of each target standard corpus based on the standard corpus, and taking the characterization vector of the target standard corpus as the standard characterization vector.

In an exemplary embodiment, before the step of obtaining the standard corpus, the method further includes a step of constructing an inverted index for the standard corpus, where the inverted index is used to query a standard corpus containing the corpus segments based on the corpus segments;

correspondingly, the step of querying the standard corpus based on each corpus fragment to obtain the standard corpus corresponding to each corpus fragment comprises the following steps:

and inquiring the standard corpus based on the inverted index to obtain a standard corpus corresponding to each corpus fragment.

In an exemplary embodiment, the step of constructing an inverted index for the standard corpus includes:

word segmentation is respectively carried out on a plurality of standard corpus to obtain at least one corpus fragment corresponding to each standard corpus;

Each corpus fragment is used as an index keyword;

determining a standard corpus containing each index keyword;

constructing a first corpus list based on the standard corpus containing each index keyword;

and constructing an inverted index of each index keyword and a first corpus list corresponding to each index keyword.

In an exemplary embodiment, the method further comprises adjusting the inverted index:

Detecting whether abnormal data exists in the inverted index or not in response to the received category determining negative feedback;

and deleting index keywords corresponding to the abnormal data when the abnormal data exist in the inverted index.

In an exemplary embodiment, before the step of obtaining the standard corpus, the method further includes a step of establishing a mapping relationship for the standard corpus, where the mapping relationship is used to query the category to which the standard corpus points based on the standard corpus;

Correspondingly, the step of determining the category of the target object based on the category marking information corresponding to the matched standard characterization vector comprises the following steps:

determining a standard corpus corresponding to the matched standard characterization vector;

And inquiring the standard corpus based on the mapping relation, determining the category pointed by the corresponding standard corpus, and taking the pointed category as the category of the target object.

In an exemplary embodiment, the step of establishing a mapping relationship for the standard corpus includes:

determining a plurality of categories based on category labeling information carried by the plurality of standard corpus;

determining a standard corpus pointing to each category;

constructing a second corpus list based on the standard corpus pointing to each category;

And establishing a mapping relation between each category and a second corpus list corresponding to each category.

According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for updating a corpus processing model, including:

A sample set acquisition unit configured to perform acquisition of a current batch of sample sets;

The grouping unit is configured to perform grouping according to category marking information carried by the sample corpus in the current batch of sample sets, so that the sample corpuses carrying the same category marking information are located in the same sample corpus group;

the representation vector obtaining unit is configured to obtain a representation vector of the sample corpus based on the current corpus processing model;

the first correlation calculation unit is configured to calculate the correlation between the characterization vector of the sample corpus and the characterization vector of the same group of sample corpus, so as to obtain the first correlation of the sample corpus, wherein the same group of sample corpus is other sample corpus which is positioned in the same sample corpus group with the sample corpus;

the second correlation degree calculation unit is configured to calculate the correlation degree between the characterization vector of the sample corpus and the characterization vector of a different group of sample corpus, so as to obtain the second correlation degree of the sample corpus, wherein the different group of sample corpus is other sample corpus which is positioned in a different sample corpus group with the sample corpus;

and a parameter adjustment unit configured to perform adjustment of parameters of the current corpus processing model to satisfy a model convergence condition according to the first correlation degree and the second correlation degree, and to take the current corpus processing model satisfying the model convergence condition as a target corpus processing model.

In an exemplary embodiment, the parameter adjustment unit includes:

the current batch actual global correlation obtaining unit is configured to obtain the current batch actual global correlation of the sample corpus according to the first correlation and the second correlation;

The current batch expected global correlation acquisition unit is configured to acquire the current batch expected global correlation of the sample corpus;

A loss function value calculation unit configured to perform calculation of a loss function value from the current lot actual global correlation and the current lot expected global correlation;

And a parameter adjustment subunit configured to perform adjustment of parameters of the current corpus processing model based on the loss function value to satisfy a model convergence condition.

In an exemplary embodiment, the current lot actual global correlation obtaining unit includes:

And the normalization processing unit is configured to perform normalization processing on the first correlation and the second correlation to obtain the actual global correlation of the current batch.

In an exemplary embodiment, the sample set acquisition unit includes:

A category expansion instruction receiving unit configured to execute a receiving category expansion instruction, wherein the category expansion instruction comprises a target category and a sample corpus carrying target category labeling information;

And the sample set construction unit is configured to construct the current batch of sample sets based on the sample corpus carrying the target category labeling information.

In an exemplary embodiment, the token vector obtaining unit includes:

The first corpus processing unit is configured to execute word segmentation processing on the sample corpus by using a first corpus processing structure of the current corpus processing model to obtain at least two sample corpus fragments, perform vector conversion on each sample corpus fragment, and obtain a matrix representing the sample corpus based on vectors corresponding to each sample corpus fragment;

And the second corpus processing unit is configured to execute encoding processing on the matrix by using a second corpus processing structure of the current corpus processing model to obtain a representation vector of the sample corpus.

According to a fourth aspect of embodiments of the present disclosure, there is provided a category determining apparatus including:

the corpus acquisition unit is configured to acquire the corpus to be processed of the indication target object;

the model application unit is configured to execute the operation of taking the corpus to be processed as input and obtaining a characterization vector of the corpus to be processed by using the target corpus processing model in the first aspect;

the matching unit is configured to execute the determination of a standard token vector matched with the token vector of the corpus to be processed based on the similarity between the token vector of the corpus to be processed and a plurality of standard token vectors, and each standard token vector carries corresponding category labeling information;

And the category determining unit is configured to determine the category of the target object based on the category labeling information corresponding to the matched standard characterization vector.

In an exemplary embodiment, the apparatus further includes a standard token vector determination unit including:

The standard corpus obtaining unit is configured to obtain a standard corpus, the standard corpus records standard corpus and characterization vectors of the standard corpus, each standard corpus carries corresponding category marking information, and the characterization vectors of the standard corpus are obtained by utilizing the target corpus processing model;

the first word segmentation processing unit is configured to perform word segmentation processing on the corpus to be processed to obtain at least two corpus fragments;

The first query unit is configured to perform query on the standard corpus based on each corpus fragment to obtain a standard corpus set corresponding to each corpus fragment, wherein standard corpus in the standard corpus set corresponding to the corpus fragment contains the corpus fragment;

The standard corpus collection obtaining unit is configured to execute the standard corpus collection corresponding to each corpus fragment to obtain the standard corpus collection;

The target standard corpus determining unit is configured to determine at least two target standard corpuses based on the occurrence frequency of each standard corpus in the standard corpus set;

And the standard token vector determination subunit is configured to acquire a token vector of each target standard corpus based on the standard corpus, and take the token vector of the target standard corpus as the standard token vector.

In an exemplary embodiment, the apparatus further includes an inverted index construction unit, and the inverted index constructed by the inverted index construction unit is used for querying a standard corpus including the corpus segments based on the corpus segments;

correspondingly, the first query unit includes:

and the first query subunit is configured to query the standard corpus based on the inverted index to obtain a standard corpus corresponding to each corpus fragment.

In an exemplary embodiment, the inverted index construction unit includes:

The second word segmentation processing unit is configured to perform word segmentation processing on a plurality of standard corpus respectively to obtain at least one corpus fragment corresponding to each standard corpus;

An index keyword determining unit configured to perform each of the corpus fragments as an index keyword, respectively;

A first standard corpus determining unit configured to perform determination of a standard corpus containing each index keyword;

A first corpus list construction unit configured to execute construction of a first corpus list based on the standard corpus containing each index keyword;

And the inverted index construction subunit is configured to construct an inverted index of each index keyword and the first corpus list corresponding to each index keyword.

In an exemplary embodiment, the apparatus further includes an inverted index adjustment unit, the inverted index adjustment unit including:

A negative feedback receiving unit configured to perform determination of negative feedback in response to the received category, and detect whether abnormal data exists in the inverted index;

And the deleting unit is configured to delete the index key words corresponding to the abnormal data when the abnormal data exists in the inverted index.

In an exemplary embodiment, the apparatus further includes a mapping relationship establishing unit, where the mapping relationship established by the mapping relationship establishing unit is used for querying, based on a standard corpus, a category to which the standard corpus points;

accordingly, the category determining unit includes:

The second standard corpus determining unit is configured to determine standard corpus corresponding to the matched standard characterization vector;

And a category determination subunit configured to perform querying the standard corpus based on the mapping relation, determining a category to which the corresponding standard corpus points, and regarding the pointed category as the category of the target object.

In an exemplary embodiment, the mapping relation establishing unit includes:

The standard corpus processing unit is configured to execute category marking information carried by the plurality of standard corpora and determine a plurality of categories;

A third standard corpus determining unit configured to perform determination of standard corpus directed to each category;

a second corpus list construction unit configured to execute construction of a second corpus list based on the standard corpus directed to each category;

and the mapping relation establishing subunit is configured to establish a mapping relation between each category and a second corpus list corresponding to each category.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

A processor;

A memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of updating the corpus processing model as described in the first aspect or the category determining method as described in the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of updating a corpus processing model as described in the first aspect or the method of determining categories as described in the second aspect.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, which when run on a computer, causes the computer to perform the method of updating a corpus processing model according to the first aspect or the method of determining categories according to the second aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

grouping the sample corpuses in the current batch of sample sets according to category labeling information, outputting a characterization vector of the sample corpuses by using the current corpus processing model, obtaining a first correlation degree and a second correlation degree based on the characterization vector of the sample corpuses and grouping information, and further adjusting parameters of the current corpus processing model according to the first correlation degree and the second correlation degree to obtain a target corpus processing model. The first correlation degree reflects the correlation degree between the same group of sample corpuses, the second correlation degree reflects the correlation degree between different groups of sample corpuses, when the corpus processing model is updated by using the current batch of sample sets, the first correlation degree and the second correlation degree are concerned, compared with the concern of the sample corpuses and category marking information carried by the sample corpuses, the technical scheme does not need to combine the sample sets of the previous batch to update the model, and the updating efficiency of the model based on the change of the related business category can be improved. In addition, in the on-line application, a representation vector of the corpus to be processed is output by using the target corpus processing model, a standard representation vector matched with the representation vector is determined based on similarity calculation, and category determination is performed by using category marking information carried by the standard representation vector, so that the efficiency of category determination is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a flow chart illustrating a method of updating a corpus processing model according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a category determination method according to an exemplary embodiment.

FIG. 3 is a flow chart illustrating the determination of a plurality of standard token vectors, according to one exemplary embodiment.

FIG. 4 is a flowchart illustrating building an inverted index for a standard corpus, according to an example embodiment.

FIG. 5 is a flow chart illustrating adjustment of an inverted index according to an exemplary embodiment.

FIG. 6 is a flowchart illustrating the creation of a mapping relationship for a standard corpus, according to an example embodiment.

FIG. 7 is a flowchart illustrating adjusting parameters of a current corpus processing model according to a first relevance and a second relevance, according to an example embodiment.

FIG. 8 is a block diagram illustrating an apparatus for updating a corpus processing model according to an exemplary embodiment.

Fig. 9 is a block diagram of a category determining device, according to an example embodiment.

FIG. 10 is an architectural diagram illustrating a method of performing category determination according to an exemplary embodiment.

Fig. 11 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The method for updating the corpus processing model can be applied to a terminal or a server provided with the corpus processing model updating system or the category determining system, the method for determining the category can be applied to the terminal or the server provided with the category determining system, and the terminal and the server can be connected through a wired network or a wireless network. The terminal may specifically be a smart phone, a desktop computer, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR) device, a digital assistant, an intelligent sound box, an intelligent wearable device, etc. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service.

Fig. 1 is a flowchart illustrating a method of updating a corpus processing model according to an exemplary embodiment, and the method includes the following steps S101 to S106 as shown in fig. 1.

In step S101, a current batch sample set is acquired.

The sample corpus in the current batch of sample sets carries category labeling information, and the category labeling information can indicate the category to which the sample corpus belongs. For example, the sample corpus 1 is a "hard disk", the category labeling information carried by the sample corpus can indicate "digital accessory categories", the sample corpus 2 is a "capsule coffee machine", and the category labeling information carried by the sample corpus can indicate "kitchen small household appliances categories". The sample corpus can carry at least one category marking information, and the number of categories to which the sample corpus belongs can be more than or equal to 1. When the number of the belonging categories is 2 or more, the relationship between the two belonging categories may be a top-bottom hierarchical relationship or a hierarchical relationship. For example, the sample corpus 3 is a "body skirt", the category labeling information carried by the sample corpus can indicate "women's clothing category" and "under clothing category", and the category labeling information carried by the sample corpus 4 is an "apple", and the category labeling information carried by the sample corpus can indicate "fresh category", "fresh fruit category", "mobile phone category" and "video entertainment category". Of course, the sample corpus is not limited to the text-form natural corpus, but may include image-form and voice-form natural corpus.

In one embodiment, step S101 may include step S1011, receiving a category expansion instruction, the category expansion instruction including a target category and a sample corpus carrying target category labeling information, and step S1012, constructing the current batch sample set based on the sample corpus carrying target category labeling information.

The category expansion instruction may indicate that there is a new category, and the model needs to be updated by using a sample corpus corresponding to the new category. The category expansion instruction may also indicate that there is a new sample corpus corresponding to the original category, and the model needs to be updated by using the new sample corpus, which may be regarded as the sample corpus corresponding to the original category needs to be expanded. For example, the historical batch sample set includes sample corpus a (belonging to category A), sample corpus B (belonging to category B), and sample corpus C (belonging to category C). The category expansion instruction 1 may indicate that a new category D exists, and the model needs to be updated by using the sample corpus D corresponding to the category D. The sample corpus corresponding to the new category may also be an original sample corpus, such as sample corpus a. The category expansion instruction 2 may indicate that there is a new sample corpus corresponding to the category B, i.e., a sample corpus e, and the model needs to be updated by using the sample corpus e. The new sample corpus corresponding to the original category can also be an original sample corpus, such as sample corpus a. According to the embodiment of the disclosure, in response to the category expansion instruction, the current batch sample set is constructed based on the sample corpus carrying the information indicating the target category label, so that the flexibility and the adaptability of constructing the current batch sample set can be improved. According to the embodiment of the disclosure, the current batch sample set constructed based on the category expansion instruction can enable updating of the current corpus processing model to be more flexible and adaptive, and timeliness of category identification by using the target corpus processing model subsequently can be improved.

In another embodiment, the sample corpus may be randomly acquired from the sample corpus, and the current batch of sample sets may be constructed based on the randomly acquired sample corpus. The sample corpus can be divided into a plurality of sample corpuses according to different categories, and the sample corpuses in each sample corpus point to the same category. When the current batch of sample corpus is constructed, a preset number of sample corpuses can be randomly extracted from the sample corpuses. For example, there are 200 categories in the sample corpus, correspondingly, there are 200 sample corpuses, when the preset number is 2 and the target number of the sample corpuses in the current batch of sample corpuses is 128, 64 sample corpuses are randomly selected from the 200 sample corpuses, and 2 sample corpuses are respectively extracted from the 64 sample corpuses, so as to construct the current sample corpuses composed of 128 sample corpuses. The above-described manner may also be employed for the construction of each batch of sample sets. In practical application, a current batch sample set constructed by sample corpus carrying relevant business category labeling information can be obtained according to business requirements. The business requirement can be to expand the commodity category for the e-commerce platform, can be to conduct the fine classification of the animal and plant images for the animal and plant identification website, and the like. The number of the sample corpora in the current batch of sample sets can be 64 or 128, and the sample corpora can be flexibly set according to actual requirements.

In step S102, grouping is performed according to category labeling information carried by the sample corpus in the current batch of sample sets, so that the sample corpuses carrying the same category labeling information are located in the same sample corpus group.

According to category marking information carried by the sample corpus, the sample corpus carrying the same category marking information is divided into the same sample corpus group, and correspondingly, the sample corpus carrying different types of target marking information is divided into different sample corpus groups.

In one embodiment, the sample corpus in the current batch of sample sets may be further processed. The number of sample corpus groups in the current batch of sample sets may be limited, as well as the number of sample corpora in each sample corpus group. For example, the number of sample corpus groups in the current batch of sample sets is limited to 32 or 64, and the number of sample corpora in each sample corpus group is limited to 2. Based on the above-mentioned limiting conditions, a sample corpus to be subsequently used for inputting the current corpus processing model can be determined. When the sample corpus in the current batch of sample sets does not meet the above-mentioned constraint conditions (for example, 32 sample corpus groups which do not meet the requirements are not met), the sample corpus used for inputting the current corpus processing model can be determined together by combining the sample corpuses acquired later. When the sample corpuses in the current batch of sample sets meet the above-mentioned constraint conditions, but there are more than a limited number of sample corpuses (for example, 32 sample corpuses groups meeting the requirement are enough, and there are the rest 6 sample corpuses), the sample corpuses meeting the constraint conditions can be input into the current corpus processing model, and the sample corpuses exceeding the limited number are combined with the sample corpuses acquired later to construct the next batch of sample sets. It should be noted that, the limiting number of the sample corpus groups can be flexibly set according to actual requirements, and the limiting number of the sample corpuses in each sample corpus group can be greater than or equal to 2.

In step S103, a token vector of the sample corpus is obtained based on the current corpus processing model.

In one embodiment, step S103 may include step S1031 of performing word segmentation on the sample corpus by using a first corpus processing structure of the current corpus processing model to obtain at least two sample corpus fragments, performing vector conversion on the at least two sample corpus fragments, and obtaining a matrix representing the sample corpus based on a vector corresponding to each of the sample corpus fragments, and step S1032 of performing encoding processing on the matrix by using a second corpus processing structure of the current corpus processing model to obtain a representation vector of the sample corpus.

The expression of the sample corpus may be sentences, phrases, words, etc. Firstly, performing word segmentation on a sample corpus by using a first corpus processing structure to obtain at least two sample corpus fragments, such as a wireless mouse of the sample corpus, wherein the two sample corpus fragments obtained by the word segmentation are wireless and mouse, then, respectively performing vector conversion on the at least two sample corpus fragments by using the first corpus processing structure, such as converting the sample corpus fragment into a vector 1 and converting the sample corpus fragment into a vector 2, further, obtaining a matrix representing the sample corpus by using the first corpus processing structure based on the vector corresponding to each sample corpus fragment, such as obtaining a matrix 1 representing the sample corpus based on the vector 1 and the vector 2, and finally, performing encoding processing on the matrix by using a second corpus processing structure to obtain a representation vector of the sample corpus, such as performing encoding processing on the matrix 1 to obtain a representation vector 1 representing the sample corpus of the wireless mouse. This token vector is also called hidden layer token. According to the embodiment of the disclosure, the links of word segmentation processing, vector conversion and matrix coding are combined, so that better and deeper mining of sample corpus is realized. The obtained characterization vector can more accurately reflect the semantics of the sample corpus and more comprehensively express the semantics of the sample corpus, and particularly reflects and expresses the more hidden semantics in the sample corpus.

In practical application, "the sample corpus is subjected to word segmentation to obtain at least two sample corpus fragments" may be represented by performing word segmentation on the sample corpus to obtain at least two sample corpus fragments, for example, four sample corpus fragments obtained by performing word segmentation on the sample corpus "wireless mouse" are "none", "line", "mouse" and "mark". Correspondingly, vector conversion is carried out on the sample corpus fragments obtained through word division, such as ' none ' is converted into a vector 11, a ' line ' is converted into a vector 12, a ' mouse ' is converted into a vector 21 and a ' mark ' is converted into a vector 22, and a matrix representing the sample corpus is obtained by utilizing the vectors corresponding to the sample corpus fragments obtained through word division, such as a matrix 1' representing the sample corpus of a wireless mouse is obtained based on the vector 11, the vector 12, the vector 21 and the vector 22. Compared with the word segmentation processing mode, the method has the advantages that the sample corpus is segmented based on the word segmentation processing mode, the limitation that the word segmentation processing mode is applied to the sample corpus with the expression form of fixed phrases, words and the like can be broken through, and the segmentation granularity is thinned to mine the characterization vector of the sample corpus with a higher reduction degree.

When the sample corpus is in the form of an image or a voice, the sample corpus can be converted into the natural corpus in the form of a text, and then the processes of word segmentation (word) processing, vector conversion and matrix coding can be performed to obtain the characterization vector of the sample corpus.

The first corpus processing structure may correspond to a word vector tool (e.g., word embedding table) and the second corpus processing structure may correspond to an encoder (encoder). When one sample corpus group comprises two sample corpuses, the two sample corpuses respectively obtain corresponding matrixes thereof through word embedding table, and the obtained two matrixes respectively obtain corresponding vectors thereof through the same encoder, namely hidden layer representation of each sample corpus.

The construction basis of the current corpus processing model can comprise one of TextCNN (a convolutional neural network model for text classification), fastText (a word vector and text classification tool based on a word2vec model which is open source, a related model for generating word vectors) model and a Transformer (an NLP classical model), and the current corpus processing model can modify the used model structure. The second corpus processing structure in the current corpus processing model may directly utilize the encoder of the Transformer.

In step S104, a correlation degree between the characterization vector of the sample corpus and the characterization vector of the same group of sample corpus is calculated, so as to obtain a first correlation degree of the sample corpus, where the same group of sample corpus is other sample corpuses in the same sample corpus group as the sample corpus.

The first correlation is based on the calculation of the characterization vector to reflect the actual correlation degree between one sample corpus and other sample corpuses in the same group, that is, the actual correlation degree between sample corpuses carrying the labeling information of the same category can be quantized through the concept of the first correlation. When one sample corpus group comprises two sample corpuses, calculating the similarity between the characterization vectors of the two sample corpuses, and further obtaining the first relatedness of each sample corpus. When one sample corpus group comprises a plurality of sample corpuses, similarity between the characterization vector of the target sample corpus and the characterization vectors of other sample corpuses in the same group is calculated respectively, and then the first correlation degree of the target sample corpus is determined based on the obtained similarity results.

In step S105, a correlation degree between the characterization vector of the sample corpus and the characterization vector of a different set of sample corpus is calculated, so as to obtain a second correlation degree of the sample corpus, where the different set of sample corpus is other sample corpus located in a different set of sample corpus than the sample corpus.

The second correlation degree is calculated based on the characterization vector to reflect the actual correlation degree between one sample corpus and other sample corpuses of different groups (different groups), that is, the actual correlation degree between sample corpuses carrying different kinds of target annotation information can be quantized through the concept of the second correlation degree. And when the number of other sample linguistic data in different groups is 1, calculating the similarity between the characterization vector of the target sample linguistic data and the characterization vector of the sample linguistic data in the different groups, and further obtaining the second relativity of the target sample linguistic data. When the number of the other sample corpuses in the different groups is more than or equal to 2, the similarity between the characterization vector of the target sample corpus and the characterization vector of the other sample corpuses in the different groups is calculated respectively, and then the second correlation of the target sample corpus is determined based on the obtained at least 2 similarity results.

The indices of the similarity referred to in step S104 and step S105 may be cosine similarity, euclidean distance, relative entropy, or the like. Of course, considering the generality of the data, the method for updating the corpus processing model provided by the disclosure selects an index, such as cosine similarity.

In step S106, according to the first correlation degree and the second correlation degree, the parameters of the current corpus processing model are adjusted to meet the model convergence condition, and the current corpus processing model meeting the model convergence condition is used as a target corpus processing model.

Because the first relevance is the actual relevance between the embodied same group of sample corpora, for the same group of sample corpora carrying the labeling information of the same category, the value representing the highest similarity can be used for measuring the expected relevance between the same group of sample corpora. Because the second correlation degree is the actual correlation degree between the embodied different groups of sample corpora, for the different groups of sample corpora carrying different types of target note information, the value representing the lowest similarity degree can be used for measuring the expected correlation degree between the same groups of sample corpora.

For example, the token vectors of the sample corpuses in the first sample corpuses are A1 (corresponding to sample corpuses A1) and A2 (corresponding to sample corpuses A2), the token vectors of the sample corpuses in the second sample corpuses are B1 (corresponding to sample corpuses B1) and B2 (corresponding to sample corpuses B2), the token vectors of the sample corpuses in the third sample corpuses are C1 (corresponding to sample corpuses C1) and C2 (corresponding to sample corpuses C2), and the token vectors of the sample corpuses in the fourth sample corpuses are D1 (corresponding to sample corpuses D1) and D2 (corresponding to sample corpuses D2). When the index of similarity is cosine similarity, the first correlation degree of the sample corpus A1 is a1.a2, and the second correlation degree of the sample corpus A1 is determined by a1.b1, a1.b2, a1.c1, a1.c2, a1.d1, a1.d2. The expected correlation corresponding to the first relative degree may be 1, the expected correlation corresponding to the second relative degree may be 0 (the expected similarity of each of a1.b1, a1.b2, a1.c1, a1.c2, a1.d1, a1.d2 is 0), and other sample corpora are similar. Based on the difference between the first correlation degree of the sample corpus and the expected correlation degree corresponding to the first correlation degree and the difference between the second correlation degree Guan Du of the sample corpus and the expected correlation degree corresponding to the second correlation degree, adjusting parameters of the current corpus processing model until the model convergence condition is met, and further taking the current corpus processing model meeting the model convergence condition as a target corpus processing model.

In one embodiment, as shown in fig. 7, step S106 may include step S1061 of obtaining a current batch actual global correlation of the sample corpus according to the first correlation and the second correlation, step S1062 of obtaining a current batch expected global correlation of the sample corpus, step S1063 of calculating a loss function value according to the current batch actual global correlation and the current batch expected global correlation, and step S1064 of adjusting parameters of the current corpus processing model based on the loss function value until a model convergence condition is satisfied.

The actual global relevance of the current batch is the actual relevance of a sample corpus and the current batch sample set (or the sample corpus of the input current corpus processing model satisfying the constraint condition, which can refer to the relevant records in step S102) based on the calculation of the characterization vector. The expected global relevance of the current batch is the expected relevance of a sample corpus and the current batch sample set (or the sample corpus of the input current corpus processing model meeting the constraint condition, which can refer to the relevant records in step S102) based on the calculation of the characterization vector.

The first correlation degree and the second correlation degree can be normalized based on the same normalization rule, and then the actual global correlation degree of the current batch is obtained through combination. For example, the similarity normalization result of the characterization vector of one sample corpus with respect to the characterization vector of the current batch sample set (or the sample corpus of the input current corpus processing model satisfying the constraint condition can refer to the related record in step S102) of other sample corpuses is integrated to obtain the current batch actual global correlation of the sample corpus, and then the corresponding current batch actual global correlations of multiple sample corpuses are combined to obtain the current batch actual global correlation. The normalization is used as a simplified calculation mode, and the first correlation degree and the second correlation degree are respectively converted into a similarity normalization result in a scalar form, so that the actual global correlation degree of the current batch in the scalar form is obtained. Based on the similarity normalization result in a scalar form, the calculation efficiency between the first correlation degree and the corresponding expected similarity and the calculation efficiency between the second correlation degree and the corresponding expected similarity can be improved, so that the efficiency of constructing a loss function is improved, and the timeliness of model updating is ensured.

The current batch of desired global correlations of one sample corpus may include desired correlations between token vectors of the same set of sample corpuses (relative to a first correlation) and desired correlations between token vectors of different sets of sample corpuses (relative to a second correlation). Correspondingly, a loss function corresponding to the sample corpus is constructed based on the actual global relevance of the current batch of the sample corpus and the expected global relevance of the current batch, and then parameters of the current corpus processing model are adjusted according to the loss function corresponding to each sample corpus and the gradient descent mode. According to the embodiment of the disclosure, when the parameters of the current corpus processing model are adjusted, the model can fully learn the similarity between the corpora of the same group and the difference between corpora of different groups based on the actual global relevance of the current batch pointed by the loss function and the expected global relevance of the current batch, so that the characteristic vector with better reflecting semantics can be conveniently and effectively extracted from the corpora, and the efficient category determination brought by the application of the follow-up model to the category determination scene is further improved.

Along with the example using the first sample corpus-fourth sample corpus described above, a similarity matrix may be constructed based on the similarity of the token vectors to each other:

a1.a2	a1.b2	a1.c2	a1.d2
				b1.a2	b1.b2	b1.c2	b1.d2
c1.a2	c1.b2	c1.c2	c1.d2
				d1.a2	d1.b2	d1.c2	d1.d2

In practical applications, a first token vector set and a second token vector set may be respectively constructed, where the first token vector set includes a1, b1, c1, and d1, and the second token vector set includes a2, b2, c2, and d2. The token vectors in each token vector group correspond to different categories, and each token vector in the two token vector groups carries out dot product operation to each other, so that the similarity square matrix is obtained. Referring to the similarity square matrix, the pairwise characterization vector similarities (a 1.a2, b1.b2, c1.c2 and d 1.d2) on the diagonal line from top left to bottom right correspondingly represent the first correlation degree of the actual correlation degree between the same group of sample corpora, and the pairwise characterization vector similarities at other positions correspondingly represent the second correlation degree of the actual correlation degree between different groups of sample corpora.

The Softmax function (a normalized exponential function) can be utilized to normalize each row (column), so as to obtain the actual global relevance of the current batch of the sample corpus. The loss function constructed based on the current lot actual global correlation and the current lot desired global correlation can be regarded as a cross entropy (cross entropy) loss function. A adam optimizer optimizer (an optimizer) may be utilized when adjusting parameters of the current corpus processing model based on the loss function. The current corpus processing model learns the thought that the correlation degree between the corpora of the same group of samples is higher and the correlation degree between the corpora of different groups of samples is lower in updating, and based on the thought, the model updating based on the current batch of sample sets is completed.

According to the method for updating the corpus processing model, the sample corpuses in the current batch of sample sets are grouped according to category labeling information, the representation vector of the sample corpuses is output by the current corpus processing model, the first correlation degree and the second correlation degree are obtained based on the representation vector and grouping information of the sample corpuses, and then parameters of the current corpus processing model are adjusted according to the first correlation degree and the second correlation degree to obtain the target corpus processing model. The first correlation degree reflects the correlation degree between the same group of sample corpuses, the second correlation degree reflects the correlation degree between different groups of sample corpuses, when the corpus processing model is updated by using the current batch of sample sets, the first correlation degree and the second correlation degree are concerned, compared with the concern of the sample corpuses and category marking information carried by the sample corpuses, the technical scheme does not need to combine the sample sets of the previous batch to update the model, and the updating efficiency of the model based on the change of the related business category can be improved. When the category expansion requirement exists, corresponding sample corpus can be added under the new category to construct a current batch sample set, and then the current corpus processing model is updated based on the current batch sample set, so that the dynamic update of the model is realized.

Fig. 2 is a flowchart illustrating a category determining method according to an exemplary embodiment, which includes steps S201 to S204 described below, as shown in fig. 2.

In step S201, a corpus to be processed indicating target objects is acquired.

The corpus to be processed, which indicates the target object, can be regarded as the corpus in which the target object is characterized. The corpus to be processed can be a natural corpus in a text form, for example, the corpus to be processed is presented in the form of sentences, phrases and expressions of words. The corpus to be processed can also be natural corpus in the form of images and voice.

In practical applications, the target object may indicate different contents according to different service scenarios. When the business scenario involves an e-commerce platform, the target object may indicate a commodity for which a category has not been determined. When the business scenario involves an animal and plant identification website, the target object may indicate that the animal and plant of the class has not been determined.

In step S202, the corpus to be processed is taken as input, and a representation vector of the corpus to be processed is obtained by using the target corpus processing model provided by the disclosure.

Inputting the corpus to be processed into the target corpus processing model, and outputting the characterization vector of the corpus to be processed by using the target corpus processing model. The process of outputting the token vector of the corpus to be processed by the target corpus processing model may refer to the description in step S103, and will not be described herein.

In step S203, based on the similarity between the token vector of the corpus to be processed and a plurality of standard token vectors, a standard token vector matched with the token vector of the corpus to be processed is determined, and each standard token vector carries corresponding category labeling information.

The standard token vector is a token vector of a standard corpus, which may be a representative corpus representing a preset category, such as a preset category "watch", and a standard corpus "labor". The standard corpus pointing to the same preset category can have a certain difference, such as the preset category of vinegar, and the standard corpus of mature vinegar and aromatic vinegar.

The standard characterization vector is obtained by inputting a standard corpus into a target corpus processing model. After the target corpus processing model is determined, the standard corpus is input into the model, and a standard corpus is constructed based on the obtained characterization vector and the standard corpus. The standard corpus may be stored in redis (a key-value pair database). In practical application, the sample corpus can be used as a standard corpus to be placed in the standard corpus, and correspondingly, the characterization vector of the sample corpus is used as a standard characterization vector to be placed in the standard corpus, and the characterization vector of the sample corpus is obtained by utilizing a target corpus processing model.

In one embodiment, as shown in fig. 3, before step S203, the method further includes determining the plurality of standard token vectors, step S301, obtaining a standard corpus, wherein the standard corpus records standard corpus and token vectors of the standard corpus, each standard corpus carries corresponding category labeling information, the token vectors of the standard corpus are obtained by using the target corpus processing model, step S302, performing word segmentation processing on the corpus to be processed to obtain at least two corpus fragments, step S303, querying the standard corpus based on each corpus fragment to obtain a standard corpus set corresponding to each corpus fragment, wherein standard corpus in the standard corpus set corresponding to each corpus fragment contains the corpus fragment, step S304, obtaining a standard corpus set according to the standard corpus set corresponding to each corpus fragment, step S305, determining at least two target standards based on occurrence frequencies of each standard corpus in the standard corpus set, step S306, obtaining the token vectors of each target based on the standard corpus set, and taking the target token vectors of each target corpus as the standard token vectors of the standard corpus.

The expression form of the corpus to be processed can be sentences, phrases and the like. The corpus to be processed is subjected to word segmentation to obtain corpus fragments, and word segmentation can be performed based on jieba word segmentation components (a word segmentation component). "word segmentation" referred to herein may also be expressed as "word segmentation", especially for corpus to be processed that is expressed in terms of fixed phrases, words. When the corpus to be processed is the natural corpus in the form of images and voices, the corpus can be converted into the natural corpus in the form of texts, and then the word segmentation (word segmentation) can be performed to obtain corpus fragments. For example, the corpus to be processed is subjected to word segmentation (word segmentation) to obtain a corpus segment 1, a corpus segment 2 and a corpus segment 3.

Based on the corpus fragment query standard corpus, when the standard corpus containing the corpus fragment is determined in the standard corpus, the standard corpus corresponding to each corpus fragment can be obtained. If the standard corpus containing the corpus segment 1 has the standard corpuses 1,2 and 3, the standard corpuses 1,2 and 3 form the standard corpus 1 corresponding to the corpus segment 1, the standard corpuses containing the corpus segment 2 has the standard corpuses 2 and 4, the standard corpuses 2 and 4 form the standard corpus 2 corresponding to the corpus segment 2, the standard corpuses containing the corpus segment 3 in the standard corpuses have the standard corpuses 2 and 3, and the standard corpuses 2 and 3 form the standard corpus 3 corresponding to the corpus segment 3.

And obtaining a standard corpus set according to the standard corpus set corresponding to each corpus fragment, namely obtaining a union set of the standard corpus sets. For example, the standard corpus set obtained according to the standard corpus set 1-3 includes standard corpuses 1, 2, 3 and 4. In the standard corpus set, the number of occurrences of the standard corpus 1 is 1, the number of occurrences of the standard corpus 2 is 3, the number of occurrences of the standard corpus 3 is 2, and the number of occurrences of the standard corpus 4 is 1. At least two target standard corpuses can be determined based on the occurrence frequency of each standard corpus in the standard corpus integration set, and then the standard corpuses 2> standard corpuses 3> standard corpuses 1=standard corpuses 4 according to the occurrence frequency, and the standard corpuses 2 and 3 can be used as target standard corpuses. Of course, the standard corpus 1-4 may be used as the target standard corpus, where the number of target standard corpora may be determined according to a preset number. Correspondingly, the characterization vector of the target standard corpus can be obtained, and the characterization vector of the target standard corpus is used as the standard characterization vector.

Accordingly, step S203 may include the steps of calculating the similarity between the token vector of each of the target standard corpora and the token vector of the corpora to be processed, and determining a standard token vector matching the token vector of the corpora to be processed based on the similarity calculation result. For example, the standard corpus 2 and the standard corpus 3 are used as target standard corpus, the characterization vector of the standard corpus 2 is the standard characterization vector 2, and the characterization vector of the standard corpus 3 is the standard characterization vector 3. Calculating the similarity between the characterization vector of the corpus to be processed and the standard characterization vector 2 to obtain similarity 2, calculating the similarity between the characterization vector of the corpus to be processed and the standard characterization vector 3 to obtain similarity 3, and determining a standard characterization vector which is more matched with the characterization vector of the corpus to be processed from the standard characterization vector 2 and the standard characterization vector 3 according to the similarity 2 and the similarity 3. The index of the similarity referred to for the similarity calculation here may be cosine similarity, euclidean distance, relative entropy, or the like.

According to the embodiment of the disclosure, when a plurality of standard characterization vectors are selected, corpus fragments of the corpus to be processed are utilized to perform corpus preliminary screening of a standard corpus, then a target standard corpus is determined based on the frequency of occurrence of the standard corpus and the preliminary screening result, and further the characterization vectors of the target standard corpus are used as standard characterization vectors. Compared with all standard characterization vectors in the standard corpus, the method can effectively reduce the range of the standard characterization vectors to be matched, improve the matching efficiency and reduce the consumption of computing resources. Compared with other standard corpus in the standard corpus, the standard corpus corresponding to the standard characterization vector has more relevance with the corpus to be processed, so that the accuracy of the determination of the subsequent category can be ensured.

In practical application, cosine similarity calculation can be respectively performed based on the characterization vector of the corpus to be processed and the characterization vector of the recalled target standard corpus, TOPN categories with highest matching degree can be selected in the category set corresponding to the recalled target standard corpus to return according to the obtained similarity high-low ordering result and the hit category occurrence frequency, and N is a positive integer larger than 0.

In another embodiment, the present disclosure further provides an embodiment for constructing an inverted index for a standard corpus, where the construction and application of the inverted index are described below.

1) As shown in FIG. 4, the step of constructing an inverted index for a standard corpus includes the steps of performing word segmentation processing on a plurality of standard corpuses respectively to obtain at least one corpus segment corresponding to each standard corpus, the step of using each corpus segment as an index keyword respectively in the step of S402, the step of determining the standard corpus containing each index keyword in the step of S403, the step of constructing a first corpus list based on the standard corpus containing each index keyword in the step of S404, and the step of constructing an inverted index of each index keyword and the first corpus list corresponding to each index keyword in the step of S405.

First, the expression of the standard corpus may be sentences, phrases, or the like. The standard corpus is subjected to word segmentation to obtain corresponding corpus fragments, and word segmentation can be performed based on jieba word segmentation components (a word segmentation component). "word segmentation" referred to herein may also be expressed as "word segmentation", especially for standard corpus expressed in terms of fixed phrases, words. When the standard corpus is natural corpus in the form of image and voice, the standard corpus can be converted into natural corpus in the form of text, and then the word segmentation (word segmentation) can be performed to obtain standard corpus fragments. For example, the standard corpus 1 is processed by word segmentation to obtain a standard corpus segment 1, a standard corpus segment 2 and a standard corpus segment 3, the standard corpus 2 is processed by word segmentation to obtain a standard corpus segment 1 and a standard corpus segment 4, and the standard corpus 3 is processed by word segmentation to obtain a standard corpus segment 2.

The corpus fragment may then be used as an index key. The standard corpus fragment 1 may be used as an index key word 1, the standard corpus fragment 2 may be used as an index key word 2, the standard corpus fragment 3 may be used as an index key word 3, and the standard corpus fragment 4 may be used as an index key word 4.

Furthermore, a standard corpus containing each index keyword may be determined, and a first corpus list may be constructed based on the standard corpus containing the index keywords. The method comprises the steps of constructing a first corpus list 1 based on standard corpus 1 and standard corpus 2 containing index keywords 1, constructing a first corpus list 2 based on standard corpus 1 and standard corpus 3 containing index keywords 2, constructing the first corpus list 1 based on standard corpus 1 containing index keywords 3, and constructing a first corpus list 4 based on standard corpus 2 containing index keywords 4.

Next, an inverted index of each index keyword and the first corpus list corresponding to each index keyword is constructed. The index key word may be used as a key, and the first corpus list corresponding to the index key word may be used as a value, so as to establish an inverted index with the key-value. For example, index keyword 1 is used as key, and first corpus list 1 is used as value to construct inverted index 1. Correspondingly, an inverted index 2 of the index keyword 2 and the first corpus list 2 is constructed, an inverted index 3 of the index keyword 3 and the first corpus list 3 is constructed, and an inverted index 4 of the index keyword 4 and the first corpus list 4 is constructed.

In the embodiment of the invention, the corpus fragments of standard corpus are used as index keywords, and the first corpus list is obtained based on the standard corpus containing the index keywords, so that the inverted index of each index keyword and the first corpus list corresponding to each index keyword is constructed, and the corpus fragment can be further used as a query object to query in the inverted index in a corpus fragment matching mode. Along with the updating of the standard corpus in the standard corpus, the method has better flexibility and instantaneity for updating the original inverted index, and ensures the accuracy and efficiency of the corpus fragment matching mode query.

Further, as shown in fig. 5, the constructed inverted index may be adjusted according to the service requirement, in which step S501, in response to the received category determination negative feedback, whether the inverted index has abnormal data is detected, and in step S502, when the inverted index has abnormal data, the index keyword corresponding to the abnormal data is deleted.

The category determination negative feedback may be generated during the test link or in an online application. The category determination negative feedback indicates that the systematically determined category differs from the intended category in that the systematically determined category ("food") is too broad compared to the intended category ("suspension coffee"), that the systematically determined category ("food") does not intersect with the intended category ("jewelry"), and so forth. When the category of the target object determined in step S204 is different from the expected category, negative feedback may be determined by reporting the error generation category.

Determining negative feedback based on category may detect abnormal data in the inverted index, the abnormal data being the data in the inverted index that caused the difference. The exception data may be dirty data, which often inevitably occurs. When abnormal data is detected, the index keywords corresponding to the abnormal data can be deleted. The abnormal data may exist in the index key word in the inverted index or may exist in the first corpus list. The index keywords corresponding to the abnormal data are deleted, so that errors among categories can be dynamically adjusted, influence of the abnormal data on category determination is effectively avoided when the retrieval is carried out based on the inverted index, and particularly dirty data which obviously influence on-line effects are avoided. According to the business requirement, the dynamic fine adjustment of index keywords can be carried out on the inverted index, the model does not need to be updated, the coupling between the offline model updating and the online category determining application can be further reduced, and the flexibility of the online category determining application is improved.

2) The inverted index may be used to query the standard corpus containing the corpus segment based on the corpus segment, and the step S303 may be performed to query the standard corpus based on the inverted index when the standard corpus corresponding to each corpus segment is obtained. For the corpus fragments obtained by word segmentation (word segmentation) of the corpus to be processed, the corpus fragments can be searched based on the inverted index to obtain the standard corpus containing the corpus fragments, so that the consumption of the computing resource by invalid computation can be effectively avoided, the computing efficiency is improved, and the online feedback time is improved.

In step S204, the category of the target object is determined based on the category label information corresponding to the matched standard token vector.

The category marking information carried by the standard characterization vector indicates a preset category, the standard characterization vector and the corresponding preset category can be stored first, after the matched standard characterization vector is determined, the preset category corresponding to the matched standard characterization vector is obtained based on the corresponding relation in the stored data, and the preset category is taken as the category of the target object.

In one embodiment, the present disclosure further provides an embodiment for establishing a mapping relationship for a standard corpus, and the establishment and application of the mapping relationship will be described below.

1) As shown in FIG. 6, the step of establishing a mapping relationship for the standard corpus includes the step of S601 determining a plurality of categories based on category labeling information carried by the plurality of standard corpuses, the step of S602 determining standard corpuses pointing to each category, the step of S603 establishing a second corpus list based on the standard corpuses pointing to each category, and the step of S604 establishing a mapping relationship between each category and the second corpus list corresponding to each category.

First, a plurality of categories are determined based on category labeling information carried by a plurality of standard corpus. The category labeling information carried by the plurality of standard corpus may be different, and two standard corpus carrying the same category labeling information may also exist in the plurality of standard corpus, where determining the plurality of categories may be regarded as constructing a category library based on the category labeling information carried by the plurality of standard corpus.

Then, a standard corpus is determined that points to each category and a second corpus list is constructed based on the standard corpus that points to each category. For example, category labeling information carried by standard corpus 1 may indicate category 1, category labeling information carried by standard corpus 2 may indicate category 2, category labeling information carried by standard corpus 3 may indicate category 3, category labeling information carried by standard corpus 4 may indicate category 1, and category labeling information carried by standard corpus 5 may indicate category 2. Then, a second corpus list 1 is constructed based on the standard corpus 1 and the standard corpus 4 pointing to category 1, a second corpus list 2 is constructed based on the standard corpus 2 and the standard corpus 5 pointing to category 2, and a second corpus list 3 is constructed based on the standard corpus 3 pointing to category 3.

And further, establishing a mapping relation between each category and a second corpus list corresponding to each category. The category may be used as a key, and the second corpus list corresponding to the category may be used as a value, so that a mapping relationship may be established by using the key-value. For example, category 1 is used as a key, and the second corpus list 1 is used as a value to establish a mapping relation 1. Correspondingly, a mapping relation 2 between the category 2 and the second corpus list 2 is established, and a mapping relation 3 between the category 3 and the second corpus list 3 is established.

According to the embodiment of the establishing a mapping relation, the category is determined based on the category labeling information carried by the standard corpus in the standard corpus, the second corpus list is obtained based on the standard corpus containing the category, so that the mapping relation between each category and the second corpus list corresponding to each category is established, and further the standard corpus can be used as a query object to query in a standard corpus matching mode in the mapping relation. Along with the updating of the standard corpus, the method has better flexibility and instantaneity for updating the original mapping relation, and ensures the accuracy and efficiency of the query of the standard corpus matching mode.

2) The mapping relation is used for inquiring the category pointed by the standard corpus based on the standard corpus. Step S204 may include the steps of determining a standard corpus corresponding to the matched standard token vector, querying the standard corpus based on the mapping relationship, determining a category to which the corresponding standard corpus is directed, and regarding the directed category as the category of the target object. On the basis of determining a standard representation vector matched with a representation vector of the corpus to be processed, determining a standard corpus corresponding to the matched standard representation vector, and further determining a category corresponding to the standard corpus based on a mapping relation, wherein the category is taken as a category of a target object. For the standard corpus corresponding to the matched standard characterization vector, the corresponding category can be determined based on the mapping relation retrieval standard corpus, so that the consumption of the computing resource by invalid computation can be effectively avoided, the computing efficiency is improved, and the online feedback time is improved.

In an exemplary embodiment, as shown in fig. 10, the category determining method provided in the present disclosure may be applied to a scenario in which an e-commerce category is retrieved. For example, when a merchant uploads a commodity to an electronic commerce platform, the merchant inputs a commodity characteristic keyword, and the electronic commerce platform recommends the category to which the commodity belongs based on the commodity characteristic keyword. When the category expansion requirement exists, a corresponding sample corpus can be added under the new category to construct a current batch sample set, and then the current corpus processing model is updated based on the current batch sample set, wherein the updating of the current corpus processing model is performed in an offline environment. The online service portion includes a token vector of training terms and an inverted index constructed for the training terms. The online response section makes a category determination with the online service section in response to the received online request entry.

E-commerce platforms often involve hundreds of categories, each containing hundreds of keywords or phrases. By using the corpus processing model updating method provided by the invention, dynamic and efficient updating of the model can be realized when the category expansion requirement exists, and the influence of category expansion on online application is reduced. Furthermore, for the e-commerce platform with imperfect categories, frequent category expansion, category modification and the like are often needed, and by using the corpus processing model updating method provided by the disclosure, the model updating period can be reduced, and the computing resource cost caused by model updating is reduced.

According to the category determining method provided by the embodiment, the target corpus processing model is utilized to output the characterization vector of the corpus to be processed, the standard characterization vector matched with the characterization vector is determined based on similarity calculation, and the category determining is carried out by utilizing the category marking information carried by the standard characterization vector, so that the efficiency of category determining is improved.

FIG. 8 is a block diagram illustrating an updating apparatus of a corpus processing model according to an exemplary embodiment. Referring to fig. 8, the apparatus includes a sample set acquisition unit 810, a grouping unit 820, a characterization vector obtaining unit 830, a first correlation calculation unit 840, a second correlation calculation unit 850, and a parameter adjustment unit 860.

The sample set acquisition unit 810 is configured to perform acquisition of a current batch sample set.

The grouping unit 820 is configured to perform grouping according to category labeling information carried by the sample corpus in the current batch of sample sets, so that the sample corpuses carrying the same category labeling information are located in the same sample corpus group;

the token vector obtaining unit 830 is configured to obtain a token vector of the sample corpus based on the current corpus processing model;

the first correlation calculation unit 840 is configured to calculate a correlation between the token vector of the sample corpus and the token vector of the same group of sample corpuses, so as to obtain a first correlation of the sample corpus, where the same group of sample corpuses are other sample corpuses that are located in the same sample corpus group as the sample corpus;

The second correlation calculating unit 850 is configured to calculate a correlation between the token vector of the sample corpus and the token vector of a different set of sample corpuses, so as to obtain a second correlation of the sample corpus, where the different set of sample corpuses are other sample corpuses that are located in a different set of sample corpuses from the sample corpus;

The parameter adjustment unit 860 is configured to perform adjusting parameters of the current corpus processing model to satisfy a model convergence condition according to the first correlation and the second correlation, and taking the current corpus processing model satisfying the model convergence condition as a target corpus processing model.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 9 is a block diagram of a category determining device, according to an example embodiment. Referring to fig. 9, the apparatus includes a corpus to be processed acquisition unit 910, a model application unit 920, a matching unit 930, and a category determination unit 940.

The corpus to be processed obtaining unit 910 is configured to perform obtaining a corpus to be processed indicating a target object;

The model application unit 920 is configured to execute the method using the corpus to be processed as input, and obtain a token vector of the corpus to be processed by using the current corpus processing model in the steps S101 to S106;

The matching unit 930 is configured to determine a standard token vector matched with the token vector of the corpus to be processed based on the similarity between the token vector of the corpus to be processed and a plurality of standard token vectors, where each standard token vector carries corresponding category labeling information;

The category determination unit 940 is configured to perform determining the category of the target object based on the category label information corresponding to the matched standard token vector.

In an exemplary embodiment, there is also provided an electronic device, including a processor, and a memory for storing instructions executable by the processor, where the processor is configured to implement, when executing the instructions stored on the memory, the steps of the method for updating a corpus processing model or the method for determining categories in any of the foregoing embodiments.

The electronic device may be a terminal, a server, or a similar computing device, taking the electronic device as an example of a server, fig. 11 is a block diagram of an electronic device for updating a corpus processing model or determining a category, where the electronic device 1100 may be relatively different due to configuration or performance, and may include one or more central processing units (Central Processing Units, CPU) 1110 (the processor 1110 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 1130 for storing data, one or more storage media 1120 (such as one or more mass storage devices) for storing the application 1123 or the data 1122. Wherein the memory 1130 and the storage medium 1120 may be transitory or persistent storage. The program stored on the storage medium 1120 may include one or more modules, each of which may include a series of instruction operations in the electronic device. Still further, the central processor 1110 may be configured to communicate with a storage medium 1120 to execute a series of instruction operations in the storage medium 1120 on the electronic device 1100. The electronic device 1100 may also include one or more power supplies 1160, one or more wired or wireless network interfaces 1150, one or more input output interfaces 1140, and/or one or more operating systems 1121, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The input-output interface 1140 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the electronic device 1100. In one example, the input/output interface 1140 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices through a base station to communicate with the internet. In an exemplary embodiment, the input/output interface 110 may be a Radio Frequency (RF) module for communicating with the internet in a wireless manner.

It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 11 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the electronic device 1100 may also include more or fewer components than shown in fig. 11, or have a different configuration than shown in fig. 11.

In an exemplary embodiment, a storage medium is also provided, which when instructions in the storage medium are executed by a processor of an electronic device, enable the electronic device to perform the steps of the method of updating a corpus processing model or the method of determining categories in any of the above embodiments.

In an exemplary embodiment, a computer program product is also provided, the computer program product comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device performs the corpus processing model updating method or the category determining method provided in any one of the above embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for updating a corpus processing model, the method comprising:

Acquiring a current batch sample set;

The first correlation degree and the second correlation degree are normalized to obtain the actual global correlation degree of the current batch, the expected global correlation degree of the current batch of the sample corpus is obtained, a loss function value is calculated according to the actual global correlation degree of the current batch and the expected global correlation degree of the current batch, and parameters of the current corpus processing model are adjusted based on the loss function value until model convergence conditions are met;

And taking the current corpus processing model meeting the model convergence condition as a target corpus processing model.

2. The method of claim 1, wherein the obtaining a current batch sample set comprises:

3. The method of claim 1, wherein the obtaining a token vector for the sample corpus based on the current corpus processing model comprises:

4. A category determining method, the method comprising:

acquiring a corpus to be processed of an indication target object;

Taking the corpus to be processed as input, and obtaining a characterization vector of the corpus to be processed by using the target corpus processing model according to any one of claims 1 to 3;

5. The method of claim 4, wherein prior to determining a standard token vector that matches a token vector of the corpus to be processed based on similarity between the token vector of the corpus to be processed and a plurality of standard token vectors, the method further comprises the step of determining the plurality of standard token vectors;

the step of determining the plurality of standard token vectors comprises:

6. The method of claim 5, further comprising, prior to the step of obtaining the standard corpus, the step of constructing an inverted index for the standard corpus, the inverted index being used to query a standard corpus containing the corpus segments based on the corpus segments;

correspondingly, the querying the standard corpus based on each corpus fragment to obtain the standard corpus corresponding to each corpus fragment includes:

7. The method of claim 6, wherein the step of constructing an inverted index for the standard corpus comprises:

Each corpus fragment is used as an index keyword;

determining a standard corpus containing each index keyword;

8. The method of claim 7, further comprising adjusting the inverted index:

9. The method of claim 5, further comprising, prior to the step of obtaining the standard corpus, a step of establishing a mapping relationship for the standard corpus, the mapping relationship being used to query the category to which the standard corpus is directed based on the standard corpus;

correspondingly, the determining the category of the target object based on the category marking information corresponding to the matched standard characterization vector comprises the following steps:

10. The method of claim 9, wherein the step of establishing a mapping relationship for the standard corpus comprises:

determining a standard corpus pointing to each category;

11. An apparatus for updating a corpus processing model, the apparatus comprising:

A parameter adjustment unit configured to perform adjustment of parameters of the current corpus processing model to satisfy model convergence conditions according to the first correlation degree and the second correlation degree, and to take the current corpus processing model satisfying the model convergence conditions as a target corpus processing model;

The parameter adjustment unit comprises a current batch actual global correlation obtaining unit, a current batch expected global correlation obtaining unit, a loss function value calculation unit, a parameter adjustment subunit, a parameter adjustment unit and a parameter adjustment unit, wherein the current batch actual global correlation obtaining unit is configured to obtain the current batch actual global correlation of the sample corpus according to the first correlation and the second correlation;

the current batch actual global correlation obtaining unit comprises a normalization processing unit, a comparison unit and a comparison unit, wherein the normalization processing unit is configured to perform normalization processing on the first correlation and the second correlation to obtain the current batch actual global correlation.

12. The apparatus according to claim 11, wherein the sample set acquisition unit includes:

13. The apparatus of claim 11, wherein the token vector derivation unit comprises:

14. A category determining device, the device comprising:

A model application unit configured to perform obtaining a characterization vector of the corpus to be processed using the target corpus processing model according to any one of claims 1 to 3, taking the corpus to be processed as an input;

15. The apparatus of claim 14, further comprising a criteria-token-vector determination unit comprising:

16. The apparatus according to claim 15, further comprising an inverted index construction unit, wherein the inverted index constructed by the inverted index unit is used to query a standard corpus containing the corpus segments based on the corpus segments;

correspondingly, the first query unit includes:

17. The apparatus of claim 16, wherein the inverted index construction unit comprises:

18. The apparatus of claim 17, wherein the device comprises a plurality of sensors, the apparatus further includes an inverted index adjustment unit including:

19. The apparatus according to claim 15, further comprising a mapping relation establishing unit, wherein the mapping relation established by the mapping relation establishing unit is used for querying the category to which the standard corpus is directed based on the standard corpus;

accordingly, the category determining unit includes:

20. The apparatus according to claim 19, wherein the mapping relation establishing unit includes:

21. An electronic device, comprising:

A processor;

A memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of updating a corpus processing model according to any of claims 1 to 3 or the category determining method according to any of claims 4 to 10.

22. A non-transitory computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the corpus processing model updating method of any of claims 1 to 3 or the category determining method of any of claims 4 to 10.