CN110348001B

CN110348001B - Word vector training method and server

Info

Publication number: CN110348001B
Application number: CN201810299633.5A
Authority: CN
Inventors: 宋彦; 史树明; 张海松; 李菁; 俞栋; 张潼
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-04-04
Filing date: 2018-04-04
Publication date: 2022-11-25
Anticipated expiration: 2038-04-04
Also published as: CN110348001A

Abstract

The embodiment of the invention discloses a word vector training method and a server, which are used for integrating direction information into word vectors and can meet the requirements of semantic and syntactic tasks of natural language processing. The embodiment of the invention provides a word vector training method, which comprises the following steps: acquiring corresponding input word vectors according to words in the training sample text; obtaining corresponding original output word vectors according to the context words corresponding to the words in the training sample text; generating a target output word vector according to the original output word vector, wherein the target output word vector carries direction information for indicating the position direction of the context word relative to the word; training a word vector learning model using the output word vectors and the target output word vectors.

Description

Word vector training method and server

Technical Field

The invention relates to the technical field of computers, in particular to a word vector training method and a server.

Background

The SG (Skip-Gram) model is a currently general word vector learning model and is widely used in actual industrial environments. On the basis of large-scale corpora, the SG model can obtain a word vector model with high quality, and when a negative sampling (negative sampling) computing technology is used in a matched mode, word vectors can be computed efficiently and quickly, so that computing efficiency and result quality can be guaranteed simultaneously.

In the prior art, the SG model can be established by establishing the relationship between one word and other words around the word. Specifically, in a given corpus, for a sequence of words segment, the SG model learns the relationship between them for each pair of words, i.e., predicts the probability of outputting other words given a word as input. The vector for each word is finally updated by optimizing these probability values.

Although the current SG models can effectively train the word vectors, the prior art still has some corresponding disadvantages. For example, the SG model treats any word in the context window of each target word equally, so the context structure information in the target word cannot be reflected in the vector of the target word, and all words around a word have equal importance to the word, so the word vector obtained by the SG model learning cannot embody the context structure information, the word vector obtained by the prior art is not sensitive to the position information of the target word, and cannot be effectively applied to semantic and syntactic tasks of natural language processing.

Disclosure of Invention

The embodiment of the invention provides a word vector training method and a server, which are used for integrating direction information into word vectors and can meet the requirements of semantic and syntactic tasks of natural language processing.

In order to solve the above technical problem, the embodiments of the present invention provide the following technical solutions:

in a first aspect, an embodiment of the present invention provides a word vector training method, including:

acquiring corresponding input word vectors according to words in the training sample text;

obtaining corresponding original output word vectors according to the context words corresponding to the words in the training sample text;

generating a target output word vector according to the original output word vector, wherein the target output word vector carries direction information for indicating the position direction of the context word relative to the word;

training a word vector learning model using the output word vectors and the target output word vectors.

In a second aspect, an embodiment of the present invention further provides a server, including:

the input word vector acquisition module is used for acquiring corresponding input word vectors according to words in the training sample text;

an output word vector acquisition module, configured to acquire a corresponding original output word vector according to a context word corresponding to the word in the training sample text;

an output word vector reconfiguration module, configured to generate a target output word vector according to the original output word vector, where the target output word vector carries direction information used to indicate a position direction of the context word relative to the word;

and the model training module is used for training a word vector learning model by using the output word vector and the target output word vector.

In the second aspect, the constituent modules of the server may further perform the steps described in the foregoing first aspect and various possible implementations, for details, see the foregoing description of the first aspect and various possible implementations.

In a third aspect, an embodiment of the present invention provides a server, where the server includes: a processor, a memory; the memory is used for storing instructions; the processor is configured to execute the instructions in the memory to cause the server to perform the method of any of the preceding first aspects.

In a fourth aspect, the present invention provides a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to perform the method of the above aspects.

In a fifth aspect, embodiments of the present invention provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the above aspects.

According to the technical scheme, the embodiment of the invention has the following advantages:

in the embodiment of the invention, firstly, a corresponding input word vector is obtained according to a word in a training sample text, a corresponding original output word vector is obtained according to a context word corresponding to the word in the training sample text, then, a target output word vector is generated according to the original output word vector, the target output word vector carries direction information used for indicating the position direction of the context word relative to the word, and a word vector learning model is trained by using the output word vector and the target output word vector. In the embodiment of the invention, the context of the input word in different position directions is modeled respectively, and the structural information of the context word is integrated into the word vector learning, so the word vector obtained by the word vector model learning can embody the structural information of the context, and the word vector obtained by the word vector learning model provided by the embodiment of the invention can be suitable for various tasks of natural language processing, especially semantic and syntax related tasks.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings.

Fig. 1 is a schematic flowchart of a word vector training method according to an embodiment of the present invention;

fig. 2 is a schematic view of an application scenario of the word vector training method according to the embodiment of the present invention;

fig. 3 is a schematic diagram of an SG model as a word vector learning model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of joint optimization provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a word vector learning model provided by an embodiment of the present invention being an SSG model;

fig. 6-a is a schematic structural diagram of a server according to an embodiment of the present invention;

fig. 6-b is a schematic structural diagram of a reconfiguration module for output word vectors according to an embodiment of the present invention;

FIG. 6-c is a schematic diagram of a structure of a model training module according to an embodiment of the present invention;

FIG. 6-d is a schematic diagram illustrating a structure of another output word vector reconfiguration module according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a composition structure of a server to which the word vector training method according to the embodiment of the present invention is applied.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein, are intended to be within the scope of the present invention.

The terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The following are detailed below.

The word vector training method provided by the embodiment of the invention realizes the training of the word vector learning model by using the direction information of the context, the word vector learning can be an SG (Skip-Gram) model with direction pointing of the context, for convenience of description, the SG model with direction pointing of the context adopted by the embodiment of the invention is called a DSG (direct Skip-Gram) model, and the DSG model provided by the embodiment of the invention can help to learn the word vector. The DSG model provided by the embodiment of the invention considers that sequence information of words is a very important indicating signal in any language, and for all input and output word pairs (pair), direction information is introduced into an output word vector to indicate the left and right directions (and the up and down directions) of a target word on the input word, so that the guiding function of the target word on the input word is enhanced, and a better word vector is obtained. In the embodiment of the invention, the structural information of the text is integrated into word vector learning by respectively modeling the upper text and the lower text of the target word. Therefore, the word vector obtained through the learning of the DSG model can embody the structural information of the context, the semantic expression capability of the current word vector can be enhanced through the direction information of the context, and the syntactic capability is increased at the same time, so that the word vector obtained through the embodiment of the invention can be suitable for semantic and syntactic tasks of natural language processing.

The word vector training method provided in the embodiment of the present invention may be applied to a word vector learning scenario, and the method may be applied to a server, where the server may include a processor and a memory, where an input word vector and a target output word vector are stored by a storage device in the server, and the target output word vector carries direction information indicating a position direction of a context word with respect to a word. For example, the input word vector and the target output word vector are stored in the memory of the server, and the processor may read a program from the memory to execute the word vector training method provided by the embodiment of the present invention.

Referring to fig. 1, a word vector training method according to an embodiment of the present invention includes the following steps:

101. and acquiring corresponding input word vectors according to words in the training sample text.

In the embodiment of the present invention, a corpus stores training sample texts, where the training sample texts may include a segment of vocabulary, where each vocabulary may be a word, and the word corresponds to a context word, for example, the training sample texts include: a continuous segment of vocabulary: ABC, then for word B, word a and word C constitute the context word for that B. Firstly, words and context words of the words are obtained from a training sample text, corresponding input word vectors are obtained according to the words in the training sample text, the input word vectors comprise the words, the input vectors can be input into a word vector learning model, and the input vectors can be continuously updated in the model training process, namely, new words can be continuously read from a corpus and written into the input vectors.

102. And acquiring a corresponding original output word vector according to the context word corresponding to the word in the training sample text.

In the embodiment of the present invention, after obtaining a word and a context word of the word from a training sample text, an original output word vector corresponding to the context word may also be obtained, where the original output word vector includes the context word corresponding to the word, and the output word vector is a standard value of prediction output of a word vector learning model, and when an input vector is continuously updated in a model training process, a new context word corresponding to the word may be continuously read from a corpus and written into the original output word vector.

It should be noted that, in the embodiment of the present invention, the output word vector corresponding to the context word may be described as an "original output word vector", and after the input word vector and the output word vector are obtained, training of a word vector learning model cannot be directly performed, but the original output word vector needs to be reconfigured to carry the position direction of the context word relative to the word in the output word vector.

103. And generating a target output word vector according to the original output word vector, wherein the target output word vector carries direction information for indicating the position direction of the context word relative to the word.

In the embodiment of the present invention, after the original output word vector is obtained, a target output word vector needs to be generated according to the original output word vector, where the target output word vector carries direction information for indicating a position direction of a context word relative to a word, that is, the original output word vector is reconfigured to carry the position direction of the context word relative to the word in the original output word vector, and for convenience of description of distinction, an output word vector obtained after the original output word vector is reconfigured is referred to as a "target output word vector".

In the embodiment of the present invention, the target output word vector carries direction information indicating a position direction of the context word relative to the word. Where the position direction indicates which direction of the word the context word appears in, the direction information may be a one-dimensional array indicating the position direction. For example, the positional orientation of the context word relative to the word may include: the context word appears above (i.e., left direction) or below (i.e., right direction) the word, and if the context word of the word appears above (i.e., left direction) the direction information may take a value of 1, and if the context word of the word appears below (i.e., right direction) the word, the direction information may take a value of 0.

In some embodiments of the present invention, step 103 generating a target output word vector from the original output word vector comprises:

generating a direction vector according to whether the context word appears above or below the word, wherein the direction vector is used for indicating that the context word appears above or below the word;

obtaining a target output word vector through the original output word vector and the direction vector, wherein the target output word vector comprises: the original output word vector and direction vector.

The sequence information of a word is an important indication signal in any language, the context word of a word in the corpus can indicate the sequence information corresponding to the word, the direction vector is used to indicate that the context word appears above or below the word, and the target output word vector is obtained through the original output word vector and the direction vector. The direction vector is introduced to indicate the left direction or the right direction of the target word in the input word, so that the guiding effect of the target word on the input word is enhanced, and a better word vector is obtained.

acquiring an upper output word vector from an original output word vector according to the context word appearing above the word;

acquiring a context output word vector from an original output word vector according to the context word appearing in the context of the word;

obtaining a target output word vector through the upper output word vector and the lower output word vector, wherein the target output word vector comprises: the above output word vector and the below output word vector.

In the embodiment of the present invention, a manner of implicitly carrying directional information may also be adopted, that is, two sets of output word vectors may be designed, which are respectively used to express an upper word and a lower word of any input word, unlike the implementation manner of carrying a directional vector in a target output word vector in the foregoing embodiment. Thus, each word has three vectors, one for expressing the word as an input word vector, one for expressing as an above output word vector, and the last for expressing as a below output word vector. Thus, when calculating a word vector, for any input word, its previous word can use their previous output word vector, and the following word can use the following output word vector with the input word vector of the input word to calculate a log-probability likelihood estimate. This implementation in embodiments of the present invention may also actively distinguish the context of a word during the learning process of a word vector, because each time it is only possible to update as context or context alone, the output vector of each word will only have half the probability to be updated.

104. The word vector learning model is trained using the output word vectors and the target output word vectors.

In the embodiment of the present invention, after an output word vector and a target output word vector are obtained respectively, a word vector learning model may be trained using the output word vector and the target output word vector, the word vector learning model provided in the embodiment of the present invention may be an SG model that uses a context with a direction, which is referred to as a DSG model for short, and since the target output word vector carries direction information indicating a position direction of a context word with respect to a word, the structure information of the context word may be integrated into word vector learning through training of the word vector learning model, so the word vector obtained through learning by the word vector model may embody the structure information of the context, and the word vector obtained through the word vector learning model provided in the embodiment of the present invention may be applicable to semantic and syntactic tasks of natural language processing. The direction information is used for expanding the existing word vector learning model in the embodiment of the invention, so that various model variants can be obtained according to different use scenes, different tasks are further suitable, and word vectors with higher quality can be obtained.

In some embodiments of the invention, the target output word vector comprises: in the case of the original output word vector and the direction vector, step 104 trains a word vector learning model using the output word vector and the target output word vector, including:

obtaining an interactive function calculation result according to the input word vector and the direction vector, and performing iterative updating on the input word vector and the direction vector according to the interactive function calculation result;

obtaining a conditional probability calculation result according to the input word vector and the original output word vector, and performing iterative updating on the input word vector and the original output word vector according to the conditional probability calculation result;

and estimating the optimal target of the word vector learning model according to the interactive function calculation result and the conditional probability calculation result.

In the embodiment of the present invention, the result of the interaction function may be calculated by using the input word vector and the direction vector, that is, the result of the interaction function calculation is obtained, for example, the interaction relationship between the input word vector and the direction vector may be calculated by using a softmax function, so as to achieve the purpose of integrating the direction information into the final word vector. And synchronously updating the values of the input word vector and the direction vector through the interactive function calculation result to enable the interactive function calculation result to accord with an expected result, for example, calculating the interactive relation between the input word vector and the direction vector by using a softmax function to ensure that the value of the interactive function calculation result tends to 1 when the context word is on the left side of the word and tends to 0 when the context word is on the right side of the word. In the embodiment of the invention, besides the need of calculating the interaction function between the input word vector and the direction vector, the conditional probability between the words and the context words needs to be synchronously calculated, namely, the conditional probability calculation result can be obtained according to the input word vector and the original output word vector, for example, the conditional probability between the words is calculated through an SG model, so that the semantic relation between the words is modeled. After the interactive function calculation result and the conditional probability calculation result are obtained through the steps, joint optimization can be performed according to the interactive function calculation result and the conditional probability calculation result, namely, the optimal target of the word vector learning model can be estimated, so that the optimal target of each word can be iteratively updated through the interactive function calculation result and the conditional probability calculation result, and after the training of the word vector learning model is completed, the word vector learning model can obtain high-quality word vectors of input words.

Optionally, in some embodiments of the present invention, taking an interaction function as a softmax function as an example, obtaining a calculation result of the interaction function according to an input word vector and a direction vector, where the calculation result includes:

an interaction function between an input word vector and a direction vector is calculated by, among other things,

wherein g (ω) _t+i ,ω _t ) Representing the result of the calculation of the interaction function, δ _ωt+i Meaning that the context word is ω _t+i Direction vector of time, v _ωt Representing words as omega _t The input vector of time, V, represents the set of all words in the corpus. In the above formula, exp represents an e-function and T represents transposition.

Optionally, in some embodiments of the present invention, iteratively updating the input word vector and the direction vector according to the interactive function calculation result includes:

the input word vector and the direction vector are iteratively updated in such a way that, among others,

wherein,

represents the updated word as ω _t The input vector of the time of day,

represents the input vector before update, gamma represents the learning rate, delta _ωt+i Meaning that the context word is ω _t+i Direction vector of time, v _ωt Representing the word as omega _t Input vector of time, σ (v) _ωt ^T δ _ωt+i ) Representing contextual wordsRelative to the position direction predicted value of the word, D represents the position direction flag value of the context word relative to the word,

represents an updated context word of ω _t+i The direction vector of the time of flight,

representing the context word before update as omega _t+i The direction vector of time.

In the above formula, a superscript (new) is used to indicate a vector after update, a superscript (old) is used to indicate a vector before update, γ is a learning rate, and the learning rate is a numerical variable that is continuously decreased as the training process progresses in the training of the word vector, for example, the learning rate may be defined as a ratio of an untrained text size to a total text size.

Optionally, the position and direction flag value D satisfies the following condition:

wherein, when i <0, the position direction of the context word relative to the word is represented as the above, and when i >0, the position direction of the context word relative to the word is represented as the below.

For example, D is the label information of the context word in the left and right directions of the input word, and as mentioned above, there are two values: i <0 corresponds to the vocabulary in the text, i >0 corresponds to the vocabulary in the text, and in each training sample, the value of D is a mark automatically obtained according to the position of the word during training.

Optionally, in some embodiments of the present invention, the optimal target of the word vector learning model is estimated according to the interactive function calculation result and the conditional probability calculation result:

the global log maximum likelihood estimate f (ω) is calculated as follows _t+i ,ω _t ) Wherein

f(ω _t+i ,ω _t )＝p(ω _t+i |ω _t )+g(ω _t+i ,ω _t ) (formula five)

Wherein g (ω) _t+i ,ω _t ) Representing the result of the calculation of the interaction function, p (ω) _t+i |ω _t ) Representing the result of the conditional probability computation.

Joint log-likelihood estimation L for calculating the probability of a word to a context word by _SG Wherein, in the process,

where V represents the set of all words in the corpus, and the context word is ω _t+i The word is omega _t And c represents a context window size.

For example, the global log-maximum likelihood estimation can be optimized through the formula five, so that the optimal target estimation on the word vector learning model in the embodiment of the present invention can be converted into a joint optimization problem on two correlation functions, thereby implementing the optimal target estimation on the word vector learning model.

As can be seen from the description of the embodiments of the present invention in the above embodiments, first, a corresponding input word vector is obtained according to a word in a training sample text, a corresponding original output word vector is obtained according to a context word corresponding to the word in the training sample text, then, a target output word vector is generated according to the original output word vector, the target output word vector carries direction information indicating a position direction of the context word relative to the word, and a word vector learning model is trained using the output word vector and the target output word vector. Because the context of the input words in different position directions is modeled respectively in the embodiment of the invention, the structural information of the context words is merged into the word vector learning, so the word vector obtained by the word vector model learning can embody the structural information of the context, and the word vector obtained by the word vector learning model provided by the embodiment of the invention can be suitable for various tasks of natural language processing, especially tasks related to semantics and syntax.

In order to better understand and implement the above solution of the embodiment of the present invention, the following description specifically illustrates a corresponding application scenario.

The word vector learning model used in the embodiment of the present invention may be an improved SG model (hereinafter referred to as a DSG model) which is a word vector learning model by establishing a relationship between one word and other words around it. Specifically, in a given corpus, for a word sequence segment, the SG model learns the relationship between them for each pair of words, i.e., predicts the probability of outputting other words given a word as input. The vector for each word is finally updated by optimizing these probability values. The method provided by the invention can enhance the semantic ability of the SG model and increase the syntactic ability at the same time.

The word vector training method provided by the embodiment of the invention is used as a basic algorithm and can be used in all application scenes related to natural language and processing technologies and products required by the application scenes. The usage mode is generally to generate or update word vectors by using the word vector learning model provided by the invention, and deliver the generated vectors to be applied to subsequent natural language processing tasks. For example, the generated word vector can be used in a word segmentation and part-of-speech tagging system to improve the accuracy of word segmentation and part-of-speech tagging, thereby improving the subsequent processing capability. As another example, in a search and related scenarios, the obtained search results often need to be sorted, and the sorted results often need to calculate semantic similarity of each result to a search query statement (query). The similarity measurement can be achieved through similarity calculation of word vectors, and therefore the quality of the vectors greatly determines the effect of the semantic similarity calculation method. In addition to the above tasks, since the word vectors trained by the embodiments of the present invention effectively combine and distinguish context information of different words, it is possible to have better performance especially for tasks of semantic and syntactic types (e.g., part-of-speech tagging, chunking analysis, structural syntactic analysis, dependency syntactic analysis, etc.).

Fig. 2 is a schematic view of an application scenario of the word vector training method according to the embodiment of the present invention. In the word vector training method provided by the embodiment of the invention, human languages have a linear characteristic, namely, words expressing any language usually follow a certain sequence relationship, so that collocation among the words can form a certain relatively fixed front-back sequence relationship, for example, in a sentence, a word may often appear on the left side of another word, especially for a language such as Chinese which has higher requirement on word sequence than syntax requirement. Based on the above analysis, the embodiment of the present invention uses another approach to model the above (left text) and below (right text) relations of the input word respectively to reflect the word order relation formed by the context of one word. On the basis of an SG model, the embodiment of the invention introduces an additional direction vector delta for each word, and the vector delta is used for expressing and calculating the situation that the word appears on the left side or the right side of an input word as a certain context word.

For this purpose, a softmax function g is defined, and the interaction between the direction vector of the following word and the word vector of the currently input word is calculated as in the formula one, so as to integrate the direction information into the final word vector.

In particular, the interaction function is used to compute the input word w for _t Context word w _t+i And synchronously updating the values of delta and v through the calculation result of the formula I to ensure that when w is equal to _t+i At w _t The value of g tends to 1 when w is on the left side _t+i At w _t The value of g tends to 0 on the right side. To achieve this effect, the updating manner of δ and v may be as in formula two and formula three in the foregoing embodiment, where the superscript (new) is used to indicate the vector after updating, and (old) is used to indicate the vector before updating, and the learning rate is a numerical variable that is continuously reduced as the training process of the word vector progresses, and is generally defined as the ratio of the untrained text size to the total text size. Where D is the tagged information of the context word around the input word, there are two values, as mentioned above, such as the formula four, i in the previous embodiment<0 corresponds to the above vocabulary, i>0 corresponds to the vocabulary in the following text, and in each training sample, the value of D is distinguished according to the above text and the following text, that isThe labels are automatically obtained according to the position of the words during training.

The g function defined by the above formula can be regarded as an effective means for modeling the structural information of the context, and besides the g function, in the embodiment of the present invention, an SG model is used to calculate the conditional probability between words for modeling the semantic relationship between words.

As shown in fig. 3, the word vector learning model provided in the embodiment of the present invention is a schematic diagram of an SG model. Wherein, w ₀ Is a current word, w _-2 ，w _-1 ，w ₁ And w ₂ Is w ₀ The SG model utilizes w ₀ As an input, maximize w ₀ Probability to other words, so the optimization goal of the SG model over the entire corpus is to maximize each word w _t A joint Log-likelihood estimate (Log-likelihood) of the probability to its context may be estimated, for example, by the aforementioned equation six.

For convenience of explanation of subsequent methods, formula six may use the f function to express w _t In the SG model, f (w) _t+i ，w _t ) Defined as a softmax function expressed by a word vector, for example, as shown in equation seven below:

wherein v is _wt Finger w _t Is expressed as an input vector of' _wt+i Finger w _t+i The output vector of (a) is expressed, and so on. Each word in the SG model has two vectors, one for the input word (labeled v) and the other for the predicted output context word (labeled v'). Therefore, the SG model increases the value of the joint likelihood estimation in equation six by calculating equation seven and updating the vectors of the respective words continuously iteratively over the entire corpus, and outputs the vectors of all words after the specified number of iterations.

Fig. 4 is a schematic diagram of joint optimization provided in the embodiment of the present invention. In the embodiment of the present invention, the optimization target of the DSG model is consistent with the function defined by formula six, and the global log maximum likelihood estimation is optimized, for example, by using formula five in the foregoing embodiment. Therefore, in the present invention, it can be considered as a joint optimization problem for two correlation functions, and the optimization target of each word can be expressed in the form shown in fig. 4, where the solid line arrow represents the prediction relationship and the dotted line arrow represents the vector update process of the input word.

In the implementation process, the method provided by the embodiment of the invention has no special requirement on hardware, is consistent with a word vector learning model (such as an SG model), can complete calculation by using a common processor, and can be used for single thread or multithreading. The word vectors and direction vectors related to the present invention are stored in a Memory (RAM) during the calculation process, and are output to a disk or other carriers for storage after the calculation is completed. In the embodiment of the invention, the whole algorithm only needs to give one training corpus, and the vectors of the words contained in the corpus can be calculated according to parameters such as the size of a predefined window, the iteration times and the like.

The embodiment of the present invention further provides a Structured SG (Structured Skip-Gram, SSG), where the SSG model considers context words of the input word and also considers the influence of the positions of the context words on the input word, the positions of the context words refer to a relative position relationship between the context words and the input word in the corpus, and the probabilities of the context words are predicted separately for each different position. The structure of the SSG model is similar to that of the SG model, as shown in FIG. 5, except that the SSG model estimates the probability of each context word at a corresponding position using different parameters, which are represented by O in FIG. 2 _-2 、O _-1 、O ₁ And O ₂ Indicating that different words are predicted using a uniform O as distinguished from that shown in fig. 3. Wherein, O is the expression of the prediction relationship, the same O represents the same prediction relationship, and O with different corner marks represents different prediction relationships.

In the embodiment of the invention, the optimization target of the SSG model is consistent with that of the SG model, and the joint log-likelihood estimation of the whole corpus is maximized. The only difference is that there are multiple output vectors corresponding to different locations in the SSG model, so f is defined as equation eight below.

Wherein r is the relative position, c is the size of the context window, and the meaning of the remaining physical quantities is the same as the formula above, where a context word w _t+i The probability of the input word needs to be taken into account at w _t And thus the SSG model formally defines a series of different "roles" (prototypes) for each context word to distinguish the effect of the words on the input word when they occur at different locations. Compared with the SG model, distinguishing context words at different positions enables the SSG model to model the structural information of the context (here, information such as the arrangement and sequence relationship of the words) to a certain extent, so that the mutual relationship between the words richer than the SG can be learned.

As can be seen from the foregoing examples of the SG model, the DSG model and the SSG model, although various methods can effectively train the word vector, there are some differences in many aspects. For example, the SG model does not distinguish between different types of contexts, and treats any word within the context window for each target word input word equally. Therefore, the context structure information in the target word cannot be reflected in the vector of the target word, all words around a word are equal in importance degree to the word, and much collocation information (especially fixed forward collocation or backward collocation) cannot be reflected in the learning process of the vector. In contrast, the SSG model solves the problem of distinguishing the context from the SG model, and ensures that the context words at each location have a specific and unique role, however, this significantly increases the computational complexity, and for training with the same size corpus, the SSG may take several times the SG.

Table 1 lists the temporal and spatial complexity of SG and SSG models, where d represents the dimension of a word vector (e.g., 50,100,200, etc.), and S is the corpus size (total Token number) for training the word vector, where Token can be translated into "word" representing the number of words in the corpus, but will be different from the word concept in the vocabulary, e.g., a corpus contains 1 ten thousand words, and possibly only 100 different words (i.e., vocabulary), where Token refers to 1 ten thousand words. V is the set of all words in the corpus, o is the time required for performing one vector update, n is the number of negative samples (negative sampling), and negative sampling is an algorithm for effectively reducing the computational complexity in word vector computation, as can be seen from table 1 below, when the context window increases, the spatial and temporal complexity of the SSG model will be higher than the SG model by the multiple of the window size c, and taking the context window size of 5 words typically used in word vector computation as an example, training an SSG model will require about 5 times the spatial and temporal complexity of the SG model.

TABLE 1 spatiotemporal complexity analysis of the various models

Method	Spatial complexity	Time complexity
			SG
	2\|V\|d	2cS(n+1)o
			SSG	(2c+1)\|V\|d	4c2S(n+1)o
DSG	3\|V\|d	2cS(n+2)o

Wherein, the lower the space-time complexity, the easier the implementation, and the less the hardware requirement for the processor. The space-time complexity of the DSG model compared to the SG model and the SSG model is also shown in the third row of Table 1. Compared with an SG model and an SSG model, the DSG model can give consideration to learning of structural information of a certain degree of context, and meanwhile, compared with the SG model, the calculation complexity is not increased remarkably. The aforementioned SSG model has higher computational complexity than the DSG model, and due to sparsity of occurrence of words at different positions, the SSG model is difficult to extend to a larger contextual window environment in actual computation, while the DSG model is less affected by the data sparsity problem than the SSG model. Considering the expression characteristics of Chinese, the DSG model is more sensitive to word order than syntax, so that the DSG model is more suitable for learning Chinese word vectors, and is more favorable for semantic understanding of a Chinese environment and further processing of Chinese vocabularies.

In the word vector training method provided by the invention, a group of additional direction vectors are introduced by using the DSG model and used for expressing the position information of the context words to the input words, so that the method can learn the structure information of the context. For example, compared with the SG and SSG models, the DSG model requires 1.5 times more spatial complexity than the conventional SG model, the temporal complexity is close to the conventional SG model, and is much lower than the SSG model, especially the spatial complexity is not affected by the size of the context window, the temporal complexity is linearly proportional to the window size, and the SSG model is squared.

In the embodiment of the present application, because a set of additional direction vectors is introduced, the set of additional direction vectors can be output separately, and can be directly used for calculating the position relationship between a certain word and another word, for example, directly calculating the direction vector and the cosine of the word vector, sim = cosine (v 1, d 2) where v1 refers to the word vector of word 1 and d2 refers to the direction vector of word 2. The similarity is calculated through the position vector of a certain word and the word vector of another word, so that the expression of position information is simplified, and only the upper text and the lower text are distinguished. Meanwhile, since the sequence information of the context is integrated into the word vector expression, the word vector obtained through the learning of the DSG model has certain syntactic adaptability, that is, the final word vector implicitly contains text structure information, so that the method can help the semantic and syntactic related tasks (such as part-of-speech tagging, chunk recognition, dependency syntactic analysis and the like) of natural language processing to a certain extent.

Due to the context distinguishing performance in the invention, the word vector obtained by learning through the method can obtain more accurate word class distinguishing capability. This is because, due to the characteristics of the language structure, words belonging to certain categories tend to be subject to a certain degree of organization, for example, adjectives tend to be in front of nouns, adverbs before and after verbs are functionally different, and so on (as with the aforementioned syntactic adaptability). Therefore, when similar words are calculated using word vectors learned by DSG, it is easier to obtain the same type of words (compared to SG models), and word vectors having such a capability can be calculated more efficiently than complex models such as SSG.

Without limitation, in the word vector training method provided in the embodiment of the present invention, the negative sampling algorithm may be replaced with a hierarchical softmax (hierarchical softmax) algorithm to calculate the probability of the target word and predict the target word. Compared to negative sampling, hierarchical softmax can yield better results when the training data is small, but when the training data is increased, the required computational space is significantly increased.

It should be noted that for simplicity of description, the above-mentioned method embodiments are shown as a series of combinations of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

To facilitate a better implementation of the above-described aspects of embodiments of the present invention, the following also provides relevant means for implementing the above-described aspects.

Referring to fig. 6-a, a server 600 according to an embodiment of the present invention may include: an input word vector obtaining module 601, an output word vector obtaining module 602, an output word vector reconfiguring module 603, and a model training module 604, wherein,

an input word vector acquisition module 601, configured to acquire a corresponding input word vector according to a word in a training sample text;

an output word vector obtaining module 602, configured to obtain a corresponding original output word vector according to a context word corresponding to the word in the training sample text;

an output word vector reconfiguration module 603, configured to generate a target output word vector according to the original output word vector, where the target output word vector carries direction information used to indicate a position direction of the context word relative to the word;

a model training module 604, configured to train a word vector learning model using the output word vector and the target output word vector.

In some embodiments of the present application, referring to fig. 6-b, the output word vector reconfiguration module 603 includes:

a direction vector generation module 6031 configured to generate a direction vector according to the context word appearing above or below the word, where the direction vector is used to indicate that the context word appears above or below the word;

a first target output word vector generating module 6032, configured to obtain the target output word vector by using the original output word vector and the direction vector, where the target output word vector includes: the original output word vector and the direction vector.

In some embodiments of the present application, referring to fig. 6-c, the model training module 604 comprises:

an interactive function calculating module 6041, configured to obtain an interactive function calculation result according to the input word vector and the direction vector, and perform iterative update on the input word vector and the direction vector according to the interactive function calculation result;

a conditional probability calculation module 6042, configured to obtain a conditional probability calculation result according to the input word vector and the original output word vector, and iteratively update the input word vector and the original output word vector according to the conditional probability calculation result;

and an object estimation module 6043, configured to estimate an optimal object of the word vector learning model according to the interaction function calculation result and the conditional probability calculation result.

Further, in some embodiments of the present application, the interaction function calculation module 6041 is specifically configured to calculate an interaction function between the input word vector and the direction vector, wherein,

wherein, the g (ω) _t+i ,ω _t ) Representing the result of said interactive function calculation, said δ _ωt+i Means that the context word is ω _t+i Direction vector of time, said v _ωt Represents the word as omega _t The V represents the set of all words in the corpus.

Further, in some embodiments of the present application, the interactive function calculating module 6041 is specifically configured to perform an iterative update on the input word vector and the direction vector, where,

wherein, the

Represents that the updated word is ω _t An input vector of time, said

Representing the input vector before update, the gamma representing the learning rate, the delta _ωt+i Means that the context word is ω _t+i Direction vector of time, said v _ωt Represents the word as omega _t An input vector of time, said σ (v) _ωt ^T δ _ωt+i ) Representing a position direction predicted value of the context word relative to the word, the D representing a position direction marker value of the context word relative to the word, the

Represents the updated context word as ω _t+i A direction vector of time, said

Represents the context word before update as ω _t+i The direction vector of time.

In some embodiments of the present application, the position direction flag value D satisfies the following condition:

wherein, when i <0, it indicates that the position direction of the context word relative to the word is the upper, and when i >0, it indicates that the position direction of the context word relative to the word is the lower.

Further, in some embodiments of the application, the target estimation module 6043 is configured to calculate a global log-maximum likelihood estimate f (ω) by _t+i ,ω _t ) Wherein, f (ω) _t+i ,ω _t )＝p(ω _t+i |ω _t )+g(ω _t+i ,ω _t ) Said g (ω) _t+i ,ω _t ) Representing the interactionThe result of the function computation, said p (ω) _t+i |ω _t ) Representing the conditional probability computation result; calculating a joint log-likelihood estimate L of the probability of the word to the context word by _SG Wherein, in the process,

the V represents all word sets in the corpus, and the context word is omega _t+i The word is omega _t And c represents a context window size.

In some embodiments of the present application, referring to fig. 6-d, the output word vector reconfiguration module 603 includes:

an output word vector generating module 6033, configured to obtain an output word vector from the original output word vector according to the context word appearing above the word;

a context output word vector generation module 6034, configured to obtain a context output word vector from the original output word vector according to that the context word appears in the context of the word;

a second target output word vector generating module 6035, configured to obtain the target output word vector by using the above output word vector and the below output word vector, where the target output word vector includes: the above output word vector and the below output word vector.

As can be seen from the above description of the embodiments of the present invention, first, a corresponding input word vector is obtained according to a word in a training sample text, a corresponding original output word vector is obtained according to a context word corresponding to the word in the training sample text, then, a target output word vector is generated according to the original output word vector, the target output word vector carries direction information indicating a position direction of the context word relative to the word, and a word vector learning model is trained by using the output word vector and the target output word vector. In the embodiment of the invention, the context of the input word in different position directions is modeled respectively, and the structural information of the context word is integrated into the word vector learning, so the word vector obtained by the word vector model learning can embody the structural information of the context, and the word vector obtained by the word vector learning model provided by the embodiment of the invention can be suitable for semantic and syntactic tasks of natural language processing.

Fig. 7 is a schematic diagram of a server 1100 according to an embodiment of the present invention, where the server 1100 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1122 (e.g., one or more processors) and a memory 1132, and one or more storage media 1130 (e.g., one or more mass storage devices) for storing applications 1142 or data 1144. Memory 1132 and storage media 1130 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, the central processor 1122 may be provided in communication with the storage medium 1130 to execute a series of instruction operations in the storage medium 1130 on the server 1100.

The server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158, and/or one or more operating systems 1141, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and so forth.

The steps of the word vector training method performed by the server in the above embodiment may be based on the server structure shown in fig. 7.

It should be noted that the above-described embodiments of the apparatus are merely illustrative, where the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus necessary general hardware, and may also be implemented by special hardware including special integrated circuits, special CPUs, special memories, special components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, the implementation of a software program is a more preferable embodiment for the present invention. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

In summary, the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the above embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the above embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for word vector training, comprising:

generating a direction vector according to the context word appearing above or below the word, and obtaining a target output word vector through the original output word vector and the direction vector, wherein the direction vector is used for indicating that the context word appears above or below the word, and the target output word vector comprises: the original output word vector and the direction vector, the target output word vector carries direction information for indicating the position direction of the context word relative to the word;

2. The method of claim 1, wherein obtaining interactive function computation results according to the input word vector and the direction vector comprises:

calculating an interaction function between the input word vector and the direction vector by, wherein,

3. The method of claim 1, wherein iteratively updating the input word vector and the direction vector according to the interactive function computation result comprises:

iteratively updating the input word vector and the direction vector by, wherein,

wherein, the

Represents that the updated word is ω _t An input vector of time, said

Representing the input vector before update, gamma representing the learning rate, delta _ωt+i Means that the context word is ω _t+i Direction vector of time, said v _ωt Represents the word as omega _t An input vector of time, said σ (v) _ωt ^T δ _ωt+i ) Representing a position direction prediction value of the context word relative to the word, the D representing a position direction marker value of the context word relative to the word, the

Represents the updated context word as ω _t+i A direction vector of time, said

Represents the context word before update is ω _t+i The direction vector of time.

4. The method according to claim 3, characterized in that the position direction flag value D satisfies the condition:

wherein when i <0, it means that the position direction of the context word with respect to the word is above, and when i >0, it means that the position direction of the context word with respect to the word is below.

5. The method according to any one of claims 1 to 4, wherein the estimation of the optimal target of the word vector learning model from the interaction function calculation result and the conditional probability calculation result is performed by:

the global log maximum likelihood estimate f (ω) is calculated by _t+i ,ω _t ) Wherein, in the process,

f(ω _t+i ,ω _t )＝p(ω _t+i ω _t )+g(ω _t+i ,ω _t )，

wherein, the g (ω) _t+i ,ω _t ) Representing the result of said interaction function computation, said p (ω) _t+i ω _t ) Representing the conditional probability computation result;

calculating a joint log-likelihood estimate L of the probability of the word to the context word by _SG Wherein

wherein the V represents all word sets in the corpus, and the context word is omega _t+i The word is omega _t And c represents a contextual window size.

6. The method according to claim 1, wherein the generating a direction vector according to the context word appearing above or below the word and obtaining the target output word vector through the original output word vector and the direction vector comprises:

acquiring an upper output word vector from the original output word vector according to the situation that the context word appears above the word;

acquiring a context output word vector from the original output word vector according to the situation that the context word appears in the context of the word;

obtaining the target output word vector through the above output word vector and the below output word vector, where the target output word vector includes: the above output word vector and the below output word vector.

7. A server, comprising:

an output word vector obtaining module, configured to obtain a corresponding original output word vector according to a context word corresponding to the word in the training sample text;

the model training module is used for training a word vector learning model by using the output word vector and the target output word vector;

the output word vector reconfiguration module comprises:

a direction vector generation module for generating a direction vector according to the context word appearing above or below the word, wherein the direction vector is used for indicating that the context word appears above or below the word;

a first target output word vector generation module, configured to obtain the target output word vector through the original output word vector and the direction vector, where the target output word vector includes: the original output word vector and the direction vector;

the model training module comprises:

the interactive function calculation module is used for acquiring an interactive function calculation result according to the input word vector and the direction vector and performing iterative update on the input word vector and the direction vector according to the interactive function calculation result;

the conditional probability calculation module is used for acquiring a conditional probability calculation result according to the input word vector and the original output word vector and performing iterative update on the input word vector and the original output word vector according to the conditional probability calculation result;

and the target estimation module is used for estimating the optimal target of the word vector learning model according to the interactive function calculation result and the conditional probability calculation result.

8. The server according to claim 7, wherein the interaction function computation module is configured to compute the interaction function between the input word vector and the direction vector by, in particular,

wherein, the g (ω) _t+i ,ω _t ) Representing the result of said interactive function calculation, said δ _ωt+i Means that the context word is ω _t+i Direction vector of time, saidv _ωt Represents the word as omega _t The V represents the set of all words in the corpus.

9. The server according to claim 7, wherein the interaction function computation module is configured to iteratively update the input word vector and the direction vector by, wherein,

wherein, the

Represents the updated word as ω _t An input vector of time, said

Representing the input vector before update, gamma representing the learning rate, delta _ωt+i Means that the context word is ω _t+i Direction vector of time, said v _ωt Represents the word as omega _t An input vector of time, said σ (v) _ωt ^T δ _ωt+i ) Representing a position direction predicted value of the context word relative to the word, the D representing a position direction marker value of the context word relative to the word, the

Represents the updated context word as ω _t+i Direction vector of time, said

10. The server according to claim 9, wherein the position and direction flag value D satisfies the following condition:

11. The server according to any of claims 7 to 10, wherein the target estimation module is configured to calculate a global log maximum likelihood estimate f (ω) by _t+i ,ω _t ) Wherein, f (ω) _t+i ,ω _t )＝p(ω _t+i ω _t )+g(ω _t+i ,ω _t ) Said g (ω) _t+i ,ω _t ) Represents the result of the interactive function computation, the p (ω) _t+i ω _t ) Representing the conditional probability computation result; calculating a joint log-likelihood estimate L of the probability of the word to the context word by _SG Wherein

12. A server, characterized in that the server comprises: a processor and a memory, the memory to store instructions, the processor to execute the instructions in the memory, causing the server to perform the method of any of claims 1-6.

13. A computer-readable storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of any one of claims 1-6.