WO2019228014A1

WO2019228014A1 - Method and apparatus for performing, for training corpus, negative sampling in word frequency table

Info

Publication number: WO2019228014A1
Application number: PCT/CN2019/077438
Authority: WO
Inventors: 林建滨; 周俊
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2018-06-01
Filing date: 2019-03-08
Publication date: 2019-12-05
Also published as: CN108875810B; TWI698761B; TW202004533A; CN108875810A

Abstract

Embodiments of the description provide a method and apparatus for performing, for a training corpus, negative sampling in a word frequency table. The method comprises: first obtaining, from a word frequency table, an alternative word as the current word, and obtaining a remaining sampling quantity, a remaining sampling probability, and the current sampling probability; determining, on the basis of the binomial distribution of the current word under the conditions of the remaining sampling quantity and the current sampling probability, the number of sampled times of the current word; and then adding the current word to a negative sample set according to the number of sampled times. Because in the case that the steps above are executed to the current word, the current word can be added to a negative sample set for the number of sampled times, such that the number of total negative sampling times is reduced, thereby reducing the negative sampling time and implementing quick and effective negative sampling.

Description

Method and device for negative example sampling from word frequency table for training corpus

Technical field

One or more embodiments of the present specification relate to the field of computer technology, and in particular, to a method and device for performing negative sampling from a word frequency table on a training corpus by a computer.

Background technique

Noise contrast estimation (NCE) is a loss function commonly used in unsupervised algorithms such as Word2Vec and Node2Vec. When applying this loss function, it is necessary to generate a certain number of negative examples for each word and its context in the training corpus. Among them, for a vocabulary expected in training, any word other than the word and its context vocabulary can be a negative example. Generally, the above negative examples are randomly sampled according to the vocabulary distribution of the training corpus, and the vocabulary distribution is, for example, approximated by a word frequency table.

In the conventional technique, when a negative example is used, a negative example is generated for the corresponding vocabulary. Specifically, a vocabulary distribution (such as a word frequency table) is mapped to an interval, and a value in the interval is generated, so as to find a corresponding vocabulary as a negative example. When the dictionary expected in the training has many words (for example, hundreds of millions of levels), and a large number of negative examples are required, it is hoped that an improved solution can be used to reduce the sampling time, so that negative example sampling can be performed quickly and efficiently.

Summary of the Invention

One or more embodiments of the present specification describe a method and device. In a dictionary expected in training, there are many vocabularies (for example, hundreds of millions of levels), and when the number of negative examples is large, the sampling time can be reduced, thereby Can quickly and efficiently perform negative sampling.

According to a first aspect, a method for performing negative sampling from a word frequency table for a training corpus is provided. The word frequency table includes multiple candidate words and the frequency of occurrence of each candidate word in the training corpus. Methods include:

Obtaining a current vocabulary from the unsampled vocabulary set of the plurality of candidate vocabularies, and a frequency of occurrence corresponding to the current vocabulary;

Acquiring the number of remaining samples and the remaining sampling probability determined for the unsampled vocabulary set;

Determining a current sampling probability based on a frequency of occurrence corresponding to the current vocabulary and the remaining sampling probability;

Determining the number of times the current vocabulary is sampled according to the binomial distribution of the current vocabulary under the conditions of the number of remaining samples and the current sampling probability;

Adding the current vocabulary to a negative example set according to the number of samples;

Updating the remaining number of samples according to the number of times the current vocabulary is sampled, and updating the remaining sampling probability according to the frequency of occurrence of the current vocabulary, for sampling other candidate words in the word frequency table until A predetermined condition is detected to be satisfied.

In one embodiment, determining the current sampling probability based on the occurrence frequency corresponding to the current vocabulary and the remaining sampling probability includes: determining the current sampling probability as, the occurrence frequency corresponding to the current vocabulary and the remaining sampling Probability ratio.

According to an embodiment, the determining the number of times the current vocabulary is sampled includes:

Performing the sampling operation for the remaining sampling times by simulation, wherein in each sampling operation, the probability that the current word is sampled is the current sampling probability;

It is determined that the number of times to be sampled is the number of times that the current word is sampled in the sampling operation of the remaining number of sampling times.

In one embodiment, updating the number of remaining samples according to the number of times the current vocabulary is sampled includes: updating the number of remaining samples to a difference between the number of remaining samples and the number of times of sampling.

Further, in one embodiment, the predetermined condition includes: the number of negative examples in the negative example set reaches a preset number; or the number of remaining samples after the update is zero; or the unsampled vocabulary set is empty .

In a possible embodiment, the updating the remaining sampling probability according to an appearance frequency corresponding to the current vocabulary includes: updating the remaining sampling probability to be a difference between the remaining sampling probability and an appearance frequency corresponding to the current vocabulary. .

According to a possible design, the method further includes: outputting the negative example set if the number of negative examples in the negative example set meets a predetermined condition.

In some possible embodiments, the method further includes: selecting a negative example from the negative example set for the training vocabulary in the training corpus.

Further, in some embodiments, selecting negative examples from the negative example set includes generating a random number on a predetermined interval, wherein each value on the predetermined interval is related to each negative example in the negative example set. Correspondingly, the random number is taken from the respective numerical values; a negative example corresponding to the random number is obtained from the negative example set.

According to an embodiment, the obtaining a negative example corresponding to the random number from the negative example set includes:

Compare whether the obtained negative example is consistent with the training vocabulary; if they are consistent, perform the step of generating a random number on a predetermined interval again.

According to a possible design, the method further includes: detecting whether an update condition of the negative example set is satisfied; and when the update condition is satisfied, regenerating a negative example set.

According to a second aspect, there is provided a device for performing negative sampling on a training corpus from a word frequency table, the word frequency table including a plurality of candidate words and a frequency of occurrence of each candidate word in the training corpus, the device include:

A first acquiring unit configured to acquire a current vocabulary from an unsampled vocabulary set of the plurality of candidate vocabularies, and a frequency of occurrence corresponding to the current vocabulary;

A second obtaining unit configured to obtain a remaining sampling number and a remaining sampling probability determined for the unsampled vocabulary set;

A first determining unit configured to determine a current sampling probability corresponding to the current vocabulary based on the occurrence frequency corresponding to the current vocabulary and the remaining sampling probability;

A second determining unit configured to determine the number of times the current vocabulary is sampled according to a binomial distribution of the current vocabulary under the conditions of the number of remaining samples and the current sampling probability;

An adding unit configured to add the current vocabulary to the negative example set according to the number of samples;

An update unit is configured to update the remaining sampling number according to the number of times the current vocabulary is sampled, and update the remaining sampling probability according to the occurrence frequency corresponding to the current vocabulary, and is used for other candidate words in the word frequency table. Sampling is performed until the number of negative examples in the negative example set is detected to meet a predetermined condition.

According to a third aspect, there is provided a computer-readable storage medium having stored thereon a computer program that, when executed on a computer, causes the computer to execute the method of the first aspect.

According to a fourth aspect, there is provided a computing device including a memory and a processor, wherein the memory stores executable code, and the processor implements the method of the first aspect when the processor executes the executable code. .

According to the method and device provided by the embodiments of the present specification, when a negative sample is taken from the word frequency table for the training corpus, one candidate word is obtained from the word frequency table as the current word, and the number of remaining samples and the remaining sampling probability are obtained based on The binomial distribution of the current vocabulary under the conditions of the number of remaining samples and the current sampling probability, determine the number of times the current vocabulary has been sampled, and then add the current vocabulary to the negative set according to the number of samples. Since the above steps are performed on a current vocabulary, the current vocabulary that has been sampled can be added to the negative example set, so that the overall number of negative samples is reduced, thereby reducing the time of negative sampling, and thus the negative examples can be performed quickly and efficiently sampling.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings used in the description of the embodiments are briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the present invention. Those of ordinary skill in the art can also obtain other drawings according to these drawings without paying creative labor.

FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification; FIG.

2 shows a flowchart of a method for performing negative sampling from a word frequency table on a training corpus according to an embodiment;

FIG. 3 shows a specific example of selecting a negative example from a negative example set;

4 shows another specific example of selecting a negative example from a negative example set;

FIG. 5 shows a schematic block diagram of an apparatus for performing negative sampling from a word frequency table on a training corpus according to an embodiment.

Detailed ways

The solutions provided in this specification are described below with reference to the drawings.

FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification. During the training of an unsupervised model (such as Word2Vec, Node2Vec), the loss function can be used to estimate the NCE for noise contrast. The expression is as follows:

Where: V represents the dictionary; w _i represents the i-th training vocabulary; c _i represents the context vocabulary adjacent to the i-th vocabulary; k represents the number of negative examples corresponding to w _i ; w _ij represents the j-th negative of w _i Example; c _j represents the context word adjacent to the j-th negative example.

It can be known from the above formula that during the corpus training process, each training word w _i needs to be randomly sampled k times from its probability distribution in the dictionary to obtain k negative examples.

The occurrence frequency of multiple words in the dictionary and each word in the training corpus is usually represented by a word frequency table. The word frequency table corresponding to the dictionary V is usually projected onto an interval [0, 1], and the length of each segment in the interval is proportional to the frequency of occurrence of the corresponding vocabulary. Further, in a negative sampling method, the interval segment corresponding to each vocabulary is divided into a plurality of "lattices" according to the minimum frequency unit, and the number of each lattice is recorded as an index. The greater the frequency of occurrence of a vocabulary, the longer the corresponding interval segment, and the larger the number of grids it contains. Each time a negative sample is taken, a random number in the index is generated, and the corresponding vocabulary whose index is the random number is taken as a negative example. In actual use, the more the number of indexes, the more accurate the simulation of the dictionary word frequency table is. For example, since each index corresponds to a "lattice", in order to ensure that each vocabulary has a corresponding index, the vocabulary with the least frequent vocabulary corresponds to at least one index, and other vocabulary may have multiple corresponding frequencies, such as vocabulary 1. The frequency of occurrence is 0.03, and the frequency of occurrence of vocabulary 2 is vocabulary of 0.001 ..., so that vocabulary 2 corresponds to one index and vocabulary 1 corresponds to 30 indexes. When the vocabulary in the dictionary V is large (such as hundreds of millions), the number of indexes is larger. Large storage space is required, and even to a remote server, each time it takes a negative case to take extra communication time.

As shown in FIG. 1, an embodiment of the present specification provides a solution. First, pre-sampling of negative examples is performed through a word frequency table, and the sampled words are added to a negative example set. In the pre-sampling process, batch sampling is performed, and each vocabulary in the word frequency table is sampled only once, and the number of samples can be multiple, and the final sampling number of each vocabulary is consistent with the frequency of occurrence in the word frequency table. As shown in FIG. 1, the word w1 in the word frequency table is sampled s1 times, the word w2 is sampled s2 times, the word w3 is sampled s3 times, and so on. In this way, the number of samples during pre-sampling is reduced, and the number of samples of each vocabulary in the negative example set is consistent with the appearance frequency in the word frequency table.

In the vocabulary training process, if negative examples are needed, a corresponding number of negative examples can be randomly obtained from the negative example set. As shown in Figure 1, k1 negative examples are randomly taken from the negative set for the training vocabulary u1, k2 negative examples are randomly taken from the negative set for the training vocabulary u2, and k3 negative examples are randomly taken from the negative set for the training vocabulary u3; and many more. Since the negative example set is a pre-sampled negative example, the number of samples of each vocabulary is consistent with the frequency of appearance in the word frequency table, so when using it, only a corresponding number of negative examples need to be randomly taken out, without considering the appearance of the vocabulary in the vocabulary Frequency, it can ensure that the sampling probability of each negative example is consistent with the frequency of occurrence of corresponding words in the word frequency table. In this way, the computational complexity is greatly reduced. At the same time, the pre-sampling negative example set can be used multiple times, which further improves the effectiveness of negative sampling in model training.

It can be understood that the computing platform in FIG. 1 may be various devices and equipment with certain computing capabilities, such as desktop computers, servers, and so on. It can be understood that the computing platform may also be a device cluster composed of the foregoing devices and devices. In the case that the computing platform is multiple devices or devices, according to one embodiment, some of the devices or devices may complete the negative sampling operation to generate a negative example set, and other devices or devices obtain the negative example set. When training vocabulary, negative examples are randomly taken from the negative example set.

The following describes the specific execution process of negative sampling from the word frequency table for the training corpus.

FIG. 2 shows a flowchart of a method for performing negative sampling from a word frequency table on a training corpus according to an embodiment of the present specification. The execution subject of this method is, for example, the computing platform of FIG. 1. As shown in FIG. 2, the method includes the following steps: Step 21: Obtain a current vocabulary from an unsampled vocabulary set of a word frequency table, and the frequency of occurrence of the current vocabulary; Step 22: Obtain a remaining determined for the unsampled vocabulary set Number of samples and remaining sampling probability; step 23, determining the current sampling probability corresponding to the current vocabulary based on the frequency of occurrence of the current vocabulary and the remaining sampling probability; step 24, according to the current number of remaining sampling and current sampling probability conditions The next binomial distribution determines the number of times the current vocabulary has been sampled; step 25, adds the current vocabulary to the negative example set according to the number of times sampled; step 26, updates the remaining number of samples based on the number of times the current word is sampled, and The corresponding occurrence frequency is updated with the above-mentioned remaining sampling probability, and is used to sample other candidate words in the word frequency table until it is detected that a predetermined condition is satisfied. The specific execution process of the above steps is described below.

First, in step 21, a current vocabulary is obtained from an unsampled vocabulary set of a word frequency table, and a frequency of occurrence corresponding to the current vocabulary. It can be understood that the word frequency table may include multiple candidate words, and the frequency of occurrence of each candidate word in the training expectation. The plurality of candidate words may include all words that are expected in the training. The word frequency table may be in various forms such as a table, a vector, an array, and a key-value, and this specification does not limit this.

The occurrence times of each candidate vocabulary in the training corpus are different. In this way, the word frequency table can also measure the proportion of each vocabulary in the training corpus by its appearance frequency. The frequency of occurrence of a candidate vocabulary may include the ratio of the total number of occurrences of the candidate vocabulary in the training corpus to the total number of vocabulary in the training corpus. Wherein, the total number of vocabulary is not merged when calculating the total number of vocabulary, that is, the total number of vocabulary is increased by 1 each time each vocabulary appears in statistics.

As mentioned above, according to the method of the embodiment of the present description, each candidate word in the word frequency table may be sampled in batches in sequence. Therefore, the word frequency table can be divided into a sampled vocabulary set and an unsampled vocabulary set, each of which includes a sampled candidate vocabulary and an unsampled candidate vocabulary. In step 21, the current vocabulary and the frequency of occurrence of the current vocabulary are obtained from the unsampled vocabulary set, so as to be used for sampling the current vocabulary next.

In one embodiment, a candidate word can be sequentially obtained as the current word according to the storage address of each word in the word frequency table. Taking the words in this order can ensure that the words will not be repeatedly sampled, that is, never sampled each time. Get the current vocabulary from the vocabulary collection. For example, the storage address of the word frequency table is obtained, and an alternative vocabulary is obtained according to the offset of each candidate word from the storage address of the word frequency table. At this time, the storage address of the word frequency table plus the offset is the storage address of each candidate word. If the above offset is between [0000-FFFF], the candidate word with an offset of 0000 can be obtained as the current word, and the candidate word with an offset of 0001 can be obtained when the next round of the process is executed. And so on. Optionally, the candidate vocabulary and its occurrence frequency may be stored in a storage unit corresponding to the same storage address. At this time, the current vocabulary and its occurrence frequency may be obtained at the same time. In another case, the candidate vocabulary and its occurrence frequency may be stored in different storage units. At this time, the associated occurrence frequency may be obtained according to the storage address of the candidate vocabulary.

In another embodiment, an alternative vocabulary can also be obtained as the current vocabulary according to the arrangement order of each alternative vocabulary in the word frequency table, so as to ensure that the current vocabulary is obtained from the unsampled vocabulary set each time. For example, if the word frequency table is a table, the candidate vocabulary is obtained in accordance with the order of the rows in the table. and many more. In the case of a table with multiple columns, an alternative vocabulary can also be obtained in this order in the first column, first row, first column, second row ...

In step 22, the number of remaining samples s and the remaining sampling probability r determined for the unsampled vocabulary set are obtained.

The number of remaining samples s can be the number of negative examples still needed in the negative example set, and also the total number of times that all unsampled words in the unsampled word set need to be sampled.

Initially, the number of remaining samples s is the total number of negative examples S0 required for the entire negative example set. In one embodiment, the number of negative examples S0 required for the entire negative example set may be calculated according to the number of words in the training corpus, or may be manually set, which is not limited in this application. For example, in the aforementioned loss function, k negative examples are required for each training vocabulary. Assuming that n vocabularies are included in the training expectation, the number of negative examples S0 can be set to S0 = n * k. In another embodiment, the initial required number of negative examples S0 may also be set to a predetermined ratio of the number of words expected in training, and so on.

After the initial setting, after each pair of candidate words is sampled, the remaining sampling times are updated, that is, the remaining sampling times are reduced by the corresponding times. For example, if a negative example set is manually set, 10,000 negative examples are required, and the candidate word w _{0 is} sampled 5 times, and the total number of samples required for the remaining words is 10000-5 = 9995.

The remaining sampling probability r may be the total sampling probability of all unsampled words during the negative sampling process of generating the negative set. As an example, it is assumed that the candidate words in the word frequency table include w ₀ , w ₁ , w _2, ..., corresponding to the occurrence frequencies p ₀ , p ₁ , p _2, ..., the remaining sampling probability r represents the total sampling probability of unsampled words. . Initially, all candidate words are not sampled. At this time, the remaining sampling probability r is the total sampling probability of all candidate words in the word frequency table during the negative sampling process of generating a negative set. Therefore, the initial value of r is 1.

It can be understood that in order to ensure that the proportion of each negative example in the final negative example set in the negative example set is consistent with the frequency of occurrence of corresponding candidate words, after each pair of candidate words is sampled, the remaining sampling probability is also updated. For example, when the first candidate word w _{0 is} sampled, the remaining sampling probability will be updated to r ′ = rp ₀ = 1-p ₀ , and so on, and the second candidate word w _{1 is} sampled. In this case, the remaining sampling probability will be updated as r '' = r'-p ₁ = 1-p ₀ -p ₁ …….

Therefore, if the current vocabulary wi is the first vocabulary in the word frequency table, then in step 22, the initial value S0 of the number of negative examples required in the negative set is obtained as the remaining number of samples s, and the initial value r = 1 is the residual sampling probability r. If the current vocabulary w _i is not the first vocabulary in the vocabulary, then in step 22, the number of remaining samples s and the remaining sampling probability r obtained after sampling the previous word w _i-1 are read.

Step 23: Determine a current sampling probability P corresponding to the current vocabulary based on the occurrence frequency p _i corresponding to the current vocabulary and the remaining sampling probability r. The current sampling probability P may be the sampling probability of the current word in the entire unsampled set.

It can be understood that, in this embodiment, the candidate words are sampled in batches. In other words, a certain number of candidate words are collected at a time. Then, after a candidate word is sampled, it will be added to the sampled word set, and the probability of subsequent sampling is 0. In this way, the subsequent sampling process does not need to consider the sampled candidate vocabulary, but proceeds in the unsampled vocabulary set. Among them, since the current vocabulary has not been sampled, the above-mentioned unsampled vocabulary set includes the current vocabulary.

Still referring to the above example, it is easy to understand that the occurrence frequencies of the candidate words w ₀ , w ₁ , w _2, ... are p ₀ , p ₁ , p _2, ... respectively. In the case of sampling the first candidate word w ₀ , the sampling probability is p ₀ , and the total sampling probability of the remaining candidate words (unsampled word set) is r = 1-p ₀ = p ₁ + p ₂ + ... . The second alternative word occurrence frequency w is ₁ p _1, in which the remaining alternate words (word set not sampled) is the sampling probability _{_{p 1 / (p 1 + p}} 2 + ...) = p 1/1 -p ₀ . By analogy, for the current vocabulary wi, the current sampling probability can also be expressed as: P = p _i / r, that is, the ratio of the occurrence frequency p _i corresponding to the current vocabulary and the remaining sampling probability r.

Step 24, according to the current word w _i binomial number of residual samples in the current sample and the probability P s condition, determining a current word is the number of samples b w _i. It can be understood that the candidate words in the word frequency table all correspond to a number of times of sampling, for example, as shown in FIG. 1, the word w1 is sampled s1, the word w2 is sampled s2, and the word w3 is sampled s3, etc. Batch sampling of alternative words. Optionally, when the frequency of occurrence of a candidate word is small, the number of times it is sampled may be zero.

In one embodiment, the binomial distribution is used to determine the number of samples. The binomial distribution is a discrete probability distribution of the number of successes in several independent Bernoulli trials. In each experiment, only one of two possible results can appear, and the experimental results of each experiment are independent of each other. The probability of occurrence of each result remains the same in each independent experiment. When the number of trials is 1, the binomial distribution obeys the 0-1 distribution, that is, for one of the results, either (success) occurs, Either does not happen.

Let ξ be the result of the random test. If the probability of an event occurring is p, then the probability of non-occurrence is q = 1-p. Let P be the probability of the event occurring k times in n independent repeated experiments:

P (ξ = k) = C (n, k) × p ^k × (1-p) ^(nk) ;

Where: C (n, k) = n! / (k! × (n-k)!).

This is the binomial distribution probability of the event under the number of times n and the probability p.

Specifically in step 24, in one embodiment, a binomial distribution function Binomial (s, P) is called to determine the number of times b of the current vocabulary is sampled. It can be seen that the parameters of the binomial distribution function are the number of remaining samples s and the current sampling probability P, which means that under the condition that the probability of sampling the current word w _i each time is p in s sampling experiments, w _i is sampled To the number of times.

The execution of the above binomial distribution function may include simulating a sampling operation (Bernoulli test) for the remaining number of samplings, which is equivalent to a sampling test, where these sampling operations are performed for the remaining candidate words. In each sampling operation, it is ensured that the probability that the current word w _i is sampled (successfully tested) is the current sampling probability P. Count the number of times the current vocabulary has been sampled, and determine that the number of times the current vocabulary has been sampled b is the number of times the current vocabulary has been sampled in the sampling operation with the remaining number of samples s.

In another embodiment, a value may also be randomly obtained from the values satisfying the sampling conditions of the binomial distribution, as the number of times the current vocabulary is sampled. It can be understood that according to the meaning of the binomial distribution, assuming that the current word "wealth" is finally sampled b times, the condition that the value b meets can be: the ratio to the number of remaining samples should be consistent with the current sampling probability. For example, if the number of remaining samples s is 8000 and the current sampling probability P is 0.03, it is possible that b / 8000 can be rounded to 0.03 when b is in the range of 200-272. In this way, a random number between 200-272 can be taken as the number of times the current word "wealth" is sampled.

In step 25, the current vocabulary w _i is added to the negative example set according to the number of samples b described above. Here, the number of samples b determined in step 24 is how many current words are added to the negative example set. For example, if the value of b is 232 in the above example, 232 current words "wealth" are added to the negative set.

Step 26: Update the remaining sampling number s according to the number b of sampling of the current word, and update the remaining sampling probability r according to the appearance frequency p _i corresponding to the current word. The updated remaining sampling times s and the remaining sampling probability r can be used to sample other candidate words in the word frequency table. For example, for the next candidate word, the number of remaining samples and the remaining sampling probability obtained in step 22 are the number of remaining samples and the remaining sampling probability updated in this step.

It can be understood that after sampling each candidate word, it can be moved from the unsampled word set to the sampled word set. In this way, the number of remaining samples s set for the unsampled vocabulary set will decrease by a corresponding amount, and the remaining sampling probability r will also change. In other words, for the next candidate word, the sampling conditions change. For example, if the negative example set needs 10,000 negative examples, the initial remaining negative examples are 10000, and the initial residual sampling frequency is 1. When a candidate word w0 with an occurrence frequency of 0.03 is sampled 200 times, the next occurrence frequency is sampled. For the candidate vocabulary of 0.05, sampling is performed with the number of remaining negative examples being 9800 and the remaining sampling probability being 0.97.

In one embodiment, after the sampling of the current vocabulary is completed, the number of remaining samples s may be updated to be the difference between the original number of remaining samples and the number b of sampling of the current vocabulary. For example, the implementation logic is: s = sb; where s is the number of remaining samples and b is the number of times that the current word w _i is sampled.

In one embodiment, the remaining sampling probability r is updated to be the difference between the original remaining sampling probability and the appearance frequency p _i corresponding to the current vocabulary. For example, the implementation logic is: r = rp _i ; where r is the remaining sampling probability and p _i is the frequency of occurrence of the current word w _i .

It is worth noting that because the number of negative examples required in the negative example set is limited, a predetermined condition related to the number of negative examples in the negative example set can also be set in advance. When this condition is met, negative sampling is stopped, Otherwise, the above sampling process is continued for other candidate words of the word frequency table. This detection step may be performed after the update step 26, or may be performed in parallel with the step 26. It can be part of step 26, or it can be a subsequent step 27 of step 26. The specific implementation of this detection step is described in detail in the following step 27.

In step 27, it is detected whether a predetermined condition is satisfied. If the predetermined condition is satisfied, the negative sampling process ends. If the predetermined condition is not satisfied, the updated remaining sampling number and the remaining sampling probability are compared. Samples of other alternative words in the word frequency table.

In one embodiment, the predetermined condition may include that the total number of negative examples in the negative example set reaches an initial remaining sampling number, such as a manually set number of negative examples of 10,000, and the like.

In another embodiment, the predetermined condition may include that the number of remaining samples after the update is zero. At this time, it means that it is no longer necessary to collect other candidate words as negative examples.

In another embodiment, the predetermined condition may include that the unsampled vocabulary set is empty. At this point, all words in the word frequency table have been sampled.

According to an embodiment of another aspect, when the above-mentioned predetermined condition is satisfied, the negative example set may also be output. The negative example set can be output locally or to other devices. Each vocabulary in the negative example set can be arranged according to the sampling order, or can be arranged randomly in random order, which is not limited in this application.

In a further embodiment, for the training vocabulary in the training corpus, a negative example can be selected from the negative example set. For example, if k negative examples are needed for the training vocabulary U _i in the training corpus, k vocabularies can be directly taken from the negative example set.

According to an embodiment of the aspect, the vocabulary in the negative example set may correspond to each value in a predetermined interval. As shown in FIG. 3, each candidate negative example in the negative example set 31 and the values in the value interval 32 are one by one. correspond. If there are 10,000 pre-sampled negative vocabularies in the negative example set 31, a positive integer on the interval [1, 10000] can be selected, and each value corresponds to a negative vocabulary. When a negative example is selected for a training vocabulary, a random number on this predetermined interval is generated, for example, a random number 5 on the numerical interval 32, and the negative example vocabulary w ₁ corresponding to the numerical value 5 in the negative example set 31 can be selected. In practice, how many negative examples are needed, how many random numbers are generated. One random number can be generated at a time to obtain corresponding negative examples, or multiple random numbers can be generated at one time to obtain corresponding negative examples in batches, which is not limited in this application.

It can be understood that with a very small probability, the negative example obtained may also be consistent with the training vocabulary itself or its associated vocabulary. The above-mentioned related vocabulary is, for example, the context of the vocabulary trained in the context prediction model and the vocabulary trained in the synonym prediction model Synonyms, etc. In this case, the vocabulary selected from the negative example set will not be used as a negative example of the training vocabulary. Therefore, when selecting a negative example from the negative example set for the training vocabulary, if the selected vocabulary is consistent with the training vocabulary itself or its associated vocabulary, the step of generating a random number on a predetermined interval is re-executed to generate a new random number to obtain The negative vocabulary for the new random number.

According to another embodiment, in the case where the vocabulary in the negative example set is randomly shuffled, k words may be sequentially selected from a selected position as a negative example. The selected position may be determined according to a certain rule, or the position corresponding to the generated random number may be determined as the selected position. For example: find the first word that is the same as the training word, and use the position of the next word as the selected position. For another example: in the above example of a predetermined interval, a random number between 1 and 10,000 is generated. In this case, only a random number needs to be generated, and the calculation amount is small. As shown in FIG. 4, for the negative example set 41, when 7 negative examples need to be taken out for a training vocabulary, a random number on the numerical interval 42 can be generated. As the selected position, starting from this selected position, get 7 candidate negative examples w ₃ , w ₉ , w ₃ , w ₇ , w ₆ , w ₄ , w _{8 in the} negative example set 41 and interval 43 As a negative example of this training vocabulary.

In this way, the process of obtaining negative examples of the training vocabulary for the training corpus is greatly simplified, and the acquisition speed is also improved.

In some possible designs, the process shown in FIG. 2 may further include the following steps: detecting whether the update condition of the negative example set is satisfied; and when the update condition is satisfied, re-executing the negative sampling of the training corpus from the word frequency table Method to regenerate the negative example set. It can be understood that when the number of vocabulary in the negative example set is large, such as several hundred million, the calculation is also very large, so you can generate a small negative example set at one time, such as 10 million, and then set the negative example set update conditions ( For example, the number of uses is 10 million, etc.), the negative example set is updated. Because during the execution of the above method, for each candidate vocabulary, when the number of samples is obtained, the sampling operation of the remaining number of samples s times (Bernoulli test) is simulated, or a value is randomly obtained from the values that meet the conditions , Etc. Therefore, the negative example set generated by the method of performing negative sampling from the word frequency table for the training corpus each time may be different.

Reviewing the above process, on the one hand, since the negative examples are pre-sampled negative examples, only a corresponding number of negative examples need to be randomly taken out without using the frequency of vocabulary in the vocabulary, and the computational complexity is greatly reduced. On the other hand, in the pre-sampling process, batch sampling is performed, and each word in the word frequency table is sampled only once, and the number of samples can be multiple, thereby reducing the time of negative sampling and further enabling negative examples to be performed quickly and efficiently. sampling. In short, the process shown in Figure 2 can improve the effectiveness of negative sampling.

According to an embodiment of another aspect, a device for performing negative sampling from a word frequency table on a training corpus is also provided. FIG. 5 shows a schematic block diagram of an apparatus for performing negative sampling from a word frequency table on a training corpus according to an embodiment. As shown in FIG. 5, the apparatus 500 for performing negative sampling from the word frequency table for training corpus includes: a first obtaining unit 51 configured to obtain a current word from an unsampled word set of the word frequency table, and a current word corresponding Occurrence frequency; a second acquisition unit 52 configured to acquire the number of remaining samples and the remaining sampling probability determined for the unsampled vocabulary set; a first determination unit 53 configured to determine the current frequency based on the occurrence frequency and the remaining sampling probability corresponding to the current vocabulary Sampling probability; a second determining unit 54 configured to determine the number of times the current vocabulary is sampled according to the binomial distribution of the current vocabulary under the number of remaining samples and the current sampling probability; an adding unit 55 configured to assign the current vocabulary The number of samples is added to the negative example set; the updating unit 56 is configured to update the number of remaining samples according to the number of times the current vocabulary is sampled, and update the remaining sampling probability according to the frequency of occurrence of the current vocabulary, and is used to perform other alternative words in the word frequency table. Sampling until it is detected that a predetermined condition is satisfied.

The first obtaining unit 51 may first obtain an alternative vocabulary as a current vocabulary from an unsampled vocabulary set of a plurality of candidate vocabularies in a word frequency table, and obtain an appearance frequency corresponding to the current vocabulary. The frequency of occurrence may be the frequency of occurrence of the current vocabulary in the training corpus.

The second obtaining unit 52 is configured to obtain the number of remaining samples and the remaining sampling probability determined for the unsampled vocabulary set. The number of remaining samples can be the number of negative examples required in the negative example set. In other words, it is the total number of samplings of the unsampled vocabulary in the negative sampling process of generating the negative set. The remaining sampling probability may be the total sampling probability of the unsampled vocabulary during the negative sampling process of generating the negative set. The initial value of the remaining sampling probability r is generally 1.

The first determining unit 53 may determine the current sampling probability corresponding to the current vocabulary based on the appearance frequency and the remaining sampling probability corresponding to the current vocabulary. The current sampling probability can be the sampling probability of the current word in the unsampled word set. In an optional embodiment, the current sampling probability may be a ratio of a frequency of occurrence corresponding to the current vocabulary and a remaining sampling probability.

The second determining unit 54 may determine the number of times the current vocabulary is sampled according to the binomial distribution of the current vocabulary under the conditions of the remaining number of samples and the current sampling probability. The binomial distribution is a discrete probability distribution of the number of successes in several independent Bernoulli trials. Specifically, in an embodiment, for the number of negative experiments to be sampled, for each experiment, the probability that the current vocabulary is sampled is the current sampling probability. The main function of the second determining unit 54 is to determine the number b of times that the ith vocabulary was successfully sampled in the s experiments.

According to an embodiment of the other aspect, the second determining unit 54 may simulate performing the sampling operation of the remaining sampling times. In each sampling operation, it is ensured that the probability that the current word is sampled is the current sampling probability. Count the number of times the current vocabulary was sampled, and determine the number of times the current vocabulary was sampled as the number of times the current vocabulary was sampled.

According to an embodiment of the other aspect, the second determining unit 54 may also randomly obtain a value from the values that satisfy the condition, as the number of times the current vocabulary is sampled. The value here can satisfy the condition that the ratio to the number of remaining samples should be consistent with the current sampling probability.

The adding unit 55 may add the current vocabulary to the negative example set according to the sampled times determined by the second determining unit 54. As many times as sampled, add as many current words to the negative set as possible.

The updating unit 56 updates the remaining sampling number according to the number of times the current vocabulary is sampled, and updates the remaining sampling probability according to the appearance frequency corresponding to the current vocabulary. It can be understood that after sampling each candidate word, the number of remaining samples will decrease by a corresponding amount, and the remaining sampling probability will also change. In other words, for the next candidate word, the sampling conditions change. In some possible designs, the update unit 56 may update the number of remaining samples to be the difference between the original number of remaining samples and the number of times the current vocabulary was sampled. The update unit 56 updates the remaining sampling probability to be the difference between the original remaining sampling probability and the appearance frequency corresponding to the current vocabulary.

On the other hand, because the number of negative examples required in the negative example set is limited, you can also set a predetermined condition in advance. When the condition is met, stop the negative sampling, otherwise continue to perform sampling on other candidate words of the word frequency table. Process. The detection function may be implemented by the update unit 26, or may be implemented by an independent detection unit. Therefore, in some embodiments, the device 500 further includes a detection unit 57 configured to detect whether the predetermined condition is satisfied after the update unit 26 updates the remaining number of samples and the remaining sampling probability, and if the predetermined condition is not satisfied , Sample other candidate words in the word frequency table according to the updated number of remaining samples and the remaining sampling probability. Here, the predetermined condition may include that the total number of negative examples in the negative example set reaches the initial number of remaining samples, or that the number of updated remaining samples is 0, or that the unsampled vocabulary set is empty.

In some possible designs, the apparatus 500 may further include:

An output module (not shown) is configured to output the negative example set if the number of negative examples in the negative example set meets a predetermined condition. The negative example set can be output locally or to other devices. In a further embodiment, the device 500 may further include a selection unit (not shown) configured to target the training words in the training corpus, and a negative example may be selected from the negative example set.

According to an embodiment of the aspect, the vocabulary in the negative example set may correspond to each value in a predetermined interval, and the selection unit may further include a generation module configured to generate a random number in the predetermined interval, wherein the generated random number is taken from Each of the foregoing numerical values; the obtaining module is configured to obtain negative examples corresponding to the random numbers from the negative example set.

In some implementations, the negative example obtained may also be consistent with the training vocabulary or its context vocabulary. At this time, the negative vocabulary will not be a negative example of the training vocabulary. Therefore, the obtaining module may be further configured to: compare whether the obtained negative examples are consistent with the training vocabulary; and if the obtained negative examples are consistent, the above generating module generates a random number on a predetermined interval again.

According to a possible design, the device 500 may further include: a detection unit (not shown) configured to detect whether the update condition of the negative example set is satisfied; so that the device 500 regenerates the negative example set if the update condition is satisfied. To update the negative set.

With the above device, on the one hand, it is possible to generate a pre-sampled negative example set. Since the negative example set is a pre-sampled negative example, only a corresponding number of negative examples need to be randomly taken out when used, without having to consider the appearance of vocabulary in the vocabulary Frequency, operation complexity is greatly reduced. On the other hand, batch sampling can be performed during the pre-sampling process, and each vocabulary in the word frequency table is sampled only once, and the number of samples can be multiple, thereby reducing the time of negative sampling, and thus can quickly and effectively perform negative sampling. Example sampling. In summary, the apparatus 500 shown in FIG. 5 can improve the effectiveness of negative sampling.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.

According to an embodiment of still another aspect, there is also provided a computing device including a memory and a processor, where the memory stores executable code, and when the processor executes the executable code, the implementation described in conjunction with FIG. 2 is implemented. method.

Those skilled in the art should appreciate that, in one or more of the above examples, the functions described in the present invention may be implemented by hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored in or transmitted over as one or more instructions or code on a computer-readable medium.

The specific embodiments described above further describe the objectives, technical solutions, and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the present invention. The scope of protection, any modification, equivalent replacement, and improvement made on the basis of the technical solution of the present invention shall be included in the scope of protection of the present invention.

Claims

A method for performing negative sampling on a training corpus from a word frequency table, the word frequency table including a plurality of candidate words and a frequency of occurrence of each candidate word in the training corpus. The method includes:

Acquiring a current vocabulary from an unsampled vocabulary set of the plurality of candidate vocabularies, and a frequency of occurrence pi corresponding to the current vocabulary;

Acquiring the number of remaining samples s and the remaining sampling probability r determined for the unsampled vocabulary set;

Determining a current sampling probability P corresponding to the current vocabulary based on the occurrence frequency pi corresponding to the current vocabulary and the remaining sampling probability r;

Determining the number b of sampling of the current vocabulary b according to a binomial distribution of the current vocabulary under the conditions of the number of remaining samples s and the current sampling probability P;

Adding the current vocabulary to the negative example set according to the number of samples b;

Update the remaining sampling number s according to the number of samples b, and update the remaining sampling probability r according to the appearance frequency pi corresponding to the current vocabulary, for sampling other candidate words in the word frequency table Until it is detected that the predetermined condition is satisfied.
The method according to claim 1, wherein determining the current sampling probability P based on the occurrence frequency pi corresponding to the current vocabulary and the remaining sampling probability r comprises:

The current sampling probability P is determined as a ratio of an appearance frequency pi corresponding to the current word to the remaining sampling probability r.
The method according to claim 1, wherein determining the number b of sampling of the current vocabulary comprises:

The simulation performs the sampling operation of the remaining sampling times s times, wherein in each sampling operation, the probability that the current word is sampled is the current sampling probability P;

It is determined that the number of samples b is the number of times that the current word is sampled in the sampling operation of the remaining number of samples s times.
The method according to claim 1, wherein updating the remaining number of samples s according to the number b of sampling of the current vocabulary comprises:

The number of remaining samples s is updated to be a difference between the number of remaining samples s and the number of samples b.
The method according to claim 1, wherein the predetermined conditions include:

The number of negative examples in the negative example set reaches a preset number; or

The updated number of remaining samples s is zero; or

The unsampled vocabulary set is empty.
The method according to claim 1, wherein the updating the remaining sampling probability r according to an appearance frequency pi corresponding to the current vocabulary comprises:

The remaining sampling probability r is updated as a difference between the remaining sampling probability r and an appearance frequency pi corresponding to the current vocabulary.
The method of claim 1, further comprising:

When the predetermined condition is satisfied, the negative example set is output.
The method according to claim 7, wherein the method further comprises:

For the training vocabulary in the training corpus, negative examples are selected from the negative example set.
The method according to claim 8, wherein selecting negative examples from the negative example set comprises:

Generating a random number on a predetermined interval, wherein the predetermined interval includes multiple values, each value corresponding to each negative example in the negative example set, and the random number is taken from the multiple values;

A negative example corresponding to the random number is obtained from the negative example set.
The method according to claim 9, wherein the obtaining a negative example corresponding to the random number from the negative example set comprises:

Comparing whether the obtained negative example is consistent with the training vocabulary;

If they are consistent, the step of generating a random number on a predetermined interval is re-executed.
The method according to any one of claims 1-10, wherein the method further comprises:

Detecting whether the update condition of the negative example set is satisfied;

When the update condition is satisfied, a negative example set is regenerated.
A device for performing negative sampling on a training corpus from a word frequency table, the word frequency table including a plurality of candidate words and a frequency of occurrence of each candidate word in the training corpus. The device includes:

A first obtaining unit configured to obtain a current vocabulary from an unsampled vocabulary set of the plurality of candidate vocabularies, and a frequency of occurrence pi corresponding to the current vocabulary;

A second obtaining unit configured to obtain a remaining sampling number s and a remaining sampling probability r determined for the unsampled vocabulary set;

A first determining unit configured to determine a current sampling probability P corresponding to the current vocabulary based on an appearance frequency pi corresponding to the current vocabulary and the remaining sampling probability r;

A second determining unit configured to determine the number of times b of the current vocabulary is sampled according to a binomial distribution of the current vocabulary under the conditions of the number of remaining samples s and the current sampling probability P;

An adding unit configured to add the current vocabulary to the negative example set according to the number of times of sampling b;

An update unit is configured to update the remaining sampling number s according to the number b of sampling of the current vocabulary, and update the remaining sampling probability r according to the occurrence frequency pi corresponding to the current vocabulary, for The other candidate words are sampled until a predetermined condition is detected to be satisfied.
The apparatus according to claim 12, wherein the first determining unit is further configured to:

The current sampling probability P is determined as a ratio of an appearance frequency pi corresponding to the current word to a remaining sampling probability r.
The apparatus according to claim 12, wherein the second determining unit comprises:

The test module is configured to simulate the sampling operation of the remaining sampling number of times, wherein, in each sampling operation, the probability that the current word is sampled is the current sampling probability P;

The determining module is configured to determine that the number of samples b is the number of times that the current word is sampled in the sampling operation of the remaining number of samples s times.
The apparatus according to claim 12, wherein the update unit is further configured to:

The number of remaining samples s is updated to be a difference between the number of remaining samples s and the number of samples b.
The apparatus according to claim 11, wherein the predetermined conditions include:

The number of negative examples in the negative example set reaches a preset number; or

The updated number of remaining samples is zero; or

The unsampled vocabulary set is empty.
The apparatus according to claim 12, wherein the update unit is further configured to:

The remaining sampling probability r is updated as a difference between the remaining sampling probability r and an appearance frequency pi corresponding to the current vocabulary.
The apparatus according to claim 12, wherein the apparatus further comprises:

The output module is configured to output the negative example set if a predetermined condition is satisfied.
The apparatus according to claim 18, wherein the apparatus further comprises:

The selection unit is configured to select a negative example from the negative example set for the training vocabulary in the training corpus.
The apparatus according to claim 19, wherein the selection unit comprises:

A generating module configured to generate a random number in a predetermined interval, wherein the predetermined interval includes a plurality of values, each value corresponding to each negative example in the negative example set, and the random number is taken from the multiple Value

The obtaining module is configured to obtain a negative example corresponding to the random number from the negative example set.
The apparatus according to claim 20, wherein the acquisition module is further configured to:

Comparing whether the obtained negative example is consistent with the training vocabulary;

If they are consistent, the generating module executes the step of generating a random number on a predetermined interval again.
The device according to any one of claims 12-21, wherein the device further comprises:

A detection unit configured to detect whether an update condition of a negative example set is satisfied;

So that the device regenerates a negative example set if the update condition is satisfied.
A computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, causes the computer to execute the method according to any one of claims 1-11.
A computing device includes a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the processor according to any one of claims 1-11 is implemented. method.