WO2019228014A1 - Method and apparatus for performing, for training corpus, negative sampling in word frequency table - Google Patents
Method and apparatus for performing, for training corpus, negative sampling in word frequency table Download PDFInfo
- Publication number
- WO2019228014A1 WO2019228014A1 PCT/CN2019/077438 CN2019077438W WO2019228014A1 WO 2019228014 A1 WO2019228014 A1 WO 2019228014A1 CN 2019077438 W CN2019077438 W CN 2019077438W WO 2019228014 A1 WO2019228014 A1 WO 2019228014A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- vocabulary
- sampling
- current
- remaining
- negative
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Definitions
- One or more embodiments of the present specification relate to the field of computer technology, and in particular, to a method and device for performing negative sampling from a word frequency table on a training corpus by a computer.
- Noise contrast estimation is a loss function commonly used in unsupervised algorithms such as Word2Vec and Node2Vec. When applying this loss function, it is necessary to generate a certain number of negative examples for each word and its context in the training corpus. Among them, for a vocabulary expected in training, any word other than the word and its context vocabulary can be a negative example. Generally, the above negative examples are randomly sampled according to the vocabulary distribution of the training corpus, and the vocabulary distribution is, for example, approximated by a word frequency table.
- a negative example is generated for the corresponding vocabulary.
- a vocabulary distribution (such as a word frequency table) is mapped to an interval, and a value in the interval is generated, so as to find a corresponding vocabulary as a negative example.
- the dictionary expected in the training has many words (for example, hundreds of millions of levels), and a large number of negative examples are required, it is hoped that an improved solution can be used to reduce the sampling time, so that negative example sampling can be performed quickly and efficiently.
- One or more embodiments of the present specification describe a method and device.
- a dictionary expected in training there are many vocabularies (for example, hundreds of millions of levels), and when the number of negative examples is large, the sampling time can be reduced, thereby Can quickly and efficiently perform negative sampling.
- a method for performing negative sampling from a word frequency table for a training corpus includes multiple candidate words and the frequency of occurrence of each candidate word in the training corpus. Methods include:
- determining the current sampling probability based on the occurrence frequency corresponding to the current vocabulary and the remaining sampling probability includes: determining the current sampling probability as, the occurrence frequency corresponding to the current vocabulary and the remaining sampling Probability ratio.
- the determining the number of times the current vocabulary is sampled includes:
- the number of times to be sampled is the number of times that the current word is sampled in the sampling operation of the remaining number of sampling times.
- updating the number of remaining samples according to the number of times the current vocabulary is sampled includes: updating the number of remaining samples to a difference between the number of remaining samples and the number of times of sampling.
- the predetermined condition includes: the number of negative examples in the negative example set reaches a preset number; or the number of remaining samples after the update is zero; or the unsampled vocabulary set is empty .
- the updating the remaining sampling probability according to an appearance frequency corresponding to the current vocabulary includes: updating the remaining sampling probability to be a difference between the remaining sampling probability and an appearance frequency corresponding to the current vocabulary. .
- the method further includes: outputting the negative example set if the number of negative examples in the negative example set meets a predetermined condition.
- the method further includes: selecting a negative example from the negative example set for the training vocabulary in the training corpus.
- selecting negative examples from the negative example set includes generating a random number on a predetermined interval, wherein each value on the predetermined interval is related to each negative example in the negative example set.
- the random number is taken from the respective numerical values; a negative example corresponding to the random number is obtained from the negative example set.
- the obtaining a negative example corresponding to the random number from the negative example set includes:
- the method further includes: detecting whether an update condition of the negative example set is satisfied; and when the update condition is satisfied, regenerating a negative example set.
- a device for performing negative sampling on a training corpus from a word frequency table the word frequency table including a plurality of candidate words and a frequency of occurrence of each candidate word in the training corpus, the device include:
- a first acquiring unit configured to acquire a current vocabulary from an unsampled vocabulary set of the plurality of candidate vocabularies, and a frequency of occurrence corresponding to the current vocabulary
- a second obtaining unit configured to obtain a remaining sampling number and a remaining sampling probability determined for the unsampled vocabulary set
- a first determining unit configured to determine a current sampling probability corresponding to the current vocabulary based on the occurrence frequency corresponding to the current vocabulary and the remaining sampling probability
- a second determining unit configured to determine the number of times the current vocabulary is sampled according to a binomial distribution of the current vocabulary under the conditions of the number of remaining samples and the current sampling probability
- An adding unit configured to add the current vocabulary to the negative example set according to the number of samples
- An update unit is configured to update the remaining sampling number according to the number of times the current vocabulary is sampled, and update the remaining sampling probability according to the occurrence frequency corresponding to the current vocabulary, and is used for other candidate words in the word frequency table. Sampling is performed until the number of negative examples in the negative example set is detected to meet a predetermined condition.
- a computer-readable storage medium having stored thereon a computer program that, when executed on a computer, causes the computer to execute the method of the first aspect.
- a computing device including a memory and a processor, wherein the memory stores executable code, and the processor implements the method of the first aspect when the processor executes the executable code.
- the current vocabulary when a negative sample is taken from the word frequency table for the training corpus, one candidate word is obtained from the word frequency table as the current word, and the number of remaining samples and the remaining sampling probability are obtained based on The binomial distribution of the current vocabulary under the conditions of the number of remaining samples and the current sampling probability, determine the number of times the current vocabulary has been sampled, and then add the current vocabulary to the negative set according to the number of samples. Since the above steps are performed on a current vocabulary, the current vocabulary that has been sampled can be added to the negative example set, so that the overall number of negative samples is reduced, thereby reducing the time of negative sampling, and thus the negative examples can be performed quickly and efficiently sampling.
- FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification.
- FIG. 2 shows a flowchart of a method for performing negative sampling from a word frequency table on a training corpus according to an embodiment
- FIG. 3 shows a specific example of selecting a negative example from a negative example set
- FIG. 5 shows a schematic block diagram of an apparatus for performing negative sampling from a word frequency table on a training corpus according to an embodiment.
- FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification.
- the loss function can be used to estimate the NCE for noise contrast.
- the expression is as follows:
- V represents the dictionary
- w i represents the i-th training vocabulary
- c i represents the context vocabulary adjacent to the i-th vocabulary
- k represents the number of negative examples corresponding to w i
- w ij represents the j-th negative of w i Example
- c j represents the context word adjacent to the j-th negative example.
- each training word w i needs to be randomly sampled k times from its probability distribution in the dictionary to obtain k negative examples.
- the occurrence frequency of multiple words in the dictionary and each word in the training corpus is usually represented by a word frequency table.
- the word frequency table corresponding to the dictionary V is usually projected onto an interval [0, 1], and the length of each segment in the interval is proportional to the frequency of occurrence of the corresponding vocabulary.
- the interval segment corresponding to each vocabulary is divided into a plurality of "lattices" according to the minimum frequency unit, and the number of each lattice is recorded as an index. The greater the frequency of occurrence of a vocabulary, the longer the corresponding interval segment, and the larger the number of grids it contains.
- the vocabulary with the least frequent vocabulary corresponds to at least one index, and other vocabulary may have multiple corresponding frequencies, such as vocabulary 1.
- the frequency of occurrence is 0.03, and the frequency of occurrence of vocabulary 2 is vocabulary of 0.001 ..., so that vocabulary 2 corresponds to one index and vocabulary 1 corresponds to 30 indexes.
- the vocabulary in the dictionary V is large (such as hundreds of millions), the number of indexes is larger. Large storage space is required, and even to a remote server, each time it takes a negative case to take extra communication time.
- an embodiment of the present specification provides a solution.
- pre-sampling of negative examples is performed through a word frequency table, and the sampled words are added to a negative example set.
- batch sampling is performed, and each vocabulary in the word frequency table is sampled only once, and the number of samples can be multiple, and the final sampling number of each vocabulary is consistent with the frequency of occurrence in the word frequency table.
- the word w1 in the word frequency table is sampled s1 times
- the word w2 is sampled s2 times
- the word w3 is sampled s3 times, and so on. In this way, the number of samples during pre-sampling is reduced, and the number of samples of each vocabulary in the negative example set is consistent with the appearance frequency in the word frequency table.
- k1 negative examples are randomly taken from the negative set for the training vocabulary u1
- k2 negative examples are randomly taken from the negative set for the training vocabulary u2
- k3 negative examples are randomly taken from the negative set for the training vocabulary u3; and many more.
- the negative example set is a pre-sampled negative example, the number of samples of each vocabulary is consistent with the frequency of appearance in the word frequency table, so when using it, only a corresponding number of negative examples need to be randomly taken out, without considering the appearance of the vocabulary in the vocabulary Frequency, it can ensure that the sampling probability of each negative example is consistent with the frequency of occurrence of corresponding words in the word frequency table. In this way, the computational complexity is greatly reduced.
- the pre-sampling negative example set can be used multiple times, which further improves the effectiveness of negative sampling in model training.
- the computing platform in FIG. 1 may be various devices and equipment with certain computing capabilities, such as desktop computers, servers, and so on. It can be understood that the computing platform may also be a device cluster composed of the foregoing devices and devices. In the case that the computing platform is multiple devices or devices, according to one embodiment, some of the devices or devices may complete the negative sampling operation to generate a negative example set, and other devices or devices obtain the negative example set. When training vocabulary, negative examples are randomly taken from the negative example set.
- FIG. 2 shows a flowchart of a method for performing negative sampling from a word frequency table on a training corpus according to an embodiment of the present specification.
- the execution subject of this method is, for example, the computing platform of FIG. 1.
- the method includes the following steps: Step 21: Obtain a current vocabulary from an unsampled vocabulary set of a word frequency table, and the frequency of occurrence of the current vocabulary; Step 22: Obtain a remaining determined for the unsampled vocabulary set Number of samples and remaining sampling probability; step 23, determining the current sampling probability corresponding to the current vocabulary based on the frequency of occurrence of the current vocabulary and the remaining sampling probability; step 24, according to the current number of remaining sampling and current sampling probability conditions The next binomial distribution determines the number of times the current vocabulary has been sampled; step 25, adds the current vocabulary to the negative example set according to the number of times sampled; step 26, updates the remaining number of samples based on the number of times the current word is sampled, and The corresponding occurrence frequency is updated with the above-mentioned remaining
- a current vocabulary is obtained from an unsampled vocabulary set of a word frequency table, and a frequency of occurrence corresponding to the current vocabulary.
- the word frequency table may include multiple candidate words, and the frequency of occurrence of each candidate word in the training expectation.
- the plurality of candidate words may include all words that are expected in the training.
- the word frequency table may be in various forms such as a table, a vector, an array, and a key-value, and this specification does not limit this.
- the occurrence times of each candidate vocabulary in the training corpus are different.
- the word frequency table can also measure the proportion of each vocabulary in the training corpus by its appearance frequency.
- the frequency of occurrence of a candidate vocabulary may include the ratio of the total number of occurrences of the candidate vocabulary in the training corpus to the total number of vocabulary in the training corpus. Wherein, the total number of vocabulary is not merged when calculating the total number of vocabulary, that is, the total number of vocabulary is increased by 1 each time each vocabulary appears in statistics.
- each candidate word in the word frequency table may be sampled in batches in sequence. Therefore, the word frequency table can be divided into a sampled vocabulary set and an unsampled vocabulary set, each of which includes a sampled candidate vocabulary and an unsampled candidate vocabulary.
- the current vocabulary and the frequency of occurrence of the current vocabulary are obtained from the unsampled vocabulary set, so as to be used for sampling the current vocabulary next.
- a candidate word can be sequentially obtained as the current word according to the storage address of each word in the word frequency table. Taking the words in this order can ensure that the words will not be repeatedly sampled, that is, never sampled each time.
- Get the current vocabulary from the vocabulary collection For example, the storage address of the word frequency table is obtained, and an alternative vocabulary is obtained according to the offset of each candidate word from the storage address of the word frequency table. At this time, the storage address of the word frequency table plus the offset is the storage address of each candidate word. If the above offset is between [0000-FFFF], the candidate word with an offset of 0000 can be obtained as the current word, and the candidate word with an offset of 0001 can be obtained when the next round of the process is executed. And so on.
- the candidate vocabulary and its occurrence frequency may be stored in a storage unit corresponding to the same storage address. At this time, the current vocabulary and its occurrence frequency may be obtained at the same time. In another case, the candidate vocabulary and its occurrence frequency may be stored in different storage units. At this time, the associated occurrence frequency may be obtained according to the storage address of the candidate vocabulary.
- an alternative vocabulary can also be obtained as the current vocabulary according to the arrangement order of each alternative vocabulary in the word frequency table, so as to ensure that the current vocabulary is obtained from the unsampled vocabulary set each time.
- the word frequency table is a table
- the candidate vocabulary is obtained in accordance with the order of the rows in the table. and many more.
- an alternative vocabulary can also be obtained in this order in the first column, first row, first column, second row ...
- step 22 the number of remaining samples s and the remaining sampling probability r determined for the unsampled vocabulary set are obtained.
- the number of remaining samples s can be the number of negative examples still needed in the negative example set, and also the total number of times that all unsampled words in the unsampled word set need to be sampled.
- the number of remaining samples s is the total number of negative examples S0 required for the entire negative example set.
- the remaining sampling probability r may be the total sampling probability of all unsampled words during the negative sampling process of generating the negative set.
- the candidate words in the word frequency table include w 0 , w 1 , w 2, ..., corresponding to the occurrence frequencies p 0 , p 1 , p 2, ..., the remaining sampling probability r represents the total sampling probability of unsampled words. .
- all candidate words are not sampled.
- the remaining sampling probability r is the total sampling probability of all candidate words in the word frequency table during the negative sampling process of generating a negative set. Therefore, the initial value of r is 1.
- Step 23 Determine a current sampling probability P corresponding to the current vocabulary based on the occurrence frequency p i corresponding to the current vocabulary and the remaining sampling probability r.
- the current sampling probability P may be the sampling probability of the current word in the entire unsampled set.
- the candidate words are sampled in batches. In other words, a certain number of candidate words are collected at a time. Then, after a candidate word is sampled, it will be added to the sampled word set, and the probability of subsequent sampling is 0. In this way, the subsequent sampling process does not need to consider the sampled candidate vocabulary, but proceeds in the unsampled vocabulary set. Among them, since the current vocabulary has not been sampled, the above-mentioned unsampled vocabulary set includes the current vocabulary.
- the occurrence frequencies of the candidate words w 0 , w 1 , w 2, ... are p 0 , p 1 , p 2, ... respectively.
- the sampling probability is p 0
- Step 24 according to the current word w i binomial number of residual samples in the current sample and the probability P s condition, determining a current word is the number of samples b w i.
- the candidate words in the word frequency table all correspond to a number of times of sampling, for example, as shown in FIG. 1, the word w1 is sampled s1, the word w2 is sampled s2, and the word w3 is sampled s3, etc. Batch sampling of alternative words.
- the frequency of occurrence of a candidate word is small, the number of times it is sampled may be zero.
- the binomial distribution is used to determine the number of samples.
- the binomial distribution is a discrete probability distribution of the number of successes in several independent Bernoulli trials. In each experiment, only one of two possible results can appear, and the experimental results of each experiment are independent of each other. The probability of occurrence of each result remains the same in each independent experiment. When the number of trials is 1, the binomial distribution obeys the 0-1 distribution, that is, for one of the results, either (success) occurs, Either does not happen.
- a binomial distribution function Binomial (s, P) is called to determine the number of times b of the current vocabulary is sampled. It can be seen that the parameters of the binomial distribution function are the number of remaining samples s and the current sampling probability P, which means that under the condition that the probability of sampling the current word w i each time is p in s sampling experiments, w i is sampled To the number of times.
- the execution of the above binomial distribution function may include simulating a sampling operation (Bernoulli test) for the remaining number of samplings, which is equivalent to a sampling test, where these sampling operations are performed for the remaining candidate words.
- a sampling operation Bossi test
- P the probability that the current word w i is sampled (successfully tested) is the current sampling probability P.
- Count the number of times the current vocabulary has been sampled, and determine that the number of times the current vocabulary has been sampled b is the number of times the current vocabulary has been sampled in the sampling operation with the remaining number of samples s.
- a value may also be randomly obtained from the values satisfying the sampling conditions of the binomial distribution, as the number of times the current vocabulary is sampled. It can be understood that according to the meaning of the binomial distribution, assuming that the current word "wealth" is finally sampled b times, the condition that the value b meets can be: the ratio to the number of remaining samples should be consistent with the current sampling probability. For example, if the number of remaining samples s is 8000 and the current sampling probability P is 0.03, it is possible that b / 8000 can be rounded to 0.03 when b is in the range of 200-272. In this way, a random number between 200-272 can be taken as the number of times the current word "wealth" is sampled.
- step 25 the current vocabulary w i is added to the negative example set according to the number of samples b described above.
- the number of samples b determined in step 24 is how many current words are added to the negative example set. For example, if the value of b is 232 in the above example, 232 current words "wealth" are added to the negative set.
- Step 26 Update the remaining sampling number s according to the number b of sampling of the current word, and update the remaining sampling probability r according to the appearance frequency p i corresponding to the current word.
- the updated remaining sampling times s and the remaining sampling probability r can be used to sample other candidate words in the word frequency table. For example, for the next candidate word, the number of remaining samples and the remaining sampling probability obtained in step 22 are the number of remaining samples and the remaining sampling probability updated in this step.
- the sampling conditions change. For example, if the negative example set needs 10,000 negative examples, the initial remaining negative examples are 10000, and the initial residual sampling frequency is 1. When a candidate word w0 with an occurrence frequency of 0.03 is sampled 200 times, the next occurrence frequency is sampled. For the candidate vocabulary of 0.05, sampling is performed with the number of remaining negative examples being 9800 and the remaining sampling probability being 0.97.
- the number of remaining samples s may be updated to be the difference between the original number of remaining samples and the number b of sampling of the current vocabulary.
- the remaining sampling probability r is updated to be the difference between the original remaining sampling probability and the appearance frequency p i corresponding to the current vocabulary.
- a predetermined condition related to the number of negative examples in the negative example set can also be set in advance. When this condition is met, negative sampling is stopped, Otherwise, the above sampling process is continued for other candidate words of the word frequency table.
- This detection step may be performed after the update step 26, or may be performed in parallel with the step 26. It can be part of step 26, or it can be a subsequent step 27 of step 26. The specific implementation of this detection step is described in detail in the following step 27.
- step 27 it is detected whether a predetermined condition is satisfied. If the predetermined condition is satisfied, the negative sampling process ends. If the predetermined condition is not satisfied, the updated remaining sampling number and the remaining sampling probability are compared. Samples of other alternative words in the word frequency table.
- the predetermined condition may include that the total number of negative examples in the negative example set reaches an initial remaining sampling number, such as a manually set number of negative examples of 10,000, and the like.
- the predetermined condition may include that the number of remaining samples after the update is zero. At this time, it means that it is no longer necessary to collect other candidate words as negative examples.
- the predetermined condition may include that the unsampled vocabulary set is empty. At this point, all words in the word frequency table have been sampled.
- the negative example set when the above-mentioned predetermined condition is satisfied, may also be output.
- the negative example set can be output locally or to other devices.
- Each vocabulary in the negative example set can be arranged according to the sampling order, or can be arranged randomly in random order, which is not limited in this application.
- a negative example can be selected from the negative example set. For example, if k negative examples are needed for the training vocabulary U i in the training corpus, k vocabularies can be directly taken from the negative example set.
- the vocabulary in the negative example set may correspond to each value in a predetermined interval.
- each candidate negative example in the negative example set 31 and the values in the value interval 32 are one by one. correspond.
- a positive integer on the interval [1, 10000] can be selected, and each value corresponds to a negative vocabulary.
- a negative example is selected for a training vocabulary
- a random number on this predetermined interval is generated, for example, a random number 5 on the numerical interval 32, and the negative example vocabulary w 1 corresponding to the numerical value 5 in the negative example set 31 can be selected.
- how many negative examples are needed, how many random numbers are generated.
- One random number can be generated at a time to obtain corresponding negative examples, or multiple random numbers can be generated at one time to obtain corresponding negative examples in batches, which is not limited in this application.
- the negative example obtained may also be consistent with the training vocabulary itself or its associated vocabulary.
- the above-mentioned related vocabulary is, for example, the context of the vocabulary trained in the context prediction model and the vocabulary trained in the synonym prediction model Synonyms, etc.
- the vocabulary selected from the negative example set will not be used as a negative example of the training vocabulary. Therefore, when selecting a negative example from the negative example set for the training vocabulary, if the selected vocabulary is consistent with the training vocabulary itself or its associated vocabulary, the step of generating a random number on a predetermined interval is re-executed to generate a new random number to obtain The negative vocabulary for the new random number.
- k words may be sequentially selected from a selected position as a negative example.
- the selected position may be determined according to a certain rule, or the position corresponding to the generated random number may be determined as the selected position. For example: find the first word that is the same as the training word, and use the position of the next word as the selected position.
- a random number between 1 and 10,000 is generated. In this case, only a random number needs to be generated, and the calculation amount is small.
- a random number on the numerical interval 42 can be generated.
- the process shown in FIG. 2 may further include the following steps: detecting whether the update condition of the negative example set is satisfied; and when the update condition is satisfied, re-executing the negative sampling of the training corpus from the word frequency table Method to regenerate the negative example set. It can be understood that when the number of vocabulary in the negative example set is large, such as several hundred million, the calculation is also very large, so you can generate a small negative example set at one time, such as 10 million, and then set the negative example set update conditions ( For example, the number of uses is 10 million, etc.), the negative example set is updated.
- the negative example set generated by the method of performing negative sampling from the word frequency table for the training corpus each time may be different.
- the negative examples are pre-sampled negative examples, only a corresponding number of negative examples need to be randomly taken out without using the frequency of vocabulary in the vocabulary, and the computational complexity is greatly reduced.
- the pre-sampling process batch sampling is performed, and each word in the word frequency table is sampled only once, and the number of samples can be multiple, thereby reducing the time of negative sampling and further enabling negative examples to be performed quickly and efficiently. sampling.
- the process shown in Figure 2 can improve the effectiveness of negative sampling.
- FIG. 5 shows a schematic block diagram of an apparatus for performing negative sampling from a word frequency table on a training corpus according to an embodiment. As shown in FIG.
- the apparatus 500 for performing negative sampling from the word frequency table for training corpus includes: a first obtaining unit 51 configured to obtain a current word from an unsampled word set of the word frequency table, and a current word corresponding Occurrence frequency; a second acquisition unit 52 configured to acquire the number of remaining samples and the remaining sampling probability determined for the unsampled vocabulary set; a first determination unit 53 configured to determine the current frequency based on the occurrence frequency and the remaining sampling probability corresponding to the current vocabulary Sampling probability; a second determining unit 54 configured to determine the number of times the current vocabulary is sampled according to the binomial distribution of the current vocabulary under the number of remaining samples and the current sampling probability; an adding unit 55 configured to assign the current vocabulary The number of samples is added to the negative example set; the updating unit 56 is configured to update the number of remaining samples according to the number of times the current vocabulary is sampled, and update the remaining sampling probability according to the frequency of occurrence of the current vocabulary, and is used to perform other alternative words in the word frequency table. Sampling until it is detected that
- the first obtaining unit 51 may first obtain an alternative vocabulary as a current vocabulary from an unsampled vocabulary set of a plurality of candidate vocabularies in a word frequency table, and obtain an appearance frequency corresponding to the current vocabulary.
- the frequency of occurrence may be the frequency of occurrence of the current vocabulary in the training corpus.
- the second obtaining unit 52 is configured to obtain the number of remaining samples and the remaining sampling probability determined for the unsampled vocabulary set.
- the number of remaining samples can be the number of negative examples required in the negative example set. In other words, it is the total number of samplings of the unsampled vocabulary in the negative sampling process of generating the negative set.
- the remaining sampling probability may be the total sampling probability of the unsampled vocabulary during the negative sampling process of generating the negative set.
- the initial value of the remaining sampling probability r is generally 1.
- the first determining unit 53 may determine the current sampling probability corresponding to the current vocabulary based on the appearance frequency and the remaining sampling probability corresponding to the current vocabulary.
- the current sampling probability can be the sampling probability of the current word in the unsampled word set.
- the current sampling probability may be a ratio of a frequency of occurrence corresponding to the current vocabulary and a remaining sampling probability.
- the second determining unit 54 may determine the number of times the current vocabulary is sampled according to the binomial distribution of the current vocabulary under the conditions of the remaining number of samples and the current sampling probability.
- the binomial distribution is a discrete probability distribution of the number of successes in several independent Bernoulli trials. Specifically, in an embodiment, for the number of negative experiments to be sampled, for each experiment, the probability that the current vocabulary is sampled is the current sampling probability.
- the main function of the second determining unit 54 is to determine the number b of times that the ith vocabulary was successfully sampled in the s experiments.
- the second determining unit 54 may simulate performing the sampling operation of the remaining sampling times. In each sampling operation, it is ensured that the probability that the current word is sampled is the current sampling probability. Count the number of times the current vocabulary was sampled, and determine the number of times the current vocabulary was sampled as the number of times the current vocabulary was sampled.
- the second determining unit 54 may also randomly obtain a value from the values that satisfy the condition, as the number of times the current vocabulary is sampled.
- the value here can satisfy the condition that the ratio to the number of remaining samples should be consistent with the current sampling probability.
- the adding unit 55 may add the current vocabulary to the negative example set according to the sampled times determined by the second determining unit 54. As many times as sampled, add as many current words to the negative set as possible.
- the updating unit 56 updates the remaining sampling number according to the number of times the current vocabulary is sampled, and updates the remaining sampling probability according to the appearance frequency corresponding to the current vocabulary. It can be understood that after sampling each candidate word, the number of remaining samples will decrease by a corresponding amount, and the remaining sampling probability will also change. In other words, for the next candidate word, the sampling conditions change. In some possible designs, the update unit 56 may update the number of remaining samples to be the difference between the original number of remaining samples and the number of times the current vocabulary was sampled. The update unit 56 updates the remaining sampling probability to be the difference between the original remaining sampling probability and the appearance frequency corresponding to the current vocabulary.
- the device 500 further includes a detection unit 57 configured to detect whether the predetermined condition is satisfied after the update unit 26 updates the remaining number of samples and the remaining sampling probability, and if the predetermined condition is not satisfied , Sample other candidate words in the word frequency table according to the updated number of remaining samples and the remaining sampling probability.
- the predetermined condition may include that the total number of negative examples in the negative example set reaches the initial number of remaining samples, or that the number of updated remaining samples is 0, or that the unsampled vocabulary set is empty.
- the apparatus 500 may further include:
- An output module (not shown) is configured to output the negative example set if the number of negative examples in the negative example set meets a predetermined condition.
- the negative example set can be output locally or to other devices.
- the device 500 may further include a selection unit (not shown) configured to target the training words in the training corpus, and a negative example may be selected from the negative example set.
- the vocabulary in the negative example set may correspond to each value in a predetermined interval
- the selection unit may further include a generation module configured to generate a random number in the predetermined interval, wherein the generated random number is taken from Each of the foregoing numerical values; the obtaining module is configured to obtain negative examples corresponding to the random numbers from the negative example set.
- the negative example obtained may also be consistent with the training vocabulary or its context vocabulary. At this time, the negative vocabulary will not be a negative example of the training vocabulary. Therefore, the obtaining module may be further configured to: compare whether the obtained negative examples are consistent with the training vocabulary; and if the obtained negative examples are consistent, the above generating module generates a random number on a predetermined interval again.
- the device 500 may further include: a detection unit (not shown) configured to detect whether the update condition of the negative example set is satisfied; so that the device 500 regenerates the negative example set if the update condition is satisfied. To update the negative set.
- a detection unit (not shown) configured to detect whether the update condition of the negative example set is satisfied; so that the device 500 regenerates the negative example set if the update condition is satisfied. To update the negative set.
- the apparatus 500 shown in FIG. 5 can improve the effectiveness of negative sampling.
- a computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
- a computing device including a memory and a processor, where the memory stores executable code, and when the processor executes the executable code, the implementation described in conjunction with FIG. 2 is implemented. method.
- the functions described in the present invention may be implemented by hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored in or transmitted over as one or more instructions or code on a computer-readable medium.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Embodiments of the description provide a method and apparatus for performing, for a training corpus, negative sampling in a word frequency table. The method comprises: first obtaining, from a word frequency table, an alternative word as the current word, and obtaining a remaining sampling quantity, a remaining sampling probability, and the current sampling probability; determining, on the basis of the binomial distribution of the current word under the conditions of the remaining sampling quantity and the current sampling probability, the number of sampled times of the current word; and then adding the current word to a negative sample set according to the number of sampled times. Because in the case that the steps above are executed to the current word, the current word can be added to a negative sample set for the number of sampled times, such that the number of total negative sampling times is reduced, thereby reducing the negative sampling time and implementing quick and effective negative sampling.
Description
本说明书一个或多个实施例涉及计算机技术领域,尤其涉及通过计算机针对训练语料从词频表中进行负例采样的方法和装置。One or more embodiments of the present specification relate to the field of computer technology, and in particular, to a method and device for performing negative sampling from a word frequency table on a training corpus by a computer.
噪声对比估计NCE(Noise Contrastive Estimation)是Word2Vec、Node2Vec等无监督算法中通常用到的损失函数。应用该损失函数时,需要先对训练语料中每一个词汇及其上下文生成一定数量的负例。其中,对训练预料中的某个词汇而言,该词及其上下文词汇之外的任一个词都可以是一个负例。一般地,上述负例根据训练语料的词汇分布随机采样,该词汇分布例如采用词频表近似表示。Noise contrast estimation (NCE) is a loss function commonly used in unsupervised algorithms such as Word2Vec and Node2Vec. When applying this loss function, it is necessary to generate a certain number of negative examples for each word and its context in the training corpus. Among them, for a vocabulary expected in training, any word other than the word and its context vocabulary can be a negative example. Generally, the above negative examples are randomly sampled according to the vocabulary distribution of the training corpus, and the vocabulary distribution is, for example, approximated by a word frequency table.
常规技术中,在使用负例时对相应词汇生成负例。具体地,将词汇分布(如词频表)映射到一个区间上,生成区间内的数值,从而查找相对应的词汇作为负例。在训练预料中的字典中词汇较多(例如数亿级别),需要负例数量较大的情况下,希望能有改进的方案,减少采样的时间,从而能够快速有效地进行负例采样。In the conventional technique, when a negative example is used, a negative example is generated for the corresponding vocabulary. Specifically, a vocabulary distribution (such as a word frequency table) is mapped to an interval, and a value in the interval is generated, so as to find a corresponding vocabulary as a negative example. When the dictionary expected in the training has many words (for example, hundreds of millions of levels), and a large number of negative examples are required, it is hoped that an improved solution can be used to reduce the sampling time, so that negative example sampling can be performed quickly and efficiently.
发明内容Summary of the Invention
本说明书一个或多个实施例描述了一种方法和装置,在训练预料中的字典中词汇较多(例如数亿级别),需要负例数量较大的情况下,可以减少采样的时间,从而能够快速有效地进行负例采样。One or more embodiments of the present specification describe a method and device. In a dictionary expected in training, there are many vocabularies (for example, hundreds of millions of levels), and when the number of negative examples is large, the sampling time can be reduced, thereby Can quickly and efficiently perform negative sampling.
根据第一方面,提供了一种针对训练语料从词频表中进行负例采样的方法,所述词频表包括多个备选词汇和各个备选词汇在所述训练语料中的出现频率,所述方法包括:According to a first aspect, a method for performing negative sampling from a word frequency table for a training corpus is provided. The word frequency table includes multiple candidate words and the frequency of occurrence of each candidate word in the training corpus. Methods include:
从所述多个备选词汇的未采样词汇集合中获取当前词汇,以及当前词汇对应的出现频率;Obtaining a current vocabulary from the unsampled vocabulary set of the plurality of candidate vocabularies, and a frequency of occurrence corresponding to the current vocabulary;
获取针对所述未采样词汇集合确定的剩余采样个数和剩余采样概率;Acquiring the number of remaining samples and the remaining sampling probability determined for the unsampled vocabulary set;
基于所述当前词汇对应的出现频率和所述剩余采样概率,确定当前采样概率;Determining a current sampling probability based on a frequency of occurrence corresponding to the current vocabulary and the remaining sampling probability;
根据所述当前词汇在所述剩余采样个数和所述当前采样概率条件下的二项分布,确 定所述当前词汇被采样次数;Determining the number of times the current vocabulary is sampled according to the binomial distribution of the current vocabulary under the conditions of the number of remaining samples and the current sampling probability;
将所述当前词汇按照所述被采样次数添加到负例集中;Adding the current vocabulary to a negative example set according to the number of samples;
根据所述当前词汇被采样次数更新所述剩余采样个数,并根据所述当前词汇对应的出现频率更新所述剩余采样概率,用于对所述词频表中的其他备选词汇进行采样,直到检测到预定条件得到满足。Updating the remaining number of samples according to the number of times the current vocabulary is sampled, and updating the remaining sampling probability according to the frequency of occurrence of the current vocabulary, for sampling other candidate words in the word frequency table until A predetermined condition is detected to be satisfied.
在一个实施例中,基于所述当前词汇对应的出现频率和所述剩余采样概率,确定当前采样概率包括:将所述当前采样概率确定为,所述当前词汇对应的出现频率与所述剩余采样概率的比值。In one embodiment, determining the current sampling probability based on the occurrence frequency corresponding to the current vocabulary and the remaining sampling probability includes: determining the current sampling probability as, the occurrence frequency corresponding to the current vocabulary and the remaining sampling Probability ratio.
根据一个实施例,所述确定所述当前词汇被采样次数包括:According to an embodiment, the determining the number of times the current vocabulary is sampled includes:
模拟执行所述剩余采样个数次的采样操作,其中,在各次采样操作中,所述当前词汇被采样到的概率为所述当前采样概率;Performing the sampling operation for the remaining sampling times by simulation, wherein in each sampling operation, the probability that the current word is sampled is the current sampling probability;
确定所述被采样次数为,所述当前词汇在所述剩余采样个数次的采样操作中被采样到的次数。It is determined that the number of times to be sampled is the number of times that the current word is sampled in the sampling operation of the remaining number of sampling times.
在一个实施方式中,根据所述当前词汇被采样次数更新所述剩余采样个数包括:将剩余采样个数更新为,所述剩余采样个数与所述被采样次数的差。In one embodiment, updating the number of remaining samples according to the number of times the current vocabulary is sampled includes: updating the number of remaining samples to a difference between the number of remaining samples and the number of times of sampling.
进一步地,在一个实施例中,所述预定条件包括:所述负例集中的负例个数达到预设数目;或者更新后的剩余采样个数为零;或者所述未采样词汇集合为空。Further, in one embodiment, the predetermined condition includes: the number of negative examples in the negative example set reaches a preset number; or the number of remaining samples after the update is zero; or the unsampled vocabulary set is empty .
在一个可能的实施例中,所述根据所述当前词汇对应的出现频率更新所述剩余采样概率包括:将剩余采样概率更新为,所述剩余采样概率与所述当前词汇对应的出现频率的差。In a possible embodiment, the updating the remaining sampling probability according to an appearance frequency corresponding to the current vocabulary includes: updating the remaining sampling probability to be a difference between the remaining sampling probability and an appearance frequency corresponding to the current vocabulary. .
根据一种可能的设计,所述方法还包括:在所述负例集中的负例个数满足预定条件的情况下,输出所述负例集。According to a possible design, the method further includes: outputting the negative example set if the number of negative examples in the negative example set meets a predetermined condition.
在一些可能的实施例中,所述方法还包括:针对训练语料中的训练词汇,从所述负例集中选取负例。In some possible embodiments, the method further includes: selecting a negative example from the negative example set for the training vocabulary in the training corpus.
进一步地,在一些实施例中,从所述负例集中选取负例包括:生成预定区间上的随机数,其中,所述预定区间上的各个数值分别与所述负例集中的各个负例相对应,所述随机数取自所述各个数值;从所述负例集中获取与所述随机数相对应的负例。Further, in some embodiments, selecting negative examples from the negative example set includes generating a random number on a predetermined interval, wherein each value on the predetermined interval is related to each negative example in the negative example set. Correspondingly, the random number is taken from the respective numerical values; a negative example corresponding to the random number is obtained from the negative example set.
根据一种实施方式,所述从所述负例集中获取与所述随机数相对应的负例包括:According to an embodiment, the obtaining a negative example corresponding to the random number from the negative example set includes:
比较所获取的负例与所述训练词汇是否一致;在一致的情况下,重新执行所述生成预定区间上的随机数的步骤。Compare whether the obtained negative example is consistent with the training vocabulary; if they are consistent, perform the step of generating a random number on a predetermined interval again.
根据一种可能的设计,所述方法还包括:检测所述负例集的更新条件是否满足;在所述更新条件满足的情况下,重新生成负例集。According to a possible design, the method further includes: detecting whether an update condition of the negative example set is satisfied; and when the update condition is satisfied, regenerating a negative example set.
根据第二方面,提供一种针对训练语料从词频表中进行负例采样的装置,所述词频表包括多个备选词汇和各个备选词汇在所述训练语料中的出现频率,所述装置包括:According to a second aspect, there is provided a device for performing negative sampling on a training corpus from a word frequency table, the word frequency table including a plurality of candidate words and a frequency of occurrence of each candidate word in the training corpus, the device include:
第一获取单元,配置为从所述多个备选词汇的未采样词汇集合中获取当前词汇,以及当前词汇对应的出现频率;A first acquiring unit configured to acquire a current vocabulary from an unsampled vocabulary set of the plurality of candidate vocabularies, and a frequency of occurrence corresponding to the current vocabulary;
第二获取单元,配置为获取针对所述未采样词汇集合确定的剩余采样个数和剩余采样概率;A second obtaining unit configured to obtain a remaining sampling number and a remaining sampling probability determined for the unsampled vocabulary set;
第一确定单元,配置为基于所述当前词汇对应的出现频率和所述剩余采样概率,确定当前词汇对应的当前采样概率;A first determining unit configured to determine a current sampling probability corresponding to the current vocabulary based on the occurrence frequency corresponding to the current vocabulary and the remaining sampling probability;
第二确定单元,配置为根据所述当前词汇在所述剩余采样个数和所述当前采样概率条件下的二项分布,确定所述当前词汇被采样次数;A second determining unit configured to determine the number of times the current vocabulary is sampled according to a binomial distribution of the current vocabulary under the conditions of the number of remaining samples and the current sampling probability;
添加单元,配置为将所述当前词汇按照所述被采样次数添加到负例集中;An adding unit configured to add the current vocabulary to the negative example set according to the number of samples;
更新单元,配置为根据所述当前词汇被采样次数更新所述剩余采样个数,根据所述当前词汇对应的出现频率更新所述剩余采样概率,用于对所述词频表中的其他备选词汇进行采样,直到检测到所述负例集中的负例个数满足预定条件。An update unit is configured to update the remaining sampling number according to the number of times the current vocabulary is sampled, and update the remaining sampling probability according to the occurrence frequency corresponding to the current vocabulary, and is used for other candidate words in the word frequency table. Sampling is performed until the number of negative examples in the negative example set is detected to meet a predetermined condition.
根据第三方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面的方法。According to a third aspect, there is provided a computer-readable storage medium having stored thereon a computer program that, when executed on a computer, causes the computer to execute the method of the first aspect.
根据第四方面,提供了一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面的方法。According to a fourth aspect, there is provided a computing device including a memory and a processor, wherein the memory stores executable code, and the processor implements the method of the first aspect when the processor executes the executable code. .
通过本说明书实施例提供的方法和装置,针对训练语料从词频表中进行负例采样时,对于一个从词频表获取一个备选词汇作为当前词汇,并获取剩余采样个数和剩余采样概率,基于当前词汇在剩余采样个数和当前采样概率条件下的二项分布,确定当前词汇被采样次数,然后将当前词汇按照被采样次数添加到负例集中。由于对一个当前词汇执行以上步骤的情况下,可以添加被采样次数的当前词汇到负例集,使总体的负例采样次数减少,从而减少负例采样的时间,进而能够快速有效地进行负例采样。According to the method and device provided by the embodiments of the present specification, when a negative sample is taken from the word frequency table for the training corpus, one candidate word is obtained from the word frequency table as the current word, and the number of remaining samples and the remaining sampling probability are obtained based on The binomial distribution of the current vocabulary under the conditions of the number of remaining samples and the current sampling probability, determine the number of times the current vocabulary has been sampled, and then add the current vocabulary to the negative set according to the number of samples. Since the above steps are performed on a current vocabulary, the current vocabulary that has been sampled can be added to the negative example set, so that the overall number of negative samples is reduced, thereby reducing the time of negative sampling, and thus the negative examples can be performed quickly and efficiently sampling.
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings used in the description of the embodiments are briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the present invention. Those of ordinary skill in the art can also obtain other drawings according to these drawings without paying creative labor.
图1示出本说明书披露的一个实施例的实施场景示意图;FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification; FIG.
图2示出根据一个实施例的针对训练语料从词频表中进行负例采样的方法流程图;2 shows a flowchart of a method for performing negative sampling from a word frequency table on a training corpus according to an embodiment;
图3示出从负例集中选取负例的一个具体例子;FIG. 3 shows a specific example of selecting a negative example from a negative example set;
图4示出从负例集中选取负例的另一个具体例子;4 shows another specific example of selecting a negative example from a negative example set;
图5示出根据一个实施例的用于针对训练语料从词频表中进行负例采样的装置的示意性框图。FIG. 5 shows a schematic block diagram of an apparatus for performing negative sampling from a word frequency table on a training corpus according to an embodiment.
下面结合附图,对本说明书提供的方案进行描述。The solutions provided in this specification are described below with reference to the drawings.
图1为本说明书披露的一个实施例的实施场景示意图。在一个无监督模型(例如Word2Vec、Node2Vec)训练过程中,损失函数可以为噪声对比估计NCE,表达式如下:FIG. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in this specification. During the training of an unsupervised model (such as Word2Vec, Node2Vec), the loss function can be used to estimate the NCE for noise contrast. The expression is as follows:
其中:V表示字典;w
i表示第i个训练词汇;c
i表示与第i个词汇相邻的上下文词汇;k表示w
i对应的负例个数;w
ij表示w
i的第j个负例;c
j表示与该第j个负例相邻的上下文词汇。
Where: V represents the dictionary; w i represents the i-th training vocabulary; c i represents the context vocabulary adjacent to the i-th vocabulary; k represents the number of negative examples corresponding to w i ; w ij represents the j-th negative of w i Example; c j represents the context word adjacent to the j-th negative example.
由以上公式可知,在语料训练过程中,对每个训练词汇w
i,都需要从其在字典的概率分布中进行k次随机取样,获取k个负例。
It can be known from the above formula that during the corpus training process, each training word w i needs to be randomly sampled k times from its probability distribution in the dictionary to obtain k negative examples.
字典中的多个词汇和各个词汇在训练语料中的出现频率通常通过词频表来表示。往往将字典V对应的词频表投影到一个区间[0,1]上,区间中各段的长度和相应词汇的出现频率成正比。进一步地,在一种负例采样方式中,将各个词汇对应的区间段按照最小频率单位划分为多个“格子”,并记录各个格子的编号作为索引。一个词汇的出现频率越大,对应的区间段越长,包含的格子数目越多。每次进行负例采样时,生成索引中的 随机数,并将索引为该随机数的相应词汇作为负例。实际使用中,索引数量越多,对字典词频表的模拟越准确。例如,由于每个索引对应一个“格子”,为了保证每个词汇都有对应的索引,出现频率最小的词汇对应的索引至少为1个,其他词汇对应的频率可能有多个,如词汇1的出现频率为0.03,词汇2的出现频率为词汇0.001……,则可以使词汇2对应1个索引,而词汇1对应30个索引。当字典V中词汇量较多(如以亿计)时,索引数量更多。需要较大的存储空间,甚至存储到远程服务器,每次获取负例花费额外通信时间。The occurrence frequency of multiple words in the dictionary and each word in the training corpus is usually represented by a word frequency table. The word frequency table corresponding to the dictionary V is usually projected onto an interval [0, 1], and the length of each segment in the interval is proportional to the frequency of occurrence of the corresponding vocabulary. Further, in a negative sampling method, the interval segment corresponding to each vocabulary is divided into a plurality of "lattices" according to the minimum frequency unit, and the number of each lattice is recorded as an index. The greater the frequency of occurrence of a vocabulary, the longer the corresponding interval segment, and the larger the number of grids it contains. Each time a negative sample is taken, a random number in the index is generated, and the corresponding vocabulary whose index is the random number is taken as a negative example. In actual use, the more the number of indexes, the more accurate the simulation of the dictionary word frequency table is. For example, since each index corresponds to a "lattice", in order to ensure that each vocabulary has a corresponding index, the vocabulary with the least frequent vocabulary corresponds to at least one index, and other vocabulary may have multiple corresponding frequencies, such as vocabulary 1. The frequency of occurrence is 0.03, and the frequency of occurrence of vocabulary 2 is vocabulary of 0.001 ..., so that vocabulary 2 corresponds to one index and vocabulary 1 corresponds to 30 indexes. When the vocabulary in the dictionary V is large (such as hundreds of millions), the number of indexes is larger. Large storage space is required, and even to a remote server, each time it takes a negative case to take extra communication time.
如图1所示,本说明书实施例提供了一种方案,先通过词频表进行负例的预采样,将采样的词汇加入到负例集中。在预采样过程中,进行批量采样,对词频表中的各个词汇只采样一次,而采样的数量可以是多个,并保证最终各个词汇的采样数量与词频表中的出现频率一致。如图1中,对词频表中的词汇w1采样s1次,对词汇w2采样s2次,对词汇w3采样s3次,等等。如此,既减少了预采样时的采样次数,并保证负例集中各个词汇的采样数量与词频表中的出现频率一致。As shown in FIG. 1, an embodiment of the present specification provides a solution. First, pre-sampling of negative examples is performed through a word frequency table, and the sampled words are added to a negative example set. In the pre-sampling process, batch sampling is performed, and each vocabulary in the word frequency table is sampled only once, and the number of samples can be multiple, and the final sampling number of each vocabulary is consistent with the frequency of occurrence in the word frequency table. As shown in FIG. 1, the word w1 in the word frequency table is sampled s1 times, the word w2 is sampled s2 times, the word w3 is sampled s3 times, and so on. In this way, the number of samples during pre-sampling is reduced, and the number of samples of each vocabulary in the negative example set is consistent with the appearance frequency in the word frequency table.
在词汇训练过程中,如果需要负例,则从负例集中随机获取相应数量的负例即可。如图1中,针对训练词汇u1从负例集中随机取出k1个负例,对训练词汇u2从负例集中随机取出k2个负例,对训练词汇u3从负例集中随机取出k3个负例;等等。由于负例集中是预采样的负例,其中各个词汇的采样数量与词频表中的出现频率一致,因此使用时只需随机取出相应数量负例即可,而无需考虑词汇在词汇表中的出现频率,就可以保证对各个负例的取样概率与词频表中相应词汇的出现频率一致。如此,运算复杂度大大降低。同时,预采样的负例集可以多次使用,进一步提高模型训练中负例采样的有效性。In the vocabulary training process, if negative examples are needed, a corresponding number of negative examples can be randomly obtained from the negative example set. As shown in Figure 1, k1 negative examples are randomly taken from the negative set for the training vocabulary u1, k2 negative examples are randomly taken from the negative set for the training vocabulary u2, and k3 negative examples are randomly taken from the negative set for the training vocabulary u3; and many more. Since the negative example set is a pre-sampled negative example, the number of samples of each vocabulary is consistent with the frequency of appearance in the word frequency table, so when using it, only a corresponding number of negative examples need to be randomly taken out, without considering the appearance of the vocabulary in the vocabulary Frequency, it can ensure that the sampling probability of each negative example is consistent with the frequency of occurrence of corresponding words in the word frequency table. In this way, the computational complexity is greatly reduced. At the same time, the pre-sampling negative example set can be used multiple times, which further improves the effectiveness of negative sampling in model training.
可以理解,图1中的计算平台可以是具有一定运算能力的各种装置、设备,例如台式计算机、服务器等等。可以理解,计算平台还可以是上述装置、设备组成的设备集群。在该计算平台是多个设备或装置的情况下,根据一种实施方式,可以由其中一些设备或装置完成负例采样操作,生成负例集,另一些设备或装置获取该负例集,在训练词汇时从该负例集中随机取出负例。It can be understood that the computing platform in FIG. 1 may be various devices and equipment with certain computing capabilities, such as desktop computers, servers, and so on. It can be understood that the computing platform may also be a device cluster composed of the foregoing devices and devices. In the case that the computing platform is multiple devices or devices, according to one embodiment, some of the devices or devices may complete the negative sampling operation to generate a negative example set, and other devices or devices obtain the negative example set. When training vocabulary, negative examples are randomly taken from the negative example set.
下面描述针对训练语料从词频表中进行负例采样的具体执行过程。The following describes the specific execution process of negative sampling from the word frequency table for the training corpus.
图2示出了本说明书一个实施例的针对训练语料从词频表中进行负例采样的方法流程图。该方法的执行主体例如是图1的计算平台。如图2所示,该方法包括以下步骤:步骤21,从词频表的未采样词汇集合中获取当前词汇,以及当前词汇对应的出现频率; 步骤22,获取针对所述未采样词汇集合确定的剩余采样个数和剩余采样概率;步骤23,基于当前词汇对应的出现频率和所述剩余采样概率,确定当前词汇对应的当前采样概率;步骤24,根据当前词汇在剩余采样个数和当前采样概率条件下的二项分布,确定当前词汇被采样次数;步骤25,将当前词汇按照上述被采样次数添加到负例集中;步骤26,根据当前词汇被采样次数更新上述剩余采样个数,以及根据当前词汇对应的出现频率更新上述剩余采样概率,用于对词频表中的其他备选词汇进行采样,直到检测到预定条件得到满足。下面描述以上各个步骤的具体执行过程。FIG. 2 shows a flowchart of a method for performing negative sampling from a word frequency table on a training corpus according to an embodiment of the present specification. The execution subject of this method is, for example, the computing platform of FIG. 1. As shown in FIG. 2, the method includes the following steps: Step 21: Obtain a current vocabulary from an unsampled vocabulary set of a word frequency table, and the frequency of occurrence of the current vocabulary; Step 22: Obtain a remaining determined for the unsampled vocabulary set Number of samples and remaining sampling probability; step 23, determining the current sampling probability corresponding to the current vocabulary based on the frequency of occurrence of the current vocabulary and the remaining sampling probability; step 24, according to the current number of remaining sampling and current sampling probability conditions The next binomial distribution determines the number of times the current vocabulary has been sampled; step 25, adds the current vocabulary to the negative example set according to the number of times sampled; step 26, updates the remaining number of samples based on the number of times the current word is sampled, and The corresponding occurrence frequency is updated with the above-mentioned remaining sampling probability, and is used to sample other candidate words in the word frequency table until it is detected that a predetermined condition is satisfied. The specific execution process of the above steps is described below.
首先,在步骤21,从词频表的未采样词汇集合中获取当前词汇,以及当前词汇对应的出现频率。可以理解,词频表可以包括多个备选词汇,以及各个备选词汇在训练预料中的出现频率。这多个备选词汇可以包括训练预料中出现的所有词汇。词频表可以是表格、向量、数组、键值对(key-value)等各种形式,本说明书对此不作限定。First, in step 21, a current vocabulary is obtained from an unsampled vocabulary set of a word frequency table, and a frequency of occurrence corresponding to the current vocabulary. It can be understood that the word frequency table may include multiple candidate words, and the frequency of occurrence of each candidate word in the training expectation. The plurality of candidate words may include all words that are expected in the training. The word frequency table may be in various forms such as a table, a vector, an array, and a key-value, and this specification does not limit this.
各个备选词汇在训练语料中的出现次数各不相同,如此,词频表还可以通过出现频率来衡量各个词汇在训练语料中的比重。一个备选词汇的出现频率可以包括,该备选词汇在训练语料中的总出现次数与训练语料中的总词汇数量的比值。其中,该总词汇数量计算时不合并重复的词汇,即:统计时,每个词汇每出现一次,总词汇数量增加1。The occurrence times of each candidate vocabulary in the training corpus are different. In this way, the word frequency table can also measure the proportion of each vocabulary in the training corpus by its appearance frequency. The frequency of occurrence of a candidate vocabulary may include the ratio of the total number of occurrences of the candidate vocabulary in the training corpus to the total number of vocabulary in the training corpus. Wherein, the total number of vocabulary is not merged when calculating the total number of vocabulary, that is, the total number of vocabulary is increased by 1 each time each vocabulary appears in statistics.
如前所述,根据本说明实施例的方法,对于词频表中的各个备选词汇,可以依次对其进行批量采样。因此,词频表可以划分为已采样词汇集合和未采样词汇集合,其各自包括已采样的备选词汇和未采样的备选词汇。在步骤21,从上述未采样词汇集合中获取当前词汇,以及当前词汇对应的出现频率,以便用于接下来对当前词汇进行采样。As mentioned above, according to the method of the embodiment of the present description, each candidate word in the word frequency table may be sampled in batches in sequence. Therefore, the word frequency table can be divided into a sampled vocabulary set and an unsampled vocabulary set, each of which includes a sampled candidate vocabulary and an unsampled candidate vocabulary. In step 21, the current vocabulary and the frequency of occurrence of the current vocabulary are obtained from the unsampled vocabulary set, so as to be used for sampling the current vocabulary next.
在一个实施例中,可以按照词频表中各个词汇的存储地址依次获取一个备选词汇作为当前词汇,按这样的顺序取词可以保证不会重复取词重复采样,即每次都是从未采样的词汇集合中获取当前词汇。例如,获取词频表的存储地址,根据各个备选词汇相对于词频表的存储地址的偏移量来获取一个备选词汇。此时,词频表的存储地址加上偏移量就是各个备选词汇的存储地址。如上述偏移量在[0000-FFFF]之间,可以首先将偏移量为0000的备选词汇获取为当前词汇,下一轮执行该流程时会获取偏移量为0001的备选词汇,以此类推。可选地,备选词汇及其出现频率可以存储在同一个存储地址对应的存储单元中,此时,可以同时获取当前词汇及其出现频率。另一种情况下,备选词汇及其出现频率可以存储在不同的存储单元,此时,可以按照备选词汇的存储地址获取相关联的出现频率。In one embodiment, a candidate word can be sequentially obtained as the current word according to the storage address of each word in the word frequency table. Taking the words in this order can ensure that the words will not be repeatedly sampled, that is, never sampled each time. Get the current vocabulary from the vocabulary collection. For example, the storage address of the word frequency table is obtained, and an alternative vocabulary is obtained according to the offset of each candidate word from the storage address of the word frequency table. At this time, the storage address of the word frequency table plus the offset is the storage address of each candidate word. If the above offset is between [0000-FFFF], the candidate word with an offset of 0000 can be obtained as the current word, and the candidate word with an offset of 0001 can be obtained when the next round of the process is executed. And so on. Optionally, the candidate vocabulary and its occurrence frequency may be stored in a storage unit corresponding to the same storage address. At this time, the current vocabulary and its occurrence frequency may be obtained at the same time. In another case, the candidate vocabulary and its occurrence frequency may be stored in different storage units. At this time, the associated occurrence frequency may be obtained according to the storage address of the candidate vocabulary.
在另一个实施例中,也可以按照词频表中各个备选词汇的排列顺序获取一个备选词 汇作为当前词汇,如此保证每次都是从未采样词汇集合中获取当前词汇。例如在词频表是表格的情况下,按照表格中的各行顺序获取备选词汇,如第一轮执行该流程时获取第一行的备选词汇,第二轮获取第二行的备选词汇,等等。在表格有多个列的情况下,还可以按照第一列第一行、第一列第二行……这样的顺序这样获取一个备选词汇。In another embodiment, an alternative vocabulary can also be obtained as the current vocabulary according to the arrangement order of each alternative vocabulary in the word frequency table, so as to ensure that the current vocabulary is obtained from the unsampled vocabulary set each time. For example, if the word frequency table is a table, the candidate vocabulary is obtained in accordance with the order of the rows in the table. and many more. In the case of a table with multiple columns, an alternative vocabulary can also be obtained in this order in the first column, first row, first column, second row ...
在步骤22中,获取针对未采样词汇集合确定的剩余采样个数s和剩余采样概率r。In step 22, the number of remaining samples s and the remaining sampling probability r determined for the unsampled vocabulary set are obtained.
剩余采样个数s可以是负例集中还需要的负例个数,也是未采样词汇集合中所有未采样词汇需要被采样的总次数。The number of remaining samples s can be the number of negative examples still needed in the negative example set, and also the total number of times that all unsampled words in the unsampled word set need to be sampled.
初始地,剩余采样个数s是整个负例集所需要的总的负例个数S0。在一个实施例中,整个负例集所需要的负例个数S0可以是根据训练语料中的词汇数计算得到的,也可以是人工设定的,本申请对此不作限定。例如,在前述损失函数中,针对每个训练词汇需要取k个负例,假定训练预料中包含n个词汇,那么可以将负例个数S0设定为S0=n*k。在另一实施例中,还可以将该初始需要的负例个数S0设定为训练预料中词汇数目的预定比例,等等。Initially, the number of remaining samples s is the total number of negative examples S0 required for the entire negative example set. In one embodiment, the number of negative examples S0 required for the entire negative example set may be calculated according to the number of words in the training corpus, or may be manually set, which is not limited in this application. For example, in the aforementioned loss function, k negative examples are required for each training vocabulary. Assuming that n vocabularies are included in the training expectation, the number of negative examples S0 can be set to S0 = n * k. In another embodiment, the initial required number of negative examples S0 may also be set to a predetermined ratio of the number of words expected in training, and so on.
在初始设置之后,每对一个备选词汇采样完毕,会对剩余采样次数进行更新,即剩余采样次数减小相应次数。例如,由人工设定了负例集需要10000个负例,对备选词汇w
0采样了5次,则剩下的词总共需要的采样次数为10000-5=9995。
After the initial setting, after each pair of candidate words is sampled, the remaining sampling times are updated, that is, the remaining sampling times are reduced by the corresponding times. For example, if a negative example set is manually set, 10,000 negative examples are required, and the candidate word w 0 is sampled 5 times, and the total number of samples required for the remaining words is 10000-5 = 9995.
剩余采样概率r,可以是所有未采样词汇在生成负例集的负例采样过程中的总采样概率。作为示例,假定词频表中的备选词汇包括w
0、w
1、w
2……,对应出现频率p
0、p
1、p
2……,剩余采样概率r表示,未采样词汇的总采样概率。初始地,所有备选词汇均未被采样,此时剩余采样概率r即为词频表中所有备选词汇在生成负例集的负例采样过程中的总采样概率,因此,r的初始值为1。
The remaining sampling probability r may be the total sampling probability of all unsampled words during the negative sampling process of generating the negative set. As an example, it is assumed that the candidate words in the word frequency table include w 0 , w 1 , w 2, ..., corresponding to the occurrence frequencies p 0 , p 1 , p 2, ..., the remaining sampling probability r represents the total sampling probability of unsampled words. . Initially, all candidate words are not sampled. At this time, the remaining sampling probability r is the total sampling probability of all candidate words in the word frequency table during the negative sampling process of generating a negative set. Therefore, the initial value of r is 1.
可以理解,为了保证最终负例集中的各个负例在负例集中的占比与相应的备选词汇的出现频率一致,每对一个备选词汇采样完毕,会对剩余采样概率也进行更新。例如,在第一个备选词汇w
0采样完毕的情况下,剩余采样概率会被更新为r'=r-p
0=1-p
0,以此类推,在第二个备选词汇w
1采样完毕的情况下,剩余采样概率会被更新为r''=r'-p
1=1-p
0-p
1……。
It can be understood that in order to ensure that the proportion of each negative example in the final negative example set in the negative example set is consistent with the frequency of occurrence of corresponding candidate words, after each pair of candidate words is sampled, the remaining sampling probability is also updated. For example, when the first candidate word w 0 is sampled, the remaining sampling probability will be updated to r ′ = rp 0 = 1-p 0 , and so on, and the second candidate word w 1 is sampled. In this case, the remaining sampling probability will be updated as r '' = r'-p 1 = 1-p 0 -p 1 …….
因此,如果当前词汇wi为词频表中的第一个词汇,那么在步骤22,分别获取负例集中需要的负例个数的初始值S0为所述剩余采样个数s,获取初始值r=1为所述剩余采样概率r。如果当前词汇w
i不是词汇表的第一个词汇,那么在步骤22,分别读取在对前 一词汇w
i-1采样之后更新得到的剩余采样个数s和剩余采样概率r。
Therefore, if the current vocabulary wi is the first vocabulary in the word frequency table, then in step 22, the initial value S0 of the number of negative examples required in the negative set is obtained as the remaining number of samples s, and the initial value r = 1 is the residual sampling probability r. If the current vocabulary w i is not the first vocabulary in the vocabulary, then in step 22, the number of remaining samples s and the remaining sampling probability r obtained after sampling the previous word w i-1 are read.
步骤23,基于当前词汇对应的出现频率p
i和所述剩余采样概率r,确定当前词汇对应的当前采样概率P。当前采样概率P,可以是当前词汇在整个未采样集合中的采样概率。
Step 23: Determine a current sampling probability P corresponding to the current vocabulary based on the occurrence frequency p i corresponding to the current vocabulary and the remaining sampling probability r. The current sampling probability P may be the sampling probability of the current word in the entire unsampled set.
可以理解,由于本实施例是针对各个备选词汇批量采样,换言之,要一次采集够相应数量的某个备选词汇。那么,当一个备选词汇采样完毕后,其会被加入已采样词汇集合,后续被采样到的概率是0。这样,之后的采样过程不需要考虑采样过的备选词汇,而是在未采样词汇集合中进行。其中,由于当前词汇还未进行采样,所以上述未采样词汇集合包括当前词汇在内。It can be understood that, in this embodiment, the candidate words are sampled in batches. In other words, a certain number of candidate words are collected at a time. Then, after a candidate word is sampled, it will be added to the sampled word set, and the probability of subsequent sampling is 0. In this way, the subsequent sampling process does not need to consider the sampled candidate vocabulary, but proceeds in the unsampled vocabulary set. Among them, since the current vocabulary has not been sampled, the above-mentioned unsampled vocabulary set includes the current vocabulary.
仍然参考以上示例,容易理解,备选词汇w
0、w
1、w
2……的出现频率分别为p
0、p
1、p
2……。在对第一个备选词汇w
0采样的情况下,采样概率为p
0,剩余备选词汇(未采样词汇集合)的总采样概率为r=1-p
0=p
1+p
2+…。第二个备选词汇w
1的出现频率为p
1,则其在剩余备选词汇(未采样词汇集合)中的采样概率为p
1/(p
1+p
2+…)=p
1/1-p
0。以此类推,对于当前词汇wi,当前采样概率也可以表示为:P=p
i/r,即:当前词汇对应的出现频率p
i和剩余采样概率r的比值。
Still referring to the above example, it is easy to understand that the occurrence frequencies of the candidate words w 0 , w 1 , w 2, ... are p 0 , p 1 , p 2, ... respectively. In the case of sampling the first candidate word w 0 , the sampling probability is p 0 , and the total sampling probability of the remaining candidate words (unsampled word set) is r = 1-p 0 = p 1 + p 2 + ... . The second alternative word occurrence frequency w is 1 p 1, in which the remaining alternate words (word set not sampled) is the sampling probability p 1 / (p 1 + p 2 + ...) = p 1/1 -p 0 . By analogy, for the current vocabulary wi, the current sampling probability can also be expressed as: P = p i / r, that is, the ratio of the occurrence frequency p i corresponding to the current vocabulary and the remaining sampling probability r.
步骤24,根据当前词汇w
i在剩余采样个数s和当前采样概率P条件下的二项分布,确定当前词汇w
i的被采样次数b。可以理解,词频表中的备选词汇都对应一个被采样次数,例如图1所示的对词汇w1采样s1次,对词汇w2采样s2次,对词汇w3采样s3次,等等,以完成对备选词汇批量采样。可选地,当一个备选词汇的出现频率较小时,其被采样次数可能为0。
Step 24, according to the current word w i binomial number of residual samples in the current sample and the probability P s condition, determining a current word is the number of samples b w i. It can be understood that the candidate words in the word frequency table all correspond to a number of times of sampling, for example, as shown in FIG. 1, the word w1 is sampled s1, the word w2 is sampled s2, and the word w3 is sampled s3, etc. Batch sampling of alternative words. Optionally, when the frequency of occurrence of a candidate word is small, the number of times it is sampled may be zero.
在一个实施例中,利用二项分布确定上述被采样次数。二项分布是若干次独立的伯努利试验中成功的次数的离散概率分布。在每次试验中,只能出现两个可能结果中的一个结果,并且各次实验的实验结果相互独立。每个结果发生的概率在每一次独立试验中都保持不变,当试验次数为1时,二项分布服从0-1分布,亦即,对其中的一个结果而言,要么发生(成功),要么不发生。In one embodiment, the binomial distribution is used to determine the number of samples. The binomial distribution is a discrete probability distribution of the number of successes in several independent Bernoulli trials. In each experiment, only one of two possible results can appear, and the experimental results of each experiment are independent of each other. The probability of occurrence of each result remains the same in each independent experiment. When the number of trials is 1, the binomial distribution obeys the 0-1 distribution, that is, for one of the results, either (success) occurs, Either does not happen.
用ξ表示随机试验的结果。如果某个事件发生的概率是p,则不发生的概率q=1-p,用P表示n次独立重复试验中该事件发生k次的概率为:Let ξ be the result of the random test. If the probability of an event occurring is p, then the probability of non-occurrence is q = 1-p. Let P be the probability of the event occurring k times in n independent repeated experiments:
P(ξ=k)=C(n,k)×p
k×(1-p)
(n-k);
P (ξ = k) = C (n, k) × p k × (1-p) (nk) ;
其中:C(n,k)=n!/(k!×(n-k)!)。Where: C (n, k) = n! / (k! × (n-k)!).
这个就是该事件在次数n和概率p条件下的二项分布概率。This is the binomial distribution probability of the event under the number of times n and the probability p.
具体到该步骤24中,在一个实施例中,调用二项分布函数Binomial(s,P)来确定当前词汇的被采样次数b。可以看到,该二项分布函数的参数为剩余采样个数s和当前采样概率P,表示在s次采样试验,每次采样到当前词汇w
i的概率为P的条件下,w
i被采样到的次数。
Specifically in step 24, in one embodiment, a binomial distribution function Binomial (s, P) is called to determine the number of times b of the current vocabulary is sampled. It can be seen that the parameters of the binomial distribution function are the number of remaining samples s and the current sampling probability P, which means that under the condition that the probability of sampling the current word w i each time is p in s sampling experiments, w i is sampled To the number of times.
上述二项分布函数的执行可以包括,模拟执行剩余采样个数s次的采样操作(伯努利试验),相当于采样试验,其中,这些采样操作是针对剩余的备选词汇执行的。在各次采样操作中,确保当前词汇w
i被采样到(试验成功)的概率为当前采样概率P。统计当前词汇被采样到的次数,并确定当前词汇的被采样次数b为,当前词汇在剩余采样个数s次的采样操作中被采样到的次数。
The execution of the above binomial distribution function may include simulating a sampling operation (Bernoulli test) for the remaining number of samplings, which is equivalent to a sampling test, where these sampling operations are performed for the remaining candidate words. In each sampling operation, it is ensured that the probability that the current word w i is sampled (successfully tested) is the current sampling probability P. Count the number of times the current vocabulary has been sampled, and determine that the number of times the current vocabulary has been sampled b is the number of times the current vocabulary has been sampled in the sampling operation with the remaining number of samples s.
在另一个实施例中,还可以从满足二项分布采样条件的数值中随机获取一个值,作为当前词汇被采样次数。可以理解,根据二项分布的含义,假设当前词汇“财富”最终被采样b次,则数值b满足的条件可以为:与剩余采样个数的比值应与当前采样概率一致。例如,剩余采样个数s为8000,当前采样概率P为0.03,则可能当b在200-272范围内时b/8000都可以四舍五入为0.03。如此,可以在200-272之间取一个随机数,作为当前词汇“财富”的被采样次数。In another embodiment, a value may also be randomly obtained from the values satisfying the sampling conditions of the binomial distribution, as the number of times the current vocabulary is sampled. It can be understood that according to the meaning of the binomial distribution, assuming that the current word "wealth" is finally sampled b times, the condition that the value b meets can be: the ratio to the number of remaining samples should be consistent with the current sampling probability. For example, if the number of remaining samples s is 8000 and the current sampling probability P is 0.03, it is possible that b / 8000 can be rounded to 0.03 when b is in the range of 200-272. In this way, a random number between 200-272 can be taken as the number of times the current word "wealth" is sampled.
在步骤25,将当前词汇w
i按照上述被采样次数b添加到负例集中。这里,步骤24确定的被采样次数b是多少,就添加多少个当前词汇到负例集。如以上示例中b的取值为232,则添加232个当前词汇“财富”到负例集中。
In step 25, the current vocabulary w i is added to the negative example set according to the number of samples b described above. Here, the number of samples b determined in step 24 is how many current words are added to the negative example set. For example, if the value of b is 232 in the above example, 232 current words "wealth" are added to the negative set.
步骤26,根据当前词汇被采样次数b更新上述剩余采样个数s,根据当前词汇对应的出现频率p
i更新上述剩余采样概率r。更新后的剩余采样次数s和剩余采样概率r可以用于对词频表中的其他备选词汇进行采样。例如对下一个备选词汇而言,步骤22中所获取的剩余采样个数和剩余采样概率是本步骤更新后的剩余采样个数和剩余采样概率。
Step 26: Update the remaining sampling number s according to the number b of sampling of the current word, and update the remaining sampling probability r according to the appearance frequency p i corresponding to the current word. The updated remaining sampling times s and the remaining sampling probability r can be used to sample other candidate words in the word frequency table. For example, for the next candidate word, the number of remaining samples and the remaining sampling probability obtained in step 22 are the number of remaining samples and the remaining sampling probability updated in this step.
可以理解,对每个备选词汇进行采样后,可以将其从未采样词汇集合移动到已采样词汇集合中。如此,针对未采样词汇集合设置的剩余采样个数s会减少相应数量,剩余采样概率r也会发生改变。换句话说,对于下一个备选词汇而言,采样条件发生变化。例如,负例集需要10000个负例,则初始剩余负例个数为10000,初始剩余采样频率为1,当对一个出现频率为0.03的备选词汇w0采样200次后,对下一个出现频率为0.05 的备选词汇而言,是在剩余负例个数为9800、剩余采样概率为0.97的情况下进行采样。It can be understood that after sampling each candidate word, it can be moved from the unsampled word set to the sampled word set. In this way, the number of remaining samples s set for the unsampled vocabulary set will decrease by a corresponding amount, and the remaining sampling probability r will also change. In other words, for the next candidate word, the sampling conditions change. For example, if the negative example set needs 10,000 negative examples, the initial remaining negative examples are 10000, and the initial residual sampling frequency is 1. When a candidate word w0 with an occurrence frequency of 0.03 is sampled 200 times, the next occurrence frequency is sampled. For the candidate vocabulary of 0.05, sampling is performed with the number of remaining negative examples being 9800 and the remaining sampling probability being 0.97.
在一个实施例中,对当前词汇采样完成后,可以将剩余采样个数s更新为,原剩余采样个数与当前词汇被采样次数b的差。如实现逻辑为:s=s-b;其中,s为剩余采样个数,b为当前词汇w
i的被采样次数。
In one embodiment, after the sampling of the current vocabulary is completed, the number of remaining samples s may be updated to be the difference between the original number of remaining samples and the number b of sampling of the current vocabulary. For example, the implementation logic is: s = sb; where s is the number of remaining samples and b is the number of times that the current word w i is sampled.
在一个实施例中,将剩余采样概率r更新为,原剩余采样概率与当前词汇对应的出现频率p
i的差。如实现逻辑为:r=r-p
i;其中,r为剩余采样概率,p
i为当前词汇w
i的出现频率。
In one embodiment, the remaining sampling probability r is updated to be the difference between the original remaining sampling probability and the appearance frequency p i corresponding to the current vocabulary. For example, the implementation logic is: r = rp i ; where r is the remaining sampling probability and p i is the frequency of occurrence of the current word w i .
值得说明的是,由于负例集中需要的负例个数是有限的,因此,还可以预先设置与负例集中的负例个数相关的预定条件,当满足该条件时,停止负例采样,否则继续针对词频表的其他备选词汇执行以上采样流程。该检测步骤可以在更新步骤26之后进行,也可以和步骤26并列执行。其可以是步骤26的一部分,也可以是步骤26的后续步骤27。下面以后续步骤27的方式详细说明该检测步骤的具体实现。It is worth noting that because the number of negative examples required in the negative example set is limited, a predetermined condition related to the number of negative examples in the negative example set can also be set in advance. When this condition is met, negative sampling is stopped, Otherwise, the above sampling process is continued for other candidate words of the word frequency table. This detection step may be performed after the update step 26, or may be performed in parallel with the step 26. It can be part of step 26, or it can be a subsequent step 27 of step 26. The specific implementation of this detection step is described in detail in the following step 27.
在该步骤27中,检测预定条件是否得到满足,在满足预定条件的情况下,该负例采样流程结束,在不满足预定条件的情况下,根据更新后的剩余采样个数和剩余采样概率对词频表中的其他备选词汇采样。In step 27, it is detected whether a predetermined condition is satisfied. If the predetermined condition is satisfied, the negative sampling process ends. If the predetermined condition is not satisfied, the updated remaining sampling number and the remaining sampling probability are compared. Samples of other alternative words in the word frequency table.
在一个实施例中,预定条件可以包括负例集中的负例总个数达到初始剩余采样个数,如人工设置的负例个数10000等。In one embodiment, the predetermined condition may include that the total number of negative examples in the negative example set reaches an initial remaining sampling number, such as a manually set number of negative examples of 10,000, and the like.
在另一个实施例中,预定条件可以包括更新后的剩余采样个数为0。此时,代表着已经不需要采集其他备选词汇作为负例了。In another embodiment, the predetermined condition may include that the number of remaining samples after the update is zero. At this time, it means that it is no longer necessary to collect other candidate words as negative examples.
在另一实施例中,预定条件可以包括,未采样词汇集合为空。此时,已经对词频表中所有词汇进行采样。In another embodiment, the predetermined condition may include that the unsampled vocabulary set is empty. At this point, all words in the word frequency table have been sampled.
根据另一方面的实施例,在上述预定条件得到满足的情况下,还可以输出该负例集。该负例集可以输出到本地,也可以输出到其他设备。负例集中的各个词汇可以按照采样顺序排列,也可以随机打乱顺序排列,本申请对此不作限定。According to an embodiment of another aspect, when the above-mentioned predetermined condition is satisfied, the negative example set may also be output. The negative example set can be output locally or to other devices. Each vocabulary in the negative example set can be arranged according to the sampling order, or can be arranged randomly in random order, which is not limited in this application.
在进一步的实施例中,针对训练语料中的训练词汇,可以从该负例集中选取负例。例如,针对训练语料中的训练词汇U
i需要k个负例的情况下,可以直接从该负例集中取出k个词汇。
In a further embodiment, for the training vocabulary in the training corpus, a negative example can be selected from the negative example set. For example, if k negative examples are needed for the training vocabulary U i in the training corpus, k vocabularies can be directly taken from the negative example set.
根据一方面的实施例,负例集中的词汇可以和预定区间上的各个数值相对应,如图 3所示,将负例集31中的各个备选负例与数值区间32中的数值一一对应。假如负例集31中有10000个预采样到的负例词汇,可以选择区间[1,10000]上的正整数,每个数值对应一个负例词汇。对一个训练词汇选取负例时,生成这个预定区间上的随机数,例如数值区间32上的随机数5,则可以选择负例集31中与数值5相对应的负例词汇w
1。实践中,需要多少个负例,则生成多少个随机数。可以一次生成一个随机数,获取相应负例,也可以一次生成多个随机数,批量获取相应负例,本申请对此不作限定。
According to an embodiment of the aspect, the vocabulary in the negative example set may correspond to each value in a predetermined interval. As shown in FIG. 3, each candidate negative example in the negative example set 31 and the values in the value interval 32 are one by one. correspond. If there are 10,000 pre-sampled negative vocabularies in the negative example set 31, a positive integer on the interval [1, 10000] can be selected, and each value corresponds to a negative vocabulary. When a negative example is selected for a training vocabulary, a random number on this predetermined interval is generated, for example, a random number 5 on the numerical interval 32, and the negative example vocabulary w 1 corresponding to the numerical value 5 in the negative example set 31 can be selected. In practice, how many negative examples are needed, how many random numbers are generated. One random number can be generated at a time to obtain corresponding negative examples, or multiple random numbers can be generated at one time to obtain corresponding negative examples in batches, which is not limited in this application.
可以理解,在极小的概率下,所获取的负例还可能与训练词汇本身或其关联词汇一致,上述关联词汇例如是,在上下文预测模型中训练词汇的上下文,在同义词预测模型中训练词汇的同义词等。在这样的情况下,从负例集中选取的词汇将不能作为该训练词汇的负例。因此,针对训练词汇从负例集选取负例时,在选取的词汇与训练词汇本身或其关联词汇一致的情况下,重新执行生成预定区间上的随机数的步骤,生成新的随机数,获取新的随机数对应的负例词汇。It can be understood that with a very small probability, the negative example obtained may also be consistent with the training vocabulary itself or its associated vocabulary. The above-mentioned related vocabulary is, for example, the context of the vocabulary trained in the context prediction model and the vocabulary trained in the synonym prediction model Synonyms, etc. In this case, the vocabulary selected from the negative example set will not be used as a negative example of the training vocabulary. Therefore, when selecting a negative example from the negative example set for the training vocabulary, if the selected vocabulary is consistent with the training vocabulary itself or its associated vocabulary, the step of generating a random number on a predetermined interval is re-executed to generate a new random number to obtain The negative vocabulary for the new random number.
根据另一方面的实施例,负例集中的各个词汇随机打乱顺序排列的情况下,还可以从一个选定位置开始按顺序选定k个词汇作为负例。该选定位置可以按照一定规则确定,也可以将生成的随机数对应的位置确定为选定位置。例如:查找到第一个和训练词汇相同的词汇,将下一个词汇的位置作为选定位置。再例如:上述预定区间的例子中,生成一个1-10000之间的随机数。这种情况下只需生成一个随机数,计算量较小。如图4所示,对于负例集41而言,对一个训练词汇需要取出7个负例的情况下,可以生成一个数值区间42上的随机数,例如是数值5,则可以将数值5对应的位置作为选定位置,从该选定位置开始获取负例集41中,区间43上的7个备选负例w
3、w
9、w
3、w
7、w
6、w
4、w
8,作为该训练词汇的负例。
According to another embodiment, in the case where the vocabulary in the negative example set is randomly shuffled, k words may be sequentially selected from a selected position as a negative example. The selected position may be determined according to a certain rule, or the position corresponding to the generated random number may be determined as the selected position. For example: find the first word that is the same as the training word, and use the position of the next word as the selected position. For another example: in the above example of a predetermined interval, a random number between 1 and 10,000 is generated. In this case, only a random number needs to be generated, and the calculation amount is small. As shown in FIG. 4, for the negative example set 41, when 7 negative examples need to be taken out for a training vocabulary, a random number on the numerical interval 42 can be generated. As the selected position, starting from this selected position, get 7 candidate negative examples w 3 , w 9 , w 3 , w 7 , w 6 , w 4 , w 8 in the negative example set 41 and interval 43 As a negative example of this training vocabulary.
如此,针对训练语料的训练词汇获取负例的过程就简化很多,获取速度也得以提高。In this way, the process of obtaining negative examples of the training vocabulary for the training corpus is greatly simplified, and the acquisition speed is also improved.
在一些可能的设计中,图2所示的流程还可以包括以下步骤:检测负例集的更新条件是否满足;在更新条件满足的情况下,重新执行针对训练语料从词频表中进行负例采样的方法,以重新生成负例集。可以理解,当需要的负例集中词汇数量较多时,例如几亿,计算量也非常大,因此可以一次生成一个较小的负例集,例如1千万,然后设定负例集更新条件(例如使用次数为1千万等),对负例集进行更新。由于上述方法执行过程中,针对每个备选词汇,获取被采样次数时,会模拟执行剩余采样个数s次的采样操作(伯努利试验),或者从满足条件的数值中随机获取一个值,等等,因此,每次重新执行针对训练语料从词频表中进行负例采样的方法生成的负例集都可能不同。In some possible designs, the process shown in FIG. 2 may further include the following steps: detecting whether the update condition of the negative example set is satisfied; and when the update condition is satisfied, re-executing the negative sampling of the training corpus from the word frequency table Method to regenerate the negative example set. It can be understood that when the number of vocabulary in the negative example set is large, such as several hundred million, the calculation is also very large, so you can generate a small negative example set at one time, such as 10 million, and then set the negative example set update conditions ( For example, the number of uses is 10 million, etc.), the negative example set is updated. Because during the execution of the above method, for each candidate vocabulary, when the number of samples is obtained, the sampling operation of the remaining number of samples s times (Bernoulli test) is simulated, or a value is randomly obtained from the values that meet the conditions , Etc. Therefore, the negative example set generated by the method of performing negative sampling from the word frequency table for the training corpus each time may be different.
回顾以上过程,一方面,由于负例集中是预采样的负例,使用时只需随机取出相应数量负例即可,而无需考虑词汇在词汇表中的出现频率,运算复杂度大大降低。另一方面,在预采样过程中,进行批量采样,对词频表中的各个词汇只采样一次,而采样的数量可以是多个,从而减少负例采样的时间,进而能够快速有效地进行负例采样。总之,图2示出的流程可以提高负例采样的有效性。Reviewing the above process, on the one hand, since the negative examples are pre-sampled negative examples, only a corresponding number of negative examples need to be randomly taken out without using the frequency of vocabulary in the vocabulary, and the computational complexity is greatly reduced. On the other hand, in the pre-sampling process, batch sampling is performed, and each word in the word frequency table is sampled only once, and the number of samples can be multiple, thereby reducing the time of negative sampling and further enabling negative examples to be performed quickly and efficiently. sampling. In short, the process shown in Figure 2 can improve the effectiveness of negative sampling.
根据另一方面的实施例,还提供一种针对训练语料从词频表中进行负例采样的装置。图5示出根据一个实施例的用于针对训练语料从词频表中进行负例采样的装置的示意性框图。如图5所示,用于针对训练语料从词频表中进行负例采样的装置500包括:第一获取单元51,配置为从词频表的未采样词汇集合中获取当前词汇,以及当前词汇对应的出现频率;第二获取单元52,配置为获取针对未采样词汇集合确定的剩余采样个数和剩余采样概率;第一确定单元53,配置为基于当前词汇对应的出现频率和剩余采样概率,确定当前采样概率;第二确定单元54,配置为根据当前词汇在剩余采样个数和当前采样概率条件下的二项分布,确定当前词汇被采样次数;添加单元55,配置为将所述当前词汇按照被采样次数添加到负例集中;更新单元56,配置为根据当前词汇被采样次数更新剩余采样个数,根据当前词汇对应的出现频率更新剩余采样概率,用于对词频表中的其他备选词汇进行采样,直到检测到预定条件得到满足。According to an embodiment of another aspect, a device for performing negative sampling from a word frequency table on a training corpus is also provided. FIG. 5 shows a schematic block diagram of an apparatus for performing negative sampling from a word frequency table on a training corpus according to an embodiment. As shown in FIG. 5, the apparatus 500 for performing negative sampling from the word frequency table for training corpus includes: a first obtaining unit 51 configured to obtain a current word from an unsampled word set of the word frequency table, and a current word corresponding Occurrence frequency; a second acquisition unit 52 configured to acquire the number of remaining samples and the remaining sampling probability determined for the unsampled vocabulary set; a first determination unit 53 configured to determine the current frequency based on the occurrence frequency and the remaining sampling probability corresponding to the current vocabulary Sampling probability; a second determining unit 54 configured to determine the number of times the current vocabulary is sampled according to the binomial distribution of the current vocabulary under the number of remaining samples and the current sampling probability; an adding unit 55 configured to assign the current vocabulary The number of samples is added to the negative example set; the updating unit 56 is configured to update the number of remaining samples according to the number of times the current vocabulary is sampled, and update the remaining sampling probability according to the frequency of occurrence of the current vocabulary, and is used to perform other alternative words in the word frequency table. Sampling until it is detected that a predetermined condition is satisfied.
第一获取单元51首先可以从词频表的多个备选词汇的未采样词汇集合中,获取一个备选词汇作为当前词汇,并获取该当前词汇对应的出现频率。其中,该出现频率可以是该当前词汇在训练语料中的出现频率。The first obtaining unit 51 may first obtain an alternative vocabulary as a current vocabulary from an unsampled vocabulary set of a plurality of candidate vocabularies in a word frequency table, and obtain an appearance frequency corresponding to the current vocabulary. The frequency of occurrence may be the frequency of occurrence of the current vocabulary in the training corpus.
第二获取单元52,配置为获取针对未采样词汇集合确定的剩余采样个数和剩余采样概率。剩余采样个数可以是负例集中还需要的负例个数。换句话说,就是未采样词汇在生成负例集的负例采样过程中的总采样次数。剩余采样概率,可以是未采样词汇在生成负例集的负例采样过程中的总采样概率。剩余采样概率为r的初始值一般为1。The second obtaining unit 52 is configured to obtain the number of remaining samples and the remaining sampling probability determined for the unsampled vocabulary set. The number of remaining samples can be the number of negative examples required in the negative example set. In other words, it is the total number of samplings of the unsampled vocabulary in the negative sampling process of generating the negative set. The remaining sampling probability may be the total sampling probability of the unsampled vocabulary during the negative sampling process of generating the negative set. The initial value of the remaining sampling probability r is generally 1.
第一确定单元53可以基于当前词汇对应的出现频率和剩余采样概率,确定当前词汇对应的当前采样概率。当前采样概率,可以是当前词汇在未采样词汇集合中的采样概率。在一个可选的实施例中,当前采样概率可以是当前词汇对应的出现频率和剩余采样概率的比值。The first determining unit 53 may determine the current sampling probability corresponding to the current vocabulary based on the appearance frequency and the remaining sampling probability corresponding to the current vocabulary. The current sampling probability can be the sampling probability of the current word in the unsampled word set. In an optional embodiment, the current sampling probability may be a ratio of a frequency of occurrence corresponding to the current vocabulary and a remaining sampling probability.
第二确定单元54可以根据当前词汇在剩余采样个数和当前采样概率条件下的二项分布,确定当前词汇被采样次数。二项分布是若干次独立的伯努利试验中成功的次数的离散概率分布。具体到某个实施例,对于进行的待采样负例个数次试验,对于每次试验, 当前词汇被采样到的概率为当前采样概率。第二确定单元54的主要作用在于:确定在s次试验中,第i个词汇成功被采样的次数b。The second determining unit 54 may determine the number of times the current vocabulary is sampled according to the binomial distribution of the current vocabulary under the conditions of the remaining number of samples and the current sampling probability. The binomial distribution is a discrete probability distribution of the number of successes in several independent Bernoulli trials. Specifically, in an embodiment, for the number of negative experiments to be sampled, for each experiment, the probability that the current vocabulary is sampled is the current sampling probability. The main function of the second determining unit 54 is to determine the number b of times that the ith vocabulary was successfully sampled in the s experiments.
根据另一方面的实施例,第二确定单元54可以模拟执行剩余采样个数次的采样操作,在各次采样操作中,确保当前词汇被采样到的概率为当前采样概率。统计当前词汇被采样到的次数,并确定当前词汇的被采样次数为,当前词汇被采样到的次数。According to an embodiment of the other aspect, the second determining unit 54 may simulate performing the sampling operation of the remaining sampling times. In each sampling operation, it is ensured that the probability that the current word is sampled is the current sampling probability. Count the number of times the current vocabulary was sampled, and determine the number of times the current vocabulary was sampled as the number of times the current vocabulary was sampled.
根据另一方面的实施例,第二确定单元54还可以从满足条件的数值中随机获取一个值,作为当前词汇被采样次数。这里的数值满足的条件可以为:与剩余采样个数的比值应与当前采样概率一致。According to an embodiment of the other aspect, the second determining unit 54 may also randomly obtain a value from the values that satisfy the condition, as the number of times the current vocabulary is sampled. The value here can satisfy the condition that the ratio to the number of remaining samples should be consistent with the current sampling probability.
添加单元55,可以按照第二确定单元54确定的被采样次数将当前词汇添加到负例集中。被采样次数是多少,就添加多少个当前词汇到负例集。The adding unit 55 may add the current vocabulary to the negative example set according to the sampled times determined by the second determining unit 54. As many times as sampled, add as many current words to the negative set as possible.
更新单元56,根据当前词汇被采样次数更新上述剩余采样个数,根据当前词汇对应的出现频率更新上述剩余采样概率。可以理解,对每个备选词汇进行采样后,剩余采样个数会减少相应数量,剩余采样概率也会发生改变。换句话说,对于下一个备选词汇而言,采样条件发生变化。在一些可能的设计中,更新单元56可以将剩余采样个数更新为,原剩余采样个数与当前词汇被采样次数的差。更新单元56将剩余采样概率更新为,原剩余采样概率与当前词汇对应的出现频率的差。The updating unit 56 updates the remaining sampling number according to the number of times the current vocabulary is sampled, and updates the remaining sampling probability according to the appearance frequency corresponding to the current vocabulary. It can be understood that after sampling each candidate word, the number of remaining samples will decrease by a corresponding amount, and the remaining sampling probability will also change. In other words, for the next candidate word, the sampling conditions change. In some possible designs, the update unit 56 may update the number of remaining samples to be the difference between the original number of remaining samples and the number of times the current vocabulary was sampled. The update unit 56 updates the remaining sampling probability to be the difference between the original remaining sampling probability and the appearance frequency corresponding to the current vocabulary.
另一方面,由于负例集中需要的负例个数是有限的,因此,还可以预先设置预定条件,当满足该条件时,停止负例采样,否则继续针对词频表的其他备选词汇执行采样流程。该检测功能可以由更新单元26实现,也可以由一个独立的检测单元实现。由此,在一些实施例中,装置500还包括检测单元57,配置为在更新单元26更新完剩余采样个数和剩余采样概率后,检测预定条件是否得到满足,在不满足预定条件的情况下,根据更新后的剩余采样个数和剩余采样概率对词频表中的其他备选词汇采样。这里,预定条件可以包括负例集中的负例总个数达到初始剩余采样个数,也可以包括更新后的剩余采样个数为0,还可以包括未采样词汇集合为空。On the other hand, because the number of negative examples required in the negative example set is limited, you can also set a predetermined condition in advance. When the condition is met, stop the negative sampling, otherwise continue to perform sampling on other candidate words of the word frequency table. Process. The detection function may be implemented by the update unit 26, or may be implemented by an independent detection unit. Therefore, in some embodiments, the device 500 further includes a detection unit 57 configured to detect whether the predetermined condition is satisfied after the update unit 26 updates the remaining number of samples and the remaining sampling probability, and if the predetermined condition is not satisfied , Sample other candidate words in the word frequency table according to the updated number of remaining samples and the remaining sampling probability. Here, the predetermined condition may include that the total number of negative examples in the negative example set reaches the initial number of remaining samples, or that the number of updated remaining samples is 0, or that the unsampled vocabulary set is empty.
在一些可能的设计中,装置500还可以包括:In some possible designs, the apparatus 500 may further include:
输出模块(未示出),配置为在所述负例集中的负例个数满足预定条件的情况下,输出所述负例集。该负例集可以输出到本地,也可以输出到其他设备。在进一步的实施例中,装置500还可以包括选择单元(未示出),配置为针对训练语料中的训练词汇,可以从该负例集中选取负例。An output module (not shown) is configured to output the negative example set if the number of negative examples in the negative example set meets a predetermined condition. The negative example set can be output locally or to other devices. In a further embodiment, the device 500 may further include a selection unit (not shown) configured to target the training words in the training corpus, and a negative example may be selected from the negative example set.
根据一方面的实施例,负例集中的词汇可以和预定区间上的各个数值相对应,选择单元进一步可以包括:生成模块,配置为生成预定区间上的随机数,其中,所生成随机数取自前述各个数值;获取模块,配置为从负例集中获取与上述随机数相对应的负例。According to an embodiment of the aspect, the vocabulary in the negative example set may correspond to each value in a predetermined interval, and the selection unit may further include a generation module configured to generate a random number in the predetermined interval, wherein the generated random number is taken from Each of the foregoing numerical values; the obtaining module is configured to obtain negative examples corresponding to the random numbers from the negative example set.
在一些实现中,所获取的负例还可能与训练词汇或其上下文词汇一致,此时,该负例词汇将不是该训练词汇的负例。因此,获取模块进一步可以配置为:比较所获取的负例与训练词汇是否一致;在一致的情况下,由上述生成模块重新生成预定区间上的随机数。In some implementations, the negative example obtained may also be consistent with the training vocabulary or its context vocabulary. At this time, the negative vocabulary will not be a negative example of the training vocabulary. Therefore, the obtaining module may be further configured to: compare whether the obtained negative examples are consistent with the training vocabulary; and if the obtained negative examples are consistent, the above generating module generates a random number on a predetermined interval again.
根据一种可能的设计,装置500还可以包括:检测单元(未示出),配置为检测负例集的更新条件是否满足;以使装置500在更新条件满足的情况下,重新生成负例集,从而对负例集进行更新。According to a possible design, the device 500 may further include: a detection unit (not shown) configured to detect whether the update condition of the negative example set is satisfied; so that the device 500 regenerates the negative example set if the update condition is satisfied. To update the negative set.
通过以上装置,一方面,可以生成预采样的负例集,由于负例集中是预采样的负例,使用时只需随机取出相应数量负例即可,而无需考虑词汇在词汇表中的出现频率,运算复杂度大大降低。另一方面,可以在预采样过程中,进行批量采样,对词频表中的各个词汇只采样一次,而采样的数量可以是多个,从而减少负例采样的时间,进而能够快速有效地进行负例采样。总之,图5示出的装置500可以提高负例采样的而有效性。With the above device, on the one hand, it is possible to generate a pre-sampled negative example set. Since the negative example set is a pre-sampled negative example, only a corresponding number of negative examples need to be randomly taken out when used, without having to consider the appearance of vocabulary in the vocabulary Frequency, operation complexity is greatly reduced. On the other hand, batch sampling can be performed during the pre-sampling process, and each vocabulary in the word frequency table is sampled only once, and the number of samples can be multiple, thereby reducing the time of negative sampling, and thus can quickly and effectively perform negative sampling. Example sampling. In summary, the apparatus 500 shown in FIG. 5 can improve the effectiveness of negative sampling.
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2所描述的方法。According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图2所述的方法。According to an embodiment of still another aspect, there is also provided a computing device including a memory and a processor, where the memory stores executable code, and when the processor executes the executable code, the implementation described in conjunction with FIG. 2 is implemented. method.
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。Those skilled in the art should appreciate that, in one or more of the above examples, the functions described in the present invention may be implemented by hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored in or transmitted over as one or more instructions or code on a computer-readable medium.
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。The specific embodiments described above further describe the objectives, technical solutions, and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the present invention. The scope of protection, any modification, equivalent replacement, and improvement made on the basis of the technical solution of the present invention shall be included in the scope of protection of the present invention.
Claims (24)
- 一种针对训练语料从词频表中进行负例采样的方法,所述词频表包括多个备选词汇和各个备选词汇在所述训练语料中的出现频率,所述方法包括:A method for performing negative sampling on a training corpus from a word frequency table, the word frequency table including a plurality of candidate words and a frequency of occurrence of each candidate word in the training corpus. The method includes:从所述多个备选词汇中的未采样词汇集合中获取当前词汇,以及当前词汇对应的出现频率pi;Acquiring a current vocabulary from an unsampled vocabulary set of the plurality of candidate vocabularies, and a frequency of occurrence pi corresponding to the current vocabulary;获取针对所述未采样词汇集合确定的剩余采样个数s和剩余采样概率r;Acquiring the number of remaining samples s and the remaining sampling probability r determined for the unsampled vocabulary set;基于所述当前词汇对应的出现频率pi和所述剩余采样概率r,确定所述当前词汇对应的当前采样概率P;Determining a current sampling probability P corresponding to the current vocabulary based on the occurrence frequency pi corresponding to the current vocabulary and the remaining sampling probability r;根据所述当前词汇在所述剩余采样个数s和所述当前采样概率P条件下的二项分布,确定所述当前词汇被采样次数b;Determining the number b of sampling of the current vocabulary b according to a binomial distribution of the current vocabulary under the conditions of the number of remaining samples s and the current sampling probability P;将所述当前词汇按照所述被采样次数b添加到负例集中;Adding the current vocabulary to the negative example set according to the number of samples b;根据所述被采样次数b更新所述剩余采样个数s,并根据所述当前词汇对应的出现频率pi更新所述剩余采样概率r,用于对所述词频表中的其他备选词汇进行采样,直到检测到预定条件得到满足。Update the remaining sampling number s according to the number of samples b, and update the remaining sampling probability r according to the appearance frequency pi corresponding to the current vocabulary, for sampling other candidate words in the word frequency table Until it is detected that the predetermined condition is satisfied.
- 根据权利要求1所述的方法,其中,基于所述当前词汇对应的出现频率pi和所述剩余采样概率r,确定当前采样概率P包括:The method according to claim 1, wherein determining the current sampling probability P based on the occurrence frequency pi corresponding to the current vocabulary and the remaining sampling probability r comprises:将所述当前采样概率P确定为,所述当前词汇对应的出现频率pi与所述剩余采样概率r的比值。The current sampling probability P is determined as a ratio of an appearance frequency pi corresponding to the current word to the remaining sampling probability r.
- 根据权利要求1所述的方法,其中,确定所述当前词汇被采样次数b包括:The method according to claim 1, wherein determining the number b of sampling of the current vocabulary comprises:模拟执行所述剩余采样个数s次的采样操作,其中,在各次采样操作中,所述当前词汇被采样到的概率为所述当前采样概率P;The simulation performs the sampling operation of the remaining sampling times s times, wherein in each sampling operation, the probability that the current word is sampled is the current sampling probability P;确定所述被采样次数b为,所述当前词汇在所述剩余采样个数s次的采样操作中被采样到的次数。It is determined that the number of samples b is the number of times that the current word is sampled in the sampling operation of the remaining number of samples s times.
- 根据权利要求1所述的方法,其中,根据所述当前词汇被采样次数b更新所述剩余采样个数s包括:The method according to claim 1, wherein updating the remaining number of samples s according to the number b of sampling of the current vocabulary comprises:将剩余采样个数s更新为,所述剩余采样个数s与所述被采样次数b的差。The number of remaining samples s is updated to be a difference between the number of remaining samples s and the number of samples b.
- 根据权利要求1所述的方法,其中,所述预定条件包括:The method according to claim 1, wherein the predetermined conditions include:所述负例集中的负例个数达到预设数目;或者The number of negative examples in the negative example set reaches a preset number; or更新后的剩余采样个数s为零;或者The updated number of remaining samples s is zero; or所述未采样词汇集合为空。The unsampled vocabulary set is empty.
- 根据权利要求1所述的方法,其中,所述根据所述当前词汇对应的出现频率pi 更新所述剩余采样概率r包括:The method according to claim 1, wherein the updating the remaining sampling probability r according to an appearance frequency pi corresponding to the current vocabulary comprises:将剩余采样概率r更新为,所述剩余采样概率r与所述当前词汇对应的出现频率pi的差。The remaining sampling probability r is updated as a difference between the remaining sampling probability r and an appearance frequency pi corresponding to the current vocabulary.
- 根据权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, further comprising:在所述预定条件得到满足的情况下,输出所述负例集。When the predetermined condition is satisfied, the negative example set is output.
- 根据权利要求7所述的方法,其中,所述方法还包括:The method according to claim 7, wherein the method further comprises:针对训练语料中的训练词汇,从所述负例集中选取负例。For the training vocabulary in the training corpus, negative examples are selected from the negative example set.
- 根据权利要求8所述的方法,其中从所述负例集中选取负例包括:The method according to claim 8, wherein selecting negative examples from the negative example set comprises:生成预定区间上的随机数,其中,所述预定区间包括多个数值,各个数值分别与所述负例集中的各个负例相对应,所述随机数取自所述多个数值;Generating a random number on a predetermined interval, wherein the predetermined interval includes multiple values, each value corresponding to each negative example in the negative example set, and the random number is taken from the multiple values;从所述负例集中获取与所述随机数相对应的负例。A negative example corresponding to the random number is obtained from the negative example set.
- 根据权利要求9所述的方法,其中,所述从所述负例集中获取与所述随机数相对应的负例包括:The method according to claim 9, wherein the obtaining a negative example corresponding to the random number from the negative example set comprises:比较所获取的负例与所述训练词汇是否一致;Comparing whether the obtained negative example is consistent with the training vocabulary;在一致的情况下,重新执行所述生成预定区间上的随机数的步骤。If they are consistent, the step of generating a random number on a predetermined interval is re-executed.
- 根据权利要求1-10中任一所述的方法,其中,所述方法还包括:The method according to any one of claims 1-10, wherein the method further comprises:检测所述负例集的更新条件是否满足;Detecting whether the update condition of the negative example set is satisfied;在所述更新条件满足的情况下,重新生成负例集。When the update condition is satisfied, a negative example set is regenerated.
- 一种针对训练语料从词频表中进行负例采样的装置,所述词频表包括多个备选词汇和各个备选词汇在所述训练语料中的出现频率,所述装置包括:A device for performing negative sampling on a training corpus from a word frequency table, the word frequency table including a plurality of candidate words and a frequency of occurrence of each candidate word in the training corpus. The device includes:第一获取单元,配置为从所述多个备选词汇中的未采样词汇集合中获取当前词汇,以及当前词汇对应的出现频率pi;A first obtaining unit configured to obtain a current vocabulary from an unsampled vocabulary set of the plurality of candidate vocabularies, and a frequency of occurrence pi corresponding to the current vocabulary;第二获取单元,配置为获取针对所述未采样词汇集合确定的剩余采样个数s和剩余采样概率r;A second obtaining unit configured to obtain a remaining sampling number s and a remaining sampling probability r determined for the unsampled vocabulary set;第一确定单元,配置为基于所述当前词汇对应的出现频率pi和所述剩余采样概率r,确定所述当前词汇对应的当前采样概率P;A first determining unit configured to determine a current sampling probability P corresponding to the current vocabulary based on an appearance frequency pi corresponding to the current vocabulary and the remaining sampling probability r;第二确定单元,配置为根据所述当前词汇在所述剩余采样个数s和所述当前采样概率P条件下的二项分布,确定所述当前词汇被采样次数b;A second determining unit configured to determine the number of times b of the current vocabulary is sampled according to a binomial distribution of the current vocabulary under the conditions of the number of remaining samples s and the current sampling probability P;添加单元,配置为将所述当前词汇按照所述被采样次数b添加到负例集中;An adding unit configured to add the current vocabulary to the negative example set according to the number of times of sampling b;更新单元,配置为根据所述当前词汇被采样次数b更新所述剩余采样个数s,根据所述当前词汇对应的出现频率pi更新所述剩余采样概率r,用于对所述词频表中的其他 备选词汇进行采样,直到检测到预定条件得到满足。An update unit is configured to update the remaining sampling number s according to the number b of sampling of the current vocabulary, and update the remaining sampling probability r according to the occurrence frequency pi corresponding to the current vocabulary, for The other candidate words are sampled until a predetermined condition is detected to be satisfied.
- 根据权利要求12所述的装置,其中,所述第一确定单元进一步配置为:The apparatus according to claim 12, wherein the first determining unit is further configured to:将所述当前采样概率P确定为,所述当前词汇对应的出现频率pi与剩余采样概率r的比值。The current sampling probability P is determined as a ratio of an appearance frequency pi corresponding to the current word to a remaining sampling probability r.
- 根据权利要求12所述的装置,其中,所述第二确定单元包括:The apparatus according to claim 12, wherein the second determining unit comprises:试验模块,配置为模拟执行所述剩余采样个数s次的采样操作,其中,在各次采样操作中,所述当前词汇被采样到的概率为所述当前采样概率P;The test module is configured to simulate the sampling operation of the remaining sampling number of times, wherein, in each sampling operation, the probability that the current word is sampled is the current sampling probability P;确定模块,配置为确定所述被采样次数b为,所述当前词汇在所述剩余采样个数s次的采样操作中被采样到的次数。The determining module is configured to determine that the number of samples b is the number of times that the current word is sampled in the sampling operation of the remaining number of samples s times.
- 根据权利要求12所述的装置,其中,所述更新单元进一步配置为:The apparatus according to claim 12, wherein the update unit is further configured to:将剩余采样个数s更新为,所述剩余采样个数s与所述被采样次数b的差。The number of remaining samples s is updated to be a difference between the number of remaining samples s and the number of samples b.
- 根据权利要求11所述的装置,其中,所述预定条件包括:The apparatus according to claim 11, wherein the predetermined conditions include:所述负例集中的负例个数达到预设数目;或者The number of negative examples in the negative example set reaches a preset number; or更新后的剩余采样个数为零;或者The updated number of remaining samples is zero; or所述未采样词汇集合为空。The unsampled vocabulary set is empty.
- 根据权利要求12所述的装置,其中,所述更新单元还进一步配置为:The apparatus according to claim 12, wherein the update unit is further configured to:将剩余采样概率r更新为,所述剩余采样概率r与所述当前词汇对应的出现频率pi的差。The remaining sampling probability r is updated as a difference between the remaining sampling probability r and an appearance frequency pi corresponding to the current vocabulary.
- 根据权利要求12所述的装置,其中,所述装置还包括:The apparatus according to claim 12, wherein the apparatus further comprises:输出模块,配置为在预定条件得到满足的情况下,输出所述负例集。The output module is configured to output the negative example set if a predetermined condition is satisfied.
- 根据权利要求18所述的装置,其中,所述装置还包括:The apparatus according to claim 18, wherein the apparatus further comprises:选择单元,配置为针对训练语料中的训练词汇,从所述负例集中选取负例。The selection unit is configured to select a negative example from the negative example set for the training vocabulary in the training corpus.
- 根据权利要求19所述的装置,其中所述选择单元包括:The apparatus according to claim 19, wherein the selection unit comprises:生成模块,配置为生成预定区间上的随机数,其中,所述预定区间包括多个数值,各个数值分别与所述负例集中的各个负例相对应,所述随机数取自所述多个数值;A generating module configured to generate a random number in a predetermined interval, wherein the predetermined interval includes a plurality of values, each value corresponding to each negative example in the negative example set, and the random number is taken from the multiple Value获取模块,配置为从所述负例集中获取与所述随机数相对应的负例。The obtaining module is configured to obtain a negative example corresponding to the random number from the negative example set.
- 根据权利要求20所述的装置,其中,所述获取模块进一步配置为:The apparatus according to claim 20, wherein the acquisition module is further configured to:比较所获取的负例与所述训练词汇是否一致;Comparing whether the obtained negative example is consistent with the training vocabulary;在一致的情况下,重新通过所述生成模块执行所述生成预定区间上的随机数的步骤。If they are consistent, the generating module executes the step of generating a random number on a predetermined interval again.
- 根据权利要求12-21中任一所述的装置,其中,所述装置还包括:The device according to any one of claims 12-21, wherein the device further comprises:检测单元,配置为检测负例集的更新条件是否满足;A detection unit configured to detect whether an update condition of a negative example set is satisfied;以使所述装置在所述更新条件满足的情况下,重新生成负例集。So that the device regenerates a negative example set if the update condition is satisfied.
- 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-11中任一项的所述的方法。A computer-readable storage medium having stored thereon a computer program, and when the computer program is executed in a computer, causes the computer to execute the method according to any one of claims 1-11.
- 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-11中任一项所述的方法。A computing device includes a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the processor according to any one of claims 1-11 is implemented. method.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810555518.X | 2018-06-01 | ||
CN201810555518.XA CN108875810B (en) | 2018-06-01 | 2018-06-01 | Method and device for sampling negative examples from word frequency table aiming at training corpus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019228014A1 true WO2019228014A1 (en) | 2019-12-05 |
Family
ID=64336301
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/077438 WO2019228014A1 (en) | 2018-06-01 | 2019-03-08 | Method and apparatus for performing, for training corpus, negative sampling in word frequency table |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN108875810B (en) |
TW (1) | TWI698761B (en) |
WO (1) | WO2019228014A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116756573A (en) * | 2023-08-16 | 2023-09-15 | 国网智能电网研究院有限公司 | Negative example sampling method, training method, defect grading method, device and system |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875810B (en) * | 2018-06-01 | 2020-04-28 | 阿里巴巴集团控股有限公司 | Method and device for sampling negative examples from word frequency table aiming at training corpus |
CN112364130B (en) * | 2020-11-10 | 2024-04-09 | 深圳前海微众银行股份有限公司 | Sample sampling method, apparatus and readable storage medium |
CN114201603B (en) * | 2021-11-04 | 2025-08-12 | 阿里巴巴(中国)有限公司 | Entity classification method, entity classification device, storage medium, processor and electronic device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004095421A1 (en) * | 2003-03-19 | 2004-11-04 | Intel Corporation | A coupled hidden markov model (chmm) for continuous audiovisual speech recognition |
CN106257441A (en) * | 2016-06-30 | 2016-12-28 | 电子科技大学 | A kind of training method of skip language model based on word frequency |
CN107220233A (en) * | 2017-05-09 | 2017-09-29 | 北京理工大学 | A kind of user knowledge demand model construction method based on gauss hybrid models |
CN108021934A (en) * | 2017-11-23 | 2018-05-11 | 阿里巴巴集团控股有限公司 | The method and device of more key element identifications |
CN108875810A (en) * | 2018-06-01 | 2018-11-23 | 阿里巴巴集团控股有限公司 | The method and device of negative example sampling is carried out from word frequency list for training corpus |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7729901B2 (en) * | 2005-12-13 | 2010-06-01 | Yahoo! Inc. | System for classifying words |
US7681147B2 (en) * | 2005-12-13 | 2010-03-16 | Yahoo! Inc. | System for determining probable meanings of inputted words |
US7825937B1 (en) * | 2006-06-16 | 2010-11-02 | Nvidia Corporation | Multi-pass cylindrical cube map blur |
US9084096B2 (en) * | 2010-02-22 | 2015-07-14 | Yahoo! Inc. | Media event structure and context identification using short messages |
WO2012126180A1 (en) * | 2011-03-24 | 2012-09-27 | Microsoft Corporation | Multi-layer search-engine index |
US8957707B2 (en) * | 2011-11-30 | 2015-02-17 | Egalax—Empia Technology Inc. | Positive/negative sampling and holding circuit |
CN103870447A (en) * | 2014-03-11 | 2014-06-18 | 北京优捷信达信息科技有限公司 | Keyword extracting method based on implied Dirichlet model |
TWI605353B (en) * | 2016-05-30 | 2017-11-11 | Chunghwa Telecom Co Ltd | File classification system, method and computer program product based on lexical statistics |
CN106547735B (en) * | 2016-10-25 | 2020-07-07 | 复旦大学 | Construction and usage of context-aware dynamic word or word vector based on deep learning |
CN107239444B (en) * | 2017-05-26 | 2019-10-08 | 华中科技大学 | A kind of term vector training method and system merging part of speech and location information |
-
2018
- 2018-06-01 CN CN201810555518.XA patent/CN108875810B/en active Active
-
2019
- 2019-02-27 TW TW108106638A patent/TWI698761B/en active
- 2019-03-08 WO PCT/CN2019/077438 patent/WO2019228014A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004095421A1 (en) * | 2003-03-19 | 2004-11-04 | Intel Corporation | A coupled hidden markov model (chmm) for continuous audiovisual speech recognition |
CN106257441A (en) * | 2016-06-30 | 2016-12-28 | 电子科技大学 | A kind of training method of skip language model based on word frequency |
CN107220233A (en) * | 2017-05-09 | 2017-09-29 | 北京理工大学 | A kind of user knowledge demand model construction method based on gauss hybrid models |
CN108021934A (en) * | 2017-11-23 | 2018-05-11 | 阿里巴巴集团控股有限公司 | The method and device of more key element identifications |
CN108875810A (en) * | 2018-06-01 | 2018-11-23 | 阿里巴巴集团控股有限公司 | The method and device of negative example sampling is carried out from word frequency list for training corpus |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116756573A (en) * | 2023-08-16 | 2023-09-15 | 国网智能电网研究院有限公司 | Negative example sampling method, training method, defect grading method, device and system |
CN116756573B (en) * | 2023-08-16 | 2024-01-16 | 国网智能电网研究院有限公司 | Negative example sampling method, training method, defect grading method, device and system |
Also Published As
Publication number | Publication date |
---|---|
CN108875810B (en) | 2020-04-28 |
TWI698761B (en) | 2020-07-11 |
TW202004533A (en) | 2020-01-16 |
CN108875810A (en) | 2018-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019228014A1 (en) | Method and apparatus for performing, for training corpus, negative sampling in word frequency table | |
JP2020522055A5 (en) | ||
CN108491302B (en) | Method for detecting spark cluster node state | |
US20160026557A1 (en) | System and method for testing software | |
CN111881023B (en) | Software aging prediction method and device based on multi-model comparison | |
EP3660705A1 (en) | Optimization device and control method of optimization device | |
US20150127990A1 (en) | Error Report Processing Using Call Stack Similarity | |
JP2019114158A (en) | Coverage teste support apparatus and coverage test support method | |
WO2021056914A1 (en) | Automatic modeling method and apparatus for object detection model | |
US12340285B2 (en) | Testing models in data pipeline | |
US20230359449A1 (en) | Learning-augmented application deployment pipeline | |
CN104021072A (en) | Machine and methods for evaluating failing software programs | |
CN116166967B (en) | Data processing method, equipment and storage medium based on meta learning and residual error network | |
CN119003350A (en) | Test method and device for code generation model | |
CN113641905A (en) | Model training method, information pushing method, device, equipment and storage medium | |
CN113537614A (en) | Construction method, system, equipment and medium of power grid engineering cost prediction model | |
CN117785517A (en) | Equipment reliability assessment methods, devices, computer equipment and storage media | |
CN112423031B (en) | KPI monitoring method, device and system based on IPTV | |
KR102255470B1 (en) | Method and apparatus for artificial neural network | |
CN113836005A (en) | A method, apparatus, electronic device and storage medium for generating a virtual user | |
Qiu et al. | Availability analysis of systems deploying sequences of environmental-diversity-based recovery methods | |
CN110442508B (en) | Test task processing method, device, equipment and medium | |
CN120104394B (en) | Method, device, equipment, medium and program for predicting recovery duration | |
US20240289605A1 (en) | Proxy Task Design Tools for Neural Architecture Search | |
US20250245216A1 (en) | Machine learning model prompt hydration via prompt registry and context store |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19812199 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19812199 Country of ref document: EP Kind code of ref document: A1 |