WO2017012243A1 - Voice recognition method and apparatus, terminal device and storage medium - Google Patents
Voice recognition method and apparatus, terminal device and storage medium Download PDFInfo
- Publication number
- WO2017012243A1 WO2017012243A1 PCT/CN2015/096622 CN2015096622W WO2017012243A1 WO 2017012243 A1 WO2017012243 A1 WO 2017012243A1 CN 2015096622 W CN2015096622 W CN 2015096622W WO 2017012243 A1 WO2017012243 A1 WO 2017012243A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- array
- probability score
- text
- recognition result
- language model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Definitions
- the embodiments of the present invention relate to the field of voice recognition technologies, and in particular, to a voice recognition method, device, terminal device, and storage medium.
- speech recognition results are determined by two parts: acoustic model and language model.
- the language model plays an important role. For example, when the pronunciations of “Bei Daihe” and “Be brought by the river” are similar, the scores of the acoustic models are almost the same. At this time, the language model is needed to further decide which words are used in the language. of. That is to say, the language model solves the problem of evaluating the natural language order in speech recognition.
- the voice recognition method provided in the prior art mainly includes the following steps:
- the language model resource is read from the hard disk, and the resource is stored in a node manner
- Each node corresponds to one word, and each node is composed of node information (including the corresponding word or word, child information, such as the word corresponding to the child node and the number of children), probability list (ProbList) (storage probability), and fallback.
- the probability list (BackOff) consists of three parts; as shown in the following table 1:
- the process of constructing a tree is specifically: after the language model resource is loaded into the cache, the storage address of the node changes, so each node only knows which word its own child node is, and does not know its storage address, so it needs to be based on The child node information recorded in each node queries the storage address of its child node one by one and adds it to the parent node to establish a score tree.
- the acoustic model is used for speech recognition, the pronunciation information is obtained, and the multi-cross check tree of the language model is searched according to the pronunciation information for scoring;
- the existing speech recognition method needs to dynamically load the language model resources after reading the language model resources, and construct a multi-cross check tree, which is a waste of time and leads to low recognition efficiency.
- the embodiment of the invention provides a voice recognition method, device, terminal device and storage medium, which can greatly shorten the startup time.
- an embodiment of the present invention provides a voice recognition method, including:
- the query tree information includes a plurality of nodes corresponding to the text, Each node includes at least a storage location offset between the current node and the child node;
- the character recognition result is selected based on the probability score as the final recognition result.
- an embodiment of the present invention further provides a voice recognition apparatus, including:
- a pronunciation information obtaining module configured to identify the pronunciation information according to the voice information
- a probability score query module configured to load a language model check score tree according to the check score tree information, and query the language model check score tree to determine a probability score of a text recognition result that matches the pronunciation information; wherein the check score tree information includes a plurality of nodes corresponding to the text, each node including at least a storage location offset between the current node and the child node;
- a text recognition module is configured to select a text recognition result according to the probability score as a final recognition result.
- the embodiment of the present invention further provides a terminal device for implementing voice recognition, including:
- One or more processors are One or more processors;
- One or more modules the one or more modules being stored in the memory, and when executed by the one or more processors, performing the following operations:
- the query tree information includes a plurality of nodes corresponding to the text, Each node includes at least a storage location between the current node and the child node Offset;
- the character recognition result is selected based on the probability score as the final recognition result.
- an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores one or more modules, when the one or more modules are executed by a device that performs a voice recognition method.
- the device is caused to perform the following operations:
- the query tree information includes a plurality of nodes corresponding to the text, Each node includes at least a storage location offset between the current node and the child node;
- the character recognition result is selected based on the probability score as the final recognition result.
- the technical solution of the embodiment of the present invention directly stores the language model check tree according to the storage location offset between the current node and the child node, and does not need to dynamically construct the language model to check the tree at startup, thereby greatly shortening the startup time.
- FIG. 1 is a schematic flow chart of a voice recognition method provided by the prior art
- FIG. 2A is a schematic flowchart of a voice recognition method according to Embodiment 1 of the present invention.
- FIG. 2B is a schematic structural diagram of a first check molecular tree in a speech recognition method according to Embodiment 1 of the present invention.
- FIG. 2C is a schematic structural diagram of a second molecular check tree in a speech recognition method according to Embodiment 1 of the present invention.
- FIG. 2D is a schematic structural diagram of a third check molecular tree in a speech recognition method according to Embodiment 1 of the present invention.
- FIG. 2E is a schematic diagram of a fourth molecular tree structure in a speech recognition method according to Embodiment 1 of the present invention.
- FIG. 3 is a schematic structural diagram of a voice recognition apparatus according to Embodiment 2 of the present invention.
- FIG. 4 is a schematic structural diagram of a terminal device for implementing voice recognition according to Embodiment 3 of the present invention.
- the execution body of the voice recognition method provided by the embodiment of the present invention may be a voice recognition device provided by an embodiment of the present invention, or a terminal device (for example, a smart phone, a tablet computer, etc.) integrated with the voice recognition device, and the voice recognition
- the device can be implemented in hardware or software.
- FIG. 2A is a schematic flowchart of a voice recognition method according to Embodiment 1 of the present invention. As shown in FIG. 2A, the method specifically includes:
- the user can input voice information in the voice recognition device provided by the embodiment of the present invention.
- a voice recording button can be set in the input field of the voice recognition device, and the user clicks on the above.
- the voice recording button can start the recording function and record the user's speech to obtain the voice information.
- the voice information is identified and processed by the pre-loaded acoustic model and the voice recognition resource, so that the required pronunciation information can be obtained. For example, if the voice that the user wants to input is "Beidaihe", the pronunciation information obtainable by the above recognition process is "beidaihe".
- the language model is searched according to the query tree information, and the language model check tree is queried to determine a probability score of the text recognition result that matches the pronunciation information.
- the check tree information includes multiple texts corresponding to the text. a node, each node including at least a storage location offset between the current node and the child node;
- the check tree information is similar to the language model resource, and includes a plurality of nodes corresponding to the text, wherein each node includes at least a storage location offset between the current node and the child node.
- the check tree may further include a storage probability (ProbList, that is, a probability of occurrence of the current node) of each node, a probability of returning the current node (BackOff), and a number of corresponding child nodes.
- the child node is the child node.
- the parent node and the child node are the combination of words that will appear at the same time. For example, the "Beijing" and "Beijing" nodes are the parent nodes of the "North" node.
- the storage location offset is specifically the distance between the storage locations of the nodes and the child nodes.
- the query tree information of the language model is directly written into the language model resource, so that the initialization of the tree tree is not required to dynamically construct the language model, but the pointer information of the dynamically constructed tree is regarded as an offset.
- the quantity is written into the language model resource, that is, the language model check tree is constructed offline in advance.
- the storage location offset between the current node and the child node is directly written into the language model resource, and the query tree information is as shown in Table 2 below.
- the storage relative distance between the nodes does not change. Therefore, the storage locations of other nodes can be determined based on the storage locations of the initial nodes and the offsets with other nodes.
- the voice model check tree can be loaded onto the line, and the check tree is queried according to the pronunciation information.
- the pronunciation information obtained by the above step S21 is “beidaihe”, and first, in the root node (RootProbList) of the check tree, the text node corresponding to “he” in the pronunciation information “beidaihe” is searched for, and includes a plurality of, for example, “charges”. , "drink", “river”, etc., as shown in FIG.
- the ProbList of the text node under the "children" corresponding to "drink” is higher than 60%, and the BackOff is lower than 60%, then the corresponding child node “drink” is reserved, and the corresponding child node “drinks”
- the ProbList of Dai and “Band” are both lower than 60%, and BackOff is higher than 60%.
- the corresponding child nodes “Dai” and “Band” of "Drink” are returned.
- the final result obtained by the above selection process is shown in the two subtrees shown in Figures 2C and 2D.
- the problist corresponding to the text node corresponding to each "bei” and the probability score corresponding to "beidaihe” in BackOff are obtained, for example, the results shown in Table 4 below are obtained:
- the final recognition result is “Bei Daihe” and “Beaded River”, and the corresponding probability scores are 99% and 60%, respectively, and the score recognition result can be placed according to the score.
- the low-scoring text recognition results are displayed on the principle, and they are returned to The user, that is, "Bei Daihe” and “Daihe” will be returned to the user at the same time for the user to choose. It is also possible to return only the highest score to the user, that is, return "Bei Daihe” to the user.
- the language model check tree is pre-recorded by the storage location offset between the current node and the child node, and the load location may be directly stored according to the storage location between the nodes.
- the offset loads the lookup tree into the cache, eliminating the need for dynamic builds, which greatly reduces startup time.
- the tree may be searched according to the language model loaded by the pronunciation information query, and the following steps are added before determining the probability score of the text recognition result matching the pronunciation information:
- the common word sequence includes some words and hot words that people often visit in people's lives, for example, including the name of a tourist attraction, the place name of a municipality directly under the central government, the name of a network celebrity, the name of a song, etc.
- the common vocabulary is placed in the cache, which can greatly improve the query efficiency.
- the text recognition result of the historical query can also be recorded in the cache.
- the user inputs the same voice information again, it can be directly returned to the user from the cache, which also saves the query time.
- the present embodiment of the present invention looks at the tree for conversion and converts it into a more memory-saving language. Word model. Specifically, before the probability score of the character recognition result matching the pronunciation information is queried in the common word sequence in the cache according to the pronunciation information, the following operations are further added to form a common word sequence:
- the first array and the second array are stored as the common word sequence.
- a part of the single text or all the single characters included in the root node in the language model checking tree and the corresponding probability scores are stored in the form of an array.
- the parent node of the language model and the text combination corresponding to each child node and its probability score are also stored in the form of an array.
- a single character and a combination of characters with low probability included in the language model check tree can be removed to improve query efficiency.
- the text included in the root node of the language model scoring tree includes “North”, “Beijing”, “River”, “Hang”, “Drink”, etc., and the corresponding storage probabilities are P1, P2, P3, respectively.
- P4, P5 can be realized by two-dimensional array, and its storage form is as shown in Table 5 below:
- the parent node and the character combinations corresponding to each child node and their probability scores may be stored in a two-dimensional array, for example, as shown in Table 6 below, which is a combination of binary characters:
- the target area to be queried may be quickly located by using a positioning table, and the following steps may be further added after storing the first array and the second array as the common word sequence:
- the predetermined rule may be set according to a specific scenario, and different division rules are used for different scenarios, so that a better text recognition result can be better and faster matched.
- the identifier value corresponding to the first character in the binary character combination may be right shifted by the first specified number of digits and the identifier value corresponding to the second character is shifted to the left by the second specified digit as the feature value K, and the feature value is obtained.
- a binary text combination in which the number of binary text combinations of K is greater than or equal to a preset value is classified into an ordered sequence array; a binary combination of characters whose binary value combination of the feature value K is smaller than a preset value Classified as an array of unordered sequences.
- Equation 1 For a binary combination of words, Equation 1 can be used to calculate an ordered sequence array and an unordered sequence array. First, the eigenvalue K of the binary combination is calculated:
- the first designated digit has a value of 3
- the second designated digit has a value of 13
- “>>” is a right shift symbol.
- " ⁇ " is the left shift symbol
- M1 is the identifier value corresponding to the first text
- M2 is the identifier value corresponding to the second text.
- the binary text combination whose number of binary character combinations whose feature value is K is greater than or equal to the preset value is classified into an ordered sequence array; the number of binary character combinations whose feature value is K is less than the preset value
- the metacharacter combination is classified as an unordered sequence array.
- the identification value is a value that can uniquely identify the text. For example, when the characters are identified by ASCII code, the ASCII code value of the text is the identification value. After the identification value of the character is shifted to the left and right, the feature value K is calculated, which is equivalent to classifying each character combination according to the feature value K, and grouping the characters having the same feature value K into one group. If the number of combinations of texts in the group is too small, it is not necessary to set the group.
- preset rule may also be other formulas, not limited to left shift and right shift, and is not limited to the specific number of bits of the above shift.
- the eigenvalue K of the binary combination can be calculated by the above formula, and then the eigenvalue K is shifted to the right by the first specified number of digits and the identification value corresponding to the third character is shifted to the left by the second specified number of digits.
- the feature value T the ternary character combination whose number of ternary characters whose feature value is T is greater than or equal to the preset value is classified into an ordered sequence array; the number of ternary characters combined with the feature value T
- a ternary text combination that is less than a preset value is classified as an unordered sequence array.
- an ordered sequence array and an unordered sequence array can be obtained by combining the above formula 1 and the following formula 2.
- the eigenvalue K of the binary combination is obtained by using the formula 1
- the eigenvalue T of the ternary combination is obtained by using the formula 2
- the first designated digit has a value of 3
- the second specified digit has a value of 13
- K is a feature value corresponding to the combination of M1 and M2
- M3 is an identifier value corresponding to the third character
- the ternary character combination whose ternary character combination whose feature value is T is greater than or equal to the preset value is classified into an ordered sequence array; the number of ternary character combinations whose feature value is T is less than the preset value of three
- the metacharacter combination is classified as an unordered sequence array.
- the ordered sequence array can be divided into multiple sub-subarrays according to the feature value, and each sub-array stores a combination of characters having the same feature value.
- each sub-array stores a combination of characters having the same feature value.
- For the eigenvalues of the binary combination it is calculated by the above formula 1.
- For the ternary combination it can be calculated by combining formula 1 and formula 2.
- the voice check tree all the text combinations are counted, including binary text combination, ternary text combination and n-gram combination, where n is a natural number greater than 3.
- the more common ones are binary text combinations and ternary text combinations.
- the eigenvalue is calculated by using the above formula 1.
- the eigenvalue is calculated by using the above formula 1 and formula 2.
- a combination of characters with a eigenvalue of K1 is calculated including “Beijing”, “Tianjin”, “Bei Daihe”, “Baidu”, and “Sohu”, and the combination of characters with a eigenvalue of K2 is calculated to include “Hangzhou good” and “taken River, "milk” and “yoghurt”, the text combination with the eigenvalue K3 is calculated, including “Suzhou”, and the text combination with the eigenvalue K4 is calculated, including “summer heat” and “wearing the river”, and the eigenvalue is calculated.
- the number of text combinations whose statistical feature value is K1 is 5
- the number of text combinations whose statistical feature value is K2 is 4, and the number of text combinations whose statistical feature value is K3 is 1.
- the number of text combinations whose statistical feature value is K4 is 2. If the preset value is set to 3, the number of text combinations with the same feature value exceeds 3, which is classified as an ordered sequence array. Otherwise, it is classified as none.
- the ordered sequence array, the resulting ordered sequence array is represented in the form of a list, as shown in Table 7 below, where the ordered sequence array also contains the probability of occurrence of the feature value and each combination of words, which can be directly from the language. Type check points to obtain the tree:
- the resulting array of unordered sequences is represented in the form of a list, as shown in Table VIII below, where the array of unordered sequences also contains the probability of occurrence of eigenvalues and combinations of words, which can be obtained directly from the language model check tree:
- the text combination in the ordered sequence array is further divided, and the sub-array is divided according to the feature value, for example, in Table VII.
- the eigenvalues can be divided into one sub-array and divided into two sub-arrays. As shown in the following table IX:
- a positioning table is constructed according to each sub-array divided by the above table 9 and the unordered sequence array shown in the above table 8.
- the feature values corresponding to the sub-array and the starting storage location, and the feature values corresponding to the combination of characters in the unordered sequence array and their probability scores are placed in the positioning table.
- the obtained positioning table is as shown in Table 10 below.
- the corresponding feature value can be directly used as the subscript of the array, that is, the subscript corresponding to the subarray 1 is K1, the subscript corresponding to the subarray 2 is K2, and the subscript array 1 corresponds to the lower subscript. Marked as K3, the subscript corresponding to the unordered sequence array 2 is K4, then the subscript corresponding to each array is directly stored in the positioning table, and the obtained positioning table is as shown in Table 11 below:
- the language model is searched according to the search tree information, and the language model check tree is queried to determine a probability score of the text recognition result matching the pronunciation information.
- the fast sub-array is used to query the matched sub-array to determine a probability score of the text recognition result that matches the pronunciation information.
- the probability scores of each word combination of "beidaihe” are obtained. For example, taking the positioning table corresponding to Table 10 as an example, if the probability of P (by
- K2 queries the positioning table of Table 10 above, and knows that the corresponding query range is sub-array 2, according to the start and stop bits of the sub-array 2 recorded in the positioning table, the query is returned in the sub-array 2, and a fast query algorithm can be used (for example, Dichotomy)
- the search is performed to obtain a probability score of P (by
- Daihe) is calculated by using the above formula 1 and formula 2, and the positioning table of the above table 10 is queried according to the feature value K4. If the corresponding query result is recorded in the positioning table, the probability score of P (by
- Daihe) is P8, and the probability scores of all combinations of words pronounced “beidaihe” are compared.
- the combination of words is sorted according to the probability score, and the combination of the previous texts is returned to user.
- the language model is searched according to the query tree information, and the language model check tree is queried to determine a probability score of the text recognition result that matches the pronunciation information, wherein the check tree information includes a text corresponding to the text.
- the language model is directly loaded according to the storage location offset between the current node and the child node, so that the startup time is greatly shortened.
- the foregoing embodiments further construct a positioning table to initially locate the approximate location of the text combination to be queried, and further use a quick query algorithm to accurately find and determine a probability score of the text recognition result that matches the pronunciation information, thereby further improving the query. effectiveness.
- FIG. 3 is a schematic structural diagram of a voice recognition apparatus according to Embodiment 2 of the present invention, as shown in FIG. 3, specifically including: a pronunciation information acquisition module 31, a probability score query module 32, and a character recognition module 33;
- the pronunciation information obtaining module 31 is configured to identify the pronunciation information according to the voice information
- the probability score query module 32 is configured to load a language model check tree according to the query tree information, and query the language model check tree to determine a probability score of a text recognition result that matches the pronunciation information; wherein the score tree information is Include a plurality of nodes corresponding to the text, each node including at least a storage location offset between the current node and the child node;
- the character recognition module 33 is configured to select a character recognition result according to the probability score as a final recognition result.
- the voice recognition device is used to perform the voice recognition method described in the foregoing embodiments, and the technical principle and the generated technical effect are similar, and are not described here.
- the device further includes: a cache query module 34 and a trigger module 35;
- the cache query module 34 is configured to: after the probability score query module 32 loads the language model check tree according to the query tree information, query the language model check tree to determine the probability score of the text recognition result that matches the pronunciation information before And querying, according to the pronunciation information, a common word sequence stored in the cache and/or a text recognition result of the recorded historical query, a probability score of the text recognition result matching the pronunciation information;
- the triggering module 35 is configured to trigger an operation of performing a query in the language model checking tree if the cache query module 34 does not have a probability score of a text recognition result matching the pronunciation information in the cache.
- the device further includes: a first array forming module 36, a second array forming module 37, and a storage module 38;
- the first array forming module 36 is configured to: before the cache query module 34 queries the probability score of the text recognition result matching the pronunciation information in the common word sequence in the cache according to the pronunciation information, the language is The probability score of a single character appearing in the model check tree is higher than the single character of the set threshold and its probability score, forming a first array;
- the second array forming module 37 is configured to form a second array by combining a combination of a probability score of a combination of characters composed of at least two characters in the language model check tree above a set threshold value and a probability score thereof;
- the storage module 38 is configured to store the first array and the second array as the common word sequence.
- the device further includes: an array decomposition module 39 and a positioning table construction module 310;
- the array decomposition module 39 is configured to: after the storage module 38 stores the first array and the second array as the common word sequence, the plurality of texts in the second array according to a predetermined rule
- the word combination is divided into an ordered sequence array and an unordered sequence array, wherein the ordered sequence array includes at least two sub-arrays, and each sub-array stores a plurality of character combinations of the same feature value;
- the positioning table structure module 310 is configured to store the probability score in the array of the unordered sequence, and the starting position and/or the ending position, and the feature value, the starting position, and/or the ending position of each sub-array in the positioning. In the table;
- the cache query module 34 is specifically configured to:
- the array decomposition module 39 is specifically configured to:
- the identification value corresponding to the first character in the binary character combination is shifted right by the first specified number of digits and the identification value corresponding to the second character is shifted to the left by the second specified number of digits as the feature value K;
- the binary text combination whose number of binary character combinations whose feature value is K is greater than or equal to the preset value is classified into an ordered sequence array; the number of binary character combinations whose feature value is K is less than the preset value
- the metacharacter combination is classified as an unordered sequence array.
- the array decomposition module 39 is specifically configured to:
- the ternary character combination whose ternary character combination whose feature value is T is greater than or equal to the preset value is classified into an ordered sequence array; the number of ternary character combinations whose feature value is T is less than the preset value of three
- the metacharacter combination is classified as an unordered sequence array.
- the voice recognition device described in each of the above embodiments is also used to perform the voice described in the above embodiments.
- the identification method, the technical principle and the generated technical effect are similar, and will not be described here.
- FIG. 4 is a schematic diagram of a hardware structure of a terminal device for implementing voice recognition according to Embodiment 3 of the present invention, where the terminal device includes one or more processors 41, a memory 42, one or more modules, and the one or more
- the module (for example, the pronunciation information acquisition module 31, the probability score query module 32, the character recognition module 33, the cache query module 34, the trigger module 35, the first array forming module 36, and the second array in the voice recognition device shown in FIG.
- the forming module 37, the storage module 38, the array decomposition module 39, and the positioning table construction module 310) are stored in the memory 42; in FIG. 4, a processor 41 is taken as an example; the processor 41 and the memory 42 in the terminal device can pass Bus or other way of connection, in Figure 4 by way of a bus connection.
- the query tree information includes a plurality of nodes corresponding to the text, Each node includes at least a storage location offset between the current node and the child node;
- the character recognition result is selected based on the probability score as the final recognition result.
- the foregoing terminal device can perform the methods provided in Embodiment 1 and Embodiment 2 of the present invention, and has the corresponding functional modules and beneficial effects of the execution method.
- the processor 41 after querying the loaded language model according to the pronunciation information, and determining the probability score of the character recognition result that matches the pronunciation information, according to the pronunciation information. Querying a probability score of a text recognition result matching the pronunciation information in a common word sequence stored in the cache and/or a text recognition result of the recorded history query; if there is no matching in the cache with the pronunciation information The probability score of the text recognition result triggers the operation of querying in the language model check tree.
- the processor 41 displays the single character in the language model check tree before querying the probability score of the character recognition result matching the pronunciation information in the common word sequence in the cache according to the pronunciation information.
- the combination of words and their probability scores form a second array; the first array and the second array are stored as the sequence of common words.
- the processor 41 stores the first array and the second array as the common word sequence
- the plurality of text combinations in the second array are divided into an ordered sequence array according to a predetermined rule.
- an array of unordered sequences comprising at least two sub-arrays, each of the sub-arrays storing a plurality of combinations of characters having the same eigenvalue; a probability score in the array of the unordered sequences, and a starting position And/or the termination position, and the feature value, the starting position, and/or the ending position of each sub-array are stored in the positioning table; and the positioning table is queried according to the pronunciation information and the corresponding feature value, and the pronunciation information is determined Matching sub-arrays; querying the matched sub-array using a fast lookup algorithm to determine a probability score of the text recognition result that matches the pronunciation information.
- the processor 41 shifts the identifier value corresponding to the first character in the binary character combination to the left of the first specified number of digits and the identifier value corresponding to the second character to the left of the second specified digit as the feature value.
- K classify the binary combination of the number of binary combinations with the eigenvalue K greater than or equal to the preset value as an ordered sequence array; the number of binary combination of eigenvalues K is less than the preset value
- Binary text The combination is classified as an array of unordered sequences.
- the processor 41 shifts the feature value K right by the first specified number of digits and the identifier value corresponding to the third character to the left of the second specified number of digits as the feature value T;
- the ternary character combination whose number of ternary characters is greater than or equal to the preset value is classified into an ordered sequence array; the ternary combination of the number of ternary characters whose eigenvalue is T is less than the preset value is classified as none An array of ordered sequences.
- An embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores one or more modules, when the one or more modules are executed by a device that performs a voice recognition method, The device performs the following operations:
- the query tree information includes a plurality of nodes corresponding to the text, Each node includes at least a storage location offset between the current node and the child node;
- the character recognition result is selected based on the probability score as the final recognition result.
- the method preferably includes:
- the method preferably includes:
- the first array and the second array are stored as the common word sequence.
- the method preferably includes:
- the language model is searched according to the query tree information, and the probability score of the text recognition result that matches the pronunciation information is determined by querying the language model check tree:
- the fast sub-array is used to query the matched sub-array to determine a probability score of the text recognition result that matches the pronunciation information.
- dividing the binary text combination in the second array into an ordered sequence array and an unordered sequence array according to a predetermined rule is preferably:
- the identification value corresponding to the first character in the binary character combination is shifted right by the first specified number of digits and the identification value corresponding to the second character is shifted to the left by the second specified number of digits as the feature value K;
- the binary text combination whose number of binary character combinations whose feature value is K is smaller than the preset value is classified into an unordered sequence array.
- the ternary character combination in the second array is divided into an ordered sequence array and an unordered sequence array according to a predetermined rule:
- the ternary character combination whose number of ternary characters whose feature value is T is greater than or equal to the preset value is classified into an ordered sequence array
- a ternary character combination in which the number of ternary characters whose feature value is T is less than a preset value is classified into an unordered sequence array.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
Abstract
Description
本专利申请要求于2015年07月20日提交的、申请号为201510427908.5、申请人为百度在线网络技术(北京)有限公司、发明名称为“语音识别方法及装置”的中国专利申请的优先权,该申请的全文以引用的方式并入本申请中。This patent application claims priority from the Chinese patent application filed on July 20, 2015, the application number is 201510427908.5, the applicant is Baidu Online Network Technology (Beijing) Co., Ltd., and the invention name is "voice recognition method and device". The full text of the application is hereby incorporated by reference.
本发明实施例涉及语音识别技术领域,尤其涉及一种语音识别方法、装置、终端设备及存储介质。The embodiments of the present invention relate to the field of voice recognition technologies, and in particular, to a voice recognition method, device, terminal device, and storage medium.
在嵌入式语音识别领域,语音识别结果由声学模型和语言模型两部分决定。而语言模型有着十分重要的作用,例如,当“北戴河”和“被带河”发音相似,声学模型的得分相差无几,这时就需要使用语言模型来进一步决定哪一个词是语言中会用到的。也就是说,语言模型解决了语音识别中对自然语言顺序的评测问题。In the field of embedded speech recognition, speech recognition results are determined by two parts: acoustic model and language model. The language model plays an important role. For example, when the pronunciations of “Bei Daihe” and “Be brought by the river” are similar, the scores of the acoustic models are almost the same. At this time, the language model is needed to further decide which words are used in the language. of. That is to say, the language model solves the problem of evaluating the natural language order in speech recognition.
如图1所示,为现有技术中提供的语音识别方法,主要包括以下步骤:As shown in FIG. 1 , the voice recognition method provided in the prior art mainly includes the following steps:
S11、从硬盘上读取语言模型资源,资源以节点的方式存储;S11. The language model resource is read from the hard disk, and the resource is stored in a node manner;
其中,每个节点对应一个字,每个节点由节点信息(包括所对应的字或词、孩子信息,例如孩子节点对应的字以及孩子数目),概率列表(ProbList)(存储概率),回退概率列表(BackOff)三部分组成;即如下表一所示:Each node corresponds to one word, and each node is composed of node information (including the corresponding word or word, child information, such as the word corresponding to the child node and the number of children), probability list (ProbList) (storage probability), and fallback. The probability list (BackOff) consists of three parts; as shown in the following table 1:
表一Table I
S12、根据读取的语言模型资源构建多叉查分树;S12. Construct a multi-cross check tree according to the read language model resource;
构建查分树的过程,具体是:将语言模型资源加载到缓存之后,节点的存储地址发生了变化,因此每个节点只知道自身的孩子节点是哪个字,而不知道其存储地址,因此需要根据每个节点中记录的孩子节点信息,逐一查询其孩子节点的存储地址,并添加至父节点中,从而建立查分树。The process of constructing a tree is specifically: after the language model resource is loaded into the cache, the storage address of the node changes, so each node only knows which word its own child node is, and does not know its storage address, so it needs to be based on The child node information recorded in each node queries the storage address of its child node one by one and adds it to the parent node to establish a score tree.
S13、加载声学模型和其他语音识别的资源;S13. Loading an acoustic model and other speech recognition resources;
S14、接收输入的语音信息,使用维特比算法进行解码;S14. Receive input voice information, and decode using a Viterbi algorithm;
S15、在解码的过程中,使用声学模型进行语音识别,得到发音信息,并根据发音信息查询语言模型的多叉查分树进行查分;S15. In the process of decoding, the acoustic model is used for speech recognition, the pronunciation information is obtained, and the multi-cross check tree of the language model is searched according to the pronunciation information for scoring;
S16、获得语言模型的识别结果;S16. Obtain a recognition result of the language model;
S17、输出识别结果,释放资源。S17. Output the recognition result and release the resource.
但是,现有的语音识别方法在读取语言模型资源之后,需要对语言模型资源进行动态的加载,构建多叉查分树,这个过程十分浪费时间,导致识别效率较低。However, the existing speech recognition method needs to dynamically load the language model resources after reading the language model resources, and construct a multi-cross check tree, which is a waste of time and leads to low recognition efficiency.
发明内容Summary of the invention
本发明实施例提供一种语音识别方法、装置、终端设备及存储介质,能够大大的缩短启动时间。The embodiment of the invention provides a voice recognition method, device, terminal device and storage medium, which can greatly shorten the startup time.
第一方面,本发明实施例提供了一种语音识别方法,包括:In a first aspect, an embodiment of the present invention provides a voice recognition method, including:
根据语音信息识别得到发音信息; Acquiring the pronunciation information according to the voice information;
根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分;其中,所述查分树信息包括与文字对应的多个节点,每个节点至少包括当前节点与子节点之间的存储位置偏移量;And loading the language model to check the tree according to the query tree information, and querying the language model to determine a probability score of the text recognition result that matches the pronunciation information; wherein the query tree information includes a plurality of nodes corresponding to the text, Each node includes at least a storage location offset between the current node and the child node;
根据所述概率得分选择文字识别结果,作为最终的识别结果。The character recognition result is selected based on the probability score as the final recognition result.
第二方面,本发明实施例还提供一种语音识别装置,包括:In a second aspect, an embodiment of the present invention further provides a voice recognition apparatus, including:
发音信息获取模块,用于根据语音信息识别得到发音信息;a pronunciation information obtaining module, configured to identify the pronunciation information according to the voice information;
概率得分查询模块,用于根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分;其中,所述查分树信息包括与文字对应的多个节点,每个节点至少包括当前节点与子节点之间的存储位置偏移量;a probability score query module, configured to load a language model check score tree according to the check score tree information, and query the language model check score tree to determine a probability score of a text recognition result that matches the pronunciation information; wherein the check score tree information includes a plurality of nodes corresponding to the text, each node including at least a storage location offset between the current node and the child node;
文字识别模块,用于根据所述概率得分选择文字识别结果,作为最终的识别结果。A text recognition module is configured to select a text recognition result according to the probability score as a final recognition result.
第三方面,本发明实施例还提供一种实现语音识别的终端设备,包括:In a third aspect, the embodiment of the present invention further provides a terminal device for implementing voice recognition, including:
一个或者多个处理器;One or more processors;
存储器;Memory
一个或者多个模块,所述一个或者多个模块存储在所述存储器中,当被所述一个或者多个处理器执行时,进行如下操作:One or more modules, the one or more modules being stored in the memory, and when executed by the one or more processors, performing the following operations:
根据语音信息识别得到发音信息;Acquiring the pronunciation information according to the voice information;
根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分;其中,所述查分树信息包括与文字对应的多个节点,每个节点至少包括当前节点与子节点之间的存储位置 偏移量;And loading the language model to check the tree according to the query tree information, and querying the language model to determine a probability score of the text recognition result that matches the pronunciation information; wherein the query tree information includes a plurality of nodes corresponding to the text, Each node includes at least a storage location between the current node and the child node Offset;
根据所述概率得分选择文字识别结果,作为最终的识别结果。The character recognition result is selected based on the probability score as the final recognition result.
第四方面,本发明实施例还提供一种非易失性计算机存储介质,所述计算机存储介质存储有一个或者多个模块,当所述一个或者多个模块被一个执行语音识别方法的设备执行时,使得所述设备执行如下操作:In a fourth aspect, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores one or more modules, when the one or more modules are executed by a device that performs a voice recognition method. When the device is caused to perform the following operations:
根据语音信息识别得到发音信息;Acquiring the pronunciation information according to the voice information;
根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分;其中,所述查分树信息包括与文字对应的多个节点,每个节点至少包括当前节点与子节点之间的存储位置偏移量;And loading the language model to check the tree according to the query tree information, and querying the language model to determine a probability score of the text recognition result that matches the pronunciation information; wherein the query tree information includes a plurality of nodes corresponding to the text, Each node includes at least a storage location offset between the current node and the child node;
根据所述概率得分选择文字识别结果,作为最终的识别结果。The character recognition result is selected based on the probability score as the final recognition result.
本发明实施例的技术方案,直接根据当前节点与子节点之间的存储位置偏移量来存储语言模型查分树,无需在启动时动态构建语言模型查分树,这样大大的缩短了启动时间。The technical solution of the embodiment of the present invention directly stores the language model check tree according to the storage location offset between the current node and the child node, and does not need to dynamically construct the language model to check the tree at startup, thereby greatly shortening the startup time.
图1为现有技术提供的语音识别方法的流程示意图;1 is a schematic flow chart of a voice recognition method provided by the prior art;
图2A为本发明实施例一提供的语音识别方法的流程示意图;2A is a schematic flowchart of a voice recognition method according to Embodiment 1 of the present invention;
图2B为本发明实施例一提供的语音识别方法中的第一种查分子树结构示意图;2B is a schematic structural diagram of a first check molecular tree in a speech recognition method according to Embodiment 1 of the present invention;
图2C为本发明实施例一提供的语音识别方法中的第二种查分子树结构示意图; 2C is a schematic structural diagram of a second molecular check tree in a speech recognition method according to Embodiment 1 of the present invention;
图2D为本发明实施例一提供的语音识别方法中的第三种查分子树结构示意图;2D is a schematic structural diagram of a third check molecular tree in a speech recognition method according to Embodiment 1 of the present invention;
图2E为本发明实施例一提供的语音识别方法中的第四种查分子树结构示意图;2E is a schematic diagram of a fourth molecular tree structure in a speech recognition method according to Embodiment 1 of the present invention;
图3为本发明实施例二提供的语音识别装置的结构示意图;3 is a schematic structural diagram of a voice recognition apparatus according to Embodiment 2 of the present invention;
图4为本发明实施例三提供的实现语音识别的终端设备的结构示意图。FIG. 4 is a schematic structural diagram of a terminal device for implementing voice recognition according to Embodiment 3 of the present invention.
下面结合附图和实施例对本发明作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本发明,而非对本发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本发明相关的部分而非全部结构。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. It should also be noted that, for ease of description, only some, but not all, of the structures related to the present invention are shown in the drawings.
本发明实施例提供的语音识别方法的执行主体,可为本发明实施例提供的语音识别装置,或者集成了所述语音识别装置的终端设备(例如,智能手机、平板电脑等),该语音识别装置可以采用硬件或软件实现。The execution body of the voice recognition method provided by the embodiment of the present invention may be a voice recognition device provided by an embodiment of the present invention, or a terminal device (for example, a smart phone, a tablet computer, etc.) integrated with the voice recognition device, and the voice recognition The device can be implemented in hardware or software.
实施例一Embodiment 1
图2A为本发明实施例一提供的语音识别方法的流程示意图,如图2A所示,具体包括:2A is a schematic flowchart of a voice recognition method according to Embodiment 1 of the present invention. As shown in FIG. 2A, the method specifically includes:
S21、根据语音信息识别得到发音信息;S21. Obtaining pronunciation information according to the voice information;
具体的,用户可在本发明实施例提供的语音识别装置中输入语音信息,例如,可在语音识别装置中的输入栏设置一个语音录音按钮,用户通过点击上述 语音录音按钮,即可启动录音功能,对用户说话进行录音,从而获取到所述语音信息。然后通过预先加载的声学模型和语音识别资源对所述语音信息进行识别处理,即可得到需要的发音信息。例如,如果用户想要输入的语音为“北戴河”,则通过上述识别过程可获得的发音信息为“beidaihe”。Specifically, the user can input voice information in the voice recognition device provided by the embodiment of the present invention. For example, a voice recording button can be set in the input field of the voice recognition device, and the user clicks on the above. The voice recording button can start the recording function and record the user's speech to obtain the voice information. Then, the voice information is identified and processed by the pre-loaded acoustic model and the voice recognition resource, so that the required pronunciation information can be obtained. For example, if the voice that the user wants to input is "Beidaihe", the pronunciation information obtainable by the above recognition process is "beidaihe".
S22、根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分;其中,所述查分树信息包括与文字对应的多个节点,每个节点至少包括当前节点与子节点之间的存储位置偏移量;S22. The language model is searched according to the query tree information, and the language model check tree is queried to determine a probability score of the text recognition result that matches the pronunciation information. The check tree information includes multiple texts corresponding to the text. a node, each node including at least a storage location offset between the current node and the child node;
其中,所述查分树信息与语言模型资源类似,包括文字对应的多个节点,其中,每个节点至少包括当前节点与子节点之间的存储位置偏移量。除此之外,所述查分树还可以包括每个节点的存储概率(ProbList,即当前节点出现的概率)、当前节点退回的概率(BackOff)以及对应子节点的数目。子节点即孩子节点。父节点与子节点之间是会同时出现的文字组合,例如,“北京”,“京”的节点即为“北”节点的父节点。存储位置偏移量具体是节点与子节点各自存储位置之间的距离。The check tree information is similar to the language model resource, and includes a plurality of nodes corresponding to the text, wherein each node includes at least a storage location offset between the current node and the child node. In addition, the check tree may further include a storage probability (ProbList, that is, a probability of occurrence of the current node) of each node, a probability of returning the current node (BackOff), and a number of corresponding child nodes. The child node is the child node. The parent node and the child node are the combination of words that will appear at the same time. For example, the "Beijing" and "Beijing" nodes are the parent nodes of the "North" node. The storage location offset is specifically the distance between the storage locations of the nodes and the child nodes.
具体的,预先将语言模型的查分树信息直接写入语言模型资源中,这样初始化上就不需要动态的构建语言模型的查分树资源,而是把动态构建起来的查分树的指针信息当成偏移量写入语言模型资源中,即提前离线构建好语言模型查分树。将当前节点与子节点之间的存储位置偏移量直接写入语言模型资源,所述查分树信息如下述表二所示。在启动加载时,根据查分树信息将离线建立好的语言模型查分树直接进行加载。Specifically, the query tree information of the language model is directly written into the language model resource, so that the initialization of the tree tree is not required to dynamically construct the language model, but the pointer information of the dynamically constructed tree is regarded as an offset. The quantity is written into the language model resource, that is, the language model check tree is constructed offline in advance. The storage location offset between the current node and the child node is directly written into the language model resource, and the query tree information is as shown in Table 2 below. When the load is started, the offline language model search tree is directly loaded according to the check tree information.
表二 Table II
当将查分树信息加载到缓存中时,节点之间的存储相对距离不会发生变化,因此,可以基于初始节点的存储位置以及与其他节点的偏移量,确定其他节点的存储位置。When the check tree information is loaded into the cache, the storage relative distance between the nodes does not change. Therefore, the storage locations of other nodes can be determined based on the storage locations of the initial nodes and the offsets with other nodes.
根据上述表二的信息即可将语音模型查分树加载到线上,根据发音信息查询所述查分树。例如,通过上述步骤S21得到的发音信息为“beidaihe”,首先在所述查分树的根节点(RootProbList)中查询发音信息“beidaihe”中“he”对应的文字节点,包含多个例如“荷”、“喝”,“河”等,如图2B所示,然后在各个“he”对应的文字节点的子节点中查询“dai”对应的文字节点,也包含多个例如“带”、“戴”、“待”等,查询各“dai”对应的文字节点的ProbList和BackOff中“daihe”对应的概率得分,例如得到如下表三所示的结果:According to the information in Table 2 above, the voice model check tree can be loaded onto the line, and the check tree is queried according to the pronunciation information. For example, the pronunciation information obtained by the above step S21 is “beidaihe”, and first, in the root node (RootProbList) of the check tree, the text node corresponding to “he” in the pronunciation information “beidaihe” is searched for, and includes a plurality of, for example, “charges”. , "drink", "river", etc., as shown in FIG. 2B, and then query the text node corresponding to "dai" in the child nodes of the text nodes corresponding to each "he", and also include a plurality of, for example, "bands" and "wearing "," "wait", etc., query the ProbList of the text node corresponding to each "dai" and the probability score corresponding to "daihe" in BackOff, for example, the results shown in Table 3 below are obtained:
表三Table 3
则通过上述表三,可得到“荷”对应的子节点“戴”、“带”和“待”下的文字节点的退回概率BackOff均高于60%,比较高,而ProbList均低于60%,比较低,则“荷”对应的子树被退回。同理,“河”对应的子节点“戴”、“带”和“待”下的文字节点的ProbList均高于60%,而BackOff均低于60%,则“河”对应的子树保留。“喝”对应的子节点“待”下的文字节点的ProbList高于60%,而BackOff低于60%,则“喝”对应的子节点“待”保留,而“喝”对应的子节点“戴”和“带”的ProbList均低于60%,而BackOff均高于60%,“喝”对应的子节点“戴”和“带”被退回。通过上述选择过程最终可得到的结果如图2C和图2D所示的两个子树。Then, through the above Table 3, the backoff probability BackOff of the text nodes under the "dot", "band" and "to" of the "charge" corresponding to the "charge" is higher than 60%, which is higher, and the ProbList is lower than 60%. If it is low, the subtree corresponding to "荷" is returned. Similarly, the ProbList of the text nodes under the "river" corresponding to "river" are higher than 60%, and the BackOff is lower than 60%, then the subtree corresponding to "river" remains. . The ProbList of the text node under the "children" corresponding to "drink" is higher than 60%, and the BackOff is lower than 60%, then the corresponding child node "drink" is reserved, and the corresponding child node "drinks" The ProbList of Dai and "Band" are both lower than 60%, and BackOff is higher than 60%. The corresponding child nodes "Dai" and "Band" of "Drink" are returned. The final result obtained by the above selection process is shown in the two subtrees shown in Figures 2C and 2D.
在所2C和图2D所示的子树的基础上,根据发音信息中的“beidaihe”中的“bei”,再次查询“dai”对应的各文字节点的子节点,也包含多个,例如“被”、“北”和“背”等。On the basis of 2C and the subtree shown in FIG. 2D, according to "bei" in "beidaihe" in the pronunciation information, the child nodes of each character node corresponding to "dai" are again queried, and also include a plurality of, for example, " Being "," "north" and "back".
查询各“bei”对应的文字节点的ProbList和BackOff中“beidaihe”对应的概率得分,例如得到如下表四所示的结果:The problist corresponding to the text node corresponding to each "bei" and the probability score corresponding to "beidaihe" in BackOff are obtained, for example, the results shown in Table 4 below are obtained:
表四Table 4
则通过上述表四,可得到“戴”对应的子节点“北戴河”和“被戴河”下的文字节点的ProbList均高于60%,而BackOff均低于60%,则“戴”对应的子节点“北戴河”和“被戴河”保留,而“戴”对应的子节点“背戴河”的文字节点的BackOff均高于60%,而ProbList均低于60%,“戴”对应的子节点“背戴河”被退回。同理,可得出,“待”和“带”对应的子树均被退回。通过上述选择过程最终可得到的结果如图2E所示。Then, through the above Table 4, the ProbList of the text nodes under the "Bei Daihe" and "Be Daihe" corresponding to the "Dai" are higher than 60%, and the BackOff is lower than 60%, then the "Dai" corresponding The child nodes "Bei Daihe" and "Waihehe" are reserved, and the BackOff of the text node of the "children wearing" corresponding to "Dai" is higher than 60%, and the ProbList is lower than 60%, corresponding to "Dai" The child node "back wearing the river" was returned. In the same way, it can be concluded that the subtrees corresponding to "to" and "band" are returned. The results finally obtained by the above selection process are shown in Fig. 2E.
S23、根据所述概率得分选择文字识别结果,作为最终的识别结果。S23. Select a character recognition result according to the probability score as a final recognition result.
同样以上述步骤S23为例,最终得到的识别结果为“北戴河”和“被戴河”,对应的概率得分分别为99%和60%,则可按照得分多少,将得分高的文字识别结果放在前边显示,得分低的文字识别结果在显示的原则,同时将它们返回给 用户,即将“北戴河”“被戴河”同时返回给用户,以供用户选择。也可只将得分最高的返回给用户,即将“北戴河”返回给用户。Similarly, taking the above step S23 as an example, the final recognition result is “Bei Daihe” and “Beaded River”, and the corresponding probability scores are 99% and 60%, respectively, and the score recognition result can be placed according to the score. In front of the display, the low-scoring text recognition results are displayed on the principle, and they are returned to The user, that is, "Bei Daihe" and "Daihe" will be returned to the user at the same time for the user to choose. It is also possible to return only the highest score to the user, that is, return "Bei Daihe" to the user.
本实施例,无需在启动时动态构建语言模型查分树,通过当前节点与子节点之间的存储位置偏移量来预先记录语言模型查分树,则需要加载时,可直接根据节点之间存储位置偏移量将查分树加载至缓存中,无需动态建立,这样大大的缩短了启动时间。In this embodiment, it is not necessary to dynamically construct a language model check tree at startup, and the language model check tree is pre-recorded by the storage location offset between the current node and the child node, and the load location may be directly stored according to the storage location between the nodes. The offset loads the lookup tree into the cache, eliminating the need for dynamic builds, which greatly reduces startup time.
示例性的,为了更快的查询,可在根据所述发音信息查询加载的语言模型查分树,确定与所述发音信息匹配的文字识别结果的概率得分之前增加如下步骤:Exemplarily, for a faster query, the tree may be searched according to the language model loaded by the pronunciation information query, and the following steps are added before determining the probability score of the text recognition result matching the pronunciation information:
根据所述发音信息在缓存中存储的常用词序列和/或记录的历史查询的文字识别结果中查询与所述发音信息匹配的文字识别结果的概率得分;Querying a probability score of a text recognition result matching the pronunciation information in a common word sequence stored in the cache and/or a text recognition result of the recorded history query according to the pronunciation information;
如果在所述缓存中不存在与所述发音信息匹配的文字识别结果的概率得分,则触发在所述语言模型查分树中进行查询的操作。If there is no probability score of the character recognition result matching the pronunciation information in the cache, an operation of performing a query in the language model check tree is triggered.
其中,所述常用词序列中包含了人们生活中经常用户到的一些词汇和热词,例如,包括旅游景点的名称、各省市直辖市自治区的地名、网络名人的名字、歌曲名字等等,将这些常用词汇之间放在缓存中,可以大大提高查询效率。The common word sequence includes some words and hot words that people often visit in people's lives, for example, including the name of a tourist attraction, the place name of a municipality directly under the central government, the name of a network celebrity, the name of a song, etc. The common vocabulary is placed in the cache, which can greatly improve the query efficiency.
还可将历史查询的文字识别结果纪录在缓存中,当用户再次输入相同的语音信息时,可直接从缓存中返回给用户,同样节省了查询时间。The text recognition result of the historical query can also be recorded in the cache. When the user inputs the same voice information again, it can be directly returned to the user from the cache, which also saves the query time.
由于目前使用的语言模型资源尺寸都比较大,即使是经过裁剪的语言模型,也占用了很大的内存,且经过裁剪还会影响查询效率,为进一步节省内存,本发明实施例将现有的语言模型查分树进行转换,转换为一种更加节约内存的语 言模型。具体的,在根据所述发音信息在缓存中的常用词序列中查询与所述发音信息匹配的文字识别结果的概率得分之前进一步增加如下操作,以便形成常用词序列:Since the language model resources currently used are relatively large in size, even a tailored language model occupies a large amount of memory, and the clipping also affects the query efficiency. To further save memory, the present embodiment of the present invention The language model looks at the tree for conversion and converts it into a more memory-saving language. Word model. Specifically, before the probability score of the character recognition result matching the pronunciation information is queried in the common word sequence in the cache according to the pronunciation information, the following operations are further added to form a common word sequence:
将所述语言模型查分树中单个文字出现的概率得分高于设定门限值的单个文字及其概率得分,形成第一数组;Forming a first array by using a probability that a probability score of a single character in the language model is higher than a set threshold and a probability score thereof;
将所述语言模型查分树中至少两个文字构成的文字组合的概率得分高于设定门限值的文字组合及其概率得分,形成第二数组;Forming, by the language model, a combination of at least two characters of a text group having a probability score higher than a set of text values and a probability score thereof to form a second array;
将所述第一数组和第二数组作为所述常用词序列进行存储。The first array and the second array are stored as the common word sequence.
具体的,将所述语言模型查分树中根节点中包含的部分单个文字或者所有单个文字及其对应的概率得分以数组的形式进行存储。将所述语言模型查分树中父节点以及各子节点对应的文字组合及其概率得分也以数组的形式进行存储。本实施例通过设置设定门限值,可去除所述语言模型查分树中包含的低概率的单个文字和文字组合,以提高查询效率。Specifically, a part of the single text or all the single characters included in the root node in the language model checking tree and the corresponding probability scores are stored in the form of an array. The parent node of the language model and the text combination corresponding to each child node and its probability score are also stored in the form of an array. In this embodiment, by setting a set threshold value, a single character and a combination of characters with low probability included in the language model check tree can be removed to improve query efficiency.
举例来说,所述语言模型查分树中根节点中包含的文字包括“北”、“京”、“河”、“荷”、“喝”等,对应的存储概率分别为P1、P2、P3、P4、P5,具体可采用二维数组来实现,其存储形式如下表五所示:For example, the text included in the root node of the language model scoring tree includes “North”, “Beijing”, “River”, “Hang”, “Drink”, etc., and the corresponding storage probabilities are P1, P2, P3, respectively. P4, P5, can be realized by two-dimensional array, and its storage form is as shown in Table 5 below:
表五Table 5
同理,对于所述语言模型查分树中父节点以及各子节点对应的文字组合及其概率得分也可以二维数组的形式进行存储,例如如下表六所示,为二元文字的组合: Similarly, for the language model, the parent node and the character combinations corresponding to each child node and their probability scores may be stored in a two-dimensional array, for example, as shown in Table 6 below, which is a combination of binary characters:
表六Table 6
在后续查询时,可直接从上述数组中进行查询。In subsequent queries, you can query directly from the above array.
示例性的,为了进一步提高查询效率,可通过定位表快速定位要查询的目标区域,具体的可在将所述第一数组和第二数组作为所述常用词序列进行存储之后进一步增加如下步骤:Exemplarily, in order to further improve query efficiency, the target area to be queried may be quickly located by using a positioning table, and the following steps may be further added after storing the first array and the second array as the common word sequence:
根据预定规则将所述第二数组中的多个文字组合分为有序序列数组和无序序列数组,所述有序序列数组中包含至少两个子数组,各子数组中存储有相同特征值的多个文字组合;And dividing a plurality of text combinations in the second array into an ordered sequence array and an unordered sequence array according to a predetermined rule, wherein the ordered sequence array includes at least two sub-arrays, and each sub-array stores the same eigenvalue Multiple text combinations;
将所述无序序列数组中的概率得分,以及起始位置和/或终止位置,以及各子数组的特征值、起始位置和/或终止位置存储在定位表中;Generating a probability score in the array of unordered sequences, and a starting position and/or a ending position, and a feature value, a starting position, and/or a ending position of each sub-array in a positioning table;
其中,预定规则可根据具体的场景进行设定,对于不同的场景有不同的划分规则,可更好的更快速的匹配到合适的文字识别结果。具体的,可将所述二元文字组合中第一文字对应的标识值右移第一指定位数与第二文字对应的标识值左移第二指定位数的和作为特征值K,将特征值为K的二元文字组合的个数大于或等于预设数值的二元文字组合归为有序序列数组;将特征值为K的二元文字组合的个数小于预设数值的二元文字组合归为无序序列数组。The predetermined rule may be set according to a specific scenario, and different division rules are used for different scenarios, so that a better text recognition result can be better and faster matched. Specifically, the identifier value corresponding to the first character in the binary character combination may be right shifted by the first specified number of digits and the identifier value corresponding to the second character is shifted to the left by the second specified digit as the feature value K, and the feature value is obtained. A binary text combination in which the number of binary text combinations of K is greater than or equal to a preset value is classified into an ordered sequence array; a binary combination of characters whose binary value combination of the feature value K is smaller than a preset value Classified as an array of unordered sequences.
例如,对于二元文字组合可采用公式一计算得到有序序列数组和无序序列数组,首先计算二元文字组合的特征值K:For example, for a binary combination of words, Equation 1 can be used to calculate an ordered sequence array and an unordered sequence array. First, the eigenvalue K of the binary combination is calculated:
K=M1>>3+M2<<13K=M1>>3+M2<<13
其中,第一指定位数取值为3,第二指定位数取值为13,“>>”为右移符号, “<<”为左移符号,M1为第一文字对应的标识值,M2为第二文字对应的标识值。Wherein, the first designated digit has a value of 3, the second designated digit has a value of 13, and “>>” is a right shift symbol. "<<" is the left shift symbol, M1 is the identifier value corresponding to the first text, and M2 is the identifier value corresponding to the second text.
将特征值为K的二元文字组合的个数大于或等于预设数值的二元文字组合归为有序序列数组;将特征值为K的二元文字组合的个数小于预设数值的二元文字组合归为无序序列数组。The binary text combination whose number of binary character combinations whose feature value is K is greater than or equal to the preset value is classified into an ordered sequence array; the number of binary character combinations whose feature value is K is less than the preset value The metacharacter combination is classified as an unordered sequence array.
标识值是能够唯一标识文字的数值,例如典型的,以ASCII编码标识各文字时,文字的ASCII码数值即为标识值。将文字的标识值进行左移和右移之后,计算其特征值K,相当于根据特征值K对各文字组合进行了分类,将特征值K相同的文字组合归为一个组。对于组内文字组合数量过少的,则不必设该组。The identification value is a value that can uniquely identify the text. For example, when the characters are identified by ASCII code, the ASCII code value of the text is the identification value. After the identification value of the character is shifted to the left and right, the feature value K is calculated, which is equivalent to classifying each character combination according to the feature value K, and grouping the characters having the same feature value K into one group. If the number of combinations of texts in the group is too small, it is not necessary to set the group.
本领域技术人员可以理解,预设规则也可以是其他公式,不限于左移和右移,以及不限于上述移位的具体位数。It will be understood by those skilled in the art that the preset rule may also be other formulas, not limited to left shift and right shift, and is not limited to the specific number of bits of the above shift.
对于三元文字组合,可通过上述公式计算的其中二元文字组合的特征值K,然后将特征值K右移第一指定位数与第三文字对应的标识值左移第二指定位数的和作为特征值T;将特征值为T的三元文字组合的个数大于或等于预设数值的三元文字组合归为有序序列数组;将特征值为T的三元文字组合的个数小于预设数值的三元文字组合归为无序序列数组。For the ternary character combination, the eigenvalue K of the binary combination can be calculated by the above formula, and then the eigenvalue K is shifted to the right by the first specified number of digits and the identification value corresponding to the third character is shifted to the left by the second specified number of digits. And as the feature value T; the ternary character combination whose number of ternary characters whose feature value is T is greater than or equal to the preset value is classified into an ordered sequence array; the number of ternary characters combined with the feature value T A ternary text combination that is less than a preset value is classified as an unordered sequence array.
例如,可结合上述公式一和如下公式二得到有序序列数组和无序序列数组,首先采用公式一计算得到二元文字组合的特征值K,然后采用公式二得到三元文字组合的特征值TFor example, an ordered sequence array and an unordered sequence array can be obtained by combining the above formula 1 and the following formula 2. First, the eigenvalue K of the binary combination is obtained by using the formula 1, and then the eigenvalue T of the ternary combination is obtained by using the formula 2
T=K>>3+M3<<13T=K>>3+M3<<13
其中,第一指定位数取值为3,第二指定位数取值为13,K为M1和M2组合对应的特征值,M3为第三文字对应的标识值; The first designated digit has a value of 3, the second specified digit has a value of 13, K is a feature value corresponding to the combination of M1 and M2, and M3 is an identifier value corresponding to the third character;
将特征值为T的三元文字组合的个数大于或等于预设数值的三元文字组合归为有序序列数组;将特征值为T的三元文字组合的个数小于预设数值的三元文字组合归为无序序列数组。The ternary character combination whose ternary character combination whose feature value is T is greater than or equal to the preset value is classified into an ordered sequence array; the number of ternary character combinations whose feature value is T is less than the preset value of three The metacharacter combination is classified as an unordered sequence array.
其中,有序序列数组可根据特征值划分为多个子子数组,每个子数组存储有相同特征值的文字组合。对于二元文字组合的特征值,采用上述公式一计算得到,对于三元文字组合,可结合公式一和公式二计算得到。The ordered sequence array can be divided into multiple sub-subarrays according to the feature value, and each sub-array stores a combination of characters having the same feature value. For the eigenvalues of the binary combination, it is calculated by the above formula 1. For the ternary combination, it can be calculated by combining formula 1 and formula 2.
下面通过举例详细介绍定位表的构建过程。The following is a detailed description of the construction process of the positioning table.
首先根据语音查分树,统计出所有的文字组合,包括二元文字组合、三元文字组合和n元文字组合,其中n为大于3的自然数。较常用的为二元文字组合和三元文字组合。对于二元文字组合,则采用上述公式一计算得到特征值,对于三元文字组合,则采用上述公式一和公式二计算得到特征值。例如,计算得到特征值为K1的文字组合包括“北京”、“天津”、“北戴河”、“百度”和“搜狐”,计算得到特征值为K2的文字组合包括“杭州好”、“被带河”、“牛奶”和“酸奶”,计算得到特征值为K3的文字组合包括“苏州”,计算得到特征值为K4的文字组合包括“夏天热”和“被戴河”,计算得到特征值为K3的文字组合包括“苏州”则统计特征值为K1的文字组合的个数为5、统计特征值为K2的文字组合的个数为4、统计特征值为K3的文字组合的个数为1,统计特征值为K4的文字组合的个数为2,如果预设数值设为3,则将相同特征值的文字组合的个数超过3的归为有序序列数组,否则,归为无序序列数组,则最终得到的有序序列数组以列表的形式表示,如下述表七所示,其中有序序列数组中还包含特征值和各文字组合出现的概率,此概率可直接从语言模型查分树中获取:First, according to the voice check tree, all the text combinations are counted, including binary text combination, ternary text combination and n-gram combination, where n is a natural number greater than 3. The more common ones are binary text combinations and ternary text combinations. For the binary combination, the eigenvalue is calculated by using the above formula 1. For the ternary combination, the eigenvalue is calculated by using the above formula 1 and formula 2. For example, a combination of characters with a eigenvalue of K1 is calculated including "Beijing", "Tianjin", "Bei Daihe", "Baidu", and "Sohu", and the combination of characters with a eigenvalue of K2 is calculated to include "Hangzhou good" and "taken River, "milk" and "yoghurt", the text combination with the eigenvalue K3 is calculated, including "Suzhou", and the text combination with the eigenvalue K4 is calculated, including "summer heat" and "wearing the river", and the eigenvalue is calculated. For the text combination of K3, including "Suzhou", the number of text combinations whose statistical feature value is K1 is 5, the number of text combinations whose statistical feature value is K2 is 4, and the number of text combinations whose statistical feature value is K3 is 1. The number of text combinations whose statistical feature value is K4 is 2. If the preset value is set to 3, the number of text combinations with the same feature value exceeds 3, which is classified as an ordered sequence array. Otherwise, it is classified as none. The ordered sequence array, the resulting ordered sequence array is represented in the form of a list, as shown in Table 7 below, where the ordered sequence array also contains the probability of occurrence of the feature value and each combination of words, which can be directly from the language. Type check points to obtain the tree:
表七 Table 7
最终得到的无序序列数组以列表的形式表示,如下述表八所示,其中无序序列数组中同样包含特征值和各文字组合出现的概率,此概率可直接从语言模型查分树中获取:The resulting array of unordered sequences is represented in the form of a list, as shown in Table VIII below, where the array of unordered sequences also contains the probability of occurrence of eigenvalues and combinations of words, which can be obtained directly from the language model check tree:
表八Table eight
由于所述有序序列数组中包含的文字组合个数比较多,在查找时不便于查找,则进一步将有序序列数组中的文字组合进行分割,根据特征值分成多个子数组,例如表七中,可将特征值相同的分为一个子数组,共分为2个子数组。如下表九所示:Since the number of combinations of characters included in the ordered sequence array is relatively large, and the search is not convenient, the text combination in the ordered sequence array is further divided, and the sub-array is divided according to the feature value, for example, in Table VII. The eigenvalues can be divided into one sub-array and divided into two sub-arrays. As shown in the following table IX:
表九Table 9
而对于无序序列数组,由于其包含的文字组合的个数比较少,则不必将其进行分组。For an unordered sequence array, since it contains fewer text combinations, it is not necessary to group them.
最后,根据上述表九划分的各个子数组,以及上述表八所示的无序序列数组,构建定位表。即将各子数组对应的特征值及起始存储位置,以及无序序列数组中的文字组合对应的特征值及其概率得分放置在定位表中,例如,得到的定位表如下表十所示Finally, a positioning table is constructed according to each sub-array divided by the above table 9 and the unordered sequence array shown in the above table 8. The feature values corresponding to the sub-array and the starting storage location, and the feature values corresponding to the combination of characters in the unordered sequence array and their probability scores are placed in the positioning table. For example, the obtained positioning table is as shown in Table 10 below.
表十Table ten
另外,在构建定位表时,可直接将对应的特征值作为数组的下标,即子数组1对应的下标为K1,子数组2对应的下标为K2,无序序列数组1对应的下标为K3,无序序列数组2对应的下标为K4,则直接将各数组对应的下标存储在定位表中,得到的定位表如下表十一所示:In addition, when constructing the positioning table, the corresponding feature value can be directly used as the subscript of the array, that is, the subscript corresponding to the subarray 1 is K1, the subscript corresponding to the subarray 2 is K2, and the subscript array 1 corresponds to the lower subscript. Marked as K3, the subscript corresponding to the unordered sequence array 2 is K4, then the subscript corresponding to each array is directly stored in the positioning table, and the obtained positioning table is as shown in Table 11 below:
表十一Table XI
相应的,在查询时,根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分具体包 括:Correspondingly, when querying, the language model is searched according to the search tree information, and the language model check tree is queried to determine a probability score of the text recognition result matching the pronunciation information. include:
根据所述发音信息和对应的特征值查询所述定位表,确定与所述发音信息匹配的子数组;Querying the positioning table according to the pronunciation information and the corresponding feature value, and determining a sub-array matching the pronunciation information;
采用快速查找算法查询所述匹配的子数组确定与所述发音信息匹配的文字识别结果的概率得分。The fast sub-array is used to query the matched sub-array to determine a probability score of the text recognition result that matches the pronunciation information.
例如,对于要查询发音信息“beidaihe”对应的文字识别结果,则从所述第一数组存储的单个文字中查询得到发音为“bei”、“dai”、“he”对应的所有文字组合,再根据定位表查询得到各发音为“beidaihe”文字组合的概率得分。例如以表十对应的定位表为例,要查P(被|带河)的概率,则采用上述公式一和公式二计算得到文字组合“被带河”的特征值为K2,则根据特征值K2查询上述表十的定位表,可知对应的查询范围为子数组2,则根据定位表中记录的子数组2的起止位返回表九去子数组2中查询,具体可采用快速查询算法(例如二分法)进行查找,得到P(被|带河)的概率得分为P15。例如要查P(被|戴河)的概率,则采用上述公式一和公式二计算得到文字组合“被戴河”的特征值为K4,则根据特征值K4查询上述表十的定位表,可知对应的查询结果记录在所述定位表中,则直接查询得到P(被|戴河)的概率得分为P17。同理,查询得到P(北|戴河)的概率得分为P8,比较所有发音为“beidaihe”的文字组合的概率得分,按照概率得分对文字组合进行排序,将排序在前的文字组合返回给用户。For example, for the text recognition result corresponding to the pronunciation information “beidaihe”, all the text combinations corresponding to the pronunciations “bei”, “dai”, “he” are obtained from the single characters stored in the first array, and then According to the positioning table query, the probability scores of each word combination of "beidaihe" are obtained. For example, taking the positioning table corresponding to Table 10 as an example, if the probability of P (by | river) is to be checked, the eigenvalue of the text combination "taken river" is calculated by using the above formula 1 and formula 2, and then according to the characteristic value. K2 queries the positioning table of Table 10 above, and knows that the corresponding query range is sub-array 2, according to the start and stop bits of the sub-array 2 recorded in the positioning table, the query is returned in the sub-array 2, and a fast query algorithm can be used (for example, Dichotomy) The search is performed to obtain a probability score of P (by | river) of P15. For example, to check the probability of P (by | Daihe), the eigenvalue of the text combination "Waihe" is calculated by using the above formula 1 and formula 2, and the positioning table of the above table 10 is queried according to the feature value K4. If the corresponding query result is recorded in the positioning table, the probability score of P (by | Daihe) is directly obtained by querying P17. Similarly, the probability score of P (North | Daihe) is P8, and the probability scores of all combinations of words pronounced “beidaihe” are compared. The combination of words is sorted according to the probability score, and the combination of the previous texts is returned to user.
上述各实施例通过根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分,其中,所述查分树信息包括与文字对应的多个节点,每个节点至少包括当前节点与子节 点之间的存储位置偏移量,根据概率得分即可得到文字识别结果。而无需在启动时动态构建语言模型查分树,本发明实施例通过直接根据当前节点与子节点之间的存储位置偏移量加载语言模型查分树,这样大大的缩短了启动时间。In the above embodiments, the language model is searched according to the query tree information, and the language model check tree is queried to determine a probability score of the text recognition result that matches the pronunciation information, wherein the check tree information includes a text corresponding to the text. Multiple nodes, each node including at least the current node and subsection The storage position offset between the points can be obtained according to the probability score. In the embodiment of the present invention, the language model is directly loaded according to the storage location offset between the current node and the child node, so that the startup time is greatly shortened.
另外,上述各实施例还通过构建定位表,初步定位要查询的文字组合的大概位置,进一步采用快速查询算法进行精确查找确定与所述发音信息匹配的文字识别结果的概率得分,进一步提高了查询效率。In addition, the foregoing embodiments further construct a positioning table to initially locate the approximate location of the text combination to be queried, and further use a quick query algorithm to accurately find and determine a probability score of the text recognition result that matches the pronunciation information, thereby further improving the query. effectiveness.
实施例二Embodiment 2
图3为本发明实施例二提供的语音识别装置的结构示意图,如图3所示,具体包括:发音信息获取模块31、概率得分查询模块32和文字识别模块33;3 is a schematic structural diagram of a voice recognition apparatus according to Embodiment 2 of the present invention, as shown in FIG. 3, specifically including: a pronunciation
所述发音信息获取模块31用于根据语音信息识别得到发音信息;The pronunciation
所述概率得分查询模块32用于根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分;其中,所述查分树信息包括与文字对应的多个节点,每个节点至少包括当前节点与子节点之间的存储位置偏移量;The probability
所述文字识别模块33用于根据所述概率得分选择文字识别结果,作为最终的识别结果。The
本发明实施例所述的语音识别装置用于执行上述各实施例所述的语音识别方法,其技术原理和产生的技术效果类似,这里不再累述。The voice recognition device according to the embodiment of the present invention is used to perform the voice recognition method described in the foregoing embodiments, and the technical principle and the generated technical effect are similar, and are not described here.
示例性的,在上述实施例的基础上,所述装置还包括:缓存查询模块34和触发模块35;
Exemplarily, based on the foregoing embodiment, the device further includes: a
所述缓存查询模块34用于在所述概率得分查询模块32根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分之前,根据所述发音信息在缓存中存储的常用词序列和/或记录的历史查询的文字识别结果中查询与所述发音信息匹配的文字识别结果的概率得分;The
所述触发模块35用于如果所述缓存查询模块34在缓存中不存在与所述发音信息匹配的文字识别结果的概率得分,则触发在所述语言模型查分树中进行查询的操作。The triggering
示例性的,所述装置还包括:第一数组形成模块36、第二数组形成模块37和存储模块38;Exemplarily, the device further includes: a first
所述第一数组形成模块36用于在所述缓存查询模块34根据所述发音信息在缓存中的常用词序列中查询与所述发音信息匹配的文字识别结果的概率得分之前,将所述语言模型查分树中单个文字出现的概率得分高于设定门限值的单个文字及其概率得分,形成第一数组;The first
所述第二数组形成模块37用于将所述语言模型查分树中至少两个文字构成的文字组合的概率得分高于设定门限值的文字组合及其概率得分,形成第二数组;The second
所述存储模块38用于将所述第一数组和第二数组作为所述常用词序列进行存储。The
示例性的,所述装置还包括:数组分解模块39和定位表构建模块310;Exemplarily, the device further includes: an
所述数组分解模块39用于在所述存储模块38将所述第一数组和第二数组作为所述常用词序列进行存储之后,根据预定规则将所述第二数组中的多个文
字组合分为有序序列数组和无序序列数组,所述有序序列数组中包含至少两个子数组,各子数组中存储有相同特征值的多个文字组合;The
所述定位表构模块310用于将所述无序序列数组中的概率得分,以及起始位置和/或终止位置,以及各子数组的特征值、起始位置和/或终止位置存储在定位表中;The positioning
相应的,所述缓存查询模块34具体用于:Correspondingly, the
根据所述发音信息和对应的特征值查询所述定位表,确定与所述发音信息匹配的子数组;采用快速查找算法查询所述匹配的子数组确定与所述发音信息匹配的文字识别结果的概率得分。Querying the positioning table according to the pronunciation information and the corresponding feature value, determining a sub-array matching the pronunciation information; and querying the matched sub-array by using a fast search algorithm to determine a text recognition result that matches the pronunciation information. Probability score.
示例性的,所述数组分解模块39具体用于:Exemplarily, the
将所述二元文字组合中第一文字对应的标识值右移第一指定位数与第二文字对应的标识值左移第二指定位数的和作为特征值K;And the identification value corresponding to the first character in the binary character combination is shifted right by the first specified number of digits and the identification value corresponding to the second character is shifted to the left by the second specified number of digits as the feature value K;
将特征值为K的二元文字组合的个数大于或等于预设数值的二元文字组合归为有序序列数组;将特征值为K的二元文字组合的个数小于预设数值的二元文字组合归为无序序列数组。The binary text combination whose number of binary character combinations whose feature value is K is greater than or equal to the preset value is classified into an ordered sequence array; the number of binary character combinations whose feature value is K is less than the preset value The metacharacter combination is classified as an unordered sequence array.
示例性的,所述数组分解模块39具体用于:Exemplarily, the
将所述特征值K右移第一指定位数与第三文字对应的标识值左移第二指定位数的和作为特征值T;Transmitting the feature value K to the right of the first specified number of digits and the identification value corresponding to the third character is shifted to the left by the second specified number of digits as the feature value T;
将特征值为T的三元文字组合的个数大于或等于预设数值的三元文字组合归为有序序列数组;将特征值为T的三元文字组合的个数小于预设数值的三元文字组合归为无序序列数组。The ternary character combination whose ternary character combination whose feature value is T is greater than or equal to the preset value is classified into an ordered sequence array; the number of ternary character combinations whose feature value is T is less than the preset value of three The metacharacter combination is classified as an unordered sequence array.
上述各实施例所述的语音识别装置同样用于执行上述各实施例所述的语音 识别方法,其技术原理和产生的技术效果类似,这里不再累述。The voice recognition device described in each of the above embodiments is also used to perform the voice described in the above embodiments. The identification method, the technical principle and the generated technical effect are similar, and will not be described here.
实施例三Embodiment 3
图4为本发明实施例三提供的一种实现语音识别的终端设备的硬件结构示意图,该终端设备包括一个或多个处理器41、存储器42,一个或者多个模块,所述一个或者多个模块(例如,附图3所示的语音识别装置中的发音信息获取模块31、概率得分查询模块32、文字识别模块33、缓存查询模块34、触发模块35第一数组形成模块36、第二数组形成模块37、存储模块38、数组分解模块39和定位表构建模块310)存储在所述存储器42中;图4中以一个处理器41为例;终端设备中的处理器41和存储器42可以通过总线或其他方式连接,图4中以通过总线连接为例。FIG. 4 is a schematic diagram of a hardware structure of a terminal device for implementing voice recognition according to Embodiment 3 of the present invention, where the terminal device includes one or
当被所述一个或者多个处理器41执行时,进行如下操作:When executed by the one or
根据语音信息识别得到发音信息;Acquiring the pronunciation information according to the voice information;
根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分;其中,所述查分树信息包括与文字对应的多个节点,每个节点至少包括当前节点与子节点之间的存储位置偏移量;And loading the language model to check the tree according to the query tree information, and querying the language model to determine a probability score of the text recognition result that matches the pronunciation information; wherein the query tree information includes a plurality of nodes corresponding to the text, Each node includes at least a storage location offset between the current node and the child node;
根据所述概率得分选择文字识别结果,作为最终的识别结果。The character recognition result is selected based on the probability score as the final recognition result.
上述终端设备可执行本发明实施例一和实施例二所提供的方法,具备执行方法相应的功能模块和有益效果。The foregoing terminal device can perform the methods provided in Embodiment 1 and Embodiment 2 of the present invention, and has the corresponding functional modules and beneficial effects of the execution method.
示例性的,所述处理器41在根据所述发音信息查询加载的语言模型查分树,确定与所述发音信息匹配的文字识别结果的概率得分之前,根据所述发音信息
在缓存中存储的常用词序列和/或记录的历史查询的文字识别结果中查询与所述发音信息匹配的文字识别结果的概率得分;如果在所述缓存中不存在与所述发音信息匹配的文字识别结果的概率得分,则触发在所述语言模型查分树中进行查询的操作。Exemplarily, the
示例性的,所述处理器41在根据所述发音信息在缓存中的常用词序列中查询与所述发音信息匹配的文字识别结果的概率得分之前,将所述语言模型查分树中单个文字出现的概率得分高于设定门限值的单个文字及其概率得分,形成第一数组;将所述语言模型查分树中至少两个文字构成的文字组合的概率得分高于设定门限值的文字组合及其概率得分,形成第二数组;将所述第一数组和第二数组作为所述常用词序列进行存储。Exemplarily, the
示例性的,所述处理器41将所述第一数组和第二数组作为所述常用词序列进行存储之后,根据预定规则将所述第二数组中的多个文字组合分为有序序列数组和无序序列数组,所述有序序列数组中包含至少两个子数组,各子数组中存储有相同特征值的多个文字组合;将所述无序序列数组中的概率得分,以及起始位置和/或终止位置,以及各子数组的特征值、起始位置和/或终止位置存储在定位表中;根据所述发音信息和对应的特征值查询所述定位表,确定与所述发音信息匹配的子数组;采用快速查找算法查询所述匹配的子数组确定与所述发音信息匹配的文字识别结果的概率得分。Exemplarily, after the
示例性的,所述处理器41将所述二元文字组合中第一文字对应的标识值右移第一指定位数与第二文字对应的标识值左移第二指定位数的和作为特征值K;将特征值为K的二元文字组合的个数大于或等于预设数值的二元文字组合归为有序序列数组;将特征值为K的二元文字组合的个数小于预设数值的二元文字
组合归为无序序列数组。Exemplarily, the
示例性的,所述处理器41将所述特征值K右移第一指定位数与第三文字对应的标识值左移第二指定位数的和作为特征值T;将特征值为T的三元文字组合的个数大于或等于预设数值的三元文字组合归为有序序列数组;将特征值为T的三元文字组合的个数小于预设数值的三元文字组合归为无序序列数组。Exemplarily, the
实施例四Embodiment 4
本发明实施例还提供一种非易失性计算机存储介质,所述计算机存储介质存储有一个或者多个模块,当所述一个或者多个模块被一个执行语音识别方法的设备执行时,使得所述设备执行如下操作:An embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores one or more modules, when the one or more modules are executed by a device that performs a voice recognition method, The device performs the following operations:
根据语音信息识别得到发音信息;Acquiring the pronunciation information according to the voice information;
根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分;其中,所述查分树信息包括与文字对应的多个节点,每个节点至少包括当前节点与子节点之间的存储位置偏移量;And loading the language model to check the tree according to the query tree information, and querying the language model to determine a probability score of the text recognition result that matches the pronunciation information; wherein the query tree information includes a plurality of nodes corresponding to the text, Each node includes at least a storage location offset between the current node and the child node;
根据所述概率得分选择文字识别结果,作为最终的识别结果。The character recognition result is selected based on the probability score as the final recognition result.
上述存储介质中存储的模块被所述设备所执行时,在根据所述发音信息查询加载的语言模型查分树,确定与所述发音信息匹配的文字识别结果的概率得分之前,优选包括:When the module stored in the storage medium is executed by the device, before the score model of the language model that is loaded according to the pronunciation information is queried, and the probability score of the text recognition result matching the pronunciation information is determined, the method preferably includes:
根据所述发音信息在缓存中存储的常用词序列和/或记录的历史查询的文字识别结果中查询与所述发音信息匹配的文字识别结果的概率得分;Querying a probability score of a text recognition result matching the pronunciation information in a common word sequence stored in the cache and/or a text recognition result of the recorded history query according to the pronunciation information;
如果在所述缓存中不存在与所述发音信息匹配的文字识别结果的概率得分, 则触发在所述语言模型查分树中进行查询的操作。If there is no probability score of the text recognition result matching the pronunciation information in the cache, Then triggering the operation of querying in the language model check tree.
上述存储介质中存储的模块被所述设备所执行时,在根据所述发音信息在缓存中的常用词序列中查询与所述发音信息匹配的文字识别结果的概率得分之前,优选包括:When the module stored in the storage medium is executed by the device, before the probability score of the character recognition result matching the pronunciation information is queried in the common word sequence in the cache according to the pronunciation information, the method preferably includes:
将所述语言模型查分树中单个文字出现的概率得分高于设定门限值的单个文字及其概率得分,形成第一数组;Forming a first array by using a probability that a probability score of a single character in the language model is higher than a set threshold and a probability score thereof;
将所述语言模型查分树中至少两个文字构成的文字组合的概率得分高于设定门限值的文字组合及其概率得分,形成第二数组;Forming, by the language model, a combination of at least two characters of a text group having a probability score higher than a set of text values and a probability score thereof to form a second array;
将所述第一数组和第二数组作为所述常用词序列进行存储。The first array and the second array are stored as the common word sequence.
上述存储介质中存储的模块被所述设备所执行时,将所述第一数组和第二数组作为所述常用词序列进行存储之后,优选包括:When the module stored in the storage medium is executed by the device, after storing the first array and the second array as the common word sequence, the method preferably includes:
根据预定规则将所述第二数组中的多个文字组合分为有序序列数组和无序序列数组,所述有序序列数组中包含至少两个子数组,各子数组中存储有相同特征值的多个文字组合;And dividing a plurality of text combinations in the second array into an ordered sequence array and an unordered sequence array according to a predetermined rule, wherein the ordered sequence array includes at least two sub-arrays, and each sub-array stores the same eigenvalue Multiple text combinations;
将所述无序序列数组中的概率得分,以及起始位置和/或终止位置,以及各子数组的特征值、起始位置和/或终止位置存储在定位表中;Generating a probability score in the array of unordered sequences, and a starting position and/or a ending position, and a feature value, a starting position, and/or a ending position of each sub-array in a positioning table;
相应的,根据所述查分树信息加载语言模型查分树,查询所述语言模型查分树确定与所述发音信息匹配的文字识别结果的概率得分优选为:Correspondingly, the language model is searched according to the query tree information, and the probability score of the text recognition result that matches the pronunciation information is determined by querying the language model check tree:
根据所述发音信息和对应的特征值查询所述定位表,确定与所述发音信息匹配的子数组;Querying the positioning table according to the pronunciation information and the corresponding feature value, and determining a sub-array matching the pronunciation information;
采用快速查找算法查询所述匹配的子数组确定与所述发音信息匹配的文字识别结果的概率得分。 The fast sub-array is used to query the matched sub-array to determine a probability score of the text recognition result that matches the pronunciation information.
上述存储介质中存储的模块被所述设备所执行时,根据预定规则将所述第二数组中的二元文字组合分为有序序列数组和无序序列数组优选为:When the module stored in the storage medium is executed by the device, dividing the binary text combination in the second array into an ordered sequence array and an unordered sequence array according to a predetermined rule is preferably:
将所述二元文字组合中第一文字对应的标识值右移第一指定位数与第二文字对应的标识值左移第二指定位数的和作为特征值K;And the identification value corresponding to the first character in the binary character combination is shifted right by the first specified number of digits and the identification value corresponding to the second character is shifted to the left by the second specified number of digits as the feature value K;
将特征值为K的二元文字组合的个数大于或等于预设数值的二元文字组合归为有序序列数组;Combining a binary combination of two characters of a binary combination having a feature value of K greater than or equal to a preset value into an ordered sequence array;
将特征值为K的二元文字组合的个数小于预设数值的二元文字组合归为无序序列数组。The binary text combination whose number of binary character combinations whose feature value is K is smaller than the preset value is classified into an unordered sequence array.
上述存储介质中存储的模块被所述设备所执行时,根据预定规则将所述第二数组中的三元文字组合分为有序序列数组和无序序列数组优选为:When the module stored in the storage medium is executed by the device, the ternary character combination in the second array is divided into an ordered sequence array and an unordered sequence array according to a predetermined rule:
将所述特征值K右移第一指定位数与第三文字对应的标识值左移第二指定位数的和作为特征值T;Transmitting the feature value K to the right of the first specified number of digits and the identification value corresponding to the third character is shifted to the left by the second specified number of digits as the feature value T;
将特征值为T的三元文字组合的个数大于或等于预设数值的三元文字组合归为有序序列数组;The ternary character combination whose number of ternary characters whose feature value is T is greater than or equal to the preset value is classified into an ordered sequence array;
将特征值为T的三元文字组合的个数小于预设数值的三元文字组合归为无序序列数组。A ternary character combination in which the number of ternary characters whose feature value is T is less than a preset value is classified into an unordered sequence array.
注意,上述仅为本发明的较佳实施例及所运用技术原理。本领域技术人员会理解,本发明不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此,虽然通过以上实施例对本发明进行了较为详细的说明,但是本发明不仅仅限于以上实施例,在不脱离本发明构思的情况下,还可以包括更多其他等效实施例,而本发明的范围由所附的权利要求范围决定。 Note that the above are only the preferred embodiments of the present invention and the technical principles applied thereto. Those skilled in the art will appreciate that the present invention is not limited to the specific embodiments described herein, and that various modifications, changes and substitutions may be made without departing from the scope of the invention. Therefore, the present invention has been described in detail by the above embodiments, but the present invention is not limited to the above embodiments, and other equivalent embodiments may be included without departing from the inventive concept. The scope is determined by the scope of the appended claims.
Claims (14)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201510427908.5 | 2015-07-20 | ||
| CN201510427908.5A CN105096944B (en) | 2015-07-20 | 2015-07-20 | Audio recognition method and device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2017012243A1 true WO2017012243A1 (en) | 2017-01-26 |
Family
ID=54577230
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2015/096622 Ceased WO2017012243A1 (en) | 2015-07-20 | 2015-12-08 | Voice recognition method and apparatus, terminal device and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN105096944B (en) |
| WO (1) | WO2017012243A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110032716A (en) * | 2019-04-17 | 2019-07-19 | 北京地平线机器人技术研发有限公司 | Character coding method and device, readable storage medium storing program for executing and electronic equipment |
| CN111261165A (en) * | 2020-01-13 | 2020-06-09 | 佳都新太科技股份有限公司 | Station name identification method, device, equipment and storage medium |
| CN111898923A (en) * | 2020-08-12 | 2020-11-06 | 中国人民解放军总医院第二医学中心 | Information analysis method |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105096944B (en) * | 2015-07-20 | 2017-11-03 | 百度在线网络技术(北京)有限公司 | Audio recognition method and device |
| CN109003608A (en) * | 2018-08-07 | 2018-12-14 | 北京东土科技股份有限公司 | Court's trial control method, system, computer equipment and storage medium |
| CN110164416B (en) * | 2018-12-07 | 2023-05-09 | 腾讯科技(深圳)有限公司 | Voice recognition method and device, equipment and storage medium thereof |
| CN111326147B (en) * | 2018-12-12 | 2023-11-17 | 北京嘀嘀无限科技发展有限公司 | Speech recognition method, device, electronic equipment and storage medium |
| CN113903342B (en) * | 2021-10-29 | 2022-09-13 | 镁佳(北京)科技有限公司 | Voice recognition error correction method and device |
| CN115240644B (en) * | 2022-07-18 | 2025-05-27 | 网易(杭州)网络有限公司 | Speech recognition method, device, storage medium and electronic device |
| CN120636406A (en) * | 2025-07-08 | 2025-09-12 | 杭州灵伴科技有限公司 | Speech recognition method, system, device, medium and product for text matching |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1346112A (en) * | 2000-09-27 | 2002-04-24 | 中国科学院自动化研究所 | Integrated prediction searching method for Chinese continuous speech recognition |
| US20070055525A1 (en) * | 2005-08-31 | 2007-03-08 | Kennewick Robert A | Dynamic speech sharpening |
| JP2009156941A (en) * | 2007-12-25 | 2009-07-16 | Advanced Telecommunication Research Institute International | Storage medium recording tree structure dictionary, tree structure dictionary creating apparatus, and tree structure dictionary creating program |
| CN101604522A (en) * | 2009-07-16 | 2009-12-16 | 北京森博克智能科技有限公司 | The embedded Chinese and English mixing voice recognition methods and the system of unspecified person |
| US7810024B1 (en) * | 2002-03-25 | 2010-10-05 | Adobe Systems Incorporated | Efficient access to text-based linearized graph data |
| CN103577548A (en) * | 2013-10-12 | 2014-02-12 | 优视科技有限公司 | Method and device for matching characters with close pronunciation |
| CN104238991A (en) * | 2013-06-21 | 2014-12-24 | 腾讯科技(深圳)有限公司 | Voice input matching method and voice input matching device |
| CN105096944A (en) * | 2015-07-20 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Speech recognition method and apparatus |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030187843A1 (en) * | 2002-04-02 | 2003-10-02 | Seward Robert Y. | Method and system for searching for a list of values matching a user defined search expression |
| CN101398830B (en) * | 2007-09-27 | 2012-06-27 | 阿里巴巴集团控股有限公司 | Thesaurus fuzzy enquiry method and thesaurus fuzzy enquiry system |
| CN101576929B (en) * | 2009-06-16 | 2011-11-30 | 程治永 | Fast vocabulary entry prompting realization method |
| CN103577394B (en) * | 2012-07-31 | 2016-08-24 | 阿里巴巴集团控股有限公司 | A kind of machine translation method based on even numbers group searching tree and device |
| CN104485107B (en) * | 2014-12-08 | 2018-06-22 | 畅捷通信息技术股份有限公司 | Audio recognition method, speech recognition system and the speech recognition apparatus of title |
-
2015
- 2015-07-20 CN CN201510427908.5A patent/CN105096944B/en active Active
- 2015-12-08 WO PCT/CN2015/096622 patent/WO2017012243A1/en not_active Ceased
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1346112A (en) * | 2000-09-27 | 2002-04-24 | 中国科学院自动化研究所 | Integrated prediction searching method for Chinese continuous speech recognition |
| US7810024B1 (en) * | 2002-03-25 | 2010-10-05 | Adobe Systems Incorporated | Efficient access to text-based linearized graph data |
| US20070055525A1 (en) * | 2005-08-31 | 2007-03-08 | Kennewick Robert A | Dynamic speech sharpening |
| JP2009156941A (en) * | 2007-12-25 | 2009-07-16 | Advanced Telecommunication Research Institute International | Storage medium recording tree structure dictionary, tree structure dictionary creating apparatus, and tree structure dictionary creating program |
| CN101604522A (en) * | 2009-07-16 | 2009-12-16 | 北京森博克智能科技有限公司 | The embedded Chinese and English mixing voice recognition methods and the system of unspecified person |
| CN104238991A (en) * | 2013-06-21 | 2014-12-24 | 腾讯科技(深圳)有限公司 | Voice input matching method and voice input matching device |
| CN103577548A (en) * | 2013-10-12 | 2014-02-12 | 优视科技有限公司 | Method and device for matching characters with close pronunciation |
| CN105096944A (en) * | 2015-07-20 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Speech recognition method and apparatus |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110032716A (en) * | 2019-04-17 | 2019-07-19 | 北京地平线机器人技术研发有限公司 | Character coding method and device, readable storage medium storing program for executing and electronic equipment |
| CN111261165A (en) * | 2020-01-13 | 2020-06-09 | 佳都新太科技股份有限公司 | Station name identification method, device, equipment and storage medium |
| CN111898923A (en) * | 2020-08-12 | 2020-11-06 | 中国人民解放军总医院第二医学中心 | Information analysis method |
Also Published As
| Publication number | Publication date |
|---|---|
| CN105096944B (en) | 2017-11-03 |
| CN105096944A (en) | 2015-11-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2017012243A1 (en) | Voice recognition method and apparatus, terminal device and storage medium | |
| US8082270B2 (en) | Fuzzy search using progressive relaxation of search terms | |
| CN110019647B (en) | A keyword search method, device and search engine | |
| CN107247745B (en) | A kind of information retrieval method and system based on pseudo-linear filter model | |
| WO2014000517A1 (en) | Recommendation system and method for input searching | |
| CN109830285B (en) | Medical image file processing method and device | |
| CN111611471B (en) | Searching method and device and electronic equipment | |
| CN106815179B (en) | Text similarity determination method and device | |
| CN102402502A (en) | Word segmentation processing method and device for search engine | |
| CN103365992A (en) | Method for realizing dictionary search of Trie tree based on one-dimensional linear space | |
| CN106681981B (en) | Chinese part-of-speech tagging method and device | |
| CN109918664B (en) | Word segmentation method and device | |
| CN111026281B (en) | Phrase recommendation method of client, client and storage medium | |
| US20190087466A1 (en) | System and method for utilizing memory efficient data structures for emoji suggestions | |
| CN105653546B (en) | Method and system for retrieving a target subject | |
| US20150356173A1 (en) | Search device | |
| Sun et al. | Allies: Prompting large language model with beam search | |
| CN109215636B (en) | A method and system for classifying voice information | |
| CN106997354B (en) | POI data retrieval method and device | |
| TW495736B (en) | Method for generating candidate strings in speech recognition | |
| CN113139383B (en) | Document ordering method, system, electronic device and storage medium | |
| EP3859554A1 (en) | Method and apparatus for indexing multi-dimensional records based upon similarity of the records | |
| CN111221975B (en) | Method and device for extracting field and computer storage medium | |
| CN115455294A (en) | Title core content determination method, search request processing method and related devices | |
| CN115292478A (en) | Method, device, equipment and storage medium for recommending search content |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15898799 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 15898799 Country of ref document: EP Kind code of ref document: A1 |