[go: up one dir, main page]

WO2018192186A1 - Speech recognition method and apparatus - Google Patents

Speech recognition method and apparatus Download PDF

Info

Publication number
WO2018192186A1
WO2018192186A1 PCT/CN2017/104382 CN2017104382W WO2018192186A1 WO 2018192186 A1 WO2018192186 A1 WO 2018192186A1 CN 2017104382 W CN2017104382 W CN 2017104382W WO 2018192186 A1 WO2018192186 A1 WO 2018192186A1
Authority
WO
WIPO (PCT)
Prior art keywords
pronunciation
probability
speech signal
classification result
feature classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2017/104382
Other languages
French (fr)
Chinese (zh)
Inventor
李忠杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Publication of WO2018192186A1 publication Critical patent/WO2018192186A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a voice recognition method and apparatus.
  • the speech recognition technology mainly converts the speech signal to be recognized into text information through the signal processing module, the feature extraction module, the acoustic model, the language model (Language Model, LM), the dictionary and the decoder (Decoder), and completes the speech recognition.
  • the signal processing module and the feature extraction module first divide the speech signal to be recognized into a plurality of speech signal frames, and then enhance each speech signal frame by eliminating noise, channel distortion, etc., and then each speech.
  • the signal frame is transformed from the time domain to the frequency domain and appropriate acoustic features are extracted from the converted speech signal frame.
  • the acoustic model trained according to the characteristic parameters of the training speech library is input to the acoustic feature extracted by the feature extraction module, mapped to the pronunciation capable of describing the pronunciation feature of the speech signal frame, and the speech signal frame is mapped to each pronunciation. Probability, get the feature classification result.
  • the language model contains the associations between different words (such as words, words, phrases) and their probabilities (possibility), which are used to estimate the possibility of various textual information composed of different words.
  • the decoder can establish an identification network based on the trained acoustic model, the language model and the dictionary, and identify each path in the network corresponding to various text information and the pronunciation of each text information, and then output the pronunciation for the acoustic model. Finding the best one path in the identification network, based on the path, the text information corresponding to the voice signal can be output with the maximum probability to complete the voice recognition.
  • the language model is generally based on a large number of corpus trained models, including the relationship and possibility between a large number of words. Therefore, the recognition network based on the speech model contains more nodes, and the number of branches per node is also very many.
  • the path search is performed in the identification network, the number of nodes involved in the pronunciation of each speech signal frame will increase exponentially, resulting in a large amount of path search, and the search process takes more time, which in turn reduces the speech recognition efficiency.
  • the present invention provides a voice recognition method and apparatus, which solves the problem that the voice recognition process is time-consuming and effective.
  • a speech recognition method comprising the steps of:
  • the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame and a probability that each speech signal frame is mapped to a corresponding pronunciation;
  • the speech signal is identified based on the filtered feature classification result.
  • the filtering according to a probability included in the feature classification result, filtering the pronunciation included in the feature classification result, including:
  • the corresponding pronunciation satisfies a predetermined filtering rule, the corresponding pronunciation is filtered out.
  • the corresponding pronunciation is determined. Meet predetermined filtering rules;
  • the corresponding utterance is determined to satisfy the predetermined filtering rule.
  • the predetermined number is any of the following:
  • the number of pronunciations retained in the feature classification result in the pronunciation corresponding to the frame speech signal frame
  • the product of the predetermined ratio threshold and the total number of pronunciations corresponding to the frame of the speech signal frame is the product of the predetermined ratio threshold and the total number of pronunciations corresponding to the frame of the speech signal frame.
  • the filtering according to a probability included in the feature classification result, filtering the pronunciation included in the feature classification result, including:
  • the pronunciation that satisfies the predetermined filtering rule is deleted from the pronunciation included in the feature classification result.
  • the removing the pronunciation that satisfies the predetermined filtering rule from the pronunciation included in the feature classification result includes:
  • the pronunciation is determined as a candidate pronunciation
  • the candidate pronunciation is deleted from the pronunciation included in the feature classification result ;
  • the candidate pronunciation is retained in the pronunciation included in the feature classification result.
  • a speech recognition apparatus comprising:
  • a classification result obtaining module configured to obtain a feature classification result of the speech signal to be identified;
  • the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame and a probability that each speech signal frame is mapped to a corresponding pronunciation;
  • a pronunciation filtering module configured to filter a pronunciation included in the feature classification result based on a probability included in the feature classification result
  • a voice recognition module configured to identify the voice signal based on the filtered feature classification result.
  • the pronunciation filtering module further includes:
  • a first filtering module configured to: when a probability difference between a probability of mapping any voice signal frame to a corresponding pronunciation and a maximum mapping probability of the voice signal frame, within a predetermined difference range, the corresponding pronunciation Filtering;
  • the second filtering module is configured to filter the corresponding pronunciation when the probability of mapping any of the speech signal frames to the corresponding pronunciation is less than the probability that the speech signal frame is mapped to a predetermined number of pronunciations.
  • the pronunciation filtering module includes:
  • a probability distribution module configured to obtain a histogram distribution of a probability that any speech signal frame is mapped to each pronunciation
  • a beam width determining module configured to acquire a beam width corresponding to the histogram distribution
  • a pronunciation determining module configured to determine a pronunciation whose probability is distributed outside the beam width, to determine a pronunciation that satisfies the predetermined filtering rule
  • the pronunciation deletion module is configured to delete the pronunciation that satisfies the predetermined filtering rule from the pronunciation included in the feature classification result.
  • the pronunciation filtering module includes:
  • a candidate pronunciation module configured to determine the pronunciation as a candidate pronunciation when a probability that any of the speech signal frames are mapped to the corresponding pronunciation meets a predetermined filtering rule
  • a candidate pronunciation deletion module configured to map any one of adjacent speech signal frames of a predetermined number of frames of the speech signal frame When the probability of hitting the candidate pronunciation satisfies a predetermined filtering rule, the candidate pronunciation is deleted from the pronunciation included in the feature classification result;
  • a candidate pronunciation retaining module configured to retain the candidate pronunciation in the feature classification result when the probability of mapping the candidate speech to the adjacent speech signal frame of the predetermined number of frames of the speech signal frame does not satisfy the predetermined filtering rule In the pronunciation.
  • Embodiments provided by the present invention when identifying a voice signal, first acquiring a feature classification result of the voice signal, and then filtering the pronunciation included in the feature classification result based on a probability included in the feature classification result, then In the process of recognizing the speech signal, it is no longer necessary to perform an identification operation related to the filtered pronunciation, such as no need to search the identification network for the path related to the filtered pronunciation, thereby effectively reducing the time spent in the speech recognition process. In turn, the speech recognition efficiency can be improved.
  • FIG. 1 is a flow chart showing a voice recognition method according to an exemplary embodiment of the present invention
  • FIG. 2 is a flow chart showing a voice recognition method according to another exemplary embodiment of the present invention.
  • FIG. 3 is a logic block diagram of a voice recognition apparatus according to an exemplary embodiment of the present invention.
  • FIG. 4 is a logic block diagram of a voice recognition apparatus according to another exemplary embodiment of the present invention.
  • FIG. 5 is a hardware configuration diagram of a voice recognition apparatus according to an exemplary embodiment of the present invention.
  • first, second, third, etc. may be used to describe various information in the present invention, such information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information without departing from the scope of the invention.
  • second information may also be referred to as the first information.
  • word "if” as used herein may be interpreted as "when” or “when” or “in response to a determination.”
  • an acoustic model and a language model are involved in the recognition process, wherein the acoustic model is performed on differences in acoustics, phonetics, environmental variables, and gender, accent, and the like of the person who emits the voice.
  • Knowledge representation, the LSTM (Long Short-Term Memory), CTC (Connectionist temporal classification) model, or Hidden Markov Model HMM can be used to train the speech contained in the training speech library to obtain the acoustics of the speech.
  • the mapping of features to pronunciations constitutes an acoustic model that is related to the modeling unit.
  • the pronunciation is a syllable; if the modeling unit is a phoneme, the pronunciation is a phoneme; if the modeling unit is a state constituting a phoneme, the pronunciation is a state.
  • the pronunciation will vary with words, speed, intonation, light accent, and dialect, etc.
  • the training speech library needs to cover different words, speed, tone, and weight.
  • a smaller pronunciation unit such as a syllable, a phoneme, a state, or the like can be selected as the modeling unit. Therefore, based on the large amount of speech contained in the training speech library and the predetermined modeling unit, the model training will construct a large number of acoustic models.
  • a large number of acoustic models are used to classify the speech signals to be recognized, and the obtained feature classification results will contain a large number of pronunciations (categories), such as: 3,000 to 10,000 pronunciations.
  • the current speech recognition technology needs to recognize the text information corresponding to the speech signal, and it is necessary to search for all possible paths in the identification network for each pronunciation, and an index increment in the form of an index is generated in the search process. If you search for all possible paths involved in 3,000 to 10,000 pronunciations in the recognition network, the storage resources and calculations required for the search may exceed the limits of the speech recognition system. Therefore, the current speech recognition technology takes a lot of time. And resources, there is a problem of low efficiency of speech recognition, and the present invention proposes a solution for how to improve the efficiency of speech recognition.
  • the solution of the present invention improves the feature classification result obtained by the speech recognition process, and sets a filtering rule according to the device resource and the recognition efficiency requirement involved in the speech recognition, and then, when recognizing the speech signal, Firstly acquiring the feature classification result of the voice signal, and then filtering the pronunciation included in the feature classification result based on the probability included in the feature classification result, so that in the process of identifying the voice signal, there is no need to identify in the network Searching for the path associated with the filtered pronunciation, thus effectively reducing the time spent in the search process, thereby improving speech recognition efficiency.
  • the speech recognition process of the present invention will be described in detail below with reference to the accompanying drawings.
  • FIG. 1 is a flowchart of a voice recognition method according to an exemplary embodiment of the present invention.
  • the embodiment can be applied to various electronic devices having voice processing capabilities, and may include the following steps S101-S103:
  • Step S101 Acquire a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame and a probability that each speech signal frame is mapped to a corresponding pronunciation.
  • Step S102 Filter the pronunciations included in the feature classification result based on the probability included in the feature classification result.
  • Step S103 Identify the voice signal based on the filtered feature classification result.
  • the voice signal may be a voice sent by a user collected by a local voice collection device in real time, or may be a voice transmitted remotely by the voice collection device.
  • the speech signal may be pre-processed in real time through the speech preprocessing module in the field, and the pre-processed speech signal is extracted by the feature extraction module, and the extracted feature may include PLP ( Perceptual Linear Predictive, LPCC (Linear Predictive Cepstral Coefficient), FBANK (Mel-Scale Filter Bank), MFCC (Mel-Frequency Cepstral Coefficients, Meyer)
  • PLP Perceptual Linear Predictive, LPCC (Linear Predictive Cepstral Coefficient), FBANK (Mel-Scale Filter Bank), MFCC (Mel-Frequency Cepstral Coefficients, Meyer)
  • PLP Perceptual Linear Predictive, LPCC (Linear Predictive Cepstral Coefficient), FBANK (Me
  • the solution of the present invention considers that part of the pronunciation included in the feature classification result has low correlation with the speech signal frame of the speech signal to be recognized, and has little influence on the speech recognition accuracy rate, and is reduced.
  • the pronunciations having less influence on the speech recognition accuracy rate can be filtered out from the feature classification result before the speech recognition based on the feature classification result. The number of pronunciations included in the feature classification result is reduced, thereby improving the efficiency of speech recognition.
  • the pronunciation included in the feature classification result may be filtered based on the probability that the speech signal frame is mapped to each pronunciation, and after filtering, the probability that any speech signal frame is mapped to the filtered pronunciation is smaller than the speech signal frame is mapped to other The probability of pronunciation.
  • Filtering method 1 filtering the low-probability pronunciation according to the predetermined number, the predetermined number may refer to the number of pronunciations in the pronunciation corresponding to the speech signal frame retained in the feature classification result; or may refer to the predetermined proportional threshold corresponding to the speech signal frame The product of the total number of pronunciations.
  • the probability that any of the speech signal frames are mapped to the corresponding utterance is less than the probability that the speech signal frame is mapped to a predetermined number of utterances in the utterance, it is determined that the corresponding utterance satisfies a predetermined filtering rule.
  • the predetermined proportional threshold may be set by the designer of the present invention according to the required speech recognition accuracy, for example, set to 1/4, which refers to the ratio of the retained pronunciation to the total number of pronunciations.
  • the pronunciation in actual filtering, the pronunciation may be deleted from the feature classification result in the order of probability from small to large, and when the number of reserved pronunciations is proportional to the number of original pronunciations, the predetermined proportional threshold is satisfied, and the feature is completed. Filtering of classification results.
  • the predetermined ratio threshold may refer to the ratio of the number of unfiltered pronunciations to the filtered pronunciations.
  • the pronunciation can be selected in the feature classification result according to the order of probability, and when the ratio of the number of selected pronunciations to the number of remaining pronunciations satisfies a predetermined proportional threshold, the feature classification result is completed. Filtering.
  • the designer of the present invention can set the predetermined number according to the required speech recognition accuracy, for example, , set to any of the values from 2000 to 9000.
  • the pronunciations to which each speech signal frame is mapped may be arranged in descending order of probability, and then the pronunciations arranged in the predetermined number of bits are deleted from the feature classification result, and the result of the feature classification is completed. Filtering, the predetermined number of bits is equal to the predetermined number of values.
  • the predetermined number may refer to the number of utterances that are not filtered out, for example, set to 1000.
  • the pronunciations to which each speech signal frame is mapped may be arranged in descending order of probability, and then the pronunciations arranged in the predetermined number of bits are retained in the feature classification result, and other pronunciations are characterized.
  • the classification result is deleted, and the filtering of the feature classification result is completed, and the predetermined number of bits is equal to the value of the quantity threshold.
  • other technical means may also be used to perform filtering according to a pair of feature classification results in a filtering manner, which is not limited by the present invention.
  • Filtering method 2 filtering the low-probability pronunciation according to the predetermined difference threshold, which can be set by the designer of the present invention according to the required speech recognition accuracy, for example, set to -0.5, and the filtering is filtered.
  • the probability difference between the probability of falling out and the pronunciation of the most probable probability that the same speech signal frame is mapped.
  • filtering if the probability of any speech signal frame being mapped to the corresponding pronunciation, the probability difference between the maximum mapping probability of the speech signal frame and the predetermined difference range, Then determining that the corresponding pronunciation meets a predetermined filtering rule, and filtering the corresponding pronunciation.
  • the pronunciations to which each speech signal frame is mapped may be arranged in descending order of probability, and the speech signal frame is mapped to the probability of the pronunciation arranged in the first position. Determined as the maximum probability, and then from the beginning of the pronunciation of the last bit, sequentially obtain the difference between the probability that the speech signal frame is mapped to each pronunciation and the maximum probability, and if the difference is less than -0.5, the pronunciation is classified from the feature. The result is removed.
  • other technical means may be used to filter the feature classification result according to the filtering method. The present invention does not limit this.
  • Filtering method 3 filtering the pronunciations distributed outside the beam width according to the histogram distribution of the probability.
  • the histogram distribution of the probability that any speech signal frame is mapped to each pronunciation may be acquired first; The map distribution corresponds to a beam width; then the pronunciation whose probability distribution is outside the beam width is determined to satisfy the pronunciation of the predetermined filtering rule; finally, the pronunciation of the predetermined filtering rule is satisfied, from the feature classification result The included pronunciation is removed.
  • the beam width can be determined by the designer of the present invention according to the required speech recognition accuracy and the distribution of the histogram. For example, the preset needs to filter out 8000 low-probability pronunciations, which can be from the histogram. The low probability side starts to find 8000 pronunciations, and the 8000th pronunciation position is determined as the beam width boundary.
  • other technical means may be used to filter the feature classification result according to the filtering manner. The present invention does not limit this.
  • the predetermined identification network may be directly retrieved, and the path related to the pronunciation included in the filtered feature classification result is searched for the best. a path based on which the text information corresponding to the speech signal to be recognized is output with maximum probability to complete speech recognition.
  • the identification network mentioned herein may refer to the speech signal to be recognized by the decoder according to the well-trained acoustics. Models, language models, and identification networks established by dictionaries.
  • the probability (acoustic score) contained in the feature classification result can be converted to be similar to the association probability (language score) between the words (such as words, words, phrases) contained in the phonetic model.
  • association probability language score
  • each speech signal frame is limited by a preset threshold value, and the difference from the optimal path is greater than the threshold value, then the The path is discarded, otherwise it is reserved.
  • all paths are sorted according to the preset maximum number of paths, and only the optimal path is retained until the last frame is completed. Path map.
  • the modeling unit of the acoustic model that outputs the feature classification result is small.
  • the state is the modeling unit. Since a single phoneme can be composed of three to five states, the voice signal formed by the pronunciation of one phoneme can be Dividing into a plurality of speech signal frames, therefore, it is prone to a situation in which the acoustic characteristics of a plurality of consecutive speech signal frames are relatively similar, and the pronunciation of each frame in the continuous speech signal frames in the feature classification result is prone to a similar situation. .
  • the present invention is based on the probability and predetermined filtering rules contained in the feature classification result, the pronunciations mapped to each frame of the speech signal frame are respectively Line filtering can easily filter out the pronunciation that has a great influence on the recognition accuracy.
  • the filtering status of continuous speech signal frames can be comprehensively considered when filtering the classification results.
  • the specific implementation process can refer to the figure.
  • the method shown in 2 includes the following steps S201-S205:
  • Step S201 Acquire a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame and a probability that each speech signal frame is mapped to a corresponding pronunciation.
  • Step S202 If any of the speech signal frames are mapped to the probability of the corresponding pronunciation, the predetermined filtering rule is satisfied, and the pronunciation is determined as the candidate pronunciation.
  • Step S203 If any one of the adjacent speech signal frames of the predetermined number of frames of the speech signal frame, the probability of mapping to the candidate pronunciation satisfies a predetermined filtering rule, the candidate pronunciation is included in the feature classification result. Deleted in pronunciation.
  • Step S204 If the adjacent speech signal frame of the predetermined number of frames of the speech signal frame does not satisfy the predetermined filtering rule, the candidate pronunciation is retained in the pronunciation included in the feature classification result. .
  • Step S205 Identify the voice signal based on the filtered feature classification result.
  • the predetermined filtering rule may be any one of the filtering methods 1 to the filtering method 4 described above, and may also be other filtering rules that can limit the influence degree of the filtered pronunciation on the recognition accuracy.
  • the predetermined number of frames of consecutive speech signal frames can be set by the designer of the present invention according to the required speech recognition accuracy, for example, set to 6, adjacent first three frames and adjacent last three frames.
  • the speech recognition method of the present invention first acquires the feature classification result of the speech signal, and then, based on the probability contained in the feature classification result, the pronunciation included in the feature classification result.
  • filtering in the process of recognizing the voice signal, it is no longer necessary to perform the recognition operation related to the filtered pronunciation execution, such as no need to search the identification network for the path related to the filtered pronunciation, thereby effectively reducing the voice. Identifying the time spent in the process can improve speech recognition efficiency.
  • the voice recognition method in the embodiment of the present invention can be applied to human-computer interaction software of various electronic devices, for example, voice dialing, voice control, and voice search in a smart phone, and is applied to voice search in a smart phone if The user sends a voice in a predetermined range from the smartphone, and the voice recognition method applied to the voice search may obtain the feature classification result of the voice after receiving the user voice collected by the voice collection device, and then The probability included in the feature classification result, filtering the pronunciation included in the feature classification result, and then searching only the uncorrelated pronunciation-related path in the recognition network, and quickly identifying the text information corresponding to the user voice through the path search In turn, the voice assistant quickly responds to the user based on the recognition result.
  • the invention also provides an embodiment of the device.
  • FIG. 3 is a logic block diagram of a voice recognition apparatus according to an exemplary embodiment of the present invention.
  • the apparatus may include: a classification result obtaining module 310, a pronunciation filtering module 320, and a voice recognition module 330.
  • the classification result obtaining module 310 is configured to obtain a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame, and each speech signal frame is mapped to a corresponding pronunciation. Probability.
  • the pronunciation filtering module 320 is configured to filter the pronunciations included in the feature classification result based on the probability included in the feature classification result.
  • the voice recognition module 330 is configured to identify the voice signal based on the filtered feature classification result.
  • the pronunciation filtering module 320 can include:
  • a first filtering module configured to: when a probability difference between a probability of mapping any voice signal frame to a corresponding pronunciation and a maximum mapping probability of the voice signal frame, within a predetermined difference range, the corresponding pronunciation Filter.
  • the second filtering module is configured to filter the corresponding pronunciation when the probability of mapping any of the speech signal frames to the corresponding pronunciation is less than the probability that the speech signal frame is mapped to a predetermined number of pronunciations.
  • the pronunciation filtering module 320 may further include:
  • a probability distribution module is configured to obtain a histogram distribution of the probability that any speech signal frame is mapped to each pronunciation.
  • a beam width determining module for acquiring a beam width corresponding to the histogram distribution.
  • the pronunciation determining module is configured to determine a pronunciation whose probability is distributed outside the beam width as a pronunciation satisfying the predetermined filtering rule.
  • the pronunciation deletion module is configured to delete the pronunciation that satisfies the predetermined filtering rule from the pronunciation included in the feature classification result.
  • FIG. 4 is a logic block diagram of a voice recognition apparatus according to another exemplary embodiment of the present invention.
  • the apparatus may include a classification result acquisition module 410, a pronunciation filtering module 420, and a voice recognition module 430.
  • the pronunciation filtering module 420 can include a candidate pronunciation determination module 421, a candidate pronunciation deletion module 422, and a candidate pronunciation retention module 423.
  • the classification result obtaining module 410 is configured to obtain a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame, and each speech signal frame is mapped to a corresponding pronunciation. Probability.
  • the candidate pronunciation determining module 421 is configured to determine the pronunciation as a candidate pronunciation when the probability that any of the speech signal frames are mapped to the corresponding pronunciation meets a predetermined filtering rule.
  • the candidate pronunciation deleting module 422 is configured to: when any one of the adjacent speech signal frames of the predetermined number of frames of the speech signal frame is mapped to a predetermined filtering rule, the candidate pronunciation is from the feature. The pronunciation included in the classification result is deleted.
  • the candidate pronunciation retaining module 423 is configured to reserve the candidate pronunciation in the feature classification result when the probability of mapping the candidate speech to the adjacent speech signal frame of the predetermined number of frames of the speech signal frame does not satisfy the predetermined filtering rule. In the pronunciation contained.
  • the voice recognition module 430 is configured to identify the voice signal based on the filtered feature classification result.
  • the device embodiment since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment.
  • the device embodiments described above are merely illustrative, wherein the units or modules described as separate components may or may not be physically separate, and the components displayed as units or modules may or may not be physical units. Or modules, which can be located in one place, or distributed to multiple network units or modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solution of the present invention. Those of ordinary skill in the art can understand and implement without any creative effort.
  • Embodiments of the speech recognition apparatus of the present invention can be applied to an electronic device.
  • This can be implemented by a computer chip or an entity, or by a product having a certain function.
  • the electronic device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver device. , game consoles, tablets, wearables, Internet TVs, smart locomotives, driverless cars, smart refrigerators, other smart home devices, or a combination of any of these devices.
  • the device embodiment may be implemented by software, or may be implemented by hardware or a combination of hardware and software.
  • a processor of the electronic device in which it is located reads a corresponding computer program instruction in a readable medium such as a non-volatile memory into a memory.
  • a hardware level as shown in FIG. 5, is a hardware structure diagram of an electronic device in which the voice recognition device of the present invention is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in FIG. 5.
  • the electronic device in which the device is located in the embodiment may also include other hardware according to the actual function of the electronic device, and details are not described herein again.
  • the memory of the electronic device can be stored in a processor executable a program instruction; the processor may be coupled to the memory for reading the program instructions stored in the memory, and in response, performing the following operations: acquiring a feature classification result of the voice signal to be identified; the feature classification result is included for description The pronunciation of the pronunciation feature of each speech signal frame and the probability that each speech signal frame is mapped to the corresponding pronunciation; filtering the pronunciation contained in the feature classification result based on the probability contained in the feature classification result; based on the filtered The feature classification result identifies the speech signal.
  • the operations performed by the processor may be referred to the related description in the foregoing method embodiments, and details are not described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Disclosed are a speech recognition method and apparatus. The method comprises: acquiring a feature classification result of a speech signal to be recognized, wherein the feature classification result contains a pronunciation for describing a pronunciation feature of each speech signal frame and the probability of mapping each speech signal frame to a corresponding pronunciation (S201); if the probability of mapping any speech signal frame to a corresponding pronunciation satisfies a predetermined filtering rule, determining the pronunciation as a candidate pronunciation (S202); if the probability of mapping any of a predetermined number of speech signal frames adjacent to the speech signal frame to the candidate pronunciation satisfies the predetermined filtering rule, deleting the candidate pronunciation from pronunciations contained in the feature classification result (S203); if none of the probabilities of mapping the predetermined number of speech signal frames adjacent to the speech signal frame to the candidate pronunciation satisfies the predetermined filtering rule, retaining the candidate pronunciation in the pronunciations contained in the feature classification result (S204); and recognizing the speech signal based on a filtered feature classification result (S205). During the process of speech signal recognition, a recognition operation related to a pronunciation filtered out is not required to be executed, thereby improving the efficiency of speech recognition.

Description

语音识别方法及装置Speech recognition method and device 技术领域Technical field

本发明涉及计算机技术领域,尤其涉及语音识别方法及装置。The present invention relates to the field of computer technologies, and in particular, to a voice recognition method and apparatus.

背景技术Background technique

随着计算机技术的发展,语音识别(Automatic Speech Recognition,ASR)技术在人机交互等领域的应用越来越多。目前,语音识别技术主要通过信号处理模块、特征提取模块、声学模型、语言模型(Language Model,LM)、字典和解码器(Decoder),将待识别的语音信号转换为文本信息,完成语音识别。With the development of computer technology, the application of Automatic Speech Recognition (ASR) technology in the field of human-computer interaction is increasing. At present, the speech recognition technology mainly converts the speech signal to be recognized into text information through the signal processing module, the feature extraction module, the acoustic model, the language model (Language Model, LM), the dictionary and the decoder (Decoder), and completes the speech recognition.

在语音识别过程中,信号处理模块和特征提取模块,先将待识别的语音信号划分成多个语音信号帧,然后通过消除噪音、信道失真等处理对各语音信号帧进行增强,再将各语音信号帧从时域转化到频域,并从转换后的语音信号帧内提取合适的声学特征。而根据训练语音库的特征参数训练出的声学模型,以特征提取模块所提取的声学特征为输入,映射到能够描述语音信号帧的发音特征的发音、并计算出语音信号帧映射到各发音的概率,得到特征分类结果。In the process of speech recognition, the signal processing module and the feature extraction module first divide the speech signal to be recognized into a plurality of speech signal frames, and then enhance each speech signal frame by eliminating noise, channel distortion, etc., and then each speech. The signal frame is transformed from the time domain to the frequency domain and appropriate acoustic features are extracted from the converted speech signal frame. The acoustic model trained according to the characteristic parameters of the training speech library is input to the acoustic feature extracted by the feature extraction module, mapped to the pronunciation capable of describing the pronunciation feature of the speech signal frame, and the speech signal frame is mapped to each pronunciation. Probability, get the feature classification result.

语言模型含有不同的字词(如:字、词、短语)之间关联关系、及其概率(可能性),用于估计由不同字词组成的各种文本信息的可能性。解码器可以基于己经训练好的声学模型、语言模型及字典建立一个识别网络,识别网络中的各路径分别与各种文本信息、以及各文本信息的发音对应,然后针对声学模型输出的发音,在该识别网络中寻找最佳的一条路径,基于该路径能够以最大概率输出该语音信号对应的文本信息,完成语音识别。The language model contains the associations between different words (such as words, words, phrases) and their probabilities (possibility), which are used to estimate the possibility of various textual information composed of different words. The decoder can establish an identification network based on the trained acoustic model, the language model and the dictionary, and identify each path in the network corresponding to various text information and the pronunciation of each text information, and then output the pronunciation for the acoustic model. Finding the best one path in the identification network, based on the path, the text information corresponding to the voice signal can be output with the maximum probability to complete the voice recognition.

但是,语言模型一般是基于大量语料训练出来的模型,包含大量字词之间的关联关系和可能性,所以,基于语音模型建立的识别网络包含的节点较多,每个节点的分支数量也非常多。在识别网络中进行路径搜索时,各语音信号帧的发音涉及的节点数会以指数形式暴增,导致路径搜索量极大,搜索过程耗费的时间较多,进而会降低语音识别效率。However, the language model is generally based on a large number of corpus trained models, including the relationship and possibility between a large number of words. Therefore, the recognition network based on the speech model contains more nodes, and the number of branches per node is also very many. When the path search is performed in the identification network, the number of nodes involved in the pronunciation of each speech signal frame will increase exponentially, resulting in a large amount of path search, and the search process takes more time, which in turn reduces the speech recognition efficiency.

发明内容Summary of the invention

有鉴于此,本发明提供一种语音识别方法及装置,以解决语音识别过程耗时多、识别效 率低的问题。In view of this, the present invention provides a voice recognition method and apparatus, which solves the problem that the voice recognition process is time-consuming and effective. The problem of low rates.

根据本发明的第一方面,提供一种语音识别方法,包括步骤:According to a first aspect of the present invention, a speech recognition method is provided, comprising the steps of:

获取待识别的语音信号的特征分类结果;所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率;Acquiring a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame and a probability that each speech signal frame is mapped to a corresponding pronunciation;

基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤;Filtering the pronunciations included in the feature classification result based on the probability contained in the feature classification result;

基于过滤后的特征分类结果识别所述语音信号。The speech signal is identified based on the filtered feature classification result.

在一个实施例中,所述基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤,包括:In an embodiment, the filtering, according to a probability included in the feature classification result, filtering the pronunciation included in the feature classification result, including:

判断任一语音信号帧映射到对应的发音的概率是否满足预定过滤规则;Determining whether the probability that any speech signal frame is mapped to the corresponding pronunciation satisfies a predetermined filtering rule;

如果所述对应的发音满足预定过滤规则,对所述对应的发音进行滤掉。If the corresponding pronunciation satisfies a predetermined filtering rule, the corresponding pronunciation is filtered out.

在一个实施例中,如果任一语音信号帧映射到对应的发音的概率,与该语音信号帧的最大映射概率之间的概率差,在预定的差值范围内,则确定所述对应的发音满足预定过滤规则;In one embodiment, if any of the speech signal frames are mapped to a probability of a corresponding pronunciation, and a probability difference between the maximum mapping probability of the speech signal frame, within a predetermined difference range, the corresponding pronunciation is determined. Meet predetermined filtering rules;

如果任一语音信号帧映射到对应的发音的概率,小于该语音信号帧映射到预定数目的发音中各发音的概率,则确定所述对应的发音满足预定过滤规则。If the probability that any of the speech signal frames are mapped to the corresponding utterance is less than the probability that the speech signal frame is mapped to a predetermined number of utterances in the utterance, then the corresponding utterance is determined to satisfy the predetermined filtering rule.

在一个实施例中,所述预定数目为以下任一:In one embodiment, the predetermined number is any of the following:

该帧语音信号帧对应的发音中被保留在特征分类结果内的发音的数量;The number of pronunciations retained in the feature classification result in the pronunciation corresponding to the frame speech signal frame;

预定的比例阈值与该帧语音信号帧对应的发音的总数目的乘积。The product of the predetermined ratio threshold and the total number of pronunciations corresponding to the frame of the speech signal frame.

在一个实施例中,所述基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤,包括:In an embodiment, the filtering, according to a probability included in the feature classification result, filtering the pronunciation included in the feature classification result, including:

获取任一语音信号帧映射到各发音的概率的直方图分布;Obtaining a histogram distribution of the probability of mapping any speech signal frame to each pronunciation;

获取与所述直方图分布对应的束宽;Obtaining a beam width corresponding to the histogram distribution;

将概率分布在所述束宽之外的发音,确定为满足所述预定过滤规则的发音;Having a probability that the probability is distributed outside the beam width is determined to satisfy the pronunciation of the predetermined filtering rule;

将满足所述预定过滤规则的发音,从所述特征分类结果所含的发音中删除。The pronunciation that satisfies the predetermined filtering rule is deleted from the pronunciation included in the feature classification result.

在一个实施例中,所述将满足所述预定过滤规则的发音从所述特征分类结果所含的发音中删除,包括:In one embodiment, the removing the pronunciation that satisfies the predetermined filtering rule from the pronunciation included in the feature classification result includes:

如果任一语音信号帧映射到对应的发音的概率满足预定过滤规则,将该发音确定为候选 发音;If the probability that any of the speech signal frames are mapped to the corresponding pronunciation satisfies a predetermined filtering rule, the pronunciation is determined as a candidate pronunciation;

如果该语音信号帧的预定帧数的相邻语音信号帧中的任一帧,映射到该候选发音的概率满足预定过滤规则,则将该候选发音从所述特征分类结果所含的发音中删除;If any one of the adjacent speech signal frames of the predetermined number of frames of the speech signal frame, the probability of mapping to the candidate pronunciation satisfies a predetermined filtering rule, the candidate pronunciation is deleted from the pronunciation included in the feature classification result ;

如果该语音信号帧的预定帧数的相邻语音信号帧,映射到该候选发音的概率均不满足预定过滤规则,则将该候选发音保留在所述特征分类结果所含的发音中。If the adjacent speech signal frame of the predetermined number of frames of the speech signal frame does not satisfy the predetermined filtering rule, the candidate pronunciation is retained in the pronunciation included in the feature classification result.

根据本发明的第二方面,提供一种语音识别装置,包括:According to a second aspect of the present invention, a speech recognition apparatus is provided, comprising:

分类结果获取模块,用于获取待识别的语音信号的特征分类结果;所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率;a classification result obtaining module, configured to obtain a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame and a probability that each speech signal frame is mapped to a corresponding pronunciation;

发音过滤模块,用于基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤;a pronunciation filtering module, configured to filter a pronunciation included in the feature classification result based on a probability included in the feature classification result;

语音识别模块,用于基于过滤后的特征分类结果识别所述语音信号。a voice recognition module, configured to identify the voice signal based on the filtered feature classification result.

在一个实施例中,所述发音过滤模块还包括:In an embodiment, the pronunciation filtering module further includes:

第一过滤模块,用于在任一语音信号帧映射到对应的发音的概率,与该语音信号帧的最大映射概率之间的概率差,在预定的差值范围内时,对所述对应的发音进行过滤;a first filtering module, configured to: when a probability difference between a probability of mapping any voice signal frame to a corresponding pronunciation and a maximum mapping probability of the voice signal frame, within a predetermined difference range, the corresponding pronunciation Filtering;

第二过滤模块,用于在任一语音信号帧映射到对应的发音的概率,小于该语音信号帧映射到预定数目的发音中各发音的概率时,对所述对应的发音进行过滤。The second filtering module is configured to filter the corresponding pronunciation when the probability of mapping any of the speech signal frames to the corresponding pronunciation is less than the probability that the speech signal frame is mapped to a predetermined number of pronunciations.

在一个实施例中,所述发音过滤模块包括:In an embodiment, the pronunciation filtering module includes:

概率分布模块,用于获取任一语音信号帧映射到各发音的概率的直方图分布;a probability distribution module, configured to obtain a histogram distribution of a probability that any speech signal frame is mapped to each pronunciation;

束宽确定模块,用于获取与所述直方图分布对应的束宽;a beam width determining module, configured to acquire a beam width corresponding to the histogram distribution;

发音确定模块,用于将概率分布在所述束宽之外的发音,确定为满足所述预定过滤规则的发音;a pronunciation determining module, configured to determine a pronunciation whose probability is distributed outside the beam width, to determine a pronunciation that satisfies the predetermined filtering rule;

发音删除模块,用于将满足所述预定过滤规则的发音从所述特征分类结果所含的发音中删除。The pronunciation deletion module is configured to delete the pronunciation that satisfies the predetermined filtering rule from the pronunciation included in the feature classification result.

在一个实施例中,所述发音过滤模块包括:In an embodiment, the pronunciation filtering module includes:

候选发音模块,用于在任一语音信号帧映射到对应的发音的概率满足预定过滤规则时,将该发音确定为候选发音;a candidate pronunciation module, configured to determine the pronunciation as a candidate pronunciation when a probability that any of the speech signal frames are mapped to the corresponding pronunciation meets a predetermined filtering rule;

候选发音删除模块,用于在该语音信号帧的预定帧数的相邻语音信号帧中的任一帧,映 射到该候选发音的概率满足预定过滤规则时,将该候选发音从所述特征分类结果所含的发音中删除;a candidate pronunciation deletion module, configured to map any one of adjacent speech signal frames of a predetermined number of frames of the speech signal frame When the probability of hitting the candidate pronunciation satisfies a predetermined filtering rule, the candidate pronunciation is deleted from the pronunciation included in the feature classification result;

候选发音保留模块,用于在该语音信号帧的预定帧数的相邻语音信号帧,映射到该候选发音的概率均不满足预定过滤规则时,将该候选发音保留在所述特征分类结果所含的发音中。a candidate pronunciation retaining module, configured to retain the candidate pronunciation in the feature classification result when the probability of mapping the candidate speech to the adjacent speech signal frame of the predetermined number of frames of the speech signal frame does not satisfy the predetermined filtering rule In the pronunciation.

实施本发明提供的实施例,在识别语音信号时,先获取该语音信号的特征分类结果,然后基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤,那么在识别语音信号的过程中,无需再执行与被过滤掉的发音相关的识别操作,如无需再在识别网络中搜索与被过滤掉的发音相关的路径,因此能有效降低语音识别过程耗费的时间,进而能提高语音识别效率。Embodiments provided by the present invention, when identifying a voice signal, first acquiring a feature classification result of the voice signal, and then filtering the pronunciation included in the feature classification result based on a probability included in the feature classification result, then In the process of recognizing the speech signal, it is no longer necessary to perform an identification operation related to the filtered pronunciation, such as no need to search the identification network for the path related to the filtered pronunciation, thereby effectively reducing the time spent in the speech recognition process. In turn, the speech recognition efficiency can be improved.

附图说明DRAWINGS

图1是本发明一示例性实施例示出的语音识别方法的流程图;1 is a flow chart showing a voice recognition method according to an exemplary embodiment of the present invention;

图2是本发明另一示例性实施例示出的语音识别方法的流程图;2 is a flow chart showing a voice recognition method according to another exemplary embodiment of the present invention;

图3是本发明一示例性实施例示出的语音识别装置的逻辑框图;FIG. 3 is a logic block diagram of a voice recognition apparatus according to an exemplary embodiment of the present invention; FIG.

图4是本发明另一示例性实施例示出的语音识别装置的逻辑框图;FIG. 4 is a logic block diagram of a voice recognition apparatus according to another exemplary embodiment of the present invention; FIG.

图5是本发明一示例性实施例示出的语音识别装置的硬件结构图。FIG. 5 is a hardware configuration diagram of a voice recognition apparatus according to an exemplary embodiment of the present invention.

具体实施方式detailed description

这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. The following description refers to the same or similar elements in the different figures unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Instead, they are merely examples of devices and methods consistent with aspects of the invention as detailed in the appended claims.

在本发明使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本发明。在本发明和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。 The terminology used in the present invention is for the purpose of describing particular embodiments, and is not intended to limit the invention. The singular forms "a", "the" and "the" It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

应当理解,尽管在本发明可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本发明范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used to describe various information in the present invention, such information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information without departing from the scope of the invention. Similarly, the second information may also be referred to as the first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to a determination."

本发明实施例的语音识别,在识别过程中会涉及到声学模型和语言模型,其中,声学模型是对声学、语音学、环境的变量、以及发出语音的人员的性别、口音等的差异进行的知识表示,可以通过LSTM(Long Short-Term Memory,时间递归神经网络)、CTC(Connectionist temporal classification)模型、或者隐马尔可夫模型HMM,对训练语音库所含的语音进行训练,获得语音的声学特征到发音的映射,构成声学模型,该发音与建模单元相关。如果建模单元为音节,该发音为音节;如果建模单元为音素,该发音为音素;如果建模单元为构成音素的状态,该发音为状态。In the speech recognition of the embodiment of the present invention, an acoustic model and a language model are involved in the recognition process, wherein the acoustic model is performed on differences in acoustics, phonetics, environmental variables, and gender, accent, and the like of the person who emits the voice. Knowledge representation, the LSTM (Long Short-Term Memory), CTC (Connectionist temporal classification) model, or Hidden Markov Model HMM can be used to train the speech contained in the training speech library to obtain the acoustics of the speech. The mapping of features to pronunciations constitutes an acoustic model that is related to the modeling unit. If the modeling unit is a syllable, the pronunciation is a syllable; if the modeling unit is a phoneme, the pronunciation is a phoneme; if the modeling unit is a state constituting a phoneme, the pronunciation is a state.

而训练声学模型时,考虑到发音会随着字词、语速、语调、轻重音、以及方言等影响发音的因素不同而不同,训练语音库需要涵盖不同的字词、语速、语调、轻重音、以及方言等影响发音的因素的大量语音。此外,考虑到语音识别的精确性,可以选择音节、音素、状态等较小的发音单位为建模单元。因此,基于训练语音库所含的大量语音以及预定的建模单元,进行模型训练,会构建出大量的声学模型。语音识别过程中,通过大量的声学模型对待识别的语音信号进行特征分类,所获得特征分类结果会包含大量的发音(类别),如:3000到10000个发音。When training acoustic models, it is considered that the pronunciation will vary with words, speed, intonation, light accent, and dialect, etc. The training speech library needs to cover different words, speed, tone, and weight. A large number of voices, such as sounds and dialects, that influence the factors of pronunciation. In addition, considering the accuracy of speech recognition, a smaller pronunciation unit such as a syllable, a phoneme, a state, or the like can be selected as the modeling unit. Therefore, based on the large amount of speech contained in the training speech library and the predetermined modeling unit, the model training will construct a large number of acoustic models. In the process of speech recognition, a large number of acoustic models are used to classify the speech signals to be recognized, and the obtained feature classification results will contain a large number of pronunciations (categories), such as: 3,000 to 10,000 pronunciations.

此外,目前的语音识别技术要识别出语音信号对应的文本信息,需针对每一个发音,在识别网络中搜索所有可能的路径,在这个搜索过程中会产生指数形式的路径增量。如果在识别网络中搜索3000到10000个发音涉及的所有可能的路径,搜索所需的存储资源和计算量可能超出语音识别系统所能承受的极限,因此,目前的语音识别技术会耗费大量的时间和资源,存在语音识别效率低的问题,本发明针对如何提高语音识别效率,提出解决方案。In addition, the current speech recognition technology needs to recognize the text information corresponding to the speech signal, and it is necessary to search for all possible paths in the identification network for each pronunciation, and an index increment in the form of an index is generated in the search process. If you search for all possible paths involved in 3,000 to 10,000 pronunciations in the recognition network, the storage resources and calculations required for the search may exceed the limits of the speech recognition system. Therefore, the current speech recognition technology takes a lot of time. And resources, there is a problem of low efficiency of speech recognition, and the present invention proposes a solution for how to improve the efficiency of speech recognition.

本发明的方案,为了解决语音识别效率低这个问题,针对语音识别过程所得的特征分类结果进行改进,预先根据语音识别涉及的设备资源、识别效率需求设定过滤规则,然后在识别语音信号时,先获取该语音信号的特征分类结果,然后基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤,那么在识别语音信号的过程中,无需再在识别网络中搜索与被过滤掉的发音相关的路径,因此能有效降低搜索过程耗费的时间,进而能提高语音识别效率。以下结合附图详细说明本发明的语音识别过程。 In order to solve the problem of low speech recognition efficiency, the solution of the present invention improves the feature classification result obtained by the speech recognition process, and sets a filtering rule according to the device resource and the recognition efficiency requirement involved in the speech recognition, and then, when recognizing the speech signal, Firstly acquiring the feature classification result of the voice signal, and then filtering the pronunciation included in the feature classification result based on the probability included in the feature classification result, so that in the process of identifying the voice signal, there is no need to identify in the network Searching for the path associated with the filtered pronunciation, thus effectively reducing the time spent in the search process, thereby improving speech recognition efficiency. The speech recognition process of the present invention will be described in detail below with reference to the accompanying drawings.

请参阅图1,图1是本发明一示例性实施例示出的语音识别方法的流程图,该实施例能应用于具备语音处理能力的各种电子设备上,可以包括以下步骤S101-S103:Please refer to FIG. 1. FIG. 1 is a flowchart of a voice recognition method according to an exemplary embodiment of the present invention. The embodiment can be applied to various electronic devices having voice processing capabilities, and may include the following steps S101-S103:

步骤S101、获取待识别的语音信号的特征分类结果;所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率。Step S101: Acquire a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame and a probability that each speech signal frame is mapped to a corresponding pronunciation.

步骤S102、基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤。Step S102: Filter the pronunciations included in the feature classification result based on the probability included in the feature classification result.

步骤S103、基于过滤后的特征分类结果识别所述语音信号。Step S103: Identify the voice signal based on the filtered feature classification result.

本发明实施例中,所述语音信号可以是本地语音采集设备所实时采集的用户发出的语音,也可以是其语音采集设备远程传送过来的语音。在获取语音信号的特征分类结果时,可以实时通过本领域的语音预处理模块对语音信号进行预处理,通过特征提取模块对预处理后的语音信号进行特征提取,所提取的特征可以包括PLP(Perceptual Linear Predictive,感知线性预测)、LPCC(Linear Predictive Cepstral Coefficient,线性预测倒谱系数)、FBANK(Mel-Scale Filter Bank,梅尔标度滤波器组)、MFCC(Mel-Frequency Cepstral Coefficients,梅尔倒谱系数)等,然后通过声学模型对提取的特征进行相应处理,获得特征分类结果,特征分类结果所含的概率,用于表示语音信号帧映射到对应的发音的可能性。在其他例子中,也可以直接接收其他终端设备传送过来的特征分类结果。In the embodiment of the present invention, the voice signal may be a voice sent by a user collected by a local voice collection device in real time, or may be a voice transmitted remotely by the voice collection device. When acquiring the feature classification result of the speech signal, the speech signal may be pre-processed in real time through the speech preprocessing module in the field, and the pre-processed speech signal is extracted by the feature extraction module, and the extracted feature may include PLP ( Perceptual Linear Predictive, LPCC (Linear Predictive Cepstral Coefficient), FBANK (Mel-Scale Filter Bank), MFCC (Mel-Frequency Cepstral Coefficients, Meyer) The cepstrum coefficient is equal to, and then the extracted features are processed correspondingly by the acoustic model to obtain the feature classification result, and the probability contained in the feature classification result is used to indicate the possibility that the speech signal frame is mapped to the corresponding pronunciation. In other examples, it is also possible to directly receive the feature classification results transmitted by other terminal devices.

在得到特征分类结果后,本发明的方案,考虑到特征分类结果所含的部分发音,与待识别的语音信号的语音信号帧相关性较低,对语音识别准确率的影响较小,在降低特征分类结果所含的大量发音对语音识别效率的影响时,可以在基于特征分类结果进行语音识别前,将这些对语音识别准确率影响较小的发音从所述特征分类结果中过滤掉,来减少特征分类结果所含的发音的数量,进而提高语音识别效率。After obtaining the feature classification result, the solution of the present invention considers that part of the pronunciation included in the feature classification result has low correlation with the speech signal frame of the speech signal to be recognized, and has little influence on the speech recognition accuracy rate, and is reduced. When the influence of the large number of pronunciations included in the feature classification result on the speech recognition efficiency, the pronunciations having less influence on the speech recognition accuracy rate can be filtered out from the feature classification result before the speech recognition based on the feature classification result. The number of pronunciations included in the feature classification result is reduced, thereby improving the efficiency of speech recognition.

一般情况下,发音与待识别的语音信号帧的相关性越低,在通过声学模型对语音信号的声学特征进行分类时,语音信号帧映射到该发音的概率越低。因此,可以基于语音信号帧映射到各发音的概率,来过滤特征分类结果所含的发音,过滤后,任一语音信号帧映射到被过滤掉的发音的概率,小于该语音信号帧映射到其他发音的概率。In general, the lower the correlation between the pronunciation and the speech signal frame to be recognized, the lower the probability that the speech signal frame is mapped to the speech when the acoustic characteristics of the speech signal are classified by the acoustic model. Therefore, the pronunciation included in the feature classification result may be filtered based on the probability that the speech signal frame is mapped to each pronunciation, and after filtering, the probability that any speech signal frame is mapped to the filtered pronunciation is smaller than the speech signal frame is mapped to other The probability of pronunciation.

此外,在过滤相关性较低的发音时,考虑到不同应用场景对语音识别准确率的需求,还需要衡量所过滤掉的发音对语音识别准确率的影响,因此,可以根据语音识别准确率的需求,预先设定能限制过滤掉的发音对识别准确率的影响程度的各种过滤规则。针对各种预定过滤规则,在过滤特征分类结果所含的发音时,判断任一语音信号帧映射到对应的发音的概率是 否满足预定过滤规则,如果所述对应的发音满足预定过滤规则,对所述对应的发音进行滤掉。过滤掉的发音一般指从特征分类结果中删除掉的发音。In addition, when filtering the less relevant pronunciation, considering the need of the speech recognition accuracy rate of different application scenarios, it is also necessary to measure the influence of the filtered pronunciation on the speech recognition accuracy rate, and therefore, according to the speech recognition accuracy rate Requirements, various filtering rules that can limit the degree of influence of the filtered pronunciation on the recognition accuracy. For each predetermined filtering rule, when filtering the pronunciation included in the feature classification result, the probability that any speech signal frame is mapped to the corresponding pronunciation is Whether the predetermined filtering rule is satisfied, and if the corresponding pronunciation satisfies a predetermined filtering rule, the corresponding pronunciation is filtered out. Filtered pronunciation generally refers to the pronunciation that is removed from the feature classification results.

以下列举几种对所述特征分类结果所含的发音进行过滤的方式:Here are a few ways to filter the pronunciations contained in the feature classification results:

过滤方式一:按预定数目过滤掉低概率的发音,该预定数目可以指语音信号帧对应的发音中被保留在特征分类结果内的发音的数量;也可以指预定的比例阈值与语音信号帧对应的发音的总数目的乘积。在过滤时,如果任一语音信号帧映射到对应的发音的概率,小于该语音信号帧映射到预定数目的发音中各发音的概率,则确定所述对应的发音满足预定过滤规则。Filtering method 1: filtering the low-probability pronunciation according to the predetermined number, the predetermined number may refer to the number of pronunciations in the pronunciation corresponding to the speech signal frame retained in the feature classification result; or may refer to the predetermined proportional threshold corresponding to the speech signal frame The product of the total number of pronunciations. At the time of filtering, if the probability that any of the speech signal frames are mapped to the corresponding utterance is less than the probability that the speech signal frame is mapped to a predetermined number of utterances in the utterance, it is determined that the corresponding utterance satisfies a predetermined filtering rule.

其中,预定的比例阈值,可以由本发明的设计人员根据需要达到的语音识别准确率来设定,例如,设定为1/4,指被保留的发音与所有发音的数量比例。The predetermined proportional threshold may be set by the designer of the present invention according to the required speech recognition accuracy, for example, set to 1/4, which refers to the ratio of the retained pronunciation to the total number of pronunciations.

在一例子中,实际过滤时,可以按概率从小到大的顺序,从特征分类结果中删除发音,当保留的发音的数量与原来所有发音的数量的比例,满足预定的比例阈值,完成对特征分类结果的过滤。In an example, in actual filtering, the pronunciation may be deleted from the feature classification result in the order of probability from small to large, and when the number of reserved pronunciations is proportional to the number of original pronunciations, the predetermined proportional threshold is satisfied, and the feature is completed. Filtering of classification results.

在其他例子中,预定的比例阈值可以指未被过滤掉的发音与被过滤掉的发音的数量比例。实际过滤时,可以按概率从大到小的顺序,在特征分类结果中挑选发音,当挑选出的发音的数量与剩余的发音的数量的比例,满足预定的比例阈值时,完成对特征分类结果的过滤。In other examples, the predetermined ratio threshold may refer to the ratio of the number of unfiltered pronunciations to the filtered pronunciations. In the actual filtering, the pronunciation can be selected in the feature classification result according to the order of probability, and when the ratio of the number of selected pronunciations to the number of remaining pronunciations satisfies a predetermined proportional threshold, the feature classification result is completed. Filtering.

实际应用中,预定数目指该帧语音信号帧对应的发音中被保留在特征分类结果内的发音的数量时,可以由本发明的设计人员根据需要达到的语音识别准确率来设定预定数目,例如,设定为2000至9000中的任一数值。过滤时,可以按概率从小到大的顺序,将每一语音信号帧所映射到的发音进行排列,然后将排列在前预定位数的发音从特征分类结果中给删除,完成对特征分类结果的过滤,所述预定位数与所述预定数目的数值相等。In practical applications, when the predetermined number refers to the number of pronunciations in the pronunciation corresponding to the frame speech signal frame that are retained in the feature classification result, the designer of the present invention can set the predetermined number according to the required speech recognition accuracy, for example, , set to any of the values from 2000 to 9000. When filtering, the pronunciations to which each speech signal frame is mapped may be arranged in descending order of probability, and then the pronunciations arranged in the predetermined number of bits are deleted from the feature classification result, and the result of the feature classification is completed. Filtering, the predetermined number of bits is equal to the predetermined number of values.

在其他例子中,预定数目可以指未被过滤掉的发音的数量,例如,设定为1000。实际过滤时,可以按概率从大到小的顺序,将每一语音信号帧所映射到的发音进行排列,然后将排列在前预定位数的发音保留在特征分类结果中,将其他发音从特征分类结果中删除,完成对特征分类结果的过滤,所述预定位数与所述数量阈值的数值相等。在其他实施例中,还可以采取其他技术手段按过滤方式一对特征分类结果进行过滤,本发明对此不做限制。In other examples, the predetermined number may refer to the number of utterances that are not filtered out, for example, set to 1000. In actual filtering, the pronunciations to which each speech signal frame is mapped may be arranged in descending order of probability, and then the pronunciations arranged in the predetermined number of bits are retained in the feature classification result, and other pronunciations are characterized. The classification result is deleted, and the filtering of the feature classification result is completed, and the predetermined number of bits is equal to the value of the quantity threshold. In other embodiments, other technical means may also be used to perform filtering according to a pair of feature classification results in a filtering manner, which is not limited by the present invention.

过滤方式二:按预定的差值阈值过滤掉低概率的发音,该差值阈值可以由本发明的设计人员根据需要达到的语音识别准确率来设定,例如,设定为-0.5,指被过滤掉的发音的概率与同一语音信号帧映射到的概率最大的发音之间的概率差。过滤时,如果任一语音信号帧映射到对应的发音的概率,与该语音信号帧的最大映射概率之间的概率差,在预定的差值范围内, 则确定所述对应的发音满足预定过滤规则,可以对所述对应的发音进行过滤。Filtering method 2: filtering the low-probability pronunciation according to the predetermined difference threshold, which can be set by the designer of the present invention according to the required speech recognition accuracy, for example, set to -0.5, and the filtering is filtered. The probability difference between the probability of falling out and the pronunciation of the most probable probability that the same speech signal frame is mapped. When filtering, if the probability of any speech signal frame being mapped to the corresponding pronunciation, the probability difference between the maximum mapping probability of the speech signal frame and the predetermined difference range, Then determining that the corresponding pronunciation meets a predetermined filtering rule, and filtering the corresponding pronunciation.

在一例子中,实际过滤时,可以按概率从大到小的顺序,将每一语音信号帧所映射到的发音进行排列,将该语音信号帧映射到排列在第一位的发音的概率,确定为最大概率,然后从排列在最后一位的发音开始,依次获得该语音信号帧映射到每个发音的概率与最大概率的差值,如果差值小于-0.5,则将该发音从特征分类结果中删除。在其他实施例中,还可以采取其他技术手段按过滤方式二对特征分类结果进行过滤,本发明对此不做限制。In an example, in actual filtering, the pronunciations to which each speech signal frame is mapped may be arranged in descending order of probability, and the speech signal frame is mapped to the probability of the pronunciation arranged in the first position. Determined as the maximum probability, and then from the beginning of the pronunciation of the last bit, sequentially obtain the difference between the probability that the speech signal frame is mapped to each pronunciation and the maximum probability, and if the difference is less than -0.5, the pronunciation is classified from the feature. The result is removed. In other embodiments, other technical means may be used to filter the feature classification result according to the filtering method. The present invention does not limit this.

过滤方式三:按概率的直方图分布过滤分布在所述束宽之外的发音,实际过滤时,可以先获取任一语音信号帧映射到各发音的概率的直方图分布;获取与所述直方图分布对应的束宽;然后将概率分布在所述束宽之外的发音,确定为满足所述预定过滤规则的发音;最终将满足所述预定过滤规则的发音,从所述特征分类结果所含的发音中删除。实际应用中,束宽可以由本发明的设计人员根据需要达到的语音识别准确率、以及直方图的分布状况来确定,如:预先设定需要过滤掉8000个低概率的发音,可以从直方图中低概率一侧开始查找8000个发音,将第8000个发音所在位置确定为束宽边界。在其他实施例中,还可以采取其他技术手段按过滤方式三对特征分类结果进行过滤,本发明对此不做限制。Filtering method 3: filtering the pronunciations distributed outside the beam width according to the histogram distribution of the probability. In actual filtering, the histogram distribution of the probability that any speech signal frame is mapped to each pronunciation may be acquired first; The map distribution corresponds to a beam width; then the pronunciation whose probability distribution is outside the beam width is determined to satisfy the pronunciation of the predetermined filtering rule; finally, the pronunciation of the predetermined filtering rule is satisfied, from the feature classification result The included pronunciation is removed. In practical applications, the beam width can be determined by the designer of the present invention according to the required speech recognition accuracy and the distribution of the histogram. For example, the preset needs to filter out 8000 low-probability pronunciations, which can be from the histogram. The low probability side starts to find 8000 pronunciations, and the 8000th pronunciation position is determined as the beam width boundary. In other embodiments, other technical means may be used to filter the feature classification result according to the filtering manner. The present invention does not limit this.

在按以上任一过滤方式,对所述特征分类结果所含的发音进行过滤后,可以直接调取预定的识别网络,搜索与过滤后的特征分类结果所含的发音相关的路径,寻找最佳的一条路径,基于该路径以最大概率输出待识别的语音信号对应的文本信息,完成语音识别,这里提到的识别网络,可以指解码器针对待识别的语音信号,根据己经训练好的声学模型、语言模型及字典建立的识别网络。After filtering the pronunciations included in the feature classification result according to any of the above filtering methods, the predetermined identification network may be directly retrieved, and the path related to the pronunciation included in the filtered feature classification result is searched for the best. a path based on which the text information corresponding to the speech signal to be recognized is output with maximum probability to complete speech recognition. The identification network mentioned herein may refer to the speech signal to be recognized by the decoder according to the well-trained acoustics. Models, language models, and identification networks established by dictionaries.

在寻找最佳的一条路径时,可以将特征分类结果所含的概率(声学得分)转换到和语音模型所含的字词(如:字、词、短语)之间关联概率(语言得分)相近的数值空间,并加权相加,作为路径搜索过程的综合分值,每一语音信号帧都会用一个预设的门限值来限制,与最佳路径的差值大于这个门限值,则该路径丢弃,否则保留;每一语音信号帧完成搜索后,会根据预设的最大路径数量,对所有路径进行排序,只保留此数量的最优路径,直至最后一帧完成,由此得出最后的路径图。When searching for the best path, the probability (acoustic score) contained in the feature classification result can be converted to be similar to the association probability (language score) between the words (such as words, words, phrases) contained in the phonetic model. The numerical space, and weighted addition, as a comprehensive score of the path search process, each speech signal frame is limited by a preset threshold value, and the difference from the optimal path is greater than the threshold value, then the The path is discarded, otherwise it is reserved. After each voice signal frame completes the search, all paths are sorted according to the preset maximum number of paths, and only the optimal path is retained until the last frame is completed. Path map.

在某些例子中,输出特征分类结果的声学模型的建模单元较小,如以状态为建模单元,由于单个音素可以由三到五个状态组成,一个音素的发音所成的语音信号可以分割为多个语音信号帧,因此,易出现多个连续的语音信号帧的声学特征较类似的状况,那么特征分类结果中描述这些连续的语音信号帧中的各帧的发音,易出现类似状况。针对这种状况,如果本发明基于特征分类结果所含的概率和预定过滤规则,分别对每帧语音信号帧映射到的发音进 行过滤,易将对识别准确率影响较大的发音过滤掉,为了避免误过滤这类发音,在过滤特征分类结果时,可以综合考虑连续的语音信号帧的过滤状况,具体实现过程可以参阅图2所示的方法,包括以下步骤S201-S205:In some examples, the modeling unit of the acoustic model that outputs the feature classification result is small. For example, the state is the modeling unit. Since a single phoneme can be composed of three to five states, the voice signal formed by the pronunciation of one phoneme can be Dividing into a plurality of speech signal frames, therefore, it is prone to a situation in which the acoustic characteristics of a plurality of consecutive speech signal frames are relatively similar, and the pronunciation of each frame in the continuous speech signal frames in the feature classification result is prone to a similar situation. . In response to this situation, if the present invention is based on the probability and predetermined filtering rules contained in the feature classification result, the pronunciations mapped to each frame of the speech signal frame are respectively Line filtering can easily filter out the pronunciation that has a great influence on the recognition accuracy. In order to avoid erroneous filtering of such pronunciations, the filtering status of continuous speech signal frames can be comprehensively considered when filtering the classification results. The specific implementation process can refer to the figure. The method shown in 2 includes the following steps S201-S205:

步骤S201、获取待识别的语音信号的特征分类结果;所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率。Step S201: Acquire a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame and a probability that each speech signal frame is mapped to a corresponding pronunciation.

步骤S202、如果任一语音信号帧映射到对应的发音的概率,满足预定过滤规则,将该发音确定为候选发音。Step S202: If any of the speech signal frames are mapped to the probability of the corresponding pronunciation, the predetermined filtering rule is satisfied, and the pronunciation is determined as the candidate pronunciation.

步骤S203、如果该语音信号帧的预定帧数的相邻语音信号帧中的任一帧,映射到该候选发音的概率满足预定过滤规则,则将该候选发音从所述特征分类结果所含的发音中删除。Step S203: If any one of the adjacent speech signal frames of the predetermined number of frames of the speech signal frame, the probability of mapping to the candidate pronunciation satisfies a predetermined filtering rule, the candidate pronunciation is included in the feature classification result. Deleted in pronunciation.

步骤S204、如果该语音信号帧的预定帧数的相邻语音信号帧,映射到该候选发音的概率均不满足预定过滤规则,则将该候选发音保留在所述特征分类结果所含的发音中。Step S204: If the adjacent speech signal frame of the predetermined number of frames of the speech signal frame does not satisfy the predetermined filtering rule, the candidate pronunciation is retained in the pronunciation included in the feature classification result. .

步骤S205、基于过滤后的特征分类结果识别所述语音信号。Step S205: Identify the voice signal based on the filtered feature classification result.

本发明实施例中,预定过滤规则可以是以上所述的过滤方式一至过滤方式四涉及的任一种规则,还可以是能限制过滤掉的发音对识别准确率的影响程度的其他过滤规则。In the embodiment of the present invention, the predetermined filtering rule may be any one of the filtering methods 1 to the filtering method 4 described above, and may also be other filtering rules that can limit the influence degree of the filtered pronunciation on the recognition accuracy.

连续的语音信号帧的预定帧数可以由本发明的设计人员根据需要达到的语音识别准确率来设定,例如,设定为6,相邻的前三帧以及相邻的后三帧。The predetermined number of frames of consecutive speech signal frames can be set by the designer of the present invention according to the required speech recognition accuracy, for example, set to 6, adjacent first three frames and adjacent last three frames.

由上述实施例可知:本发明的语音识别方法在识别语音信号时,先获取该语音信号的特征分类结果,然后基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤,那么在识别语音信号的过程中,无需再执行与被过滤掉的发音执行相关的识别操作,如无需再在识别网络中搜索与被过滤掉的发音相关的路径,因此能有效降低语音识别过程耗费的时间,进而能提高语音识别效率。It can be seen from the above embodiment that the speech recognition method of the present invention first acquires the feature classification result of the speech signal, and then, based on the probability contained in the feature classification result, the pronunciation included in the feature classification result. By performing filtering, in the process of recognizing the voice signal, it is no longer necessary to perform the recognition operation related to the filtered pronunciation execution, such as no need to search the identification network for the path related to the filtered pronunciation, thereby effectively reducing the voice. Identifying the time spent in the process can improve speech recognition efficiency.

进而,本发明实施例的语音识别方法可以应用于各种电子设备的人机交互软件内,例如:智能手机内的语音拨号、语音操控、语音查找,应用于智能手机内的语音查找时,如果用户在距离智能手机的预定范围内发出一段语音,那么应用于语音查找上的语音识别方法,可以在接收到语音采集设备采集的用户语音后,先获取该语音的特征分类结果,然后基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤,然后在识别网络中只搜索未被过滤掉的发音相关的路径,通过路径搜索快速识别出用户语音对应的文本信息,进而使语音助手基于该识别结果快速响应用户。 Furthermore, the voice recognition method in the embodiment of the present invention can be applied to human-computer interaction software of various electronic devices, for example, voice dialing, voice control, and voice search in a smart phone, and is applied to voice search in a smart phone if The user sends a voice in a predetermined range from the smartphone, and the voice recognition method applied to the voice search may obtain the feature classification result of the voice after receiving the user voice collected by the voice collection device, and then The probability included in the feature classification result, filtering the pronunciation included in the feature classification result, and then searching only the uncorrelated pronunciation-related path in the recognition network, and quickly identifying the text information corresponding to the user voice through the path search In turn, the voice assistant quickly responds to the user based on the recognition result.

与前述方法的实施例相对应,本发明还提供了装置的实施例。Corresponding to the embodiment of the aforementioned method, the invention also provides an embodiment of the device.

参见图3,图3是本发明一示例性实施例示出的语音识别装置的逻辑框图,该装置可以包括:分类结果获取模块310、发音过滤模块320和语音识别模块330。Referring to FIG. 3, FIG. 3 is a logic block diagram of a voice recognition apparatus according to an exemplary embodiment of the present invention. The apparatus may include: a classification result obtaining module 310, a pronunciation filtering module 320, and a voice recognition module 330.

其中,分类结果获取模块310,用于获取待识别的语音信号的特征分类结果;所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率。The classification result obtaining module 310 is configured to obtain a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame, and each speech signal frame is mapped to a corresponding pronunciation. Probability.

发音过滤模块320,用于基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤。The pronunciation filtering module 320 is configured to filter the pronunciations included in the feature classification result based on the probability included in the feature classification result.

语音识别模块330,用于基于过滤后的特征分类结果识别所述语音信号。The voice recognition module 330 is configured to identify the voice signal based on the filtered feature classification result.

一些例子中,发音过滤模块320可以包括:In some examples, the pronunciation filtering module 320 can include:

第一过滤模块,用于在任一语音信号帧映射到对应的发音的概率,与该语音信号帧的最大映射概率之间的概率差,在预定的差值范围内时,对所述对应的发音进行过滤。a first filtering module, configured to: when a probability difference between a probability of mapping any voice signal frame to a corresponding pronunciation and a maximum mapping probability of the voice signal frame, within a predetermined difference range, the corresponding pronunciation Filter.

第二过滤模块,用于在任一语音信号帧映射到对应的发音的概率,小于该语音信号帧映射到预定数目的发音中各发音的概率时,对所述对应的发音进行过滤。The second filtering module is configured to filter the corresponding pronunciation when the probability of mapping any of the speech signal frames to the corresponding pronunciation is less than the probability that the speech signal frame is mapped to a predetermined number of pronunciations.

另一些例子中,发音过滤模块320还可以包括:In other examples, the pronunciation filtering module 320 may further include:

概率分布模块,用于获取任一语音信号帧映射到各发音的概率的直方图分布。A probability distribution module is configured to obtain a histogram distribution of the probability that any speech signal frame is mapped to each pronunciation.

束宽确定模块,用于获取与所述直方图分布对应的束宽。a beam width determining module for acquiring a beam width corresponding to the histogram distribution.

发音确定模块,用于将概率分布在所述束宽之外的发音,确定为满足所述预定过滤规则的发音。The pronunciation determining module is configured to determine a pronunciation whose probability is distributed outside the beam width as a pronunciation satisfying the predetermined filtering rule.

发音删除模块,用于将满足所述预定过滤规则的发音从所述特征分类结果所含的发音中删除。The pronunciation deletion module is configured to delete the pronunciation that satisfies the predetermined filtering rule from the pronunciation included in the feature classification result.

参见图4,图4是本发明另一示例性实施例示出的语音识别装置的逻辑框图,该装置可以包括:分类结果获取模块410、发音过滤模块420和语音识别模块430。发音过滤模块420可以包括候选发音确定模块421、候选发音删除模块422和候选发音保留模块423。Referring to FIG. 4, FIG. 4 is a logic block diagram of a voice recognition apparatus according to another exemplary embodiment of the present invention. The apparatus may include a classification result acquisition module 410, a pronunciation filtering module 420, and a voice recognition module 430. The pronunciation filtering module 420 can include a candidate pronunciation determination module 421, a candidate pronunciation deletion module 422, and a candidate pronunciation retention module 423.

其中,分类结果获取模块410,用于获取待识别的语音信号的特征分类结果;所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率。 The classification result obtaining module 410 is configured to obtain a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame, and each speech signal frame is mapped to a corresponding pronunciation. Probability.

候选发音确定模块421,用于在任一语音信号帧映射到对应的发音的概率满足预定过滤规则时,将该发音确定为候选发音。The candidate pronunciation determining module 421 is configured to determine the pronunciation as a candidate pronunciation when the probability that any of the speech signal frames are mapped to the corresponding pronunciation meets a predetermined filtering rule.

候选发音删除模块422,用于在该语音信号帧的预定帧数的相邻语音信号帧中的任一帧,映射到该候选发音的概率满足预定过滤规则时,将该候选发音从所述特征分类结果所含的发音中删除。The candidate pronunciation deleting module 422 is configured to: when any one of the adjacent speech signal frames of the predetermined number of frames of the speech signal frame is mapped to a predetermined filtering rule, the candidate pronunciation is from the feature. The pronunciation included in the classification result is deleted.

候选发音保留模块423,用于在该语音信号帧的预定帧数的相邻语音信号帧,映射到该候选发音的概率均不满足预定过滤规则时,将该候选发音保留在所述特征分类结果所含的发音中。The candidate pronunciation retaining module 423 is configured to reserve the candidate pronunciation in the feature classification result when the probability of mapping the candidate speech to the adjacent speech signal frame of the predetermined number of frames of the speech signal frame does not satisfy the predetermined filtering rule. In the pronunciation contained.

语音识别模块430,用于基于过滤后的特征分类结果识别所述语音信号。The voice recognition module 430 is configured to identify the voice signal based on the filtered feature classification result.

上述装置中各个单元(或模块)的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。The implementation process of the functions and functions of each unit (or module) in the above device is specifically described in the implementation process of the corresponding steps in the foregoing method, and details are not described herein again.

对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元或模块可以是或者也可以不是物理上分开的,作为单元或模块显示的部件可以是或者也可以不是物理单元或模块,即可以位于一个地方,或者也可以分布到多个网络单元或模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment. The device embodiments described above are merely illustrative, wherein the units or modules described as separate components may or may not be physically separate, and the components displayed as units or modules may or may not be physical units. Or modules, which can be located in one place, or distributed to multiple network units or modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solution of the present invention. Those of ordinary skill in the art can understand and implement without any creative effort.

本发明语音识别装置的实施例可以应用在电子设备上。具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现中,电子设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备、互联网电视、智能机车、无人驾驶汽车、智能冰箱、其他智能家居设备或者这些设备中的任意几种设备的组合。Embodiments of the speech recognition apparatus of the present invention can be applied to an electronic device. This can be implemented by a computer chip or an entity, or by a product having a certain function. In a typical implementation, the electronic device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver device. , game consoles, tablets, wearables, Internet TVs, smart locomotives, driverless cars, smart refrigerators, other smart home devices, or a combination of any of these devices.

装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在电子设备的处理器将非易失性存储器等可读介质中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图5所示,为本发明语音识别装置所在电子设备的一种硬件结构图,除了图5所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的电子设备通常根据该电子设备的实际功能,还可以包括其他硬件,对此不再赘述。电子设备的存储器可以存储处理器可执行 的程序指令;处理器可以耦合存储器,用于读取所述存储器存储的程序指令,并作为响应,执行如下操作:获取待识别的语音信号的特征分类结果;所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率;基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤;基于过滤后的特征分类结果识别所述语音信号。The device embodiment may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking the software implementation as an example, as a logical device, a processor of the electronic device in which it is located reads a corresponding computer program instruction in a readable medium such as a non-volatile memory into a memory. From a hardware level, as shown in FIG. 5, is a hardware structure diagram of an electronic device in which the voice recognition device of the present invention is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in FIG. 5. The electronic device in which the device is located in the embodiment may also include other hardware according to the actual function of the electronic device, and details are not described herein again. The memory of the electronic device can be stored in a processor executable a program instruction; the processor may be coupled to the memory for reading the program instructions stored in the memory, and in response, performing the following operations: acquiring a feature classification result of the voice signal to be identified; the feature classification result is included for description The pronunciation of the pronunciation feature of each speech signal frame and the probability that each speech signal frame is mapped to the corresponding pronunciation; filtering the pronunciation contained in the feature classification result based on the probability contained in the feature classification result; based on the filtered The feature classification result identifies the speech signal.

在其他实施例中,处理器所执行的操作可以参考上文方法实施例中相关的描述,在此不予赘述。In other embodiments, the operations performed by the processor may be referred to the related description in the foregoing method embodiments, and details are not described herein.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。 The above are only the preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are made within the spirit and principles of the present invention, should be included in the present invention. Within the scope of protection.

Claims (10)

一种语音识别方法,其特征在于,包括步骤:A speech recognition method, comprising the steps of: 获取待识别的语音信号的特征分类结果;所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率;Acquiring a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame and a probability that each speech signal frame is mapped to a corresponding pronunciation; 基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤;Filtering the pronunciations included in the feature classification result based on the probability contained in the feature classification result; 基于过滤后的特征分类结果识别所述语音信号。The speech signal is identified based on the filtered feature classification result. 根据权利要求1所述的方法,其特征在于,所述基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤,包括:The method according to claim 1, wherein the filtering of the pronunciations included in the feature classification result based on the probability contained in the feature classification result comprises: 判断任一语音信号帧映射到对应的发音的概率是否满足预定过滤规则;Determining whether the probability that any speech signal frame is mapped to the corresponding pronunciation satisfies a predetermined filtering rule; 如果所述对应的发音满足预定过滤规则,对所述对应的发音进行滤掉。If the corresponding pronunciation satisfies a predetermined filtering rule, the corresponding pronunciation is filtered out. 根据权利要求2所述的方法,其特征在于:The method of claim 2 wherein: 如果任一语音信号帧映射到对应的发音的概率,与该语音信号帧的最大映射概率之间的概率差,在预定的差值范围内,则确定所述对应的发音满足预定过滤规则;If the probability difference between the probability of any speech signal frame being mapped to the corresponding pronunciation and the maximum mapping probability of the speech signal frame is within a predetermined difference range, determining that the corresponding pronunciation satisfies a predetermined filtering rule; 如果任一语音信号帧映射到对应的发音的概率,小于该语音信号帧映射到预定数目的发音中各发音的概率,则确定所述对应的发音满足预定过滤规则。If the probability that any of the speech signal frames are mapped to the corresponding utterance is less than the probability that the speech signal frame is mapped to a predetermined number of utterances in the utterance, then the corresponding utterance is determined to satisfy the predetermined filtering rule. 根据权利要求3所述的方法,其特征在于,所述预定数目为以下任一:The method of claim 3 wherein said predetermined number is any of the following: 该帧语音信号帧对应的发音中被保留在特征分类结果内的发音的数量;The number of pronunciations retained in the feature classification result in the pronunciation corresponding to the frame speech signal frame; 预定的比例阈值与该帧语音信号帧对应的发音的总数目的乘积。The product of the predetermined ratio threshold and the total number of pronunciations corresponding to the frame of the speech signal frame. 根据权利要求1所述的方法,其特征在于,所述基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤,包括:The method according to claim 1, wherein the filtering of the pronunciations included in the feature classification result based on the probability contained in the feature classification result comprises: 获取任一语音信号帧映射到各发音的概率的直方图分布;Obtaining a histogram distribution of the probability of mapping any speech signal frame to each pronunciation; 获取与所述直方图分布对应的束宽;Obtaining a beam width corresponding to the histogram distribution; 将概率分布在所述束宽之外的发音,确定为满足预定过滤规则的发音;Acquiring a probability that is distributed outside the beam width is determined to satisfy a predetermined filtering rule; 将满足预定过滤规则的发音,从所述特征分类结果所含的发音中删除。The pronunciation satisfying the predetermined filtering rule is deleted from the pronunciation included in the feature classification result. 根据权利要求1至5中任一项所述的方法,其特征在于,所述基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤,包括:The method according to any one of claims 1 to 5, wherein the filtering of the pronunciation included in the feature classification result based on the probability contained in the feature classification result comprises: 如果任一语音信号帧映射到对应的发音的概率满足预定过滤规则,将该发音确定为候选发音;If the probability that any of the speech signal frames are mapped to the corresponding pronunciation satisfies a predetermined filtering rule, the pronunciation is determined as a candidate pronunciation; 如果该语音信号帧的预定帧数的相邻语音信号帧中的任一帧,映射到该候选发音的概率满足预定过滤规则,则将该候选发音从所述特征分类结果所含的发音中删除;If any one of the adjacent speech signal frames of the predetermined number of frames of the speech signal frame, the probability of mapping to the candidate pronunciation satisfies a predetermined filtering rule, the candidate pronunciation is deleted from the pronunciation included in the feature classification result ; 如果该语音信号帧的预定帧数的相邻语音信号帧,映射到该候选发音的概率均不满足预 定过滤规则,则将该候选发音保留在所述特征分类结果所含的发音中。If the adjacent speech signal frame of the predetermined number of frames of the speech signal frame, the probability of mapping to the candidate pronunciation is not satisfied. The filter rule is retained, and the candidate pronunciation is retained in the pronunciation included in the feature classification result. 一种语音识别装置,其特征在于,包括:A speech recognition device, comprising: 分类结果获取模块,用于获取待识别的语音信号的特征分类结果;所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率;a classification result obtaining module, configured to obtain a feature classification result of the speech signal to be identified; the feature classification result includes a pronunciation for describing a pronunciation feature of each speech signal frame and a probability that each speech signal frame is mapped to a corresponding pronunciation; 发音过滤模块,用于基于所述特征分类结果所含的概率,对所述特征分类结果所含的发音进行过滤;a pronunciation filtering module, configured to filter a pronunciation included in the feature classification result based on a probability included in the feature classification result; 语音识别模块,用于基于过滤后的特征分类结果识别所述语音信号。a voice recognition module, configured to identify the voice signal based on the filtered feature classification result. 根据权利要求7所述的装置,其特征在于,所述发音过滤模块还包括:The device according to claim 7, wherein the pronunciation filtering module further comprises: 第一过滤模块,用于在任一语音信号帧映射到对应的发音的概率,与该语音信号帧的最大映射概率之间的概率差,在预定的差值范围内时,对所述对应的发音进行过滤;a first filtering module, configured to: when a probability difference between a probability of mapping any voice signal frame to a corresponding pronunciation and a maximum mapping probability of the voice signal frame, within a predetermined difference range, the corresponding pronunciation Filtering; 第二过滤模块,用于在任一语音信号帧映射到对应的发音的概率,小于该语音信号帧映射到预定数目的发音中各发音的概率时,对所述对应的发音进行过滤。The second filtering module is configured to filter the corresponding pronunciation when the probability of mapping any of the speech signal frames to the corresponding pronunciation is less than the probability that the speech signal frame is mapped to a predetermined number of pronunciations. 根据权利要求7所述的装置,其特征在于,所述发音过滤模块包括:The device according to claim 7, wherein the pronunciation filtering module comprises: 概率分布模块,用于获取任一语音信号帧映射到各发音的概率的直方图分布;a probability distribution module, configured to obtain a histogram distribution of a probability that any speech signal frame is mapped to each pronunciation; 束宽确定模块,用于获取与所述直方图分布对应的束宽;a beam width determining module, configured to acquire a beam width corresponding to the histogram distribution; 发音确定模块,用于将概率分布在所述束宽之外的发音,确定为满足所述预定过滤规则的发音;a pronunciation determining module, configured to determine a pronunciation whose probability is distributed outside the beam width, to determine a pronunciation that satisfies the predetermined filtering rule; 发音删除模块,用于将满足所述预定过滤规则的发音从所述特征分类结果所含的发音中删除。The pronunciation deletion module is configured to delete the pronunciation that satisfies the predetermined filtering rule from the pronunciation included in the feature classification result. 根据权利要求7至9中任一项所述的装置,其特征在于,所述发音过滤模块包括:The apparatus according to any one of claims 7 to 9, wherein the pronunciation filtering module comprises: 候选发音确定模块,用于在任一语音信号帧映射到对应的发音的概率满足预定过滤规则时,将该发音确定为候选发音;a candidate pronunciation determining module, configured to determine the pronunciation as a candidate pronunciation when a probability that any of the speech signal frames are mapped to the corresponding pronunciation meets a predetermined filtering rule; 候选发音删除模块,用于在该语音信号帧的预定帧数的相邻语音信号帧中的任一帧,映射到该候选发音的概率满足预定过滤规则时,将该候选发音从所述特征分类结果所含的发音中删除;a candidate pronunciation deleting module, configured to classify the candidate pronunciation from the feature when any one of adjacent frames of the predetermined number of frames of the voice signal frame is mapped to a predetermined filtering rule The result is deleted from the pronunciation; 候选发音保留模块,用于在该语音信号帧的预定帧数的相邻语音信号帧,映射到该候选发音的概率均不满足预定过滤规则时,将该候选发音保留在所述特征分类结果所含的发音中。 a candidate pronunciation retaining module, configured to retain the candidate pronunciation in the feature classification result when the probability of mapping the candidate speech to the adjacent speech signal frame of the predetermined number of frames of the speech signal frame does not satisfy the predetermined filtering rule In the pronunciation.
PCT/CN2017/104382 2017-04-18 2017-09-29 Speech recognition method and apparatus Ceased WO2018192186A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710254628.8 2017-04-18
CN201710254628.8A CN106875936B (en) 2017-04-18 2017-04-18 Voice recognition method and device

Publications (1)

Publication Number Publication Date
WO2018192186A1 true WO2018192186A1 (en) 2018-10-25

Family

ID=59162735

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/104382 Ceased WO2018192186A1 (en) 2017-04-18 2017-09-29 Speech recognition method and apparatus

Country Status (2)

Country Link
CN (1) CN106875936B (en)
WO (1) WO2018192186A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875936B (en) * 2017-04-18 2021-06-22 广州视源电子科技股份有限公司 Voice recognition method and device
CN110310623B (en) * 2017-09-20 2021-12-28 Oppo广东移动通信有限公司 Sample generation method, model training method, device, medium, and electronic apparatus
CN108694951B (en) * 2018-05-22 2020-05-22 华南理工大学 A Speaker Recognition Method Based on Multi-Stream Hierarchical Fusion Transform Features and Long Short-Term Memory Networks
CN108899013B (en) * 2018-06-27 2023-04-18 广州视源电子科技股份有限公司 Voice search method and device and voice recognition system
CN108877782B (en) * 2018-07-04 2020-09-11 百度在线网络技术(北京)有限公司 Speech recognition method and device
CN108932943A (en) * 2018-07-12 2018-12-04 广州视源电子科技股份有限公司 Command word sound detection method, device, equipment and storage medium
CN109192211A (en) * 2018-10-29 2019-01-11 珠海格力电器股份有限公司 Method, device and equipment for recognizing voice signal
CN109872715A (en) * 2019-03-01 2019-06-11 深圳市伟文无线通讯技术有限公司 A kind of voice interactive method and device
CN115798277A (en) * 2021-09-10 2023-03-14 广州视源电子科技股份有限公司 Online classroom interaction method and online classroom system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101030369A (en) * 2007-03-30 2007-09-05 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model
CN101101751A (en) * 2006-07-04 2008-01-09 株式会社东芝 Speech recognition device and method
CN102779510A (en) * 2012-07-19 2012-11-14 东南大学 Speech Emotion Recognition Method Based on Feature Space Adaptive Projection
US9396722B2 (en) * 2013-06-20 2016-07-19 Electronics And Telecommunications Research Institute Method and apparatus for detecting speech endpoint using weighted finite state transducer
CN106875936A (en) * 2017-04-18 2017-06-20 广州视源电子科技股份有限公司 Voice recognition method and device

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IL84902A (en) * 1987-12-21 1991-12-15 D S P Group Israel Ltd Digital autocorrelation system for detecting speech in noisy audio signal
US6714909B1 (en) * 1998-08-13 2004-03-30 At&T Corp. System and method for automated multimedia content indexing and retrieval
CN101894549A (en) * 2010-06-24 2010-11-24 中国科学院声学研究所 Method for fast calculating confidence level in speech recognition application field
CN101944359B (en) * 2010-07-23 2012-04-25 杭州网豆数字技术有限公司 Voice recognition method for specific crowd
CN102426836B (en) * 2011-08-25 2013-03-20 哈尔滨工业大学 Rapid keyword detection method based on quantile self-adaption cutting
CN102436816A (en) * 2011-09-20 2012-05-02 安徽科大讯飞信息科技股份有限公司 Voice data decoding method and device
CN103730115B (en) * 2013-12-27 2016-09-07 北京捷成世纪科技股份有限公司 A kind of method and apparatus detecting keyword in voice
CN105243143B (en) * 2015-10-14 2018-07-24 湖南大学 Recommendation method and system based on real-time phonetic content detection
CN105845128B (en) * 2016-04-06 2020-01-03 中国科学技术大学 Voice recognition efficiency optimization method based on dynamic pruning beam width prediction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101751A (en) * 2006-07-04 2008-01-09 株式会社东芝 Speech recognition device and method
CN101030369A (en) * 2007-03-30 2007-09-05 清华大学 Built-in speech discriminating method based on sub-word hidden Markov model
CN102779510A (en) * 2012-07-19 2012-11-14 东南大学 Speech Emotion Recognition Method Based on Feature Space Adaptive Projection
US9396722B2 (en) * 2013-06-20 2016-07-19 Electronics And Telecommunications Research Institute Method and apparatus for detecting speech endpoint using weighted finite state transducer
CN106875936A (en) * 2017-04-18 2017-06-20 广州视源电子科技股份有限公司 Voice recognition method and device

Also Published As

Publication number Publication date
CN106875936A (en) 2017-06-20
CN106875936B (en) 2021-06-22

Similar Documents

Publication Publication Date Title
WO2018192186A1 (en) Speech recognition method and apparatus
US10249294B2 (en) Speech recognition system and method
US8972260B2 (en) Speech recognition using multiple language models
US20140207457A1 (en) False alarm reduction in speech recognition systems using contextual information
JP2021105708A (en) Neural speech-to-meaning
KR102094935B1 (en) System and method for recognizing speech
CN115132170B (en) Language classification method, device and computer readable storage medium
KR101153078B1 (en) Hidden conditional random field models for phonetic classification and speech recognition
US20200219487A1 (en) Information processing apparatus and information processing method
JP6552999B2 (en) Text correction device, text correction method, and program
JP6622681B2 (en) Phoneme Breakdown Detection Model Learning Device, Phoneme Breakdown Interval Detection Device, Phoneme Breakdown Detection Model Learning Method, Phoneme Breakdown Interval Detection Method, Program
CN112687291A (en) Pronunciation defect recognition model training method and pronunciation defect recognition method
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
CN114387950A (en) Speech recognition method, apparatus, device and storage medium
CN114627896A (en) Voice evaluation method, device, equipment and storage medium
Mary et al. Searching speech databases: features, techniques and evaluation measures
KR20210081166A (en) Spoken language identification apparatus and method in multilingual environment
KR20210054001A (en) Method and apparatus for providing voice recognition service
TW202129628A (en) Speech recognition system with fine-grained decoding
CN112687296B (en) Audio disfluency identification method, device, equipment and readable storage medium
Liu et al. Deriving disyllabic word variants from a Chinese conversational speech corpus
KR102113879B1 (en) The method and apparatus for recognizing speaker's voice by using reference database
CN115424616B (en) Audio data screening method, device, equipment and computer-readable medium
KR20210094267A (en) Apparatus and method for assessmenting languige proficiency
CN114582373B (en) Method and device for identifying emotion of user in man-machine conversation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17905935

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 28.02.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17905935

Country of ref document: EP

Kind code of ref document: A1