CN109243430A

CN109243430A - A kind of audio recognition method and device

Info

Publication number: CN109243430A
Application number: CN201710537548.3A
Authority: CN
Inventors: 郑宏
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-07-04
Filing date: 2017-07-04
Publication date: 2019-01-18
Anticipated expiration: 2037-07-04
Also published as: CN109243430B

Abstract

The embodiment of the present invention provides a kind of audio recognition method and device, which comprises the voice input for receiving user identifies voice input, obtains candidate speech recognition result；The candidate speech recognition result is ranked up using the language identification model of user；The personal language model corresponding to the user is the language model established using the history text input data of the user；Final speech recognition result is obtained using the candidate speech recognition result after sequence.The embodiment of the present invention can effectively improve the accuracy of speech recognition result.

Description

A kind of audio recognition method and device

Technical field

The present embodiments relate to technical field of voice recognition, and in particular to a kind of audio recognition method and device.

Background technique

Speech recognition technology is a kind of technology that human speech is converted to computer-readable input.Speech recognition technology exists The fields such as phonetic dialing, Voice Navigation, automatic equipment control are all widely used.Therefore, the standard of speech recognition how is improved True property becomes an important project.

In the prior art, it is generally identified using the voice that speech model inputs user, by the phonetic feature of input Sequence is converted to character string.Speech model generally comprises acoustic model and language model, respectively corresponds voice to syllable probability Calculating and syllable to character probabilities calculating.

Applicant has found that the prior art is using identical speech recognition modeling to difference during studying the prior art The voice of user identifies, however, the pronunciation characteristic of different user and language use habit be it is different, the prior art without Method provides accurate, personalized speech recognition result.Although there are a kind of methods for the prior art, a voice of user can be applied It learns model user speech is identified to obtain recognition result, but this method only only accounts for the pronunciation characteristic of user, such as Dialect classification belonging to user, this method are still not able to provide more accurate, personalized speech recognition result.

Summary of the invention

The embodiment of the present invention is intended to provide a kind of audio recognition method and device, can use general language model and with The corresponding personal language model of the user is ranked up candidate speech recognition result, obtains more accurate, personalized Speech recognition result.

For this purpose, the embodiment of the present invention provides the following technical solutions:

In a first aspect, the embodiment of the invention provides a kind of audio recognition methods, comprising: the voice input of user is received, Voice input is identified, candidate speech recognition result is obtained；Using the language identification model of user to the candidate Speech recognition result is ranked up；Wherein, the language identification model of the user by general language model and with user couple The personal language model answered obtains, and the personal language model corresponding to the user is to be inputted using the history text of the user The language model that data are established；Final speech recognition result is obtained using the candidate speech recognition result after sequence.

Second aspect, the embodiment of the invention provides a kind of speech recognition equipments, comprising: recognition unit is used for receiving The voice at family inputs, and identifies to voice input, obtains candidate speech recognition result；Sequencing unit, for utilizing use The language identification model at family is ranked up the candidate speech recognition result；Wherein, the language identification model of the user is logical It crosses general language model and personal language model corresponding to the user obtains, the personal language model corresponding to the user is The language model established using the history text input data of the user；As a result obtaining unit, for utilizing the time after sequence Speech recognition result is selected to obtain final speech recognition result.

The third aspect, it to include memory, Yi Jiyi that the embodiment of the invention provides a kind of devices for speech recognition A perhaps more than one program one of them or more than one program is stored in memory, and is configured to by one Or it includes the instruction for performing the following operation that more than one processor, which executes the one or more programs: being received The voice of user inputs, and identifies to voice input, obtains candidate speech recognition result；Utilize the language identification of user Model is ranked up the candidate speech recognition result；Wherein, the language identification model of the user passes through all-purpose language mould Type and personal language model corresponding to the user obtain, and the personal language model corresponding to the user is to utilize the user History text input data establish language model；Final voice is obtained using the candidate speech recognition result after sequence to know Other result.

Fourth aspect, the embodiment of the invention provides a kind of machine readable medias, are stored thereon with instruction, when by one or When multiple processors execute, so that device executes the audio recognition method as shown in first aspect.

Audio recognition method and device provided in an embodiment of the present invention can receive the voice input of user, to institute's predicate Sound input is identified, is obtained candidate speech recognition result, is identified using the language identification model of user to the candidate speech As a result it is ranked up, obtains final speech recognition result using the candidate speech recognition result after sequence.Due to of the invention real Apply that general language model is utilized in example and individual subscriber language model has obtained the language identification model of user to candidate speech Recognition result is ranked up, not only allow for general language use habit, it is also contemplated that user individual language use habit Influence to candidate speech recognition result, so that the sort result for more meeting user individual language use habit has in forefront Effect improves the accuracy of speech recognition result.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The some embodiments recorded in invention, for those of ordinary skill in the art, without creative efforts, It is also possible to obtain other drawings based on these drawings.

Fig. 1 is the audio recognition method flow chart that one embodiment of the invention provides；

Fig. 2 be another embodiment of the present invention provides audio recognition method flow chart；

Fig. 3 is the speech recognition equipment schematic diagram that one embodiment of the invention provides；

Fig. 4 is a kind of block diagram for speech recognition equipment shown according to an exemplary embodiment；

Fig. 5 is the block diagram of server shown according to an exemplary embodiment.

Specific embodiment

Technical solution in order to enable those skilled in the art to better understand the present invention, below in conjunction with of the invention real The attached drawing in example is applied, technical solution in the embodiment of the present invention is described, it is clear that described embodiment is only this hair Bright a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having Every other embodiment obtained under the premise of creative work is made, should fall within the scope of the present invention.

The audio recognition method shown in exemplary embodiment of the present is introduced below in conjunction with attached drawing 1 to attached drawing 2.

Referring to Fig. 1, the audio recognition method flow chart provided for one embodiment of the invention.As shown in Figure 1, may include:

S101 receives the voice input of user, identifies to voice input, obtain candidate speech recognition result.

It should be noted that acoustic mode in the prior art can be used when the voice input to user identifies Type identifies voice input, obtains candidate speech recognition result.It is best that the candidate speech recognition result is generally top n Recognition result, N are positive integer, and value rule of thumb or can need to set.

S102 is ranked up the candidate speech recognition result using the language identification model of user.

It should be noted that generally being known in the prior art using general speech model to the voice input of user Not, the pronunciation characteristic and language use habit of different user are not considered.For example, each user has different language uses Habit, such as each user have different pet phrases, different common words, different industry common-use words and different places Characteristic vocabulary etc..The personalized acoustic model recognition methods that the prior art provides, it is contemplated that the different pronunciation of different user is special Point, the acoustic feature for acquiring user establishes personal acoustic model to provide the accuracy of speech recognition, but this method does not have There is the language use in view of user to be accustomed to, more accurate, personalized speech recognition result can not be still provided.

In the embodiment of the present invention, personal language model corresponding to the user can be pre-established.It is described corresponding to the user Personal language model is the language model established using the history text input data of the user, can be with one sentence of valid metric The height of sub- probability.

When specific implementation, personal language model corresponding to the user can be established in the following manner:

A, the history text input data of user is obtained.

When specific implementation, language of the text input data of user as speech model training can be collected through a variety of ways Material.

B, the word feature and/or word combination feature of the user are obtained according to the history text input data of the user, Institute's predicate feature includes the statistics frequency of word and word, and institute's word combination feature includes the statistics of word combination and word combination The frequency.

Wherein, the statistics frequency of word is specifically as follows frequency of occurrence of the word in entire corpus.Word combination Counting the frequency is specially frequency of occurrence of the word combination in entire corpus.Phrase is combined into the combination of more than two words.It illustrates Bright, user Zhang San misses potter input " I strangle go " when typewriting, and seldom inputs " I a go ".By collecting user Zhang San Word combination, and count its number occurred in corpus, when future, Zhang San said the words with voice, candidate speech recognition result " I strangle go " will come before candidate speech recognition result " me go ".For another example, it is often inputted when user Li Si typewrites " storehouse " seldom inputs " battle ".By collecting word feature, in subsequent speech recognition, candidate speech recognition result " storehouse " It will come before candidate speech recognition result " battle ".

C, personal language corresponding with the user is obtained using the word feature of the user and/or the training of word combination feature Model.

In the personal language model of training, a kind of N-Gram (speech model) language model training side can be specifically used Method, Recognition with Recurrent Neural Network (full name in English is Recurrent neural Network, English abbreviation RNN) language model training Method, shot and long term memory network (full name in English is Long Short-Term Memory, English abbreviation LSTM) language model Training method etc..It is introduced by taking the training of ternary N-Gram language model as an example below.

(1) word segmentation processing is carried out to each sentence in corpus.Such as the result that ABC the words is segmented is (A, B, C).

(2) probability that word A, word B, word C occur in corpus is calculated separately.Wherein:

Total word number in number/corpus that P (A)=A occurs in corpus

(3) probability occurred in word combination AB corpus is calculated.

The total degree that A occurs in number/corpus that P (B | A)=AB occurs in corpus

(4) conditional probability that word C occurs after word combination AB is calculated.

The number that AB occurs in number/corpus that P (C | AB)=ABC occurs in corpus

(5) probability that sentence ABC occurs in corpus is calculated.

P (ABC)=P (A) P (B | A) P (C | AB)

As a result, all corpus of user are trained with the personal language model that can obtain user, a human speech Say that model can be with the height of one sentence probability of occurrence of valid metric.

When the invention is realized in detail, after obtaining N number of candidate speech recognition result, it can use the language identification of user Model is ranked up the candidate speech recognition result.Wherein, the language identification model of the user passes through all-purpose language mould Type and personal language model corresponding to the user obtain, and general language model is instructed using the text input corpus of all users The model got.

In some embodiments, the language identification model using user carries out the candidate speech recognition result Sequence includes: to carry out linear interpolation using the general language model and personal language model corresponding with the user, is obtained To the language identification model of the user；The general of each candidate speech recognition result is calculated using the language identification model of the user Rate is ranked up each candidate speech recognition result according to the probability of calculating.

For example, weight can be preset, the language identification model of user is obtained by following formula:

Language identification model=a × individual's language model+b × general language model of user

Wherein, 0 < a < 1,0 <b < 1, a+b=1.

For example, the value that the value of a is 0.7, b is 0.3.

The probability that each candidate speech recognition result is calculated using acquisition language identification model, is arranged according to probability size descending Sequence.

In other embodiments, the language identification model using user to the candidate speech recognition result into Row sequence includes: to calculate the probability of each candidate speech recognition result using general language model and utilize corresponding with the user Personal language model calculate the probability of each candidate speech recognition result；By the probability being calculated using general language model with Linear interpolation is carried out using the probability that personal language model corresponding with the user is calculated, according to the result of linear interpolation Each candidate speech recognition result is ranked up

For example, weight can be preset, the result of linear interpolation is obtained by following formula:

Probability value+b × general language model probability value of final probability value=a × individual's language model

Wherein, 0 < a < 1,0 <b < 1, a+b=1.

It illustrates, it is assumed that the value that the value of a is 0.7, b is 0.3.The personal language model of user Zhang San is to candidate language The probability value calculated result of sound recognition result " I strangles a go " is 0.00038683, probability value of the general language model to the words Calculated result is 0.00023453, by linear interpolation, obtains the final probability value calculated result of the words are as follows: 0.7 × 0.00038683+0.3 × 0.00023453=0.00034114.

In some embodiments, the method also includes: obtain group corresponding with user language model；It is described Group's language model is used to describe the language feature of the affiliated group of user；The language identification model using user is to the time Selecting speech recognition result to be ranked up includes: to utilize general language model, personal language model corresponding with the user and institute The corresponding group's language model of user is stated to be ranked up the candidate speech recognition result.Specific implementation may refer to Fig. 2 institute Show embodiment.

S103 obtains final speech recognition result using the candidate speech recognition result after sequence.

When specific implementation, the candidate speech recognition result of sequence up front can be obtained into final speech recognition knot Fruit.For example, user says " Wang Li is made a phone call to me ".If being using the possible recognition result of general language model merely " Wang Li is made a phone call to me ".But for the user usually when using input method, the name often inputted is " Wang Li ".Using Method of the invention, when carrying out N-best to candidate input results and resetting sequence algorithm and beat again point sequence, can general " Wang Li is beaten to me A phone " is discharged to before " Wang Li is made a phone call to me ".Thus obtained personalized identification result can compare universal identification As a result more acurrate, more meet the language use feature of user.

In this embodiment of the invention, it can receive the voice input of user, voice input identified, is obtained Candidate speech recognition result is obtained, the candidate speech recognition result is ranked up using the language identification model of user, is utilized Candidate speech recognition result after sequence obtains final speech recognition result.Since all-purpose language is utilized in the embodiment of the present invention The language identification model that model and individual subscriber language model have obtained user is ranked up candidate speech recognition result, comprehensive The influence for considering general language use habit and user individual language use habit to candidate speech recognition result is closed, So that the sort result for more meeting user individual language use habit effectively increases the accurate of speech recognition result in forefront Property.

Referring to fig. 2, for another embodiment of the present invention provides audio recognition method flow chart.Not with embodiment illustrated in fig. 1 With, this embodiment also contemplates influence of the group's language model corresponding to the user to recognition result, by with user Corresponding group's language model can identify that user is rarely employed but the common personalized language of similar population corresponding to the user Material, to make up the deficiency of user's corpus, improves the accuracy of speech recognition.

S201 establishes personal language model corresponding to the user.

When specific implementation, personal language model corresponding to the user can be established in the following manner: obtaining going through for user History text input data；The word feature and/or word combination of the user are obtained according to the history text input data of the user Feature, institute's predicate feature include the statistics frequency of word and word, and institute's word combination feature includes word combination and word combination The statistics frequency；A human speech corresponding with the user is obtained using word feature and/or word combination the feature training of the user Say model.Specific implementation is referred to embodiment illustrated in fig. 1 and realizes.

S202 establishes each group's language model.

Wherein, group's language model is used to describe the language feature of the affiliated group of user.Each user has therewith Corresponding group's language model, can pre-save the corresponding relationship of user Yu group's language model, in S205, it can benefit With the corresponding relationship pre-saved, group corresponding with active user language model is obtained.

Wherein, it is described establish each group's language model the following steps are included:

S202A calculates the similarity between different user, obtains similar users group set according to the similarity of calculating.Institute Stating similar users group set includes each user that similarity is greater than given threshold.

When specific implementation, S202A may include: the word feature vector for obtaining different user again；By the different user Each user calculates the word feature vector of the active user and the word feature vector of other users respectively as active user COS distance, using the COS distance as the similarity of active user and other users；It will be big with the similarity of active user Similar users corresponding with active user group set is added in each user of given threshold.

For example, the word feature vector of user can be obtained by the 1-Gram individual language model of user, shown word Feature vector can be the probability value of each word, represent frequency of use of the user on each word.Due to similar users Frequency of use to word is similar, therefore the similitude of user group can be measured by the similarity of word feature vector. For example, the common words of the common words of doctor A and doctor B are similar, common words and ice hockey the coach C, truck of doctor A The common words of driver D be it is different, the word feature of each user can reflect on the vector of a vocabulary size.Doctor The word feature vector of A and the similarity of doctor B word feature vector are greater than the word feature vector of doctor A certainly and ice hockey trains C's The similarity of word feature vector.It, can be using the method for the COS distance for calculating vector when calculating similarity.

Wherein it is possible to calculate the COS distance of two vectors a and b using following formula:

Cos θ=(ab)/‖ a ‖ ‖ b ‖

It should be noted that if the similarity of two user's word feature vectors is greater than given threshold, it is believed that two use Family is similar users.Will with the similarity of active user be greater than given threshold each user constitute set can be used as with currently The corresponding similar users group of user.If the text input corpus of active user is less, similar users group can be made up currently User version inputs the less deficiency of corpus, can supplement user and be rarely employed but user group similar with its is commonly used Personalized corpus, thus obtained speech recognition result is more accurate.

S202B is obtained corresponding with similar users group using the text input of each user of similar users group set Word feature and/or word combination feature, institute's predicate feature include the statistics frequency of word and word, and institute's word combination feature includes The statistics frequency of word combination and word combination.

S202C obtains group's language using the word feature corresponding with similar users group and/or the training of word combination feature Say model.

It should be noted that the method that training obtains group's language model is similar with the method for obtaining personal language model, Only input corpus is different, and the input corpus of group's language model can be some or all of in similar users group The corpus of user.

S203 establishes general language model.

Wherein, there is no the successive of certainty to execute sequence by S201, S202, S203, and execution sequence can be executed reversedly, Or it is performed in parallel.

S204 receives the voice input of user, identifies to voice input, obtain candidate speech recognition result.

S205 utilizes general language model and the user corresponding personal language model, group corresponding with the user Body language model is ranked up the candidate speech recognition result.

In some embodiments, it is described using general language model, personal language model corresponding with the user and It includes: to utilize general language model that the corresponding group's language model of the user, which is ranked up the candidate speech recognition result, And personal language model corresponding with the user, group's language model carry out linear interpolation, obtain the language of the user Identification model；The probability that each candidate speech recognition result is calculated using the language identification model of the user, according to the general of calculating Rate is ranked up each candidate speech recognition result.

Language identification model=x × individual's language model+y × general language model+z × group's language model of user

Wherein, 0 < x < 1,0 < y < 1,0 < z < 1, x+y+z=1.

For example, the value that the value that the value of x is 0.5, b is 0.3, z is 0.2.

When specific implementation, the final language identification model that can use acquisition calculates the general of each candidate speech recognition result Rate, according to probability size descending sort.

In some embodiments, it is described using general language model, personal language model corresponding with the user and It includes: to utilize general language model that the corresponding group's language model of the user, which is ranked up the candidate speech recognition result, It calculates the probability of each candidate speech recognition result, calculate each candidate speech knowledge using personal language model corresponding with the user The probability of other result and the probability that each candidate speech recognition result is calculated using group corresponding with user language model； It is calculated by the probability being calculated using general language model, using personal language model corresponding with the user general Rate and the probability progress linear interpolation being calculated using group's language model corresponding to the user, according to the knot of linear interpolation Fruit is ranked up each candidate speech recognition result.

Probability value+y × general language model probability value+z × group's language of final probability value=x × individual's language model Say the probability value of model

Wherein, 0 < x < 1,0 < y < 1,0 < z < 1, x+y+z=1.

S206 obtains final speech recognition result using the candidate speech recognition result after sequence.

In this embodiment of the invention, it is contemplated that influence of the group's language model corresponding to the user to recognition result, By group's language model corresponding to the user, it can identify that user is rarely employed but similar population corresponding to the user is common Personalized corpus improve the accuracy of speech recognition to make up the deficiency of user's corpus.

Referring to Fig. 3, the speech recognition equipment schematic diagram provided for one embodiment of the invention.

A kind of speech recognition equipment 300, comprising:

Recognition unit 301, the voice for receiving user are inputted, are identified to voice input, obtain candidate language Sound recognition result.Wherein, the specific implementation of the recognition unit 301 is referred to the step 101 of embodiment illustrated in fig. 1 and real It is existing.

Sequencing unit 302 is ranked up the candidate speech recognition result for the language identification model using user； Wherein, the language identification model of the user is obtained by general language model and personal language model corresponding to the user, The personal language model corresponding to the user is the language model established using the history text input data of the user.Its In, the specific implementation of the sequencing unit 302 is referred to the step 102 of embodiment illustrated in fig. 1 and realizes.

As a result obtaining unit 303, for obtaining final speech recognition knot using the candidate speech recognition result after sequence Fruit.Wherein, the specific implementation of the result obtaining unit 303 is referred to the step 103 of embodiment illustrated in fig. 1 and realizes.

In some embodiments, the sequencing unit 302 is specifically used for: using general language model and with the use The corresponding personal language model in family carries out linear interpolation, obtains the language identification model of the user；Utilize the language of the user Speech identification model calculates the probability of each candidate speech recognition result, is carried out according to the probability of calculating to each candidate speech recognition result Sequence.

In some embodiments, the sequencing unit 302 is specifically used for: calculating each candidate language using general language model The probability of sound recognition result and utilization personal language model corresponding with the user calculate each candidate speech recognition result Probability；It is calculated by the probability being calculated using general language model and using personal language model corresponding with the user The probability arrived carries out linear interpolation, is ranked up according to the result of linear interpolation to each candidate speech recognition result.

In some embodiments, described device further includes that personal language model establishes unit, for establishing and user couple The personal language model answered, wherein individual's language model is established unit and is specifically used for: the history text input of user is obtained Data；The word feature and/or word combination feature of the user, institute's predicate are obtained according to the history text input data of the user Feature includes the statistics frequency of word and word, and institute's word combination feature includes the statistics frequency of word combination and word combination； Personal language model corresponding with the user is obtained using word feature and/or word combination the feature training of the user.

In some embodiments, described device further includes obtaining group's language model unit, for obtaining and the use The corresponding group's language model in family；Group's language model is used to describe the language feature of the affiliated group of user；

The sequencing unit is also used to: utilizing general language model, personal language model corresponding with the user and institute The corresponding group's language model of user is stated to be ranked up the candidate speech recognition result.

In some embodiments, described device further includes that group's language model establishes unit, group's language model It establishes unit to be specifically used for: calculating the similarity between different user, similar users group set is obtained according to the similarity of calculating, The similar users group set includes each user that similarity is greater than given threshold；Gather each user using similar users group Text input obtain corresponding with similar users group word feature and/or word combination feature, institute's predicate feature includes word And the statistics frequency of word, institute's word combination feature include the statistics frequency of word combination and word combination；Using described with phase Group's language model is obtained like the corresponding word feature of user group and/or the training of word combination feature.

In some embodiments, group's language model is established unit and is specifically used for: the word for obtaining different user is special Levy vector；Using each user of the different user as active user, calculate the word feature vector of the active user with And the COS distance of the word feature vector of other users, it is similar to other users using the COS distance as active user Degree；By and the similarity of active user be greater than each user of given threshold corresponding with active user similar users group be added Body set.

Wherein, the setting of apparatus of the present invention each unit or module is referred to Fig. 1 and realizes to method shown in Fig. 2, This is not repeated.

It referring to fig. 4, is a kind of block diagram for speech recognition equipment shown according to an exemplary embodiment.For example, dress Setting 400 can be mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical treatment Equipment, body-building equipment, personal digital assistant etc..

Referring to Fig. 4, device 400 may include following one or more components: processing component 402, memory 404, power supply Component 406, multimedia component 408, audio component 410, the interface 412 of input/output (I/O), sensor module 414, and Communication component 416.

The integrated operation of the usual control device 400 of processing component 402, such as with display, telephone call, data communication, phase Machine operation and record operate associated operation.Processing component 402 may include that one or more processors 420 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 402 may include one or more modules, just Interaction between processing component 402 and other assemblies.For example, processing component 402 may include multi-media module, it is more to facilitate Interaction between media component 408 and processing component 402.

Memory 404 is configured as storing various types of data to support the operation in equipment 400.These data are shown Example includes the instruction of any application or method for operating on device 400, contact data, and telephone book data disappears Breath, picture, video etc..Memory 404 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.

Power supply module 406 provides electric power for the various assemblies of device 400.Power supply module 406 may include power management system System, one or more power supplys and other with for device 400 generate, manage, and distribute the associated component of electric power.

Multimedia component 408 includes the screen of one output interface of offer between described device 400 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers Body component 408 includes a front camera and/or rear camera.When equipment 400 is in operation mode, such as screening-mode or When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 410 is configured as output and/or input audio signal.For example, audio component 410 includes a Mike Wind (MIC), when device 400 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched It is set to reception external audio signal.The received audio signal can be further stored in memory 404 or via communication set Part 416 is sent.In some embodiments, audio component 410 further includes a loudspeaker, is used for output audio signal.

I/O interface 412 provides interface between processing component 402 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.

Sensor module 414 includes one or more sensors, and the state for providing various aspects for device 400 is commented Estimate.For example, sensor module 414 can detecte the state that opens/closes of equipment 400, and the relative positioning of component, for example, it is described Component is the display and keypad of device 400, and sensor module 414 can be with 400 1 components of detection device 400 or device Position change, the existence or non-existence that user contacts with device 400,400 orientation of device or acceleration/deceleration and device 400 Temperature change.Sensor module 414 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 414 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 416 is configured to facilitate the communication of wired or wireless way between device 400 and other equipment.Device 400 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation In example, communication component 414 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 414 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, device 400 can be believed by one or more application specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

Specifically, the embodiment of the invention provides a kind of speech recognition equipments 400, include memory 404 and one Perhaps more than one program one of them or more than one program is stored in memory 404, and is configured to by one Or it includes the instruction for performing the following operation that more than one processor 420, which executes the one or more programs: The voice input for receiving user, identifies voice input, obtains candidate speech recognition result；Utilize the language of user Identification model is ranked up the candidate speech recognition result；Wherein, the language identification model of the user passes through common language Speech model and personal language model corresponding to the user obtain, and the personal language model corresponding to the user is described in utilization The language model that the history text input data of user is established；Final language is obtained using the candidate speech recognition result after sequence Sound recognition result.

Further, it includes to be used for that the processor 420, which specifically is also used to execute the one or more programs, The instruction performed the following operation: it is linearly inserted using general language model and personal language model corresponding with the user Value, obtains the language identification model of the user；Each candidate speech identification knot is calculated using the language identification model of the user The probability of fruit is ranked up each candidate speech recognition result according to the probability of calculating.

Further, it includes to be used for that the processor 420, which specifically is also used to execute the one or more programs, The instruction performed the following operation: using general language model calculate each candidate speech recognition result probability and using with it is described The corresponding personal language model of user calculates the probability of each candidate speech recognition result；It will be calculated using general language model Probability and carry out linear interpolation using the probability that personal language model corresponding with the user is calculated, inserted according to linear The result of value is ranked up each candidate speech recognition result.

Further, it includes to be used for that the processor 420, which specifically is also used to execute the one or more programs, The instruction performed the following operation: the history text input data of user is obtained；According to the history text input data of the user The word feature and/or word combination feature of the user are obtained, institute's predicate feature includes the statistics frequency of word and word, described Word combination feature includes the statistics frequency of word combination and word combination；Utilize the word feature and/or word combination feature of the user Training obtains personal language model corresponding with the user.

Further, it includes to be used for that the processor 420, which specifically is also used to execute the one or more programs, The instruction performed the following operation: group corresponding with user language model is obtained；Group's language model is for describing The language feature of the affiliated group of user；Utilize general language model, personal language model corresponding with the user and the use The corresponding group's language model in family is ranked up the candidate speech recognition result.

Further, it includes to be used for that the processor 420, which specifically is also used to execute the one or more programs, The instruction performed the following operation: calculating the similarity between different user, obtains each similar users group according to the similarity of calculating Set, the similar users group set include each user that similarity is greater than given threshold；Gathered using similar users group The text input of each user obtain corresponding with similar users group word feature and/or word combination feature, institute's predicate feature packet The statistics frequency of word and word is included, institute's word combination feature includes the statistics frequency of word combination and word combination；Using institute It states word feature corresponding with similar users group and/or the training of word combination feature obtains group's language model.

Further, it includes to be used for that the processor 420, which specifically is also used to execute the one or more programs, The instruction performed the following operation: different word feature vectors is obtained；It is used using each user of the different user as current Family calculates the COS distance of the word feature vector of the active user and the word feature vector of other users, by the cosine Similarity of the distance as active user and other users；The user for being greater than given threshold with the similarity of active user is added Similar users corresponding with active user group set.

A kind of machine readable media, such as the machine readable media can be non-transitorycomputer readable storage medium, When the instruction in the medium is executed by the processor of device (terminal or server), enable a device to execute a kind Audio recognition method, which comprises the voice input for receiving user identifies voice input, obtains candidate Speech recognition result；The candidate speech recognition result is ranked up using the language identification model of user；Described and user Corresponding individual's language model is the language model established using the history text input data of the user；After sequence Candidate speech recognition result obtains final speech recognition result.

Optionally, the language identification model using user, which is ranked up the candidate speech recognition result, includes: Linear interpolation is carried out using general language model and personal language model corresponding with the user, obtains the language of the user Say identification model；The probability that each candidate speech recognition result is calculated using the language identification model of the user, according to calculating Probability is ranked up each candidate speech recognition result.

Optionally, the language identification model using user, which is ranked up the candidate speech recognition result, includes: The probability of each candidate speech recognition result is calculated using general language model and utilizes personal language corresponding with the user Model calculates the probability of each candidate speech recognition result；By the probability being calculated using general language model with using with it is described The probability that the corresponding personal language model of user is calculated carries out linear interpolation, according to the result of linear interpolation to each candidate language Sound recognition result is ranked up.

Optionally, the method also includes: obtain the history text input data of user；According to the history of user text This input data obtains the word feature and/or word combination feature of the user, and institute's predicate feature includes the system of word and word The frequency is counted, institute's word combination feature includes the statistics frequency of word combination and word combination；Using the user word feature and/or The training of word combination feature obtains personal language model corresponding with the user.

Optionally, the method also includes: obtain group corresponding with user language model；Group's language mould Type is used to describe the language feature of the affiliated group of user；It is described that the candidate speech is identified using the language identification model of user As a result be ranked up includes: to utilize general language model, personal language model corresponding with the user, corresponding with the user Group's language model the candidate speech recognition result is ranked up.

Optionally, acquisition group corresponding with user language model includes:

Pre-establish each group's language model；

According to the corresponding relationship of user and group's language model, group corresponding with user language model is obtained.

Optionally, it is described pre-establish each group's language model include: calculate different user between similarity, according to calculating Similarity obtain similar users group set, similar users group set includes each use that similarity is greater than given threshold Family；Using the text input of each user of similar users group set obtain corresponding with similar users group word feature and/or Word combination feature, institute's predicate feature include the statistics frequency of word and word, institute's word combination feature include word combination and The statistics frequency of word combination；Group is obtained using the word feature corresponding with similar users group and/or the training of word combination feature Body language model.

Optionally, the similarity calculated between different user obtains similar users group collection according to the similarity of calculating Conjunction includes: the word feature vector for obtaining different user；Using each user of the different user as active user, institute is calculated The COS distance for stating the word feature vector of active user and the word feature vector of other users, using the COS distance as working as The similarity of preceding user and other users；It will be added with the similarity of active user greater than the user of given threshold and described current The corresponding similar users group set of user.

Fig. 5 is the structural schematic diagram of server in the embodiment of the present invention.The server 500 can be due to configuration or performance be different Generate bigger difference, may include one or more central processing units (central processing units, CPU) 522 (for example, one or more processors) and memory 532, one or more storage application programs 542 or The storage medium 530 (such as one or more mass memory units) of data 544.Wherein, memory 532 and storage medium 530 can be of short duration storage or persistent storage.The program for being stored in storage medium 530 may include one or more modules (diagram does not mark), each module may include to the series of instructions operation in server.Further, central processing unit 522 can be set to communicate with storage medium 530, and the series of instructions behaviour in storage medium 530 is executed on server 500 Make.

Server 500 can also include one or more power supplys 526, one or more wired or wireless networks Interface 550, one or more input/output interfaces 558, one or more keyboards 556, and/or, one or one The above operating system 541, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.The present invention is directed to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.The present invention can be by calculating The general described in the text, such as program module up and down for the computer executable instructions that machine executes.Generally, program module includes holding The routine of row particular task or realization particular abstract data type, programs, objects, component, data structure etc..It can also divide Cloth, which calculates, practices the present invention in environment, in these distributed computing environments, by connected long-range by communication network Processing equipment executes task.In a distributed computing environment, program module can be located at the local including storage equipment In remote computer storage medium.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method Part explanation.The apparatus embodiments described above are merely exemplary, wherein described be used as separate part description Unit may or may not be physically separated, component shown as a unit may or may not be Physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to the actual needs Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying In the case where creative work, it can understand and implement.The above is only a specific embodiment of the invention, should be referred to Out, for those skilled in the art, without departing from the principle of the present invention, can also make several Improvements and modifications, these modifications and embellishments should also be considered as the scope of protection of the present invention.

Claims

1. a kind of audio recognition method characterized by comprising

The voice input for receiving user, identifies voice input, obtains candidate speech recognition result；

The candidate speech recognition result is ranked up using the language identification model of user；Wherein, the language of the user Identification model is obtained by general language model and personal language model corresponding to the user, the individual corresponding to the user Language model is the language model established using the history text input data of the user；

Final speech recognition result is obtained using the candidate speech recognition result after sequence.

2. the method according to claim 1, wherein the language identification model using user is to the candidate Speech recognition result, which is ranked up, includes:

Linear interpolation is carried out using the general language model and personal language model corresponding with the user, is obtained described The language identification model of user；

The probability that each candidate speech recognition result is calculated using the language identification model of the user, according to the probability of calculating to each Candidate speech recognition result is ranked up.

3. the method according to claim 1, wherein the language identification model using user is to the candidate Speech recognition result, which is ranked up, includes:

The probability of each candidate speech recognition result is calculated using general language model and utilizes individual corresponding with the user Language model calculates the probability of each candidate speech recognition result；

It is calculated by the probability being calculated using general language model and using personal language model corresponding with the user The probability arrived carries out linear interpolation, is ranked up according to the result of linear interpolation to each candidate speech recognition result.

4. according to claim 1 to method described in 3 any one, which is characterized in that the method also includes:

Obtain the history text input data of user；

The word feature and/or word combination feature of the user, institute's predicate are obtained according to the history text input data of the user Feature includes the statistics frequency of word and word, and institute's word combination feature includes the statistics frequency of word combination and word combination；

Personal language model corresponding with the user is obtained using word feature and/or word combination the feature training of the user.

5. the method according to claim 1, wherein the method also includes:

Obtain group corresponding with user language model；Group's language model is used to describe the language of the affiliated group of user Say feature；

The language identification model using user, which is ranked up the candidate speech recognition result, includes:

Utilize general language model and the user corresponding personal language model, group's language mould corresponding with the user Type is ranked up the candidate speech recognition result.

6. according to the method described in claim 5, it is characterized in that, acquisition group corresponding with user language model Include:

Pre-establish each group's language model；

According to the corresponding relationship of user and group's language model, group corresponding with user language model is obtained；

Wherein, pre-establishing each group's language model includes:

The similarity between different user is calculated, each similar users group is obtained according to the similarity of calculating and is gathered, the similar use Family group set includes each user that similarity is greater than given threshold；

Using the text input of each user of similar users group set obtain corresponding with similar users group word feature and/or Word combination feature, institute's predicate feature include the statistics frequency of word and word, institute's word combination feature include word combination and The statistics frequency of word combination；

Group's language model is obtained using the word feature corresponding with similar users group and/or the training of word combination feature.

7. according to the method described in claim 6, it is characterized in that, it is described calculate different user between similarity, according to calculating Similarity obtain similar users group set include:

Obtain the word feature vector of different user；

Using each user of the different user as active user, calculate the active user word feature vector and its The COS distance of the word feature vector of his user, using the COS distance as the similarity of active user and other users；

By and the similarity of active user be greater than the user of given threshold corresponding with active user similar users group be added Body set.

8. a kind of speech recognition equipment characterized by comprising

Recognition unit, the voice for receiving user are inputted, are identified to voice input, obtain candidate speech identification knot Fruit；

Sequencing unit is ranked up the candidate speech recognition result for the language identification model using user；Wherein, institute The language identification model for stating user is obtained by general language model and personal language model corresponding to the user, described and use The corresponding personal language model in family is the language model established using the history text input data of the user；

As a result obtaining unit, for obtaining final speech recognition result using the candidate speech recognition result after sequence.

9. a kind of device for speech recognition, which is characterized in that include memory and one or more than one journey Sequence, perhaps more than one program is stored in memory and is configured to by one or more than one processor for one of them Executing the one or more programs includes the instruction for performing the following operation:

10. a kind of machine readable media is stored thereon with instruction, when executed by one or more processors, so that device is held Audio recognition method of the row as described in one or more in claim 1 to 7.