CN112750428A - Voice interaction method and device and computer equipment - Google Patents
Voice interaction method and device and computer equipment Download PDFInfo
- Publication number
- CN112750428A CN112750428A CN202011591154.4A CN202011591154A CN112750428A CN 112750428 A CN112750428 A CN 112750428A CN 202011591154 A CN202011591154 A CN 202011591154A CN 112750428 A CN112750428 A CN 112750428A
- Authority
- CN
- China
- Prior art keywords
- model
- region
- current user
- specified
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/065—Adaptation
- G10L15/07—Adaptation to the speaker
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0638—Interactive procedures
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The application relates to the field of artificial intelligence, and discloses a voice interaction method, which comprises the following steps: acquiring region information corresponding to a current user; obtaining model parameters corresponding to each region model at the current moment; polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model; matching an appointed region model corresponding to the current user according to the matching probability; inputting the voice sentence of the current user into the designated region model; and feeding back the interactive feedback output by the designated region model to the current user. The voice data of the user is distinguished through the characteristics of different regions of the user, the voice deep learning model is trained respectively, the accuracy of voice recognition is improved, voice interaction is supported by using the language or dialect familiar to the user, communication is smoother, the user familiarity is increased, and the use viscosity of the user is improved.
Description
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a voice interaction method, apparatus, and computer device.
Background
At present, deep learning models of AI voice systems docked by customer service voices are mostly not distinguished according to user regions, including user voice recognition and user voice answers, and are not distinguished from region features. Different regions have different voice characteristics, and if the users cannot be distinguished according to the different regions where the users are located, the accuracy rate of voice recognition with dialect characteristics is possibly low, and the voice interaction is not facilitated in the regions with rich dialect characteristics.
Disclosure of Invention
The application mainly aims to provide a voice interaction method, and aims to solve the technical problem that the existing voice interaction cannot distinguish users according to different regions where the users are located.
The application provides a voice interaction method, which comprises the following steps:
acquiring region information corresponding to a current user;
obtaining model parameters corresponding to each region model at the current moment;
polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model;
matching an appointed region model corresponding to the current user according to the matching probability;
inputting the voice sentence of the current user into the designated region model;
and feeding back the interactive feedback output by the designated region model to the current user.
Preferably, the step of obtaining model parameters corresponding to each of the local models at the current time includes:
acquiring multiple groups of model parameters of a first region model in a specified time period, wherein the first region model is any one of all region models;
solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model;
and calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
Preferably, the step of calculating the matching probability corresponding to each region model by polling according to the region information and the model parameters corresponding to each region model includes:
inputting the specified model parameters and the region information which are currently corresponding to the first model into a first calculation formula, wherein the first calculation formula isw and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result;
obtaining a calculation result of the first calculation formula;
taking the calculation result as the matching probability currently corresponding to the first model;
and polling to calculate the matching probability corresponding to each region model according to the calculation mode of the matching probability corresponding to the first model at present.
Preferably, the step of matching the designated region model corresponding to the current user according to the matching probability includes:
the matching probability corresponding to each region model forms descending order from big to small;
determining a second region model corresponding to the maximum matching probability in the descending order;
and taking the second region model as a specified region model matched with the current user.
Preferably, the designated geographical model includes a gaussian mixture model, and the step of inputting the speech sentence of the current user into the designated geographical model includes:
extracting audio data from the voice sentence of the current user;
performing data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data;
determining keywords in the audio data according to the prediction probability;
and determining feedback information corresponding to the voice sentence of the current user according to the keyword.
Preferably, the step of extracting audio data from the speech sentence of the current user includes:
preprocessing the voice sentence of the current user and then performing Fourier transform to obtain audio frame data;
filtering the audio frame data through a filter to obtain frequency data corresponding to each audio;
respectively corresponding frequency data of each audio frequency is subjected to discrete cosine transform to obtain audio frequency characteristics respectively corresponding to each audio frequency;
and audio data of the voice sentence of the current user are formed by the audio features corresponding to the audios according to the arrangement sequence of the audios in the voice sentence.
Preferably, the step of inputting the speech sentence of the current user into the designated geographical model includes:
obtaining a likelihood function corresponding to the complete data according to the joint probability distribution of the complete data;
taking logarithm of the likelihood function corresponding to the complete data to obtain a logarithm formula;
calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model;
obtaining an expected value function of the hidden variable according to the posterior probability of the specified Gaussian distribution function and the logarithm expression;
deriving the expected value function to obtain parameters of the specified Gaussian distribution function;
and determining parameters respectively corresponding to all Gaussian distribution functions in the Gaussian mixture model according to the determination process of the parameters of the specified Gaussian distribution functions.
The present application further provides a voice interaction apparatus, including:
the first acquisition module is used for acquiring the region information corresponding to the current user;
the second acquisition module is used for acquiring model parameters corresponding to the regional models at the current moment;
the first calculation module is used for calculating the matching probability corresponding to each region model in a polling mode according to the region information and the model parameters corresponding to each region model;
the matching module is used for matching the appointed region model corresponding to the current user according to the matching probability;
the input module is used for inputting the voice sentence of the current user into the designated region model;
and the feedback module is used for feeding back the interactive feedback output by the designated region model to the current user.
The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above method when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as described above.
According to the method and the device, the voice data of the user are distinguished through the characteristics of different regions of the user, the voice deep learning model is trained respectively, the accuracy of voice recognition is improved, the voice interaction of the language or dialect familiar to the user is supported, the communication is smoother, the user familiarity is increased, and the use viscosity of the user is improved.
Drawings
FIG. 1 is a schematic flow chart of a voice interaction method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a voice interaction system according to an embodiment of the present application;
fig. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, a voice interaction method according to an embodiment of the present application includes:
s1: acquiring region information corresponding to a current user;
s2: obtaining model parameters corresponding to each region model at the current moment;
s3: polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model;
s4: matching an appointed region model corresponding to the current user according to the matching probability;
s5: inputting the voice sentence of the current user into the designated region model;
s6: and feeding back the interactive feedback output by the designated region model to the current user.
In the embodiment of the application, the region information includes, but is not limited to, information such as a location to which the user mobile phone number belongs, a location to which the landline number belongs, and a location to which the user identification number belongs. The region marking operation is carried out on the user voice information data, and then the AI voice deep learning model is respectively trained according to different region marking data to obtain each region model. And then, forming a voice recognition system by the trained region models, and starting the region models matched with the current region characteristics in a targeted manner for voice recognition according to different region characteristics of the current user.
The dictionary matching model and the matching algorithm of the Mandarin and the local dialect are embedded into each region model, so that the flexible conversion of the Mandarin and the local dialect is realized, the interaction between a machine and a user through dialects can be supported, and the region use characteristics are improved.
According to the method and the device, the voice data of the user are distinguished through the characteristics of different regions of the user, the voice deep learning model is trained respectively, the accuracy of voice recognition is improved, the voice interaction of the language or dialect familiar to the user is supported, the communication is smoother, the user familiarity is increased, and the use viscosity of the user is improved.
Further, the step S2 of obtaining model parameters corresponding to each of the local models at the current time includes:
s21: acquiring multiple groups of model parameters of a first region model in a specified time period, wherein the first region model is any one of all region models;
s22: solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model;
s23: and calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
In the embodiment of the application, the parameters of the region model are progressively updated along with the increase of the training samples so as to gradually optimize the parameters of the region model. In the embodiment of the application, multiple groups of parameters of the region model are regarded as one parameter, and the parameters of the optimized region model are updated in a likelihood function solving mode.
The parameters w and b of the terrain model are taken as a parameter theta, for example, and then solved according to the likelihood function,wherein,x represents the region information, i represents the ith region information sample, hθ(X) represents a probability function, and P represents a value of the matching probability.
Further, step S3 of calculating the matching probability corresponding to each of the region models by polling according to the region information and the model parameters corresponding to each of the region models includes:
s31: inputting the specified model parameters and the region information which are currently corresponding to the first model into a first calculation formula, wherein the first calculation formula isw and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result;
s32: obtaining a calculation result of the first calculation formula;
s33: taking the calculation result as the matching probability currently corresponding to the first model;
s34: and polling to calculate the matching probability corresponding to each region model according to the calculation mode of the matching probability corresponding to the first model at present.
In the embodiment of the application, the parameters of the region model obtained by solving are substituted into the first calculation formula, and the matching probability y of the current region information and the parameters of the region model is calculated, so that the region model matched with the region information is selected according to the matching probability.
Further, the step S4 of matching the designated geographic model corresponding to the current user according to the matching probability includes:
s41: the matching probability corresponding to each region model forms descending order from big to small;
s42: determining a second region model corresponding to the maximum matching probability in the descending order;
s43: and taking the second region model as a specified region model matched with the current user.
In the embodiment of the application, the matching probability of the current region information and each region model is calculated through polling, and the region model corresponding to the maximum matching probability is used as the model matched with the current region information.
Further, the step S5 of inputting the speech sentence of the current user into the designated geographical model, where the designated geographical model includes a gaussian mixture model, includes:
s51: extracting audio data from the voice sentence of the current user;
s52: performing data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data;
s53: determining keywords in the audio data according to the prediction probability;
s54: and determining feedback information corresponding to the voice sentence of the current user according to the keyword.
In the implementation of the application, each region model is a Gaussian mixture model formed by a plurality of groups of Gaussian distribution functions, initial parameters of the Gaussian mixture model are initialized through a K-means algorithm, namely K values are randomly selected as a clustering center, and then points close to the clustering center are clustered into one class. The prediction probability is obtained by calculating the distance between the prediction probability and each cluster center point so as to determine each keyword contained in the voice sentence, and the feedback information corresponding to the voice sentence is determined through a preset feedback information list corresponding to each keyword.
Further, before the step S5 of inputting the speech sentence of the current user into the designated geographical model, the step S5 of specifying that the geographical model includes a gaussian mixture model including a plurality of gaussian distribution functions includes:
s501: obtaining a likelihood function corresponding to the complete data according to the joint probability distribution of the complete data;
s502: taking logarithm of the likelihood function corresponding to the complete data to obtain a logarithm formula;
s503: calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model;
s504: obtaining an expected value function of the hidden variable according to the posterior probability of the specified Gaussian distribution function and the logarithm expression;
s505: deriving the expected value function to obtain parameters of the specified Gaussian distribution function;
s506: and determining parameters respectively corresponding to all Gaussian distribution functions in the Gaussian mixture model according to the determination process of the parameters of the specified Gaussian distribution functions.
In an embodiment of the present application, the complete data includes a full time-ordered set of audio data corresponding to the speech sentence. The likelihood function corresponding to the complete data is expressed asWhere x represents voice sample data, znHidden variable representing estimated parameter, N (x)n|μk,∑k) Is a sample xnGeneration of the representation, μ, in the k-th Gaussian distribution functionkRepresents the mean value, sigmakRepresents the variance; pikRepresenting the weight coefficients of the kth gaussian distribution function. Obtained by taking natural logarithmCalculating znThe posterior probability of (a) is:the expectation function of the log-likelihood for the hidden variable is expressed asThe derivative is obtained for the pi, the mu and the sigma to obtain the average parameter when the derivative is zero,
further, the step S51 of extracting audio data from the current user' S speech sentence includes:
s511: preprocessing the voice sentence of the current user and then performing Fourier transform to obtain audio frame data;
s512: filtering the audio frame data through a filter to obtain frequency data corresponding to each audio;
s513: respectively corresponding frequency data of each audio frequency is subjected to discrete cosine transform to obtain audio frequency characteristics respectively corresponding to each audio frequency;
s514: and audio data of the voice sentence of the current user are formed by the audio features corresponding to the audios according to the arrangement sequence of the audios in the voice sentence.
In the embodiment of the present application, the preprocessing includes, but is not limited to, pre-emphasis, framing, and hamming window. After Fourier transformation is carried out on a voice sample, the lowest frequency, the highest frequency and the number of Mel filters are determined through the Mel filters, the Mel frequencies corresponding to the lowest frequency and the highest frequency are converted, the distance between the central Mel frequencies of the two connected Mel filters is calculated, the central Mel frequencies are converted into unequal-interval frequencies, subscripts of Fourier centers corresponding to the converted frequencies are calculated, and then decorrelation is carried out through discrete cosine transformation, so that audio features corresponding to audio frequencies are obtained.
Referring to fig. 2, a voice interaction apparatus according to an embodiment of the present application includes:
the first acquisition module 1 is used for acquiring region information corresponding to a current user;
the second obtaining module 2 is used for obtaining model parameters corresponding to each region model at the current moment;
the first calculation module 3 is configured to calculate, in a polling manner, matching probabilities corresponding to the respective geographic models according to the geographic information and model parameters corresponding to the respective geographic models;
the matching module 4 is used for matching the designated region model corresponding to the current user according to the matching probability;
an input module 5, configured to input the voice sentence of the current user into the designated region model;
and the feedback module 6 is used for feeding back the interactive feedback output by the specified region model to the current user.
The device in the embodiment of the present application explains the corresponding parts of the method, and details are not repeated.
Further, the second obtaining module 2 includes:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring multiple groups of model parameters of a first region model in a specified time period, and the first region model is any one of all region models;
the solving unit is used for solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model;
and the first calculation unit is used for calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
Further, the calculation module 3 includes:
an input unit, configured to input a specified model parameter and the region information corresponding to the first model into a first calculation formula, where the first calculation formula isw and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result;
the second acquisition unit is used for acquiring a calculation result of the first calculation formula;
a first acting unit, configured to take the calculation result as a matching probability currently corresponding to the first model;
and the second calculating unit is used for calculating the matching probability corresponding to each region model in a polling mode according to the current matching probability corresponding to the first model.
Further, matching model 4, includes:
the forming unit is used for forming descending order from big to small according to the matching probability corresponding to each region model;
the first determining unit is used for determining a second region model corresponding to the maximum matching probability in the descending order;
and the second acting unit is used for acting the second region model as a specified region model matched with the current user.
Further, the input module 5 includes:
the extraction unit is used for extracting audio data from the voice sentences of the current user;
the data processing unit is used for carrying out data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data;
a second determining unit, configured to determine a keyword in the audio data according to the prediction probability;
and the third determining unit is used for determining feedback information corresponding to the voice sentence of the current user according to the keyword.
Further, the designated geographic model includes a gaussian mixture model including a plurality of gaussian distribution functions, and the voice interaction apparatus includes:
the device comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining a likelihood function corresponding to complete data according to the joint probability distribution of the complete data;
a second obtaining module, configured to log the likelihood function corresponding to the complete data to obtain a logarithmic expression;
the second calculation module is used for calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model;
a third obtaining module, configured to obtain an expected value function of the hidden variable according to the posterior probability of the specified gaussian distribution function and the logarithmic expression;
a fourth obtaining module, configured to derive the expected value function to obtain a parameter of the specified gaussian distribution function;
and the determining module is used for determining the parameters respectively corresponding to all the Gaussian distribution functions in the Gaussian mixture model according to the determining process of the parameters of the specified Gaussian distribution functions.
Further, the extraction unit includes:
the transformation subunit is used for preprocessing the voice sentences of the current user and then performing Fourier transformation to obtain audio frame data;
the filtering subunit is used for filtering the audio frame data through a filter to obtain frequency data corresponding to each audio;
the obtaining subunit is used for obtaining the frequency data corresponding to each audio frequency respectively through discrete cosine transform to obtain the audio frequency characteristics corresponding to each audio frequency respectively;
and the composition subunit is used for composing the audio data of the voice sentence of the current user according to the arrangement sequence of the audios in the voice sentence by using the audio features corresponding to the audios respectively.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store all data required for the voice interaction process. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a voice interaction method.
The processor executes the voice interaction method, and the method comprises the following steps: acquiring region information corresponding to a current user; obtaining model parameters corresponding to each region model at the current moment; polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model; matching an appointed region model corresponding to the current user according to the matching probability; inputting the voice sentence of the current user into the designated region model; and feeding back the interactive feedback output by the designated region model to the current user.
According to the computer equipment, the voice data of the user are distinguished through the characteristics of different regions of the user, the voice deep learning models are trained respectively, the accuracy of voice recognition is improved, voice interaction is supported by using the language or dialect familiar to the user, communication is smoother, the user familiarity is increased, and the use viscosity of the user is improved.
In an embodiment, the step of acquiring, by the processor, model parameters corresponding to the respective local models at the current time includes: acquiring multiple groups of model parameters of a first region model in a specified time period, wherein the first region model is any one of all region models; solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model; and calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
In an embodiment, the step of calculating, by polling, matching probabilities corresponding to the respective geographic models according to the geographic information and model parameters corresponding to the respective geographic models by the processor includes: inputting the specified model parameters and the region information which are currently corresponding to the first model into a first calculation formula, wherein the first calculation formula isw and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result; obtaining a calculation result of the first calculation formula; taking the calculation result as the matching probability currently corresponding to the first model; and polling to calculate the matching probability corresponding to each region model according to the calculation mode of the matching probability corresponding to the first model at present.
In an embodiment, the step of matching, by the processor, the designated geographic model corresponding to the current user according to the matching probability includes: the matching probability corresponding to each region model forms descending order from big to small; determining a second region model corresponding to the maximum matching probability in the descending order; and taking the second region model as a specified region model matched with the current user.
In one embodiment, the specified geographical model includes a gaussian mixture model, and the step of the processor inputting the speech sentence of the current user into the specified geographical model includes: extracting audio data from the voice sentence of the current user; performing data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data; determining keywords in the audio data according to the prediction probability; and determining feedback information corresponding to the voice sentence of the current user according to the keyword.
In an embodiment, the step of extracting audio data from the speech sentence of the current user by the processor includes: preprocessing the voice sentence of the current user and then performing Fourier transform to obtain audio frame data; filtering the audio frame data through a filter to obtain frequency data corresponding to each audio; respectively corresponding frequency data of each audio frequency is subjected to discrete cosine transform to obtain audio frequency characteristics respectively corresponding to each audio frequency; and audio data of the voice sentence of the current user are formed by the audio features corresponding to the audios according to the arrangement sequence of the audios in the voice sentence.
In one embodiment, the step of inputting the speech sentence of the current user into the designated geographical model by the processor comprises: obtaining a likelihood function corresponding to the complete data according to the joint probability distribution of the complete data; taking logarithm of the likelihood function corresponding to the complete data to obtain a logarithm formula; calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model; obtaining an expected value function of the hidden variable according to the posterior probability of the specified Gaussian distribution function and the logarithm expression; deriving the expected value function to obtain parameters of the specified Gaussian distribution function; and determining parameters respectively corresponding to all Gaussian distribution functions in the Gaussian mixture model according to the determination process of the parameters of the specified Gaussian distribution functions.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a voice interaction method, including: acquiring region information corresponding to a current user; obtaining model parameters corresponding to each region model at the current moment; polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model; matching an appointed region model corresponding to the current user according to the matching probability; inputting the voice sentence of the current user into the designated region model; and feeding back the interactive feedback output by the designated region model to the current user.
The computer readable storage medium distinguishes the voice data of the user through the characteristics of different regions of the user, respectively trains the voice deep learning model, improves the accuracy of voice recognition, supports voice interaction by using the language or dialect familiar to the user, enables communication to be smoother, increases the familiarity of the user, and improves the use viscosity of the user.
In an embodiment, the step of acquiring, by the processor, model parameters corresponding to the respective local models at the current time includes: acquiring multiple groups of model parameters of a first region model in a specified time period, wherein the first region model is any one of all region models; solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model; and calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
In an embodiment, the step of calculating, by polling, matching probabilities corresponding to the respective geographic models according to the geographic information and model parameters corresponding to the respective geographic models by the processor includes: inputting the specified model parameters and the region information which are currently corresponding to the first model into a first calculation formula, wherein the first calculation formula isw and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result; obtaining a calculation result of the first calculation formula; taking the calculation result as the matching probability currently corresponding to the first model; and polling to calculate the matching probability corresponding to each region model according to the calculation mode of the matching probability corresponding to the first model at present.
In an embodiment, the step of matching, by the processor, the designated geographic model corresponding to the current user according to the matching probability includes: the matching probability corresponding to each region model forms descending order from big to small; determining a second region model corresponding to the maximum matching probability in the descending order; and taking the second region model as a specified region model matched with the current user.
In one embodiment, the specified geographical model includes a gaussian mixture model, and the step of the processor inputting the speech sentence of the current user into the specified geographical model includes: extracting audio data from the voice sentence of the current user; performing data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data; determining keywords in the audio data according to the prediction probability; and determining feedback information corresponding to the voice sentence of the current user according to the keyword.
In an embodiment, the step of extracting audio data from the speech sentence of the current user by the processor includes: preprocessing the voice sentence of the current user and then performing Fourier transform to obtain audio frame data; filtering the audio frame data through a filter to obtain frequency data corresponding to each audio; respectively corresponding frequency data of each audio frequency is subjected to discrete cosine transform to obtain audio frequency characteristics respectively corresponding to each audio frequency; and audio data of the voice sentence of the current user are formed by the audio features corresponding to the audios according to the arrangement sequence of the audios in the voice sentence.
In one embodiment, the step of inputting the speech sentence of the current user into the designated geographical model by the processor comprises: obtaining a likelihood function corresponding to the complete data according to the joint probability distribution of the complete data; taking logarithm of the likelihood function corresponding to the complete data to obtain a logarithm formula; calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model; obtaining an expected value function of the hidden variable according to the posterior probability of the specified Gaussian distribution function and the logarithm expression; deriving the expected value function to obtain parameters of the specified Gaussian distribution function; and determining parameters respectively corresponding to all Gaussian distribution functions in the Gaussian mixture model according to the determination process of the parameters of the specified Gaussian distribution functions.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.
Claims (10)
1. A method of voice interaction, comprising:
acquiring region information corresponding to a current user;
obtaining model parameters corresponding to each region model at the current moment;
polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model;
matching an appointed region model corresponding to the current user according to the matching probability;
inputting the voice sentence of the current user into the designated region model;
and feeding back the interactive feedback output by the designated region model to the current user.
2. The method of claim 1, wherein the step of obtaining model parameters corresponding to the respective local models at the current time comprises:
acquiring multiple groups of model parameters of a first region model in a specified time period, wherein the first region model is any one of all region models;
solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model;
and calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
3. The voice interaction method according to claim 1, wherein the step of calculating the matching probability corresponding to each of the geographic models by polling according to the model parameters corresponding to the geographic information and each of the geographic models respectively comprises:
inputting the specified model parameters and the region information which are currently corresponding to the first model into a first calculation formula, wherein the first calculation formula isw and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result;
obtaining a calculation result of the first calculation formula;
taking the calculation result as the matching probability currently corresponding to the first model;
and polling to calculate the matching probability corresponding to each region model according to the calculation mode of the matching probability corresponding to the first model at present.
4. The voice interaction method according to claim 1, wherein the step of matching the designated geographic model corresponding to the current user according to the matching probability comprises:
the matching probability corresponding to each region model forms descending order from big to small;
determining a second region model corresponding to the maximum matching probability in the descending order;
and taking the second region model as a specified region model matched with the current user.
5. The method of claim 1, wherein the designated geographical model comprises a gaussian mixture model, and the step of inputting the speech sentence of the current user into the designated geographical model comprises:
extracting audio data from the voice sentence of the current user;
performing data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data;
determining keywords in the audio data according to the prediction probability;
and determining feedback information corresponding to the voice sentence of the current user according to the keyword.
6. The method of claim 5, wherein the step of extracting audio data from the current user's speech utterance comprises:
preprocessing the voice sentence of the current user and then performing Fourier transform to obtain audio frame data;
filtering the audio frame data through a filter to obtain frequency data corresponding to each audio;
respectively corresponding frequency data of each audio frequency is subjected to discrete cosine transform to obtain audio frequency characteristics respectively corresponding to each audio frequency;
and audio data of the voice sentence of the current user are formed by the audio features corresponding to the audios according to the arrangement sequence of the audios in the voice sentence.
7. The method according to claim 1, wherein the designated geographical model comprises a gaussian mixture model, the gaussian mixture model comprises a plurality of gaussian distribution functions, and the step of inputting the speech sentence of the current user into the designated geographical model comprises, before the step of:
obtaining a likelihood function corresponding to the complete data according to the joint probability distribution of the complete data;
taking logarithm of the likelihood function corresponding to the complete data to obtain a logarithm formula;
calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model;
obtaining an expected value function of the hidden variable according to the posterior probability of the specified Gaussian distribution function and the logarithm expression;
deriving the expected value function to obtain parameters of the specified Gaussian distribution function;
and determining parameters respectively corresponding to all Gaussian distribution functions in the Gaussian mixture model according to the determination process of the parameters of the specified Gaussian distribution functions.
8. A voice interaction apparatus, comprising:
the first acquisition module is used for acquiring the region information corresponding to the current user;
the second acquisition module is used for acquiring model parameters corresponding to the regional models at the current moment;
the first calculation module is used for calculating the matching probability corresponding to each region model in a polling mode according to the region information and the model parameters corresponding to each region model;
the matching module is used for matching the appointed region model corresponding to the current user according to the matching probability;
the input module is used for inputting the voice sentence of the current user into the designated region model;
and the feedback module is used for feeding back the interactive feedback output by the designated region model to the current user.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011591154.4A CN112750428A (en) | 2020-12-29 | 2020-12-29 | Voice interaction method and device and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011591154.4A CN112750428A (en) | 2020-12-29 | 2020-12-29 | Voice interaction method and device and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112750428A true CN112750428A (en) | 2021-05-04 |
Family
ID=75646672
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011591154.4A Pending CN112750428A (en) | 2020-12-29 | 2020-12-29 | Voice interaction method and device and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112750428A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160284344A1 (en) * | 2013-12-19 | 2016-09-29 | Baidu Online Network Technology (Beijing) Co., Ltd. | Speech data recognition method, apparatus, and server for distinguishing regional accent |
CN107331384A (en) * | 2017-06-12 | 2017-11-07 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN108766419A (en) * | 2018-05-04 | 2018-11-06 | 华南理工大学 | A kind of abnormal speech detection method based on deep learning |
CN111009233A (en) * | 2019-11-20 | 2020-04-14 | 泰康保险集团股份有限公司 | Voice processing method and device, electronic equipment and storage medium |
-
2020
- 2020-12-29 CN CN202011591154.4A patent/CN112750428A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160284344A1 (en) * | 2013-12-19 | 2016-09-29 | Baidu Online Network Technology (Beijing) Co., Ltd. | Speech data recognition method, apparatus, and server for distinguishing regional accent |
CN107331384A (en) * | 2017-06-12 | 2017-11-07 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN108766419A (en) * | 2018-05-04 | 2018-11-06 | 华南理工大学 | A kind of abnormal speech detection method based on deep learning |
CN111009233A (en) * | 2019-11-20 | 2020-04-14 | 泰康保险集团股份有限公司 | Voice processing method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109087648B (en) | Counter voice monitoring method and device, computer equipment and storage medium | |
CN111916111B (en) | Intelligent voice outbound method and device with emotion, server and storage medium | |
CN112435673B (en) | Model training method and electronic terminal | |
US11450332B2 (en) | Audio conversion learning device, audio conversion device, method, and program | |
US11264044B2 (en) | Acoustic model training method, speech recognition method, acoustic model training apparatus, speech recognition apparatus, acoustic model training program, and speech recognition program | |
CN108922543B (en) | Model base establishing method, voice recognition method, device, equipment and medium | |
US10490182B1 (en) | Initializing and learning rate adjustment for rectifier linear unit based artificial neural networks | |
CN112233651B (en) | Dialect type determining method, device, equipment and storage medium | |
CN112925945A (en) | Conference summary generation method, device, equipment and storage medium | |
CN112802461B (en) | Speech recognition method and device, server and computer readable storage medium | |
CN109313892A (en) | Robust language recognition method and system | |
CN114678014A (en) | Intent recognition method, apparatus, computer device, and computer-readable storage medium | |
CN113254613B (en) | Dialogue question-answering method, device, equipment and storage medium | |
CN114360522B (en) | Training method of voice awakening model, and detection method and equipment of voice false awakening | |
CN113593525B (en) | Accent classification model training and accent classification method, apparatus and storage medium | |
CN112017694A (en) | Voice data evaluation method and device, storage medium and electronic device | |
CN111223476A (en) | Method and device for extracting voice feature vector, computer equipment and storage medium | |
CN111785302B (en) | Speaker separation method and device and electronic equipment | |
CN113486140A (en) | Knowledge question-answer matching method, device, equipment and storage medium | |
CN113887243A (en) | Training method, device and equipment of semantic classification model and storage medium | |
CN111968650B (en) | Voice matching method and device, electronic equipment and storage medium | |
CN113870875A (en) | Tone feature extraction method, device, computer equipment and storage medium | |
CN113223504A (en) | Acoustic model training method, device, equipment and storage medium | |
CN112750428A (en) | Voice interaction method and device and computer equipment | |
CN117609574A (en) | Speaking recommendation method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20240528 |