[go: up one dir, main page]

CN112750428A - Voice interaction method and device and computer equipment - Google Patents

Voice interaction method and device and computer equipment Download PDF

Info

Publication number
CN112750428A
CN112750428A CN202011591154.4A CN202011591154A CN112750428A CN 112750428 A CN112750428 A CN 112750428A CN 202011591154 A CN202011591154 A CN 202011591154A CN 112750428 A CN112750428 A CN 112750428A
Authority
CN
China
Prior art keywords
model
region
current user
specified
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011591154.4A
Other languages
Chinese (zh)
Inventor
姚宏志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202011591154.4A priority Critical patent/CN112750428A/en
Publication of CN112750428A publication Critical patent/CN112750428A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the field of artificial intelligence, and discloses a voice interaction method, which comprises the following steps: acquiring region information corresponding to a current user; obtaining model parameters corresponding to each region model at the current moment; polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model; matching an appointed region model corresponding to the current user according to the matching probability; inputting the voice sentence of the current user into the designated region model; and feeding back the interactive feedback output by the designated region model to the current user. The voice data of the user is distinguished through the characteristics of different regions of the user, the voice deep learning model is trained respectively, the accuracy of voice recognition is improved, voice interaction is supported by using the language or dialect familiar to the user, communication is smoother, the user familiarity is increased, and the use viscosity of the user is improved.

Description

Voice interaction method and device and computer equipment
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a voice interaction method, apparatus, and computer device.
Background
At present, deep learning models of AI voice systems docked by customer service voices are mostly not distinguished according to user regions, including user voice recognition and user voice answers, and are not distinguished from region features. Different regions have different voice characteristics, and if the users cannot be distinguished according to the different regions where the users are located, the accuracy rate of voice recognition with dialect characteristics is possibly low, and the voice interaction is not facilitated in the regions with rich dialect characteristics.
Disclosure of Invention
The application mainly aims to provide a voice interaction method, and aims to solve the technical problem that the existing voice interaction cannot distinguish users according to different regions where the users are located.
The application provides a voice interaction method, which comprises the following steps:
acquiring region information corresponding to a current user;
obtaining model parameters corresponding to each region model at the current moment;
polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model;
matching an appointed region model corresponding to the current user according to the matching probability;
inputting the voice sentence of the current user into the designated region model;
and feeding back the interactive feedback output by the designated region model to the current user.
Preferably, the step of obtaining model parameters corresponding to each of the local models at the current time includes:
acquiring multiple groups of model parameters of a first region model in a specified time period, wherein the first region model is any one of all region models;
solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model;
and calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
Preferably, the step of calculating the matching probability corresponding to each region model by polling according to the region information and the model parameters corresponding to each region model includes:
inputting the specified model parameters and the region information which are currently corresponding to the first model into a first calculation formula, wherein the first calculation formula is
Figure BDA0002868667520000021
w and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result;
obtaining a calculation result of the first calculation formula;
taking the calculation result as the matching probability currently corresponding to the first model;
and polling to calculate the matching probability corresponding to each region model according to the calculation mode of the matching probability corresponding to the first model at present.
Preferably, the step of matching the designated region model corresponding to the current user according to the matching probability includes:
the matching probability corresponding to each region model forms descending order from big to small;
determining a second region model corresponding to the maximum matching probability in the descending order;
and taking the second region model as a specified region model matched with the current user.
Preferably, the designated geographical model includes a gaussian mixture model, and the step of inputting the speech sentence of the current user into the designated geographical model includes:
extracting audio data from the voice sentence of the current user;
performing data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data;
determining keywords in the audio data according to the prediction probability;
and determining feedback information corresponding to the voice sentence of the current user according to the keyword.
Preferably, the step of extracting audio data from the speech sentence of the current user includes:
preprocessing the voice sentence of the current user and then performing Fourier transform to obtain audio frame data;
filtering the audio frame data through a filter to obtain frequency data corresponding to each audio;
respectively corresponding frequency data of each audio frequency is subjected to discrete cosine transform to obtain audio frequency characteristics respectively corresponding to each audio frequency;
and audio data of the voice sentence of the current user are formed by the audio features corresponding to the audios according to the arrangement sequence of the audios in the voice sentence.
Preferably, the step of inputting the speech sentence of the current user into the designated geographical model includes:
obtaining a likelihood function corresponding to the complete data according to the joint probability distribution of the complete data;
taking logarithm of the likelihood function corresponding to the complete data to obtain a logarithm formula;
calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model;
obtaining an expected value function of the hidden variable according to the posterior probability of the specified Gaussian distribution function and the logarithm expression;
deriving the expected value function to obtain parameters of the specified Gaussian distribution function;
and determining parameters respectively corresponding to all Gaussian distribution functions in the Gaussian mixture model according to the determination process of the parameters of the specified Gaussian distribution functions.
The present application further provides a voice interaction apparatus, including:
the first acquisition module is used for acquiring the region information corresponding to the current user;
the second acquisition module is used for acquiring model parameters corresponding to the regional models at the current moment;
the first calculation module is used for calculating the matching probability corresponding to each region model in a polling mode according to the region information and the model parameters corresponding to each region model;
the matching module is used for matching the appointed region model corresponding to the current user according to the matching probability;
the input module is used for inputting the voice sentence of the current user into the designated region model;
and the feedback module is used for feeding back the interactive feedback output by the designated region model to the current user.
The present application further provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above method when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as described above.
According to the method and the device, the voice data of the user are distinguished through the characteristics of different regions of the user, the voice deep learning model is trained respectively, the accuracy of voice recognition is improved, the voice interaction of the language or dialect familiar to the user is supported, the communication is smoother, the user familiarity is increased, and the use viscosity of the user is improved.
Drawings
FIG. 1 is a schematic flow chart of a voice interaction method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a voice interaction system according to an embodiment of the present application;
fig. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Referring to fig. 1, a voice interaction method according to an embodiment of the present application includes:
s1: acquiring region information corresponding to a current user;
s2: obtaining model parameters corresponding to each region model at the current moment;
s3: polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model;
s4: matching an appointed region model corresponding to the current user according to the matching probability;
s5: inputting the voice sentence of the current user into the designated region model;
s6: and feeding back the interactive feedback output by the designated region model to the current user.
In the embodiment of the application, the region information includes, but is not limited to, information such as a location to which the user mobile phone number belongs, a location to which the landline number belongs, and a location to which the user identification number belongs. The region marking operation is carried out on the user voice information data, and then the AI voice deep learning model is respectively trained according to different region marking data to obtain each region model. And then, forming a voice recognition system by the trained region models, and starting the region models matched with the current region characteristics in a targeted manner for voice recognition according to different region characteristics of the current user.
The dictionary matching model and the matching algorithm of the Mandarin and the local dialect are embedded into each region model, so that the flexible conversion of the Mandarin and the local dialect is realized, the interaction between a machine and a user through dialects can be supported, and the region use characteristics are improved.
According to the method and the device, the voice data of the user are distinguished through the characteristics of different regions of the user, the voice deep learning model is trained respectively, the accuracy of voice recognition is improved, the voice interaction of the language or dialect familiar to the user is supported, the communication is smoother, the user familiarity is increased, and the use viscosity of the user is improved.
Further, the step S2 of obtaining model parameters corresponding to each of the local models at the current time includes:
s21: acquiring multiple groups of model parameters of a first region model in a specified time period, wherein the first region model is any one of all region models;
s22: solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model;
s23: and calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
In the embodiment of the application, the parameters of the region model are progressively updated along with the increase of the training samples so as to gradually optimize the parameters of the region model. In the embodiment of the application, multiple groups of parameters of the region model are regarded as one parameter, and the parameters of the optimized region model are updated in a likelihood function solving mode.
The parameters w and b of the terrain model are taken as a parameter theta, for example, and then solved according to the likelihood function,
Figure BDA0002868667520000051
wherein,
Figure BDA0002868667520000052
x represents the region information, i represents the ith region information sample, hθ(X) represents a probability function, and P represents a value of the matching probability.
Further, step S3 of calculating the matching probability corresponding to each of the region models by polling according to the region information and the model parameters corresponding to each of the region models includes:
s31: inputting the specified model parameters and the region information which are currently corresponding to the first model into a first calculation formula, wherein the first calculation formula is
Figure BDA0002868667520000053
w and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result;
s32: obtaining a calculation result of the first calculation formula;
s33: taking the calculation result as the matching probability currently corresponding to the first model;
s34: and polling to calculate the matching probability corresponding to each region model according to the calculation mode of the matching probability corresponding to the first model at present.
In the embodiment of the application, the parameters of the region model obtained by solving are substituted into the first calculation formula, and the matching probability y of the current region information and the parameters of the region model is calculated, so that the region model matched with the region information is selected according to the matching probability.
Further, the step S4 of matching the designated geographic model corresponding to the current user according to the matching probability includes:
s41: the matching probability corresponding to each region model forms descending order from big to small;
s42: determining a second region model corresponding to the maximum matching probability in the descending order;
s43: and taking the second region model as a specified region model matched with the current user.
In the embodiment of the application, the matching probability of the current region information and each region model is calculated through polling, and the region model corresponding to the maximum matching probability is used as the model matched with the current region information.
Further, the step S5 of inputting the speech sentence of the current user into the designated geographical model, where the designated geographical model includes a gaussian mixture model, includes:
s51: extracting audio data from the voice sentence of the current user;
s52: performing data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data;
s53: determining keywords in the audio data according to the prediction probability;
s54: and determining feedback information corresponding to the voice sentence of the current user according to the keyword.
In the implementation of the application, each region model is a Gaussian mixture model formed by a plurality of groups of Gaussian distribution functions, initial parameters of the Gaussian mixture model are initialized through a K-means algorithm, namely K values are randomly selected as a clustering center, and then points close to the clustering center are clustered into one class. The prediction probability is obtained by calculating the distance between the prediction probability and each cluster center point so as to determine each keyword contained in the voice sentence, and the feedback information corresponding to the voice sentence is determined through a preset feedback information list corresponding to each keyword.
Further, before the step S5 of inputting the speech sentence of the current user into the designated geographical model, the step S5 of specifying that the geographical model includes a gaussian mixture model including a plurality of gaussian distribution functions includes:
s501: obtaining a likelihood function corresponding to the complete data according to the joint probability distribution of the complete data;
s502: taking logarithm of the likelihood function corresponding to the complete data to obtain a logarithm formula;
s503: calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model;
s504: obtaining an expected value function of the hidden variable according to the posterior probability of the specified Gaussian distribution function and the logarithm expression;
s505: deriving the expected value function to obtain parameters of the specified Gaussian distribution function;
s506: and determining parameters respectively corresponding to all Gaussian distribution functions in the Gaussian mixture model according to the determination process of the parameters of the specified Gaussian distribution functions.
In an embodiment of the present application, the complete data includes a full time-ordered set of audio data corresponding to the speech sentence. The likelihood function corresponding to the complete data is expressed as
Figure BDA0002868667520000071
Where x represents voice sample data, znHidden variable representing estimated parameter, N (x)nk,∑k) Is a sample xnGeneration of the representation, μ, in the k-th Gaussian distribution functionkRepresents the mean value, sigmakRepresents the variance; pikRepresenting the weight coefficients of the kth gaussian distribution function. Obtained by taking natural logarithm
Figure BDA0002868667520000072
Calculating znThe posterior probability of (a) is:
Figure BDA0002868667520000073
the expectation function of the log-likelihood for the hidden variable is expressed as
Figure BDA0002868667520000074
The derivative is obtained for the pi, the mu and the sigma to obtain the average parameter when the derivative is zero,
Figure BDA0002868667520000075
Figure BDA0002868667520000076
further, the step S51 of extracting audio data from the current user' S speech sentence includes:
s511: preprocessing the voice sentence of the current user and then performing Fourier transform to obtain audio frame data;
s512: filtering the audio frame data through a filter to obtain frequency data corresponding to each audio;
s513: respectively corresponding frequency data of each audio frequency is subjected to discrete cosine transform to obtain audio frequency characteristics respectively corresponding to each audio frequency;
s514: and audio data of the voice sentence of the current user are formed by the audio features corresponding to the audios according to the arrangement sequence of the audios in the voice sentence.
In the embodiment of the present application, the preprocessing includes, but is not limited to, pre-emphasis, framing, and hamming window. After Fourier transformation is carried out on a voice sample, the lowest frequency, the highest frequency and the number of Mel filters are determined through the Mel filters, the Mel frequencies corresponding to the lowest frequency and the highest frequency are converted, the distance between the central Mel frequencies of the two connected Mel filters is calculated, the central Mel frequencies are converted into unequal-interval frequencies, subscripts of Fourier centers corresponding to the converted frequencies are calculated, and then decorrelation is carried out through discrete cosine transformation, so that audio features corresponding to audio frequencies are obtained.
Referring to fig. 2, a voice interaction apparatus according to an embodiment of the present application includes:
the first acquisition module 1 is used for acquiring region information corresponding to a current user;
the second obtaining module 2 is used for obtaining model parameters corresponding to each region model at the current moment;
the first calculation module 3 is configured to calculate, in a polling manner, matching probabilities corresponding to the respective geographic models according to the geographic information and model parameters corresponding to the respective geographic models;
the matching module 4 is used for matching the designated region model corresponding to the current user according to the matching probability;
an input module 5, configured to input the voice sentence of the current user into the designated region model;
and the feedback module 6 is used for feeding back the interactive feedback output by the specified region model to the current user.
The device in the embodiment of the present application explains the corresponding parts of the method, and details are not repeated.
Further, the second obtaining module 2 includes:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring multiple groups of model parameters of a first region model in a specified time period, and the first region model is any one of all region models;
the solving unit is used for solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model;
and the first calculation unit is used for calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
Further, the calculation module 3 includes:
an input unit, configured to input a specified model parameter and the region information corresponding to the first model into a first calculation formula, where the first calculation formula is
Figure BDA0002868667520000081
w and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result;
the second acquisition unit is used for acquiring a calculation result of the first calculation formula;
a first acting unit, configured to take the calculation result as a matching probability currently corresponding to the first model;
and the second calculating unit is used for calculating the matching probability corresponding to each region model in a polling mode according to the current matching probability corresponding to the first model.
Further, matching model 4, includes:
the forming unit is used for forming descending order from big to small according to the matching probability corresponding to each region model;
the first determining unit is used for determining a second region model corresponding to the maximum matching probability in the descending order;
and the second acting unit is used for acting the second region model as a specified region model matched with the current user.
Further, the input module 5 includes:
the extraction unit is used for extracting audio data from the voice sentences of the current user;
the data processing unit is used for carrying out data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data;
a second determining unit, configured to determine a keyword in the audio data according to the prediction probability;
and the third determining unit is used for determining feedback information corresponding to the voice sentence of the current user according to the keyword.
Further, the designated geographic model includes a gaussian mixture model including a plurality of gaussian distribution functions, and the voice interaction apparatus includes:
the device comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining a likelihood function corresponding to complete data according to the joint probability distribution of the complete data;
a second obtaining module, configured to log the likelihood function corresponding to the complete data to obtain a logarithmic expression;
the second calculation module is used for calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model;
a third obtaining module, configured to obtain an expected value function of the hidden variable according to the posterior probability of the specified gaussian distribution function and the logarithmic expression;
a fourth obtaining module, configured to derive the expected value function to obtain a parameter of the specified gaussian distribution function;
and the determining module is used for determining the parameters respectively corresponding to all the Gaussian distribution functions in the Gaussian mixture model according to the determining process of the parameters of the specified Gaussian distribution functions.
Further, the extraction unit includes:
the transformation subunit is used for preprocessing the voice sentences of the current user and then performing Fourier transformation to obtain audio frame data;
the filtering subunit is used for filtering the audio frame data through a filter to obtain frequency data corresponding to each audio;
the obtaining subunit is used for obtaining the frequency data corresponding to each audio frequency respectively through discrete cosine transform to obtain the audio frequency characteristics corresponding to each audio frequency respectively;
and the composition subunit is used for composing the audio data of the voice sentence of the current user according to the arrangement sequence of the audios in the voice sentence by using the audio features corresponding to the audios respectively.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store all data required for the voice interaction process. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a voice interaction method.
The processor executes the voice interaction method, and the method comprises the following steps: acquiring region information corresponding to a current user; obtaining model parameters corresponding to each region model at the current moment; polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model; matching an appointed region model corresponding to the current user according to the matching probability; inputting the voice sentence of the current user into the designated region model; and feeding back the interactive feedback output by the designated region model to the current user.
According to the computer equipment, the voice data of the user are distinguished through the characteristics of different regions of the user, the voice deep learning models are trained respectively, the accuracy of voice recognition is improved, voice interaction is supported by using the language or dialect familiar to the user, communication is smoother, the user familiarity is increased, and the use viscosity of the user is improved.
In an embodiment, the step of acquiring, by the processor, model parameters corresponding to the respective local models at the current time includes: acquiring multiple groups of model parameters of a first region model in a specified time period, wherein the first region model is any one of all region models; solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model; and calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
In an embodiment, the step of calculating, by polling, matching probabilities corresponding to the respective geographic models according to the geographic information and model parameters corresponding to the respective geographic models by the processor includes: inputting the specified model parameters and the region information which are currently corresponding to the first model into a first calculation formula, wherein the first calculation formula is
Figure BDA0002868667520000111
w and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result; obtaining a calculation result of the first calculation formula; taking the calculation result as the matching probability currently corresponding to the first model; and polling to calculate the matching probability corresponding to each region model according to the calculation mode of the matching probability corresponding to the first model at present.
In an embodiment, the step of matching, by the processor, the designated geographic model corresponding to the current user according to the matching probability includes: the matching probability corresponding to each region model forms descending order from big to small; determining a second region model corresponding to the maximum matching probability in the descending order; and taking the second region model as a specified region model matched with the current user.
In one embodiment, the specified geographical model includes a gaussian mixture model, and the step of the processor inputting the speech sentence of the current user into the specified geographical model includes: extracting audio data from the voice sentence of the current user; performing data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data; determining keywords in the audio data according to the prediction probability; and determining feedback information corresponding to the voice sentence of the current user according to the keyword.
In an embodiment, the step of extracting audio data from the speech sentence of the current user by the processor includes: preprocessing the voice sentence of the current user and then performing Fourier transform to obtain audio frame data; filtering the audio frame data through a filter to obtain frequency data corresponding to each audio; respectively corresponding frequency data of each audio frequency is subjected to discrete cosine transform to obtain audio frequency characteristics respectively corresponding to each audio frequency; and audio data of the voice sentence of the current user are formed by the audio features corresponding to the audios according to the arrangement sequence of the audios in the voice sentence.
In one embodiment, the step of inputting the speech sentence of the current user into the designated geographical model by the processor comprises: obtaining a likelihood function corresponding to the complete data according to the joint probability distribution of the complete data; taking logarithm of the likelihood function corresponding to the complete data to obtain a logarithm formula; calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model; obtaining an expected value function of the hidden variable according to the posterior probability of the specified Gaussian distribution function and the logarithm expression; deriving the expected value function to obtain parameters of the specified Gaussian distribution function; and determining parameters respectively corresponding to all Gaussian distribution functions in the Gaussian mixture model according to the determination process of the parameters of the specified Gaussian distribution functions.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a voice interaction method, including: acquiring region information corresponding to a current user; obtaining model parameters corresponding to each region model at the current moment; polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model; matching an appointed region model corresponding to the current user according to the matching probability; inputting the voice sentence of the current user into the designated region model; and feeding back the interactive feedback output by the designated region model to the current user.
The computer readable storage medium distinguishes the voice data of the user through the characteristics of different regions of the user, respectively trains the voice deep learning model, improves the accuracy of voice recognition, supports voice interaction by using the language or dialect familiar to the user, enables communication to be smoother, increases the familiarity of the user, and improves the use viscosity of the user.
In an embodiment, the step of acquiring, by the processor, model parameters corresponding to the respective local models at the current time includes: acquiring multiple groups of model parameters of a first region model in a specified time period, wherein the first region model is any one of all region models; solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model; and calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
In an embodiment, the step of calculating, by polling, matching probabilities corresponding to the respective geographic models according to the geographic information and model parameters corresponding to the respective geographic models by the processor includes: inputting the specified model parameters and the region information which are currently corresponding to the first model into a first calculation formula, wherein the first calculation formula is
Figure BDA0002868667520000121
w and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result; obtaining a calculation result of the first calculation formula; taking the calculation result as the matching probability currently corresponding to the first model; and polling to calculate the matching probability corresponding to each region model according to the calculation mode of the matching probability corresponding to the first model at present.
In an embodiment, the step of matching, by the processor, the designated geographic model corresponding to the current user according to the matching probability includes: the matching probability corresponding to each region model forms descending order from big to small; determining a second region model corresponding to the maximum matching probability in the descending order; and taking the second region model as a specified region model matched with the current user.
In one embodiment, the specified geographical model includes a gaussian mixture model, and the step of the processor inputting the speech sentence of the current user into the specified geographical model includes: extracting audio data from the voice sentence of the current user; performing data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data; determining keywords in the audio data according to the prediction probability; and determining feedback information corresponding to the voice sentence of the current user according to the keyword.
In an embodiment, the step of extracting audio data from the speech sentence of the current user by the processor includes: preprocessing the voice sentence of the current user and then performing Fourier transform to obtain audio frame data; filtering the audio frame data through a filter to obtain frequency data corresponding to each audio; respectively corresponding frequency data of each audio frequency is subjected to discrete cosine transform to obtain audio frequency characteristics respectively corresponding to each audio frequency; and audio data of the voice sentence of the current user are formed by the audio features corresponding to the audios according to the arrangement sequence of the audios in the voice sentence.
In one embodiment, the step of inputting the speech sentence of the current user into the designated geographical model by the processor comprises: obtaining a likelihood function corresponding to the complete data according to the joint probability distribution of the complete data; taking logarithm of the likelihood function corresponding to the complete data to obtain a logarithm formula; calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model; obtaining an expected value function of the hidden variable according to the posterior probability of the specified Gaussian distribution function and the logarithm expression; deriving the expected value function to obtain parameters of the specified Gaussian distribution function; and determining parameters respectively corresponding to all Gaussian distribution functions in the Gaussian mixture model according to the determination process of the parameters of the specified Gaussian distribution functions.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims (10)

1. A method of voice interaction, comprising:
acquiring region information corresponding to a current user;
obtaining model parameters corresponding to each region model at the current moment;
polling and calculating the matching probability corresponding to each region model according to the region information and the model parameters corresponding to each region model;
matching an appointed region model corresponding to the current user according to the matching probability;
inputting the voice sentence of the current user into the designated region model;
and feeding back the interactive feedback output by the designated region model to the current user.
2. The method of claim 1, wherein the step of obtaining model parameters corresponding to the respective local models at the current time comprises:
acquiring multiple groups of model parameters of a first region model in a specified time period, wherein the first region model is any one of all region models;
solving the multiple groups of model parameters through a likelihood function to obtain the specified model parameters currently corresponding to the first region model;
and calculating the model parameters respectively corresponding to the region models according to the calculation mode of the specified model parameters currently corresponding to the first region model.
3. The voice interaction method according to claim 1, wherein the step of calculating the matching probability corresponding to each of the geographic models by polling according to the model parameters corresponding to the geographic information and each of the geographic models respectively comprises:
inputting the specified model parameters and the region information which are currently corresponding to the first model into a first calculation formula, wherein the first calculation formula is
Figure FDA0002868667510000011
w and b represent the specified model parameters, X represents the region information, T represents transposition, and y represents a calculation result;
obtaining a calculation result of the first calculation formula;
taking the calculation result as the matching probability currently corresponding to the first model;
and polling to calculate the matching probability corresponding to each region model according to the calculation mode of the matching probability corresponding to the first model at present.
4. The voice interaction method according to claim 1, wherein the step of matching the designated geographic model corresponding to the current user according to the matching probability comprises:
the matching probability corresponding to each region model forms descending order from big to small;
determining a second region model corresponding to the maximum matching probability in the descending order;
and taking the second region model as a specified region model matched with the current user.
5. The method of claim 1, wherein the designated geographical model comprises a gaussian mixture model, and the step of inputting the speech sentence of the current user into the designated geographical model comprises:
extracting audio data from the voice sentence of the current user;
performing data processing on the audio data through the Gaussian mixture model to obtain a prediction probability corresponding to the audio data;
determining keywords in the audio data according to the prediction probability;
and determining feedback information corresponding to the voice sentence of the current user according to the keyword.
6. The method of claim 5, wherein the step of extracting audio data from the current user's speech utterance comprises:
preprocessing the voice sentence of the current user and then performing Fourier transform to obtain audio frame data;
filtering the audio frame data through a filter to obtain frequency data corresponding to each audio;
respectively corresponding frequency data of each audio frequency is subjected to discrete cosine transform to obtain audio frequency characteristics respectively corresponding to each audio frequency;
and audio data of the voice sentence of the current user are formed by the audio features corresponding to the audios according to the arrangement sequence of the audios in the voice sentence.
7. The method according to claim 1, wherein the designated geographical model comprises a gaussian mixture model, the gaussian mixture model comprises a plurality of gaussian distribution functions, and the step of inputting the speech sentence of the current user into the designated geographical model comprises, before the step of:
obtaining a likelihood function corresponding to the complete data according to the joint probability distribution of the complete data;
taking logarithm of the likelihood function corresponding to the complete data to obtain a logarithm formula;
calculating the posterior probability of the hidden variable belonging to the specified Gaussian distribution function according to the logarithmic expression, wherein the specified Gaussian distribution function belongs to any Gaussian distribution function in the Gaussian mixture model;
obtaining an expected value function of the hidden variable according to the posterior probability of the specified Gaussian distribution function and the logarithm expression;
deriving the expected value function to obtain parameters of the specified Gaussian distribution function;
and determining parameters respectively corresponding to all Gaussian distribution functions in the Gaussian mixture model according to the determination process of the parameters of the specified Gaussian distribution functions.
8. A voice interaction apparatus, comprising:
the first acquisition module is used for acquiring the region information corresponding to the current user;
the second acquisition module is used for acquiring model parameters corresponding to the regional models at the current moment;
the first calculation module is used for calculating the matching probability corresponding to each region model in a polling mode according to the region information and the model parameters corresponding to each region model;
the matching module is used for matching the appointed region model corresponding to the current user according to the matching probability;
the input module is used for inputting the voice sentence of the current user into the designated region model;
and the feedback module is used for feeding back the interactive feedback output by the designated region model to the current user.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202011591154.4A 2020-12-29 2020-12-29 Voice interaction method and device and computer equipment Pending CN112750428A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011591154.4A CN112750428A (en) 2020-12-29 2020-12-29 Voice interaction method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011591154.4A CN112750428A (en) 2020-12-29 2020-12-29 Voice interaction method and device and computer equipment

Publications (1)

Publication Number Publication Date
CN112750428A true CN112750428A (en) 2021-05-04

Family

ID=75646672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011591154.4A Pending CN112750428A (en) 2020-12-29 2020-12-29 Voice interaction method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN112750428A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160284344A1 (en) * 2013-12-19 2016-09-29 Baidu Online Network Technology (Beijing) Co., Ltd. Speech data recognition method, apparatus, and server for distinguishing regional accent
CN107331384A (en) * 2017-06-12 2017-11-07 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN108766419A (en) * 2018-05-04 2018-11-06 华南理工大学 A kind of abnormal speech detection method based on deep learning
CN111009233A (en) * 2019-11-20 2020-04-14 泰康保险集团股份有限公司 Voice processing method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160284344A1 (en) * 2013-12-19 2016-09-29 Baidu Online Network Technology (Beijing) Co., Ltd. Speech data recognition method, apparatus, and server for distinguishing regional accent
CN107331384A (en) * 2017-06-12 2017-11-07 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN108766419A (en) * 2018-05-04 2018-11-06 华南理工大学 A kind of abnormal speech detection method based on deep learning
CN111009233A (en) * 2019-11-20 2020-04-14 泰康保险集团股份有限公司 Voice processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109087648B (en) Counter voice monitoring method and device, computer equipment and storage medium
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN112435673B (en) Model training method and electronic terminal
US11450332B2 (en) Audio conversion learning device, audio conversion device, method, and program
US11264044B2 (en) Acoustic model training method, speech recognition method, acoustic model training apparatus, speech recognition apparatus, acoustic model training program, and speech recognition program
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
US10490182B1 (en) Initializing and learning rate adjustment for rectifier linear unit based artificial neural networks
CN112233651B (en) Dialect type determining method, device, equipment and storage medium
CN112925945A (en) Conference summary generation method, device, equipment and storage medium
CN112802461B (en) Speech recognition method and device, server and computer readable storage medium
CN109313892A (en) Robust language recognition method and system
CN114678014A (en) Intent recognition method, apparatus, computer device, and computer-readable storage medium
CN113254613B (en) Dialogue question-answering method, device, equipment and storage medium
CN114360522B (en) Training method of voice awakening model, and detection method and equipment of voice false awakening
CN113593525B (en) Accent classification model training and accent classification method, apparatus and storage medium
CN112017694A (en) Voice data evaluation method and device, storage medium and electronic device
CN111223476A (en) Method and device for extracting voice feature vector, computer equipment and storage medium
CN111785302B (en) Speaker separation method and device and electronic equipment
CN113486140A (en) Knowledge question-answer matching method, device, equipment and storage medium
CN113887243A (en) Training method, device and equipment of semantic classification model and storage medium
CN111968650B (en) Voice matching method and device, electronic equipment and storage medium
CN113870875A (en) Tone feature extraction method, device, computer equipment and storage medium
CN113223504A (en) Acoustic model training method, device, equipment and storage medium
CN112750428A (en) Voice interaction method and device and computer equipment
CN117609574A (en) Speaking recommendation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240528