Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart illustrating an implementation of a voice wake-up method according to an embodiment of the present invention.
Referring to fig. 1, in an aspect, an embodiment of the present invention provides a voice wake-up method, where the method includes: step 101, obtaining a bone vibration signal; step 102, carrying out voice activity detection on the bone vibration signal to obtain a detection result; 103, determining whether a voice signal exists in the bone vibration signal based on the detection result; 104, collecting a sound signal when the bone vibration signal is determined to have a voice signal; and step 105, executing a specified task corresponding to the sound signal to generate interactive behavior for the user.
The awakening method provided by the embodiment of the invention is used for rapidly awakening the target equipment as required so as to execute the specified task corresponding to the user requirement and generate the interactive behavior for the user. The method is applied to intelligent equipment with a data processing function, such as wearable intelligent equipment, a portable intelligent terminal, a fixed terminal and the like, and further comprises the following steps: intelligent glasses, intelligent earphone, intelligent gloves, intelligent wrist-watch, intelligent dress, intelligent ornaments, notebook computer, cell-phone, intelligent audio amplifier, desktop computer etc.. Specifically, the method includes obtaining a bone vibration signal, where the bone vibration signal is used to determine whether a bone used by a user to generate sound has motion, and the bone includes, but is not limited to, one or more of a mandible, a hyoid, a larynx, etc. When the bone vibration signal is obtained, it can be considered that the bone of the user at this time has a vocal action, and further, that the user at this time has a possibility of making a voice. Such as: when the user speaks, the mandible can be opened and closed, and at the moment, the device can acquire bone vibration signals.
The method further comprises the step of carrying out voice activity detection on the bone vibration signal to obtain a detection result. Voice activity detection is used to detect whether a voice signal is present. The detection result obtained by the voice activity detection is whether a voice signal is present in the bone vibration signal. Can be used to determine whether the user's skeletal motion corresponding to the bone vibration signal is such that the user is speaking. That is, when the detection result indicates that a speech signal exists in the bone vibration signal, it may be considered that the reason for the bone motion of the user is that the user is speaking, and at this time, the sound of the user is collected to obtain a sound signal corresponding to the sound of the user. It should be added that, when the detection result is that no voice signal exists in the bone vibration signal, it may be considered that the cause of the bone motion of the user is not the user speaking, and at this time, the bone vibration signal may be discarded and the next round of bone vibration signal may be obtained. Here, a microphone may be preset on the device to collect the user's voice, or the microphone may be in communication connection with the device, and a voice signal may be obtained by the device through signal transmission with the microphone. The microphone is selected to be a common air conduction microphone. The sound signal is analyzed through voice recognition analysis processing, a target intention corresponding to the sound signal of the user is obtained, according to the target intention of the user, and according to the function of the device, the device can correspondingly obtain a specified task corresponding to the target intention, the specified task can have differences according to different devices, and commonly used specified tasks can include but are not limited to: the method comprises the steps of man-machine conversation, telephone dialing, short message sending, short message reading, entertainment content playing switching, message playing switching on the Internet, map navigation, task switching, intelligent control and the like. The equipment executes the specified tasks according to the requirements of the user so as to generate interactive behaviors for the user.
In one implementation scenario, the method is applied to an intelligent earphone, when a user speaks, the oral cavity of the user acts, and the intelligent earphone obtains a bone vibration signal. The intelligent earphone carries out voice activity detection on the bone vibration signal to obtain a detection result, the existence of the voice signal in the bone vibration signal is determined according to the detection result, a microphone of the intelligent earphone collects sound in the environment to obtain a sound signal, and the microphone can be preset on the equipment. Through carrying out voice recognition analysis on the sound signals, the fact that the user intention contained in the sound signals is 'playing songs' is obtained, and the intelligent earphone starts playing the songs according to the user intention. It is supplementary to discard the sound signal when the obtained user intention cannot correspond to the specified task by performing the voice recognition analysis on the sound signal.
The awakening method provided by the embodiment of the invention is applied to equipment, and in a standby state of the equipment, the equipment only needs to start the module corresponding to the bone vibration signal without normally opening a microphone for sound collection, and other algorithm modules except the module corresponding to the bone vibration signal can be in a closed state, so that the equipment is in a low power consumption state. The method has high real-time performance, and can reach 10 ms-level time delay, while the time delay of the existing awakening mode reaches 24 ms-300 ms, and the method can solve the problem of high power consumption of the awakening algorithm due to the fact that the time delay is always on. And because the method has high real-time performance and low time delay, the sound signal can be quickly obtained, and the loss of the previous frame signal of the sound signal due to overhigh time delay is avoided.
Fig. 2 is a schematic flow chart illustrating an implementation process of obtaining a bone vibration signal in a voice wake-up method according to an embodiment of the present invention.
Referring to fig. 2, in an embodiment of the present invention, step 101, obtaining a bone vibration signal includes: step 1011, obtaining a detection signal; step 1012, determining whether the detection signal meets the bone vibration condition; and step 1013, when the detection signal is determined to meet the bone vibration condition, determining the detection signal as a bone vibration signal.
Specifically, the method includes obtaining a detection signal, where the detection signal may be collected by a detection device, and the detection device may be a bone vibration sensor, a multi-axis sensor, or other devices with detection functions. The detection device can be preset on the equipment or can be in communication connection with the equipment, and the equipment obtains a detection signal through signal transmission. The bone vibration signal is used for judging whether the bone of the user has motion or not, and the detection device can be arranged at a position where the bone vibration can be detected when speaking. Including but not limited to in the ear canal of the user, behind the user's ear, in the user's chin, in the user's throat, etc. If when equipment is intelligent in-ear earphone, can arrange bone vibration sensor at intelligent earphone casing front end, make bone vibration sensor stretch into user's the duct in, gather the detection signal that produces when the jaw of inner ear department vibrates. When equipment is intelligent hangers formula earphone, can arrange bone vibration sensor in the hangers department of intelligent earphone, make bone vibration sensor fix the user's behind the ear, gather the detection signal that produces when the jaw vibrates behind the ear. The detection device adopts a continuous or discontinuous acquisition mode to acquire detection signals. The detection signal is preferably acquired in a continuous acquisition manner.
After obtaining the detection signal, the method further includes determining whether the detection signal satisfies a bone vibration condition. The bone vibration condition is referred to as a bone vibration condition which generates vibration bones when a user speaks, and the determination mode of whether the bone vibration condition is met can be realized by comparing the amplitude of the detection signal with a preset value.
It should be understood that when the user does not speak, slight bone vibration may be generated, such as bone vibration generated by mouth breathing, tooth grinding, swallowing, and the like, and when the bone vibration condition is set, the amplitude of the bone vibration signal during speaking in a general situation may be used as a preset value, so that bone vibration which is absolutely impossible to generate a sound signal is eliminated. The specific value of the preset value needs to be set according to statistics and requirements, and is not specifically limited herein. It should be understood that there are also situations such as sneezing, hiccups, etc. that are closer to the bone vibrating during speaking, and therefore all non-user speaking situations cannot be excluded by the preset values. Judging whether the amplitude of the detection signal is greater than a preset value to judge whether the detection signal meets bone vibration conditions, and when the amplitude of the detection signal is greater than the preset value, the detection signal can be considered to meet the bone vibration conditions; when the amplitude of the detection signal is smaller than the preset value, the detection signal is considered not to satisfy the bone vibration condition, and the detection signal can be discarded at this time. It should be added that the preset value is set to be smaller in practical application based on that bone vibration signals are different from person to person, so as to reduce the strictness of requirements on bone vibration conditions. When it is determined that the detection signal satisfies the bone vibration condition, the detection signal may be determined to be a bone vibration signal. I.e. it means that the user has a similar skeletal motion as when speaking, at which point the following operations, such as step 102, can be performed.
Fig. 3 is a schematic diagram illustrating an implementation flow of echo cancellation processing in a voice wake-up method according to an embodiment of the present invention.
Referring to fig. 3, in an embodiment of the present invention, the method further comprises: step 301, judging whether the designated object is executing the playing task, and obtaining a judgment result; step 302, when the judgment result is that the designated object is executing the playing task, performing echo cancellation processing on the detection signal to obtain a processing signal; the processed signal is used to determine whether a bone vibration condition is satisfied.
When equipment is in the user state, especially when equipment plays the task, the sound of equipment broadcast causes the condition that equipment mistake awakened up easily, for avoiding the mistake awakening up, this application still need carry out echo cancellation to detected signal, gets rid of the influence that causes because the environmental sound among the detected signal.
Specifically, the method includes judging whether the designated object is executing the playing task, and obtaining a judgment result. The designated object is specifically an object which is easy to cause misjudgment of the detection signal, and the designated object may be an intelligent device or other articles, such as an earphone, a mobile phone and other intelligent terminals. The specified object may be the device itself to which the method is applied or may be a device other than itself. When the designated object is not the device which refers to the method, a query can be sent to the designated device through signal transmission to determine whether the designated device carries out the playing task, so that information interaction with the designated device is realized, and whether the designated object is executing the playing task is judged. The playback task herein refers to a task that may cause misjudgment of the detection signal, such as a task involving sound playback.
The judgment result is used for evaluating whether the specified object is executing the playing task. When the judgment result is that the designated object is executing the play task, the detection signal is processed by Echo Cancellation (AEC) to obtain a processing signal. The echo cancellation process may be implemented by passing the detected signal through an echo canceller process. After echo cancellation processing, the detection signal can eliminate bone vibration signal misjudgment caused by a playing task. The processed signal may then be used to determine whether the bone vibration condition is satisfied. It should be noted that, in step 302 of the method, when the determination result is that the designated object is executing the play task, the echo cancellation processing is performed on the detection signal, and the obtaining of the processing signal needs to be performed after the detection signal is obtained, but in step 301 of the method, it is determined whether the designated object is executing the play task, and the determination result is obtained, which may be performed before the detection signal is obtained or after the detection signal is obtained. When the judgment result is that the designated object does not execute the playing task, the detection signal obtained at this time can be directly used for determining the bone vibration condition.
Fig. 4 is a schematic diagram illustrating an implementation flow of determining a wake-up condition in a voice wake-up method according to an embodiment of the present invention.
In an embodiment of the present invention, in step 104, when it is determined that the speech signal exists in the bone vibration signal, after the sound signal is collected, the method further includes: step 401, judging whether the sound signal meets a wake-up condition; and 402, when the sound signal is judged to meet the awakening condition, executing a specified task corresponding to the sound signal.
Since the user does not necessarily need to communicate with the device during the speaking process, the collected voice signal is not necessarily used to indicate the intention of the user to give an instruction to the device after the voice signal is determined. Based on this, it is also necessary to perform a wake-up judgment on the sound signal to determine whether the sound signal satisfies the wake-up condition. The wake up determination herein includes, but is not limited to, a determination by a wake up word, a determination by a voiceprint, a determination by a combination of a wake up word and a voiceprint, or by other means. If the voice signal is judged to contain the awakening words, the voice signal is judged to meet the awakening condition. When the awakening mode is voiceprint judgment, analyzing the sound signal through a voiceprint algorithm to obtain a voiceprint result, judging whether the voiceprint result corresponding to the sound signal is the same as a preset voiceprint or not, and when the voiceprint result corresponding to the sound signal is judged to be the same as the preset voiceprint, determining that the sound signal meets the awakening condition. The preset voiceprint can be recorded by the user in advance, the voiceprint of the user is obtained in advance through equipment processing, and the voiceprint of the user is determined to be the preset voiceprint. It should be noted that, because the voiceprint algorithm needs enough computing resources, the device to which the method is applied may be in communication connection with a third-party device, and the voiceprint analysis is performed by sending the voice signal of the user to the third-party device, and the voiceprint analysis result is sent to the device to which the method is applied by the third-party device for judgment, so that whether the voice signal meets the requirement of the wake-up condition or not may also be determined. The communication connection mode can be a wired connection mode, a wireless connection mode, a Bluetooth connection mode or other communication connection modes.
In another specific implementation scenario, the method is applied to an intelligent headset, the intelligent headset is in communication connection with a mobile phone, and the audio of the mobile phone is played through the intelligent headset. Firstly, the intelligent earphone judges whether the earphone is executing a playing task, and a judgment result is obtained. The judgment result can also be obtained by sending an inquiry signal to the mobile phone through the intelligent earphone to inquire whether the intelligent earphone carries out a playing task or not, receiving a reply signal from the mobile phone and analyzing the reply signal. The judgment result here is that the smart headset is playing music. The detection signal from the detection device is processed by echo cancellation through an echo canceller to obtain a processed signal, then the amplitude of the processed signal is compared with a preset value, when a user speaks, the oral cavity of the user acts, the amplitude of the detection signal meets the preset value, namely the detection signal meets the bone vibration condition, and the detection signal is determined to be the bone vibration signal. After the bone vibration signal is obtained, the intelligent earphone carries out voice activity detection on the bone vibration signal to obtain a detection result, the bone vibration signal is determined to be a voice signal according to the detection result, and a microphone of the intelligent earphone collects sound in the environment to obtain a sound signal. The voice recognition analysis is carried out on the sound signals, the awakening word 'XX' which is the same as the preset awakening word is obtained, the earphone is awakened, the sound signals are analyzed, the user intention contained in the sound signals is obtained and is 'playing songs', and the intelligent earphone starts playing the songs according to the user intention. It is supplementary to discard the sound signal when the obtained user intention cannot correspond to the specified task by performing the voice recognition analysis on the sound signal.
In the embodiment of the present invention, step 102, performing voice activity detection on the bone vibration signal, including performing voice activity detection on the bone vibration signal through a model; wherein the model is obtained by training, and the data used for training the model comprises noise and human voice.
In the method, in order to improve the robustness of the system, the voice activity detection is carried out on bone vibration signals through the model, the training data of the model can be used for recording and acquiring according to the use scenes of product scenes besides the voice data, for example, in the scenes such as stairs, riding, restaurants, crossroads, markets, vehicles, subways, bars, offices and the like, the noise of the scenes and the voice of China in the scenes are acquired, the noise data acquired in common general scenes are added into training, and the robustness of each scene is improved. The model here is a speech activity detection model, such as a DNN algorithm model.
Fig. 5 is a block diagram of a voice wake-up device according to an embodiment of the present invention.
Referring to fig. 5, another aspect of the present invention provides a wake-up apparatus, where the apparatus includes: an obtaining module 501, configured to obtain a bone vibration signal; a detection module 502, configured to perform voice activity detection on the bone vibration signal to obtain a detection result; a determining module 503, configured to determine whether a voice signal exists in the bone vibration signal based on the detection result; an acquisition module 504, configured to acquire a sound signal when it is determined that a voice signal exists in the bone vibration signal; and the execution module 505 is used for executing a specified task corresponding to the sound signal so as to generate an interactive behavior for the user.
In this embodiment of the present invention, the obtaining module 501 includes: an obtaining sub-module 5011 for obtaining a detection signal; a first determination submodule 5012 for determining whether the detection signal satisfies a bone vibration condition; the second determining sub-module 5013 determines the detection signal as a bone vibration signal when it is determined that the detection signal satisfies the bone vibration condition.
In an embodiment of the present invention, the apparatus further includes: a judging module 506, configured to judge whether the designated object is executing the play task, and obtain a judgment result; the echo cancellation module 507 is configured to perform echo cancellation processing on the detection signal to obtain a processed signal when the determination result is that the designated object is executing the play task; the processed signal is used to determine whether a bone vibration condition is satisfied.
In an embodiment of the present invention, the apparatus further includes: the determining module 506 is further configured to determine whether the sound signal satisfies a wake-up condition; and the executing module 505 is configured to execute a specified task corresponding to the sound signal when it is determined that the sound signal satisfies the wake-up condition.
In this embodiment of the present invention, the detecting module 502 includes: carrying out voice activity detection on the bone vibration signal through the model; wherein the model is obtained by training, and the data used for training the model comprises noise and human voice.
Fig. 6 is a block diagram of another voice wake-up apparatus according to an embodiment of the present invention.
Referring to fig. 6, another wake-up device according to an embodiment of the present invention is provided, wherein when it is determined that the player of the device does not play audio, the detection signal collected by the bone vibration sensor is subjected to bone vibration condition determination and voice activity detection. The other modules are in an off state, which is a low power consumption state. When the detection signal meets the bone vibration condition and the voice activity detection judges that a voice signal exists, the air conduction microphone collects the voice signal, the voice signal outputs a 16k 16bit mono audio through a beamforming algorithm, the audio is input into a wake-up module, whether the signal is wakened up or not is detected, and if the judgment is successful, the wake-up signal is output.
When the device player is judged to play audio, the detection signal acquired by the bone vibration sensor is processed by echo cancellation (AEC), and then is judged by bone vibration conditions and detected by voice activity. The other modules are in an off state, which is a low power consumption state. When the processing signal after echo cancellation processing meets the bone vibration condition and voice activity detection judges that a voice signal exists, a voice signal is collected through an air conduction microphone, the voice signal outputs and outputs a 16k 16bit mono audio through a beamforming algorithm, the audio is input into a wake-up module, whether a wake-up signal exists is detected, and if the judgment is successful, the wake-up signal is output.
In another aspect, embodiments of the present invention provide a computer-readable storage medium, which includes a set of computer-executable instructions, and when executed, is configured to perform any one of the above-mentioned wake-up methods.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.