TWI878975B

TWI878975B - Speech recognition device and method

Info

Publication number: TWI878975B
Application number: TW112126111A
Authority: TW
Inventors: 吳鎧丞
Original assignee: 新唐科技股份有限公司
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2025-04-01
Also published as: CN119314491A; TW202503737A

Abstract

A speech recognition device including a spoof defense module and a speaker verification module is provided. The spoof defense module includes a first deep learning model. The first deep learning model determines whether a first input speech is real speech or a spoof speech. When the first input speech is the real speech, the spoof defense module outputs a second input speech. The speaker verification module includes a second deep learning model. The second deep learning model processes the second input speech to produce a voice-point. The speaker verification module determines whether the first input speech is provided by a legitimate speaker according to the voice-point. When the first input speech is provided by a legitimate speaker, the speaker verification module performs a specific action. When the first input speech is not provided by a legitimate speaker, the speaker verification module does not perform the specific action.

Description

Speech recognition device and method

本發明是關於一種語音辨識裝置，特別是關於一種具有攻擊防禦能力的語音辨識裝置。 The present invention relates to a voice recognition device, and in particular to a voice recognition device with attack and defense capabilities.

以主打安全性的智慧門禁系統來說，為了達到更高層級的安全防護，可能需要通過層層系統(如臉部、指紋加上語音的辨識)才能進入室內。然而，這樣的系統造成龐大的功耗，成本造價上會提升非常多。 For smart access control systems that focus on security, in order to achieve a higher level of security protection, it may be necessary to pass through multiple systems (such as facial, fingerprint and voice recognition) to enter the room. However, such a system causes huge power consumption and the cost will increase a lot.

再者，隨著科技的快速發展，一些惡意攻擊方式的技術同時也得到了更大的進展。以語音攻擊為例，就包含了語音合成、預錄音撥放、甚至是虛假音頻生成等等方式。如果沒有經常更新攻擊方式，就如同時時刻刻曝露在風險下。 Furthermore, with the rapid development of technology, some malicious attack methods have also made greater progress. For example, voice attacks include voice synthesis, pre-recorded playback, and even false audio generation. If the attack methods are not updated frequently, it is like being exposed to risks at all times.

本發明之一實施例提供一種語音辨識裝置，包括一攻擊防禦模組以及一語者驗證模組。攻擊防禦模組用以濾除一假語音，並包括一處理電路以及一推測電路。處理電路處理一第一輸入語音，用以產生一音頻信號。推測電路接收音頻信號，並利用一第一深層學習模型處理該音頻信號，用以判別第一輸入語音是為真語音或是假語音。當第一輸入語音係為真語音時，推測電路輸出一第二輸入語音。語者驗證模組包括一特徵擷取電路、一比對電路以及一決策電路。特徵擷取電路利用一第二深層學習模型處理第二輸入語音，用以產生一語音特徵。比對電路將語音特徵與至少一預設特徵進行比對，用以產生一比對結果。決策電路判斷比對結果是否大於一閥值。當比對結果大於閥值時，決策電路進行一特定動作。當比對結果未大於閥值時，決策電路不進行特定動作。 An embodiment of the present invention provides a speech recognition device, including an attack defense module and a speaker verification module. The attack defense module is used to filter out a false voice, and includes a processing circuit and a prediction circuit. The processing circuit processes a first input voice to generate an audio signal. The prediction circuit receives the audio signal and processes the audio signal using a first deep learning model to determine whether the first input voice is a true voice or a false voice. When the first input voice is a true voice, the prediction circuit outputs a second input voice. The speaker verification module includes a feature extraction circuit, a comparison circuit, and a decision circuit. The feature acquisition circuit uses a second deep learning model to process the second input speech to generate a speech feature. The comparison circuit compares the speech feature with at least one preset feature to generate a comparison result. The decision circuit determines whether the comparison result is greater than a threshold value. When the comparison result is greater than the threshold value, the decision circuit performs a specific action. When the comparison result is not greater than the threshold value, the decision circuit does not perform a specific action.

本發明另提供一種語音辨識方法，包括接收並處理一輸入語音，用以產生一音頻信號；利用一第一深層學習模型處理音頻信號，用以判別輸入語音是為一真語音或是一假語音；當輸入語音係為真語音時，利用一第二深層學習模型處理輸入語音，用以產生一語音特徵；比對語音特徵與至少一預設特徵，用以產生一比對結果；判斷比對結果是否大於一閥值；當比對結果大於閥值時，進行一特定動作；以及當比對結果未大於閥值時，不進行特定動作。 The present invention also provides a speech recognition method, including receiving and processing an input speech to generate an audio signal; using a first deep learning model to process the audio signal to determine whether the input speech is a true speech or a false speech; when the input speech is a true speech, using a second deep learning model to process the input speech to generate a speech feature; comparing the speech feature with at least one preset feature to generate a comparison result; determining whether the comparison result is greater than a threshold value; when the comparison result is greater than the threshold value, performing a specific action; and when the comparison result is not greater than the threshold value, not performing the specific action.

本發明之語音辨識方法可經由本發明之語音辨識裝置來實作，其為可執行特定功能之硬體或韌體，亦可以透過程式碼方式收錄於一紀錄媒體中，並結合特定硬體來實作。當程式碼被電子裝置、處理器、電腦或機器載入且執行時，電子裝置、處理器、電腦或機器變成用以實行本發明之攻擊防禦模組以及語者驗證模組。 The speech recognition method of the present invention can be implemented by the speech recognition device of the present invention, which is hardware or firmware that can execute specific functions, or can be recorded in a recording medium in the form of program code and implemented in combination with specific hardware. When the program code is loaded and executed by an electronic device, processor, computer or machine, the electronic device, processor, computer or machine becomes an attack defense module and a speaker verification module for implementing the present invention.

100:語音辨識裝置 100: Voice recognition device

110:攻擊防禦模組 110: Attack and defense module

120:語者驗證模組 120: Speaker verification module

130:接收電路 130: Receiving circuit

SP_IN1~SP_IN3:輸入語音 SP_IN1~SP_IN3: Input voice

SP_RL:真語音 SP_RL: Real Voice

SP_SF:假語音 SP_SF: False voice

ST:觸發信號 ST: trigger signal

210:處理電路 210: Processing circuit

220:推測電路 220: Circuit estimation

ADS_1~ADS_3:音頻信號 ADS_1~ADS_3: audio signal

230、350:輸入輸出埠 230, 350: Input and output ports

DA_1、DA_2:更新資料 DA_1, DA_2: Update data

221、311:深層學習模型 221, 311: Deep learning model

310:特徵擷取電路 310: Feature acquisition circuit

330:比對電路 330: Comparison circuit

340:決策電路 340: Decision circuit

SP_EM:語音特徵 SP_EM: Voice features

S_CM:比對結果 S_CM: Comparison results

SP_PR:預設特徵 SP_PR: Default features

S611~S619:步驟 S611~S619: Steps

第1圖為本發明之語音辨識裝置的示意圖。 Figure 1 is a schematic diagram of the speech recognition device of the present invention.

第2圖為本發明之攻擊防禦模組的示意圖。 Figure 2 is a schematic diagram of the attack defense module of the present invention.

第3A圖為本發明之語者驗證模組的示意圖。 Figure 3A is a schematic diagram of the speaker verification module of the present invention.

第3B圖為本發明之語者驗證模組的另一示意圖。 Figure 3B is another schematic diagram of the speaker verification module of the present invention.

第4圖為本發明之攻擊防禦操作的示意圖。 Figure 4 is a schematic diagram of the attack defense operation of the present invention.

第5圖為本發明之語者驗證操作的示意圖。 Figure 5 is a schematic diagram of the speaker verification operation of the present invention.

第6圖為本發明之語音辨識方法的流程示意圖。 Figure 6 is a schematic diagram of the process of the speech recognition method of the present invention.

為讓本發明之目的、特徵和優點能更明顯易懂，下文特舉出實施例，並配合所附圖式，做詳細之說明。本發明說明書提供不同的實施例來說明本發明不同實施方式的技術特徵。其中，實施例中的各元件之配置係為說明之用，並非用以限制本發明。另外，實施例中圖式標號之部分重覆，係為了簡化說明，並非意指不同實施例之間的關聯性。 In order to make the purpose, features and advantages of the present invention more clearly understandable, the following examples are specifically cited and detailed descriptions are made in conjunction with the attached drawings. This invention specification provides different examples to illustrate the technical features of different implementations of the present invention. Among them, the configuration of each component in the embodiment is for illustrative purposes and is not used to limit the present invention. In addition, the partial repetition of the figure numbers in the embodiment is for the purpose of simplifying the description and does not mean the correlation between different embodiments.

第1圖為本發明之語音辨識裝置的示意圖。如圖所示，語音辨識裝置100包括一攻擊防禦模組110以及一語者驗證模組120。在一可能實施例中，語音辨識裝置100係為一門禁語音辨識系統。在此例中，語音辨識裝置100先判斷一輸入語音SP_IN1是否為一真語音(real speech)。如果輸入語音SP_IN1並非一真語音，例如輸入語音SP_IN1為一假語音(spoof speech)，語音辨識裝置100忽略輸入語音SP_IN1。如果輸入語音SP_IN1係為一真語音，語音辨識裝置100再判斷輸入語音SP_IN1是否符合已註冊的預設語音。若是，表示發出輸入語音SP_IN1的人員為合法人員。因此，語音辨識裝置100根據輸入語音SP_IN1執行一特定動作，如開門。倘若輸入語音SP_IN1不符合已註冊的預設語音，表示發出輸入語音SP_IN1的人員為不合法人員。因此，語音辨識裝置100不執行特定動作。在一些實施例中，雖然語音辨識裝置100不執行特定動作，但語音辨識裝置100執行一警示動作，如發出一警示聲，通知相關人員，有人試圖進入室內。 FIG. 1 is a schematic diagram of a voice recognition device of the present invention. As shown in the figure, the voice recognition device 100 includes an attack defense module 110 and a speaker verification module 120. In a possible embodiment, the voice recognition device 100 is a door access voice recognition system. In this example, the voice recognition device 100 first determines whether an input voice SP_IN1 is a real speech. If the input voice SP_IN1 is not a real voice, for example, the input voice SP_IN1 is a spoof speech, the voice recognition device 100 ignores the input voice SP_IN1. If the input voice SP_IN1 is a real voice, the voice recognition device 100 then determines whether the input voice SP_IN1 matches the registered default voice. If yes, it means that the person who issued the input voice SP_IN1 is a legitimate person. Therefore, the voice recognition device 100 performs a specific action according to the input voice SP_IN1, such as opening the door. If the input voice SP_IN1 does not match the registered default voice, it means that the person who issued the input voice SP_IN1 is an illegal person. Therefore, the voice recognition device 100 does not perform a specific action. In some embodiments, although the voice recognition device 100 does not perform a specific action, the voice recognition device 100 performs a warning action, such as issuing a warning sound, to notify relevant personnel that someone is trying to enter the room.

在一可能實施例中，語音辨識裝置100更包括一接收電路130。接收電路130接收一外部語音。外部語音可能是一真語音SP_RL或是一假語音SP_SF。不論外部語音是真語音SP_RL或是假語音SP_SF，接收電路130將外部語音作為輸入語音SP_IN1，並提供輸入語音SP_IN1予攻擊防禦模組110。在一些實施例中，接收電路130具有一麥克風。本發明並不限定接收電路130的架構。在一可能實施例中，接收電路130係為一錄音器(recorder)。 In a possible embodiment, the voice recognition device 100 further includes a receiving circuit 130. The receiving circuit 130 receives an external voice. The external voice may be a true voice SP_RL or a false voice SP_SF. Regardless of whether the external voice is a true voice SP_RL or a false voice SP_SF, the receiving circuit 130 uses the external voice as an input voice SP_IN1 and provides the input voice SP_IN1 to the attack defense module 110. In some embodiments, the receiving circuit 130 has a microphone. The present invention does not limit the structure of the receiving circuit 130. In a possible embodiment, the receiving circuit 130 is a recorder.

攻擊防禦模組110接收輸入語音SP_IN1，並判斷輸入語音SP_IN1的種類。當輸入語音SP_IN1為一真語音時，攻擊防禦模組110根據輸入語音SP_IN1，提供一輸入語音SP_IN2。在一些實施例中，攻擊防禦模組110直接將輸入語音SP_IN1作為輸入語音SP_IN2。當輸入語音SP_IN1為一假語音時，攻擊防禦模組110停止提供輸入語音SP_IN2。 The attack defense module 110 receives the input voice SP_IN1 and determines the type of the input voice SP_IN1. When the input voice SP_IN1 is a true voice, the attack defense module 110 provides an input voice SP_IN2 according to the input voice SP_IN1. In some embodiments, the attack defense module 110 directly uses the input voice SP_IN1 as the input voice SP_IN2. When the input voice SP_IN1 is a false voice, the attack defense module 110 stops providing the input voice SP_IN2.

在一可能實施例中，真語音係為一活體(如使用者)所發出的語音，假語音係由一語音生成器所產生的語音。該語音生成器可能利用一語音合成方式、一預錄音撥放方式或是一虛假音頻生成方式，產生一假語音。有心人士可能靠近接收電路130，再播放出假語音。在此例中，雖然假語音也是由活體所發出，但並非活體親自直接對著接收電路130直接講出的語音。因此，只要不是活體親自對著接收電路130講出的語音，均稱為假語音。在其它實施例中，語音生成器可能擷取活體的部分語音，再重新排列，甚至利用人類聽力的上下限，模擬出一攻擊語音(即假語音)。因此，只要是播放出來的語音，均稱為假語音。 In a possible embodiment, the real voice is the voice uttered by a living body (such as a user), and the false voice is the voice generated by a voice generator. The voice generator may generate a false voice using a voice synthesis method, a pre-recorded playback method, or a virtual audio generation method. A person with intentions may approach the receiving circuit 130 and then play the false voice. In this example, although the false voice is also uttered by a living body, it is not the voice spoken directly by the living body to the receiving circuit 130. Therefore, as long as the voice is not spoken by the living body to the receiving circuit 130, it is called a false voice. In other embodiments, the voice generator may capture part of the voice of the living body, rearrange it, or even use the upper and lower limits of human hearing to simulate an attacking voice (i.e., a false voice). Therefore, any voice that is played is called false voice.

在本實施例中，攻擊防禦模組110具有一第一深層學習模型(deep learning model)。第一深層學習模型的參數(parameter)係由一外部裝置提供。該外部裝置執行真語音及假語音的訓練(training)，再將訓練後的參數寫入攻擊防禦模組110的第一深層學習模型中。因此，攻擊防禦模組110具有分辨輸入語音SP_IN1為一假語音或是一真語音的能力。在此例中，外部裝置可能定時或不時更新第一深層學習模型的參數，使得語音辨識裝置100更強健更有效地抵擋新型假語音的攻擊。當新型的攻擊出現時，外部裝置根據攻擊資料進行訓練，再利用訓練後的參數更新第一深層學習模型的參數。第一深層學習模型便可分辨出新型的攻擊語音。 In this embodiment, the attack defense module 110 has a first deep learning model. The parameters of the first deep learning model are provided by an external device. The external device performs training of real voice and false voice, and then writes the trained parameters into the first deep learning model of the attack defense module 110. Therefore, the attack defense module 110 has the ability to distinguish whether the input voice SP_IN1 is a false voice or a real voice. In this example, the external device may update the parameters of the first deep learning model regularly or from time to time, so that the voice recognition device 100 is more robust and more effective in resisting attacks from new false voices. When a new attack appears, the external device is trained based on the attack data, and then the parameters of the first deep learning model are updated using the trained parameters. The first deep learning model can then distinguish the new attack voice.

在一些實施例中，攻擊防禦模組110進行一機器學習(machine learning)，用以分辨出先進的語音合成。攻擊防禦模組110可能對虛假音頻生成器所產生的攻擊語音以及真實的人聲語音進行訓練，再將訓練所產生的參數寫入第一深層學習模型中，便可分辨出真語音以及假設音，並阻擋惡意攻擊語音。 In some embodiments, the attack defense module 110 performs machine learning to distinguish advanced speech synthesis. The attack defense module 110 may train the attack speech generated by the false audio generator and the real human voice, and then write the parameters generated by the training into the first deep learning model, so that the real voice and the false voice can be distinguished, and malicious attack voices can be blocked.

語者驗證模組120接收輸入語音SP_IN2，並判斷輸入語音SP_IN2是否為已註冊的使用者的聲音。如果輸入語音SP_IN2為已註冊的使用者的聲音，語者驗證模組120執行一特定動作。在一可能實施例中，語者驗證模組120致能一觸發信號ST，用以打開門鎖。如果輸入語音SP_IN2並非已註冊的使用者的聲音，語者驗證模組120不執行一特定動作。舉例而言，語者驗證模組120可能不致能觸發信號ST。因此，門鎖繼續鎖住。在其它實施例中，語者驗證模組120可能發出一警示聲音，通知特定人士，有人試圖進入屋內。 The speaker verification module 120 receives the input voice SP_IN2 and determines whether the input voice SP_IN2 is the voice of a registered user. If the input voice SP_IN2 is the voice of a registered user, the speaker verification module 120 performs a specific action. In a possible embodiment, the speaker verification module 120 enables a trigger signal ST to unlock the door lock. If the input voice SP_IN2 is not the voice of a registered user, the speaker verification module 120 does not perform a specific action. For example, the speaker verification module 120 may not be able to trigger the signal ST. Therefore, the door lock continues to be locked. In other embodiments, the speaker verification module 120 may emit a warning sound to notify a specific person that someone is trying to enter the house.

在一些實施例中，在一註冊期間，語者驗證模組120接收輸入語音SP_IN3。輸入語音SP_IN3係為一合法人員的語音。語者驗證模組120擷取輸入語音SP_IN3的語音特徵(voice-point)，並將輸入語音SP_IN3的語音特徵作為一預設特徵。在此例中，語者驗證模組120比較輸入語音SP_IN2與SP_IN3的語音特徵。當輸入語音SP_IN2的語音特徵相似於預設特徵，表示輸入語音SP_IN1係由合法人員所發出。因此，語者驗證模組120執行特定動作，如打開門鎖。 In some embodiments, during a registration period, the speaker verification module 120 receives input voice SP_IN3. The input voice SP_IN3 is the voice of a legitimate person. The speaker verification module 120 captures the voice-point of the input voice SP_IN3 and uses the voice-point of the input voice SP_IN3 as a default feature. In this example, the speaker verification module 120 compares the voice-points of the input voices SP_IN2 and SP_IN3. When the voice-point of the input voice SP_IN2 is similar to the default feature, it indicates that the input voice SP_IN1 is uttered by a legitimate person. Therefore, the speaker verification module 120 performs a specific action, such as unlocking a door.

在一可能實施例中，語者驗證模組120具有一第二深層學習模型。第二深層學習模型的參數係由一外部裝置提供。該外部裝置將複數語音作為訓練資料，用以學習如何從語音中，擷取出該說話者語音特徵。外部裝置將訓練後的參數寫入語者驗證模組120的第二深層學習模型中。因此，第二深層學習模型便具有擷取語音特徵的能力。在一註冊期間，第二深層學習模型擷取至少一合法人員的語音特徵，並將擷取到的語音特徵記錄於一資料庫中。在一些實施例中，當第二深層學習模型擷取複數合法人員的語音特徵時，則資料庫記錄複數語音特徵。每一語音特徵對應一合法人員的聲音。 In a possible embodiment, the speaker verification module 120 has a second deep learning model. The parameters of the second deep learning model are provided by an external device. The external device uses multiple speech as training data to learn how to extract the speaker's voice features from the speech. The external device writes the trained parameters into the second deep learning model of the speaker verification module 120. Therefore, the second deep learning model has the ability to extract voice features. During a registration period, the second deep learning model extracts the voice features of at least one legitimate person and records the extracted voice features in a database. In some embodiments, when the second deep learning model captures the voice features of multiple legitimate persons, the database records the multiple voice features. Each voice feature corresponds to the voice of a legitimate person.

第2圖為本發明之攻擊防禦模組110的示意圖。攻擊防禦模組110包括一處理電路210以及一推測電路220。處理電路210處理輸入語音SP_IN1，用以產生一音頻信號ADS_1。本發明並不限定處理電路210的架構。在一可能實施例中，處理電路210具有抽取特徵(feature extraction)的能力。舉例而言，處理電路210可能抽取輸入語音SP_IN1的常數Q倒頻譜係數(CQCC)或是梅爾倒頻譜係數(MFCC)，用以產生一抽取結果。在此例中，處理電路210將抽取結果作為音頻信號ADS_1。 FIG. 2 is a schematic diagram of the attack defense module 110 of the present invention. The attack defense module 110 includes a processing circuit 210 and a prediction circuit 220. The processing circuit 210 processes the input speech SP_IN1 to generate an audio signal ADS_1. The present invention does not limit the structure of the processing circuit 210. In a possible embodiment, the processing circuit 210 has the ability to extract features. For example, the processing circuit 210 may extract the constant Q cepstral coefficient (CQCC) or the Mel cepstral coefficient (MFCC) of the input speech SP_IN1 to generate an extraction result. In this example, the processing circuit 210 uses the extraction result as the audio signal ADS_1.

推測電路220接收音頻信號ADS_1，並利用一深層學習模型221處理音頻信號ADS_1，用以判別輸入語音SP_IN1是為一真語音或是一假語音。當輸入語音SP_IN1係為一真語音時，推測電路220輸出一輸入語音SP_IN2。在一些實施例中，推測電路220將輸入語音SP_IN1或是音頻信號ADS_1作為輸入語音SP_IN2。當輸入語音SP_IN1係為一假語音時，推測電路220不提供輸入語音SP_IN2。在一可能實施例中，推測電路220具有一記憶裝置，用以儲存深層學習模型221。 The inference circuit 220 receives the audio signal ADS_1 and processes the audio signal ADS_1 using a deep learning model 221 to determine whether the input voice SP_IN1 is a true voice or a false voice. When the input voice SP_IN1 is a true voice, the inference circuit 220 outputs an input voice SP_IN2. In some embodiments, the inference circuit 220 uses the input voice SP_IN1 or the audio signal ADS_1 as the input voice SP_IN2. When the input voice SP_IN1 is a false voice, the inference circuit 220 does not provide the input voice SP_IN2. In a possible embodiment, the inference circuit 220 has a memory device for storing the deep learning model 221.

在一些實施例中，攻擊防禦模組110更包括一輸入輸出埠230。輸入輸出埠230用以接收一更新資料DA_1。更新資料DA_1係由一外部裝置提供。該外部裝置對一訓練資料進行訓練。該訓練資料包括至少一真語音及至少一假語音。經訓練後，外部裝置產生至少一參數，並將參數作為更新資料DA_1。在一更新期間，當外部裝置耦接輸入輸出埠230時，推測電路220透過輸入輸出埠230，接收更新資料DA_1。推測電路220將更新資料DA_1寫入深層學習模型221，用以更新深層學習模型221的至少一參數。 In some embodiments, the attack defense module 110 further includes an input/output port 230. The input/output port 230 is used to receive an update data DA_1. The update data DA_1 is provided by an external device. The external device trains a training data. The training data includes at least one real voice and at least one fake voice. After training, the external device generates at least one parameter and uses the parameter as the update data DA_1. During an update period, when the external device is coupled to the input/output port 230, the inference circuit 220 receives the update data DA_1 through the input/output port 230. The inference circuit 220 writes the update data DA_1 into the deep learning model 221 to update at least one parameter of the deep learning model 221.

本發明並不限定輸入輸出埠230的種類。在一可能實施例中，輸入輸出埠230透過一連接線，連接外部裝置。在另一可能實施例中，輸入輸出埠230係為一無線接收器，接收來自外部裝置的更新資料DA_1。 The present invention does not limit the type of the input/output port 230. In one possible embodiment, the input/output port 230 is connected to an external device via a connection line. In another possible embodiment, the input/output port 230 is a wireless receiver that receives update data DA_1 from an external device.

第3A圖為本發明之語者驗證模組120的示意圖。如圖所示，語者驗證模組120包括一特徵擷取電路310、一比對電路330以及一決策電路340。在本實施例中，語者驗證模組120用以判斷來自攻擊防禦模組110的輸入語音SP_IN2是否為合法語音。當輸入語音SP_IN1為合法語音時，語者驗證模組120進行一特定動作，如開始。當輸入語音SP_IN1為不合法語音時，語者驗證模組120不進行特定動作，或是進行一警示動作，如發出警告聲，通知室內的人員。 FIG. 3A is a schematic diagram of the speaker verification module 120 of the present invention. As shown in the figure, the speaker verification module 120 includes a feature acquisition circuit 310, a comparison circuit 330, and a decision circuit 340. In this embodiment, the speaker verification module 120 is used to determine whether the input voice SP_IN2 from the attack defense module 110 is a legitimate voice. When the input voice SP_IN1 is a legitimate voice, the speaker verification module 120 performs a specific action, such as starting. When the input voice SP_IN1 is an illegal voice, the speaker verification module 120 does not perform a specific action, or performs a warning action, such as issuing a warning sound to notify the people in the room.

特徵擷取電路310利用一深層學習模型311處理輸入語音SP_IN2，用以產生一語音特徵SP_EM。在一可能實施例中，輸入語音SP_IN2係為輸入語音SP_IN1。在此例中，當攻擊防禦模組110判斷輸入語音SP_IN1為一真語音時，攻擊防禦模組110直接將音頻信號ADS_1作為輸入語音SP_IN2。在本實施例中，深層學習模型311提取說話者的特徵(speaker embedding)。在一可能實施例中，特徵擷取電路310具有一記憶裝置，用以儲存深層學習模型311。 The feature acquisition circuit 310 processes the input speech SP_IN2 using a deep learning model 311 to generate a speech feature SP_EM. In one possible embodiment, the input speech SP_IN2 is the input speech SP_IN1. In this example, when the attack defense module 110 determines that the input speech SP_IN1 is a true speech, the attack defense module 110 directly uses the audio signal ADS_1 as the input speech SP_IN2. In this embodiment, the deep learning model 311 extracts the speaker's features (speaker embedding). In one possible embodiment, the feature acquisition circuit 310 has a memory device for storing the deep learning model 311.

比對電路330將語音特徵SP_EM與至少一預設特徵SP_PR進行比對，用以產生一比對結果S_CM。在一些實施例中，比對結果S_CM表示語音特徵SP_EM與預設特徵SP_PR之間的相似度。 The comparison circuit 330 compares the speech feature SP_EM with at least one preset feature SP_PR to generate a comparison result S_CM. In some embodiments, the comparison result S_CM indicates the similarity between the speech feature SP_EM and the preset feature SP_PR.

決策電路340判斷比對結果S_CM是否大於一閥值。當比對結果S_CM大於閥值時，決策電路340進行一特定動作，如致能觸發信號ST。當比對結果S_CM未大於閥值時，決策電路340不進行特定動作，如不致能觸發信號ST。 The decision circuit 340 determines whether the comparison result S_CM is greater than a threshold value. When the comparison result S_CM is greater than the threshold value, the decision circuit 340 performs a specific action, such as enabling the trigger signal ST. When the comparison result S_CM is not greater than the threshold value, the decision circuit 340 does not perform a specific action, such as not enabling the trigger signal ST.

在一些實施例中，語者驗證模組120更包括一儲存電路320。儲存電路320用以儲存預設特徵SP_PR。在一可能實施例中，預設特徵SP_PR係由特徵擷取電路310提供。本發明並不限定預設特徵SP_PR的數量。儲存電路320可能儲存更多的預設特徵SP_PR。在此例中，每一預設特徵SP_PR代表一合法人員(如家庭成員)的語音特徵。本發明並不限定儲存電路320的種類。在一可能實施例中，儲存電路320係為一非揮發性記憶體。 In some embodiments, the speaker verification module 120 further includes a storage circuit 320. The storage circuit 320 is used to store the default feature SP_PR. In a possible embodiment, the default feature SP_PR is provided by the feature capture circuit 310. The present invention does not limit the number of the default features SP_PR. The storage circuit 320 may store more default features SP_PR. In this example, each default feature SP_PR represents a voice feature of a legitimate person (such as a family member). The present invention does not limit the type of the storage circuit 320. In a possible embodiment, the storage circuit 320 is a non-volatile memory.

在一可能實施例中，深層學習模型311的參數係由一外部裝置(未顯示)。該外部裝置將複數語音作為訓練資料，用以學習擷取語音特徵。外部裝置將訓練後的參數寫入深層學習模型311中。因此，深層學習模型311便具有擷取語音特徵的能力。 In one possible embodiment, the parameters of the deep learning model 311 are obtained from an external device (not shown). The external device uses multiple speech sounds as training data to learn to extract speech features. The external device writes the trained parameters into the deep learning model 311. Therefore, the deep learning model 311 has the ability to extract speech features.

在其它實施例中，語者驗證模組120更包括一輸入輸出埠350。輸入輸出埠350用以接收一更新資料DA_2。當一外部裝置耦接輸入輸出埠350時，語者驗證模組120進入一更新模式。在更新模式下，特徵擷取電路310根據更新資料DA_2，更新深層學習模型311的至少一參數。本發明並不限定輸入輸出埠350的種類。在一可能實施例中，輸入輸出埠350透過一連接線，連接外部裝置。在另一可能實施例中，輸入輸出埠350係為一無線接收器，接收來自外部裝置的更新資料DA_2。 In other embodiments, the speaker verification module 120 further includes an input/output port 350. The input/output port 350 is used to receive an update data DA_2. When an external device is coupled to the input/output port 350, the speaker verification module 120 enters an update mode. In the update mode, the feature acquisition circuit 310 updates at least one parameter of the deep learning model 311 according to the update data DA_2. The present invention does not limit the type of the input/output port 350. In one possible embodiment, the input/output port 350 is connected to the external device through a connection line. In another possible embodiment, the input/output port 350 is a wireless receiver that receives the update data DA_2 from the external device.

第3B圖為本發明之語者驗證模組120的另一示意圖。第3B圖相似第3A圖，不同之處在於，第3B圖多了一處理電路360。處理電路360處理輸入語音SP_IN2，用以產生一音頻信號ADS_2。由於處理電路360的特性相似於第2圖的處理電路210，故不再贅述。 FIG. 3B is another schematic diagram of the speaker verification module 120 of the present invention. FIG. 3B is similar to FIG. 3A, but the difference is that FIG. 3B has an additional processing circuit 360. The processing circuit 360 processes the input speech SP_IN2 to generate an audio signal ADS_2. Since the characteristics of the processing circuit 360 are similar to those of the processing circuit 210 in FIG. 2, it will not be described in detail.

特徵擷取電路310接收音頻信號ADS_2，並利用深層學習模型311處理音頻信號ADS_2，用以產生語音特徵SP_EM。在本實施例中，當攻擊防禦模組110得知輸入語音SP_IN1係為一真語音時，攻擊防禦模組110將輸入語音SP_IN1作為輸入語音SP_IN2。 The feature acquisition circuit 310 receives the audio signal ADS_2 and processes the audio signal ADS_2 using the deep learning model 311 to generate the speech feature SP_EM. In this embodiment, when the attack defense module 110 knows that the input speech SP_IN1 is a true speech, the attack defense module 110 uses the input speech SP_IN1 as the input speech SP_IN2.

在一些實施例中，在一初始期間(或稱註冊期間)，處理電路360處理輸入語音SP_IN3，用以產生一音頻信號ADS_3。在此例中，輸入語音SP_IN3係為合法人員的語音。深層學習模型311擷取音頻信號ADS_3的語音特徵，並將擷取結果(即預設特徵SP_PR)記錄於儲存電路320中。 In some embodiments, during an initial period (or registration period), the processing circuit 360 processes the input voice SP_IN3 to generate an audio signal ADS_3. In this example, the input voice SP_IN3 is the voice of a legitimate person. The deep learning model 311 captures the voice features of the audio signal ADS_3 and records the capture results (i.e., the default features SP_PR) in the storage circuit 320.

在一可能實施例中，合法人員對著第1圖的接收電路130說話。接收電路130提供合法人員的語音(即輸入語音SP_IN3)予處理電路360。在一些實施例中，接收電路130可能接收五位合法人員的語音。因此，儲存電路320記錄五筆預設特徵SP_PR，分別對應五位合法人員。在此例中，比對電路330將語音特徵SP_EM與五筆預設特徵SP_PR進行比對，並提供五次的比對結果予決策電路340。決策電路340將每一比對結果與一閥值進行比較。 In a possible embodiment, a legitimate person speaks to the receiving circuit 130 of FIG. 1. The receiving circuit 130 provides the voice of the legitimate person (i.e., the input voice SP_IN3) to the processing circuit 360. In some embodiments, the receiving circuit 130 may receive the voices of five legitimate persons. Therefore, the storage circuit 320 records five preset features SP_PR, corresponding to the five legitimate persons respectively. In this example, the comparison circuit 330 compares the voice feature SP_EM with the five preset features SP_PR, and provides five comparison results to the decision circuit 340. The decision circuit 340 compares each comparison result with a valve value.

第4圖為本發明之攻擊防禦操作的示意圖。在一訓練階段(training phase)，提供許多不同的訓練資料。訓練資料真語音及假語音。真語音係為活體所發出的聲音。假語音包含使用各種方法所取得的語音，如生成與活體語音相似的聲音。另外，還有錄音和重放、擷取部分語音、透過聲音的環繞與立體聲。接著，抽取訓練資料的特徵，再進行語音訓練(如機器學習machine learning)。藉由多次語音訓練，便可產生至少一參數。再將參數寫入一第一深層學習模型中。 Figure 4 is a schematic diagram of the attack defense operation of the present invention. In a training phase, many different training data are provided. The training data are real voice and fake voice. Real voice is the sound made by a living body. Fake voice includes voice obtained using various methods, such as generating voice similar to the voice of a living body. In addition, there are recording and playback, capturing part of the voice, and using surround sound and stereo. Then, extract the features of the training data and perform voice training (such as machine learning). Through multiple voice trainings, at least one parameter can be generated. The parameter is then written into a first deep learning model.

在一推論階段(inference phase)，接收一推論資料。在一可能實施例中，推論資料係為一未知語音。該未知語音可能來自一錄音器。然後，抽取推論資料的特徵，用以產生一第一抽取結果。利用第一深層學習模型，處理第一抽取結果，用以判斷推論資料係為一真語音或是一假語音。在一可能實施例中，如果推論資料係為一真語音時，則驗證該真語音是否為合法人員所發出。如果推論資料係為一假語音，則不驗證假語音。 In an inference phase, an inference data is received. In a possible embodiment, the inference data is an unknown voice. The unknown voice may come from a recorder. Then, the features of the inference data are extracted to generate a first extraction result. The first extraction result is processed using a first deep learning model to determine whether the inference data is a true voice or a false voice. In a possible embodiment, if the inference data is a true voice, it is verified whether the true voice is uttered by a legitimate person. If the inference data is a false voice, the false voice is not verified.

第5圖為本發明之語者驗證操作的示意圖。在一訓練階段，提供許多不同的訓練資料。在一可能實施例中，每一訓練資料係為一真語音。接著，抽取訓練資料的特徵，再進行語音訓練。藉由多次語音訓練，便可產生至少一參數。再將參數寫入一第二深層學習模型中。 Figure 5 is a schematic diagram of the speaker verification operation of the present invention. In a training phase, a number of different training data are provided. In a possible embodiment, each training data is a real voice. Then, the features of the training data are extracted and voice training is performed. Through multiple voice trainings, at least one parameter can be generated. The parameter is then written into a second deep learning model.

在一註冊階段(enrollment phase)，接收至少一註冊資料。接著，抽取註冊資料的特徵，用以產生一第二抽取結果。利用第二深層學習模型處理第二抽取結果，用以取得註冊資料的語音特徵(或稱預設特徵)。在一可能實施例中，註冊資料的語音特徵儲存於一資料庫中。 In an enrollment phase, at least one enrollment data is received. Then, features of the enrollment data are extracted to generate a second extraction result. The second extraction result is processed using a second deep learning model to obtain voice features (or default features) of the enrollment data. In a possible embodiment, the voice features of the enrollment data are stored in a database.

在一驗證(verification phase)階段，接收一驗證資料。在一可能實施例中，驗證資料係為一未知真語音，如經第4圖的第一深層學習模型判斷得知的真語音。然後，抽取驗證資料的特徵，用以產生一第三抽取結果。利用第二深層學習模型處理第三抽取結果，用以產生一未知語音特徵。將未知語音特徵與資料庫裡的預設特徵相比較，判斷未知語音特徵與每一預設特徵之間的相似度。接著，判斷未知語音特徵與每一預設特徵之間的相似度是否超過一閥值。若是，表示驗證資料係為一合法人員所發出的聲音，故接受驗證資料，並執行一特定動作。若否，表示驗證資料並非一合法人員所發出的聲音，故拒絕驗證資料，且不執行特定動作。 In a verification phase, a verification data is received. In a possible embodiment, the verification data is an unknown true voice, such as the true voice determined by the first deep learning model in FIG. 4. Then, the features of the verification data are extracted to generate a third extraction result. The third extraction result is processed by the second deep learning model to generate an unknown voice feature. The unknown voice feature is compared with the preset features in the database to determine the similarity between the unknown voice feature and each preset feature. Then, it is determined whether the similarity between the unknown voice feature and each preset feature exceeds a threshold value. If so, it means that the verification data is a sound made by a legitimate person, so the verification data is accepted and a specific action is performed. If not, it means that the verification data is not the voice of a legitimate person, so the verification data is rejected and the specific action is not performed.

第6圖為本發明之語音辨識方法的流程示意圖。本發明的語音辨識方法可以透過程式碼存在。當程式碼被機器載入且執行時，機器變成用以實行本發明之語音辨識裝置。 Figure 6 is a flow chart of the speech recognition method of the present invention. The speech recognition method of the present invention can exist through program code. When the program code is loaded and executed by a machine, the machine becomes a speech recognition device for implementing the present invention.

首先，接收並處理一輸入語音，用以產生一音頻信號(步驟S611)。在一可能實施例中，步驟S611係抽取輸入語音的常數Q倒頻譜係數或是梅爾倒頻譜係數，再將抽取結果作為音頻信號。 First, an input speech is received and processed to generate an audio signal (step S611). In a possible embodiment, step S611 extracts the constant Q inverse spectrum coefficient or the Mel inverse spectrum coefficient of the input speech, and then uses the extraction result as the audio signal.

接著，利用一第一深層學習模型處理音頻信號(步驟S612)。在一可能實施例中，第一深層學習模型的參數(parameter)係由一外部裝置提供。該外部裝置執行真語音及假語音的訓練，再將訓練後的參數寫入第一深層學習模型中。因此，根據第一深層學習模型的輸出結果，便可得知輸入語音為一假語音或是一真語音的能力。 Next, a first deep learning model is used to process the audio signal (step S612). In a possible embodiment, the parameters of the first deep learning model are provided by an external device. The external device performs training of real speech and fake speech, and then writes the trained parameters into the first deep learning model. Therefore, according to the output result of the first deep learning model, the ability of the input speech to be a fake speech or a real speech can be known.

判斷第一深層學習模型是否將輸入語音分類為真語音(步驟S613)。當輸入語音並非真語音時，拒絕假語音(步驟S614)。當輸入語音係為真語音時，利用一第二深層學習模型處理輸入語音，用以產生一語音特徵(步驟S615)。 Determine whether the first deep learning model classifies the input speech as real speech (step S613). When the input speech is not real speech, reject the false speech (step S614). When the input speech is real speech, use a second deep learning model to process the input speech to generate a speech feature (step S615).

接著，比對語音特徵與至少一預設特徵，用以產生一比對結果(步驟S616)。在一可能實施例中，比對結果係為語音特徵與預設特徵的相似度。然後，判斷比對結果是否大於一閥值(步驟S617)。當比對結果未大於閥值時，表示輸入語音並非一合法人員的聲音。因此，不進行特定動作(步驟S618)。當比對結果大於閥值時，表示輸入語音係為一合法人員的聲音。因此，進行一特定動作(步驟S619)。 Next, the voice feature is compared with at least one preset feature to generate a comparison result (step S616). In a possible embodiment, the comparison result is the similarity between the voice feature and the preset feature. Then, it is determined whether the comparison result is greater than a threshold value (step S617). When the comparison result is not greater than the threshold value, it indicates that the input voice is not the voice of a legitimate person. Therefore, no specific action is performed (step S618). When the comparison result is greater than the threshold value, it indicates that the input voice is the voice of a legitimate person. Therefore, a specific action is performed (step S619).

本發明並不限定特定動作的類型。在一可能實施例中，特定動作係打開門鎖。在另一可能實施例中，比對結果未大於閥值時，表示一不合法人員企圖進入屋內，故步驟S618發出一警示聲，提醒屋內人員。在其它實施例中，藉由調整閥值，可提高語音辨識的嚴格程度。 The present invention does not limit the type of specific action. In one possible embodiment, the specific action is to unlock the door. In another possible embodiment, when the comparison result is not greater than the valve value, it means that an illegal person attempts to enter the house, so step S618 emits a warning sound to remind the person in the house. In other embodiments, by adjusting the valve value, the strictness of voice recognition can be improved.

本發明之語音辨識方法，或特定型態或其部份，可以以程式碼的型態存在。程式碼可儲存於實體媒體，如軟碟、光碟片、硬碟、或是任何其他機器可讀取(如電腦可讀取)儲存媒體，亦或不限於外在形式之電腦程式產品，其中，當程式碼被機器，如電腦載入且執行時，此機器變成用以參與本發明之攻擊防禦模組及語者驗證模組。程式碼也可透過一些傳送媒體，如電線或電纜、光纖、或是任何傳輸型態進行傳送，其中，當程式碼被機器，如電腦接收、載入且執行時，此機器變成用以參與本發明之攻擊防禦模組及語者驗證模組。當在一般用途處理單元實作時，程式碼結合處理單元提供一操作類似於應用特定邏輯電路之獨特裝置。 The speech recognition method of the present invention, or a specific form or part thereof, can exist in the form of program code. The program code can be stored in a physical medium, such as a floppy disk, an optical disk, a hard disk, or any other machine-readable (such as computer-readable) storage medium, or a computer program product that is not limited to an external form, wherein when the program code is loaded and executed by a machine, such as a computer, the machine becomes used to participate in the attack defense module and speaker verification module of the present invention. The program code can also be transmitted through some transmission media, such as wires or cables, optical fibers, or any transmission type, wherein when the program code is received, loaded and executed by a machine, such as a computer, the machine becomes used to participate in the attack defense module and speaker verification module of the present invention. When implemented on a general-purpose processing unit, the code combines with the processing unit to provide a unique device that operates like an application-specific logic circuit.

除非另作定義，在此所有詞彙(包含技術與科學詞彙)均屬本發明所屬技術領域中具有通常知識者之一般理解。此外，除非明白表示，詞彙於一般字典中之定義應解釋為與其相關技術領域之文章中意義一致，而不應解釋為理想狀態或過分正式之語態。雖然“第一”、“第二”等術語可用於描述各種元件，但這些元件不應受這些術語的限制。這些術語只是用以區分一個元件和另一個元件。 Unless otherwise defined, all terms (including technical and scientific terms) herein are generally understood by those with ordinary knowledge in the technical field to which the present invention belongs. In addition, unless expressly stated, the definition of a term in a general dictionary should be interpreted as consistent with the meaning in the article in the relevant technical field, and should not be interpreted as an ideal state or overly formal tone. Although terms such as "first" and "second" can be used to describe various components, these components should not be limited by these terms. These terms are only used to distinguish one component from another.

雖然本發明已以較佳實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明之精神和範圍內，當可作些許之更動與潤飾。舉例來說，本發明實施例所述之系統、裝置或是方法可以硬體、軟體或硬體以及軟體的組合的實體實施例加以實現。因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。 Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person with ordinary knowledge in the relevant technical field may make some changes and modifications without departing from the spirit and scope of the present invention. For example, the system, device or method described in the embodiments of the present invention can be implemented in a physical embodiment of hardware, software or a combination of hardware and software. Therefore, the scope of protection of the present invention shall be determined by the scope of the attached patent application.

100：語音辨識裝置 110：攻擊防禦模組 120：語者驗證模組 SP_IN1～SP_IN3：輸入語音 SP_RL：真語音 SP_SF：假語音 130：接收電路 ST：觸發信號 100: Voice recognition device 110: Attack defense module 120: Speaker verification module SP_IN1~SP_IN3: Input voice SP_RL: Real voice SP_SF: False voice 130: Receiving circuit ST: Trigger signal

Claims

A speech recognition device includes: an attack defense module for filtering a false voice, and includes: a first processing circuit for processing a first input voice to generate a first audio signal; and a guessing circuit for receiving the first audio signal and processing the first audio signal using a first deep learning model to determine whether the first input voice is a true voice or the false voice. When the input voice is the real voice, the inference circuit outputs a second input voice; and a speaker verification module, including: a feature extraction circuit, using a second deep learning model to process the second input voice to generate a voice feature; a comparison circuit, comparing the voice feature with at least one preset feature to generate a comparison result; a second processing circuit, processing the second input voice, using to generate a second audio signal; and a decision circuit to determine whether the comparison result is greater than a threshold value. When the comparison result is greater than the threshold value, the decision circuit performs a specific action. When the comparison result is not greater than the threshold value, the decision circuit does not perform the specific action, wherein: when the first input voice is the true voice, the inference circuit uses the first input voice as the second input voice, The feature acquisition circuit receives the second audio signal and processes the second audio signal using the second deep learning model to generate the voice feature. During a registration period, the second processing circuit processes a third input voice to generate a third audio signal. The feature acquisition circuit receives the third audio signal and processes the third audio signal using the second deep learning model to generate the default feature.

As in claim 1, the speech recognition device, wherein the speaker verification module further comprises: a storage circuit for storing the preset feature.

As in claim 1, the speech recognition device, wherein the attack defense module further comprises: a first input/output port for receiving a first update data; wherein, during a first update period, the inference circuit updates at least one parameter of the first deep learning model according to the first update data.

A speech recognition device as claimed in claim 1, wherein the speaker verification module further comprises: a second input/output port for receiving a second update data; wherein, during a second update period, the feature acquisition circuit updates at least one parameter of the second deep learning model according to the second update data.

The voice recognition device of claim 1 further includes: a receiving circuit for receiving the first input voice and providing the first input voice to the attack defense module.

A speech recognition method includes: receiving and processing a first input speech to generate a first audio signal; using a first deep learning model to process the first audio signal to determine whether the first input speech is a true speech or a false speech; when the first input speech is the true speech, using a second deep learning model to process the first input speech to generate a speech feature; comparing the speech feature with the first audio signal; A preset feature is provided to generate a comparison result; determine whether the comparison result is greater than a threshold value; when the comparison result is greater than the threshold value, perform a specific action; when the comparison result is not greater than the threshold value, do not perform the specific action; and during a registration period: process a second input voice to generate a second audio signal; use the second deep learning model to process the second audio signal to generate the preset feature.