TWI878975B - Speech recognition device and method - Google Patents
Speech recognition device and method Download PDFInfo
- Publication number
- TWI878975B TWI878975B TW112126111A TW112126111A TWI878975B TW I878975 B TWI878975 B TW I878975B TW 112126111 A TW112126111 A TW 112126111A TW 112126111 A TW112126111 A TW 112126111A TW I878975 B TWI878975 B TW I878975B
- Authority
- TW
- Taiwan
- Prior art keywords
- voice
- speech
- input
- circuit
- audio signal
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G07—CHECKING-DEVICES
- G07C—TIME OR ATTENDANCE REGISTERS; REGISTERING OR INDICATING THE WORKING OF MACHINES; GENERATING RANDOM NUMBERS; VOTING OR LOTTERY APPARATUS; ARRANGEMENTS, SYSTEMS OR APPARATUS FOR CHECKING NOT PROVIDED FOR ELSEWHERE
- G07C9/00—Individual registration on entry or exit
- G07C9/30—Individual registration on entry or exit not involving the use of a pass
- G07C9/32—Individual registration on entry or exit not involving the use of a pass in combination with an identity check
- G07C9/37—Individual registration on entry or exit not involving the use of a pass in combination with an identity check using biometric data, e.g. fingerprints, iris scans or voice recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Lock And Its Accessories (AREA)
- Telephonic Communication Services (AREA)
- Burglar Alarm Systems (AREA)
Abstract
Description
本發明是關於一種語音辨識裝置,特別是關於一種具有攻擊防禦能力的語音辨識裝置。 The present invention relates to a voice recognition device, and in particular to a voice recognition device with attack and defense capabilities.
以主打安全性的智慧門禁系統來說,為了達到更高層級的安全防護,可能需要通過層層系統(如臉部、指紋加上語音的辨識)才能進入室內。然而,這樣的系統造成龐大的功耗,成本造價上會提升非常多。 For smart access control systems that focus on security, in order to achieve a higher level of security protection, it may be necessary to pass through multiple systems (such as facial, fingerprint and voice recognition) to enter the room. However, such a system causes huge power consumption and the cost will increase a lot.
再者,隨著科技的快速發展,一些惡意攻擊方式的技術同時也得到了更大的進展。以語音攻擊為例,就包含了語音合成、預錄音撥放、甚至是虛假音頻生成等等方式。如果沒有經常更新攻擊方式,就如同時時刻刻曝露在風險下。 Furthermore, with the rapid development of technology, some malicious attack methods have also made greater progress. For example, voice attacks include voice synthesis, pre-recorded playback, and even false audio generation. If the attack methods are not updated frequently, it is like being exposed to risks at all times.
本發明之一實施例提供一種語音辨識裝置,包括一攻擊防禦模組以及一語者驗證模組。攻擊防禦模組用以濾除一假語 音,並包括一處理電路以及一推測電路。處理電路處理一第一輸入語音,用以產生一音頻信號。推測電路接收音頻信號,並利用一第一深層學習模型處理該音頻信號,用以判別第一輸入語音是為真語音或是假語音。當第一輸入語音係為真語音時,推測電路輸出一第二輸入語音。語者驗證模組包括一特徵擷取電路、一比對電路以及一決策電路。特徵擷取電路利用一第二深層學習模型處理第二輸入語音,用以產生一語音特徵。比對電路將語音特徵與至少一預設特徵進行比對,用以產生一比對結果。決策電路判斷比對結果是否大於一閥值。當比對結果大於閥值時,決策電路進行一特定動作。當比對結果未大於閥值時,決策電路不進行特定動作。 An embodiment of the present invention provides a speech recognition device, including an attack defense module and a speaker verification module. The attack defense module is used to filter out a false voice, and includes a processing circuit and a prediction circuit. The processing circuit processes a first input voice to generate an audio signal. The prediction circuit receives the audio signal and processes the audio signal using a first deep learning model to determine whether the first input voice is a true voice or a false voice. When the first input voice is a true voice, the prediction circuit outputs a second input voice. The speaker verification module includes a feature extraction circuit, a comparison circuit, and a decision circuit. The feature acquisition circuit uses a second deep learning model to process the second input speech to generate a speech feature. The comparison circuit compares the speech feature with at least one preset feature to generate a comparison result. The decision circuit determines whether the comparison result is greater than a threshold value. When the comparison result is greater than the threshold value, the decision circuit performs a specific action. When the comparison result is not greater than the threshold value, the decision circuit does not perform a specific action.
本發明另提供一種語音辨識方法,包括接收並處理一輸入語音,用以產生一音頻信號;利用一第一深層學習模型處理音頻信號,用以判別輸入語音是為一真語音或是一假語音;當輸入語音係為真語音時,利用一第二深層學習模型處理輸入語音,用以產生一語音特徵;比對語音特徵與至少一預設特徵,用以產生一比對結果;判斷比對結果是否大於一閥值;當比對結果大於閥值時,進行一特定動作;以及當比對結果未大於閥值時,不進行特定動作。 The present invention also provides a speech recognition method, including receiving and processing an input speech to generate an audio signal; using a first deep learning model to process the audio signal to determine whether the input speech is a true speech or a false speech; when the input speech is a true speech, using a second deep learning model to process the input speech to generate a speech feature; comparing the speech feature with at least one preset feature to generate a comparison result; determining whether the comparison result is greater than a threshold value; when the comparison result is greater than the threshold value, performing a specific action; and when the comparison result is not greater than the threshold value, not performing the specific action.
本發明之語音辨識方法可經由本發明之語音辨識裝置來實作,其為可執行特定功能之硬體或韌體,亦可以透過程式碼方式收錄於一紀錄媒體中,並結合特定硬體來實作。當程式碼被電子裝置、處理器、電腦或機器載入且執行時,電子裝置、處理器、電腦或機器變成用以實行本發明之攻擊防禦模組以及語者驗證模 組。 The speech recognition method of the present invention can be implemented by the speech recognition device of the present invention, which is hardware or firmware that can execute specific functions, or can be recorded in a recording medium in the form of program code and implemented in combination with specific hardware. When the program code is loaded and executed by an electronic device, processor, computer or machine, the electronic device, processor, computer or machine becomes an attack defense module and a speaker verification module for implementing the present invention.
100:語音辨識裝置 100: Voice recognition device
110:攻擊防禦模組 110: Attack and defense module
120:語者驗證模組 120: Speaker verification module
130:接收電路 130: Receiving circuit
SP_IN1~SP_IN3:輸入語音 SP_IN1~SP_IN3: Input voice
SP_RL:真語音 SP_RL: Real Voice
SP_SF:假語音 SP_SF: False voice
ST:觸發信號 ST: trigger signal
210:處理電路 210: Processing circuit
220:推測電路 220: Circuit estimation
ADS_1~ADS_3:音頻信號 ADS_1~ADS_3: audio signal
230、350:輸入輸出埠 230, 350: Input and output ports
DA_1、DA_2:更新資料 DA_1, DA_2: Update data
221、311:深層學習模型 221, 311: Deep learning model
310:特徵擷取電路 310: Feature acquisition circuit
330:比對電路 330: Comparison circuit
340:決策電路 340: Decision circuit
SP_EM:語音特徵 SP_EM: Voice features
S_CM:比對結果 S_CM: Comparison results
SP_PR:預設特徵 SP_PR: Default features
S611~S619:步驟 S611~S619: Steps
第1圖為本發明之語音辨識裝置的示意圖。 Figure 1 is a schematic diagram of the speech recognition device of the present invention.
第2圖為本發明之攻擊防禦模組的示意圖。 Figure 2 is a schematic diagram of the attack defense module of the present invention.
第3A圖為本發明之語者驗證模組的示意圖。 Figure 3A is a schematic diagram of the speaker verification module of the present invention.
第3B圖為本發明之語者驗證模組的另一示意圖。 Figure 3B is another schematic diagram of the speaker verification module of the present invention.
第4圖為本發明之攻擊防禦操作的示意圖。 Figure 4 is a schematic diagram of the attack defense operation of the present invention.
第5圖為本發明之語者驗證操作的示意圖。 Figure 5 is a schematic diagram of the speaker verification operation of the present invention.
第6圖為本發明之語音辨識方法的流程示意圖。 Figure 6 is a schematic diagram of the process of the speech recognition method of the present invention.
為讓本發明之目的、特徵和優點能更明顯易懂,下文特舉出實施例,並配合所附圖式,做詳細之說明。本發明說明書提供不同的實施例來說明本發明不同實施方式的技術特徵。其中,實施例中的各元件之配置係為說明之用,並非用以限制本發明。另外,實施例中圖式標號之部分重覆,係為了簡化說明,並非意指不同實施例之間的關聯性。 In order to make the purpose, features and advantages of the present invention more clearly understandable, the following examples are specifically cited and detailed descriptions are made in conjunction with the attached drawings. This invention specification provides different examples to illustrate the technical features of different implementations of the present invention. Among them, the configuration of each component in the embodiment is for illustrative purposes and is not used to limit the present invention. In addition, the partial repetition of the figure numbers in the embodiment is for the purpose of simplifying the description and does not mean the correlation between different embodiments.
第1圖為本發明之語音辨識裝置的示意圖。如圖所示,語音辨識裝置100包括一攻擊防禦模組110以及一語者驗證模組120。在一可能實施例中,語音辨識裝置100係為一門禁語音辨識系
統。在此例中,語音辨識裝置100先判斷一輸入語音SP_IN1是否為一真語音(real speech)。如果輸入語音SP_IN1並非一真語音,例如輸入語音SP_IN1為一假語音(spoof speech),語音辨識裝置100忽略輸入語音SP_IN1。如果輸入語音SP_IN1係為一真語音,語音辨識裝置100再判斷輸入語音SP_IN1是否符合已註冊的預設語音。若是,表示發出輸入語音SP_IN1的人員為合法人員。因此,語音辨識裝置100根據輸入語音SP_IN1執行一特定動作,如開門。倘若輸入語音SP_IN1不符合已註冊的預設語音,表示發出輸入語音SP_IN1的人員為不合法人員。因此,語音辨識裝置100不執行特定動作。在一些實施例中,雖然語音辨識裝置100不執行特定動作,但語音辨識裝置100執行一警示動作,如發出一警示聲,通知相關人員,有人試圖進入室內。
FIG. 1 is a schematic diagram of a voice recognition device of the present invention. As shown in the figure, the
在一可能實施例中,語音辨識裝置100更包括一接收電路130。接收電路130接收一外部語音。外部語音可能是一真語音SP_RL或是一假語音SP_SF。不論外部語音是真語音SP_RL或是假語音SP_SF,接收電路130將外部語音作為輸入語音SP_IN1,並提供輸入語音SP_IN1予攻擊防禦模組110。在一些實施例中,接收電路130具有一麥克風。本發明並不限定接收電路130的架構。在一可能實施例中,接收電路130係為一錄音器(recorder)。
In a possible embodiment, the
攻擊防禦模組110接收輸入語音SP_IN1,並判斷輸入語音SP_IN1的種類。當輸入語音SP_IN1為一真語音時,攻擊防禦模組110根據輸入語音SP_IN1,提供一輸入語音SP_IN2。在
一些實施例中,攻擊防禦模組110直接將輸入語音SP_IN1作為輸入語音SP_IN2。當輸入語音SP_IN1為一假語音時,攻擊防禦模組110停止提供輸入語音SP_IN2。
The
在一可能實施例中,真語音係為一活體(如使用者)所發出的語音,假語音係由一語音生成器所產生的語音。該語音生成器可能利用一語音合成方式、一預錄音撥放方式或是一虛假音頻生成方式,產生一假語音。有心人士可能靠近接收電路130,再播放出假語音。在此例中,雖然假語音也是由活體所發出,但並非活體親自直接對著接收電路130直接講出的語音。因此,只要不是活體親自對著接收電路130講出的語音,均稱為假語音。在其它實施例中,語音生成器可能擷取活體的部分語音,再重新排列,甚至利用人類聽力的上下限,模擬出一攻擊語音(即假語音)。因此,只要是播放出來的語音,均稱為假語音。
In a possible embodiment, the real voice is the voice uttered by a living body (such as a user), and the false voice is the voice generated by a voice generator. The voice generator may generate a false voice using a voice synthesis method, a pre-recorded playback method, or a virtual audio generation method. A person with intentions may approach the receiving
在本實施例中,攻擊防禦模組110具有一第一深層學習模型(deep learning model)。第一深層學習模型的參數(parameter)係由一外部裝置提供。該外部裝置執行真語音及假語音的訓練(training),再將訓練後的參數寫入攻擊防禦模組110的第一深層學習模型中。因此,攻擊防禦模組110具有分辨輸入語音SP_IN1為一假語音或是一真語音的能力。在此例中,外部裝置可能定時或不時更新第一深層學習模型的參數,使得語音辨識裝置100更強健更有效地抵擋新型假語音的攻擊。當新型的攻擊出現時,外部裝置根據攻擊資料進行訓練,再利用訓練後的參數更新第
一深層學習模型的參數。第一深層學習模型便可分辨出新型的攻擊語音。
In this embodiment, the
在一些實施例中,攻擊防禦模組110進行一機器學習(machine learning),用以分辨出先進的語音合成。攻擊防禦模組110可能對虛假音頻生成器所產生的攻擊語音以及真實的人聲語音進行訓練,再將訓練所產生的參數寫入第一深層學習模型中,便可分辨出真語音以及假設音,並阻擋惡意攻擊語音。
In some embodiments, the
語者驗證模組120接收輸入語音SP_IN2,並判斷輸入語音SP_IN2是否為已註冊的使用者的聲音。如果輸入語音SP_IN2為已註冊的使用者的聲音,語者驗證模組120執行一特定動作。在一可能實施例中,語者驗證模組120致能一觸發信號ST,用以打開門鎖。如果輸入語音SP_IN2並非已註冊的使用者的聲音,語者驗證模組120不執行一特定動作。舉例而言,語者驗證模組120可能不致能觸發信號ST。因此,門鎖繼續鎖住。在其它實施例中,語者驗證模組120可能發出一警示聲音,通知特定人士,有人試圖進入屋內。
The
在一些實施例中,在一註冊期間,語者驗證模組120接收輸入語音SP_IN3。輸入語音SP_IN3係為一合法人員的語音。語者驗證模組120擷取輸入語音SP_IN3的語音特徵(voice-point),並將輸入語音SP_IN3的語音特徵作為一預設特徵。在此例中,語者驗證模組120比較輸入語音SP_IN2與SP_IN3的語音特徵。當輸入語音SP_IN2的語音特徵相似於預設特徵,表
示輸入語音SP_IN1係由合法人員所發出。因此,語者驗證模組120執行特定動作,如打開門鎖。
In some embodiments, during a registration period, the
在一可能實施例中,語者驗證模組120具有一第二深層學習模型。第二深層學習模型的參數係由一外部裝置提供。該外部裝置將複數語音作為訓練資料,用以學習如何從語音中,擷取出該說話者語音特徵。外部裝置將訓練後的參數寫入語者驗證模組120的第二深層學習模型中。因此,第二深層學習模型便具有擷取語音特徵的能力。在一註冊期間,第二深層學習模型擷取至少一合法人員的語音特徵,並將擷取到的語音特徵記錄於一資料庫中。在一些實施例中,當第二深層學習模型擷取複數合法人員的語音特徵時,則資料庫記錄複數語音特徵。每一語音特徵對應一合法人員的聲音。
In a possible embodiment, the
第2圖為本發明之攻擊防禦模組110的示意圖。攻擊防禦模組110包括一處理電路210以及一推測電路220。處理電路210處理輸入語音SP_IN1,用以產生一音頻信號ADS_1。本發明並不限定處理電路210的架構。在一可能實施例中,處理電路210具有抽取特徵(feature extraction)的能力。舉例而言,處理電路210可能抽取輸入語音SP_IN1的常數Q倒頻譜係數(CQCC)或是梅爾倒頻譜係數(MFCC),用以產生一抽取結果。在此例中,處理電路210將抽取結果作為音頻信號ADS_1。
FIG. 2 is a schematic diagram of the
推測電路220接收音頻信號ADS_1,並利用一深層學習模型221處理音頻信號ADS_1,用以判別輸入語音SP_IN1是
為一真語音或是一假語音。當輸入語音SP_IN1係為一真語音時,推測電路220輸出一輸入語音SP_IN2。在一些實施例中,推測電路220將輸入語音SP_IN1或是音頻信號ADS_1作為輸入語音SP_IN2。當輸入語音SP_IN1係為一假語音時,推測電路220不提供輸入語音SP_IN2。在一可能實施例中,推測電路220具有一記憶裝置,用以儲存深層學習模型221。
The
在一些實施例中,攻擊防禦模組110更包括一輸入輸出埠230。輸入輸出埠230用以接收一更新資料DA_1。更新資料DA_1係由一外部裝置提供。該外部裝置對一訓練資料進行訓練。該訓練資料包括至少一真語音及至少一假語音。經訓練後,外部裝置產生至少一參數,並將參數作為更新資料DA_1。在一更新期間,當外部裝置耦接輸入輸出埠230時,推測電路220透過輸入輸出埠230,接收更新資料DA_1。推測電路220將更新資料DA_1寫入深層學習模型221,用以更新深層學習模型221的至少一參數。
In some embodiments, the
本發明並不限定輸入輸出埠230的種類。在一可能實施例中,輸入輸出埠230透過一連接線,連接外部裝置。在另一可能實施例中,輸入輸出埠230係為一無線接收器,接收來自外部裝置的更新資料DA_1。
The present invention does not limit the type of the input/
第3A圖為本發明之語者驗證模組120的示意圖。如圖所示,語者驗證模組120包括一特徵擷取電路310、一比對電路330以及一決策電路340。在本實施例中,語者驗證模組120用以判斷來自攻擊防禦模組110的輸入語音SP_IN2是否為合法語音。當輸
入語音SP_IN1為合法語音時,語者驗證模組120進行一特定動作,如開始。當輸入語音SP_IN1為不合法語音時,語者驗證模組120不進行特定動作,或是進行一警示動作,如發出警告聲,通知室內的人員。
FIG. 3A is a schematic diagram of the
特徵擷取電路310利用一深層學習模型311處理輸入語音SP_IN2,用以產生一語音特徵SP_EM。在一可能實施例中,輸入語音SP_IN2係為輸入語音SP_IN1。在此例中,當攻擊防禦模組110判斷輸入語音SP_IN1為一真語音時,攻擊防禦模組110直接將音頻信號ADS_1作為輸入語音SP_IN2。在本實施例中,深層學習模型311提取說話者的特徵(speaker embedding)。在一可能實施例中,特徵擷取電路310具有一記憶裝置,用以儲存深層學習模型311。
The
比對電路330將語音特徵SP_EM與至少一預設特徵SP_PR進行比對,用以產生一比對結果S_CM。在一些實施例中,比對結果S_CM表示語音特徵SP_EM與預設特徵SP_PR之間的相似度。 The comparison circuit 330 compares the speech feature SP_EM with at least one preset feature SP_PR to generate a comparison result S_CM. In some embodiments, the comparison result S_CM indicates the similarity between the speech feature SP_EM and the preset feature SP_PR.
決策電路340判斷比對結果S_CM是否大於一閥值。當比對結果S_CM大於閥值時,決策電路340進行一特定動作,如致能觸發信號ST。當比對結果S_CM未大於閥值時,決策電路340不進行特定動作,如不致能觸發信號ST。
The
在一些實施例中,語者驗證模組120更包括一儲存電路320。儲存電路320用以儲存預設特徵SP_PR。在一可能實施
例中,預設特徵SP_PR係由特徵擷取電路310提供。本發明並不限定預設特徵SP_PR的數量。儲存電路320可能儲存更多的預設特徵SP_PR。在此例中,每一預設特徵SP_PR代表一合法人員(如家庭成員)的語音特徵。本發明並不限定儲存電路320的種類。在一可能實施例中,儲存電路320係為一非揮發性記憶體。
In some embodiments, the
在一可能實施例中,深層學習模型311的參數係由一外部裝置(未顯示)。該外部裝置將複數語音作為訓練資料,用以學習擷取語音特徵。外部裝置將訓練後的參數寫入深層學習模型311中。因此,深層學習模型311便具有擷取語音特徵的能力。
In one possible embodiment, the parameters of the
在其它實施例中,語者驗證模組120更包括一輸入輸出埠350。輸入輸出埠350用以接收一更新資料DA_2。當一外部裝置耦接輸入輸出埠350時,語者驗證模組120進入一更新模式。在更新模式下,特徵擷取電路310根據更新資料DA_2,更新深層學習模型311的至少一參數。本發明並不限定輸入輸出埠350的種類。在一可能實施例中,輸入輸出埠350透過一連接線,連接外部裝置。在另一可能實施例中,輸入輸出埠350係為一無線接收器,接收來自外部裝置的更新資料DA_2。
In other embodiments, the
第3B圖為本發明之語者驗證模組120的另一示意圖。第3B圖相似第3A圖,不同之處在於,第3B圖多了一處理電路360。處理電路360處理輸入語音SP_IN2,用以產生一音頻信號ADS_2。由於處理電路360的特性相似於第2圖的處理電路210,故不再贅述。
FIG. 3B is another schematic diagram of the
特徵擷取電路310接收音頻信號ADS_2,並利用深層學習模型311處理音頻信號ADS_2,用以產生語音特徵SP_EM。在本實施例中,當攻擊防禦模組110得知輸入語音SP_IN1係為一真語音時,攻擊防禦模組110將輸入語音SP_IN1作為輸入語音SP_IN2。
The
在一些實施例中,在一初始期間(或稱註冊期間),處理電路360處理輸入語音SP_IN3,用以產生一音頻信號ADS_3。在此例中,輸入語音SP_IN3係為合法人員的語音。深層學習模型311擷取音頻信號ADS_3的語音特徵,並將擷取結果(即預設特徵SP_PR)記錄於儲存電路320中。
In some embodiments, during an initial period (or registration period), the
在一可能實施例中,合法人員對著第1圖的接收電路130說話。接收電路130提供合法人員的語音(即輸入語音SP_IN3)予處理電路360。在一些實施例中,接收電路130可能接收五位合法人員的語音。因此,儲存電路320記錄五筆預設特徵SP_PR,分別對應五位合法人員。在此例中,比對電路330將語音特徵SP_EM與五筆預設特徵SP_PR進行比對,並提供五次的比對結果予決策電路340。決策電路340將每一比對結果與一閥值進行比較。
In a possible embodiment, a legitimate person speaks to the receiving
第4圖為本發明之攻擊防禦操作的示意圖。在一訓練階段(training phase),提供許多不同的訓練資料。訓練資料真語音及假語音。真語音係為活體所發出的聲音。假語音包含使用各種方法所取得的語音,如生成與活體語音相似的聲音。另外,還有 錄音和重放、擷取部分語音、透過聲音的環繞與立體聲。接著,抽取訓練資料的特徵,再進行語音訓練(如機器學習machine learning)。藉由多次語音訓練,便可產生至少一參數。再將參數寫入一第一深層學習模型中。 Figure 4 is a schematic diagram of the attack defense operation of the present invention. In a training phase, many different training data are provided. The training data are real voice and fake voice. Real voice is the sound made by a living body. Fake voice includes voice obtained using various methods, such as generating voice similar to the voice of a living body. In addition, there are recording and playback, capturing part of the voice, and using surround sound and stereo. Then, extract the features of the training data and perform voice training (such as machine learning). Through multiple voice trainings, at least one parameter can be generated. The parameter is then written into a first deep learning model.
在一推論階段(inference phase),接收一推論資料。在一可能實施例中,推論資料係為一未知語音。該未知語音可能來自一錄音器。然後,抽取推論資料的特徵,用以產生一第一抽取結果。利用第一深層學習模型,處理第一抽取結果,用以判斷推論資料係為一真語音或是一假語音。在一可能實施例中,如果推論資料係為一真語音時,則驗證該真語音是否為合法人員所發出。如果推論資料係為一假語音,則不驗證假語音。 In an inference phase, an inference data is received. In a possible embodiment, the inference data is an unknown voice. The unknown voice may come from a recorder. Then, the features of the inference data are extracted to generate a first extraction result. The first extraction result is processed using a first deep learning model to determine whether the inference data is a true voice or a false voice. In a possible embodiment, if the inference data is a true voice, it is verified whether the true voice is uttered by a legitimate person. If the inference data is a false voice, the false voice is not verified.
第5圖為本發明之語者驗證操作的示意圖。在一訓練階段,提供許多不同的訓練資料。在一可能實施例中,每一訓練資料係為一真語音。接著,抽取訓練資料的特徵,再進行語音訓練。藉由多次語音訓練,便可產生至少一參數。再將參數寫入一第二深層學習模型中。 Figure 5 is a schematic diagram of the speaker verification operation of the present invention. In a training phase, a number of different training data are provided. In a possible embodiment, each training data is a real voice. Then, the features of the training data are extracted and voice training is performed. Through multiple voice trainings, at least one parameter can be generated. The parameter is then written into a second deep learning model.
在一註冊階段(enrollment phase),接收至少一註冊資料。接著,抽取註冊資料的特徵,用以產生一第二抽取結果。利用第二深層學習模型處理第二抽取結果,用以取得註冊資料的語音特徵(或稱預設特徵)。在一可能實施例中,註冊資料的語音特徵儲存於一資料庫中。 In an enrollment phase, at least one enrollment data is received. Then, features of the enrollment data are extracted to generate a second extraction result. The second extraction result is processed using a second deep learning model to obtain voice features (or default features) of the enrollment data. In a possible embodiment, the voice features of the enrollment data are stored in a database.
在一驗證(verification phase)階段,接收一驗證 資料。在一可能實施例中,驗證資料係為一未知真語音,如經第4圖的第一深層學習模型判斷得知的真語音。然後,抽取驗證資料的特徵,用以產生一第三抽取結果。利用第二深層學習模型處理第三抽取結果,用以產生一未知語音特徵。將未知語音特徵與資料庫裡的預設特徵相比較,判斷未知語音特徵與每一預設特徵之間的相似度。接著,判斷未知語音特徵與每一預設特徵之間的相似度是否超過一閥值。若是,表示驗證資料係為一合法人員所發出的聲音,故接受驗證資料,並執行一特定動作。若否,表示驗證資料並非一合法人員所發出的聲音,故拒絕驗證資料,且不執行特定動作。 In a verification phase, a verification data is received. In a possible embodiment, the verification data is an unknown true voice, such as the true voice determined by the first deep learning model in FIG. 4. Then, the features of the verification data are extracted to generate a third extraction result. The third extraction result is processed by the second deep learning model to generate an unknown voice feature. The unknown voice feature is compared with the preset features in the database to determine the similarity between the unknown voice feature and each preset feature. Then, it is determined whether the similarity between the unknown voice feature and each preset feature exceeds a threshold value. If so, it means that the verification data is a sound made by a legitimate person, so the verification data is accepted and a specific action is performed. If not, it means that the verification data is not the voice of a legitimate person, so the verification data is rejected and the specific action is not performed.
第6圖為本發明之語音辨識方法的流程示意圖。本發明的語音辨識方法可以透過程式碼存在。當程式碼被機器載入且執行時,機器變成用以實行本發明之語音辨識裝置。 Figure 6 is a flow chart of the speech recognition method of the present invention. The speech recognition method of the present invention can exist through program code. When the program code is loaded and executed by a machine, the machine becomes a speech recognition device for implementing the present invention.
首先,接收並處理一輸入語音,用以產生一音頻信號(步驟S611)。在一可能實施例中,步驟S611係抽取輸入語音的常數Q倒頻譜係數或是梅爾倒頻譜係數,再將抽取結果作為音頻信號。 First, an input speech is received and processed to generate an audio signal (step S611). In a possible embodiment, step S611 extracts the constant Q inverse spectrum coefficient or the Mel inverse spectrum coefficient of the input speech, and then uses the extraction result as the audio signal.
接著,利用一第一深層學習模型處理音頻信號(步驟S612)。在一可能實施例中,第一深層學習模型的參數(parameter)係由一外部裝置提供。該外部裝置執行真語音及假語音的訓練,再將訓練後的參數寫入第一深層學習模型中。因此,根據第一深層學習模型的輸出結果,便可得知輸入語音為一假語音或是一真語音的能力。 Next, a first deep learning model is used to process the audio signal (step S612). In a possible embodiment, the parameters of the first deep learning model are provided by an external device. The external device performs training of real speech and fake speech, and then writes the trained parameters into the first deep learning model. Therefore, according to the output result of the first deep learning model, the ability of the input speech to be a fake speech or a real speech can be known.
判斷第一深層學習模型是否將輸入語音分類為真語音(步驟S613)。當輸入語音並非真語音時,拒絕假語音(步驟S614)。當輸入語音係為真語音時,利用一第二深層學習模型處理輸入語音,用以產生一語音特徵(步驟S615)。 Determine whether the first deep learning model classifies the input speech as real speech (step S613). When the input speech is not real speech, reject the false speech (step S614). When the input speech is real speech, use a second deep learning model to process the input speech to generate a speech feature (step S615).
接著,比對語音特徵與至少一預設特徵,用以產生一比對結果(步驟S616)。在一可能實施例中,比對結果係為語音特徵與預設特徵的相似度。然後,判斷比對結果是否大於一閥值(步驟S617)。當比對結果未大於閥值時,表示輸入語音並非一合法人員的聲音。因此,不進行特定動作(步驟S618)。當比對結果大於閥值時,表示輸入語音係為一合法人員的聲音。因此,進行一特定動作(步驟S619)。 Next, the voice feature is compared with at least one preset feature to generate a comparison result (step S616). In a possible embodiment, the comparison result is the similarity between the voice feature and the preset feature. Then, it is determined whether the comparison result is greater than a threshold value (step S617). When the comparison result is not greater than the threshold value, it indicates that the input voice is not the voice of a legitimate person. Therefore, no specific action is performed (step S618). When the comparison result is greater than the threshold value, it indicates that the input voice is the voice of a legitimate person. Therefore, a specific action is performed (step S619).
本發明並不限定特定動作的類型。在一可能實施例中,特定動作係打開門鎖。在另一可能實施例中,比對結果未大於閥值時,表示一不合法人員企圖進入屋內,故步驟S618發出一警示聲,提醒屋內人員。在其它實施例中,藉由調整閥值,可提高語音辨識的嚴格程度。 The present invention does not limit the type of specific action. In one possible embodiment, the specific action is to unlock the door. In another possible embodiment, when the comparison result is not greater than the valve value, it means that an illegal person attempts to enter the house, so step S618 emits a warning sound to remind the person in the house. In other embodiments, by adjusting the valve value, the strictness of voice recognition can be improved.
本發明之語音辨識方法,或特定型態或其部份,可以以程式碼的型態存在。程式碼可儲存於實體媒體,如軟碟、光碟片、硬碟、或是任何其他機器可讀取(如電腦可讀取)儲存媒體,亦或不限於外在形式之電腦程式產品,其中,當程式碼被機器,如電腦載入且執行時,此機器變成用以參與本發明之攻擊防禦模組及語者驗證模組。程式碼也可透過一些傳送媒體,如電線或電纜、光纖、 或是任何傳輸型態進行傳送,其中,當程式碼被機器,如電腦接收、載入且執行時,此機器變成用以參與本發明之攻擊防禦模組及語者驗證模組。當在一般用途處理單元實作時,程式碼結合處理單元提供一操作類似於應用特定邏輯電路之獨特裝置。 The speech recognition method of the present invention, or a specific form or part thereof, can exist in the form of program code. The program code can be stored in a physical medium, such as a floppy disk, an optical disk, a hard disk, or any other machine-readable (such as computer-readable) storage medium, or a computer program product that is not limited to an external form, wherein when the program code is loaded and executed by a machine, such as a computer, the machine becomes used to participate in the attack defense module and speaker verification module of the present invention. The program code can also be transmitted through some transmission media, such as wires or cables, optical fibers, or any transmission type, wherein when the program code is received, loaded and executed by a machine, such as a computer, the machine becomes used to participate in the attack defense module and speaker verification module of the present invention. When implemented on a general-purpose processing unit, the code combines with the processing unit to provide a unique device that operates like an application-specific logic circuit.
除非另作定義,在此所有詞彙(包含技術與科學詞彙)均屬本發明所屬技術領域中具有通常知識者之一般理解。此外,除非明白表示,詞彙於一般字典中之定義應解釋為與其相關技術領域之文章中意義一致,而不應解釋為理想狀態或過分正式之語態。雖然“第一”、“第二”等術語可用於描述各種元件,但這些元件不應受這些術語的限制。這些術語只是用以區分一個元件和另一個元件。 Unless otherwise defined, all terms (including technical and scientific terms) herein are generally understood by those with ordinary knowledge in the technical field to which the present invention belongs. In addition, unless expressly stated, the definition of a term in a general dictionary should be interpreted as consistent with the meaning in the article in the relevant technical field, and should not be interpreted as an ideal state or overly formal tone. Although terms such as "first" and "second" can be used to describe various components, these components should not be limited by these terms. These terms are only used to distinguish one component from another.
雖然本發明已以較佳實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明之精神和範圍內,當可作些許之更動與潤飾。舉例來說,本發明實施例所述之系統、裝置或是方法可以硬體、軟體或硬體以及軟體的組合的實體實施例加以實現。因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。 Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person with ordinary knowledge in the relevant technical field may make some changes and modifications without departing from the spirit and scope of the present invention. For example, the system, device or method described in the embodiments of the present invention can be implemented in a physical embodiment of hardware, software or a combination of hardware and software. Therefore, the scope of protection of the present invention shall be determined by the scope of the attached patent application.
100:語音辨識裝置 110:攻擊防禦模組 120:語者驗證模組 SP_IN1~SP_IN3:輸入語音 SP_RL:真語音 SP_SF:假語音 130:接收電路 ST:觸發信號 100: Voice recognition device 110: Attack defense module 120: Speaker verification module SP_IN1~SP_IN3: Input voice SP_RL: Real voice SP_SF: False voice 130: Receiving circuit ST: Trigger signal
Claims (6)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW112126111A TWI878975B (en) | 2023-07-13 | 2023-07-13 | Speech recognition device and method |
| CN202410860316.1A CN119314491A (en) | 2023-07-13 | 2024-06-28 | Speech recognition device and method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW112126111A TWI878975B (en) | 2023-07-13 | 2023-07-13 | Speech recognition device and method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW202503737A TW202503737A (en) | 2025-01-16 |
| TWI878975B true TWI878975B (en) | 2025-04-01 |
Family
ID=94183444
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW112126111A TWI878975B (en) | 2023-07-13 | 2023-07-13 | Speech recognition device and method |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN119314491A (en) |
| TW (1) | TWI878975B (en) |
Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180097838A1 (en) * | 2016-10-03 | 2018-04-05 | Telepathy Labs, Inc. | System and method for audio fingerprinting for attack detection |
| US20190237096A1 (en) * | 2018-12-28 | 2019-08-01 | Intel Corporation | Ultrasonic attack detection employing deep learning |
| US10795364B1 (en) * | 2017-12-29 | 2020-10-06 | Apex Artificial Intelligence Industries, Inc. | Apparatus and method for monitoring and controlling of a neural network using another neural network implemented on one or more solid-state chips |
| US10911600B1 (en) * | 2019-07-30 | 2021-02-02 | Nice Ltd. | Method and system for fraud clustering by content and biometrics analysis |
| TW202119393A (en) * | 2019-10-31 | 2021-05-16 | 大陸商支付寶(杭州)信息技術有限公司 | System and method for determining voice characteristics |
| WO2022042129A1 (en) * | 2020-08-31 | 2022-03-03 | 北京达佳互联信息技术有限公司 | Audio processing method and apparatus |
| CN114974263A (en) * | 2022-05-13 | 2022-08-30 | 北京百度网讯科技有限公司 | Identity authentication method, device, device and storage medium |
| CN114974204A (en) * | 2022-06-06 | 2022-08-30 | 平安科技(深圳)有限公司 | Detection method, device, computer equipment and storage medium for synthetic attack speech |
| US20220293123A1 (en) * | 2021-03-10 | 2022-09-15 | Covid Cough, Inc. | Systems and methods for authentication using sound-based vocalization analysis |
| US20220375474A1 (en) * | 2018-12-20 | 2022-11-24 | Schlage Lock Company Llc | Audio-based access control |
| US11551699B2 (en) * | 2018-05-04 | 2023-01-10 | Samsung Electronics Co., Ltd. | Voice input authentication device and method |
| TW202303587A (en) * | 2021-07-13 | 2023-01-16 | 宏碁股份有限公司 | Processing method of sound watermark and speech communication system |
-
2023
- 2023-07-13 TW TW112126111A patent/TWI878975B/en active
-
2024
- 2024-06-28 CN CN202410860316.1A patent/CN119314491A/en active Pending
Patent Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180097838A1 (en) * | 2016-10-03 | 2018-04-05 | Telepathy Labs, Inc. | System and method for audio fingerprinting for attack detection |
| US10795364B1 (en) * | 2017-12-29 | 2020-10-06 | Apex Artificial Intelligence Industries, Inc. | Apparatus and method for monitoring and controlling of a neural network using another neural network implemented on one or more solid-state chips |
| US11551699B2 (en) * | 2018-05-04 | 2023-01-10 | Samsung Electronics Co., Ltd. | Voice input authentication device and method |
| US20220375474A1 (en) * | 2018-12-20 | 2022-11-24 | Schlage Lock Company Llc | Audio-based access control |
| US20190237096A1 (en) * | 2018-12-28 | 2019-08-01 | Intel Corporation | Ultrasonic attack detection employing deep learning |
| US10911600B1 (en) * | 2019-07-30 | 2021-02-02 | Nice Ltd. | Method and system for fraud clustering by content and biometrics analysis |
| TW202119393A (en) * | 2019-10-31 | 2021-05-16 | 大陸商支付寶(杭州)信息技術有限公司 | System and method for determining voice characteristics |
| WO2022042129A1 (en) * | 2020-08-31 | 2022-03-03 | 北京达佳互联信息技术有限公司 | Audio processing method and apparatus |
| US20220293123A1 (en) * | 2021-03-10 | 2022-09-15 | Covid Cough, Inc. | Systems and methods for authentication using sound-based vocalization analysis |
| TW202303587A (en) * | 2021-07-13 | 2023-01-16 | 宏碁股份有限公司 | Processing method of sound watermark and speech communication system |
| CN114974263A (en) * | 2022-05-13 | 2022-08-30 | 北京百度网讯科技有限公司 | Identity authentication method, device, device and storage medium |
| CN114974204A (en) * | 2022-06-06 | 2022-08-30 | 平安科技(深圳)有限公司 | Detection method, device, computer equipment and storage medium for synthetic attack speech |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119314491A (en) | 2025-01-14 |
| TW202503737A (en) | 2025-01-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Li et al. | Practical adversarial attacks against speaker recognition systems | |
| US10720166B2 (en) | Voice biometrics systems and methods | |
| CN100476950C (en) | Method of comparing sound for security control | |
| Chen et al. | Robust deep feature for spoofing detection-the SJTU system for ASVspoof 2015 challenge. | |
| US8620657B2 (en) | Speaker verification methods and apparatus | |
| US8386263B2 (en) | Speaker verification methods and apparatus | |
| CN106251874B (en) | A kind of voice gate inhibition and quiet environment monitoring method and system | |
| Li et al. | Security and privacy problems in voice assistant applications: A survey | |
| CN110491391A (en) | A kind of deception speech detection method based on deep neural network | |
| US20080270132A1 (en) | Method and system to improve speaker verification accuracy by detecting repeat imposters | |
| CN103106717A (en) | Intelligent warehouse voice control doorkeeper system based on voiceprint recognition and identity authentication method thereof | |
| CN109448759A (en) | A kind of anti-voice authentication spoofing attack detection method based on gas explosion sound | |
| CN120148524B (en) | A speaker recognition method for land-air conversations with short speech and complex noise | |
| KR101995443B1 (en) | Method for verifying speaker and system for recognizing speech | |
| Marras et al. | Dictionary attacks on speaker verification | |
| Shirvanian et al. | Quantifying the breakability of voice assistants | |
| Chetty et al. | " Liveness" Verification in Audio-Video Authentication | |
| TWI878975B (en) | Speech recognition device and method | |
| Huang et al. | Eve said yes: Airbone authentication for head-wearable smart voice assistant | |
| TWI832552B (en) | Speaker identification system based on meta-learning applied to real-time short sentences in an open set environment | |
| VS et al. | A review of automatic speaker verification systems with feature extractions and spoofing attacks | |
| Gupta et al. | Text dependent voice based biometric authentication system using spectrum analysis and image acquisition | |
| KR20190077296A (en) | Method for verifying speaker and system for recognizing speech | |
| CN119577727A (en) | Identity authentication method and related equipment | |
| Asha et al. | Downsampling Attack on Automatic Speaker Authentication System |