[go: up one dir, main page]

TWI892389B - Method and processing circuit for performing wake-up control on voice-controlled device with aid of detecting voice feature of self-defined word - Google Patents

Method and processing circuit for performing wake-up control on voice-controlled device with aid of detecting voice feature of self-defined word

Info

Publication number
TWI892389B
TWI892389B TW112151041A TW112151041A TWI892389B TW I892389 B TWI892389 B TW I892389B TW 112151041 A TW112151041 A TW 112151041A TW 112151041 A TW112151041 A TW 112151041A TW I892389 B TWI892389 B TW I892389B
Authority
TW
Taiwan
Prior art keywords
audio
feature
audio segment
segment
feature list
Prior art date
Application number
TW112151041A
Other languages
Chinese (zh)
Other versions
TW202526912A (en
Inventor
趙盈盈
Original Assignee
瑞昱半導體股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 瑞昱半導體股份有限公司 filed Critical 瑞昱半導體股份有限公司
Priority to TW112151041A priority Critical patent/TWI892389B/en
Priority to US18/908,826 priority patent/US20250218441A1/en
Publication of TW202526912A publication Critical patent/TW202526912A/en
Application granted granted Critical
Publication of TWI892389B publication Critical patent/TWI892389B/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Selective Calling Equipment (AREA)

Abstract

A method for performing wake-up control on a voice-controlled device with aid of detecting voice feature of self-defined word and an associated processing circuit are provided. The method may include: performing feature collection on audio data of at least one audio clip to generate at least one feature list of the at least one audio clip, to establish a feature-list-based database in the voice-controlled device; performing the feature collection on audio data of another audio clip to generate another feature list of the other audio clip; and performing at least one screening operation on at least one feature in the other feature list according to the feature-list-based database to determine whether the other audio clip is invalid, to selectively ignore the other audio clip or execute at least one subsequent operation, where the at least one subsequent operation includes waking up the voice-controlled device.

Description

藉助於偵測自定義詞的語音特徵對聲控裝置進行喚醒控制之方法及處理電路Method and processing circuit for waking up a voice-controlled device by detecting voice characteristics of custom words

本發明係有關於語音控制技術,尤指一種藉助於偵測自定義詞(self-defined word)的語音特徵(voice feature)對一聲控裝置進行喚醒控制之方法及處理電路。 The present invention relates to voice control technology, and more particularly to a method and processing circuit for waking up a voice-controlled device by detecting the voice features of a self-defined word.

依據相關技術,針對生物特徵之辨識系統可用於啟動使用者裝置以提升便利性及安全性,但典型地需要仰賴具備強大計算能力的一遠端系統。舉例來說,為了準確地辨識語者(speaker),一人工智慧語者辨識系統的設計所涉及的各種條件可相對於語言特性、說話習慣、性別和年齡、發聲構造等而變化,因此建立一語者模型需要大量且適當的語音數據,以供訓練該人工智慧語者辨識系統以成功地進行自動辨識。由於通常需要透過有線或無線網路連線到該遠端系統例如該人工智慧語者辨識系統,其可用性(availability)會被網路中斷所影響。因此,需要一種新穎的方法及相關架構,以在無副作用或較少副作用之狀況下實現不需要仰賴具備強大計算能力的任何遠端系統之啟動控制。 According to related technologies, biometric recognition systems can be used to activate user devices to enhance convenience and security, but typically require a remote system with powerful computing capabilities. For example, to accurately identify a speaker, the various conditions involved in the design of an artificial intelligence speaker recognition system can vary with language characteristics, speaking habits, gender and age, vocal structure, etc. Therefore, building a speaker model requires a large amount of appropriate voice data for training the artificial intelligence speaker recognition system to successfully perform automatic recognition. Since the remote system, such as the artificial intelligence speaker recognition system, is usually connected to a wired or wireless network, its availability can be affected by network interruptions. Therefore, a novel method and related architecture are needed to achieve startup control of any remote system without relying on powerful computing capabilities, with no or minimal side effects.

本發明的目的之一在於提出一種藉助於偵測自定義詞的語音特徵對一聲控裝置進行喚醒控制之方法及處理電路,以解決相關技術的問題,且避免 竊賊或年幼孩童誤啟動該聲控裝置。 One of the objectives of the present invention is to provide a method and processing circuit for waking up a voice-controlled device by detecting the voice characteristics of custom words, thereby resolving problems in related technologies and preventing thieves or young children from accidentally activating the voice-controlled device.

本發明提供一種藉助於偵測自定義詞的語音特徵對一聲控裝置進行喚醒控制之方法,其中該方法可包含:於多個階段中的一註冊階段(registration phase)中,對至少一音訊片段(audio clip)的音訊數據進行特徵收集(feature collection)以產生該至少一音訊片段的至少一特徵清單,以於該聲控裝置中建立一基於特徵清單的(feature-list-based)資料庫,其中該至少一音訊片段載有(carry)至少一自定義詞,該基於特徵清單的資料庫包含該至少一特徵清單,該至少一特徵清單中之任一特徵清單包含該至少一音訊片段中之一對應的音訊片段的多個特徵,且該多個特徵分別屬於多個預定類型的特徵(predetermined types of features);於該多個階段中的一辨識階段(identification phase)中,對另一音訊片段的音訊數據進行該特徵收集以產生該另一音訊片段的另一特徵清單;以及於該辨識階段中,依據該基於特徵清單的資料庫對該另一特徵清單中的至少一特徵進行至少一篩選操作以判定該另一音訊片段是否無效,以選擇性地忽略該另一音訊片段或執行至少一後續操作,其中該至少一後續操作包含喚醒該聲控裝置。 The present invention provides a method for performing wake-up control on a voice-controlled device by detecting voice features of custom words, wherein the method may include: in a registration phase among multiple phases, performing feature collection on audio data of at least one audio clip to generate at least one feature list of the at least one audio clip, and establishing a feature-list-based database in the voice-controlled device, wherein the at least one audio clip carries at least one custom word, the feature-list-based database includes the at least one feature list, any one of the at least one feature list includes multiple features of an audio clip corresponding to one of the at least one audio clip, and the multiple features respectively belong to a plurality of predetermined types of features. features); in an identification phase among the multiple phases, performing the feature collection on audio data of another audio segment to generate another feature list for the other audio segment; and in the identification phase, performing at least one filtering operation on at least one feature in the other feature list based on the database based on the feature list to determine whether the other audio segment is invalid, thereby selectively ignoring the other audio segment or performing at least one subsequent operation, wherein the at least one subsequent operation includes waking up the voice control device.

本發明另提供一種處理電路以供藉助於偵測自定義詞的語音特徵對一聲控裝置進行喚醒控制。 The present invention also provides a processing circuit for performing wake-up control on a voice-controlled device by detecting the voice characteristics of custom words.

本發明的多個好處的其中之一是,本發明之方法及處理電路可依據該基於特徵清單的資料庫判定一未知語者是否為無效,以選擇性地忽略其音訊片段或喚醒/啟動該聲控裝置,而不需要連線至任何遠端系統以取得任何語音資料以供進行相關判斷。舉例來說,該方法不需要判斷自定義詞有哪些字,因此不需要透過任何網路來連線至任何雲端資料庫以取得大量語音資料。另外,本發明之方法及處理電路可在無副作用或較少副作用之狀況下實現精巧(compact)、快速、安全且可靠的聲控處理系統。 Among the many advantages of the present invention is that the method and processing circuit of the present invention can determine whether an unknown speaker is invalid based on the feature list-based database, selectively ignoring the audio clip or waking/activating the voice-controlled device, without requiring a connection to any remote system to obtain any voice data for such determination. For example, the method does not require determining the characters in a custom word, thus eliminating the need to connect to any cloud database via any network to obtain large amounts of voice data. Furthermore, the method and processing circuit of the present invention can implement a compact, fast, secure, and reliable voice-controlled processing system with minimal or no side effects.

100:聲控裝置 100: Voice-controlled device

110:處理電路 110: Processing circuit

111:處理模組 111: Processing Module

112:閾值初始化處理模組 112: Threshold initialization processing module

113:短期能量(STE)及過零率(ZCR)處理模組 113: Short-Term Energy (STE) and Zero Crossing Rate (ZCR) Processing Module

114:語音類型分類處理模組 114: Voice type classification processing module

115:音高處理模組 115: Pitch Processing Module

116:特徵清單處理模組 116: Feature List Processing Module

120:音訊輸入裝置 120: Audio input device

130:音訊數據轉換介面電路 130: Audio data conversion interface circuit

140:儲存裝置 140: Storage device

141:程式碼 141: Code

142:基於特徵清單的資料庫 142: Feature List-Based Database

610,620:音訊分析操作 610,620: Audio analysis operation

711:開頭音訊節段 711: Opening audio segment

719:結尾音訊節段 719: Ending audio segment

720:主要音訊節段 720: Main audio segment

f’1~f’18,f1~f16:音訊幀 f'1~f'18,f1~f16: audio frames

Seg1~Seg9:音訊節段 Seg1~Seg9: audio segments

y1,y2:音訊樣本 y1,y2: audio sample

S10~S13,S20~S23,S30~S36,S41~S45,S50~S52:步驟 S10~S13, S20~S23, S30~S36, S41~S45, S50~S52: Steps

STE( ):短期能量 STE( ): Short-term energy

STE_th:短期能量閾值 STE_th: Short-term energy threshold

X1,X2:軸 X 1 ,X 2 : axis

ZCR( ):過零率 ZCR( ):Zero Crossing Rate

ZCR_th:過零率閾值 ZCR_th: Zero crossing rate threshold

第1圖為根據本發明一實施例之聲控裝置的示意圖。 Figure 1 is a schematic diagram of a voice control device according to an embodiment of the present invention.

第2圖根據本發明一實施例繪示一種藉助於偵測自定義詞的語音特徵對一聲控裝置進行喚醒控制之方法的一註冊及辨識控制方案的一工作流程。 Figure 2 illustrates a workflow of a registration and recognition control scheme for a method of waking up a voice-controlled device by detecting the voice characteristics of custom words according to an embodiment of the present invention.

第3圖根據本發明一實施例繪示該方法的一特徵收集控制方案所涉及的音訊和相關訊號。 FIG3 illustrates the audio and related signals involved in a feature collection control scheme of the method according to an embodiment of the present invention.

第4圖根據本發明一實施例繪示該特徵收集控制方案中的閾值初始化處理所涉及的一音訊前處理操作。 Figure 4 illustrates an audio pre-processing operation involved in the threshold initialization process in the feature collection control scheme according to an embodiment of the present invention.

第5圖根據本發明一實施例繪示該特徵收集控制方案中的該閾值初始化處理所涉及的音訊樣本和音訊幀(audio frame)。 Figure 5 illustrates the audio samples and audio frames involved in the threshold initialization process in the feature collection control scheme according to an embodiment of the present invention.

第6圖根據本發明一實施例繪示該特徵收集控制方案中的音訊分析操作。 Figure 6 illustrates the audio analysis operation in the feature collection control scheme according to one embodiment of the present invention.

第7圖根據本發明一實施例繪示該特徵收集控制方案中的語音類型分類所涉及的音訊幀和音訊節段(audio segment)。 Figure 7 illustrates the audio frames and audio segments involved in the speech type classification in the feature collection control scheme according to an embodiment of the present invention.

第8圖根據本發明一實施例繪示該方法的一語者辨識控制方案的一工作流程。 Figure 8 illustrates a workflow of a speaker identification control scheme of the method according to an embodiment of the present invention.

第9圖根據本發明一實施例繪示該語者辨識控制方案中的k-近鄰(k-nearest neighbors,k-NN)演算法分類器所涉及的資料點。 FIG9 illustrates the data points involved in the k-nearest neighbors (k-NN) algorithm classifier in the speaker identification control scheme according to one embodiment of the present invention.

第10圖根據本發明一實施例繪示該方法的一流程圖。 Figure 10 shows a flow chart of the method according to one embodiment of the present invention.

第1圖為根據本發明一實施例之聲控裝置100的示意圖。聲控裝置100可包含一處理電路110(例如聲控處理電路)、一音訊輸入裝置120、一音訊數據 轉換介面電路130以及至少一儲存裝置140,其中處理電路110可包含多個處理模組111,用來進行處理電路110的操作。多個處理模組111可包含一閾值初始化(threshold initialization)處理模組112、一短期能量(short-term energy,STE)及過零率(zero-crossing rate,ZCR)處理模組113、一語音類型分類(voice type classification)處理模組114、一音高(pitch)處理模組115以及一特徵清單處理模組116。特徵清單處理模組116可進行特徵清單相關(feature-list-related)處理,而至少一其它處理模組諸如閾值初始化處理模組112、短期能量及過零率處理模組113、語音類型分類處理模組114和音高處理模組115可用來進行特徵收集。 FIG1 is a schematic diagram of a voice-controlled device 100 according to an embodiment of the present invention. Voice-controlled device 100 may include a processing circuit 110 (e.g., a voice-controlled processing circuit), an audio input device 120, an audio data conversion interface circuit 130, and at least one storage device 140. Processing circuit 110 may include multiple processing modules 111 for performing operations of processing circuit 110. These multiple processing modules 111 may include a threshold initialization processing module 112, a short-term energy (STE) and zero-crossing rate (ZCR) processing module 113, a voice type classification processing module 114, a pitch processing module 115, and a feature list processing module 116. The feature list processing module 116 can perform feature-list-related processing, while at least one other processing module, such as the threshold initialization processing module 112, the short-term energy and zero-crossing rate processing module 113, the speech type classification processing module 114, and the pitch processing module 115, can be used to collect features.

舉例來說,處理電路110可藉由處理器、微處理器等方式來實施,音訊輸入裝置120可藉由麥克風、耳麥(headset)等方式來實施,音訊數據轉換介面電路130可藉由放大器、類比數位轉換器等方式來實施,以及儲存裝置140可藉由電氣可擦除可編程唯讀記憶體(electrically erasable programmable read-only memory,EEPROM)、快閃記憶體等方式來實施。聲控裝置100的例子可包含,但不限於:聲控鎖諸如門鎖、車鎖等,以及聲控玩具。該多個處理模組111可代表運行於處理電路110上的多個程式模組,其中聲控裝置100可將程式碼141載入至處理電路110上成為該多個程式模組。於某些實施例中,該多個處理模組111可代表處理電路110的多個子電路。 For example, processing circuit 110 may be implemented by a processor or microprocessor, audio input device 120 may be implemented by a microphone or headset, audio data conversion interface circuit 130 may be implemented by an amplifier or analog-to-digital converter, and storage device 140 may be implemented by an electrically erasable programmable read-only memory (EEPROM) or flash memory. Examples of voice-controlled device 100 include, but are not limited to, voice-controlled locks such as door locks and car locks, as well as voice-controlled toys. The multiple processing modules 111 may represent multiple program modules running on the processing circuit 110, wherein the voice-controlled device 100 may load the program code 141 onto the processing circuit 110 to form the multiple program modules. In some embodiments, the multiple processing modules 111 may represent multiple sub-circuits of the processing circuit 110.

第2圖根據本發明一實施例繪示一種藉助於偵測自定義詞的語音特徵對一聲控裝置(例如聲控裝置100)進行喚醒控制之方法的一註冊及辨識控制方案的一工作流程。處理電路110可於多個階段中的一註冊階段中分別為語者A和B執行一註冊程序,且可於該多個階段中的一辨識階段中為語者U(例如未知語者)執行一辨識程序,其中針對語者A的該註冊程序可包含步驟S10~S13,針對語者B的該註冊程序可包含步驟S20~S23,且針對語者U的該辨識程序可包含步驟S30~S34及步驟S35或S36。於該註冊階段中,處理電路110可錄製(record)多 個音訊片段{Audio_Clip}(例如:語者A的音訊片段{Audio_ClipA}和語者B的音訊片段{Audio_ClipB}),且對音訊片段{Audio_Clip}之各自的音訊數據{Audio_Data}(例如:音訊片段{Audio_ClipA}之各自的音訊數據{Audio_DataA}和音訊片段{Audio_ClipB}之各自的音訊數據{Audio_DataB})進行該特徵收集以產生音訊片段{Audio_Clip}之各自的特徵清單{L}(例如:音訊片段{Audio_ClipA}之各自的特徵清單{LA}和音訊片段{Audio_ClipB}之各自的特徵清單{LB}),以於聲控裝置100中建立一基於特徵清單的資料庫142,其中基於特徵清單的資料庫142可包含特徵清單{L},特徵清單{L}中之任一特徵清單L可包含音訊片段{Audio_Clip}中之一對應的音訊片段Audio_Clip的多個特徵,且該多個特徵分別屬於多個預定類型的特徵。舉例來說,音訊片段{Audio_Clip}可載有至少一自定義詞W,尤其,音訊片段{Audio_ClipA}中的每一音訊片段Audio_ClipA可載有一自定義詞WA,且音訊片段{Audio_ClipB}中的每一音訊片段Audio_ClipB可載有一自定義詞WB。另外,於該辨識階段中,處理電路110可錄製另一音訊片段Audio_Clip(例如:語者U的音訊片段Audio_ClipU),且對該另一音訊片段Audio_Clip的音訊數據Audio_Data(例如:音訊片段Audio_ClipU的音訊數據Audio_DataU)進行該特徵收集以產生該另一音訊片段Audio_Clip的另一特徵清單L(例如:音訊片段Audio_ClipU的特徵清單LU),且依據基於特徵清單的資料庫142對該另一特徵清單L中的至少一特徵進行至少一篩選操作以判定該另一音訊片段Audio_Clip是否無效,以選擇性地忽略該另一音訊片段Audio_Clip或執行至少一後續操作,而不需要透過任何網路來連線至任何雲端資料庫以取得任何語音資料以供判斷上述至少一自定義詞有哪些字,其中上述至少一後續操作可包含喚醒/啟動聲控裝置100。 FIG. 2 illustrates a workflow of a registration and recognition control scheme for a method of performing wakeup control on a voice-controlled device (e.g., voice-controlled device 100) by detecting voice features of custom words according to an embodiment of the present invention. The processing circuit 110 may execute a registration procedure for speakers A and B respectively in a registration phase among the multiple phases, and may execute an identification procedure for speaker U (e.g., an unknown speaker) in an identification phase among the multiple phases, wherein the registration procedure for speaker A may include steps S10 to S13, the registration procedure for speaker B may include steps S20 to S23, and the identification procedure for speaker U may include steps S30 to S34 and step S35 or S36. In the registration phase, the processing circuit 110 may record a plurality of audio clips {Audio_Clip} (e.g., audio clip {Audio_Clip A } of speaker A and audio clip {Audio_Clip B } of speaker B), and perform feature collection on the respective audio data {Audio_Data} of the audio clips {Audio_Clip} (e.g., audio data {Audio_Data A } of audio clip {Audio_Clip A } and audio data {Audio_Data B } of audio clip {Audio_Clip B }) to generate respective feature lists {L} of the audio clips {Audio_Clip} (e.g., feature lists {L A} of audio clip {Audio_Clip A } and {L A } of audio clip {Audio_Clip B}) . } ) to establish a feature list-based database 142 in the voice control device 100. Feature list-based database 142 may include feature lists {L}. Any feature list L in feature lists {L} may include multiple features of an audio clip Audio_Clip corresponding to one of the audio clips {Audio_Clip}, and these multiple features may belong to multiple predetermined types of features. For example, the audio clip {Audio_Clip} may contain at least one custom word W. In particular, each audio clip Audio_Clip A in the audio clips {Audio_Clip A } may contain a custom word W A , and each audio clip Audio_Clip B in the audio clips {Audio_Clip B } may contain a custom word W B. In addition, in the recognition phase, the processing circuit 110 may record another audio clip Audio_Clip (e.g., audio clip Audio_Clip U of speaker U), and perform the feature collection on the audio data Audio_Data of the another audio clip Audio_Clip (e.g., audio data Audio_Data U of audio clip Audio_Clip U ) to generate another feature list L of the another audio clip Audio_Clip (e.g., feature list L U of audio clip Audio_Clip U) . ), and performing at least one filtering operation on at least one feature in the other feature list L based on the feature list-based database 142 to determine whether the other audio clip Audio_Clip is invalid, so as to selectively ignore the other audio clip Audio_Clip or perform at least one subsequent operation without having to connect to any cloud database through any network to obtain any voice data for determining which words are included in the above-mentioned at least one custom word, wherein the above-mentioned at least one subsequent operation may include waking up/activating the voice-controlled device 100.

於步驟S10中,處理電路110可開始為語者A執行該註冊程序。 In step S10, the processing circuit 110 may begin to execute the registration process for speaker A.

於步驟S11中,處理電路110可錄製音訊片段{Audio_ClipA}中的一對 應的音訊片段Audio_ClipA以記錄自定義詞WA,例如語者A定義的字(Speaker-A-defined words)WA,以作為語者A專用的喚醒詞。 In step S11 , the processing circuit 110 may record a corresponding audio clip Audio_Clip A in the audio clips {Audio_Clip A } to record the self-defined words W A , such as Speaker-A-defined words W A , as a wake-up word dedicated to Speaker A.

於步驟S12中,處理電路110可對該對應的音訊片段Audio_ClipA的對應的音訊數據Audio_DataA進行該特徵收集,以取得對應的音訊片段Audio_ClipA的多個特徵。處理電路110可重新進入步驟S11如用虛線描繪之箭頭所示,以重複地執行步驟S11和S12以分別對音訊片段{Audio_ClipA}的音訊數據{Audio_DataA}進行該特徵收集,以取得音訊片段{Audio_ClipA}之各自的特徵。舉例來說,處理電路110可提供使用者介面諸如錄音鈕和停止鈕,而語者A可按下錄音鈕且用同樣的聲調和音量錄下自定義詞WA,再按下停止鈕。處理電路110可偵測該對應的音訊片段Audio_ClipA的某些語音特性。若這些語音特性符合預定錄音規則,處理電路110可記錄該對應的音訊片段Audio_ClipA的該多個特徵;否則,處理電路110可通知語者A再錄一次。 In step S12, the processing circuit 110 may perform feature collection on the audio data Audio_Data A corresponding to the audio clip Audio_Clip A to obtain a plurality of features of the corresponding audio clip Audio_Clip A. The processing circuit 110 may re-enter step S11 as indicated by the dashed arrow to repeatedly execute steps S11 and S12 to respectively perform feature collection on the audio data {Audio_Data A } of the audio clip {Audio_Clip A } to obtain respective features of the audio clip {Audio_Clip A }. For example, processing circuit 110 may provide a user interface such as a record button and a stop button. Speaker A can press the record button to record the custom word W A at the same pitch and volume, then press the stop button. Processing circuit 110 may detect certain voice characteristics of the corresponding audio clip Audio_Clip A. If these voice characteristics meet predetermined recording rules, processing circuit 110 may record the corresponding audio clip Audio_Clip A. Otherwise, processing circuit 110 may notify speaker A to record again.

於步驟S13中,處理電路110可依據對應的音訊片段Audio_ClipA的該多個特徵產生特徵清單LA,尤其,依據音訊片段{Audio_ClipA}之各自的特徵產生音訊片段{Audio_ClipA}之各自的特徵清單{LA}。 In step S13, the processing circuit 110 may generate a feature list LA according to the features of the corresponding audio clip Audio_Clip A , and in particular, generate a feature list { LA } for the audio clip {Audio_Clip A } according to the features of the audio clip {Audio_Clip A }.

於步驟S20中,處理電路110可開始為語者B執行該註冊程序。 In step S20, the processing circuit 110 may begin to execute the registration process for speaker B.

於步驟S21中,處理電路110可錄製音訊片段{Audio_ClipB}中的一對應的音訊片段Audio_ClipB以記錄自定義詞WB,例如語者B定義的字(Speaker-B-defined words)WB,以作為語者B專用的喚醒詞。 In step S21 , the processing circuit 110 may record a corresponding audio clip Audio_Clip B in the audio clip {Audio_Clip B } to record the self-defined words W B , such as Speaker-B-defined words W B , as a wake-up word dedicated to Speaker B.

於步驟S22中,處理電路110可對該對應的音訊片段Audio_ClipB的對應的音訊數據Audio_DataB進行該特徵收集,以取得對應的音訊片段Audio_ClipB的多個特徵。處理電路110可重新進入步驟S21如用虛線描繪之箭頭所示,以重複地執行步驟S21和S22以分別對音訊片段{Audio_ClipB}的音訊數據{Audio_DataB}進行該特徵收集,以取得音訊片段{Audio_ClipB}之各自的特徵。 舉例來說,語者B可按下錄音鈕且用同樣的聲調和音量錄下自定義詞WB,再按下停止鈕。處理電路110可偵測該對應的音訊片段Audio_ClipB的某些語音特性。若這些語音特性符合上述預定錄音規則,處理電路110可記錄該對應的音訊片段Audio_ClipB的該多個特徵;否則,處理電路110可通知語者B再錄一次。 In step S22, processing circuit 110 may perform feature collection on the audio data Audio_Data B corresponding to audio clip Audio_Clip B to obtain multiple features of audio clip Audio_Clip B. Processing circuit 110 may re-enter step S21, as indicated by the dashed arrow, to repeatedly execute steps S21 and S22 to perform feature collection on the audio data {Audio_Data B } of audio clip {Audio_Clip B } to obtain the respective features of audio clip {Audio_Clip B }. For example, speaker B may press the record button and record the custom word W B using the same pitch and volume, then press the stop button. The processing circuit 110 may detect certain voice characteristics of the corresponding audio clip Audio_Clip B. If these voice characteristics meet the predetermined recording rules, the processing circuit 110 may record the characteristics of the corresponding audio clip Audio_Clip B ; otherwise, the processing circuit 110 may notify speaker B to record again.

於步驟S23中,處理電路110可依據對應的音訊片段Audio_ClipB的該多個特徵產生特徵清單LB,尤其,依據音訊片段{Audio_ClipB}之各自的特徵產生音訊片段{Audio_ClipB}之各自的特徵清單{LB}。 In step S23, the processing circuit 110 may generate a feature list L B according to the features of the corresponding audio clip Audio_Clip B , and in particular, generate a feature list {L B } for the audio clip {Audio_Clip B } according to the features of the audio clip {Audio_Clip B }.

於步驟S30中,處理電路110可開始為語者U執行該辨識程序。 In step S30, the processing circuit 110 may begin executing the recognition process for speaker U.

於步驟S31中,處理電路110可錄製音訊片段Audio_ClipU以記錄語者U之任何自定義詞WU(若自定義詞WU存在)。 In step S31 , the processing circuit 110 may record the audio clip Audio_Clip U to record any custom word W U of the speaker U (if the custom word W U exists).

於步驟S32中,處理電路110可對音訊片段Audio_ClipU的音訊數據Audio_DataU進行該特徵收集,以取得音訊片段Audio_ClipU的多個特徵。 In step S32 , the processing circuit 110 may perform the feature collection on the audio data Audio_Data U of the audio clip Audio_Clip U to obtain a plurality of features of the audio clip Audio_Clip U.

於步驟S33中,處理電路110可依據音訊片段Audio_ClipU的該多個特徵產生特徵清單LUIn step S33, the processing circuit 110 may generate a feature list L U according to the multiple features of the audio clip Audio_Clip U.

於步驟S34中,處理電路110可依據基於特徵清單的資料庫142進行語者辨識,尤其,依據基於特徵清單的資料庫142對特徵清單LU中的一個或多個特徵快速地進行篩選操作以判定音訊片段Audio_ClipU是否無效。舉例來說,當判定音訊片段Audio_ClipU為無效時,這表示語者U為無效語者,處理電路110可執行步驟S36。當判定音訊片段Audio_ClipU並非無效時,這表示語者U為語者A或語者B,處理電路110可執行步驟S35。 In step S34, processing circuit 110 may perform speaker identification based on feature list database 142. Specifically, processing circuit 110 may quickly filter one or more features in feature list L U based on feature list database 142 to determine whether audio clip Audio_Clip U is invalid. For example, if audio clip Audio_Clip U is determined to be invalid, this indicates that speaker U is an invalid speaker, and processing circuit 110 may execute step S36. If audio clip Audio_Clip U is determined to be not invalid, this indicates that speaker U is speaker A or speaker B, and processing circuit 110 may execute step S35.

於步驟S35中,處理電路110可進行上述至少一後續操作以作為行動Action( ),尤其,喚醒/啟動聲控裝置100。 In step S35, the processing circuit 110 may perform at least one of the subsequent operations described above as an action ( ), in particular, waking up/activating the voice-controlled device 100.

於步驟S36中,處理電路110可忽略音訊片段Audio_ClipUIn step S36, the processing circuit 110 may ignore the audio clip Audio_Clip U.

由於註冊的語者可以有不同的自定義詞{W},對於任一註冊的語 者,只要自定義詞W不被其他人聽到,就有第一層安全保障;即使自定義詞W被其他人聽到,還有第二層安全保障,這是因為語音特徵不同就無法喚醒/啟動聲控裝置100。如果語者採用不同聲調或不同音量說出自定義詞W,則處理電路110會判定其音訊片段Audio_Clip為無效/不合格,因此不用擔心日常對話被盜錄以供偽造語音來啟動聲控裝置100。另外,處理電路110可依據基於特徵清單的資料庫142快速地判定語者U(例如未知語者)是否為無效,以選擇性地忽略其音訊片段Audio_ClipU或喚醒/啟動聲控裝置100,而不需要連線至任何遠端系統以取得任何語音資料以供進行相關判斷。由於不需要判斷自定義詞有哪些字,處理電路110不需要透過任何網路來連線至任何雲端資料庫以取得大量語音資料。因此,本發明之方法及處理電路110可在無副作用或較少副作用之狀況下實現精巧、快速、安全且可靠的聲控處理系統。 Because registered speakers can have different custom words {W}, for any registered speaker, as long as the custom word W is not overheard by others, there is a first level of security. Even if the custom word W is overheard by others, there is a second level of security because the voice characteristics will not be able to wake up/activate voice-controlled device 100. If the speaker speaks custom word W with a different tone or volume, the processing circuit 110 will determine that the audio clip Audio_Clip is invalid/unqualified. Therefore, there is no need to worry about daily conversations being recorded and used to forge voice to activate voice-controlled device 100. Furthermore, processing circuit 110 can quickly determine whether speaker U (e.g., an unknown speaker) is invalid based on database 142 based on a feature list, thereby selectively ignoring its audio clip Audio_Clip U or waking/activating voice-controlled device 100 without requiring a connection to any remote system to obtain any voice data for relevant judgments. Since there is no need to determine the characters in a custom word, processing circuit 110 does not need to connect to any cloud database via any network to obtain large amounts of voice data. Therefore, the method and processing circuit 110 of the present invention can implement a compact, fast, safe, and reliable voice-controlled processing system with no or minimal side effects.

依據某些實施例,一個或多個步驟可於第2圖所示之工作流程中增加、刪除或修改。舉例來說,藉由進行機器學習,處理電路110可於聲控裝置100中建立對應於至少一預定模型的一預定分類器,其中上述至少預定模型的一預定空間(例如多個軸{X1,X2,...}所展開的預定空間)的維度可等於該多個預定類型的特徵之特徵類型數(feature-type count)。另外,該另一特徵清單L中的上述至少一特徵可為該另一特徵清單L中的所有特徵中的至少一個,其中該另一特徵清單L中的上述所有特徵分別屬於該多個預定類型的特徵。於進行上述至少一後續操作的期間,處理電路110可依據該另一特徵清單L中的上述所有特徵,利用該預定分類器進行基於機器學習的(machine-learning-based)分類以判定是否該另一音訊片段Audio_Clip的語者U為語者A(例如:使用者A)、或語者B(例如:使用者B),以於步驟S35中選擇性地執行對應於語者A(例如:使用者A)的至少一行動Action(A)或對應於語者B(例如:使用者B)的至少一行動Action(B)。 According to some embodiments, one or more steps may be added, deleted, or modified in the workflow shown in FIG. 2 . For example, by performing machine learning, the processing circuit 110 may establish a predetermined classifier corresponding to at least one predetermined model in the voice-controlled device 100 . The dimension of a predetermined space (e.g., a predetermined space expanded by multiple axes {X 1 , X 2 , ...}) of the at least one predetermined model may be equal to the feature-type count of the multiple predetermined types of features. Furthermore, the at least one feature in the other feature list L may be at least one of all features in the other feature list L, wherein all features in the other feature list L belong to the multiple predetermined types of features. During the execution of at least one of the aforementioned subsequent operations, the processing circuit 110 may utilize the predetermined classifier to perform machine-learning-based classification based on all of the aforementioned features in the other feature list L to determine whether the speaker U of the other audio clip Audio_Clip is speaker A (e.g., user A) or speaker B (e.g., user B), so as to selectively execute at least one action Action(A) corresponding to speaker A (e.g., user A) or at least one action Action(B) corresponding to speaker B (e.g., user B) in step S35.

第3圖根據本發明一實施例繪示該方法的一特徵收集控制方案所涉及的音訊和相關訊號,其中橫軸可代表時間,以毫秒(milliseconds,ms)為單位來量測,而縱軸對於音訊而言可代表音訊樣本y的強度。依據某些實施例,音訊和相關訊號諸如短期能量STE( )、過零率ZCR( )和分類訊號可予以變化。如第3圖所示,處理電路110可依據短期能量STE( )是否達到短期能量閾值STE_th以及過零率ZCR( )是否達到過零率閾值ZCR_th來產生分類訊號,以指出音訊的多個部分中的任一部分為無語音(unvoiced)、有語音(voiced)或氣音(breathy voice)。舉例來說,處理電路110可偵測到音訊的音高等於181.07赫茲(Hertz,Hz)。 FIG. 3 illustrates audio and related signals involved in a feature collection control scheme according to an embodiment of the present invention. The horizontal axis may represent time, measured in milliseconds (ms), and the vertical axis may represent the intensity of audio sample y. According to certain embodiments, audio and related signals, such as short-term energy (STE), zero-crossing rate (ZCR), and classification signals, may be varied. As shown in FIG. 3 , processing circuit 110 may generate a classification signal, indicating whether any of multiple portions of the audio is unvoiced, voiced, or breathy, based on whether the short-term energy (STE) reaches a short-term energy threshold (STE_th) and whether the zero-crossing rate (ZCR) reaches a zero-crossing rate threshold (ZCR_th). For example, the processing circuit 110 may detect that the pitch of the audio signal is equal to 181.07 Hertz (Hz).

第4圖根據本發明一實施例繪示該特徵收集控制方案中的閾值初始化處理所涉及的一音訊前處理操作,其中第1圖所示之閾值初始化處理模組112可用來進行該閾值初始化處理。閾值初始化處理模組112可對音訊樣本y的原始版本(例如音訊樣本y1)進行該音訊前處理操作以產生音訊樣本y的新版本(例如音訊樣本y2)。假設音訊樣本y1的取樣頻率等於48000Hz,且音訊樣本y1的開頭的雜音部分之時長(duration;或時間長度)tNOISE等於0.4秒(seconds,s),則其樣本數n等於(48000 * 0.4)=19200。依據某些實施例,相關參數諸如取樣頻率、時長tNOISE等可予以變化。閾值初始化處理模組112可計算平均值MEAN如下:MEAN=(SUM(y1[0:n-1])/n)=(SUM(y1[0:19199])/19200);其中“SUM( )”可代表加總。平均值MEAN可指出音訊輸入路徑上的元件(例如:音訊輸入裝置120及/或音訊數據轉換介面電路130)所導致的訊號偏移。閾值初始化處理模組112可依據平均值MEAN校正原始的零位準,尤其,將已錄下的所有音訊樣本y1減去平均值MEAN以產生音訊樣本y2如下:y2[0:N-1]=y1[0:N-1]-MEAN;其中“N”可代表音訊樣本y1(或y2)的樣本數。 FIG. 4 illustrates an audio pre-processing operation involved in the threshold initialization process in the feature collection control scheme according to an embodiment of the present invention, wherein the threshold initialization processing module 112 shown in FIG. 1 may be used to perform the threshold initialization process. The threshold initialization processing module 112 may perform the audio pre-processing operation on an original version of audio sample y (e.g., audio sample y1) to generate a new version of audio sample y (e.g., audio sample y2). Assuming that the sampling frequency of audio sample y1 is 48,000 Hz and the duration (or time length) t NOISE of the noise portion at the beginning of audio sample y1 is 0.4 seconds (s), the number of samples n is equal to (48,000 * 0.4) = 19,200. According to some embodiments, relevant parameters such as the sampling frequency and duration tNOISE may be varied. The threshold initialization processing module 112 may calculate the average value MEAN as follows: MEAN = (SUM(y1[0:n-1])/n) = (SUM(y1[0:19199])/19200), where "SUM()" may represent summation. The average value MEAN may indicate signal offset caused by components in the audio input path (e.g., the audio input device 120 and/or the audio data conversion interface circuit 130). The threshold initialization processing module 112 can calibrate the original zero level according to the average value MEAN. Specifically, the average value MEAN is subtracted from all recorded audio samples y1 to generate audio sample y2 as follows: y2[0:N-1]=y1[0:N-1]-MEAN; where "N" can represent the number of samples of audio sample y1 (or y2).

第5圖根據本發明一實施例繪示該特徵收集控制方案中的該閾值初 始化處理所涉及的音訊樣本{y2}(例如音訊樣本y2[0:n-1])和音訊幀{f’}(例如:音訊幀{f’1,f’2,f’3,...,f’18}),其中閾值初始化處理模組112可將第4圖所示之原始的零位準減去平均值MEAN以取得新的零位準。假設雜音部分的音訊幀{f’}(或雜音幀{f’})之幀大小p1等於1024個音訊樣本,則在雜音部分之時長tNOISE中的幀數等於(n/p1)=(19200/1024)=18.75~=18,這表示時長tNOISE中至少有18個幀。依據某些實施例,相關參數諸如時長tNOISE、樣本數n、幀大小p1等可予以變化。令幀大小p=p1,則音訊幀{f’}中的任一音訊幀f’(i)可包含音訊樣本{y2[(i-1)* p],...,y2[(i * p)-1]|i=1,...,18}。閾值初始化處理模組112可計算音訊幀f’(i)的短期能量值STE(f’(i))如下:STE(f’(i))=SUM({y2[x]2|x=((i-1)* p),...,((i * p)-1)});其中“y2[x]2”可代表音訊樣本y2[x]的能量。閾值初始化處理模組112可依據音訊幀{f’(i)}之各自的短期能量值{STE(f’(i))}計算短期能量閾值STE_th如下:STE_th=MAX({STE(f’(i))})* FACTOR_STE;其中“MAX( )”可代表最大值,而“FACTOR_STE”可代表一預定短期能量因子。例如,FACTOR_STE=10,用以決定短期能量閾值STE_th以供判定語者是否正在說話中或只有雜音。依據某些實施例,短期能量閾值STE_th及/或預定短期能量因子FACTOR_STE可予以變化。 FIG5 illustrates the audio sample {y2} (e.g., audio sample y2[0:n-1]) and audio frame {f'} (e.g., audio frame {f'1, f'2, f'3, ..., f'18}) involved in the threshold initialization processing in the feature collection control scheme according to an embodiment of the present invention, wherein the threshold initialization processing module 112 can subtract the average value MEAN from the original zero level shown in FIG4 to obtain a new zero level. Assuming that the frame size p1 of the noise portion audio frame {f'} (or noise frame {f'}) is equal to 1024 audio samples, the number of frames in the noise portion duration tNOISE is equal to (n/p1) = (19200/1024) = 18.75~=18, which means that there are at least 18 frames in the duration tNOISE . According to some embodiments, relevant parameters such as duration tNOISE , number of samples n, and frame size p1 can be varied. Assuming frame size p = p1, any audio frame f'(i) in audio frame {f'} can contain audio samples {y2[(i-1)*p],...,y2[(i*p)-1]|i=1,...,18}. The threshold initialization processing module 112 may calculate the short-term energy value STE(f'(i)) of the audio frame f'(i) as follows: STE(f'(i))=SUM({y2[x] 2 |x=((i-1)*p),...,((i*p)-1)}); where "y2[x] 2 " may represent the energy of the audio sample y2[x]. The threshold initialization processing module 112 may calculate the short-term energy threshold STE_th based on the short-term energy value {STE(f'(i))} of the audio frame {f'(i)} as follows: STE_th=MAX({STE(f'(i))})*FACTOR_STE; where "MAX( )" may represent a maximum value, and "FACTOR_STE" may represent a predetermined short-term energy factor. For example, FACTOR_STE=10 is used to determine the short-term energy threshold STE_th for determining whether the speaker is speaking or only has noise. According to some embodiments, the short-term energy threshold STE_th and/or the predetermined short-term energy factor FACTOR_STE can be varied.

另外,閾值初始化處理模組112可計算過零率閾值ZCR_th。令y3[x]為y2[x]的函數以供指出y2[x]是否大於零如下:y3[x]=1,若y2[x]>0;以及y3[x]=-1,若y2[x]0。 In addition, the threshold initialization processing module 112 can calculate the zero crossing rate threshold ZCR_th. Let y3[x] be a function of y2[x] to indicate whether y2[x] is greater than zero as follows: y3[x]=1 if y2[x]>0; and y3[x]=-1 if y2[x] 0.

依據某些實施例,y3[x]及/或相關判定條件(例如:y2[x]>0及/或y2[x]0)可予以變化。閾值初始化處理模組112可計算音訊幀f’(i)的過零率值 ZCR(f’(i))如下:ZCR(f’(i))=SUM({(|y3[x+1]-y3[x]|)|x=((i-1)* p),…,((i * p)-1)})/2p;其中“|y3[x+1]-y3[x]|”可代表(y3[x+1]-y3[x])的絕對值。針對上述預定錄音規則,雜音序列的過零率係被預期要夠大,尤其,達到一預定雜音序列過零率閾值Noise_Sequence_ZCR_th,這可指出此雜音序列為合格的雜音以供正確地進行該閾值初始化處理。於該註冊程序中,閾值初始化處理模組112可依據音訊幀{f’(i)}之各自的過零率值{ZCR(f’(i))}和預定雜音序列過零率閾值Noise_Sequence_ZCR_th決定是否通知正在註冊的語者/使用者(例如:語者/使用者A或語者/使用者B)再錄一次,而相關操作可包含:(1)若MIN({ZCR(f’(i))})<Noise_Sequence_ZCR_th,則處理電路110(或閾值初始化處理模組112)可控制聲控裝置100通知語者/使用者再錄一次;以及(2)若MIN({ZCR(f’(i))})Noise_Sequence_ZCR_th,則閾值初始化處理模組112可設定ZCR_th=MIN({ZCR(f’(i))});其中“MIN( )”可代表最小值。舉例來說,預定雜音序列過零率閾值Noise_Sequence_ZCR_th可等於0.3。在某些例子中,預定雜音序列過零率閾值Noise_Sequence_ZCR_th可予以變化。 According to some embodiments, y3[x] and/or related judgment conditions (e.g., y2[x]>0 and/or y2[x] 0) can be varied. The threshold initialization processing module 112 can calculate the zero-crossing rate value ZCR(f'(i)) of the audio frame f'(i) as follows: ZCR(f'(i))=SUM({(|y3[x+1]-y3[x]|)|x=((i-1)*p),…,((i*p)-1)})/2p; where "|y3[x+1]-y3[x]|" can represent the absolute value of (y3[x+1]-y3[x]). According to the above-mentioned predetermined recording rules, the zero-crossing rate of the noise sequence is expected to be large enough, in particular, to reach a predetermined noise sequence zero-crossing rate threshold Noise_Sequence_ZCR_th, which indicates that the noise sequence is qualified as noise for correct threshold initialization processing. In the registration process, the threshold initialization processing module 112 can determine whether to notify the speaker/user being registered (e.g., speaker/user A or speaker/user B) to re-record based on the zero-crossing rate value {ZCR(f'(i))} of each audio frame {f'(i)} and the predetermined noise sequence zero-crossing rate threshold Noise_Sequence_ZCR_th, and the related operations may include: (1) if MIN({ZCR(f'(i))})<Noise_Sequence_ZCR_th, the processing circuit 110 (or the threshold initialization processing module 112) can control the voice control device 100 to notify the speaker/user to re-record; and (2) if MIN({ZCR(f'(i))}) If the predetermined noise sequence zero-crossing rate threshold Noise_Sequence_ZCR_th is equal to 0.3, the threshold initialization processing module 112 may set ZCR_th = MIN({ZCR(f'(i))}), where "MIN()" may represent a minimum value. For example, the predetermined noise sequence zero-crossing rate threshold Noise_Sequence_ZCR_th may be equal to 0.3. In some examples, the predetermined noise sequence zero-crossing rate threshold Noise_Sequence_ZCR_th may be variable.

第6圖根據本發明一實施例繪示該特徵收集控制方案中的音訊分析操作610和620,其中第1圖所示之閾值初始化處理模組112和短期能量及過零率處理模組113可分別用來進行音訊分析操作610和620。在錄製該對應的音訊片段Audio_Clip(例如:對應的音訊片段Audio_ClipA或對應的音訊片段Audio_ClipB)以取得該對應的音訊片段Audio_Clip之對應的音訊數據Audio_Data(例如:對應的音訊數據Audio_DataA或對應的音訊數據Audio_DataB)後,閾值初始化處理模 組112可分析該對應的音訊片段Audio_Clip的一第一局部(partial)音訊片段Audio_Clip1(例如:雜音部分)的第一音訊數據Audio_Data1,以依據第一音訊數據Audio_Data1的多個第一音訊幀(例如:音訊幀{f’})確定短期能量閾值STE_th和過零率閾值ZCR_th,以供進一步處理該對應的音訊片段Audio_Clip的一其餘局部音訊片段Audio_Clip2的其餘音訊數據Audio_Data2。如第6圖最左側部分所示,音訊的開頭部分(例如音訊幀{f’(i)}諸如音訊幀{f’1,f’2,f’3,...,f’18}})可被預期為雜音/雜訊。 FIG6 illustrates the audio analysis operations 610 and 620 in the feature collection control scheme according to an embodiment of the present invention, wherein the threshold initialization processing module 112 and the short-term energy and zero-crossing rate processing module 113 shown in FIG1 can be used to perform the audio analysis operations 610 and 620, respectively. When recording the corresponding audio clip Audio_Clip (e.g., the corresponding audio clip Audio_Clip A or the corresponding audio clip Audio_Clip B ) to obtain the corresponding audio data Audio_Data of the corresponding audio clip Audio_Clip (e.g., the corresponding audio data Audio_Data A or the corresponding audio data Audio_Data B), the audio data Audio_Data A and the corresponding audio data Audio_Data B are recorded. ), the threshold initialization processing module 112 may analyze the first audio data Audio_Data1 of a first partial audio clip Audio_Clip1 (e.g., the noise portion) of the corresponding audio clip Audio_Clip to determine the short-term energy threshold STE_th and the zero-crossing rate threshold ZCR_th based on multiple first audio frames (e.g., audio frame {f'}) of the first audio data Audio_Data1 for further processing of the remaining audio data Audio_Data2 of a remaining partial audio clip Audio_Clip2 of the corresponding audio clip Audio_Clip. As shown in the leftmost portion of FIG. 6 , the beginning of the audio (e.g., audio frame {f'(i)} such as audio frame {f'1, f'2, f'3, ..., f'18}) can be expected to be noise/interference.

另外,短期能量及過零率處理模組113可分析其餘局部音訊片段Audio_Clip2的其餘音訊數據Audio_Data2,以計算其餘音訊數據Audio_Data2的多個第二音訊幀(例如:音訊幀{f})之各自的短期能量值{STE( )}和過零率{ZCR( )}。依據該多個第二音訊幀中的任一第二音訊幀(例如:音訊幀f)之短期能量值STE( )是否達到短期能量閾值STE_th以及該任一第二音訊幀之過零率ZCR( )是否達到過零率閾值ZCR_th,處理電路110(或語音類型分類處理模組114)可判定該任一第二音訊幀的語音類型為多個預定語音類型(例如:一無語音類型、一有語音類型以及一氣音類型)的其中之一,以供依據該多個第二音訊幀(例如:音訊幀{f})之各自的語音類型判定該對應的音訊片段Audio_Clip的該多個特徵。 In addition, the short-term energy and zero-crossing rate processing module 113 can analyze the remaining audio data Audio_Data2 of the remaining partial audio clip Audio_Clip2 to calculate the short-term energy value {STE()} and zero-crossing rate {ZCR()} of each of the plurality of second audio frames (eg, audio frame {f}) of the remaining audio data Audio_Data2. Based on whether the short-term energy value STE() of any second audio frame (e.g., audio frame f) among the plurality of second audio frames reaches a short-term energy threshold STE_th and whether the zero-crossing rate ZCR() of any second audio frame reaches a zero-crossing rate threshold ZCR_th, the processing circuit 110 (or the voice type classification processing module 114) can determine whether the voice type of any second audio frame is one of a plurality of predetermined voice types (e.g., a non-voice type, a voice type, and a breathy voice type), so as to determine the plurality of features of the corresponding audio clip Audio_Clip based on the voice types of each of the plurality of second audio frames (e.g., audio frame {f}).

假設幀大小p=p2,短期能量及過零率處理模組113可計算音訊幀{f(j)}(例如:音訊幀{f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16})中的任一音訊幀f(j)的短期能量值STE(f(j))和過零率ZCR(f(j)),而基於一組預定分類規則,處理電路110(或語音類型分類處理模組114)可依據短期能量值STE(f(j))和過零率ZCR(f(j))將該任一音訊幀f(j)分類為該多個預定語音類型的其中之一,舉例來說:(1)若STE(f(j))<STE_th,則處理電路110(或語音類型分類處理模 組114)可判定該任一音訊幀f(j)的語音類型為該無語音類型;(2)若STE(f(j))STE_th且ZCR(f(j))<ZCR_th,則處理電路110(或語音類型分類處理模組114)可判定該任一音訊幀f(j)的語音類型為該有語音類型;以及(3)若STE(f(j))STE_th且ZCR(f(j))ZCR_th,則處理電路110(或語音類型分類處理模組114)可判定該任一音訊幀f(j)的語音類型為該氣音類型。 Assuming that the frame size p=p2, the short-term energy and zero-crossing rate processing module 113 can calculate the short-term energy value STE(f(j)) and the zero-crossing rate ZCR(f(j)) of any audio frame f(j) in the audio frame {f(j)} (for example, the audio frames {f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, f12, f13, f14, f15, f16}), and based on a set of predetermined classification rules, the processing circuit 110 (or The speech type classification processing module 114 may classify any audio frame f(j) into one of the plurality of predetermined speech types based on the short-term energy value STE(f(j)) and the zero-crossing rate ZCR(f(j)). For example: (1) if STE(f(j))<STE_th, the processing circuit 110 (or the speech type classification processing module 114) may determine that the speech type of any audio frame f(j) is the silent type; (2) if STE(f(j)) STE_th and ZCR(f(j))<ZCR_th, then the processing circuit 110 (or the voice type classification processing module 114) can determine that the voice type of any audio frame f(j) is the voice type; and (3) if STE(f(j)) STE_th and ZCR(f(j)) ZCR_th, the processing circuit 110 (or the speech type classification processing module 114) can determine that the speech type of any audio frame f(j) is the breathy type.

依據某些實施例,該組預定分類規則和相關計算及/或相關參數諸如幀大小p、音訊幀{f(j)}的幀數等可予以變化。 According to some embodiments, the set of predetermined classification rules and related calculations and/or related parameters such as frame size p, the number of audio frames {f(j)}, etc. can be varied.

第7圖根據本發明一實施例繪示該特徵收集控制方案中的語音類型分類所涉及的音訊幀{f(j)}和音訊節段{Seg(k)},其中第1圖所示之語音類型分類處理模組114可用來進行該語音類型分類。依據該多個第二音訊幀(例如:音訊幀{f(j)}諸如音訊幀{f1,f2,...,f16})之上述各自的語音類型,語音類型分類處理模組114可將該對應的音訊片段Audio_Clip區分為多個音訊節段{Seg(k)}諸如音訊節段{Seg1,Seg2,Seg3,Seg4,Seg5,Seg6,Seg7,Seg8,Seg9}。該對應的音訊數據Audio_Data的所有音訊幀中具有一相同的預定語音類型的任何兩個相鄰音訊幀可屬於一相同的音訊節段Seg(k),該對應的音訊數據Audio_Data的上述所有音訊幀可包含該多個第一音訊幀(例如:音訊幀{f’(i)})和該多個第二音訊幀(例如:音訊幀{f(j)}),而該多個音訊節段{Seg(k)}中的一開頭音訊節段711(例如:音訊節段Seg1)至少可包含該多個第一音訊幀,且可對應於一第一預定語音類型諸如該無語音類型。 FIG7 illustrates the audio frames {f(j)} and audio segments {Seg(k)} involved in voice type classification in the feature collection control scheme according to one embodiment of the present invention, wherein the voice type classification processing module 114 shown in FIG1 may be used to perform the voice type classification. Based on the respective voice types of the plurality of second audio frames (e.g., audio frames {f(j)}, such as audio frames {f1, f2, ..., f16}), the voice type classification processing module 114 may classify the corresponding audio segment Audio_Clip into a plurality of audio segments {Seg(k)}, such as audio segments {Seg1, Seg2, Seg3, Seg4, Seg5, Seg6, Seg7, Seg8, Seg9}. Any two adjacent audio frames having the same predetermined audio type in all audio frames of the corresponding audio data Audio_Data may belong to the same audio segment Seg(k). All the audio frames of the corresponding audio data Audio_Data may include the plurality of first audio frames (e.g., audio frame {f'(i)}) and the plurality of second audio frames (e.g., audio frame {f(j)}). A leading audio segment 711 (e.g., audio segment Seg1) in the plurality of audio segments {Seg(k)} may include at least the plurality of first audio frames and may correspond to a first predetermined audio type, such as the silent audio type.

尤其,語音類型分類處理模組114可計算該多個音訊節段{Seg(k)}中的至少一主要音訊節段720(例如:音訊節段{Seg2,Seg3,...,Seg8})之總時間長度,例如時長Duration( ),以作為該對應的音訊片段Audio_Clip的該多個特徵的一個特徵。上述至少一主要音訊節段720可包含該多個音訊節段{Seg(k)}中除 了開頭音訊節段711和對應於該第一預定語音類型(例如:該無語音類型)的任何結尾音訊節段719(例如:音訊節段Seg9)以外的音訊節段,諸如音訊節段{Seg2,Seg3,...,Seg8}。另外,語音類型分類處理模組114可利用處理模組111中的一或多個其它處理模組計算該多個音訊節段{Seg(k)}中對應於一第二預定語音類型(例如:該有語音類型)的每一音訊節段Seg(k)(例如:音訊節段{Seg2,Seg4,Seg6,Seg8}中的每一個)之至少一節段級參數(segment-level parameter),以依據上述至少一節段級參數判定該對應的音訊片段Audio_Clip的至少一參數,以作為該對應的音訊片段Audio_Clip的該多個特徵的至少一其它特徵,其中上述每一音訊節段Seg(k)可代表一有語音節段Seg(k)。舉例來說,上述至少一節段級參數可包含上述每一音訊節段Seg(k)之音高Pitch( )、短期能量值STE( )和過零率ZCR( ),且上述至少一參數可包含該對應的音訊片段Audio_Clip之音高Pitch( )、短期能量值STE( )和過零率ZCR( )。依據某些實施例,上述至少一節段級參數及/或上述至少一參數可予以變化。舉例來說,上述至少一參數可另包含該對應的音訊片段Audio_Clip的上述至少一主要音訊節段720中的某一主要音訊節段之開始時間點Pos( )。 In particular, the speech type classification processing module 114 may calculate the total duration, e.g., duration( ), of at least one primary audio segment 720 (e.g., audio segments {Seg2, Seg3, ..., Seg8}) from the plurality of audio segments {Seg(k)} as one of the plurality of features of the corresponding audio clip Audio_Clip. The at least one primary audio segment 720 may include audio segments from the plurality of audio segments {Seg(k)}, such as audio segments {Seg2, Seg3, ..., Seg8}, excluding the leading audio segment 711 and any trailing audio segments 719 (e.g., audio segment Seg9) corresponding to the first predetermined speech type (e.g., the non-speech type). In addition, the speech type classification processing module 114 can use one or more other processing modules in the processing module 111 to calculate at least one segment-level parameter of each audio segment Seg(k) (for example, each of the audio segments {Seg2, Seg4, Seg6, Seg8}) corresponding to a second predetermined speech type (for example, the speech type) in the multiple audio segments {Seg(k)}, so as to determine at least one parameter of the corresponding audio clip Audio_Clip based on the above at least one segment-level parameter as at least one other feature of the multiple features of the corresponding audio clip Audio_Clip, wherein each of the above audio segments Seg(k) can represent a speech segment Seg(k). For example, the at least one segment-level parameter may include the pitch ( Pitch ), short-term energy ( STE ), and zero-crossing rate ( ZCR ) of each audio segment Seg(k), and the at least one parameter may include the pitch ( Pitch ), short-term energy ( STE ), and zero-crossing rate ( ZCR ) of the corresponding audio clip Audio_Clip. According to certain embodiments, the at least one segment-level parameter and/or the at least one parameter may be variable. For example, the at least one parameter may further include the start time point Pos ( ) of a main audio segment in the at least one main audio segment 720 of the corresponding audio clip Audio_Clip.

針對上述每一音訊節段Seg(k)(例如:有語音節段Seg(k),諸如音訊節段{Seg2,Seg4,Seg6,Seg8}中的每一個)之音高Pitch( )的計算可說明如下。語音類型分類處理模組114可利用音高處理模組115依據一個或多個預定音高計算函數中的任一預定音高計算函數計算有語音節段Seg(k)之音高Pitch( )。舉例來說,該一個或多個預定音高計算函數可包含一第一預定音高計算函數例如自相關函數(Auto-Correlation Function,ACF)ACF( ),以及一第二預定音高計算函數例如平均幅度差函數(Average Magnitude Difference Function,AMDF)AMDF( )。假設有語音節段Seg(k)的長度等於Q(或Q個音訊樣本),自相關函數ACF( )可表示如下: ACF(m)=Σq=m Q-1(y2(q)* y2(q-m));或ACF(m)=Σq=m,...,Q-1(y2(q)* y2(q-m));其中“Σ”可代表加總,而“q”可代表相關音訊樣本y2(q)之樣本索引,且可為區間[m,Q-1]內的整數。音高處理模組115可找出M1<m<M2-1的範圍內的最大ACF(m)。舉例來說,M1可為要計算的自相關點的數量,對應於音高處理模組115要搜尋的最小音高週期,而M2可為要計算的自相關點的數量,對應於音高處理模組115要搜尋的最大音高週期。另外,平均幅度差函數AMDF( )可表示如下:AMDF(m)=Σq=0 Q-1(y2(q)-y2(q+m));或AMDF(m)=Σq=0,...,Q-1(y2(q)-y2(q+m))。 The calculation of the pitch Pitch( ) for each of the above-mentioned audio segments Seg(k) (e.g., each of the audio segments {Seg2, Seg4, Seg6, Seg8}) can be described as follows. The speech type classification processing module 114 can use the pitch processing module 115 to calculate the pitch Pitch( ) of the speech segment Seg(k) according to any one of one or more predetermined pitch calculation functions. For example, the one or more predetermined pitch calculation functions can include a first predetermined pitch calculation function such as an auto-correlation function (ACF) ACF( ) and a second predetermined pitch calculation function such as an average magnitude difference function (AMDF) AMDF( ). Assuming that the length of a speech segment Seg(k) is Q (or Q audio samples), the autocorrelation function ACF( ) can be expressed as follows: ACF(m) = Σ q = m Q-1 (y2(q) * y2(qm)); or ACF(m) = Σ q = m, ..., Q-1 (y2(q) * y2(qm)); where "Σ" represents summation, and "q" represents the sample index of the associated audio sample y2(q), which can be an integer in the interval [m, Q-1]. The pitch processing module 115 can find the maximum ACF(m) in the range M1 < m < M2 -1. For example, M1 may be the number of autocorrelation points to be calculated, corresponding to the minimum pitch period to be searched by the pitch processing module 115, and M2 may be the number of autocorrelation points to be calculated, corresponding to the maximum pitch period to be searched by the pitch processing module 115. Furthermore, the mean amplitude difference function (AMDF) can be expressed as follows: AMDF(m) = Σq =0 Q-1 (y2(q)-y2(q+m)); or AMDF(m) = Σq=0,...,Q-1 (y2(q)-y2(q+m)).

其中“Σ”可代表加總,而“q”可代表相關音訊樣本y2(q)之樣本索引,且可為區間[0,Q-1]內的整數。音高處理模組115可找出M1<m<M2-1的範圍內的最小AMDF(m)。 Wherein "Σ" may represent summation, and "q" may represent the sample index of the relevant audio sample y2(q), and may be an integer in the interval [0, Q-1]. The pitch processing module 115 may find the minimum AMDF(m) in the range of M1 <m< M2-1 .

有語音節段Seg(k)(其長度等於Q)可視為帶有週期T0=m的週期性訊號,且基頻(fundamental frequency)f0等於(1/T0)。舉例來說,當取樣頻率fs等於48000Hz,音高處理模組115可計算週期T0和基頻f0如下:T0=m=250(個樣本)=(250/48000)(秒);以及f0=(1/T0)=(fs/m)=(48000/250)=192(Hz)。 A speech segment Seg(k) (whose length is equal to Q) can be considered a periodic signal with a period T 0 = m, and a fundamental frequency f 0 equal to (1/T 0 ). For example, when the sampling frequency fs is equal to 48,000 Hz, the pitch processing module 115 can calculate the period T 0 and fundamental frequency f 0 as follows: T 0 = m = 250 (samples) = (250/48,000) (seconds); and f 0 = (1/T 0 ) = (fs/m) = (48,000/250) = 192 (Hz).

依據某些實施例,該一個或多個預定音高計算函數及/或相關參數可予以變化。一般而言,音高Pitch( )之範圍可為60~300Hz,其中男士的音高Pitch( )之範圍可為60~180Hz,而女士的音高Pitch( )之範圍可為160~300Hz。舉例來說,當取樣頻率fs等於48000Hz:若f0=60Hz,則T0=(48000/60)=800;以及若f0=300Hz,則T0=(48000/300)=160; 所以音調週期可為160~800個樣本。因此,音高處理模組115可設定M1=160及M2=800。 According to certain embodiments, the one or more predetermined pitch calculation functions and/or related parameters may be varied. Generally speaking, the pitch (Pitch) may range from 60 to 300 Hz, with the pitch (Pitch) for men ranging from 60 to 180 Hz and the pitch (Pitch) for women ranging from 160 to 300 Hz. For example, when the sampling frequency fs is 48,000 Hz: if f0 = 60 Hz, then T0 = (48,000/60) = 800; and if f0 = 300 Hz, then T0 = (48,000/300) = 160; thus, the pitch period may be 160 to 800 samples. Therefore, the pitch processing module 115 may set M1 = 160 and M2 = 800.

針對上述每一音訊節段Seg(k)(例如:有語音節段Seg(k),諸如音訊節段{Seg2,Seg4,Seg6,Seg8}中的每一個)之短期能量值STE( )和過零率ZCR( )的計算可說明如下。語音類型分類處理模組114可利用短期能量及過零率處理模組113計算有語音節段Seg(k)之短期能量值STE(Seg(k))和過零率ZCR(Seg(k))如下:STE(Seg(k))=AVG({STEf(j))|j=j1(k),...,j2(k)});ZCR(Seg(k))=AVG({ZCRf(j))|j=j1(k),...,j2(k)});其中“AVG( )”可代表平均,“j1(k)”可代表有語音節段Seg(k)之開頭音訊框f(j1(k))的索引,而“j2(k)”可代表有語音節段Seg(k)之結尾音訊框f(j2(k))的索引。 The calculation of the short-term energy value STE( ) and the zero-crossing rate ZCR( ) for each of the above audio segments Seg(k) (for example, there is a speech segment Seg(k), such as each of the audio segments {Seg2, Seg4, Seg6, Seg8}) can be explained as follows. The speech type classification processing module 114 can use the short-term energy and zero-crossing rate processing module 113 to calculate the short-term energy value STE(Seg(k)) and the zero-crossing rate ZCR(Seg(k)) of the speech segment Seg(k) as follows: STE(Seg(k))=AVG({STEf(j))|j=j1(k),...,j2(k)}); ZCR(Seg(k))=AVG({ZCRf(j))|j=j1(k),...,j2(k)}); where "AVG( )" may represent the average, "j1(k)" may represent the index of the beginning audio frame f(j1(k)) of the speech segment Seg(k), and "j2(k)" may represent the index of the ending audio frame f(j2(k)) of the speech segment Seg(k).

依據某些實施例,處理電路110(或特徵清單處理模組116)可依據一預定特徵清單格式產生特徵清單{L}諸如特徵清單{LA}和{LB}。舉例來說,該多個預定類型的特徵可包含短期能量值STE( )、過零率ZCR( )、開始時間點Pos( )、音高Pitch( )和時長Duration( ),而該預定特徵清單格式可代表載有短期能量值STE( )、過零率ZCR( )、開始時間點Pos( )、音高Pitch( )和時長Duration( )之特徵清單格式(STE( ),ZCR( ),Pos( ),Pitch( ),Duration( ))。依據某些例子中,該多個預定類型的特徵及/或該預定特徵清單格式可予以變化。 According to some embodiments, the processing circuit 110 (or the feature list processing module 116) may generate a feature list {L}, such as feature lists { LA } and { LB }, according to a predetermined feature list format. For example, the plurality of predetermined features may include short-term energy value STE(), zero-crossing rate ZCR(), start time point Pos(), pitch Pitch(), and duration Duration(), and the predetermined feature list format may represent a feature list format (STE(), ZCR(), Pos(), Pitch(), Duration()) containing short-term energy value STE(), zero-crossing rate ZCR(), start time point Pos(), pitch Pitch(), and duration Duration(). According to some examples, the plurality of predetermined types of features and/or the format of the predetermined feature list may be varied.

表1展示針對語者A和B之暫時特徵清單{L_tmpA}和{L_tmpB}的例 子,而表2展示針對語者A和B之特徵清單{LA}和{LB}的例子。處理電路110(或短期能量及過零率處理模組113)可找出有語音節段{Seg(k)}(例如:音訊節段{Seg2,Seg4,Seg6,Seg8})之各自的短期能量值{STE(Seg(k))}中的最大值以作為該對應的音訊片段Audio_Clip之短期能量值STE( )。尤其,處理電路110(或特徵清單處理模組116)可記錄具有該最大值的有語音節段Seg(k)之短期能量值STE(Seg(k))、過零率ZCR(Seg(k))、開始時間點Pos(Seg(k))和音高Pitch(Seg(k))以作為該對應的音訊片段Audio_Clip之短期能量值STE( )、過零率ZCR( )、開始時間點Pos( )和音高Pitch( ),且記錄音訊片段Audio_Clip之時長Duration( ),以建立暫時特徵清單{L_tmp}(例如:暫時特徵清單{L_tmpA}和{L_tmpB})中的一對應的暫時特徵清單L_tmp,並且對暫時特徵清單{L_tmp}(例如:暫時特徵清單{L_tmpA}和{L_tmpB})進行正規化(normalization)以產生特徵清單{L}(例如:特徵清單{LA}和{LB})。 Table 1 shows examples of temporary feature lists {L_tmp A } and {L_tmp B } for speakers A and B, while Table 2 shows examples of feature lists { LA } and { LB } for speakers A and B. Processing circuit 110 (or short-term energy and zero-crossing rate processing module 113) can find the maximum value of the short-term energy values {STE(Seg(k))} of each speech segment {Seg(k)} (e.g., audio segments {Seg2, Seg4, Seg6, Seg8}) as the short-term energy value STE( ) of the corresponding audio clip Audio_Clip. In particular, the processing circuit 110 (or the feature list processing module 116) can record the short-term energy value STE(Seg(k)), the zero-crossing rate ZCR(Seg(k)), the starting time point Pos(Seg(k)), and the pitch Pitch(Seg(k)) of the speech segment Seg(k) having the maximum value as the short-term energy value STE( ), the zero-crossing rate ZCR( ), the starting time point Pos( ) and the pitch Pitch( ) of the corresponding audio clip Audio_Clip, and record the duration Duration( ) of the audio clip Audio_Clip to establish a temporary feature list {L_tmp} (for example, temporary feature lists {L_tmp A } and {L_tmp B }), and normalizes the temporary feature list {L_tmp} (e.g., temporary feature list {L_tmp A } and {L_tmp B }) to generate a feature list {L} (e.g., feature list {L A } and {L B }).

舉例來說,特徵清單處理模組116可進行上述特徵清單相關處理以產生暫時特徵清單{L_tmpA}和{L_tmpB}諸如表1所示之暫時特徵清單{{(0.0001776469,0.057857143,2800,111.7718733,39200),(0.0000499814,0.021830357,25200,109.6723773,39200),...,(0.0000897361,0.059107143,5600,112.5243115,44800)},{(0.000117627,0.044642857,2800,189.5970772,42000),(0.0003191778,0.036785714,44800,182.5511925,44800),...,(0.0001378857,0.033214286,44800,178.4916067,44800)}},且將暫時特徵清單{L_tmpA}和{L_tmpB}分別轉換為特徵清單{LA}和{LB}諸如表2所示之特徵清單{{(0.0823451200,1.45292435,-0.95713666,-0.97132086,-1.3764944),(-1.4695843500,-1.75679355,0.27787838,-1.02837763,-1.3764944),...,(-0.9863176100,1.56429003,-0.80275978,-0.95087229,0.91766294)},{(-0.6472588500,0.27563005,-0.95713666,1.14368901,-0.22941573), (1.8028253900,-0.42438277,1.35851655,0.95220714,0.91766294),...,(-0.4010006400,-0.74257042,1.35851655,0.84188216,0.91766294)}}。於其它例子中,暫時特徵清單{L_tmpA}和{L_tmpB}以及特徵清單{LA}和{LB}可予以變化。 For example, the feature list processing module 116 may perform the feature list related processing to generate the temporary feature lists {L_tmp A } and {L_tmp B } as shown in Table 1: {{(0.0001776469,0.057857143,2800,111.7718733,39200),(0.0000499814,0.021830357,25200,109.6723773,39200),...,(0.0000897361,0.059107143,5600,112.5243115,44800)}, {(0.000117627,0.044642857,2800,189.5970772,42000),(0.0003191778,0.036785714,44800,182.5511925,44800),...,(0.0001378857,0.033214286,44800,178.4916067,44800)}}, and convert the temporary feature lists {L_tmp A } and {L_tmp B } into feature lists {L A } and {L B } respectively. }As shown in Table 2, the feature list {{(0.0823451200,1.45292435,-0.95713666,-0.97132086,-1.3764944),(-1.4695843500,-1.75679355,0.27787838,-1.02837763,-1.37649 44),...,(-0.9863176100,1.56429003,-0.80275978,-0.95087229,0.91766294)},{(-0.6472588500,0.27563005,-0.95713666,1.14368901,-0.22941573), (1.8028253900,-0.42438277,1.35851655,0.95220714,0.91766294),...,(-0.4010006400,-0.74257042,1.35851655,0.84188216,0.91766294)}}. In other examples, the temporary feature lists {L_tmp A } and {L_tmp B } and the feature lists {L A } and {L B } may be varied.

於表1和表2中的任一表中,每一列(row)可視為為一個樣本,而除了(用以標示語者A或B的)第0行之每一行(column)可視為一種特徵F(Col)。特徵清單處理模組116可分別對表1的第1~5行(例如:特徵{F1(1),F1(2),F1(3),F1(4),F1(5)})進行上述正規化以產生表2的第1~5行(例如:特徵{F2(1),F2(2),F2(3),F2(4),F2(5)}),其中對於表2中轉換後的特徵F2,平均值為0且標準差為1,這可使離群值(outlier)影響降低。舉例來說,相關操作可包含:(1)針對語者A和B之各自的自定義詞WA和WB的每種特徵F1(Col),特徵清單處理模組116可計算表1中所有樣本的平均值Mean1(Col)和標準差Std1(Col),其中針對特徵{F1(1),F1(2),F1(3),F1(4),F1(5)}(例如:短期能量值STE( )、過零率ZCR( )、開始時間點Pos( )、音高Pitch( )和時長Duration( )),特徵清單處理模組116可分別計算表1中所有樣本的平均值{Mean1(1),Mean1(2),Mean1(3),Mean1(4),Mean1(5)}和標準差{Std1(1),Std1(2),Std1(3),Std1(4),Std1(5)};以及(2)特徵清單處理模組116可將特徵F1(Col)轉換為特徵F2(Col),以使F2(Col)=(F1(Col)-Mean1(Col))/Std1(Col),尤其,將暫時特徵清單{L_tmp}中的特徵{F1(1),F1(2),F1(3),F1(4),F1(5)}分別轉換為特徵清單{L}中的特徵{F2(1),F2(2),F2(3),F2(4),F2(5)};其中“Col”可代表區間[1,5]中的整數。特徵清單處理模組116可將特徵{F2(1),F2(2),F2(3),F2(4),F2(5)}和{F1(1),F1(2),F1(3),F1(4),F1(5)}之間的轉換參數{(Mean1(1),Std1(1)),(Mean1(2),Std1(2)),(Mean1(3),Std1(3)),(Mean1(4),Std1(4)),(Mean1(5),Std1(5))}儲存至儲存裝置140中,以供在該辨識程 序中將語者U的音訊片段Audio_ClipU的音訊數據Audio_DataU之暫時特徵清單L_tmpU轉換為音訊數據Audio_DataU之特徵清單LU。特徵清單處理模組116可將暫時特徵清單L_tmpU中的特徵FU1(Col)轉換為特徵清單LU中的特徵FU2(Col),以使FU2(Col)=(FU1(Col)-Mean1(Col))/Std1(Col),尤其,將暫時特徵清單L_tmpU中的特徵{FU1(1),FU1(2),FU1(3),FU1(4),FU1(5)}分別轉換為特徵清單LU中的特徵{FU2(1),FU2(2),FU2(3),FU2(4),FU2(5)}。 In either Table 1 or Table 2, each row can be considered as a sample, and each column except row 0 (used to mark language A or B) can be considered as a feature F(Col). The feature list processing module 116 can perform the above-mentioned normalization on rows 1 to 5 of Table 1 (e.g., features {F1(1), F1(2), F1(3), F1(4), F1(5)}) to generate rows 1 to 5 of Table 2 (e.g., features {F2(1), F2(2), F2(3), F2(4), F2(5)}). For the transformed feature F2 in Table 2, the mean is 0 and the standard deviation is 1, which can reduce the influence of outliers. For example, the related operations may include: (1) for each feature F1(Col) of the self-defined words W A and W B of speakers A and B, the feature list processing module 116 may calculate the mean value Mean1(Col) and standard deviation Std1(Col) of all samples in Table 1, wherein for the features {F1(1), F1(2), F1(3), F1(4), F1(5)} (for example: short-term energy value STE( ), zero crossing rate ZCR( ), starting time point Pos( ), pitch Pitch( ) and duration Duration( )), the feature list processing module 116 can calculate the mean values {Mean1(1), Mean1(2), Mean1(3), Mean1(4), Mean1(5)} and standard deviations {Std1(1), Std1(2), Std1(3), Std1(4), Std1(5)} of all samples in Table 1; and (2) the feature list processing module 116 can convert the feature F1(Col) into the feature F2(Col), So that F2(Col)=(F1(Col)-Mean1(Col))/Std1(Col). In particular, the features {F1(1),F1(2),F1(3),F1(4),F1(5)} in the temporary feature list {L_tmp} are converted into features {F2(1),F2(2),F2(3),F2(4),F2(5)} in the feature list {L}; where "Col" can represent an integer in the interval [1,5]. The feature list processing module 116 can store the conversion parameters {(Mean1(1), Std1(1)), (Mean1(2), Std1(2)), (Mean1(3), Std1(3)), (Mean1(4), Std1(4)), (Mean1(5), Std1(5))} between the features {F2(1), F2(2), F2(3), F2(4), F2(5)} and {F1(1), F1(2), F1(3), F1(4), F1(5)} in the storage device 140 for converting the temporary feature list L_tmp U of the audio data Audio_Data U of the audio clip Audio_Clip U of the speaker U into the feature list L U of the audio data Audio_Data U in the recognition process. The feature list processing module 116 can convert the feature F U 1(Col) in the temporary feature list L_tmp U into the feature F U 2(Col) in the feature list L U so that F U 2(Col)=(F U 1(Col)-Mean1(Col))/Std1(Col). In particular, the features {F U 1(1), F U 1(2), F U 1(3), F U 1(4), F U 1(5)} in the temporary feature list L_tmp U are converted into the features {F U 2(1), F U 2(2), F U 2(3), F U 2(4), F U 2(5)} in the feature list L U respectively.

第8圖根據本發明一實施例繪示該方法的一語者辨識控制方案的一工作流程。第2圖所示之步驟S34和S35可分別包含多個子步驟諸如步驟S41~S43以及步驟S44和S45,其中該一個或多個特徵可包含音高Pitch( )和時長Duration( )。 FIG8 illustrates a workflow of a speaker identification control scheme of the method according to an embodiment of the present invention. Steps S34 and S35 shown in FIG2 may include multiple sub-steps, such as steps S41-S43 and steps S44 and S45, respectively, wherein the one or more features may include pitch (Pitch) and duration (Duration).

於步驟S41中,處理電路110(或語音類型分類處理模組114)可依據基於特徵清單的資料庫142對特徵清單LU中的音高Pitch(U)進行篩選操作以判定音訊片段Audio_ClipU是否無效,尤其,依據對應於語者A的特徵清單{LA}中的音高{Pitch(A)}的最大值MAX({Pitch(A)})和最小值MIN({Pitch(A)})以及對應於語者B的特徵清單{LB}中的音高{Pitch(B)}的最大值MAX({Pitch(B)})和最小值MIN({Pitch(B)})對音高Pitch(U)進行篩選操作。若MIN({Pitch(A)})<Pitch(U)<MAX({Pitch(A)})或MIN({Pitch(B)})<Pitch(U)<MAX({Pitch(B)}),進入步驟S42;否則,在判定音訊片段Audio_ClipU(或語者U)為無效之情況下,進入步驟S36。 In step S41, the processing circuit 110 (or the voice type classification processing module 114) can perform a filtering operation on the pitch Pitch(U) in the feature list L U based on the database 142 based on the feature list to determine whether the audio clip Audio_Clip U is invalid. In particular, the pitch Pitch( U ) is filtered based on the maximum value MAX({Pitch(A)}) and the minimum value MIN({Pitch(A)}) of the pitch {Pitch(A)} in the feature list {L A } corresponding to speaker A, and the maximum value MAX({Pitch(B)}) and the minimum value MIN({Pitch(B)}) of the pitch {Pitch(B)} in the feature list {L B } corresponding to speaker B. If MIN({Pitch(A)})<Pitch(U)<MAX({Pitch(A)}) or MIN({Pitch(B)})<Pitch(U)<MAX({Pitch(B)}), proceed to step S42; otherwise, if the audio clip Audio_Clip U (or speaker U) is determined to be invalid, proceed to step S36.

於步驟S42中,處理電路110(或語音類型分類處理模組114)可依據基於特徵清單的資料庫142對特徵清單LU中的時長Duration(U)進行篩選操作以判定音訊片段Audio_ClipU是否無效,尤其,依據對應於語者A的特徵清單{LA}中的時長{Duration(A)}的最大值MAX({Duration(A)})和最小值MIN({Duration(A)})以及對應於語者B的特徵清單{LB}中的時長{Duration(B)}的 最大值MAX({Duration(B)})和最小值MIN({Duration(B)})對時長Duration(U)進行篩選操作。若MIN({Duration(A)})<Duration(U)<MAX({Duration(A)})或MIN({Duration(B)})<Duration(U)<MAX({Duration(B)}),進入步驟S43;否則,在判定音訊片段Audio_ClipU(或語者U)為無效之情況下,進入步驟S36。 In step S42, the processing circuit 110 (or the voice type classification processing module 114) can perform a filtering operation on the duration Duration(U) in the feature list L U according to the database 142 based on the feature list to determine whether the audio clip Audio_Clip U is invalid. In particular, the duration Duration( U ) is filtered according to the maximum value MAX({Duration(A)}) and the minimum value MIN({Duration(A)}) of the duration {Duration(A)} in the feature list {L A } corresponding to speaker A, and the maximum value MAX({Duration(B)}) and the minimum value MIN({Duration(B)}) of the duration {Duration(B)} in the feature list {L B } corresponding to speaker B. If MIN({Duration(A)})<Duration(U)<MAX({Duration(A)}) or MIN({Duration(B)})<Duration(U)<MAX({Duration(B)}), proceed to step S43; otherwise, if the audio clip Audio_Clip U (or speaker U) is determined to be invalid, proceed to step S36.

於步驟S43中,處理電路110(或語音類型分類處理模組114)可依據特徵清單LU中的所有特徵,利用該預定分類器例如一k-NN演算法分類器(或“KNN分類器”)進行該基於機器學習的分類以判定是否音訊片段Audio_ClipU的語者U為語者A或語者B,以供選擇性地進入步驟S44或步驟S45。若判定語者U為語者A,進入步驟S44;若判定語者U為語者B,進入步驟S45。 In step S43, processing circuit 110 (or speech type classification processing module 114) can perform machine learning-based classification based on all features in feature list L U using a predetermined classifier, such as a k-NN algorithm classifier (or "KNN classifier"), to determine whether speaker U of audio clip Audio_Clip U is speaker A or speaker B, thereby selectively proceeding to step S44 or step S45. If speaker U is determined to be speaker A, the process proceeds to step S44; if speaker U is determined to be speaker B, the process proceeds to step S45.

於步驟S44中,處理電路110可執行對應於語者A的行動Action(A)。 In step S44, the processing circuit 110 may execute the action Action(A) corresponding to speaker A.

於步驟S45中,處理電路110可執行對應於語者B的行動Action(B)。 In step S45, the processing circuit 110 may execute the action Action(B) corresponding to speaker B.

依據某些實施例,一個或多個步驟可於第8圖所示之工作流程中增加、刪除或修改。 According to some embodiments, one or more steps may be added, deleted, or modified in the workflow shown in Figure 8.

第9圖根據本發明一實施例繪示該語者辨識控制方案中的k-NN分類器所涉及的資料點。舉例來說,依據k-NN演算法來操作之KNN分類器可於處置分類問題時用多數決來進行分類,而相關操作可包含:(1)決定k值;(2)求每個鄰居跟一目標資料點(例如新資料點)之間的距離;(3)找出跟該目標資料點(例如新資料點)最近的k個鄰居;以及(4)檢查多個預定類別(category)中之哪一類別/哪一組(例如類別A或類別B)鄰居數量最多,以將該目標資料點(例如新資料點)分類為具有最多鄰居的那一類別/那一組(例如類別A或類別B);其中該多個預定類別諸如類別A和B可對應於上述註冊的語者諸如語者A和B。該預定空間的維度(例如:該多個軸{X1,X2,...}的軸數)可等於該 多個預定類型的特徵之特徵類型數(例如:5)。為了更好地理解,於第9圖中可用兩個軸{X1,X2}作為該多個軸{X1,X2,...}的例子來繪示。另外,屬於類別A的資料點可分別代表特徵清單{LA},屬於類別B的資料點可分別代表特徵清單{LB},而新資料點可代表特徵清單LU。此外,屬於類別A的資料點的資料點數CNT_Data_pointA(例如:CNT_Data_pointA=9)可等於特徵清單{LA}的特徵清單數CNT_LA(例如:CNT_LA=9),而屬於類別B的資料點的資料點數CNT_Data_pointB(例如:CNT_Data_pointB=8)可等於特徵清單{LB}的特徵清單數CNT_LB(例如:CNT_LB=8)。舉例來說,對於新資料點,有3個鄰居屬於類別A且有2個鄰居屬於類別B。此情況下,KNN分類器可將新資料點分類為類別A。依據某些實施例,新資料點、屬於類別A的資料點、屬於類別B的資料點、資料點數CNT_Data_pointA、資料點數CNT_Data_pointB、特徵清單數CNT_LA、特徵清單數CNT_LB、屬於類別A的鄰居數、及/或屬於類別B的鄰居數可予以變化。舉例來說,對於新資料點,有2個鄰居屬於類別A且有5個鄰居屬於類別B。此情況下,KNN分類器可將新資料點分類為類別B。 FIG. 9 illustrates the data points involved in the k-NN classifier in the speaker identification control scheme according to an embodiment of the present invention. For example, a KNN classifier operating according to the k-NN algorithm can use majority decision to perform classification when dealing with classification problems, and the relevant operations may include: (1) determining the value of k; (2) calculating the distance between each neighbor and a target data point (e.g., a new data point); (3) finding the k neighbors closest to the target data point (e.g., a new data point); and (4) checking which category/group (e.g., category A or category B) among multiple predetermined categories has the largest number of neighbors, so as to classify the target data point (e.g., new data point) into the category/group (e.g., category A or category B) with the largest number of neighbors; wherein the multiple predetermined categories, such as category A and category B, may correspond to the above-mentioned registered speakers, such as speakers A and B. The dimension of the predetermined space (e.g., the number of axes in the plurality of axes {X 1 , X 2 , ...}) can be equal to the number of feature types in the plurality of predetermined types of features (e.g., 5). For better understanding, FIG. 9 illustrates the plurality of axes {X 1 , X 2 } using two axes {X 1 , X 2 } as an example. Furthermore, data points belonging to class A can be represented by a feature list {L A }, data points belonging to class B can be represented by a feature list {L B }, and new data points can be represented by a feature list L U . Furthermore, the number of data points CNT_Data_point A (e.g., CNT_Data_point A = 9) belonging to class A can be equal to the number of features CNT_LA (e.g., CNT_LA = 9) in the feature list { LA }, while the number of data points CNT_Data_point B (e.g., CNT_Data_point B = 8) belonging to class B can be equal to the number of features CNT_LB (e.g., CNT_LB = 8) in the feature list { LB } . For example, for a new data point, three neighbors belong to class A and two neighbors belong to class B. In this case, the KNN classifier can classify the new data point as class A. According to some embodiments, the new data point, the data point belonging to class A, the data point belonging to class B, the number of data points CNT_Data_point A , the number of data points CNT_Data_point B , the number of feature lists CNT_LA , the number of feature lists CNT_LB , the number of neighbors belonging to class A, and/or the number of neighbors belonging to class B may vary. For example, for a new data point, there are two neighbors belonging to class A and five neighbors belonging to class B. In this case, the KNN classifier may classify the new data point as class B.

第10圖根據本發明一實施例繪示該方法的一流程圖。 Figure 10 shows a flow chart of the method according to one embodiment of the present invention.

於步驟S50中,處理電路110可於該註冊階段中,對至少一音訊片段Audio_Clip的音訊數據Audio_Data(例如:音訊數據{Audio_DataA}和{Audio_DataB})進行該特徵收集以產生上述至少一音訊片段Audio_Clip的至少一特徵清單L(例如:特徵清單{LA}和{LB}),以於聲控裝置100中建立基於特徵清單的資料庫142,其中上述至少一音訊片段Audio_Clip可載有上述至少一自定義詞W,基於特徵清單的資料庫142可包含上述至少一特徵清單L,上述至少一特徵清單L中之任一特徵清單L可包含上述至少一音訊片段Audio_Clip中之一對應的音訊片段Audio_Clip的多個特徵,且該多個特徵分別屬於該多個預定類型的特徵。 In step S50, the processing circuit 110 may perform the feature collection on the audio data Audio_Data (e.g., audio data {Audio_Data A } and {Audio_Data B }) of at least one audio clip Audio_Clip in the registration phase to generate at least one feature list L (e.g., feature list {L A } and {L B}) of the at least one audio clip Audio_Clip. }), to establish a feature list-based database 142 in the voice control device 100, wherein the at least one audio clip Audio_Clip may carry the at least one custom word W, and the feature list-based database 142 may include the at least one feature list L, and any feature list L in the at least one feature list L may include multiple features of an audio clip Audio_Clip corresponding to one of the at least one audio clip Audio_Clip, and the multiple features respectively belong to the multiple predetermined types of features.

於步驟S51中,處理電路110可於該辨識階段中,對另一音訊片段Audio_Clip(例如:音訊片段Audio_ClipU)的音訊數據Audio_Data(例如:音訊數據Audio_DataU)進行該特徵收集以產生該另一音訊片段Audio_Clip的另一特徵清單L(例如:特徵清單LU)。 In step S51 , the processing circuit 110 may perform the feature collection on the audio data Audio_Data (eg, audio data Audio_Data U ) of another audio clip Audio_Clip (eg, audio clip Audio_Clip U ) in the identification phase to generate another feature list L (eg, feature list L U ) of the another audio clip Audio_Clip.

於步驟S52中,處理電路110可於該辨識階段中,依據基於特徵清單的資料庫142對該另一特徵清單L中的至少一特徵進行至少一篩選操作以判定該另一音訊片段Audio_Clip是否無效,以選擇性地忽略該另一音訊片段Audio_Clip或執行至少一後續操作,其中上述至少一後續操作包含喚醒聲控裝置100。 In step S52 , during the identification phase, the processing circuit 110 may perform at least one filtering operation on at least one feature in the other feature list L based on the feature list database 142 to determine whether the other audio clip Audio_Clip is invalid, thereby selectively ignoring the other audio clip Audio_Clip or performing at least one subsequent operation, wherein the at least one subsequent operation includes waking up the voice-controlled device 100.

依據某些實施例,一個或多個步驟可於第10圖所示之工作流程中增加、刪除或修改。 According to some embodiments, one or more steps may be added, deleted, or modified in the workflow shown in FIG. 10 .

以上所述僅為本發明之較佳實施例,凡依本發明申請專利範圍所做之均等變化與修飾,皆應屬本發明之涵蓋範圍。 The above description is merely a preferred embodiment of the present invention. All equivalent changes and modifications made within the scope of the patent application of the present invention should fall within the scope of the present invention.

100:聲控裝置 110:處理電路 111:處理模組 112:閾值初始化處理模組 113:短期能量(STE)及過零率(ZCR)處理模組 114:語音類型分類處理模組 115:音高處理模組 116:特徵清單處理模組 120:音訊輸入裝置 130:音訊數據轉換介面電路 140:儲存裝置 141:程式碼 142:基於特徵清單的資料庫100: Voice Control Device 110: Processing Circuit 111: Processing Module 112: Threshold Initialization Processing Module 113: Short-Term Energy (STE) and Zero Crossing Rate (ZCR) Processing Module 114: Voice Type Classification Processing Module 115: Pitch Processing Module 116: Feature List Processing Module 120: Audio Input Device 130: Audio Data Conversion Interface Circuit 140: Storage Device 141: Programming Code 142: Feature List-Based Database

Claims (9)

一種藉助於偵測自定義詞(self-defined word)的語音特徵(voice feature)對一聲控裝置進行喚醒控制之方法,該方法包含有: 於多個階段中的一註冊階段(registration phase)中,對至少一音訊片段(audio clip)的音訊數據進行特徵收集(feature collection)以產生該至少一音訊片段的至少一特徵清單,以於該聲控裝置中建立一基於特徵清單的(feature-list-based)資料庫,其中該至少一音訊片段載有(carry)至少一自定義詞,該基於特徵清單的資料庫包含該至少一特徵清單,該至少一特徵清單中之任一特徵清單包含該至少一音訊片段中之一對應的音訊片段的多個特徵,且該多個特徵分別屬於多個預定類型的特徵(predetermined types of features),其中對該至少一音訊片段的該音訊數據進行該特徵收集以產生該至少一音訊片段的該至少一特徵清單另包含: 在錄製(record)該對應的音訊片段以取得該對應的音訊片段之對應的音訊數據後,分析該對應的音訊片段的一第一局部(partial)音訊片段的第一音訊數據,以依據該第一音訊數據的多個第一音訊幀(frame)確定一能量閾值(energy threshold)和一過零率閾值(zero-crossing rate threshold),以供進一步處理該對應的音訊片段的一其餘局部音訊片段的其餘音訊數據;以及 分析該其餘局部音訊片段的該其餘音訊數據,以計算該其餘音訊數據的多個第二音訊幀之各自的能量值和過零率,且依據該多個第二音訊幀中的任一第二音訊幀之能量值是否達到該能量閾值以及該任一第二音訊幀之過零率是否達到該過零率閾值來判定該任一第二音訊幀的語音類型為多個預定語音類型的其中之一,以供依據該多個第二音訊幀之各自的語音類型判定該對應的音訊片段的該多個特徵; 於該多個階段中的一辨識階段(identification phase)中,對另一音訊片段的音訊數據進行該特徵收集以產生該另一音訊片段的另一特徵清單;以及 於該辨識階段中,依據該基於特徵清單的資料庫對該另一特徵清單中的至少一特徵進行至少一篩選操作以判定該另一音訊片段是否無效,以選擇性地忽略該另一音訊片段或執行至少一後續操作,其中該至少一後續操作包含喚醒該聲控裝置。 A method for waking up a voice-controlled device by detecting voice features of a self-defined word, the method comprising: In a registration phase among multiple phases, collecting features of audio data of at least one audio clip; Performing feature collection on the audio data of the at least one audio segment to generate at least one feature list for the at least one audio segment, thereby establishing a feature-list-based database in the voice-controlled device, wherein the at least one audio segment carries at least one custom word, the feature-list-based database includes the at least one feature list, any one of the at least one feature list includes multiple features of an audio segment corresponding to one of the at least one audio segment, and the multiple features respectively belong to multiple predetermined types of features, wherein performing feature collection on the audio data of the at least one audio segment to generate the at least one feature list for the at least one audio segment further includes: After recording the corresponding audio segment to obtain corresponding audio data of the corresponding audio segment, analyzing first audio data of a first partial audio segment of the corresponding audio segment to determine an energy threshold and a zero-crossing rate threshold based on a plurality of first audio frames of the first audio data for further processing remaining audio data of a remaining partial audio segment of the corresponding audio segment; and Analyzing the remaining audio data of the remaining partial audio segment to calculate the energy values and zero-crossing rates of each of the plurality of second audio frames of the remaining audio data, and determining whether the speech type of any second audio frame in the plurality of second audio frames reaches the energy threshold and the zero-crossing rate of any second audio frame reaches the zero-crossing rate threshold, so as to determine the plurality of features of the corresponding audio segment based on the speech types of the plurality of second audio frames; In an identification phase among the plurality of phases, performing the feature collection on the audio data of another audio segment to generate another feature list for the another audio segment; and During the recognition phase, at least one filtering operation is performed on at least one feature in the other feature list based on the feature list database to determine whether the other audio segment is invalid, thereby selectively ignoring the other audio segment or performing at least one subsequent operation, wherein the at least one subsequent operation includes waking up the voice control device. 如申請專利範圍第1項所述之方法,其中該至少一音訊片段包含多個音訊片段,以及該至少一特徵清單包含該多個音訊片段之各自的特徵清單,其中該至少一特徵清單中之該任一特徵清單代表該多個音訊片段之所述各自的特徵清單中之一特徵清單,以及該對應的音訊片段代表該多個音訊片段的其中之一。A method as described in item 1 of the patent application, wherein the at least one audio segment comprises a plurality of audio segments, and the at least one feature list comprises respective feature lists of the plurality of audio segments, wherein any feature list in the at least one feature list represents one of the respective feature lists of the plurality of audio segments, and the corresponding audio segment represents one of the plurality of audio segments. 如申請專利範圍第1項所述之方法,其中依據該基於特徵清單的資料庫對該另一特徵清單中的該至少一特徵進行該至少一篩選操作以判定該另一音訊片段是否無效,以選擇性地忽略該另一音訊片段或執行該至少一後續操作另包含: 若該另一音訊片段為無效,則忽略該另一音訊片段;以及 若該另一音訊片段並非無效,則進行該至少一後續操作。 The method as described in claim 1, wherein performing the at least one filtering operation on the at least one feature in the other feature list based on the feature list-based database to determine whether the other audio segment is invalid, thereby selectively ignoring the other audio segment or performing the at least one subsequent operation further comprises: If the other audio segment is invalid, ignoring the other audio segment; and If the other audio segment is not invalid, performing the at least one subsequent operation. 如申請專利範圍第1項所述之方法,其中該至少一音訊片段包含一第一使用者的至少一第一音訊片段,且包含一第二使用者的至少一第二音訊片段;以及對該至少一音訊片段的該音訊數據進行該特徵收集以產生該至少一音訊片段的該至少一特徵清單另包含: 對該至少一第一音訊片段的音訊數據進行該特徵收集以產生該至少一第一音訊片段的至少一第一特徵清單,其中該至少一第一音訊片段中的每一第一音訊片段載有一第一自定義詞,該基於特徵清單的資料庫包含該至少一第一特徵清單,該至少一第一特徵清單中之任一第一特徵清單包含該至少一第一音訊片段中之一對應的第一音訊片段的多個第一特徵,且該多個第一特徵分別屬於該多個預定類型的特徵;以及 對該至少一第二音訊片段的音訊數據進行該特徵收集以產生該至少一第二音訊片段的至少一第二特徵清單,其中該至少一第二音訊片段中的每一第二音訊片段載有一第二自定義詞,該基於特徵清單的資料庫包含該至少一第二特徵清單,該至少一第二特徵清單中之任一第二特徵清單包含該至少一第二音訊片段中之一對應的第二音訊片段的多個第二特徵,且該多個第二特徵分別屬於該多個預定類型的特徵。 The method as described in claim 1, wherein the at least one audio segment includes at least one first audio segment of a first user and at least one second audio segment of a second user; and performing feature collection on the audio data of the at least one audio segment to generate the at least one feature list for the at least one audio segment further comprises: Performing feature collection on the audio data of the at least one first audio segment to generate at least one first feature list for the at least one first audio segment, wherein each first audio segment of the at least one first audio segment contains a first custom word, the feature list-based database includes the at least one first feature list, any first feature list of the at least one first feature list includes multiple first features of a first audio segment corresponding to one of the at least one first audio segment, and the multiple first features respectively belong to the multiple predetermined types of features; and The feature collection is performed on the audio data of the at least one second audio segment to generate at least one second feature list for the at least one second audio segment, wherein each of the at least one second audio segment contains a second custom word, the feature list-based database includes the at least one second feature list, and any second feature list in the at least one second feature list includes multiple second features of a second audio segment corresponding to one of the at least one second audio segment, and the multiple second features respectively belong to the multiple predetermined types of features. 如申請專利範圍第1項所述之方法,其中藉由進行機器學習,對應於至少一預定模型的一預定分類器係於該聲控裝置中被建立;該另一特徵清單中的該至少一特徵為該另一特徵清單中的所有特徵中的至少一個,其中該另一特徵清單中的所述所有特徵分別屬於該多個預定類型的特徵;以及該至少一後續操作另包含: 依據該另一特徵清單中的所述所有特徵,利用該預定分類器進行基於機器學習的(machine-learning-based)分類以判定是否該另一音訊片段的語者(speaker)為一第一使用者或一第二使用者,以選擇性地執行對應於該第一使用者的至少一第一行動或對應於該第二使用者的至少一第二行動。 The method as described in claim 1, wherein a predetermined classifier corresponding to at least one predetermined model is established in the voice control device by performing machine learning; the at least one feature in the other feature list is at least one of all features in the other feature list, wherein all features in the other feature list belong to the plurality of predetermined types of features; and the at least one subsequent operation further comprises: Based on all features in the other feature list, using the predetermined classifier to perform machine-learning-based classification to determine whether the speaker of the other audio segment is a first user or a second user, and selectively performing at least one first action corresponding to the first user or at least one second action corresponding to the second user. 如申請專利範圍第5項所述之方法,其中該至少一預定模型的一預定空間的維度等於該多個預定類型的特徵之特徵類型數(feature-type count)。The method of claim 5, wherein a dimension of a predetermined space of the at least one predetermined model is equal to a feature-type count of the plurality of predetermined types of features. 如申請專利範圍第1項所述之方法,其中依據該基於特徵清單的資料庫對該另一特徵清單中的該至少一特徵進行該至少一篩選操作以判定該另一音訊片段是否無效,以選擇性地忽略該另一音訊片段或執行該至少一後續操作另包含: 依據該基於特徵清單的資料庫對該另一特徵清單中的該至少一特徵進行該至少一篩選操作以判定該另一音訊片段是否無效,以選擇性地忽略該另一音訊片段或執行該至少一後續操作,而不需要透過任何網路來連線至任何雲端資料庫以取得任何語音資料以供判斷該至少一自定義詞有哪些字。 The method as described in claim 1, wherein performing the at least one filtering operation on the at least one feature in the other feature list based on the feature list database to determine whether the other audio segment is invalid, thereby selectively ignoring the other audio segment or performing the at least one subsequent operation further comprises: Performing the at least one filtering operation on the at least one feature in the other feature list based on the feature list database to determine whether the other audio segment is invalid, thereby selectively ignoring the other audio segment or performing the at least one subsequent operation, without requiring a network connection to any cloud database to obtain any voice data for determining the characters in the at least one custom word. 如申請專利範圍第1項所述之方法,其中對該至少一音訊片段的該音訊數據進行該特徵收集以產生該至少一音訊片段的該至少一特徵清單另包含: 依據該多個第二音訊幀之所述各自的語音類型,將該對應的音訊片段區分為多個音訊節段(audio segment),其中該對應的音訊數據的所有音訊幀中具有一相同的預定語音類型的任何兩個相鄰音訊幀屬於一相同的音訊節段,該對應的音訊數據的所述所有音訊幀包含該多個第一音訊幀和該多個第二音訊幀,以及該多個音訊節段中的一開頭音訊節段至少包含該多個第一音訊幀,且對應於一第一預定語音類型; 計算該多個音訊節段中的至少一主要音訊節段之總時間長度,以作為該對應的音訊片段的該多個特徵的一個特徵,其中該至少一主要音訊節段包含該多個音訊節段中除了該開頭音訊節段和對應於該第一預定語音類型的任何結尾音訊節段以外的音訊節段;以及 計算該多個音訊節段中對應於一第二預定語音類型的每一音訊節段之至少一節段級參數(segment-level parameter),以依據該至少一節段級參數判定該對應的音訊片段的至少一參數,以作為該對應的音訊片段的該多個特徵的至少一其它特徵。 The method as described in claim 1, wherein collecting features from the audio data of the at least one audio segment to generate the at least one feature list for the at least one audio segment further comprises: Dividing the corresponding audio segment into a plurality of audio segments based on the respective audio types of the plurality of second audio frames, wherein any two adjacent audio frames having the same predetermined audio type in all audio frames of the corresponding audio data belong to the same audio segment, wherein all audio frames of the corresponding audio data include the plurality of first audio frames and the plurality of second audio frames, and a leading audio segment of the plurality of audio segments includes at least the plurality of first audio frames and corresponds to a first predetermined audio type; Calculating a total duration of at least one primary audio segment from the plurality of audio segments as one of the plurality of features of the corresponding audio segment, wherein the at least one primary audio segment includes audio segments from the plurality of audio segments excluding the leading audio segment and any trailing audio segments corresponding to the first predetermined speech type; and Calculating at least one segment-level parameter for each audio segment from the plurality of audio segments corresponding to a second predetermined speech type, and determining at least one parameter of the corresponding audio segment based on the at least one segment-level parameter as at least one other feature of the plurality of features of the corresponding audio segment. 一種處理電路,用來藉助於偵測自定義詞(self-defined word)的語音特徵(voice feature)對一聲控裝置進行喚醒控制,該處理電路包含: 多個處理模組,用來進行該處理電路的操作,其中該多個處理模組包含: 一特徵清單處理模組,用來進行特徵清單相關(feature-list-related)處理;以及 至少一其它處理模組,用來進行特徵收集; 其中: 於多個階段中的一註冊階段(registration phase)中,該處理電路對至少一音訊片段(audio clip)的音訊數據進行該特徵收集(feature collection)以產生該至少一音訊片段的至少一特徵清單,以於該聲控裝置中建立一基於特徵清單的(feature-list-based)資料庫,其中該至少一音訊片段載有(carry)至少一自定義詞,該基於特徵清單的資料庫包含該至少一特徵清單,該至少一特徵清單中之任一特徵清單包含該至少一音訊片段中之一對應的音訊片段的多個特徵,且該多個特徵分別屬於多個預定類型的特徵(predetermined types of features),其中對該至少一音訊片段的該音訊數據進行該特徵收集以產生該至少一音訊片段的該至少一特徵清單另包含: 在錄製(record)該對應的音訊片段以取得該對應的音訊片段之對應的音訊數據後,分析該對應的音訊片段的一第一局部(partial)音訊片段的第一音訊數據,以依據該第一音訊數據的多個第一音訊幀(frame)確定一能量閾值(energy threshold)和一過零率閾值(zero-crossing rate threshold),以供進一步處理該對應的音訊片段的一其餘局部音訊片段的其餘音訊數據;以及 分析該其餘局部音訊片段的該其餘音訊數據,以計算該其餘音訊數據的多個第二音訊幀之各自的能量值和過零率,且依據該多個第二音訊幀中的任一第二音訊幀之能量值是否達到該能量閾值以及該任一第二音訊幀之過零率是否達到該過零率閾值來判定該任一第二音訊幀的語音類型為多個預定語音類型的其中之一,以供依據該多個第二音訊幀之各自的語音類型判定該對應的音訊片段的該多個特徵; 於該多個階段中的一辨識階段(identification phase)中,該處理電路對另一音訊片段的音訊數據進行該特徵收集以產生該另一音訊片段的另一特徵清單;以及 於該辨識階段中,該處理電路依據該基於特徵清單的資料庫對該另一特徵清單中的至少一特徵進行至少一篩選操作以判定該另一音訊片段是否無效,以選擇性地忽略該另一音訊片段或執行至少一後續操作,其中該至少一後續操作包含喚醒該聲控裝置。 A processing circuit for waking up a voice-controlled device by detecting voice features of self-defined words. The processing circuit comprises: A plurality of processing modules for operating the processing circuit, wherein the plurality of processing modules comprise: A feature list processing module for performing feature-list-related processing; and At least one other processing module for performing feature collection; Wherein: In a registration phase among a plurality of phases, the processing circuit performs the feature collection on audio data of at least one audio clip. Performing feature collection on the audio data of the at least one audio segment to generate at least one feature list for the at least one audio segment, thereby establishing a feature-list-based database in the voice-controlled device, wherein the at least one audio segment carries at least one custom word, the feature-list-based database includes the at least one feature list, any one of the at least one feature list includes multiple features of an audio segment corresponding to one of the at least one audio segment, and the multiple features respectively belong to multiple predetermined types of features, wherein performing feature collection on the audio data of the at least one audio segment to generate the at least one feature list for the at least one audio segment further includes: After recording the corresponding audio segment to obtain corresponding audio data of the corresponding audio segment, analyzing first audio data of a first partial audio segment of the corresponding audio segment to determine an energy threshold and a zero-crossing rate threshold based on a plurality of first audio frames of the first audio data for further processing remaining audio data of a remaining partial audio segment of the corresponding audio segment; and Analyzing the remaining audio data of the remaining partial audio segment to calculate the energy values and zero-crossing rates of each of the plurality of second audio frames of the remaining audio data, and determining whether the speech type of any second audio frame in the plurality of second audio frames reaches the energy threshold and the zero-crossing rate of any second audio frame reaches the zero-crossing rate threshold, so as to determine the plurality of features of the corresponding audio segment based on the speech types of the plurality of second audio frames; In an identification phase among the plurality of phases, the processing circuit performs the feature collection on the audio data of another audio segment to generate another feature list for the another audio segment; and During the recognition phase, the processing circuit performs at least one filtering operation on at least one feature in the other feature list based on the feature list database to determine whether the other audio segment is invalid, thereby selectively ignoring the other audio segment or performing at least one subsequent operation, wherein the at least one subsequent operation includes waking up the voice control device.
TW112151041A 2023-12-27 2023-12-27 Method and processing circuit for performing wake-up control on voice-controlled device with aid of detecting voice feature of self-defined word TWI892389B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
TW112151041A TWI892389B (en) 2023-12-27 2023-12-27 Method and processing circuit for performing wake-up control on voice-controlled device with aid of detecting voice feature of self-defined word
US18/908,826 US20250218441A1 (en) 2023-12-27 2024-10-08 Method and processing circuit for performing wake-up control on voice-controlled device with aid of detecting voice feature of self-defined word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW112151041A TWI892389B (en) 2023-12-27 2023-12-27 Method and processing circuit for performing wake-up control on voice-controlled device with aid of detecting voice feature of self-defined word

Publications (2)

Publication Number Publication Date
TW202526912A TW202526912A (en) 2025-07-01
TWI892389B true TWI892389B (en) 2025-08-01

Family

ID=96174598

Family Applications (1)

Application Number Title Priority Date Filing Date
TW112151041A TWI892389B (en) 2023-12-27 2023-12-27 Method and processing circuit for performing wake-up control on voice-controlled device with aid of detecting voice feature of self-defined word

Country Status (2)

Country Link
US (1) US20250218441A1 (en)
TW (1) TWI892389B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115914734A (en) * 2021-09-22 2023-04-04 北京字跳网络技术有限公司 Audio and video processing method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103221996A (en) * 2010-12-10 2013-07-24 松下电器产业株式会社 Apparatus and method for password modeling for speaker verification, and speaker verification system
US20140122243A1 (en) * 2005-09-14 2014-05-01 Millennial Media, Inc. Managing Sponsored Content for Delivery to Mobile Communication Facilities
CN105474166A (en) * 2013-03-15 2016-04-06 先进元素科技公司 Method and system for purposeful computing
TW201729034A (en) * 2015-12-18 2017-08-16 英特爾Ip公司 Security mechanisms for extreme deep sleep state
CN111247582A (en) * 2018-09-28 2020-06-05 搜诺思公司 System and method for selective wake word detection using a neural network model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140122243A1 (en) * 2005-09-14 2014-05-01 Millennial Media, Inc. Managing Sponsored Content for Delivery to Mobile Communication Facilities
CN103221996A (en) * 2010-12-10 2013-07-24 松下电器产业株式会社 Apparatus and method for password modeling for speaker verification, and speaker verification system
CN105474166A (en) * 2013-03-15 2016-04-06 先进元素科技公司 Method and system for purposeful computing
TW201729034A (en) * 2015-12-18 2017-08-16 英特爾Ip公司 Security mechanisms for extreme deep sleep state
CN111247582A (en) * 2018-09-28 2020-06-05 搜诺思公司 System and method for selective wake word detection using a neural network model

Also Published As

Publication number Publication date
US20250218441A1 (en) 2025-07-03
TW202526912A (en) 2025-07-01

Similar Documents

Publication Publication Date Title
US11676583B2 (en) System and method for activation of voice interactive services based on user state
CN108320733B (en) Voice data processing method and device, storage medium and electronic equipment
Ntalampiras et al. Modeling the temporal evolution of acoustic parameters for speech emotion recognition
CN104835498B (en) Method for recognizing sound-groove based on polymorphic type assemblage characteristic parameter
CN109817197B (en) Singing voice generation method and device, computer equipment and storage medium
JP5949550B2 (en) Speech recognition apparatus, speech recognition method, and program
US20100332222A1 (en) Intelligent classification method of vocal signal
Molina et al. SiPTH: Singing transcription based on hysteresis defined on the pitch-time curve
JPS62231997A (en) Voice recognition system and method
US9646592B2 (en) Audio signal analysis
JP6585112B2 (en) Voice keyword detection apparatus and voice keyword detection method
KR101888058B1 (en) The method and apparatus for identifying speaker based on spoken word
JP6622681B2 (en) Phoneme Breakdown Detection Model Learning Device, Phoneme Breakdown Interval Detection Device, Phoneme Breakdown Detection Model Learning Method, Phoneme Breakdown Interval Detection Method, Program
CN103778915A (en) Speech recognition method and mobile terminal
TWI892389B (en) Method and processing circuit for performing wake-up control on voice-controlled device with aid of detecting voice feature of self-defined word
CN113241059B (en) Voice wake-up method, device, equipment and storage medium
Nandwana et al. A new front-end for classification of non-speech sounds: a study on human whistle
CN105895079A (en) Voice data processing method and device
CN112786071A (en) Data annotation method for voice segments of voice interaction scene
Moondra et al. Voice feature extraction method analysis for speaker recognition with degraded human voice
CN120260556A (en) Wake-up control method and processing circuit for voice control device
JP5749186B2 (en) Acoustic model adaptation device, speech recognition device, method and program thereof
TWI299855B (en) Detection method for voice activity endpoint
Mallick et al. Beat detection and automatic annotation of the music of bharatanatyam dance using speech recognition techniques
Vikram et al. Acoustic analysis of misarticulated trills in cleft lip and palate children