TWI858768B

TWI858768B - Voice activity detection device and voice activity detection method

Info

Publication number: TWI858768B
Application number: TW112121910A
Authority: TW
Inventors: 朱曉鼎
Original assignee: 大陸商星宸科技股份有限公司
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2024-10-11
Also published as: TW202501469A

Abstract

A voice activity detection device includes an audio processing circuit, a first memory, and a processor. The audio processing circuit processes an audio signal from an audio generator circuit to generate first audio data. The first memory stores the first audio data and a first program code. The processor executes the first program code to operate in a first mode, and is switched from operating in the first mode to operating in a second mode in response to an interrupt signal from the audio generator circuit, in order to determine whether the first audio data stored in the first memory includes a human voice signal, in which the power consumption of the processor operating in the first mode is lower than that of the processor operating in the second mode.

Description

Voice activity detection device and voice activity detection method

本案是關於語音活動檢測裝置，尤其是關於可節省功率消耗的語音活動檢測裝置與語音活動檢測方法。 This case is about a voice activity detection device, and more particularly about a voice activity detection device and a voice activity detection method that can save power consumption.

隨著技術發展，可接受語音控制的電子裝置越來越多。在現有技術中，語音活動檢測裝置通常使用一個主(master)處理器及從(slave)處理器來偵測語音指令。在待機期間，主處理器為休眠狀態，且從處理器可進行運作以等待接收語音指令並據此喚醒主處理器。在喚醒主處理器後，從處理器進入休眠狀態，而經喚醒的主處理器可根據該語音指令進行後續操作。在上述技術中，主處理器與從處理器會共同存取同一記憶體，使得該記憶體無法主動切換為操作於低功耗模式，且在同一期間內，主處理器與從處理器中之一者處於休眠狀態而未進行其他實質操作，導致系統成本以及功率消耗增加。 With the development of technology, more and more electronic devices can be controlled by voice. In the prior art, voice activity detection devices usually use a master processor and a slave processor to detect voice commands. During standby, the master processor is in a dormant state, and the slave processor can operate to wait for receiving a voice command and wake up the master processor accordingly. After waking up the master processor, the slave processor enters a dormant state, and the awakened master processor can perform subsequent operations according to the voice command. In the above technology, the main processor and the slave processor will access the same memory together, so that the memory cannot be actively switched to operate in a low-power mode, and during the same period, one of the main processor and the slave processor is in a sleep state without performing other substantial operations, resulting in increased system cost and power consumption.

於一些實施態樣中，本案的目的之一在於提供可節省功率消耗與系統成本的語音活動檢測裝置與語音活動檢測方法，以改善先前技術之不足。 In some implementations, one of the purposes of this case is to provide a voice activity detection device and a voice activity detection method that can save power consumption and system cost, so as to improve the shortcomings of the previous technology.

於一些實施態樣中，語音活動檢測裝置包含一音頻處理電路、一第一記憶體以及一處理器。音頻處理電路處理自一音頻產生電路提供的一音頻訊號以產生一第一音頻資料。第一記憶體儲存該第一音頻資料與一第一程式碼。處理器執行該第一程式碼以操作在一第一模式，並響應自該音頻產生電路提供的一中斷訊號切換為操作在一第二模式，以執行一第二記憶體中的一第二程式碼以判斷儲存在該第一記憶體中的該第一音頻資料是否包含一人聲訊號，其中該處理器操作在該第一模式的功率消耗低於該處理器操作在該第二模式的功率消耗。 In some embodiments, the voice activity detection device includes an audio processing circuit, a first memory, and a processor. The audio processing circuit processes an audio signal provided by an audio generation circuit to generate a first audio data. The first memory stores the first audio data and a first program code. The processor executes the first program code to operate in a first mode, and switches to operate in a second mode in response to an interrupt signal provided by the audio generation circuit to execute a second program code in a second memory to determine whether the first audio data stored in the first memory includes a human voice signal, wherein the power consumption of the processor operating in the first mode is lower than the power consumption of the processor operating in the second mode.

於一些實施態樣中，語音活動檢測方法包含下列操作：根據自一音頻產生電路提供的一音頻訊號產生一第一音頻資料，並儲存該第一音頻資料至一第一記憶體；藉由一處理器執行該第一記憶體中的一第一程式碼以操作在一第一模式；以及藉由該處理器響應自該音頻產生電路提供的一中斷訊號切換為操作在一第二模式，以執行一第二記憶體中的一第二程式碼以判斷儲存在該第一記憶體中的該第一音頻資料是否包含一人聲訊號，其中該處理器操作在該第一模式的功率消耗低於該處理器操作在該第二模式的功率消耗。 In some implementations, the voice activity detection method includes the following operations: generating a first audio data according to an audio signal provided by an audio generation circuit, and storing the first audio data in a first memory; executing a first program code in the first memory by a processor to operate in a first mode; and switching to a second mode by the processor in response to an interrupt signal provided by the audio generation circuit to execute a second program code in a second memory to determine whether the first audio data stored in the first memory includes a human voice signal, wherein the power consumption of the processor operating in the first mode is lower than the power consumption of the processor operating in the second mode.

有關本案的特徵、實作與功效，茲配合圖式作較佳實施例詳細說明如下。 The features, implementation and effects of this case are described in detail below with reference to the diagrams for a preferred embodiment.

100:語音活動檢測裝置 100: Voice activity detection device

101:音頻產生電路 101: Audio generation circuit

102,150:記憶體 102,150:Memory

110:中斷控制器 110: Interrupt controller

120:振盪器 120: Oscillator

130:處理器 130: Processor

140:音頻處理電路 140: Audio processing circuit

141:類比數位轉換器 141:Analog-to-digital converter

142:音頻編解碼器 142: Audio codec

160:記憶體介面單元 160:Memory interface unit

170:時脈產生器電路 170: Clock generator circuit

171,172:鎖相迴路 171,172: Phase-locked loop

300:環形緩衝器 300: Annular buffer

600:語音活動檢測方法 600: Voice activity detection method

CK1,CK2:時脈訊號 CK1, CK2: clock signal

CKREF:參考時脈訊號 CKREF: Reference clock signal

D1,D2,D3:音頻資料 D1,D2,D3: audio data

P1,P2:程式碼 P1, P2: Program code

RP:讀取指標 RP: Read pointer

S201~S212,S401~S412,S610,S620,S630:操作 S201~S212,S401~S412,S610,S620,S630: Operation

SA:音頻訊號 SA: Audio signal

SD:數位資料 SD: Digital data

ST:中斷訊號 ST: interrupt signal

WP:寫入指標 WP: Write Pointer

t0,t1,t2:時間 t0,t1,t2: time

〔圖1〕為根據本案一些實施例繪製的一種語音活動檢測裝置的示意圖；〔圖2〕為根據本案一些實施例繪製圖1的語音活動檢測裝置的操作流程圖；〔圖3〕為根據本案一些實施例繪製圖1中的多個記憶體的操作示意圖；〔圖4〕為根據本案一些實施例繪製圖1的語音活動檢測裝置的操作流程圖；〔圖5〕為根據本案一些實施例繪製圖1中的音頻訊號的波形時序圖；以及〔圖6〕為根據本案一些實施例繪製一種語音活動檢測方法的流程圖。〔Figure 1〕is a schematic diagram of a voice activity detection device drawn according to some embodiments of the present invention;〔Figure 2〕is an operation flow chart of the voice activity detection device of Figure 1 drawn according to some embodiments of the present invention;〔Figure 3〕is an operation schematic diagram of multiple memories in Figure 1 drawn according to some embodiments of the present invention;〔Figure 4〕is an operation flow chart of the voice activity detection device of Figure 1 drawn according to some embodiments of the present invention;〔Figure 5〕is a waveform timing diagram of the audio signal in Figure 1 drawn according to some embodiments of the present invention; and〔Figure 6〕is a flow chart of a voice activity detection method drawn according to some embodiments of the present invention.

本文所使用的所有詞彙具有其通常的意涵。上述之詞彙在普遍常用之字典中之定義，在本案的內容中包含任一於此討論的詞彙之使用例子僅為示例，不應限制到本案之範圍與意涵。同樣地，本案亦不僅以於此說明書所示出的各種實施例為限。 All terms used in this article have their usual meanings. The definitions of the above terms in commonly used dictionaries and the use examples of any term discussed herein in the content of this case are only examples and should not limit the scope and meaning of this case. Similarly, this case is not limited to the various embodiments shown in this specification.

關於本文中所使用之『耦接』或『連接』，均可指二或多個元件相互直接作實體或電性接觸，或是相互間接作實體或電性接觸，亦可指二或多個元件相互操作或動作。如本文所用，用語『電路』可為由至少一個電晶體與/或至少一個主被動元件按一定方式連接以處理訊號的裝置。 As used herein, "coupling" or "connection" may refer to two or more components making physical or electrical contact directly or indirectly, or two or more components operating or acting on each other. As used herein, the term "circuit" may refer to a device that is composed of at least one transistor and/or at least one active and passive component connected in a certain manner to process signals.

圖1為根據本案一些實施例繪製的一種語音活動(voice activity detection,VAD)檢測裝置100的示意圖。語音活動檢測裝置100包含中斷(interrupt)控制器110、振盪器120、處理器130、音頻處理電路140、記憶體150、記憶體介面單元(memory interface unit)160以及時脈產生器電路170。中斷控制器110耦接至音頻產生電路101，以接收音頻產生電路101產生的中斷訊號ST，並根據中斷訊號ST要求處理器130進行相應的硬體/軟體處理。在一些實施例中，音頻產生電路101可收集應用環境中的聲音訊號並產生音頻訊號SA。音頻產生電路101可提供音頻訊號SA至音頻處理電路140，並可判斷音頻訊號SA是否滿足一預定條件以產生中斷訊號ST。例如，當音頻產生電路101判斷音頻訊號SA的音量超過一預定臨界值時，音頻產生電路101可產生中斷訊號ST。 FIG1 is a schematic diagram of a voice activity detection (VAD) detection device 100 according to some embodiments of the present invention. The voice activity detection device 100 includes an interrupt controller 110, an oscillator 120, a processor 130, an audio processing circuit 140, a memory 150, a memory interface unit 160, and a clock generator circuit 170. The interrupt controller 110 is coupled to the audio generation circuit 101 to receive an interrupt signal ST generated by the audio generation circuit 101, and requests the processor 130 to perform corresponding hardware/software processing according to the interrupt signal ST. In some embodiments, the audio generation circuit 101 can collect sound signals in the application environment and generate an audio signal SA. The audio generation circuit 101 can provide the audio signal SA to the audio processing circuit 140, and can determine whether the audio signal SA meets a predetermined condition to generate an interrupt signal ST. For example, when the audio generation circuit 101 determines that the volume of the audio signal SA exceeds a predetermined critical value, the audio generation circuit 101 can generate an interrupt signal ST.

振盪器120可產生參考時脈訊號CKREF，並傳輸參考時脈訊號CKREF給時脈產生器電路170。在一些實施例中，振盪器120可為，但不限於，石英振盪器。時脈產生器電路170可根據參考時脈訊號CKREF產生處理器130、音頻處理電路140以及系統中其他電路所需的時序。例如，時脈產生器電路170可包含鎖相迴路(phase locked loop,PLL)171與鎖相迴路172。鎖相迴路171可根據參考時脈訊號CKREF產生時脈訊號CK1，並提供時脈訊號CK1給處理器130。類似地，鎖相迴路172可根據參考時脈訊號CKREF產生時脈訊號CK2，並提供時脈訊號CK2給音頻處理電路140。 The oscillator 120 may generate a reference clock signal CKREF and transmit the reference clock signal CKREF to the clock generator circuit 170. In some embodiments, the oscillator 120 may be, but is not limited to, a quartz oscillator. The clock generator circuit 170 may generate the timing required by the processor 130, the audio processing circuit 140, and other circuits in the system according to the reference clock signal CKREF. For example, the clock generator circuit 170 may include a phase locked loop (PLL) 171 and a phase locked loop 172. The phase locked loop 171 may generate a clock signal CK1 according to the reference clock signal CKREF and provide the clock signal CK1 to the processor 130. Similarly, the phase-locked loop 172 can generate a clock signal CK2 according to the reference clock signal CKREF, and provide the clock signal CK2 to the audio processing circuit 140.

處理器130可存取記憶體150中儲存的程式碼P1，以操作在第一模式並等待音頻產生電路101產生中斷訊號ST。當音頻產生電路101產生中斷訊號ST時，處理器130可響應此中斷訊號ST而改為執行記憶體102中的程式碼P2，以從操作於第一模式切換為操作於第二模式，從而判斷儲存在記憶體150中的一或多個音頻資料是否包含人聲訊號。在一些實施例中，處理器130操作在第一模式下所產生的功率消耗低於處理器130操作在第二模式下所產生的功率消耗。換言之，第一模式可為低功耗模式(或可稱為等待中斷(wait for interrupt)模式)。當處理器130操作在第一模式時，處理器130將以較低操作速度進行運作並等待接收中斷訊號ST，從而節省功率消耗。或者，當處理器130切換為操作於第二模式時，處理器130會切換為接收具有較高頻率的時脈訊號CK1，故處理器130將以較高的操作速度進行運作，以更快地檢測是否有待處理的語音控制指令。 The processor 130 can access the program code P1 stored in the memory 150 to operate in the first mode and wait for the audio generation circuit 101 to generate the interrupt signal ST. When the audio generation circuit 101 generates the interrupt signal ST, the processor 130 can respond to the interrupt signal ST and instead execute the program code P2 in the memory 102 to switch from operating in the first mode to operating in the second mode, thereby determining whether one or more audio data stored in the memory 150 includes a human voice signal. In some embodiments, the power consumption generated by the processor 130 operating in the first mode is lower than the power consumption generated by the processor 130 operating in the second mode. In other words, the first mode can be a low power consumption mode (or can be called a wait for interrupt mode). When the processor 130 operates in the first mode, the processor 130 will operate at a lower operating speed and wait to receive the interrupt signal ST, thereby saving power consumption. Alternatively, when the processor 130 switches to operate in the second mode, the processor 130 switches to receive the clock signal CK1 with a higher frequency, so the processor 130 will operate at a higher operating speed to more quickly detect whether there is a voice control command to be processed.

在一些實施例中，記憶體150可為，但不限於，靜態隨機存取記憶體。處理器130可經由記憶體介面單元160耦接到記憶體102。在一些實施例中，記憶體102可為，但不限於，動態隨機存取記憶體。音頻處理電路140可處理音頻訊號SA以產生音頻資料(例如為圖3中的音頻資料D1~D3)。例如，音頻處理電路140可包含類比數位轉換器141以及音頻編解碼器(codec)142。類比數位轉換器141根據時脈訊號CK2轉換音頻訊號SA以產生數位資料SD。音頻編解碼器142處理數位資料SD以產生音頻資料。 In some embodiments, the memory 150 may be, but is not limited to, a static random access memory. The processor 130 may be coupled to the memory 102 via the memory interface unit 160. In some embodiments, the memory 102 may be, but is not limited to, a dynamic random access memory. The audio processing circuit 140 may process the audio signal SA to generate audio data (e.g., the audio data D1~D3 in Figure 3). For example, the audio processing circuit 140 may include an analog-to-digital converter 141 and an audio codec (codec) 142. The analog-to-digital converter 141 converts the audio signal SA according to the clock signal CK2 to generate digital data SD. The audio codec 142 processes the digital data SD to generate audio data.

圖2為根據本案一些實施例繪製圖1的語音活動檢測裝置100的操作流程圖。在一些實施例中，圖2的多個操作對應於迴路(loop)模式。在操作S201，初始化音頻處理電路140與時脈產生器電路170。例如，在語音活動檢測裝置100啟動後，處理器130可執行系統的相關軟體與/或韌體，以對音頻處理電路140與時脈產生器電路170中的各種參數(例如可為，但不限於，時脈頻率、取樣率、增益大小、編解碼格式等等)進行初始設置。在操作S202，設定部分電路(不包含鎖相迴路172)操作在低速狀態。例如，在完成初始化後，前述的相關軟體與/或韌體可進一步關閉鎖相迴路171，並將處理器130所接收的時脈訊號由時脈訊號CK1切換到基於參考時脈訊號CKREF所產生的另一時脈訊號(未示於圖1，其頻率可低於參考時脈訊號CKREF之頻率)。同時，前述的相關軟體與/或韌體可進一步將記憶體介面單元160所接收的時脈訊號(未示出)切換為參考時脈訊號CKREF，使得記憶體介面單元160亦操作於低速狀態。另一方面，由於鎖相迴路172未被關閉，故鎖相迴路172可根據參考時脈訊號CKREF提供時脈訊號CK2給音頻處理電路140。 FIG. 2 is a flowchart of the operation of the voice activity detection device 100 of FIG. 1 according to some embodiments of the present invention. In some embodiments, the multiple operations of FIG. 2 correspond to a loop mode. In operation S201, the audio processing circuit 140 and the clock generator circuit 170 are initialized. For example, after the voice activity detection device 100 is started, the processor 130 may execute the relevant software and/or firmware of the system to perform initial settings for various parameters in the audio processing circuit 140 and the clock generator circuit 170 (for example, but not limited to, clock frequency, sampling rate, gain size, codec format, etc.). In operation S202, a part of the circuit (excluding the phase-locked loop 172) is set to operate in a low speed state. For example, after completing the initialization, the aforementioned related software and/or firmware may further close the phase-locked loop 171, and switch the clock signal received by the processor 130 from the clock signal CK1 to another clock signal generated based on the reference clock signal CKREF (not shown in FIG. 1 , whose frequency may be lower than the frequency of the reference clock signal CKREF). At the same time, the aforementioned related software and/or firmware may further switch the clock signal (not shown) received by the memory interface unit 160 to the reference clock signal CKREF, so that the memory interface unit 160 also operates in a low-speed state. On the other hand, since the phase-locked loop 172 is not closed, the phase-locked loop 172 can provide the clock signal CK2 to the audio processing circuit 140 according to the reference clock signal CKREF.

在操作S203，控制記憶體102操作於第三模式，並執行記憶體150中的程式碼P1。在操作S204，處理器130操作在第一模式。例如，如前所述，處理器130可執行記憶體150中的程式碼P1以操作於第一模式，以等待音頻產生電路101產生中斷訊號ST。另一方面，前述的相關軟體與/或韌體可控制記憶體102操作於第三模式。在一些實施例中，第三模式可為記憶體102的低功耗模式。例如，若記憶體102為動態隨機存取記憶體，第三模式可為自刷新(self-refresh)模式。當處理器130執行記憶體150中的程式碼P1以操作在第一模式時，處理器130不存取記憶體102或是存取記憶體102的需求相對較少。於此條件下，可降低部分電路(例如包含記憶體102、記憶體介面單元160等等)的操作速度以節省整體功率消耗。 In operation S203, the memory 102 is controlled to operate in the third mode, and the program code P1 in the memory 150 is executed. In operation S204, the processor 130 operates in the first mode. For example, as described above, the processor 130 may execute the program code P1 in the memory 150 to operate in the first mode to wait for the audio generation circuit 101 to generate the interrupt signal ST. On the other hand, the aforementioned related software and/or firmware may control the memory 102 to operate in the third mode. In some embodiments, the third mode may be a low power mode of the memory 102. For example, if the memory 102 is a dynamic random access memory, the third mode may be a self-refresh mode. When the processor 130 executes the program code P1 in the memory 150 to operate in the first mode, the processor 130 does not access the memory 102 or the need to access the memory 102 is relatively small. Under this condition, the operating speed of some circuits (such as including the memory 102, the memory interface unit 160, etc.) can be reduced to save overall power consumption.

在操作S205，致能音頻產生電路101以開始產生音頻訊號SA，並經由音頻處理電路140寫入音頻資料到記憶體150。如前所述，當處理器130執行記憶體150中的程式碼P1時，處理器130操作於第一模式以等待音頻產生電路101發出中斷訊號ST。在音頻產生電路101被前述的相關軟體與/或韌體啟動後，音頻產生電路101可開始收集環境中的聲音以產生音頻訊號SA，並經由音頻處理電路140將該音頻訊號SA所對應的音頻資料儲存到記憶體150。 In operation S205, the audio generation circuit 101 is enabled to start generating the audio signal SA, and the audio data is written to the memory 150 through the audio processing circuit 140. As mentioned above, when the processor 130 executes the program code P1 in the memory 150, the processor 130 operates in the first mode to wait for the audio generation circuit 101 to send an interrupt signal ST. After the audio generation circuit 101 is activated by the aforementioned related software and/or firmware, the audio generation circuit 101 can start collecting sounds in the environment to generate the audio signal SA, and the audio data corresponding to the audio signal SA is stored in the memory 150 through the audio processing circuit 140.

在操作S206，音頻產生電路101發出中斷訊號ST，且處理器130響應於中斷訊號ST執行記憶體102中的程式碼P2以切換為操作在第二模式。在操作S207，配置時脈產生器電路170以使該部分電路改為操作於較高頻率，並控制記憶體102操作於第四模式。例如，當音頻產生電路101判斷音頻訊號SA的音量超過預定臨界值時，音頻產生電路101可發出中斷訊號ST給中斷控制器110，使得處理器130可響應此中斷訊號ST改為執行記憶體102中的程式碼P2以操作在第二模式。另一方面，在此條件下，前述的相關軟體與/或韌體可對應配置時脈產生器電路170，以使鎖相迴路171與172可產生具有較高頻率的時脈訊號CK1與時脈訊號CK2，並使記憶體介面單元160切換為基於較高頻率的時脈訊號進行運作。同時，前述的相關軟體與/或韌體可控制記憶體102改操作於第四模式，其中第四模式可為運作速度較快的主動(active)模式。換句話說，操作在第三模式的記憶體102所產生的功率消耗低於操作在第四模式的記憶體102所產生的功率消耗。 In operation S206, the audio generation circuit 101 sends an interrupt signal ST, and the processor 130 executes the program code P2 in the memory 102 in response to the interrupt signal ST to switch to the second mode of operation. In operation S207, the clock generator circuit 170 is configured so that the part of the circuit is changed to operate at a higher frequency, and the memory 102 is controlled to operate in the fourth mode. For example, when the audio generation circuit 101 determines that the volume of the audio signal SA exceeds a predetermined threshold value, the audio generation circuit 101 can send an interrupt signal ST to the interrupt controller 110, so that the processor 130 can respond to the interrupt signal ST and change to execute the program code P2 in the memory 102 to operate in the second mode. On the other hand, under this condition, the aforementioned related software and/or firmware can configure the clock generator circuit 170 accordingly, so that the phase-locked loops 171 and 172 can generate clock signals CK1 and CK2 with higher frequencies, and switch the memory interface unit 160 to operate based on the clock signal with higher frequency. At the same time, the aforementioned related software and/or firmware can control the memory 102 to operate in the fourth mode, wherein the fourth mode can be an active mode with faster operation speed. In other words, the power consumption generated by the memory 102 operating in the third mode is lower than the power consumption generated by the memory 102 operating in the fourth mode.

在操作S208，判斷記憶體150中的音頻資料是否包含人聲訊號。若判斷記憶體150中的音頻資料包含人聲訊號，執行操作S209。或者，若判斷記憶體150中的音頻資料不包含人聲訊號，執行操作S202。在操作S209，控制記憶體150轉移該音頻資料到記憶體102，並控制音頻處理電路140將後續產生的音頻資料儲存到記憶體102。例如，記憶體102中的程式碼P2包含辨識人聲訊號的訊號處理演算法。處理器130可執行程式碼P2而根據該演算法判斷記憶體150中的音頻資料是否包含人聲訊號。若音頻資料包含人聲訊號，處理器130可控制記憶體150傳輸該音頻資料到記憶體102(而不再儲存到記憶體150)。如此，在轉移音頻資料到記憶體102後，處理器130可釋放出記憶體150中先前用於儲存該音頻資料的暫存空間給系統中的其他電路使用。若音頻資料不包含人聲訊號，則重新執行操作S202，以重新等待下一次的語音指令。 In operation S208, it is determined whether the audio data in the memory 150 includes a human voice signal. If it is determined that the audio data in the memory 150 includes a human voice signal, operation S209 is performed. Alternatively, if it is determined that the audio data in the memory 150 does not include a human voice signal, operation S202 is performed. In operation S209, the memory 150 is controlled to transfer the audio data to the memory 102, and the audio processing circuit 140 is controlled to store the subsequently generated audio data in the memory 102. For example, the program code P2 in the memory 102 includes a signal processing algorithm for identifying a human voice signal. The processor 130 can execute the program code P2 and determine whether the audio data in the memory 150 includes a human voice signal according to the algorithm. If the audio data includes a human voice signal, the processor 130 can control the memory 150 to transfer the audio data to the memory 102 (and no longer store it in the memory 150). In this way, after transferring the audio data to the memory 102, the processor 130 can release the temporary storage space in the memory 150 previously used to store the audio data for use by other circuits in the system. If the audio data does not include a human voice signal, the operation S202 is re-executed to wait for the next voice command again.

在操作S210，判斷記憶體102中的音頻資料是否包含關鍵字訊息。若判斷記憶體102中的音頻資料包含關鍵字訊息，執行操作S211。或者，若判斷記憶體102中的音頻資料不包含關鍵字訊息，執行操作S212。在操作S211中，根據該關鍵字訊息執行後續處理。在操作S212，繼續檢測是否出現關鍵字訊息，直到音頻產生電路101判斷後續收到的音頻訊號SA的音量不超過預定臨界值。 In operation S210, it is determined whether the audio data in the memory 102 contains a keyword message. If it is determined that the audio data in the memory 102 contains a keyword message, operation S211 is performed. Alternatively, if it is determined that the audio data in the memory 102 does not contain a keyword message, operation S212 is performed. In operation S211, subsequent processing is performed according to the keyword message. In operation S212, it is continuously detected whether a keyword message appears until the audio generation circuit 101 determines that the volume of the subsequently received audio signal SA does not exceed a predetermined critical value.

例如，記憶體102中的程式碼P2更包含辨識關鍵字訊息的訊號處理演算法。處理器130可執行程式碼P2而根據該演算法判斷記憶體102中的音頻資料是否包含該關鍵字訊息。在一些實施例中，該關鍵字訊息可為，但不限於，用來控制一特定裝置執行一預設操作之一語音指令。若處理器130判斷記憶體102中的音頻資料包含關鍵字訊息，處理器130可根據此關鍵字訊息進行後續處理，以控制該特定裝置執行該預設操作。或者，若處理器130判斷記憶體102中的音頻資料不包含關鍵字訊息，處理器130可持續根據記憶體102後續儲存的音頻資料來判斷是否出現關鍵字訊息，直到音頻產生電路101判斷後續收到的音頻訊號SA的音量不超過預定臨界值(例如為使用者停止輸入語音指令)。 For example, the program code P2 in the memory 102 further includes a signal processing algorithm for identifying a keyword message. The processor 130 may execute the program code P2 and determine whether the audio data in the memory 102 includes the keyword message according to the algorithm. In some embodiments, the keyword message may be, but is not limited to, a voice command for controlling a specific device to perform a preset operation. If the processor 130 determines that the audio data in the memory 102 includes a keyword message, the processor 130 may perform subsequent processing according to the keyword message to control the specific device to perform the preset operation. Alternatively, if the processor 130 determines that the audio data in the memory 102 does not contain a keyword message, the processor 130 may continue to determine whether a keyword message appears based on the audio data subsequently stored in the memory 102 until the audio generation circuit 101 determines that the volume of the subsequently received audio signal SA does not exceed a predetermined threshold value (for example, the user stops inputting voice commands).

在上述操作中，處理器130執行記憶體150中的程式碼P1以操作在第一模式來降低功率消耗並等待中斷訊號ST。因此，處理器130操作在第一模式所執行的操作相對簡易。處理器130執行記憶體102中的程式碼P2以操作在第二模式來以較高處理速度來執行人聲辨識與關鍵字辨識等多個演算法。因此，程式碼P1的代碼大小(code size)與/或複雜度都相對低於程式碼P2的代碼大小與/或複雜度。在一些實施例中，由於記憶體102的成本高於記憶體150的成本，故藉由上述設置方式可降低記憶體102的所需容量，進而降低整體成本。此外，在切換到使用記憶體102後，記憶體150可釋放原先儲存音頻資料的暫存空間，故系統中的其他電路亦可對記憶體150進行分時共用，從而避免額外設置記憶體的需求，以降低系統額外的功率消耗與/或系統成本。 In the above operation, the processor 130 executes the program code P1 in the memory 150 to operate in the first mode to reduce power consumption and wait for the interrupt signal ST. Therefore, the operation performed by the processor 130 in the first mode is relatively simple. The processor 130 executes the program code P2 in the memory 102 to operate in the second mode to execute multiple algorithms such as human voice recognition and keyword recognition at a higher processing speed. Therefore, the code size and/or complexity of the code P1 are relatively lower than the code size and/or complexity of the code P2. In some embodiments, since the cost of the memory 102 is higher than the cost of the memory 150, the required capacity of the memory 102 can be reduced by the above configuration, thereby reducing the overall cost. In addition, after switching to use the memory 102, the memory 150 can release the temporary storage space originally storing the audio data, so other circuits in the system can also share the memory 150 in a time-sharing manner, thereby avoiding the need for additional memory settings, thereby reducing the additional power consumption and/or system cost of the system.

圖3為根據本案一些實施例繪製圖1中的記憶體150與記憶體102的操作示意圖。如前所述，記憶體150可為靜態隨機存取記憶體。處理器130可配置記憶體150以規劃一暫存空間，其中該暫存空間可被設置為一環形緩衝器(ring buffer)300。處理器130可根據寫入指標WP與讀取指標RP來存取此環形緩衝器300。在操作S209被執行之前，音頻處理電路140根據音頻產生電路101的音頻訊號SA所產生的音頻資料D1與音頻資料D2可儲存在此環形緩衝器300。在操作S209，處理器130根據讀取指標RP讀取音頻資料D1並判斷音頻資料D1包含人聲訊號，故將音頻資料D1與音頻資料D2(寫入指標WP對應於音頻資料D2的結束位置，代表音頻資料D2也是有效的音頻資料)轉移到記憶體102，並控制音頻處理電路140將後續產生的音頻資料D3儲存到記憶體102。亦即，音頻處理電路140是在產生多個音頻資料D1與D2後接著根據音頻訊號SA產生音頻資料D3。在轉移音頻資料D1與音頻資料D2到記憶體102後，環形緩衝器300中對應的暫存空間可被釋放。接著，處理器130可根據記憶體102中的多個連續的音頻資料D1、D2與D3來判斷是否包含人聲訊號與關鍵字訊息。如此，處理器130可將多個音頻資料D1、D2與D3進行合併，並根據這些音頻資料的合併後的連續音頻內容來更完整地判斷使用者是否有發出語音指令。 FIG3 is a schematic diagram of the operation of the memory 150 and the memory 102 in FIG1 according to some embodiments of the present invention. As mentioned above, the memory 150 can be a static random access memory. The processor 130 can configure the memory 150 to plan a temporary storage space, wherein the temporary storage space can be set as a ring buffer 300. The processor 130 can access the ring buffer 300 according to the write pointer WP and the read pointer RP. Before operation S209 is performed, the audio data D1 and the audio data D2 generated by the audio processing circuit 140 according to the audio signal SA of the audio generating circuit 101 may be stored in the ring buffer 300. In operation S209, the processor 130 reads the audio data D1 according to the read pointer RP and determines that the audio data D1 includes a human voice signal, so the audio data D1 and the audio data D2 (the write pointer WP corresponds to the end position of the audio data D2, indicating that the audio data D2 is also valid audio data) are transferred to the memory 102, and the audio processing circuit 140 is controlled to store the subsequently generated audio data D3 in the memory 102. That is, the audio processing circuit 140 generates the audio data D3 according to the audio signal SA after generating the plurality of audio data D1 and D2. After transferring the audio data D1 and the audio data D2 to the memory 102, the corresponding temporary storage space in the circular buffer 300 can be released. Then, the processor 130 can determine whether the multiple continuous audio data D1, D2 and D3 in the memory 102 contain human voice signals and keyword information. In this way, the processor 130 can merge the multiple audio data D1, D2 and D3, and more completely determine whether the user has issued a voice command based on the continuous audio content after the merger of these audio data.

圖4為根據本案一些實施例繪製圖1的語音活動檢測裝置100的操作流程圖。在一些實施例中，圖4的多個操作對應於中斷(interrupt)模式，其耗費的功率消耗可更低於圖2的迴路模式的功率消耗。中斷模式包含多個操作S401~S412，其中多個操作S403~S404、S406與S408~S412分別相同於圖2的多個操作S203~S204、S206與S208~S212，故於此不再重複贅述。以下主要說明與圖2的操作存在差異的部分。 FIG. 4 is an operation flow chart of the voice activity detection device 100 of FIG. 1 according to some embodiments of the present invention. In some embodiments, the multiple operations of FIG. 4 correspond to the interrupt mode, and the power consumption thereof may be lower than the power consumption of the loop mode of FIG. 2. The interrupt mode includes multiple operations S401~S412, wherein multiple operations S403~S404, S406 and S408~S412 are respectively the same as multiple operations S203~S204, S206 and S208~S212 of FIG. 2, so they will not be repeated here. The following mainly describes the parts that are different from the operations of FIG. 2.

在操作S401中，初始化類比數位轉換器141與時脈產生器電路170，並關閉音頻編解碼器142。不同於操作S201，在此例中，在語音活動檢測裝置100剛啟動後，處理器130可執行系統的相關軟體與/或韌體，以僅對類比數位轉換器141與時脈產生器電路170進行初始化設定，其中音頻編解碼器142仍處於關閉狀態。在操作S402中，設定部分電路(包含鎖相迴路171與鎖相迴路172)操作在低速狀態。不同於操作S202，在此例中，前述的相關軟體與/或韌體更進一步關閉鎖相迴路172。於此條件下，時脈產生器電路170不提供時脈訊號CK2給音頻處理電路140。 In operation S401, the analog-to-digital converter 141 and the clock generator circuit 170 are initialized, and the audio codec 142 is turned off. Different from operation S201, in this example, after the voice activity detection device 100 is just started, the processor 130 can execute the relevant software and/or firmware of the system to only initialize the analog-to-digital converter 141 and the clock generator circuit 170, wherein the audio codec 142 is still in a turned-off state. In operation S402, part of the circuit (including the phase-locked loop 171 and the phase-locked loop 172) is set to operate in a low speed state. Different from operation S202, in this example, the aforementioned related software and/or firmware further closes the phase-locked loop 172. Under this condition, the clock generator circuit 170 does not provide the clock signal CK2 to the audio processing circuit 140.

在操作S405，致能音頻產生電路101以開始產生音頻訊號SA。不同於操作S205，在此例中，由於音頻處理電路140未收到時脈訊號CK2且音頻編解碼器142尚未致能，故音頻處理電路140在此階段不會根據音頻訊號SA產生對應音頻資料，也不會儲存該對應音頻資料到記憶體150。 In operation S405, the audio generation circuit 101 is enabled to start generating the audio signal SA. Different from operation S205, in this example, since the audio processing circuit 140 has not received the clock signal CK2 and the audio codec 142 has not been enabled, the audio processing circuit 140 will not generate corresponding audio data according to the audio signal SA at this stage, nor will it store the corresponding audio data in the memory 150.

在操作S407，配置時脈產生器電路170，並致能鎖相迴路172與音頻編解碼器142，以使該部分電路改為操作於較高頻率，並控制記憶體102操作於第四模式。不同於圖2，在此例中，當音頻產生電路101判斷音頻訊號SA的音量超過預定臨界值時，前述的相關軟體/韌體可配置時脈產生器電路170並致能所有的鎖相迴路(例如包含鎖相迴路171與172)與音頻編解碼器142，以開始產生具有較高頻率的時脈訊號CK1與時脈訊號CK2。如此，時脈產生器電路170開始提供時脈訊號CK2給類比數位轉換器141，使得音頻編解碼器142可開始產生相應的音頻資料，並開始儲存該音頻資料到記憶體150。另一方面，處理器130可執行記憶體102中的程式碼P2以判斷記憶體150中的音頻資料是否包含人聲訊號。 In operation S407, the clock generator circuit 170 is configured, and the phase-locked loop 172 and the audio codec 142 are enabled, so that the circuit part is changed to operate at a higher frequency, and the memory 102 is controlled to operate in the fourth mode. Different from FIG. 2, in this example, when the audio generation circuit 101 determines that the volume of the audio signal SA exceeds a predetermined threshold value, the aforementioned related software/firmware can configure the clock generator circuit 170 and enable all phase-locked loops (e.g., including the phase-locked loops 171 and 172) and the audio codec 142 to start generating the clock signal CK1 and the clock signal CK2 with a higher frequency. Thus, the clock generator circuit 170 starts to provide the clock signal CK2 to the analog-to-digital converter 141, so that the audio codec 142 can start to generate corresponding audio data and start to store the audio data in the memory 150. On the other hand, the processor 130 can execute the program code P2 in the memory 102 to determine whether the audio data in the memory 150 contains a human voice signal.

據此，應當理解，在圖2的迴路模式中，當處理器130操作在第一模式時，音頻處理電路140會儲存音頻資料到記憶體150。不同於圖2，在圖4的中斷模式中，當處理器130操作在第一模式時，時脈產生器電路170不產生時脈訊號CK2，使得音頻處理電路140中的至少一部分電路處於禁能狀態而未進行運作，而不會產生音頻資料到記憶體150。在收到中斷訊號ST後，處理器130改操作在第二模式，且音頻處理電路140開始產生並儲存音頻資料到記憶體150。如此，圖4的中斷模式可節省更多的功率消耗。相對地，圖2的迴路模式可收集更完整的音頻資料。關於此處請參照圖5的說明。 Accordingly, it should be understood that in the loop mode of FIG. 2 , when the processor 130 operates in the first mode, the audio processing circuit 140 stores audio data in the memory 150. Different from FIG. 2 , in the interrupt mode of FIG. 4 , when the processor 130 operates in the first mode, the clock generator circuit 170 does not generate the clock signal CK2, so that at least a portion of the circuits in the audio processing circuit 140 are in a disabled state and do not operate, and no audio data is generated in the memory 150. After receiving the interrupt signal ST, the processor 130 changes to operate in the second mode, and the audio processing circuit 140 starts to generate and store audio data in the memory 150. In this way, the interrupt mode of FIG. 4 can save more power consumption. In contrast, the loop mode in Figure 2 can collect more complete audio data. Please refer to the description of Figure 5 for this.

圖5為根據本案一些實施例繪製圖1中的音頻訊號SA的波形時序圖。為更易於理解圖2的迴路模式與圖4的中斷模式的差異，如圖5所示，在迴路模式中，音頻編解碼器142與鎖相迴路172被致能，使得音頻處理電路140可在時間t0開始將對應的音頻資料儲存到記憶體150。當音頻產生電路101判斷音頻訊號SA的音量大於預定臨界值時，音頻產生電路101發出中斷訊號ST使得處理器130可在時間t1切換為操作在第二模式以開始判斷音頻資料是否包含人聲訊號以及關鍵字訊息，直到音頻訊號SA的音量在時間t2開始持續低於預定臨界值。不同於迴路模式，在中斷模式中，音頻處理電路140在時間t0尚未開始將對應的音頻資料儲存到記憶體150，而是在時間t1收到中斷訊號ST後才開始將對應的音頻資料儲存到記憶體150。換言之，在迴路模式中，語音活動檢測裝置100可儲存更完整的音頻資料來進行檢測，故可獲得更高的語音檢測準確率。在中斷模式中，音頻處理電路140、時脈產生器電路170以及記憶體150在時間t1之前所產生的功率消耗較低，使得操作在中斷模式中的語音活動檢測裝置100可具有更低的功率消耗。 FIG5 is a waveform timing diagram of the audio signal SA in FIG1 according to some embodiments of the present invention. To make it easier to understand the difference between the loop mode of FIG2 and the interrupt mode of FIG4, as shown in FIG5, in the loop mode, the audio codec 142 and the phase-locked loop 172 are enabled, so that the audio processing circuit 140 can store the corresponding audio data into the memory 150 starting at time t0. When the audio generating circuit 101 determines that the volume of the audio signal SA is greater than a predetermined threshold value, the audio generating circuit 101 sends an interrupt signal ST so that the processor 130 can switch to the second mode at time t1 to start determining whether the audio data contains a human voice signal and a keyword message until the volume of the audio signal SA starts to be continuously lower than the predetermined threshold value at time t2. Different from the loop mode, in the interrupt mode, the audio processing circuit 140 does not start to store the corresponding audio data to the memory 150 at time t0, but starts to store the corresponding audio data to the memory 150 after receiving the interrupt signal ST at time t1. In other words, in the loop mode, the voice activity detection device 100 can store more complete audio data for detection, so a higher voice detection accuracy can be obtained. In the interrupt mode, the power consumption generated by the audio processing circuit 140, the clock generator circuit 170 and the memory 150 before time t1 is lower, so that the voice activity detection device 100 operating in the interrupt mode can have lower power consumption.

圖6為根據本案一些實施例示出一種語音活動檢測方法600的流程圖。在操作S610，根據自一音頻產生電路提供的一音頻訊號產生一第一音頻資料，並儲存該第一音頻資料至一第一記憶體。在操作S620，藉由一處理器執行該第一記憶體中的一第一程式碼以操作在一第一模式。在操作S630，藉由該處理器響應自該音頻產生電路提供的一中斷訊號切換為操作在第二模式，以執行一第二記憶體中的一第二程式碼以判斷儲存在該第一記憶體中的該第一音頻資料是否包含一人聲訊號，其中該處理器操作在該第一模式的功率消耗低於該處理器操作在該第二模式的功率消耗。 FIG6 is a flow chart showing a voice activity detection method 600 according to some embodiments of the present invention. In operation S610, a first audio data is generated according to an audio signal provided by an audio generation circuit, and the first audio data is stored in a first memory. In operation S620, a processor executes a first program code in the first memory to operate in a first mode. In operation S630, the processor switches to the second mode in response to an interrupt signal provided by the audio generation circuit to execute a second program code in a second memory to determine whether the first audio data stored in the first memory includes a human voice signal, wherein the power consumption of the processor operating in the first mode is lower than the power consumption of the processor operating in the second mode.

上述語音活動檢測方法600的多個操作可參考前述的實施例之說明，故於此不再重複贅述。上述圖2、圖4與/或圖6中的多個操作僅為示例，並非限定需依照此示例中的順序執行。在不違背本案的各實施例的操作方式與範圍下，在圖2、圖4與/或圖6中的各種操作當可適當地增加、替換、省略或以不同順序執行(例如可以是同時執行或是部分同時執行)。 The multiple operations of the above-mentioned voice activity detection method 600 can refer to the description of the above-mentioned embodiments, so they will not be repeated here. The multiple operations in the above-mentioned Figures 2, 4 and/or 6 are only examples, and are not limited to be executed in the order in this example. Without violating the operation method and scope of each embodiment of this case, the various operations in Figures 2, 4 and/or 6 can be appropriately added, replaced, omitted or executed in a different order (for example, they can be executed simultaneously or partially simultaneously).

綜上所述，本案一些實施例中的語音活動檢測裝置與語音活動檢測方法可在未使用額外的從(slave)處理器來進行語音活動檢測，並可進一步地改善功率消耗。 In summary, the voice activity detection device and voice activity detection method in some embodiments of the present invention can perform voice activity detection without using an additional slave processor, and can further improve power consumption.

雖然本案之實施例如上所述，然而該些實施例並非用來限定本案，本技術領域具有通常知識者可依據本案之明示或隱含之內容對本案之技術特徵施以變異，凡此種種變異均可能屬於本案所尋求之專利保護範疇，換言之，本案之專利保護範圍須視本說明書之申請專利範圍所界定者為準。 Although the embodiments of this case are described above, these embodiments are not used to limit this case. People with ordinary knowledge in this technical field can make variations to the technical features of this case based on the explicit or implicit content of this case. All these variations may fall within the scope of patent protection sought by this case. In other words, the scope of patent protection of this case shall be subject to the scope of patent application defined in this specification.

100:語音活動檢測裝置 100: Voice activity detection device

101:音頻產生電路 101: Audio generation circuit

102,150:記憶體 102,150:Memory

110:中斷控制器 110: Interrupt controller

120:振盪器 120: Oscillator

130:處理器 130: Processor

140:音頻處理電路 140: Audio processing circuit

141:類比數位轉換器 141:Analog-to-digital converter

142:音頻編解碼器 142: Audio codec

160:記憶體介面單元 160:Memory interface unit

170:時脈產生器電路 170: Clock generator circuit

171,172:鎖相迴路 171,172: Phase-locked loop

CK1,CK2:時脈訊號 CK1, CK2: clock signal

CKREF:參考時脈訊號 CKREF: Reference clock signal

P1,P2:程式碼 P1, P2: Program code

SA:音頻訊號 SA: Audio signal

SD:數位資料 SD: Digital data

ST:中斷訊號 ST: interrupt signal

Claims

A voice activity detection device comprises: an audio processing circuit, processing an audio signal provided by an audio generating circuit to generate a first audio data; a first memory, storing the first audio data and a first program code; and a processor, executing the first program code to operate in a first mode, and switching to operate in a second mode in response to an interrupt signal provided by the audio generating circuit to execute a second program code in a second memory to determine Whether the first audio data stored in the first memory includes a human voice signal, wherein the power consumption of the processor operating in the first mode is lower than the power consumption of the processor operating in the second mode, the processor further controls the second memory to switch from operating in a third mode to operating in a fourth mode in response to the interrupt signal, and the power consumption of the second memory operating in the third mode is lower than the power consumption of the second memory operating in the fourth mode.

As in claim 1, the voice activity detection device, wherein when the processor determines that the audio data includes the human voice signal, the processor further controls the first memory to transfer the first audio data to the second memory, and the audio processing circuit further stores a second audio data in the second memory, wherein the audio processing circuit generates the second audio data according to the audio signal after generating the first audio data.

As in the voice activity detection device of claim 2, the processor further determines whether the first audio data and the second audio data in the second memory contain a keyword message.

As in claim 2, the voice activity detection device, wherein after the first memory transfers the first audio data to the second memory, the first memory further releases a temporary storage space in the first memory previously used to store the first audio data.

】As claimed in claim 1, the second memory is a dynamic random access memory, the third mode is a self-refresh mode, and the fourth mode is an active mode.

A voice activity detection device as claimed in claim 1, wherein when the processor operates in the first mode, the audio processing circuit stores the first audio data in the first memory.

A voice activity detection device as claimed in claim 1, wherein when the processor operates in the first mode, the audio processing circuit does not store the first audio data in the first memory.

As in claim 1, the voice activity detection device, wherein the audio processing circuit comprises: an analog-to-digital converter, converting the audio signal into digital data; and an audio codec, processing the digital data to generate the first audio data.

The voice activity detection device of claim 1 further comprises: a clock generator circuit, which generates a first clock signal according to a reference clock signal, wherein when the processor operates in the first mode, the clock generator circuit generates the first clock signal to the audio processing circuit, and the audio processing circuit processes the audio signal according to the first clock signal to generate the first audio data.

The voice activity detection device of claim 1 further comprises: a clock generator circuit that generates a first clock signal according to a reference clock signal, wherein when the processor operates in the first mode, the clock generator circuit does not generate the first clock signal, so that the audio processing circuit does not generate the first audio data.

A voice activity detection device comprises: an audio processing circuit, processing an audio signal provided by an audio generating circuit to generate a first audio data; a first memory, storing the first audio data and a first program code; and a processor, executing the first program code to operate in a first mode, and switching to operate in a second mode in response to an interrupt signal provided by the audio generating circuit, to execute a second program code in a second memory to determine whether the first audio data stored in the first memory includes a human voice signal, wherein the power consumption of the processor operating in the first mode is lower than the power consumption of the processor operating in the second mode, and the code size of the first program code is smaller than the code size of the second program code.

A voice activity detection method comprises: generating a first audio data according to an audio signal provided by an audio generating circuit, and storing the first audio data in a first memory; executing a first program code in the first memory by a processor to operate in a first mode; switching to a second mode by the processor in response to an interrupt signal provided by the audio generating circuit to execute a second program code in a second memory to determine whether the first audio data stored in the first memory is a first mode; Whether the first audio data in the first memory includes a human voice signal, wherein the power consumption of the processor operating in the first mode is lower than the power consumption of the processor operating in the second mode; and the processor controls the second memory to switch from operating in a third mode to operating in a fourth mode in response to the interrupt signal, wherein the power consumption of the second memory operating in the third mode is lower than the power consumption of the second memory operating in the fourth mode.

A voice activity detection method comprises: generating a first audio data according to an audio signal provided by an audio generating circuit, and storing the first audio data in a first memory; executing a first program code in the first memory by a processor to operate in a first mode; and switching to a second mode by the processor in response to an interrupt signal provided by the audio generating circuit to execute a second program code in a second memory to determine whether the first audio data stored in the first memory includes a human voice signal, wherein the power consumption of the processor operating in the first mode is lower than the power consumption of the processor operating in the second mode, and the code size of the first program code is smaller than the code size of the second program code.