TWI902105B

TWI902105B - Speech enhancement system

Info

Publication number: TWI902105B
Application number: TW112151245A
Authority: TW
Inventors: 秦允求; 蔡連枝; 黃兆華
Original assignee: 仁寶電腦工業股份有限公司
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2025-10-21
Also published as: TW202527576A

Abstract

The present invention discloses a speech enhancement system. The processing unit within the mobile device executes a preprocessing procedure according to the first reference audio received by the wireless audio device and the second reference audio received by the mobile device so as to obtain various feature masks by a feature audio model. According to the requirements, the actual audio corresponding to the reference audio undergoes multiplication processing by the feature mask. When the user communicates with someone by the wireless audio device, the voice of the man communicated with the user is served as the actual audio. The processing unit of the mobile device enhances the voice of the man communicated with the user, so that the user hears the voice clearly, and the clarity of audio reception by the wireless audio device is improved.

Description

Voice enhancement system

本案關於一種語音增強系統，尤指一種清晰度較高的語音增強系統。This case concerns a speech enhancement system, particularly a speech enhancement system with higher clarity.

通常，助聽器中的收音裝置位於本體上，當佩戴助聽器的使用者進行溝通時，與使用者溝通的對方聲音以及使用者的本身聲音皆由該助聽器上的收音裝置進行收音、處理以及播放，由於與使用者溝通的對方聲音的距離較遠，使得傳送到助聽器的音量衰減，且容易被環境音所干擾，例如環境噪音較大的馬路邊、賣場、餐廳、車站等場所，又由於佩戴助聽器的使用者的本身聲音距離助聽器的收音裝置較近，且加上骨傳導的效果，佩戴助聽器的使用者的本身聲音傳送至助聽器的收音裝置的音量及品質通常會大於與使用者溝通的對方聲音傳送至助聽器的收音裝置的音量及品質，使得與使用者溝通的對方聲音傳送至助聽器的收音裝置的音量較小且清晰度較低。Typically, the microphone in a hearing aid is located on the main body. When a user wearing a hearing aid communicates, both the voice of the person they are communicating with and the user's own voice are picked up, processed, and played back by the microphone on the hearing aid. Because the distance between the person and the user is relatively far, the volume transmitted to the hearing aid is attenuated, and it is easily disturbed by ambient noise, such as from noisy places like roadsides, shopping malls, restaurants, and train stations. In places like these, because the user's own voice is closer to the hearing aid's receiver, and due to the effect of bone conduction, the volume and quality of the user's own voice transmitted to the hearing aid's receiver are usually greater than the volume and quality of the voice of the person communicating with the user. This results in the volume of the person communicating with the user being lower and the clarity being lower.

因此，如何發展一種克服上述缺點的語音增強系統，實為目前迫切之需求。Therefore, developing a speech enhancement system that overcomes the above-mentioned shortcomings is an urgent need at present.

本案之目的在於提供一種語音增強系統，其行動裝置中的處理單元將無線音訊裝置所接收的第一參考音訊以及行動裝置所接收的第二參考音訊，進行前處理程序，以利用特徵聲音模型獲得多種特徵遮罩，可根據需求利用特徵遮罩將實際音訊中對應於該參考音訊的特徵進行乘積處理，當使用者配戴無線音訊裝置進行溝通時，與使用者溝通的對方聲音為實際音訊，行動裝置的處理單元可將與使用者溝通的對方聲音增強，使得使用者可清楚聽到其聲音，進而增強無線音訊裝置收音的清晰度。The purpose of this invention is to provide a voice enhancement system in which the processing unit of the mobile device performs a preprocessing procedure on the first reference audio signal and the second reference audio signal received by the wireless audio device to obtain various feature masks using a feature sound model. The feature masks can be used to multiply the features in the actual audio signal corresponding to the reference audio signal as needed. When a user wears the wireless audio device to communicate, the voice of the other party communicating with the user is the actual audio signal. The processing unit of the mobile device can enhance the voice of the other party communicating with the user, so that the user can hear the voice clearly, thereby enhancing the clarity of the wireless audio device's sound reception.

為達上述目的，本案之一較廣義實施態樣為提供一種語音增強系統，包含無線音訊裝置及行動裝置。無線音訊裝置包含第一收音單元，用以接收第一參考音訊。行動裝置與無線音訊裝置之間進行資料傳輸，以接收第一參考音訊，且行動裝置包含第二收音單元及處理單元。第二收音單元用以接收第二參考音訊與實際音訊。處理單元用以執行前處理程序，處理單元具有記憶單元，且記憶單元儲存特徵聲音模型，前處理程序包含以下步驟：轉換第一參考音訊與第二參考音訊為第一音訊嵌入特徵與第二音訊嵌入特徵，以及載入第一音訊嵌入特徵、第二音訊嵌入特徵及實際音訊於特徵聲音模型，以獲得第一特徵遮罩與第二特徵遮罩。處理單元將實際音訊由時域形式轉換為頻域形式，並以第二特徵遮罩將實際音訊中對應於第二參考音訊的特徵進行乘積處理，以提供第一輸出音訊。To achieve the above objectives, a broader embodiment of this application provides a voice enhancement system comprising a wireless audio device and a mobile device. The wireless audio device includes a first receiving unit for receiving a first reference audio message. Data transmission occurs between the mobile device and the wireless audio device to receive the first reference audio message, and the mobile device includes a second receiving unit and a processing unit. The second receiving unit receives the second reference audio message and the actual audio message. The processing unit executes preprocessing procedures. The processing unit has a memory unit that stores the feature audio model. The preprocessing procedure includes the following steps: converting the first reference audio and the second reference audio into first audio embedding features and second audio embedding features, and loading the first audio embedding features, the second audio embedding features, and the actual audio into the feature audio model to obtain a first feature mask and a second feature mask. The processing unit converts the actual audio from time-domain to frequency-domain format and uses the second feature mask to multiply the features in the actual audio corresponding to the second reference audio to provide the first output audio.

體現本案特徵與優點的一些典型實施例將在後段的說明中詳細敘述。應理解的是本案能夠在不同的態樣上具有各種的變化，其皆不脫離本案的範圍，且其中的說明及圖式在本質上系當作說明之用，而非用於限制本案。Some typical embodiments that embody the characteristics and advantages of this case will be described in detail in the following section. It should be understood that this case can have various variations in different forms, all of which do not depart from the scope of this case, and the descriptions and diagrams therein are essentially for illustrative purposes, not for limiting this case.

請參閱第1圖及第2圖，其中第1圖為本案的語音增強系統的系統架構圖，第2圖為第1圖所示的語音增強系統的部分元件的系統架構圖。如圖所示，本案的語音增強系統1包含第一無線音訊裝置2、第二無線音訊裝置3及行動裝置4。第一無線音訊裝置2可為但不限為助聽器、無線耳機、VR/AR/MR眼鏡、智慧手錶或智慧手機等，包含第一收音單元21及第一播放單元22，第一收音單元21用以接收第一參考音訊，其中第一參考音訊為時域形式，且係由鄰近於第一收音單元21的音源所提供。第一播放單元22用以播放第一輸出音訊及/或第二輸出音訊。Please refer to Figures 1 and 2, where Figure 1 is a system architecture diagram of the voice enhancement system of this invention, and Figure 2 is a system architecture diagram of some components of the voice enhancement system shown in Figure 1. As shown in the figures, the voice enhancement system 1 of this invention includes a first wireless audio device 2, a second wireless audio device 3, and a mobile device 4. The first wireless audio device 2 may be, but is not limited to, a hearing aid, wireless headphones, VR/AR/MR glasses, a smartwatch, or a smartphone, etc., and includes a first receiving unit 21 and a first playback unit 22. The first receiving unit 21 is used to receive a first reference audio message, wherein the first reference audio message is in time-domain format and is provided by an audio source adjacent to the first receiving unit 21. The first playback unit 22 is used to play a first output audio message and/or a second output audio message.

第二無線音訊裝置3可為但不限為筆電、平板或智慧手機等。於一實施例中，行動裝置4可位於第二無線音訊裝置3中，而於另一實施例中，行動裝置4可獨立於第二無線音訊裝置3外。於本案中，以行動裝置4獨立於第二無線音訊裝置3外進行說明，如第1圖所示。行動裝置4與第一無線音訊裝置2以及第二無線音訊裝置3之間以無線或有線的方式進行資料傳輸，例如以藍芽、WIFI、2.4G、USB2.0/3.0的方式進行資料傳輸，以接收第一無線音訊裝置2所接收到的第一參考音訊。行動裝置4包含第二收音單元41以及處理單元42，第二收音單元41用以接收第二參考音訊與實際音訊，其中第二參考音訊與實際音訊為時域形式，且第二參考音訊與實際音訊皆係由鄰近於第二收音單元41的相同聲源(例如人聲)所提供。The second wireless audio device 3 can be, but is not limited to, a laptop, tablet, or smartphone. In one embodiment, the mobile device 4 may be located within the second wireless audio device 3, while in another embodiment, the mobile device 4 may be independent of the second wireless audio device 3. In this case, the mobile device 4 is described as being independent of the second wireless audio device 3, as shown in Figure 1. The mobile device 4 transmits data wirelessly or wiredly to both the first wireless audio device 2 and the second wireless audio device 3, for example, via Bluetooth, Wi-Fi, 2.4G, or USB 2.0/3.0, to receive the first reference audio signal received by the first wireless audio device 2. Mobile device 4 includes a second radio unit 41 and a processing unit 42. The second radio unit 41 is used to receive a second reference audio message and an actual audio message, wherein the second reference audio message and the actual audio message are in time domain form, and both the second reference audio message and the actual audio message are provided by the same sound source (e.g., human voice) adjacent to the second radio unit 41.

處理單元42用以執行前處理程序，其中前處理程序包含以下步驟，首先，分別轉換第一參考音訊與第二參考音訊為第一音訊嵌入特徵與第二音訊嵌入特徵。接著，載入第一音訊嵌入特徵、第二音訊嵌入特徵及實際音訊於處理單元42內的記憶單元中的特徵聲音模型，以獲得第一特徵遮罩及第二特徵遮罩，以下將進一步說明處理單元42的內部電路結構及其細部功能。此外，處理單元42更可以透過前處理程序中所獲得的第一音訊嵌入特徵或第二音訊嵌入特徵，與新接收到的聲音嵌入特徵，透過比對歐式距離的方式，來判別聲音是否源自於相同的人。Processing unit 42 is used to execute a preprocessing procedure, which includes the following steps: First, the first reference audio and the second reference audio are converted into a first audio embedding feature and a second audio embedding feature, respectively. Next, the first audio embedding feature, the second audio embedding feature, and the actual audio are loaded into the feature sound model in the memory unit within processing unit 42 to obtain a first feature mask and a second feature mask. The internal circuit structure and detailed functions of processing unit 42 will be further explained below. In addition, processing unit 42 can also determine whether the sound originates from the same person by comparing the first or second audio embedding feature obtained in the preprocessing procedure with the newly received sound embedding feature through Euclidean distance.

如第2圖所示，處理單元42包含時域/頻域轉換器421、辨頻器422、矩陣重塑器423、編碼器424、記憶單元425、混音器426及頻域/時域轉換器427。時域/頻域轉換器421包含短時距傅立葉變換(Short-time Fourier Transform, STFT)的功能，用以接收第一收音單元21的第一參考音訊以及第二收音單元41的第二參考音訊，並將第一參考音訊及第二參考音訊由時域形式轉換為頻域形式。辨頻器422包含梅爾頻率倒譜係數(Mel-scale Frequency Cepstral Coefficients, MFCC)，用以根據頻域形式的第一參考音訊及第二參考音訊，而確認第一參考音訊中的第一音訊特徵以及第二參考音訊中的第二音訊特徵，其中第一音訊特徵為第一參考音訊中具有其訊息的可量化屬性，第二音訊特徵為第二參考音訊中具有其訊息的可量化屬性。矩陣重塑器(reshape)423用以將第一參考音訊的第一音訊特徵以及第二參考音訊的第二音訊特徵分別轉換為第一矩陣及第二矩陣，即將第一音訊特徵以及第二音訊特徵以矩陣形式進行量化。編碼器(encoder)424對第一矩陣及第二矩陣進行編碼，以分別對應輸出第一音訊嵌入特徵(embedding data)及第二音訊嵌入特徵，其中第一音訊嵌入特徵及第二音訊嵌入特徵的維度分別為64。記憶單元425儲存特徵聲音模型，當第一音訊嵌入特徵、第二音訊嵌入特徵與實際音訊載入特徵聲音模型時，以獲得第一特徵遮罩與第二特徵遮罩。As shown in Figure 2, the processing unit 42 includes a time-domain/frequency-domain converter 421, a frequency discriminator 422, a matrix reshaping unit 423, an encoder 424, a memory unit 425, a mixer 426, and a frequency-domain/time-domain converter 427. The time-domain/frequency-domain converter 421 includes a short-time Fourier transform (STFT) function to receive the first reference audio signal from the first receiver unit 21 and the second reference audio signal from the second receiver unit 41, and convert the first reference audio signal and the second reference audio signal from time-domain form to frequency-domain form. The frequency discriminator 422 includes Mel-scale Frequency Cepstral Coefficients (MFCCs) to identify a first audio feature in the first reference audio and a second audio feature in the second reference audio based on the first and second reference audio in frequency domain form. The first audio feature is a quantizable attribute of the information in the first reference audio, and the second audio feature is a quantizable attribute of the information in the second reference audio. The matrix reshaper 423 converts the first audio feature of the first reference audio and the second audio feature of the second reference audio into a first matrix and a second matrix, respectively, that is, quantizes the first and second audio features in matrix form. Encoder 424 encodes the first matrix and the second matrix to output the first audio embedding feature and the second audio embedding feature, respectively, wherein the dimensions of the first audio embedding feature and the second audio embedding feature are 64. Memory unit 425 stores the feature sound model. When the first audio embedding feature, the second audio embedding feature, and the actual audio are loaded into the feature sound model, the first feature mask and the second feature mask are obtained.

另外，本案記憶單元425內的特徵聲音模型，亦可以根據第一音訊嵌入特徵及第二音訊嵌入特徵作持續訓練的數據，用以強化效果。本實施例中記憶單元425的特徵聲音模型包含一層門控循環單元(Gated Recurrent Unit, GRU)及兩層長短時記憶(Long Short-Term Memory, LSTM)。Furthermore, the feature audio model within memory unit 425 can also be continuously trained using data from the first and second audio embedding features to enhance the effect. In this embodiment, the feature audio model of memory unit 425 includes one layer of gated recurrent unit (GRU) and two layers of long short-term memory (LSTM).

請參閱第3圖並配合第1圖及第2圖，其中第3圖為第1圖所示的語音增強系統的行動裝置內處理單元於第二參考音訊進行乘積處理時的部分細部架構圖。於第3圖中僅示出，當行動裝置4建立特徵聲音模型後，進一步接收實際音訊的控制架構，即第2圖中的部分的控制架構。如圖所示，行動裝置4的第二收音單元41更接收實際音訊，其中實際音訊為時域形式。時域/頻域轉換器421將實際音訊由時域形式轉換為頻域形式。混音器426接收記憶單元425內的第二特徵遮罩以及頻域形式的實際音訊，並利用特徵聲音模型中的第二特徵遮罩將實際音訊中對應於第二參考音訊的特徵進行乘積處理(spectrogram multiplication)，例如進行阿達瑪乘積(Hadamard matrix)處理，以提供第一輸出音訊。頻域/時域轉換器427將第一輸出音訊由頻域形式轉換為時域形式。行動裝置4更將時域形式的第一輸出音訊輸出至第一無線音訊裝置2的第一播放單元22。換句話說，行動裝置4將其所接收到的實際音訊中對應於第一無線音訊裝置2所提供的參考音訊減弱，使得使用者利用第一無線音訊裝置2的第一播放單元22播放第一輸出音訊時，第一輸出音訊中關於第一無線音訊裝置2所接收的音源被衰減，即使用者可清晰聽到行動裝置4所接收的音源。Please refer to Figure 3 in conjunction with Figures 1 and 2. Figure 3 is a partial detailed architecture diagram of the processing unit within the mobile device of the speech enhancement system shown in Figure 1 when performing multiplication processing on the second reference audio message. Figure 3 only shows the control architecture for further receiving actual audio messages after the mobile device 4 establishes a characteristic sound model, which is part of the control architecture shown in Figure 2. As shown in the figure, the second receiving unit 41 of the mobile device 4 further receives the actual audio messages, which are in time-domain format. The time-domain/frequency-domain converter 421 converts the actual audio messages from time-domain format to frequency-domain format. Mixer 426 receives a second feature mask and actual audio in frequency domain form from memory unit 425, and uses the second feature mask in the feature sound model to perform spectrogram multiplication on the features in the actual audio corresponding to the second reference audio, such as Hadamard matrix processing, to provide a first output audio message. Frequency domain/time domain converter 427 converts the first output audio message from frequency domain form to time domain form. Mobile device 4 further outputs the time domain form of the first output audio message to the first playback unit 22 of the first wireless audio device 2. In other words, the mobile device 4 weakens the reference audio information corresponding to the first wireless audio device 2 in the actual audio it receives, so that when the user plays the first output audio using the first playback unit 22 of the first wireless audio device 2, the audio source received by the first wireless audio device 2 in the first output audio is attenuated, that is, the user can clearly hear the audio source received by the mobile device 4.

請參閱第4圖並配合第1圖及第2圖，其中第4圖為第1圖所示的語音增強系統的行動裝置內處理單元於第一參考音訊進行乘積處理時的部分細部架構圖。於第4圖中僅示出，當行動裝置4建立特徵聲音模型後，進一步接收實際音訊的控制架構，即第2圖中的部分的控制架構。如圖所示，行動裝置4的第二收音單元41更接收實際音訊，其中實際音訊為時域形式。時域/頻域轉換器421將實際音訊由時域形式轉換為頻域形式。混音器426接收記憶單元425內的第一特徵遮罩以及頻域形式的實際音訊，並利用特徵聲音模型中的第一特徵遮罩將實際音訊中對應於第一參考音訊的特徵進行乘積處理(spectrogram multiplication)，例如進行阿達瑪乘積(Hadamard matrix)處理，以提供第二輸出音訊。頻域/時域轉換器427將第二輸出音訊由頻域形式轉換為時域形式。行動裝置4更將時域形式的第二輸出音訊輸出至第一無線音訊裝置2的第一播放單元22。換句話說，行動裝置4將其所接收到的實際音訊中對應於行動裝置4所提供的參考音訊減弱，使得使用者利用第一無線音訊裝置2的第一播放單元22播放第二輸出音訊時，第二輸出音訊中關於行動裝置4所接收的音源被衰減，即使用者可清晰聽到第一無線音訊裝置2所接收的音源。Please refer to Figure 4 in conjunction with Figures 1 and 2. Figure 4 is a partial detailed architecture diagram of the processing unit within the mobile device of the speech enhancement system shown in Figure 1 during the product processing of the first reference audio message. Figure 4 only shows the control architecture for further receiving actual audio messages after the mobile device 4 establishes a characteristic sound model, which is part of the control architecture shown in Figure 2. As shown in the figure, the second receiving unit 41 of the mobile device 4 further receives the actual audio messages, which are in time-domain format. The time-domain/frequency-domain converter 421 converts the actual audio messages from time-domain format to frequency-domain format. Mixer 426 receives a first feature mask and actual audio in frequency domain form from memory unit 425, and uses the first feature mask in the feature sound model to perform spectrogram multiplication on the features in the actual audio corresponding to the first reference audio, such as Hadamard matrix processing, to provide a second output audio. Frequency domain/time domain converter 427 converts the second output audio from frequency domain form to time domain form. Mobile device 4 further outputs the second output audio in time domain form to the first playback unit 22 of the first wireless audio device 2. In other words, the mobile device 4 weakens the reference audio information corresponding to the actual audio received by the mobile device 4, so that when the user plays the second output audio using the first playback unit 22 of the first wireless audio device 2, the audio source received by the mobile device 4 in the second output audio is attenuated, that is, the user can clearly hear the audio source received by the first wireless audio device 2.

當然，於一些實施例中，行動裝置4亦可包含第二播放單元43，以播放第一輸出音訊及/或第二輸出音訊。或者，於一些實施例中，第二無線音訊裝置3亦具有播放單元，以播放第一輸出音訊及/或第二輸出音訊。Of course, in some embodiments, the mobile device 4 may also include a second playback unit 43 to play the first output audio message and/or the second output audio message. Alternatively, in some embodiments, the second wireless audio device 3 may also have a playback unit to play the first output audio message and/or the second output audio message.

於一些實施例中，如第5圖及第6圖所示，行動裝置4的處理單元42中包含儲存單元428，用以儲存第一輸出音訊及/或第二輸出音訊，以分別轉換第一輸出音訊及/或第二輸出音訊為第一語音備忘錄及/或第二語音備忘錄，使語音增強系統1具有語音備忘錄的功能。In some embodiments, as shown in Figures 5 and 6, the processing unit 42 of the mobile device 4 includes a storage unit 428 for storing a first output audio message and/or a second output audio message, so as to convert the first output audio message and/or the second output audio message into a first voice memo and/or a second voice memo, respectively, so that the voice enhancement system 1 has a voice memo function.

於一些實施例中，除了對兩個參考音訊中的其中之一參考音訊進行乘積處理外，更可同時對另一參考音訊進行乘積處理，以取得更清晰的音源。請參閱第7圖並配合第1圖及第2圖，其中第7圖為第1圖所示的語音增強系統的行動裝置內處理單元於第一參考音訊與第二參考音訊同時進行乘積處理時的部分細部架構圖。於第7圖中僅示出，當行動裝置4建立特徵聲音模型後，進一步接收實際音訊的控制架構，即第2圖中的部分的控制架構。如圖所示，行動裝置4的第二收音單元41更接收實際音訊，其中實際音訊為時域形式。時域/頻域轉換器421將實際音訊由時域形式轉換為頻域形式。於本實施例中，混音器426包含第一子混音器426a及第二子混音器426b，其中混音器426的第一子混音器426a接收記憶單元425內的第二特徵遮罩以及頻域形式的實際音訊，並利用特徵聲音模型中的第二特徵遮罩將實際音訊中對應於第二參考音訊的特徵進行乘積處理(spectrogram multiplication)，例如進行阿達瑪乘積(Hadamard matrix)處理，以提供第一輸出音訊，混音器426的第二子混音器426b接收記憶單元425的特徵聲音模型以及頻域形式的實際音訊，並利用特徵聲音模型中的第一特徵遮罩將實際音訊中對應於第一參考音訊的特徵進行乘積處理(spectrogram multiplication)，例如進行阿達瑪乘積(Hadamard matrix)處理，以提供第二輸出音訊。頻域/時域轉換器427分別將第一輸出音訊及第二輸出音訊由頻域形式轉換為時域形式。且於本實施例中，行動裝置4的處理單元42更包含加減法器429，用以將轉換為時域形式的第一輸出音訊及第二輸出音訊進行增強或衰減以輸出一完整輸出音訊。行動裝置4更將時域形式的完整輸出音訊輸出至第一無線音訊裝置2的第一播放單元22。換句話說，行動裝置4同時增強所接收到的實際音訊中對應於行動裝置4所提供的參考音訊以及對應於第一無線音訊裝置2所提供的參考音訊，使得使用者利用第一無線音訊裝置2的第一播放單元22播放完整輸出音訊時，使用者可同時清晰聽到行動裝置4所接收的音源以及第一無線音訊裝置2所接收的音源。當然，於本實施例中，行動裝置4的處理單元42中更可包含儲存單元428，用以儲存第一輸出音訊及/或第二輸出音訊，以分別轉換第一輸出音訊及/或第二輸出音訊為第一語音備忘錄及/或第二語音備忘錄，使語音增強系統1具有語音備忘錄的功能。In some embodiments, in addition to multiplying one of the two reference audio messages, the other reference audio message can be multiplied simultaneously to obtain a clearer sound source. Please refer to Figure 7 in conjunction with Figures 1 and 2, where Figure 7 is a partial detailed architecture diagram of the processing unit within the mobile device of the speech enhancement system shown in Figure 1 when the first and second reference audio messages are multiplied simultaneously. Figure 7 only shows the control architecture for further receiving actual audio messages after the mobile device 4 establishes a characteristic sound model, i.e., part of the control architecture in Figure 2. As shown in the figure, the second receiving unit 41 of the mobile device 4 further receives the actual audio messages, where the actual audio messages are in time-domain format. The time-domain to frequency-domain converter 421 converts actual audio signals from time-domain format to frequency-domain format. In this embodiment, mixer 426 includes a first sub-mixer 426a and a second sub-mixer 426b. The first sub-mixer 426a receives a second feature mask and actual audio in frequency domain form from memory unit 425, and uses the second feature mask in the feature sound model to perform spectrogram multiplication on the features in the actual audio corresponding to the second reference audio, for example, Hadamard matrix multiplication, to provide a first output audio. The second sub-mixer 426b receives the feature sound model from memory unit 425 and actual audio in frequency domain form, and uses the first feature mask in the feature sound model to perform spectrogram multiplication on the features in the actual audio corresponding to the first reference audio. Multiplication, such as Hadamard matrix processing, is used to provide a second output audio signal. The frequency-to-time converter 427 converts the first and second output audio signals from frequency-domain to time-domain formats, respectively. Furthermore, in this embodiment, the processing unit 42 of the mobile device 4 includes an adder/subtractor 429 to amplify or attenuate the converted time-domain first and second output audio signals to output a complete output audio signal. The mobile device 4 then outputs the complete time-domain output audio signal to the first playback unit 22 of the first wireless audio device 2. In other words, the mobile device 4 simultaneously enhances the reference audio information corresponding to both the actual audio received by the mobile device 4 and the reference audio information corresponding to the first wireless audio device 2. This allows the user to clearly hear both the audio source received by the mobile device 4 and the audio source received by the first wireless audio device 2 when playing the complete output audio using the first playback unit 22 of the first wireless audio device 2. Of course, in this embodiment, the processing unit 42 of the mobile device 4 may further include a storage unit 428 to store the first output audio information and/or the second output audio information, converting the first output audio information and/or the second output audio information into a first voice memo and/or a second voice memo, respectively, thus enabling the voice enhancement system 1 to have a voice memo function.

於一些實施例中，行動裝置4可利用將背景音源產生特徵遮罩的方式，使行動裝置4所接收的音源以及第一無線音訊裝置2所接收的音源加入背景音源，進而使輸出音訊較為柔和。於本實施例中，第一無線音訊裝置2更接收背景音源，並利用背景音源產生背景特徵遮罩於記憶單元425中，其特徵遮罩的產生方法相似於第2圖的特徵遮罩的產生方法，故於此不再贅述。In some embodiments, the mobile device 4 can use a feature masking method to add background sound sources received by the mobile device 4 and the first wireless audio device 2 to the background sound source, thereby making the output audio signal softer. In this embodiment, the first wireless audio device 2 further receives background sound sources and uses the background sound sources to generate a background feature mask in the memory unit 425. The method of generating the feature mask is similar to the method of generating the feature mask in Figure 2, so it will not be described in detail here.

請參閱第8圖並配合第1圖及第2圖，其中第8圖為第1圖所示的語音增強系統的行動裝置內處理單元於第二參考音訊及背景音源進行乘積處理時的部分細部架構圖。於第8圖中僅示出，當行動裝置4建立特徵聲音模型後，進一步接收實際音訊的控制架構，即第2圖中的部分的控制架構。如圖所示，行動裝置4的第二收音單元41更接收實際音訊，其中實際音訊為時域形式。時域/頻域轉換器421將實際音訊由時域形式轉換為頻域形式。於本實施例中，混音器426包含第一子混音器426a及第二子混音器426b，其中混音器426的第一子混音器426a接收記憶單元425內的第二特徵遮罩以及頻域形式的實際音訊，並利用特徵聲音模型中的第二特徵遮罩將實際音訊中對應於第二參考音訊的特徵進行乘積處理(spectrogram multiplication)，例如進行阿達瑪乘積(Hadamard matrix)處理，以提供第一輸出音訊，混音器426的第二子混音器426b接收記憶單元425的特徵聲音模型以及頻域形式的實際音訊，並利用特徵聲音模型中的背景特徵遮罩將實際音訊中對應於背景音訊的特徵進行乘積處理(spectrogram multiplication)，例如進行阿達瑪乘積(Hadamard matrix)處理，以提供第二輸出音訊。頻域/時域轉換器427分別將第一輸出音訊及第二輸出音訊由頻域形式轉換為時域形式。且於本實施例中，行動裝置4的處理單元42更包含加減法器429，用以將轉換為時域形式的第一輸出音訊及第二輸出音訊進行比較以輸出一完整輸出音訊。行動裝置4更將時域形式的完整輸出音訊輸出至第一無線音訊裝置2的第一播放單元22。換句話說，行動裝置4同時增強所接收到的實際音訊中對應於行動裝置4所提供的參考音訊以及對應於第一無線音訊裝置2所提供的背景音訊，使得使用者利用第一無線音訊裝置2的第一播放單元22播放完整輸出音訊時，使用者可清晰聽到行動裝置4所接收的音源，且該音源經過柔和處理，使得其的音訊較不死板。當然，於本實施例中，行動裝置4的處理單元42中包含儲存單元428，用以儲存第一輸出音訊及/或第二輸出音訊，以分別轉換第一輸出音訊及/或第二輸出音訊為第一語音備忘錄及/或第二語音備忘錄，使語音增強系統1具有語音備忘錄的功能。Please refer to Figure 8 in conjunction with Figures 1 and 2. Figure 8 is a partial detailed architecture diagram of the processing unit within the mobile device of the speech enhancement system shown in Figure 1, where the second reference audio message and background audio source are multiplied. Figure 8 only shows the control architecture for further receiving actual audio messages after the mobile device 4 establishes a characteristic sound model, which is part of the control architecture shown in Figure 2. As shown in the figure, the second receiving unit 41 of the mobile device 4 further receives the actual audio messages, which are in time-domain format. The time-domain/frequency-domain converter 421 converts the actual audio messages from time-domain format to frequency-domain format. In this embodiment, mixer 426 includes a first sub-mixer 426a and a second sub-mixer 426b. The first sub-mixer 426a receives a second feature mask and actual audio in frequency domain form from memory unit 425, and performs spectrogram multiplication (e.g., Hadamard matrix multiplication) on the features in the actual audio corresponding to the second reference audio using the second feature mask in the feature sound model to provide a first output audio. The second sub-mixer 426b receives the feature sound model from memory unit 425 and actual audio in frequency domain form, and performs spectrogram multiplication on the features in the actual audio corresponding to the background audio using a background feature mask in the feature sound model. Multiplication, such as Hadamard matrix processing, is performed to provide a second output audio signal. The frequency-to-time converter 427 converts both the first and second output audio signals from frequency-domain to time-domain formats. Furthermore, in this embodiment, the processing unit 42 of the mobile device 4 includes an adder/subtractor 429 to compare the converted time-domain first and second output audio signals to output a complete output audio signal. The mobile device 4 then outputs the complete time-domain output audio signal to the first playback unit 22 of the first wireless audio device 2. In other words, the mobile device 4 simultaneously enhances the received actual audio signal corresponding to the reference audio signal provided by the mobile device 4 and the background audio signal corresponding to the first wireless audio device 2. This allows the user to clearly hear the audio source received by the mobile device 4 when playing the complete output audio signal using the first playback unit 22 of the first wireless audio device 2. Furthermore, the audio source is softened, making the sound less rigid. Of course, in this embodiment, the processing unit 42 of the mobile device 4 includes a storage unit 428 for storing the first output audio signal and/or the second output audio signal, respectively converting the first output audio signal and/or the second output audio signal into a first voice memo and/or a second voice memo, thus enabling the voice enhancement system 1 to have a voice memo function.

綜上所述，本案揭露一種語音增強系統，其行動裝置中的處理單元將無線音訊裝置所接收的第一參考音訊以及行動裝置所接收的第二參考音訊，進行前處理程序，以利用特徵聲音模型獲得多種特徵遮罩，可根據需求利用特徵遮罩將實際音訊中對應於該參考音訊的特徵進行乘積處理，舉例來說，當使用者配戴無線音訊裝置進行溝通時，與使用者溝通的對方聲音為實際音訊，行動裝置的處理單元可將與使用者溝通的對方聲音增強，使得使用者可清楚聽到其聲音，進而增強無線音訊裝置收音的清晰度。In summary, this case discloses a voice enhancement system in which the processing unit of the mobile device performs a preprocessing procedure on the first reference audio signal received by the wireless audio device and the second reference audio signal received by the mobile device. This preprocessing process utilizes a feature sound model to obtain various feature masks. The feature masks can be used to multiply the features in the actual audio signal corresponding to the reference audio signal as needed. For example, when a user wears a wireless audio device to communicate, the voice of the other party is the actual audio signal. The processing unit of the mobile device can enhance the voice of the other party, so that the user can hear the voice clearly, thereby enhancing the clarity of the wireless audio device's reception.

1:語音增強系統 2:第一無線音訊裝置 21:第一收音單元 22:第一播放單元 3:第二無線音訊裝置 4:行動裝置 41:第二收音單元 42:處理單元 43:第二播放單元 421:時域/頻域轉換器 422:辨頻器 423:矩陣重塑器 424:編碼器 425:記憶單元 426:混音器 426a:第一子混音器 426b:第二子混音器 427:頻域/時域轉換器 428:儲存單元 429:加減法器 1: Voice Enhancement System 2: First Wireless Audio Device 21: First Receiver Unit 22: First Playback Unit 3: Second Wireless Audio Device 4: Mobile Device 41: Second Receiver Unit 42: Processing Unit 43: Second Playback Unit 421: Time Domain/Frequency Domain Converter 422: Frequency Discriminator 423: Matrix Reshaping Unit 424: Encoder 425: Memory Unit 426: Mixer 426a: First Sub-Mixer 426b: Second Sub-Mixer 427: Frequency Domain/Time Domain Converter 428: Storage Unit 429: Adder/Subtractor

第1圖為本案的語音增強系統的系統架構圖；第2圖為第1圖所示的語音增強系統的部分元件的系統架構圖；第3圖為第1圖所示的語音增強系統的行動裝置內處理單元於第二參考音訊進行乘積處理時的部分細部架構圖；第4圖為第1圖所示的語音增強系統的行動裝置內處理單元於第一參考音訊進行乘積處理時的部分細部架構圖；第5圖為第1圖所示的語音增強系統的行動裝置內處理單元於第二參考音訊進行乘積處理時的另一實施例的部分細部架構圖；第6圖為第1圖所示的語音增強系統的行動裝置內處理單元於第一參考音訊進行乘積處理時的另一實施例的部分細部架構圖；第7圖為第1圖所示的語音增強系統的行動裝置內處理單元於第一參考音訊與第二參考音訊同時進行乘積處理時的部分細部架構圖；以及第8圖為第1圖所示的語音增強系統的行動裝置內處理單元於第二參考音訊及背景音源進行乘積處理時的部分細部架構圖。 Figure 1 is a system architecture diagram of the voice enhancement system of this case; Figure 2 is a system architecture diagram of some components of the voice enhancement system shown in Figure 1; Figure 3 is a partial detailed architecture diagram of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the second reference audio; Figure 4 is a partial detailed architecture diagram of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the first reference audio; Figure 5 is a partial detailed architecture diagram of another embodiment of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the second reference audio; Figure 6 is a partial detailed architecture diagram of another embodiment of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the first reference audio message; Figure 7 is a partial detailed architecture diagram of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the first reference audio message and the second reference audio message simultaneously; and Figure 8 is a partial detailed architecture diagram of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the second reference audio message and the background sound source.

2:第一無線音訊裝置 2: First Wireless Audio Device

21:第一收音單元 21: First radio unit

22:第一播放單元 22: First Playback Unit

4:行動裝置 4: Mobile Devices

41:第二收音單元 41: Second radio unit

42:處理單元 42: Processing Unit

421:時域/頻域轉換器 421: Time-Domain/Frequency-Domain Converter

422:辨頻器 422: Frequency Discriminator

423:矩陣重塑器 423: Matrix Reshaper

424:編碼器 424: Encoder

425:記憶單元 425: Memory Unit

426:混音器 426: Mixer

427:頻域/時域轉換器 427: Frequency Domain/Time Domain Converter

Claims

A speech enhancement system includes: a wireless audio device including a first receiving unit for receiving a first reference audio message; and a mobile device for data transmission with the wireless audio device to receive the first reference audio message, wherein the mobile device includes: a second receiving unit for receiving a second reference audio message and an actual audio message; and a processing unit for performing preprocessing. The program, the processing unit, includes: a memory unit storing a feature sound model; a time-domain/frequency-domain converter converting the first reference audio and the second reference audio from time-domain to frequency-domain form; a frequency discriminator identifying a first audio feature in the first reference audio and a second audio feature in the second reference audio; and a matrix reshaping unit converting the first audio feature and the second reference audio into frequency-domain form. The second audio feature is converted into a first matrix and a second matrix respectively; and an encoder encodes the first matrix and the second matrix to output a first audio embedding feature and a second audio embedding feature respectively; wherein the preprocessing procedure includes the following steps: converting the first reference audio and the second reference audio into the first audio embedding feature and the second audio embedding feature; to The processing unit loads the first audio embedding feature, the second audio embedding feature, and the actual audio into the feature sound model to obtain a first feature mask and a second feature mask; wherein, the processing unit converts the actual audio from a time-domain form to a frequency-domain form, and performs product processing on the features in the actual audio corresponding to the second reference audio using the second feature mask to provide a first output audio.

The speech enhancement system as described in claim 1, wherein the processing unit further includes a mixer that receives the second feature mask and the actual audio message, and uses the second feature mask to multiply the features in the actual audio message corresponding to the second reference audio message.

The voice enhancement system as described in claim 1, wherein when the second receiver unit of the mobile device receives the actual audio message, the time-domain/frequency-domain converter converts the actual audio message from time-domain form to frequency-domain form, and uses the first feature mask to multiply the features in the actual audio message corresponding to the first reference audio message to provide a second output audio message.

The speech enhancement system as described in claim 3, wherein the speech enhancement system includes an adder/subtractor for amplifying or attenuating the first output audio message and the second output audio message to output a complete output audio message.

The speech enhancement system as described in claim 3 further includes a frequency domain to time domain converter to convert the first output audio message and/or the second output audio message from frequency domain form to time domain form.

The voice enhancement system as described in claim 5, wherein the wireless audio device includes a playback unit for playing the first output audio message and/or the second output audio message in time-domain form.

The voice enhancement system as described in claim 3 further includes a storage unit for storing the first output audio message and/or the second output audio message, and converting the first output audio message and/or the second output audio message into a first voice memo and/or a second voice memo, respectively.

The speech enhancement system as described in claim 1, wherein the characteristic sound model further includes a background feature mask corresponding to a background sound source of the wireless audio device, the time-domain/frequency-domain converter converts the actual audio from time-domain to frequency-domain form, and uses the background feature mask to multiply the features in the actual audio corresponding to the background sound source to provide a second output audio.

The speech enhancement system as described in claim 1, wherein the dimensions of the first audio embedding feature and/or the second audio embedding feature are 64 respectively.

The speech enhancement system as described in claim 1, wherein the second reference audio and the actual audio are provided by the same sound source.