[go: up one dir, main page]

TWI902105B - Speech enhancement system - Google Patents

Speech enhancement system

Info

Publication number
TWI902105B
TWI902105B TW112151245A TW112151245A TWI902105B TW I902105 B TWI902105 B TW I902105B TW 112151245 A TW112151245 A TW 112151245A TW 112151245 A TW112151245 A TW 112151245A TW I902105 B TWI902105 B TW I902105B
Authority
TW
Taiwan
Prior art keywords
audio
feature
domain
message
actual
Prior art date
Application number
TW112151245A
Other languages
Chinese (zh)
Other versions
TW202527576A (en
Inventor
秦允求
蔡連枝
黃兆華
Original Assignee
仁寶電腦工業股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 仁寶電腦工業股份有限公司 filed Critical 仁寶電腦工業股份有限公司
Priority to TW112151245A priority Critical patent/TWI902105B/en
Publication of TW202527576A publication Critical patent/TW202527576A/en
Application granted granted Critical
Publication of TWI902105B publication Critical patent/TWI902105B/en

Links

Landscapes

  • Mobile Radio Communication Systems (AREA)
  • Telephone Function (AREA)

Abstract

The present invention discloses a speech enhancement system. The processing unit within the mobile device executes a preprocessing procedure according to the first reference audio received by the wireless audio device and the second reference audio received by the mobile device so as to obtain various feature masks by a feature audio model. According to the requirements, the actual audio corresponding to the reference audio undergoes multiplication processing by the feature mask. When the user communicates with someone by the wireless audio device, the voice of the man communicated with the user is served as the actual audio. The processing unit of the mobile device enhances the voice of the man communicated with the user, so that the user hears the voice clearly, and the clarity of audio reception by the wireless audio device is improved.

Description

語音增強系統Voice enhancement system

本案關於一種語音增強系統,尤指一種清晰度較高的語音增強系統。This case concerns a speech enhancement system, particularly a speech enhancement system with higher clarity.

通常,助聽器中的收音裝置位於本體上,當佩戴助聽器的使用者進行溝通時,與使用者溝通的對方聲音以及使用者的本身聲音皆由該助聽器上的收音裝置進行收音、處理以及播放,由於與使用者溝通的對方聲音的距離較遠,使得傳送到助聽器的音量衰減,且容易被環境音所干擾,例如環境噪音較大的馬路邊、賣場、餐廳、車站等場所,又由於佩戴助聽器的使用者的本身聲音距離助聽器的收音裝置較近,且加上骨傳導的效果,佩戴助聽器的使用者的本身聲音傳送至助聽器的收音裝置的音量及品質通常會大於與使用者溝通的對方聲音傳送至助聽器的收音裝置的音量及品質,使得與使用者溝通的對方聲音傳送至助聽器的收音裝置的音量較小且清晰度較低。Typically, the microphone in a hearing aid is located on the main body. When a user wearing a hearing aid communicates, both the voice of the person they are communicating with and the user's own voice are picked up, processed, and played back by the microphone on the hearing aid. Because the distance between the person and the user is relatively far, the volume transmitted to the hearing aid is attenuated, and it is easily disturbed by ambient noise, such as from noisy places like roadsides, shopping malls, restaurants, and train stations. In places like these, because the user's own voice is closer to the hearing aid's receiver, and due to the effect of bone conduction, the volume and quality of the user's own voice transmitted to the hearing aid's receiver are usually greater than the volume and quality of the voice of the person communicating with the user. This results in the volume of the person communicating with the user being lower and the clarity being lower.

因此,如何發展一種克服上述缺點的語音增強系統,實為目前迫切之需求。Therefore, developing a speech enhancement system that overcomes the above-mentioned shortcomings is an urgent need at present.

本案之目的在於提供一種語音增強系統,其行動裝置中的處理單元將無線音訊裝置所接收的第一參考音訊以及行動裝置所接收的第二參考音訊,進行前處理程序,以利用特徵聲音模型獲得多種特徵遮罩,可根據需求利用特徵遮罩將實際音訊中對應於該參考音訊的特徵進行乘積處理,當使用者配戴無線音訊裝置進行溝通時,與使用者溝通的對方聲音為實際音訊,行動裝置的處理單元可將與使用者溝通的對方聲音增強,使得使用者可清楚聽到其聲音,進而增強無線音訊裝置收音的清晰度。The purpose of this invention is to provide a voice enhancement system in which the processing unit of the mobile device performs a preprocessing procedure on the first reference audio signal and the second reference audio signal received by the wireless audio device to obtain various feature masks using a feature sound model. The feature masks can be used to multiply the features in the actual audio signal corresponding to the reference audio signal as needed. When a user wears the wireless audio device to communicate, the voice of the other party communicating with the user is the actual audio signal. The processing unit of the mobile device can enhance the voice of the other party communicating with the user, so that the user can hear the voice clearly, thereby enhancing the clarity of the wireless audio device's sound reception.

為達上述目的,本案之一較廣義實施態樣為提供一種語音增強系統,包含無線音訊裝置及行動裝置。無線音訊裝置包含第一收音單元,用以接收第一參考音訊。行動裝置與無線音訊裝置之間進行資料傳輸,以接收第一參考音訊,且行動裝置包含第二收音單元及處理單元。第二收音單元用以接收第二參考音訊與實際音訊。處理單元用以執行前處理程序,處理單元具有記憶單元,且記憶單元儲存特徵聲音模型,前處理程序包含以下步驟:轉換第一參考音訊與第二參考音訊為第一音訊嵌入特徵與第二音訊嵌入特徵,以及載入第一音訊嵌入特徵、第二音訊嵌入特徵及實際音訊於特徵聲音模型,以獲得第一特徵遮罩與第二特徵遮罩。處理單元將實際音訊由時域形式轉換為頻域形式,並以第二特徵遮罩將實際音訊中對應於第二參考音訊的特徵進行乘積處理,以提供第一輸出音訊。To achieve the above objectives, a broader embodiment of this application provides a voice enhancement system comprising a wireless audio device and a mobile device. The wireless audio device includes a first receiving unit for receiving a first reference audio message. Data transmission occurs between the mobile device and the wireless audio device to receive the first reference audio message, and the mobile device includes a second receiving unit and a processing unit. The second receiving unit receives the second reference audio message and the actual audio message. The processing unit executes preprocessing procedures. The processing unit has a memory unit that stores the feature audio model. The preprocessing procedure includes the following steps: converting the first reference audio and the second reference audio into first audio embedding features and second audio embedding features, and loading the first audio embedding features, the second audio embedding features, and the actual audio into the feature audio model to obtain a first feature mask and a second feature mask. The processing unit converts the actual audio from time-domain to frequency-domain format and uses the second feature mask to multiply the features in the actual audio corresponding to the second reference audio to provide the first output audio.

體現本案特徵與優點的一些典型實施例將在後段的說明中詳細敘述。應理解的是本案能夠在不同的態樣上具有各種的變化,其皆不脫離本案的範圍,且其中的說明及圖式在本質上系當作說明之用,而非用於限制本案。Some typical embodiments that embody the characteristics and advantages of this case will be described in detail in the following section. It should be understood that this case can have various variations in different forms, all of which do not depart from the scope of this case, and the descriptions and diagrams therein are essentially for illustrative purposes, not for limiting this case.

請參閱第1圖及第2圖,其中第1圖為本案的語音增強系統的系統架構圖,第2圖為第1圖所示的語音增強系統的部分元件的系統架構圖。如圖所示,本案的語音增強系統1包含第一無線音訊裝置2、第二無線音訊裝置3及行動裝置4。第一無線音訊裝置2可為但不限為助聽器、無線耳機、VR/AR/MR眼鏡、智慧手錶或智慧手機等,包含第一收音單元21及第一播放單元22,第一收音單元21用以接收第一參考音訊,其中第一參考音訊為時域形式,且係由鄰近於第一收音單元21的音源所提供。第一播放單元22用以播放第一輸出音訊及/或第二輸出音訊。Please refer to Figures 1 and 2, where Figure 1 is a system architecture diagram of the voice enhancement system of this invention, and Figure 2 is a system architecture diagram of some components of the voice enhancement system shown in Figure 1. As shown in the figures, the voice enhancement system 1 of this invention includes a first wireless audio device 2, a second wireless audio device 3, and a mobile device 4. The first wireless audio device 2 may be, but is not limited to, a hearing aid, wireless headphones, VR/AR/MR glasses, a smartwatch, or a smartphone, etc., and includes a first receiving unit 21 and a first playback unit 22. The first receiving unit 21 is used to receive a first reference audio message, wherein the first reference audio message is in time-domain format and is provided by an audio source adjacent to the first receiving unit 21. The first playback unit 22 is used to play a first output audio message and/or a second output audio message.

第二無線音訊裝置3可為但不限為筆電、平板或智慧手機等。於一實施例中,行動裝置4可位於第二無線音訊裝置3中,而於另一實施例中,行動裝置4可獨立於第二無線音訊裝置3外。於本案中,以行動裝置4獨立於第二無線音訊裝置3外進行說明,如第1圖所示。行動裝置4與第一無線音訊裝置2以及第二無線音訊裝置3之間以無線或有線的方式進行資料傳輸,例如以藍芽、WIFI、2.4G、USB2.0/3.0的方式進行資料傳輸,以接收第一無線音訊裝置2所接收到的第一參考音訊。行動裝置4包含第二收音單元41以及處理單元42,第二收音單元41用以接收第二參考音訊與實際音訊,其中第二參考音訊與實際音訊為時域形式,且第二參考音訊與實際音訊皆係由鄰近於第二收音單元41的相同聲源(例如人聲)所提供。The second wireless audio device 3 can be, but is not limited to, a laptop, tablet, or smartphone. In one embodiment, the mobile device 4 may be located within the second wireless audio device 3, while in another embodiment, the mobile device 4 may be independent of the second wireless audio device 3. In this case, the mobile device 4 is described as being independent of the second wireless audio device 3, as shown in Figure 1. The mobile device 4 transmits data wirelessly or wiredly to both the first wireless audio device 2 and the second wireless audio device 3, for example, via Bluetooth, Wi-Fi, 2.4G, or USB 2.0/3.0, to receive the first reference audio signal received by the first wireless audio device 2. Mobile device 4 includes a second radio unit 41 and a processing unit 42. The second radio unit 41 is used to receive a second reference audio message and an actual audio message, wherein the second reference audio message and the actual audio message are in time domain form, and both the second reference audio message and the actual audio message are provided by the same sound source (e.g., human voice) adjacent to the second radio unit 41.

處理單元42用以執行前處理程序,其中前處理程序包含以下步驟,首先,分別轉換第一參考音訊與第二參考音訊為第一音訊嵌入特徵與第二音訊嵌入特徵。接著,載入第一音訊嵌入特徵、第二音訊嵌入特徵及實際音訊於處理單元42內的記憶單元中的特徵聲音模型, 以獲得第一特徵遮罩及第二特徵遮罩,以下將進一步說明處理單元42的內部電路結構及其細部功能。此外,處理單元42更可以透過前處理程序中所獲得的第一音訊嵌入特徵或第二音訊嵌入特徵,與新接收到的聲音嵌入特徵,透過比對歐式距離的方式,來判別聲音是否源自於相同的人。Processing unit 42 is used to execute a preprocessing procedure, which includes the following steps: First, the first reference audio and the second reference audio are converted into a first audio embedding feature and a second audio embedding feature, respectively. Next, the first audio embedding feature, the second audio embedding feature, and the actual audio are loaded into the feature sound model in the memory unit within processing unit 42 to obtain a first feature mask and a second feature mask. The internal circuit structure and detailed functions of processing unit 42 will be further explained below. In addition, processing unit 42 can also determine whether the sound originates from the same person by comparing the first or second audio embedding feature obtained in the preprocessing procedure with the newly received sound embedding feature through Euclidean distance.

如第2圖所示,處理單元42包含時域/頻域轉換器421、辨頻器422、矩陣重塑器423、編碼器424、記憶單元425、混音器426及頻域/時域轉換器427。時域/頻域轉換器421包含短時距傅立葉變換(Short-time Fourier Transform, STFT)的功能,用以接收第一收音單元21的第一參考音訊以及第二收音單元41的第二參考音訊,並將第一參考音訊及第二參考音訊由時域形式轉換為頻域形式。辨頻器422包含梅爾頻率倒譜係數(Mel-scale Frequency Cepstral Coefficients, MFCC),用以根據頻域形式的第一參考音訊及第二參考音訊,而確認第一參考音訊中的第一音訊特徵以及第二參考音訊中的第二音訊特徵,其中第一音訊特徵為第一參考音訊中具有其訊息的可量化屬性,第二音訊特徵為第二參考音訊中具有其訊息的可量化屬性。矩陣重塑器(reshape)423用以將第一參考音訊的第一音訊特徵以及第二參考音訊的第二音訊特徵分別轉換為第一矩陣及第二矩陣,即將第一音訊特徵以及第二音訊特徵以矩陣形式進行量化。編碼器(encoder)424對第一矩陣及第二矩陣進行編碼,以分別對應輸出第一音訊嵌入特徵(embedding data)及第二音訊嵌入特徵,其中第一音訊嵌入特徵及第二音訊嵌入特徵的維度分別為64。記憶單元425儲存特徵聲音模型,當第一音訊嵌入特徵、第二音訊嵌入特徵與實際音訊載入特徵聲音模型時,以獲得第一特徵遮罩與第二特徵遮罩。As shown in Figure 2, the processing unit 42 includes a time-domain/frequency-domain converter 421, a frequency discriminator 422, a matrix reshaping unit 423, an encoder 424, a memory unit 425, a mixer 426, and a frequency-domain/time-domain converter 427. The time-domain/frequency-domain converter 421 includes a short-time Fourier transform (STFT) function to receive the first reference audio signal from the first receiver unit 21 and the second reference audio signal from the second receiver unit 41, and convert the first reference audio signal and the second reference audio signal from time-domain form to frequency-domain form. The frequency discriminator 422 includes Mel-scale Frequency Cepstral Coefficients (MFCCs) to identify a first audio feature in the first reference audio and a second audio feature in the second reference audio based on the first and second reference audio in frequency domain form. The first audio feature is a quantizable attribute of the information in the first reference audio, and the second audio feature is a quantizable attribute of the information in the second reference audio. The matrix reshaper 423 converts the first audio feature of the first reference audio and the second audio feature of the second reference audio into a first matrix and a second matrix, respectively, that is, quantizes the first and second audio features in matrix form. Encoder 424 encodes the first matrix and the second matrix to output the first audio embedding feature and the second audio embedding feature, respectively, wherein the dimensions of the first audio embedding feature and the second audio embedding feature are 64. Memory unit 425 stores the feature sound model. When the first audio embedding feature, the second audio embedding feature, and the actual audio are loaded into the feature sound model, the first feature mask and the second feature mask are obtained.

另外,本案記憶單元425內的特徵聲音模型,亦可以根據第一音訊嵌入特徵及第二音訊嵌入特徵作持續訓練的數據,用以強化效果。本實施例中記憶單元425的特徵聲音模型包含一層門控循環單元(Gated Recurrent Unit, GRU)及兩層長短時記憶(Long Short-Term Memory, LSTM)。Furthermore, the feature audio model within memory unit 425 can also be continuously trained using data from the first and second audio embedding features to enhance the effect. In this embodiment, the feature audio model of memory unit 425 includes one layer of gated recurrent unit (GRU) and two layers of long short-term memory (LSTM).

請參閱第3圖並配合第1圖及第2圖,其中第3圖為第1圖所示的語音增強系統的行動裝置內處理單元於第二參考音訊進行乘積處理時的部分細部架構圖。於第3圖中僅示出,當行動裝置4建立特徵聲音模型後,進一步接收實際音訊的控制架構,即第2圖中的部分的控制架構。如圖所示,行動裝置4的第二收音單元41更接收實際音訊,其中實際音訊為時域形式。時域/頻域轉換器421將實際音訊由時域形式轉換為頻域形式。混音器426接收記憶單元425內的第二特徵遮罩以及頻域形式的實際音訊,並利用特徵聲音模型中的第二特徵遮罩將實際音訊中對應於第二參考音訊的特徵進行乘積處理(spectrogram multiplication),例如進行阿達瑪乘積(Hadamard matrix)處理,以提供第一輸出音訊。頻域/時域轉換器427將第一輸出音訊由頻域形式轉換為時域形式。 行動裝置4更將時域形式的第一輸出音訊輸出至第一無線音訊裝置2的第一播放單元22。換句話說,行動裝置4將其所接收到的實際音訊中對應於第一無線音訊裝置2所提供的參考音訊減弱,使得使用者利用第一無線音訊裝置2的第一播放單元22播放第一輸出音訊時,第一輸出音訊中關於第一無線音訊裝置2所接收的音源被衰減,即使用者可清晰聽到行動裝置4所接收的音源。Please refer to Figure 3 in conjunction with Figures 1 and 2. Figure 3 is a partial detailed architecture diagram of the processing unit within the mobile device of the speech enhancement system shown in Figure 1 when performing multiplication processing on the second reference audio message. Figure 3 only shows the control architecture for further receiving actual audio messages after the mobile device 4 establishes a characteristic sound model, which is part of the control architecture shown in Figure 2. As shown in the figure, the second receiving unit 41 of the mobile device 4 further receives the actual audio messages, which are in time-domain format. The time-domain/frequency-domain converter 421 converts the actual audio messages from time-domain format to frequency-domain format. Mixer 426 receives a second feature mask and actual audio in frequency domain form from memory unit 425, and uses the second feature mask in the feature sound model to perform spectrogram multiplication on the features in the actual audio corresponding to the second reference audio, such as Hadamard matrix processing, to provide a first output audio message. Frequency domain/time domain converter 427 converts the first output audio message from frequency domain form to time domain form. Mobile device 4 further outputs the time domain form of the first output audio message to the first playback unit 22 of the first wireless audio device 2. In other words, the mobile device 4 weakens the reference audio information corresponding to the first wireless audio device 2 in the actual audio it receives, so that when the user plays the first output audio using the first playback unit 22 of the first wireless audio device 2, the audio source received by the first wireless audio device 2 in the first output audio is attenuated, that is, the user can clearly hear the audio source received by the mobile device 4.

請參閱第4圖並配合第1圖及第2圖,其中第4圖為第1圖所示的語音增強系統的行動裝置內處理單元於第一參考音訊進行乘積處理時的部分細部架構圖。於第4圖中僅示出,當行動裝置4建立特徵聲音模型後,進一步接收實際音訊的控制架構,即第2圖中的部分的控制架構。如圖所示,行動裝置4的第二收音單元41更接收實際音訊,其中實際音訊為時域形式。時域/頻域轉換器421將實際音訊由時域形式轉換為頻域形式。混音器426接收記憶單元425內的第一特徵遮罩以及頻域形式的實際音訊,並利用特徵聲音模型中的第一特徵遮罩將實際音訊中對應於第一參考音訊的特徵進行乘積處理(spectrogram multiplication),例如進行阿達瑪乘積(Hadamard matrix)處理,以提供第二輸出音訊。頻域/時域轉換器427將第二輸出音訊由頻域形式轉換為時域形式。行動裝置4更將時域形式的第二輸出音訊輸出至第一無線音訊裝置2的第一播放單元22。換句話說,行動裝置4將其所接收到的實際音訊中對應於行動裝置4所提供的參考音訊減弱,使得使用者利用第一無線音訊裝置2的第一播放單元22播放第二輸出音訊時,第二輸出音訊中關於行動裝置4所接收的音源被衰減,即使用者可清晰聽到第一無線音訊裝置2所接收的音源。Please refer to Figure 4 in conjunction with Figures 1 and 2. Figure 4 is a partial detailed architecture diagram of the processing unit within the mobile device of the speech enhancement system shown in Figure 1 during the product processing of the first reference audio message. Figure 4 only shows the control architecture for further receiving actual audio messages after the mobile device 4 establishes a characteristic sound model, which is part of the control architecture shown in Figure 2. As shown in the figure, the second receiving unit 41 of the mobile device 4 further receives the actual audio messages, which are in time-domain format. The time-domain/frequency-domain converter 421 converts the actual audio messages from time-domain format to frequency-domain format. Mixer 426 receives a first feature mask and actual audio in frequency domain form from memory unit 425, and uses the first feature mask in the feature sound model to perform spectrogram multiplication on the features in the actual audio corresponding to the first reference audio, such as Hadamard matrix processing, to provide a second output audio. Frequency domain/time domain converter 427 converts the second output audio from frequency domain form to time domain form. Mobile device 4 further outputs the second output audio in time domain form to the first playback unit 22 of the first wireless audio device 2. In other words, the mobile device 4 weakens the reference audio information corresponding to the actual audio received by the mobile device 4, so that when the user plays the second output audio using the first playback unit 22 of the first wireless audio device 2, the audio source received by the mobile device 4 in the second output audio is attenuated, that is, the user can clearly hear the audio source received by the first wireless audio device 2.

當然,於一些實施例中,行動裝置4亦可包含第二播放單元43,以播放第一輸出音訊及/或第二輸出音訊。或者,於一些實施例中,第二無線音訊裝置3亦具有播放單元,以播放第一輸出音訊及/或第二輸出音訊。Of course, in some embodiments, the mobile device 4 may also include a second playback unit 43 to play the first output audio message and/or the second output audio message. Alternatively, in some embodiments, the second wireless audio device 3 may also have a playback unit to play the first output audio message and/or the second output audio message.

於一些實施例中,如第5圖及第6圖所示,行動裝置4的處理單元42中包含儲存單元428,用以儲存第一輸出音訊及/或第二輸出音訊,以分別轉換第一輸出音訊及/或第二輸出音訊為第一語音備忘錄及/或第二語音備忘錄,使語音增強系統1具有語音備忘錄的功能。In some embodiments, as shown in Figures 5 and 6, the processing unit 42 of the mobile device 4 includes a storage unit 428 for storing a first output audio message and/or a second output audio message, so as to convert the first output audio message and/or the second output audio message into a first voice memo and/or a second voice memo, respectively, so that the voice enhancement system 1 has a voice memo function.

於一些實施例中,除了對兩個參考音訊中的其中之一參考音訊進行乘積處理外,更可同時對另一參考音訊進行乘積處理,以取得更清晰的音源。請參閱第7圖並配合第1圖及第2圖,其中第7圖為第1圖所示的語音增強系統的行動裝置內處理單元於第一參考音訊與第二參考音訊同時進行乘積處理時的部分細部架構圖。於第7圖中僅示出,當行動裝置4建立特徵聲音模型後,進一步接收實際音訊的控制架構,即第2圖中的部分的控制架構。如圖所示,行動裝置4的第二收音單元41更接收實際音訊,其中實際音訊為時域形式。時域/頻域轉換器421將實際音訊由時域形式轉換為頻域形式。於本實施例中,混音器426包含第一子混音器426a及第二子混音器426b,其中混音器426的第一子混音器426a接收記憶單元425內的第二特徵遮罩以及頻域形式的實際音訊,並利用特徵聲音模型中的第二特徵遮罩將實際音訊中對應於第二參考音訊的特徵進行乘積處理(spectrogram multiplication),例如進行阿達瑪乘積(Hadamard matrix)處理,以提供第一輸出音訊,混音器426的第二子混音器426b接收記憶單元425的特徵聲音模型以及頻域形式的實際音訊,並利用特徵聲音模型中的第一特徵遮罩將實際音訊中對應於第一參考音訊的特徵進行乘積處理(spectrogram multiplication),例如進行阿達瑪乘積(Hadamard matrix)處理,以提供第二輸出音訊。頻域/時域轉換器427分別將第一輸出音訊及第二輸出音訊由頻域形式轉換為時域形式。且於本實施例中,行動裝置4的處理單元42更包含加減法器429,用以將轉換為時域形式的第一輸出音訊及第二輸出音訊進行增強或衰減以輸出一完整輸出音訊。行動裝置4更將時域形式的完整輸出音訊輸出至第一無線音訊裝置2的第一播放單元22。換句話說,行動裝置4同時增強所接收到的實際音訊中對應於行動裝置4所提供的參考音訊以及對應於第一無線音訊裝置2所提供的參考音訊,使得使用者利用第一無線音訊裝置2的第一播放單元22播放完整輸出音訊時,使用者可同時清晰聽到行動裝置4所接收的音源以及第一無線音訊裝置2所接收的音源。當然,於本實施例中,行動裝置4的處理單元42中更可包含儲存單元428,用以儲存第一輸出音訊及/或第二輸出音訊,以分別轉換第一輸出音訊及/或第二輸出音訊為第一語音備忘錄及/或第二語音備忘錄,使語音增強系統1具有語音備忘錄的功能。In some embodiments, in addition to multiplying one of the two reference audio messages, the other reference audio message can be multiplied simultaneously to obtain a clearer sound source. Please refer to Figure 7 in conjunction with Figures 1 and 2, where Figure 7 is a partial detailed architecture diagram of the processing unit within the mobile device of the speech enhancement system shown in Figure 1 when the first and second reference audio messages are multiplied simultaneously. Figure 7 only shows the control architecture for further receiving actual audio messages after the mobile device 4 establishes a characteristic sound model, i.e., part of the control architecture in Figure 2. As shown in the figure, the second receiving unit 41 of the mobile device 4 further receives the actual audio messages, where the actual audio messages are in time-domain format. The time-domain to frequency-domain converter 421 converts actual audio signals from time-domain format to frequency-domain format. In this embodiment, mixer 426 includes a first sub-mixer 426a and a second sub-mixer 426b. The first sub-mixer 426a receives a second feature mask and actual audio in frequency domain form from memory unit 425, and uses the second feature mask in the feature sound model to perform spectrogram multiplication on the features in the actual audio corresponding to the second reference audio, for example, Hadamard matrix multiplication, to provide a first output audio. The second sub-mixer 426b receives the feature sound model from memory unit 425 and actual audio in frequency domain form, and uses the first feature mask in the feature sound model to perform spectrogram multiplication on the features in the actual audio corresponding to the first reference audio. Multiplication, such as Hadamard matrix processing, is used to provide a second output audio signal. The frequency-to-time converter 427 converts the first and second output audio signals from frequency-domain to time-domain formats, respectively. Furthermore, in this embodiment, the processing unit 42 of the mobile device 4 includes an adder/subtractor 429 to amplify or attenuate the converted time-domain first and second output audio signals to output a complete output audio signal. The mobile device 4 then outputs the complete time-domain output audio signal to the first playback unit 22 of the first wireless audio device 2. In other words, the mobile device 4 simultaneously enhances the reference audio information corresponding to both the actual audio received by the mobile device 4 and the reference audio information corresponding to the first wireless audio device 2. This allows the user to clearly hear both the audio source received by the mobile device 4 and the audio source received by the first wireless audio device 2 when playing the complete output audio using the first playback unit 22 of the first wireless audio device 2. Of course, in this embodiment, the processing unit 42 of the mobile device 4 may further include a storage unit 428 to store the first output audio information and/or the second output audio information, converting the first output audio information and/or the second output audio information into a first voice memo and/or a second voice memo, respectively, thus enabling the voice enhancement system 1 to have a voice memo function.

於一些實施例中,行動裝置4可利用將背景音源產生特徵遮罩的方式,使行動裝置4所接收的音源以及第一無線音訊裝置2所接收的音源加入背景音源,進而使輸出音訊較為柔和。於本實施例中,第一無線音訊裝置2更接收背景音源,並利用背景音源產生背景特徵遮罩於記憶單元425中,其特徵遮罩的產生方法相似於第2圖的特徵遮罩的產生方法,故於此不再贅述。In some embodiments, the mobile device 4 can use a feature masking method to add background sound sources received by the mobile device 4 and the first wireless audio device 2 to the background sound source, thereby making the output audio signal softer. In this embodiment, the first wireless audio device 2 further receives background sound sources and uses the background sound sources to generate a background feature mask in the memory unit 425. The method of generating the feature mask is similar to the method of generating the feature mask in Figure 2, so it will not be described in detail here.

請參閱第8圖並配合第1圖及第2圖,其中第8圖為第1圖所示的語音增強系統的行動裝置內處理單元於第二參考音訊及背景音源進行乘積處理時的部分細部架構圖。於第8圖中僅示出,當行動裝置4建立特徵聲音模型後,進一步接收實際音訊的控制架構,即第2圖中的部分的控制架構。如圖所示,行動裝置4的第二收音單元41更接收實際音訊,其中實際音訊為時域形式。時域/頻域轉換器421將實際音訊由時域形式轉換為頻域形式。於本實施例中,混音器426包含第一子混音器426a及第二子混音器426b,其中混音器426的第一子混音器426a接收記憶單元425內的第二特徵遮罩以及頻域形式的實際音訊,並利用特徵聲音模型中的第二特徵遮罩將實際音訊中對應於第二參考音訊的特徵進行乘積處理(spectrogram multiplication),例如進行阿達瑪乘積(Hadamard matrix)處理,以提供第一輸出音訊,混音器426的第二子混音器426b接收記憶單元425的特徵聲音模型以及頻域形式的實際音訊,並利用特徵聲音模型中的背景特徵遮罩將實際音訊中對應於背景音訊的特徵進行乘積處理(spectrogram multiplication),例如進行阿達瑪乘積(Hadamard matrix)處理,以提供第二輸出音訊。頻域/時域轉換器427分別將第一輸出音訊及第二輸出音訊由頻域形式轉換為時域形式。且於本實施例中,行動裝置4的處理單元42更包含加減法器429,用以將轉換為時域形式的第一輸出音訊及第二輸出音訊進行比較以輸出一完整輸出音訊。行動裝置4更將時域形式的完整輸出音訊輸出至第一無線音訊裝置2的第一播放單元22。換句話說,行動裝置4同時增強所接收到的實際音訊中對應於行動裝置4所提供的參考音訊以及對應於第一無線音訊裝置2所提供的背景音訊,使得使用者利用第一無線音訊裝置2的第一播放單元22播放完整輸出音訊時,使用者可清晰聽到行動裝置4所接收的音源,且該音源經過柔和處理,使得其的音訊較不死板。當然,於本實施例中,行動裝置4的處理單元42中包含儲存單元428,用以儲存第一輸出音訊及/或第二輸出音訊,以分別轉換第一輸出音訊及/或第二輸出音訊為第一語音備忘錄及/或第二語音備忘錄,使語音增強系統1具有語音備忘錄的功能。Please refer to Figure 8 in conjunction with Figures 1 and 2. Figure 8 is a partial detailed architecture diagram of the processing unit within the mobile device of the speech enhancement system shown in Figure 1, where the second reference audio message and background audio source are multiplied. Figure 8 only shows the control architecture for further receiving actual audio messages after the mobile device 4 establishes a characteristic sound model, which is part of the control architecture shown in Figure 2. As shown in the figure, the second receiving unit 41 of the mobile device 4 further receives the actual audio messages, which are in time-domain format. The time-domain/frequency-domain converter 421 converts the actual audio messages from time-domain format to frequency-domain format. In this embodiment, mixer 426 includes a first sub-mixer 426a and a second sub-mixer 426b. The first sub-mixer 426a receives a second feature mask and actual audio in frequency domain form from memory unit 425, and performs spectrogram multiplication (e.g., Hadamard matrix multiplication) on the features in the actual audio corresponding to the second reference audio using the second feature mask in the feature sound model to provide a first output audio. The second sub-mixer 426b receives the feature sound model from memory unit 425 and actual audio in frequency domain form, and performs spectrogram multiplication on the features in the actual audio corresponding to the background audio using a background feature mask in the feature sound model. Multiplication, such as Hadamard matrix processing, is performed to provide a second output audio signal. The frequency-to-time converter 427 converts both the first and second output audio signals from frequency-domain to time-domain formats. Furthermore, in this embodiment, the processing unit 42 of the mobile device 4 includes an adder/subtractor 429 to compare the converted time-domain first and second output audio signals to output a complete output audio signal. The mobile device 4 then outputs the complete time-domain output audio signal to the first playback unit 22 of the first wireless audio device 2. In other words, the mobile device 4 simultaneously enhances the received actual audio signal corresponding to the reference audio signal provided by the mobile device 4 and the background audio signal corresponding to the first wireless audio device 2. This allows the user to clearly hear the audio source received by the mobile device 4 when playing the complete output audio signal using the first playback unit 22 of the first wireless audio device 2. Furthermore, the audio source is softened, making the sound less rigid. Of course, in this embodiment, the processing unit 42 of the mobile device 4 includes a storage unit 428 for storing the first output audio signal and/or the second output audio signal, respectively converting the first output audio signal and/or the second output audio signal into a first voice memo and/or a second voice memo, thus enabling the voice enhancement system 1 to have a voice memo function.

綜上所述,本案揭露一種語音增強系統,其行動裝置中的處理單元將無線音訊裝置所接收的第一參考音訊以及行動裝置所接收的第二參考音訊,進行前處理程序,以利用特徵聲音模型獲得多種特徵遮罩,可根據需求利用特徵遮罩將實際音訊中對應於該參考音訊的特徵進行乘積處理,舉例來說,當使用者配戴無線音訊裝置進行溝通時,與使用者溝通的對方聲音為實際音訊,行動裝置的處理單元可將與使用者溝通的對方聲音增強,使得使用者可清楚聽到其聲音,進而增強無線音訊裝置收音的清晰度。In summary, this case discloses a voice enhancement system in which the processing unit of the mobile device performs a preprocessing procedure on the first reference audio signal received by the wireless audio device and the second reference audio signal received by the mobile device. This preprocessing process utilizes a feature sound model to obtain various feature masks. The feature masks can be used to multiply the features in the actual audio signal corresponding to the reference audio signal as needed. For example, when a user wears a wireless audio device to communicate, the voice of the other party is the actual audio signal. The processing unit of the mobile device can enhance the voice of the other party, so that the user can hear the voice clearly, thereby enhancing the clarity of the wireless audio device's reception.

1:語音增強系統 2:第一無線音訊裝置 21:第一收音單元 22:第一播放單元 3:第二無線音訊裝置 4:行動裝置 41:第二收音單元 42:處理單元 43:第二播放單元 421:時域/頻域轉換器 422:辨頻器 423:矩陣重塑器 424:編碼器 425:記憶單元 426:混音器 426a:第一子混音器 426b:第二子混音器 427:頻域/時域轉換器 428:儲存單元 429:加減法器 1: Voice Enhancement System 2: First Wireless Audio Device 21: First Receiver Unit 22: First Playback Unit 3: Second Wireless Audio Device 4: Mobile Device 41: Second Receiver Unit 42: Processing Unit 43: Second Playback Unit 421: Time Domain/Frequency Domain Converter 422: Frequency Discriminator 423: Matrix Reshaping Unit 424: Encoder 425: Memory Unit 426: Mixer 426a: First Sub-Mixer 426b: Second Sub-Mixer 427: Frequency Domain/Time Domain Converter 428: Storage Unit 429: Adder/Subtractor

第1圖為本案的語音增強系統的系統架構圖; 第2圖為第1圖所示的語音增強系統的部分元件的系統架構圖; 第3圖為第1圖所示的語音增強系統的行動裝置內處理單元於第二參考音訊進行乘積處理時的部分細部架構圖; 第4圖為第1圖所示的語音增強系統的行動裝置內處理單元於第一參考音訊進行乘積處理時的部分細部架構圖; 第5圖為第1圖所示的語音增強系統的行動裝置內處理單元於第二參考音訊進行乘積處理時的另一實施例的部分細部架構圖; 第6圖為第1圖所示的語音增強系統的行動裝置內處理單元於第一參考音訊進行乘積處理時的另一實施例的部分細部架構圖; 第7圖為第1圖所示的語音增強系統的行動裝置內處理單元於第一參考音訊與第二參考音訊同時進行乘積處理時的部分細部架構圖;以及 第8圖為第1圖所示的語音增強系統的行動裝置內處理單元於第二參考音訊及背景音源進行乘積處理時的部分細部架構圖。 Figure 1 is a system architecture diagram of the voice enhancement system of this case; Figure 2 is a system architecture diagram of some components of the voice enhancement system shown in Figure 1; Figure 3 is a partial detailed architecture diagram of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the second reference audio; Figure 4 is a partial detailed architecture diagram of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the first reference audio; Figure 5 is a partial detailed architecture diagram of another embodiment of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the second reference audio; Figure 6 is a partial detailed architecture diagram of another embodiment of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the first reference audio message; Figure 7 is a partial detailed architecture diagram of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the first reference audio message and the second reference audio message simultaneously; and Figure 8 is a partial detailed architecture diagram of the processing unit within the mobile device of the voice enhancement system shown in Figure 1 when performing product processing on the second reference audio message and the background sound source.

2:第一無線音訊裝置 2: First Wireless Audio Device

21:第一收音單元 21: First radio unit

22:第一播放單元 22: First Playback Unit

4:行動裝置 4: Mobile Devices

41:第二收音單元 41: Second radio unit

42:處理單元 42: Processing Unit

421:時域/頻域轉換器 421: Time-Domain/Frequency-Domain Converter

422:辨頻器 422: Frequency Discriminator

423:矩陣重塑器 423: Matrix Reshaper

424:編碼器 424: Encoder

425:記憶單元 425: Memory Unit

426:混音器 426: Mixer

427:頻域/時域轉換器 427: Frequency Domain/Time Domain Converter

Claims (10)

一種語音增強系統,包含:一無線音訊裝置,包含一第一收音單元,用以接收一第一參考音訊;以及一行動裝置,與該無線音訊裝置之間進行資料傳輸,以接收該第一參考音訊,且該行動裝置包含:一第二收音單元,用以接收一第二參考音訊與一實際音訊;以及一處理單元,用以執行一前處理程序,該處理單元具有:一記憶單元,且該記憶單元儲存一特徵聲音模型;一時域/頻域轉換器,將該第一參考音訊及該第二參考音訊由時域形式轉換為頻域形式;一辨頻器,確認該第一參考音訊中的一第一音訊特徵及該第二參考音訊中的一第二音訊特徵;一矩陣重塑器,將該第一音訊特徵及該第二音訊特徵分別轉換為一第一矩陣及一第二矩陣;以及一編碼器,對該第一矩陣及該第二矩陣進行編碼,以分別對應輸出一第一音訊嵌入特徵及一第二音訊嵌入特徵;其中,該前處理程序包含以下步驟:轉換該第一參考音訊與該第二參考音訊為該第一音訊嵌入特徵與該第二音訊嵌入特徵;以及載入該第一音訊嵌入特徵、該第二音訊嵌入特徵及該實際音訊於該特徵聲音模型,以獲得一第一特徵遮罩與一第二特徵遮罩;其中,該處理單元將該實際音訊由時域形式轉換為頻域形式,並以該第二特徵遮罩將該實際音訊中對應於該第二參考音訊的特徵進行乘積處理,以提供一第一輸出音訊。A speech enhancement system includes: a wireless audio device including a first receiving unit for receiving a first reference audio message; and a mobile device for data transmission with the wireless audio device to receive the first reference audio message, wherein the mobile device includes: a second receiving unit for receiving a second reference audio message and an actual audio message; and a processing unit for performing preprocessing. The program, the processing unit, includes: a memory unit storing a feature sound model; a time-domain/frequency-domain converter converting the first reference audio and the second reference audio from time-domain to frequency-domain form; a frequency discriminator identifying a first audio feature in the first reference audio and a second audio feature in the second reference audio; and a matrix reshaping unit converting the first audio feature and the second reference audio into frequency-domain form. The second audio feature is converted into a first matrix and a second matrix respectively; and an encoder encodes the first matrix and the second matrix to output a first audio embedding feature and a second audio embedding feature respectively; wherein the preprocessing procedure includes the following steps: converting the first reference audio and the second reference audio into the first audio embedding feature and the second audio embedding feature; to The processing unit loads the first audio embedding feature, the second audio embedding feature, and the actual audio into the feature sound model to obtain a first feature mask and a second feature mask; wherein, the processing unit converts the actual audio from a time-domain form to a frequency-domain form, and performs product processing on the features in the actual audio corresponding to the second reference audio using the second feature mask to provide a first output audio. 如請求項1所述的語音增強系統,其中該處理單元更包含一混音器,接收該第二特徵遮罩以及該實際音訊,並利用該第二特徵遮罩將該實際音訊中對應於該第二參考音訊的特徵進行乘積處理。The speech enhancement system as described in claim 1, wherein the processing unit further includes a mixer that receives the second feature mask and the actual audio message, and uses the second feature mask to multiply the features in the actual audio message corresponding to the second reference audio message. 如請求項1所述的語音增強系統,其中於該行動裝置的該第二收音單元接收該實際音訊時,該時域/頻域轉換器將該實際音訊由時域形式轉換為頻域形式,並利用該第一特徵遮罩將該實際音訊中對應於該第一參考音訊的特徵進行乘積處理,以提供一第二輸出音訊。The voice enhancement system as described in claim 1, wherein when the second receiver unit of the mobile device receives the actual audio message, the time-domain/frequency-domain converter converts the actual audio message from time-domain form to frequency-domain form, and uses the first feature mask to multiply the features in the actual audio message corresponding to the first reference audio message to provide a second output audio message. 如請求項3所述的語音增強系統,其中該語音增強系統包含一加減法器,用以增強或衰減該第一輸出音訊及該第二輸出音訊,以對應輸出一完整輸出音訊。The speech enhancement system as described in claim 3, wherein the speech enhancement system includes an adder/subtractor for amplifying or attenuating the first output audio message and the second output audio message to output a complete output audio message. 如請求項3所述的語音增強系統,其中該語音增強系統更包含一頻域/時域轉換器,以將該第一輸出音訊及/或該第二輸出音訊由頻域形式轉換為時域形式。The speech enhancement system as described in claim 3 further includes a frequency domain to time domain converter to convert the first output audio message and/or the second output audio message from frequency domain form to time domain form. 如請求項5所述的語音增強系統,其中該無線音訊裝置包含一播放單元,用以播放時域形式的該第一輸出音訊及/或該第二輸出音訊。The voice enhancement system as described in claim 5, wherein the wireless audio device includes a playback unit for playing the first output audio message and/or the second output audio message in time-domain form. 如請求項3所述的語音增強系統,其中該語音增強系統更包含一儲存單元,用以儲存該第一輸出音訊及/或該第二輸出音訊,並將該第一輸出音訊及/或該第二輸出音訊分別轉換為一第一語音備忘錄及/或一第二語音備忘錄。The voice enhancement system as described in claim 3 further includes a storage unit for storing the first output audio message and/or the second output audio message, and converting the first output audio message and/or the second output audio message into a first voice memo and/or a second voice memo, respectively. 如請求項1所述的語音增強系統,其中該特徵聲音模型中更具有一背景特徵遮罩,該背景特徵遮罩對應於該無線音訊裝置的一背景音源,該時域/頻域轉換器將該實際音訊由時域形式轉換為頻域形式,並利用該背景特徵遮罩將該實際音訊中對應於該背景音源的特徵進行乘積處理,以提供一第二輸出音訊。The speech enhancement system as described in claim 1, wherein the characteristic sound model further includes a background feature mask corresponding to a background sound source of the wireless audio device, the time-domain/frequency-domain converter converts the actual audio from time-domain to frequency-domain form, and uses the background feature mask to multiply the features in the actual audio corresponding to the background sound source to provide a second output audio. 如請求項1所述的語音增強系統,其中該第一音訊嵌入特徵及/或該第二音訊嵌入特徵的維度分別為64。The speech enhancement system as described in claim 1, wherein the dimensions of the first audio embedding feature and/or the second audio embedding feature are 64 respectively. 如請求項1所述的語音增強系統,其中該第二參考音訊與該實際音訊係由相同聲源所提供。The speech enhancement system as described in claim 1, wherein the second reference audio and the actual audio are provided by the same sound source.
TW112151245A 2023-12-28 2023-12-28 Speech enhancement system TWI902105B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW112151245A TWI902105B (en) 2023-12-28 2023-12-28 Speech enhancement system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW112151245A TWI902105B (en) 2023-12-28 2023-12-28 Speech enhancement system

Publications (2)

Publication Number Publication Date
TW202527576A TW202527576A (en) 2025-07-01
TWI902105B true TWI902105B (en) 2025-10-21

Family

ID=97224878

Family Applications (1)

Application Number Title Priority Date Filing Date
TW112151245A TWI902105B (en) 2023-12-28 2023-12-28 Speech enhancement system

Country Status (1)

Country Link
TW (1) TWI902105B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170339491A1 (en) * 2016-05-18 2017-11-23 Qualcomm Incorporated Device for generating audio output
US20230230571A1 (en) * 2021-06-03 2023-07-20 Tencent Technology (Shenzhen) Company Limited Audio processing method and apparatus based on artificial intelligence, device, storage medium, and computer program product
US20230276182A1 (en) * 2020-09-01 2023-08-31 Starkey Laboratories, Inc. Mobile device that provides sound enhancement for hearing device
US20230298611A1 (en) * 2020-06-30 2023-09-21 Microsoft Technology Licensing, Llc Speech enhancement

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170339491A1 (en) * 2016-05-18 2017-11-23 Qualcomm Incorporated Device for generating audio output
US20230298611A1 (en) * 2020-06-30 2023-09-21 Microsoft Technology Licensing, Llc Speech enhancement
US20230276182A1 (en) * 2020-09-01 2023-08-31 Starkey Laboratories, Inc. Mobile device that provides sound enhancement for hearing device
US20230230571A1 (en) * 2021-06-03 2023-07-20 Tencent Technology (Shenzhen) Company Limited Audio processing method and apparatus based on artificial intelligence, device, storage medium, and computer program product

Also Published As

Publication number Publication date
TW202527576A (en) 2025-07-01

Similar Documents

Publication Publication Date Title
US12279092B2 (en) Interactive system for hearing devices
CN110870201B (en) Audio signal conditioning method, device, storage medium and terminal
CN112767908B (en) Active noise reduction method, electronic device and storage medium based on key sound recognition
CN106162413B (en) Headphone device for specific ambient sound reminder mode
US8498425B2 (en) Wearable headset with self-contained vocal feedback and vocal command
JP5256119B2 (en) Hearing aid, hearing aid processing method and integrated circuit used for hearing aid
US12249326B2 (en) Method and device for voice operated control
US20150281853A1 (en) Systems and methods for enhancing targeted audibility
TWI831785B (en) Personal hearing device
CN106210960B (en) Headset device with local call status confirmation mode
CN114640938B (en) Hearing aid function implementation method based on Bluetooth headset chip and Bluetooth headset
US7889872B2 (en) Device and method for integrating sound effect processing and active noise control
CN112954563B (en) Signal processing method, electronic device, apparatus, and storage medium
CN115314823A (en) Hearing aid method, system and equipment based on digital sounding chip
US20150049879A1 (en) Method of audio processing and audio-playing device
CN113782040B (en) Audio coding method and device based on psychoacoustics
CN116546126B (en) Noise suppression method and electronic equipment
TWI902105B (en) Speech enhancement system
CN112804608B (en) Use method, system, host and storage medium of TWS earphone with hearing aid function
TWI886540B (en) Assistive listening system
JP7284570B2 (en) Sound reproduction system and program
WO2023104215A1 (en) Methods for synthesis-based clear hearing under noisy conditions
JPH07111527A (en) Audio processing method and apparatus using the same
US12231851B1 (en) Method, apparatus and system for low latency audio enhancement
TWI904664B (en) System and method for voice service with customer identification