JP2017092732A

JP2017092732A - Auditory supporting system and auditory supporting device

Info

Publication number: JP2017092732A
Application number: JP2015221387A
Authority: JP
Inventors: イシイ・カルロス・トシノリ; Carlos Toshinori Ishii; 超然劉; Chaoran Liu; イアニ・エヴァン; Evan Yani
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2015-11-11
Filing date: 2015-11-11
Publication date: 2017-05-25
Anticipated expiration: 2035-11-11
Also published as: JP6665379B2

Abstract

【課題】対象空間において観測された３次元的な音環境を、利用者に対して再現することで、利用者の聴覚を補助することが可能な聴覚支援システムを提供する。
【解決手段】聴覚支援システム１０００は、マイクロフォンアレイ群１００と、人の位置を検出するためのＬＲＦ群２００とを備え、音源定位装置３００は、音の到来方向を推定し、位置検出手段の検出結果と統合して、音源の位置を特定し、特定された音源の位置からの音を分離して出力する。顔姿勢推定部５２０は、利用者２の顔姿勢を検出し、音空間再構成部５４０は、音源の位置と顔姿勢とに応じて、対象空間において、音源の位置に相当する位置から利用者の各耳への頭部伝達関数を用いて、分離音の信号から利用者２の各耳へ再現するための音信号を合成し、難聴特性に合わせて周波数帯域ごとの音量を補正する。
【選択図】図２To provide a hearing support system capable of assisting a user's hearing by reproducing for a user a three-dimensional sound environment observed in a target space.
A hearing support system 1000 includes a microphone array group 100 and an LRF group 200 for detecting the position of a person. A sound source localization apparatus 300 estimates the direction of arrival of sound and detects the position detection means. By integrating with the result, the position of the sound source is specified, and the sound from the specified sound source position is separated and output. The face posture estimation unit 520 detects the face posture of the user 2, and the sound space reconstruction unit 540 determines the user from a position corresponding to the position of the sound source in the target space according to the position of the sound source and the face posture. Using the head-related transfer function to each ear, a sound signal to be reproduced from each separated sound signal to each ear of the user 2 is synthesized, and the volume for each frequency band is corrected according to the deafness characteristics.
[Selection] Figure 2

Description

この発明は、音源定位および音源分離技術を用いて、使用者の聴覚の支援をするための技術に関する。 The present invention relates to a technique for assisting a user's hearing using sound source localization and sound source separation techniques.

世界各国で共通して、その国における人口の1割〜2割程度が難聴・聴覚障害を持っているといわれている。２００９年の日本補聴器販売店協会による「補聴器供給システムの在り方に関する研究」報告書の中で、日本の難聴者人口は15.7％ (1944万人)と報告されている。そのうち、自覚のない難聴者(7.2％)、自覚がある難聴者(4.5％)、ほとんど使用しない補聴器所有者(1.0％)、常時または随時使用の補聴器所有者(2.7％)に分かれる。 It is said that about 10% to 20% of the population in the world has hearing loss and hearing impairment. In the 2009 report on the “Hearing Aid Supply System” by the Japan Hearing Aid Dealer Association, the population of people with hearing loss in Japan was reported to be 15.7% (19.44 million). Among them, there are hearing-impaired hearing loss (7.2%), hearing-impaired hearing loss (4.5%), hearing aid owners who rarely use (1.0%), and hearing aid owners who use it regularly or occasionally (2.7%).

高齢者の難聴は、神経細胞などの老化現象としての老人性難聴で、65歳以上では25〜40％、75歳以上では40〜66％の割合で見られる。高齢化に伴い、難聴者数は更に増加すると予想される。 Deafness in the elderly is senile deafness as an aging phenomenon of nerve cells and the like, and is seen at a rate of 25 to 40% at the age of 65 years or older and 40 to 66% at the age of 75 years or older. As the population ages, the number of people with hearing loss is expected to increase further.

日本で補聴器を使っている人は400万人程度であり、難聴者のうち５人に１人しか補聴器を使っていないことになる。補聴器を途中で使わなくなる難聴者も多い。 There are about 4 million people using hearing aids in Japan, and only one in five people with hearing loss use hearing aids. Many people with hearing loss stop using hearing aids.

その理由としては、たとえば、一般の補聴器は、マイクが補聴器に埋め込まれているため、周囲の雑音も増幅されてしまうという根本的な問題があることが挙げられる。また、ハウリング（ピーピー音）も起きやすく利用者に苦痛を感じさせる。最近の補聴器は、デジタル処理の導入により、周波数帯域ごとの音量調整や騒音抑制などの機能が埋め込まれ、性能は上がっている。ハウリング防止の信号処理も施しているものがあるが、その分、音量を抑える必要があり、重度難聴には十分な音量が出力できない。 The reason is, for example, that a general hearing aid has a fundamental problem that ambient noise is also amplified because a microphone is embedded in the hearing aid. In addition, howling (beep sound) is likely to occur, causing the user to feel pain. Recent hearing aids have improved performance due to the introduction of digital processing and functions such as volume adjustment and noise suppression for each frequency band. Some of them also perform signal processing to prevent howling, but it is necessary to reduce the volume accordingly, and sufficient volume cannot be output for severe hearing loss.

利用者が補聴器を止める原因は、多くの場合、利用者に合った補聴器を選べていない、または設定が難しく誤った設定で使用しているためとされているが、それらが適切であっても補聴器単体による快適さ（聞こえやすさ）には限界がある。 In many cases, users stop hearing aids because they cannot choose the right hearing aid for the user, or they are difficult to set up and are used with incorrect settings. There is a limit to the comfort (easy to hear) of a hearing aid alone.

また、特許文献１には、選択可能な知覚空間的な音源の位置決めを備える聴覚装置が開示されている。特許文献１に開示の技術では、聴覚装置システムは、聴覚装置（右耳用の第１の補聴器と、左耳用の第２の補聴器とを備えるバイノーラル補聴器）と、聴覚装置に送信される選択された音声信号の到来の知覚方向をユーザが選択可能にする制御装置（スマートフォン）を備えている。このような構成により、会話キューを聞き取れるようにすることにより、患者の聴力が改善する。 Further, Patent Document 1 discloses a hearing device having a selectable perceptual spatial sound source positioning. In the technique disclosed in Patent Document 1, the hearing device system includes a hearing device (a binaural hearing aid including a first hearing aid for the right ear and a second hearing aid for the left ear), and a selection transmitted to the hearing device. A control device (smartphone) that allows the user to select the perceived direction of arrival of the received audio signal. With such a configuration, the hearing ability of the patient is improved by making it possible to hear the conversation queue.

このように補聴器への応用においては、バイノーラル処理（両耳に装着した補聴器のマイクを利用した信号処理）が、国内外で多く研究されている。例えば、非特許文献１には、バイノーラル信号を用いてブラインド信号処理とポストフィルタリングを中心に，両耳補聴器に適用した研究が開示されている。非特許文献２では、「聞き耳」型補聴システムの研究開発が報告されており、非特許文献３では、高齢者の聴覚機能の低下に向けた聴覚支援システムに関する研究が報告されている。 As described above, in the application to hearing aids, much research has been conducted on binaural processing (signal processing using a hearing aid microphone attached to both ears) in Japan and abroad. For example, Non-Patent Document 1 discloses a study in which binaural signals are applied to binaural hearing aids, focusing on blind signal processing and post filtering. In Non-Patent Document 2, research and development of “hearing ear” type hearing aid system is reported, and in Non-Patent Document 3, research on hearing support system for lowering of hearing function of elderly people is reported.

さらに、ピンマイクやペン型などの遠隔マイクにより、ＦＭ経由で遠隔の声を送受信する機能を持つ補聴器もあるが、遠隔のマイク周辺の雑音も増幅する問題や、音の方向を感知するための空間的情報も保たれない問題が残る。 In addition, there are hearing aids that have the function of transmitting and receiving remote voices via FM using a remote microphone such as a pin microphone or pen type, but there is a problem of amplifying noise around the remote microphone and a space for detecting the direction of sound. The problem remains that the information is not kept.

空間的情報の伝達においては、マイク埋め込みの補聴器を両耳にかけることにより、ある程度解決されるが、自分の声も大きく聞こえる問題は残る。 Spatial information transmission can be solved to some extent by putting a microphone-embedded hearing aid on both ears, but the problem that one's voice can be heard loudly remains.

聴覚を支援するための遠隔センサ・遠隔マイクによる空間的情報の伝達における問題点は、センサと音源の相対的角度が利用者と音源の相対的角度と異なることが原因で、音の方向情報を取得できる多チャンネルの場合でも生じる。聴覚支援を目的に多チャンネルのマイクロホンアレイ技術を活用した研究は国内外多数あるが、ほとんどが一つの音源を強調させ、モノラル信号を出力する仕組みで、空間的情報が失われる。 The problem with the transmission of spatial information using a remote sensor / remote microphone to support hearing is that the relative angle between the sensor and the sound source is different from the relative angle between the user and the sound source. It occurs even in the case of multiple channels that can be acquired. There are many studies in Japan and overseas that use multi-channel microphone array technology for the purpose of hearing support, but most of them have a mechanism that emphasizes one sound source and outputs a monaural signal, so that spatial information is lost.

一方で、上述したような音の空間的情報を取得するには、マイクロホンアレイを用いた音源定位と、音源分離の技術を利用することができる。 On the other hand, in order to acquire the spatial information of the sound as described above, it is possible to use a sound source localization using a microphone array and a sound source separation technique.

音源定位に関して、実環境を想定した従来技術として特許文献２または特許文献３に記載のものがある。特許文献２または特許文献３に記載の技術は、分解能が高いＭＵＳＩＣ法と呼ばれる公知の音源定位の手法を用いている。 Regarding the sound source localization, there are those described in Patent Document 2 or Patent Document 3 as conventional techniques assuming a real environment. The technique described in Patent Document 2 or Patent Document 3 uses a known sound source localization method called the MUSIC method with high resolution.

特許文献２または特許文献３に記載の発明では、マイクロホンアレイを用い、マイクロホンアレイからの信号をフーリエ変換して得られた受信信号ベクトルと、過去の相関行列とに基づいて現在の相関行列を計算する。このようにして求められた相関行列を固有値分解し、最大固有値と、最大固有値以外の固有値に対応する固有ベクトルである雑音空間とを求める。さらに、マイクロホンアレイのうち、１つのマイクロホンを基準として、各マイクの出力の位相差と、雑音空間と、最大固有値とに基づいて、ＭＵＳＩＣ法により音源の方向を推定する。 In the invention described in Patent Document 2 or Patent Document 3, a current correlation matrix is calculated based on a received signal vector obtained by Fourier-transforming a signal from the microphone array using a microphone array and a past correlation matrix. To do. The correlation matrix thus obtained is subjected to eigenvalue decomposition to obtain the maximum eigenvalue and a noise space that is an eigenvector corresponding to an eigenvalue other than the maximum eigenvalue. Furthermore, the direction of the sound source is estimated by the MUSIC method based on the phase difference of the output of each microphone, the noise space, and the maximum eigenvalue with one microphone as a reference in the microphone array.

さらに、特許文献４では、人間とそれ以外の雑音源とが混在している場合、人間の発生する音声と雑音とを精度高く分離することを目的として、音源定位および音源分離をするシステムが開示されている。ここでは、音源定位装置は、人の位置を検出するＬＲＦ（レーザレンジファインダ）群と、マイクロホンアレイ群の出力から得られる複数チャンネルの音源信号の各々と、マイクロホンアレイに含まれる各マイクロホンの間の位置関係と、ＬＲＦ群の出力とに基づいて、複数の方向の各々について、所定時間ごとにＭＵＳＩＣパワーを算出し、そのピークを音源位置として所定時間ごとに検出する音源定位処理部と、マイクロホンアレイの出力信号から、音源定位処理部により検出された音源位置からの音声信号を分離する音源分離処理部と、分離された音声信号の属性を人位置計測装置の出力を用いて高精度で判定する音源種類同定処理部とを含む。 Furthermore, Patent Document 4 discloses a sound source localization and sound source separation system for the purpose of accurately separating human-generated speech and noise when humans and other noise sources are mixed. Has been. Here, the sound source localization device includes an LRF (laser range finder) group that detects a person's position, each of a plurality of channels of sound source signals obtained from the output of the microphone array group, and each microphone included in the microphone array. A sound source localization processing unit that calculates MUSIC power at each predetermined time for each of a plurality of directions based on the positional relationship and the output of the LRF group and detects the peak as a sound source position at each predetermined time, and a microphone array The sound source separation processing unit that separates the sound signal from the sound source position detected by the sound source localization processing unit from the output signal of the sound, and the attribute of the separated sound signal is determined with high accuracy using the output of the human position measurement device And a sound source type identification processing unit.

特開２０１５−１３６１００号公報明細書Japanese Patent Application Laid-Open No. 2015-136100 特開２００８−１７５７３３号公報明細書Japanese Patent Application Laid-Open No. 2008-175733 特開２０１１−２２０７０１号公報明細書JP 2011-220701 A Specification 特開２０１２−２１１７６８号公報明細書Japanese Patent Application Laid-Open No. 2012- 211768

高藤、森、猿渡、鹿野 (2008). SIMOモデルに基づくICAと頭部伝達関数の影響を受けないバイナリマスク処理を組み合わせた両耳聴覚補助システム、電子情報通信学会技術研究報告. EA, 応用音響 108(143), 25-30, 2008.Takafuji, Mori, Saruwatari, Shikano (2008). Binaural hearing aid system combining ICA based on SIMO model and binary mask processing not affected by head related transfer function, IEICE technical report. EA, Applied acoustics 108 (143), 25-30, 2008. 鵜木祐史. 「聞き耳」型補聴システムの研究開発．「戦略的情報通信研究開発推進事業SCOPE）」平成25年度新規採択課題 http://www.soumu.go.jp/main_content/000242634.pdfYuji Kashiwagi. Research and development of “listening ear” type hearing aid system. “Strategic Information and Communication R & D Promotion Project SCOPE” Newly selected subjects for FY2013 http://www.soumu.go.jp/main_content/000242634.pdf 高齢者の聴覚機能の低下に向けた聴覚支援システムに関する研究、文部科学省科学研究費基盤研究(C)、2014年04月〜 2017年03月Research on hearing support systems for the decline of hearing function in elderly people, Ministry of Education, Culture, Sports, Science and Technology, Grant-in-Aid for Scientific Research (C), 2014.04-2017.03

しかしながら、たとえば、上述した特許文献１の技術では、ユーザがディスプレイ上で、音声を発している対象を表すシンボルを、自身の現在の環境に併せて、自分で移動させて知覚空間的な音源の位置決めを行う必要がある。このため、ユーザの負担が大きく、また、ユーザの頭の方向等が変化すると、聞こえてくる音の到来方向が、現実の空間中の音源の方向とはずれてしまい、違和感があるという問題がある。 However, for example, in the technique of Patent Document 1 described above, a symbol representing a target on which a user emits sound is moved by himself / herself in accordance with his / her current environment. It is necessary to perform positioning. For this reason, there is a problem that the burden on the user is heavy, and if the direction of the user's head changes, the direction of arrival of the audible sound deviates from the direction of the sound source in the real space, and there is a sense of discomfort. .

また、特許文献２〜４に開示の技術でも、単に、音源からの音の到来方向の推定と音源からの音の分離を行うのみであるので、ユーザの耳に聞こえてくる音の到来方向と、現実に視覚的に把握される音源の方向とのずれについては、何ら検討がなされていない。 Also, the techniques disclosed in Patent Documents 2 to 4 simply perform estimation of the direction of arrival of sound from a sound source and separation of sound from the sound source. No consideration has been given to the deviation from the direction of the sound source that is visually grasped in reality.

また、従来の補聴器では、以下のような問題点がある。 Further, the conventional hearing aid has the following problems.

（１）利用者に必要な音と不要な音を選択することができない。 (1) It is not possible to select a sound necessary and unnecessary for the user.

（２）音の空間的情報が失われる。 (2) Sound spatial information is lost.

（３）設定が複雑で使いにくい。 (3) Setting is complicated and difficult to use.

この発明は、このような問題点を解決するためになされたものであって、その目的は、観測された３次元的な音環境を、聴覚を支援する人の頭の位置・姿勢に応じて再現することで、違和感のない聴覚の支援を実現することが可能な聴覚支援システムを提供することである。 The present invention has been made to solve such problems, and its purpose is to change the observed three-dimensional sound environment according to the position and posture of a person's head that supports hearing. It is to provide an auditory support system that can realize auditory support without a sense of incongruity by reproducing.

この発明の他の目的は、環境内の個々の音を分離することにより、利用者に対して必要な音と不要な音を取捨選択的に制御することができる聴覚支援システムを提供することである。 Another object of the present invention is to provide a hearing support system that can selectively control necessary sounds and unnecessary sounds for a user by separating individual sounds in the environment. is there.

この発明の１つの局面に従うと、対象空間内の利用者の聴覚を補助するための聴覚支援システムであって、対象空間に設置される音源定位装置を備え、音源定位装置は、対象空間における対象物の位置を検出する位置検出手段と、マイクロホンアレイからの出力に応じて、音の到来方向を推定し、位置検出手段の検出結果と統合して、音源の位置を特定して出力する音源定位手段と、特定された音源の位置からの音を分離して出力するための音源分離手段とを含み、利用者の顔姿勢に応じて、対象空間内の音声を再構成するための空間感覚合成装置をさらに備え、空間感覚合成装置は、対象空間内の利用者の顔姿勢を検出するための顔姿勢検出手段と、利用者に装着され、利用者の両耳に対して対象空間の音環境を再現するための音再現手段と、音源定位手段から、音源の位置の位置を受信し、検出された顔姿勢に応じて、対象空間の音源の位置から利用者の各耳への頭部伝達関数を用いて、音源分離手段からの分離音の信号から音再現手段により各耳へ再現するための音信号を合成する音空間再構成手段とを含む。 According to one aspect of the present invention, a hearing support system for assisting the hearing of a user in a target space, comprising a sound source localization device installed in the target space, wherein the sound source localization device is a target in the target space. Sound source localization that detects the position of an object and estimates the direction of sound arrival according to the output from the microphone array and integrates it with the detection result of the position detection means to identify and output the position of the sound source Means and sound source separation means for separating and outputting sound from the position of the identified sound source, and spatial sense synthesis for reconstructing sound in the target space according to the user's face posture A spatial sensation synthesizing device comprising: a face posture detecting means for detecting a face posture of the user in the target space; and a sound environment of the target space that is attached to the user and that is attached to the user's ears. Sound reproduction means to reproduce From the sound source localization means, the position of the sound source is received from the sound source separation means using the head-related transfer function from the position of the sound source in the target space to each ear of the user according to the detected face posture. Sound space reconstruction means for synthesizing a sound signal for reproduction to each ear by sound reproduction means from the separated sound signal.

好ましくは、空間感覚合成装置は、利用者の各耳の難聴特性に合わせて周波数帯域ごとの音量を補正する周波数特性補正手段をさらに備える。 Preferably, the spatial sensation synthesis device further includes frequency characteristic correction means for correcting the volume for each frequency band in accordance with the deafness characteristic of each ear of the user.

好ましくは、空間感覚合成装置は、利用者が対象空間において注目する対象物の位置を指定する指示手段と、指示手段からの指示に応じて、音源分離手段からの分離音の信号の音量を個別に制御するための音量制御手段とをさらに備える。 Preferably, the spatial sensation synthesis device individually designates the volume of the separated sound signal from the sound source separation means in accordance with an instruction means for designating a position of an object of interest in the target space by the user and an instruction from the instruction means. And a volume control means for controlling the sound volume.

好ましくは、音再現手段は、ヘッドホンまたはイヤホンであり、顔姿勢検出手段は、ヘッドホンに装着されたジャイロおよびコンパスを含む。 Preferably, the sound reproduction means is a headphone or an earphone, and the face posture detection means includes a gyro and a compass attached to the headphones.

好ましくは、音再現手段は、ヘッドホンまたはイヤホンであり、顔姿勢検出手段は、撮像された利用者の画像から利用者の顔姿勢を推定する。 Preferably, the sound reproduction means is a headphone or an earphone, and the face posture detection means estimates the user's face posture from the captured user image.

好ましくは、音源定位手段は、マイクロホンアレイに基づく音の到来方向と位置検出手段で検出された音源の位置が、交差することに応じて、音源の位置を特定する。 Preferably, the sound source localization means specifies the position of the sound source when the sound arrival direction based on the microphone array intersects with the position of the sound source detected by the position detection means.

好ましくは、音源から利用者の各耳までの方向に応じた複数の頭部伝達関数の係数を保存するデータベースをさらに備え、音空間再構成手段は、対象空間において、対象空間の音源の位置から利用者の各耳への頭部伝達関数をデータベースから選択して、各耳へ空間的感覚を再現するための音信号を合成する。 Preferably, the apparatus further comprises a database for storing coefficients of a plurality of head related transfer functions corresponding to directions from the sound source to each ear of the user, and the sound space reconstruction means is configured to determine the position of the sound source in the target space in the target space. A head-related transfer function for each ear of the user is selected from the database, and a sound signal for reproducing spatial sensation is synthesized for each ear.

この発明の他の局面に従うと、対象空間の音環境に関する情報を送信する環境センサ装置からの情報に基づき、対象空間の音環境を利用者の顔姿勢に応じて再現するための聴覚支援装置であって、環境センサ装置からは、対象空間における音源の位置を示す位置情報と、位置情報で特定された音源の位置からの音を分離した分離音の信号とが送信され、対象空間内の利用者の顔姿勢を検出するための顔姿勢検出手段と、利用者に装着され、利用者の両耳に対して音環境に対応する音を再現するための音再現手段と、音源位置の位置情報を受信し、検出された顔姿勢に応じて、対象空間の音源の位置から利用者の各耳への頭部伝達関数を用いて、分離音の信号から音再現手段により各耳へ再現するための音信号を合成する音空間再構成手段とを備える。 According to another aspect of the present invention, an auditory assistance device for reproducing a sound environment of a target space according to a user's face posture based on information from an environment sensor device that transmits information about the sound environment of the target space. The environmental sensor device transmits position information indicating the position of the sound source in the target space, and a separated sound signal obtained by separating the sound from the position of the sound source specified by the position information, and is used in the target space. Position detection means for detecting a person's face posture, sound reproduction means for reproducing sound corresponding to the sound environment for both ears of the user, and position information of the sound source position Is reproduced from the separated sound signal to each ear by means of sound reproduction using the head-related transfer function from the position of the sound source in the target space to each ear of the user according to the detected face posture. Sound space reconstruction means for synthesizing the sound signals of Obtain.

好ましくは、利用者の各耳の難聴特性に合わせて周波数帯域ごとの音量を補正する周波数特性補正手段をさらに備える。 Preferably, frequency characteristic correction means for correcting the volume of each frequency band in accordance with the deafness characteristic of each ear of the user is further provided.

好ましくは、利用者が対象空間において注目する対象物の位置を指定する指示手段と、指示手段からの指示に応じて、音源分離手段からの分離音の信号の音量を個別に制御するための音量制御手段とをさらに備える。 Preferably, an instruction means for designating a position of an object of interest in the target space by the user, and a volume for individually controlling the volume of the separated sound signal from the sound source separation means in accordance with an instruction from the instruction means And a control means.

本発明によれば、観測された３次元的な音環境を、聴覚を支援する人の頭の位置・姿勢に応じて再現することで、違和感のない聴覚の支援を実現することが可能である。 According to the present invention, it is possible to realize hearing support without a sense of incongruity by reproducing the observed three-dimensional sound environment according to the position and posture of the head of the person supporting hearing. .

また、本発明によれば、環境内の個々の音を分離することにより、利用者に対して必要な音と不要な音を取捨選択的に制御することができる。 Further, according to the present invention, by separating individual sounds in the environment, it is possible to selectively control necessary sounds and unnecessary sounds for the user.

本実施の形態の聴覚支援システム１０００の利用場面のイメージ図である。It is an image figure of the utilization scene of the hearing assistance system 1000 of this Embodiment. 本実施の形態の聴覚支援システム１０００の構成を説明するためのブロック図である。It is a block diagram for demonstrating the structure of the hearing assistance system 1000 of this Embodiment. 音源定位装置３００の構成を説明するための機能ブロック図である。3 is a functional block diagram for explaining a configuration of a sound source localization apparatus 300. FIG. 音源分離処理を説明するための機能ブロック図である。It is a functional block diagram for demonstrating a sound source separation process. 空間感覚合成部５００を説明するための機能ブロック図である。4 is a functional block diagram for explaining a spatial sense synthesis unit 500. FIG. 音源定位装置３００のハードウェア構成を説明するためのブロック図である。4 is a block diagram for explaining a hardware configuration of a sound source localization apparatus 300. FIG. インタフェースの画面表示例を示す図である。It is a figure which shows the example of a screen display of an interface.

以下、本発明の実施の形態の聴覚支援システムの構成について、図に従って説明する。なお、以下の実施の形態において、同じ符号を付した構成要素および処理工程は、同一または相当するものであり、必要でない場合は、その説明は繰り返さない。 Hereinafter, the configuration of the hearing support system according to the embodiment of the present invention will be described with reference to the drawings. In the following embodiments, components and processing steps given the same reference numerals are the same or equivalent, and the description thereof will not be repeated unless necessary.

なお、以下の説明では、音センサとしては、いわゆるマイクロホン、より特定的にはエレクトレットコンデンサマイクロホンを例にとって説明を行うが、音声を電気信号として検出できるセンサであれば、他の音センサであってもよい。 In the following description, as a sound sensor, a so-called microphone, more specifically an electret condenser microphone will be described as an example, but other sound sensors may be used as long as they can detect sound as an electric signal. Also good.

そして、操作者側の音環境の再生には、ステレオヘッドホンを例として説明することにする。もちろん、右耳と左耳に別々に音声を再生するイヤホンであってもよい。 Then, stereo headphones will be described as an example for reproducing the sound environment on the operator side. Of course, earphones that reproduce sound separately for the right ear and the left ear may be used.

図１は、本実施の形態の聴覚支援システム１０００の利用場面のイメージ図である。 FIG. 1 is an image diagram of a usage scene of the hearing support system 1000 according to the present embodiment.

老人ホームや介護施設などの供用空間で複数の利用者が環境センサを共用し、聴覚支援システム１０００は、ドアの音や足音、食器の音、エアコンの音など、不要・不快な音を抑圧し、利用者が注意している対話相手の声やテレビの音（利用者指向の注意対象）と利用者に背後から話しかけられた声（利用者向けの発話対象）を強調し、利用者に応じてその場で聞くべき音のみを提供する。 Multiple users share environmental sensors in operating spaces such as nursing homes and nursing homes, and the hearing support system 1000 suppresses unnecessary and unpleasant sounds such as door sounds, footsteps, tableware, and air conditioning. , Emphasize the voice of the conversation partner or the sound of the TV (user-oriented attention object) and the voice spoken from behind by the user (utterance object for the user), depending on the user It provides only the sound that should be heard on the spot.

ここで、環境センサとは、後に説明するような音源定位と音源分離を行うための「マイクロホンアレイ」、対象物（特に、人）の空間内の位置をトラッキングするための「距離センサ（たとえば、レーザレンジファインダ：ＬＲＦ）」を含む。特に、距離センサは、固定されたものだけでなく、自律移動可能なロボットに搭載されて、空間内を移動するものを含んでも良い。 Here, the environmental sensor is a “microphone array” for performing sound source localization and sound source separation as will be described later, and a “distance sensor (for example, a person) for tracking the position in space of an object (particularly a person). Laser range finder: LRF) ". In particular, the distance sensor may include not only a fixed sensor but also a sensor that is mounted on an autonomously movable robot and moves in space.

図２は、本実施の形態の聴覚支援システム１０００の構成を説明するためのブロック図である。 FIG. 2 is a block diagram for explaining the configuration of the hearing support system 1000 according to the present embodiment.

図２では、ユーザのいる空間の座標系は、（ｘ，ｙ，ｚ）であるものとする。 In FIG. 2, it is assumed that the coordinate system of the space where the user is is (x, y, z).

聴覚支援システム１０００において、環境音の観測などを実行する環境センサネットワークでは、１つ以上のマイクロホンアレイ１０．１〜１０．Ｍを含むマイクロホンアレイ群１００と、複数のレーザレンジファインダ（ＬＲＦ：Laser Range Finder）２０．１〜２０．Ｌを含むＬＲＦ群２００と、マイクロホンアレイ群１００とＬＲＦ群２００との出力に基づいて、ユーザのいる環境に存在する音源の定位・トラッキングと音源の分離を行う音源定位装置３００とを備える。 In the auditory support system 1000, in an environmental sensor network that performs environmental sound observation and the like, one or more microphone arrays 10.1-10. A microphone array group 100 including M, and a plurality of laser range finders (LRF) 20.1 to 20. LRF group 200 including L, and sound source localization apparatus 300 that performs sound source localization / tracking and sound source separation in an environment where the user is present based on outputs of microphone array group 100 and LRF group 200.

音源定位装置３００において、人位置検出追跡部３１０は、ＬＲＦ群２００の出力を用いて、どの位置に人間が存在するかを示す情報（人位置情報と呼ぶ）を検出し、人の動きに応じて、非発声期間においても人位置の追跡を行う。音源定位部３２０は、マイクロホンアレイ群５２の出力および人位置検出追跡部３１０から出力される人位置情報を受けて、マイクロホンアレイ群５２から出力される音声信号に基づいて音源定位を行ない、音源分離部３３０は、音源を分離して分離した各音源からの音を収集し、分離音を出力する。また、音源定位部からの音源の方向および位置の情報（方向・位置情報と呼ぶ）も出力される。 In the sound source localization apparatus 300, the human position detection / tracking unit 310 uses the output of the LRF group 200 to detect information (referred to as human position information) indicating a position where a person exists, and responds to the movement of the person. Thus, the human position is tracked even during the non-voicing period. The sound source localization unit 320 receives the output of the microphone array group 52 and the human position information output from the human position detection tracking unit 310, performs sound source localization based on the sound signal output from the microphone array group 52, and separates the sound sources. The unit 330 collects sound from each sound source that is separated from the sound source and outputs the separated sound. Information on the direction and position of the sound source (referred to as direction / position information) from the sound source localization unit is also output.

聴覚支援システム１０００の空間感覚合成部５００は、音源分離部３３０からの分離音を受信して音量を正規化するための音量制御部５１０と、ユーザ２が装着したヘッドホン上のセンサ６００からの情報を基に、ユーザ２の顔の向きを推定する顔姿勢推定部５２０と、受信した方向・位置情報と推定されたユーザ２の顔の向きに応じて、音源の位置および顔の向きから、左右のチャンネルに対応した頭部伝達関数（ＨＲＴＦ：Head Relative Transfer Function）をデータベース５３０から選択し、分離した音声に畳み込み演算を行い、ステレオヘッドホン６１０でユーザ２に再生する音声を再構成して合成する音空間再構成部５４０とを備える。 The spatial sensation synthesizing unit 500 of the auditory support system 1000 receives information from the sound source separation unit 330 and normalizes the sound volume, and information from the sensor 600 on the headphones worn by the user 2. From the position of the sound source and the orientation of the face according to the received orientation / position information and the estimated orientation of the face of the user 2 based on the orientation of the face posture estimation unit 520 that estimates the orientation of the face of the user 2 Head Relative Transfer Function (HRTF) corresponding to each channel is selected from the database 530, convolution is performed on the separated sound, and the sound to be played back to the user 2 is reconstructed and synthesized by the stereo headphones 610. And a sound space reconstruction unit 540.

ユーザ２の頭部回転トラッキングのためのセンサ６００としては、ヘッドホン６１０の上部に取り付けたジャイロセンサーおよびコンパスを用いることができる。 As the sensor 600 for tracking the head rotation of the user 2, a gyro sensor and a compass attached to the upper part of the headphones 610 can be used.

また、音量制御部５１０においては、分離した各音源のボリュームについては、ユーザ２が、表示部６５０に表示されるユーザインタフェースにて独立して調節することが可能な構成としてもよい。 Further, the volume control unit 510 may be configured such that the user 2 can independently adjust the volume of each separated sound source through a user interface displayed on the display unit 650.

図３は、音源定位装置３００の構成を説明するための機能ブロック図である。 FIG. 3 is a functional block diagram for explaining the configuration of the sound source localization apparatus 300.

図３を参照して、音源定位部３２０は、各マイクロホンアレイ１０．１〜１０．Ｍからの信号によって、それぞれ、音の３次元到来方向（ＤＯＡ：Direction Of Arrival）を推定する３次元空間ＤＯＡ評価部３２０２．１〜３２０２．Ｍと、３次元空間地図を格納する３次元空間地図格納部３２０４とを備え、空間情報統合部３２０６は、３次元空間地図で表現される環境とマイクロホンアレイの位置関係、各音源のＤＯＡ、および人位置検出追跡部３１０からの情報を統合することで、３次元上での人位置情報を取得する。この人位置情報は、ヒューマントラッキングシステムを構成する人位置検出追跡部３１０により、非発声時にも常時追跡されている。 Referring to FIG. 3, sound source localization section 320 includes microphone arrays 10.1-10. 3D spatial DOA evaluation units 3202.1 to 3202,... That estimate the direction of arrival (DOA: Direction Of Arrival) of the sound by signals from M respectively. M and a three-dimensional spatial map storage unit 3204 for storing a three-dimensional spatial map. A spatial information integration unit 3206 includes a positional relationship between an environment represented by the three-dimensional spatial map and a microphone array, DOA of each sound source, and By integrating the information from the human position detection / tracking unit 310, the human position information in three dimensions is acquired. This person position information is always tracked by the person position detection and tracking unit 310 constituting the human tracking system even during non-speech.

音源分離部３３０において、音源分離処理部３３０２．１〜３３０２．ｊ（ｊ：話者または注目する音源の数）は、推定した人位置情報に基づいて各人の音声を分離し、空間情報統合部３２０６からの位置情報と合わせて空間感覚合成部５００に送信する。 In the sound source separation unit 330, the sound source separation processing units 3302.1 to 3302. j (j: the number of speakers or sound sources of interest) is separated from each person's voice based on the estimated person position information and transmitted to the spatial sense synthesis unit 500 together with the position information from the spatial information integration unit 3206. To do.

以下、各部の動作について、さらに詳しく説明する。
（３次元音源定位）
音源定位に関しては、まず、３次元空間ＤＯＡ評価部３２０２．１〜３２０２．Ｍが、各マイクロホンアレイ１０．１〜１０．Ｍのそれぞれに対してＤＯＡ推定を行う。空間情報統合部３２０６は、１つ以上のアレイによるＤＯＡ情報と人位置検出追跡部３１０からの人位置情報を統合することで、音源の３次元空間内の位置を推定する。
実環境での音のＤＯＡ推定は広く研究されてきており、ＭＵＳＩＣ法は、複数のソースを高い分解能で定位できる最も有効な手法の一つであり、たとえば、上述した特許文献２，３にも開示されている。音源数を固定した数値に仮定し、しきい値を超えたＭＵＳＩＣスペクトルのピークを音源として認識する。ここでは、たとえば、ＭＵＳＩＣ法の実装にあたり、１００ｍｓごとに１度の分解能を有するように構成したとしても、動作クロック周波数２ＧＨｚのシングルコアＣＰＵで、リアルタイムに音源の方向を探索することができる。
さらに、聴覚支援システム１０００にとって、最も重要な音源は人の音声である。そこで、音源定位装置３００では、人の声を漏れ無く抽出するために、複数の２次元ＬＲＦで構成したヒューマントラッキングシステムを使用する。空間情報統合部３２０６は、マイクロホンアレイからのＤＯＡ推定出力とＬＲＦのトラッキング結果が同じ位置（または所定の距離以内の位置）で交差すれば、そこに音源がある可能性が高いと判断する。 Hereinafter, the operation of each unit will be described in more detail.
(3D sound source localization)
Regarding sound source localization, first, the three-dimensional space DOA evaluation unit 3202.1 to 3202. M represents each microphone array 10.1-10. DOA estimation is performed for each of M. The spatial information integration unit 3206 estimates the position of the sound source in the three-dimensional space by integrating the DOA information from one or more arrays and the human position information from the human position detection tracking unit 310.
The DOA estimation of sound in a real environment has been widely studied, and the MUSIC method is one of the most effective methods that can localize a plurality of sources with high resolution. It is disclosed. Assuming that the number of sound sources is fixed, the peak of the MUSIC spectrum exceeding the threshold is recognized as a sound source. Here, for example, even when the MUSIC method is implemented so as to have a resolution of once every 100 ms, the direction of the sound source can be searched in real time by a single core CPU having an operation clock frequency of 2 GHz.
Furthermore, the most important sound source for the hearing support system 1000 is human voice. Therefore, the sound source localization apparatus 300 uses a human tracking system configured by a plurality of two-dimensional LRFs in order to extract a human voice without omission. If the DOA estimation output from the microphone array and the LRF tracking result intersect at the same position (or a position within a predetermined distance), the spatial information integration unit 3206 determines that there is a high possibility of a sound source there.

ここで、音源定位装置３００のように、２次元のＬＲＦを用いている場合は、人位置情報は２次元に限られる。ここでは、検出された音源の位置が口元の高さの範囲内にあるかの制限（たとえば、ｚ＝１〜１．６ｍ）もかけて音源の特定を行う。無音区間や音源方向推定が不十分な区間では、最後に推定された口元の高さと最新の２次元位置情報を用いて、音源分離を行う。
（音源分離）
音源分離部３３０では、選択された複数の人物（および注目する音源）（個数：ｊ）をパラレルに分離している。 Here, when the two-dimensional LRF is used as in the sound source localization apparatus 300, the human position information is limited to two dimensions. Here, the sound source is specified by limiting the position of the detected sound source within the range of the height of the mouth (for example, z = 1 to 1.6 m). In silent sections and sections where sound source direction estimation is insufficient, sound source separation is performed using the last estimated mouth height and the latest two-dimensional position information.
(Sound source separation)
The sound source separation unit 330 separates a plurality of selected persons (and a sound source of interest) (number: j) in parallel.

図４は、このような音源分離処理を説明するための機能ブロック図である。 FIG. 4 is a functional block diagram for explaining such a sound source separation process.

音源分離では，選択された複数の人物を並列に分離する。 In sound source separation, a plurality of selected persons are separated in parallel.

ここで、マイクロホン（Ｍｉｃ）は、Ｎ本であるものとする。ｉは、１≦ｉ≦Ｎとする。 Here, it is assumed that there are N microphones (Mic). i is 1 ≦ i ≦ N.

まず、分離の第１ステップとして、定常雑音推定部３３１０．ｋは、エアコンなどの定常雑音抑圧（noise suppression）をマイクロホンのチャンネル毎に行う。雑音抑圧部３３１２．ｉは、定常雑音抑圧手法として、以下の式（１）に示すようにウィーナーフィルタ（Wiener filter）を用いる。 First, as a first step of separation, stationary noise estimation unit 3310. k performs noise suppression for each channel of the microphone such as an air conditioner. Noise suppression unit 3312. i uses a Wiener filter as a stationary noise suppression method as shown in the following equation (1).

Ｘ_i（ｆ）は、観測信号の周波数成分を表す。定常雑音（Ｎ_i（ｆ））は、対象となる人の声が存在しない区間での平均スペクトルとして推定される。 X _i (f) represents the frequency component of the observation signal. The stationary noise (N _i (f)) is estimated as an average spectrum in a section where the target person's voice does not exist.

雑音抑圧部３３１２．iによる定常雑音抑圧処理は、ポストフィルタとして、ビームフォーマを施した後に行うことも可能であるが、ここでは、musicalノイズの発生を抑えるため、ビームフォーマの前に施すものとする。 Noise suppression unit 3312. The steady noise suppression processing by i can be performed after the beamformer is applied as a post filter, but here it is performed before the beamformer in order to suppress the generation of musical noise.

ＤＳビームフォーマー部３３１４．１〜３３１４．ｊでは、音源定位部から得られる方向（方位角、仰角）と距離情報を基に、ビームフォーマを施す。ここでは、計算量が少なく且つロバストな遅延和ビームフォーマ（Delay-Sum Beamformer）を用いて、目的方向の人の声を分離・強調する。フレーム長は２０ｍｓで、シフト長は１０ｍｓである。
なお、話者または注目する音源の個数ｊについては、予め所定の値が設定されているものとする。 DS beam former 3314.1-3314. In j, a beamformer is applied based on the direction (azimuth angle, elevation angle) and distance information obtained from the sound source localization unit. Here, the human voice in the target direction is separated and emphasized using a delay-sum beamformer with a small amount of calculation and a robustness. The frame length is 20 ms and the shift length is 10 ms.
It is assumed that a predetermined value is set in advance for the number j of speakers or sound sources of interest.

ここで、遅延和ビームフォーマについては、たとえば、以下の文献に開示がある。 Here, the delay sum beamformer is disclosed in the following document, for example.

文献１：国際公開ＷＯ２００４／０３４７３４公報（再表2004-034734号公報）
ビームフォーミングの基本原理を、２マイクロホンの場合を例に簡単に説明する。 Reference 1: International Publication WO 2004/034734 (Republished 2004-034734)
The basic principle of beam forming will be briefly described by taking the case of 2 microphones as an example.

特性が全く等しい２個の全指向性マイクロホンを間隔ｄで配置し、これらに対して平面波が方向θから到来する状況を考える。この平面波は各マイクロホンにおいて、経路差ｄｓｉｎθの分だけ、伝搬遅延時間が異なる信号として受信される。ビームフォーミングを行う装置であるビームフォーマでは、或る方向θ０から到来する信号に関する伝搬遅延を補償するように、δ＝ｄｓｉｎθ_０／ｃ（ｃは音速）だけ、一方のマイクロホン信号を遅延させ、その出力信号を他方のマイクロホン信号と加算または減算する。 Consider a situation in which two omnidirectional microphones having exactly the same characteristics are arranged at an interval d and a plane wave arrives from a direction θ. This plane wave is received by each microphone as a signal having a different propagation delay time by the path difference dsinθ. In a beam former that is an apparatus that performs beam forming, one microphone signal is delayed by δ = dsin θ ₀ / c (c is the speed of sound) so as to compensate for a propagation delay related to a signal arriving from a certain direction θ _0. The output signal is added to or subtracted from the other microphone signal.

加算器の入力では、方向θ_０から到来する信号の位相が一致する。従って、加算器の出力において、方向θ_０から到来した信号は強調される。一方、θ_０以外の方向から到来した信号は、互いに位相が一致しないため、θ_０から到来した信号ほど強調されることはない。その結果、加算器出力を用いるビームフォーマは、θ_０にビーム（Ｂｅａｍ：特に感度の高い方向）を有する指向性を形成する。対照的に、減算器では、方向θ_０から到来する信号が完全にキャンセルされる。従って、減算器出力を用いるビームフォーマは、θ_０にヌル（Ｎｕｌｌ：特に感度の低い方向）を有する指向性を形成する。このように遅延と加算のみを行うビームフォーマを、「遅延和ビームフォーマ」と呼ぶ。 At the input of the adder, the phases of the signals arriving from the direction θ ₀ match. Accordingly, the signal coming from the direction θ ₀ is emphasized at the output of the adder. On the other hand, signals coming from directions other than θ ₀ are not emphasized as much as signals coming from θ ₀ because their phases do not match each other. As a result, the beamformer using the adder output forms a directivity having a beam (Beam: a particularly sensitive direction) at θ ₀ . In contrast, the subtractor completely cancels the signal coming from direction θ ₀ . Therefore, the beamformer using the subtracter output forms a directivity having null (Null: a direction with particularly low sensitivity) at θ ₀ . A beamformer that performs only delay and addition in this way is called a “delay sum beamformer”.

ここで、より一般に、空間に指向性音源Ｓと無指向性雑音源Ｎが存在すると仮定すると、遅延和ビームフォーマの出力は以下の形になる： Here, more generally, assuming that a directional sound source S and an omnidirectional noise source N are present in space, the output of the delayed sum beamformer has the following form:

Ｙ_DS（ｆ）は周波数ｆに対応したビームフォーマの出力で、Sdirは信号の方向、ｗ_SdirはSdir方向のビームフォーマレスポンスを指す。式の二つ目の項目は、分離音声に混在する雑音を表している。この雑音成分を低減させるために、各周波数に以下のようなウェイトを掛ける。 Y _DS (f) is the output of the beamformer corresponding to the frequency f, Sdir indicates the signal direction, and w _Sdir indicates the beamformer response in the Sdir direction. The second item in the equation represents noise mixed in separated speech. In order to reduce this noise component, the following weights are applied to each frequency.

Ｙ_iはウェイト掛けした後のビームフォーマ出力である。ここでは、改めて、１≦ｉ≦ｊとする。 Y _i is the beamformer output after weighting. Here again, 1 ≦ i ≦ j.

また、チャネル間抑圧部３３１６は、ＤＳビームフォーマのみでは、十分な音源分離が出来ず、チャンネル間の信号（妨害音）の漏れを抑えるための処理（inter-channel suppression）を行う。妨害音抑圧処理には、以下の式（５）に示すようにウィーナーフィルタ（Wiener filtering）を用いる。 Further, the inter-channel suppression unit 3316 cannot perform sufficient sound source separation using only the DS beamformer, and performs processing (inter-channel suppression) for suppressing leakage of signals (interference sound) between channels. In the interference noise suppression process, a Wiener filtering is used as shown in the following formula (5).

Ｉ_i(f)は式（６）に示すように、分離された対象音以外の音源の中で、最も強い周波数成分を表す。上述の妨害音抑圧処理の一つの問題点として、同じ方向に対象音と妨害音が存在する場合、対象音に歪みが生じる可能性が高い。 I _i (f) represents the strongest frequency component among the sound sources other than the separated target sound, as shown in Expression (6). As one problem of the above-described interference sound suppression processing, when the target sound and the interference sound exist in the same direction, there is a high possibility that the target sound is distorted.

そこで、ここでは対象音の方向（ｄｉｒ₁）と妨害音の方向（ｄｉｒ₂）の差が、所定の角度、たとえば５度以内であれば、以下の式（７）に従って、抑圧処理を行わない制約を設ける。 Therefore, here, if the difference between the direction of the target sound (dir ₁ ) and the direction of the disturbing sound (dir ₂ ) is within a predetermined angle, for example, 5 degrees, the suppression process is not performed according to the following equation (7). Set constraints.

最後に、ゲイン正規化部３３１８．１〜３３１８．ｊは、音源とマイクロホンアレイの距離ｒ_iによって、観測される音圧が異なるため、以下のようなゲインｇ_iをかけることにより、距離による振幅の正規化（gain normalization）を施す。 Finally, gain normalization units 3318.1 to 3318. Since the sound pressure to be observed differs depending on the distance r _i between the sound source and the microphone array, j is subjected to gain normalization by applying a gain g _i as follows.

図５は、空間感覚合成部５００を説明するための機能ブロック図である。 FIG. 5 is a functional block diagram for explaining the spatial sense synthesis unit 500.

空間感覚合成部５００は、環境センサ側から提供される分離音を受信し、利用者と対象音源の相対的位置関係を考慮して、音の空間的感覚を再構築する。処理としては、複数音源に対する音量調整と、頭部伝達関数（HRTF）を用いた音像の合成となる。 The spatial sense synthesis unit 500 receives the separated sound provided from the environment sensor side, and reconstructs the spatial sense of the sound in consideration of the relative positional relationship between the user and the target sound source. The processing includes volume adjustment for a plurality of sound sources and synthesis of a sound image using a head related transfer function (HRTF).

音量制御部５１０は、音源分離部３３０からの分離音をそれぞれ受信して音量をそれぞれ正規化するための音量制御処理部５１０２．１〜５１０２．ｊを備える。 The volume control unit 510 receives the separated sounds from the sound source separation unit 330 and normalizes the volume, respectively. j.

音量制御部５１０は、各音源とアレイの間の距離による違いを補正するため、分離した各音声に対して距離によって以下のように正規化を行う。 The volume control unit 510 performs normalization on each separated sound as follows in order to correct a difference due to the distance between each sound source and the array.

このうち、Ｎは音源の数で、dist_nはｎ番目の音源とアレイの距離を表す。ｇ_iはｉ番目の音源からの分離音Ｙ_PF,iに掛ける正規化ファクタで、Ｙiはｉ番目の音源の分離結果を示している。 Of these, N is the number of sound sources, and dist _n is the distance between the nth sound source and the array. g _i is a normalization factor multiplied by the separated sound Y _{PF, i} from the i-th sound source, and Y _i indicates the separation result of the i-th sound source.

顔姿勢推定部５２０は、ユーザ２が装着したヘッドホン上のセンサ６００からの情報を基に、ユーザ２の顔の向きを推定する。 The face posture estimation unit 520 estimates the face orientation of the user 2 based on information from the sensor 600 on the headphones worn by the user 2.

ただし、たとえば、ユーザ２の顔の向きを推定する方法は、このような構成に限定されるわけでなく、たとえば、ユーザ２の画像を撮像し、この撮像データからユーザ２の頭部姿勢を推定することとしてもよい。このような撮像画像による頭部姿勢の推定については、特に限定されないが、たとえば、以下の文献に開示がある。 However, for example, the method of estimating the orientation of the face of the user 2 is not limited to such a configuration. For example, an image of the user 2 is captured and the head posture of the user 2 is estimated from the captured data. It is good to do. The estimation of the head posture based on such a captured image is not particularly limited, but is disclosed in the following document, for example.

文献２：特開２０１４−９３００６号公報
音空間再構成部５４０において、空間再構成部５５０は、環境センサ側から受信した方向・位置情報と推定されたユーザ２の顔の向きに応じて、座標系（ｘ，ｙ，ｚ）における音源の位置を再構成し、推定された顔の向きから、左右のチャンネルに対応した正確な頭部伝達関数（ＨＲＴＦ：Head Relative Transfer Function）をデータベース５３０から選択する。 Document 2: JP 2014-93006 A In the sound space reconstruction unit 540, the space reconstruction unit 550 coordinates according to the direction / position information received from the environment sensor side and the estimated orientation of the face of the user 2. The position of the sound source in the system (x, y, z) is reconstructed, and an accurate head transfer function (HRTF) corresponding to the left and right channels is selected from the database 530 from the estimated face orientation. To do.

ここで、頭部伝達関数ＨＲＴＦとは、任意に配置された音源から発せられたインパルス信号を、受聴者の外耳道入り口で測定したインパルス応答であり、たとえば、以下の文献にも開示がある。 Here, the head-related transfer function HRTF is an impulse response obtained by measuring an impulse signal emitted from an arbitrarily arranged sound source at the listener's ear canal entrance, and is disclosed in the following documents, for example.

文献３：特開２０１０−１１８９７８号公報
音空間再構成部５４０において、ＨＲＴＦ処理部５５０２．１〜５５０２．ｊは、分離され音量が制御された音声に、選択された頭部伝達関数との畳み込み演算を行い、左耳音合成部５５０４．１および右耳音合成部５５０４．２は、それぞれ左耳周波数特性補正部５５０６．１および右耳周波数特性補正部５５０６．２を通して、ステレオヘッドホン６１０の左右のスピーカでユーザ２に再生する左耳用音および右耳用音をそれぞれ合成する。 Document 3: JP 2010-118978 A In the sound space reconstruction unit 540, the HRTF processing units 5502.1 to 5502. j performs a convolution operation with the selected head-related transfer function on the separated and volume-controlled speech, and the left ear sound synthesis unit 5504.1 and the right ear sound synthesis unit 5504.2 Through the characteristic correction unit 5506.1 and the right ear frequency characteristic correction unit 5506.2, the left ear sound and the right ear sound to be reproduced to the user 2 by the left and right speakers of the stereo headphones 610 are respectively synthesized.

左耳周波数特性補正部５５０６．１および右耳周波数特性補正部５５０６．２は、予め測定されたユーザ２の難聴特性に合わせて、右耳および左耳のそれぞれについて、周波数帯域ごとの音量の制御を行う。たとえば、一例として、ユーザ２の右耳の高音域での聴覚能力が落ちているのであれば、これに併せて、右耳の高音域の音声を強調して補正する処理を実行する。 The left ear frequency characteristic correction unit 5506.1 and the right ear frequency characteristic correction unit 5506.2 control the volume for each frequency band for each of the right ear and the left ear in accordance with the previously measured hearing loss characteristic of the user 2. I do. For example, as an example, if the hearing ability of the user 2 in the high-pitched region of the right ear is reduced, a process of emphasizing and correcting the sound of the high-pitched region of the right ear is executed.

ヘッドホンを用いた３Ｄ音場の再現においては、日常、人は両耳に到達した音波の違いによって音源定位を行っていることを利用する。ヘッドホン６１０で、この違いを再現することで、ステレオヘッドホンで３Ｄ音場を合成することが可能になる。 In reproduction of a 3D sound field using headphones, it is used that a person ordinarily performs sound source localization based on a difference in sound waves that reach both ears. By reproducing this difference with the headphones 610, it is possible to synthesize a 3D sound field with stereo headphones.

頭部伝達関数ＨＲＴＦは、空間内の音源から発した音波が人の両耳に到達する時点の違いを表現する関数であって、３Ｄ音場のバイナル再現に多く使われる。しかし、ヘッドホンを使って空間上に存在する音源を再現する際には、バーチャルな音源が聴者の頭部・体の動きと共に動いてしまうという問題点がある。人の日常経験を考えると、外部音源の位置は聴者の体の動きに関連せず、固定されている。ヘッドホンによる３Ｄ音場の再現ではこの経験と異なるため、臨場感にマイナスに働き、不自然な印象の原因となってしまう。さらに、頭部伝達関数を使った場合、前後の誤判断が起こるという問題がある。これは、前方にある音源が後方にあるように聞こえる、もしくはその逆の現象である。日常生活では音源を定位するために意識的・無意識的に頭部を回し、その効果を定位の補助に用いている。 The head-related transfer function HRTF is a function that expresses a difference in time points when sound waves emitted from a sound source in space reach the ears of a person, and is often used to reproduce a 3D sound field. However, when reproducing a sound source existing in space using headphones, there is a problem that the virtual sound source moves with the movement of the listener's head and body. Considering human daily experience, the position of the external sound source is not related to the movement of the listener's body and is fixed. The reproduction of the 3D sound field using headphones is different from this experience, so it works negatively and creates an unnatural impression. Further, when the head-related transfer function is used, there is a problem that a wrong judgment before and after occurs. This sounds like a sound source in front is behind, or vice versa. In everyday life, the head is turned consciously and unconsciously to localize the sound source, and the effect is used to assist in localization.

これらを考慮し、聴覚支援システム１０００では、ユーザ２の頭部回転をトラッキングすることで、頭部の向きに合わせたＨＲＴＦを用いてステレオ音声を合成する。正確なＨＲＴＦを選択するのに必要な連続的音源位置情報は、複数のマイクロホンアレイのＤＯＡ推定結果、および、人位置推定システムから取得されている。 In consideration of these, the auditory assistance system 1000 synthesizes stereo sound using the HRTF that matches the direction of the head by tracking the head rotation of the user 2. The continuous sound source position information necessary for selecting an accurate HRTF is obtained from DOA estimation results of a plurality of microphone arrays and a human position estimation system.

すなわち、一つの音声を特定の方向から聞こえるようにするため、その方向に対応したＨＲＴＦによってフィルタリングしてステレオ化する。ＨＲＴＦを表す係数のデータベースとしては、特に限定されないが、たとえば、一般公開されているＫＥＭＡＲ(Knowles Elec-tronics Manikin for Acoustic Research) ダミーヘッドのＨＲＴＦデータベースを利用することができる。ＫＥＭＡＲは、ＨＲＴＦ研究のために一般的な頭部サイズを使って作られたダミーヘッドで、データベースには空間からのインパルス信号に対するダミーヘッドの左右耳のレスポンスとして、仰角−４０度から９０度までの総計７１０方向のインパルス応答が含まれている。各インパルス応答の長さは５１２サンプルで、サンプリング周波数は４４．１ｋＨｚである。なお、被験者の頭部の形状に対応したＨＲＴＦを合成しておき、これをデータベースとして使用することも可能である。 That is, in order to hear one sound from a specific direction, the sound is filtered and stereo- lated by the HRTF corresponding to that direction. The coefficient database representing the HRTF is not particularly limited, but, for example, a publicly available KRTF (Knowles Elec-tronics Manikin for Acoustic Research) dummy head HRTF database can be used. KEMAR is a dummy head made with a general head size for HRTF research. The database shows the response of the left and right ears of the dummy head to impulse signals from the space, from an elevation angle of -40 degrees to 90 degrees. The impulse response in a total of 710 directions is included. Each impulse response has a length of 512 samples and a sampling frequency of 44.1 kHz. It is also possible to synthesize an HRTF corresponding to the shape of the subject's head and use it as a database.

ＨＲＴＦを用いてダイナミックに音場を合成するには、頭部の向きのリアルタイム検出が必要であるため、上述のように、ヘッドホンの上部にジャイロセンサーとコンパスを取り付け、頭部回転のトラッキングを行う構成とすることができる。このとき、角度情報はシリアルおよびブルートゥース経由のいずれかでシステムに送られる。音場の合成に使う方向は音源方向から頭部角度を引いたもので、この方向に対応した左右チャンネルのインパルス応答がデータベースから選出され、分離結果と畳み込み演算を行った音声がユーザの両耳に再生される。 In order to synthesize a sound field dynamically using HRTF, real-time detection of the head orientation is required. Therefore, as described above, a gyro sensor and a compass are attached to the upper part of the headphone to track head rotation. It can be configured. At this time, the angle information is sent to the system either serially or via Bluetooth. The direction used to synthesize the sound field is the sound source direction minus the head angle, and the impulse response of the left and right channels corresponding to this direction is selected from the database, and the separation result and the convolution calculation voice are the user's binaural sounds. To be played.

図６は、音源定位装置３００のハードウェア構成を説明するためのブロック図である。 FIG. 6 is a block diagram for explaining a hardware configuration of the sound source localization apparatus 300.

なお、空間感覚合成部５００も、基本的には、同様の構成を有する。すなわち、図３〜図５に示した各機能ブロックの機能は、以下に説明するようなハードウェア上で動作するソフトウェアにより実現される。 The spatial sensation synthesis unit 500 basically has the same configuration. That is, the function of each functional block shown in FIGS. 3 to 5 is realized by software operating on hardware as described below.

図６に示されるように、音源定位装置３００は、外部記録媒体６４に記録されたデータを読み取ることができるドライブ装置５２と、バス６６に接続された中央演算装置（ＣＰＵ：Central Processing Unit）５６と、ＲＯＭ（Read Only Memory) ５８と、ＲＡＭ（Random Access Memory）６０と、不揮発性記憶装置５４と、マイクフォンアレイ１０．１〜１０．Ｍからの音声データおよびレーザレンジファインダ２０．１〜２０．Ｌからの測距データを取込むためのデータ入力インタフェース（以下、データ入力Ｉ／Ｆ）６８とを含んでいる。 As shown in FIG. 6, the sound source localization device 300 includes a drive device 52 that can read data recorded on the external recording medium 64 and a central processing unit (CPU) 56 that is connected to a bus 66. ROM (Read Only Memory) 58, RAM (Random Access Memory) 60, nonvolatile storage device 54, microphone array 10.1-10. Audio data from M and laser range finder 20.1-20. A data input interface (hereinafter referred to as data input I / F) 68 for fetching distance measurement data from L is included.

外部記録媒体６４としては、たとえば、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭのような光ディスクやメモリカードを使用することができる。ただし、記録媒体ドライブ５２の機能を実現する装置は、光ディスクやフラッシュメモリなどの不揮発性の記録媒体に記憶されたデータを読み出せる装置であれば、対象となる記録媒体は、これらに限定されない。また、不揮発性記憶装置５４の機能を実現する装置も、不揮発的にデータを記憶し、かつ、ランダムアクセスできる装置であれば、ハードディスクのような磁気記憶装置を使用してもよいし、フラッシュメモリなどの不揮発性半導体メモリを記憶装置として用いるソリッドステートドライブ（ＳＳＤ：Solid State Drive）を用いることもできる。 As the external recording medium 64, for example, an optical disk such as a CD-ROM or a DVD-ROM or a memory card can be used. However, the target recording medium is not limited to this as long as the device that realizes the function of the recording medium drive 52 is a device that can read data stored in a nonvolatile recording medium such as an optical disk or a flash memory. In addition, a device that realizes the function of the nonvolatile storage device 54 may be a magnetic storage device such as a hard disk or a flash memory as long as it can store data in a nonvolatile manner and can be accessed randomly. A solid state drive (SSD) that uses a nonvolatile semiconductor memory such as a storage device can also be used.

このような音源定位装置３００の主要部は、コンピュータハードウェアと、ＣＰＵ５６により実行されるソフトウェアとにより実現される。一般的にこうしたソフトウェアは、マスクＲＯＭやプログラマブルＲＯＭなどにより、音源定位装置３００の製造時に記録されており、これが実行時にＲＡＭ６０に読みだされる構成としてもよいし、ドライブ装置５２により記録媒体６４から読取られて不揮発性記憶装置５４に一旦格納され、実行時にＲＡＭ６０に読みだされる構成としてもよい。または、当該装置がネットワークに接続されている場合には、ネットワーク上のサーバから、一旦、不揮発性記憶装置５４にコピーされ、不揮発性記憶装置５４からＲＡＭ６０に読出されてＣＰＵ５６により実行される構成であってもよい。 The main part of such a sound source localization apparatus 300 is realized by computer hardware and software executed by the CPU 56. In general, such software is recorded at the time of manufacture of the sound source localization device 300 by a mask ROM, a programmable ROM, or the like, and may be read into the RAM 60 at the time of execution, or may be read from the recording medium 64 by the drive device 52. A configuration may be adopted in which the data is read and temporarily stored in the nonvolatile storage device 54 and then read out to the RAM 60 at the time of execution. Alternatively, when the device is connected to a network, the server is temporarily copied from the server on the network to the nonvolatile storage device 54, read from the nonvolatile storage device 54 to the RAM 60, and executed by the CPU 56. There may be.

図６に示したコンピュータのハードウェア自体およびその動作原理は一般的なものである。したがって、本発明の最も本質的な部分の１つは、不揮発性記憶装置５４等の記録媒体に記憶されたソフトウェアである。 The computer hardware itself and its operating principle shown in FIG. 6 are general. Accordingly, one of the most essential parts of the present invention is software stored in a recording medium such as the nonvolatile storage device 54.

また、空間感覚合成部５００の場合は、不揮発性記憶装置５４にデータベース５３０も格納される構成とできる。
（音源ボリュームの調整）
聴覚支援システム１０００では、選択されたすべての音源に対して、位置情報を反映したステレオ音声を合成し、足し合わせて、音場を表現する出力が再生される。しかし、これでは選択された各音源のボリュームが予測できない。もし、ユーザ側で各音源のボリュームを各々独立して操作することができれば、自分にとって注目したい音源に焦点をあてた音環境を作ることができる。
以下では、音場をコントロールするための２つの異なる操作パターンのユーザインタフェースについて説明する。 Further, in the case of the spatial sense synthesis unit 500, the database 530 can also be stored in the nonvolatile storage device 54.
(Adjustment of sound source volume)
In the auditory support system 1000, stereo sound reflecting position information is synthesized and added to all selected sound sources, and an output representing the sound field is reproduced. However, this cannot predict the volume of each selected sound source. If the user can independently control the volume of each sound source, it is possible to create a sound environment that focuses on the sound source that one wants to pay attention to.
Below, the user interface of two different operation patterns for controlling a sound field is demonstrated.

図７は、このようなインタフェースの画面表示例を示す図である。 FIG. 7 is a diagram showing a screen display example of such an interface.

まず、前提として、インターフェース画面では、音源定位装置３００により特定された発話者（他の注目対象の音源も含む）の位置が、画面上に２次元マップとして表示されるものとする。また、ユーザ自身の位置は、斜線の入った丸で示す。 First, as a premise, on the interface screen, it is assumed that the position of the speaker (including other attention-targeted sound sources) specified by the sound source localization apparatus 300 is displayed on the screen as a two-dimensional map. In addition, the user's own position is indicated by a hatched circle.

図７（ａ）に示す１つ目のインタフェースでは、ユーザが、周りにいる人のうち、強調したい人をマウスの左クリックで選択し、抑圧したい人を右マウスで選択する機能を設ける。強調したい人は黒丸で、抑圧したい人は、白丸で表現されている。 In the first interface shown in FIG. 7A, a function is provided in which the user selects a person to be emphasized by clicking with the left mouse button and selects a person to be suppressed with the right mouse. Those who want to emphasize are represented by black circles, and those who want to suppress are represented by white circles.

図７（ｂ）に示す２つ目のインタフェースでは、ユーザの顔の向きによって各音源のボリュームが調整される。ユーザの顔方向を利用して音源の音量を操作するため、両手が解放される。ユーザの顔の前方の所定範囲内にある音源は強調され、所定範囲外にある音源は減衰される。ボリュームを調節するファクタはユーザの顔正面方向からの角度の大きさと比例するようにしてもよい。
図７（ｂ）中では、ユーザの顔の向きは、斜線の入った丸に付随する矢印で示されている。 In the second interface shown in FIG. 7B, the volume of each sound source is adjusted according to the orientation of the user's face. Both hands are released because the volume of the sound source is operated using the user's face direction. Sound sources within a predetermined range in front of the user's face are emphasized, and sound sources outside the predetermined range are attenuated. The factor for adjusting the volume may be proportional to the magnitude of the angle from the front direction of the user's face.
In FIG. 7B, the orientation of the user's face is indicated by an arrow attached to a hatched circle.

このような構成により、ユーザが注目する対象を指示するすることができ、音量制御部５１０２．１〜５１０２．ｊは、音源分離された分離音の信号の音量を、ユーザが注目する対象の音源からの音声が強調されるように個別に制御する。 With such a configuration, it is possible to instruct a target to which the user pays attention, and the volume control units 5102.1 to 5102. j individually controls the volume of the separated sound signal separated from the sound source so that the sound from the sound source targeted by the user is emphasized.

以上説明したように、本実施の形態の聴覚支援システムでは、環境内の個々の音を分離することにより、これまで補聴器単体では出来なかった、利用者に対して必要な音と不要な音を取捨選択的に制御することができる。環境センサの利用により、対象音の強調と不要音の抑圧に加え、ハウリングの問題および自分の声が大きく聞こえる問題も解決できる。これにより、従来の補聴器より音量を上げることができ、対象となる音や声が聞きやすくなる。 As described above, in the hearing support system according to the present embodiment, by separating individual sounds in the environment, it is possible to generate necessary sounds and unnecessary sounds that have not been possible with a hearing aid alone until now. It can be controlled selectively. By using environmental sensors, in addition to emphasizing target sounds and suppressing unwanted sounds, it is possible to solve the problem of howling and the problem of loud voices. Thereby, a volume can be raised from the conventional hearing aid and it becomes easy to hear the target sound and voice.

また、本実施の形態の聴覚支援システムでは、環境センサにより分解された個々の音源に対し、センサと利用者の相対的な位置や向きに応じた音像（音の空間的情報の感覚）を再構築することができる。これにより、どの方向から音が鳴ったのか、といった空間的情報の知覚を可能にする。 In the auditory support system according to the present embodiment, the sound image (sensation of spatial information of sound) corresponding to the relative positions and orientations of the sensor and the user is reproduced for each sound source decomposed by the environmental sensor. Can be built. This makes it possible to perceive spatial information, such as from which direction the sound is produced.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

２ユーザ、１０．１〜１０．Ｍマイクロホンアレイ、２０．１〜２０．ＬＬＲＦ、１００マイクロホンアレイ群、２００ＬＲＦ群、３００音源定位装置、３１０人位置検出追跡部、３２０音源定位部、３３０音源分離部、５００音声合成装置、５１０音量制御部、５２０顔姿勢推定部、５３０データベース、５４０音空間再構成部、５５０空間再構成部、６００センサ、６１０ヘッドホン、６５０表示部。
2 users, 10.1-10. M microphone array, 20.1-20. L LRF, 100 microphone array group, 200 LRF group, 300 sound source localization device, 310 person position detection tracking unit, 320 sound source localization unit, 330 sound source separation unit, 500 speech synthesizer, 510 volume control unit, 520 face posture estimation unit, 530 database, 540 sound space reconstruction unit, 550 space reconstruction unit, 600 sensor, 610 headphones, 650 display unit.

Claims

A hearing support system for assisting a user's hearing in a target space,
A sound source localization device installed in the target space, the sound source localization device,
Position detecting means for detecting the position of the object in the object space;
According to the output from the microphone array, the direction of sound arrival is estimated, integrated with the detection result of the position detection means, and the sound source localization means for specifying and outputting the position of the sound source;
Sound source separation means for separating and outputting the sound from the position of the identified sound source,
According to the user's face posture, the device further comprises a spatial sensation synthesizing device for reconfiguring the sound in the target space,
Face posture detection means for detecting the face posture of the user in the target space;
Sound reproduction means mounted on the user for reproducing the sound environment of the target space with respect to both ears of the user;
From the sound source localization means, the position of the position of the sound source is received, and according to the detected face posture, using the head-related transfer function from the position of the sound source in the target space to each ear of the user, A hearing support system including sound space reconstruction means for synthesizing a sound signal for reproduction to each ear by the sound reproduction means from the separated sound signal from the sound source separation means.

The hearing support system according to claim 1, wherein the spatial sensation synthesis device further includes a frequency characteristic correction unit that corrects a volume for each frequency band in accordance with a deafness characteristic of each ear of the user.

The spatial sense synthesizer is
Instruction means for designating a position of the object of interest of the user in the object space;
The hearing support system according to claim 1, further comprising: a volume control unit for individually controlling a volume of a separated sound signal from the sound source separation unit according to an instruction from the instruction unit.

The sound reproduction means is a headphone or an earphone,
The hearing support system according to claim 2, wherein the face posture detection means includes a gyro and a compass attached to the headphones.

The sound reproduction means is a headphone or an earphone,
The hearing support system according to claim 2 or 3, wherein the face posture detection means estimates the user's face posture from the captured image of the user.

6. The sound source localization unit according to claim 1, wherein the sound source localization unit specifies the position of the sound source in response to a crossing of a sound arrival direction based on a microphone array and a position of the sound source detected by the position detection unit. The hearing support system according to claim 1.

A database that stores coefficients of a plurality of head related transfer functions according to directions from the sound source to each ear of the user;
The sound space reconstruction means includes:
In the target space, a head-related transfer function from the position of the sound source in the target space to each ear of the user is selected from the database, and a sound signal for reproducing a spatial sensation to each ear is synthesized. The hearing support system according to any one of claims 1 to 6.

A hearing support device for reproducing the sound environment of the target space according to the face posture of the user based on information from the environment sensor device that transmits information about the sound environment of the target space, from the environment sensor device Is transmitted position information indicating the position of the sound source in the target space, and a signal of the separated sound obtained by separating the sound from the position of the sound source specified by the position information,
Face posture detection means for detecting the face posture of the user in the target space;
Sound reproduction means for reproducing sound corresponding to the sound environment with respect to both ears of the user worn on the user;
The separated sound signal is received using the head-related transfer function from the position of the sound source in the target space to each ear of the user according to the detected face posture, based on the position information of the sound source position. And a sound space reconstruction unit that synthesizes a sound signal for reproduction to each ear by the sound reproduction unit.

The hearing support apparatus according to claim 8, further comprising frequency characteristic correction means for correcting a volume for each frequency band in accordance with the deafness characteristic of each ear of the user.

Instruction means for designating a position of the object of interest of the user in the object space;
The hearing support apparatus according to claim 8 or 9, further comprising a volume control unit for individually controlling a volume of a separated sound signal from the sound source separation unit in response to an instruction from the instruction unit.

The sound reproduction means is a headphone or an earphone,
The hearing support apparatus according to claim 9 or 10, wherein the face posture detection means includes a gyro and a compass attached to the headphones.

The sound reproduction means is a headphone or an earphone,
The hearing support apparatus according to claim 9 or 10, wherein the face posture detection means estimates the user's face posture from the captured image of the user.

A database that stores coefficients of a plurality of head related transfer functions according to directions from the sound source to each ear of the user;
The sound space reconstruction means includes:
In the target space, a head-related transfer function from the position of the sound source in the target space to each ear of the user is selected from the database, and a sound signal for reproducing a spatial sensation to each ear is synthesized. The hearing support apparatus according to any one of claims 8 to 12.