WO2026018859A1

WO2026018859A1 - Information processing method, information processing system, and program

Info

Publication number: WO2026018859A1
Application number: PCT/JP2025/025424
Authority: WO
Inventors: 正之西口; 健太竹内; 貫治渡邉; 幸治安倍; 陽宇佐見; 智一石川; 宏幸江原; 康太中橋
Original assignee: Akita Prefectural University; Panasonic Holdings Corp
Current assignee: Akita Prefectural University; Panasonic Holdings Corp
Priority date: 2024-07-19
Filing date: 2025-07-16
Publication date: 2026-01-22
Anticipated expiration: 2027-01-19

Abstract

Provided is an information processing method comprising: a step for acquiring, in a first terminal, first sound information; a step (S102) for converting, in the first terminal, the first sound information into second sound information for generating a representative sound arriving at a reference position from a representative point which is set in a three-dimensional sound field by using an acoustic signal; a step (S105) for detecting, in a second terminal, the position of a user or the head direction of the user in the three-dimensional sound field; a step for calculating, in the second terminal, a position of a playback representative point corresponding to the position of the representative point on the basis of the detected position of the user or the head direction of the user and the reference position; and a step (S106) for generating, in the second terminal, an output sound signal by using the received second sound information and a head transfer function corresponding to the calculated position of the playback representative point.

Description

Information processing method, information processing system, and program

　本開示は、情報処理方法、情報処理システム、及び、プログラムに関する。 This disclosure relates to an information processing method, an information processing system, and a program.

　従来、仮想的な三次元空間内で、立体的な音をユーザに知覚させるための音響再生に関する技術が知られている（例えば、特許文献１参照）。また、このような三次元空間内で音源オブジェクトからユーザへと到来するように音を知覚させるためには、元となる音情報から出力音情報を生成する処理が必要となる。特に、仮想空間内でユーザの身体の動きに応じた立体的な音を再生するためには膨大な処理が必要になる。コンピュータグラフィックス（ＣＧ）の発展により視覚的に複雑である仮想環境を比較的容易に構築することが可能になり、対応する聴覚情報を実現する技術が重要となっている。加えて、音情報から出力音情報を生成するまでの処理を事前に行う場合には、事前に計算した処理結果を保存する大きな記憶領域が必要になる。また、そのような大きな処理結果のデータを伝送する場合には広い通信帯域が必要となる場合がある。 Conventionally, there is known technology related to sound reproduction that allows a user to perceive three-dimensional sound in a virtual three-dimensional space (see, for example, Patent Document 1). Furthermore, in order to make a user perceive sound as if it is coming from a sound source object to the user in such a three-dimensional space, processing is required to generate output sound information from the original sound information. In particular, reproducing three-dimensional sound in response to the user's body movements in a virtual space requires a huge amount of processing. Advances in computer graphics (CG) have made it relatively easy to create visually complex virtual environments, and technology to realize corresponding auditory information has become important. In addition, when processing from sound information to generate output sound information is performed in advance, a large memory area is required to store the pre-calculated processing results. Furthermore, transmitting such large amounts of processed data may require a wide communication bandwidth.

　より現実に近い音環境を実現するため、仮想的な三次元空間内で音を出すオブジェクトの数が増えたり、反射音や回折音や残響などの音響効果に基づく副次音が増えたり、さらにユーザの動きに対してこれら副次音を適切に変化させる必要があり、大きな処理量が要求される。そこで、このような大きな処理量を削減するという目的で、三次元空間内の音をあらかじめ三次元空間内に設定された、いくつかの代表点からの音によって表現するパニング処理と呼ばれる（又は単にパニングともいう）変換技術が知られている。 In order to create a more realistic sound environment, the number of objects that emit sound in the virtual three-dimensional space increases, and secondary sounds based on acoustic effects such as reflected sound, diffracted sound, and reverberation also increase. Furthermore, these secondary sounds must be appropriately modified in response to the user's movements, requiring a large amount of processing. Therefore, with the aim of reducing this large amount of processing, a conversion technique known as panning processing (or simply panning) is known, which represents sounds in a three-dimensional space using sounds from several representative points that are set in advance within the three-dimensional space.

特開２０２０－１８６２０号公報Japanese Patent Application Laid-Open No. 2020-18620

　ただし、パニング処理のような変換処理では、処理量の削減とのトレードオフとして、音質の劣化が生じる場合がある。そこで、本開示では、変換処理を適切に行うための情報処理方法などを提供することを目的とする。 However, conversion processes such as panning can result in a degradation of sound quality as a trade-off for reducing the amount of processing. Therefore, the purpose of this disclosure is to provide an information processing method for appropriately performing conversion processes.

　本開示の一態様に係る情報処理方法は、複数の情報処理端末によって実行される情報処理方法であって、前記複数の情報処理端末のうち一の情報処理端末である第１端末において、音響信号と、三次元音場内の音源オブジェクトの位置の情報とを含む第１音情報を取得するステップと、前記第１音情報は、前記音響信号によって三次元音場内の前記音源オブジェクトにおいて再生音を発せさせるための情報であって、前記第１端末において、前記第１音情報を、前記音響信号を用いて前記三次元音場内に設定された代表点から基準位置に到来する代表音を生成するための第２音情報に変換するステップと、前記第１端末において、前記第２音情報を前記複数の情報処理端末のうち他の情報処理端末である第２端末に送信するステップと、前記第２端末において、前記三次元音場内のユーザの位置あるいは頭部の方向を検知するステップと、前記第２端末において、検知した前記ユーザの位置あるいは頭部の方向と前記基準位置とに基づいて、前記代表点の位置に対応する再生用代表点の位置を算出するステップと、前記第２端末において、算出した前記再生用代表点の位置に応じた頭部伝達関数と、受信した前記第２音情報とを用いて、出力音信号を生成するステップと、を含む。 An information processing method according to one aspect of the present disclosure is an information processing method executed by a plurality of information processing terminals, comprising the steps of: acquiring, at a first terminal, one of the plurality of information processing terminals, first sound information including an acoustic signal and information about the position of a sound source object within a three-dimensional sound field; and converting, at the first terminal, the first sound information into second sound information for generating, using the acoustic signal, a representative sound arriving at a reference position from a representative point set within the three-dimensional sound field. the first terminal transmitting the second sound information to a second terminal that is another information processing terminal among the plurality of information processing terminals; the second terminal detecting the user's position or head direction in the three-dimensional sound field; the second terminal calculating the position of a reproduction representative point corresponding to the position of the representative point based on the detected user's position or head direction and the reference position; and the second terminal generating an output sound signal using a head-related transfer function corresponding to the calculated position of the reproduction representative point and the received second sound information.

　また、本開示の一態様に係る情報処理システムは、第１端末と、第２端末とを含む情報処理システムであって、前記第１端末は、音響信号と、三次元音場内の音源オブジェクトの位置の情報とを含む音情報を取得する取得部と、前記第１音情報は、前記音響信号によって三次元音場内の前記音源オブジェクトにおいて再生音を発せさせるための情報であって、前記第１音情報を、前記音響信号を用いて前記三次元音場内に設定された代表点から基準位置に到来する代表音を生成するための第２音情報に変換する変換部と、前記第２音情報を前記第２端末に送信する送信部と、を備え、前記第２端末は、前記三次元音場内のユーザの位置あるいは頭部の方向を検知する検知器と、検知した前記ユーザの位置あるいは頭部の方向と前記基準位置とに基づいて、前記代表点の位置に対応する再生用代表点の位置を算出する算出部と、前記第２端末において、算出した前記再生用代表点の位置に応じた頭部伝達関数と、受信した前記第２音情報とを用いて、出力音信号を出力部と、を備える。 Furthermore, an information processing system according to one aspect of the present disclosure is an information processing system including a first terminal and a second terminal, wherein the first terminal comprises an acquisition unit that acquires sound information including an acoustic signal and information about the position of a sound source object within a three-dimensional sound field, a conversion unit that converts the first sound information into second sound information using the acoustic signal to generate a representative sound that arrives at a reference position from a representative point set within the three-dimensional sound field, and a transmission unit that transmits the second sound information to the second terminal, and the second terminal comprises a detector that detects the position or head direction of a user within the three-dimensional sound field, a calculation unit that calculates the position of a reproduction representative point corresponding to the position of the representative point based on the detected position or head direction of the user and the reference position, and an output unit that outputs an output sound signal using a head-related transfer function corresponding to the calculated position of the reproduction representative point and the received second sound information.

　また、本開示の一態様は、上記に記載の情報処理方法をコンピュータに実行させるためのプログラムとして実現することもできる。 Furthermore, one aspect of the present disclosure can also be realized as a program for causing a computer to execute the information processing method described above.

　なお、これらの包括的又は具体的な態様は、システム、装置、方法、集積回路、コンピュータプログラム、又は、コンピュータ読み取り可能なＣＤ－ＲＯＭなどの非一時的な記録媒体で実現されてもよく、システム、装置、方法、集積回路、コンピュータプログラム、及び、記録媒体の任意な組み合わせで実現されてもよい。 Note that these comprehensive or specific aspects may be realized as a system, device, method, integrated circuit, computer program, or non-transitory recording medium such as a computer-readable CD-ROM, or as any combination of a system, device, method, integrated circuit, computer program, and recording medium.

　本開示によれば、変換処理を適切に行うことが可能となる。 This disclosure makes it possible to perform conversion processing appropriately.

図１は、実施の形態に係る音響再生システムの使用事例を示す概略図である。FIG. 1 is a schematic diagram showing a use example of a sound reproduction system according to an embodiment. 図２は、実施の形態に係る音響再生システムの機能構成を示すブロック図である。FIG. 2 is a block diagram showing the functional configuration of the sound reproduction system according to the embodiment. 図３は、実施の形態に係る音声信号の一例を説明するための図である。FIG. 3 is a diagram illustrating an example of an audio signal according to the embodiment. 図４は、実施の形態に係る取得部の機能構成を示すブロック図である。FIG. 4 is a block diagram illustrating a functional configuration of an acquisition unit according to the embodiment. 図５は、実施の形態に係る出力音生成部の機能構成を示すブロック図である。FIG. 5 is a block diagram illustrating a functional configuration of the output sound generating unit according to the embodiment. 図６は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 6 is a diagram for explaining another example of the sound reproducing system according to the embodiment. 図７は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 7 is a diagram for explaining another example of the sound reproducing system according to the embodiment. 図８は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 8 is a diagram for explaining another example of the sound reproducing system according to the embodiment. 図９は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 9 is a diagram for explaining another example of the sound reproduction system according to the embodiment. 図１０は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 10 is a diagram for explaining another example of the sound reproducing system according to the embodiment. 図１１は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 11 is a diagram illustrating another example of the sound reproduction system according to the embodiment. 図１２は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 12 is a diagram illustrating another example of the sound reproduction system according to the embodiment. 図１３は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 13 is a diagram illustrating another example of the sound reproduction system according to the embodiment. 図１４は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 14 is a diagram illustrating another example of the sound reproduction system according to the embodiment. 図１５は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 15 is a diagram illustrating another example of the sound reproduction system according to the embodiment. 図１６は、実施の形態に係る音響再生システムの別の例を説明するための図である。FIG. 16 is a diagram illustrating another example of the sound reproduction system according to the embodiment. 図１７Ａは、実施の形態に係る情報処理装置の動作例を示すフローチャートである。FIG. 17A is a flowchart illustrating an example of the operation of the information processing device according to the embodiment. 図１７Ｂは、実施の形態に係る情報処理装置の動作例を示すフローチャートである。FIG. 17B is a flowchart illustrating an example of the operation of the information processing device according to the embodiment. 図１８は、実施の形態に係る音声再生処理のフローチャートである。FIG. 18 is a flowchart of the audio reproduction process according to the embodiment. 図１９は、実施の形態に係る音声再生処理における頭部伝達関数の合成を説明するための図である。FIG. 19 is a diagram for explaining synthesis of head-related transfer functions in the audio reproduction process according to the embodiment. 図２０は、実施の形態に係る代表方向の配置について説明するための図である。FIG. 20 is a diagram for explaining the arrangement of the representative direction according to the embodiment. 図２１は、実施の形態に係る時間シフト値計算の一例について説明するための図である。FIG. 21 is a diagram illustrating an example of time shift value calculation according to the embodiment. 図２２は、実施の形態に係る時間シフト値計算の一例について説明するための図である。FIG. 22 is a diagram illustrating an example of time shift value calculation according to the embodiment. 図２３は、実施の形態に係るゲイン値計算の一例について説明するための図である。FIG. 23 is a diagram illustrating an example of gain value calculation according to the embodiment. 図２４は、実施の形態に係るゲイン値計算の結果を示す図である。FIG. 24 is a diagram showing the results of gain value calculation according to the embodiment. 図２５は、実施の形態に係るゲイン値計算の結果を示す図である。FIG. 25 is a diagram showing the results of gain value calculation according to the embodiment. 図２６は、実施の形態に係る時間シフト値計算の結果を検証するための図である。FIG. 26 is a diagram for verifying the results of the time shift value calculation according to the embodiment. 図２７は、実施の形態に係るゲイン値計算の結果を検証するための図である。FIG. 27 is a diagram for verifying the results of gain value calculation according to the embodiment. 図２８は、実施の形態に係るゲイン値計算の結果を検証するための定位実験の結果を示す図である。FIG. 28 is a diagram showing the results of a localization experiment for verifying the results of the gain value calculation according to the embodiment. 図２９は、実施の形態に係るゲイン値計算の結果を検証するための定位実験の結果を示す図である。FIG. 29 is a diagram showing the results of a localization experiment for verifying the results of gain value calculation according to the embodiment. 図３０は、実施の形態に係るゲイン値計算の結果を検証するための定位実験の結果を示す図である。FIG. 30 is a diagram showing the results of a localization experiment for verifying the results of gain value calculation according to the embodiment. 図３１は、実施の形態に係るゲイン値計算の結果を検証するための定位実験の結果を示す図である。FIG. 31 is a diagram showing the results of a localization experiment for verifying the results of gain value calculation according to the embodiment. 図３２は、実施の形態に係る設定されたフィックスゲイン値を示す図である。FIG. 32 is a diagram showing set fixed gain values according to the embodiment. 図３３は、実施の形態に係るゲイン値計算の結果を検証するための定位実験の結果を示す図である。FIG. 33 is a diagram showing the results of a localization experiment for verifying the results of gain value calculation according to the embodiment. 図３４は、実施の形態に係るゲイン値計算の結果を検証するための定位実験の結果を示す図である。FIG. 34 is a diagram showing the results of a localization experiment for verifying the results of gain value calculation according to the embodiment.

　（開示の基礎となった知見）
　従来、仮想的な三次元空間内（以下、三次元音場という場合がある）で、立体的な音をユーザに知覚させるための音響再生に関する技術が知られている（例えば、特許文献１参照）。この技術を用いることで、ユーザは仮想空間内の所定位置に音源オブジェクトが存在し、その方向から音が到来するかのごとく、この音を知覚することができる。このように仮想的な三次元空間内の所定位置に音像を定位させるには、例えば、音源オブジェクトが鳴らしている音の信号（音源オブジェクトにおいて発せられる音、又は、再生音ともいう）に対して、立体的な音として知覚されるような両耳間での音の到来時間差、及び、両耳間での音のレベル差（又は音圧差）などを生じさせる計算処理が必要となる。このような計算処理は、立体音響フィルタを適用することによって行われる。立体音響フィルタは、元の音情報に対して、当該フィルタを適用した後の出力音信号が再生されると、音の方向や距離などの位置や音源の大きさ、空間の広さなどが立体感をもって知覚されるようになる情報処理用のフィルタである。 (Knowledge that formed the basis of disclosure)
Conventionally, a technology related to sound reproduction that allows a user to perceive stereoscopic sound in a virtual three-dimensional space (hereinafter sometimes referred to as a three-dimensional sound field) has been known (see, for example, Patent Document 1). Using this technology, a user can perceive sound as if a sound source object exists at a predetermined position in the virtual space and the sound is coming from that direction. Localizing a sound image at a predetermined position in the virtual three-dimensional space in this way requires, for example, computational processing to generate a binaural sound arrival time difference and a binaural sound level difference (or sound pressure difference) that are perceived as stereoscopic sound for a sound signal emitted by a sound source object (also referred to as a sound emitted from the sound source object or a reproduced sound). This computational processing is performed by applying a stereophonic filter. A stereophonic filter is an information processing filter that, when an output sound signal obtained by applying the filter to original sound information is reproduced, causes the position (such as the direction and distance of the sound), the size of the sound source, the size of the space, and the like to be perceived with a three-dimensional effect.

　このような立体音響フィルタの適用の計算処理の一例として、所定方向から到来する音として知覚させるための頭部伝達関数を目的の音の信号に対して畳み込む処理が知られている。この頭部伝達関数の畳み込みの処理を、音源オブジェクトの位置からユーザの位置までの再生音の到来方向に対して、十分に細かい角度で実施することで、ユーザが体感する臨場感が向上される。 One example of the computational process for applying such a stereophonic filter is the process of convolving a head-related transfer function (HRTF) with the target sound signal to make the sound appear to be coming from a specific direction. By performing this HRTF convolution process at a sufficiently fine angle relative to the direction from which the reproduced sound is coming from the sound source object's position to the user's position, the sense of realism experienced by the user is improved.

　また、近年、仮想現実（VR：Virtual Reality）に関する技術の開発が盛んに行われている。仮想現実では、ユーザの動きに対して仮想的な三次元空間内の音源オブジェクトの位置が適切に変化し、あたかもユーザが仮想空間内を移動しているように体感できることが主眼に置かれている。このためには、ユーザの動きに対して、仮想空間内の音像の定位位置を相対的に移動させる必要が生じる。このような処理は、元の音情報に対して、上記の頭部伝達関数のような立体音響フィルタを適用することで行われてきた。ただし、三次元空間内でユーザが移動する場合などには、音の反響及び干渉など、音源オブジェクトとユーザとの位置関係が変化するごとに、音の伝達経路が時々刻々と変化する。そうすると、その都度、音源オブジェクトとユーザとの位置関係をもとに、音源オブジェクトからの音の伝達経路を決定し、音の反響及び干渉などを考慮して伝達関数を畳み込む必要がある。しかしながら、このような情報処理では、処理量が膨大となり、大規模な処理装置がなければ、臨場感の向上が望めないことがある。 In addition, in recent years, there has been active development of technology related to virtual reality (VR). Virtual reality focuses on appropriately changing the position of a sound source object in a virtual three-dimensional space in response to the user's movements, allowing the user to experience the sensation of moving within the virtual space. To achieve this, it is necessary to move the localized position of the sound image within the virtual space relative to the user's movements. This processing has traditionally been performed by applying a stereophonic filter, such as the head-related transfer function described above, to the original sound information. However, when a user moves within a three-dimensional space, the sound transmission path changes every time the positional relationship between the sound source object and the user changes due to factors such as sound reverberation and interference. This requires determining the sound transmission path from the sound source object based on the positional relationship between the sound source object and the user each time, and convolving the transfer function to take into account sound reverberation and interference. However, this type of information processing requires an enormous amount of processing power, and without a large-scale processing device, it may not be possible to improve the sense of realism.

　そこで、このような膨大化する処理量を削減するという目的で、再生音にパニング処理を適用して、頭部伝達関数の畳み込み量を削減するという試みが行われている。具体的には、三次元空間内に、いくつもある音源オブジェクトのそれぞれについて、再生音に頭部伝達関数を畳み込むのではなく、音源オブジェクトからの再生音を、三次元空間内にあらかじめ設定された、いくつかの代表点からの音（代表音）によって表現しなおす。そして、代表音に代表点からのユーザの位置までの頭部伝達関数を畳み込むだけで、ユーザに遜色ない立体音を知覚させることが可能となる。代表点が元の音源オブジェクトの数よりも少なければ、当然、頭部伝達関数の畳み込みを行う対象も少なくなるため、処理量の観点で有利となる。 In order to reduce this ever-increasing amount of processing, attempts are being made to apply panning to the reproduced sound and reduce the amount of head-related transfer function convolution. Specifically, rather than convolving the head-related transfer function into the reproduced sound for each of the many sound source objects in three-dimensional space, the reproduced sound from the sound source objects is re-expressed using sounds from several representative points (representative sounds) set in advance in three-dimensional space. Then, simply by convolving the head-related transfer function from the representative points to the user's position into the representative sounds, it is possible to allow the user to perceive three-dimensional sound that is comparable to that of the original. If there are fewer representative points than the number of original sound source objects, naturally there will be fewer targets for head-related transfer function convolution, which is advantageous in terms of processing volume.

　より具体的な本開示の概要は、以下の通りである。 A more specific outline of this disclosure is as follows:

　本開示の第１態様に係る情報処理方法は、複数の情報処理端末によって実行される情報処理方法であって、複数の情報処理端末のうち一の情報処理端末である第１端末において、音響信号と、三次元音場内の音源オブジェクトの位置の情報とを含む第１音情報を取得するステップと、第１音情報は、音響信号によって三次元音場内の音源オブジェクトにおいて再生音を発せさせるための情報であって、第１端末において、第１音情報を、音響信号を用いて三次元音場内に設定された代表点から基準位置に到来する代表音を生成するための第２音情報に変換するステップと、第１端末において、第２音情報を複数の情報処理端末のうち他の情報処理端末である第２端末に送信するステップと、第２端末において、三次元音場内のユーザの位置あるいは頭部の方向を検知するステップと、第２端末において、検知したユーザの位置あるいは頭部の方向と基準位置とに基づいて、代表点の位置に対応する再生用代表点の位置を算出するステップと、第２端末において、算出した再生用代表点の位置に応じた頭部伝達関数と、受信した第２音情報とを用いて、出力音信号を生成するステップと、を含む。 An information processing method according to a first aspect of the present disclosure is an information processing method executed by a plurality of information processing terminals, and includes the steps of: acquiring, at a first terminal which is one of the plurality of information processing terminals, first sound information including an acoustic signal and information on the position of a sound source object within a three-dimensional sound field; the first sound information is information for causing the sound source object within the three-dimensional sound field to emit a playback sound using the acoustic signal; converting, at the first terminal, the first sound information into second sound information for generating a representative sound that arrives at a reference position from a representative point set within the three-dimensional sound field using the acoustic signal; transmitting, at the first terminal, the second sound information to a second terminal which is another of the plurality of information processing terminals; detecting, at the second terminal, the position of a user or the direction of the head within the three-dimensional sound field; calculating, at the second terminal, the position of a playback representative point corresponding to the position of the representative point based on the detected position or direction of the user's head and the reference position; and generating, at the second terminal, an output sound signal using a head-related transfer function corresponding to the calculated position of the playback representative point and the received second sound information.

　このような情報処理方法によれば、第２端末をユーザが装着するものとした場合に、第１端末をユーザが装着する第２端末とは別に設けることができるので、ユーザが装着するといった制約がなく、第２端末に比べて比較的処理性能を高くし易い第１端末において、比較的処理リソースが要求される、コンテンツ上の規定されている音源オブジェクトの動きを含めて音声信号にパニング処理をすることができる。また、パニング処理によってより少ない代表方向からの音にまとめることができるので情報量が圧縮され、第１端末と第２端末との間で送受信したときに、通信上のバンド幅の制約を受けにくくできる。また、第２端末をユーザが装着するものとした場合に、ユーザが装着する第２端末において、容易にユーザの頭部の位置及び向きを検知し、そのまま、パニング処理された音声信号からの出力音信号の出力のために用いることができるので、頭部運動に呼応して瞬時に音の到来方向を更新できる。さらに、出力音信号の出力において、パニング処理されたあとの数方向分という比較的少ない代表方向分の頭部伝達関数の畳み込みのみで行うことができ、第２端末に要求される処理リソースを縮小することができる。このように、本開示によれば、複数のメリットが得られるように、変換処理を適切に行うことができる。 With this information processing method, when the second terminal is worn by the user, the first terminal can be provided separately from the second terminal worn by the user. Therefore, the first terminal, which is not restricted by the need for a user to wear it and can have relatively high processing performance compared to the second terminal, can perform panning processing on the audio signal, including the movement of the sound source object specified in the content, which requires relatively high processing resources. Furthermore, panning processing can summarize sounds from fewer representative directions, compressing the amount of information and reducing communication bandwidth restrictions when transmitting and receiving data between the first and second terminals. Furthermore, when the second terminal is worn by the user, the second terminal worn by the user can easily detect the position and orientation of the user's head and use this information to output an output sound signal from the panned audio signal, thereby instantly updating the direction of sound arrival in response to head movement. Furthermore, when outputting the output sound signal, it is only necessary to convolve head-related transfer functions for a relatively small number of representative directions (i.e., a few directions after panning processing), thereby reducing the processing resources required for the second terminal. In this way, according to the present disclosure, conversion processing can be performed appropriately to obtain multiple benefits.

　また、第２態様にかかる情報処理方法は、第１態様に記載の情報処理方法であって、変換するステップでは、再生音に対して時間シフト調整とゲイン調整とを適用して代表音に変換する。 Furthermore, the information processing method according to the second aspect is the information processing method according to the first aspect, in which the converting step applies time shift adjustment and gain adjustment to the reproduced sound to convert it into a representative sound.

　これによれば、再生音に対して時間シフト調整とゲイン調整とを適用して、より違和感を与えにくい代表音に変換することができる。 This allows time shift adjustment and gain adjustment to be applied to the playback sound, converting it into a representative sound that is less likely to cause discomfort.

　また、第３態様にかかる情報処理方法は、第２態様に記載の情報処理方法であって、時間シフト調整の調整量は、代表点において０となるように設定する。 Furthermore, the information processing method according to the third aspect is the information processing method according to the second aspect, in which the adjustment amount of the time shift adjustment is set to 0 at the representative point.

　これによれば、実質的に正解値である調整量として、代表点において０となるように時間シフト調整の調整量を設定することができる。 This allows the time shift adjustment amount to be set so that it becomes 0 at the representative point, which is essentially the correct value.

　また、第４態様にかかる情報処理方法は、第２又は第３態様に記載の情報処理方法であって、所定位置の時間シフト調整の調整量は、所定位置よりも代表点に近い位置の調整量を用いて設定する。 Furthermore, the information processing method according to the fourth aspect is the information processing method according to the second or third aspect, in which the adjustment amount for the time shift adjustment at the predetermined position is set using the adjustment amount for a position closer to the representative point than the predetermined position.

　これによれば、所定位置の時間シフト調整の調整量を設定する際に、所定位置からみて代表点に近い位置の調整量を用いることで、例えば、代表点に近い位置の調整量により近い調整量とすることができる。その結果、時間シフト調整の調整量が急激に変化するような飛び値を生じにくくすることができる。 By doing this, when setting the amount of time shift adjustment at a predetermined position, the adjustment amount for a position close to the representative point as viewed from the predetermined position is used, making it possible to set an adjustment amount that is closer to the adjustment amount for a position close to the representative point, for example. As a result, it is less likely that jumps will occur, as the amount of time shift adjustment changes suddenly.

　また、第５態様にかかる情報処理方法は、第２～第４態様のいずれか１態様に記載の情報処理方法であって、所定位置のゲイン調整の調整量は、所定位置を囲む、互いに隣接する３箇所の代表点のうち、２箇所の代表点を用いて誤差が最小となる当該２箇所の代表点のそれぞれに対する調整量を算出した後に、算出した調整量の比を固定して、残りの１箇所の代表点に対する調整量を算出する。 Furthermore, the information processing method according to the fifth aspect is an information processing method according to any one of the second to fourth aspects, in which the amount of gain adjustment at a predetermined position is calculated by using two of three adjacent representative points surrounding the predetermined position to calculate the amount of adjustment for each of the two representative points that minimizes the error, and then fixing the ratio of the calculated adjustment amounts and calculating the amount of adjustment for the remaining representative point.

　これによれば、２箇所の代表点を用いて誤差が最小となる当該２箇所の代表点のそれぞれに対する調整量を算出した後に、算出した調整量の比を固定して、残りの１箇所の代表点に対する調整量を順次算出することができる。これにより、最初に調整量の算出に用いた２箇所の代表点と残りの１か所の代表点との、所定位置のゲイン調整の調整量の算出に与える影響を調整することができる。 This allows two representative points to be used to calculate the adjustment amount for each of the two representative points that minimizes the error, and then the ratio of the calculated adjustment amounts can be fixed and the adjustment amount for the remaining representative point can be calculated sequentially. This makes it possible to adjust the influence of the two representative points initially used to calculate the adjustment amount and the remaining representative point on the calculation of the adjustment amount for gain adjustment at a specified position.

　また、第６態様にかかる情報処理方法は、第２～第５態様のいずれか１態様に記載の情報処理方法であって、所定位置のゲイン調整の調整量は、所定位置から垂直方向成分を除去した水平位置について、所定位置を囲む、互いに隣接する３箇所の代表点のうち、水平方向に位置する２箇所の代表点を用いて誤差が最小となる当該２箇所の代表点のそれぞれに対する調整量を算出した後に、算出した調整量の比を固定して、残りの１箇所の代表点に対する調整量を算出する。 Furthermore, the information processing method according to the sixth aspect is an information processing method according to any one of the second to fifth aspects, in which the amount of gain adjustment at a predetermined position is calculated by calculating the amount of adjustment for each of two representative points located horizontally that minimize the error for a horizontal position obtained by removing the vertical component from the predetermined position, out of three adjacent representative points surrounding the predetermined position, and then fixing the ratio of the calculated adjustment amounts and calculating the amount of adjustment for the remaining representative point.

　これによれば、水平方向に位置する２箇所の代表点で水平位置について、ゲイン調整の調整量を算出した後に、算出した調整量の比を固定して、残りの１箇所の代表点に対する調整量を算出することができる。水平方向における音の定位感は比較的正確である。そのため、先に水平位置の調整量を算出して、調整量の比を固定することで水平方向の２箇所の代表点が所定位置のゲイン調整の調整量の算出に与える影響を、残りの１か所の代表点が所定位置のゲイン調整の調整量の算出に与える影響よりも大きくすることで、水平方向における定位感の正確度を高め、より適切に音像が定位される出力音信号を出力できる。 With this, after calculating the amount of gain adjustment for a horizontal position at two representative points located in the horizontal direction, the ratio of the calculated adjustment amounts can be fixed and the adjustment amount for the remaining representative point can be calculated. The sense of sound localization in the horizontal direction is relatively accurate. Therefore, by first calculating the adjustment amount for the horizontal position and then fixing the ratio of the adjustment amounts, the influence of the two horizontal representative points on the calculation of the amount of gain adjustment for a specified position is made greater than the influence of the remaining representative point on the calculation of the amount of gain adjustment for a specified position, thereby increasing the accuracy of the sense of localization in the horizontal direction and making it possible to output an output sound signal in which the sound image is more appropriately localized.

　また、第７態様にかかる情報処理方法は、第２～第６態様のいずれか１態様に記載の情報処理方法であって、時間シフト調整の調整量は、計算上の正解値との誤差を縮小するように、当該誤差の期待値に基づいて補正される。 Furthermore, the information processing method according to the seventh aspect is an information processing method according to any one of the second to sixth aspects, in which the amount of time shift adjustment is corrected based on the expected value of the error so as to reduce the error from the calculated correct value.

　これによれば、計算上の正解値との誤差を縮小するように時間シフト調整の調整量を設定することができる。 This allows the amount of time shift adjustment to be set so as to reduce the error from the calculated correct value.

　また、第８態様にかかる情報処理方法は、第２～第７態様のいずれか１態様に記載の情報処理方法であって、所定位置のゲイン調整の調整量は、計算上の正解値との誤差を縮小するように、当該誤差の期待値に基づいて補正される。 Furthermore, the information processing method according to the eighth aspect is an information processing method according to any one of the second to seventh aspects, in which the amount of gain adjustment at a predetermined position is corrected based on the expected value of the error so as to reduce the error from the calculated correct value.

　これによれば、計算上の正解値との誤差を縮小するように所定位置のゲイン調整の調整量を設定することができる。 This allows the gain adjustment amount at a specified position to be set so as to reduce the error from the calculated correct value.

　また、第９態様にかかる情報処理方法は、第２～第６態様のいずれか１態様に記載の情報処理方法であって、所定位置のゲイン調整の調整量は、所定位置に対する代表点を起点として、当該起点の代表点の方向を中心とした水平方向における所定の角度範囲内の複数の方向にそれぞれ対応する複数の仮想代表点での調整量のうち、２以上を用いて設定する。 Furthermore, the information processing method according to the ninth aspect is an information processing method according to any one of the second to sixth aspects, in which the adjustment amount for gain adjustment at a predetermined position is set using two or more of the adjustment amounts at multiple virtual representative points corresponding to multiple directions within a predetermined angle range in the horizontal direction centered on the direction of the representative point of the starting point, with the representative point for the predetermined position as the starting point.

　これによれば、所定位置におけるゲイン調整の調整量として、起点の代表点の方向を中心とした水平方向における所定の角度範囲内の複数の方向にそれぞれ対応する複数の仮想代表点での調整量を２以上用いることができる。これにより、２以上の調整量がいずれも影響するように所定位置におけるゲイン調整の調整量を設定することができる。 As a result, two or more adjustment amounts can be used as the amount of gain adjustment at a predetermined position at multiple virtual representative points corresponding to multiple directions within a predetermined angle range in the horizontal direction centered on the direction of the representative point of origin. This makes it possible to set the amount of gain adjustment at a predetermined position so that two or more adjustment amounts all have an effect.

　また、第１０態様にかかる情報処理方法は、第９態様に記載の情報処理方法であって、所定位置のゲイン調整の調整量は、所定位置に対する代表点を起点として、当該起点の代表点の方向を中心とした水平方向における所定の角度範囲内の複数の方向にそれぞれ対応する複数の仮想代表点での調整量を全て用いて設定する。 Furthermore, an information processing method according to a tenth aspect is the information processing method according to the ninth aspect, wherein the adjustment amount for gain adjustment at a predetermined position is set using all of the adjustment amounts at multiple virtual representative points corresponding to multiple directions within a predetermined angle range in the horizontal direction centered on the direction of the representative point of the starting point, with the representative point for the predetermined position as the starting point.

　これによれば、所定位置におけるゲイン調整の調整量として、起点の代表点の方向を中心とした水平方向における所定の角度範囲内の複数の方向にそれぞれ対応する複数の仮想代表点での調整量を全て用いることができる。これにより、全ての調整量がいずれも影響するように所定位置におけるゲイン調整の調整量を設定することができる。 This allows the adjustment amounts for gain adjustment at a predetermined position to be used as the adjustment amounts for gain adjustment at multiple virtual representative points corresponding to multiple directions within a predetermined angle range in the horizontal direction centered on the direction of the representative point of origin. This makes it possible to set the adjustment amount for gain adjustment at a predetermined position so that all adjustment amounts have an effect.

　また、第１１態様にかかる情報処理方法は、第９態様に記載の情報処理方法であって、複数の仮想代表点での調整量を平均することで、所定位置のゲイン調整の調整量を算出する。 Furthermore, an information processing method according to an eleventh aspect is the information processing method according to the ninth aspect, in which the amount of gain adjustment at a predetermined position is calculated by averaging the adjustment amounts at multiple virtual representative points.

　これによれば、２以上の調整量を平均することで、所定位置のゲイン調整の調整量を算出できる。 This allows the gain adjustment amount for a given position to be calculated by averaging two or more adjustment amounts.

　また、第１２態様にかかる情報処理方法は、第９態様に記載の情報処理方法であって、複数の仮想代表点での調整量を、仮想代表点のそれぞれの方向と、当該仮想代表点に対応するもとの方向との角度差に基づいて加重平均することで、所定位置のゲイン調整の調整量を算出する。 Furthermore, the information processing method according to the twelfth aspect is the information processing method according to the ninth aspect, in which the amount of gain adjustment at a predetermined position is calculated by taking a weighted average of the adjustment amounts at multiple virtual representative points based on the angle difference between the direction of each virtual representative point and the original direction corresponding to that virtual representative point.

　これによれば、２以上の調整量を加重平均することで、所定位置のゲイン調整の調整量を算出できる。 This allows the gain adjustment amount for a specified position to be calculated by taking a weighted average of two or more adjustment amounts.

　また、第１３態様にかかる情報処理方法は、第２～第１２態様のいずれか１態様に記載の情報処理方法であって、所定位置のゲイン調整の調整量は、所定位置を囲む、互いに隣接する３箇所の代表点のうち、垂直成分を含む仰角代表点の調整量を水平面内での所定位置の方向に依存しない所定値に設定する。 Furthermore, the information processing method according to the thirteenth aspect is an information processing method according to any one of the second to twelfth aspects, in which the amount of gain adjustment at a predetermined position is set to a predetermined value that is independent of the direction of the predetermined position in the horizontal plane, by adjusting the amount of adjustment at an elevation angle representative point that includes a vertical component, among three adjacent representative points that surround the predetermined position.

　これによれば、仰角代表点の調整量を水平面内での所定位置の方向に依存しない所定値に設定することで、所定位置のゲイン調整の調整量を設定できる。 This allows the amount of gain adjustment for a given position to be set by setting the amount of adjustment for the elevation angle representative point to a given value that is independent of the direction of the given position in the horizontal plane.

　また、第１４態様にかかる情報処理方法は、第１３態様に記載の情報処理方法であって、所定値は、垂直面内での所定位置の方向に依存して変化する値である。 Furthermore, the information processing method according to the fourteenth aspect is the information processing method according to the thirteenth aspect, in which the predetermined value is a value that changes depending on the direction of the predetermined position in a vertical plane.

　これによれば、仰角代表点の調整量を水平面内での所定位置の方向に依存せず、垂直面内での所定位置の方向に依存して変化する所定値に設定することで、所定位置のゲイン調整の調整量を設定できる。 This allows the amount of adjustment for the elevation angle representative point to be set to a predetermined value that does not depend on the direction of the specified position in the horizontal plane, but changes depending on the direction of the specified position in the vertical plane, making it possible to set the amount of gain adjustment for the specified position.

　また、第１５態様にかかる情報処理方法は、第１４態様に記載の情報処理方法であって、所定値は、垂直面内での所定位置の方向が０°の場合から９０°の場合までに、０から１の間で漸増する値である。 Furthermore, the information processing method according to the fifteenth aspect is the information processing method according to the fourteenth aspect, in which the predetermined value is a value that gradually increases between 0 and 1 when the direction of the predetermined position in the vertical plane is from 0° to 90°.

　これによれば、仰角代表点の調整量を水平面内での所定位置の方向に依存せず、垂直面内での所定位置の方向が０°の場合から９０°の場合までに、０から１の間で漸増する所定値に設定することで、所定位置のゲイン調整の調整量を設定できる。 This allows the amount of adjustment for the elevation angle representative point to be set to a predetermined value that gradually increases between 0 and 1 when the direction of the predetermined position in the vertical plane is between 0° and 90°, regardless of the direction of the predetermined position in the horizontal plane, thereby making it possible to set the amount of gain adjustment for the predetermined position.

　また、第１６態様にかかる情報処理システムは、第１端末と、第２端末とを含む情報処理システムであって、第１端末は、音響信号と、三次元音場内の音源オブジェクトの位置の情報とを含む音情報を取得する取得部と、第１音情報は、音響信号によって三次元音場内の音源オブジェクトにおいて再生音を発せさせるための情報であって、第１音情報を、音響信号を用いて三次元音場内に設定された代表点から基準位置に到来する代表音を生成するための第２音情報に変換する変換部と、第２音情報を第２端末に送信する送信部と、を備え、第２端末は、三次元音場内のユーザの位置あるいは頭部の方向を検知する検知器と、検知したユーザの位置あるいは頭部の方向と基準位置とに基づいて、代表点の位置に対応する再生用代表点の位置を算出する算出部と、第２端末において、算出した再生用代表点の位置に応じた頭部伝達関数と、受信した第２音情報とを用いて、出力音信号を出力部と、を備える。 Furthermore, an information processing system according to a sixteenth aspect is an information processing system including a first terminal and a second terminal, wherein the first terminal comprises an acquisition unit that acquires sound information including an acoustic signal and information about the position of a sound source object within a three-dimensional sound field, a conversion unit that converts the first sound information into second sound information using the acoustic signal to generate a representative sound that arrives at a reference position from a representative point set within the three-dimensional sound field, and a transmission unit that transmits the second sound information to the second terminal, and the second terminal comprises a detector that detects the position of a user's position or head direction within the three-dimensional sound field, a calculation unit that calculates the position of a reproduction representative point corresponding to the position of the representative point based on the detected user's position or head direction and the reference position, and an output unit in the second terminal that outputs an output sound signal using a head-related transfer function corresponding to the calculated position of the reproduction representative point and the received second sound information.

　これによれば、上記に記載の情報処理方法と同様の効果を奏することができる。 This can achieve the same effects as the information processing method described above.

　また、第１７態様に係るプログラムは、上記に記載の情報処理方法をコンピュータに実行させるためのプログラムである。 Furthermore, a program according to a seventeenth aspect is a program for causing a computer to execute the information processing method described above.

　これによれば、コンピュータを用いて上記に記載の情報処理方法と同様の効果を奏することができる。 This allows the same effects as the information processing method described above to be achieved using a computer.

　さらに、これらの包括的又は具体的な態様は、システム、装置、方法、集積回路、コンピュータプログラム、又は、コンピュータ読み取り可能なＣＤ－ＲＯＭなどの非一時的な記録媒体で実現されてもよく、システム、装置、方法、集積回路、コンピュータプログラム、及び、記録媒体の任意な組み合わせで実現されてもよい。 Furthermore, these comprehensive or specific aspects may be realized as a system, device, method, integrated circuit, computer program, or non-transitory recording medium such as a computer-readable CD-ROM, or as any combination of a system, device, method, integrated circuit, computer program, and recording medium.

　以下、実施の形態について、図面を参照しながら具体的に説明する。なお、以下で説明する実施の形態は、いずれも包括的又は具体的な例を示すものである。以下の実施の形態で示される数値、形状、材料、構成要素、構成要素の配置位置及び接続形態、ステップ、ステップの順序などは、一例であり、本開示を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、独立請求項に記載されていない構成要素については、任意の構成要素として説明される。なお、各図は模式図であり、必ずしも厳密に図示されたものではない。また、各図において、実質的に同一の構成に対しては同一の符号を付し、重複する説明は省略又は簡略化される場合がある。 The following describes the embodiments in detail with reference to the drawings. Note that the embodiments described below are all comprehensive or specific examples. The numerical values, shapes, materials, components, component placement and connection configurations, steps, and step order shown in the following embodiments are merely examples and are not intended to limit the present disclosure. Furthermore, of the components in the following embodiments, components that are not recited in independent claims are described as optional components. Note that each figure is a schematic diagram and is not necessarily an exact illustration. Furthermore, in each figure, substantially identical components are assigned the same reference numerals, and duplicate explanations may be omitted or simplified.

　また、以下の説明において、第１、第２及び第３等の序数が要素に付けられている場合がある。これらの序数は、要素を識別するため、要素に付けられており、意味のある順序に必ずしも対応しない。これらの序数は、適宜、入れ替えられてもよいし、新たに付与されてもよいし、取り除かれてもよい。 Furthermore, in the following description, ordinal numbers such as first, second, and third may be attached to elements. These ordinal numbers are attached to elements in order to identify them, and do not necessarily correspond to any meaningful order. These ordinal numbers may be rearranged, newly added, or removed as appropriate.

　また、以下の説明において、音情報に含まれる音響信号について説明することがあるが、音響信号は、音声信号又は音信号と表現される場合がある。つまり、本開示において、音響信号とは、音声信号又は音信号と同じ意味である。 Furthermore, in the following description, we may refer to acoustic signals contained in sound information, but these may also be expressed as voice signals or sound signals. In other words, in this disclosure, the term "acoustic signal" has the same meaning as the term "voice signal" or "sound signal."

　（実施の形態）
　［概要］
　はじめに、実施の形態に係る音響再生システムの概要について説明する。図１は、実施の形態に係る音響再生システムの使用事例を示す概略図である。図１では、音響再生システム１００を使用するユーザ９９が示されている。 (Embodiment)
[overview]
First, an overview of the sound reproduction system according to the embodiment will be described. Fig. 1 is a schematic diagram showing a use example of the sound reproduction system according to the embodiment. Fig. 1 shows a user 99 using the sound reproduction system 100.

　図１に示す音響再生システム１００は、例えば、立体映像再生装置３００と同時に使用されている。立体的な画像及び立体的な音を同時に視聴することで、画像が聴覚的な臨場感を、音が視覚的な臨場感をそれぞれ高め合い、画像及び音が撮られた現場に居るかのように体感することができる。例えば、人が会話をする画像（動画像）が表示されている場合に、会話音の音像（音源オブジェクト）の定位が当該人の口元とずれている場合にも、ユーザ９９が、当該人の口から発せられた会話音として知覚することが知られている。このように視覚情報によって、音像の位置が補正されるなど、画像と音とが併せられることで臨場感が高められることがある。 The sound reproduction system 100 shown in Figure 1 is used simultaneously with a 3D video reproduction device 300, for example. By simultaneously viewing 3D images and 3D sound, the images enhance the auditory realism, and the sound enhances the visual realism, allowing the viewer to experience the feeling of being at the scene where the images and sounds were captured. For example, when an image (moving image) of people having a conversation is displayed, it is known that even if the position of the sound image (sound source object) of the conversation sound is not aligned with the person's mouth, the user 99 will perceive it as the conversation sound coming from the person's mouth. In this way, the position of the sound image can be corrected using visual information, and the sense of realism can be enhanced by combining the image and sound.

　立体映像再生装置３００は、ユーザ９９の頭部に装着される画像表示デバイスである。したがって、立体映像再生装置３００は、ユーザ９９の頭部と一体的に移動する。例えば、立体映像再生装置３００は、図示するように、ユーザ９９の耳と鼻とで支持するメガネ型のデバイスである。 The 3D video reproduction device 300 is an image display device worn on the head of the user 99. Therefore, the 3D video reproduction device 300 moves integrally with the head of the user 99. For example, as shown in the figure, the 3D video reproduction device 300 is a glasses-type device that is supported by the ears and nose of the user 99.

　立体映像再生装置３００は、ユーザ９９の頭部の動きに応じて表示する画像を変化させることで、ユーザ９９が三次元画像空間内で頭部を動かしているように知覚させる。つまり、ユーザ９９の正面に三次元画像空間内の物体が位置しているときに、ユーザ９９が右を向くと当該物体がユーザ９９の左方向に移動し、ユーザ９９が左を向くと当該物体がユーザ９９の右方向に移動する。このように、立体映像再生装置３００は、ユーザ９９の動きに対して、三次元画像空間をユーザ９９の動きとは逆方向に移動させる。 The 3D video playback device 300 changes the image displayed in response to the movement of the user 99's head, making the user 99 perceive the movement of their head within the three-dimensional image space. In other words, when an object in the three-dimensional image space is located directly in front of the user 99, if the user 99 turns to the right, the object moves to the user 99's left, and if the user 99 turns to the left, the object moves to the user 99's right. In this way, the 3D video playback device 300 moves the three-dimensional image space in the opposite direction to the user 99's movement.

　立体映像再生装置３００は、ユーザ９９の左右の目それぞれに視差分のずれが生じた２つの画像をそれぞれ表示する。ユーザ９９は、表示される画像の視差分のずれに基づき、画像上の物体の三次元的な位置を知覚することができる。なお、音響再生システム１００を睡眠誘導用のヒーリング音の再生に使用する等、ユーザ９９が目を閉じて使用する場合等には、立体映像再生装置３００が同時に使用される必要はない。つまり、立体映像再生装置３００は、本開示の必須の構成要素ではない。立体映像再生装置３００としては、専用の映像表示デバイスの他にも、ユーザ９９が所有するスマートフォン、タブレット装置など、汎用の携帯端末が用いられる場合もある。 The 3D video playback device 300 displays two images with a parallax difference to each of the user's 99 eyes. Based on the parallax difference between the displayed images, the user 99 can perceive the three-dimensional position of objects on the images. Note that when the user 99 uses the audio playback system 100 with their eyes closed, for example when using it to play healing sounds for sleep induction, the 3D video playback device 300 does not need to be used at the same time. In other words, the 3D video playback device 300 is not an essential component of the present disclosure. In addition to dedicated video display devices, the 3D video playback device 300 may also be a general-purpose mobile device owned by the user 99, such as a smartphone or tablet device.

　このような汎用の携帯端末には、映像を表示するためのディスプレイの他に、端末の姿勢や動きを検知するための各種のセンサが搭載されている。さらには、情報処理用のプロセッサも搭載され、ネットワークに接続してクラウドサーバなどのサーバ装置と情報の送受信が可能になっている。つまり、立体映像再生装置３００及び音響再生システム１００をスマートフォンと、情報処理機能のない汎用のヘッドフォン等との組み合わせによって実現することもできる。 Such general-purpose mobile terminals are equipped with a display for displaying images, as well as various sensors for detecting the terminal's position and movement. They also have a processor for information processing, and can be connected to a network to send and receive information to and from server devices such as cloud servers. In other words, the 3D video playback device 300 and sound playback system 100 can be realized by combining a smartphone with general-purpose headphones or other devices without information processing capabilities.

　この例のように、頭部の動きを検知する機能、映像の提示機能、提示用の映像情報処理機能、音の提示機能、及び、提示用の音情報処理機能を１以上の装置に適切に配置して立体映像再生装置３００及び音響再生システム１００を実現してもよい。立体映像再生装置３００が不要である場合には、頭部の動きを検知する機能、音の提示機能、及び、提示用の音情報処理機能を１以上の装置に適切に配置できればよく、例えば、提示用の音情報処理機能を有するコンピュータ又はスマートフォンなどの処理装置と、頭部の動きを検知する機能及び音の提示機能を有するヘッドフォン等とによって音響再生システム１００を実現することもできる。 As in this example, the 3D video playback device 300 and audio playback system 100 may be realized by appropriately arranging the head movement detection function, video presentation function, video information processing function for presentation, sound presentation function, and audio information processing function for presentation in one or more devices. If the 3D video playback device 300 is not required, it is sufficient to appropriately arrange the head movement detection function, audio presentation function, and audio information processing function for presentation in one or more devices. For example, the audio playback system 100 can be realized by a processing device such as a computer or smartphone that has the audio information processing function for presentation, and headphones or the like that have the head movement detection function and the audio presentation function.

　音響再生システム１００は、ユーザ９９の頭部に装着される音提示デバイスである。したがって、音響再生システム１００は、ユーザ９９の頭部と一体的に移動する。例えば、本実施の形態における音響再生システム１００は、いわゆるオーバーイヤーヘッドフォン型のデバイスである。なお、音響再生システム１００の形態に特に限定はなく、例えば、ユーザ９９の左右の耳にそれぞれ独立して装着される２つの耳栓型のデバイスであってもよい。 The sound reproduction system 100 is a sound presentation device worn on the head of the user 99. Therefore, the sound reproduction system 100 moves integrally with the head of the user 99. For example, the sound reproduction system 100 in this embodiment is a so-called over-ear headphone type device. Note that there are no particular limitations on the form of the sound reproduction system 100, and it may be, for example, two earplug-type devices worn independently on the left and right ears of the user 99.

　音響再生システム１００は、ユーザ９９の頭部の動きに応じて提示する音を変化させることで、ユーザ９９が三次元音場内で頭部を動かしているようにユーザ９９に知覚させる。このため、上記したように、音響再生システム１００は、ユーザ９９の動きに対して三次元音場をユーザ９９の動きとは逆方向に移動させる。 The sound reproduction system 100 changes the sound presented in accordance with the movement of the user's 99's head, allowing the user 99 to perceive the user as if they were moving their head within a three-dimensional sound field. For this reason, as described above, the sound reproduction system 100 moves the three-dimensional sound field in the opposite direction to the movement of the user 99.

　ここで、ユーザ９９が三次元音場内を移動する場合、ユーザ９９の三次元音場内の位置に対する相対的な音源オブジェクトの位置が変化する。そうすると、ユーザ９９が移動する度に音源オブジェクトとユーザ９９との位置に基づく計算処理を行って再生用の出力音信号を生成する必要がある。通常このような処理は処理量が膨大になるため、本開示では、処理量の削減の観点で変換処理の１つとしてのパニング処理を適用して再生音を代表点からの代表音によって表現する。この結果、代表音に頭部伝達関数を畳み込むだけで、音源オブジェクトからの再生音をユーザ９９に知覚させることが可能となる。以下、本実施の形態では、変換処理の一例としてパニング処理を用いる場合を説明するが、変換処理としては、パニング処理に限られず、条件によって、その変換で処理量の削減が見込まれる変換処理であれば、あらゆる変換処理が適用されうる。また、パニング処理について、具体的な例をあげて説明するが、パニング処理は、以下に説明する具体例の手法に限らず、ＶＢＡＰ（Vector Based. Amplitude Panning）、ＤＢＡＰ（Distance Based Amplitude Panning）、及び、Ambisonicsなどの既存のパニング処理に関する手法を適用することもできる。 Here, when the user 99 moves within the three-dimensional sound field, the position of the sound source object changes relative to the user 99's position within the three-dimensional sound field. As a result, each time the user 99 moves, it is necessary to perform calculations based on the positions of the sound source object and the user 99 to generate an output sound signal for playback. Since such processing typically requires a huge amount of processing, in this disclosure, a panning process is applied as one type of conversion processing to reduce the amount of processing, and the reproduced sound is represented by a representative sound from a representative point. As a result, it is possible for the user 99 to perceive the reproduced sound from the sound source object simply by convolving a head-related transfer function with the representative sound. In the following, in this embodiment, a case will be described in which panning processing is used as an example of conversion processing, but the conversion processing is not limited to panning processing, and any conversion processing can be applied as long as it is expected to reduce the amount of processing, depending on the conditions. Additionally, panning processing will be explained using specific examples, but panning processing is not limited to the specific example methods explained below, and existing panning processing methods such as VBAP (Vector Based Amplitude Panning), DBAP (Distance Based Amplitude Panning), and Ambisonics can also be applied.

　［構成］
　次に、図２を参照して、本実施の形態に係る音響再生システム１００の構成について説明する。図２は、実施の形態に係る音響再生システムの機能構成を示すブロック図である。 [composition]
Next, the configuration of the sound reproduction system 100 according to this embodiment will be described with reference to Fig. 2. Fig. 2 is a block diagram showing the functional configuration of the sound reproduction system according to this embodiment.

　図２に示すように、本実施の形態に係る音響再生システム１００は、情報処理装置１０１と、通信モジュール１０２と、検知器１０３と、ドライバ１０４と、データベース１０５と、を備える。 As shown in FIG. 2, the sound reproduction system 100 according to this embodiment includes an information processing device 101, a communication module 102, a detector 103, a driver 104, and a database 105.

　情報処理装置１０１は、音響再生システム１００における各種の信号処理を行うための演算装置である、情報処理装置１０１は、例えば、コンピュータなどの、プロセッサとメモリとを備え、メモリに記憶されたプログラムがプロセッサによって実行される形で実現される。このプログラムの実行によって、以下で説明する各機能部に関する機能が発揮される。 The information processing device 101 is an arithmetic device for performing various signal processing in the sound reproduction system 100. The information processing device 101 is equipped with a processor and memory, such as a computer, and is realized by the processor executing a program stored in the memory. Execution of this program provides the functions related to each functional unit described below.

　情報処理装置１０１は、取得部１１１、経路算出部１２１、出力音生成部１３１、及び、信号出力部１４１を有する。情報処理装置１０１が有する各機能部の詳細は、情報処理装置１０１以外の構成の詳細と併せて以下に説明する。 The information processing device 101 has an acquisition unit 111, a path calculation unit 121, an output sound generation unit 131, and a signal output unit 141. Details of each functional unit of the information processing device 101 will be described below, along with details of the configuration other than the information processing device 101.

　通信モジュール１０２は、音響再生システム１００への音情報の入力を受け付けるためのインタフェース装置である。通信モジュール１０２は、例えば、アンテナと信号変換器とを備え、無線通信により外部の装置から音情報を受信する。より詳しくは、通信モジュール１０２は、無線通信のための形式に変換された音情報を示す無線信号を、アンテナを用いて受波し、信号変換器により無線信号から音情報への再変換を行う。これにより、音響再生システム１００は、外部の装置から無線通信により音情報を取得する。通信モジュール１０２によって取得された音情報は、取得部１１１によって取得される。このように、取得部１１１は、音取得部の一例である。音情報は、以上のようにして情報処理装置１０１に入力される。なお、音響再生システム１００と外部の装置との通信は、有線通信によって行われてもよい。 The communication module 102 is an interface device for accepting input of sound information to the sound reproduction system 100. The communication module 102 includes, for example, an antenna and a signal converter, and receives sound information from an external device via wireless communication. More specifically, the communication module 102 uses the antenna to receive a wireless signal representing sound information converted into a format for wireless communication, and then uses the signal converter to reconvert the wireless signal into sound information. In this way, the sound reproduction system 100 acquires sound information from an external device via wireless communication. The sound information acquired by the communication module 102 is acquired by the acquisition unit 111. In this way, the acquisition unit 111 is an example of a sound acquisition unit. The sound information is input to the information processing device 101 in the above manner. Note that communication between the sound reproduction system 100 and the external device may also be performed via wired communication.

　音響再生システム１００が取得する音情報は、例えば、ＭＰＥＧ－Ｈ　３Ｄ　Ａｕｄｉｏ（ＩＳＯ／ＩＥＣ　２３００８－３）等の所定の形式で符号化されている。一例として、符号化された音情報には、音響再生システム１００によって再生される再生音についての情報と、当該音の音像を三次元音場内において所定位置に定位させる（つまり所定方向から到来する音として知覚させる）際の定位位置に関する情報とが含まれる。音情報は、音源オブジェクトに関する情報と読み替えることもできる。つまり、音情報には、音源オブジェクトの三次元音場内における位置と、音源オブジェクトが鳴らす音とを含んでいる。また、音情報には、パニング処理を適用するかしないかを決定するためのフラグが含まれている場合がある。このフラグについては後述する。 The sound information acquired by the sound reproduction system 100 is encoded in a predetermined format, such as MPEG-H 3D Audio (ISO/IEC 23008-3). As an example, the encoded sound information includes information about the sound reproduced by the sound reproduction system 100 and information about the localization position when the sound image of that sound is localized at a predetermined position within a three-dimensional sound field (i.e., perceived as sound coming from a predetermined direction). Sound information can also be interpreted as information about a sound source object. In other words, sound information includes the position of the sound source object within a three-dimensional sound field and the sound produced by the sound source object. In addition, sound information may include a flag that determines whether or not to apply panning processing. This flag will be described later.

　音情報は、上記のように入力データとして得られ、再生音についての情報である音声信号（音響信号）と、その他の情報である音源オブジェクトの三次元音場内位置の情報とを含んでいる。その他の情報には、他に、三次元音場を定義するための情報が含まれる場合がある。そのため、その他の情報を包括して音源オブジェクトの位置の情報及び三次元音場を定義するための情報等を含む、空間に関する情報（空間情報）という場合がある。音声信号を主体として見る場合には、入力データは、音声信号にその他の情報（メタデータ）が付帯する音情報であるといえる。また、空間情報を主体として見る場合には、入力データは、空間情報に音声信号が付帯する情報であるといえる。あるいは、このような入力データの両側面を有することから、入力データを音空間情報と考えてもよい。 Sound information is obtained as input data as described above, and includes audio signals (acoustic signals), which are information about the reproduced sound, and other information, such as information about the position of sound source objects within a three-dimensional sound field. This other information may also include information for defining the three-dimensional sound field. For this reason, other information is sometimes collectively referred to as spatial information, which includes information about the position of sound source objects and information for defining the three-dimensional sound field. When viewing the input data primarily as audio signals, it can be said that the input data is sound information in which other information (metadata) is attached to the audio signal. When viewing the input data primarily as spatial information, it can be said that the input data is information in which the audio signal is attached to spatial information. Alternatively, since the input data has both aspects, it can also be thought of as sound spatial information.

　一具体例として、音情報には第１の再生音及び第２の再生音を含む複数の音に関する情報が含まれ、それぞれの音が再生された際の音像を三次元音場内における異なる位置から到来する音として知覚させるように定位させる。そのため第１の再生音の音源オブジェクトは、三次元音場内における第１の位置に、第２の再生音の音源オブジェクトは、三次元音場内における第２の位置に定位される。音情報には、このように、複数の音が含まれていることがある。つまり、音情報は、第１の再生音及び第２の再生音のそれぞれに対応する複数の音声信号と、当該複数の音声信号に１対１で対応する第１の位置及び第２の位置の複数の音源オブジェクトの位置を含むことがある。 As a specific example, the sound information includes information about multiple sounds including a first reproduced sound and a second reproduced sound, and the sound images when each sound is reproduced are localized so that they are perceived as coming from different positions in a three-dimensional sound field. Therefore, the sound source object for the first reproduced sound is localized at a first position in the three-dimensional sound field, and the sound source object for the second reproduced sound is localized at a second position in the three-dimensional sound field. In this way, the sound information may include multiple sounds. In other words, the sound information may include multiple audio signals corresponding to the first reproduced sound and the second reproduced sound, respectively, and the positions of multiple sound source objects at first and second positions that correspond one-to-one to the multiple audio signals.

　図３は、実施の形態に係る音声信号の一例を説明するための図である。例えば、図３の（ａ）に示すように、音情報には、予め第１の位置から（第１方向から）ユーザ９９の位置へと到来する第１直接音の音声信号と、第２の位置から（第２方向から）ユーザ９９の位置へと到来する第２直接音の音声信号とが含まれていることがある。なお、取得された直後の音情報には、再生音についての情報のみが含まれていてもよい。この場合、所定位置に関する情報を別途取得しそれらが揃ったときに、以降の処理が行われるようになっていてもよい。　あるいは、音情報は、複数の音声信号と、当該複数の音声信号に多対１で対応する１つの音源オブジェクトの位置を含む場合もある。例えば、このような音情報は、ある音源オブジェクトから複数の再生音が鳴るような状況で用いられる。例えば、複数の音声信号のそれぞれは、音源オブジェクトの位置からユーザ９９の位置へと直接到来する直接音と、直接音に伴って発生し、当該直接音とは異なる経路で到来する副次音（間接的な伝播で生じる音）とのそれぞれに対応する。 3 is a diagram illustrating an example of an audio signal according to an embodiment. For example, as shown in FIG. 3(a), the audio information may include an audio signal of a first direct sound arriving from a first position (from a first direction) to the position of the user 99, and an audio signal of a second direct sound arriving from a second position (from a second direction) to the position of the user 99. The audio information immediately after acquisition may include only information about the reproduced sound. In this case, information about a predetermined position may be acquired separately, and subsequent processing may be performed once this information is collected. Alternatively, the audio information may include multiple audio signals and the position of a sound source object that corresponds to the multiple audio signals in a many-to-one relationship. For example, this type of audio information is used in a situation where multiple reproduced sounds are produced from a certain sound source object. For example, each of the multiple audio signals corresponds to a direct sound that arrives directly from the position of the sound source object to the position of the user 99, and a secondary sound (sound resulting from indirect propagation) that accompanies the direct sound and arrives via a path different from that of the direct sound.

　例えば、図３の（ｂ）に示すように、取得した直後の音情報には、直接音に関する音声信号が含まれており、副次音を計算する変換処理によって残響音、１次反射音、回折音などのそれぞれの音声信号を含む音情報へと変換される。この副次音を計算する変換処理には、三次元音場の空間環境の条件（例えば、三次元音場内のオブジェクトの位置、反射、回折特性等）の情報が用いられる。このように、副次音は、１つの再生音に関する音情報から、三次元音場の空間環境の条件によって計算的に生成されるため、取得した直後の音情報には含まれておらず、副次音を計算する変換処理によってこれらの副次音を含む音情報が生成される。１つの副次音からは、その副次音の伝搬によってさらに別の副次音が生じることもある。なお、空間環境の条件の情報は、空間情報の一部であり、入力された音情報によって、音声信号とともに取得されてもよい。また、音声信号と空間情報とは別々に取得されてもよい。つまり、音情報は、１つのファイルやビットストリームから取得されてもよいし、複数のファイルやビットストリームに分けて別々に取得されてもよい。例えば、音声信号と空間情報とが別々のファイルやビットストリームから取得されてもよいし、音声信号と空間情報とのそれぞれが複数のファイルやビットストリームから取得されてもよい。 3(b), the sound information immediately after acquisition includes an audio signal related to the direct sound, which is converted into sound information including audio signals for reverberation, primary reflections, diffraction, etc., by a conversion process that calculates the secondary sounds. This conversion process that calculates the secondary sounds uses information on the spatial environment conditions of the three-dimensional sound field (e.g., the position, reflection, diffraction characteristics, etc. of objects in the three-dimensional sound field). In this way, secondary sounds are computationally generated from sound information related to a single reproduced sound based on the spatial environment conditions of the three-dimensional sound field, and therefore are not included in the sound information immediately after acquisition, and sound information including these secondary sounds is generated by the conversion process that calculates the secondary sounds. Further secondary sounds may be generated from one secondary sound as a result of its propagation. Note that information on the spatial environment conditions is part of the spatial information and may be acquired together with the audio signal from the input sound information. Alternatively, the audio signal and spatial information may be acquired separately. In other words, the sound information may be acquired from a single file or bitstream, or may be acquired separately in multiple files or bitstreams. For example, the audio signal and spatial information may be obtained from separate files or bitstreams, or the audio signal and spatial information may each be obtained from multiple files or bitstreams.

　このように、入力される音情報の形態に特に限定はなく、音響再生システム１００に各種の形態の音情報に応じた取得部１１１が備えられればよい。 As such, there are no particular limitations on the form of the input sound information, and the sound reproduction system 100 only needs to be equipped with an acquisition unit 111 that can handle various forms of sound information.

　ここで、取得部１１１の一例を、図４を用いて説明する。図４は、実施の形態に係る取得部の機能構成を示すブロック図である。図４に示すように、本実施の形態における取得部１１１は、例えば、エンコード音情報入力部１１２、デコード処理部１１３、及び、センシング情報入力部１１４を備える。 Here, an example of the acquisition unit 111 will be described using Figure 4. Figure 4 is a block diagram showing the functional configuration of the acquisition unit according to the embodiment. As shown in Figure 4, the acquisition unit 111 in this embodiment includes, for example, an encoded sound information input unit 112, a decoding processing unit 113, and a sensing information input unit 114.

　エンコード音情報入力部１１２は、取得部１１１が取得した、符号化された（言い換えるとエンコードされている）音情報が入力される処理部である。エンコード音情報入力部１１２は、入力された音情報をデコード処理部１１３へと出力する。デコード処理部１１３は、エンコード音情報入力部１１２から出力された音情報を復号する（言い換えるとデコードする）ことにより音情報に含まれる再生音と、音源オブジェクトの位置とを、以降の処理に用いられる形式で生成する処理部である。センシング情報入力部１１４については、検知器１０３の機能とともに、以下に説明する。 The encoded sound information input unit 112 is a processing unit that receives the coded (in other words, encoded) sound information acquired by the acquisition unit 111. The encoded sound information input unit 112 outputs the input sound information to the decoding processing unit 113. The decoding processing unit 113 is a processing unit that decodes (in other words, decodes) the sound information output from the encoded sound information input unit 112 to generate the playback sound and the position of the sound source object contained in the sound information in a format that can be used for subsequent processing. The sensing information input unit 114 will be described below, along with the function of the detector 103.

　検知器１０３は、ユーザ９９の頭部の動き速度を検知するための装置である。検知器１０３は、ジャイロセンサ、加速度センサなど動きの検知に使用される各種のセンサを組み合わせて構成される。本実施の形態では、検知器１０３は、音響再生システム１００に内蔵されているが、例えば、音響再生システム１００と同様にユーザ９９の頭部の動きに応じて動作する立体映像再生装置３００等、外部の装置に内蔵されていてもよい。この場合、検知器１０３は、音響再生システム１００に含まれなくてもよい。また、検知器１０３として、外部の撮像装置などを用いて、ユーザ９９の頭部の動きを撮像し、撮像された画像を処理することでユーザ９９の動きを検知してもよい。 The detector 103 is a device for detecting the speed of movement of the user 99's head. The detector 103 is configured by combining various sensors used for detecting movement, such as a gyro sensor and an acceleration sensor. In this embodiment, the detector 103 is built into the sound reproduction system 100, but it may also be built into an external device, such as a 3D image reproduction device 300 that operates in response to the movement of the user 99's head, similar to the sound reproduction system 100. In this case, the detector 103 does not need to be included in the sound reproduction system 100. Alternatively, the detector 103 may be an external imaging device that captures the movement of the user 99's head and detects the movement of the user 99 by processing the captured image.

　検知器１０３は、例えば、音響再生システム１００の筐体に一体的に固定され、筐体の動きの速度を検知する。上記の筐体を含む音響再生システム１００は、ユーザ９９が装着した後、ユーザ９９の頭部と一体的に移動するため、検知器１０３は、結果としてユーザ９９の頭部の動きの速度を検知することができる。 The detector 103 is, for example, fixed integrally to the housing of the sound reproduction system 100 and detects the speed of movement of the housing. After the sound reproduction system 100 including the housing is worn by the user 99, it moves integrally with the user 99's head, and as a result, the detector 103 can detect the speed of movement of the user 99's head.

　検知器１０３は、例えば、ユーザ９９の頭部の動きの量として、三次元空間内で互いに直交する３軸の少なくとも一つを回転軸とする回転量を検知してもよいし、上記３軸の少なくとも一つを変位方向とする変位量を検知してもよい。また、検知器１０３は、ユーザ９９の頭部の動きの量として、回転量及び変位量の両方を検知してもよい。 The detector 103 may, for example, detect the amount of rotation about at least one of three mutually orthogonal axes in three-dimensional space as the rotation axis, or may detect the amount of displacement about at least one of the three axes as the displacement direction, as the amount of movement of the user 99's head. The detector 103 may also detect both the amount of rotation and the amount of displacement as the amount of movement of the user 99's head.

　センシング情報入力部１１４は、検知器１０３からユーザ９９の頭部の動き速度を取得する。より具体的には、センシング情報入力部１１４は、単位時間あたりに検知器１０３が検知したユーザ９９の頭部の動きの量を動きの速度として取得する。このようにしてセンシング情報入力部１１４は、検知器１０３から回転速度及び変位速度の少なくとも一方を取得する。ここで取得されるユーザ９９の頭部の動きの量は、三次元音場内のユーザ９９の位置及び姿勢（言い換えると座標及び向き）を決定するために用いられる。そのため、取得部１１１は、センシング情報入力部１１４によって位置取得部としても機能する。音響再生システム１００では、決定されたユーザ９９の座標及び向きに基づいて、音像オブジェクトのユーザ９９に対する相対的な位置を決定して音が再生される。具体的には、経路算出部１２１、出力音生成部１３１によって、上記の機能が実現されている。 The sensing information input unit 114 acquires the speed of head movement of the user 99 from the detector 103. More specifically, the sensing information input unit 114 acquires the amount of head movement of the user 99 detected by the detector 103 per unit time as the speed of movement. In this way, the sensing information input unit 114 acquires at least one of the rotation speed and the displacement speed from the detector 103. The amount of head movement of the user 99 acquired here is used to determine the position and posture (in other words, the coordinates and orientation) of the user 99 in the three-dimensional sound field. Therefore, the acquisition unit 111 also functions as a position acquisition unit via the sensing information input unit 114. In the sound reproduction system 100, the position of the sound image object relative to the user 99 is determined based on the determined coordinates and orientation of the user 99, and sound is reproduced. Specifically, the above functions are realized by the path calculation unit 121 and the output sound generation unit 131.

　経路算出部１２１は、決定されたユーザ９９の座標及び向きに基づいて、再生音について、音源オブジェクトの位置からユーザ９９の位置に到来する相対的な到来方向を算出する到来方向算出機能と、上記に説明した副次音を計算する変換処理とを含んでいる。そのため、経路算出部１２１は、音源オブジェクトからの伝播経路を算出し、算出した再生音の伝播経路に応じた再生音の間接的な伝播によりユーザ９９の位置に到来する副次音及び当該副次音の到来方向を算出する機能を含んでいる。なお、副次音の到来方向には、反射音の場合のどのようなオブジェクトで反射するか、及び、その反射時の減衰率はどの程度かなどの付加情報を含む。付加情報は、入力された音情報によって計算された副次音の到来方向に含まれている。つまり、付加情報は、音情報から計算的に生成し取得される。 The path calculation unit 121 includes an arrival direction calculation function that calculates the relative arrival direction of the reproduced sound from the position of the sound source object to the position of the user 99 based on the determined coordinates and orientation of the user 99, and a conversion process that calculates the secondary sound described above. Therefore, the path calculation unit 121 includes a function that calculates the propagation path from the sound source object and calculates the secondary sound and the arrival direction of the secondary sound that arrives at the position of the user 99 through indirect propagation of the reproduced sound according to the calculated propagation path of the reproduced sound. Note that the arrival direction of the secondary sound includes additional information such as what object the sound will be reflected from in the case of a reflected sound, and the attenuation rate at the time of reflection. The additional information is included in the arrival direction of the secondary sound calculated from the input sound information. In other words, the additional information is computationally generated and obtained from the sound information.

　空間情報について整理すると、空間情報には、空間（三次元音場）における音源オブジェクトの空間位置（音源オブジェクトの位置の情報）、当該音源オブジェクトにおける音の反射、回折特性（併せて、空間環境の条件の情報）、及び、三次元音場の広さなどのさらなる情報を含んでいる。経路算出部１２１が空間情報に基づいて、再生音がどの音源オブジェクトで反射又は回折するかによって副次音を生成し、その副次音の到来方向と、副次音が反射又は回折によって減衰した後の音量などとを付加情報として算出する。音情報（入力データ）は、音声信号と付帯するメタデータの形で空間情報を含んでおり、その空間情報には、上記したように、音声信号以外の情報として、音を立体音にして三次元音場内に音源オブジェクトを位置させるようにするために必要な情報、及び／又は、音を立体音にして三次元音場内に音源オブジェクトを位置させるようにするために必要な情報を計算するのに用いられる情報を含んでいる。 To summarize the spatial information, it includes the spatial position of the sound source object in the space (three-dimensional sound field) (information about the position of the sound source object), sound reflection and diffraction characteristics at the sound source object (also information about the conditions of the spatial environment), and further information such as the size of the three-dimensional sound field. Based on the spatial information, the path calculation unit 121 generates secondary sounds depending on which sound source object the reproduced sound is reflected or diffracted from, and calculates additional information such as the direction from which the secondary sound arrives and the volume of the secondary sound after it has attenuated due to reflection or diffraction. The sound information (input data) includes spatial information in the form of audio signals and accompanying metadata, and as described above, the spatial information includes, as information other than the audio signal, information necessary to turn the sound into stereophonic sound and position the sound source object within the three-dimensional sound field, and/or information used to calculate information necessary to turn the sound into stereophonic sound and position the sound source object within the three-dimensional sound field.

　経路算出部１２１は、再生音が直接音としてユーザに届く際の再生音の到来方向を算出することと、再生音の副次的な伝播によりユーザ９９の位置に到来する副次音をその到来方向とともに算出することとができれば、どのような処理によって実現されてもよい。経路算出部１２１は、再生音及び副次音について三次元音場内のいずれの方向から到来する音としてユーザ９９に知覚させるかを上記のユーザ９９の座標及び向きに基づいて決定し、出力音信号が再生された場合に、そのような音として知覚されるように、音情報を処理する。 The path calculation unit 121 may be realized by any processing as long as it can calculate the direction of arrival of the reproduced sound when it reaches the user as direct sound, and can calculate the direction of arrival of secondary sounds that arrive at the position of the user 99 due to secondary propagation of the reproduced sound. The path calculation unit 121 determines from which direction in the three-dimensional sound field the reproduced sound and secondary sounds will be perceived by the user 99 as coming from, based on the coordinates and orientation of the user 99, and processes the sound information so that when the output sound signal is reproduced, it is perceived as such a sound.

　出力音生成部１３１は、音情報に含まれる再生音に関する情報を処理することにより、出力音信号を生成する処理部である。 The output sound generation unit 131 is a processing unit that generates an output sound signal by processing information about the reproduced sound contained in the sound information.

　ここで、出力音生成部１３１の一例を、図５を用いて説明する。図５は、実施の形態に係る出力音生成部の機能構成を示すブロック図である。図５に示すように、本実施の形態における出力音生成部１３１は、例えば、生成部１３４、及び、合成部１３５を備える。 Here, an example of the output sound generation unit 131 will be described using FIG. 5. FIG. 5 is a block diagram showing the functional configuration of the output sound generation unit according to the embodiment. As shown in FIG. 5, the output sound generation unit 131 in this embodiment includes, for example, a generation unit 134 and a synthesis unit 135.

　生成部１３４は、パニング処理を適用して、再生音を代表音に変換する変換処理を行った後に、変換後の代表音に対して頭部伝達関数を畳み込む場合に用いられる処理部である。生成部１３４は、再生音及び代表点の位置を取得して、再生音を代表点からの音によって再現するための代表音への変換処理を行う。なお、生成部１３４は図１５において後述するパニング部と同等の機能を持つ。 The generation unit 134 is a processing unit used when applying panning processing to perform conversion processing to convert the reproduced sound into a representative sound, and then convolving a head-related transfer function with the converted representative sound. The generation unit 134 acquires the reproduced sound and the position of the representative point, and performs conversion processing to convert the reproduced sound into a representative sound so that the reproduced sound is reproduced using sound from the representative point. The generation unit 134 has the same function as the panning unit described later in Figure 15.

　例えば、音源オブジェクトが２つの代表点の中間に位置する場合に、再生音と同じ音を２つの代表点のそれぞれから鳴らすように音を生成する。つまり、再生音を２つの代表点に分配する。そして、音源オブジェクトの位置に合うように、生成した音のゲイン調整を行うことで、代表音を生成することができる。再生音からの代表音への変換は、このような例に限られない。例えば、後述するように時間シフト調整とゲイン調整とを行うことで再生音から代表音への変換を行ってもよいし、その他の、既存のあらゆる変換によって、再生音を代表点からの音によって再現するための代表音への変換を行うことができればよい。また、本明細書における、再生音から代表音への変換処理は、再生音を代表点（代表方向）に分配する処理であると読み換えてもよい。具体的には、それぞれの音源オブジェクトの位置に紐づけられた再生音の音信号を、代表点の位置に分配し、代表点（代表方向）からリスナへ到来する代表音を生成する。ここで代表方向は、リスナからみた代表点の位置の方向、または代表点の位置から見たリスナの方向のことを指す。時間シフト調整とゲイン調整とを行う変換の例については後述する。生成部１３４は、変換によって得られた代表点の数と同じだけの代表音と、各代表点からユーザ９９の位置までの代表方向に対応する頭部伝達関数とを取得して、代表音に対して取得した頭部伝達関数の畳み込み処理を行い、音信号を生成する。 For example, if a sound source object is located halfway between two representative points, a sound is generated so that the same sound as the playback sound is played from each of the two representative points. In other words, the playback sound is distributed to the two representative points. Then, a representative sound can be generated by adjusting the gain of the generated sound to match the position of the sound source object. The conversion of the playback sound to a representative sound is not limited to this example. For example, the playback sound may be converted to a representative sound by performing time shift adjustment and gain adjustment, as described below, or any other existing conversion may be used as long as the playback sound can be converted to a representative sound that can be reproduced as sound from a representative point. Furthermore, in this specification, the conversion process of the playback sound to a representative sound may be interpreted as the process of distributing the playback sound to representative points (representative directions). Specifically, the sound signals of the playback sound associated with the position of each sound source object are distributed to the positions of the representative points, and a representative sound that arrives at the listener from the representative points (representative directions) is generated. Here, the representative direction refers to the direction of the representative point position as seen from the listener, or the direction of the listener as seen from the representative point position. An example of the conversion that performs time shift adjustment and gain adjustment will be described later. The generation unit 134 acquires the same number of representative sounds as the number of representative points obtained by the conversion, and head-related transfer functions corresponding to the representative directions from each representative point to the position of the user 99, and performs convolution processing of the acquired head-related transfer functions on the representative sounds to generate a sound signal.

　つまり、パニング部は、経路算出部１２１により取得された複数個の音源（目的信号）の音源方向に基づいて、特定の代表方向からの音によるパニングを、音源の時間シフトとゲイン調整によって行うことにより、音源を表現するためのパニングを行う。具体的には、パニング部は、音源の音源方向に近似する代表方向のパニングにより、音源（目的信号）を合成する。これにより、パニング部は、等価的に音源の音源方向のＨＲＩＲを生成する。ここで、本実施形態において、「等価」「等価的」とは、後述する実施例で示すように、誤差が特定程度以下であり、ほぼ同様の信号であることをいう。具体的には、パニング部は、音源のパニングによって、音源の音源方向の最寄りの、又は音源方向のＨＲＩＲに最も似ている数個の方向のＨＲＩＲの合成で、等価的に当該方向のＨＲＩＲを生成する。この方向を、本実施形態において、下記で説明する「特定の代表方向」（以下、単に「代表方向」ともいう。）として説明する。これにより、耳元の信号を生成するための演算量を削減する。 In other words, the panning unit performs panning to represent the sound source by panning sounds from a specific representative direction based on the sound source directions of multiple sound sources (target signals) acquired by the path calculation unit 121, by time shifting the sound sources and adjusting the gain. Specifically, the panning unit synthesizes the sound source (target signal) by panning in a representative direction that approximates the sound source direction of the sound source. As a result, the panning unit generates an HRIR for the sound source direction equivalently. Here, in this embodiment, "equivalent" and "equivalently" refer to signals with an error below a specific level and that are nearly similar, as shown in the examples described below. Specifically, the panning unit generates an HRIR for the sound source direction equivalently by panning the sound source by synthesizing HRIRs for several directions that are closest to the sound source direction of the sound source or that are most similar to the HRIR for the sound source direction. In this embodiment, this direction is described as a "specific representative direction" (hereinafter simply referred to as a "representative direction") described below. This reduces the amount of calculation required to generate the signal at the ear.

　すなわち、パニング部は、複数個の音源による音像を、複数の代表方向の音によって合成する。この代表方向は、例えば、２～３方向を用いることが可能であるが、代表方向の数はこれに限らない。具体的には、パニング部は、音源の個数より少ない個数の代表点にまとめ、この代表点に対する代表方向のＨＲＩＲのみで音像を合成することが可能である。 In other words, the panning unit synthesizes a sound image from multiple sound sources using sounds from multiple representative directions. For example, two or three representative directions can be used, but the number of representative directions is not limited to this. Specifically, the panning unit can group together the sound sources into fewer representative points than the number of sound sources, and synthesize a sound image using only the HRIRs of the representative directions for these representative points.

　この際、パニング部は、音源の音源方向のＨＲＩＲと代表方向のＨＲＩＲとの相互相関が最大になる時間シフト（ディレイ、時間遅延）を算出する。ここで得られた時間シフト、又はこの時間シフトに負号を付した時間シフトを音源に付与した、時間シフト後の信号が代表方向にあるものとして、以降の処理を行う。 At this time, the panning unit calculates the time shift (delay) that maximizes the cross-correlation between the HRIR in the sound source direction and the HRIR in the representative direction. The time shift obtained here, or a time shift with a negative sign added to this time shift, is applied to the sound source, and subsequent processing is performed assuming that the signal after the time shift is in the representative direction.

　この時間シフトは、サンプリング周波数より短い時間での時間シフト（サンプル位置が小数で示されるシフト。以下、「小数シフト」という。）も許容してもよい。この小数シフトは、オーバーサンプリングにより行うことが可能である。 This time shift may also be permitted to be a time shift shorter than the sampling frequency (a shift in which the sample position is expressed as a decimal; hereafter referred to as a "decimal shift"). This decimal shift can be performed by oversampling.

　ここで、パニング部は、音源を時間シフトした代表方向の信号にゲインをかけて、代表点毎に算出されたそれらの値に各代表点におけるＨＲＩＲを畳み込んだものの和を算出することで、音源に音源方向のＨＲＩＲを畳み込んだものと等価な信号を合成する。 Here, the panning unit applies a gain to the signal of the representative direction obtained by time-shifting the sound source, and then calculates the sum of the values calculated for each representative point convolved with the HRIR at each representative point, thereby synthesizing a signal equivalent to the sound source convolved with the HRIR of the sound source direction.

　一方、パニング部は、代表方向のＨＲＩＲ（ベクトル）の和で音源方向のＨＲＩＲ（ベクトル）を合成する際、合成されたＨＲＩＲ（ベクトル）と音源方向のＨＲＩＲ（ベクトル）の誤差信号ベクトルが代表方向のＨＲＩＲ（ベクトル）と直交させるようにして、ゲインを算出してもよい。なお、ＨＲＩＲ（ベクトル）とはＨＲＩＲの時間波形をベクトルと見立てたものである。以下、このＨＲＩＲ（ベクトル）を「ＨＲＩＲベクトル」とも記載する。 On the other hand, when synthesizing the HRIR (vector) of the sound source direction by the sum of the HRIR (vector) of the representative direction, the panning unit may calculate the gain by orthogonalizing the error signal vector between the synthesized HRIR (vector) and the HRIR (vector) of the sound source direction to the HRIR (vector) of the representative direction. Note that the HRIR (vector) is the time waveform of the HRIR as if it were a vector. Hereinafter, this HRIR (vector) will also be referred to as the "HRIR vector."

　パニング部は、このゲインについて、音源位置からの左右の耳のＨＲＩＲのエネルギーバランスが、パニングにより実質的に複数の代表点からのＨＲＩＲで合成されたＨＲＩＲでも維持されるように補正する。すなわち、パニング部は、音源による受聴者の左右の耳のＨＲＩＲのエネルギーバランスが、パニングにより実質的に合成されたＨＲＩＲでも維持されるようにゲインを補正してもよい。 The panning unit corrects this gain so that the energy balance of the HRIRs for the left and right ears from the sound source position is maintained in an HRIR substantially synthesized by panning from HRIRs from multiple representative points. In other words, the panning unit may correct the gain so that the energy balance of the HRIRs for the left and right ears of the listener from the sound source is maintained in an HRIR substantially synthesized by panning.

　本実施形態においては、パニング部は、音源の各音源方向について、代表方向のＨＲＩＲのゲインのゲイン値と、ＨＲＩＲの時間シフトの時間に相当する時間シフト値とを算出して、後述するＨＲＩＲテーブル２００に格納しておくことが可能である。 In this embodiment, the panning unit calculates the gain value of the HRIR gain in the representative direction for each sound source direction and a time shift value corresponding to the time shift of the HRIR, and stores these values in the HRIR table 200, which will be described later.

　この上で、パニング部は、各音源の音源方向に対応する時間シフト値及びゲイン値で、各音源の時間シフトを行い、ゲインをかけて、これの和をとって和信号とする。パニング部は、この和信号が代表点の位置に存在するものとして扱う。パニング部は、この和信号に、代表点の位置のＨＲＩＲを畳み込んで、受聴者の耳元の信号を生成することが可能である。 The panning unit then time-shifts each sound source using the time shift and gain values corresponding to the sound source direction of each sound source, multiplies the gain, and sums these to create a sum signal. The panning unit treats this sum signal as existing at the position of the representative point. The panning unit can convolve the HRIR at the position of the representative point with this sum signal to generate a signal at the listener's ear.

　合成部１３５は、出力音信号を生成する。合成部１３５は、音信号について、ＥＱ調整を行うことがある。具体的にはパニング処理において、減衰してしまう可能性が高いとされる高域ドメインについて、ゲインを上昇させるＥＱ調整を行い、この高域ドメインの強調をしてもよい。そのため、合成部１３５は、ＥＱ調整部としての機能を有する。なお、合成部１３５において行われるＥＱ調整は、音信号が複数ある場合には複数の音信号の一部のみに行ってもよいし、全部に行ってもよい。 The synthesis unit 135 generates an output sound signal. The synthesis unit 135 may perform EQ adjustment on the sound signal. Specifically, in the panning process, EQ adjustment may be performed to increase the gain of the high-frequency domain, which is likely to be attenuated, and to emphasize this high-frequency domain. Therefore, the synthesis unit 135 functions as an EQ adjustment unit. Note that, if there are multiple sound signals, the EQ adjustment performed by the synthesis unit 135 may be performed on only some or all of the multiple sound signals.

　図２を再び参照する。出力音生成部１３１は、出力音信号生成のために用いる頭部伝達関数をデータベース１０５から取得する。データベース１０５は情報を記憶するための記憶装置としての機能と、記憶された情報を読み出して、外部の構成に出力する記憶コントローラとしての機能とを併せ持つ情報記憶装置である。データベース１０５には、頭部伝達関数がユーザ９９への到来方向ごとに記憶されている。データベース１０５に含まれる頭部伝達関数は、万人に用いることができる汎用の頭部伝達関数のセット、又は、ユーザ９９個人に最適化された頭部伝達関数のセット、又は、一般に公開されている頭部伝達関数のセットである。データベース１０５は、出力音生成部１３１から、到来方向をクエリとした問い合わせを受け、その到来方向に対応する頭部伝達関数を出力音生成部１３１へと出力する。また、出力音生成部１３１は、頭部伝達関数のセットをすべて出力したり、頭部伝達関数のセット自体の特性などを出力したりする場合もある。 Referring again to Figure 2, the output sound generation unit 131 obtains the head-related transfer function used to generate the output sound signal from the database 105. The database 105 is an information storage device that functions both as a storage device for storing information and as a storage controller that reads out the stored information and outputs it to an external component. The database 105 stores a head-related transfer function for each direction of arrival for the user 99. The head-related transfer functions included in the database 105 are a set of general-purpose head-related transfer functions that can be used by everyone, a set of head-related transfer functions optimized for each individual user 99, or a set of head-related transfer functions that are publicly available. The database 105 receives an inquiry from the output sound generation unit 131 using the direction of arrival as a query, and outputs the head-related transfer function corresponding to that direction of arrival to the output sound generation unit 131. The output sound generation unit 131 may also output the entire set of head-related transfer functions, or may output the characteristics of the set of head-related transfer functions itself.

　信号出力部１４１は、生成された出力音信号をドライバ１０４へと出力する機能部である。信号出力部１４１は、出力音信号に基づいてデジタル信号からアナログ信号への信号変換などを行うことで、波形信号を生成し、波形信号に基づいてドライバ１０４に音波を発生させ、ユーザ９９に音を提示する。ドライバ１０４は、例えば、振動板とマグネット及びボイスコイルなどの駆動機構とを有する。ドライバ１０４は、波形信号に応じて駆動機構を動作させ、駆動機構によって振動板を振動させる。このようにして、ドライバ１０４は、出力音信号に応じた振動板の振動により、音波を発生させ（出力音信号を「再生」することを意味する、すなわち、ユーザ９９が知覚することは「再生」の意味には含まれない）、音波が空気を伝播してユーザ９９の耳に伝達し、ユーザ９９が音を知覚する。 The signal output unit 141 is a functional unit that outputs the generated output sound signal to the driver 104. The signal output unit 141 generates a waveform signal by performing signal conversion from digital to analog based on the output sound signal, and then generates sound waves in the driver 104 based on the waveform signal, presenting the sound to the user 99. The driver 104 has, for example, a diaphragm and a drive mechanism such as a magnet and voice coil. The driver 104 operates the drive mechanism in accordance with the waveform signal, causing the diaphragm to vibrate. In this way, the driver 104 generates sound waves by vibrating the diaphragm in accordance with the output sound signal (this means "reproducing" the output sound signal; in other words, "reproducing" does not include perception by the user 99), and the sound waves propagate through the air to the ears of the user 99, who then perceives the sound.

　［別の構成例］
　上述の例において、本実施の形態に係る音響再生システム１００は、音提示デバイスであり、情報処理装置１０１と、通信モジュール１０２と、検知器１０３と、データベース１０５と、ドライバ１０４とを備えることを説明したが、音響再生システム１００の機能を複数の装置で実現してもよいし一つの装置で実現してもよい。具体的に、図６～図１５を用いて説明する。図６～図１５は、実施の形態に係る音響再生システムの別の例を説明するための図である。 [Another configuration example]
In the above example, the sound reproduction system 100 according to the present embodiment is a sound presentation device, and has been described as including an information processing device 101, a communication module 102, a detector 103, a database 105, and a driver 104. However, the functions of the sound reproduction system 100 may be realized by a plurality of devices or by a single device. Specific examples will be described using Figures 6 to 15. Figures 6 to 15 are diagrams for explaining other examples of the sound reproduction system according to the embodiment.

　例えば、情報処理装置６０１が音声提示デバイス６０２に含まれ、音声提示デバイス６０２が音響処理と音の提示との両方を行ってもよい。また、情報処理装置６０１と音声提示デバイス６０２とが本開示で説明する音響処理を分担して実施してもよいし、情報処理装置６０１又は音声提示デバイス６０２とネットワークを介して接続されたサーバが本開示で説明する音響処理の一部又は全体を実施してもよい。 For example, the information processing device 601 may be included in the audio presentation device 602, and the audio presentation device 602 may perform both acoustic processing and sound presentation. Furthermore, the information processing device 601 and the audio presentation device 602 may share the acoustic processing described in this disclosure, or a server connected to the information processing device 601 or the audio presentation device 602 via a network may perform some or all of the acoustic processing described in this disclosure.

　なお、上記説明では、情報処理装置６０１と呼んでいるが、情報処理装置６０１が音声信号又は音響処理に用いる空間情報の少なくとも一部のデータを符号化して生成されたビットストリームを復号して音響処理を実施する場合、情報処理装置６０１は復号装置と呼ばれてもよいし、音響再生システム１００（つまり、図中の立体音響再生システム６００）は、復号処理システムと呼ばれてもよい。 Note that although the above description refers to the information processing device 601, if the information processing device 601 performs acoustic processing by decoding a bitstream generated by encoding at least a portion of the data of the audio signal or spatial information used in acoustic processing, the information processing device 601 may be called a decoding device, and the acoustic reproduction system 100 (i.e., the stereophonic sound reproduction system 600 in the figure) may be called a decoding processing system.

　ここでは、音響再生システム１００が復号処理システムとして機能する例について説明する。 Here, we will explain an example in which the sound reproduction system 100 functions as a decoding processing system.

　＜符号化装置の例＞
　図７は、本開示の符号化装置の一例である符号化装置７００の構成を示す機能ブロック図である。 <Example of encoding device>
FIG. 7 is a functional block diagram showing the configuration of an encoding device 700, which is an example of an encoding device according to the present disclosure.

　入力データ７０１はエンコーダ７０２に入力される空間情報及び／又は音声信号を含む符号化対象となるデータである。空間情報の詳細については後で説明する。 Input data 701 is data to be encoded, including spatial information and/or audio signals, input to encoder 702. Details of spatial information will be explained later.

　エンコーダ７０２は、入力データ７０１を符号化して、符号化データ７０３を生成する。符号化データ７０３は、例えば、符号化処理によって生成されたビットストリームである。 The encoder 702 encodes the input data 701 to generate encoded data 703. The encoded data 703 is, for example, a bit stream generated by the encoding process.

　メモリ７０４は、符号化データ７０３を格納する。メモリ７０４は、例えば、ハードディスク又はＳＳＤ（Ｓｏｌｉｄ－Ｓｔａｔｅ　Ｄｒｉｖｅ）であってもよいし、その他の記憶装置であってもよい。 Memory 704 stores encoded data 703. Memory 704 may be, for example, a hard disk or SSD (Solid-State Drive), or may be some other storage device.

　なお、上記説明ではメモリ７０４に記憶される符号化データ７０３の一例として符号化処理によって生成されたビットストリームを挙げたが、ビットストリーム以外のデータであってもよい。例えば、符号化装置７００は、ビットストリームを所定のデータフォーマットに変換して生成された変換後のデータをメモリ７０４に記憶してもよい。変換後のデータは、例えば、一又は複数のビットストリームを格納したファイル又は多重化ストリームであってもよい。ここで、ファイルは、例えばＩＳＯＢＭＦＦ（ＩＳＯ　Ｂａｓｅ　Ｍｅｄｉａ　Ｆｉｌｅ　Ｆｏｒｍａｔなどのファイルフォーマットを有するファイルである。また、符号化データ７０３は、上記のビットストリーム又はファイルを分割して生成された複数のパケットの形式であってもよい。エンコーダ７０２で生成されたビットストリームをビットストリームとは異なるデータに変換する場合、符号化装置７００は、図示されていない変換部を備えていてもよいし、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）で変換処理を行ってもよい。 In the above explanation, a bitstream generated by the encoding process was given as an example of encoded data 703 stored in memory 704, but data other than a bitstream may also be used. For example, encoding device 700 may convert a bitstream into a predetermined data format and store the converted data in memory 704. The converted data may be, for example, a file or multiplexed stream containing one or more bitstreams. Here, the file may have a file format such as ISOBMFF (ISO Base Media File Format). Encoded data 703 may also be in the form of multiple packets generated by dividing the bitstream or file. When converting the bitstream generated by encoder 702 into data other than the bitstream, encoding device 700 may be equipped with a conversion unit (not shown), or the conversion process may be performed by a CPU (Central Processing Unit).

　＜復号装置の例＞
　図８は、本開示の復号装置の一例である復号装置８００の構成を示す機能ブロック図である。 <Example of a decoding device>
FIG. 8 is a functional block diagram showing the configuration of a decoding device 800, which is an example of a decoding device according to the present disclosure.

　メモリ８０４は、例えば、符号化装置７００で生成された符号化データ７０３と同じデータを格納している。メモリ８０４は、保存されているデータを読み出し、デコーダ８０２の入力データ８０３として入力する。入力データ８０３は、例えば、復号対象となるビットストリームである。メモリ８０４は、例えば、ハードディスク又はＳＳＤであってもよいし、その他の記憶装置であってもよい。 Memory 804 stores, for example, the same data as the encoded data 703 generated by encoding device 700. Memory 804 reads the stored data and inputs it as input data 803 to decoder 802. Input data 803 is, for example, a bitstream to be decoded. Memory 804 may be, for example, a hard disk or SSD, or may be some other storage device.

　なお、復号装置８００は、メモリ８０４が記憶しているデータをそのまま入力データ８０３とするのではなく、読み出したデータを変換して生成された変換後のデータを入力データ８０３としてもよい。変換前のデータは、例えば、一又は複数のビットストリームを格納した多重化データであってもよい。ここで、多重化データは、例えばＩＳＯＢＭＦＦなどのファイルフォーマットを有するファイルであってもよい。また、変換前のデータは、上記のビットストリーム又はファイルを分割して生成された複数のパケットの形式であってもよい。メモリ８０４から読み出したビットストリームとは異なるデータをビットストリームに変換する場合、復号装置８００は、図示されていない変換部を備えていてもよいし、ＣＰＵで変換処理を行ってもよい。 Note that decoding device 800 may not use the data stored in memory 804 as input data 803 as is, but may convert the read data and generate converted data as input data 803. The data before conversion may be, for example, multiplexed data containing one or more bitstreams. Here, the multiplexed data may be a file having a file format such as ISOBMFF. The data before conversion may also be in the form of multiple packets generated by dividing the bitstream or file. When converting data different from the bitstream read from memory 804 into a bitstream, decoding device 800 may be equipped with a conversion unit (not shown), or the conversion process may be performed by a CPU.

　デコーダ８０２は、入力データ８０３を復号して、リスナに提示される音声信号８０１を生成する。 The decoder 802 decodes the input data 803 to generate an audio signal 801 that is presented to the listener.

　＜符号化装置の別の例＞
　図９は、本開示の符号化装置の別の一例である符号化装置９００の構成を示す機能ブロック図である。図９では、図７の構成と同じ機能を有する構成に図７の構成と同じ符号を付しており、これらの構成については説明を省略する。 <Another example of an encoding device>
Fig. 9 is a functional block diagram showing the configuration of an encoding device 900, which is another example of an encoding device according to the present disclosure. In Fig. 9, components having the same functions as those in Fig. 7 are assigned the same reference numerals, and descriptions of these components will be omitted.

　符号化装置７００は符号化データ７０３を記憶するメモリ７０４を備えているのに対し、符号化装置９００は符号化データ７０３を外部に対して送信する送信部９０１を備える点で符号化装置７００と異なる。 The coding device 700 differs from the coding device 700 in that the coding device 700 is equipped with a memory 704 that stores coded data 703, whereas the coding device 900 is equipped with a transmission unit 901 that transmits coded data 703 to the outside.

　送信部９０１は、符号化データ７０３又は符号化データ７０３を変換して生成した別のデータ形式のデータに基づいて送信信号９０２を別の装置又はサーバに対して送信する。送信信号９０２の生成に用いられるデータは、例えば、符号化装置７００で説明したビットストリーム、多重化データ、ファイル、又はパケットである。 The transmitter 901 transmits a transmission signal 902 to another device or server based on the encoded data 703 or data in a different data format generated by converting the encoded data 703. The data used to generate the transmission signal 902 is, for example, a bit stream, multiplexed data, a file, or a packet, as described in the encoding device 700.

　＜複合装置の別の例＞
　図１０は、本開示の復号装置の別の一例である復号装置１０００の構成を示す機能ブロック図である。図１０では、図８の構成と同じ機能を有する構成に図８の構成と同じ符号を付しており、これらの構成については説明を省略する。 <Another example of a composite device>
Fig. 10 is a functional block diagram showing the configuration of a decoding device 1000, which is another example of a decoding device according to the present disclosure. In Fig. 10, components having the same functions as those in Fig. 8 are assigned the same reference numerals, and descriptions of these components will be omitted.

　復号装置８００は入力データ８０３を読み出すメモリ８０４を備えているのに対し、復号装置１０００は入力データ８０３を外部から受信する受信部１００１を備える点で復号装置８００と異なる。 Decoding device 800 differs from decoding device 800 in that it is equipped with memory 804 that reads input data 803, whereas decoding device 1000 is equipped with a receiving unit 1001 that receives input data 803 from outside.

　受信部１００１は、受信信号１００２を受信して受信データを取得し、デコーダ８０２に入力される入力データ８０３を出力する。受信データは、デコーダ８０２に入力される入力データ８０３と同じであってもよいし、入力データ８０３とは異なるデータ形式のデータであってもよい。受信データが、入力データ８０３と異なるデータ形式のデータの場合、受信部１００１が受信データを入力データ８０３に変換してもよいし、復号装置１０００が備える図示されていない変換部又はＣＰＵが受信データを入力データ８０３に変換してもよい。受信データは、例えば、符号化装置９００で説明したビットストリーム、多重化データ、ファイル、又はパケットである。 The receiving unit 1001 receives the received signal 1002, acquires the received data, and outputs the input data 803 to be input to the decoder 802. The received data may be the same as the input data 803 to be input to the decoder 802, or may be data in a different data format from the input data 803. If the received data is in a different data format from the input data 803, the receiving unit 1001 may convert the received data into the input data 803, or a conversion unit or CPU (not shown) provided in the decoding device 1000 may convert the received data into the input data 803. The received data is, for example, a bit stream, multiplexed data, a file, or a packet as described for the encoding device 900.

　＜デコーダの機能説明＞
　図１１は、図８又は図１０におけるデコーダ８０２の一例であるデコーダ１１００の構成を示す機能ブロック図である。 <Decoder function explanation>
FIG. 11 is a functional block diagram showing the configuration of a decoder 1100, which is an example of the decoder 802 in FIG. 8 or 10.

　入力データ８０３は符号化されたビットストリームであり、符号化された音声信号である符号化音声データと音響処理に用いるメタデータとを含んでいる。 The input data 803 is an encoded bitstream and includes encoded audio data, which is an encoded audio signal, and metadata used for acoustic processing.

　空間情報管理部１１０１は、入力データ８０３に含まれるメタデータを取得して、メタデータを解析する。メタデータは、音空間に配置された音に作用する要素を記述した情報を含む。空間情報管理部１１０１は、メタデータを解析して得られた音響処理に必要な空間情報を管理し、レンダリング部１１０３に対して空間情報を提供する。なお、本開示では音響処理に用いる情報が空間情報と呼ばれているが、それ以外の呼び方であってもよい。当該音響処理に用いる情報は、例えば、音空間情報と呼ばれてもよいし、シーン情報と呼ばれてもよい。また、音響処理に用いる情報が経時的に変化する場合、レンダリング部１１０３に入力される空間情報は、空間状態、音空間状態、シーン状態などと呼ばれてもよい。 The spatial information management unit 1101 acquires metadata contained in the input data 803 and analyzes the metadata. The metadata includes information describing elements that act on sounds arranged in a sound space. The spatial information management unit 1101 manages the spatial information necessary for acoustic processing obtained by analyzing the metadata and provides the spatial information to the rendering unit 1103. Note that although the information used for acoustic processing is referred to as spatial information in this disclosure, it may be called something else. The information used for acoustic processing may be called, for example, sound space information or scene information. Furthermore, if the information used for acoustic processing changes over time, the spatial information input to the rendering unit 1103 may be called a space state, sound space state, scene state, etc.

　また、空間情報は音空間ごと又はシーンごとに管理されていてもよい。例えば、異なる部屋を仮想空間として表現する場合、それぞれの部屋が異なる音空間のシーンとして管理されてもよいし、同じ空間であっても表現する場面に応じて異なるシーンとして空間情報が管理されてもよい。空間情報の管理において、それぞれの空間情報を識別する識別子が付与されておいてもよい。空間情報のデータは、入力データ８０３の一形態であるビットストリームに含まれていてもよいし、ビットストリームが空間情報の識別子を含み、空間情報のデータはビットストリーム以外から取得してもよい。ビットストリームに空間情報の識別子のみが含まれる場合、レンダリング時に空間情報の識別子を用いて、音響信号処理装置のメモリ又は外部のサーバに記憶された空間情報のデータが入力データとして取得されてもよい。 Furthermore, spatial information may be managed for each sound space or for each scene. For example, when different rooms are represented as virtual spaces, each room may be managed as a different sound space scene, or even for the same space, spatial information may be managed as different scenes depending on the situation being represented. When managing spatial information, an identifier that identifies each piece of spatial information may be assigned. The spatial information data may be included in a bitstream, which is one form of input data 803, or the bitstream may include an identifier for the spatial information and the spatial information data may be obtained from somewhere other than the bitstream. If the bitstream only includes an identifier for the spatial information, the identifier for the spatial information may be used during rendering to obtain spatial information data stored in the memory of the acoustic signal processing device or an external server as input data.

　なお、空間情報管理部１１０１が管理する情報は、ビットストリームに含まれる情報に限定されない。例えば、入力データ８０３は、ビットストリームには含まれないデータとして、ＶＲ又はＡＲを提供するソフトウェアアプリケーション又はサーバから取得された空間の特性又は構造を示すデータを含んでいてもよい。また、例えば、入力データ８０３は、ビットストリームには含まれないデータとして、リスナ又はオブジェクトの特性又は位置などを示すデータを含んでいてもよい。また、入力データ８０３は、リスナの位置を示す情報として復号装置を含む端末が備えるセンサで取得された情報、又は、センサで取得された情報に基づいて推定された端末の位置を示す情報を含んでいてもよい。つまり、空間情報管理部１１０１は、外部のシステム又はサーバと通信し、空間情報及びリスナの位置を取得してもよい。また、空間情報管理部１１０１が外部のシステムからクロック同期情報を取得し、レンダリング部１１０３のクロックと同期する処理を実行してもよい。なお、上記の説明における空間は、仮想的に形成された空間、つまりＶＲ空間であってもよいし、実空間又は実空間に対応する仮想空間、つまりＡＲ空間又はＭＲ（Ｍｉｘｅｄ　Ｒｅａｌｉｔｙ）空間であってもよい。また、仮想空間は音場又は音空間と呼ばれてもよい。また、上記の説明における位置を示す情報は、空間内における位置を示す座標値などの情報であってもよいし、所定の基準位置に対する相対位置を示す情報であってもよいし、空間内の位置の動き又は加速度を示す情報であってもよい。 Note that the information managed by the spatial information management unit 1101 is not limited to information contained in the bitstream. For example, the input data 803 may include data indicating the characteristics or structure of a space obtained from a software application or server that provides VR or AR, as data not included in the bitstream. Furthermore, for example, the input data 803 may include data indicating the characteristics or position of a listener or object, as data not included in the bitstream. Furthermore, the input data 803 may include, as information indicating the position of the listener, information obtained by a sensor provided in a terminal including a decoding device, or information indicating the position of the terminal estimated based on information obtained by the sensor. In other words, the spatial information management unit 1101 may communicate with an external system or server to obtain spatial information and the position of the listener. Furthermore, the spatial information management unit 1101 may obtain clock synchronization information from an external system and perform processing to synchronize with the clock of the rendering unit 1103. Note that the space in the above description may be a virtually created space, i.e., a VR space, or a real space or a virtual space corresponding to a real space, i.e., an AR space or an MR (Mixed Reality) space. Virtual spaces may also be called sound fields or sound spaces. Furthermore, the information indicating a position in the above description may be information such as coordinate values indicating a position within a space, information indicating a relative position with respect to a predetermined reference position, or information indicating the movement or acceleration of a position within a space.

　音声データデコーダ１１０２は、入力データ８０３に含まれる符号化音声データを復号して、音声信号を取得する。 The audio data decoder 1102 decodes the encoded audio data contained in the input data 803 to obtain an audio signal.

　立体音響再生システム６００が取得する符号化音声データは、例えば、ＭＰＥＧ－Ｈ　３Ｄ　Ａｕｄｉｏ（ＩＳＯ／ＩＥＣ　２３００８－３）等の所定の形式で符号化されたビットストリームである。なお、ＭＰＥＧ－Ｈ　３Ｄ　Ａｕｄｉｏはあくまでビットストリームに含まれる符号化音声データを生成する際に利用可能な符号化方式の一例であり、他の符号化方式で符号化されたビットストリームと符号化音声データとして含んでいてもよい。例えば、用いられる符号化方式は、ＭＰ３（ＭＰＥＧ－１　Ａｕｄｉｏ　Ｌａｙｅｒ－３）、ＡＡＣ（Ａｄｖａｎｃｅｄ　Ａｕｄｉｏ　Ｃｏｄｉｎｇ）、ＷＭＡ（Ｗｉｎｄｏｗｓ　Ｍｅｄｉａ　Ａｕｄｉｏ）、ＡＣ３（Ａｕｄｉｏ　Ｃｏｄｅｃ－３）、Ｖｏｒｂｉｓなどの非可逆コーデックであってもよいし、ＡＬＡＣ（Ａｐｐｌｅ　Ｌｏｓｓｌｅｓｓ　Ａｕｄｉｏ　Ｃｏｄｅｃ）、ＦＬＡＣ（Ｆｒｅｅ　Ｌｏｓｓｌｅｓｓ　Ａｕｄｉｏ　Ｃｏｄｅｃ）などの可逆コーデックであってもよいし、上記以外の任意の符号化方式が用いられてもよい。例えば、ＰＣＭ（Ｐｕｌｓｅ　Ｃｏｄｅ　Ｍｏｄｕｌａｔｉｏｎ）データが符号化音声データの一種であるとしてもよい。この場合、復号処理は、例えば、当該ＰＣＭデータの量子化ビット数がＮである場合、Ｎビットの二進数を、レンダリング部１１０３が処理できる数形式（例えば浮動小数点形式）に変換する処理としてもよい。 The encoded audio data acquired by the stereophonic sound reproduction system 600 is a bitstream encoded in a predetermined format, such as MPEG-H 3D Audio (ISO/IEC 23008-3). Note that MPEG-H 3D Audio is merely one example of an encoding method that can be used to generate the encoded audio data contained in the bitstream, and bitstreams encoded using other encoding methods may also be included as encoded audio data. For example, the encoding method used may be a lossy codec such as MP3 (MPEG-1 Audio Layer-3), AAC (Advanced Audio Coding), WMA (Windows Media Audio), AC3 (Audio Codec-3), or Vorbis, or a lossless codec such as ALAC (Apple Lossless Audio Codec) or FLAC (Free Lossless Audio Codec), or any other encoding method may be used. For example, PCM (Pulse Code Modulation) data may be a type of encoded audio data. In this case, the decoding process may be, for example, a process of converting an N-bit binary number into a number format (e.g., floating-point format) that can be processed by the rendering unit 1103, where the number of quantization bits of the PCM data is N.

　レンダリング部１１０３は、音声信号と空間情報とを入力とし、空間情報を用いて音声信号に音響処理を施して、音響処理後の音声信号８０１を出力する。 The rendering unit 1103 receives an audio signal and spatial information as input, performs acoustic processing on the audio signal using the spatial information, and outputs the processed audio signal 801.

　空間情報管理部１１０１は、レンダリングを開始する前に、入力信号のメタデータを読み込み、空間情報で規定されたオブジェクト又は音などのレンダリングアイテムを検出し、レンダリング部１１０３に送信する。レンダリング開始後、空間情報管理部１１０１は、空間情報及びリスナの位置の経時的な変化を把握し、空間情報を更新して管理する。そして、空間情報管理部１１０１は、更新された空間情報をレンダリング部１１０３に送信する。レンダリング部１１０３は入力データに含まれる音声信号と、空間情報管理部１１０１から受信した空間情報とに基づいて音響処理を付加した音声信号を生成し出力する。 Before rendering begins, the spatial information management unit 1101 reads the metadata of the input signal, detects rendering items such as objects or sounds defined in the spatial information, and sends them to the rendering unit 1103. After rendering begins, the spatial information management unit 1101 tracks changes over time in the spatial information and the listener's position, and updates and manages the spatial information. The spatial information management unit 1101 then sends the updated spatial information to the rendering unit 1103. The rendering unit 1103 generates and outputs an audio signal to which acoustic processing has been applied based on the audio signal included in the input data and the spatial information received from the spatial information management unit 1101.

　空間情報の更新処理と、音響処理を付加した音声信号の出力処理とが同じスレッドで実行されてもよいし、空間情報管理部１１０１とレンダリング部１１０３とはそれぞれ独立したスレッドに配分してもよい。空間情報の更新処理と、音響処理を付加した音声信号の出力処理とが異なるスレッドで処理される場合、スレッドの起動頻度が個々に設定されてもよいし、並行して処理が実行されてもよい。 The spatial information update process and the audio signal output process with added acoustic processing may be executed in the same thread, or the spatial information management unit 1101 and rendering unit 1103 may each be assigned to an independent thread. If the spatial information update process and the audio signal output process with added acoustic processing are executed in different threads, the thread startup frequency may be set individually, or the processes may be executed in parallel.

　空間情報管理部１１０１とレンダリング部１１０３とが異なる独立したスレッドで処理を実行することで、レンダリング部１１０３に優先的に演算資源を割り当てることができるので、僅かな遅延も許容できないような出音処理の場合、例えば、１サンプル（０．０２ｍｓｅｃ）でも遅延した場合にプチっというノイズが発生するような出音処理であっても安全に実施することができる。その際、空間情報管理部１１０１には演算資源の割り当てが制限される。しかし、空間情報の更新は、音声信号の出力処理と比較して、低頻度の処理（例えば、受聴者の顔の向きの更新のような処理）である。このため、音声信号の出力処理のように必ずしも瞬間的に応答しなければならないというものではないので、演算資源の割り当てを制限しても受聴者に与えられる音響的な品質に大きな影響はない。 By having the spatial information management unit 1101 and the rendering unit 1103 execute their processing in different, independent threads, it is possible to allocate computational resources preferentially to the rendering unit 1103. This means that sound output processing that cannot tolerate even the slightest delay, such as sound output processing where a delay of even one sample (0.02 msec) would cause a popping noise, can be safely carried out. In this case, the allocation of computational resources to the spatial information management unit 1101 is limited. However, compared to audio signal output processing, updating spatial information is a less frequent process (for example, processing such as updating the direction of the listener's face). For this reason, it does not necessarily require an instantaneous response like audio signal output processing, so limiting the allocation of computational resources does not have a significant impact on the acoustic quality provided to the listener.

　空間情報の更新は、予め設定された時間又は期間ごとに定期的に実行されてもよいし、予め設定された条件が満たされた場合に実行されてもよい。また、空間情報の更新は、リスナ又は音空間の管理者によって手動で実行されてもよいし、外部システムの変化をトリガとして実行されてもよい。例えば、受聴者がコントローラを操作して、自身のアバターの立ち位置を瞬間的にワープしたり、時刻を瞬時に進めたり戻したり、或いは、仮想空間の管理者が、突如、場の環境を変更するような演出を施したりした場合、空間情報管理部１１０１が配置されたスレッドは、定期的な起動に加えて、単発的な割り込み処理として起動されてもよい。 Updating of spatial information may be performed periodically at preset times or intervals, or when preset conditions are met. Furthermore, updating of spatial information may be performed manually by the listener or sound space administrator, or may be triggered by a change in an external system. For example, if a listener operates a controller to instantly warp the position of their avatar, or instantly advance or reverse the time, or if the administrator of the virtual space suddenly changes the environment of the venue, the thread in which the spatial information management unit 1101 is located may be started as a one-off interrupt process in addition to being started periodically.

　空間情報の更新処理を実行する情報更新スレッドが担う役割は、例えば、受聴者が装着しているＶＲゴーグルの位置又は向きに基づいて、仮想空間内に配置された受聴者のアバターの位置又は向きを更新する処理、及び、仮想空間内を移動している物体の位置の更新などであり、数１０Ｈｚ程度の比較的低頻度で起動する処理スレッド内で賄われるものである。そのような、発生頻度の低い処理スレッドで直接音の性質を反映させる処理が行われるようにしてもよい。それは、オーディオ出力のためのオーディオ処理フレームの発生頻度より直接音の性質が変動する頻度が低いためである。むしろそうすることで、当該処理の演算負荷を相対的に小さくすることができるし、不必要に速い頻度で情報を更新するとパルシブなノイズが発生するリスクが生じるので、そのリスクを回避することもできる。 The role of the information update thread that executes spatial information update processing is, for example, to update the position or orientation of the listener's avatar placed in the virtual space based on the position or orientation of the VR goggles worn by the listener, and to update the position of objects moving within the virtual space. These tasks are handled within a processing thread that runs relatively infrequently, on the order of a few tens of Hz. Processing to reflect the properties of direct sound may be performed in such an infrequent processing thread. This is because the properties of direct sound change less frequently than the frequency with which audio processing frames for audio output occur. Doing so can actually reduce the computational load of the processing relatively, and it can also avoid the risk of pulsive noise occurring when information is updated at an unnecessarily fast frequency.

　図１２は、図８又は図１０におけるデコーダ８０２の別の一例であるデコーダ１２００の構成を示す機能ブロック図である。 Figure 12 is a functional block diagram showing the configuration of a decoder 1200, which is another example of the decoder 802 in Figure 8 or Figure 10.

　図１２は、入力データ８０３が、符号化音声データではなく符号化されていない音声信号を含んでいる点で図１１と異なる。入力データ８０３は、メタデータを含むビットストリームと音声信号を含む。 FIG. 12 differs from FIG. 11 in that the input data 803 includes an unencoded audio signal rather than encoded audio data. The input data 803 includes a bitstream including metadata and an audio signal.

　空間情報管理部１２０１は、図１１の空間情報管理部１１０１と同じであるため説明を省略する。 The spatial information management unit 1201 is the same as the spatial information management unit 1101 in Figure 11, so a description thereof will be omitted.

　レンダリング部１２０２は、図１１のレンダリング部１１０３と同じであるため説明を省略する。 The rendering unit 1202 is the same as the rendering unit 1103 in Figure 11, so a description will be omitted.

　なお、上記説明では図１２の構成がデコーダと呼ばれているが、音響処理を実施する音響処理部と呼ばれてもよい。また、音響処理部を含む装置が復号装置ではなく音響処理装置と呼ばれてもよい。また、音響信号処理装置（情報処理装置６０１）が音響処理装置と呼ばれてもよい。 Note that in the above explanation, the configuration in Figure 12 is called a decoder, but it may also be called an audio processing unit that performs audio processing. Furthermore, a device that includes an audio processing unit may also be called an audio processing device rather than a decoding device. Furthermore, an audio signal processing device (information processing device 601) may also be called an audio processing device.

　＜符号化装置の物理的構成＞
　図１３は、符号化装置の物理的構成の一例を示す図である。また、図１３に示される符号化装置は、上記の符号化装置７００及び９００などの一例である。 <Physical configuration of the encoding device>
13 is a diagram showing an example of the physical configuration of an encoding device, and the encoding device shown in FIG. 13 is an example of the encoding devices 700 and 900 described above.

　図１３の符号化装置は、プロセッサと、メモリと、通信ＩＦとを備える。 The encoding device in Figure 13 includes a processor, memory, and a communication interface.

　プロセッサは、例えば、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）又はＤＳＰ（Ｄｉｇｉｔａｌ　Ｓｉｇｎａｌ　Ｐｒｏｃｅｓｓｏｒ）又はＧＰＵ（Ｇｒａｐｈｉｃｓ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）であり、当該ＣＰＵ又はＤＳＰ又はＧＰＵがメモリに記憶されたプログラム実行することで本開示の符号化処理を実施してもよい。また、プロセッサは、本開示の符号化処理を含む音声信号に対する信号処理を行う専用回路であってもよい。 The processor may be, for example, a CPU (Central Processing Unit), DSP (Digital Signal Processor), or GPU (Graphics Processing Unit), and the encoding process of the present disclosure may be performed by the CPU, DSP, or GPU executing a program stored in memory. The processor may also be a dedicated circuit that performs signal processing on audio signals, including the encoding process of the present disclosure.

　メモリは、例えば、ＲＡＭ（Ｒａｎｄｏｍ　Ａｃｃｅｓｓ　Ｍｅｍｏｒｙ）又はＲＯＭ（Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）で構成される。メモリは、ハードディスクなどの磁気記憶媒体又はＳＳＤ（Ｓｏｌｉｄ　Ｓｔａｔｅ　Ｄｒｉｖｅ）などの半導体メモリなどを含んでいてもよい。また、ＣＰＵ又はＧＰＵに組み込まれた内部メモリを含めてメモリと呼ばれてもよい。 Memory may be composed of, for example, RAM (Random Access Memory) or ROM (Read Only Memory). Memory may also include magnetic storage media such as hard disks or semiconductor memory such as SSDs (Solid State Drives). Internal memory built into the CPU or GPU may also be referred to as memory.

　通信ＩＦ（Ｉｎｔｅｒ　Ｆａｃｅ）は、例えば、Ｂｌｕｅｔｏｏｔｈ（登録商標）又はＷＩＧＩＧ（登録商標）などの通信方式に対応した通信モジュールである。符号化装置は、通信ＩＦを介して他の通信装置と通信を行う機能を有し、符号化されたビットストリームを送信する。 The communication IF (Interface) is a communication module compatible with communication methods such as Bluetooth (registered trademark) or WIGIG (registered trademark). The encoding device has the function of communicating with other communication devices via the communication IF and transmits the encoded bitstream.

　通信モジュールは、例えば、通信方式に対応した信号処理回路とアンテナとで構成される。上記の例では、通信方式としてＢｌｕｅｔｏｏｔｈ（登録商標）又はＷＩＧＩＧ（登録商標）を例に挙げたが、ＬＴＥ（Ｌｏｎｇ　Ｔｅｒｍ　Ｅｖｏｌｕｔｉｏｎ）、ＮＲ（Ｎｅｗ　Ｒａｄｉｏ）、又はＷｉ－Ｆｉ（登録商標）などの通信方式に対応していてもよい。また、通信ＩＦは、上記のような無線通信方式ではなく、Ｅｔｈｅｒｎｅｔ（登録商標）、ＵＳＢ（Ｕｎｉｖｅｒｓａｌ　Ｓｅｒｉａｌ　Ｂｕｓ）、ＨＤＭＩ（登録商標）（Ｈｉｇｈ－Ｄｅｆｉｎｉｔｉｏｎ　Ｍｕｌｔｉｍｅｄｉａ　Ｉｎｔｅｒｆａｃｅ）などの有線の通信方式であってもよい。 The communication module is composed of, for example, a signal processing circuit and an antenna that supports the communication method. In the above example, Bluetooth (registered trademark) or WIGIG (registered trademark) was used as the communication method, but communication methods such as LTE (Long Term Evolution), NR (New Radio), or Wi-Fi (registered trademark) may also be supported. Furthermore, instead of the wireless communication methods mentioned above, the communication IF may also be a wired communication method such as Ethernet (registered trademark), USB (Universal Serial Bus), or HDMI (registered trademark) (High-Definition Multimedia Interface).

　＜音響信号処理装置の物理的構成＞
　図１４は、音響信号処理装置の物理的構成の一例を示す図である。なお、図１４の音響信号処理装置は、復号装置であってもよい。また、ここで説明する構成の一部は音声提示装置６０２に備えられていてもよい。また、図１４に示される音響信号処理装置は、上記の音響信号処理装置６０１の一例である。 <Physical configuration of the acoustic signal processing device>
Fig. 14 is a diagram showing an example of the physical configuration of an audio signal processing device. Note that the audio signal processing device in Fig. 14 may be a decoding device. Also, part of the configuration described here may be provided in an audio presentation device 602. Also, the audio signal processing device shown in Fig. 14 is an example of the audio signal processing device 601 described above.

　図１４の音響信号処理装置は、プロセッサと、メモリと、通信ＩＦと、センサと、スピーカとを備える。 The acoustic signal processing device in Figure 14 includes a processor, memory, a communication IF, a sensor, and a speaker.

　プロセッサは、例えば、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）又はＤＳＰ（Ｄｉｇｉｔａｌ　Ｓｉｇｎａｌ　Ｐｒｏｃｅｓｓｏｒ）又はＧＰＵ（Ｇｒａｐｈｉｃｓ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）であり、当該ＣＰＵ又はＤＳＰ又はＧＰＵがメモリに記憶されたプログラム実行することで本開示の音響処理又はデコード処理を実施してもよい。また、プロセッサは、本開示の音響処理を含む音声信号に対する信号処理を行う専用回路であってもよい。 The processor may be, for example, a CPU (Central Processing Unit), DSP (Digital Signal Processor), or GPU (Graphics Processing Unit), and the CPU, DSP, or GPU may execute a program stored in memory to perform the acoustic processing or decoding processing of the present disclosure. The processor may also be a dedicated circuit that performs signal processing on audio signals, including the acoustic processing of the present disclosure.

　通信ＩＦ（Ｉｎｔｅｒ　Ｆａｃｅ）は、例えば、Ｂｌｕｅｔｏｏｔｈ（登録商標）又はＷＩＧＩＧ（登録商標）などの通信方式に対応した通信モジュールである。図１４に示される音響信号処理装置は、通信ＩＦを介して他の通信装置と通信を行う機能を有し、復号対象のビットストリームを取得する。取得したビットストリームは、例えば、メモリに格納される。 The communication IF (Interface) is a communication module compatible with communication methods such as Bluetooth (registered trademark) or WIGIG (registered trademark). The acoustic signal processing device shown in FIG. 14 has the function of communicating with other communication devices via the communication IF, and acquires the bitstream to be decoded. The acquired bitstream is stored in memory, for example.

　センサは、リスナの位置又は向きを推定するためのセンシングを行う。具体的には、センサは、リスナの頭部など身体の一部又は全体の位置、向き、動き、速度、角速度、又は加速度などのうちいずれか一つ又は複数の検出結果に基づいてリスナの位置及び／又は向きを推定し、リスナの位置及び／又は向きを示す位置情報を生成する。なお、位置情報は実空間におけるリスナの位置及び／又は向きを示す情報であってもよいし、所定の時点におけるリスナの位置及び／又は向きを基準としたリスナの位置及び／又は向きの変位を示す情報であってもよい。また、位置情報は、立体音響再生システム又はセンサを備える外部装置との相対的な位置及び／又は向きを示す情報であってもよい。 The sensor performs sensing to estimate the position or orientation of the listener. Specifically, the sensor estimates the position and/or orientation of the listener based on one or more detection results of the position, orientation, movement, velocity, angular velocity, or acceleration of the entire or part of the listener's body, such as the head, and generates position information indicating the position and/or orientation of the listener. Note that the position information may be information indicating the position and/or orientation of the listener in real space, or information indicating the displacement of the position and/or orientation of the listener based on the position and/or orientation of the listener at a specified point in time. The position information may also be information indicating the position and/or orientation relative to the stereophonic reproduction system or an external device equipped with the sensor.

　センサは、例えば、カメラなどの撮像装置又はＬｉＤＡＲ（Ｌｉｇｈｔ　Ｄｅｔｅｃｔｉｏｎ　Ａｎｄ　Ｒａｎｇｉｎｇ）などの測距装置であってもよく、リスナの頭部の動きを撮像し、撮像された画像を処理することでリスナの頭部の動きを検知してもよい。また、センサとして例えばミリ波などの任意の周波数帯域の無線を用いて位置推定を行う装置を用いてもよい。 The sensor may be, for example, an imaging device such as a camera or a ranging device such as LiDAR (Light Detection and Ranging), and may capture the movement of the listener's head and detect the movement of the listener's head by processing the captured image. Alternatively, the sensor may be a device that performs position estimation using wireless signals of any frequency band, such as millimeter waves.

　なお、図１４に示される音響信号処理装置は、センサを備える外部の機器から通信ＩＦを介して位置情報を取得してもよい。この場合、音響信号処理装置はセンサを含んでいなくてもよい。ここで、外部の機器とは、例えば図６で説明した音声提示装置６０２又は、リスナの頭部に装着される立体映像再生装置などである。このときセンサは、例えば、ジャイロセンサ及び加速度センサなど各種のセンサを組み合わせて構成される。 The audio signal processing device shown in FIG. 14 may acquire position information from an external device equipped with a sensor via a communication IF. In this case, the audio signal processing device does not need to include a sensor. Here, the external device is, for example, the audio presentation device 602 described in FIG. 6 or a 3D video playback device worn on the listener's head. In this case, the sensor is configured by combining various sensors such as a gyro sensor and an acceleration sensor.

　センサは、例えば、リスナの頭部の動きの速度として、音空間内で互いに直交する３軸の少なくとも１つを回転軸とする回転の角速度を検知してもよいし、上記３軸の少なくとも１つを変位方向とする変位の加速度を検知してもよい。 The sensor may, for example, detect the angular velocity of rotation about at least one of three mutually orthogonal axes in the sound space as the axis of rotation, or may detect the acceleration of displacement with at least one of the three axes as the direction of displacement, as the speed of movement of the listener's head.

　センサは、例えば、リスナの頭部の動きの量として、音空間内で互いに直交する３軸の少なくとも１つを回転軸とする回転量を検知してもよいし、上記３軸の少なくとも１つを変位方向とする変位量を検知してもよい。具体的には、センサは、リスナの位置として６ＤｏＦ（位置（ｘ、ｙ、ｚ）及び角度（ｙａｗ、ｐｉｔｃｈ、ｒｏｌｌ））を検知する。センサは、ジャイロセンサ及び加速度センサなど動きの検知に使用される各種のセンサを組み合わせて構成される。 For example, the sensor may detect the amount of movement of the listener's head by detecting the amount of rotation around at least one of three mutually orthogonal axes in the sound space, or by detecting the amount of displacement along at least one of the three axes. Specifically, the sensor detects 6 DoF (position (x, y, z) and angle (yaw, pitch, roll)) as the listener's position. The sensor is configured by combining various sensors used for detecting movement, such as gyro sensors and acceleration sensors.

　なお、センサは、リスナの位置を検出できればよく、カメラ又はＧＰＳ（Ｇｌｏｂａｌ　Ｐｏｓｉｔｉｏｎｉｎｇ　Ｓｙｓｔｅｍ）受信機などにより実現されてもよい。ＬｉＤＡＲ（Ｌａｓｅｒ　Ｉｍａｇｉｎｇ　Ｄｅｔｅｃｔｉｏｎ　ａｎｄ　Ｒａｎｇｉｎｇ）等を用いて自己位置推定を実施して得られた位置情報を用いてもよい。例えば、センサは、音声信号再生システムがスマートフォンにより実現される場合には、スマートフォンに内蔵される。 The sensor may be any device capable of detecting the position of the listener, such as a camera or a GPS (Global Positioning System) receiver. It may also use location information obtained by performing self-position estimation using LiDAR (Laser Imaging Detection and Ranging). For example, if the audio signal playback system is implemented using a smartphone, the sensor may be built into the smartphone.

　また、センサには、図１４に示される音響信号処理装置の温度を検出する熱電対などの温度センサ、及び、音響信号処理装置が備える、又は音響信号処理装置と接続されたバッテリの残量を検出するセンサなどが含まれていてもよい。 The sensors may also include a temperature sensor such as a thermocouple that detects the temperature of the acoustic signal processing device shown in FIG. 14, and a sensor that detects the remaining charge of a battery included in or connected to the acoustic signal processing device.

　スピーカは、例えば、振動板と、マグネット又はボイスコイル等の駆動機構とアンプとを有し、音響処理後の音声信号を音としてリスナに提示する。スピーカは、アンプを介して増幅させた音声信号（より具体的には、音の波形を示す波形信号）に応じて駆動機構を動作させ、駆動機構によって振動板を振動させる。このようにして、音声信号に応じて振動する振動板は、音波を発生させ、音波が空気を伝播してリスナの耳に伝達し、リスナが音を知覚する。 A speaker, for example, has a diaphragm, a drive mechanism such as a magnet or voice coil, and an amplifier, and presents the processed audio signal as sound to the listener. The speaker operates the drive mechanism in response to the audio signal (more specifically, a waveform signal indicating the waveform of the sound) amplified via the amplifier, causing the diaphragm to vibrate. In this way, the diaphragm vibrates in response to the audio signal, generating sound waves that propagate through the air and reach the listener's ears, allowing the listener to perceive the sound.

　なお、ここでは図１４に示される音響信号処理装置がスピーカを備え、当該スピーカを介して音響処理後の音声信号を提示する場合を例に挙げて説明したが、音声信号の提示手段は上記の構成に限定されない。例えば、通信モジュールで接続された外部の音声提示装置６０２に音響処理後の音声信号が出力されてもよい。通信モジュールで行う通信は有線でも無線でもよい。また別の例として、図１４に示される音響信号処理装置が音声のアナログ信号を出力する端子を備え、端子にイヤホンなどのケーブルを接続してイヤホンなどから音声信号を提示してもよい。上記の場合、音声提示装置６０２であるリスナの頭部又は体の一部に装着されるヘッドフォン、イヤホン、ヘッドマウントディスプレイ、ネックスピーカ、ウェアラブルスピーカ、又は固定された複数のスピーカで構成されたサラウンドスピーカなどが音声信号を再生する。 Note that, while the example described here is one in which the acoustic signal processing device shown in FIG. 14 is equipped with a speaker and presents an audio signal after acoustic processing via the speaker, the means for presenting the audio signal is not limited to the above configuration. For example, the audio signal after acoustic processing may be output to an external audio presentation device 602 connected via a communication module. Communication via the communication module may be wired or wireless. As another example, the acoustic signal processing device shown in FIG. 14 may be equipped with a terminal that outputs an analog audio signal, and an audio signal may be presented from the earphone or other device by connecting a cable to the terminal. In the above case, the audio signal is reproduced by headphones, earphones, a head-mounted display, a neck speaker, a wearable speaker, or a surround speaker consisting of multiple fixed speakers that is worn on the head or part of the body of the listener, which is the audio presentation device 602.

　＜レンダリング部の機能説明＞
　図１５は、図１１および図１２のレンダリング部１１０３および１２０２の詳細な構成の一例を示す機能ブロック図である。 <Functional explanation of the rendering section>
FIG. 15 is a functional block diagram showing an example of a detailed configuration of the rendering units 1103 and 1202 shown in FIGS.

　レンダリング部は、解析部と、パニング部と、（上記の合成部１３５とは異なる）合成部とで構成され、入力信号に含まれる音データに対して音響処理を付加し出力する。 The rendering unit is composed of an analysis unit, a panning unit, and a synthesis unit (different from the synthesis unit 135 described above), and applies acoustic processing to the sound data contained in the input signal and outputs it.

　以下、入力信号に含まれる情報について説明する。 The information contained in the input signal is explained below.

　入力信号は、例えば、空間情報とセンサ情報と音データとで構成される。入力信号は、音データとメタデータ（制御情報）とで構成されるビットストリームを含んでいてもよく、その場合メタデータに空間情報が含まれていてもよい。 The input signal may be composed of, for example, spatial information, sensor information, and sound data. The input signal may also include a bitstream composed of sound data and metadata (control information), in which case the metadata may include spatial information.

　空間情報は、立体音響再生システムが作り出す音空間（三次元音場）に関する情報であって、音空間に含まれるオブジェクトに関する情報とリスナに関する情報とで構成される。オブジェクトには、音を発し音源となる音源オブジェクトと、音を発しない非発音オブジェクトが存在する。非発音オブジェクトは、音源オブジェクトが発した音を反射する障害物オブジェクトとして機能するが、音源オブジェクトが別の音源オブジェクトが発した音を反射する障害物オブジェクトとして機能する場合もある。 Spatial information is information about the sound space (three-dimensional sound field) created by the stereophonic playback system, and is composed of information about the objects contained in the sound space and information about the listener. Objects include sound source objects that emit sound and act as sound sources, and non-sound-emitting objects that do not emit sound. Non-sound-emitting objects function as obstacle objects that reflect sounds emitted by sound source objects, but sound source objects can also function as obstacle objects that reflect sounds emitted by other sound source objects.

　音源オブジェクトと非発音オブジェクトに共通して付与される情報として、位置情報や形状情報、オブジェクトが音を反射する際の音量の減衰率などがある。 Information commonly assigned to sound-source objects and non-sound-producing objects includes position information, shape information, and the rate at which the volume attenuates when the object reflects sound.

　位置情報は、ユークリッド空間の例えばＸ軸、Ｙ軸、Ｚ軸の３軸の座標値で表されるが、必ずしも三次元情報でなくてもよい。例えば、Ｘ軸、Ｙ軸の２軸の座標値で表される二次元情報であってもよい。オブジェクトの位置情報は、メッシュやボクセルで表現される形状の代表位置で定められる。 Position information is expressed as coordinate values on three axes in Euclidean space, for example the X, Y, and Z axes, but it does not necessarily have to be three-dimensional information. For example, it may be two-dimensional information expressed as coordinate values on two axes, the X and Y axes. The position information of an object is determined by the representative position of a shape expressed in mesh or voxels.

　形状情報は、表面の素材に関する情報を含んでいてもよい。 The shape information may also include information about the surface material.

　また、オブジェクトが生物に属するか否かを示す情報やオブジェクトが動体であるか否かを示す情報などを含んでいてもよい。オブジェクトが動体である場合、位置情報は時間とともに移動してもよく、変化した位置情報または変化量がレンダリング部に伝送される。 It may also include information indicating whether the object belongs to a living organism or whether the object is a moving object. If the object is a moving object, the position information may change over time, and the changed position information or the amount of change is transmitted to the rendering unit.

　音源オブジェクトに関する情報は、上述した音源オブジェクトと非発音オブジェクトに共通して付与される情報に加えて、音データと音データを音空間内に放射するために必要な情報とを含む。 Information about sound source objects includes the information commonly assigned to sound source objects and non-sound generating objects described above, as well as sound data and information necessary to radiate the sound data into the sound space.

　音データは、音の周波数および強弱に関する情報などを示す、リスナに知覚される音が表現されたデータである。音データは、典型的にはＰＣＭ信号であるが、ＭＰ３等の符号化方式を用いて圧縮されたデータであってもよい。その場合は、少なくとも当該信号が合成部に到達するまでに復号される必要があるため、レンダリング部に図示しない復号部を含んでいてもよい。或いは音声データデコーダ１１０２で復号してもよい。 Sound data is data that represents the sound perceived by a listener, including information about the frequency and intensity of the sound. Sound data is typically a PCM signal, but it may also be data compressed using an encoding method such as MP3. In this case, the signal must be decoded at least before it reaches the synthesis unit, so the rendering unit may include a decoding unit (not shown). Alternatively, the signal may be decoded by the audio data decoder 1102.

　１つの音源オブジェクトに対して少なくとも１つの音データが設定されていればよく、複数の音データが設定されていてもよい。また、それぞれの音データを識別する識別情報を付与し、音源オブジェクトに関する情報として、音データの識別情報を保持してもよい。 At least one piece of sound data needs to be set for one sound source object, but multiple pieces of sound data may also be set. Furthermore, identification information for identifying each piece of sound data may be assigned, and the sound data identification information may be held as information related to the sound source object.

　音データを音空間内に放射するために必要な情報として、例えば、音データを再生する際に基準となる基準音量の情報、音データの性質（特性ともいう）を示す情報、音源オブジェクトの位置に関する情報、音源オブジェクトの向きに関する情報、音源オブジェクトが発する音の指向性に関する情報などを含んでいてもよい。基準音量の情報は、例えば、音データを音空間に放射する際の音源位置における音データの振幅値の実効値であって、デシベル（ｄＢ）値として浮動小数点で表されてもよい。 Information necessary for radiating sound data into a sound space may include, for example, information on the reference volume that serves as a reference when playing back the sound data, information indicating the properties (also called characteristics) of the sound data, information on the position of the sound source object, information on the orientation of the sound source object, and information on the directionality of the sound emitted by the sound source object. The reference volume information is, for example, the effective value of the amplitude value of the sound data at the sound source position when radiating the sound data into a sound space, and may be expressed as a floating-point decibel (dB) value.

　例えば基準音量が０ｄＢの場合、音データが示す信号レベルの音量を増減させることなくそのままの音量で上記位置に関する情報が指し示す位置から音空間に対して音を放射することを示しているものとしてもよいし、－６ｄＢの場合、音データが示す信号レベルの音量を約半分にして上記位置に関する情報が指し示す位置から音空間に対して音を放射することを示しているものとしてもよい。これらの情報は、１つの音データに対してまたは複数の音データに対してまとめて付与される。 For example, if the reference volume is 0 dB, it may indicate that sound is emitted into the sound space from the position indicated by the information regarding the position at the same volume as the signal level indicated by the sound data, without increasing or decreasing the volume. If the reference volume is -6 dB, it may indicate that sound is emitted into the sound space from the position indicated by the information regarding the position with the volume of the signal level indicated by the sound data reduced to approximately half. This information is assigned to one piece of sound data or to multiple pieces of sound data collectively.

　音データの性質を示す情報は、例えば、音源の音量に関する情報であって、その時系列的な変動を示す情報であってもよい。例えば、音空間が仮想会議室であり、音源が話者である場合、音量は短い時間で断続的に遷移する。それをさらに単純に表現すれば、有音部分と無音部分が交互に発生する、とも言える。 Information indicating the properties of the sound data may be, for example, information regarding the volume of the sound source, and may be information indicating time-series fluctuations. For example, if the sound space is a virtual conference room and the sound source is a speaker, the volume will transition intermittently over short periods of time. Expressed more simply, this can be said to be alternating between sound and silence.

　また、音空間がコンサートホールであり、音源が演奏者である場合、音量は一定の時間長維持される。また、音空間が戦場であり、音源が爆発物である場合、爆発音の音量は一瞬だけ大となり以降は無音であり続ける。このように音源の音量の情報は、音の大きさの情報のみならず、音の大きさの遷移の情報を含むものであり、そのような情報を音データの性質を示す情報としてもよい。 Furthermore, if the sound space is a concert hall and the sound source is a performer, the volume will be maintained for a certain period of time. Furthermore, if the sound space is a battlefield and the sound source is an explosive, the volume of the explosion will increase for a moment and then remain silent. In this way, information about the volume of the sound source includes not only information about the volume of the sound, but also information about the transition in volume of the sound, and such information may be used as information indicating the properties of the sound data.

　ここで、音の大きさの遷移の情報は、周波数特性を時系列に示したデータであってもよい。有音である区間の継続時間長を示したデータであってもよい。有音である区間の継続時間長と無音である区間の時間長の時系列を示したデータであってもよい。音信号の振幅が定常的であるとみなせる（概ね一定であるとみなせる）継続時間とその間の当該信号の振幅値のデータを複数組時系列で列挙したデータなどであってもよい。音信号の周波数特性が定常的であるとみなせる継続時間のデータであってもよい。音信号の周波数特性が定常的であるとみなせる継続時間とその間の当該周波数特性のデータを複数組時系列で列挙したデータなどであってもよい。 Here, the information on transitions in sound volume may be data showing frequency characteristics in a time series. It may be data showing the duration of sections where sound is present. It may be data showing a time series of the duration of sections where sound is present and the duration of sections where sound is absent. It may be data listing multiple sets of data on durations where the amplitude of a sound signal can be considered steady (considered to be roughly constant) and the amplitude values of the signal during that time in a time series. It may be data on durations where the frequency characteristics of a sound signal can be considered steady. It may be data listing multiple sets of data on durations where the frequency characteristics of a sound signal can be considered steady and the frequency characteristics during that time in a time series.

　データの形式として例えば、スペクトログラムの概形を示すデータであってもよい。また、上記周波数特性の基準となる音量を上記基準音量としてもよい。基準音量の情報と音データの性質を示す情報は、リスナに知覚させる直接音または反射音の音量を算出する他、リスナに知覚させるか否か選択をするための選択処理に用いられてもよい。音データの性質を示す情報の他の例や具体的な選択処理への用いられ方については後述する。 The data format may be, for example, data indicating the outline of a spectrogram. Furthermore, the volume that serves as the basis for the frequency characteristics may be used as the reference volume. Information on the reference volume and information indicating the properties of the sound data may be used to calculate the volume of direct or reflected sound perceived by the listener, as well as in a selection process to select whether or not to perceive it. Other examples of information indicating the properties of sound data and specific ways in which it is used in the selection process will be described later.

　向きに関する情報は、典型的には、ｙａｗ、ｐｉｔｃｈ、ｒｏｌｌで表現される。または、ｒｏｌｌの回転を省略し、アジマス（ｙａｗ）、エレベーション（ｐｉｔｃｈ）で表現してもよい。向き情報は時間とともに変化してもよく、変化した場合、レンダリング部に伝送される。 Orientation information is typically expressed using yaw, pitch, and roll. Alternatively, the roll rotation may be omitted and the information may be expressed using azimuth (yaw) and elevation (pitch). Orientation information may change over time, and if it does change, it is transmitted to the rendering unit.

　リスナに関する情報は、音空間におけるリスナの位置情報と向きに関する情報である。位置情報はユークリッド空間のＸＹＺ軸の位置で表されるが、必ずしも三次元情報でなくてもよく、二次元情報であってもよい。向きに関する情報は、典型的には、ｙａｗ、ｐｉｔｃｈ、ｒｏｌｌで表現される。または、ｒｏｌｌの回転を省略し、アジマス（ｙａｗ）、エレベーション（ｐｉｔｃｈ）で表現してもよい。位置情報と向き情報とは時間とともに変化してもよく、変化した場合、レンダリング部に伝送される。 Information about the listener is information about the listener's position and orientation in sound space. Position information is expressed as a position on the XYZ axes in Euclidean space, but it does not necessarily have to be three-dimensional information; it can also be two-dimensional information. Orientation information is typically expressed using yaw, pitch, and roll. Alternatively, the roll rotation can be omitted and it can be expressed using azimuth (yaw) and elevation (pitch). Position information and orientation information may change over time, and if they do change, they are transmitted to the rendering unit.

　センサ情報は、リスナが装着するセンサで検知された回転量又は変位量等とリスナの位置及び向きとを含む情報である。センサ情報はレンダリング部に伝送され、レンダリング部はセンサ情報に基づいてリスナの位置及び向きの情報を更新する。センサ情報は、例えば携帯端末がＧＰＳ、カメラ、又はＬｉＤＡＲ（Ｌａｓｅｒ　Ｉｍａｇｉｎｇ　Ｄｅｔｅｃｔｉｏｎ　ａｎｄ　Ｒａｎｇｉｎｇ）等を用いて自己位置推定を実施して得られた位置情報が用いられてもよい。またセンサ以外から、通信モジュールを通じて外部から取得した情報をセンサ情報として検出してもよい。センサから、音声信号処理装置の温度を示す情報、および、バッテリの残量を示す情報を取得してもよい。音声信号処理装置や音声信号提示装置の演算資源（ＣＰＵ能力、メモリ資源、ＰＣ性能）などをリアルタイムで取得してもよい。 Sensor information includes the amount of rotation or displacement detected by a sensor worn by the listener, as well as the position and orientation of the listener. The sensor information is transmitted to the rendering unit, which updates the position and orientation information of the listener based on the sensor information. The sensor information may be, for example, location information obtained by a mobile device performing self-location estimation using GPS, a camera, or LiDAR (Laser Imaging Detection and Ranging). Information obtained externally from a source other than a sensor via a communications module may also be detected as sensor information. Information indicating the temperature of the audio signal processing device and information indicating the remaining battery capacity may be obtained from the sensor. The computing resources (CPU capacity, memory resources, PC performance) of the audio signal processing device and audio signal presentation device may also be obtained in real time.

　解析部は、上述の例における取得部１１１と同等の機能を担う。つまり、入力信号の解析を行い、経路算出部１２１及び出力音生成部１３１で必要な情報を取得する。 The analysis unit performs the same function as the acquisition unit 111 in the above example. In other words, it analyzes the input signal and acquires the information required by the path calculation unit 121 and output sound generation unit 131.

　合成部は、上述の例における出力音生成部１３１と信号出力部１４１と同等の機能を担う。直接音の音声信号と、解析部が算出した直接音到来時刻と直接音到来時音量の情報とに基づいて、入力された音声信号を加工し直接音を生成する。また、解析部が算出した反射音到来時刻と反射音到来時音量の情報に基づいて、入力された音声信号を加工し反射音を生成する。合成部は、生成した直接音と反射音を合成し出力する。 The synthesis unit performs functions equivalent to those of the output sound generation unit 131 and signal output unit 141 in the above example. It processes the input audio signal to generate direct sound based on the audio signal of the direct sound and information on the time of arrival of the direct sound and the volume at the time of arrival of the direct sound calculated by the analysis unit. It also processes the input audio signal to generate reflected sound based on information on the time of arrival of the reflected sound and the volume at the time of arrival of the reflected sound calculated by the analysis unit. The synthesis unit synthesizes the generated direct sound and reflected sound and outputs the result.

　パニング部は、上述の例における生成部１３４と同等の機能を担う。つまり、解析部により取得された複数個の音源（目的信号）の音源方向に基づいて、特定の代表方向からの音によるパニングを、音源の時間シフトとゲイン調整によって行うことにより、音源を表現するためのパニングを行う。上述のパニング部で行われる処理は、例えば国際公開第２０２１／１８０９３８号で説明されているようなパイプライン処理の一部として実行されてもよい。 The panning unit performs the same functions as the generation unit 134 in the above example. In other words, based on the sound source directions of multiple sound sources (target signals) acquired by the analysis unit, panning is performed to represent the sound sources by panning sounds from a specific representative direction through time shifting and gain adjustment of the sound sources. The processing performed by the panning unit described above may be executed as part of pipeline processing such as that described in WO 2021/180938, for example.

　図１６は、レンダリング部１３００がパイプライン処理を行うための構成例を示すブロック図である。 Figure 16 is a block diagram showing an example configuration for the rendering unit 1300 to perform pipeline processing.

　図１６のレンダリング部１３００は、残響処理部１３１１、初期反射処理部１３１２、距離減衰処理部１３１３、選択部１３１４、生成部１３１５及びバイノーラル処理部１３１６を備える。これらの複数の構成要素は、図１５に示されたレンダリング部の複数の構成要素で構成されていてもよいし、図１４に示された音響信号処理装置の複数の構成要素の少なくとも一部で構成されていてもよい。 The rendering unit 1300 in FIG. 16 includes a reverberation processing unit 1311, an early reflection processing unit 1312, a distance attenuation processing unit 1313, a selection unit 1314, a generation unit 1315, and a binaural processing unit 1316. These multiple components may be composed of multiple components of the rendering unit shown in FIG. 15, or may be composed of at least some of the multiple components of the acoustic signal processing device shown in FIG. 14.

　パイプライン処理とは、音響効果を付与するための処理を複数の処理に分割し、複数の処理を１つずつ順番に実行することを指す。複数の処理のそれぞれでは、例えば、音声信号に対する信号処理、又は、信号処理に用いられるパラメータの生成等が実行される。 Pipeline processing refers to dividing the processing for adding sound effects into multiple processes, and executing each process sequentially. Each of the multiple processes, for example, performs signal processing on an audio signal, or generates parameters used in signal processing.

　レンダリング部１３００は、パイプライン処理として、残響処理、初期反射処理、距離減衰処理及びバイノーラル処理等を行ってもよい。ただし、これらの処理は一例であり、パイプライン処理は、これら以外の処理を含んでいてもよいし、一部の処理を含んでいなくてもよい。例えば、パイプライン処理は、回折処理及びオクルージョン処理を含んでいてもよい。また、例えば、残響処理が、不要な場合、省略されてもよい。また、すべての音がバイノーラル処理ステージで処理がされなくてもよい。 The rendering unit 1300 may perform pipeline processing such as reverberation processing, early reflection processing, distance attenuation processing, and binaural processing. However, these processes are merely examples, and the pipeline processing may include other processes or may not include some of the processes. For example, the pipeline processing may include diffraction processing and occlusion processing. Furthermore, for example, reverberation processing may be omitted if it is not necessary. Furthermore, not all sounds may be processed in the binaural processing stage.

　また、各処理がステージと表現されてもよい。また、各処理の結果、生成された反射音等の音声信号は、レンダリングアイテムと表現されてもよい。パイプライン処理における複数のステージ、及び、それらの順番は、図１６に示された例に限られない。例えば、パニング部の処理は、パイプライン処理に含まれる複数のステージのうちの１つであるバイノーラル処理ステージで実行されてもよい。バイノーラル処理部は上述の説明における合成部と同等の機能を担う。 Furthermore, each process may be expressed as a stage. Furthermore, audio signals such as reflected sounds generated as a result of each process may be expressed as rendering items. The multiple stages in the pipeline processing and their order are not limited to the example shown in Figure 16. For example, the processing of the panning unit may be performed in a binaural processing stage, which is one of the multiple stages included in the pipeline processing. The binaural processing unit performs functions equivalent to those of the synthesis unit described above.

　ところで、以上のように説明した、立体音響再生システム６００において、上記の音響再生システム１００の例でも説明したようにユーザ９９の頭部の動きに応じて提示する音を変化させることで、ユーザ９９が三次元音場内で頭部を動かしているようにユーザ９９に知覚させるためには、ユーザ９９の頭部の位置及び向き（音源オブジェクトの位置に対する相対的な向き）を検知する必要がある。 In the stereophonic sound reproduction system 600 described above, in order to make the user 99 perceive as if they are moving their head within a three-dimensional sound field by changing the sound presented in accordance with the movement of the user 99's head, as explained in the example of the sound reproduction system 100 above, it is necessary to detect the position and orientation of the user 99's head (orientation relative to the position of the sound source object).

　このとき、ユーザ９９の頭部の位置及び向きの検知結果を取得して、それに応じた情報処理を行う必要があるので、理想的には図１及び図２に示すように、すべての構成がいわゆるヘッドフォンなどの、最終的な出力部分に係る装置、すなわち、図６でいう音声提示デバイス６０２に内蔵されることが望ましいが、電源確保、情報処理性能、筐体サイズ及び重量などの制約があるため、情報処理の部分を情報処理装置６０１において行い、音声の出力を音声提示デバイス６０２において行うという処理の分割が必要になる。しかも、近年では、音声提示デバイス６０２を情報処理装置６０１に対して無線通信で接続することが望まれるため、そのような無線通信で音声提示デバイス６０２と情報処理装置６０１との間で情報の送受信をするときに、同時送受信可能な情報量の制限（つまり通信帯域の制限）がネックになる。 At this time, it is necessary to obtain the detection results of the position and orientation of the user's 99's head and perform information processing accordingly. Ideally, as shown in Figures 1 and 2, all components would be built into the device related to the final output portion, such as headphones, i.e., the audio presentation device 602 in Figure 6. However, due to constraints such as power supply availability, information processing performance, housing size and weight, it is necessary to divide the processing so that the information processing portion is performed in the information processing device 601 and the audio output is performed in the audio presentation device 602. Furthermore, in recent years, it has become desirable to connect the audio presentation device 602 to the information processing device 601 via wireless communication, and when sending and receiving information between the audio presentation device 602 and the information processing device 601 via such wireless communication, the limit on the amount of information that can be sent and received simultaneously (i.e., the limit on the communication bandwidth) becomes a bottleneck.

　例えば、情報処理装置６０１において、出力音信号を出力するのであれば、その出力音信号を無線通信で音声提示デバイス６０２に送信しなければならず、無線通信の通信規格に適合するエンコード及びデコード処理による遅延、ならびに、三次元の音声信号をもつ出力音信号自体の情報量の大きさによる送信遅延などが生じてしまうので、ユーザ９９の体験を損なう可能性が高い。しかも、出力音信号を生成する前にユーザ９９の頭部の位置及び向きの検知結果を一旦情報処理装置６０１で取得する必要があるので、ユーザ９９の頭部の動きに瞬時に音声を追従させることは困難である。そこで、コンテンツ上の規定されている音源オブジェクトの動きについては、情報処理装置６０１において処理したうえで三次元の音声信号を生成し、これをパニング処理して情報量を圧縮する。そして、圧縮したパニング処理済みの音声信号を音声提示デバイス６０２に送信することで、伝送路のバンド幅の制約を受けることを回避する。ここまでの処理は、コンテンツ上で既定の部分であるため、あらかじめ情報を音声提示デバイス６０２に送信してバッファさせることができる。 For example, if the information processing device 601 were to output an output sound signal, it would have to transmit that output sound signal to the audio presentation device 602 via wireless communication. This would result in delays due to encoding and decoding processes conforming to wireless communication standards, as well as transmission delays due to the large amount of information in the output sound signal itself, which contains a three-dimensional audio signal, and would likely impair the user 99's experience. Furthermore, since the information processing device 601 must first obtain the detection results of the user 99's head position and orientation before generating the output sound signal, it would be difficult to instantaneously have the audio follow the movement of the user 99's head. Therefore, the movement of the sound source object specified in the content is processed in the information processing device 601, and a three-dimensional audio signal is generated, which is then panned to compress the amount of information. The compressed, panned audio signal is then transmitted to the audio presentation device 602, thereby avoiding limitations on the bandwidth of the transmission path. Since the processing up to this point is predetermined for the content, the information can be transmitted to the audio presentation device 602 in advance and buffered.

　そして、ユーザ９９の頭部の動きを音声提示デバイス６０２において検知し、検知結果を用いて音声提示デバイス６０２上でパニング処理された音声信号に検知結果に応じた（頭部が移動した後の）代表方向に対応する頭部伝達関数を畳み込むことでユーザ９９の頭部の動きに音声の到来方向を瞬時に追従させることが可能となる。このとき、情報処理装置６０１において、あらかじめパニング処理を行っておくことで、それほど多くない代表方向分の頭部伝達関数の畳み込みをする処理のみを音声提示デバイス６０２において実行するため、音声提示デバイス６０２側に要求される処理リソースを拡大しにくく、かつ、声提示デバイス６０２上での情報処理による遅延を大きく縮小することができる。頭部運動に伴う代表方向の更新が音声提示デバイス６０２側で行われるため、頭部運動の検知を行ってから頭部伝達関数の更新をするまでに、情報処理装置６０１と音声提示デバイス６０２との間で通信を必要としないため、音声の到来方向の更新に要する時間を最小化することが可能になる。 Then, the audio presentation device 602 detects the movement of the user 99's head, and the detection result is used to convolve the head-related transfer function corresponding to the representative direction (after the head has moved) according to the detection result into the audio signal that has been panned on the audio presentation device 602, thereby making it possible to instantaneously make the direction of sound arrival follow the movement of the user 99's head. By performing panning processing in advance in the information processing device 601, the audio presentation device 602 only executes the process of convolving the head-related transfer functions for a small number of representative directions. This makes it difficult to increase the processing resources required on the audio presentation device 602 side, and significantly reduces the delay due to information processing on the audio presentation device 602. Because the representative direction associated with head movement is updated on the audio presentation device 602 side, no communication is required between the information processing device 601 and the audio presentation device 602 after head movement is detected and before the head-related transfer function is updated, making it possible to minimize the time required to update the direction of sound arrival.

　このように、情報処理装置６０１と音声提示デバイス６０２とに分かれた立体音響再生システム６００において、情報処理装置６０１及び音声提示デバイス６０２間で送受信される音声信号をパニング処理済みの音声信号とすることで、情報処理を２つの装置に分割して行いながらもユーザ９９の頭部の動きに対して瞬時に音声の到来方向を追従させることが可能となる。 In this way, in a stereophonic reproduction system 600 divided into an information processing device 601 and an audio presentation device 602, by converting the audio signals sent and received between the information processing device 601 and the audio presentation device 602 into audio signals that have undergone panning processing, it is possible to split the information processing between the two devices while still allowing the direction from which the audio is coming to instantaneously follow the movement of the user's 99's head.

　情報処理装置６０１を第１端末とし、音声提示デバイス６０２を第２端末としたときに、上記のようにすることで、（１）ユーザ９９が装着するといった制約がなく、第２端末に比べて比較的処理性能を高くし易い第１端末において、比較的処理リソースが要求される、コンテンツ上の規定されている音源オブジェクトの動きを含めて音声信号にパニング処理をすることができる。また、（２）パニング処理によって例えば１０方向以内の代表方向からの音にまとめ情報量が圧縮され、第１端末と第２端末との間で送受信したときに、通信上のバンド幅の制約を受けにくくできる。また、（３）ユーザ９９が（耳がある頭部に）装着する第２端末において、容易にユーザ９９の頭部の位置及び向きを検知し、そのまま、パニング処理された音声信号からの出力音信号の出力のために用いることができるため、頭部運動に呼応して瞬時に音の到来方向を更新できる。さらに、（４）出力音信号の出力において、パニング処理されたときのわずかな数の代表方向分の頭部伝達関数の畳み込みのみで行うことができ、第２端末に要求される処理リソースを縮小することができる。以上のメリットを得ることができる。 When the information processing device 601 is the first terminal and the audio presentation device 602 is the second terminal, by doing as described above, (1) the first terminal, which is not restricted by the need for the user 99 to wear it and is therefore easier to increase processing performance than the second terminal, can perform panning processing on audio signals, including the movement of sound source objects specified in the content, which requires relatively high processing resources. Furthermore, (2) the panning processing compresses the amount of information by consolidating sounds from, for example, up to 10 representative directions, making it less susceptible to communication bandwidth restrictions when transmitting and receiving between the first and second terminals. Furthermore, (3) the second terminal worn by the user 99 (on the head where the ears are located) can easily detect the position and orientation of the user 99's head and use this information to output an output sound signal from the panned audio signal, thereby instantly updating the direction of sound arrival in response to head movement. Furthermore, (4) the output sound signal can be output by simply convolving head-related transfer functions for a small number of representative directions when panning is performed, thereby reducing the processing resources required for the second terminal. You can get the above benefits.

　上記の図２～図５の構成とする場合、第１端末には、図２における通信モジュール１０２、取得部１１１の一部の機能、経路算出部１２１、出力音生成部１３１の畳み込み処理以外の機能が備えられる。経路算出部１２１は、基準位置、つまり第２端末に音情報を送信する前に第１端末で既に取得しているユーザ９９の座標位置もしくは向き、システム（第１端末及び第２端末）の初期化時のユーザ９９の座標位置もしくは向き、又は、システムが事前に決めた所定の座標位置もしくは方向を用いて、音源オブジェクトの位置から当該基準位置に到来する相対的な到来方向を算出する。基準位置は、第１端末と第２端末とで互いに共有された同じ座標位置もしくは向きであれば、どのように定められてもよい。そのため、基準位置としての向きは絶対座標での「北の方向」などの具体的な方向として決められていてもよいし、ユーザ９９の過去所定期間（例えば１分）に正面となった方向から算出した平均的な方向として決められてもよい。後者の場合には、所定期間ごとに基準位置としての向きが更新される。このように、基準位置によって第１端末と第２端末とで互いに同じ座標位置もしくは向きを共有したうえで、一連のレンダリング処理を行い、その一連のレンダリング処理の中では、共有された基準位置が変更されないようになっていればよい。なお、基準位置を更新する場合、＜デコーダの機能説明＞において触れた空間情報の更新処理における情報更新スレッドに含めて更新を行ってもよいし、その他の情報更新のためのスレッドに含めて更新を行ってもよいし、基準位置を更新するための専用のスレッドで更新を行ってもよい。 2 to 5, the first terminal is equipped with functions other than the communication module 102 and some of the functions of the acquisition unit 111 in FIG. 2, and the convolution processing of the path calculation unit 121 and output sound generation unit 131. The path calculation unit 121 calculates the relative arrival direction from the position of the sound source object to the reference position using the reference position, i.e., the coordinate position or orientation of the user 99 already acquired by the first terminal before transmitting sound information to the second terminal, the coordinate position or orientation of the user 99 at the time of initialization of the system (first terminal and second terminal), or a predetermined coordinate position or direction determined in advance by the system. The reference position may be determined in any way as long as it is the same coordinate position or orientation shared by the first terminal and the second terminal. Therefore, the orientation as the reference position may be determined as a specific direction such as "north" in absolute coordinates, or as an average direction calculated from the direction that was in front of the user 99 over a predetermined period of time in the past (e.g., one minute). In the latter case, the orientation as the reference position is updated every predetermined period. In this way, the first terminal and second terminal share the same coordinate position or orientation using the reference position, and then perform a series of rendering processes, as long as the shared reference position does not change during that series of rendering processes. When updating the reference position, the update may be performed by including it in the information update thread in the spatial information update process mentioned in <Description of Decoder Function>, or by including it in a thread for updating other information, or by using a dedicated thread for updating the reference position.

　そして、第２端末には、図２における検知器１０３、取得部１１１の他部の機能、出力音生成部１３１における頭部伝達関数の畳み込み処理の機能、信号出力部１４１、データベース１０５、及び、ドライバ１０４が備えられる。信号出力部１４１は、上記の基準位置を、決定されたユーザ９９の座標及び向きにするための方向の変換を行うことで、検知器１０３による検知結果に応じた出力音信号を出力する。 The second terminal is equipped with the detector 103 in Figure 2, other functions of the acquisition unit 111, a function for convolution processing of head-related transfer functions in the output sound generation unit 131, a signal output unit 141, a database 105, and a driver 104. The signal output unit 141 converts the direction to align the above-mentioned reference position with the determined coordinates and orientation of the user 99, and outputs an output sound signal according to the detection result by the detector 103.

　図６～図１５の構成とする場合、第１端末には、図１５における解析部とパニング部が備えられる。解析部では、入力信号である第１の音情報に基づき、基準位置を用いて、音源オブジェクトの位置から当該基準位置に到来する相対的な到来方向を算出する。第１の音情報には、音源オブジェクト位置情報と音声信号が含まれる。解析部は上述の経路算出部１２１と同等の機能を有する。パニング部では、解析部で算出した到来方向に基づいて、代表点から基準位置に到来する代表方向を決定し、代表点（代表方向）に音声信号を分配するパニング処理を行う。つまり、パニング処理を行うことで、第１の音情報を、代表点の位置情報とパニング済みの音声信号を含む第２の音情報に変換する。第２端末には、合成部が備えられる。合成部において上記の基準位置を、第２端末で検知したユーザ９９の座標及び向きにするための方向の変換を行い、当該ユーザ９９の座標及び向きに基づいて、頭部伝達関数の畳み込み処理を行う。 When using the configurations shown in Figures 6 to 15, the first terminal is equipped with the analysis unit and panning unit shown in Figure 15. The analysis unit uses a reference position to calculate the relative arrival direction from the position of the sound source object to the reference position based on the first sound information, which is an input signal. The first sound information includes sound source object position information and an audio signal. The analysis unit has functionality equivalent to the path calculation unit 121 described above. The panning unit determines a representative direction from the representative point to the reference position based on the arrival direction calculated by the analysis unit, and performs panning processing to distribute the audio signal to the representative point (representative direction). In other words, by performing panning processing, the first sound information is converted into second sound information including position information of the representative point and a panned audio signal. The second terminal is equipped with a synthesis unit. The synthesis unit converts the direction to match the reference position with the coordinates and orientation of the user 99 detected by the second terminal, and performs convolution processing of the head-related transfer function based on the coordinates and orientation of the user 99.

　［動作］
　ここで、立体音響再生システム６００において、情報処理装置６０１と音声提示デバイス６０２とに処理を分割して行う場合の動作例を説明する。 [Operation]
Here, an example of operation of the stereophonic sound reproduction system 600 when processing is divided between the information processing device 601 and the audio presentation device 602 will be described.

　図１７Ａを参照して、情報処理装置６０１の動作について説明する。図１７Ａは、実施の形態に係る情報処理装置６０１（第１端末）の動作例を示すフローチャートである。図中に示す動作例では、図１５において図示しない取得部が通信モジュールを介して再生音に関する情報と、音源オブジェクトの位置に関する情報とを含む第１の音情報を取得する（ステップＳ１０１）。解析部は入力信号である第１の音情報に基づき、基準位置を用いて、音源オブジェクトの位置から当該基準位置に到来する相対的な到来方向を算出する。解析部は、例えば、直接音と、１以上の副次音それぞれを特定し、それぞれが音源オブジェクト位置から基準位置に到来する伝搬経路を算出してもよい。パニング部では、解析部で算出した到来方向に基づいて、代表点から基準位置に到来する代表方向を決定し、代表点（代表方向）に音声信号を分配するパニング処理を行う（ステップＳ１０２）。つまり、パニング処理を行うことで、第１の音情報を、代表点の位置情報とパニング済みの音声信号を含む第２の音情報に変換する。図１５において図示しない送信部は、第２の音情報を、音声提示デバイス６０２（第２端末）に送信する（ステップＳ１０３）。 The operation of the information processing device 601 will be described with reference to Figure 17A. Figure 17A is a flowchart showing an example of the operation of the information processing device 601 (first terminal) according to an embodiment. In the example of operation shown in the figure, an acquisition unit (not shown in Figure 15) acquires first sound information including information about the playback sound and information about the position of the sound source object via a communication module (step S101). Based on the first sound information, which is an input signal, the analysis unit uses a reference position to calculate the relative arrival direction from the position of the sound source object to the reference position. The analysis unit may, for example, identify the direct sound and one or more secondary sounds and calculate the propagation path from the sound source object position to the reference position for each sound. Based on the arrival direction calculated by the analysis unit, the panning unit determines a representative direction from the representative point to the reference position and performs panning processing to distribute the audio signal to the representative point (representative direction) (step S102). In other words, by performing panning processing, the first sound information is converted into second sound information including position information of the representative point and the panned audio signal. A transmitting unit (not shown in FIG. 15) transmits the second sound information to the audio presentation device 602 (second terminal) (step S103).

　次に、図１７Ｂを参照して、音声提示デバイス６０２（第２端末）の動作を説明する。図１５において図示しない受信部は、第１端末から第２音情報を受信する（ステップＳ１０４）。次に、図１５において図示しないセンシング情報入力部は、ユーザの位置に関する情報（位置情報または顔の向きの情報の少なくとも一方）を取得する（ステップＳ１０５）。次に、合成部は、上記基準位置を、決定されたユーザの座標及び向きにするための方向の変換を行い、当該ユーザの座標及び向きに基づいて、頭部伝達関数の畳み込み処理を行い（ステップＳ１０６）、音信号を合成して出力する（ステップＳ１０７）。 Next, the operation of the audio presentation device 602 (second terminal) will be described with reference to FIG. 17B. A receiving unit, not shown in FIG. 15, receives second sound information from the first terminal (step S104). Next, a sensing information input unit, not shown in FIG. 15, acquires information related to the user's position (at least one of position information and facial orientation information) (step S105). Next, a synthesis unit converts the direction to match the reference position with the determined user coordinates and orientation, performs convolution processing of the head-related transfer function based on the user coordinates and orientation (step S106), and synthesizes and outputs a sound signal (step S107).

　［パニング処理の具体例］
　再掲するが、パニング処理では、複数個の音源オブジェクトからの再生音を、複数の代表方向からの代表音によって表現する。この代表方向には、例えば、２～３方向を用いることが可能である。具体的には、パニング処理では、音源オブジェクトの個数より少ない個数の代表点にまとめ、この代表点に対する代表方向の頭部伝達関数のみで再生音を到来方向からの音として知覚させることが可能である。 [Specific example of panning processing]
To reiterate, in panning processing, reproduced sounds from multiple sound source objects are represented by representative sounds from multiple representative directions. For example, two or three directions can be used as these representative directions. Specifically, in panning processing, the number of representative points is reduced to a number less than the number of sound source objects, and the reproduced sounds can be perceived as sounds coming from the direction of arrival using only the head-related transfer functions of the representative directions for these representative points.

　この際、パニング処理では、音源オブジェクトからの到来方向の頭部伝達関数と代表方向の頭部伝達関数との相互相関が最大になる時間シフト（ディレイ、時間遅延）を算出する。ここで得られた時間シフト、又はこの時間シフトに負号を付した時間シフトを音源オブジェクトの再生音に付与した、時間シフト後の信号が代表方向にあるものとして、以降の処理を行う。 In this case, the panning process calculates the time shift (delay) that maximizes the cross-correlation between the head-related transfer function in the direction of arrival from the sound source object and the head-related transfer function in the representative direction. The time shift obtained here, or a time shift obtained by adding a negative sign to this time shift, is applied to the sound played back from the sound source object, and subsequent processing is performed assuming that the signal after the time shift is in the representative direction.

　ここで、パニング処理では、音源オブジェクトの再生音を時間シフトした代表方向の信号にゲインをかけて、代表点毎に算出されたそれらの値の和をとったものに各代表点における頭部伝達関数を畳み込んだものを算出することで、音源オブジェクトの再生音に到来方向の頭部伝達関数を畳み込んだものと等価な信号を合成する。 Here, the panning process applies a gain to the signal of the representative direction obtained by time-shifting the sound played back from the sound source object, and then calculates the sum of these values calculated for each representative point, convolving the head-related transfer function at each representative point, thereby synthesizing a signal equivalent to the sound played back from the sound source object convolved with the head-related transfer function of the direction of arrival.

　一方、パニング処理では、代表方向の頭部伝達関数（ベクトル）の和で到来方向の頭部伝達関数（ベクトル）を合成する際、合成された頭部伝達関数（ベクトル）と到来方向の頭部伝達関数（ベクトル）の誤差信号ベクトルが代表方向の頭部伝達関数（ベクトル）と直交するようにして、ゲインを算出してもよい。なお、頭部伝達関数（ベクトル）とは頭部伝達関数の時間領域での表現である頭部インパルス応答の時間波形をベクトルと見立てたものである。以下、この頭部伝達関数（ベクトル）を、単に「頭部伝達関数ベクトル」とも記載する。 On the other hand, in panning processing, when synthesizing the head-related transfer function (vector) of the direction of arrival by adding up the head-related transfer functions (vector) of the representative direction, the gain can be calculated so that the error signal vector between the synthesized head-related transfer function (vector) and the head-related transfer function (vector) of the direction of arrival is orthogonal to the head-related transfer function (vector) of the representative direction. Note that a head-related transfer function (vector) is a time waveform of a head-related transfer function, which is an expression of the head-related transfer function in the time domain, considered as a vector. Hereinafter, this head-related transfer function (vector) will also be simply referred to as a "head-related transfer function vector".

　パニング処理では、このゲインについて、音源オブジェクトの位置からのユーザ９９の左右の耳までの頭部伝達関数のエネルギーバランスが、パニング処理により実質的に複数の代表点からの頭部伝達関数で合成された頭部伝達関数でも維持されるように補正する。すなわち、パニング処理では、音源オブジェクトによるユーザ９９の左右の耳の頭部伝達関数のエネルギーバランスが、パニング処理により実質的に合成された頭部伝達関数でも維持されるようにゲインを補正してもよい。 In the panning process, this gain is corrected so that the energy balance of the head-related transfer functions from the position of the sound source object to the left and right ears of the user 99 is maintained even in the head-related transfer functions substantially synthesized by the panning process from head-related transfer functions from multiple representative points. In other words, in the panning process, the gain may be corrected so that the energy balance of the head-related transfer functions of the left and right ears of the user 99 due to the sound source object is maintained even in the head-related transfer functions substantially synthesized by the panning process.

　本実施形態においては、パニング処理では、音源オブジェクトの各到来方向について、代表方向の頭部伝達関数に乗ずるゲイン値と、代表方向の頭部伝達関数に施す時間シフト値とを算出して、後述する頭部伝達関数テーブルに格納しておくことが可能である。 In this embodiment, the panning process calculates, for each direction of arrival of the sound source object, a gain value to be multiplied by the head-related transfer function of the representative direction and a time shift value to be applied to the head-related transfer function of the representative direction, and stores these in a head-related transfer function table, which will be described later.

　この上で、パニング処理では、各音源オブジェクトの到来方向に対応する時間シフト値及びゲイン値で、各音源オブジェクトの時間シフトを行い、ゲインをかけて、これの和をとって和信号とする。パニング処理では、この和信号が代表点の位置に存在するものとして扱う。パニング処理では、この和信号に、代表点の位置の頭部伝達関数を畳み込んで、ユーザ９９の耳元の信号を生成することが可能である。 Then, in the panning process, each sound source object is time shifted using the time shift value and gain value corresponding to the direction from which each sound source object is coming, and then a gain is applied, and these are summed to create a sum signal. In the panning process, this sum signal is treated as being present at the position of the representative point. In the panning process, the head-related transfer function at the position of the representative point is convolved with this sum signal to generate a signal at the ear of the user 99.

　以下、図１８のフローチャートを参照して、パニング処理とともに関連する音声再生処理の詳細をステップ毎に説明する。まず、音源及び方向取得処理を行う（ステップＳ２０１）。例えば、経路算出部１２１が、ユーザ９９からみた音源オブジェクトの方向を取得する。 Below, with reference to the flowchart in Figure 18, the panning process and the associated audio playback process will be explained in detail step by step. First, sound source and direction acquisition processing is performed (step S201). For example, the path calculation unit 121 acquires the direction of the sound source object as seen by the user 99.

　具体的には、取得部は、音源オブジェクトの音声信号（目的信号）を取得する。この音声信号は、サンプリング周波数、量子化ビット数ともに任意である。本実施の形態においては、例えば、サンプリング周波数４８ｋＨｚ、量子化ビット数１６ビットの音声信号を用いる例について説明する。さらに、経路算出部１２１は、コンテンツの音声信号又は遠隔通話の参加者の音声信号等に付加されている、音源オブジェクトの方向情報を取得する。 Specifically, the acquisition unit acquires the audio signal (target signal) of the sound source object. This audio signal can have any sampling frequency and any quantization bit rate. In this embodiment, for example, an example will be described in which an audio signal with a sampling frequency of 48 kHz and a quantization bit rate of 16 bits is used. Furthermore, the path calculation unit 121 acquires directional information of the sound source object that is added to the audio signal of the content or the audio signals of the participants in the remote call.

　この上で、経路算出部１２１は、音源オブジェクトとユーザ９９との空間的な配置を把握する。この配置は、上述したように、コンテンツ等に設定された仮想空間等を含む空間内の配置であってもよい。そして、経路算出部１２１は、把握された空間内の配置に応じて、ユーザ９９からみた音源オブジェクトの方向、すなわち到来方向として算出する。経路算出部１２１は、コンテンツの音声信号についても、同様に、音源オブジェクトの音声信号の方向情報を参照し、ユーザ９９の配置に基づいて、到来方向を算出可能である。 The path calculation unit 121 then determines the spatial arrangement of the sound source object and the user 99. As described above, this arrangement may be within a space including a virtual space set in content, etc. The path calculation unit 121 then calculates the direction of the sound source object as seen by the user 99, i.e., the arrival direction, based on the determined arrangement within the space. Similarly, the path calculation unit 121 can also calculate the arrival direction for audio signals of content based on the arrangement of the user 99 by referring to directional information of the audio signal of the sound source object.

　なお、経路算出部１２１は、音源オブジェクトからみたユーザ９９の方向も算出してもよい。 The path calculation unit 121 may also calculate the direction of the user 99 from the sound source object.

　次に、パニング処理を実行する処理部であるパニング部（図５の機能ブロック図における生成部１３４）が、パニング処理を行う（ステップＳ２０２）。ここでは、パニング部は、方向情報を用いて、音源オブジェクトに対するパニング処理を行う。本実施の形態においては、パニング部は、パニング処理によって耳元で合成された音が、いかに本来あるべき耳もとの音に近づけることができるかという観点で、パニング処理を行う。 Next, the panning unit (generation unit 134 in the functional block diagram of Figure 5), which is a processing unit that executes the panning process, performs the panning process (step S202). Here, the panning unit uses directional information to perform panning on the sound source object. In this embodiment, the panning unit performs the panning process from the perspective of how closely the sound synthesized at the ear by the panning process can be made to resemble the sound that should be heard at the ear.

　図１９により、パニング部が、代表点Ｒ－１及び代表点Ｒ－２を用いて音源オブジェクト（音源Ｓ－１）をパニングする際の演算について説明する。図１９は、ここで、パニングする信号は音源Ｓ－１であるものの、以下、そのための最適シフト量と最適ゲインを算出するため、音源Ｓ－１、代表点Ｒ－１、及び代表点Ｒ－２から耳元までの頭部伝達関数を用いて計算をする。 Figure 19 explains the calculations performed by the panning unit when panning the sound source object (sound source S-1) using representative points R-1 and R-2. In Figure 19, the signal to be panned is sound source S-1, but below, to calculate the optimal shift amount and optimal gain for this, calculations are performed using head-related transfer functions from sound source S-1, representative point R-1, and representative point R-2 to the ears.

　この図１９の例において、音源Ｓ－１から耳元までのサンプリングのポイント数（タップ数）がＰポイントの頭部伝達関数を、Ｐ次元ベクトルとする。これを、ｖ｛ｘ｝とする（以下の各実施形態において、ベクトルを「ｖ｛｝」として示す。）。 In the example of Figure 19, the head-related transfer function with P sampling points (number of taps) from sound source S-1 to the ear is a P-dimensional vector. This is designated v{x} (in the following embodiments, vectors are represented as "v{ }").

　ここで、パニング部は、代表点Ｒ－１からユーザ９９の耳元までの頭部伝達関数をｖ｛ｘ₀₁｝、代表点Ｒ－２から耳元までの頭部伝達関数をｖ｛ｘ₀₂｝とする。ｖ｛ｘ｝とｖ｛ｘ₀₁｝との相互相関を算出し、これが最大になるようにｖ｛ｘ₀₁｝を時間シフトしたものをｖ｛ｘ₁｝とする。同様にｖ｛ｘ｝とｖ｛ｘ₀₂｝との相互相関を算出し、これが最大になるようにｖ｛ｘ₀₂｝を時間シフトしたものをｖ｛ｘ₂｝として算出する。 Here, the panning unit defines the head-related transfer function from representative point R-1 to the ear of user 99 as v{x ₀₁ }, and the head-related transfer function from representative point R-2 to the ear as v{x ₀₂ }. The cross-correlation between v{x} and v{x ₀₁ } is calculated, and v{x ₁ } is calculated by time-shifting v{x ₀₁ } so that this is maximized. Similarly, the cross-correlation between v{x} and v{x ₀₂ } is calculated, and v{x ₂ } is calculated by time-shifting v{x ₀₂ } so that this is maximized.

　このｖ｛ｘ₁｝にゲインＡをかけ、ｖ｛ｘ₂｝にゲインＢをかけ、これらの和でｖ｛ｘ｝を近似する。つまり、ｖ｛ｘ｝の近似値＝Ａ×ｖ｛ｘ₁｝＋Ｂ×ｖ｛ｘ₂｝として、ｖ｛ｘ｝を近似する。これにより、誤差を少なくしたパニング処理を実現することが可能となる。 This v{x ₁ } is multiplied by gain A, and v{x ₂ } is multiplied by gain B, and v{x} is approximated by the sum of these. In other words, v{x} is approximated as the approximate value of v{x} = A × v{x ₁ } + B × v{x ₂ }. This makes it possible to achieve panning processing with reduced error.

　このゲインの算出と時間シフトの詳細について説明する。まずは、ゲインの算出について説明する。ｖ｛ｘ｝の近似による誤差ベクトルを、下記の式（１）で示す。 The details of this gain calculation and time shift will be explained. First, the gain calculation will be explained. The error vector resulting from the approximation of v{x} is shown in equation (1) below.

　なお、上述の式（１）では、変数上の矢印によりベクトルであることを示している。ここで、ＡとＢとが、最適な大きさになっている、すなわちエラーベクトルの大きさが最小になる場合、誤差ベクトルｖ｛ｅ｝と、合成元のベクトルｖ｛ｘ₁｝及びｖ｛ｘ₂｝によって張られる面とは直交する。このため、以下の式（２）の関係が成立する。 In the above formula (1), the arrows on the variables indicate that they are vectors. Here, when A and B are of optimal magnitude, that is, when the magnitude of the error vector is minimized, the error vector v{e} is orthogonal to the plane spanned by the original vectors v{x ₁ } and v{x ₂ }. Therefore, the relationship in the following formula (2) holds.

　これにより、下記の式（３）が算出される。 This results in the following equation (3):

　この式（３）を変形すると、下記の式（４）が得られる。 Transforming this equation (3) gives us the following equation (4).

　式（４）の上の式に対して｜ｖ｛ｘ₂｝｜²、下の式に対してｖ｛ｘ₁｝・ｖ｛ｘ₂｝の演算を行うと、下記の式（５）が得られる。 By performing the operation |v{x ₂ }| ² on the upper equation of equation (4) and v{x ₁ }·v{x ₂ } on the lower equation, the following equation (5) is obtained.

　式（５）の上式から下式を減算し、Ｂを消去することでＡを算出することが可能である。これを式（６）に示す。 A can be calculated by subtracting the lower equation from the upper equation of equation (5) and eliminating B. This is shown in equation (6).

　従って、ゲインＡは、下記の式（７）となる。 Therefore, gain A is expressed as follows:

　同様に、ゲインＡを消去することで、下記の式（８）のように、ゲインＢを算出可能である。 Similarly, by eliminating gain A, gain B can be calculated as shown in equation (8) below.

　このように、ゲインＡ、Ｂは、合成信号と目的信号との誤差ベクトルが、用いた代表方向ベクトルと直交するように決定される。 In this way, gains A and B are determined so that the error vector between the composite signal and the target signal is orthogonal to the representative direction vector used.

　この計算で得られたゲインＡ、Ｂを、相互相関による時間シフト後のｖ｛ｘ₁｝の頭部伝達関数の波形、及びｖ｛ｘ₂｝の頭部伝達関数の波形に掛け、出力対象とする頭部伝達関数の合成が可能となる。すなわち、これらの時間シフト量（時間シフト値）とゲインＡ、Ｂとを、音源Ｓ－１に適用してパニング処理を行う。 The gains A and B obtained by this calculation are multiplied by the waveform of the head-related transfer function of v{x ₁ } after the time shift due to cross-correlation and the waveform of the head-related transfer function of v{x ₂ }, thereby enabling synthesis of the head-related transfer function to be output. In other words, these time shift amounts (time shift values) and gains A and B are applied to the sound source S-1 to perform panning processing.

　次に、相互相関を最大化する時間シフトの具体的な演算処理について説明する。本実施形態においては、ｖ｛ｘ｝及びｖ｛ｘ₀₁｝は、サンプル数がＰポイントの頭部伝達関数をベクトルとして扱っている。このため、頭部伝達関数の時間（サンプルのポイントの位置）の添え字を明示的に、下記の式（９）のように記載することが可能である。 Next, the specific calculation process of the time shift that maximizes cross-correlation will be described.In this embodiment, v{x} and v{x ₀₁ } are treated as vectors of the head related transfer functions of which the number of samples is P points.Therefore, the subscript of the time of head related transfer functions (position of sample points) can be explicitly written as shown in the following formula (9).

　この上で、これら式（９）の二つのベクトルの相互相関を「ｋ」の関数として、以下の式（１０）のように定義する。 Then, we define the cross-correlation between the two vectors in equation (9) as a function of "k" as shown in equation (10) below.

　ここで、φ_xx01（ｋ）の最大値を与えるｋを、ｋ_max01と記す。パニング部は、例えば、ｋに各値を代入する等して、このｋ_max01を算出する。同様にして、φ_xx02（ｋ）の最大値を与えるｋを、ｋ_max02と記す。パニング部は、このｋ_max02を、ｋ_max01と同様に算出する。このｋ_max01及びｋ_max02のいずれかを、以下、単に「ｋ_max」と記載する。 Here, the k that gives the maximum value of φ _xx01 (k) is referred to as k _max01 . The panning unit calculates this k _max01 , for example, by substituting each value for k. Similarly, the k that gives the maximum value of φ _xx02 (k) is referred to as k _max02 . The panning unit calculates this k _max02 in the same way as k _max01 . Either k _max01 or k _max02 will be referred to simply as "k _max " below.

　パニング部は、例えば、全周３６０°で、２°毎に異なる各音源オブジェクトの到来方向について算出されたゲインＡ、Ｂ、及びｋ_max01、ｋ_max02を、それぞれゲイン値と時間シフト値として頭部伝達関数テーブルに格納しておき、下記の出力処理で使用する。なお、このゲインＡ、Ｂと時間シフトのｋ_max01、ｋ_max02の値の算出を既に実行し格納してある頭部伝達関数テーブルを用いて、下記の音声出力処理のみを行うことも可能である。 The panning unit stores the gains A and B and k _max01 and k _max02 calculated for the arrival direction of each sound source object that differs every 2° around 360° as gain values and time shift values in a head-related transfer function table, and uses these values in the output process described below. Note that it is also possible to perform only the audio output process described below using a head-related transfer function table in which the gains A and B and the time shifts k _max01 and k _max02 have already been calculated and stored.

　次に、パニング部及び出力部が音声出力処理を行う（ステップＳ２０３）。まず、パニング部が、各音源オブジェクトについて、頭部伝達関数テーブルから、取得された到来方向に対応するゲイン値及び時間シフト値を取得する。この上で、パニング部は、当該音源オブジェクトの波形の各サンプリング点（サンプル）について、このゲイン値を掛ける。 Next, the panning unit and output unit perform audio output processing (step S203). First, the panning unit obtains a gain value and time shift value corresponding to the obtained direction of arrival from the head-related transfer function table for each sound source object. Then, the panning unit multiplies each sampling point (sample) of the waveform of that sound source object by this gain value.

　この際、パニング部は、当該音源オブジェクトによる左右の耳のオブジェクトのエネルギーバランスが、パニング処理により合成された頭部伝達関数でも維持されるように、ゲインを補正してもよい。すなわち、各ゲイン値に、左右の頭部伝達関数間のエネルギーバランスを元々の頭部伝達関数と一致させるような調整係数を掛けてもよい。 In this case, the panning unit may correct the gain so that the energy balance of the left and right ear objects caused by the sound source object is maintained in the head-related transfer functions synthesized by the panning process. In other words, each gain value may be multiplied by an adjustment coefficient that matches the energy balance between the left and right head-related transfer functions with the original head-related transfer functions.

　次に、パニング部は、このゲイン値を掛けた信号について、時間シフトを行う。 The panning unit then performs a time shift on the signal multiplied by this gain value.

　この時間シフトの詳細について説明する。ベクトルｖ｛ｘ₀₁｝の要素をｋ_maxサンプルだけシフトしたベクトルｖ｛ｘ₁｝を、下記の手順で生成する。 The details of this time shift will be explained below: A vector v{x _{1 } obtained by shifting the elements of the vector v{x 01} _} by k _max samples is generated by the following procedure.

　まず、位相を進めた場合、つまりｋ_max≧０の場合、ベクトルの最後にｋ_maxサンプルだけゼロを設定し、ベクトルの長さを維持する。一方、位相を遅らせた場合、つまりｋ_max＜０の場合、ベクトルの頭にｋ_maxサンプルだけゼロを設定し、ベクトルの長さを維持する。つまり、以下の式（１１）のように設定する。 First, when the phase is advanced, that is, when k _max ≧0, zeros are set to the end of the vector for k _max samples, and the length of the vector is maintained. On the other hand, when the phase is delayed, that is, when k _max <0, zeros are set to the beginning of the vector for k _max samples, and the length of the vector is maintained. That is, it is set as shown in the following equation (11).

　このようにして、時間シフトしたベクトルｖ｛ｘ₁｝を生成する。時間シフト量の値の正負の極性は、上記相互相関を算出する際の基準をどちらにするかで反転する。また、頭部伝達関数の音源信号への畳み込みの際も、時間シフト量の極性に注意する必要がある。 In this way, a time-shifted vector v{x ₁ } is generated. The positive or negative polarity of the time shift amount value is reversed depending on which is used as the reference when calculating the cross-correlation. Also, when convolving the head-related transfer function with the sound source signal, attention must be paid to the polarity of the time shift amount.

　なお、パニング部は、この時間シフトとして、タップ数の整数倍ではなく、オーバーサンプリングして行う小数倍の小数シフトを行うことも可能である。また、時間シフトを行ってからゲイン値を掛けてもよい。 In addition, the panning unit can perform this time shift by a decimal multiple of the number of taps, rather than an integer multiple. It is also possible to perform a time shift before multiplying by a gain value.

　パニング部は、このようにして算出された、ゲインと時間シフトを行った信号を代表点Ｒの位置に存在する代表点信号として扱う。この上で、パニング部は、代表点Ｒにまとめる音源オブジェクトの代表点信号の和をとり、和信号を生成する。そして、パニング部は、この和信号に、代表点Ｒの位置の頭部伝達関数（代表点方向の頭部伝達関数）を畳み込んで、ユーザ９９の耳元の信号を生成する。 The panning unit treats the signal that has been calculated in this way and that has undergone gain and time shifting as a representative point signal located at the position of representative point R. The panning unit then takes the sum of the representative point signals of the sound source objects that are grouped together at representative point R to generate a sum signal. The panning unit then convolves this sum signal with the head-related transfer function at the position of representative point R (head-related transfer function in the direction of the representative point) to generate a signal at the ear of the user 99.

　パニング部により生成されたこの耳元の信号を、出力することで再生させる。この出力は、例えば、ユーザ９９の左耳及び右耳に対応した２チャンネルのアナログ音声信号であってもよい。 The signal generated by the panning unit at the ear is then output and played back. This output may be, for example, a two-channel analog audio signal corresponding to the left and right ears of the user 99.

　これにより、ヘッドフォンによる２チャンネルの音声信号として仮想的な音場に対応した音声信号を再生することが可能となる。以上により、音声再生処理を終了する。 This makes it possible to play back audio signals corresponding to the virtual sound field as two-channel audio signals through headphones. This completes the audio playback process.

　以上のように構成することで、以下のような効果を得ることができる。 By configuring it as described above, the following effects can be achieved:

　近年、映画、ＡＲ、ＶＲ、ＭＲ、ゲーム等のコンテンツ再生をＶＲヘッドフォンやＨＭＤ等で行う際、３Ｄの音場全体を適切に記述、再生するレンダリング技術（バイノーラル化技術）が要求されていた。従来の３Ｄの立体音響（バイノーラル信号）の生成では、複数個の音源信号に、各々に対応する到来方向の頭部伝達関数を個別に畳み込むことで行っていた。このように、個々の音源オブジェクトに頭部伝達関数を畳み込むと、高い臨場感で人の動き（６ＤｏＦ：６　Ｄｅｇｒｅｅｓ　ｏｆ　Ｆｒｅｅｄｏｍ）に追従するために、膨大な演算量が要求され問題になっていた。 In recent years, when playing content such as movies, AR, VR, MR, and games using VR headphones or HMDs, there has been a demand for rendering technology (binauralization technology) that can properly describe and reproduce the entire 3D sound field. Conventional 3D stereophonic sound (binaural signals) is generated by individually convolving multiple sound source signals with the head-related transfer functions for the corresponding directions of arrival. In this way, convolving head-related transfer functions with individual sound source objects requires a huge amount of computation to follow human movement (6DoF: 6 Degrees of Freedom) with a high level of realism, which has been a problem.

　一方、スピーカによるパニング処理では、従来、サイン則、タンジェント則等でスピーカの音量バランスを制御することでスピーカ間に音像を作っていた（音源オブジェクトを定位させていた）。しかしながら、単に音量バランスを制御するだけでは、ヘッドフォンによる立体音響の音像を、適切に再生することはできなかった。 On the other hand, in speaker panning processing, sound images were created between speakers by controlling the volume balance between speakers using the sine law, tangent law, etc. (sound source objects were localized). However, simply controlling the volume balance was not enough to properly reproduce a 3D sound image through headphones.

　これに対して、上記の音声再生処理では、音源オブジェクトの到来方向を取得する経路算出部１２１と、経路算出部１２１により取得された到来方向に基づいて、特定の代表方向からの音によるパニング処理を、音源オブジェクトの時間シフトとゲイン調整によって行うことにより、音源オブジェクトを表現するためのパニング部とを用いることを特徴とする。 In contrast, the above audio playback process is characterized by the use of a path calculation unit 121 that acquires the direction of arrival of a sound source object, and a panning unit that expresses the sound source object by performing panning processing of sound from a specific representative direction based on the direction of arrival acquired by the path calculation unit 121 through time shifting and gain adjustment of the sound source object.

　このように構成することで、代表方向のパニングにより音源オブジェクトを合成し、到来方向数を減らすことで、より効率的で効果的なレンダリングが可能になる。これにより、一つ一つの音源オブジェクトの信号に、個別に頭部伝達関を畳み込む従来手法に比べて演算量を削減することができる。すなわち、パニング部は、経路算出部１２１により取得された到来方向に近似する代表方向の頭部伝達関数をパニング処理により等価的に合成し、到来方向の頭部伝達関数を生成することができる。このようにして演算量を削減することで、３Ｄ音場の再生システムとして、ゲーム、映画等のＶＲ／ＡＲアプリへ応用することができる。また、スマートフォンや家電機器に適用することで、立体音響を生成する演算量を抑えることができ、コストが削減できる。さらに、より演算量を削減した方式として、国際標準化等に適用可能となる。 With this configuration, sound source objects are synthesized by panning the representative direction, reducing the number of arrival directions, enabling more efficient and effective rendering. This reduces the amount of calculation compared to conventional methods that individually convolve the head-related transfer function into the signal of each sound source object. In other words, the panning unit uses panning processing to equivalently synthesize head-related transfer functions of representative directions that approximate the arrival direction obtained by the path calculation unit 121, and can generate a head-related transfer function for the arrival direction. By reducing the amount of calculation in this way, the system can be applied to VR/AR apps for games, movies, etc. as a 3D sound field playback system. Furthermore, by applying it to smartphones and home appliances, the amount of calculation required to generate stereophonic sound can be reduced, resulting in cost savings. Furthermore, as a method with even lower calculation volume, it can be applied to international standardization, etc.

　なお、上述の実施の形態においては、パニング部が、音源信号を左右２方向の代表点によるパニング処理で表現する場合、すなわち左右方向の頭部伝達関数のベクトルを用いて等価的に到来方向の頭部伝達関数のベクトルを合成する例について記載した。すなわち、上述の実施の形態においては、方向情報として、ユーザ９９の左右の角度方向を考慮する例について記載した。 In the above-described embodiment, an example was described in which the panning unit represents the sound source signal by panning using representative points in two directions, left and right, i.e., an example in which a vector of a head-related transfer function in the left and right direction is used to equivalently synthesize a vector of a head-related transfer function in the direction of arrival. In other words, the above-described embodiment described an example in which the angular directions of the user 99 to the left and right are taken into account as directional information.

　しかしながら、これらの到来方向として、上下方向についても考慮することが可能である。具体的には、到来方向の頭部伝達関数のベクトルを３方向の頭部伝達関数のベクトルによる補間で等価的に合成することも可能である。すなわち、パニング部は、仰角方向（あるいは俯角方向）を含む３方向の代表点によるパニング処理も同様に実行可能である。 However, it is also possible to consider the vertical direction as the direction of arrival. Specifically, it is also possible to equivalently synthesize the vector of the head-related transfer function for the direction of arrival by interpolating the vectors of the head-related transfer functions for three directions. In other words, the panning unit can also perform panning processing using representative points in three directions, including the elevation angle direction (or depression angle direction).

　この場合、２方向からの補間と同様、ｖ｛ｘ｝と相互相関が最大になるように代表方向の頭部伝達関数を時間シフトしたものをベクトル表記でｖ｛ｘ₁｝、ｖ｛ｘ₂｝、ｖ｛ｘ₃｝とする。この場合、誤差ベクトルｖ｛ｅ｝は、下記の式（１２）で示される。 In this case, similar to the interpolation from two directions, the head related transfer functions in the representative directions are time-shifted so as to maximize the cross-correlation with v{x}, and are expressed as vectors v{x ₁ }, v{x ₂ }, and v{x ₃ }. In this case, the error vector v{e} is expressed by the following equation (12).

　これを、下記式（１３）に当てはめて、解く。 Substitute this into equation (13) below and solve.

　具体的には、下記式（１４）により、最適なゲインＡ、Ｂ、Ｃが算出できる。 Specifically, the optimal gains A, B, and C can be calculated using the following formula (14):

　ここで、上述の式（１４）で、行列の右肩の「－１」は逆行列を意味する。相互相関が最大になるように決定した代表方向のＨＲＩＲの時間シフト量ｋ_max01、ｋ_max02、ｋ_max03についても、２方向の場合の値と同様に、上述のゲイン値に先だって算出する。 In the above equation (14), the "-1" at the right shoulder of the matrix means the inverse matrix. The time shift amounts k _max01 , k _max02 , and k _max03 of the HRIR in the representative direction determined to maximize the cross-correlation are also calculated prior to the above gain value, similar to the values in the case of two directions.

　また、上述の実施形態においては、代表点Ｒを２～４個用いる例について記載した。 Furthermore, in the above-described embodiment, an example was described in which two to four representative points R were used.

　しかしながら、２個以上の代表点Ｒを用いることも当然可能である。たとえば、後述する実施例で示すように、範囲角９０°及び６０°等に対応する４～６個の代表点Ｒを用いることも可能である。さらに、４個の場合も、ユーザ９９に対して斜め（４５°、１３５°、２２５°、及び３１５°）、縦横（０°、９０°、１８０°、及び２７０°）のように、異なる代表点Ｒの位置に設定することも可能である。４～６個の代表点Ｒから、到来方向に最も近い２点又は３点を選択して、当該音源の合成のための代表点Ｒとして使用することも可能である。 However, it is of course possible to use two or more representative points R. For example, as shown in the examples described below, it is possible to use four to six representative points R corresponding to range angles such as 90° and 60°. Furthermore, even in the case of four representative points, it is possible to set them at different positions, such as diagonally (45°, 135°, 225°, and 315°) or vertically and horizontally (0°, 90°, 180°, and 270°) relative to the user 99. It is also possible to select two or three of the four to six representative points R that are closest to the direction of arrival and use them as representative points R for synthesizing the sound source.

　すなわち、音声再生処理において、パニング処理には、合成されたＨＲＩＲベクトルと音源方向のＨＲＩＲベクトルとの誤差信号ベクトルのエネルギー又はＬ２ノルムを最小化するようにして算出されたゲインを用いてもよい。 In other words, in the audio reproduction process, the panning process may use a gain calculated to minimize the energy or L2 norm of the error signal vector between the synthesized HRIR vector and the HRIR vector in the sound source direction.

　［時間シフト及びゲイン算出時の重み付けフィルタ］
　また、上記においては、相互相関を最大化する時間シフト及びゲインの算出時に、頭部伝達関数そのものを用いている例について記載した。一方で、時間シフト及び／又はゲインは、周波数軸上の重み付けフィルタをかけてから相互相関が算出されたものを用いてもよい。 [Weighting filter for time shift and gain calculation]
In addition, in the above example, the time shift and the gain that maximize the cross-correlation are calculated by using the head-related transfer function itself. On the other hand, the time shift and/or the gain may be calculated by applying a weighting filter on the frequency axis.

　すなわち、相互相関を最大化する時間シフトおよびゲインの算出時に、周波数軸上の重み付けフィルタ（以下、「周波数重み付けフィルタ」ともいう。）をかけたものを用いることが可能である。 In other words, when calculating the time shift and gain that maximize the cross-correlation, it is possible to use a frequency-axis weighting filter (hereinafter also referred to as a "frequency weighting filter").

　この周波数重み付けフィルタは、ヒトの聴感の感度が高い周波数帯域近傍かそれよりやや高い周波数をカットオフ周波数として、それより高い帯域、すなわちヒトの聴感の感度が低くなってくる帯域を減衰させるようなフィルタを用いることが好適である。たとえば、カットオフ周波数を３０００Ｈｚ～６０００Ｈｚ、６ｄｂ／ｏｃｔ（オクターブ）～１２ｄｂ／ｏｃｔ程度のローパスフィルタ（ＬＰＦ）を用いることが好適である。 This frequency weighting filter preferably has a cutoff frequency near or slightly higher than the frequency band where human hearing sensitivity is high, and attenuates higher bands, i.e., bands where human hearing sensitivity decreases. For example, it is preferable to use a low-pass filter (LPF) with a cutoff frequency of 3000 Hz to 6000 Hz and a bandwidth of approximately 6 dB/oct (octave) to 12 dB/oct.

　具体的には、ｖ｛ｘ｝及びｖ｛ｘ₀₁｝は、Ｐポイントの頭部伝達関数をベクトルとして扱っているので、頭部伝達関数の時間の添え字を明示的に記して、上述の式（９）のように記すことが可能である。ここで上述の式（９）の二つのベクトルに周波数重み付けフィルタのインパルス応答ｗ_c（ｎ）を畳み込んで、長さをＰで打ち切ったものを下記の式（１５）に示す。 Specifically, since v{x} and v{x ₀₁ } treat the head-related transfer functions of P points as vectors, the time subscripts of the head-related transfer functions can be explicitly written and expressed as in the above formula (9). Here, the impulse response w _c (n) of the frequency weighting filter is convoluted with the two vectors of the above formula (9) and the length is truncated at P, as shown in the following formula (15).

　ここで、演算「＊」は、畳み込みを示す。この上で、式（１５）の二つのベクトルの相互相関を「ｋ」の関数として、以下の式（１６）のように定義する。 Here, the operation "*" indicates convolution. Then, the cross-correlation between the two vectors in equation (15) is defined as a function of "k" as shown in equation (16) below.

　ここで、式（１６）によるφ_xx01（ｋ）の最大値を与えるｋを、ｋ_maxと記す。パニング部は、例えば、ベクトルｖ｛ｘ₀₁｝の要素をｋ_maxサンプルだけシフトしたベクトルｖ｛ｘ₁｝を、上述の式（１１）と同様に、下記の手順で生成する。 Here, k that gives the maximum value of φ _xx01 (k) in equation (16) is denoted as k _max . The panning unit generates, for example, a vector v{x ₁ } obtained by shifting the elements of vector v{x ₀₁ } by k _max samples, in the same manner as in equation (11) above, by the following procedure.

　具体的には、位相を進めた場合、つまりｋ_max≧０の場合、ｋ_maxサンプル分となるように、ベクトルの最後にゼロを詰めて、ベクトルの長さを維持する。つまり、ｋ_max≧０の場合、ベクトルｖ｛ｘ₁｝は、ｖ｛ｘ₁｝＝（ｘ₀₁（０＋ｋ_max），ｘ₀₁（１＋ｋ_max），ｘ₀₁（２＋ｋ_max），　……　ｘ₀₁（Ｐ－１），　……　０，０，０）となる。 Specifically, when the phase is advanced, that is, when k _max ≧0, the length of the vector is maintained by padding zeros to the end of the vector so that it contains k _max samples. In other words, when k _max ≧0, the vector v{x ₁ } becomes v{x ₁ }=(x ₀₁ (0+k _max ), x ₀₁ (1+k _max ), x ₀₁ (2+k _max ), ... x ₀₁ (P−1), ... 0, 0, 0).

　また、位相を遅らせた場合、つまりｋ_max＜０の場合は、ベクトルの頭にゼロを詰めて、ｋ_maxサンプル分となるようにベクトルの長さを維持する。つまり、ｋ_max＜０の場合、ベクトルｖ｛ｘ₁｝は、ｖ｛ｘ₁｝＝（０，０，０，　……，ｘ₀₁（０），ｘ₀₁（１），ｘ₀₁（２），　……　，ｘ₀₁（Ｐ－１＋ｋ_max））となる。 Also, when the phase is delayed, that is, when k _max < 0, the vector is padded with zeros at the beginning to maintain the length of the vector so that it contains k _max samples. In other words, when k _max < 0, the vector v{x ₁ } becomes v{x ₁ } = (0, 0, 0, ..., x ₀₁ (0), x ₀₁ (1), x ₀₁ (2), ..., x ₀₁ (P-1+k _max )).

　上記において、ベクトルｖ｛ｘ_01w｝がベクトルｖ｛ｘ₀₁｝として用いられてもよい。このようにして、ベクトルｖ｛ｘ₁｝を生成することが可能である。すなわち、上述したものと同様に、相互相関を算出して、時間シフトの算出に用いることが可能である。 In the above, the vector v{x _01w } may be used as the vector v{x ₀₁ }. In this way, the vector v{x ₁ } can be generated. That is, similar to what has been described above, the cross-correlation can be calculated and used to calculate the time shift.

　また、上述したものでは、合成された頭部伝達関数とオリジナルの頭部伝達関数との誤差（類似度）を算出する際に、上述の式（１２）のようにして、誤差信号ベクトル（誤差ベクトル）ｖ｛ｅ｝の｜ｖ｛ｅ｝｜²を最小化するＡ，Ｂ，Ｃを算出していた。 In addition, in the above-mentioned system, when calculating the error (similarity) between the synthesized head-related transfer function and the original head-related transfer function, A, B, and C that minimize |v{e}| ² of the error signal vector (error vector) v{e} are calculated as in the above-mentioned equation (12).

　これについて、ｖ｛ｅ｝は、周波数重み付けフィルタをかけたものを用いてもよい。具体的には、ｖ｛ｅ｝が時間軸上の波形データである場合、ｖ｛ｅ｝に重み付けフィルタのインパルス応答ｗ（ｎ）を畳み込んだものをｖ｛ｅ_w｝とすると、ｖ｛ｅ_w｝は、下記の式（１７）で示される。 In this regard, v{e} may be obtained by applying a frequency weighting filter. Specifically, when v{e} is waveform data on the time axis, if v{e} is obtained by convolving an impulse response w(n) of a weighting filter with v{ _e }, then v{e _w } is expressed by the following equation (17).

　演算「＊」は、畳み込みを示す。ここでベクトルに対して演算子「＊」を用いているが、それは演算子の左右のベクトルを数列表記したもの同士の畳み込みを行った結果得られた数列を、ベクトル表記したものとする。つまりｖ｛ｘ｝＊ｖ｛ｙ｝は、ｘ（ｎ）＊ｙ（ｎ）の結果をベクトル表記したものである。以下、特に指定がない場合、ベクトルに対する演算子「＊」は、同様の扱いとなる。 The operator "*" indicates convolution. Here, the operator "*" is used on vectors, but this is the vector representation of the sequence obtained by convolving the sequence representations of the vectors on the left and right of the operator. In other words, v{x} * v{y} is the vector representation of the result of x(n) * y(n). Hereinafter, unless otherwise specified, the operator "*" on vectors will be treated in the same way.

　この上で、ｖ｛ｅ_w｝を下記の式（１８）に当てはめて解くことで、ゲインＡ，Ｂ，Ｃを算出することが可能である。 Then, gains A, B, and C can be calculated by applying v{e _w } to the following equation (18) and solving it.

　または、等価的に、下記の式（１９）により、ｖ｛ｅ｝_wを算出することも可能である。 Alternatively, v{e} _w can be calculated equivalently by the following equation (19).

　このようにして求められた時間シフトおよびゲインを用いて、目的信号を代表方向に振り分ける（パニング処理する）ことが可能となる。 Using the time shift and gain calculated in this way, it is possible to distribute the target signal to a representative direction (panning process).

　なお、パニング処理する目的信号及び畳み込む頭部伝達関数は、上述したものと同様であってもよい。すなわち、目的信号及び畳み込む頭部伝達関数には、重み付けフィルタを畳み込まなくてもよい。 Note that the target signal to be panned and the head-related transfer function to be convolved may be the same as those described above. In other words, the target signal and the head-related transfer function to be convolved do not need to be convolved with a weighting filter.

　このような周波数重み付けを導入することで、誤差をより小さく（精度良く）して、近似を行う周波数帯域を設定することが可能になる。とくに音楽や音声信号はその主要なエネルギーが低周波領域に集中しているため、低域側に重みをつける重み付けフィルタを用いることで、良好な性能が得られる。 By introducing this type of frequency weighting, it is possible to set the frequency band for approximation with smaller errors (higher accuracy). In particular, since the majority of energy in music and voice signals is concentrated in the low-frequency range, good performance can be achieved by using a weighting filter that weights the low-frequency side.

　また、インパルス応答がｗ（ｎ）である重み付けフィルタとベクトルの畳み込みを、重み付けフィルタのインパルス応答ｗ（ｎ）を１サンプルずつ時間シフトしたものを各行にもつ畳み込み行列Ｗで表すと、式（１７）を、下記式（２０）のように変形することも可能である。 Furthermore, if the convolution of a weighting filter with impulse response w(n) and a vector is expressed as a convolution matrix W, where each row contains the weighting filter's impulse response w(n) time-shifted by one sample, then equation (17) can be transformed into equation (20) below.

　この上で、下記の式（２１）にて、｜ｖ｛ｅ｝｜²を算出可能である。 Then, |v{e}| ² can be calculated using the following equation (21).

　ここで、Ｗ^Tは、Ｗの転置行列を表す。 Here, W ^T represents the transposed matrix of W.

　また、重み付けフィルタは、相互相関の算出時と、ゲインの算出時とで、同じ特性のものを用いてもよく、異なる特性のものを用いてもよい。同じものを用いる場合は、元々の頭部伝達関数のセット全体に重み付けフィルタｗを畳み込んでから、上述したものと同様の処理にて、時間シフト量およびゲインを算出してもよい。 Furthermore, the weighting filter used when calculating the cross-correlation and when calculating the gain may have the same characteristics, or may have different characteristics. If the same filter is used, the weighting filter w may be convolved with the entire set of original head-related transfer functions, and then the time shift amount and gain may be calculated using the same processing as described above.

　なお、上述のように重み付けフィルタとして、ＬＰＦで低域に重み付けをして相互相関および最適ゲインを計算する場合、有効帯域を３０００Ｈｚ程度に制限した際は、上述した小数シフトは、しなくてもよい。この場合、オーバーサンプリングも不要となる。 Furthermore, when using an LPF as a weighting filter to weight the low frequencies and calculate the cross-correlation and optimal gain as described above, if the effective band is limited to around 3000 Hz, the decimal shift described above does not need to be performed. In this case, oversampling is also not required.

　上述の実施形態では、音声信号を複数方向の代表方向にパニング処理して分配して、各代表方向の頭部伝達関数を畳み込んで表現している。具体的には、三方向のｖ｛ｘ｝の近似値＝Ａ×ｖ｛ｘ₁｝＋Ｂ×ｖ｛ｘ₂｝＋Ｃ×ｖ｛ｘ₃｝として目的方向の頭部伝達関数を代表方向の頭部伝達関数の和で模擬している。 In the above-described embodiment, the audio signal is panned and distributed to a plurality of representative directions, and the head-related transfer functions of each representative direction are convolved and expressed. Specifically, the head-related transfer function of the target direction is simulated by the sum of the head-related transfer functions of the representative directions, with the approximate value of v{x} in three directions = A × v{x ₁ } + B × v{x ₂ } + C × v{x ₃ }.

　このような場合、頭部伝達関数の高域の振幅特性は低域に比べて、オリジナルの頭部伝達関数よりもレベルが落ちる傾向がある。これは、リスニングポイントのわずかな位置ずれによる、わずか時間の誤差であっても、頭部伝達関数の高域成分の位相が大きく回転してしまい、パニング処理による足し算で相殺される傾向が強くなるためであった。 In such cases, the amplitude characteristics of the high frequencies of the HRTF tend to be lower in level than the original HRTF compared to the low frequencies. This is because even a slight time error caused by a slight misalignment of the listening point can cause a large phase rotation of the high frequency components of the HRTF, which tends to be canceled out by the addition caused by the panning process.

　これに対して、本実施形態に係る音声再生処理では、再生高域強調フィルタにより高域が減衰する傾向を補償してもよい。 In contrast, in the audio playback process according to this embodiment, the tendency for high frequencies to attenuate may be compensated for using a playback high-frequency emphasis filter.

　具体的には、パニング処理して代表方向の頭部伝達関数を畳み込んだ信号に、高域強調フィルタをかけることでその高域が減衰する傾向を補償することが可能である。または、等価的に、代表方向の頭部伝達関数そのものに事前に高域強調フィルタ処理をかけておき、高域を強調してもよい。この高域強調フィルタは、例えば、５０００～１５０００Ｈｚ以上をターンオーバー周波数として、＋１～＋１．５ｄＢ程度、高域を強調するようなインパルス応答の重み付けフィルタであってもよい。 Specifically, it is possible to compensate for the tendency for high frequencies to attenuate by applying a high-frequency emphasis filter to the signal obtained by panning and convolving the head-related transfer function of the representative direction. Alternatively, equivalently, the head-related transfer function of the representative direction itself can be subjected to high-frequency emphasis filter processing in advance to emphasize the high frequencies. This high-frequency emphasis filter may be, for example, an impulse response weighting filter that emphasizes high frequencies by approximately +1 to +1.5 dB, with a turnover frequency of 5,000 to 15,000 Hz or higher.

　このように、パニング処理を用いて合成される音声の高域を強調するフィルタ処理を行うことで、より聴感上の立体感を高めることができる。 In this way, by using a panning process to perform filtering that emphasizes the high frequencies of the synthesized audio, the perceived three-dimensionality can be further enhanced.

　なお、上述したものと同様の小数シフトを行った場合であっても、通常の８～１６倍オーバーサンプリングでは、頭部伝達の高域成分のミスマッチは残るため、高域強調フィルタをかけてもよい。 Even if a decimal shift similar to that described above is performed, mismatches in the high-frequency components of head-transmitted signals remain with the usual 8- to 16-fold oversampling, so a high-frequency emphasis filter may be applied.

　なお、パニング処理においては、データベース１０５に含まれる頭部伝達関数に応じて、時間シフト調整及びゲイン調整における調整量を決定し、再生音に対して、決定した調整量で時間シフト調整とゲイン調整とを適用して代表音に変換してもよい。頭部伝達関数に応じて、パニング処理に用いられる時間シフト調整及びゲイン調整における調整量は、最適な値が変化するため、まず、データベース１０５に含まれている頭部伝達関数を読み出した際に、それに合わせた時間シフト調整及びゲイン調整における調整量を決定することで、以降、この頭部伝達関数を用いる限りにおいて、同じ調整量を流用できるので、処理量の観点で有利である。 In addition, in the panning process, the adjustment amounts for time shift adjustment and gain adjustment may be determined according to the head-related transfer functions contained in database 105, and the time shift adjustment and gain adjustment may be applied to the reproduced sound using the determined adjustment amounts to convert it into a representative sound. Since the optimal values for the adjustment amounts for time shift adjustment and gain adjustment used in the panning process change depending on the head-related transfer function, by first reading out the head-related transfer function contained in database 105 and determining the adjustment amounts for time shift adjustment and gain adjustment that match it, the same adjustment amounts can be reused thereafter as long as this head-related transfer function is used, which is advantageous in terms of processing volume.

　頭部伝達関数テーブルは、データベース１０５に記憶された、頭部伝達関数を含むテーブルデータの一例であるが、頭部伝達関数テーブルには、頭部伝達関数とともに、その頭部伝達関数に応じて決定した時間シフト調整及びゲイン調整における調整量が互いに紐づけられて格納されている。つまり、データベース１０５に含まれる頭部伝達関数ごとに、時間シフト調整とゲイン調整における調整量を予め算出して頭部伝達関数テーブルを構築しておいてもよい。このように、それぞれの頭部伝達関数と当該調整量とを紐づけた頭部伝達関数テーブルのテーブルデータを、データベース１０５に記憶しておいてもよい。このように、データベース１０５は、記憶部の一例である。なお、頭部伝達関数ごとの調整量の算出は、生成部１３４、またはデコード処理部１１３で行ってもよい。または、外部の装置で調整量の算出を行い、外部の装置のメモリに記憶しておいてもよい。この場合は、外部の装置のメモリが記憶部の一例に相当する。 The head-related transfer function table is an example of table data containing head-related transfer functions stored in the database 105. The head-related transfer function table stores the head-related transfer functions together with the adjustment amounts for time shift adjustment and gain adjustment determined according to the head-related transfer functions, with the amounts linked to each other. In other words, a head-related transfer function table may be constructed by calculating the adjustment amounts for time shift adjustment and gain adjustment in advance for each head-related transfer function included in the database 105. In this way, table data of the head-related transfer function table linking each head-related transfer function with the adjustment amount may be stored in the database 105. In this way, the database 105 is an example of a storage unit. The calculation of the adjustment amount for each head-related transfer function may be performed by the generation unit 134 or the decoding processing unit 113. Alternatively, the adjustment amount may be calculated by an external device and stored in the memory of the external device. In this case, the memory of the external device corresponds to an example of a storage unit.

　また、時間シフト調整とゲイン調整における調整量を予め算出して、複数の代表方向のそれぞれと紐づけた調整量テーブルを構築しデータベース１０５に記憶しておいてもよい。調整量テーブルは複数の代表方向それぞれの頭部伝達関数と、時間シフト調整とゲイン調整における調整量とを紐づけたテーブルデータを含んでいてもよいし、複数の代表方向それぞれの頭部伝達関数は、予め取得しデータベース１０５に記憶している全天球（複数方向）の頭部伝達関数からレンダリング時またはシステムの初期化時に抽出してもよい。 Furthermore, the adjustment amounts for time shift adjustment and gain adjustment may be calculated in advance, and an adjustment amount table linked to each of multiple representative directions may be constructed and stored in database 105. The adjustment amount table may include table data linking the head-related transfer functions for each of multiple representative directions with the adjustment amounts for time shift adjustment and gain adjustment, or the head-related transfer functions for each of multiple representative directions may be extracted at the time of rendering or system initialization from head-related transfer functions for the entire celestial sphere (multiple directions) that have been acquired in advance and stored in database 105.

　また、調整量テーブルは、全天球の頭部伝達関数データベースの各頭部伝達関数の方向から受聴者の位置に到来する音信号に対して、例えば複数の代表方向のうちいずれの代表方向にその信号を分配するかの情報と、分配する際の代表方向毎に音声信号に乗じる、時間シフト調整量とゲイン調整量の情報とを含むテーブルであってもよい。 The adjustment amount table may also be a table that includes information on which of a plurality of representative directions to distribute a sound signal that arrives at the listener's position from the direction of each head-related transfer function in the spherical head-related transfer function database, and information on the time shift adjustment amount and gain adjustment amount to be multiplied by the sound signal for each representative direction when distributing the signal.

　頭部伝達関数の畳み込みの処理を行う際には、データベース１０５に記憶された調整量テーブルを参照し、適用する方向の頭部伝達関数に紐づけられた、時間シフト調整とゲイン調整における調整量を用いることで、畳み込みの処理ごとに調整量を算出する必要がなく、処理量の削減に寄与できる。 When performing the convolution process of the head-related transfer function, the adjustment amount table stored in the database 105 is referenced, and the adjustment amounts for the time shift adjustment and gain adjustment associated with the head-related transfer function of the direction to be applied are used. This eliminates the need to calculate the adjustment amount for each convolution process, contributing to a reduction in the amount of processing.

　なお、本発明の実施形態はデータベース１０５に含まれない新たな頭部伝達関数に適用することも可能である。音信号のデコード時や音響再生システム１００の電源投入時、あるいは音響再生システム１００の初期化時に、三次元音場全体の頭部伝達関数を新たに読み込み、本実施形態で開示した手法または別の手法で、頭部伝達関数ごとの調整量の算出を行ってもよい。その場合、頭部伝達関数と当該調整量とを紐づけたテーブルデータを、データベース１０５に記憶しておいてもよい。または、外部の装置で調整量の算出を行い、外部の装置のメモリに記憶しておいてもよい。頭部伝達関数の畳み込みの処理を行う際には、適用する頭部伝達関数に紐づけられた時間シフト調整とゲイン調整における調整量を参照することで、畳み込みの処理ごとに調整量を算出する必要がなく、処理量の削減に寄与できる。 Note that embodiments of the present invention can also be applied to new head-related transfer functions that are not included in the database 105. When decoding a sound signal, when powering on the sound reproduction system 100, or when initializing the sound reproduction system 100, the head-related transfer functions of the entire three-dimensional sound field may be newly read, and the adjustment amount for each head-related transfer function may be calculated using the method disclosed in this embodiment or another method. In this case, table data linking the head-related transfer functions to the adjustment amounts may be stored in the database 105. Alternatively, the adjustment amounts may be calculated by an external device and stored in the memory of the external device. When performing convolution processing of head-related transfer functions, by referencing the adjustment amounts for time shift adjustment and gain adjustment linked to the applied head-related transfer function, it is not necessary to calculate the adjustment amount for each convolution processing, which can contribute to reducing the amount of processing.

　このように、データベース１０５に記憶されていない新たな頭部伝達関数を読み込んだ場合に、データベース１０５に記憶する前に当該新たな頭部伝達関数に対して、パニング処理に用いられる時間シフト調整及びゲイン調整における調整量を決定し、新たな頭部伝達関数と、決定した調整量とを紐づけて頭部伝達関数テーブルを構築し、頭部伝達関数テーブルをデータベース１０５に記憶させてもよい。そして、パニング処理を行う際には、データベース１０５からこの調整量が読み出され、その調整量によって、シフト調整及びゲイン調整が適用される。なお、新たな頭部伝達関数は、それ以前にデータベース１０５に記憶されていたものが、音信号のデコード時、音響再生システム１００の電源投入時、又は、音響再生システム１００の初期化時等に一時的にデータベース１０５から取り除かれ、再びデータベース１０５に記憶しなおされたものであってもよい。再生音に対して、データベースに記憶された新たな頭部伝達関数に紐づけられた調整量で時間シフト調整とゲイン調整とを適用して代表音に変換し、代表点のそれぞれの位置からユーザの位置に向かう代表方向に応じた頭部伝達関数を代表音に畳み込むことで出力音信号を生成する第２生成部を、生成部１３４の代わりに備えてもよい。 In this way, when a new head-related transfer function not stored in database 105 is loaded, the adjustment amounts for the time shift adjustment and gain adjustment to be used in the panning process may be determined for the new head-related transfer function before storing it in database 105, and a head-related transfer function table may be constructed by linking the new head-related transfer function with the determined adjustment amounts, and the head-related transfer function table may be stored in database 105. Then, when performing panning process, these adjustment amounts are read from database 105, and shift adjustment and gain adjustment are applied based on these adjustment amounts. Note that the new head-related transfer function may be one that was previously stored in database 105, but was temporarily removed from database 105 when decoding a sound signal, when powering on sound reproduction system 100, or when initializing sound reproduction system 100, and then re-stored in database 105. Instead of the generation unit 134, a second generation unit may be provided that applies time shift adjustment and gain adjustment to the playback sound using adjustment amounts linked to a new head-related transfer function stored in the database to convert it into a representative sound, and generates an output sound signal by convolving the representative sound with a head-related transfer function corresponding to a representative direction from each representative point toward the user's position.

　以下、情報処理装置６０１と音声提示デバイス６０２とに分かれた立体音響再生システム６００を用いた場合の上記のパニング処理による音質の劣化が生じる場合がある。具体的には、以上に説明したパニング処理において、時間シフト調整及びゲイン調整を行うことを説明した。時間シフト調整及びゲイン調整は、ユーザ９９の向きが一定である場合に、その一定の向きのユーザ９９に対して、あらかじめ設定された複数の代表方向からの音に頭部伝達関数を畳み込むことにより、代表方向以外の位置にある音源オブジェクトを知覚させるためのものである。より具体的には、ユーザ９９が０°の方向を向いているときに、６０°方向にある静止した音源オブジェクトを３０°及び９０°に設定された代表方向からの音に頭部伝達関数を畳み込んだ出力音信号を出力することで生成する場合、３０°の代表方向からの音と、９０°の代表方向からの音に３０°用及び９０°用に計算された時間シフト値及びゲイン値をかけて３０°及び９０°の頭部伝達関数を畳み込むことで６０°に音源オブジェクトの音像が形成される出力音信号を生成することができる。 Below, when using a stereophonic playback system 600 divided into an information processing device 601 and an audio presentation device 602, the above-mentioned panning process may result in degradation of sound quality. Specifically, the panning process described above involves performing time shift adjustment and gain adjustment. When the user 99 is facing a fixed direction, the time shift adjustment and gain adjustment are performed to allow the user 99 facing a fixed direction to perceive a sound source object located at a position other than the representative direction by convolving a head-related transfer function with sounds from multiple preset representative directions. More specifically, when the user 99 is facing a 0° direction and a stationary sound source object located at a 60° direction is generated by outputting an output sound signal in which a head-related transfer function is convolved with sounds from representative directions set at 30° and 90°, an output sound signal in which a sound image of the sound source object is formed at 60° can be generated by multiplying the sound from the 30° representative direction and the sound from the 90° representative direction by the time shift value and gain value calculated for 30° and 90°, and convolving the 30° and 90° head-related transfer functions.

　しかしながら、ユーザ９９が－３０°頭部を回転させると、位置関係は、３０°用に計算された時間シフト値及びゲイン値をかけたものに、６０°の頭部伝達関数を畳み込み、９０°用に計算された時間シフト値及びゲイン値をかけたものに、１２０°の頭部伝達関数を畳み込むことになる。このとき、ユーザ９９は、音源オブジェクトを、頭部回転後のユーザ９９の正面を０°として、正確に９０°の位置にあるように知覚できないことがある。つまり、３０°用に計算された時間シフト値及びゲイン値は、３０°の頭部伝達関数を畳み込むことを想定して計算されているので、６０°の頭部伝達関数を畳み込む場合に適切でなく、９０°用に計算された時間シフト値及びゲイン値は、９０°の頭部伝達関数を畳み込むことを想定して計算されているので、１２０°の頭部伝達関数を畳み込む場合に適切でない。その結果として、ユーザ９９が音源オブジェクトの定位されている位置を誤認することが生じる。 However, when the user 99 rotates their head by -30 degrees, the positional relationship is determined by convolving a 60° head-related transfer function with the time shift value and gain value calculated for 30 degrees, and convolving a 120° head-related transfer function with the time shift value and gain value calculated for 90 degrees. In this case, the user 99 may not perceive the sound source object as being exactly at a 90° position, with 0° being the front of the user 99 after head rotation. In other words, the time shift value and gain value calculated for 30 degrees are calculated assuming the convolution of a 30° head-related transfer function, and therefore are not appropriate for the convolution of a 60° head-related transfer function, and the time shift value and gain value calculated for 90 degrees are calculated assuming the convolution of a 90° head-related transfer function, and therefore are not appropriate for the convolution of a 120° head-related transfer function. As a result, the user 99 may misperceive the localized position of the sound source object.

　これは、特に、仰角方向の成分を含むときの音源オブジェクトの位置を知覚させる場合において顕著である。 This is particularly noticeable when perceiving the position of a sound source object when the sound contains an elevation component.

　例えば、図２０に示す水平方向に配置された６つの代表点（代表方向）を用いる場合を例に説明する。図２０は、本実施の形態における代表方向の配置について説明するための図である。図２０では、代表方向Ｒ１、Ｒ２、Ｒ３、Ｒ４、Ｒ５、Ｒ６がそれぞれユーザ９９を水平に囲む３０°、９０°、１５０°、２１０°、２７０°、３３０°の位置に配置されている様子が示されている。音源オブジェクトが、エリアＡ１にある場合、代表方向Ｒ１、Ｒ２を用いて出力音信号が出力される。つまり、エリアＡ１にある音源オブジェクトの発する再生音は、代表方向Ｒ１、Ｒ２に分配されて、代表方向Ｒ１に分配された音には３０°の頭部伝達関数が畳み込まれ、代表方向Ｒ２に分配された音には９０°の頭部伝達関数が畳み込まれる。逆に言えば、代表方向Ｒ１には、３３０°から９０°の間の１２０°の範囲内にある音源オブジェクトの再生音が分配される。 For example, a case will be described in which six representative points (representative directions) arranged horizontally as shown in Figure 20 are used. Figure 20 is a diagram for explaining the arrangement of representative directions in this embodiment. Figure 20 shows that representative directions R1, R2, R3, R4, R5, and R6 are arranged at positions of 30°, 90°, 150°, 210°, 270°, and 330°, respectively, horizontally surrounding the user 99. When a sound source object is in area A1, an output sound signal is output using representative directions R1 and R2. In other words, the reproduced sound emitted by a sound source object in area A1 is distributed to representative directions R1 and R2, and a 30° head-related transfer function is convolved with the sound distributed to representative direction R1, and a 90° head-related transfer function is convolved with the sound distributed to representative direction R2. Conversely, the reproduced sound of a sound source object within a 120° range between 330° and 90° is distributed to representative direction R1.

　ここで、図２１及び図２２それぞれの（ａ）に、上記の例のように、時間シフト値を計算するときに、頭部伝達関数同士の相互相関が最大となる時間シフト値を計算した場合について示している。図２１及び図２２は、実施の形態に係る時間シフト値計算の一例について説明するための図である。ここでは、ユーザ９９が頭部を回転した結果、各図の右肩に示された角度が代表方向として用いられることとなった場合の、時間シフト値（縦軸）を、頭部回転後の音源オブジェクトの方向（横軸）ごとに示している。なお、図２１は、左耳の例を示し、図２２は、右耳の例を示している。各図の（a）に示すように、３６０°のいずれかの箇所で、時間シフト値が飛び値となっており、このことが、定位感を損なう原因と推定される。そこで、このような飛び値が生じにくくなるように、時間シフト値の計算方向を検討した結果を、図２１及び図２２それぞれの（ｂ）に示している。 21 and 22 (a) show the case where, as in the above example, time shift values are calculated so that the cross-correlation between head-related transfer functions is maximized. Figures 21 and 22 are diagrams for explaining an example of time shift value calculation according to an embodiment. Here, the time shift values (vertical axis) are shown for each direction (horizontal axis) of the sound source object after head rotation when the user 99 rotates their head and the angle shown at the right shoulder in each figure is used as the representative direction. Note that Figure 21 shows an example of the left ear, and Figure 22 shows an example of the right ear. As shown in (a) of each figure, there are discontinuities in the time shift values at some point within the 360° range, which is presumed to be the cause of the loss of sound localization. Therefore, the results of examining the calculation direction of the time shift value to reduce the occurrence of such discontinuities are shown in (b) of each of Figures 21 and 22.

　ここでは、代表方向においては、代表方向そのものの頭部伝達関数の畳み込みを行うので、時間シフト値を０とすることができるという前提に基づいて、飛び値が生じにくいように所定位置の時間シフト値を計算するときに、その所定位置よりも代表方向に近い側の隣接する位置の時間シフト値を用いている。より詳しくは、代表方向の時間シフトを０とし、次に代表方向に最も近い（隣接する）位置の時間シフト値を計算する。その計算値が、上記の相互相関が最大となるときの時間シフト値よりも、代表方向の時間シフト値である０に近くなるように計算をしている。 Here, since the head-related transfer function of the representative direction itself is convolved in the representative direction, the time shift value can be set to 0. Based on this premise, when calculating the time shift value for a predetermined position to prevent value jumps, the time shift value of an adjacent position closer to the representative direction than the predetermined position is used. More specifically, the time shift for the representative direction is set to 0, and the time shift value for the position closest (adjacent) to the representative direction is calculated. The calculation is performed so that this calculated value is closer to the time shift value for the representative direction, 0, than the time shift value when the above cross-correlation is at its maximum.

　さらに詳しくは、所定位置の時間シフト値を計算する場合に、相互相関が示すいくつかのピークのうち、所定位置に隣接する位置の時間シフト値に最も近いピークの時間シフト値を所定位置の時間シフト値として計算する。このようにすることで、図２１及び図２２に示すように、時間シフト値が飛び値を形成しにくくなる。所定位置の時間シフト値を計算する場合に、相互相関が示すいくつかのピークのうち、最大となるピークを基準とした閾値範囲内で、所定位置に隣接する位置の時間シフト値に最も近いピークの時間シフト値を所定位置の時間シフト値として計算する。閾値範囲は、例えば、最大となるピークのときの相関値の係数倍以内（係数は１未満）とすればよい。係数として、例えば０．８などを設定すれば、相互相関が最大となる相関値から０．８倍以内と、比較的近く、かつ、隣接する位置の時間シフト値に近い時間シフト値が得られる。 More specifically, when calculating the time shift value for a predetermined position, the time shift value of the peak closest to the time shift value of a position adjacent to the predetermined position is calculated as the time shift value for the predetermined position, among the several peaks indicated by the cross-correlation. By doing so, it is less likely that the time shift value will form a jump, as shown in Figures 21 and 22. When calculating the time shift value for a predetermined position, the time shift value of the peak closest to the time shift value of a position adjacent to the predetermined position, among the several peaks indicated by the cross-correlation, within a threshold range based on the maximum peak, is calculated as the time shift value for the predetermined position. The threshold range may be, for example, within a coefficient multiple of the correlation value at the maximum peak (the coefficient is less than 1). If the coefficient is set to, for example, 0.8, a time shift value that is relatively close, within 0.8 times the correlation value at which the cross-correlation is maximum, and close to the time shift value of the adjacent position, is obtained.

　次に、以上のようにして決定した時間シフト値としたときのゲイン値の計算について説明する。図２３は、実施の形態に係るゲイン値計算の一例について説明するための図である。図２３に示すように、実際には、音源オブジェクトのある到来方向は、三次元空間における３つの代表方向に囲まれている。つまり、３つの代表方向のそれぞれにあるゲインをかけて音を分配することで、３つの代表方向からの音によって、到来方向上に音源オブジェクトを定位させることができる。ここで、式（１２）～（１５）に示すようにして、３つのゲインを計算してもよいが、例えば、以下のように３つの代表方向のうちの２つの代表方向で音源オブジェクトを到来方向に定位させるためのゲイン値を計算した後に、残りの１つの代表方向のゲイン値を計算してもよい。 Next, we will explain how to calculate gain values when the time shift value is determined as described above. Figure 23 is a diagram for explaining an example of gain value calculation according to an embodiment. As shown in Figure 23, in reality, the arrival direction of a sound source object is surrounded by three representative directions in three-dimensional space. In other words, by multiplying each of the three representative directions by a certain gain and distributing the sound, the sound source object can be localized in the arrival direction using sounds from the three representative directions. Here, the three gains may be calculated as shown in equations (12) to (15). However, for example, it is also possible to calculate gain values for localizing the sound source object in the arrival direction in two of the three representative directions as follows, and then calculate the gain value for the remaining representative direction.

　ここでは、最初に選択される２つの代表方向として、天頂に最も近い代表方向を除くようにする。こうすることで、水平方向に近い２つの代表方向のゲイン値を先に決定し、水平方向に近い２つの代表方向のゲイン値を重視したうえで、それに合わせて天頂に最も近い代表方向のゲイン値を決定することができる。すでに説明したように、水平方向のユーザ９９の定位感は、垂直方向に比べて比較的安定しやすいので、水平方向に近い２つの代表方向のゲイン値を重視して、それに合わせて天頂に最も近い代表方向のゲイン値を決定すれば、天頂に最も近い代表方向のゲイン値による定位感の悪化の影響を抑制しやすい。 Here, the representative direction closest to the zenith is excluded from the two representative directions selected initially. In this way, the gain values of the two representative directions close to the horizontal direction are determined first, and the gain values of the two representative directions close to the horizontal direction are emphasized, and the gain value of the representative direction closest to the zenith is determined accordingly. As already explained, the sense of localization of the user 99 in the horizontal direction is relatively more stable than in the vertical direction, so if the gain values of the two representative directions close to the horizontal direction are emphasized and the gain value of the representative direction closest to the zenith is determined accordingly, it is easy to suppress the impact of the gain value of the representative direction closest to the zenith on deteriorating the sense of localization.

　具体的には、以下のように計算する。まず、２つの代表方向のゲイン値を決定するために、以下式（２２）を解く。 Specifically, the calculation is performed as follows: First, to determine the gain values for the two representative directions, solve the following equation (22).

　Ｇ_１は、２つの代表方向のうち一方に対するゲイン値であり、Ｇ_２は、２つの代表方向のうち他方に対するゲイン値である。 _G1 is the gain value for one of the two representative directions, and _G2 is the gain value for the other of the two representative directions.

　そして、Ｇ_１及びＧ_２の関係（つまり比率）を保持した状態で、以下式（２３）を解くことで天頂に最も近い代表方向のゲイン値を決定する。 Then, while maintaining the relationship (i.e., ratio) between _G1 and _G2 , the gain value of the representative direction closest to the zenith is determined by solving the following equation (23).

　αは、２つの代表方向に、それぞれのゲイン値をかけて和を取った合成ベクトルに対するゲイン値であり、Ｇ_ｔは、天頂に最も近い代表方向に対するゲイン値である。 α is the gain value for a resultant vector obtained by multiplying two representative directions by their respective gain values and then summing the results, and _Gt is the gain value for the representative direction closest to the zenith.

　以上のようにすることで、例えば、図２４及び図２５に示すような効果が確認された。図２４及び図２５は、実施の形態に係るゲイン値計算の結果を示す図である。図２４は、仰角０°におけるＳＮＲをプロットした結果を示し、図２５は、仰角４６°におけるＳＮＲをプロットした結果を示している。図２４及び図２５では、比較例１がすでに述べた計算方法で計算したＳＮＲを示し、比較例２が別の計算方法におけるＳＮＲを示し、実施例１が上記の２つの代表方向で音源オブジェクトを到来方向に定位させるためのゲイン値を計算した後に、残りの１つの代表方向のゲイン値を計算する方法でのＳＮＲを示しており、それぞれ数値が大きいほど良好な結果であることを示している。 By doing the above, the effects shown in Figures 24 and 25, for example, were confirmed. Figures 24 and 25 are diagrams showing the results of gain value calculations according to the embodiment. Figure 24 shows the results of plotting the SNR at an elevation angle of 0°, and Figure 25 shows the results of plotting the SNR at an elevation angle of 46°. In Figures 24 and 25, Comparative Example 1 shows the SNR calculated using the calculation method already described, Comparative Example 2 shows the SNR using a different calculation method, and Example 1 shows the SNR calculated using a method in which gain values for localizing a sound source object in the arrival direction in the two representative directions are calculated, and then a gain value for the remaining representative direction is calculated; in each case, the larger the value, the better the result.

　図中に示すように、仰角０°では、比較例１及び実施例１に大きな違いはなかったものの、比較例２に比べると、比較例１、実施例１ともにいずれも良好な結果を示した。仰角４６°では、比較例１及び比較例２に比べ、特に回転角が９０°に近づくほど、実施例１が良好な結果を示した。 As shown in the figure, at an elevation angle of 0°, there was no significant difference between Comparative Example 1 and Example 1, but both Comparative Example 1 and Example 1 showed better results than Comparative Example 2. At an elevation angle of 46°, Example 1 showed better results than Comparative Examples 1 and 2, especially as the rotation angle approached 90°.

　次に、時間シフト値及びゲイン値の計算方法の別の例について説明する。以下では、水平方向における仮想的な位置である水平位置の頭部伝達関数ベクトルを設定し、上記同様に、天頂に最も近い代表方向を除く、すなわち、水平方向に近い２つの代表方向の頭部伝達関数ベクトルとの相互相関を計算して、それぞれ相関値が最大となる時間シフト値Ｓ_１及びＳ_２を求める。そして、この時間シフトを適用した頭部伝達関数ベクトルを使って、以下式（２４）を解くことで２つの代表方向のゲイン値Ｇ_１及びＧ_２を計算する。 Next, another example of the calculation method of time shift value and gain value will be described.Hereinafter, set the head related transfer function vector of horizontal position, which is a virtual position in the horizontal direction, and calculate the cross-correlation with the head related transfer function vector of two representative directions, excluding the representative direction closest to the zenith, that is, close to the horizontal direction, in the same way as above, and obtain the time shift values _S1 and _S2 that maximize correlation value.Then, use this head related transfer function vector that applies time shift to solve the following formula (24) to calculate the gain values _G1 and _G2 of two representative directions.

　さらに、新たなベクトルを以下式（２５）のように定義し、到来方向に対して、天頂に最も近い代表方向及び定義した新たなベクトルとの相互相関を計算して、それぞれ相関値が最大となる時間シフト値Ｓ_ｔ及びＳ_Ｘを求める。 Furthermore, a new vector is defined as shown in the following equation (25), and the cross-correlation between the arrival direction and the representative direction closest to the zenith and the defined new vector is calculated to determine the time shift values S _t and S _x that maximize the correlation values, respectively.

　そして、この時間シフトを適用した頭部伝達関数ベクトルを使って、以下式（２６）を解くことで天頂に最も近い代表方向及び定義した新たなベクトルのゲイン値を計算する。 Then, using the head-related transfer function vector to which this time shift has been applied, the representative direction closest to the zenith and the gain value of the newly defined vector are calculated by solving the following equation (26).

　結果として、以下式（２７）により、３つの代表方向における時間シフト値、及び、ゲイン値が得られる。 As a result, the time shift values and gain values in the three representative directions can be obtained using the following equation (27).

　このようにすることで、ユーザ９９の定位感が比較的安定している水平方向の代表方向の時間シフト値及びゲイン値が得られるので、頭部回転に伴い、情報処理装置６０１で想定していた頭部伝達関数とは異なる頭部伝達関数が音声提示デバイス６０２で代表方向信号に畳み込まれても定位感が損なわれにくくすることができる。 By doing this, it is possible to obtain time shift values and gain values for the horizontal representative direction, in which the user 99's sense of localization is relatively stable, so that the sense of localization is less likely to be impaired even if a head-related transfer function different from the head-related transfer function assumed by the information processing device 601 is convolved into the representative direction signal by the audio presentation device 602 as the user's head rotates.

　なお、以上のようにして計算した時間シフト値及びゲイン値は、ユーザ９９の頭部の回転後の角度を未回転であると置き換えた場合と比較すると、少しの数値のずれがあることがわかる。 It can be seen that the time shift and gain values calculated in the manner described above have a slight numerical discrepancy when compared to when the angle of the user's 99 head after rotation is replaced with the unrotated angle.

　到来方向が１５°、代表方向が３０°及び３３０°として、式（２４）～（２７）により計算した時間シフト値及びゲイン値を用いて、これらの代表方向に分配した音に、頭部が６０°回転したものと仮定して、代表方向３０°及び９０°の頭部伝達関数を畳み込むことで６０°回転後の７５°の方向の音を生成する場合（実施例２）を、到来方向が７５°、代表方向が３０°及び９０°として計算した時間シフト値及びゲイン値を用いて、これらの代表方向に分配した音に、単に代表方向３０°及び９０°の頭部伝達関数を畳み込むことで７５°の方向の音を生成する場合（計算上の正解値：比較例３）と比較する。結果を図２６及び図２７に示す。図２６は、実施の形態に係る時間シフト値計算の結果を検証するための図である。図２７は、実施の形態に係るゲイン値計算の結果を検証するための図である。 The following example (Example 2) generates a sound in a 75° direction after a 60° rotation by convolving the head-related transfer functions for the representative directions of 30° and 90° with the time shift and gain values calculated using equations (24) to (27) assuming an arrival direction of 15° and representative directions of 30° and 330°, and assuming a 60° rotation of the head, onto the sound distributed in these representative directions. This example is compared with the example (Comparative Example 3) in which the time shift and gain values calculated assuming an arrival direction of 75° and representative directions of 30° and 90° are used to simply convolve the head-related transfer functions for the representative directions of 30° and 90° onto the sound distributed in these representative directions. The results are shown in Figures 26 and 27. Figure 26 is a diagram for verifying the results of the time shift value calculation according to the embodiment. Figure 27 is a diagram for verifying the results of the gain value calculation according to the embodiment.

　図２６及び図２７には、破線により比較例３の計算結果を示し、実線により実施例２の計算結果を示している。図２６及び図２７に示すように、時間シフト値及びゲイン値ともに、実施例２と比較例３とでは、少しの誤差が生じている。つまり、本来の正解値との間に誤差が生じていることがわかる。そこで、例えば、このような誤差を縮小するため、実施例２において計算した時間シフト値及びゲイン値のそれぞれを、誤差の期待値を縮小するように、補正してもよい。具体的に、例えば、各代表点での誤差の平均値を減ずるなどして、時間シフト値及びゲイン値のそれぞれを補正してもよい。あるいは、期待値の計算の際に、よりゲイン値の大きな方向に重みをつけるように期待値を算出して、上記の補正をしてもよい。言い換えると、ある代表方向に分配される１２０°範囲のうち、どの方向に信号が集中しているかをゲイン値の分布から推定し、その方向に信号があるものとして、時間シフト値及びゲイン値の補正量を決定しても良い。 26 and 27, the dashed lines indicate the calculation results for Comparative Example 3, and the solid lines indicate the calculation results for Example 2. As shown in FIGS. 26 and 27, there is a slight error between the time shift values and gain values in Example 2 and Comparative Example 3. In other words, it is clear that there is an error between the actual correct values. Therefore, for example, in order to reduce such errors, the time shift values and gain values calculated in Example 2 may be corrected so as to reduce the expected value of the error. Specifically, for example, the time shift values and gain values may be corrected by reducing the average value of the errors at each representative point. Alternatively, when calculating the expected value, the expected value may be calculated so as to weight the direction with the larger gain value, and the above correction may be performed. In other words, the direction in which the signal is concentrated within a 120° range distributed in a certain representative direction can be estimated from the distribution of gain values, and the correction amounts for the time shift values and gain values may be determined assuming that the signal is in that direction.

　ところで、情報処理装置６０１と音声提示デバイス６０２とに分かれた立体音響再生システム６００を用いた場合の上記のパニング処理により生じる音質の劣化を抑制するための方法として、以下に説明する方法も考えられる。 Incidentally, the following method can be considered as a way to suppress the deterioration in sound quality caused by the above-mentioned panning process when using a stereophonic reproduction system 600 that is divided into an information processing device 601 and an audio presentation device 602.

　上記のように、音質の劣化は、本来用いられるべきゲイン値及びシフト値とは異なるゲイン値及びシフト値が用いられることに起因して生じる。そのため、用いられるゲイン値及びシフト値に、本来用いられるべきゲイン値及びシフト値に近くなるような補正がされれば、このような音質の劣化を抑制することができるといえる。以下では具体的にゲイン値に着目して説明するが、同様のことをシフト値に対しても適用することができる。 As mentioned above, the degradation in sound quality occurs when gain and shift values that differ from the gain and shift values that should have been used. Therefore, if the gain and shift values used are corrected so that they are closer to the gain and shift values that should have been used, it can be said that this degradation in sound quality can be suppressed. The following explanation focuses specifically on gain values, but the same can be applied to shift values as well.

　例えば、水平面内の方向（すなわちアジマス方向又は方位）が１５°にある音源オブジェクトについて考える。水平面内においては、代表方向として、－１５０°、－９０°、－３０°、３０°、９０°、及び、１５０°が設定されているものとする。 For example, consider a sound source object whose direction in the horizontal plane (i.e., azimuth direction or bearing) is 15°. In the horizontal plane, the representative directions are set to -150°, -90°, -30°, 30°, 90°, and 150°.

　この音源オブジェクトを代表方向からの代表音として表現する場合に、例えば、－３０°及び３０°の代表方向にパニングされた代表音が生成される。その際のゲイン値としては、－３０°の代表方向を考慮したゲイン値、及び３０°の代表方向を考慮したゲイン値が設定される。しかしながら、ユーザ９９が頭部を６０°回転させた場合、音源オブジェクトは－４５°の位置に変化し、代表方向として、－９０°及び－３０°が用いられることになるが、そのとき、ゲイン値には－３０°の代表方向を考慮したゲイン値、及び３０°の代表方向を考慮したゲイン値が設定されているため音質の劣化が生じる。 When this sound source object is expressed as a representative sound from a representative direction, for example, a representative sound panned to representative directions of -30° and 30° is generated. The gain values at this time are set to a gain value that takes into account the representative direction of -30° and a gain value that takes into account the representative direction of 30°. However, if the user 99 rotates their head by 60°, the sound source object will change to a position of -45°, and the representative directions will be -90° and -30°. However, at that time, the gain values that take into account the representative direction of -30° and a gain value that takes into account the representative direction of 30° are set, resulting in a degradation in sound quality.

　そこで、音源オブジェクトに対して設定するゲイン値として、－３０°及び３０°のそれぞれを起点に、例えば±６０°の範囲、すなわち、－９０°～３０°及び－３０°～９０°の範囲で、複数の方向（仮想代表点）を選択してそれぞれゲイン値を計算し、これらのゲイン値を全てあるいはこれらのゲイン値から複数個を選択して用いた１つのゲイン値（以下、包括ゲイン値と称する）を設定するとよい。仮に、－９０°～３０°の範囲で、５°刻みの複数の方向を選択した場合、－９０°、－８５°、－８０°、・・・、２０°、２５°、３０°の２５方向の分のゲイン値を計算できる。そして、これらの２５個のゲイン値を全て用いた１つのゲイン値を音源オブジェクトのゲイン値に設定すれば、上記のように頭部を６０°回転したとしても、包括ゲイン値には－９０°の分のゲイン値が計算に用いられているので、単純に－３０°の分のゲイン値を用いる場合に比べて音質の劣化を抑制することができる。３０°を起点とした－３０°～９０°の範囲についても、包括ゲイン値には－３０°の分のゲイン値が計算に用いられているので、単純に３０°の分のゲイン値を用いる場合に比べて音質の劣化を抑制することができる。 Therefore, as the gain value to be set for the sound source object, it is advisable to select multiple directions (virtual representative points) within a range of, for example, ±60°, starting from -30° and 30°, i.e., -90° to 30° and -30° to 90°, and calculate gain values for each. A single gain value (hereinafter referred to as a global gain value) can then be set using all or several of these gain values. For example, if multiple directions are selected in 5° increments within the range of -90° to 30°, gain values for 25 directions, i.e., -90°, -85°, -80°, ..., 20°, 25°, and 30°, can be calculated. If a single gain value using all 25 gain values is then set as the gain value for the sound source object, even if the head is rotated 60° as described above, the global gain value will already include the gain value for -90°, which reduces degradation in sound quality compared to simply using a gain value for -30°. Even in the range of -30° to 90° starting from 30°, the gain value for -30° is used in the calculation of the global gain value, so deterioration in sound quality can be suppressed compared to simply using the gain value for 30°.

　また、例えば、頭部の回転量が１２０°であるときを考える。音源オブジェクトは－１０５°の位置に変化し、代表方向として、－１５０°及び－９０°が用いられることになる。包括ゲイン値は、－９０°～３０°及び－３０°～９０°までの、－１５０°及び－９０°の範囲外の分のゲイン値が考慮されている。この場合でも、－９０°及び－３０°の分のゲイン値までが考慮されているため、単純に－３０°及び３０°の分のゲイン値を用いる場合に比べて、音質の劣化を抑制することができる。あるいは、起点の方向から考慮する範囲を±１２０°の範囲にして、頭部の回転量が１２０°の場合でもその時用いられる代表方向の分のゲイン値が包括ゲイン値に含まれるようにしてもよい。つまり、起点の方向からの考慮する範囲は±１８０°未満であればよく、±１７０°、±１６０°、±１５０°、±１４０°、±１３０°、±１２０°、±１１０°、±１００°、±９０°、±８０°、±７０°、±６０°、±５０°、±４０°、±３０°、±２０°、±１０°、及び、±５°など、任意に設定可能になっていればよい。 Also, consider the case where the head rotation amount is 120°. The sound source object changes to a position of -105°, and -150° and -90° are used as the representative directions. The global gain value takes into account gain values outside the ranges of -150° and -90°, from -90° to 30° and -30° to 90°. Even in this case, because gain values up to -90° and -30° are taken into account, deterioration in sound quality can be suppressed compared to simply using gain values for -30° and 30°. Alternatively, the range considered from the starting direction can be set to ±120°, so that even when the head rotation amount is 120°, the gain value for the representative direction used at that time is included in the global gain value. In other words, the range to be considered from the direction of the starting point needs to be less than ±180°, and can be arbitrarily set to ±170°, ±160°, ±150°, ±140°, ±130°, ±120°, ±110°, ±100°, ±90°, ±80°, ±70°, ±60°, ±50°, ±40°, ±30°, ±20°, ±10°, and ±5°, etc.

　また、これら範囲内において、選択する方向の密度は、５°刻みに限られない。例えば、範囲内において選択する方向の密度は、１０°刻み、９°刻み、８°刻み、７°刻み、６°刻み、５°刻み、４°刻み、３°刻み、２°刻み、１°刻み、又は、１°未満の刻みなど、任意に設定可能になっていればよい。一例として、範囲内において選択する方向の密度は、読み込まれている頭部伝達関数のセットにおける選択可能な方向の密度と一致していればよい。ただし、範囲内において選択する方向の密度がより密であるほど、計算処理量が増加するため、範囲内において選択する方向の密度として、読み込まれている頭部伝達関数のセットにおける選択可能な方向の密度から処理能力に応じた割合で間引かれた方向の密度に設定されてもよい。 Furthermore, within these ranges, the density of the directions selected is not limited to 5° increments. For example, the density of the directions selected within the ranges can be set arbitrarily, such as 10° increments, 9° increments, 8° increments, 7° increments, 6° increments, 5° increments, 4° increments, 3° increments, 2° increments, 1° increments, or increments of less than 1°. As an example, the density of the directions selected within the ranges only needs to match the density of the selectable directions in the set of head-related transfer functions that has been loaded. However, since the denser the density of the directions selected within the ranges, the greater the amount of calculation processing required, the density of the directions selected within the ranges may be set to a density of directions that has been thinned out from the density of the selectable directions in the set of head-related transfer functions that has been loaded at a rate that corresponds to the processing power.

　以下、包括ゲイン値の計算方法と、その効果について実施例を元に説明する。以下の実施例では、起点を中心に±６０°の範囲について、範囲内において選択する方向の密度が５°刻みである例を説明する。包括ゲイン値の計算は、例えば、範囲内で選択された複数の方向のそれぞれの方向の分のゲイン値を平均した平均値として算出される。ここでの平均は単純平均でも良いし、より高い頻度で起こると考えられる回転角の小さい方が１に、その発生頻度が低いと考えられる回転角が大きい方が０に近づくような重みをつけた加重平均としてもよい。そのため、この加重平均は、仮想代表点のそれぞれの方向と、それらに対応するもとの方向との角度差に基づく加重平均であるといえる。より具体的には、この角度差が小さい仮想代表点での調整量に対する重みを、角度差が大きい仮想代表点での調整量に対する重みよりも、大きくなるように計算した加重平均値を包括ゲイン値としてもよい。あるいは、平均することに代えて、包括ゲイン値として、選択した仮想代表点全てについて、既に述べたような誤差ベクトルを最小化する計算を行い、得られた１つのゲイン値を用いてもよい。 The following describes a method for calculating a global gain value and its effects based on an example. In the following example, an example is described in which the density of directions selected within a range of ±60° from the starting point is in 5° increments. The global gain value is calculated, for example, as the average of the gain values for each of the multiple directions selected within the range. This average may be a simple average, or a weighted average in which smaller rotation angles, which are thought to occur more frequently, approach 1 and larger rotation angles, which are thought to occur less frequently, approach 0. Therefore, this weighted average can be considered a weighted average based on the angular difference between each direction of the virtual representative points and their corresponding original directions. More specifically, the global gain value may be a weighted average calculated so that the weight assigned to the adjustment amount at virtual representative points with small angular differences is greater than the weight assigned to the adjustment amount at virtual representative points with large angular differences. Alternatively, instead of averaging, a single gain value obtained by performing a calculation to minimize the error vector as described above for all selected virtual representative points may be used as the global gain value.

　以下では、音源オブジェクトの位置を水平方向に挟む２か所の代表点について、それぞれ平均値によって包括ゲイン値を計算する例を示す。ただし、このように包括ゲイン値を計算することは、音源オブジェクトを挟む２か所に限らず、音源オブジェクトを囲む３か所以上の代表点に対して行ってもよい。 Below is an example in which the global gain value is calculated by averaging the two representative points that horizontally sandwich the position of the sound source object. However, calculating the global gain value in this manner is not limited to the two points that sandwich the sound source object, and can also be performed for three or more representative points that surround the sound source object.

　ある音源オブジェクトを水平方向に挟む２つの代表方向にそれぞれ対応する２つの包括ゲイン値は、以下式（２８）によって計算することができる。 Two global gain values corresponding to two representative directions horizontally sandwiching a sound source object can be calculated using the following equation (28).

　なお、φは、２つの代表方向のうち一方の方向を示し、αは、当該代表方向と音源オブジェクトの方向の差分であり音源オブジェクトの方向(目的方向)はφ－αとなる。なお、２つの代表方向のうちもう一方の方向は、φ－６０°として表される。式（２８）は、ＭをＭ＝５ｋ（－１２≦ｋ≦１２）に置き換えることで、５°刻みのゲイン値の平均値にすることができる。以下式（２９）は、式（２８）をＭ＝５ｋ（－１２≦ｋ≦１２）に置き換えたものである。 Note that φ indicates one of the two representative directions, and α indicates the difference between the representative direction and the direction of the sound source object, with the direction of the sound source object (target direction) being φ-α. Note that the other of the two representative directions is expressed as φ-60°. Equation (28) can be made into the average value of gain values in 5° increments by replacing M with M = 5k (-12≦k≦12). Equation (29) below is obtained by replacing equation (28) with M = 5k (-12≦k≦12).

　式（２９）に基づき計算した包括ゲイン値を用いて、定位実験を行った結果を図２８～図３１に示す。図２８～図３１は、実施の形態に係るゲイン値計算の結果を検証するための定位実験の結果を示す図である。なお、以下では、実施例として、包括ゲイン値を用いたものを示し、比較例として、単純に頭部回転前のゲイン値を用いたものを示している。 The results of a localization experiment using the global gain values calculated based on equation (29) are shown in Figures 28 to 31. Figures 28 to 31 show the results of a localization experiment conducted to verify the results of the gain value calculations according to the embodiment. Note that below, an example using global gain values is shown as an example, and a comparative example simply using gain values before head rotation is shown.

　図２８では、仰角０°、すなわち水平面内における音源オブジェクトについて、被験者が知覚している方向（方位）と実方向（方位）との前後誤り率と、被験者の頭部の回転角との関係を示している。図２８に示すように、実施例では、特に頭部を回転した場合に、比較例に比べて誤り率が小さくなることが示された。 Figure 28 shows the relationship between the front-to-back error rate between the direction (azimuth) perceived by the subject and the actual direction (azimuth) for a sound source object at an elevation angle of 0°, i.e., in the horizontal plane, and the rotation angle of the subject's head. As shown in Figure 28, the example showed a smaller error rate than the comparative example, especially when the head was rotated.

　また、図２９は、仰角０°における音源オブジェクトについて、被験者が知覚している方向（方位）と実方向（方位）との平均定位誤差と、被験者の頭部の回転角との関係を示している。図２９に示すように、実施例では、特に頭部を回転した場合に、比較例に比べて誤差が小さくなることが示された。 Furthermore, Figure 29 shows the relationship between the average localization error between the direction (azimuth) perceived by the subject and the actual direction (azimuth) for a sound source object at an elevation angle of 0°, and the rotation angle of the subject's head. As shown in Figure 29, the example showed a smaller error than the comparative example, especially when the head was rotated.

　また、図３０では、仰角４０°における音源オブジェクトについて、被験者が知覚している方向（方位）と実方向（方位）との前後誤り率と、被験者の頭部の回転角との関係を示している。図３０に示すように、実施例では、比較例に比べて誤り率の改善効果が小さいことが示された。 Furthermore, Figure 30 shows the relationship between the front-to-back error rate between the direction (azimuth) perceived by the subject and the actual direction (azimuth) for a sound source object at an elevation angle of 40°, and the rotation angle of the subject's head. As shown in Figure 30, the example showed a smaller improvement in error rate than the comparative example.

　また、図３１は、仰角４０°における音源オブジェクトについて、被験者が知覚している方向（方位）と実方向（方位）との平均定位誤差と、被験者の頭部の回転角との関係を示している。図３１に示すように、実施例では、比較例に比べて誤差の縮小効果が小さいことが示された。 Furthermore, Figure 31 shows the relationship between the average localization error between the direction (azimuth) perceived by the subject and the actual direction (azimuth) for a sound source object at an elevation angle of 40°, and the rotation angle of the subject's head. As shown in Figure 31, the example showed a smaller effect of reducing the error than the comparative example.

　特に、図３０及び図３１に示す仰角４０°における音源オブジェクトについての定位実験の結果から、仰角方向における代表方向（例えば、天頂の代表方向）言い換えると、仰角に関する代表点（仰角代表点）のゲイン値が、包括ゲイン値を用いることの効果を縮小しているものと予想し、以下の改善を考案した。 In particular, based on the results of the localization experiment on the sound source object at an elevation angle of 40° shown in Figures 30 and 31, we predicted that the gain value of the representative direction in the elevation angle direction (for example, the representative direction of the zenith), in other words, the representative point in relation to the elevation angle (elevation angle representative point), reduces the effect of using the global gain value, and devised the following improvement.

　すなわち、仰角代表点（音源オブジェクトを囲む３以上の代表点のうち、垂直成分を含む代表点）のゲイン値を、音源オブジェクトの方向によらず、一定の値（フィックスゲイン値）とすれば、音源オブジェクトに対する、水平面内での音の定位に仰角代表点が及ぼす影響を抑制することができると考えられる。天頂と相対する天底側の俯角代表点についても同様のものと考えられるので、ここでは、仰角代表点についてのみ着目して説明する。 In other words, if the gain value of the elevation representative point (the representative point containing a vertical component among the three or more representative points surrounding the sound source object) is set to a constant value (fixed gain value) regardless of the direction of the sound source object, it is thought that the effect of the elevation representative point on the sound localization in the horizontal plane relative to the sound source object can be suppressed. The same is thought to be true for the depression representative point on the nadir side opposite the zenith, so here we will only focus on the elevation representative point.

　仰角代表点のゲイン値およびシフト値は、計算の過程上、水平方向の音の定位にも影響する成分を含んでいる。そのため、上記のように包括ゲイン値を用いて、水平方向における用いられるべきゲイン値の調整をしたとしても、仰角代表点の分のゲイン値およびシフト値が頭部回転前の方位に対応したゲイン値およびシフト値を維持しているために包括ゲイン値を用いる効果が小さくなったものと考えられる。そのため、第１段階として、仰角代表点の分のゲイン値を音源オブジェクトの方向(目的方向)にかかわらず所定値に維持するように設定したうえで、第２段階として、その所定値のゲイン値に対応した、包括ゲイン値を設定するように、段階分けすることが効果的であると考えられる。つまり、仰角代表点の分のゲイン値を、水平面内における代表点の分のゲイン値と分離して計算することで、仰角代表点の分のゲイン値が水平方向の音の定位に及ぼす影響を縮小することを行う。このことは、特に仰角成分を含む場合に頭部回転を伴うような用途において、包括ゲイン値を用いる場合に限らず有効であるといえる。つまり、仰角代表点の分のゲイン値にフィックスゲイン値を用いることを、上記の別のゲイン値調整に適用することも有効である。 The gain and shift values for the elevation representative points contain components that affect horizontal sound localization during the calculation process. Therefore, even if the global gain value is used to adjust the gain value to be used in the horizontal direction as described above, the effect of using the global gain value is thought to be reduced because the gain and shift values for the elevation representative points maintain the gain and shift values corresponding to the direction before head rotation. Therefore, it is thought to be effective to divide the calculation into stages: in the first stage, the gain value for the elevation representative points is set to maintain a predetermined value regardless of the direction of the sound source object (target direction), and then in the second stage, the global gain value is set corresponding to that predetermined gain value. In other words, by calculating the gain value for the elevation representative points separately from the gain values for representative points in the horizontal plane, the effect of the gain value for the elevation representative points on horizontal sound localization is reduced. This is effective not only when global gain values are used, but also in applications that involve head rotation, especially when the elevation component is included. In other words, it is also effective to use a fixed gain value as the gain value for the elevation angle representative point in the separate gain value adjustments mentioned above.

　以下、具体的にフィックスゲイン値の設定方法と、一例として設定されたフィックスゲイン値を用いた定位実験の結果について説明する。 Below, we will explain in detail how to set the fixed gain value and show the results of a localization experiment using the set fixed gain value as an example.

　フィックスゲイン値は、音源オブジェクトの水平面内の方向には依存しないようにするものの、音源オブジェクトの垂直面内の方向にも依存しないようにすると、音源オブジェクトに仰角方向の定位をさせることが困難になる。つまり、フィックスゲイン値は、音源オブジェクトの垂直面内の方向には、依存するように設定する必要がある。そこで、垂直面内のいくつかの方向の音源オブジェクトのそれぞれについて、適切なフィックスゲイン値を設定するようにした。音源オブジェクトが水平面内にあれば（仰角０°）、フィックスゲイン値は、０である必要があり、音源オブジェクトが天頂方向にあれば（仰角９０°）、フィックスゲイン値は、１である必要がある。つまり、フィックスゲイン値は、垂直面内での音源オブジェクトの方向が０°の場合から９０°の場合までに、０から１の間で漸増する値である必要がある。 If the fix gain value is set so that it does not depend on the direction of the sound source object in the horizontal plane, but is also set so that it does not depend on the direction of the sound source object in the vertical plane, it will be difficult to position the sound source object in the elevation direction. In other words, the fix gain value needs to be set so that it depends on the direction of the sound source object in the vertical plane. Therefore, we set appropriate fix gain values for each of the sound source objects in several directions in the vertical plane. If the sound source object is in the horizontal plane (elevation angle 0°), the fix gain value needs to be 0, and if the sound source object is in the zenith direction (elevation angle 90°), the fix gain value needs to be 1. In other words, the fix gain value needs to be a value that gradually increases between 0 and 1 as the direction of the sound source object in the vertical plane changes from 0° to 90°.

　以下、音源オブジェクトの仰角方向が０°より大きく、９０°より小さい範囲内について考える。音源オブジェクトが仰角２０°のときに、フィックスゲイン値として、０．０５、０．１０、０．１５、及び、０．３０を設定したとき、頭部の回転角ごとにオリジナル（つまり、回転後の角度に合わせた適切なゲイン値）での頭部伝達関数におけるＩＬＤ（Interaural Level Difference）と、フィックスゲイン値を伴う頭部伝達関数におけるＩＬＤとで、相互相関が最大となる方位のシフト値をとって、位相ずれを算出した。その結果、位相ずれが最小になるゲイン値を、音源オブジェクトが仰角２０°のときのフィックスゲイン値として採用した。結果として、音源オブジェクトが仰角２０°のときのフィックスゲイン値は、０．１０に設定された。 Below, we consider the range where the elevation angle of the sound source object is greater than 0° and less than 90°. When the sound source object has an elevation angle of 20° and the fixed gain values are set to 0.05, 0.10, 0.15, and 0.30, the phase shift was calculated by taking the azimuth shift value that maximizes the cross-correlation between the ILD (Interaural Level Difference) in the head related transfer function at the original (i.e., the appropriate gain value that matches the angle after rotation) and the ILD in the head related transfer function with the fixed gain value. As a result, the gain value that minimizes the phase shift was adopted as the fixed gain value when the sound source object has an elevation angle of 20°. As a result, the fixed gain value when the sound source object has an elevation angle of 20° was set to 0.10.

　音源オブジェクトが仰角４０°のとき、仰角６０°のとき、仰角８０°のときのそれぞれについても同様にいくつかのフィックスゲイン値を設定して実験を行い、０．１５、０．３０、及び、０．６０のフィックスゲイン値を設定した。図３２は、実施の形態に係る設定されたフィックスゲイン値を示す図である。図中に示すように、フィックスゲイン値は、音源オブジェクトの仰角方向が小さい範囲（例えば仰角４０°以下）では、比較的小さな値に設定されることが適切であり、一方で、仰角方向が大きい範囲（例えば仰角６０°以上）では、急速に値が大きくなることが適切であることがわかる。つまり、フィックスゲイン値としては、音源オブジェクトの仰角方向に単純比例するよりも、指数関数的にカーブさせることが適切であるといえる。 Similarly, experiments were conducted by setting several fixed gain values when the sound source object had an elevation angle of 40°, 60°, and 80°, and fixed gain values of 0.15, 0.30, and 0.60 were set. Figure 32 is a diagram showing the fixed gain values set in this embodiment. As shown in the figure, it is appropriate for the fixed gain value to be set to a relatively small value when the elevation angle of the sound source object is small (for example, an elevation angle of 40° or less), while it is appropriate for the value to increase rapidly when the elevation angle is large (for example, an elevation angle of 60° or more). In other words, it can be said that it is more appropriate for the fixed gain value to curve exponentially rather than being simply proportional to the elevation angle of the sound source object.

　図３３及び図３４は、実施の形態に係るゲイン値計算の結果を検証するための定位実験の結果を示す図である。なお、以下では、実施例として、適切なフィックスゲイン値を包括ゲイン値に組み合わせて用いたものを示し、比較例として、包括ゲイン値のみを用いたものを示している。 Figures 33 and 34 show the results of localization experiments conducted to verify the results of gain value calculations according to the embodiment. Note that the following examples show a case in which an appropriate fixed gain value is used in combination with a global gain value, and a comparative example shows a case in which only the global gain value is used.

　図３３は、仰角４０°における音源オブジェクトについて、被験者が知覚している方向（方位）と実方向（方位）との前後誤り率と、被験者の頭部の回転角との関係を示している。図３３に示すように、実施例では、特に頭部を回転した場合に、比較例に比べて誤り率が小さくなることが示された。 Figure 33 shows the relationship between the front-to-back error rate between the direction (azimuth) perceived by the subject and the actual direction (azimuth) for a sound source object at an elevation angle of 40°, and the angle of rotation of the subject's head. As shown in Figure 33, the example showed a smaller error rate than the comparative example, especially when the head was rotated.

　また、図３４は、仰角４０°における音源オブジェクトについて、被験者が知覚している方向（方位）と実方向（方位）との平均定位誤差と、被験者の頭部の回転角との関係を示している。図３４に示すように、実施例では、特に頭部を回転した場合に、比較例に比べて誤差が小さくなることが示された。このように、フィックスゲイン値を組み合わせて用いることで、単に包括ゲイン値を用いる場合に比べて、より適切に音源オブジェクトの定位をさせることが可能である。 Furthermore, Figure 34 shows the relationship between the average localization error between the direction (azimuth) perceived by the subject and the actual direction (azimuth) for a sound source object at an elevation angle of 40°, and the rotation angle of the subject's head. As shown in Figure 34, the example shows that the error is smaller than the comparative example, especially when the head is rotated. In this way, by using a combination of fixed gain values, it is possible to localize the sound source object more appropriately than when simply using global gain values.

　なお、仰角代表点及び俯角代表点の頭部伝達関数を用いる場合に、上記のようにフィックスゲイン値を用いる代わりに、頭部回転角に応じた初期遅延の補正を行うようにしてもよい。初期遅延の補正は、受聴者の真正面（方位０°）の音が、頭部回転により移動した際のＩＴＤ（Interaural Time Difference）の変化分と同じ量の変化が、仰角代表点及び俯角代表点の頭部伝達関数のＩＴＤに適用される様に、両耳間の仰角代表点及び俯角代表点の頭部伝達関数の初期遅延を調整することで行えばよい。ここでの調整量は、音源オブジェクトの方向によって変化するが、代表的な音源オブジェクトの方向について、予め計算しておいて、受信側（すなわち音声提示デバイス６０２側）でテーブルとして保持しておくことで、推定した音源方向に応じて参照して、選択し、使用するようにしてもよい。 When using head-related transfer functions of elevation and depression representative points, instead of using fixed gain values as described above, it is also possible to correct the initial delay according to the head rotation angle. The initial delay can be corrected by adjusting the initial delay of the head-related transfer functions of the elevation and depression representative points between the ears so that a change equal to the change in ITD (Interaural Time Difference) when a sound directly in front of the listener (azimuth 0°) moves due to head rotation is applied to the ITD of the head-related transfer functions of the elevation and depression representative points. The amount of adjustment here varies depending on the direction of the sound source object, but the directions of representative sound source objects can be calculated in advance and stored as a table on the receiving side (i.e., on the audio presentation device 602 side), so that the table can be referenced, selected, and used according to the estimated sound source direction.

　（その他の実施の形態）
　以上、実施の形態について説明したが、本開示は、上記の実施の形態に限定されるものではない。 (Other embodiments)
Although the embodiments have been described above, the present disclosure is not limited to the above-described embodiments.

　例えば、上記の実施の形態に説明した情報処理システム又は音響再生システムは、構成要素をすべて備える一つの装置として実現されてもよいし、複数の装置に各機能が割り振られ、この複数の装置が連携することで実現されてもよい。後者の場合には、情報処理装置に該当する装置として、スマートフォン、タブレット端末、ＰＣ、又は基地局などの情報処理装置が用いられてもよい。例えば、音響効果を付加した音響信号を生成するレンダラとしての機能を有する音響再生システム１００において、レンダラの機能のすべて又は一部をサーバが担ってもよい。つまり、取得部１１１、経路算出部１２１、出力音生成部１３１、信号出力部１４１のすべて又は一部は、図示しないサーバに存在してもよい。その場合、音響再生システム１００は、例えば、コンピュータ又はスマートフォンなどの情報処理装置と、ユーザ９９に装着されるヘッドマウントディスプレイ（ＨＭＤ）やイヤホンなどの音提示デバイスと、図示しないサーバとを組み合わせて実現される。なお、コンピュータと音提示デバイスとサーバとが同一のネットワークで通信可能に接続されていてもよいし、異なるネットワークで接続されていてもよい。異なるネットワークで接続されている場合、通信に遅延が発生する可能性が高くなるため、コンピュータと音提示デバイスとサーバとが同一ネットワークで通信可能に接続されている場合にのみサーバでの処理を許可してもよい。また、音響再生システム１００が受け付けるビットストリームのデータ量に応じて、レンダラのすべて又は一部の機能をサーバが担うか否かを決定してもよい。 For example, the information processing system or sound reproduction system described in the above embodiments may be realized as a single device that includes all of the components, or may be realized by allocating each function to multiple devices and working together. In the latter case, the device corresponding to the information processing device may be an information processing device such as a smartphone, tablet terminal, PC, or base station. For example, in the sound reproduction system 100 that functions as a renderer that generates sound signals with added sound effects, all or part of the renderer's functions may be performed by a server. In other words, all or part of the acquisition unit 111, path calculation unit 121, output sound generation unit 131, and signal output unit 141 may reside on a server (not shown). In this case, the sound reproduction system 100 is realized by combining, for example, an information processing device such as a computer or smartphone, a sound presentation device such as a head-mounted display (HMD) or earphones worn by the user 99, and a server (not shown). The computer, sound presentation device, and server may be connected to each other so that they can communicate with each other via the same network, or may be connected via different networks. When connected via different networks, there is a high possibility of communication delays, so processing on the server may be permitted only when the computer, sound presentation device, and server are connected so that they can communicate via the same network. Furthermore, whether the server will take on all or part of the functions of the renderer may be determined depending on the amount of bitstream data accepted by the sound reproduction system 100.

　また、本開示の情報処理システム又は音響再生システムは、ドライバのみを備える再生装置に接続され、当該再生装置に対して、取得した音情報に基づいて生成された出力音信号を再生するのみの情報処理装置として実現することもできる。この場合、情報処理装置は、専用の回路を備えるハードウェアとして実現してもよいし、汎用のプロセッサに特定の処理を実行させるためのソフトウェアとして実現してもよい。 Furthermore, the information processing system or sound reproduction system disclosed herein can also be realized as an information processing device that is connected to a reproduction device equipped with only a driver and that simply reproduces an output sound signal generated based on acquired sound information for the reproduction device. In this case, the information processing device may be realized as hardware equipped with dedicated circuits, or as software that causes a general-purpose processor to execute specific processes.

　また、上記の実施の形態において、特定の処理部が実行する処理を別の処理部が実行してもよい。また、複数の処理の順序が変更されてもよいし、複数の処理が並行して実行されてもよい。 Furthermore, in the above embodiments, the processing performed by a specific processing unit may be performed by another processing unit. Furthermore, the order of multiple processes may be changed, or multiple processes may be performed in parallel.

　また、上記の実施の形態において、各構成要素は、各構成要素に適したソフトウェアプログラムを実行することによって実現されてもよい。各構成要素は、ＣＰＵ又はプロセッサなどのプログラム実行部が、ハードディスク又は半導体メモリなどの記録媒体に記録されたソフトウェアプログラムを読み出して実行することによって実現されてもよい。 Furthermore, in the above embodiments, each component may be realized by executing a software program appropriate for that component. Each component may also be realized by a program execution unit such as a CPU or processor reading and executing a software program recorded on a recording medium such as a hard disk or semiconductor memory.

　また、各構成要素は、ハードウェアによって実現されてもよい。例えば、各構成要素は、回路（又は集積回路）でもよい。これらの回路は、全体として１つの回路を構成してもよいし、それぞれ別々の回路でもよい。また、これらの回路は、それぞれ、汎用的な回路でもよいし、専用の回路でもよい。 Furthermore, each component may be realized by hardware. For example, each component may be a circuit (or integrated circuit). These circuits may form a single circuit as a whole, or each may be a separate circuit. Furthermore, each of these circuits may be a general-purpose circuit or a dedicated circuit.

　また、本開示の全般的又は具体的な態様は、システム、装置、方法、集積回路、コンピュータプログラム又はコンピュータ読み取り可能なＣＤ－ＲＯＭなどの記録媒体で実現されてもよい。また、本開示の全般的又は具体的な態様は、システム、装置、方法、集積回路、コンピュータプログラム及び記録媒体の任意な組み合わせで実現されてもよい。 Furthermore, general or specific aspects of the present disclosure may be realized as a system, device, method, integrated circuit, computer program, or computer-readable recording medium such as a CD-ROM. Furthermore, general or specific aspects of the present disclosure may be realized as any combination of a system, device, method, integrated circuit, computer program, and recording medium.

　例えば、本開示は、コンピュータによって実行される情報処理方法又は音声信号再生方法として実現されてもよいし、情報処理方法又は音声信号再生方法をコンピュータに実行させるためのプログラムとして実現されてもよい。本開示は、このようなプログラムが記録されたコンピュータ読み取り可能な非一時的な記録媒体として実現されてもよい。 For example, the present disclosure may be realized as an information processing method or audio signal reproduction method executed by a computer, or as a program for causing a computer to execute an information processing method or audio signal reproduction method. The present disclosure may also be realized as a computer-readable non-transitory recording medium on which such a program is recorded.

　その他、各実施の形態に対して当業者が思いつく各種変形を施して得られる形態、又は、本開示の趣旨を逸脱しない範囲で各実施の形態における構成要素及び機能を任意に組み合わせることで実現される形態も本開示に含まれる。 In addition, this disclosure also includes forms obtained by applying various modifications to each embodiment that a person skilled in the art would conceive, or forms realized by arbitrarily combining the components and functions of each embodiment within the scope of the spirit of this disclosure.

　なお、本開示における符号化された音情報は、音響再生システム１００によって再生される所定音についての情報である音信号及び、当該所定音の音像を三次元音場内において所定位置に定位させる際の定位位置に関する情報であるメタデータを含むビットストリームと言い換えることができる。例えばＭＰＥＧ－Ｈ　３Ｄ　Ａｕｄｉｏ（ＩＳＯ／ＩＥＣ　２３００８－３）等の所定の形式で符号化されたビットストリームとして音情報が音響再生システム１００に取得されてもよい。一例として、符号化された音信号は、音響再生システム１００によって再生される所定音についての情報を含む。ここでいう所定音は、三次元音場に存在する音源オブジェクトが発する音又は自然環境音であって、例えば、機械音、又は人を含む動物の音声等を含み得る。なお、三次元音場に音源オブジェクトが複数存在する場合、音響再生システム１００は、複数の音源オブジェクトにそれぞれ対応する複数の音信号を取得することになる。 In this disclosure, the encoded sound information can be rephrased as a bitstream containing a sound signal, which is information about a specific sound to be reproduced by the sound reproduction system 100, and metadata, which is information about the localization position when the sound image of the specific sound is localized at a specific position within a three-dimensional sound field. For example, the sound information may be acquired by the sound reproduction system 100 as a bitstream encoded in a specific format such as MPEG-H 3D Audio (ISO/IEC 23008-3). As an example, the encoded sound signal contains information about the specific sound to be reproduced by the sound reproduction system 100. The specific sound here refers to a sound emitted by a sound source object present in the three-dimensional sound field or a natural environmental sound, and may include, for example, mechanical sounds or the sounds of animals, including humans. In addition, if there are multiple sound source objects in the three-dimensional sound field, the sound reproduction system 100 will acquire multiple sound signals corresponding to the multiple sound source objects.

　一方、メタデータとは、例えば、音響再生システム１００において音信号に対する音響処理を制御するために用いられる情報である。メタデータは、仮想空間（三次元音場）で表現されるシーンを記述するために用いられる情報であってもよい。ここでシーンとは、メタデータを用いて、音響再生システム１００でモデリングされる、仮想空間における三次元映像及び音響イベントを表す全ての要素の集合体を指す用語である。つまり、ここでいうメタデータとは、音響処理を制御する情報だけでなく、映像処理を制御する情報も含んでいてもよい。もちろん、メタデータには、音響処理と映像処理とのいずれか一方だけを制御する情報が含まれていてもよいし、両方の制御に用いられる情報が含まれていてもよい。本開示において音響再生システム１００が取得するビットストリームには、このようなメタデータが含まれている場合がある。あるいは、音響再生システム１００は、後述するようにビットストリームとは別に、メタデータを単体で取得してもよい。 Meanwhile, metadata is information used, for example, to control acoustic processing of sound signals in the sound reproduction system 100. Metadata may also be information used to describe a scene expressed in a virtual space (three-dimensional sound field). Here, a scene is a term that refers to the collection of all elements representing three-dimensional images and acoustic events in a virtual space, modeled by the sound reproduction system 100 using metadata. In other words, the metadata here may include not only information that controls acoustic processing, but also information that controls video processing. Of course, the metadata may include information that controls only either audio processing or video processing, or information used to control both. In the present disclosure, the bitstream acquired by the sound reproduction system 100 may include such metadata. Alternatively, the sound reproduction system 100 may acquire the metadata separately from the bitstream, as described below.

　音響再生システム１００は、ビットストリームに含まれるメタデータ、及び追加で取得されるインタラクティブなユーザ９９の位置情報等を用いて、音信号に音響処理を行うことで、仮想的な音響効果を生成する。例えば、初期反射音生成、後期残響音生成、回折音生成、距離減衰効果、ローカリゼーション、音像定位処理、又はドップラー効果等の音響効果が付加されることが考えられる。また、音響効果の全て又は一部のオンオフを切り替える情報がメタデータとして付加されてもよい。 The sound reproduction system 100 generates virtual sound effects by performing sound processing on the sound signal using metadata contained in the bitstream and additionally acquired position information of the interactive user 99. For example, sound effects such as early reflection sound generation, late reverberation sound generation, diffraction sound generation, distance attenuation effect, localization, sound image localization processing, or Doppler effect may be added. Information for switching all or some of the sound effects on and off may also be added as metadata.

　なお、全てのメタデータ又は一部のメタデータは、音情報のビットストリーム以外から取得されてもよい。例えば、音響を制御するメタデータと映像を制御するメタデータとのいずれかがビットストリーム以外から取得されてもよいし、両方のメタデータがビットストリーム以外から取得されてもよい。 Note that all or some of the metadata may be obtained from sources other than the audio information bitstream. For example, either the metadata controlling the audio or the metadata controlling the video may be obtained from sources other than the bitstream, or both may be obtained from sources other than the bitstream.

　また、映像を制御するメタデータが音響再生システム１００で取得されるビットストリームに含まれる場合は、音響再生システム１００は映像の制御に用いることができるメタデータを、画像を表示する表示装置、又は立体映像を再生する立体映像再生装置に対して出力する機能を備えていてもよい。 Furthermore, if metadata for controlling video is included in the bitstream acquired by the audio reproduction system 100, the audio reproduction system 100 may have a function for outputting the metadata that can be used to control video to a display device that displays images or a 3D video reproduction device that reproduces 3D video.

　また、一例として、符号化されたメタデータは、音を発する音源オブジェクト、及び障害物オブジェクトを含む三次元音場に関する情報と、当該音の音像を三次元音場内において所定位置に定位させる（つまり、所定方向から到達する音として知覚させる）際の定位位置に関する情報、すなわち所定方向に関する情報とを含む。ここで、障害物オブジェクトは、音源オブジェクトが発する音がユーザ９９へと到達するまでの間において、例えば音を遮ったり、音を反射したりして、ユーザ９９が知覚する音に影響を及ぼし得るオブジェクトである。障害物オブジェクトは、静止物体の他に、人等の動物、又は機械等の動体を含み得る。また、三次元音場に複数の音源オブジェクトが存在する場合、任意の音源オブジェクトにとっては、他の音源オブジェクトは障害物オブジェクトとなり得る。また、建材又は無生物等の非発音源オブジェクトも、音を発する音源オブジェクトも、いずれも障害物オブジェクトとなり得る。 Also, as an example, the encoded metadata includes information about a three-dimensional sound field including sound source objects and obstacle objects that emit sound, as well as information about the localization position when the sound image of the sound is localized at a specific position within the three-dimensional sound field (i.e., the sound is perceived as arriving from a specific direction), i.e., information about the specific direction. Here, an obstacle object is an object that can affect the sound perceived by the user 99 by, for example, blocking or reflecting the sound emitted by the sound source object before it reaches the user 99. Obstacle objects can include not only stationary objects, but also animals such as people, or moving objects such as machines. Furthermore, when multiple sound source objects exist in a three-dimensional sound field, the other sound source objects can be obstacle objects for any sound source object. Furthermore, both non-sound source objects such as building materials or inanimate objects and sound source objects that emit sound can be obstacle objects.

　メタデータを構成する空間情報として、三次元音場の形状だけでなく、三次元音場に存在する障害物オブジェクトの形状及び位置と、三次元音場に存在する音源オブジェクトの形状及び位置とをそれぞれ表す情報が含まれていてもよい。三次元音場は、閉空間又は開空間のいずれであってもよく、メタデータには、例えば床、壁、又は天井等の三次元音場において音を反射し得る構造物の反射率、及び三次元音場に存在する障害物オブジェクトの反射率を表す情報が含まれる。ここで、反射率は、入射音に対する反射音のエネルギーの比であって、音の周波数帯域ごとに設定されている。もちろん、反射率は、音の周波数帯域に依らず、一律に設定されていてもよい。また、三次元音場が開空間の場合は、例えば一律で設定された減衰率、回折音、又は初期反射音等のパラメータが用いられてもよい。 The spatial information constituting the metadata may include not only the shape of the three-dimensional sound field, but also information representing the shape and position of obstacle objects present in the three-dimensional sound field, and the shape and position of sound source objects present in the three-dimensional sound field. The three-dimensional sound field may be either a closed or open space, and the metadata includes information representing the reflectance of structures that can reflect sound in the three-dimensional sound field, such as floors, walls, or ceilings, and the reflectance of obstacle objects present in the three-dimensional sound field. Here, reflectance is the ratio of the energy of reflected sound to incident sound, and is set for each frequency band of sound. Of course, the reflectance may also be set uniformly, regardless of the frequency band of the sound. Furthermore, when the three-dimensional sound field is an open space, parameters such as uniform attenuation rate, diffracted sound, or early reflected sound may be used.

　上記説明では、メタデータに含まれる障害物オブジェクト又は音源オブジェクトに関するパラメータとして反射率が挙げられたが、メタデータは、反射率以外の情報を含んでいてもよい。例えば、音源オブジェクト及び非発音源オブジェクトの両方に関わるメタデータとして、オブジェクトの素材に関する情報が含まれていてもよい。具体的には、メタデータは、拡散率、透過率、又は吸音率等のパラメータを含んでいてもよい。 In the above explanation, reflectance was mentioned as a parameter related to an obstacle object or sound source object included in the metadata, but the metadata may also include information other than reflectance. For example, metadata related to both sound source objects and non-sound source objects may include information about the material of the object. Specifically, the metadata may include parameters such as diffusion rate, transmittance, or sound absorption rate.

　音源オブジェクトに関する情報として、音量、放射特性（指向性）、再生条件、ひとつのオブジェクトから発せられる音源の数及び種類、又はオブジェクトにおける音源領域を指定する情報等が含まれてもよい。再生条件では、例えば、継続的に流れ続ける音なのかイベント発動する音なのかが定められてもよい。オブジェクトにおける音源領域は、ユーザ９９の位置とオブジェクトの位置との相対的な関係で定められてもよいし、オブジェクトを基準として定められてもよい。ユーザ９９の位置とオブジェクトの位置との相対的な関係で定められる場合、ユーザ９９がオブジェクトを見ている面を基準とし、ユーザ９９から見てオブジェクトの右側からは音Ｘ、左側からは音Ｙが発せられているようにユーザ９９に知覚させることができる。オブジェクトを基準として定められる場合、ユーザ９９の見ている方向に関わらず、オブジェクトのどの領域からどの音を出すかは固定にすることができる。例えばオブジェクトを正面から見たときの右側からは高い音、左側からは低い音が流れているようにユーザ９９に知覚させることができる。この場合、ユーザ９９がオブジェクトの背面に回り込むと、背面から見て右側からは低い音、左側からは高い音が流れているようにユーザ９９に知覚させることができる。 Information about sound source objects may include volume, radiation characteristics (directivity), playback conditions, the number and type of sound sources emitted from an object, or information specifying the sound source area within the object. Playback conditions may, for example, determine whether the sound will play continuously or trigger an event. The sound source area within an object may be determined relative to the position of the user 99 and the position of the object, or may be determined based on the object. When determined relative to the position of the user 99 and the position of the object, the surface from which the user 99 is viewing the object is used as the reference, and the user 99 can be made to perceive that sound X is coming from the right side of the object and sound Y is coming from the left side as viewed from the user 99. When determined based on the object as the reference, it is possible to fix which sound is coming from which area of the object, regardless of the direction the user 99 is looking. For example, the user 99 can be made to perceive that a high-pitched sound is coming from the right side and a low-pitched sound is coming from the left side when viewing the object from the front. In this case, if the user 99 goes around to the back of the object, the user 99 can be made to perceive that a low-pitched sound is coming from the right side and a high-pitched sound is coming from the left side as viewed from behind.

　空間に関するメタデータとして、初期反射音までの時間、残響時間、又は直接音と拡散音との比率等を含めることができる。直接音と拡散音との比率がゼロの場合、直接音のみをユーザ９９に知覚させることができる。 Spatial metadata can include the time to early reflections, reverberation time, or the ratio of direct sound to diffuse sound. If the ratio of direct sound to diffuse sound is zero, the user 99 will only perceive direct sound.

　また、三次元音場におけるユーザ９９の位置及び向きを示す情報が初期設定として予めメタデータとしてビットストリームに含まれていてもよいし、ビットストリームに含まれていなくてもよい。ユーザ９９の位置及び向きを示す情報がビットストリームに含まれていない場合、ユーザ９９の位置及び向きを示す情報はビットストリーム以外の情報から取得される。例えば、ＶＲ空間におけるユーザ９９の位置情報であれば、ＶＲコンテンツを提供するアプリから取得されてもよいし、ＡＲとして音を提示するためのユーザ９９の位置情報であれば、例えば携帯端末がＧＰＳ、カメラ、又はＬｉＤＡＲ（Ｌａｓｅｒ　Ｉｍａｇｉｎｇ　Ｄｅｔｅｃｔｉｏｎ　ａｎｄ　Ｒａｎｇｉｎｇ）等を用いて自己位置推定を実施して得られた位置情報が用いられてもよい。なお、音信号とメタデータとは、一つのビットストリームに格納されていてもよいし、複数のビットストリームに別々に格納されていてもよい。同様に、音信号とメタデータとは、一つのファイルに格納されていてもよいし、複数のファイルに別々に格納されていてもよい。 Furthermore, information indicating the position and orientation of the user 99 in the three-dimensional sound field may be included in the bitstream as metadata as an initial setting, or it may not be included in the bitstream. If information indicating the position and orientation of the user 99 is not included in the bitstream, the information indicating the position and orientation of the user 99 is obtained from information other than the bitstream. For example, position information of the user 99 in the VR space may be obtained from an app that provides VR content, and position information of the user 99 for presenting sound as AR may be position information obtained by a mobile device performing self-position estimation using GPS, a camera, LiDAR (Laser Imaging Detection and Ranging), or the like. Note that the sound signal and metadata may be stored in a single bitstream, or may be stored separately in multiple bitstreams. Similarly, the sound signal and metadata may be stored in a single file, or may be stored separately in multiple files.

　音信号とメタデータとが複数のビットストリームに別々に格納されている場合、関連する他のビットストリームを示す情報が、音信号とメタデータとが格納された複数のビットストリームのうちの一つ又は一部のビットストリームに含まれていてもよい。また、関連する他のビットストリームを示す情報が、音信号とメタデータとが格納された複数のビットストリームの各ビットストリームのメタデータ又は制御情報に含まれていてもよい。音信号とメタデータとが複数のファイルに別々に格納されている場合、関連する他のビットストリーム又はファイルを示す情報が、音信号とメタデータとが格納された複数のファイルのうちの一つ又は一部のファイルに含まれていてもよい。また、関連する他のビットストリーム又はファイルを示す情報が、音信号とメタデータとが格納された複数のビットストリームの各ビットストリームのメタデータ又は制御情報に含まれていてもよい。 When an audio signal and metadata are stored separately in multiple bitstreams, information indicating other related bitstreams may be included in one or some of the multiple bitstreams in which the audio signal and metadata are stored. Furthermore, information indicating other related bitstreams may be included in the metadata or control information of each bitstream of the multiple bitstreams in which the audio signal and metadata are stored. When an audio signal and metadata are stored separately in multiple files, information indicating other related bitstreams or files may be included in one or some of the multiple files in which the audio signal and metadata are stored. Furthermore, information indicating other related bitstreams or files may be included in the metadata or control information of each bitstream of the multiple bitstreams in which the audio signal and metadata are stored.

　ここで、関連するビットストリーム又はファイルはそれぞれ、例えば、音響処理の際に同時に用いられる可能性のあるビットストリーム又はファイルである。また、関連する他のビットストリームを示す情報は、音信号とメタデータとを格納した複数のビットストリームのうちの一つのビットストリームのメタデータ又は制御情報にまとめて記述されていてもよいし、音信号とメタデータとを格納した複数のビットストリームのうちの二以上のビットストリームのメタデータ又は制御情報に分割して記述されていてもよい。同様に、関連する他のビットストリーム又はファイルを示す情報は、音信号とメタデータとを格納した複数のファイルのうちの一つのファイルのメタデータ又は制御情報にまとめて記述されていてもよいし、音信号とメタデータとを格納した複数のファイルのうちの二以上のファイルのメタデータ又は制御情報に分割して記述されていてもよい。また、関連する他のビットストリーム又はファイルを示す情報を、まとめて記述した制御ファイルが音信号とメタデータとを格納した複数のファイルとは別に生成されてもよい。このとき、制御ファイルは音信号とメタデータとを格納していなくてもよい。 Here, the related bitstreams or files are, for example, bitstreams or files that may be used simultaneously during audio processing. Furthermore, information indicating other related bitstreams may be described collectively in the metadata or control information of one bitstream out of multiple bitstreams storing audio signals and metadata, or may be described separately in the metadata or control information of two or more bitstreams out of multiple bitstreams storing audio signals and metadata. Similarly, information indicating other related bitstreams or files may be described collectively in the metadata or control information of one file out of multiple files storing audio signals and metadata, or may be described separately in the metadata or control information of two or more files out of multiple files storing audio signals and metadata. Furthermore, a control file that collectively describes information indicating other related bitstreams or files may be generated separately from the multiple files storing audio signals and metadata. In this case, the control file does not have to store the audio signal and metadata.

　ここで、関連する他のビットストリーム又はファイルを示す情報とは、例えば当該他のビットストリームを示す識別子、他のファイルを示すファイル名、ＵＲＬ（Ｕｎｉｆｏｒｍ　Ｒｅｓｏｕｒｃｅ　Ｌｏｃａｔｏｒ）、又はＵＲＩ（Ｕｎｉｆｏｒｍ　Ｒｅｓｏｕｒｃｅ　Ｉｄｅｎｔｉｆｉｅｒ）等である。この場合、取得部は、関連する他のビットストリーム又はファイルを示す情報に基づいて、ビットストリーム又はファイルを特定又は取得する。また、関連する他のビットストリームを示す情報が音信号とメタデータとを格納した複数のビットストリームのうちの少なくとも一部のビットストリームのメタデータ又は制御情報に含まれていると共に、関連する他のファイルを示す情報が音信号とメタデータとを格納した複数のファイルのうちの少なくとも一部のファイルのメタデータ又は制御情報に含まれていてもよい。ここで、関連するビットストリーム又はファイルを示す情報を含むファイルとは、例えばコンテンツの配信に用いられるマニフェストファイル等の制御ファイルであってもよい。 Here, the information indicating the other related bitstream or file may be, for example, an identifier indicating the other bitstream, a file name indicating the other file, a URL (Uniform Resource Locator), or a URI (Uniform Resource Identifier). In this case, the acquisition unit identifies or acquires the bitstream or file based on the information indicating the other related bitstream or file. Furthermore, the information indicating the other related bitstream may be included in the metadata or control information of at least some of the bitstreams among multiple bitstreams that store audio signals and metadata, and the information indicating the other related file may be included in the metadata or control information of at least some of the files among multiple files that store audio signals and metadata. Here, the file containing information indicating the related bitstream or file may be, for example, a control file such as a manifest file used for content distribution.

　本開示は、立体的な音をユーザに知覚させる等の音響再生の際に有用である。 This disclosure is useful when reproducing sound, such as allowing the user to perceive three-dimensional sound.

　　　９９　ユーザ
　　１００　音響再生システム
　　１０１　情報処理装置
　　１０２　通信モジュール
　　１０３　検知器
　　１０４　ドライバ
　　１０５　データベース
　　１１１　取得部
　　１１２　エンコード音情報入力部
　　１１３　デコード処理部
　　１１４　センシング情報入力部
　　１２１　経路算出部
　　１３１　出力音生成部
　　１３４　生成部
　　１３５　合成部
　　１４１　信号出力部
　　３００　立体映像再生装置 99 User 100 Sound reproduction system 101 Information processing device 102 Communication module 103 Detector 104 Driver 105 Database 111 Acquisition unit 112 Encoded sound information input unit 113 Decode processing unit 114 Sensing information input unit 121 Path calculation unit 131 Output sound generation unit 134 Generation unit 135 Synthesis unit 141 Signal output unit 300 3D video reproduction device

Claims

An information processing method executed by a plurality of information processing terminals,
acquiring, in a first terminal that is one of the plurality of information processing terminals, first sound information including an acoustic signal and information about a position of a sound source object within a three-dimensional sound field;
the first sound information is information for causing the sound source object in a three-dimensional sound field to emit a reproduced sound by the acoustic signal,
converting, in the first terminal, the first sound information into second sound information for generating a representative sound arriving at a reference position from a representative point set in the three-dimensional sound field using the acoustic signal;
a step of transmitting, in the first terminal, the second sound information to a second terminal which is another information processing terminal among the plurality of information processing terminals;
detecting, at the second terminal, a position or a head direction of a user within the three-dimensional sound field;
calculating, in the second terminal, a position of a reproduction representative point corresponding to the position of the representative point based on the detected position or head direction of the user and the reference position;
generating, in the second terminal, an output sound signal using a head-related transfer function according to the calculated position of the reproduction representative point and the received second sound information.

The information processing method according to claim 1 , wherein the converting step converts the reproduced sound into the representative sound by applying a time shift adjustment and a gain adjustment to the reproduced sound.

The information processing method according to claim 2 , wherein the adjustment amount of the time shift adjustment is set to be 0 at the representative point.

The information processing method according to claim 2 , wherein the adjustment amount of the time shift adjustment at a predetermined position is set using an adjustment amount at a position closer to the representative point than the predetermined position.

3. The information processing method according to claim 2, wherein the adjustment amount of the gain adjustment at a predetermined position is calculated by using two of the three adjacent representative points surrounding the predetermined position to calculate the adjustment amount for each of the two representative points that minimize the error, and then fixing the ratio of the calculated adjustment amounts to calculate the adjustment amount for the remaining representative point.

3. The information processing method according to claim 2, wherein the amount of gain adjustment at a predetermined position is calculated by calculating the amount of adjustment for each of two representative points located horizontally out of three adjacent representative points surrounding the predetermined position for a horizontal position obtained by removing vertical components from the predetermined position, using the two representative points that have the smallest error, and then fixing the ratio of the calculated adjustment amounts to calculate the amount of adjustment for the remaining representative point.

The information processing method according to claim 2 , wherein the amount of the time shift adjustment is corrected based on an expected value of the error so as to reduce an error from a calculated correct value.

The information processing method according to claim 2 , wherein the amount of gain adjustment at a predetermined position is corrected based on an expected value of the error so as to reduce an error from a calculated correct value.

3. The information processing method according to claim 2, wherein the adjustment amount of the gain adjustment at a predetermined position is set using two or more of adjustment amounts at a plurality of virtual representative points corresponding to a plurality of directions within a predetermined angle range in the horizontal direction centered on the direction of the representative point of the starting point, with the representative point for the predetermined position as the starting point.

10. The information processing method according to claim 9, wherein the adjustment amount of the gain adjustment at a predetermined position is set using all of the adjustment amounts at a plurality of virtual representative points corresponding to a plurality of directions within a predetermined angle range in the horizontal direction centered on the direction of the representative point of the starting point, with the representative point for the predetermined position as the starting point.

The information processing method according to claim 9 , further comprising calculating the amount of gain adjustment at the predetermined position by averaging the amounts of adjustment at a plurality of the virtual representative points.

The information processing method according to claim 9, wherein the adjustment amount of the gain adjustment at a predetermined position is calculated by taking a weighted average of adjustment amounts at a plurality of the virtual representative points based on an angle difference between the direction of each of the virtual representative points and an original direction corresponding to the virtual representative point.

The information processing method according to any one of claims 2 to 12, wherein the amount of gain adjustment at a predetermined position is set to a predetermined value that does not depend on the direction of the predetermined position in a horizontal plane, the amount of adjustment of an elevation angle representative point that includes a vertical component among three adjacent representative points that surround the predetermined position.

The information processing method according to claim 13 , wherein the predetermined value varies depending on the direction of the predetermined position in a vertical plane.

The information processing method according to claim 14 , wherein the predetermined value is a value that gradually increases between 0 and 1 when the direction of the predetermined position in the vertical plane is from 0° to 90°.

An information processing system including a first terminal and a second terminal,
The first terminal
an acquisition unit that acquires first sound information including an acoustic signal and information about a position of a sound source object in a three-dimensional sound field;
the first sound information is information for causing the sound source object in a three-dimensional sound field to emit a reproduced sound by the acoustic signal,
a conversion unit that converts the first sound information into second sound information for generating a representative sound arriving at a reference position from a representative point set in the three-dimensional sound field using the acoustic signal;
a transmitting unit that transmits the second sound information to the second terminal,
The second terminal
a detector for detecting a user's position or head direction within the three-dimensional sound field;
a calculation unit that calculates a position of a reproduction representative point corresponding to the position of the representative point based on the detected position or head direction of the user and the reference position;
an output unit in the second terminal that outputs an output sound signal using a head-related transfer function according to the calculated position of the reproduction representative point and the received second sound information.

A program for causing a computer to execute the information processing method according to claim 1.