[go: up one dir, main page]

WO2025182639A1 - Information processing device, information processing method, and information processing system - Google Patents

Information processing device, information processing method, and information processing system

Info

Publication number
WO2025182639A1
WO2025182639A1 PCT/JP2025/005168 JP2025005168W WO2025182639A1 WO 2025182639 A1 WO2025182639 A1 WO 2025182639A1 JP 2025005168 W JP2025005168 W JP 2025005168W WO 2025182639 A1 WO2025182639 A1 WO 2025182639A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
unit
sound
speech
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2025/005168
Other languages
French (fr)
Japanese (ja)
Inventor
菜琳 岡▲崎▼
光 高鳥
康夫 川端
裕 高瀬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Group Corp
Original Assignee
Sony Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Group Corp filed Critical Sony Group Corp
Publication of WO2025182639A1 publication Critical patent/WO2025182639A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/12Circuits for transducers, loudspeakers or microphones for distributing signals to two or more loudspeakers

Definitions

  • This disclosure relates to an information processing device, an information processing method, and an information processing system.
  • Remote communication systems are known that enable remote communication between multiple people in remote areas.
  • people in one remote area communicate with people in another remote area using the microphone, speaker, camera, etc. of the remote communication system.
  • the remote communication system uses a microphone placed in one remote area to pick up the speaker's voice, and then plays the picked-up voice from a speaker placed in another remote area.
  • This disclosure has been made in consideration of the above-mentioned circumstances, and provides an information processing device, information processing method, and information processing system that can promote smooth dialogue.
  • the information processing device disclosed herein includes a human sensing unit that determines whether each speaker is participating in a conversation based on video captured by a camera of multiple speakers present in a specified space; a clarity adjustment unit that adjusts the clarity of the speech of speakers who are participating in the conversation and that is included in the audio of the specified space picked up by a microphone and that is determined by the human sensing unit to be included; and a transmission unit that transmits the audio of the specified space processed by the clarity adjustment unit.
  • FIG. 1 is a block diagram of a remote communication device.
  • 1 is a bird's-eye view of an environment in which a remote communication device is placed.
  • FIG. 1 is a front view of an environment in which a remote communication device is placed.
  • FIG. 10 is a diagram for explaining processing by a human sensing unit.
  • FIG. 10 is a diagram illustrating a coordinate transformation from screen coordinates to normalized coordinates.
  • FIG. 10 is a diagram illustrating individual sound separation processing.
  • FIG. 10 is a diagram illustrating a multiple facing speaker balancing process.
  • FIG. 10 is a diagram illustrating processing using a normal distribution function for outfield speech.
  • FIG. 2 is a diagram showing an outline of data flow between the local remote communication devices according to the first embodiment.
  • 10 is a flowchart of a transmission side process. 10 is a flowchart of a person position determination process. 10 is a flowchart of an infield/outfield determination process. 10 is a flowchart of an individual sound separation process. 10 is a flowchart of a process of adjusting clarity of individual sounds. 10 is a flowchart of a multiple opposite speaker balancing process. 10 is a flowchart of a background sound adjustment process. 10 is a flowchart of a process for adjusting infield speech sounds. 10 is a flowchart of a process for adjusting outfield speech sounds. 10 is a flowchart of a process for deriving a localization center speaker unit. FIG.
  • FIG. 10 is a block diagram of a remote communication device according to a second embodiment.
  • FIG. 10 is a diagram showing an outline of data flow between the local remote communication devices according to the second embodiment.
  • FIG. 10 is a block diagram of a remote communication device according to a third embodiment.
  • FIG. 13 is a diagram showing an outline of data flow between the local remote communication devices according to the third embodiment.
  • FIG. 13 is a block diagram of a remote communication device according to a fifth embodiment.
  • 13A and 13B are diagrams for explaining audio signal processing by a remote communication device according to a fifth embodiment.
  • FIG. 20 is a diagram for explaining audio reproduction by a remote communication device according to a sixth embodiment.
  • FIG. 20 is a diagram for explaining audio reproduction by a remote communication device according to a seventh embodiment.
  • FIG. 10 is a hardware configuration diagram showing an example of a computer that realizes an arithmetic unit of a remote communication device that is an information processing device according to the first to seventh embodiments.
  • Remote communication device 1.1. Definition of terms 1.1.1. Remote and local 1.1.2. Individual sound 1.1.3. Infield and outfield 1.1.4. Multiple facing speakers 1.2. Transmission unit 1.2.1. Human sensing unit 1.2.2. Audio separation unit 1.2.3. Acoustic signal processing unit 1.2.4. Output integration unit 1.2.5. Transmission unit 1.3. Receiving unit 1.3.1. Receiving unit 1.3.2. Audio output control unit 2. Data flow between remote communication devices according to the first embodiment 3. Remote communication processing 3.1. Transmission-side processing 3.1.1. Human position determination processing 3.1.2. Infield/outfield determination processing 3.1.3. Individual sound separation processing 3.1.4. Clarity adjustment processing 3.2. Multiple opposing speaker balancing processing (receiving side processing) 3.2.1. Background sound adjustment 3.2.2. Infield speech sound adjustment 3.2.3.
  • Remote communication device In conventional remote communication systems, when multiple people speak at the same time, the sounds can become mixed together during playback, making it difficult to hear, or the voice of the person you want to hear can be suppressed, creating an unnatural experience.In addition, conventional remote communication systems often cut out sounds other than the speech.
  • Figure 1 is a block diagram of a remote communication device.
  • the remote communication device 1 according to the embodiment is able to process background sounds and spoken voices in separate systems, and further divides spoken voices into two types depending on the person being spoken to.
  • the remote communication device 1 then processes each of the divided voices using different parameters, adjusting the ease with which the voices can be heard while maintaining a sense of connection with the remote location.
  • the sense of connection corresponds to a sense of realism that makes it feel as if you are having a face-to-face conversation with your conversation partner in the same space.
  • the remote communication device 1 has a transmitting unit 10 and a receiving unit 20.
  • the remote communication device 1 shown in FIG. 1 is a device that performs two-way communication with a remote communication device 1 of a partner, and the remote communication device 1 of the partner also has a transmitting unit 10 and a receiving unit 20.
  • the information processing system that realizes remote communication in this embodiment has a remote communication device 1 as a transmitting device, a receiving device, and a remote communication device 1.
  • the sending remote communication device 1 and the receiving remote communication device 1 are connected via a network 7.
  • Figure 2A is an overhead view of the environment in which the remote communication device is placed.
  • Figure 2B is a front view of the environment in which the remote communication device is placed.
  • the dotted lines in Figures 2A and 2B indicate the wiring of signal lines.
  • the remote communication device 1 is connected to an audio interface 5.
  • the audio interface 5 is connected to multiple microphones 3 and multiple opposed speakers 4.
  • the audio interface 5 outputs sounds picked up from the multiple microphones 3 to the remote communication device 1.
  • the audio interface 5 also outputs sounds output from the remote communication device 1 to the multiple opposed speakers 4.
  • the camera 2 captures images of the specific space in which it is placed and outputs the captured images to the remote communication device 1.
  • the microphone 3 is an omnidirectional microphone. Multiple microphones 3 are arranged in a predetermined configuration to form a microphone array 30. It is preferable that the camera 2 and microphone array 30 are aligned in approximately the same position. In particular, it is preferable that they are aligned in a horizontal line from the position where people are lined up in Figure 2A toward the camera 2 and microphone array 30.
  • a display 6 is connected to the remote communication device 1.
  • the display 6 displays the video received by the remote communication device 1.
  • a speaker array 41 and a speaker array 42 of multiple opposing speakers 4 are arranged above and below the display 6, respectively.
  • the surface of the display 6 that actually displays the image will be referred to as the "screen.”
  • the direction connecting the head and feet of the person being photographed, and the up and down direction of the image when projected onto the screen will be referred to as the "vertical direction.”
  • the left and right directions when the person being photographed is facing forward, and the left and right directions when the image is projected onto the screen will be referred to as the "horizontal direction.”
  • the left side of the image when projected onto the screen will be referred to as the "left,” and the right side will be referred to as the "right.”
  • the remote communication device 1 connects two physically separated locations and performs remote communication between people in each space. Of the two locations where remote communication is performed, the location where the transmitting remote communication device 1 is located is referred to as the "remote" location, and the location where the transmitting remote communication device 1 is located is referred to as the "local" location. In other words, the following description will focus on the case where video and audio from the remote location are sent to the local location.
  • the remote person who is the conversation partner of the local location is referred to as the speaker, and the conversation partner to whom the speaker wants to send a message is referred to as the listener.
  • Individual sounds refer to the individual voices of each speaker. In many-to-many communication, where many remote people converse with many local people, it is expected that multiple people will be speaking simultaneously at one of the locations (remote). In such a situation, simply recording with an omnidirectional microphone will result in the recording of multiple voices. Therefore, the process of extracting each speaker's voice individually from the recorded voices of multiple people is called "individual sound separation.”
  • Infield and outfield refers to the position of a remote speaker who is currently paying attention to the local speaker, such as someone who is currently having a conversation with the local speaker, as seen from the local speaker at that time. For example, if a conversation is taking place between a local person and a remote person, the remote person is considered to be the "infield" for the local person.
  • the infield can also be anyone who is currently participating in the conversation.
  • the outfield refers to the position of people who are not paying attention to the local situation at that time, such as remote people having a conversation with other remote people, as seen from the perspective of the local speaker at that time. If the listener for the remote speaker is also remote, in other words, if the conversation is between two remote speakers, the remote listener is treated as the "outfield" for the local speaker. People in the outfield can also be said to be people who are not participating in the conversation at that time. However, remote speakers can become infield or outfield depending on changes in their behavior over time.
  • the multiple opposed speakers 4 are a speaker system having a pair of a speaker array 41 consisting of a plurality of speaker units 411 and a speaker array 42 consisting of a plurality of speaker units 412.
  • the speaker array 41 and the speaker array 42 are arranged vertically with the screen of the display 6 in between.
  • the speaker array 41 is installed above the speaker array 42 in the image projected on the screen.
  • the following describes an example in which the speaker array 41 is installed above the speaker array 42.
  • Speaker array 41 and speaker array 42 have the same number of speaker units 411 and 412.
  • speaker array 41 multiple speaker units 411 are arranged horizontally.
  • speaker array 42 multiple speaker units 412 are arranged horizontally.
  • Each speaker unit 411 and each speaker unit 412 are arranged at corresponding positions in the vertical direction.
  • speaker units 411 and 412 are arranged at corresponding positions in the vertical direction.
  • speaker units 411 and 412 in the same row.
  • Multiple opposed speakers 4 have the following advantages over a system that simply consists of an array of speakers. By playing back the voices of separate speakers on individual channels, multiple opposed speakers 4 make it easier to hear each individual voice than when multiple voices are mixed together on a single channel. Additionally, by localizing the sound source in the vertical center of the display 6 screen, multiple opposed speakers 4 can make the voice sound as if it is coming from approximately the position of a person's face.
  • the output direction of speaker array 41 and the output direction of speaker array 42 are opposed to each other to achieve point localization and enhance the sense of realism, but it is also possible to use a general array speaker to achieve surface localization without opposing them.
  • the transmission unit 10 is a function used in the transmitting remote communication device 1. In the following explanation, the case where sound is sent from the remote to the local will be explained, i.e., the remote will be the transmitting side and the local will be the receiving side. As shown in FIG. 1 , the transmission unit 10 has a human sensing unit 11, a voice separation unit 12, an acoustic signal processing unit 13, an output integration unit 14, and a transmission unit 15.
  • the human sensing unit 11 receives an input of an image captured by the camera 2 in a specific space on the remote side where the camera 2 is placed. The human sensing unit 11 then executes human sensing processing including human position determination processing and infield/outfield determination processing, which will be described below.
  • human position determination processing including human position determination processing and infield/outfield determination processing, which will be described below.
  • infield/outfield determination processing the space on the remote side captured by the camera 2 will be referred to as the "remote space.”
  • the human sensing unit 11 detects the position of the horizontal origin in the remote space from the image of camera 2.
  • Figure 3 is a diagram for explaining the processing of the human sensing unit.
  • the human sensing unit 11 assumes that the origin is located at the horizontal center 600 of the screen when the image of camera 2 in the remote space is projected onto the screen of display 6.
  • the human sensing unit 11 sets normalized coordinates that indicate the horizontal position from the origin on the screen of display 6.
  • the human sensing unit 11 sets the distance from the horizontal edge of the screen of display 6 to the center 600 as a distance of 0.5 in normalized coordinates.
  • the human sensing unit 11 uses the x coordinate as a normalized coordinate, and sets the normalized coordinate of the right edge of the screen of display 6 to 0.5 and the normalized coordinate of the left edge to -0.5. That is, in this embodiment, the human sensing unit 11 sets normalized coordinates with the screen width set to 1.
  • the human sensing unit 11 performs the following human position determination process.
  • the human sensing unit 11 performs skeletal recognition of each speaker who appears in the image captured by camera 2.
  • the human sensing unit 11 acquires the normalized coordinates of the neck of each speaker who appears in the image captured by camera 2.
  • the human sensing unit 11 can acquire normalized coordinates that indicate the position of each speaker, and can determine the lateral positional relationships between speakers.
  • the human sensing unit 11 can perform skeletal recognition using general tools that are already commercially released.
  • the human sensing unit 11 generates a person list consisting of elements equal to the number of speakers whose normalized coordinates have been acquired. In this embodiment, the human sensing unit 11 arranges the elements in the person list in ascending order of normalized coordinate values.
  • the images at the bottom of Figure 3 show details of the person position determination process and the infield/outfield determination process.
  • the person sensing unit 11 performs skeletal recognition to obtain the normalized coordinates of speaker 61's neck 601, the normalized coordinates of speaker 62's neck 602, and the normalized coordinates of speaker 63's neck 603.
  • the person sensing unit 11 then confirms from the acquired normalized coordinates that speakers 61, 62, and 63 are lined up from the left of video 60 in this order.
  • the person sensing unit 11 then sets elements in the person list in the order of speaker 61, speaker 62, and speaker 63, in accordance with video 60.
  • the human sensing unit 11 performs skeleton detection from the video, but the coordinates obtained from the image are usually not expressed in a normalized coordinate system but in screen coordinates using the number of pixels in the video. Therefore, in practice, the human sensing unit 11 obtains the screen coordinates of the speaker's position from the video in the human position determination process, and then performs coordinate conversion from those screen coordinates to normalized coordinates.
  • Figure 4 is a diagram showing the coordinate conversion from screen coordinates to normalized coordinates.
  • the axes of the screen coordinates 621 are the u-axis and v-axis
  • the axes of the normalized coordinates 622 are the x-axis and y-axis.
  • the human sensing unit 11 determines the position of each speaker from the image projected on the screen 620 using screen coordinates 621 having a u-axis and a v-axis, and then converts the determined position of each speaker into normalized coordinates 622 having an x-axis and a y-axis. Furthermore, if the screen width of the screen 620 is W Scr and the screen height is H Scr , the ranges that each coordinate system can take are 0 ⁇ u ⁇ W Scr and 0 ⁇ v ⁇ H Scr , and -0.5 ⁇ x ⁇ 0.5 and -0.5 ⁇ y ⁇ 0.5.
  • the vertical normalized coordinate system is expressed in the range from -0.5 to 0.5, with the vertical center of the image set as 0. In this case, the human sensing unit 11 performs coordinate conversion using the following equation (1). By performing coordinate conversion in this manner, processing can be performed regardless of the camera's angle of view or aspect ratio.
  • the human sensing unit 11 executes the following infield/outfield determination process.
  • the human sensing unit 11 estimates the head direction of each person in the video, i.e., the direction in which the speaker is facing. For example, if the normalized coordinates of p people are obtained in the human position determination process, the human sensing unit 11 detects the faces of the speakers located at each of the normalized coordinates for each of the p people, and estimates the head direction using image analysis, etc. More specifically, the human sensing unit 11 estimates the head direction by assuming that the direction of the vector in the positive direction of the normalized coordinates is 0 degrees, and estimating the angle ⁇ of the head direction vector from this 0-degree vector.
  • the person sensing unit 11 has an infield range in advance for determining whether each speaker is in the infield or outfield. For example, the person sensing unit 11 defines the state in which the speaker is facing the camera 2 as 90 degrees, and defines a range around a predetermined angle of 90 degrees as the infield range. The person sensing unit 11 then determines that speakers whose head direction is within the infield range are in the infield, and determines that speakers whose head direction is not within the infield range are outfield people. Next, the person sensing unit 11 provides an infield flag item for each element in the person list and sets the initial value to False. The person sensing unit 11 then sets the infield flag to True for speakers determined to be in the infield.
  • the person sensing unit 11 also sets the infield flag to False for speakers determined to be in the outfield.
  • the human sensing unit 11 can set the infield flag of each speaker in the person list to Tre if the speaker is in the infield, to False if the speaker is in the outfield, and further to False if there is an oversight of head direction detection.
  • the person sensing unit 11 estimates head directions 611-613 for speakers 61-63. Furthermore, here, the infield range ⁇ is set to 45 degrees forward or backward from the front direction 610, which is 90 degrees, and the person sensing unit 11 performs infield/outfield determination processing. In this case, because head direction 611 is included in the infield range, the person sensing unit 11 determines that speaker 61 is in the infield. Furthermore, because head directions 612 and 613 are not included in the infield range, the person sensing unit 11 determines that speakers 62 and 63 are people in the outfield. Then, the person sensing unit 11 sets the infield flag for speaker 61 in the person list to True, and the infield flags for speakers 62 and 63 to False.
  • the human sensing unit 11 then outputs the created human list to the acoustic signal processing unit 13 and the output integration unit 14.
  • the remote space is an example of a "predetermined space”
  • determining whether a speaker is in the infield or outfield is an example of "determining whether a speaker is participating in a conversation.”
  • the human sensing unit 11 determines whether each speaker is participating in a conversation based on the video captured by the camera 2 that captures multiple speakers present in the predetermined space.
  • the human sensing unit 11 also determines the position of each speaker based on the video.
  • the audio separation unit 12 receives audio input collected by the multiple microphones 3.
  • the input to the microphone array 30 includes not only speech sounds generated by speech but also background sounds generated in the environment, such as background noise, background music (BGM), and other sounds. Therefore, the audio separation unit 12 performs audio separation processing to separate the speech sounds input from the microphone array 30 from the background sounds so that the speech sounds can be processed later to adjust their audibility.
  • the audio separation processing can also be considered speech extraction processing.
  • the audio separation unit 12 performs audio separation using vocal extraction or technology to extract only the singing voice from music with singing voices.
  • the audio separation unit 12 separates background sound from audio in a specified space that corresponds to the remote space.
  • the speech audio By separating the speech audio from the background sound, it becomes easier to adjust processes such as clarity adjustment for the speech audio and background sound.
  • an omnidirectional microphone is used as the microphone 3, but it is also possible to have each speaker wear a pin microphone as the microphone 3, and to use an omnidirectional microphone to pick up background sound.
  • the audio separation unit 12 may use the audio picked up by the pin microphone as the speech of each speaker, and the audio picked up by the omnidirectional microphone as the background sound.
  • the sound signal processing unit 13 includes an individual sound separation unit 131 and a clarity adjustment unit 132 .
  • the individual sound separation unit 131 receives input of speech sounds extracted by the speech separation process performed by the speech separation unit 12. The individual sound separation unit 131 also acquires the person list created by the person sensing unit 11. The individual sound separation unit 131 then uses the person list to perform the following individual sound separation process to separate the speech sounds of each person appearing in the video.
  • FIG. 5 is a diagram showing the individual sound separation process.
  • the camera 2 and the microphone array made up of the microphone array 30 are arranged in series in the shooting direction of the camera 2.
  • the camera 2 and the microphone array 30 are arranged in the normal direction of the two-dimensional image shot by the camera 2, and when considering the coordinates formed by the vertical, horizontal and depth directions in the image, the vertical and horizontal coordinates of the camera 2 and the microphone array 30 match.
  • the speakers in the remote space line up horizontally at positions where the vertical direction of the two-dimensional image shot by the camera 2 matches.
  • the surface that corresponds vertically and horizontally to the positions where the speakers in the remote space are lined up is called the "virtual screen.”
  • the normalized coordinates shown in the range of -0.5 to 0.5 in Figure 5 match the horizontal coordinates of the virtual screen.
  • the distance L between the camera 2 and the microphone 3 is known.
  • x ph in Figure 4 is the physical distance from the origin of the coordinates on the virtual screen.
  • x ph can be calculated from the angle of view of the camera 2 and the value of the distance L.
  • the individual sound separation unit 131 creates threads for the number of elements registered in the acquired person list and assigns an ID (identifier) to each thread as an attribute to identify each speaker. For example, the individual sound separation unit 131 assigns thread IDs of 1, 2, 3, ... in order from left to right in the person list. Furthermore, the individual sound separation unit 131 stores the normalized coordinate values registered in the person list in each thread as thread-specific values. Then, the individual sound separation unit 131 calculates x ph , which is the physical distance from the origin of each speaker, using the normalized coordinates obtained in the person position determination process as coordinates on the virtual screen. Then, the individual sound separation unit 131 can calculate the angle ⁇ to a specific person using the calculated physical distance x ph and the distance L from the virtual screen to camera 2 using the following equation (2).
  • the individual sound separation unit 131 acquires x1 as the normalized coordinate. Then, the individual sound separation unit 131 calculates the physical distance from the origin to the speaker 101 from the normalized coordinate x1 , and can calculate the angle ⁇ 1 from the origin centered on the camera 2 and microphone 3 of the speaker 101 using the physical distance calculated in equation (2) and the distance L. The individual sound separation unit 131 can obtain the angle of each speaker for each thread by performing a calculation to find the angle ⁇ for each thread.
  • the individual sound separation unit 131 performs individual sound separation processing on the speech sound extracted by the speech separation by the speech separation unit 12, using beamforming, for the speech sound of each speaker.
  • the speech sound extracted by the speech separation by the speech separation unit 12 is sound obtained from the microphone array 30, and sensitivity in the direction of each speaker is ensured. Therefore, in the case of speaker 101, the individual sound separation unit 131 sets the beam direction to angle ⁇ 1 with respect to speaker 101 and sets the beam width to a range 102 of 15 degrees centered on angle ⁇ 1 . Then, the individual sound separation unit 131 specifies the beam direction and beam width to suppress sounds from directions outside the specified angular range of the acquired speech sound, thereby ensuring sensitivity in the target direction and acquiring the individual sound of speaker 101.
  • the individual sound separation unit 131 separates individual sounds, which are the speech sounds of each speaker, from the sound in the specified space that corresponds to the remote space, based on the position of each speaker determined by the human sensing unit 11.
  • the individual sound separation unit 131 of this embodiment uses a person-specific list in which the positions of people on the video are registered in the order they appear on the video, and performs beamforming processing while maintaining that order, making it easy to pair the position of each speaker on the video with the individual sound after separation. This makes it easy to match the video and sound output positions when playing back the speech locally after transmission.
  • the clarity adjustment unit 132 receives input of the individual sounds of each person present in the remote space generated by the individual sound separation unit 131.
  • the clarity adjustment unit 132 also receives input of a person list from the individual sound separation unit 131.
  • the clarity adjustment unit 132 receives input of background sounds extracted by audio separation by the audio separation unit 12.
  • the clarity adjustment unit 132 determines for each individual sound whether each speaker is in the infield or outfield using the infield flag in the person list. Then, the clarity adjustment unit 132 performs clarity adjustment processing on the individual sounds and background sounds according to the determination result. In detail, the clarity adjustment unit 132 performs enhancement processing on the individual sounds of people in the infield, making them easier to hear. The clarity adjustment unit 132 also performs degrade processing on the individual sounds and background sounds of people in the outfield, making them harder to hear. The clarity adjustment unit 132 then outputs the individual sounds and background sounds that have been subjected to clarity adjustment processing to the output integration unit 14.
  • the clarity adjustment unit 132 can use formant weighting, an equalizing filter, or a reverberation filter for clarity adjustment processing.
  • formant weighting the clarity adjustment unit 132 reinforces the second formant and above as an enhancement process, and suppresses the second formant and above as a degradation process.
  • the clarity adjustment unit 132 performs equalizing filter processing, such as degrading background sounds using a low-pass filter and degrading sounds by adding reverberation.
  • the clarity adjustment unit 132 adjusts the clarity of the speech of speakers who are participating in the conversation as determined by the human sensing unit 11, which is included in the audio of the specified space corresponding to the remote space picked up by the microphone 3.
  • the clarity adjustment unit 132 also performs a clarity reduction adjustment on the speech of speakers who are not participating in the conversation, which is included in the audio of the specified space corresponding to the remote space, to reduce the clarity compared to the audio pickup state by the microphone 3.
  • the clarity adjustment unit 132 also performs a clarity reduction adjustment on the background sound separated by the audio separation unit 12 to reduce the clarity compared to other audio.
  • the clarity adjustment unit 132 determines whether the transmitted sound is background sound or speech, and even within speech, whether the speaker is in the infield or outfield, thereby making it possible to change the ease of audibility and promote smoother dialogue.
  • the output integration unit 14 receives input of the individual sounds and background sounds that have been subjected to clarity adjustment processing from the clarity adjustment unit 132.
  • the output integration unit 14 also receives input of the person list from the person sensing unit 11. Then, the output integration unit 14 executes the clarity adjustment processing described below.
  • Audio processing of background sounds and speech sounds is performed one channel at a time, corresponding to each sound.
  • the output integration unit 14 receives input data equal to the number of speakers plus one channel of background sound. In other words, if the background sound is one channel of data and the number of speakers is p, the output integration unit 14 obtains (1 + P) channels of data.
  • the output integration unit 14 integrates the sounds of all channels into a single piece of audio data.
  • the output integration unit 14 can generate a single piece of audio data using channel interleaving, which includes elements of each channel for each frame.
  • the output integration unit 14 then outputs the generated single piece of audio data to the transmission unit 15.
  • the output integration unit 14 also outputs the person list to the transmission unit 15, along with information that associates each element of the person list and background sound with each channel included in the audio data.
  • the output integration unit 14 may add the data of the person list to the audio data to create a single piece of data.
  • the transmitter 15 receives an input of a single piece of sound data including the individual sounds of each speaker and background sound from the output integration unit 14.
  • the transmitter 15 also receives an input of a person list from the output integration unit 14.
  • the transmitter 15 also receives an input of captured video from the camera 2.
  • the transmitter 15 then transmits the single piece of sound data including the individual sounds of each speaker and background sound, the person list, and the video from the camera 2 to the receiving unit 20 of the remote communication device 1 on the local side via the network 7.
  • the network 7 is, for example, the Internet or a LAN (Local Area Network). In this way, the transmitter 15 transmits the audio of a predetermined space corresponding to the remote space processed by the clarity adjustment unit 132.
  • the receiving unit 20 is a function used in the receiving-side remote communication device 1, which is the local side in this embodiment.
  • the receiving unit 20 includes a receiving section 21 and an audio output control section 22, as shown in FIG.
  • the receiving unit 21 receives one piece of sound data including individual sounds of each speaker in the remote space and background sounds, a person list, and video transmitted from the transmitting unit 15 of the remote communication device 1 on the remote side.
  • the receiving unit 21 outputs the received video to the display 6.
  • the display 6 displays the video output from the receiving unit 21 on a screen.
  • the receiving unit 21 also outputs the sound data and the person list to the audio output control unit 22.
  • the receiving unit 21 receives the audio from the specified space corresponding to the remote space transmitted by the transmitting unit 15.
  • the receiving unit 21 also receives the video along with the audio from the specified space corresponding to the remote space, and projects the video onto a screen on which speaker arrays 41 and 42 are arranged, one at each end of the video in the vertical direction when the video is displayed.
  • Audio output control unit 22 performs a multiple opposing speaker balancing process described below to specify the speaker units 411 and 412 that will output the background sound, the infield speech sound, and the outfield speech sound, respectively, and plays them back.
  • the audio output control unit 22 receives sound data including the individual sound and background sound of each speaker, as well as an input of a person list, from the receiving unit 21.
  • the speech sound of people in the infield will be referred to as "infield speech sound”
  • the speech sound of people in the outfield will be referred to as "outfield speech sound.”
  • Figure 6 is a diagram showing the multiple opposing speaker balancing process.
  • the image at the top of Figure 6 shows the length of each part of speaker arrays 41 and 42, and graph 200 at the bottom shows the post-processing output sounds from speaker arrays 41 and 42 for background sound, infield speech sounds, and outfield speech sounds.
  • Graph 200 shows the positions corresponding to speaker arrays 41 and 42 on the horizontal axis, and the volume on the vertical axis.
  • the vertical axis of graph 200 shows a value normalized with the peak volume of infield speech sounds set to 1.
  • the audio output control unit 22 has information on the physical speaker array width W spk and display width W disp in advance.
  • the audio output control unit 22 also has information on the number u of speaker units 411 in the speaker array 41 in advance.
  • the number of speaker units 412 in the speaker array 42 is also u.
  • W spk represents the distance between the leftmost speaker unit 411 and the rightmost speaker unit 411. The same applies to W spk for the speaker units 412.
  • the inter-speaker distances W u between the speaker units 411 and between the speaker units 412 are all equal, and the display 6, speaker array 41, and speaker array 42 are aligned horizontally and centrally.
  • Wu is the actual physical distance
  • the audio output control unit 22 also aligns the speaker array width Wspk to the normalized coordinate size as Wspk_n .
  • the audio output control unit 22 also calculates the normalized coordinate xu of the speaker units 411 and 412.
  • the audio output control unit 22 further applies the following volume adjustments to the background sound, whose clarity has been reduced by the degradation process, at the timing of playback to blur the sense of positioning of the background sound.
  • the audio output control unit 22 acquires background sound from the sound data. Next, the audio output control unit 22 adjusts the loudness of the background sound to 1/u and outputs it from all speaker units 411 and 412. That is, the audio output control unit 22 processes the amplitude of the background sound to be 1.66 times (1/u), and outputs the background sound from all speaker units 411 of the speaker array 41 and all speaker units 412 of the speaker array 42.
  • background sound is output at the same reduced volume from all speaker units 411 and 412, as shown by curve 201 in graph 200.
  • background sound refers to sounds whose source location is unclear, such as natural environmental sounds, background music in a cafe, and office noise.
  • the audio output control unit 22 can reproduce the remote space with a greater sense of realism than if it were played from a single speaker unit 411 or 412.
  • the audio output control unit 22 performs the following process on individual infield sounds that have been remotely enhanced.
  • the audio output control unit 22 acquires infield speech from the sound data using the person list. Next, the audio output control unit 22 plays the acquired infield speech directly from the speaker arrays 41 and 42 without blurring the sense of localization of the speech and without adjusting the volume or volume.
  • the audio output control unit 22 identifies the speaker units 411 and 412 that are closest to the normalized coordinates of the speaker of the infield speech. In other words, if the normalized coordinates of the speaker are x p , the audio output control unit 22 identifies the speaker units 411 and 421 that exist at the normalized coordinate x u calculated by the following formula (3).
  • the audio output control unit 22 outputs the infield speech sound from the identified speaker units 411 and 412. In this case, the audio output control unit 22 does not output the infield speech sound from the other speaker units 411 and 412.
  • the infield speech sound is output from one specific speaker unit 411 and 412 at a louder volume than other sounds, as shown by curve 202 in graph 200.
  • Outfield speech does not need to be clearly audible because the local person is not directly speaking with them, but as with background sound, it is undesirable to completely remove it from the perspective of spatial connection. Also, by lowering the clarity but still being audible if you listen carefully, there is an opportunity for a local person to talk to someone in the outfield, and the communication develops and shifts to the infield. Therefore, to make infield speech relatively easier to hear and outfield speech relatively harder to hear, degrading processing is performed on individual outfield sounds remotely. Then, the audio output control unit 22 performs processing to blur the sense of positioning of the acquired outfield speech sounds.
  • the audio output control unit 22 blurs the sense of localization of outfield speech by processing using a normal distribution function.
  • the normal distribution function is expressed by the following equation (4), where ⁇ is the mean value and ⁇ is the standard deviation.
  • FIG. 7 is a diagram showing processing using a normal distribution function for outfield speech.
  • the audio output control unit 22 creates a function f(x) represented by a graph 211 whose peak is the position of the speaker of the outfield speech.
  • f(x) is expressed by the following Equation (5).
  • the x-coordinate range of the normalized distribution function is normally from - ⁇ to ⁇ , but the audio output control unit 22 imposes a restriction on the range of playback by the speaker arrays 41 and 42, for example, by playing back using speaker units 411 and 412 within the 95% confidence interval.
  • the audio output control unit 22 plays back one person's outfield speech from five speaker units 411 and 412 each. By narrowing the range of speaker units 411 and 412 to be played back in this way, the audio output control unit 22 can reduce the processing load.
  • sf can be set to 0.8, for example.
  • the audio output control unit 22 then processes the amplitude of the outfield speech sound to be 1.66 times ⁇ g(x) ⁇ according to the normalized coordinates, and reproduces the outfield speech sound from a specified number of speaker units 411 and 412 centered around the speaker position of the outfield speech sound.
  • the value of the scaling factor sf (0 ⁇ sf ⁇ 1) that defines the peak value and the speaker playback range can be changed by the user using a configuration file, etc.
  • the outfield speech sound is output at a lower volume than the infield speech sound from the speaker units 411 and 412 within a specified range centered around the speaker position, as shown by curve 203 in graph 200 of FIG. 5.
  • the audio output control unit 22 performs the following processing on the multiple opposed speakers 4, which have two arrays: a speaker array 41 in which multiple speaker units 411 are arranged in a row, and a speaker array 42 in which multiple speaker units 412 are arranged in a row. Based on the position of each speaker, the audio output control unit 22 selects speaker units 411 and 412 to play back the speech of each speaker from the audio received by the receiving unit 21 in a specified space corresponding to the remote space. The audio output control unit 22 then causes the selected speaker units 411 and 412 to play back the speech of each speaker. Furthermore, for each speaker shown on the screen, the audio output control unit 22 selects speaker units 411 and 412 to play back based on the position of the speaker so that the speech is played back near the position on the screen.
  • the audio output control unit 22 makes the voice of the infield speaker, who is the local conversation partner, relatively easier to hear, while maintaining a sense of connection with the remote space by leaving background sounds and outfield speech.
  • the audio output control unit 22 plays background sounds and outfield speech from multiple speaker units 411 and 412, blurring the sense of sound localization and making them more difficult to hear, while making infield speech relatively easier to hear, allowing you to better concentrate on the conversation with the remote party.
  • Data flow between remote communication devices according to the first embodiment 8 is a diagram showing an outline of the data flow between the local remote communication devices according to the first embodiment.
  • the case where data is transmitted from the transmitting unit 10 of the remote remote communication device 1 to the receiving unit 20 of the local remote communication device 1 will be described.
  • the transmission of audio will be described.
  • the sending-side process 221 is a process executed by the remote communication device 1 located on the remote side.
  • the receiving-side process 222 is a process executed by the remote communication device 1 located on the local side.
  • the microphone 3 inputs the audio picked up in the remote space to the remote communication device 1 (step S1).
  • data from the microphone array 30 is s channel data.
  • the camera 2 also inputs the video generated by capturing images of the remote space to the remote communication device 1 (step S2).
  • the video input from camera 2 is sent to the human sensing unit 11.
  • the human sensing unit 11 uses the video captured by camera 2 to perform human sensing processing to determine the position of each person in the remote space and to determine whether the field or outfield is located (step S3).
  • the human sensing unit 11 sets normalized coordinates in the horizontal direction of the image projected on the screen. Then, the human sensing unit 11 performs skeletal recognition on speakers in the image and obtains normalized coordinates of the position of each speaker. After that, the human sensing unit 11 generates a person list having elements corresponding to the order of speakers and registers the normalized coordinates of each person's position, thereby executing the human position determination process (step S31). Here, the human sensing unit 11 extracts p speakers present in the image. In other words, p elements are registered in the person list.
  • the human sensing unit 11 also estimates the head direction of each speaker in the video. Then, depending on whether the estimated head direction is within a predetermined infield range, the human sensing unit 11 determines whether each speaker is in the infield or outfield, and executes the infield/outfield determination process by registering an infield flag indicating the infield or outfield in the person list based on the determination result (step S32).
  • Audio separation unit 12 performs audio separation processing to separate speech and background sounds contained in the audio input from microphone 3 (step S4). For the s-channel data input from microphone array 30, audio separation unit 12 generates one channel of data as background sound and s-channel data as speech.
  • the background sound which is one-channel data extracted by the audio separation unit 12, is sent to the clarity adjustment unit 132 of the audio signal processing unit 13.
  • the clarity adjustment unit 132 performs a clarity adjustment process, called a degradation process, to make the background sound less audible (step S5).
  • the speech sound which is the s-channel data extracted by the speech separation unit 12, is sent to the individual sound separation unit 131 of the acoustic signal processing unit 13.
  • the person list is also sent to the individual sound separation unit 131.
  • the acoustic signal processing unit 13 performs speech sound signal processing on the speech sound using the person list (step S6).
  • the individual sound separation unit 131 creates p threads, which is the number of elements registered in the acquired personal list. Furthermore, the individual sound separation unit 131 stores the normalized coordinate values registered in the personal list as thread-specific values for each thread. The individual sound separation unit 131 then calculates the physical distance using the normalized coordinates of each speaker as coordinates on the virtual screen. The individual sound separation unit 131 then determines the angle ⁇ to each speaker using the calculated physical distance and the distance from the virtual screen to camera 2. Next, the individual sound separation unit 131 performs individual sound separation processing on the speech sounds extracted by speech separation by the speech separation unit 12 using beamforming according to the angle to each speaker (step S61). As a result, the individual sound separation unit 131 generates individual sounds, which are one-channel data for each speaker. In other words, one-channel individual sound data is generated for each of the p threads.
  • the clarity adjustment unit 132 determines whether the speaker corresponding to the individual sounds for each thread is infield or outfield using the infield flag in the person list. Then, as a clarity adjustment process, the clarity adjustment unit 132 performs an enhancement process to make the individual sounds of infield speakers easier to hear, and a degradation process to make the individual sounds of outfield speakers harder to hear (step S62).
  • the background sound that has undergone clarity adjustment processing is input to the output integrating unit 14.
  • the individual sounds that have undergone speech signal processing are also input to the output integrating unit 14.
  • the output integrating unit 14 obtains data for (1+P) channels.
  • the output integrating unit 14 then executes output integration processing to integrate the sounds from all channels to generate a single sound data set (step S7).
  • the sound data and person list generated by the output integration unit 14 are sent by the transmission unit 15 via the network 7 to the local-side remote communication device 1 (step S8).
  • the person list By sending the person list, the normalized coordinates of each speaker determined by the person position determination process and information on whether each speaker is in the infield or outfield determined by the infield/outfield determination process are sent to the local-side remote communication device 1.
  • the local remote communication device 1 receives the sound data and the person list.
  • the sound data and person list are sent to the audio output control unit 22 via the receiving unit 21.
  • the audio output control unit 22 uses the person list to perform a multiple facing speaker balancing process on the sound data, and specifies the speaker units 411 and 412 to output the background sound, infield speech sound, and outfield speech sound for playback (step S9).
  • the audio output control unit 22 adjusts the loudness of the background sound by dividing it by the number of pairs of speaker units 411 and 412, and plays the adjusted sound on all speaker units 411 and 412.
  • the audio output control unit 22 also plays the infield speech sound directly on the speaker units 411 and 412 closest to the position of the speaker of the infield speech sound.
  • the audio output control unit 22 processes the outfield speech using a normal distribution function with a scaling factor, and limits the range of speaker units 411 and 412 to be used, causing the speaker units 411 and 412 to reproduce the outfield speech.
  • the speaker arrays 41 and 42 output background sounds, infield speech sounds, and outfield speech sounds using the speaker units 411 and 412 specified by the audio output control unit 22 (step S10).
  • Transmission side processing> 9 is a flowchart of the transmission side process. The overall flow of the transmission side process will be described with reference to FIG.
  • the human sensing unit 11 acquires video of the remote space captured by the camera 2.
  • the audio separation unit 12 also acquires audio of the remote space picked up by the microphone 3 (step S11).
  • the person sensing unit 11 performs a person determination process using the video captured by the camera 2 to determine the normalized coordinates of each speaker's position and to create a person list with elements for the number of people and in which the normalized coordinates of each speaker are registered (step S12).
  • the human sensing unit 11 performs an infield/outfield determination process using the video and the person list to determine whether each speaker is in the infield or outfield (step S13).
  • the audio separation unit 12 performs audio separation processing to separate the speech and background sounds contained in the audio input from the microphone 3 (step S14).
  • the clarity adjustment unit 132 receives the background sound extracted by the audio separation unit 12. The clarity adjustment unit 132 then performs a degrading process to make the background sound more difficult to hear as a clarity adjustment process (step S15).
  • the individual sound separation unit 131 receives input of speech extracted by the audio separation unit 12.
  • the individual sound separation unit 131 also acquires a person-specific list from the person sensing unit 11.
  • the individual sound separation unit 131 generates threads equal to the number of elements registered in the acquired person-specific list (step S16).
  • the individual sound separation unit 131 assigns IDs to each thread, sequentially numbered from 1, equal to the number of elements registered in the person-specific list.
  • the process is performed when the number of elements registered in the person-specific list is p.
  • the individual sound separation unit 131 performs individual sound separation processing to separate the speech into individual sounds for each speaker (step S17).
  • the clarity adjustment unit 132 receives input of individual sounds from the individual sound separation unit 131. Next, the clarity adjustment unit 132 initializes i to 0 (step S18).
  • the clarity adjustment unit 132 performs clarity adjustment processing on the individual sounds in the thread whose ID is i (step S19).
  • the clarity adjustment unit 132 determines whether i is less than p (step S20). If i is less than p (step S20: Yes), the clarity adjustment unit 132 increments i by 1 (step S21). Then, the clarity adjustment unit 132 returns to step S19.
  • step S20 the clarity adjustment unit 132 outputs the background sound and individual sounds that have been subjected to clarity adjustment processing to the output integrator 14.
  • the output integrator 14 also receives video input from the camera 2.
  • the output integrator 14 then executes output integration processing to integrate the background sound and individual sounds to generate a single piece of sound data (step S22).
  • the transmission unit 15 transmits the sound data generated by the output integration unit 14 and the video captured by the camera 2 to the local remote communication device 1 via the network 7 (step 23).
  • Fig. 10 is a flowchart of the person position determination process. The process shown in Fig. 10 is an example of the process executed in step S12 in Fig. 9. Next, the flow of the person position determination process will be described with reference to Fig. 10.
  • the human sensing unit 11 sets normalized coordinates in the horizontal direction of the image projected on the screen. Next, the human sensing unit 11 performs skeletal recognition on the speakers in the image to identify the position of each speaker's neck and obtain the screen coordinates of the neck (step S101).
  • the human sensing unit 11 converts the screen coordinates into normalized coordinates (step S102).
  • the human sensing unit 11 registers the screen coordinates and normalized coordinates of each speaker in the person list (step S103).
  • the human sensing unit 11 stores the horizontal coordinates when the screen coordinates and normalized coordinates of each image are projected onto the screen of the display 6.
  • Fig. 11 is a flowchart of the infield/outfield determination process. The process shown in Fig. 11 is an example of the process executed in step S13 in Fig. 9. Next, the flow of the infield/outfield determination process will be described with reference to Fig. 11.
  • the human sensing unit 11 creates an infield flag item in the person list, initializes it, and sets all items equal to the number of people present in the remote space to False (step S111).
  • the human sensing unit 11 initializes i to 0 (step S112).
  • the human sensing unit 11 detects the head direction of the i-th speaker, counting the first speaker from the left side of the video as 0 (step S113).
  • the human sensing unit 11 determines whether the head direction of the i-th speaker can be detected (step S114). If the head direction cannot be detected (step S114: No), the human sensing unit 11 proceeds to step S118.
  • step S114 if the head direction can be detected (step S114: Yes), the human sensing unit 11 determines whether the head direction is within the infield (step S115).
  • step S115 If the head direction is within the infield range (step S115: Yes), the person sensing unit 11 stores True in the infield flag of the element corresponding to the i-th speaker in the person list (step S116).
  • step S115 if the head direction is outside the infield range (step S115: No), the person sensing unit 11 stores False in the infield flag of the element corresponding to the i-th speaker in the person list (step S117).
  • the human sensing unit 11 determines whether i is less than p (step S118). If i is less than p (step S118: Yes), the human sensing unit 11 increments i by 1 (step S119). Then, the human sensing unit 11 returns to step S113.
  • step S118: No the human sensing unit 11 ends the infield/outfield determination process.
  • Fig. 12 is a flowchart of the individual sound separation process. The process shown in Fig. 12 is an example of the process executed in step S17 in Fig. 9. Next, the flow of the individual sound separation process will be described with reference to Fig. 12.
  • the individual sound separation unit 131 creates threads for the number of people registered in the personal list. Next, the individual sound separation unit 131 stores the normalized coordinate values registered in the personal list as thread-specific values for each thread. Then, the individual sound separation unit 131 initializes i to 0 (step S121).
  • the individual sound separation unit 131 calculates the beam direction of the i-th speaker, counting the first speaker from the left side of the video as 0 (step S122).
  • the individual sound separation unit 131 acquires the individual sound of the i-th speaker by beamforming processing in the beam direction (step S123).
  • step S124 determines whether i is less than p. If i is less than p (step S124: Yes), the individual sound separation unit 131 increments i by 1 (step S125). Then, the individual sound separation unit 131 returns to step S122.
  • step S124 No
  • the individual sound separation unit 131 terminates the individual sound separation process.
  • FIG. 13 is a flowchart of the clarity adjustment process for individual sounds. The process shown in Fig. 13 is an example of the process executed in step S19 in Fig. 9. Next, the flow of the clarity adjustment process will be described with reference to Fig. 13.
  • the clarity adjustment unit 132 initializes i to 0 (step S131).
  • the clarity adjustment unit 132 determines whether the infield flag in the person list for the ith speaker, with the first speaker from the left side of the video being number 0, is True (step S132).
  • step S132 If the infield flag is True (step S132: Yes), the clarity adjustment unit 132 performs enhancement processing on the individual sounds of the i-th speaker (step S133).
  • step S132 If the infield flag is False (step S132: No), the clarity adjustment unit 132 performs degrading processing on the individual sound of the i-th speaker (step S134).
  • the clarity adjustment unit 132 determines whether i is less than p (step S135). If i is less than p (step S135: Yes), the clarity adjustment unit 132 increments i by 1 (step S136). Then, the clarity adjustment unit 132 returns to step S132.
  • step S135 No
  • the clarity adjustment unit 132 terminates the individual sound separation process.
  • Multiple opposing speaker balancing process (receiving side process)> 14 is a flowchart of the multiple opposed speaker balancing process. The overall flow of the multiple opposed speaker balancing process will be described with reference to FIG. 14. Here, the case where the audio output control unit 22 controls playback using a list for each combination of speaker units 411 and 412 will be described.
  • the audio output control unit 22 receives input of the sound data and the person list from the receiving unit 21.
  • the audio output control unit 22 calculates the normalized coordinates for each pair of speaker units 411 and 412, and stores the normalized coordinates in ascending order in a speaker unit list having the same number of elements as the number of pairs of speaker units 411 and 412 (step S322).
  • the audio output control unit 22 provides an area for storing output sounds for each element in the speaker unit list, and initializes each area with the number of channels for the background sounds and individual sounds (step S33). For example, if the background sound is one-channel data and there are p individual sounds, (1 + P) areas for storing sounds of (1 + P) channels are set in the area for storing output sounds.
  • the audio output control unit 22 initializes i to 0 (step S34).
  • the audio output control unit 22 determines whether the audio of the i-th channel, with the first of the (1+P) channels arranged in order in the audio data being numbered 0, is an individual sound of the speech (step S35).
  • step S35 If the audio on the i-th channel is background sound (step S35: No), the audio output control unit 22 performs background sound adjustment (step S36). The audio output control unit 22 then proceeds to step S40.
  • step S35 determines whether the speaker of that individual sound is Uchino using the person list.
  • step S37 If the speaker is in the outfield (step S37: No), the audio output control unit 22 performs outfield speech sound adjustment (step S38). The audio output control unit 22 then proceeds to step S40.
  • step S37 If the speaker is an infield player (step S37: Yes), the audio output control unit 22 performs infield speech sound adjustment (step S39).
  • the audio output control unit 22 stores the audio of the i-th channel in the area for storing output audio in the speaker unit list (step S40).
  • step S41 determines whether i is less than p. If i is less than p (step S41: Yes), the audio output control unit 22 increments i by 1 (step S42). Then, the audio output control unit 22 returns to step S35.
  • step S41 the audio output control unit 22 mixes the audio stored in the area for storing output audio in the speaker unit list. Then, the audio output control unit 22 outputs the mixed sound from the speaker units 411 and 412 corresponding to the speaker unit list (step S43).
  • Fig. 15 is a flowchart of the background sound adjustment process. The process shown in Fig. 15 is an example of the process executed in step S36 in Fig. 14. Next, the flow of the background sound adjustment process will be described with reference to Fig. 15.
  • the audio output control unit 22 adjusts the amplitude of the background sound by a factor of (1/u) (step S201), where u is the number of pairs of speaker units 411 and 412.
  • the audio output control unit 22 assigns waveforms to the channels of all pairs of speaker units 411 and 412 (step S202).
  • Fig. 16 is a flowchart of the infield speech sound adjustment process. The process shown in Fig. 16 is an example of the process executed in step S39 in Fig. 14. Next, the flow of the infield speech sound adjustment process will be described with reference to Fig. 16.
  • the audio output control unit 22 performs localization center speaker unit derivation to identify the pair of speaker units 411 and 412 that are closest to the speaker (step S211). This localization center speaker unit derivation will be explained in detail later.
  • the audio output control unit 22 assigns the waveform to the pair of channels of speaker units 411 and 412 that is closest to the normalized coordinates of the speaker (step S212).
  • Outfield speech adjustment Fig. 17 is a flowchart of the outfield speech sound adjustment process.
  • the process shown in Fig. 17 is an example of the process executed in step S38 of Fig. 14. Next, the flow of the outfield speech sound adjustment process will be described with reference to Fig. 17.
  • the audio output control unit 22 performs localization center speaker unit derivation to identify the pair of speaker units 411 and 412 closest to the speaker (step S221).
  • This localization center speaker unit derivation is the same process as the process used in infield speech sound adjustment, and will be described in detail later.
  • the positions of the pair of speaker units 411 and 412 are represented by consecutive numbers starting from the left with 0 as the first, the position of the pair closest to the speaker is called the "closest speaker unit position.”
  • the number of pairs of speaker units 411 and 412 is u
  • the closest speaker unit position is any one of 0 to u.
  • the audio output control unit 22 initializes k to 0 (step S222).
  • k is a parameter used to control the repetition of the process of determining the playback range.
  • step S223 determines whether the value obtained by adding k to the nearest speaker unit position is less than or equal to u (step S225). In other words, the audio output control unit 22 determines whether the position obtained by adding k to the nearest speaker unit position does not extend beyond the right end of the speaker arrays 41 and 42.
  • step S225: No If the value obtained by adding k to the position of the nearest speaker unit is greater than u (step S225: No), the audio output control unit 22 proceeds to step S227. On the other hand, if the value obtained by adding k to the position of the nearest speaker unit is equal to or less than u (step S225: Yes), the audio output control unit 22 sets one x spk to the position obtained by adding k to the nearest speaker unit (step S226). Thereafter, the audio output control unit 22 proceeds to step S227.
  • the audio output control unit 22 determines whether the value obtained by subtracting k from the nearest speaker unit position is 0 or greater (step S227). In other words, the audio output control unit 22 determines whether the position obtained by subtracting k from the nearest speaker unit position does not extend beyond the left end of the speaker arrays 41 and 42.
  • step S227: No If the value obtained by subtracting k from the nearest speaker unit position is less than 0 (step S227: No), the audio output control unit 22 proceeds to step S229. On the other hand, if the value obtained by subtracting k from the nearest speaker unit position is 0 or greater (step S227: Yes), the audio output control unit 22 sets the other x spk to the position obtained by subtracting k from the nearest speaker unit (step S228). Thereafter, the audio output control unit 22 proceeds to step S229.
  • the audio output control unit 22 adjusts the amplitude of the audio at position k by multiplying it by ⁇ g(x spk ) ⁇ 1.66 (step S229).
  • the audio output control unit 22 adjusts the amplitude of the audio at both positions k.
  • the audio output control unit 22 determines whether k is less than a predetermined range width (step S230).
  • step S230 If k is equal to or less than the range width (step S230: Yes), the audio output control unit 22 increments k by 1 (step S231). Then, the audio output control unit 22 returns to step S233.
  • step S230 determines whether k is greater than the range width (step S230: No). If k is greater than the range width (step S230: No), the audio output control unit 22 assigns the waveform to one of the u channels with a ⁇ range width centered on the position of the closest speaker unit (step S232).
  • FIG. 18 is a flowchart of the process of deriving the localization center speaker unit.
  • the process shown in Fig. 18 corresponds to an example of the process executed in step S211 of Fig. 16 and step S221 of Fig. 17. Next, the flow of the process of deriving the localization center speaker unit will be described with reference to Fig. 18.
  • the audio output control unit 22 sets the initial value of the shortest distance to the speaker unit 411 as the distance between the speaker units 411 (step S241).
  • the audio output control unit 22 initializes j to 0 (step S242).
  • j is a parameter used to control the repetition of the process of determining whether the distance to the speaker for each pair of speaker units 411 and 412 is the closest speaker unit distance.
  • the audio output control unit 22 subtracts the position of the jth speaker unit 411 when the speaker unit 411 pairs are numbered consecutively from the left, with the first being number 0, from the speaker's position (step S243).
  • step S244 determines whether the subtraction result is less than the shortest distance. If the subtraction result is equal to or greater than the shortest distance (step S244: No), the audio output control unit 22 proceeds to step S246.
  • step S244 If the subtraction result is less than the shortest distance (step S244: Yes), the audio output control unit 22 updates the shortest distance as the subtraction result. Furthermore, the audio output control unit 22 sets j to the nearest speaker unit position (step S245).
  • the audio output control unit 22 determines whether j is less than u, which is the number of speaker units 411 (step S246).
  • step S246 If j is less than u (step S246: Yes), the audio output control unit 22 increments j by 1 (step S247). Then, the audio output control unit 22 returns to step S243.
  • step S246 No
  • the audio output control unit 22 terminates the derivation of the localization center speaker unit.
  • the remote communication device 1 identifies the positions of speakers from the video, estimates the head direction of each identified speaker, and determines whether each speaker is in the infield or outfield.
  • the remote communication device 1 also separates background sound from the collected audio and acquires the individual sounds of the speakers based on the identified positions.
  • the remote communication device 1 then reduces the volume of the background sound and outfield speech sounds and outputs them from a pair of speaker units 411 and 412 within a certain range.
  • the remote communication device 1 also outputs the infield speech sounds from the speaker units 411 and 412 closest to the speaker's position.
  • the remote communication device 1 according to the second embodiment outputs background sounds without blurring the sense of localization of sudden background sounds.
  • Figure 19 is a block diagram of a remote communication device according to the second embodiment. Note that in Figure 19, the same components as in Figure 1 are designated by the same reference numerals, and the following explanation will focus on the components that differ from Figure 1, and may omit explanations of the components that are the same as in Figure 1.
  • the sound signal processing unit 13 includes a background sound separation unit 133 in addition to an individual sound separation unit 131 and a clarity adjustment unit 132 .
  • the background sound separation unit 133 receives input of the background sound separated by the audio separation unit 12. The background sound separation unit 133 then performs peak detection processing to detect peaks of the background sound. Next, the background sound separation unit 133 performs DoA (Direction Of Arrival) estimation of the sudden sound to estimate the direction from which the sudden sound occurred, and performs sudden sound extraction processing to extract the sudden sound from the background sound.
  • DoA Direction Of Arrival
  • the background sound separation unit 133 performs DoA estimation of the sudden sound by using, for example, technology for visualizing audio or a method using a cross-correlation function of audio signals obtained from the microphone array 30.
  • the background sound separation unit 133 then outputs information about the estimated position of the sudden sound to the output integration unit 14.
  • the background sound separation unit 133 also outputs the extracted sudden sound to the output integration unit 14.
  • the background sound separation unit 133 also performs background sound extraction processing to extract sounds other than sudden sounds from the background sound as steady sounds.
  • the background sound separation unit 133 then outputs the extracted steady sounds to the clarity adjustment unit 132. In this way, the background sound separation unit 133 separates the background sound into sudden sounds and steady sounds.
  • the clarity adjustment unit 132 receives an input of a stationary sound from the background sound separation unit 133. The clarity adjustment unit 132 then performs a degradation process on the stationary sound from the background sound to make it harder to hear. The clarity adjustment unit 132 then outputs the stationary sound from the background sound that has been subjected to clarity adjustment to the output integration unit 14.
  • the output integrating unit 14 receives input of information about sudden sounds and the positions of the sudden sounds in the background sound from the background sound separation unit 133.
  • the output integrating unit 14 also receives input of steady sounds in the background sound from the clarity adjustment unit 132.
  • the output integrating unit 14 acquires individual sounds of the same number of channels as the number of people present in the remote space, as well as one channel of data for each of the sudden sounds and steady sounds in the background sound. In other words, if the number of people present in the remote space is p, the output integrating unit 14 acquires (2+p) channels of data.
  • the output integration unit 14 then integrates each individual sound, the sudden sound from the background sound, and the steady sound from the background sound to generate one piece of audio data.
  • the output integration unit 14 then adds speaker information in the person list corresponding to each individual sound.
  • the output integration unit 14 also associates the sudden sound with positional information of the sudden sound.
  • the output integration unit 14 then outputs the sound data, the person list, and the positional information of the sudden sound to the transmission unit 15, causing it to be transmitted to the local-side remote communication device 1.
  • Audio output control unit 22 in the local remote communication device 1 receives input of audio data including each individual sound, a sudden sound from the background sound, and a steady sound from the background sound.
  • the audio output control unit 22 also receives input of the person list and position information of the sudden sound.
  • the audio output control unit 22 acquires a sudden sound from the background sound and a steady sound from the background sound from the sound data. Next, the audio output control unit 22 processes the steady sound from the background sound by multiplying the amplitude by (1/u) 1.66 so that the loudness becomes 1/u. Then, the audio output control unit 22 causes all speaker units 411 of the speaker array 41 and all speaker units 412 of the speaker array 42 to output the steady sounds with reduced loudness.
  • the audio output control unit 22 identifies the speaker units 411 and 412 that are closest to the position of the sudden sound. Then, the audio output control unit 22 causes the identified speaker units 411 and 412 to reproduce the sudden sound from the background sound as is.
  • the audio output control unit 22 applies the same processing to the sudden sound as to the infield speech sound in the first embodiment, and outputs the sound; however, it may also apply the same processing to the sudden sound as to the outfield speech sound in the first embodiment, and output the sound.
  • Data flow between remote communication devices according to the second embodiment 20 is a diagram showing an outline of the data flow between the local remote communication devices according to the second embodiment.
  • the case where data is transmitted from the transmitting unit 10 of the remote remote communication device 1 to the receiving unit 20 of the local remote communication device 1 will be described.
  • the transmission of audio will be described.
  • the microphone 3 inputs the audio picked up in the remote space to the remote communication device 1 (step S301).
  • data from the microphone array 30 is s channel data.
  • the camera 2 also inputs the video generated by capturing the remote space to the remote communication device 1 (step S302).
  • the video input from camera 2 is sent to the human sensing unit 11.
  • the human sensing unit 11 uses the video captured by camera 2 to perform sensing processing (step S303), including human position determination processing (step S331) and infield/outfield determination processing (step S332).
  • Audio separation unit 12 performs audio separation processing to separate the speech audio and background sounds contained in the audio input from microphone 3 (step S304).
  • the background sound extracted by the audio separation unit 12 is sent to the background sound separation unit 133 of the audio signal processing unit 13.
  • the background sound separation unit 133 executes peak detection processing to detect peaks in the background sound (step S305).
  • the background sound separation unit 133 executes a sudden sound extraction process to estimate the DoA of the sudden sound and extract the sudden sound from the background sound (step S306).
  • the sudden sound in the background sound and its position information are sent to the output integration unit 14.
  • the background sound separation unit 133 also executes a steady sound extraction process to extract steady sounds from the background sounds (step S307).
  • the steady sounds from the background sounds are sent to the clarity adjustment unit 132.
  • the clarity adjustment unit 132 performs a degrading process to make the background sound less audible as clarity adjustment processing (step S308).
  • the individual sounds and person list which are s-channel data extracted by the audio separation unit 12, are sent to the audio signal processing unit 13.
  • the audio signal processing unit 13 performs speech signal processing (step S309), including individual sound separation processing by the individual sound separation unit 131 (step S391) and clarity adjustment processing by the clarity adjustment unit 132 (step S392).
  • the steady sound, which is one channel of data, the sudden sound, and the position information of the sudden sound, which is one channel of data, are input to the output integration unit 14.
  • the individual sounds, which are p channel of data that have been subjected to speech signal processing, are also input to the output integration unit 14.
  • the output integration unit 14 obtains (2+P) channel data.
  • the output integration unit 14 executes output integration processing, which integrates the sounds of all channels to generate a single sound data (step S310).
  • the sound data, person list, and sudden sound location information generated by the output integration unit 14 are sent by the transmission unit 15 to the local remote communication device 1 via the network 7 (step S311).
  • the local remote communication device 1 receives the sound data, the person list, and the positional information of the sudden sound.
  • the sound data, the person list, and the positional information of the sudden sound are sent to the audio output control unit 22 via the receiving unit 21.
  • the audio output control unit 22 performs a multiple facing speaker balancing process on the sound data using the person list and the positional information of the sudden sound.
  • the audio output control unit 22 specifies the speaker units 411 and 412 to output for each of the background sound, infield speech sound, and outfield speech sound, and plays them (step S312).
  • the audio output control unit 22 plays the steady sound of the background sound in all speaker units 411 and 412, adjusting the loudness by dividing it by the number of pairs of speaker units 411 and 412.
  • the audio output control unit 22 also plays the sudden sound of the steady sound as is in the speaker units 411 and 412 closest to the position of the sudden sound.
  • the speaker arrays 41 and 42 use the speaker units 411 and 412 specified by the audio output control unit 22 to output sudden sounds from the background sound, steady sounds from the background sound, infield speech sounds, and outfield speech sounds (step S313).
  • the remote communication device 1 divides background sounds into two types, sudden sounds and steady sounds, and transmits position information for the background sounds.
  • the remote communication device 1 outputs the sudden sounds among the background sounds without blurring their localization, and outputs the steady sounds among the background sounds with blurring their localization.
  • the remote communication device 1 according to the third embodiment converts the speech voice picked up by the microphone 3 into a mechanical voice to make it easier to hear.
  • FIG. 21 is a block diagram of a remote communication device according to the third embodiment. Note that in FIG. 21, the same components as those in FIG. 1 are designated by the same reference numerals, and the following description will focus on the components that differ from FIG. 1, and may omit a description of the components that are the same as those in FIG. 1.
  • the acoustic signal processing unit 13 includes a voice synthesis unit 134 in addition to an individual sound separation unit 131 and a clarity adjustment unit 132 .
  • Speech synthesis unit receives input of the individual sounds of each speaker separated by the individual sound separation unit 131.
  • the speech synthesis unit 134 then executes a speech recognition process to recognize speech and transcribe the speech into text.
  • the speech synthesis unit 134 executes a speech synthesis process to synthesize the textual speech into text and regenerate the individual sounds.
  • the speech synthesis unit 134 then outputs the regenerated individual sounds of each speaker to the clarity adjustment unit 132.
  • the speech synthesis unit 134 performs speech recognition for each individual sound to generate characters indicating the spoken content of the individual sound, and performs speech synthesis based on the characters indicating the spoken content to regenerate the speaker's speech.
  • the clarity adjustment unit 132 makes adjustments for each individual sound regenerated by the speech synthesis unit 134 based on whether or not the speaker is participating in the conversation.
  • Data Flow Between Remote Communication Devices is a diagram showing an outline of the data flow between the local remote communication devices according to the third embodiment.
  • the case where data is transmitted from the transmitting unit 10 of the remote remote communication device 1 to the receiving unit 20 of the local remote communication device 1 will be described.
  • the transmission of audio will be described.
  • the microphone 3 inputs the audio picked up in the remote space to the remote communication device 1 (step S401).
  • data from the microphone array 30 is s channel data.
  • the camera 2 also inputs the video generated by capturing the remote space to the remote communication device 1 (step S402).
  • the video input from camera 2 is sent to the human sensing unit 11.
  • the human sensing unit 11 uses the video captured by camera 2 to perform sensing processing (step S403), including human position determination processing (step S431) and infield/outfield determination processing (step S432).
  • Audio separation unit 12 performs audio separation processing to separate the speech audio and background sounds contained in the audio input from microphone 3 (step S404).
  • the background sound extracted by the audio separation unit 12 is sent to the clarity adjustment unit 132 of the audio signal processing unit 13.
  • the clarity adjustment unit 132 performs a degrading process to make the background sound less audible as clarity adjustment processing (step S405).
  • the individual sounds and person list which are s-channel data extracted by the voice separation unit 12, are sent to the acoustic signal processing unit 13.
  • the acoustic signal processing unit 13 performs speech signal processing on the speech using the person list (step S406).
  • the individual sound separation unit 131 creates p threads, which is the number of elements registered in the acquired personal list. Furthermore, the individual sound separation unit 131 stores the normalized coordinate values registered in the personal list as thread-specific values for each thread. Next, the individual sound separation unit 131 calculates the angle ⁇ to each speaker. Then, the individual sound separation unit 131 performs individual sound separation processing on the speech sounds extracted by speech separation by the speech separation unit 12 using beamforming according to the angle to each speaker, and generates individual sounds for each speaker for each thread (step S461).
  • the individual sounds generated by the individual sound separation unit 131 are sent to the speech synthesis unit 134.
  • the speech synthesis unit 134 performs speech recognition processing to recognize the individual sounds and transcribe the speech content into text (step S462).
  • the speech synthesis unit 134 executes a speech synthesis process to synthesize the text of the speech and regenerate the individual sounds (step S463).
  • the individual sounds of each speaker regenerated by the speech synthesis unit 134 are sent to the clarity adjustment unit 132.
  • the clarity adjustment unit 132 determines whether the speaker corresponding to the individual sounds for each thread is infield or outfield using the infield flag in the person list. The clarity adjustment unit 132 then performs clarity adjustment processing, which performs enhancement processing to make the individual sounds of infield speakers easier to hear, and degrade processing to make the individual sounds of outfield speakers harder to hear (step S464).
  • the background sound which is one channel of data
  • the individual sounds which are p channel data that have been subjected to speech signal processing, are also input to the output integrating unit 14.
  • the output integrating unit 14 obtains (1+P) channel data.
  • the output integrating unit 14 then executes output integration processing, which integrates the sounds of all channels to generate a single sound data (step S407).
  • the sound data and person list generated by the output integration unit 14 are sent by the transmission unit 15 to the local remote communication device 1 via the network 7 (step S408).
  • the local remote communication device 1 receives the sound data and the person list.
  • the sound data and the person list are sent to the audio output control unit 22 via the receiving unit 21.
  • the audio output control unit 22 performs a multiple facing speaker balancing process on the sound data using the person list, and specifies the speaker units 411 and 412 to output the background sound, infield speech sound, and outfield speech sound, respectively, and plays them (step S409).
  • the speaker arrays 41 and 42 use the speaker units 411 and 412 specified by the audio output control unit 22 to output sudden sounds from the background sound, steady sounds from the background sound, infield speech sounds, and outfield speech sounds (step S410).
  • the remote communication device 1 performs speech recognition to transcribe the spoken content into text, and then synthesizes the text into speech data to be played back.
  • the remote communication device 1 of this embodiment can reduce variations in voice volume and articulation, making the speech easier to hear overall, thereby facilitating smoother communication.
  • remote communication device Next, a remote communication device 1 according to a fourth embodiment will be described.
  • the remote communication device 1 according to the fourth embodiment performs voice recognition, such as infield/outfield determination, using a method other than head direction estimation.
  • the remote communication device 1 according to this embodiment is also represented by the block diagram of FIG. 1. The following description will focus on the parts that are different from FIG. 1, and descriptions of parts that are the same as those in FIG. 1 may be omitted.
  • Human sensing unit performs depth sensing to detect the distance of each speaker from the video. Then, under the assumption that people far from the camera 2 will not converse with remote people, the human sensing unit 11 determines that people far from the camera 2 are people in the outfield. In this way, the human sensing unit 11 determines the distance of the speaker based on the video, and determines whether the speaker is participating in the conversation based on the distance.
  • Acoustic signal processing section The acoustic signal processing unit 13 performs speech recognition on each individual sound generated by the individual sound separation unit 131 and transcribes the speech content into text. Then, the acoustic signal processing unit 13 determines whether the speech is directed to a person on the local side based on the context of the transcribed speech content. The acoustic signal processing unit 13 can make this determination, for example, by performing AI analysis of emotions, empathy level, and listening level.
  • the acoustic signal processing unit 13 determines that the speech is directed at a person on the local side, it determines that the speaker is an infield person. The acoustic signal processing unit 13 then sets the infield flag of the speaker determined to be an infield person in the person list to True, and updates the person list. The acoustic signal processing unit 13 then outputs the updated person list to the clarity adjustment unit 132.
  • the voice of a person giving a presentation while looking at slides, or the interjections of a person listening while taking notes are also messages sent to the remote, and are preferably processed as infield. Therefore, by determining infield/outfield based on the context, the speech of a speaker who would be determined to be infield in head direction estimation but is actually participating in the conversation is transmitted to the remote and played back as the speech of an infield person. This makes it possible to process speech that is easier to hear as infield speech, even in situations where the speaker is not facing the remote but is speaking to it. This can facilitate smooth communication.
  • the local-side human sensing unit 11 extracts the specific position using motion capture, gaze estimation, or the like. Then, the local-side human sensing unit 11 transmits information about the specific position pointed to via the transmission unit 15 to the remote-side remote communication device 1.
  • the individual sound separation unit 131 extracts the sound at that specific position as an individual sound using beamforming or the like.
  • the clarity adjustment unit 132 applies appropriate processing to the individual sound at the specific position, such as the same processing as that applied to infield speech.
  • the output integration unit 14 then combines this with the other individual sounds into a single data item, and transmits it, along with information about the specific position, to the local remote communication device 1 via the transmission unit 15.
  • the audio output control unit 22 of the local remote communication device 1 extracts individual sounds at specific positions from the sound data and performs appropriate multiple opposing speaker balancing processing, such as the same processing as for infield speech, using the information about the specific positions.
  • the audio output control unit 22 then causes the speaker units 411 and 412 to reproduce the processed individual sounds at the specific positions.
  • remote communication device Next, a remote communication device 1 according to a fifth embodiment will be described.
  • the camera 2 and microphone 3 are aligned in the normal direction of the virtual screen, and the human sensing unit 11 determines the angle of the direction of each speaker based on normalized coordinates indicating the position of each speaker in the image captured by the camera 2.
  • the remote communication device 1 according to the fifth embodiment identifies the direction of a sound source using the sound captured by the microphone 3.
  • the remote communication device 1 according to this example is also represented by the block diagram of FIG. 1. The following description will focus on the parts that are different from FIG. 1, and descriptions of parts that are the same as those in FIG. 1 may be omitted.
  • FIG. 23 is a block diagram of a remote communication device according to the fifth embodiment. Also, FIG. 24 is a diagram for explaining audio signal processing by the remote communication device according to the fifth embodiment.
  • the remote communication device 1 has a sound source direction estimating unit 16 as shown in Fig. 23.
  • the sound source direction estimating unit 16 acquires sounds obtained by directing a beam in each direction in turn by the microphone array 30.
  • the direction to direct the beam may be preset in the microphone 3, or the sound source direction estimating unit 16 may specify the direction to direct the beam in turn to the microphone 3.
  • beamforming is assumed to detect only speech and sudden background sounds, and not pick up steady background sounds. Furthermore, it is assumed that no sounds are generated other than the speaker appearing on the screen and any sudden background sounds occurring within that range, i.e., no sound is heard from outside the field of view.
  • the sound source direction estimation unit 16 selects, from the sounds obtained by directing the beam in each direction, a sound whose output is greater than a threshold, and estimates the sound source direction by assuming that the sound source is located in the direction of the beam that obtained the selected sound (step S501).
  • Step S501 in FIG. 24 shows that a large number of sound source directions have been estimated.
  • the sound source direction estimation unit 16 determines the sound source direction as the position where the speaker is located. In this way, the sound source direction estimation unit 16 determines the position of the speaker based on the sound in a specified space that corresponds to the remote space.
  • the sound source direction estimation unit 16 then notifies the person sensing unit 11 of the normalized coordinates of the position where the determined speaker is located, and creates a person list based on the normalized coordinates of the position where the speaker is located.
  • the person sensing unit 11 uses the notified normalized coordinates to identify speakers from among the people shown in the video, performs an infield/outfield determination for each identified speaker, and registers them in the person list.
  • the acoustic signal processing unit 13 acquires, from the person sensing unit 11, a person list in which normalized coordinates of the position of each speaker estimated from the voice by the sound source direction estimation unit 16 are registered. Then, the acoustic signal processing unit 13 executes speech signal processing using the acquired person list (step S502).
  • the individual sound separation unit 131 generates threads for each speaker estimated from the audio registered in the person list. Then, the individual sound separation unit 131 performs individual sound separation processing using the normalized coordinates of the position of each speaker for each thread to generate individual sounds for each speaker (step S521). In this way, the individual sound separation unit 131 separates individual sounds, which are the speech sounds of each speaker, from the audio in a specified space corresponding to the remote space, based on the position of the speaker determined by the sound source direction estimation unit 16.
  • the individual sound separation unit 131 also separates the individual sounds from the sound source direction detected using the microphone array 30, starting from the angle on the left side of the image displayed on the display 6, thereby matching the individual sounds with the normalized coordinates indicating the position of each speaker registered in the person list. This allows the image and sound to match even after transmission to the local side.
  • the clarity adjustment unit 132 For each thread, the clarity adjustment unit 132 performs clarity adjustment processing for the individual sounds of the speaker estimated from the audio using the infield flag of each speaker registered in the person list (step S522). In this way, the clarity adjustment unit 132 adjusts the clarity for each individual sound separated by the individual sound separation unit 131 based on whether the speaker is participating in the conversation.
  • remote communication device 1 Next, a remote communication device 1 according to a sixth embodiment will be described.
  • the remote communication device 1 according to this embodiment remaps the audio playback position to match the width of the display 6.
  • the remote communication device 1 according to this example is also represented by the block diagram of Figure 1. The following description will focus on the parts that are different from Figure 1, and descriptions of parts that are the same as those in Figure 1 may be omitted.
  • Audio output control unit> 25 is a diagram illustrating audio playback by a remote communication device according to a sixth embodiment.
  • the speaker array width W spk of the speaker arrays 41 and 42 is shorter than the display width W disp, which is the width of the screen of the display 6.
  • the audio output control unit 22 determines the processing position by dividing the speaker array width W spk and the speaker distance W u by the display width W disp in order to match the image and audio in the normalized coordinate system.
  • the speaker array width W spk is shorter than the display width W disp , it becomes difficult to reproduce the audio of a speaker who appears on the edge of the screen of the display 6.
  • the audio output control unit 22 determines the position in processing by dividing the speaker array width W spk and the speaker distance W u by the speaker array width W spk .
  • the width of the speaker arrays 41 and 42 has a length of 1 in normalized coordinates
  • the width between the pair of speaker units 411 and 412 is the value obtained by dividing the speaker distance W u by the speaker array width W spk .
  • the audio output control unit 22 sets the speaker array width in processing to -0.5 to 0.5 in normalized coordinates. Then, the audio output control unit 22 executes a multiple facing speaker balancing process using this distance in processing to determine the pair of speaker units 411 and 412 that will output each sound.
  • the audio output control unit 22 selects the speaker units 411 and 412 to play back based on the speaker's position so that the playback positions of the spoken voices in the speaker arrays 41 and 42 maintain the distance relationship between the speaker's position on the screen.
  • the remote communication device 1 remaps the audio playback position to match the width of the speaker arrays 41 and 42. This makes it possible to play back the voices of all speakers appearing on the screen. Therefore, even if the width of the speaker arrays 41 and 42 is significantly shorter than the screen of the display 6, it is possible to prevent the voices of speakers speaking at the edges of the screen from being cut off, and all speakers appearing on the screen can be played back from the speaker arrays 41 and 42.
  • Remote communication device Next, a remote communication device 1 according to a seventh embodiment will be described.
  • the multiple opposing speakers 4 virtually localize the sound source in the vertical center of the image by reproducing the same monaural sound from both the top and bottom. However, this may sound unnatural when reproducing the speech of a short speaker such as a child or a tall speaker.
  • the remote communication device 1 plays back audio by changing the vertical position of the sound source according to the height of the sound source.
  • the remote communication device 1 according to this embodiment is represented by the block diagram in Figure 23. The following explanation will focus on the parts that differ from Figure 23, and explanations of parts that are the same as those in Figure 1 may be omitted. However, in this embodiment, the sound source direction estimation unit 16 does not need to estimate the position of each speaker in the normalized coordinate direction.
  • Figure 26 is a diagram for explaining audio playback by a remote communication device according to the seventh embodiment.
  • the human sensing unit 11 acquires the vertical position of the mouth of each speaker shown in the video. For example, the human sensing unit 11 can estimate the position of the mouth through video analysis. Then, the human sensing unit 11 stores the normalized vertical coordinate Y in FIG. 26 for each speaker in the person list. In this case, as shown in FIG. 26, the human sensing unit 11 sets the vertical normalized coordinate Y with the center of the screen as the origin and the vertical coordinate between -0.5 and 0.5.
  • the background sound separation unit 133 identifies the vertical position of the sudden sound.
  • the background sound separation unit 133 transmits and receives signals using a signal line connected to the microphone array 30 via the individual sound separation unit 131 and the audio interface 5.
  • the background sound separation unit 133 can estimate the vertical position of the sound source of the sudden sound from the vertical sound intensity by causing the microphone array 30 to scan a beam.
  • the background sound separation unit 133 then represents the position of the sudden sound in the background sound using normalized coordinates Y and transmits this information to the remote communication device 1 via the output integration unit 14 and the transmission unit 15.
  • the remote communication device 1 of the second embodiment may also be equipped with the sound source direction estimation unit 16 shown in FIG. 23.
  • the vertical position of the sound source estimated by the sound source direction estimation unit 16 using the microphone array 30 is sent to the person sensing unit 11 and registered in the person list.
  • Audio output control unit 22 stereoizes the audio signal, for example, by using the speaker array 41 as the L channel and the speaker array 42 as the R channel. Then, the audio output control unit 22 pans the stereoized audio signal using the L channel and the R channel in accordance with the vertical normalized coordinate Y registered in the person list, and changes the sound source position so that the speech sound is reproduced at a specified vertical position.
  • the audio output control unit 22 plays back the speech of speaker P1 in FIG. 26 using a sound source located below the center in the vertical direction. Furthermore, the audio output control unit 22 plays back the speech of speaker P1 using a sound source located above the center in the vertical direction.
  • the audio output control unit 22 may perform the following process. That is, the audio output control unit 22 also pans the stereo audio signal for the vertical position of the sudden sound, according to the normalized coordinate Y of the vertical position of the sudden sound notified by the individual sound separation unit 131. As a result, the audio output control unit 22 changes the sound source position so that the sudden sound is reproduced at the specified vertical position. In this way, the audio output control unit 22 stereophonizes the sound in a predetermined space that corresponds to the remote space, and adjusts the position of the sound source between the speaker array 41 and the speaker array 42.
  • a virtual sound source is positioned in the vertical center of the screen.
  • the remote communication device 1 can move the sound source position vertically to an appropriate position when there is a difference in the height of speakers or when the sound being emitted is far from the vertical center. This allows for a more natural conversation and facilitates smooth communication.
  • FIG. 27 is a hardware configuration diagram showing an example of a computer that realizes the arithmetic unit of the remote communication device that is the information processing device according to the first to seventh embodiments.
  • Computer 1000 has a CPU 1100, RAM 1200, ROM (Read Only Memory) 1300, HDD (Hard Disk Drive) 1400, communication interface 1500, and input/output interface 1600. Each part of computer 1000 is connected by bus 1050.
  • the CPU 1100 operates based on programs stored in the ROM 1300 or HDD 1400 and controls each component. For example, the CPU 1100 loads programs stored in the ROM 1300 or HDD 1400 into the RAM 1200 and executes processing corresponding to the various programs.
  • ROM 1300 stores boot programs such as the BIOS (Basic Input Output System) executed by the CPU 1100 when the computer 1000 starts up, as well as programs that depend on the computer 1000's hardware.
  • BIOS Basic Input Output System
  • HDD 1400 is a computer-readable recording medium that non-temporarily records programs executed by CPU 1100 and data used by such programs.
  • HDD 1400 is a recording medium that records application programs related to the present disclosure, which are an example of program data 1450.
  • the communication interface 1500 is an interface that allows the computer 1000 to connect to an external network 1550 (e.g., the Internet).
  • an external network 1550 e.g., the Internet
  • the CPU 1100 receives data from other devices and transmits data generated by the CPU 1100 to other devices via the communication interface 1500.
  • the input/output interface 1600 is an interface for connecting the input/output device 1650 and the computer 1000.
  • the CPU 1100 receives data from input devices such as a keyboard or mouse via the input/output interface 1600.
  • the CPU 1100 also transmits data to output devices such as the display 6, audio interface 5, or printer via the input/output interface 1600.
  • the input/output interface 1600 may also function as a media interface that reads programs and the like recorded on a specified recording medium. Examples of media include optical recording media such as DVDs (Digital Versatile Discs) and PDs (Phase Change Rewritable Disks), magneto-optical recording media such as MOs (Magneto-Optical Disks), tape media, magnetic recording media, or semiconductor memory.
  • the CPU 1100 reads and executes the program data 1450 from the HDD 1400, but as another example, it may also obtain these programs from another device via the external network 1550.
  • This technology can also be configured as follows:
  • a human sensing unit that determines whether each speaker is participating in a conversation based on images of multiple speakers present in a predetermined space captured by a camera; an articulation adjustment unit that adjusts the articulation of speech of speakers participating in the conversation determined by the human sensing unit and included in the speech in the predetermined space collected by a microphone; a transmitting unit that transmits the sound in the predetermined space that has been processed by the clarity adjusting unit.
  • the human sensing unit determines the position of each speaker based on the video; an individual sound separation unit that separates individual sounds, which are speech sounds of each speaker, from the sound in the predetermined space based on the position of each speaker determined by the human sensing unit;
  • the information processing device wherein the clarity adjustment unit adjusts clarity for each of the individual sounds separated by the individual sound separation unit based on whether or not each speaker is participating in a conversation.
  • the information processing device described in (2) further includes an audio output control unit that selects, based on the position of each speaker, a speaker unit that will play the speech of each speaker from the audio in the specified space received by the receiving unit, in a speaker having two speaker arrays in which multiple speaker units are arranged in a row, and causes the selected speaker unit to play the speech of each speaker.
  • the receiving unit receives the video together with the audio from the predetermined space, and projects the video on a screen on which the speaker arrays are arranged, one at each end of the video in a vertical direction when the video is projected;
  • the information processing device described in (3) wherein the audio output control unit selects the speaker unit to be played based on the position of each speaker displayed on the screen so that the spoken audio is played near the position on the screen.
  • the receiving unit receives the video together with the audio from the predetermined space, and projects the video on a screen on which the speaker arrays are arranged, one at each end of the video in a vertical direction when the video is projected;
  • the audio output control unit selects the speaker unit to be played back based on the position of the speaker so that the playback position of the spoken audio in the speaker array maintains the distance relationship between the positions of each speaker on the screen.
  • the clarity adjustment unit performs a clarity reduction adjustment to reduce the clarity of the speech of a speaker who is not participating in the conversation and is included in the audio of the specified space compared to the state of sound collection by the microphone.
  • a speech synthesis unit that performs speech recognition for each of the individual sounds to generate characters indicating the speech content of the individual sounds, and performs speech synthesis based on the characters indicating the speech content to reproduce the speech of the speaker;
  • the information processing device according to any one of (2) to (5), wherein the clarity adjustment unit adjusts clarity for each of the individual sounds regenerated by the speech synthesis unit based on whether or not a speaker is participating in a conversation.
  • the human sensing unit determines the distance of the speaker based on the video and determines whether the speaker is participating in the conversation based on the distance.
  • a sound source direction estimation unit that determines the position of a speaker based on the sound in the predetermined space; an individual sound separation unit separates individual sounds, which are speech sounds of each speaker, from the sound in the predetermined space based on the position of the speaker determined by the sound source direction estimation unit;
  • the information processing device according to any one of (1) to (10), wherein the clarity adjustment unit adjusts clarity for each of the individual sounds separated by the individual sound separation unit based on whether or not a speaker is participating in a conversation.
  • the audio output control unit stereophonically converts audio in the predetermined space and adjusts the position of a sound source among the speaker arrays.
  • the information processing device a human sensing step of determining whether each speaker is participating in a conversation based on an image captured by a camera capturing multiple speakers present in a predetermined space; an articulation adjustment step of adjusting the articulation of speech of a speaker participating in the conversation determined in the person sensing step, the speech being included in the audio of the predetermined space collected by a microphone that collects audio of the predetermined space; a transmitting step of transmitting the sound in the predetermined space processed in the clarity adjusting step.
  • the transmitting device a human sensing unit that determines whether each speaker is participating in a conversation based on images of multiple speakers present in a predetermined space captured by a camera; an articulation adjustment unit that adjusts the articulation of speech of speakers participating in the conversation determined by the human sensing unit and included in the speech in the predetermined space collected by a microphone; a transmitting unit that transmits the sound in the predetermined space processed by the clarity adjustment unit,
  • the receiving device a receiving unit that receives the sound in the predetermined space transmitted by the transmitting unit;
  • An information processing system comprising: a speaker having two speaker arrays in which a plurality of speaker units are arranged in a row; and an audio output control unit that selects a speaker unit to play back the speech of each speaker from the audio in the specified space received by the receiving unit based on the position of each speaker, and causes the selected speaker unit to play back the speech of each speaker.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Otolaryngology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Provided are an information processing device, an information processing method, and an information processing system which make it possible to promote a smoother dialogue. A human sensing unit determines, on the basis of images captured by a plurality of cameras of speakers present in a predetermined space, whether each speaker participates in a conversation. A clarity adjustment unit adjusts the clarity of speech voice of the speaker determined to participate in the conversation by the human sensing unit, the speech voice being included in voice in the predetermined space, which voice is collected by a microphone. The transmission unit transmits the voice in the predetermined space processed by the clarity adjustment unit.

Description

情報処理装置、情報処理方法及び情報処理システムInformation processing device, information processing method, and information processing system

 本開示は、情報処理装置、情報処理方法及び情報処理システムに関する。 This disclosure relates to an information processing device, an information processing method, and an information processing system.

 複数の人がいる遠隔エリア同士で遠隔コミュニケーションを行う遠隔コミュニケーションシステムが知られている。このような遠隔コミュニケーションシステムでは、ある遠隔エリアにおける人と、他の遠隔エリアにおける人とが、遠隔コミュニケーションシステムのマイク、スピーカー、カメラ等を使用して遠隔コミュニケーションを行う。 Remote communication systems are known that enable remote communication between multiple people in remote areas. In such remote communication systems, people in one remote area communicate with people in another remote area using the microphone, speaker, camera, etc. of the remote communication system.

特開2010-191544号公報JP 2010-191544 A

 遠隔コミュニケーションシステムは、ある遠隔エリアに配置されたマイクを用いて発話者の音声を収音し、収音した音声を他の遠隔エリアに配置されたスピーカーから再生する。 The remote communication system uses a microphone placed in one remote area to pick up the speaker's voice, and then plays the picked-up voice from a speaker placed in another remote area.

 しかしながら、単に人の音声を抽出して再生する方法では、会話に関係しない人の声まで強調されるため、対話相手との会話に集中することが困難になるという問題がある。さらに、会話を強調するために周囲の音は完全に消してしまうと、相手空間の状況が把握し難い場合や、会話が無い場合の雰囲気が不自然になるといった問題がある。 However, simply extracting and playing back human voices has the problem that even voices unrelated to the conversation are emphasized, making it difficult to concentrate on the conversation with the other person. Furthermore, completely muting surrounding sounds to emphasize the conversation can make it difficult to grasp the situation in the other person's space, and the atmosphere can become unnatural when there is no conversation.

 本開示は、上記実情に鑑みてなされたものであり、対話の円滑化を促進することができる情報処理装置、情報処理方法及び情報処理システムを提供する。 This disclosure has been made in consideration of the above-mentioned circumstances, and provides an information processing device, information processing method, and information processing system that can promote smooth dialogue.

 本開示の情報処理装置は、所定空間に存在する複数の発話者のカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシング部と、マイクにより収音された前記所定空間の音声に含まれる前記人センシング部により判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整部と、前記明瞭度調整部により処理された前記所定空間の音声を送信する送信部とを有する。 The information processing device disclosed herein includes a human sensing unit that determines whether each speaker is participating in a conversation based on video captured by a camera of multiple speakers present in a specified space; a clarity adjustment unit that adjusts the clarity of the speech of speakers who are participating in the conversation and that is included in the audio of the specified space picked up by a microphone and that is determined by the human sensing unit to be included; and a transmission unit that transmits the audio of the specified space processed by the clarity adjustment unit.

遠隔コミュニケーション装置のブロック図である。FIG. 1 is a block diagram of a remote communication device. 遠隔コミュニケーション装置が配置された環境の俯瞰図である。1 is a bird's-eye view of an environment in which a remote communication device is placed. 遠隔コミュニケーション装置が配置された環境の正面図である。FIG. 1 is a front view of an environment in which a remote communication device is placed. 人センシング部の処理を説明するための図である。FIG. 10 is a diagram for explaining processing by a human sensing unit. スクリーン座標から正規化座標への座標変換を示す図である。FIG. 10 is a diagram illustrating a coordinate transformation from screen coordinates to normalized coordinates. 個別音分離処理を示す図である。FIG. 10 is a diagram illustrating individual sound separation processing. 複数対向スピーカーバランシング処理を示す図である。FIG. 10 is a diagram illustrating a multiple facing speaker balancing process. 外野発話音声に対する正規分布関数を用いた処理を示す図である。FIG. 10 is a diagram illustrating processing using a normal distribution function for outfield speech. 第1の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。FIG. 2 is a diagram showing an outline of data flow between the local remote communication devices according to the first embodiment. 送信側処理のフローチャートである。10 is a flowchart of a transmission side process. 人位置判定処理のフローチャートである。10 is a flowchart of a person position determination process. 内野外野判定処理のフローチャートである。10 is a flowchart of an infield/outfield determination process. 個別音分離処理のフローチャートである。10 is a flowchart of an individual sound separation process. 個別音に対する明瞭度調整処理のフローチャートである。10 is a flowchart of a process of adjusting clarity of individual sounds. 複数対向スピーカーバランシング処理のフローチャートである。10 is a flowchart of a multiple opposite speaker balancing process. 背景音調整の処理のフローチャートである。10 is a flowchart of a background sound adjustment process. 内野発話音声調整の処理のフローチャートである。10 is a flowchart of a process for adjusting infield speech sounds. 外野発話音声調整の処理のフローチャートである。10 is a flowchart of a process for adjusting outfield speech sounds. 定位中心スピーカーユニット導出の処理のフローチャートである。10 is a flowchart of a process for deriving a localization center speaker unit. 第2の実施の形態に係る遠隔コミュニケーション装置のブロック図である。FIG. 10 is a block diagram of a remote communication device according to a second embodiment. 第2の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。FIG. 10 is a diagram showing an outline of data flow between the local remote communication devices according to the second embodiment. 第3の実施の形態に係る遠隔コミュニケーション装置のブロック図である。FIG. 10 is a block diagram of a remote communication device according to a third embodiment. 第3の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。FIG. 13 is a diagram showing an outline of data flow between the local remote communication devices according to the third embodiment. 第5の実施の形態に係る遠隔コミュニケーション装置のブロック図である。FIG. 13 is a block diagram of a remote communication device according to a fifth embodiment. 第5の実施の形態に係る遠隔コミュニケーション装置による音声信号処理を説明するための図である。13A and 13B are diagrams for explaining audio signal processing by a remote communication device according to a fifth embodiment. 第6の実施の形態に係る遠隔コミュニケーション装置による音声再生を説明するための図である。FIG. 20 is a diagram for explaining audio reproduction by a remote communication device according to a sixth embodiment. 第7の実施の形態に係る遠隔コミュニケーション装置による音声再生を説明するための図である。FIG. 20 is a diagram for explaining audio reproduction by a remote communication device according to a seventh embodiment. 第1の実施の形態~第7の実施の形態に係る情報処理装置である遠隔コミュニケーション装置の演算装置を実現するコンピュータの一例を示すハードウェア構成図である。FIG. 10 is a hardware configuration diagram showing an example of a computer that realizes an arithmetic unit of a remote communication device that is an information processing device according to the first to seventh embodiments.

 以下、添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。説明は以下の順序で行うものとする。 Below, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In this specification and drawings, components having substantially the same functional configuration will be assigned the same reference numerals, and duplicate explanations will be omitted. The explanation will be given in the following order.

1.第1の実施の形態に係る遠隔コミュニケーション装置
 1.1.言葉の定義
  1.1.1.リモート及びローカル
  1.1.2.個別音
  1.1.3.内野及び外野
  1.1.4.複数対向スピーカー
 1.2.送信ユニット
  1.2.1.人センシング部
  1.2.2.音声分離部
  1.2.3.音響信号処理部
  1.2.4.出力統合部
  1.2.5.送信部
 1.3.受信ユニット
 1.3.1.受信部
 1.3.2.音声出力制御部
2.第1の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ
3.遠隔コミュニケーション処理
 3.1.送信側処理
  3.1.1.人位置判定処理
  3.1.2.内野外野判定処理
  3.1.3.個別音分離処理
  3.1.4.明瞭度調整処理
 3.2.複数対向スピーカーバランシング処理(受信側処理)
  3.2.1.背景音調整
  3.2.2.内野発話音声調整
  3.2.3.外野発話音声調整
  3.2.4.定位中心スピーカーユニット導出
4.効果
5.第2の実施の形態に係る遠隔コミュニケーション装置
 5.1.音響信号処理部
  5.1.1.背景音分離部
  5.1.2.明瞭度調整部
 5.2.出力統合部
 5.3.音声出力制御部
6.第2の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ
7.効果
8.第3の実施の形態に係る遠隔コミュニケーション装置
 8.1.音響信号処理部
  8.1.1.音声合成部
9.第3の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ
10.効果
11.第4の実施の形態に係る遠隔コミュニケーション装置
 11.1.人センシング部
 11.2.音響信号処理部
 11.3.その他の音声識別
 11.4.効果
12.第5の実施の形態に係る遠隔コミュニケーション装置
 12.1.人センシング部
 12.2.音響信号処理部
 12.3.効果
13.第6の実施の形態に係る遠隔コミュニケーション装置
 13.1.音声出力制御部
 13.2.効果
14.第7の実施の形態に係る遠隔コミュニケーション装置
 14.1.人センシング部
 14.2.音響信号処理部
 14.3.音声出力制御部
 14.4.効果
15.ハードウェア構成
1. Remote communication device according to the first embodiment 1.1. Definition of terms 1.1.1. Remote and local 1.1.2. Individual sound 1.1.3. Infield and outfield 1.1.4. Multiple facing speakers 1.2. Transmission unit 1.2.1. Human sensing unit 1.2.2. Audio separation unit 1.2.3. Acoustic signal processing unit 1.2.4. Output integration unit 1.2.5. Transmission unit 1.3. Receiving unit 1.3.1. Receiving unit 1.3.2. Audio output control unit 2. Data flow between remote communication devices according to the first embodiment 3. Remote communication processing 3.1. Transmission-side processing 3.1.1. Human position determination processing 3.1.2. Infield/outfield determination processing 3.1.3. Individual sound separation processing 3.1.4. Clarity adjustment processing 3.2. Multiple opposing speaker balancing processing (receiving side processing)
3.2.1. Background sound adjustment 3.2.2. Infield speech sound adjustment 3.2.3. Outfield speech sound adjustment 3.2.4. Localization center speaker unit derivation 4. Effect 5. Remote communication device according to the second embodiment 5.1. Acoustic signal processing unit 5.1.1. Background sound separation unit 5.1.2. Clarity adjustment unit 5.2. Output integration unit 5.3. Audio output control unit 6. Data flow between remote communication devices according to the second embodiment 7. Effect 8. Remote communication device according to the third embodiment 8.1. Acoustic signal processing unit 8.1.1. Speech synthesis unit 9. Data flow between remote communication devices according to the third embodiment 10. Effect 11. Remote communication device according to the fourth embodiment 11.1. Human sensing unit 11.2. Acoustic signal processing unit 11.3. Other voice recognition 11.4. Effect 12. 12. Remote communication device according to the fifth embodiment 12.1. Human sensing unit 12.2. Acoustic signal processing unit 12.3. Effect 13. Remote communication device according to the sixth embodiment 13.1. Audio output control unit 13.2. Effect 14. Remote communication device according to the seventh embodiment 14.1. Human sensing unit 14.2. Acoustic signal processing unit 14.3. Audio output control unit 14.4. Effect 15. Hardware configuration

<1.第1の実施の形態に係る遠隔コミュニケーション装置>
 従来の遠隔コミュニケーションシステムでは、複数人が同時に発話すると、再生されるときに音声が混合することで聞き取りづらかったり、聞きたい人の声が勝手に抑圧されたりして不自然と感じられる場合があった。また、従来の遠隔コミュニケーションシステムでは、発話以外の音はカットされることが多かった。
1. Remote communication device according to the first embodiment
In conventional remote communication systems, when multiple people speak at the same time, the sounds can become mixed together during playback, making it difficult to hear, or the voice of the person you want to hear can be suppressed, creating an unnatural experience.In addition, conventional remote communication systems often cut out sounds other than the speech.

 図1は、遠隔コミュニケーション装置のブロック図である。実施の形態に係る遠隔コミュニケーション装置1は、背景音と発話の音声とを別系統で処理できるようにし、発話の音声をさらに話し相手に応じて2種類に分ける。そして、遠隔コミュニケーション装置1は、分けた音声のそれぞれに対して異なるパラメータで処理を施すことで、遠隔地とのつながり感を保持しながら声の聞こえやすさを調整する。ここで、つながり感とは、同じ空間において対面で対話相手と会話しているように感じられる臨場感にあたる。 Figure 1 is a block diagram of a remote communication device. The remote communication device 1 according to the embodiment is able to process background sounds and spoken voices in separate systems, and further divides spoken voices into two types depending on the person being spoken to. The remote communication device 1 then processes each of the divided voices using different parameters, adjusting the ease with which the voices can be heard while maintaining a sense of connection with the remote location. Here, the sense of connection corresponds to a sense of realism that makes it feel as if you are having a face-to-face conversation with your conversation partner in the same space.

 遠隔コミュニケーション装置1は、送信ユニット10及び受信ユニット20を有する。図1に示した遠隔コミュニケーション装置1は、相手方の遠隔コミュニケーション装置1と双方向通信を行う装置であり、相手方の遠隔コミュニケーション装置1も送信ユニット10及び受信ユニット20を有する。 The remote communication device 1 has a transmitting unit 10 and a receiving unit 20. The remote communication device 1 shown in FIG. 1 is a device that performs two-way communication with a remote communication device 1 of a partner, and the remote communication device 1 of the partner also has a transmitting unit 10 and a receiving unit 20.

 ただし、双方向送信を行わない場合には、送信側の遠隔コミュニケーション装置1は少なくとも送信ユニット10を有し、受信側の遠隔コミュニケーション装置1は少なくとも受信ユニット20を有すればよい。すなわち、実施例に係る遠隔コミュニケーションを実現する情報処理システムは、送信装置である遠隔コミュニケーション装置1と受信装置と遠隔コミュニケーション装置1とを有する。 However, if two-way transmission is not performed, the remote communication device 1 on the transmitting side only needs to have at least a transmission unit 10, and the remote communication device 1 on the receiving side only needs to have at least a receiving unit 20. In other words, the information processing system that realizes remote communication in this embodiment has a remote communication device 1 as a transmitting device, a receiving device, and a remote communication device 1.

 送信側の遠隔コミュニケーション装置1と受信側の遠隔コミュニケーション装置1とは、ネットワーク7を介して接続される。 The sending remote communication device 1 and the receiving remote communication device 1 are connected via a network 7.

 図2Aは、遠隔コミュニケーション装置が配置された環境の俯瞰図である。また、図2Bは、遠隔コミュニケーション装置が配置された環境の正面図である。図2A及び2Bにおける点線は信号線の配線を示す。 Figure 2A is an overhead view of the environment in which the remote communication device is placed. Figure 2B is a front view of the environment in which the remote communication device is placed. The dotted lines in Figures 2A and 2B indicate the wiring of signal lines.

 遠隔コミュニケーション装置1は、オーディオインタフェース5に接続される。オーディオインタフェース5は、複数のマイク3及び複数対向スピーカー4に接続される。オーディオインタフェース5は、複数のマイク3から収音した音を遠隔コミュニケーション装置1へ出力する。また、オーディオインタフェース5は、遠隔コミュニケーション装置1から出力された音を複数対向スピーカー4へ出力する。 The remote communication device 1 is connected to an audio interface 5. The audio interface 5 is connected to multiple microphones 3 and multiple opposed speakers 4. The audio interface 5 outputs sounds picked up from the multiple microphones 3 to the remote communication device 1. The audio interface 5 also outputs sounds output from the remote communication device 1 to the multiple opposed speakers 4.

 カメラ2は、配置された特定の空間の映像を撮影して、撮影した映像を遠隔コミュニケーション装置1へ出力する。マイク3は、全方位マイクである。複数のマイク3は、所定の配置で配列されてマイクアレイ30を形成する。カメラ2とマイクアレイ30とは、ほぼ同じ位置に並べられることが好ましい。特に、図2Aにおける人が並ぶ位置からカメラ2及びマイクアレイ30へ向かう水平方向に一列に並べられることが好ましい。 The camera 2 captures images of the specific space in which it is placed and outputs the captured images to the remote communication device 1. The microphone 3 is an omnidirectional microphone. Multiple microphones 3 are arranged in a predetermined configuration to form a microphone array 30. It is preferable that the camera 2 and microphone array 30 are aligned in approximately the same position. In particular, it is preferable that they are aligned in a horizontal line from the position where people are lined up in Figure 2A toward the camera 2 and microphone array 30.

 また、遠隔コミュニケーション装置1は、ディスプレイ6が接続される。ディスプレイ6は、遠隔コミュニケーション装置1が受信した映像を表示する。図2Bに示すように、ディスプレイ6の上下には複数対向スピーカー4のスピーカーアレイ41及びスピーカーアレイ42がそれぞれ配置される。 Furthermore, a display 6 is connected to the remote communication device 1. The display 6 displays the video received by the remote communication device 1. As shown in FIG. 2B , a speaker array 41 and a speaker array 42 of multiple opposing speakers 4 are arranged above and below the display 6, respectively.

 以下の説明では、遠隔コミュニケーション装置1を用いて対話を行う人は、ディスプレイ6から等距離にディスプレイ6と平行に一列に並ぶ場合で説明する。ディスプレイ6から人までの距離は、例えば2.0m程度である。以下の説明では、ディスプレイ6の実際に映像を映し出す面を「スクリーン」とぶ。また、撮影された人の頭と足元とを結ぶ方向であり映像がスクリーンに映された場合の映像の上下方向を「縦方向」と呼ぶ。また、撮影された人が正面を向いた場合の左右方向であり映像がスクリーンに映された場合の左右方向を「横方向」と呼ぶ。特に、映像がスクリーンに映された状態での映像に向って左側を「左」と呼び、右側を「右」と呼ぶ。 In the following explanation, it is assumed that people who will be conversing using the remote communication device 1 are lined up in a row parallel to the display 6 at equal distances from the display 6. The distance from the display 6 to the people is, for example, approximately 2.0 m. In the following explanation, the surface of the display 6 that actually displays the image will be referred to as the "screen." Furthermore, the direction connecting the head and feet of the person being photographed, and the up and down direction of the image when projected onto the screen, will be referred to as the "vertical direction." Furthermore, the left and right directions when the person being photographed is facing forward, and the left and right directions when the image is projected onto the screen, will be referred to as the "horizontal direction." In particular, the left side of the image when projected onto the screen will be referred to as the "left," and the right side will be referred to as the "right."

 <1.1.言葉の定義>
 次に、実施の形態において使用される言葉の定義について説明する。
<1.1. Definition of terms>
Next, definitions of terms used in the embodiments will be explained.

 <1.1.1.リモート及びローカル>
 実施の形態において遠隔コミュニケーション装置1は、物理的に離れた2拠点を接続してそれぞれの空間にいる人同士でコミュニケーションを取り合う遠隔コミュニケーションを実行する。そこで、遠隔コミュニケーションを行う2拠点のうち、送信側の遠隔コミュニケーション装置1が配置された側を「リモート」と呼び、送信側の遠隔コミュニケーション装置1が配置された側を「ローカル」と呼ぶ。すなわち、以下の説明では、リモートの映像及び音声をローカルへ送る場合について説明する。また、ローカルの対話相手となるリモートにおける人を発話者と呼び、発話者がメッセージを送りたい対象である対話相手を受話者と呼ぶ。
1.1.1. Remote and Local
In the embodiment, the remote communication device 1 connects two physically separated locations and performs remote communication between people in each space. Of the two locations where remote communication is performed, the location where the transmitting remote communication device 1 is located is referred to as the "remote" location, and the location where the transmitting remote communication device 1 is located is referred to as the "local" location. In other words, the following description will focus on the case where video and audio from the remote location are sent to the local location. The remote person who is the conversation partner of the local location is referred to as the speaker, and the conversation partner to whom the speaker wants to send a message is referred to as the listener.

 <1.1.2.個別音>
 個別音とは、各発話者の個別音声のことである。リモートの多数人とローカルの多数人とが対話する多対多のコミュニケーションでは、一方の拠点(リモート)で複数人が同時に話す状況が想定される。そのような場面で、単純に無指向マイクで収録すると複数人の声が収録される。そこで、収録された複数人の声の中からそれぞれの話者の声を単体で抽出することを「個別音分離」の処理と呼ぶ。
<1.1.2. Individual sounds>
Individual sounds refer to the individual voices of each speaker. In many-to-many communication, where many remote people converse with many local people, it is expected that multiple people will be speaking simultaneously at one of the locations (remote). In such a situation, simply recording with an omnidirectional microphone will result in the recording of multiple voices. Therefore, the process of extracting each speaker's voice individually from the recorded voices of multiple people is called "individual sound separation."

 <1.1.3.内野及び外野>
 内野とは、ローカルの話者を基準として、ローカルの話者との間でその時点で会話をしている等のローカルに注目しているリモートの話者の、その時点でのローカルの話者から見た立ち位置を表す。例えば、ローカルの人とリモートの人とで会話が行われている場合、そのリモートの人が、ローカルの人にとっての「内野」として扱われる。内野の人は、その時点で会話に参加している人ともに言える。
1.1.3. Infield and outfield
The infield refers to the position of a remote speaker who is currently paying attention to the local speaker, such as someone who is currently having a conversation with the local speaker, as seen from the local speaker at that time. For example, if a conversation is taking place between a local person and a remote person, the remote person is considered to be the "infield" for the local person. The infield can also be anyone who is currently participating in the conversation.

 外野とは、ローカルの話者を基準として、リモートの人同士の間で会話をしているリモートの人等のローカルに注目しておらずその時点でローカルに注目していない人の、その時点でのローカルの話者から見た立ち位置を表す。リモートの発話者にとっての受話者が同じくリモートにいる、つまりリモート同士の話者の間で会話をしている場合、そのリモートの受話者はローカルの話者にとっての「外野」として扱われる。外野の人は、その時点で会話に参加していない人とも言える。ただし、リモートの発話者は、時間経過でのその人の行動の変化により、内野になったり外野になったりする。 The outfield refers to the position of people who are not paying attention to the local situation at that time, such as remote people having a conversation with other remote people, as seen from the perspective of the local speaker at that time. If the listener for the remote speaker is also remote, in other words, if the conversation is between two remote speakers, the remote listener is treated as the "outfield" for the local speaker. People in the outfield can also be said to be people who are not participating in the conversation at that time. However, remote speakers can become infield or outfield depending on changes in their behavior over time.

 <1.1.4.複数対向スピーカー>
 複数対向スピーカー4は、複数のスピーカーユニット411からなるスピーカーアレイ41と複数のスピーカーユニット412からなるスピーカーアレイ42とを対として有するスピーカシステムである。本実施例では、スピーカーアレイ41とスピーカーアレイ42とは、ディスプレイ6の画面を挟んで縦方向に並べて配置される。例えば、スピーカーアレイ41は、スピーカーアレイ42に対してスクリーンに映された映像における上側に設置される。以下では、スピーカーアレイ41がスピーカーアレイ42の上側に設置される場合を例に説明する。
<1.1.4. Multiple opposing speakers>
The multiple opposed speakers 4 are a speaker system having a pair of a speaker array 41 consisting of a plurality of speaker units 411 and a speaker array 42 consisting of a plurality of speaker units 412. In this embodiment, the speaker array 41 and the speaker array 42 are arranged vertically with the screen of the display 6 in between. For example, the speaker array 41 is installed above the speaker array 42 in the image projected on the screen. The following describes an example in which the speaker array 41 is installed above the speaker array 42.

 スピーカーアレイ41とスピーカーアレイ42とは、同数のスピーカーユニット411及び412を有する。スピーカーアレイ41において、複数のスピーカーユニット411が横方向に並ぶ。また、スピーカーアレイ42において複数のスピーカーユニット412も横方向に並ぶ。各スピーカーユニット411と各スピーカーユニット412とはそれぞれが、縦方向の対応する位置に配置される。以下では、縦方向の対応する位置に配置されたスピーカーユニット411とスピーカーユニット412の組みを、同じ列のスピーカーユニット411とスピーカーユニット412と呼ぶ。スピーカーユニット411及び412からは同じ音声を再生することで、スピーカーアレイ41とスピーカーアレイ42との間の中央に仮想的に音源が定位する。 Speaker array 41 and speaker array 42 have the same number of speaker units 411 and 412. In speaker array 41, multiple speaker units 411 are arranged horizontally. Similarly, in speaker array 42, multiple speaker units 412 are arranged horizontally. Each speaker unit 411 and each speaker unit 412 are arranged at corresponding positions in the vertical direction. Below, a pair of speaker units 411 and 412 arranged at corresponding positions in the vertical direction will be referred to as speaker units 411 and 412 in the same row. By playing the same sound from speaker units 411 and 412, a sound source is virtually positioned in the center between speaker array 41 and speaker array 42.

 複数対向スピーカー4は、単にスピーカーを並べたシステムに比べて以下のようなメリットを有する。複数対向スピーカー4は、分離した話者の音声を個別チャンネルで再生することで、1つのチャンネルに複数人の声が混ざって再生されたときよりも個々人の声が聞き取り安くできる。また、複数対向スピーカー4は、ディスプレイ6のスクリーンの縦方向の中央に音源を定位させることで、おおよそ人の顔の位置から声が出ているように聞かせることができる。 Multiple opposed speakers 4 have the following advantages over a system that simply consists of an array of speakers. By playing back the voices of separate speakers on individual channels, multiple opposed speakers 4 make it easier to hear each individual voice than when multiple voices are mixed together on a single channel. Additionally, by localizing the sound source in the vertical center of the display 6 screen, multiple opposed speakers 4 can make the voice sound as if it is coming from approximately the position of a person's face.

 ただし、第1の実施の形態では、スピーカーアレイ41の出力方向とスピーカーアレイ42の出力方向と対向させることで点定位させ臨場感を向上させる効果を強めているが、対向させず一般的なアレイスピーカーで面定位させてもよい。 However, in the first embodiment, the output direction of speaker array 41 and the output direction of speaker array 42 are opposed to each other to achieve point localization and enhance the sense of realism, but it is also possible to use a general array speaker to achieve surface localization without opposing them.

<1.2.送信ユニット>
 送信ユニット10は、送信側の遠隔コミュニケーション装置1で用いられる機能である。以下の説明では、リモートからローカルに音を送る場合、すなわちリモートを送信側としローカルを受信側として説明する。送信ユニット10は、図1に示すように、人センシング部11、音声分離部12、音響信号処理部13、出力統合部14及び送信部15を有する。
<1.2. Transmitting unit>
The transmission unit 10 is a function used in the transmitting remote communication device 1. In the following explanation, the case where sound is sent from the remote to the local will be explained, i.e., the remote will be the transmitting side and the local will be the receiving side. As shown in FIG. 1 , the transmission unit 10 has a human sensing unit 11, a voice separation unit 12, an acoustic signal processing unit 13, an output integration unit 14, and a transmission unit 15.

<1.2.1.人位置検出部>
 人センシング部11は、カメラ2が配置されたリモート側の特定の空間のカメラ2により撮影された映像の入力を受ける。そして、人センシング部11は、以下に説明する、人位置判定処理及び内野外野判定処理を含む人センシング処理を実行する。以下では、カメラ2により撮影されるリモート側の空間を「リモート空間」と呼ぶ。
<1.2.1. Person position detection unit>
The human sensing unit 11 receives an input of an image captured by the camera 2 in a specific space on the remote side where the camera 2 is placed. The human sensing unit 11 then executes human sensing processing including human position determination processing and infield/outfield determination processing, which will be described below. Hereinafter, the space on the remote side captured by the camera 2 will be referred to as the "remote space."

 人センシング部11は、カメラ2の映像からリモート空間における横方向の原点とする位置を検出する。図3は、人センシング部の処理を説明するための図である。ここでは、図3の紙面に向って上段の画像で示すように、ディスプレイ6に3人の発話者61~63が写る映像60が表示される場合を例に説明する。人センシング部11は、例えば、リモート空間におけるカメラ2の映像がディスプレイ6のスクリーンに映された場合の画面の横方向の中央600に原点があるものとする。次に、人センシング部11は、原点からディスプレイ6の画面の横方向の位置を示す正規化座標を設定する。例えば、人センシング部11は、ディスプレイ6のスクリーンの横方向の端部から中央600までの距離を正規化座標における0.5の距離とする。図3では、人センシング部11は、x座標を正規化座標として、ディスプレイ6のスクリーンの右側端部の正規化座標を0.5とし、左側端部の正規化座標を-0.5とする。すなわち、本実施の形態では、人センシング部11は、スクリーンの横幅を1として正規化座標を設定する。 The human sensing unit 11 detects the position of the horizontal origin in the remote space from the image of camera 2. Figure 3 is a diagram for explaining the processing of the human sensing unit. Here, as shown in the image in the upper row of Figure 3, an example will be described in which an image 60 showing three speakers 61 to 63 is displayed on display 6. For example, the human sensing unit 11 assumes that the origin is located at the horizontal center 600 of the screen when the image of camera 2 in the remote space is projected onto the screen of display 6. Next, the human sensing unit 11 sets normalized coordinates that indicate the horizontal position from the origin on the screen of display 6. For example, the human sensing unit 11 sets the distance from the horizontal edge of the screen of display 6 to the center 600 as a distance of 0.5 in normalized coordinates. In Figure 3, the human sensing unit 11 uses the x coordinate as a normalized coordinate, and sets the normalized coordinate of the right edge of the screen of display 6 to 0.5 and the normalized coordinate of the left edge to -0.5. That is, in this embodiment, the human sensing unit 11 sets normalized coordinates with the screen width set to 1.

 次に、人センシング部11は、以下の人位置判定処理を実行する。人センシング部11は、カメラ2の映像に写っている各発話者の骨格認識を実行する。そして、人センシング部11は、カメラ2の映像に写っている発話者それぞれの首の正規化座標を取得する。これにより、人センシング部11は、各発話者の位置を示す正規化座標が取得でき、発話者同士の横方向の位置関係が判定できる。人センシング部11は、既に商用リリースされているような一般的なツールを用いて骨格認識を実行することができる。そして、人センシング部11は、正規化座標が取得できた発話者の人数分の要素からなる人別リストを生成する。本実施の形態では、人センシング部11は、正規化座標の値が小さい順に人別リストに要素を並べる。 Next, the human sensing unit 11 performs the following human position determination process. The human sensing unit 11 performs skeletal recognition of each speaker who appears in the image captured by camera 2. Then, the human sensing unit 11 acquires the normalized coordinates of the neck of each speaker who appears in the image captured by camera 2. As a result, the human sensing unit 11 can acquire normalized coordinates that indicate the position of each speaker, and can determine the lateral positional relationships between speakers. The human sensing unit 11 can perform skeletal recognition using general tools that are already commercially released. Then, the human sensing unit 11 generates a person list consisting of elements equal to the number of speakers whose normalized coordinates have been acquired. In this embodiment, the human sensing unit 11 arranges the elements in the person list in ascending order of normalized coordinate values.

 図3の紙面に向って下段の画像は、人位置判定処理及び内野外野判定処理の詳細を示す。例えば、映像60を取得した場合、人センシング部11は、骨格認識を実行して発話者61の首601の正規化座標、発話者62の首602の正規化座標及び発話者63の首603の正規化座標を取得する。そして、人センシング部11は、取得した正規化座標から、発話者61、発話者62、発話者63の順に映像60の左から並ぶことを確認する。そして、人センシング部11は、映像60に合わせて発話者61、発話者62、発話者63の順に人別リストに要素を設定する。 The images at the bottom of Figure 3 show details of the person position determination process and the infield/outfield determination process. For example, when video 60 is acquired, the person sensing unit 11 performs skeletal recognition to obtain the normalized coordinates of speaker 61's neck 601, the normalized coordinates of speaker 62's neck 602, and the normalized coordinates of speaker 63's neck 603. The person sensing unit 11 then confirms from the acquired normalized coordinates that speakers 61, 62, and 63 are lined up from the left of video 60 in this order. The person sensing unit 11 then sets elements in the person list in the order of speaker 61, speaker 62, and speaker 63, in accordance with video 60.

 ここで、人センシング部11は、映像から骨格検出を行うが、影像から得られる座標は通常は正規化座標系ではなく、映像のピクセル数を用いたスクリーン座標により表される。そのため、人センシング部11は、実際には、人位置判定処理において映像から発話者の位置のスクリーン座標を取得した後、そのスクリーン座標から正規化座標への座標変換を行う。図4は、スクリーン座標から正規化座標への座標変換を示す図である。ここでは、スクリーン座標621の各軸をu軸及びv軸とし、正規化座標622の各軸をx軸及びy軸とする。 Here, the human sensing unit 11 performs skeleton detection from the video, but the coordinates obtained from the image are usually not expressed in a normalized coordinate system but in screen coordinates using the number of pixels in the video. Therefore, in practice, the human sensing unit 11 obtains the screen coordinates of the speaker's position from the video in the human position determination process, and then performs coordinate conversion from those screen coordinates to normalized coordinates. Figure 4 is a diagram showing the coordinate conversion from screen coordinates to normalized coordinates. Here, the axes of the screen coordinates 621 are the u-axis and v-axis, and the axes of the normalized coordinates 622 are the x-axis and y-axis.

 人センシング部11は、u軸及びv軸を有するスクリーン座標621を用いてスクリーン620に映される映像から各発話者の位置を判定し、その後、判定した各発話者の位置をx軸及びy軸を有する正規化座標622に変換する。また、スクリーン620のスクリーン幅をWScrとし、スクリーン高さをHScrとすると、それぞれの座標系が取れる範囲は、0≦u≦WScrかつ0≦v≦HScr、及び、-0.5≦x≦0.5かつ-0.5≦y≦0.5となる。ここでは、縦方向の正規化座標系を、影像の縦方向の中心を0として、-0.5から0.5の範囲で表すものとする。この場合、人センシング部11は、次の数式(1)を用いて座標変換を行う。このように座標変換することで、カメラ画角やアスペクト比に左右されることなく、処理を行うことができる。 The human sensing unit 11 determines the position of each speaker from the image projected on the screen 620 using screen coordinates 621 having a u-axis and a v-axis, and then converts the determined position of each speaker into normalized coordinates 622 having an x-axis and a y-axis. Furthermore, if the screen width of the screen 620 is W Scr and the screen height is H Scr , the ranges that each coordinate system can take are 0≦u≦W Scr and 0≦v≦H Scr , and -0.5≦x≦0.5 and -0.5≦y≦0.5. Here, the vertical normalized coordinate system is expressed in the range from -0.5 to 0.5, with the vertical center of the image set as 0. In this case, the human sensing unit 11 performs coordinate conversion using the following equation (1). By performing coordinate conversion in this manner, processing can be performed regardless of the camera's angle of view or aspect ratio.

 次に、人センシング部11は、以下の内野外野判定処理を実行する。人センシング部11は、映像内の各人の頭部方向、すなわち、その発話者の向いている方向を推定する。例えば、人位置判定処理でp人の正規化座標が得られた場合、人センシング部11は、p人数分の正規化座標のそれぞれに位置する発話者の顔を検出して、画像解析等により頭部方向を推定する。より詳しくは、人センシング部11は、正規化座標の正方向のベクトルの向きを0度として、その0度のベクトルからの角度を頭部方向ベクトルの角度Φを推定することで頭部方向の推定を行う。 Next, the human sensing unit 11 executes the following infield/outfield determination process. The human sensing unit 11 estimates the head direction of each person in the video, i.e., the direction in which the speaker is facing. For example, if the normalized coordinates of p people are obtained in the human position determination process, the human sensing unit 11 detects the faces of the speakers located at each of the normalized coordinates for each of the p people, and estimates the head direction using image analysis, etc. More specifically, the human sensing unit 11 estimates the head direction by assuming that the direction of the vector in the positive direction of the normalized coordinates is 0 degrees, and estimating the angle Φ of the head direction vector from this 0-degree vector.

 ここで、人センシング部11は、各発話者が内野か外野かを判定するための内野範囲を予め有する。例えば、人センシング部11は、カメラ2に向いた状態を90度として、90度の所定角度前後の範囲を内野範囲として有する。そして、人センシング部11は、頭部方向が内野範囲に含まれる発話者を内野と判定し、頭部方向が内野範囲に含まれない発話者を外野の人と判定する。次に、人センシング部11は、人別リストの各要素に内野フラグの項目を設けて、初期値としてFalseを設定する。そして、人センシング部11は、内野と判定した発話者については、内野フラグをTrueに設定する。また、人センシング部11は、外野と判定した発話者については、内野フラグをFaluseに設定する。これにより、人センシング部11は、人別リストにおける各発話者の内野フラグについて、その発話者が内野であればTreを設定し、外野であればFaluseを設定し、さらに頭部方向の検出漏れがあった場合にはFalseを設定することができる。 Here, the person sensing unit 11 has an infield range in advance for determining whether each speaker is in the infield or outfield. For example, the person sensing unit 11 defines the state in which the speaker is facing the camera 2 as 90 degrees, and defines a range around a predetermined angle of 90 degrees as the infield range. The person sensing unit 11 then determines that speakers whose head direction is within the infield range are in the infield, and determines that speakers whose head direction is not within the infield range are outfield people. Next, the person sensing unit 11 provides an infield flag item for each element in the person list and sets the initial value to False. The person sensing unit 11 then sets the infield flag to True for speakers determined to be in the infield. The person sensing unit 11 also sets the infield flag to False for speakers determined to be in the outfield. As a result, the human sensing unit 11 can set the infield flag of each speaker in the person list to Tre if the speaker is in the infield, to False if the speaker is in the outfield, and further to False if there is an oversight of head direction detection.

 例えば、図3の下段の画像に示すように、人センシング部11は、発話者61~63について頭部方向611~613を推定する。さらに、ここでは内野範囲ψを90度となる正面方向610から前後45度として、人センシング部11は、内野外野判定処理を行う。この場合、頭部方向611は内野範囲に含まれるため、人センシング部11は、発話者61を内野と判定する。また、頭部方向612及び613は内野範囲に含まれないため、人センシング部11は、発話者62及び63を外野の人と判定する。そして、人センシング部11は、人別リストにおける発話者61の内野フラグをTureとし、発話者62及び63の内野フラグをFalseとする。 For example, as shown in the image at the bottom of Figure 3, the person sensing unit 11 estimates head directions 611-613 for speakers 61-63. Furthermore, here, the infield range ψ is set to 45 degrees forward or backward from the front direction 610, which is 90 degrees, and the person sensing unit 11 performs infield/outfield determination processing. In this case, because head direction 611 is included in the infield range, the person sensing unit 11 determines that speaker 61 is in the infield. Furthermore, because head directions 612 and 613 are not included in the infield range, the person sensing unit 11 determines that speakers 62 and 63 are people in the outfield. Then, the person sensing unit 11 sets the infield flag for speaker 61 in the person list to True, and the infield flags for speakers 62 and 63 to False.

 図1に戻って説明を続ける。その後、人センシング部11は、作成した人別リストを音響信号処理部13及び出力統合部14へ出力する。 Referring back to Figure 1, the explanation continues. The human sensing unit 11 then outputs the created human list to the acoustic signal processing unit 13 and the output integration unit 14.

 ここで、リモート空間が「所定空間」の一例にあたり、発話者が内野であるか外野であるかの判定が「発話者が会話に参加しているか否かの判定」の一例にあたる。すなわち、人センシング部11は、所定空間に存在する複数の発話者を撮影するカメラ2により撮影された映像を基に、各発話者が会話に参加しているか否かを判定する。また、人センシング部11は、映像を基に発話者それぞれの位置を判定する。 Here, the remote space is an example of a "predetermined space," and determining whether a speaker is in the infield or outfield is an example of "determining whether a speaker is participating in a conversation." In other words, the human sensing unit 11 determines whether each speaker is participating in a conversation based on the video captured by the camera 2 that captures multiple speakers present in the predetermined space. The human sensing unit 11 also determines the position of each speaker based on the video.

 <1.2.2.音声分離部>
 音声分離部12は、複数のマイク3で収音された音声の入力を受ける。マイクアレイ30の入力には発話による発話音声以外にも、背景ノイズ、BGM(Back Ground Music)及び物音等といったその環境で発生する背景音が含まれる。そこで、音声分離部12は、後の処理で発話音声に処理を施して聞こえ易さを調整することができるように、マイクアレイ30から入力された音の発話音声と背景音とを分離する音声分離処理を行う。音声分離処理は、発話抽出処理とも言える。例えば、音声分離部12は、ボーカル抽出や歌声ありの音楽で歌声のみを抽出する技術を用いて音声分離を行う。
<1.2.2. Audio Separation Unit>
The audio separation unit 12 receives audio input collected by the multiple microphones 3. The input to the microphone array 30 includes not only speech sounds generated by speech but also background sounds generated in the environment, such as background noise, background music (BGM), and other sounds. Therefore, the audio separation unit 12 performs audio separation processing to separate the speech sounds input from the microphone array 30 from the background sounds so that the speech sounds can be processed later to adjust their audibility. The audio separation processing can also be considered speech extraction processing. For example, the audio separation unit 12 performs audio separation using vocal extraction or technology to extract only the singing voice from music with singing voices.

 このように、音声分離部12は、リモート空間にあたる所定空間の音声から背景音を分離する。そして、発話音声と背景音とを分離することで、発話音声及び背景音に対する明瞭度調整等の処理が調整することが容易となる。 In this way, the audio separation unit 12 separates background sound from audio in a specified space that corresponds to the remote space. By separating the speech audio from the background sound, it becomes easier to adjust processes such as clarity adjustment for the speech audio and background sound.

 ここで、本実施例では、マイク3として全方位マイクを用いたが、他にも、マイク3として、各発話者にピンマイクを装着させ、且つ、無指向性マイクを用いて背景音を収音してもよい。その場合、音声分離部12は、ピンマイクで収音した音声を各発話者の発話音声として、無指向性マイクで収音した音声を背景音としてもよい。 In this embodiment, an omnidirectional microphone is used as the microphone 3, but it is also possible to have each speaker wear a pin microphone as the microphone 3, and to use an omnidirectional microphone to pick up background sound. In this case, the audio separation unit 12 may use the audio picked up by the pin microphone as the speech of each speaker, and the audio picked up by the omnidirectional microphone as the background sound.

 <1.2.3.音響信号処理部>
 音響信号処理部13は、個別音分離部131及び明瞭度調整部132を有する。
<1.2.3. Acoustic signal processing section>
The sound signal processing unit 13 includes an individual sound separation unit 131 and a clarity adjustment unit 132 .

 個別音分離部131は、音声分離部12による音声分離処理により抽出された発話音声の入力を受ける。また、個別音分離部131は、人センシング部11により作成された人別リストを取得する。そして、個別音分離部131は、人別リストを用いて映像に映る各人の発話音声を分離するための以下の個別音分離処理を実行する。 The individual sound separation unit 131 receives input of speech sounds extracted by the speech separation process performed by the speech separation unit 12. The individual sound separation unit 131 also acquires the person list created by the person sensing unit 11. The individual sound separation unit 131 then uses the person list to perform the following individual sound separation process to separate the speech sounds of each person appearing in the video.

 図5は、個別音分離処理を示す図である。ここでは、以下の条件が満たされている場合で説明する。カメラ2とマイクアレイ30によるマイクアレイとはカメラ2の撮影方向に向かって直列に並べられる。すなわち、カメラ2により撮影される2次元の映像の法線方向にカメラ2とマイクアレイ30とが並べられ、映像における縦方向及び横方向と奥行き方向とで形成される座標を考えた場合、カメラ2とマイクアレイ30とは縦方向及び横方向の座標が一致する。また、リモート空間の発話者は、カメラ2により撮影される2次元の映像の縦方向が一致する位置で横方向一列に並ぶ。ここで、リモート空間の各発話者が並ぶ位置の縦方向及び横方向にあたる面を「仮想スクリーン」と呼ぶ。 Figure 5 is a diagram showing the individual sound separation process. Here, we will explain the case where the following conditions are met. The camera 2 and the microphone array made up of the microphone array 30 are arranged in series in the shooting direction of the camera 2. In other words, the camera 2 and the microphone array 30 are arranged in the normal direction of the two-dimensional image shot by the camera 2, and when considering the coordinates formed by the vertical, horizontal and depth directions in the image, the vertical and horizontal coordinates of the camera 2 and the microphone array 30 match. Furthermore, the speakers in the remote space line up horizontally at positions where the vertical direction of the two-dimensional image shot by the camera 2 matches. Here, the surface that corresponds vertically and horizontally to the positions where the speakers in the remote space are lined up is called the "virtual screen."

 図5における-0.5~0.5の範囲で示す正規化座標は、仮想スクリーンの横方向の座標と一致する。また、カメラ2及びマイク3との距離Lは既知である。さらに、図4におけるxphは、仮想スクリーンにおける座標の原点からの物理的距離である。xphは、カメラ2の画角や距離Lの値から算出可能である。 The normalized coordinates shown in the range of -0.5 to 0.5 in Figure 5 match the horizontal coordinates of the virtual screen. The distance L between the camera 2 and the microphone 3 is known. Furthermore, x ph in Figure 4 is the physical distance from the origin of the coordinates on the virtual screen. x ph can be calculated from the angle of view of the camera 2 and the value of the distance L.

 この場合に、個別音分離部131は、取得した人別リストに登録された要素数分のスレッドを作成して、各スレッドに属性として各発話者を識別するためのID(Identifier)を設定する。例えば、個別音分離部131は、人別リストの左から並ぶ順に、スレッドIDを1,2,3,・・・と割り当てる。さらに、個別音分離部131は、各スレッドに人別リストに登録された正規化座標の値もスレッド固有の値として保持させる。そして、個別音分離部131は、人位置判定処理で得られた正規化座標を仮想スクリーンの座標として、各発話者の原点からの物理的距離であるxphを算出する。そして、個別音分離部131は、算出した物理的距離xph、仮想クリーンからカメラ2までの距離Lを用いて、次の数式(2)により特定の人までの角度θを求めることができる。 In this case, the individual sound separation unit 131 creates threads for the number of elements registered in the acquired person list and assigns an ID (identifier) to each thread as an attribute to identify each speaker. For example, the individual sound separation unit 131 assigns thread IDs of 1, 2, 3, ... in order from left to right in the person list. Furthermore, the individual sound separation unit 131 stores the normalized coordinate values registered in the person list in each thread as thread-specific values. Then, the individual sound separation unit 131 calculates x ph , which is the physical distance from the origin of each speaker, using the normalized coordinates obtained in the person position determination process as coordinates on the virtual screen. Then, the individual sound separation unit 131 can calculate the angle θ to a specific person using the calculated physical distance x ph and the distance L from the virtual screen to camera 2 using the following equation (2).

 例えば、図5の発話者101について、個別音分離部131は、正規化座標としてxを取得する。そして、個別音分離部131は、正規化座標であるxから原点から発話者101までの物理的距離を算出し、数式(2)に算出した物理的距離及び距離Lを用いて、発話者101のカメラ2及びマイク3を中心とした原点からの角度θを算出する事ができる。個別音分離部131は、各スレッドについて角度θを求める計算を行うことで、各発話者の角度をそれぞれのスレッドで得ることができる。 For example, for the speaker 101 in Fig. 5, the individual sound separation unit 131 acquires x1 as the normalized coordinate. Then, the individual sound separation unit 131 calculates the physical distance from the origin to the speaker 101 from the normalized coordinate x1 , and can calculate the angle θ1 from the origin centered on the camera 2 and microphone 3 of the speaker 101 using the physical distance calculated in equation (2) and the distance L. The individual sound separation unit 131 can obtain the angle of each speaker for each thread by performing a calculation to find the angle θ for each thread.

 次に、個別音分離部131は、音声分離部12による音声分離により抽出された発話音声に対して、ビームフォーミングを用いて発話者毎の発話音声について、個別音分離処理を実行する。音声分離部12による音声分離により抽出された発話音声は、マイクアレイ30から得られた音であり、各発話者の方向の感度は確保されている。そこで、個別音分離部131は、発話者101の場合であれば、発話者101に対する角度θをビーム方向とし、かつ、角度θを中心として15度の範囲102のビーム幅とする。そして、個別音分離部131は、ビーム方向及びビーム幅を指定して、取得した発話音声のうち指定した角度範囲以外の方向の音を抑圧することで、目的方向の感度を確保して発話者101の個別音を取得する。 Next, the individual sound separation unit 131 performs individual sound separation processing on the speech sound extracted by the speech separation by the speech separation unit 12, using beamforming, for the speech sound of each speaker. The speech sound extracted by the speech separation by the speech separation unit 12 is sound obtained from the microphone array 30, and sensitivity in the direction of each speaker is ensured. Therefore, in the case of speaker 101, the individual sound separation unit 131 sets the beam direction to angle θ1 with respect to speaker 101 and sets the beam width to a range 102 of 15 degrees centered on angle θ1 . Then, the individual sound separation unit 131 specifies the beam direction and beam width to suppress sounds from directions outside the specified angular range of the acquired speech sound, thereby ensuring sensitivity in the target direction and acquiring the individual sound of speaker 101.

 このように、個別音分離部131は、人センシング部11により判定された発話者それぞれの位置を基に、リモート空間にあたる所定空間の音声から発話者毎の発話音声である個別音を分離する。 In this way, the individual sound separation unit 131 separates individual sounds, which are the speech sounds of each speaker, from the sound in the specified space that corresponds to the remote space, based on the position of each speaker determined by the human sensing unit 11.

 この点、従来は、映像上の人位置と対応付けながらマイク非装着で話者の個別音を取得することが困難であった。一方、本実施の形態に係る個別音分離部131は、映像上の人の位置が映像上の順番で登録された人別リストを用いて、その順番を保持しながらビームフォーミングの処理をすることで、映像上の各発話者の位置と分離後の個別音のペアが容易に作成できる。これにより、伝送後にローカルで発話を再生するときに、映像と出音位置を一致させることが容易となる。 In the past, it was difficult to acquire individual sounds from speakers without wearing a microphone while associating them with the position of the person on the video. However, the individual sound separation unit 131 of this embodiment uses a person-specific list in which the positions of people on the video are registered in the order they appear on the video, and performs beamforming processing while maintaining that order, making it easy to pair the position of each speaker on the video with the individual sound after separation. This makes it easy to match the video and sound output positions when playing back the speech locally after transmission.

 図1に戻って説明を続ける。明瞭度調整部132は、個別音分離部131により生成されたリモート空間に存在する各人の個別音の入力を受ける。また、明瞭度調整部132は、人別リストの入力を個別音分離部131から受ける。さらに、明瞭度調整部132は、音声分離部12による音声分離により抽出された背景音の入力を受ける。 Returning to Figure 1, the explanation continues. The clarity adjustment unit 132 receives input of the individual sounds of each person present in the remote space generated by the individual sound separation unit 131. The clarity adjustment unit 132 also receives input of a person list from the individual sound separation unit 131. Furthermore, the clarity adjustment unit 132 receives input of background sounds extracted by audio separation by the audio separation unit 12.

 明瞭度調整部132は、個別音毎に各発話者が内野であるか外野であるかを人別リストの内野フラグを用いて判定する。そして、明瞭度調整部132は、判定結果に応じて個別音及び背景音に対して明瞭度調整処理を実施する。詳しくは、明瞭度調整部132は、内野の人の個別音に対して、より聞こえやすくするエンハンス(Enhance)処理を実施する。また、明瞭度調整部132は、外野の人の個別音及び背景音に対して、より聞こえ難くするデグレード(Degrade)処理を実施する。その後、明瞭度調整部132は、明瞭度調整処理を施した個別音及び背景音を出力統合部14へ出力する。 The clarity adjustment unit 132 determines for each individual sound whether each speaker is in the infield or outfield using the infield flag in the person list. Then, the clarity adjustment unit 132 performs clarity adjustment processing on the individual sounds and background sounds according to the determination result. In detail, the clarity adjustment unit 132 performs enhancement processing on the individual sounds of people in the infield, making them easier to hear. The clarity adjustment unit 132 also performs degrade processing on the individual sounds and background sounds of people in the outfield, making them harder to hear. The clarity adjustment unit 132 then outputs the individual sounds and background sounds that have been subjected to clarity adjustment processing to the output integration unit 14.

 明瞭度調整部132は、明瞭度調整処理には、フォルマント重み付けや、イコライジング・フィルタや、残響フィルタを用いることができる。フォルマント重み付けの場合、明瞭度調整部132は、エンハンス処理として第2フォルマンド以上を補強し、デグレード処理として第2フォルマント以上を抑圧する。また、明瞭度調整部132は、イコライジング・フィルタを用いた処理として、ローパスフィルタによる背景音のデグレード処理や、残響音の付加によるデグレード処理を実施する。 The clarity adjustment unit 132 can use formant weighting, an equalizing filter, or a reverberation filter for clarity adjustment processing. In the case of formant weighting, the clarity adjustment unit 132 reinforces the second formant and above as an enhancement process, and suppresses the second formant and above as a degradation process. In addition, the clarity adjustment unit 132 performs equalizing filter processing, such as degrading background sounds using a low-pass filter and degrading sounds by adding reverberation.

 このように、明瞭度調整部132は、マイク3により収音されたリモート空間にあたる所定空間の音声に含まれる、人センシング部11により判定された会話に参加している発話者の発話音声について明瞭度の調整を行う。また、明瞭度調整部132は、リモート空間にあたる所定空間の音声に含まれる会話に参加していない発話者の発話音声について、マイク3による収音状態よりも明瞭度を低下させる明瞭度低下調整を行う。また、明瞭度調整部132は、音声分離部12により分離された背景音について、他の音声よりも明瞭度を低下させる明瞭度低下調整を行う。 In this way, the clarity adjustment unit 132 adjusts the clarity of the speech of speakers who are participating in the conversation as determined by the human sensing unit 11, which is included in the audio of the specified space corresponding to the remote space picked up by the microphone 3. The clarity adjustment unit 132 also performs a clarity reduction adjustment on the speech of speakers who are not participating in the conversation, which is included in the audio of the specified space corresponding to the remote space, to reduce the clarity compared to the audio pickup state by the microphone 3. The clarity adjustment unit 132 also performs a clarity reduction adjustment on the background sound separated by the audio separation unit 12 to reduce the clarity compared to other audio.

 この点、従来は、背景音の適切な取り扱いや、発話者の会話への参加不参加に応じた明瞭度調整が困難であった。これに対して、明瞭度調整部132は、伝送する音が背景音か発話音声か、また発話音声の中でも発話者が内野か外野かで会話への参加不参加を判定することによって、聞き取りやすさを変更することができ、対話の円滑化を促進することができる。 In the past, it was difficult to appropriately handle background sounds and adjust clarity depending on whether a speaker is participating in a conversation. In contrast, the clarity adjustment unit 132 determines whether the transmitted sound is background sound or speech, and even within speech, whether the speaker is in the infield or outfield, thereby making it possible to change the ease of audibility and promote smoother dialogue.

 <1.2.4.出力統合部>
 出力統合部14は、明瞭度調整処理が施された個別音及び背景音の入力を明瞭度調整部132はから受ける。また、出力統合部14は、人別リストの入力を人センシング部11から受ける。そして、出力統合部14は、以下に説明する明瞭度調整処理を実行する。
<1.2.4. Output Integration Unit>
The output integration unit 14 receives input of the individual sounds and background sounds that have been subjected to clarity adjustment processing from the clarity adjustment unit 132. The output integration unit 14 also receives input of the person list from the person sensing unit 11. Then, the output integration unit 14 executes the clarity adjustment processing described below.

 背景音や発話音声の音声処理は、それぞれの音声毎に対応させて1チャンネルずつ行われる。すなわち、出力統合部14は、発話者の人数に背景音の1チャンネルを付加した数のデータの入力を受ける。つまり、背景音が1チャンネルのデータであり、発話者の人数がp人だとすると、出力統合部14は、(1+P)チャンネルのデータを取得する。 Audio processing of background sounds and speech sounds is performed one channel at a time, corresponding to each sound. In other words, the output integration unit 14 receives input data equal to the number of speakers plus one channel of background sound. In other words, if the background sound is one channel of data and the number of speakers is p, the output integration unit 14 obtains (1 + P) channels of data.

 そして、出力統合部14は、ローカルに音声を伝送するために、すべてのチャンネルの音を統合して1つの音データにする。例えば、出力統合部14は、フレーム毎に各チャンネルの要素を含ませるチャンネルインターリーブを用いて1つの音データを生成することができる。その後、出力統合部14は、生成した1つの音データを送信部15へ出力する。また、出力統合部14は、人別リストの各要素及び背景音のそれぞれと音データに含まれる各チャンネルとを対応付ける情報とともに、人別リストを送信部15へ出力する。ここで、出力統合部14は、音データに人別リストのデータを付加して1つのデータとしてもよい。 Then, in order to transmit the audio locally, the output integration unit 14 integrates the sounds of all channels into a single piece of audio data. For example, the output integration unit 14 can generate a single piece of audio data using channel interleaving, which includes elements of each channel for each frame. The output integration unit 14 then outputs the generated single piece of audio data to the transmission unit 15. The output integration unit 14 also outputs the person list to the transmission unit 15, along with information that associates each element of the person list and background sound with each channel included in the audio data. Here, the output integration unit 14 may add the data of the person list to the audio data to create a single piece of data.

 <1.2.5.送信部>
 送信部15は、各発話者の個別音及び背景音を含む1つの音データの入力を出力統合部14から受ける。また、送信部15は、人別リストの入力を出力統合部14から受ける。また、送信部15は、撮像された映像の入力をカメラ2から受ける。そして、送信部15は、各発話者の個別音及び背景音を含む1つの音データ、人別リスト、並びに、カメラ2の映像を、ネットワーク7を介してローカル側の遠隔コミュニケーション装置1の受信ユニット20へ送信する。ネットワーク7は、例えば、インターネット、LAN(Local Area Network)等である。このように、送信部15は、明瞭度調整部132により処理されたリモート空間にあたる所定空間の音声を送信する。
<1.2.5. Transmitter>
The transmitter 15 receives an input of a single piece of sound data including the individual sounds of each speaker and background sound from the output integration unit 14. The transmitter 15 also receives an input of a person list from the output integration unit 14. The transmitter 15 also receives an input of captured video from the camera 2. The transmitter 15 then transmits the single piece of sound data including the individual sounds of each speaker and background sound, the person list, and the video from the camera 2 to the receiving unit 20 of the remote communication device 1 on the local side via the network 7. The network 7 is, for example, the Internet or a LAN (Local Area Network). In this way, the transmitter 15 transmits the audio of a predetermined space corresponding to the remote space processed by the clarity adjustment unit 132.

 <1.3.受信ユニット>
 受信ユニット20は、本実施の形態ではローカル側である受信側の遠隔コミュニケーション装置1で用いられる機能である。受信ユニット20は、図1に示すように、受信部21及び音声出力制御部22を有する。
1.3. Receiving unit
The receiving unit 20 is a function used in the receiving-side remote communication device 1, which is the local side in this embodiment. The receiving unit 20 includes a receiving section 21 and an audio output control section 22, as shown in FIG.

 <.3.1.受信部>
 受信部21は、リモート側の遠隔コミュニケーション装置1の送信部15から送信されたリモート空間の各発話者の個別音及び背景音を含む1つの音データ、人別リスト、並びに、映像を受信する。受信部21は、受信した映像をディスプレイ6に出力する。ディスプレイ6は、受信部21から出力された映像をスクリーンに表示する。また、受信部21は、音データ及び人別リストを音声出力制御部22へ出力する。
<3.1. Receiving section>
The receiving unit 21 receives one piece of sound data including individual sounds of each speaker in the remote space and background sounds, a person list, and video transmitted from the transmitting unit 15 of the remote communication device 1 on the remote side. The receiving unit 21 outputs the received video to the display 6. The display 6 displays the video output from the receiving unit 21 on a screen. The receiving unit 21 also outputs the sound data and the person list to the audio output control unit 22.

 このように、受信部21は、送信部15により送信されたリモート空間にあたる所定空間の音声を受信する。また、受信部21は、リモート空間にあたる所定空間の音声とともに映像を受信して、映像を写した際の映像の上下方向の両端部に1つずつスピーカーアレイ41及び42が配置されたスクリーンに映像を映す。 In this way, the receiving unit 21 receives the audio from the specified space corresponding to the remote space transmitted by the transmitting unit 15. The receiving unit 21 also receives the video along with the audio from the specified space corresponding to the remote space, and projects the video onto a screen on which speaker arrays 41 and 42 are arranged, one at each end of the video in the vertical direction when the video is displayed.

 <1.3.2.音声出力制御部>
 音声出力制御部22は、以下に説明する複数対向スピーカーバランシング処理を実施して、背景音、内野発話音声及び外野発話音声毎に出力するスピーカーユニット411及び412を指定して再生させる。音声出力制御部22は、各発話者の個別音及び背景音を含む音データ、並びに、人別リストの入力を受信部21から受ける。以下では、内野の人の発話音声を「内野発話音声」と呼び、外野の人の発話音声を「外野発話音声」と呼ぶ。
<1.3.2. Audio output control unit>
The audio output control unit 22 performs a multiple opposing speaker balancing process described below to specify the speaker units 411 and 412 that will output the background sound, the infield speech sound, and the outfield speech sound, respectively, and plays them back. The audio output control unit 22 receives sound data including the individual sound and background sound of each speaker, as well as an input of a person list, from the receiving unit 21. Hereinafter, the speech sound of people in the infield will be referred to as "infield speech sound," and the speech sound of people in the outfield will be referred to as "outfield speech sound."

 図6は、複数対向スピーカーバランシング処理を示す図である。図6の紙面に向って上側の画像はスピーカーアレイ41及び42における各部の長さを示し、紙面に向って下側のグラフ200は処理後の背景音、内野発話音声及び外野発話音声のスピーカーアレイ41及び42からの出音を示す。グラフ200は、横軸でスピーカーアレイ41及び42に対応する位置を示し、縦軸で音量を示す。グラフ200の縦軸は、内野発話音声のピークの音量を1として規格化した値を示す。 Figure 6 is a diagram showing the multiple opposing speaker balancing process. The image at the top of Figure 6 shows the length of each part of speaker arrays 41 and 42, and graph 200 at the bottom shows the post-processing output sounds from speaker arrays 41 and 42 for background sound, infield speech sounds, and outfield speech sounds. Graph 200 shows the positions corresponding to speaker arrays 41 and 42 on the horizontal axis, and the volume on the vertical axis. The vertical axis of graph 200 shows a value normalized with the peak volume of infield speech sounds set to 1.

 音声出力制御部22は、物理的なスピーカーアレイ幅Wspk及びディスプレイ幅Wdispの情報を予め有する。また、音声出力制御部22は、スピーカーアレイ41の中のスピーカーユニット411の数uの情報も予め有する。スピーカーアレイ42の中のスピーカーユニット412の数もuである。ここで、Wspkは左端のスピーカーユニット411と右端のスピーカーユニット411と間の距離を表す。これは、Wspkは、スピーカーユニット412についても同様である。また、スピーカーユニット411同士及びスピーカーユニット412同士の間のスピーカー間距離Wはすべて等しく、ディスプレイ6とスピーカーアレイ41とスピーカーアレイ42とは左右中央揃えされている。 The audio output control unit 22 has information on the physical speaker array width W spk and display width W disp in advance. The audio output control unit 22 also has information on the number u of speaker units 411 in the speaker array 41 in advance. The number of speaker units 412 in the speaker array 42 is also u. Here, W spk represents the distance between the leftmost speaker unit 411 and the rightmost speaker unit 411. The same applies to W spk for the speaker units 412. Furthermore, the inter-speaker distances W u between the speaker units 411 and between the speaker units 412 are all equal, and the display 6, speaker array 41, and speaker array 42 are aligned horizontally and centrally.

 背景音、内野発話音声及び外野発話音声に対する出力調整処理に共通する前段の処理として、音声出力制御部22は、既知のWspk及びuを用いて、W=Wspk/uとしてスピーカーユニット411間のスピーカー間距離Wを求める。ここで、Wは物理的な実際の距離であるため、音声出力制御部22は、映像上の発話者の位置と音声とを一致させるため、処理上の距離をWu_n=W/Wdispとして正規化座標の大きさに揃える。また、音声出力制御部22は、スピーカーアレイ幅Wspkも同様にWspk_nとして正規化座標に揃える。また、音声出力制御部22は、スピーカーユニット411及び412の正規化座標xuを算出する。 As a preliminary process common to the output adjustment processes for background sound, infield speech sound, and outfield speech sound, the audio output control unit 22 calculates the inter-speaker distance Wu between the speaker units 411 as Wu = Wspk /u using known Wspk and u. Here, since Wu is the actual physical distance, the audio output control unit 22 aligns the processed distance to the normalized coordinate size as Wu_n = Wu / Wdisp in order to match the position of the speaker on the video with the audio. The audio output control unit 22 also aligns the speaker array width Wspk to the normalized coordinate size as Wspk_n . The audio output control unit 22 also calculates the normalized coordinate xu of the speaker units 411 and 412.

 ここで、振幅比をx、音圧レベルをy、ラウドネス比をzとすると、y=20log10x及びz=2y/10という関係式が成り立つ。これをxについて解くと、x=z1/2log 10 という関係が導かれる。これはすなわち、音量をz倍したい場合には、振幅をz1.66倍すればよいことを表す。 Here, if the amplitude ratio is x, the sound pressure level is y, and the loudness ratio is z, then the following relationship holds: y = 20 log 10 x and z = 2 y/10 . Solving this for x yields the relationship x = z 1/2 log 10 2. This means that if you want to increase the volume by z times, you just need to multiply the amplitude by z 1.66 .

 背景音に対する複数対向スピーカーバランシング処理について説明する。本実施の形態に係る遠隔コミュニケーション装置1では、空間のつながり感を維持しつつ、リモートにいる対話の相手とのコミュニケーションを円滑に行うため、背景音を完全に除去するのではなく敢えてローカルに伝送する。ただし、背景音の音量を大きくしすぎると内野発話が聞き取り難くなる。そこで、音声出力制御部22は、デグレード処理で明瞭度が下げられた背景音に対して、再生のタイミングでさらに以下の音量調整を加えて背景音の定位感をぼかす。 The following describes the multiple opposing speaker balancing process for background sound. In the remote communication device 1 according to this embodiment, in order to facilitate communication with a remote conversation partner while maintaining a sense of spatial connection, background sound is transmitted locally rather than being completely removed. However, if the volume of the background sound is made too high, it becomes difficult to hear infield speech. Therefore, the audio output control unit 22 further applies the following volume adjustments to the background sound, whose clarity has been reduced by the degradation process, at the timing of playback to blur the sense of positioning of the background sound.

 具体的には、音声出力制御部22は、音データから背景音を取得する。次に、音声出力制御部22は、背景音のラウドネスが1/uとなるようにして、全てのスピーカーユニット411及び412から出力させる。すなわち、音声出力制御部22は、背景音の振幅を(1/u)1.66倍になるように処理して、スピーカーアレイ41の全てのスピーカーユニット411及びスピーカーアレイ42の全てのスピーカーユニット412から出音させる。 Specifically, the audio output control unit 22 acquires background sound from the sound data. Next, the audio output control unit 22 adjusts the loudness of the background sound to 1/u and outputs it from all speaker units 411 and 412. That is, the audio output control unit 22 processes the amplitude of the background sound to be 1.66 times (1/u), and outputs the background sound from all speaker units 411 of the speaker array 41 and all speaker units 412 of the speaker array 42.

 以上の処理を加えることで、背景音は、グラフ200の曲線201に示すように全てのスピーカーユニット411及び412から同じ抑えられた音量で出音される。ここでは、背景音とは、自然環境音、カフェのBGM及びオフィスの喧騒等の音源位置が明確でない音を対象とする。音声出力制御部22は、背景音の定位感をぼかすことで、ある1つのスピーカーユニット411及び412から再生するよりもリモート空間の様子を臨場感高く再現できる。 By performing the above processing, background sound is output at the same reduced volume from all speaker units 411 and 412, as shown by curve 201 in graph 200. Here, background sound refers to sounds whose source location is unclear, such as natural environmental sounds, background music in a cafe, and office noise. By blurring the sense of positioning of the background sound, the audio output control unit 22 can reproduce the remote space with a greater sense of realism than if it were played from a single speaker unit 411 or 412.

 次に、内野発話音声に対する複数対向スピーカーバランシング処理について説明する。内野発話音声はローカルの人に向けた発話なので、なるべく聞こえやすくすることが好ましい。そこで、音声出力制御部22は、リモートにおいてエンハンス処理が施され内野の個別音に対して以下のような処理を実施する。 Next, we will explain the multiple facing speaker balancing process for infield speech sounds. Since infield speech sounds are intended for local people, it is preferable to make them as easy to hear as possible. Therefore, the audio output control unit 22 performs the following process on individual infield sounds that have been remotely enhanced.

 音声出力制御部22は、人別リストを用いて音データから内野発話音声を取得する。次に、音声出力制御部22は、取得したその内野発話音声の定位感はぼかさず、音も大きさも調整せずそのままスピーカーアレイ41及び42に再生させる。ここで、音声出力制御部22は、その内野発話音声の発話者の正規化座標に最も近いスピーカーユニット411及び412を特定する。すなわち、発話者の正規化座標をxとした場合、音声出力制御部22は、以下の数式(3)で求まる正規化座標xに存在するスピーカーユニット411及び421を特定する。 The audio output control unit 22 acquires infield speech from the sound data using the person list. Next, the audio output control unit 22 plays the acquired infield speech directly from the speaker arrays 41 and 42 without blurring the sense of localization of the speech and without adjusting the volume or volume. Here, the audio output control unit 22 identifies the speaker units 411 and 412 that are closest to the normalized coordinates of the speaker of the infield speech. In other words, if the normalized coordinates of the speaker are x p , the audio output control unit 22 identifies the speaker units 411 and 421 that exist at the normalized coordinate x u calculated by the following formula (3).

 そして、音声出力制御部22は、特定したスピーカーユニット411及び412からその内野発話音声を出力させる。この場合、音声出力制御部22は、その内野発話音声を他のスピーカーユニット411及び412からは出力させない。以上の処理を加えることで、内野発話音声は、グラフ200の曲線202に示すように1つの特定のスピーカーユニット411及び412から他の音よりも大きな音量で出音される。 Then, the audio output control unit 22 outputs the infield speech sound from the identified speaker units 411 and 412. In this case, the audio output control unit 22 does not output the infield speech sound from the other speaker units 411 and 412. By performing the above processing, the infield speech sound is output from one specific speaker unit 411 and 412 at a louder volume than other sounds, as shown by curve 202 in graph 200.

 次に、外野発話音声に対する複数対向スピーカーバランシング処理について説明する。外野発話は、ローカルの人が直接会話している相手ではないので明瞭に聞こえなくてもよいが、背景音と同様に空間のつながり感の観点でも完全に除去することは望ましくない。また、明瞭度は下げつつも耳を傾ければ聞き取れるようにすることで、ローカルの人が外野の人に話しかけ、コミュニケーションが発展し内野に変化するというチャンスもある。そこで、相対的に内野発話を聞き取りやすく、外野発話を聞き取り難くするため、リモートにおいて外野の個別音に対してデグレード処理を施している。そのうえで、音声出力制御部22は、取得したその外野発話音声の定位感をぼかす処理を行う。 Next, we will explain the multiple facing speaker balancing process for outfield speech sounds. Outfield speech does not need to be clearly audible because the local person is not directly speaking with them, but as with background sound, it is undesirable to completely remove it from the perspective of spatial connection. Also, by lowering the clarity but still being audible if you listen carefully, there is an opportunity for a local person to talk to someone in the outfield, and the communication develops and shifts to the infield. Therefore, to make infield speech relatively easier to hear and outfield speech relatively harder to hear, degrading processing is performed on individual outfield sounds remotely. Then, the audio output control unit 22 performs processing to blur the sense of positioning of the acquired outfield speech sounds.

 具体的には、外野発話音声は音源位置が明確であるため、音声出力制御部22は、正規分布関数を用いる処理により外野発話音声の定位感をぼかす。ここで、正規分布関数は、μを平均値とし、σを標準偏差として次の数式(4)で表される。 Specifically, because the sound source position of outfield speech is clear, the audio output control unit 22 blurs the sense of localization of outfield speech by processing using a normal distribution function. Here, the normal distribution function is expressed by the following equation (4), where μ is the mean value and σ is the standard deviation.

 音声出力制御部22は、正規分布関数のピークの部分を外野発話音声の発話者の位置と合致させるため、数式(3)を満たす発話者の正規化座標に最も近いスピーカーユニット411及び412の正規化座標xの位置をμの値に設定する。また、音声出力制御部22は、σをWとする。図7は、外野発話音声に対する正規分布関数を用いた処理を示す図である。音声出力制御部22は、ピークを外野発話音声の発話者の位置としたグラフ211で表される関数f(x)を作成する。ここで、f(x)は、次の数式(5)で表される。 In order to match the peak of the normal distribution function with the position of the speaker of the outfield speech, the audio output control unit 22 sets the value of μ to the position of the normalized coordinate x u of the speaker units 411 and 412 that is closest to the normalized coordinate of the speaker that satisfies Equation (3). Furthermore, the audio output control unit 22 sets σ to W u . FIG. 7 is a diagram showing processing using a normal distribution function for outfield speech. The audio output control unit 22 creates a function f(x) represented by a graph 211 whose peak is the position of the speaker of the outfield speech. Here, f(x) is expressed by the following Equation (5).

 正規化分布関数のx座標の範囲は通常は-∞から∞までであるが、音声出力制御部22は、スピーカーアレイ41及び42を再生する範囲として、例えば、信頼区間95%までの区間にあたるスピーカーユニット411及び412で再生するといった制約を掛ける。これにより、音声出力制御部22は、μ-2σからμ+2σまでのレンジ=2の範囲で出音させることができ、出音する範囲を有限にする。この場合、音声出力制御部22は、1人の外野発話音声を5つずつのスピーカーユニット411及び412に再生させる。このように再生させるスピーカーユニット411及び412の範囲を狭めることで、音声出力制御部22は、処理負荷を軽減することができる。 The x-coordinate range of the normalized distribution function is normally from -∞ to ∞, but the audio output control unit 22 imposes a restriction on the range of playback by the speaker arrays 41 and 42, for example, by playing back using speaker units 411 and 412 within the 95% confidence interval. This allows the audio output control unit 22 to output sound within a range of μ-2σ to μ+2σ (range = 2), fining the range of sound output. In this case, the audio output control unit 22 plays back one person's outfield speech from five speaker units 411 and 412 each. By narrowing the range of speaker units 411 and 412 to be played back in this way, the audio output control unit 22 can reduce the processing load.

 また、σの値によっては最大値が1より大きくなってしまうことが考えられる。その場合、f(x)を音量調整に用いると内野発話音声よりも外野発話音声が大きくなる可能性がある。そこで、音声出力制御部22は、スケーリング因子sfを用いて、図7のグラフ212で示すように、g(x)=sf/f(μ)×f(x)として音量を全体にわたって小さくすることで、内野発話音声よりも外野発話音声が大きく再生することを防止する。sfは例えば、0.8とすることができる。 Furthermore, depending on the value of σ, it is possible that the maximum value will be greater than 1. In that case, using f(x) to adjust the volume may result in outfield speech sounds being louder than infield speech sounds. Therefore, the audio output control unit 22 uses a scaling factor sf to reduce the overall volume by g(x) = sf/f(μ) × f(x), as shown in graph 212 in Figure 7, thereby preventing outfield speech sounds from being played louder than infield speech sounds. sf can be set to 0.8, for example.

 そして、音声出力制御部22は、正規化座標に合わせて、外野発話音声の音の振幅を{g(x)}1.66倍になるように処理して、外野発話音声の発話者の位置を中心として指定した数のスピーカーユニット411及び412で外野発話音声を再生させる。ここで、ピークの値を規定するスケーリング因子sfの値(0≦sf≦1)やスピーカー再生範囲は、設定ファイル等を用いて利用者が変更できる。以上の処理を加えることで、外野発話音声は、図5のグラフ200の曲線203に示すように発話者の位置を中心としての所定の範囲のスピーカーユニット411及び412から内野発話音声よりも抑えた音量で出音される。 The audio output control unit 22 then processes the amplitude of the outfield speech sound to be 1.66 times {g(x)} according to the normalized coordinates, and reproduces the outfield speech sound from a specified number of speaker units 411 and 412 centered around the speaker position of the outfield speech sound. Here, the value of the scaling factor sf (0≦sf≦1) that defines the peak value and the speaker playback range can be changed by the user using a configuration file, etc. By performing the above processing, the outfield speech sound is output at a lower volume than the infield speech sound from the speaker units 411 and 412 within a specified range centered around the speaker position, as shown by curve 203 in graph 200 of FIG. 5.

 このように、音声出力制御部22は、複数のスピーカーユニット411が一列に並ぶスピーカーアレイ41と複数のスピーカーユニット412が一列に並ぶスピーカーアレイ42とを2つ有する複数対向スピーカー4において、以下の処理を実行する。音声出力制御部22は、発話者それぞれの位置を基に、受信部21により受信されたリモート空間にあたる所定空間の音声のうち各発話者の発話音声を再生させるスピーカーユニット411及び412を選択する。そして、音声出力制御部22は、選択した前記スピーカーユニット411及び412に発話者それぞれの発話音声を再生させる。また、音声出力制御部22は、スクリーンに映された発話者毎に、スクリーン上の位置の近傍で発話音声が再生されるように、発話者の位置を基に再生させるスピーカーユニット411及び412を選択する。 In this way, the audio output control unit 22 performs the following processing on the multiple opposed speakers 4, which have two arrays: a speaker array 41 in which multiple speaker units 411 are arranged in a row, and a speaker array 42 in which multiple speaker units 412 are arranged in a row. Based on the position of each speaker, the audio output control unit 22 selects speaker units 411 and 412 to play back the speech of each speaker from the audio received by the receiving unit 21 in a specified space corresponding to the remote space. The audio output control unit 22 then causes the selected speaker units 411 and 412 to play back the speech of each speaker. Furthermore, for each speaker shown on the screen, the audio output control unit 22 selects speaker units 411 and 412 to play back based on the position of the speaker so that the speech is played back near the position on the screen.

 この点、従来は発話者が画面の向こうの相手と会話しているか否か、すなわち会話に参加しているか否かにより多チャンネルスピーカーの再生方法を制御することは困難であった。これに対して、音声出力制御部22は、ローカルにとっての対話相手である内野の発話者の声が相対的に聞き取り易くすると同時に、背景音や外野発話を残すことでリモートの空間とのつながり感を保持する。また、音声出力制御部22は、背景音や外野発話を複数のスピーカーユニット411及び412から再生することで音の定位感をぼかしてより聞き取り難くし、かつ、内野発話音声を相対的に聞き取り易くすることでリモートの相手との会話により集中できるようにする。 In the past, it was difficult to control the playback method of multi-channel speakers depending on whether the speaker was speaking with the other party on the other side of the screen, i.e., whether they were participating in the conversation. In contrast, the audio output control unit 22 makes the voice of the infield speaker, who is the local conversation partner, relatively easier to hear, while maintaining a sense of connection with the remote space by leaving background sounds and outfield speech. In addition, the audio output control unit 22 plays background sounds and outfield speech from multiple speaker units 411 and 412, blurring the sense of sound localization and making them more difficult to hear, while making infield speech relatively easier to hear, allowing you to better concentrate on the conversation with the remote party.

 <2.第1の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ>
 図8は、第1の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。ここでも、リモート側の遠隔コミュニケーション装置1の送信ユニット10からローカル側の遠隔コミュニケーション装置1の受信ユニット20へとデータが送信される場合で説明する。ここでは、音声の送信について説明する。
2. Data flow between remote communication devices according to the first embodiment
8 is a diagram showing an outline of the data flow between the local remote communication devices according to the first embodiment. Here, too, the case where data is transmitted from the transmitting unit 10 of the remote remote communication device 1 to the receiving unit 20 of the local remote communication device 1 will be described. Here, the transmission of audio will be described.

 送信側処理221が、リモート側に存在する遠隔コミュニケーション装置1が実行する処理である。また、受信側処理222が、ローカル側に存在する遠隔コミュニケーション装置1が実行する処理である。 The sending-side process 221 is a process executed by the remote communication device 1 located on the remote side. The receiving-side process 222 is a process executed by the remote communication device 1 located on the local side.

 マイク3は、リモート空間で収音した音声をリモート側の遠隔コミュニケーション装置1へ入力する(ステップS1)。ここでは、例えば、マイクアレイ30のデータとしてsチャンネルのデータが存在する。また、カメラ2は、リモート空間の撮影を行って生成した映像をリモート側の遠隔コミュニケーション装置1へ入力する(ステップS2)。 The microphone 3 inputs the audio picked up in the remote space to the remote communication device 1 (step S1). Here, for example, data from the microphone array 30 is s channel data. The camera 2 also inputs the video generated by capturing images of the remote space to the remote communication device 1 (step S2).

 カメラ2から入力された映像は、人センシング部11に送られる。人センシング部11は、カメラ2により撮影された映像を用いて、リモート空間に存在する各人の位置の判定及び内野外野の判定を行う人センシング処理を実行する(ステップS3)。 The video input from camera 2 is sent to the human sensing unit 11. The human sensing unit 11 uses the video captured by camera 2 to perform human sensing processing to determine the position of each person in the remote space and to determine whether the field or outfield is located (step S3).

 詳しくは、人センシング部11は、スクリーンに映される映像の横方向に正規化座標を設定する。そして、人センシング部11は、映像内の発話者に対して骨格認識を行い、各発話者の位置の正規化座標を取得する。その後、人センシング部11は、発話者の並びに対応させた要素を有する人別リストを生成して、各人の位置の正規化座標を登録することで、人位置判定処理を実行する(ステップS31)。ここでは、人センシング部11は、映像内に存在するp人の発話者を抽出する。すなわち、人別リストにはp個の要素が登録される。 In more detail, the human sensing unit 11 sets normalized coordinates in the horizontal direction of the image projected on the screen. Then, the human sensing unit 11 performs skeletal recognition on speakers in the image and obtains normalized coordinates of the position of each speaker. After that, the human sensing unit 11 generates a person list having elements corresponding to the order of speakers and registers the normalized coordinates of each person's position, thereby executing the human position determination process (step S31). Here, the human sensing unit 11 extracts p speakers present in the image. In other words, p elements are registered in the person list.

 また、人センシング部11は、映像内の各発話者の頭部方向を推定する。そして、人センシング部11は、推定した頭部方向が予め決められた内野範囲に含まれるか否かにより、各発話者が内野か外野かを判定し、判定結果を基に内野及び外野を示す内野フラグを人別リストに登録することで内野外野判定処理を実行する(ステップS32)。 The human sensing unit 11 also estimates the head direction of each speaker in the video. Then, depending on whether the estimated head direction is within a predetermined infield range, the human sensing unit 11 determines whether each speaker is in the infield or outfield, and executes the infield/outfield determination process by registering an infield flag indicating the infield or outfield in the person list based on the determination result (step S32).

 マイク3から入力された音声は、音声分離部12へ送られる。音声分離部12は、マイク3から入力された音声に含まれる発話音声と背景音とを分離する音声分離処理を行う(ステップS4)。音声分離部12は、マイクアレイ30から入力されたsチャンネルのデータに対して、背景音として1チャンネルのデータを生成し、発話音声としてsチャンネルのデータを生成する。 The audio input from microphone 3 is sent to audio separation unit 12. Audio separation unit 12 performs audio separation processing to separate speech and background sounds contained in the audio input from microphone 3 (step S4). For the s-channel data input from microphone array 30, audio separation unit 12 generates one channel of data as background sound and s-channel data as speech.

 音声分離部12により抽出された1チャンネルのデータである背景音は、音響信号処理部13の明瞭度調整部132へ送られる。明瞭度調整部132は、明瞭度調整処理として背景音に対してより聞こえ難くするデグレード処理を実施する(ステップS5)。 The background sound, which is one-channel data extracted by the audio separation unit 12, is sent to the clarity adjustment unit 132 of the audio signal processing unit 13. The clarity adjustment unit 132 performs a clarity adjustment process, called a degradation process, to make the background sound less audible (step S5).

 また、音声分離部12により抽出されたsチャンネルのデータである発話音声は、音響信号処理部13の個別音分離部131へ送られる。また、人別リストが、個別音分離部131へ送られる。発話音声に対しては、音響信号処理部13により人別リストを用いた発話音声信号処理が実施される(ステップS6)。 Furthermore, the speech sound, which is the s-channel data extracted by the speech separation unit 12, is sent to the individual sound separation unit 131 of the acoustic signal processing unit 13. The person list is also sent to the individual sound separation unit 131. The acoustic signal processing unit 13 performs speech sound signal processing on the speech sound using the person list (step S6).

 詳しくは、個別音分離部131は、取得した人別リストに登録された要素数であるp個分のスレッドを作成する。さらに、個別音分離部131は、人別リストに登録された正規化座標の値を、スレッド毎のスレッド固有の値として保持させる。そして、個別音分離部131は、各発話者の正規化座標を仮想スクリーンの座標として、物理的距離を算出する。そして、個別音分離部131は、算出した物理的距離、仮想クリーンからカメラ2までの距離を用いて各発話者までの角度θを求める。次に、個別音分離部131は、音声分離部12による音声分離により抽出された発話音声に対して、各発話者までの角度に応じたビームフォーミングを用いて個別音分離処理を行う(ステップS61)。これにより、個別音分離部131は、発話者毎に1チャンネルのデータである個別音を生成する。すなわち、p個のスレッド1つ1つで、1チャンネルの個別音のデータが生成される。 In more detail, the individual sound separation unit 131 creates p threads, which is the number of elements registered in the acquired personal list. Furthermore, the individual sound separation unit 131 stores the normalized coordinate values registered in the personal list as thread-specific values for each thread. The individual sound separation unit 131 then calculates the physical distance using the normalized coordinates of each speaker as coordinates on the virtual screen. The individual sound separation unit 131 then determines the angle θ to each speaker using the calculated physical distance and the distance from the virtual screen to camera 2. Next, the individual sound separation unit 131 performs individual sound separation processing on the speech sounds extracted by speech separation by the speech separation unit 12 using beamforming according to the angle to each speaker (step S61). As a result, the individual sound separation unit 131 generates individual sounds, which are one-channel data for each speaker. In other words, one-channel individual sound data is generated for each of the p threads.

 次に、明瞭度調整部132は、スレッド毎に個別音に対応する発話者が内野であるか外野であるかを人別リストの内野フラグを用いて判定する。そして、明瞭度調整部132は、明瞭度調整処理として、内野の発話者の個別音に対してより聞こえやすくするエンハンス処理を実施し、かつ、外野の発話者の個別音に対してより聞こえ難くするデグレード処理を実施する(ステップS62)。 Next, the clarity adjustment unit 132 determines whether the speaker corresponding to the individual sounds for each thread is infield or outfield using the infield flag in the person list. Then, as a clarity adjustment process, the clarity adjustment unit 132 performs an enhancement process to make the individual sounds of infield speakers easier to hear, and a degradation process to make the individual sounds of outfield speakers harder to hear (step S62).

 明瞭度調整処理が施された背景音は、出力統合部14へ入力される。また、発話音声信号処理が施された個別音も、出力統合部14へ入力される。これにより、出力統合部14は、(1+P)チャンネルのデータを取得する。そして、出力統合部14は、すべてのチャンネルの音を統合して1つの音データを生成する出力統合処理を実行する(ステップS7)。 The background sound that has undergone clarity adjustment processing is input to the output integrating unit 14. The individual sounds that have undergone speech signal processing are also input to the output integrating unit 14. As a result, the output integrating unit 14 obtains data for (1+P) channels. The output integrating unit 14 then executes output integration processing to integrate the sounds from all channels to generate a single sound data set (step S7).

 出力統合部14で生成された音データ及び人別リストは、送信部15によりネットワーク7を介してローカル側の遠隔コミュニケーション装置1へ送られる(ステップS8)。人別リストの送信により、人位置判定処理により判定された各発話者の正規化座標及び内野外野判定処理により判定された各発話者が内野であるか外野であるかの情報が、ローカル側の遠隔コミュニケーション装置1へ送られる。 The sound data and person list generated by the output integration unit 14 are sent by the transmission unit 15 via the network 7 to the local-side remote communication device 1 (step S8). By sending the person list, the normalized coordinates of each speaker determined by the person position determination process and information on whether each speaker is in the infield or outfield determined by the infield/outfield determination process are sent to the local-side remote communication device 1.

 ローカル側の遠隔コミュニケーション装置1は、音データ及び人別リストの送信を受ける。音データ及び人別リストは、受信部21を介して音声出力制御部22へ送られる。音声出力制御部22は、人別リストを用いて音データに対して複数対向スピーカーバランシング処理を実施して、背景音、内野発話音声及び外野発話音声毎に出力するスピーカーユニット411及び412を指定して再生させる(ステップS9)。詳細には、音声出力制御部22は、背景音のラウドネスをスピーカーユニット411及び412の組みの数で除算した大きさにして全てのスピーカーユニット411及び412に再生させる。また、音声出力制御部22は、内野発話音声の発話者の位置に最も近いスピーカーユニット411及び412に内野発話音声をそのまま再生させる。また、音声出力制御部22は、スケーリング因子を用いた正規分布関数を使用して外野発話音声に処理を施し、かつ、使用するスピーカーユニット411及び412の範囲を限定してスピーカーユニット411及び412に外野発話音声を再生させる。 The local remote communication device 1 receives the sound data and the person list. The sound data and person list are sent to the audio output control unit 22 via the receiving unit 21. The audio output control unit 22 uses the person list to perform a multiple facing speaker balancing process on the sound data, and specifies the speaker units 411 and 412 to output the background sound, infield speech sound, and outfield speech sound for playback (step S9). In detail, the audio output control unit 22 adjusts the loudness of the background sound by dividing it by the number of pairs of speaker units 411 and 412, and plays the adjusted sound on all speaker units 411 and 412. The audio output control unit 22 also plays the infield speech sound directly on the speaker units 411 and 412 closest to the position of the speaker of the infield speech sound. In addition, the audio output control unit 22 processes the outfield speech using a normal distribution function with a scaling factor, and limits the range of speaker units 411 and 412 to be used, causing the speaker units 411 and 412 to reproduce the outfield speech.

 スピーカーアレイ41及び42は、音声出力制御部22から指定されたスピーカーユニット411及び412を用いて、背景音、内野発話音声及び外野発話音声を出音する(ステップS10)。 The speaker arrays 41 and 42 output background sounds, infield speech sounds, and outfield speech sounds using the speaker units 411 and 412 specified by the audio output control unit 22 (step S10).

 <3.遠隔コミュニケーション処理>
 次に、遠隔コミュニケーション処理の流れについて説明する。ここでは、送信側の遠隔コミュニケーション装置1による送信側処理と、受信側の遠隔コミュニケーション装置1による複数対向スピーカーバランシング処理に分けて説明する。
<3. Remote communication processing>
Next, the flow of the remote communication process will be described, which will be divided into a transmitting process by the transmitting remote communication device 1 and a multiple opposing speaker balancing process by the receiving remote communication device 1.

 <3.1.送信側処理>
 図9は、送信側処理のフローチャートである。図9を参照して、送信側処理の全体的な流れを説明する。
<3.1. Transmission side processing>
9 is a flowchart of the transmission side process. The overall flow of the transmission side process will be described with reference to FIG.

 人センシング部11は、カメラ2により撮影されたリモート空間の映像を取得する。また、音声分離部12は、マイク3により収音されたリモート空間の音声を取得する(ステップS11)。 The human sensing unit 11 acquires video of the remote space captured by the camera 2. The audio separation unit 12 also acquires audio of the remote space picked up by the microphone 3 (step S11).

 人センシング部11は、カメラ2により撮影された映像を用いて、各発話者の位置の正規化座標を判定し、かつ、人数分の要素を有し各発話者の正規化座標が登録された人別リストを作成する人判定処理を実行する(ステップS12)。 The person sensing unit 11 performs a person determination process using the video captured by the camera 2 to determine the normalized coordinates of each speaker's position and to create a person list with elements for the number of people and in which the normalized coordinates of each speaker are registered (step S12).

 次に、人センシング部11は、映像及び人別リストを用いて、各発話者が内野であるか外野であるかを判定する内野外野判定処理を実行する(ステップS13)。 Next, the human sensing unit 11 performs an infield/outfield determination process using the video and the person list to determine whether each speaker is in the infield or outfield (step S13).

 音声分離部12は、マイク3から入力された音声に含まれる発話音声と背景音とを分離する音声分離処理を行う(ステップS14)。 The audio separation unit 12 performs audio separation processing to separate the speech and background sounds contained in the audio input from the microphone 3 (step S14).

 明瞭度調整部132は、音声分離部12により抽出された背景音の入力を受ける。そして、明瞭度調整部132は、背景音に対してより聞こえ難くするデグレード処理を明瞭度調整処理として実施する(ステップS15)。 The clarity adjustment unit 132 receives the background sound extracted by the audio separation unit 12. The clarity adjustment unit 132 then performs a degrading process to make the background sound more difficult to hear as a clarity adjustment process (step S15).

 個別音分離部131は、音声分離部12により抽出された発話音声の入力を受ける。また、個別音分離部131は、人別リストを人センシング部11から取得する。次に、個別音分離部131は、取得した人別リストに登録された要素数分のスレッドを生成する(ステップS16)。個別音分離部131は、各スレッドに1から連番で人別リストに登録された要素数分のIDを割り当てる。ここでは、人別リストに登録された要素数がpである場合で詰め移する。 The individual sound separation unit 131 receives input of speech extracted by the audio separation unit 12. The individual sound separation unit 131 also acquires a person-specific list from the person sensing unit 11. Next, the individual sound separation unit 131 generates threads equal to the number of elements registered in the acquired person-specific list (step S16). The individual sound separation unit 131 assigns IDs to each thread, sequentially numbered from 1, equal to the number of elements registered in the person-specific list. Here, the process is performed when the number of elements registered in the person-specific list is p.

 次に、個別音分離部131は、発話者毎の個別音に発話音声を分離する個別音分離処理を実行する(ステップS17)。 Next, the individual sound separation unit 131 performs individual sound separation processing to separate the speech into individual sounds for each speaker (step S17).

 明瞭度調整部132は、個別音の入力を個別音分離部131から受ける。次に、明瞭度調整部132は、iを初期化して0に設定する(ステップS18)。 The clarity adjustment unit 132 receives input of individual sounds from the individual sound separation unit 131. Next, the clarity adjustment unit 132 initializes i to 0 (step S18).

 次に、明瞭度調整部132は、IDがiであるスレッドの個別音について明瞭度調整処理を実行する(ステップS19)。 Next, the clarity adjustment unit 132 performs clarity adjustment processing on the individual sounds in the thread whose ID is i (step S19).

 次に、明瞭度調整部132は、iがp未満か否かを判定する(ステップS20)。iがp未満の場合(ステップS20:肯定)、明瞭度調整部132は、iを1つインクリメントする(ステップS21)。その後、明瞭度調整部132は、ステップS19へ戻る。 Next, the clarity adjustment unit 132 determines whether i is less than p (step S20). If i is less than p (step S20: Yes), the clarity adjustment unit 132 increments i by 1 (step S21). Then, the clarity adjustment unit 132 returns to step S19.

 これに対して、iがp以上の場合(ステップS20:否定)、明瞭度調整部132は、明瞭度調整処理を施した背景音及び個別音を出力統合部14へ出力する。また、出力統合部14は、映像の入力をカメラ2から受ける。そして、出力統合部14は、背景音声及び個別音を統合して1つの音データを生成する出力統合処理を実行する(ステップS22)。 On the other hand, if i is greater than or equal to p (step S20: No), the clarity adjustment unit 132 outputs the background sound and individual sounds that have been subjected to clarity adjustment processing to the output integrator 14. The output integrator 14 also receives video input from the camera 2. The output integrator 14 then executes output integration processing to integrate the background sound and individual sounds to generate a single piece of sound data (step S22).

 その後、送信部15は、出力統合部14により生成された音データ及びカメラ2により撮影された映像をネットワーク7を介してローカル側の遠隔コミュニケーション装置1へ送信する(ステップ23)。 Then, the transmission unit 15 transmits the sound data generated by the output integration unit 14 and the video captured by the camera 2 to the local remote communication device 1 via the network 7 (step 23).

 <3.1.1.人位置判定処理>
 図10は、人位置判定処理のフローチャートである。図10に示した処理は、図9のステップS12で実行される処理の一例にあたる。次に、図10を参照して、人位置判定処理の流れを説明する。
<3.1.1. Person position determination process>
Fig. 10 is a flowchart of the person position determination process. The process shown in Fig. 10 is an example of the process executed in step S12 in Fig. 9. Next, the flow of the person position determination process will be described with reference to Fig. 10.

 人センシング部11は、スクリーンに映される映像の横方向に正規化座標を設定する。次に、人センシング部11は、映像内の発話者に対して骨格認識を行って各発話者の首の位置を特定し、首のスクリーン座標を取得する(ステップS101)。 The human sensing unit 11 sets normalized coordinates in the horizontal direction of the image projected on the screen. Next, the human sensing unit 11 performs skeletal recognition on the speakers in the image to identify the position of each speaker's neck and obtain the screen coordinates of the neck (step S101).

 次に、人センシング部11は、スクリーン座標を正規化座標に変換する(ステップS102)。 Next, the human sensing unit 11 converts the screen coordinates into normalized coordinates (step S102).

 その後、人センシング部11は、人別リストに各発話者のスクリーン座標及び正規化座標を登録する(ステップS103)。ここで、人センシング部11は、スクリーン座標及び正規化座標それぞれの映像をディスプレイ6のスクリーンに映した場合の横方向の座標を格納すればよい。 Then, the human sensing unit 11 registers the screen coordinates and normalized coordinates of each speaker in the person list (step S103). Here, the human sensing unit 11 stores the horizontal coordinates when the screen coordinates and normalized coordinates of each image are projected onto the screen of the display 6.

 <3.1.2.内野外野判定処理>
 図11は、内野外野判定処理のフローチャートである。図11に示した処理は、図9のステップS13で実行される処理の一例にあたる。次に、図11を参照して、内野外野判定処理の流れを説明する。
<3.1.2. Infield/Outfield Determination Processing>
Fig. 11 is a flowchart of the infield/outfield determination process. The process shown in Fig. 11 is an example of the process executed in step S13 in Fig. 9. Next, the flow of the infield/outfield determination process will be described with reference to Fig. 11.

 人センシング部11は、人別リストに内野フラグの項目を設けて、それらを初期化してリモート空間に存在する人数分の項目全てにFalseを設定する(ステップS111)。 The human sensing unit 11 creates an infield flag item in the person list, initializes it, and sets all items equal to the number of people present in the remote space to False (step S111).

 次に、人センシング部11は、 iを初期化して0とする(ステップS112)。 Next, the human sensing unit 11 initializes i to 0 (step S112).

 次に、人センシング部11は、映像の左側から先頭を0番目としてi番目の発話者の頭部方向の検出を実行する(ステップS113)。 Next, the human sensing unit 11 detects the head direction of the i-th speaker, counting the first speaker from the left side of the video as 0 (step S113).

 人センシング部11は、i番目の発話者の頭部方向が検出可能か否かを判定する(ステップS114)。頭部方向が検出できない場合(ステップS114:否定)、人センシング部11は、ステップS118へ進む。 The human sensing unit 11 determines whether the head direction of the i-th speaker can be detected (step S114). If the head direction cannot be detected (step S114: No), the human sensing unit 11 proceeds to step S118.

 これに対して、頭部方向が検出可能な場合(ステップS114:肯定)、人センシング部11は、頭部方向が内野範囲内か否かを判定する(ステップS115)。 On the other hand, if the head direction can be detected (step S114: Yes), the human sensing unit 11 determines whether the head direction is within the infield (step S115).

 頭部方向が内野範囲内の場合(ステップS115:肯定)、人センシング部11は、人別リストのi番目の発話者に対応する要素の内野フラグにTrueを格納する(ステップS116)。 If the head direction is within the infield range (step S115: Yes), the person sensing unit 11 stores True in the infield flag of the element corresponding to the i-th speaker in the person list (step S116).

 これに対して、頭部方向が内野範囲以外の場合(ステップS115:否定)、人センシング部11は、人別リストのi番目の発話者に対応する要素の内野フラグにFalseを格納する(ステップS117)。 On the other hand, if the head direction is outside the infield range (step S115: No), the person sensing unit 11 stores False in the infield flag of the element corresponding to the i-th speaker in the person list (step S117).

 その後、人センシング部11は、iがp未満か否かを判定する(ステップS118)。iがp未満の場合(ステップS118:肯定)、人センシング部11は、iを1つインクリメントする(ステップS119)。その後、人センシング部11は、ステップS113へ戻る。 Then, the human sensing unit 11 determines whether i is less than p (step S118). If i is less than p (step S118: Yes), the human sensing unit 11 increments i by 1 (step S119). Then, the human sensing unit 11 returns to step S113.

 これに対して、iがp以上の場合(ステップS118:否定)、人センシング部11は、内野外野判定処理を終了する。 On the other hand, if i is greater than or equal to p (step S118: No), the human sensing unit 11 ends the infield/outfield determination process.

 <3.1.3.個別音分離処理>
 図12は、個別音分離処理のフローチャートである。図12に示した処理は、図9のステップS17で実行される処理の一例にあたる。次に、図12を参照して、個別音分離処理の流れを説明する。
<3.1.3. Individual Sound Separation Processing>
Fig. 12 is a flowchart of the individual sound separation process. The process shown in Fig. 12 is an example of the process executed in step S17 in Fig. 9. Next, the flow of the individual sound separation process will be described with reference to Fig. 12.

 個別音分離部131は、人別リストに登録された人数分のスレッドを作成する。次に、個別音分離部131は、人別リストに登録された正規化座標の値を、スレッド毎のスレッド固有の値として保持させる。そして、個別音分離部131は、iを初期化して0に設定する(ステップS121)。 The individual sound separation unit 131 creates threads for the number of people registered in the personal list. Next, the individual sound separation unit 131 stores the normalized coordinate values registered in the personal list as thread-specific values for each thread. Then, the individual sound separation unit 131 initializes i to 0 (step S121).

 次に、個別音分離部131は、映像の左側から先頭を0番目としてi番目の発話者のビーム方向を計算する(ステップS122)。 Next, the individual sound separation unit 131 calculates the beam direction of the i-th speaker, counting the first speaker from the left side of the video as 0 (step S122).

 次に、個別音分離部131は、ビーム方向に対するビームフォーミング処理でi番目の発話者の個別音を取得する(ステップS123)。 Next, the individual sound separation unit 131 acquires the individual sound of the i-th speaker by beamforming processing in the beam direction (step S123).

 その後、個別音分離部131は、iがp未満か否かを判定する(ステップS124)。iがp未満の場合(ステップS124:肯定)、個別音分離部131は、iを1つインクリメントする(ステップS125)。その後、個別音分離部131は、ステップS122へ戻る。 Then, the individual sound separation unit 131 determines whether i is less than p (step S124). If i is less than p (step S124: Yes), the individual sound separation unit 131 increments i by 1 (step S125). Then, the individual sound separation unit 131 returns to step S122.

 これに対して、iがp以上の場合(ステップS124:否定)、個別音分離部131は、個別音分離処理を終了する。 On the other hand, if i is greater than or equal to p (step S124: No), the individual sound separation unit 131 terminates the individual sound separation process.

 <3.1.4.明瞭度調整処理>
 図13は、個別音に対する明瞭度調整処理のフローチャートである。図13に示した処理は、図9のステップS19で実行される処理の一例にあたる。次に、図13を参照して、明瞭度調整処理の流れを説明する。
3.1.4. Clarity Adjustment Processing
Fig. 13 is a flowchart of the clarity adjustment process for individual sounds. The process shown in Fig. 13 is an example of the process executed in step S19 in Fig. 9. Next, the flow of the clarity adjustment process will be described with reference to Fig. 13.

 明瞭度調整部132は、iを初期化して0に設定する(ステップS131)。 The clarity adjustment unit 132 initializes i to 0 (step S131).

 次に、明瞭度調整部132は、映像の左側から先頭を0番目としてi番目の発話者の人別リストの内野フラグがTrueか否かを判定する(ステップS132)。 Next, the clarity adjustment unit 132 determines whether the infield flag in the person list for the ith speaker, with the first speaker from the left side of the video being number 0, is True (step S132).

 内野フラグがTrueの場合(ステップS132:肯定)、明瞭度調整部132は、i番目の発話者の個別音にエンハンス処理を実施する(ステップS133)。 If the infield flag is True (step S132: Yes), the clarity adjustment unit 132 performs enhancement processing on the individual sounds of the i-th speaker (step S133).

 内野フラグがFalseの場合(ステップS132:否定)、明瞭度調整部132は、i番目の発話者の個別音にデグレード処理を実施する(ステップS134)。 If the infield flag is False (step S132: No), the clarity adjustment unit 132 performs degrading processing on the individual sound of the i-th speaker (step S134).

 その後、明瞭度調整部132は、iがp未満か否かを判定する(ステップS135)。iがp未満の場合(ステップS135:肯定)、明瞭度調整部132は、iを1つインクリメントする(ステップS136)。その後、明瞭度調整部132は、ステップS132へ戻る。 Then, the clarity adjustment unit 132 determines whether i is less than p (step S135). If i is less than p (step S135: Yes), the clarity adjustment unit 132 increments i by 1 (step S136). Then, the clarity adjustment unit 132 returns to step S132.

 これに対して、iがp以上の場合(ステップS135:否定)、明瞭度調整部132は、個別音分離処理を終了する。 On the other hand, if i is greater than or equal to p (step S135: No), the clarity adjustment unit 132 terminates the individual sound separation process.

 <3.2.複数対向スピーカーバランシング処理(受信側処理)>
 図14は、複数対向スピーカーバランシング処理のフローチャートである。図14を参照して、複数対向スピーカーバランシング処理の全体的な流れを説明する。ここでは、音声出力制御部22は、スピーカーユニット411及び412の組合せ毎のリストを使用して再生を制御する場合で説明する。
<3.2. Multiple opposing speaker balancing process (receiving side process)>
14 is a flowchart of the multiple opposed speaker balancing process. The overall flow of the multiple opposed speaker balancing process will be described with reference to FIG. 14. Here, the case where the audio output control unit 22 controls playback using a list for each combination of speaker units 411 and 412 will be described.

 音声出力制御部22は、音データ及び人別リストの入力を受信部21から受ける。音声出力制御部22は、スピーカーユニット411同士の間の距離を算出する(ステップS321)。例えば、双方の端部のスピーカーユニット411間の距離をWspkであり、スピーカーユニット411及び412のいずれもu個存在する場合、音声出力制御部22は、Wu=Wspk/uとして、スピーカーユニット411同士の間のスピーカー間距離Wuを算出できる。 The audio output control unit 22 receives input of the sound data and the person list from the receiving unit 21. The audio output control unit 22 calculates the distance between the speaker units 411 (step S321). For example, if the distance between the speaker units 411 at both ends is Wspk and there are u speaker units 411 and 412, the audio output control unit 22 can calculate the inter-speaker distance Wu between the speaker units 411 as Wu = Wspk/u.

 次に、音声出力制御部22は、各スピーカーユニット411及び412の組の正規化座標を算出して、スピーカーユニット411及び412の組の数の要素を有するスピーカーユニットリストに、昇順に正規化座標を格納する(ステップS322)。 Next, the audio output control unit 22 calculates the normalized coordinates for each pair of speaker units 411 and 412, and stores the normalized coordinates in ascending order in a speaker unit list having the same number of elements as the number of pairs of speaker units 411 and 412 (step S322).

 次に、音声出力制御部22は、スピーカーユニットリストの要素毎に出力音格納用の領域を設けて、それぞれの領域を背景音及び個別音のチャンネル数で初期化する(ステップS33)。例えば、背景音が1チャンネルのデータであり、個別音がp個ある場合、出力音格納用の領域には、(1+P)個のチャンネルの音を格納する(1+P)個の領域が設定される。 Next, the audio output control unit 22 provides an area for storing output sounds for each element in the speaker unit list, and initializes each area with the number of channels for the background sounds and individual sounds (step S33). For example, if the background sound is one-channel data and there are p individual sounds, (1 + P) areas for storing sounds of (1 + P) channels are set in the area for storing output sounds.

 次に、音声出力制御部22は、iを初期化して0に設定する(ステップS34)。 Next, the audio output control unit 22 initializes i to 0 (step S34).

 次に、音声出力制御部22は、音データにおいて順番に並べられた(1+P)個のチャンネルの先頭を0番目としてi番目のチャンネルの音声が発話音声の個別音であるか否かを判定する(ステップS35)。 Next, the audio output control unit 22 determines whether the audio of the i-th channel, with the first of the (1+P) channels arranged in order in the audio data being numbered 0, is an individual sound of the speech (step S35).

 i番目のチャンネルの音声が背景音の場合(ステップS35:否定)、音声出力制御部22は、背景音調整を実施する(ステップS36)。その後、音声出力制御部22は、ステップS40へ進む。 If the audio on the i-th channel is background sound (step S35: No), the audio output control unit 22 performs background sound adjustment (step S36). The audio output control unit 22 then proceeds to step S40.

 これに対して、i番目のチャンネルの音声が発話音声の個別音である場合(ステップS35:肯定)、音声出力制御部22は、その個別音の発話者が内野であるか否かを人別リストを用いて判定する(ステップS37)。 On the other hand, if the audio on the i-th channel is an individual sound of speech (step S35: Yes), the audio output control unit 22 determines whether the speaker of that individual sound is Uchino using the person list (step S37).

 発話者が外野の場合(ステップS37:否定)、音声出力制御部22は、外野発話音声調整を実施する(ステップS38)。その後、音声出力制御部22は、ステップS40へ進む。 If the speaker is in the outfield (step S37: No), the audio output control unit 22 performs outfield speech sound adjustment (step S38). The audio output control unit 22 then proceeds to step S40.

 発話者が内野の場合(ステップS37:肯定)、音声出力制御部22は、内野発話音声調整を実施する(ステップS39)。 If the speaker is an infield player (step S37: Yes), the audio output control unit 22 performs infield speech sound adjustment (step S39).

 その後、音声出力制御部22は、スピーカーユニットリストの出力音格納用の領域にi番目のチャンネルの音声を格納する(ステップS40)。 Then, the audio output control unit 22 stores the audio of the i-th channel in the area for storing output audio in the speaker unit list (step S40).

 次に、音声出力制御部22は、iがp未満か否かを判定する(ステップS41)。iがp未満の場合(ステップS41:肯定)、音声出力制御部22は、iを1つインクリメントする(ステップS42)。その後、音声出力制御部22は、ステップS35へ戻る。 Next, the audio output control unit 22 determines whether i is less than p (step S41). If i is less than p (step S41: Yes), the audio output control unit 22 increments i by 1 (step S42). Then, the audio output control unit 22 returns to step S35.

 これに対して、iがp以上の場合(ステップS41:否定)、音声出力制御部22は、スピーカーユニットリストの出力音格納用の領域に格納された音声をミキシングする。そして、音声出力制御部22は、ミキシングした音をスピーカーユニットリストに対応するスピーカーユニット411及び412から出音させる(ステップS43)。 On the other hand, if i is greater than or equal to p (step S41: No), the audio output control unit 22 mixes the audio stored in the area for storing output audio in the speaker unit list. Then, the audio output control unit 22 outputs the mixed sound from the speaker units 411 and 412 corresponding to the speaker unit list (step S43).

 <3.2.1.背景音調整>
 図15は、背景音調整の処理のフローチャートである。図15に示した処理は、図14のステップS36で実行される処理の一例にあたる。次に、図15を参照して、背景音調整の処理の流れを説明する。
<3.2.1. Background sound adjustment>
Fig. 15 is a flowchart of the background sound adjustment process. The process shown in Fig. 15 is an example of the process executed in step S36 in Fig. 14. Next, the flow of the background sound adjustment process will be described with reference to Fig. 15.

 音声出力制御部22は、背景音の振幅を(1/u)倍に調整する(ステップS201)。ここで、uは、スピーカーユニット411及び412の組の数である。 The audio output control unit 22 adjusts the amplitude of the background sound by a factor of (1/u) (step S201), where u is the number of pairs of speaker units 411 and 412.

 そして、音声出力制御部22は、全てのスピーカーユニット411及び412の組のチャンネルに波形を代入する(ステップS202)。 Then, the audio output control unit 22 assigns waveforms to the channels of all pairs of speaker units 411 and 412 (step S202).

 <3.2.2.内野発話音声調整>
 図16は、内野発話音声調整の処理のフローチャートである。図16に示した処理は、図14のステップS39で実行される処理の一例にあたる。次に、図16を参照して、内野発話音声調整の処理の流れを説明する。
<3.2.2. Infield speech sound adjustment>
Fig. 16 is a flowchart of the infield speech sound adjustment process. The process shown in Fig. 16 is an example of the process executed in step S39 in Fig. 14. Next, the flow of the infield speech sound adjustment process will be described with reference to Fig. 16.

 音声出力制御部22は、定位中心スピーカーユニット導出を実行して、発話者に最も近いスピーカーユニット411及び412の組を特定する(ステップS211)。この定位中心スピーカーユニット導出については、後で詳細に説明する。 The audio output control unit 22 performs localization center speaker unit derivation to identify the pair of speaker units 411 and 412 that are closest to the speaker (step S211). This localization center speaker unit derivation will be explained in detail later.

 次に、音声出力制御部22は、スピーカーユニット411及び412の組のチャンネルのうち発話者の正規化座標と最も近くにある組のチャンネルに波形を代入する(ステップS212)。 Next, the audio output control unit 22 assigns the waveform to the pair of channels of speaker units 411 and 412 that is closest to the normalized coordinates of the speaker (step S212).

 <3.2.3.外野発話音声調整>
 図17は、外野発話音声調整の処理のフローチャートである。図17に示した処理は、図14のステップS38で実行される処理の一例にあたる。次に、図17を参照して、外野発話音声調整の処理の流れを説明する。
3.2.3. Outfield speech adjustment
Fig. 17 is a flowchart of the outfield speech sound adjustment process. The process shown in Fig. 17 is an example of the process executed in step S38 of Fig. 14. Next, the flow of the outfield speech sound adjustment process will be described with reference to Fig. 17.

 音声出力制御部22は、定位中心スピーカーユニット導出を実行して、発話者に最も近いスピーカーユニット411及び412の組を特定する(ステップS221)。この定位中心スピーカーユニット導出は内野発話音声調整における処理と同じ処理であり、後で詳細に説明する。ここでは、スピーカーユニット411及び412の組みの左から順に先頭を0として連番の番号でそれぞれの位置を表した場合の、発話者に最も近い組の位置を「最近接スピーカーユニット位置」とよぶ。ここでは、スピーカーユニット411及び412の組の個数がu個であり、最近接スピーカーユニット位置は、0~uのいずれかである。 The audio output control unit 22 performs localization center speaker unit derivation to identify the pair of speaker units 411 and 412 closest to the speaker (step S221). This localization center speaker unit derivation is the same process as the process used in infield speech sound adjustment, and will be described in detail later. Here, when the positions of the pair of speaker units 411 and 412 are represented by consecutive numbers starting from the left with 0 as the first, the position of the pair closest to the speaker is called the "closest speaker unit position." Here, the number of pairs of speaker units 411 and 412 is u, and the closest speaker unit position is any one of 0 to u.

 次に、音声出力制御部22は、kを初期化して0に設定する(ステップS222)。kは、再生範囲を決定する処理の繰り返しを制御するためのパラメータである。 Next, the audio output control unit 22 initializes k to 0 (step S222). k is a parameter used to control the repetition of the process of determining the playback range.

 次に、音声出力制御部22は、k=0か否かを判定する(ステップS223)。 Next, the audio output control unit 22 determines whether k = 0 (step S223).

 k=0の場合(ステップS223:肯定)、音声出力制御部22は、xspkを最近接スピーカーユニット位置にkを加算した位置とする(ステップS224)。その後、音声出力制御部22は、ステップS229へ進む。 If k=0 (step S223: Yes), the audio output control unit 22 sets x spk to the position obtained by adding k to the position of the closest speaker unit (step S224). After that, the audio output control unit 22 proceeds to step S229.

 kが0以外の場合(ステップS223:否定)、音声出力制御部22は、最近接スピーカーユニット位置にkを加算した値がu以下かを判定する(ステップS225)。すなわち、音声出力制御部22は、最近接スピーカーユニット位置にkを加算した位置がスピーカーアレイ41及び42の右端からはみ出さないか否かを判定する。 If k is other than 0 (step S223: No), the audio output control unit 22 determines whether the value obtained by adding k to the nearest speaker unit position is less than or equal to u (step S225). In other words, the audio output control unit 22 determines whether the position obtained by adding k to the nearest speaker unit position does not extend beyond the right end of the speaker arrays 41 and 42.

 最近接スピーカーユニット位置にkを加算した値がuより大きい場合(ステップS225:否定)、音声出力制御部22は、ステップS227へ進む。これに対して、最近接スピーカーユニット位置にkを加算した値がu以下の場合(ステップS225:肯定)、音声出力制御部22は、一方のxspkを最近接スピーカーユニットにkを加算した位置とする(ステップS226)。その後、音声出力制御部22は、ステップS227へ進む。 If the value obtained by adding k to the position of the nearest speaker unit is greater than u (step S225: No), the audio output control unit 22 proceeds to step S227. On the other hand, if the value obtained by adding k to the position of the nearest speaker unit is equal to or less than u (step S225: Yes), the audio output control unit 22 sets one x spk to the position obtained by adding k to the nearest speaker unit (step S226). Thereafter, the audio output control unit 22 proceeds to step S227.

 次に、音声出力制御部22は、最近接スピーカーユニット位置からkを減算した値が0以上かを判定する(ステップS227)。すなわち、音声出力制御部22は、最近接スピーカーユニット位置からkを減算した位置がスピーカーアレイ41及び42の左端からはみ出さないか否かを判定する。 Next, the audio output control unit 22 determines whether the value obtained by subtracting k from the nearest speaker unit position is 0 or greater (step S227). In other words, the audio output control unit 22 determines whether the position obtained by subtracting k from the nearest speaker unit position does not extend beyond the left end of the speaker arrays 41 and 42.

 最近接スピーカーユニット位置からkを減算した値が0未満の場合(ステップS227:否定)、音声出力制御部22は、ステップS229へ進む。これに対して、最近接スピーカーユニット位置からkを減算した値が0以上の場合(ステップS227:肯定)、音声出力制御部22は、他方のxspkを最近接スピーカーユニットからkを減算した位置とする(ステップS228)。その後、音声出力制御部22は、ステップS229へ進む。 If the value obtained by subtracting k from the nearest speaker unit position is less than 0 (step S227: No), the audio output control unit 22 proceeds to step S229. On the other hand, if the value obtained by subtracting k from the nearest speaker unit position is 0 or greater (step S227: Yes), the audio output control unit 22 sets the other x spk to the position obtained by subtracting k from the nearest speaker unit (step S228). Thereafter, the audio output control unit 22 proceeds to step S229.

 その後、音声出力制御部22は、kの位置の音声の振幅を{g(xspk)}1.66倍して調整を実施する(ステップS229)。ここで、一方のxspk及び他方のxspkの双方が存在する場合は、音声出力制御部22は、両方のkの位置についての音声の振幅を調整する。 Thereafter, the audio output control unit 22 adjusts the amplitude of the audio at position k by multiplying it by {g(x spk )} 1.66 (step S229). Here, if both one x spk and the other x spk exist, the audio output control unit 22 adjusts the amplitude of the audio at both positions k.

 その後、音声出力制御部22は、kが予め決められたレンジ幅を未満か否かを判定する(ステップS230)。 Then, the audio output control unit 22 determines whether k is less than a predetermined range width (step S230).

 kがレンジ幅以下の場合(ステップS230:肯定)、音声出力制御部22は、kを1つインクリメントする(ステップS231)。その後、音声出力制御部22は、ステップS233へ戻る。 If k is equal to or less than the range width (step S230: Yes), the audio output control unit 22 increments k by 1 (step S231). Then, the audio output control unit 22 returns to step S233.

 これに対して、kがレンジ幅より大きい場合(ステップS230:否定)、音声出力制御部22は、u個のチャンネルのうち、最近接スピーカーユニット位置を中心に±レンジ幅のチャンネルに波形を代入する(ステップS232)。 On the other hand, if k is greater than the range width (step S230: No), the audio output control unit 22 assigns the waveform to one of the u channels with a ±range width centered on the position of the closest speaker unit (step S232).

 <3.2.4.定位中心スピーカーユニット導出>
 図18は、定位中心スピーカーユニット導出の処理のフローチャートである。図18に示した処理は、図16のステップS211及び図17のステップS221で実行される処理の一例にあたる。次に、図18を参照して、定位中心スピーカーユニット導出の処理の流れを説明する。
3.2.4. Derivation of the center speaker unit position
Fig. 18 is a flowchart of the process of deriving the localization center speaker unit. The process shown in Fig. 18 corresponds to an example of the process executed in step S211 of Fig. 16 and step S221 of Fig. 17. Next, the flow of the process of deriving the localization center speaker unit will be described with reference to Fig. 18.

 音声出力制御部22は、スピーカーユニット411までの最短距離の初期値をスピーカーユニット411の間の距離と設定する(ステップS241)。 The audio output control unit 22 sets the initial value of the shortest distance to the speaker unit 411 as the distance between the speaker units 411 (step S241).

 次に、音声出力制御部22は、jを初期化して0に設定する(ステップS242)。ここで、jは、スピーカーユニット411及び412の組み毎の発話者までの距離を最近接スピーカーユニット距離とするかの判定する処理の繰り返しを制御するためのパラメータである。 Next, the audio output control unit 22 initializes j to 0 (step S242). Here, j is a parameter used to control the repetition of the process of determining whether the distance to the speaker for each pair of speaker units 411 and 412 is the closest speaker unit distance.

 次に、音声出力制御部22は、発話者の位置からスピーカーユニット411の組に左から先頭を0番として連番を振った場合のj番目のスピーカーユニット411の位置を減算する(ステップS243)。 Next, the audio output control unit 22 subtracts the position of the jth speaker unit 411 when the speaker unit 411 pairs are numbered consecutively from the left, with the first being number 0, from the speaker's position (step S243).

 次に、音声出力制御部22は、減算結果が最短距離未満か否かを判定する(ステップS244)。減算結果が最短距離以上の場合(ステップS244:否定)、音声出力制御部22は、ステップS246へ進む。 Next, the audio output control unit 22 determines whether the subtraction result is less than the shortest distance (step S244). If the subtraction result is equal to or greater than the shortest distance (step S244: No), the audio output control unit 22 proceeds to step S246.

 減算結果が最短距離未満の場合(ステップS244:肯定)、音声出力制御部22は、最短距離を減算結果として更新する。さらに、音声出力制御部22は、jを最近接スピーカーユニット位置とする(ステップS245)。 If the subtraction result is less than the shortest distance (step S244: Yes), the audio output control unit 22 updates the shortest distance as the subtraction result. Furthermore, the audio output control unit 22 sets j to the nearest speaker unit position (step S245).

 次に、音声出力制御部22は、jがスピーカーユニット411の個数であるu未満か否かを判定する(ステップS246)。 Next, the audio output control unit 22 determines whether j is less than u, which is the number of speaker units 411 (step S246).

 jがu未満の場合(ステップS246:肯定)、音声出力制御部22は、jを1つインクリメントする(ステップS247)。その後、音声出力制御部22は、ステップS243へ戻る。 If j is less than u (step S246: Yes), the audio output control unit 22 increments j by 1 (step S247). Then, the audio output control unit 22 returns to step S243.

 これに対して、jがu以上の場合(ステップS246:否定)、音声出力制御部22は、定位中心スピーカーユニット導出を終了する。 On the other hand, if j is greater than or equal to u (step S246: No), the audio output control unit 22 terminates the derivation of the localization center speaker unit.

 <4.効果>
 以上に説明したように、本実施の形態に係る遠隔コミュニケーション装置1は、映像から発話者の位置を特定し、特定した各発話者の頭部方向を推定して各発話者が内野であるか外野であるかを判定する。また、遠隔コミュニケーション装置1は、収集した音声から、背景音を分離し、さらに特定した位置を基に発話者の個別音を取得する。そして、遠隔コミュニケーション装置1は、背景音及び外野発話音声を小さくし、かつ、一定の範囲のスピーカーユニット411及び412の組みに出音させる。また、遠隔コミュニケーション装置1は、内野発話音声を発話者の位置に近いスピーカーユニット411及び412から出音させる。
<4. Effects>
As described above, the remote communication device 1 according to this embodiment identifies the positions of speakers from the video, estimates the head direction of each identified speaker, and determines whether each speaker is in the infield or outfield. The remote communication device 1 also separates background sound from the collected audio and acquires the individual sounds of the speakers based on the identified positions. The remote communication device 1 then reduces the volume of the background sound and outfield speech sounds and outputs them from a pair of speaker units 411 and 412 within a certain range. The remote communication device 1 also outputs the infield speech sounds from the speaker units 411 and 412 closest to the speaker's position.

 このように、発話音声を内野発話音声と外野発話音声に分類することで、内野発話音声と外野発話音声とに対してそれぞれ別の音響処理ができる。また、背景音や外野発話は、音の定位感をぼかしてより聞き取りづらくすることで、内野発話が相対的に聞き取りやすくなる。これらにより、存在感や空気感は残しつつ、対話相手の声が聞こえ易くなり、会話に集中できる。また、頭部方向推定による会話参加の判定により、コミュニケーション参加話者の自動判定ができる。また、背景音、内野発話音声及び外野発話音声毎の処理を行うことで、参加度に基づく信号処理が可能となり、さらに、背景音、内野発話音声及び外野発話音声に合わせた多チャンネルスピーカーの制御を行うことができる。したがって、対話の円滑化を促進することができる。 In this way, by classifying speech sounds into infield speech sounds and outfield speech sounds, different acoustic processing can be performed for each. Furthermore, by blurring the sense of sound localization for background sounds and outfield speech sounds, making them harder to hear, the infield speech sounds become relatively easier to hear. This makes it easier to hear the voice of the person you're talking to while still retaining a sense of presence and atmosphere, allowing you to concentrate on the conversation. Furthermore, determining conversation participation through head direction estimation enables automatic determination of speakers participating in communication. Furthermore, by processing background sounds, infield speech sounds, and outfield speech sounds separately, signal processing based on participation level becomes possible, and multi-channel speaker control can be performed in accordance with background sounds, infield speech sounds, and outfield speech sounds. Therefore, smoother dialogue can be promoted.

 <5.第2の実施の形態に係る遠隔コミュニケーション装置>
 次に、第2の実施の形態に係る遠隔コミュニケーション装置1について説明する。第2の実施の形態に係る遠隔コミュニケーション装置1は、背景音のうち突発的な背景音の定位感をぼかさずに出音する。
5. Remote communication device according to the second embodiment
Next, a remote communication device 1 according to a second embodiment will be described. The remote communication device 1 according to the second embodiment outputs background sounds without blurring the sense of localization of sudden background sounds.

 図19は、第2の実施の形態に係る遠隔コミュニケーション装置のブロック図である。なお、図19では図1と同一の各部には同一符号を付し、以下では、図1と異なる部分を中心に説明して、図1と同一の各部については説明を省略する場合がある。 Figure 19 is a block diagram of a remote communication device according to the second embodiment. Note that in Figure 19, the same components as in Figure 1 are designated by the same reference numerals, and the following explanation will focus on the components that differ from Figure 1, and may omit explanations of the components that are the same as in Figure 1.

 <5.1.音響信号処理部>
 本実施の形態に係る音響信号処理部13は、個別音分離部131及び明瞭度調整部132に加えて、背景音分離部133を有する。
<5.1. Acoustic signal processing section>
The sound signal processing unit 13 according to this embodiment includes a background sound separation unit 133 in addition to an individual sound separation unit 131 and a clarity adjustment unit 132 .

 <5.1.1.背景音分離部>
 背景音分離部133は、音声分離部12により分離された背景音の入力を受ける。そして、背景音分離部133は、背景音のピークを検出するピーク検出処理を実行する。次に、背景音分離部133は、突発音が発生した方向を推定する突発音のDoA(Direction Of Arrival)推定を実施して背景音の中から突発音を抽出する突発音抽出処理を実行する。ここで、背景音分離部133は、音声を可視化する技術やマイクアレイ30から得られる音声信号の相互相関関数を用いる方法等を利用して突発音のDoA推定を実行する。
<5.1.1. Background sound separation section>
The background sound separation unit 133 receives input of the background sound separated by the audio separation unit 12. The background sound separation unit 133 then performs peak detection processing to detect peaks of the background sound. Next, the background sound separation unit 133 performs DoA (Direction Of Arrival) estimation of the sudden sound to estimate the direction from which the sudden sound occurred, and performs sudden sound extraction processing to extract the sudden sound from the background sound. Here, the background sound separation unit 133 performs DoA estimation of the sudden sound by using, for example, technology for visualizing audio or a method using a cross-correlation function of audio signals obtained from the microphone array 30.

 そして、背景音分離部133は、推定した突発音の位置の情報を出力統合部14へ出力する。また、背景音分離部133は、抽出した突発音を出力統合部14へ出力する。 The background sound separation unit 133 then outputs information about the estimated position of the sudden sound to the output integration unit 14. The background sound separation unit 133 also outputs the extracted sudden sound to the output integration unit 14.

 また、背景音分離部133は、背景音のうち突発音以外の音声を定常音として抽出する背景音抽出処理を実行する。そして、背景音分離部133は、抽出した定常音を明瞭度調整部132へ出力する。このように、背景音分離部133は、背景音を突発音と定常音とに分離する。 The background sound separation unit 133 also performs background sound extraction processing to extract sounds other than sudden sounds from the background sound as steady sounds. The background sound separation unit 133 then outputs the extracted steady sounds to the clarity adjustment unit 132. In this way, the background sound separation unit 133 separates the background sound into sudden sounds and steady sounds.

 <5.1.2.明瞭度調整部>
 明瞭度調整部132は、背景音のうちの定常音の入力を背景音分離部133から受ける。そして、明瞭度調整部132は、背景音のうちの定常音に対してより聞こえ難くするデグレード処理を実施する。その後、明瞭度調整部132は、明瞭度調整を施した背景音のうちの定常音を出力統合部14へ出力する。
<5.1.2. Clarity adjustment section>
The clarity adjustment unit 132 receives an input of a stationary sound from the background sound separation unit 133. The clarity adjustment unit 132 then performs a degradation process on the stationary sound from the background sound to make it harder to hear. The clarity adjustment unit 132 then outputs the stationary sound from the background sound that has been subjected to clarity adjustment to the output integration unit 14.

 <5.2.出力統合部>
 出力統合部14は、背景音のうちの突発音及び突発音の位置の情報の入力を背景音分離部133から受ける。また、出力統合部14は、背景音のうちの定常音の入力を明瞭度調整部132から受ける。この場合、出力統合部14は、リモート空間に存在する人数分のチャンネル数の個別音に加えて、背景音のうちの突発音と定常音とをそれぞれ1チャンネルのデータとして取得する。すなわち、リモート空間に存在する人数をpとした場合、出力統合部14は、(2+p)チャンネルのデータを取得する。
<5.2. Output Integration Unit>
The output integrating unit 14 receives input of information about sudden sounds and the positions of the sudden sounds in the background sound from the background sound separation unit 133. The output integrating unit 14 also receives input of steady sounds in the background sound from the clarity adjustment unit 132. In this case, the output integrating unit 14 acquires individual sounds of the same number of channels as the number of people present in the remote space, as well as one channel of data for each of the sudden sounds and steady sounds in the background sound. In other words, if the number of people present in the remote space is p, the output integrating unit 14 acquires (2+p) channels of data.

 そして、出力統合部14は、各個別音、背景音のうちの突発音及び背景音のうちの定常音を統合して1つの音声データを生成する。そして、出力統合部14は、各個別音に対応する人別リストにおける発話者の情報を付加する。また、出力統合部14は、突発音と突発音の位置情報とを対応付ける。そして、出力統合部14は、音データ、人別リスト及び突発音の位置情報を送信部15へ出力してローカル側の遠隔コミュニケーション装置1へ送信させる。 The output integration unit 14 then integrates each individual sound, the sudden sound from the background sound, and the steady sound from the background sound to generate one piece of audio data. The output integration unit 14 then adds speaker information in the person list corresponding to each individual sound. The output integration unit 14 also associates the sudden sound with positional information of the sudden sound. The output integration unit 14 then outputs the sound data, the person list, and the positional information of the sudden sound to the transmission unit 15, causing it to be transmitted to the local-side remote communication device 1.

 <5.3.音声出力制御部>
 ローカル側の遠隔コミュニケーション装置1における音声出力制御部22は、各個別音、背景音のうちの突発音及び背景音のうちの定常音を含む音声データの入力を受ける。また、音声出力制御部22は、人別リスト及び突発音の位置情報の入力を受ける。
<5.3. Audio output control unit>
The audio output control unit 22 in the local remote communication device 1 receives input of audio data including each individual sound, a sudden sound from the background sound, and a steady sound from the background sound. The audio output control unit 22 also receives input of the person list and position information of the sudden sound.

 音声出力制御部22は、音データから背景音のうちの突発音及び背景音のうちの定常音をそれぞれ取得する。次に、音声出力制御部22は、背景音のうちの定常音については、振幅を(1/u)1.66倍してラウドネスが1/uになるように処理する。そして、音声出力制御部22は、スピーカーアレイ41の全てのスピーカーユニット411及びスピーカーアレイ42の全てのスピーカーユニット412からラウドネスを低減した定常音を出音させる。 The audio output control unit 22 acquires a sudden sound from the background sound and a steady sound from the background sound from the sound data. Next, the audio output control unit 22 processes the steady sound from the background sound by multiplying the amplitude by (1/u) 1.66 so that the loudness becomes 1/u. Then, the audio output control unit 22 causes all speaker units 411 of the speaker array 41 and all speaker units 412 of the speaker array 42 to output the steady sounds with reduced loudness.

 これに対して、背景音のうちの突発音については、音声出力制御部22は、突発音の位置に最も近いスピーカーユニット411及び412を特定する。そして、音声出力制御部22は、特定したスピーカーユニット411及び412に背景音のうちの突発音をそのまま再生させる。 In contrast, for a sudden sound from the background sound, the audio output control unit 22 identifies the speaker units 411 and 412 that are closest to the position of the sudden sound. Then, the audio output control unit 22 causes the identified speaker units 411 and 412 to reproduce the sudden sound from the background sound as is.

 ここで、本実施例では、音声出力制御部22は、第1の実施の形態における内野発話音声と同じ処理を突発音に施して出音させたが、この他にも、第1の実施の形態における外野発話音声と同じ処理を突発音に施して出音させてもよい。 In this example, the audio output control unit 22 applies the same processing to the sudden sound as to the infield speech sound in the first embodiment, and outputs the sound; however, it may also apply the same processing to the sudden sound as to the outfield speech sound in the first embodiment, and output the sound.

 <6.第2の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ>
 図20は、第2の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。ここでも、リモート側の遠隔コミュニケーション装置1の送信ユニット10からローカル側の遠隔コミュニケーション装置1の受信ユニット20へとデータが送信される場合で説明する。ここでは、音声の送信について説明する。
6. Data flow between remote communication devices according to the second embodiment
20 is a diagram showing an outline of the data flow between the local remote communication devices according to the second embodiment. Here, too, the case where data is transmitted from the transmitting unit 10 of the remote remote communication device 1 to the receiving unit 20 of the local remote communication device 1 will be described. Here, the transmission of audio will be described.

 マイク3は、リモート空間で収音した音声をリモート側の遠隔コミュニケーション装置1へ入力する(ステップS301)。ここでは、例えば、マイクアレイ30のデータとしてsチャンネルのデータが存在する。また、カメラ2は、リモート空間の撮影を行って生成した映像をリモート側の遠隔コミュニケーション装置1へ入力する(ステップS302)。 The microphone 3 inputs the audio picked up in the remote space to the remote communication device 1 (step S301). Here, for example, data from the microphone array 30 is s channel data. The camera 2 also inputs the video generated by capturing the remote space to the remote communication device 1 (step S302).

 カメラ2から入力された映像は、人センシング部11に送られる。人センシング部11は、カメラ2により撮影された映像を用いて、人位置判定処理(ステップS331)及び内野外野判定処理(ステップS332)を含むセンシング処理を実行する(ステップS303)。 The video input from camera 2 is sent to the human sensing unit 11. The human sensing unit 11 uses the video captured by camera 2 to perform sensing processing (step S303), including human position determination processing (step S331) and infield/outfield determination processing (step S332).

 マイク3から入力された音声は、音声分離部12へ送られる。音声分離部12は、マイク3から入力された音声に含まれる発話音声と背景音とを分離する音声分離処理を行う(ステップS304)。 The audio input from microphone 3 is sent to audio separation unit 12. Audio separation unit 12 performs audio separation processing to separate the speech audio and background sounds contained in the audio input from microphone 3 (step S304).

 音声分離部12により抽出された背景音は、音響信号処理部13の背景音分離部133へ送られる。背景音分離部133は、背景音のピークを検出するピーク検出処理を実行する(ステップS305)。 The background sound extracted by the audio separation unit 12 is sent to the background sound separation unit 133 of the audio signal processing unit 13. The background sound separation unit 133 executes peak detection processing to detect peaks in the background sound (step S305).

 次に、背景音分離部133は、突発音のDoA推定を実施して背景音の中から突発音を抽出する突発音抽出処理を実行する(ステップS306)。背景音のうちの突発音及び突発音の位置情報は、出力統合部14へ送られる。 Next, the background sound separation unit 133 executes a sudden sound extraction process to estimate the DoA of the sudden sound and extract the sudden sound from the background sound (step S306). The sudden sound in the background sound and its position information are sent to the output integration unit 14.

 また、背景音分離部133は、背景音の中から定常音を抽出する定常音抽出処理を実行する(ステップS307)。背景音のうちの定常音は、明瞭度調整部132へ送られる。 The background sound separation unit 133 also executes a steady sound extraction process to extract steady sounds from the background sounds (step S307). The steady sounds from the background sounds are sent to the clarity adjustment unit 132.

 明瞭度調整部132は、背景音に対してより聞こえ難くするデグレード処理を明瞭度調整処理として実施する(ステップS308)。 The clarity adjustment unit 132 performs a degrading process to make the background sound less audible as clarity adjustment processing (step S308).

 また、音声分離部12により抽出されたsチャンネルのデータである個別音及び人別リストが、音響信号処理部13へ送られる。音響信号処理部13は、個別音分離部131による個別音分離処理(ステップS391)及び明瞭度調整部132による明瞭度調整処理(ステップS392)を含む発話音声信号処理を実施する(ステップS309)。 Furthermore, the individual sounds and person list, which are s-channel data extracted by the audio separation unit 12, are sent to the audio signal processing unit 13. The audio signal processing unit 13 performs speech signal processing (step S309), including individual sound separation processing by the individual sound separation unit 131 (step S391) and clarity adjustment processing by the clarity adjustment unit 132 (step S392).

 1チャンネルのデータである定常音、1チャンネルのデータである突発音及び突発音の位置情報は、出力統合部14へ入力される。また、発話音声信号処理が施されたpチャンネルのデータである個別音も、出力統合部14へ入力される。これにより、出力統合部14は、(2+P)チャンネルのデータを取得する。そして、出力統合部14は、すべてのチャンネルの音を統合して1つの音データを生成する出力統合処理を実行する(ステップS310)。 The steady sound, which is one channel of data, the sudden sound, and the position information of the sudden sound, which is one channel of data, are input to the output integration unit 14. In addition, the individual sounds, which are p channel of data that have been subjected to speech signal processing, are also input to the output integration unit 14. As a result, the output integration unit 14 obtains (2+P) channel data. The output integration unit 14 then executes output integration processing, which integrates the sounds of all channels to generate a single sound data (step S310).

 出力統合部14で生成された音データ、人別リスト及び突発音の位置情報は、送信部15によりネットワーク7を介してローカル側の遠隔コミュニケーション装置1へ送られる(ステップS311)。 The sound data, person list, and sudden sound location information generated by the output integration unit 14 are sent by the transmission unit 15 to the local remote communication device 1 via the network 7 (step S311).

 ローカル側の遠隔コミュニケーション装置1は、音データ、人別リスト及び突発音の位置情報の送信を受ける。音データ、人別リスト及び突発音の位置情報は、受信部21を介して音声出力制御部22へ送られる。音声出力制御部22は、人別リスト及び突発音の位置情報を用いて音データに対して複数対向スピーカーバランシング処理を実施する。そして、音声出力制御部22は、背景音、内野発話音声及び外野発話音声毎に出力するスピーカーユニット411及び412を指定して再生させる(ステップS312)。この際、音声出力制御部22は、背景音のうちの定常音を、ラウドネスをスピーカーユニット411及び412の組みの数で除算した大きさにして全てのスピーカーユニット411及び412に再生させる。また、音声出力制御部22は、突発音の位置に最も近いスピーカーユニット411及び412に定常音のうちの突発音をそのまま再生させる。 The local remote communication device 1 receives the sound data, the person list, and the positional information of the sudden sound. The sound data, the person list, and the positional information of the sudden sound are sent to the audio output control unit 22 via the receiving unit 21. The audio output control unit 22 performs a multiple facing speaker balancing process on the sound data using the person list and the positional information of the sudden sound. The audio output control unit 22 then specifies the speaker units 411 and 412 to output for each of the background sound, infield speech sound, and outfield speech sound, and plays them (step S312). At this time, the audio output control unit 22 plays the steady sound of the background sound in all speaker units 411 and 412, adjusting the loudness by dividing it by the number of pairs of speaker units 411 and 412. The audio output control unit 22 also plays the sudden sound of the steady sound as is in the speaker units 411 and 412 closest to the position of the sudden sound.

 スピーカーアレイ41及び42は、音声出力制御部22から指定されたスピーカーユニット411及び412を用いて、背景音のうちの突発音、背景音のうちの定常音、内野発話音声及び外野発話音声を出音する(ステップS313)。 The speaker arrays 41 and 42 use the speaker units 411 and 412 specified by the audio output control unit 22 to output sudden sounds from the background sound, steady sounds from the background sound, infield speech sounds, and outfield speech sounds (step S313).

 <7.効果>
 以上に説明したように、本実施の形態に係る遠隔コミュニケーション装置1は、背景音を突発音と定常音の2種類に分けて、背景音は位置情報を伝送する。そして、遠隔コミュニケーション装置1は、背景音のうちの突発音の定位感はぼかさずに出音させ、背景音のうちの定常音について定位感をぼかして出音させる。
<7. Effects>
As described above, the remote communication device 1 according to the present embodiment divides background sounds into two types, sudden sounds and steady sounds, and transmits position information for the background sounds. The remote communication device 1 outputs the sudden sounds among the background sounds without blurring their localization, and outputs the steady sounds among the background sounds with blurring their localization.

 これにより、背景音は発生した位置付近から定常音よりも大きな音で再生することができ、突発的な背景音について定位感が無くなり違和感が発生することを抑制できる。対話を自然な状態に近づけることができ、対話の円滑化を促進することができる。 This allows background sounds to be played louder than steady sounds from near the location where they were generated, preventing sudden background sounds from losing their sense of position and causing discomfort. This makes conversations more natural and promotes smoother conversations.

 <8.第3の実施の形態に係る遠隔コミュニケーション装置>
 次に、第3の実施の形態に係る遠隔コミュニケーション装置1について説明する。第3の実施の形態に係る遠隔コミュニケーション装置1は、マイク3により収音された発話音声を機械的な音声に変更して聞き取り易くする。
8. Remote communication device according to the third embodiment
Next, a remote communication device 1 according to a third embodiment will be described. The remote communication device 1 according to the third embodiment converts the speech voice picked up by the microphone 3 into a mechanical voice to make it easier to hear.

 図21は、第3の実施の形態に係る遠隔コミュニケーション装置のブロック図である。なお、図21では図1と同一の各部には同一符号を付し、以下では、図1と異なる部分を中心に説明して、図1と同一の各部については説明を省略する場合がある。 FIG. 21 is a block diagram of a remote communication device according to the third embodiment. Note that in FIG. 21, the same components as those in FIG. 1 are designated by the same reference numerals, and the following description will focus on the components that differ from FIG. 1, and may omit a description of the components that are the same as those in FIG. 1.

 <8.1.音響信号処理部>
 本実施の形態に係る音響信号処理部13は、個別音分離部131及び明瞭度調整部132に加えて、音声合成部134を有する。
8.1. Acoustic signal processing section
The acoustic signal processing unit 13 according to this embodiment includes a voice synthesis unit 134 in addition to an individual sound separation unit 131 and a clarity adjustment unit 132 .

 <8.1.1.音声合成部>
 音声合成部134は、個別音分離部131により分離された各発話者の個別音の入力を受ける。そして、音声合成部134は、音声認識をして一度発話内容を文字に起こす音声認識処理を実行する。次に、音声合成部134は、文字にした発話内容を音声合成して個別音を再生成する音声合成処理を実行する。その後、音声合成部134は、再生成した各発話者の個別音を明瞭度調整部132へ出力する。
<8.1.1. Speech synthesis unit>
The speech synthesis unit 134 receives input of the individual sounds of each speaker separated by the individual sound separation unit 131. The speech synthesis unit 134 then executes a speech recognition process to recognize speech and transcribe the speech into text. Next, the speech synthesis unit 134 executes a speech synthesis process to synthesize the textual speech into text and regenerate the individual sounds. The speech synthesis unit 134 then outputs the regenerated individual sounds of each speaker to the clarity adjustment unit 132.

 このように、音声合成部134は、個別音毎に、音声認識を実行して個別音の発話内容を示す文字を生成し、発話内容を示す文字を基に音声合成を実行して、発話者の発話音声を再生成する。この場合、明瞭度調整部132は、音声合成部134により再生成された個別音毎に、発話者が会話に参加しているか否かを基に調整を行う。 In this way, the speech synthesis unit 134 performs speech recognition for each individual sound to generate characters indicating the spoken content of the individual sound, and performs speech synthesis based on the characters indicating the spoken content to regenerate the speaker's speech. In this case, the clarity adjustment unit 132 makes adjustments for each individual sound regenerated by the speech synthesis unit 134 based on whether or not the speaker is participating in the conversation.

 <9.第3の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ>
 図22は、第3の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。ここでも、リモート側の遠隔コミュニケーション装置1の送信ユニット10からローカル側の遠隔コミュニケーション装置1の受信ユニット20へとデータが送信される場合で説明する。ここでは、音声の送信について説明する。
9. Data Flow Between Remote Communication Devices According to the Third Embodiment
22 is a diagram showing an outline of the data flow between the local remote communication devices according to the third embodiment. Here, too, the case where data is transmitted from the transmitting unit 10 of the remote remote communication device 1 to the receiving unit 20 of the local remote communication device 1 will be described. Here, the transmission of audio will be described.

 マイク3は、リモート空間で収音した音声をリモート側の遠隔コミュニケーション装置1へ入力する(ステップS401)。ここでは、例えば、マイクアレイ30のデータとしてsチャンネルのデータが存在する。また、カメラ2は、リモート空間の撮影を行って生成した映像をリモート側の遠隔コミュニケーション装置1へ入力する(ステップS402)。 The microphone 3 inputs the audio picked up in the remote space to the remote communication device 1 (step S401). Here, for example, data from the microphone array 30 is s channel data. The camera 2 also inputs the video generated by capturing the remote space to the remote communication device 1 (step S402).

 カメラ2から入力された映像は、人センシング部11に送られる。人センシング部11は、カメラ2により撮影された映像を用いて、人位置判定処理(ステップS431)及び内野外野判定処理(ステップS432)を含むセンシング処理を実行する(ステップS403)。 The video input from camera 2 is sent to the human sensing unit 11. The human sensing unit 11 uses the video captured by camera 2 to perform sensing processing (step S403), including human position determination processing (step S431) and infield/outfield determination processing (step S432).

 マイク3から入力された音声は、音声分離部12へ送られる。音声分離部12は、マイク3から入力された音声に含まれる発話音声と背景音とを分離する音声分離処理を行う(ステップS404)。 The audio input from microphone 3 is sent to audio separation unit 12. Audio separation unit 12 performs audio separation processing to separate the speech audio and background sounds contained in the audio input from microphone 3 (step S404).

 音声分離部12により抽出された背景音は、音響信号処理部13の明瞭度調整部132へ送られる。明瞭度調整部132は、背景音に対してより聞こえ難くするデグレード処理を明瞭度調整処理として実施する(ステップS405)。 The background sound extracted by the audio separation unit 12 is sent to the clarity adjustment unit 132 of the audio signal processing unit 13. The clarity adjustment unit 132 performs a degrading process to make the background sound less audible as clarity adjustment processing (step S405).

 また、音声分離部12により抽出されたsチャンネルのデータである個別音及び人別リストが、音響信号処理部13へ送られる。発話音声に対しては、音響信号処理部13により人別リストを用いた発話音声信号処理が実施される(ステップS406)。 Furthermore, the individual sounds and person list, which are s-channel data extracted by the voice separation unit 12, are sent to the acoustic signal processing unit 13. The acoustic signal processing unit 13 performs speech signal processing on the speech using the person list (step S406).

 詳しくは、個別音分離部131は、取得した人別リストに登録された要素数であるp個分のスレッドを作成する。さらに、個別音分離部131は、人別リストに登録された正規化座標の値を、スレッド毎のスレッド固有の値として保持させる。次に、個別音分離部131は、各発話者までの角度θを求める。そして、個別音分離部131は、音声分離部12による音声分離により抽出された発話音声に対して、各発話者までの角度に応じたビームフォーミングを用いて個別音分離処理を行い各発話者の個別音をスレッド毎に生成する(ステップS461)。 In detail, the individual sound separation unit 131 creates p threads, which is the number of elements registered in the acquired personal list. Furthermore, the individual sound separation unit 131 stores the normalized coordinate values registered in the personal list as thread-specific values for each thread. Next, the individual sound separation unit 131 calculates the angle θ to each speaker. Then, the individual sound separation unit 131 performs individual sound separation processing on the speech sounds extracted by speech separation by the speech separation unit 12 using beamforming according to the angle to each speaker, and generates individual sounds for each speaker for each thread (step S461).

 個別音分離部131により生成された個別音は、音声合成部134へ送られる。音声合成部134は、個別音について音声認識をして一度発話内容を文字に起こす音声認識処理を実行する(ステップS462)。 The individual sounds generated by the individual sound separation unit 131 are sent to the speech synthesis unit 134. The speech synthesis unit 134 performs speech recognition processing to recognize the individual sounds and transcribe the speech content into text (step S462).

 次に、音声合成部134は、文字にした発話内容を音声合成して個別音を再生成する音声合成処理を実行する(ステップS463)。音声合成部134により再生成された各発話者の個別音は、明瞭度調整部132へ送られる。 Next, the speech synthesis unit 134 executes a speech synthesis process to synthesize the text of the speech and regenerate the individual sounds (step S463). The individual sounds of each speaker regenerated by the speech synthesis unit 134 are sent to the clarity adjustment unit 132.

 明瞭度調整部132は、スレッド毎に個別音に対応する発話者が内野であるか外野であるかを人別リストの内野フラグを用いて判定する。そして、明瞭度調整部132は、内野の発話者の個別音に対してより聞こえやすくするエンハンス処理を実施し、外野の発話者の個別音に対してより聞こえにくくするデグレード処理を実施する明瞭度調整処理を実施する(ステップS464)。 The clarity adjustment unit 132 determines whether the speaker corresponding to the individual sounds for each thread is infield or outfield using the infield flag in the person list. The clarity adjustment unit 132 then performs clarity adjustment processing, which performs enhancement processing to make the individual sounds of infield speakers easier to hear, and degrade processing to make the individual sounds of outfield speakers harder to hear (step S464).

 1チャンネルのデータである背景音は、出力統合部14へ入力される。また、発話音声信号処理が施されたpチャンネルのデータである個別音も、出力統合部14へ入力される。これにより、出力統合部14は、(1+P)チャンネルのデータを取得する。そして、出力統合部14は、すべてのチャンネルの音を統合して1つの音データを生成する出力統合処理を実行する(ステップS407)。 The background sound, which is one channel of data, is input to the output integrating unit 14. The individual sounds, which are p channel data that have been subjected to speech signal processing, are also input to the output integrating unit 14. As a result, the output integrating unit 14 obtains (1+P) channel data. The output integrating unit 14 then executes output integration processing, which integrates the sounds of all channels to generate a single sound data (step S407).

 出力統合部14で生成された音データ及び人別リストは、送信部15によりネットワーク7を介してローカル側の遠隔コミュニケーション装置1へ送られる(ステップS408)。 The sound data and person list generated by the output integration unit 14 are sent by the transmission unit 15 to the local remote communication device 1 via the network 7 (step S408).

 ローカル側の遠隔コミュニケーション装置1は、音データ及び人別リストの送信を受ける。音データ及び人別リストは、受信部21を介して音声出力制御部22へ送られる。音声出力制御部22は、人別リストを用いて音データに対して複数対向スピーカーバランシング処理を実施して、背景音、内野発話音声及び外野発話音声毎に出力するスピーカーユニット411及び412を指定して再生させる(ステップS409)。 The local remote communication device 1 receives the sound data and the person list. The sound data and the person list are sent to the audio output control unit 22 via the receiving unit 21. The audio output control unit 22 performs a multiple facing speaker balancing process on the sound data using the person list, and specifies the speaker units 411 and 412 to output the background sound, infield speech sound, and outfield speech sound, respectively, and plays them (step S409).

 スピーカーアレイ41及び42は、音声出力制御部22から指定されたスピーカーユニット411及び412を用いて、背景音のうちの突発音、背景音のうちの定常音、内野発話音声及び外野発話音声を出音する(ステップS410)。 The speaker arrays 41 and 42 use the speaker units 411 and 412 specified by the audio output control unit 22 to output sudden sounds from the background sound, steady sounds from the background sound, infield speech sounds, and outfield speech sounds (step S410).

 <10.効果>
 以上に説明したように、本実施の形態に係る遠隔コミュニケーション装置1は、音声認識をして一度発話内容を文字に起こし、それを音声合成したものを再生させる発話音声のデータとする。
<10. Effects>
As described above, the remote communication device 1 according to this embodiment performs speech recognition to transcribe the spoken content into text, and then synthesizes the text into speech data to be played back.

 マイク3で収録した発話音声をそのまま再生すると、人による声量や活舌の違いにより聞き取りづらくなることがある。これに対して、本実施の形態に係る遠隔コミュニケーション装置1は、発話の声量や活舌のばらつきを抑えて全体的に発話音声を聞き取り易くすることができる、したがって、コミュニケーションの円滑化を図ることができる。 If the speech recorded by the microphone 3 is played back as is, it may be difficult to hear due to differences in voice volume and articulation between people. In contrast, the remote communication device 1 of this embodiment can reduce variations in voice volume and articulation, making the speech easier to hear overall, thereby facilitating smoother communication.

 <11.第4の実施の形態に係る遠隔コミュニケーション装置>
 次に、第4の実施の形態に係る遠隔コミュニケーション装置1について説明する。第4の実施の形態に係る遠隔コミュニケーション装置1は、内野外野判定等の音声識別を頭部方向推定以外の方法でも実行する。本実施例に係る遠隔コミュニケーション装置1も、図1のブロック図で表される。以下では、図1と異なる部分を中心に説明して、図1と同一の各部については説明を省略する場合がある。
11. Remote communication device according to the fourth embodiment
Next, a remote communication device 1 according to a fourth embodiment will be described. The remote communication device 1 according to the fourth embodiment performs voice recognition, such as infield/outfield determination, using a method other than head direction estimation. The remote communication device 1 according to this embodiment is also represented by the block diagram of FIG. 1. The following description will focus on the parts that are different from FIG. 1, and descriptions of parts that are the same as those in FIG. 1 may be omitted.

 <11.1.人センシング部>
 人センシング部11は、映像から各発話者の距離を検出するデプスセンシングを行う。そして、人センシング部11は、カメラ2から遠くの位置にいる人がリモートの人と会話することはないという仮定の下、カメラ2から遠くの位置にいる人を外野の人と判定する。このように、人センシング部11は、映像を基に発話者の距離を判定し、距離を基に発話者が会話に参加しているか否かを判定する。
<11.1. Human sensing unit>
The human sensing unit 11 performs depth sensing to detect the distance of each speaker from the video. Then, under the assumption that people far from the camera 2 will not converse with remote people, the human sensing unit 11 determines that people far from the camera 2 are people in the outfield. In this way, the human sensing unit 11 determines the distance of the speaker based on the video, and determines whether the speaker is participating in the conversation based on the distance.

 このように、カメラ2からの距離を用いて内野外野を判定することで、たまたま後ろを通りかかった人が画面の方を見ながら話しているときに、その人をローカルの人との会話の参加者として判定されることを回避できる。これにより、コミュニケーションの円滑化を図ることができる。 In this way, by determining infield and outfield using the distance from camera 2, it is possible to avoid someone passing behind you who happens to be looking at the screen while talking, being identified as a participant in a conversation with a local person. This facilitates smooth communication.

 <11.2.音響信号処理部>
 音響信号処理部13は、個別音分離部131により生成された各個別音に対して音声認識を実行して、発話内容を文字に起こす。そして、音響信号処理部13は、文字にした発話内容のコンテキストに基づいて、その発話がローカル側の人に対する発話であるか否かを判定する。音響信号処理部13は、例えば、感情、共感度及び傾聴度をAI分析する等によりこの判定を行うことができる。
11.2. Acoustic signal processing section
The acoustic signal processing unit 13 performs speech recognition on each individual sound generated by the individual sound separation unit 131 and transcribes the speech content into text. Then, the acoustic signal processing unit 13 determines whether the speech is directed to a person on the local side based on the context of the transcribed speech content. The acoustic signal processing unit 13 can make this determination, for example, by performing AI analysis of emotions, empathy level, and listening level.

 そして、音響信号処理部13は、その発話がローカル側の人に対する発話であると判定した場合、その発話者を内野の人とする。そして、音響信号処理部13は、人別リストにおける内野の人と判定した発話者の内野フラグをTrueに設定して人別リストを更新する。その後、音響信号処理部13は、更新した人別リストを明瞭度調整部132へ出力する。 If the acoustic signal processing unit 13 determines that the speech is directed at a person on the local side, it determines that the speaker is an infield person. The acoustic signal processing unit 13 then sets the infield flag of the speaker determined to be an infield person in the person list to True, and updates the person list. The acoustic signal processing unit 13 then outputs the updated person list to the clarity adjustment unit 132.

 ここで、例えばスライドを見ながらプレゼンテーションをしている人の声や、メモを取りながら話を聞いている人の相槌も、リモートに対して発せられるメッセージであり、内野として処理されることが好ましい。そこで、コンテキストに基づいて内野外野を判定することで、頭部方向推定では外野として判定されてしまうが実際には会話に参加している発話者の発話が、内野の人の発話としてリモートに伝送して再生される。これにより、リモートの方向に頭が向いていないがリモートに話しかけているような状態でも、内野発話として聞き取りやすくなるように処理される。したがって、コミュニケーションの円滑化を図ることができる。 Here, for example, the voice of a person giving a presentation while looking at slides, or the interjections of a person listening while taking notes, are also messages sent to the remote, and are preferably processed as infield. Therefore, by determining infield/outfield based on the context, the speech of a speaker who would be determined to be infield in head direction estimation but is actually participating in the conversation is transmitted to the remote and played back as the speech of an infield person. This makes it possible to process speech that is easier to hear as infield speech, even in situations where the speaker is not facing the remote but is speaking to it. This can facilitate smooth communication.

 <11.3.その他の音声識別>
 他にも、ローカル側の人センシング部11は、ローカル側の人がディスプレイ6のスクリーン上の特定の位置を指さした場合に、モーションキャプチャ又は視線推定等を用いてその特定の位置を抽出する。そして、ローカル側の人センシング部11は、送信部15を介してリモート側の遠隔コミュニケーション装置1へ指さされた特定の位置の情報を送信する。
11.3. Other Voice Recognition
Additionally, when a person on the local side points to a specific position on the screen of the display 6, the local-side human sensing unit 11 extracts the specific position using motion capture, gaze estimation, or the like. Then, the local-side human sensing unit 11 transmits information about the specific position pointed to via the transmission unit 15 to the remote-side remote communication device 1.

 この場合、例えば、個別音分離部131が、ビームフォーミング等を用いてその特定の位置の音声を個別音として抽出する。明瞭度調整部132により、特定の位置の個別音に対して内野発話に対する処理と同じ処理等の適当な処理を加える。その後、出力統合部14は、他の個別音とともに1つのデータにまとめて、特定の位置の情報とともにローカル側の遠隔コミュニケーション装置1へ送信部15を介して送信する。 In this case, for example, the individual sound separation unit 131 extracts the sound at that specific position as an individual sound using beamforming or the like. The clarity adjustment unit 132 applies appropriate processing to the individual sound at the specific position, such as the same processing as that applied to infield speech. The output integration unit 14 then combines this with the other individual sounds into a single data item, and transmits it, along with information about the specific position, to the local remote communication device 1 via the transmission unit 15.

 ローカル側の遠隔コミュニケーション装置1の音声出力制御部22は、音データから特定の位置の個別音を取り出して、特定の位置の情報を用いて内野発話に対する処理と同じ処理等の適当な複数対向スピーカーバランシング処理を行う。そして、音声出力制御部22は、処理を施した特定の位置の個別音をスピーカーユニット411及び412に再生させる。 The audio output control unit 22 of the local remote communication device 1 extracts individual sounds at specific positions from the sound data and performs appropriate multiple opposing speaker balancing processing, such as the same processing as for infield speech, using the information about the specific positions. The audio output control unit 22 then causes the speaker units 411 and 412 to reproduce the processed individual sounds at the specific positions.

 <11.4.効果>
 このように、指さされた特定の位置の音を背景音と比較して強調して再生することで、発話音声以外にも、ローカルの人が聞き取りたい音を聞き易くすることができ、コミュニケーションの円滑化を図ることができる。
<11.4. Effects>
In this way, by emphasizing and reproducing the sound at the specific location pointed at compared to the background sounds, it is possible to make it easier for local people to hear sounds that they want to hear, in addition to spoken voices, thereby facilitating smooth communication.

 <12.第5の実施の形態に係る遠隔コミュニケーション装置>
 次に、第5の実施の形態に係る遠隔コミュニケーション装置1について説明する。第1の実施の形態では、カメラ2とマイク3とが仮想スクリーンの法線方向に並べられており、人センシング部11が、カメラ2で取得した映像の各発話者の位置を示す正規化座標を基に、各発話者の方向の角度を求めた。これに対して、第5の実施の形態に係る遠隔コミュニケーション装置1は、マイク3により得られた音声により音源方向を特定する。本実施例に係る遠隔コミュニケーション装置1も、図1のブロック図で表される。以下では、図1と異なる部分を中心に説明して、図1と同一の各部については説明を省略する場合がある。
12. Remote communication device according to the fifth embodiment
Next, a remote communication device 1 according to a fifth embodiment will be described. In the first embodiment, the camera 2 and microphone 3 are aligned in the normal direction of the virtual screen, and the human sensing unit 11 determines the angle of the direction of each speaker based on normalized coordinates indicating the position of each speaker in the image captured by the camera 2. In contrast, the remote communication device 1 according to the fifth embodiment identifies the direction of a sound source using the sound captured by the microphone 3. The remote communication device 1 according to this example is also represented by the block diagram of FIG. 1. The following description will focus on the parts that are different from FIG. 1, and descriptions of parts that are the same as those in FIG. 1 may be omitted.

 図23は、第5の実施の形態に係る遠隔コミュニケーション装置のブロック図である。また、図24は、第5の実施の形態に係る遠隔コミュニケーション装置による音声信号処理を説明するための図である。 FIG. 23 is a block diagram of a remote communication device according to the fifth embodiment. Also, FIG. 24 is a diagram for explaining audio signal processing by the remote communication device according to the fifth embodiment.

 <12.1.音源方向推定部>
 本実施例に係る遠隔コミュニケーション装置1は、図23に示すように音源方向推定部16を有する。音源方向推定部16は、マイクアレイ30により各方向に順番にビームを向けて得られる音声を取得する。ビームを向ける方向は、マイク3に予め設定しておいても良いし、音源方向推定部16がマイク3に順番に指定してもよい。
<12.1. Sound source direction estimation section>
The remote communication device 1 according to this embodiment has a sound source direction estimating unit 16 as shown in Fig. 23. The sound source direction estimating unit 16 acquires sounds obtained by directing a beam in each direction in turn by the microphone array 30. The direction to direct the beam may be preset in the microphone 3, or the sound source direction estimating unit 16 may specify the direction to direct the beam in turn to the microphone 3.

 ここでは、ビームフォーミングでは発話および突発背景音のみを検出し、定常背景音は収音しないものとする。また、映像上に映っている発話者及びその範囲で発生した突発背景音の他に、音は発生しない、すなわち画角外からは音が鳴らないものとする。 Here, beamforming is assumed to detect only speech and sudden background sounds, and not pick up steady background sounds. Furthermore, it is assumed that no sounds are generated other than the speaker appearing on the screen and any sudden background sounds occurring within that range, i.e., no sound is heard from outside the field of view.

 次に、図24に示すように、音源方向推定部16は、各方向にビームを向けて得られるそれぞれの音声のうち、出力が閾値より大きい音声を選択して、選択した音声を取得したビームの方向に音源があるとして音源方向の推定を行う(ステップS501)。図24のステップS501は、多数の音源方向が推定されたことを示している。そして、音源方向推定部16は、音源方向を発話者が存在する位置と判定する。このように、音源方向推定部16は、リモート空間にあたる所定空間の音声を基に発話者の位置を判定する。 Next, as shown in FIG. 24, the sound source direction estimation unit 16 selects, from the sounds obtained by directing the beam in each direction, a sound whose output is greater than a threshold, and estimates the sound source direction by assuming that the sound source is located in the direction of the beam that obtained the selected sound (step S501). Step S501 in FIG. 24 shows that a large number of sound source directions have been estimated. The sound source direction estimation unit 16 then determines the sound source direction as the position where the speaker is located. In this way, the sound source direction estimation unit 16 determines the position of the speaker based on the sound in a specified space that corresponds to the remote space.

 その後、音源方向推定部16は、判定した発話者が存在する位置の正規化座標を人センシング部11へ通知して、発話者が存在する位置の正規化座標を基に人別リストを作成する。この場合、人センシング部11は、通知された正規化座標を用いて映像に映っている人の中から発話者を特定し、特定した各発話者の内野外野判定を実行して、人別リストに登録する。 The sound source direction estimation unit 16 then notifies the person sensing unit 11 of the normalized coordinates of the position where the determined speaker is located, and creates a person list based on the normalized coordinates of the position where the speaker is located. In this case, the person sensing unit 11 uses the notified normalized coordinates to identify speakers from among the people shown in the video, performs an infield/outfield determination for each identified speaker, and registers them in the person list.

 <12.2.音響信号処理部>
 音響信号処理部13は、音源方向推定部16により音声から推定された各発話者の位置の正規化座標が登録された人別リストを人センシング部11から取得する。そして、取得した人別リストを用いて発話音声信号処理を実行する(ステップS502)。
<12.2. Acoustic signal processing section>
The acoustic signal processing unit 13 acquires, from the person sensing unit 11, a person list in which normalized coordinates of the position of each speaker estimated from the voice by the sound source direction estimation unit 16 are registered. Then, the acoustic signal processing unit 13 executes speech signal processing using the acquired person list (step S502).

 詳しくは、個別音分離部131が、人別リストに登録された音声から推定された各発話者の人数のスレッドを生成する。そして、個別音分離部131は、スレッド毎に各発話者の位置の正規化座標を用いて個別音分離処理を実行して各発話者の個別音を生成する(ステップS521)。このように、個別音分離部131は、音源方向推定部16により判定された発話者の位置を基に、リモート空間にあたる所定空間の音声から発話者毎の発話音声である個別音を分離する。 In more detail, the individual sound separation unit 131 generates threads for each speaker estimated from the audio registered in the person list. Then, the individual sound separation unit 131 performs individual sound separation processing using the normalized coordinates of the position of each speaker for each thread to generate individual sounds for each speaker (step S521). In this way, the individual sound separation unit 131 separates individual sounds, which are the speech sounds of each speaker, from the audio in a specified space corresponding to the remote space, based on the position of the speaker determined by the sound source direction estimation unit 16.

 個別音分離部131は、マイクアレイ30を用いて検出した音源方向についても、ディスプレイ6に表示された状態での映像の左側の角度から順に個別音分離を行うことで、個別音と人別リストに登録された各発話者の位置を示す正規化座標とを対応させる。これにより、ローカル側に伝送後も画像と音声とを一致させることができる。 The individual sound separation unit 131 also separates the individual sounds from the sound source direction detected using the microphone array 30, starting from the angle on the left side of the image displayed on the display 6, thereby matching the individual sounds with the normalized coordinates indicating the position of each speaker registered in the person list. This allows the image and sound to match even after transmission to the local side.

 明瞭度調整部132は、スレッド毎に、人別リストに登録された各発話者の内野フラグを用いて、音声から推定された発話者の個別音の明瞭度調整処理を実行する(ステップS522)。このように、明瞭度調整部132は、個別音分離部131により分離された個別音毎に、発話者が会話に参加しているか否かを基に明瞭度の調整を行う。 For each thread, the clarity adjustment unit 132 performs clarity adjustment processing for the individual sounds of the speaker estimated from the audio using the infield flag of each speaker registered in the person list (step S522). In this way, the clarity adjustment unit 132 adjusts the clarity for each individual sound separated by the individual sound separation unit 131 based on whether the speaker is participating in the conversation.

 <12.3.効果>
 これにより、カメラ2とマイク3とを仮想スクリーンの法線方向に並べて設置することが現実的に難しい場合であっても、カメラ2とマイク3との位置関係に制約なく個別音を生成することができる。
<12.3. Effects>
This allows individual sounds to be generated without any restrictions on the positional relationship between the camera 2 and the microphone 3, even if it is practically difficult to install the camera 2 and the microphone 3 side by side in the normal direction of the virtual screen.

 <13.第6の実施の形態に係る遠隔コミュニケーション装置>
 次に、第6の実施の形態に係る遠隔コミュニケーション装置1について説明する。本実施の形態に係る遠隔コミュニケーション装置1は、スピーカーアレイ41及び42の幅がディスプレイ6のスクリーンの幅よりも短い場合に、ディスプレイ6の幅に合わせて音声の再生位置を再マッピングする。本実施例に係る遠隔コミュニケーション装置1も、図1のブロック図で表される。以下では、図1と異なる部分を中心に説明して、図1と同一の各部については説明を省略する場合がある。
13. Remote communication device according to the sixth embodiment
Next, a remote communication device 1 according to a sixth embodiment will be described. When the width of the speaker arrays 41 and 42 is shorter than the width of the screen of the display 6, the remote communication device 1 according to this embodiment remaps the audio playback position to match the width of the display 6. The remote communication device 1 according to this example is also represented by the block diagram of Figure 1. The following description will focus on the parts that are different from Figure 1, and descriptions of parts that are the same as those in Figure 1 may be omitted.

<13.1.音声出力制御部>
 図25は、第6の実施の形態に係る遠隔コミュニケーション装置による音声再生を説明するための図である。本実施の形態では、図25に示すように、スピーカーアレイ41及び42のスピーカーアレイ幅Wspkがディスプレイ6のスクリーンの幅であるディスプレイ幅Wdispよりも短い。
<13.1. Audio output control unit>
25 is a diagram illustrating audio playback by a remote communication device according to a sixth embodiment. In this embodiment, as shown in FIG. 25, the speaker array width W spk of the speaker arrays 41 and 42 is shorter than the display width W disp, which is the width of the screen of the display 6.

 ここで、第1の実施の形態では、音声出力制御部22は、正規化座標系で画像と音声とを一致させるために、スピーカーアレイ幅Wspk及びスピーカー間距離Wをディスプレイ幅Wdispで除して処理上の位置を決定した。しかし、この方法の複数対向スピーカーバランシング処理では、スピーカーアレイ幅Wspkがディスプレイ幅Wdispよりも短い場合、ディスプレイ6のスクリーンの端部に映る発話者の音声の再生が困難となる。 In the first embodiment, the audio output control unit 22 determines the processing position by dividing the speaker array width W spk and the speaker distance W u by the display width W disp in order to match the image and audio in the normalized coordinate system. However, with this method of multiple opposed speaker balancing processing, if the speaker array width W spk is shorter than the display width W disp , it becomes difficult to reproduce the audio of a speaker who appears on the edge of the screen of the display 6.

 そこで、本実施の形態に係る音声出力制御部22は、スピーカーアレイ幅Wspk及びスピーカー間距離Wをスピーカーアレイ幅Wspkで除して処理上の位置を決定する。この場合、スピーカーアレイ41及び42の幅は正規化座標における長さが1となり、スピーカーユニット411及び412の組みの間の幅は、スピーカー間距離Wをスピーカーアレイ幅Wspkで除した値となる。すなわち、音声出力制御部22は、処理上のスピーカーアレイ幅が正規化座標の-0.5~0.5に設定する。そしえ、音声出力制御部22は、この処理上の距離を用いて複数対向スピーカーバランシング処理を実行して、各音声を出力するスピーカーユニット411及び412の組みを決定する。 Therefore, the audio output control unit 22 according to the present embodiment determines the position in processing by dividing the speaker array width W spk and the speaker distance W u by the speaker array width W spk . In this case, the width of the speaker arrays 41 and 42 has a length of 1 in normalized coordinates, and the width between the pair of speaker units 411 and 412 is the value obtained by dividing the speaker distance W u by the speaker array width W spk . In other words, the audio output control unit 22 sets the speaker array width in processing to -0.5 to 0.5 in normalized coordinates. Then, the audio output control unit 22 executes a multiple facing speaker balancing process using this distance in processing to determine the pair of speaker units 411 and 412 that will output each sound.

 このように、音声出力制御部22は、スピーカーアレイ41及び42における発話音声の再生位置が、発話者それぞれのスクリーン上の位置の距離関係を維持するように、発話者の位置を基に再生させるスピーカーユニット411及び412を選択する。 In this way, the audio output control unit 22 selects the speaker units 411 and 412 to play back based on the speaker's position so that the playback positions of the spoken voices in the speaker arrays 41 and 42 maintain the distance relationship between the speaker's position on the screen.

 <13.2.効果>
 このように、本実施の形態に係る遠隔コミュニケーション装置1は、スピーカーアレイ41及び42の幅に合わせて音声の再生位置を再マッピングする。これにより、スクリーンに映されている発話者の声を全て再生することができる。したがって、スピーカーアレイ41及び42の幅がディスプレイ6のスクリーンよりも極端に短い場合であっても、画面の端で話している発話者の声が再生されなくなることを回避でき、映像上の全ての話者をスピーカーアレイ41及び42から再生することができる。
<13.2. Effects>
In this way, the remote communication device 1 according to the present embodiment remaps the audio playback position to match the width of the speaker arrays 41 and 42. This makes it possible to play back the voices of all speakers appearing on the screen. Therefore, even if the width of the speaker arrays 41 and 42 is significantly shorter than the screen of the display 6, it is possible to prevent the voices of speakers speaking at the edges of the screen from being cut off, and all speakers appearing on the screen can be played back from the speaker arrays 41 and 42.

 <14.第7の実施の形態に係る遠隔コミュニケーション装置>
 次に、第7の実施の形態に係る遠隔コミュニケーション装置1について説明する。複数対向スピーカー4は、上下ともモノラルで同じ音声を再生することで映像の縦方向の中央に仮想的に音源を定位させる。ただし、子供等の低身長の発話者や高身長の発話者の発話音声を再生する場合には不自然になる可能性がある。
14. Remote communication device according to the seventh embodiment
Next, a remote communication device 1 according to a seventh embodiment will be described. The multiple opposing speakers 4 virtually localize the sound source in the vertical center of the image by reproducing the same monaural sound from both the top and bottom. However, this may sound unnatural when reproducing the speech of a short speaker such as a child or a tall speaker.

 そこで、本実施の形態に係る遠隔コミュニケーション装置1は、音源の高さに合わせて縦方向の音源位置を変更して音声を再生する。本実施例に係る遠隔コミュニケーション装置1は、図23のブロック図で表される。以下では、図23と異なる部分を中心に説明して、図1と同一の各部については説明を省略する場合がある。ただし、本実施の形態では、音源方向推定部16は、各発話者の正規化座標方向の位置の推定は行わなくてもよい。図26は、第7の実施の形態に係る遠隔コミュニケーション装置による音声再生を説明するための図である。 Therefore, the remote communication device 1 according to this embodiment plays back audio by changing the vertical position of the sound source according to the height of the sound source. The remote communication device 1 according to this embodiment is represented by the block diagram in Figure 23. The following explanation will focus on the parts that differ from Figure 23, and explanations of parts that are the same as those in Figure 1 may be omitted. However, in this embodiment, the sound source direction estimation unit 16 does not need to estimate the position of each speaker in the normalized coordinate direction. Figure 26 is a diagram for explaining audio playback by a remote communication device according to the seventh embodiment.

 <14.1.人センシング部>
 人センシング部11は、映像に映った各発話者の口の縦方向の位置を取得する。例えば、人センシング部11は、映像解析により口の位置を推定することができる。そして、人センシング部11は、人別リストに各発話者についての縦方向の図26における正規化座標Yを格納する。この場合、人センシング部11は、図26に示すように、スクリーンの縦方向の中央を原点とし、且つ、縦方向の座標を-0.5~0.5として縦方向の正規化座標Yを設定する。
<14.1. Human sensing unit>
The human sensing unit 11 acquires the vertical position of the mouth of each speaker shown in the video. For example, the human sensing unit 11 can estimate the position of the mouth through video analysis. Then, the human sensing unit 11 stores the normalized vertical coordinate Y in FIG. 26 for each speaker in the person list. In this case, as shown in FIG. 26, the human sensing unit 11 sets the vertical normalized coordinate Y with the center of the screen as the origin and the vertical coordinate between -0.5 and 0.5.

 <14.2.個別音分離部>
 また、背景音のうちの突発音等の発話以外の音は必ずしも映像上の中央に映されている位置が音源とは限らない。これについては、第2の実施の形態のように突発音を特定する場合に、背景音分離部133は、突発音の縦方向の位置を特定する。この場合、例えば、背景音分離部133は、個別音分離部131及びオーディオインタフェース5を介してマイクアレイ30に繋がる信号線を用いて信号を送受信する。背景音分離部133は、マイクアレイ30に縦方向にビームを走査させて、縦方向の音声の強さから突発音の音源の縦方向の位置を推定する事ができる。そして、背景音分離部133は、背景音における突発音の位置を正規化座標Yで表して、その情報を出力統合部14及び送信部15を介してリモート側の遠隔コミュニケーション装置1へ送信する。
<14.2. Individual sound separation section>
Furthermore, the source of background sounds other than speech, such as sudden sounds, is not necessarily located at the center of the image. Regarding this, when identifying a sudden sound as in the second embodiment, the background sound separation unit 133 identifies the vertical position of the sudden sound. In this case, for example, the background sound separation unit 133 transmits and receives signals using a signal line connected to the microphone array 30 via the individual sound separation unit 131 and the audio interface 5. The background sound separation unit 133 can estimate the vertical position of the sound source of the sudden sound from the vertical sound intensity by causing the microphone array 30 to scan a beam. The background sound separation unit 133 then represents the position of the sudden sound in the background sound using normalized coordinates Y and transmits this information to the remote communication device 1 via the output integration unit 14 and the transmission unit 15.

 また、ここでは、背景音分離部133がマイクアレイ30に操作させる構成で説明したが、他にも、第2の実施の形態の遠隔コミュニケーション装置1に、図23に示した音源方向推定部16を搭載させてもよい。その場合、音源方向推定部16がマイクアレイ30を用いて推定した音源の縦方向の位置は人センシング部11に送られて、人別リストに登録される。 Furthermore, while the configuration has been described here in which the background sound separation unit 133 operates the microphone array 30, the remote communication device 1 of the second embodiment may also be equipped with the sound source direction estimation unit 16 shown in FIG. 23. In this case, the vertical position of the sound source estimated by the sound source direction estimation unit 16 using the microphone array 30 is sent to the person sensing unit 11 and registered in the person list.

 <14.3.音声出力制御部>
 音声出力制御部22は、例えば、スピーカーアレイ41をLチャンネルとし、かつ、スピーカーアレイ42をRチャンネルとして音声信号をステレオ化する。そして、音声出力制御部22は、人別リストに登録された縦方向の正規化座標YにしたがってLチャンネル及びRチャンネルを用いてステレオ化した音声信号をパニングして、指定された縦方向の位置で発話音声が再生されるように音源位置を変更する。
<14.3. Audio output control unit>
The audio output control unit 22 stereoizes the audio signal, for example, by using the speaker array 41 as the L channel and the speaker array 42 as the R channel. Then, the audio output control unit 22 pans the stereoized audio signal using the L channel and the R channel in accordance with the vertical normalized coordinate Y registered in the person list, and changes the sound source position so that the speech sound is reproduced at a specified vertical position.

 例えば、音声出力制御部22は、図26における発話者P1の発話音声は、縦方向の中央よりも下の位置を音源として再生させる。また、音声出力制御部22は、発話者P1の発話音声は、縦方向の中央よりも上の位置を音源として再生させる。 For example, the audio output control unit 22 plays back the speech of speaker P1 in FIG. 26 using a sound source located below the center in the vertical direction. Furthermore, the audio output control unit 22 plays back the speech of speaker P1 using a sound source located above the center in the vertical direction.

 また、第2の実施の形態のように背景音における突発音を定常音とは別に再生させる場合に、音声出力制御部22は、以下の処理を行ってもよい。すなわち、音声出力制御部22は、突発音の縦方向の位置についても、個別音分離部131から通知された突発音の縦方向の位置の正規化座標Yに合わせて、ステレオ化した音声信号をパニングする。これにより、音声出力制御部22は、指定された縦方向の位置で突発音が再生されるように音源位置を変更する。このように、音声出力制御部22は、リモート空間にあたる所定空間の音声をステレオ化して、スピーカーアレイ41とスピーカーアレイ42との間の音源の位置を調整する。 Furthermore, when a sudden sound in the background sound is reproduced separately from the steady sound as in the second embodiment, the audio output control unit 22 may perform the following process. That is, the audio output control unit 22 also pans the stereo audio signal for the vertical position of the sudden sound, according to the normalized coordinate Y of the vertical position of the sudden sound notified by the individual sound separation unit 131. As a result, the audio output control unit 22 changes the sound source position so that the sudden sound is reproduced at the specified vertical position. In this way, the audio output control unit 22 stereophonizes the sound in a predetermined space that corresponds to the remote space, and adjusts the position of the sound source between the speaker array 41 and the speaker array 42.

 <14.4.効果>
 モノラル再生ではスクリーンの縦方向の中央に仮想的な音源が定位される。これに対して、本実施の形態に係る遠隔コミュニケーション装置1は、発話者の身長に差がある場合や鳴らす音が縦方向の中央から離れている場合に、縦方向についても適切な位置に音源位置を動かすことができる。したがって、より対話を自然な状態に近づけることができ、コミュニケーションの円滑化を図ることができる。
<14.4. Effects>
In monaural playback, a virtual sound source is positioned in the vertical center of the screen. In contrast, the remote communication device 1 according to this embodiment can move the sound source position vertically to an appropriate position when there is a difference in the height of speakers or when the sound being emitted is far from the vertical center. This allows for a more natural conversation and facilitates smooth communication.

 以上、本開示の実施の形態について説明したが、本開示の技術的範囲は、上述の実施の形態そのままに限定されるものではなく、本開示の要旨を逸脱しない範囲において種々の変更が可能である。また、異なる実施の形態及び変形例にわたる構成要素を適宜組み合わせてもよい。 Although the embodiments of the present disclosure have been described above, the technical scope of the present disclosure is not limited to the above-described embodiments, and various modifications are possible without departing from the gist of the present disclosure. Furthermore, components from different embodiments and modifications may be combined as appropriate.

 <15.ハードウェア構成>
 図27は、第1の実施の形態~第7の実施の形態に係る情報処理装置である遠隔コミュニケーション装置の演算装置を実現するコンピュータの一例を示すハードウェア構成図である。
15. Hardware Configuration
FIG. 27 is a hardware configuration diagram showing an example of a computer that realizes the arithmetic unit of the remote communication device that is the information processing device according to the first to seventh embodiments.

 コンピュータ1000は、CPU1100、RAM1200、ROM(Read Only Memory)1300、HDD(Hard Disk Drive)1400、通信インターフェイス1500、及び入出力インターフェイス1600を有する。コンピュータ1000の各部は、バス1050によって接続される。 Computer 1000 has a CPU 1100, RAM 1200, ROM (Read Only Memory) 1300, HDD (Hard Disk Drive) 1400, communication interface 1500, and input/output interface 1600. Each part of computer 1000 is connected by bus 1050.

 CPU1100は、ROM1300又はHDD1400に格納されたプログラムに基づいて動作し、各部の制御を行う。例えば、CPU1100は、ROM1300又はHDD1400に格納されたプログラムをRAM1200に展開し、各種プログラムに対応した処理を実行する。 The CPU 1100 operates based on programs stored in the ROM 1300 or HDD 1400 and controls each component. For example, the CPU 1100 loads programs stored in the ROM 1300 or HDD 1400 into the RAM 1200 and executes processing corresponding to the various programs.

 ROM1300は、コンピュータ1000の起動時にCPU1100によって実行されるBIOS(Basic Input Output System)等のブートプログラムや、コンピュータ1000のハードウェアに依存するプログラム等を格納する。 ROM 1300 stores boot programs such as the BIOS (Basic Input Output System) executed by the CPU 1100 when the computer 1000 starts up, as well as programs that depend on the computer 1000's hardware.

 HDD1400は、CPU1100によって実行されるプログラム、及び、かかるプログラムによって使用されるデータ等を非一時的に記録する、コンピュータが読み取り可能な記録媒体である。具体的には、HDD1400は、プログラムデータ1450の一例である本開示に係るアプリケーションプログラムを記録する記録媒体である。 HDD 1400 is a computer-readable recording medium that non-temporarily records programs executed by CPU 1100 and data used by such programs. Specifically, HDD 1400 is a recording medium that records application programs related to the present disclosure, which are an example of program data 1450.

 通信インターフェイス1500は、コンピュータ1000が外部ネットワーク1550(例えばインターネット)と接続するためのインターフェイスである。例えば、CPU1100は、通信インターフェイス1500を介して、他の機器からデータを受信したり、CPU1100が生成したデータを他の機器へ送信したりする。 The communication interface 1500 is an interface that allows the computer 1000 to connect to an external network 1550 (e.g., the Internet). For example, the CPU 1100 receives data from other devices and transmits data generated by the CPU 1100 to other devices via the communication interface 1500.

 入出力インターフェイス1600は、入出力デバイス1650とコンピュータ1000とを接続するためのインターフェイスである。例えば、CPU1100は、入出力インターフェイス1600を介して、キーボードやマウス等の入力デバイスからデータを受信する。また、CPU1100は、入出力インターフェイス1600を介して、ディスプレイ6やオーディオインタフェース5やプリンタ等の出力デバイスにデータを送信する。また、入出力インターフェイス1600は、所定の記録媒体(メディア)に記録されたプログラム等を読み取るメディアインターフェイスとして機能してもよい。メディアとは、例えばDVD(Digital Versatile Disc)、PD(Phase change rewritable Disk)等の光学記録媒体、MO(Magneto-Optical disk)等の光磁気記録媒体、テープ媒体、磁気記録媒体、又は半導体メモリ等である。 The input/output interface 1600 is an interface for connecting the input/output device 1650 and the computer 1000. For example, the CPU 1100 receives data from input devices such as a keyboard or mouse via the input/output interface 1600. The CPU 1100 also transmits data to output devices such as the display 6, audio interface 5, or printer via the input/output interface 1600. The input/output interface 1600 may also function as a media interface that reads programs and the like recorded on a specified recording medium. Examples of media include optical recording media such as DVDs (Digital Versatile Discs) and PDs (Phase Change Rewritable Disks), magneto-optical recording media such as MOs (Magneto-Optical Disks), tape media, magnetic recording media, or semiconductor memory.

 なお、CPU1100は、プログラムデータ1450をHDD1400から読み取って実行するが、他の例として、外部ネットワーク1550を介して、他の装置からこれらのプログラムを取得してもよい。 Note that the CPU 1100 reads and executes the program data 1450 from the HDD 1400, but as another example, it may also obtain these programs from another device via the external network 1550.

 以上、添付図面を参照しながら本開示の好適な実施の形態について詳細に説明したが、本開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、請求の範囲に記載された技術的思想の範疇内において、各種の変更例又は修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。 The above describes in detail preferred embodiments of the present disclosure with reference to the accompanying drawings, but the technical scope of the present disclosure is not limited to such examples. It is clear that a person with ordinary skill in the technical field of the present disclosure would be able to conceive of various modified or revised examples within the scope of the technical ideas set forth in the claims, and it is understood that these also naturally fall within the technical scope of the present disclosure.

 また、本明細書に記載された効果は、あくまで説明的又は例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、又は上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 Furthermore, the effects described in this specification are merely descriptive or exemplary and are not limiting. In other words, the technology disclosed herein may achieve other effects in addition to or in place of the above-mentioned effects that would be apparent to those skilled in the art from the description herein.

 なお、本技術は以下のような構成を取ることもできる。 This technology can also be configured as follows:

(1)
 所定空間に存在する複数の発話者のカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシング部と、
 マイクにより収音された前記所定空間の音声に含まれる前記人センシング部により判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整部と、
 前記明瞭度調整部により処理された前記所定空間の音声を送信する送信部と
 を有する情報処理装置。
(2)
 前記人センシング部は、前記映像を基に発話者それぞれの位置を判定し、
 前記人センシング部により判定された発話者それぞれの前記位置を基に、前記所定空間の音声から発話者毎の発話音声である個別音を分離する個別音分離部とさらに有し、
 前記明瞭度調整部は、前記個別音分離部により分離された前記個別音毎に、各発話者が会話に参加しているか否かを基に明瞭度の調整を行う
 前記(1)に記載の情報処理装置。
(3)
 前記送信部により送信された前記所定空間の音声を受信する受信部と、
 複数のスピーカーユニットが一列に並ぶスピーカーアレイを2つ有するスピーカーにおいて、発話者それぞれの前記位置を基に、前記受信部により受信された前記所定空間の音声のうち各発話者の発話音声を再生させる前記スピーカーユニットを選択して、選択した前記スピーカーユニットに発話者それぞれの発話音声を再生させる音声出力制御部と
 をさらに有する前記(2)に記載の情報処理装置。
(4)
 前記受信部は、前記所定空間の音声とともに前記映像を受信して、前記映像を写した際の前記映像の上下方向の両端部に1つずつ前記スピーカーアレイが配置されたスクリーンに前記映像を映し、
 前記音声出力制御部は、前記スクリーンに映された発話者毎に、前記スクリーン上の位置の近傍で発話音声が再生されるように、前記発話者の位置を基に再生させる前記スピーカーユニットを選択する
 前記(3)に記載の情報処理装置。
(5)
 前記受信部は、前記所定空間の音声とともに前記映像を受信して、前記映像を写した際の前記映像の上下方向の両端部に1つずつ前記スピーカーアレイが配置されたスクリーンに前記映像を映し、
 前記音声出力制御部は、前記スピーカーアレイにおける発話音声の再生位置が、発話者それぞれの前記スクリーン上の位置の距離関係を維持するように、前記発話者の位置を基に再生させる前記スピーカーユニットを選択する
 前記(3)に記載の情報処理装置。
(6)
 前記明瞭度調整部は、前記所定空間の音声に含まれる会話に参加していない発話者の発話音声について、前記マイクによる収音状態よりも明瞭度を下げる明瞭度低下調整を行う前記(1)~(5)のいずれか一つに記載の情報処理装置。
(7)
 前記所定空間の音声から背景音を分離する音声分離部をさらに有し、
 前記明瞭度調整部は、前記音声分離部により分離された前記背景音について、他の音声よりも明瞭度を下げる明瞭度低下調整を行う
 前記(1)~(6)のいずれか一つに記載の情報処理装置。
(8)
 前記背景音を突発音と定常音とに分離する背景音分離部を備え、
 前記明瞭度調整部は、前記背景音のうち前記定常音について、前記明瞭度低下調整を行う
 前記(7)に記載の情報処理装置。
(9)
 前記個別音毎に、音声認識を実行して前記個別音の発話内容を示す文字を生成し、前記発話内容を示す文字を基に音声合成を実行して、発話者の発話音声を再生成する音声合成部をさらに有し、
 前記明瞭度調整部は、前記音声合成部により再生成された前記個別音毎に、発話者が会話に参加しているか否かを基に明瞭度の調整を行う
 前記(2)~(5)のいずれか一つに記載の情報処理装置。
(10)
 前記人センシング部は、前記映像を基に発話者の距離を判定し、前記距離を基に発話者が会話に参加しているか否かを判定する前記(1)~(9)のいずれか一つに記載の情報処理装置。
(11)
 前記所定空間の音声を基に発話者の位置を判定する音源方向推定部をさらに有し、
 個別音分離部は、前記音源方向推定部により判定された発話者の位置を基に、前記所定空間の音声から発話者毎の発話音声である個別音を分離し、
 前記明瞭度調整部は、前記個別音分離部により分離された前記個別音毎に、発話者が会話に参加しているか否かを基に明瞭度の調整を行う
 前記(1)~(10)のいずれか一つに記載の情報処理装置。
(12)
 前記音声出力制御部は、前記所定空間の音声をステレオ化して、前記スピーカーアレイ間の音源の位置を調整する前記(3)に記載の情報処理装置。
(13)
 情報処理装置が、
 所定空間に存在する複数の発話者を撮影するカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシングステップと、
 前記所定空間の音声を収音するマイクにより収音された前記所定空間の音声に含まれる、前記人センシングステップにおいて判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整ステップと、
 前記明瞭度調整ステップにおいて処理された前記所定空間の音声を送信する送信ステップと
 を実行する情報処理方法。
(14)
 所定空間に存在する複数の発話者を撮影するカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定し
 前記所定空間の音声を収音するマイクにより収音された前記所定空間の音声に含まれる、会話に参加している発話者の発話音声について、明瞭度の調整を行い、
 会話に参加している発話者の発話音声を調整した前記所定空間の音声を送信する
 処理をコンピュータに実行させる情報処理プログラム。
(15)
 送信装置は、
 所定空間に存在する複数の発話者のカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシング部と、
 マイクにより収音された前記所定空間の音声に含まれる前記人センシング部により判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整部と、
 前記明瞭度調整部により処理された前記所定空間の音声を送信する送信部とを有し、
 受信装置は、
 前記送信部により送信された前記所定空間の音声を受信する受信部と、
 複数のスピーカーユニットが一列に並ぶスピーカーアレイを2つ有するスピーカーにおいて、発話者それぞれの前記位置を基に、前記受信部により受信された前記所定空間の音声のうち各発話者の発話音声を再生させる前記スピーカーユニットを選択して、選択した前記スピーカーユニットに発話者それぞれの発話音声を再生させる音声出力制御部とを有する
 情報処理システム。
(1)
a human sensing unit that determines whether each speaker is participating in a conversation based on images of multiple speakers present in a predetermined space captured by a camera;
an articulation adjustment unit that adjusts the articulation of speech of speakers participating in the conversation determined by the human sensing unit and included in the speech in the predetermined space collected by a microphone;
a transmitting unit that transmits the sound in the predetermined space that has been processed by the clarity adjusting unit.
(2)
the human sensing unit determines the position of each speaker based on the video;
an individual sound separation unit that separates individual sounds, which are speech sounds of each speaker, from the sound in the predetermined space based on the position of each speaker determined by the human sensing unit;
The information processing device according to (1), wherein the clarity adjustment unit adjusts clarity for each of the individual sounds separated by the individual sound separation unit based on whether or not each speaker is participating in a conversation.
(3)
a receiving unit that receives the sound in the predetermined space transmitted by the transmitting unit;
The information processing device described in (2) further includes an audio output control unit that selects, based on the position of each speaker, a speaker unit that will play the speech of each speaker from the audio in the specified space received by the receiving unit, in a speaker having two speaker arrays in which multiple speaker units are arranged in a row, and causes the selected speaker unit to play the speech of each speaker.
(4)
the receiving unit receives the video together with the audio from the predetermined space, and projects the video on a screen on which the speaker arrays are arranged, one at each end of the video in a vertical direction when the video is projected;
The information processing device described in (3), wherein the audio output control unit selects the speaker unit to be played based on the position of each speaker displayed on the screen so that the spoken audio is played near the position on the screen.
(5)
the receiving unit receives the video together with the audio from the predetermined space, and projects the video on a screen on which the speaker arrays are arranged, one at each end of the video in a vertical direction when the video is projected;
The information processing device described in (3), wherein the audio output control unit selects the speaker unit to be played back based on the position of the speaker so that the playback position of the spoken audio in the speaker array maintains the distance relationship between the positions of each speaker on the screen.
(6)
The information processing device described in any one of (1) to (5), wherein the clarity adjustment unit performs a clarity reduction adjustment to reduce the clarity of the speech of a speaker who is not participating in the conversation and is included in the audio of the specified space compared to the state of sound collection by the microphone.
(7)
further comprising an audio separation unit that separates background sounds from the audio in the predetermined space;
The information processing device according to any one of (1) to (6), wherein the clarity adjustment unit performs clarity reduction adjustment to reduce clarity of the background sound separated by the sound separation unit compared to other sounds.
(8)
a background sound separation unit that separates the background sound into a sudden sound and a steady sound,
The information processing device according to (7), wherein the clarity adjustment unit performs the clarity reduction adjustment on the stationary sound among the background sounds.
(9)
a speech synthesis unit that performs speech recognition for each of the individual sounds to generate characters indicating the speech content of the individual sounds, and performs speech synthesis based on the characters indicating the speech content to reproduce the speech of the speaker;
The information processing device according to any one of (2) to (5), wherein the clarity adjustment unit adjusts clarity for each of the individual sounds regenerated by the speech synthesis unit based on whether or not a speaker is participating in a conversation.
(10)
The information processing device according to any one of (1) to (9), wherein the human sensing unit determines the distance of the speaker based on the video and determines whether the speaker is participating in the conversation based on the distance.
(11)
a sound source direction estimation unit that determines the position of a speaker based on the sound in the predetermined space;
an individual sound separation unit separates individual sounds, which are speech sounds of each speaker, from the sound in the predetermined space based on the position of the speaker determined by the sound source direction estimation unit;
The information processing device according to any one of (1) to (10), wherein the clarity adjustment unit adjusts clarity for each of the individual sounds separated by the individual sound separation unit based on whether or not a speaker is participating in a conversation.
(12)
The information processing device according to (3), wherein the audio output control unit stereophonically converts audio in the predetermined space and adjusts the position of a sound source among the speaker arrays.
(13)
The information processing device
a human sensing step of determining whether each speaker is participating in a conversation based on an image captured by a camera capturing multiple speakers present in a predetermined space;
an articulation adjustment step of adjusting the articulation of speech of a speaker participating in the conversation determined in the person sensing step, the speech being included in the audio of the predetermined space collected by a microphone that collects audio of the predetermined space;
a transmitting step of transmitting the sound in the predetermined space processed in the clarity adjusting step.
(14)
Based on the video captured by a camera capturing multiple speakers in a predetermined space, it is determined whether each speaker is participating in the conversation, and the clarity of the speech of the speakers participating in the conversation, which is included in the audio of the predetermined space collected by a microphone that collects the audio of the predetermined space, is adjusted.
An information processing program that causes a computer to execute a process of adjusting the speech of speakers participating in the conversation and transmitting the speech in the predetermined space.
(15)
The transmitting device
a human sensing unit that determines whether each speaker is participating in a conversation based on images of multiple speakers present in a predetermined space captured by a camera;
an articulation adjustment unit that adjusts the articulation of speech of speakers participating in the conversation determined by the human sensing unit and included in the speech in the predetermined space collected by a microphone;
a transmitting unit that transmits the sound in the predetermined space processed by the clarity adjustment unit,
The receiving device
a receiving unit that receives the sound in the predetermined space transmitted by the transmitting unit;
An information processing system comprising: a speaker having two speaker arrays in which a plurality of speaker units are arranged in a row; and an audio output control unit that selects a speaker unit to play back the speech of each speaker from the audio in the specified space received by the receiving unit based on the position of each speaker, and causes the selected speaker unit to play back the speech of each speaker.

 1 遠隔コミュニケーション装置
 2 カメラ
 3 マイク
 4 複数対向スピーカー
 5 オーディオインタフェース
 6 ディスプレイ
 7 ネットワーク
 10 送信ユニット
 11 人センシング部
 12 音声分離部
 13 音響信号処理部
 14 出力統合部
 15 送信部
 16 音源方向推定部
 20 受信ユニット
 21 受信部
 22 音声出力制御部
 30 マイクアレイ
 41,42 スピーカーアレイ
 131 個別音分離部
 132 明瞭度調整部
 133 背景音分離部
 134 音声合成部
 411,412 スピーカーユニット
REFERENCE SIGNS LIST 1 Remote communication device 2 Camera 3 Microphone 4 Multiple opposing speakers 5 Audio interface 6 Display 7 Network 10 Transmission unit 11 Human sensing unit 12 Voice separation unit 13 Acoustic signal processing unit 14 Output integration unit 15 Transmission unit 16 Sound source direction estimation unit 20 Receiving unit 21 Receiving unit 22 Voice output control unit 30 Microphone array 41, 42 Speaker array 131 Individual sound separation unit 132 Clarity adjustment unit 133 Background sound separation unit 134 Voice synthesis unit 411, 412 Speaker unit

Claims (14)

 所定空間に存在する複数の発話者のカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシング部と、
 マイクにより収音された前記所定空間の音声に含まれる前記人センシング部により判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整部と、
 前記明瞭度調整部により処理された前記所定空間の音声を送信する送信部と
 を有する情報処理装置。
a human sensing unit that determines whether each speaker is participating in a conversation based on images of multiple speakers present in a predetermined space captured by a camera;
an articulation adjustment unit that adjusts the articulation of speech of speakers participating in the conversation determined by the human sensing unit and included in the speech in the predetermined space collected by a microphone;
a transmitting unit that transmits the sound in the predetermined space that has been processed by the clarity adjusting unit.
 前記人センシング部は、前記映像を基に発話者それぞれの位置を判定し、
 前記人センシング部により判定された発話者それぞれの前記位置を基に、前記所定空間の音声から発話者毎の発話音声である個別音を分離する個別音分離部とさらに有し、
 前記明瞭度調整部は、前記個別音分離部により分離された前記個別音毎に、各発話者が会話に参加しているか否かを基に明瞭度の調整を行う
 請求項1に記載の情報処理装置。
the human sensing unit determines the position of each speaker based on the video;
an individual sound separation unit that separates individual sounds, which are speech sounds of each speaker, from the sound in the predetermined space based on the position of each speaker determined by the human sensing unit;
The information processing device according to claim 1 , wherein the clarity adjustment unit adjusts clarity for each of the individual sounds separated by the individual sound separation unit based on whether or not each speaker is participating in a conversation.
 前記送信部により送信された前記所定空間の音声を受信する受信部と、
 複数のスピーカーユニットが一列に並ぶスピーカーアレイを2つ有するスピーカーにおいて、発話者それぞれの前記位置を基に、前記受信部により受信された前記所定空間の音声のうち各発話者の発話音声を再生させる前記スピーカーユニットを選択して、選択した前記スピーカーユニットに発話者それぞれの発話音声を再生させる音声出力制御部と
 をさらに有する請求項2に記載の情報処理装置。
a receiving unit that receives the sound in the predetermined space transmitted by the transmitting unit;
3. The information processing device according to claim 2, further comprising: an audio output control unit that selects, based on the position of each speaker, a speaker unit that will play back the speech of each speaker from the audio in the specified space received by the receiving unit, in a speaker having two speaker arrays in which a plurality of speaker units are arranged in a row, and causes the selected speaker unit to play back the speech of each speaker.
 前記受信部は、前記所定空間の音声とともに前記映像を受信して、前記映像を写した際の前記映像の上下方向の両端部に1つずつ前記スピーカーアレイが配置されたスクリーンに前記映像を映し、
 前記音声出力制御部は、前記スクリーンに映された発話者毎に、前記スクリーン上の位置の近傍で発話音声が再生されるように、前記発話者の位置を基に再生させる前記スピーカーユニットを選択する
 請求項3に記載の情報処理装置。
the receiving unit receives the video together with the audio from the predetermined space, and projects the video on a screen on which the speaker arrays are arranged, one at each end of the video in a vertical direction when the video is projected;
The information processing device according to claim 3 , wherein the audio output control unit selects the speaker unit to be played back based on the position of each speaker displayed on the screen so that the spoken audio is played back near the position on the screen.
 前記受信部は、前記所定空間の音声とともに前記映像を受信して、前記映像を写した際の前記映像の上下方向の両端部に1つずつ前記スピーカーアレイが配置されたスクリーンに前記映像を映し、
 前記音声出力制御部は、前記スピーカーアレイにおける発話音声の再生位置が、発話者それぞれの前記スクリーン上の位置の距離関係を維持するように、前記発話者の位置を基に再生させる前記スピーカーユニットを選択する
 請求項3に記載の情報処理装置。
the receiving unit receives the video together with the audio from the predetermined space, and projects the video on a screen on which the speaker arrays are arranged, one at each end of the video in a vertical direction when the video is projected;
The information processing device according to claim 3 , wherein the audio output control unit selects the speaker unit to be played back based on the position of the speaker so that the playback position of the speech voice in the speaker array maintains a distance relationship between the positions of the speakers on the screen.
 前記明瞭度調整部は、前記所定空間の音声に含まれる会話に参加していない発話者の発話音声について、前記マイクによる収音状態よりも明瞭度を下げる明瞭度低下調整を行う請求項1に記載の情報処理装置。 The information processing device according to claim 1, wherein the clarity adjustment unit performs clarity reduction adjustment to reduce the clarity of speech of speakers not participating in the conversation included in the audio from the specified space compared to the state of sound pickup by the microphone.  前記所定空間の音声から背景音を分離する音声分離部をさらに有し、
 前記明瞭度調整部は、前記音声分離部により分離された前記背景音について、他の音声よりも明瞭度を下げる明瞭度低下調整を行う
 請求項1に記載の情報処理装置。
further comprising an audio separation unit that separates background sounds from the audio in the predetermined space;
The information processing device according to claim 1 , wherein the clarity adjustment unit performs clarity reduction adjustment on the background sound separated by the sound separation unit to reduce clarity more than other sounds.
 前記背景音を突発音と定常音とに分離する背景音分離部を備え、
 前記明瞭度調整部は、前記背景音のうち前記定常音について、前記明瞭度低下調整を行う
 請求項7に記載の情報処理装置。
a background sound separation unit that separates the background sound into a sudden sound and a steady sound,
The information processing device according to claim 7 , wherein the clarity adjustment unit performs the clarity reduction adjustment on the stationary sound of the background sound.
 前記個別音毎に、音声認識を実行して前記個別音の発話内容を示す文字を生成し、前記発話内容を示す文字を基に音声合成を実行して、発話者の発話音声を再生成する音声合成部をさらに有し、
 前記明瞭度調整部は、前記音声合成部により再生成された前記個別音毎に、発話者が会話に参加しているか否かを基に明瞭度の調整を行う
 請求項2に記載の情報処理装置。
a speech synthesis unit that performs speech recognition for each of the individual sounds to generate characters indicating the speech content of the individual sounds, and performs speech synthesis based on the characters indicating the speech content to reproduce the speech of the speaker;
The information processing device according to claim 2 , wherein the clarity adjustment unit adjusts clarity for each of the individual sounds regenerated by the speech synthesis unit based on whether or not a speaker is participating in a conversation.
 前記人センシング部は、前記映像を基に発話者の距離を判定し、前記距離を基に発話者が会話に参加しているか否かを判定する請求項1に記載の情報処理装置。 The information processing device described in claim 1, wherein the human sensing unit determines the distance of the speaker based on the video and determines whether the speaker is participating in the conversation based on the distance.  前記所定空間の音声を基に発話者の位置を判定する音源方向推定部をさらに有し、
 個別音分離部は、前記音源方向推定部により判定された発話者の位置を基に、前記所定空間の音声から発話者毎の発話音声である個別音を分離し、
 前記明瞭度調整部は、前記個別音分離部により分離された前記個別音毎に、発話者が会話に参加しているか否かを基に明瞭度の調整を行う
 請求項1に記載の情報処理装置。
a sound source direction estimation unit that determines the position of a speaker based on the sound in the predetermined space;
an individual sound separation unit separates individual sounds, which are speech sounds of each speaker, from the sound in the predetermined space based on the position of the speaker determined by the sound source direction estimation unit;
The information processing device according to claim 1 , wherein the clarity adjustment unit adjusts clarity for each of the individual sounds separated by the individual sound separation unit based on whether or not a speaker is participating in a conversation.
 前記音声出力制御部は、前記所定空間の音声をステレオ化して、前記スピーカーアレイ間の音源の位置を調整する請求項3に記載の情報処理装置。 The information processing device described in claim 3, wherein the audio output control unit converts the audio from the specified space into stereo and adjusts the position of the sound source between the speaker arrays.  情報処理装置が、
 所定空間に存在する複数の発話者を撮影するカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシングステップと、
 前記所定空間の音声を収音するマイクにより収音された前記所定空間の音声に含まれる、前記人センシングステップにおいて判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整ステップと、
 前記明瞭度調整ステップにおいて処理された前記所定空間の音声を送信する送信ステップと
 を実行する情報処理方法。
The information processing device
a human sensing step of determining whether each speaker is participating in a conversation based on an image captured by a camera capturing multiple speakers present in a predetermined space;
an articulation adjustment step of adjusting the articulation of speech of a speaker participating in the conversation determined in the person sensing step, the speech being included in the audio of the predetermined space collected by a microphone that collects audio of the predetermined space;
a transmitting step of transmitting the sound in the predetermined space processed in the clarity adjusting step.
 送信装置は、
 所定空間に存在する複数の発話者のカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシング部と、
 マイクにより収音された前記所定空間の音声に含まれる前記人センシング部により判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整部と、
 前記明瞭度調整部により処理された前記所定空間の音声を送信する送信部とを有し、
 受信装置は、
 前記送信部により送信された前記所定空間の音声を受信する受信部と、
 複数のスピーカーユニットが一列に並ぶスピーカーアレイを2つ有するスピーカーにおいて、発話者それぞれの位置を基に、前記受信部により受信された前記所定空間の音声のうち各発話者の発話音声を再生させる前記スピーカーユニットを選択して、選択した前記スピーカーユニットに発話者それぞれの発話音声を再生させる音声出力制御部とを有する
 情報処理システム。
The transmitting device
a human sensing unit that determines whether each speaker is participating in a conversation based on images of multiple speakers present in a predetermined space captured by a camera;
an articulation adjustment unit that adjusts the articulation of speech of speakers participating in a conversation determined by the human sensing unit and included in the speech in the predetermined space collected by a microphone;
a transmitting unit that transmits the sound in the predetermined space processed by the clarity adjustment unit,
The receiving device
a receiving unit that receives the sound in the predetermined space transmitted by the transmitting unit;
An information processing system comprising: a speaker having two speaker arrays in which a plurality of speaker units are arranged in a row; and an audio output control unit that selects a speaker unit to play back the speech of each speaker from the audio in the specified space received by the receiving unit based on the position of each speaker, and causes the selected speaker unit to play back the speech of each speaker.
PCT/JP2025/005168 2024-02-27 2025-02-17 Information processing device, information processing method, and information processing system Pending WO2025182639A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2024-027465 2024-02-27
JP2024027465 2024-02-27

Publications (1)

Publication Number Publication Date
WO2025182639A1 true WO2025182639A1 (en) 2025-09-04

Family

ID=96921331

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2025/005168 Pending WO2025182639A1 (en) 2024-02-27 2025-02-17 Information processing device, information processing method, and information processing system

Country Status (1)

Country Link
WO (1) WO2025182639A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140223A (en) * 2021-03-02 2021-07-20 广州朗国电子科技有限公司 Conference voice data processing method, device and storage medium
JP2023047956A (en) * 2021-09-27 2023-04-06 ソフトバンク株式会社 Information processing device, information processing method and information processing program
WO2023100594A1 (en) * 2021-12-03 2023-06-08 ソニーグループ株式会社 Information processing device, information processing method, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113140223A (en) * 2021-03-02 2021-07-20 广州朗国电子科技有限公司 Conference voice data processing method, device and storage medium
JP2023047956A (en) * 2021-09-27 2023-04-06 ソフトバンク株式会社 Information processing device, information processing method and information processing program
WO2023100594A1 (en) * 2021-12-03 2023-06-08 ソニーグループ株式会社 Information processing device, information processing method, and program

Similar Documents

Publication Publication Date Title
US11386903B2 (en) Methods and systems for speech presentation based on simulated binaural audio signals
Donley et al. Easycom: An augmented reality dataset to support algorithms for easy communication in noisy environments
US12294843B2 (en) Audio apparatus and method of audio processing for rendering audio elements of an audio scene
EP2323425B1 (en) Method and device for generating audio signals
JP7354225B2 (en) Audio device, audio distribution system and method of operation thereof
US10998870B2 (en) Information processing apparatus, information processing method, and program
CN112637529B (en) Video processing method and device, storage medium and electronic equipment
EP3506080B1 (en) Audio scene processing
WO2022059362A1 (en) Information processing device, information processing method, and information processing system
US11711652B2 (en) Reproduction device, reproduction system and reproduction method
JP2003032776A (en) Reproduction system
Brandenburg et al. Creating auditory illusions with binaural technology
US12035123B2 (en) Impulse response generation system and method
WO2023286320A1 (en) Information processing device and method, and program
CN115734148A (en) Sound effect adjusting method and related device
WO2025182639A1 (en) Information processing device, information processing method, and information processing system
CN117636928A (en) Pickup device and related audio enhancement method
JP2006338493A (en) Next speaker detection method, apparatus, and program
RU2798414C2 (en) Audio device and audio processing method
WO2023176389A1 (en) Information processing device, information processing method, and recording medium
CN119520873A (en) Video playback method, device, equipment and readable storage medium
WO2024116945A1 (en) Audio signal processing device, audio device, and audio signal processing method
JP2007243604A (en) Terminal equipment, remote conference system, remote conference method, and program