JP2012008391A

JP2012008391A - Device and method for changing voice, and confidential communication system for voice information

Info

Publication number: JP2012008391A
Application number: JP2010145038A
Authority: JP
Inventors: Takayoshi Nakai; 孝芳中井; Fukuji Kawakami; 福司川上
Original assignee: Nippon Sheet Glass Environment Amenity Co Ltd
Current assignee: Nippon Sheet Glass Environment Amenity Co Ltd
Priority date: 2010-06-25
Filing date: 2010-06-25
Publication date: 2012-01-12
Anticipated expiration: 2030-06-25
Also published as: JP5662711B2

Abstract

PROBLEM TO BE SOLVED: To provide a device, a method and a system for concealing a content of a voice while suppressing increase in a noise level.SOLUTION: A confidential communication system for voice information has a microphone Mic to receive a voice while a customer 6 in a booth 2 is uttering and to generate a voice signal representing the voice, a SD controller part SD to change the voice signal generated by the microphone Mic, and a speaker SP to convert the voice signal changed by the SD controller part SD into the voice and to output the voice to a neighboring booth 2' where the voice being uttered are being heard. The SD controller part SD includes a partial extraction part to extract a signal of a change subject from the voice signal generated by the microphone Mic based on a waveform of the voice signal, a partial change part to change the signal of the change subject part, and an output part to output the voice signal changed in such a way to the speaker SP.

Description

本発明は、音声を変更する音声変更装置、音声変更方法およびその音声変更装置を備える音声情報秘話システムに関する。 The present invention relates to a voice changing device that changes voice, a voice changing method, and a voice information secret talk system including the voice changing device.

個人情報保護法などの施行により銀行やオフィスにおける会話情報の保護の必要性が高まっている。その手段として、従来から物理的に空間を分ける遮音・防音や、オープンプランオフィスなどにおいて会話音声を別の雑音・音楽などで隠蔽するＢＧＭ・マスキングシステムなどが提案されてきた。 With the enforcement of the Personal Information Protection Law, there is an increasing need to protect conversation information in banks and offices. Conventionally, sound insulation / soundproofing that physically separates the space, BGM / masking system that conceals conversational speech with other noise / music, etc. in an open plan office have been proposed.

音声情報の隠蔽という目的については従来から、
（１）対象音声を他の定常的な雑音で隠蔽するマスキングシステム（Masking System）
（２）室内の暗騒音や空調騒音で隠蔽するシェーディングシステム（Shading System）
（３）遮音・防音（対象室を空間的に区画し、音響的に分離する）
等があった。（１）の例は音声の存在そのものを（無理やり）消し去ろうとするもので、エネルギマスキング（Energy Masking）と位置付けられる。これは例えばオープンプランオフィスのブースや会議室に使用されている。 For the purpose of concealing voice information,
(1) Masking system that masks the target speech with other stationary noise
(2) Shading system concealed by indoor background noise and air conditioning noise
(3) Sound insulation / sound insulation (the target room is spatially separated and acoustically separated)
Etc. The example (1) attempts to (forcefully) erase the presence of speech, and is positioned as energy masking. This is used, for example, in an open plan office booth or conference room.

（１）のシステムの例が非特許文献１に報告されている。そこでは、天井内部などに専用のジェネレータやスピーカを設置し、マスキング音を発生して音声の隠蔽を行っている。その原理は、会話の邪魔にならない程度の（会話とは脈絡のない）音楽や雑音を生成し、いわゆるＳ／Ｎを低減して音声の内容を隠蔽したり、明瞭度・了解度を低減したりして、会話内容を理解できない程度まで隠蔽しようとするものである。システムには会話レベルや室内暗騒音などに応じてマスキング音を最適レベルに制御する制御装置（信号処理装置）・電力増幅器などが含まれる。 An example of the system (1) is reported in Non-Patent Document 1. There, a dedicated generator or speaker is installed inside the ceiling, etc., and masking sound is generated to conceal the sound. The principle is that it generates music and noise that is not in the way of conversation (contrast with conversation), conceals the contents of speech by reducing so-called S / N, and reduces clarity and intelligibility. Or to conceal the content of the conversation to the extent that it cannot be understood. The system includes a control device (signal processing device), a power amplifier, and the like that control the masking sound to the optimum level according to the conversation level, background noise, and the like.

また、この技術を利用した例としては、パーティションからブース内へマスキング用のノイズを放射し、対象空間領域をブースに限定することにより、室内全体の騒音レベルが上昇するのを抑えようとしたものがある。 In addition, as an example using this technology, noise for masking was radiated from the partition into the booth, and the target space area was limited to the booth to suppress the increase in the noise level in the entire room. There is.

（２）のシステムの例が非特許文献２に報告されている。そこでは、放射するマスキングノイズとして、室内の暗騒音そのものや、日常的に身近な空調騒音を使用した「Sound Shading System」が報告されている。このシステムでは、銀行の窓口などにおけるプライバシーの確保を目的とした視覚遮断的なパーティションに対し、会話のプライバシー保護を目的としてパーティション頂部にスピーカを設置する。このスピーカからマスキング音を再生し、それによりパーティションの反対側にいる人への会話内容の漏洩・伝達の阻止を図る。再生する音には街の雑踏をもとに生成した音や、その部屋の空調騒音を使用する。 An example of the system (2) is reported in Non-Patent Document 2. There are reports of “Sound Shading System” that uses indoor background noise itself and air-conditioning noise that is familiar everyday as radiating masking noise. In this system, a speaker is installed at the top of the partition for the purpose of protecting the privacy of conversation, in contrast to a visually interrupting partition for the purpose of ensuring privacy at a bank counter. A masking sound is reproduced from this speaker, thereby preventing the leakage / transmission of conversation contents to a person on the other side of the partition. The sound to be reproduced is the sound generated based on the crowds of the city or the air conditioning noise of the room.

（３）のシステムの例としては、別室として区画する遮音や、パーティションなどで区画する防音がある。 As an example of the system of (3), there is sound insulation partitioned as a separate room or soundproof partitioned by a partition.

特開２００８−２３３６７１号公報JP 2008-233671 A

コクヨ社プレスリリース、サウンドマスキング、２００６年１０月１８日KOKUYO Press Release, Sound Masking, October 18, 2006 杉本明子、中村隆宏、伊勢史郎、「会話のしやすさとプライバシーを考慮した音場を生成する Sound Shading System の評価」、日本音響学会２００５年春季研究発表会講演論文集、ｐ．８１７Akiko Sugimoto, Takahiro Nakamura, Shiro Ise, “Evaluation of Sound Shading System that generates sound field in consideration of ease of conversation and privacy”, Acoustical Society of Japan Spring Meeting 2005, Proceedings, p. 817 電子情報通信学会、聴覚と音声、１９７３年、ｐ．３７０−３７１The Institute of Electronics, Information and Communication Engineers, Auditory and Speech, 1973, p. 370-371 梶田、小林、武田、板倉、「ヒューマンスピーチライク雑音に含まれる音声的特徴の分析」、日本音響学会誌、１９９７年５月１日、５３（５）、ｐ．３３７−３４５Iwata, Kobayashi, Takeda, Itakura, “Analysis of phonetic features contained in human speech-like noise”, Journal of the Acoustical Society of Japan, May 1, 1997, 53 (5), p. 337-345

本発明者は、上述のマスキング／シェーディング技術に関して以下の課題を認識した。
（Ｉ）原音声とは脈絡のない新たな音を放射するので、違和感を伴い、またマスカーは原音声に対応して常に最適、あるいは最大効果のあるものとは言えない。
（ＩＩ）音声発生のないいわゆる「無音時」にも騒音、つまりマスキング音が聞こえ得る。したがって、室内空間の騒音レベルを確実に上昇させ得る。
（ＩＩＩ）会話とは関係のない別の音（騒音・音楽）を放射することにより、発声者・会話者・その他の在室者に少なからず違和感を与え得る。
（ＩＶ）音声の情報隠蔽は、性質の異なるもの同士は区別して認識する、という聴覚の性質により、雑音やBGMでは奏功しにくいという基本的な問題を含む（包絡線（エンベロープ）やスペクトルが似通った音声波形同士の方が聴覚認識上、区別されにくい）。 The inventor has recognized the following problems regarding the above-described masking / shading technique.
(I) Since the original sound emits a new sound that is unrelated to the original sound, it is accompanied by a sense of incongruity, and the masker is not always optimal or has the maximum effect corresponding to the original sound.
(II) Noise, that is, a masking sound can be heard even when the sound is not generated. Therefore, the noise level in the indoor space can be reliably increased.
(III) By emitting another sound (noise / music) unrelated to conversation, it is possible to give a sense of incongruity to a speaker, a talker, and other people in the room.
(IV) Concealment of speech information includes the basic problem that it is difficult to succeed with noise and BGM due to the auditory nature of distinguishing and recognizing different things (envelope and spectrum are similar) Audio waveforms are more difficult to distinguish for auditory recognition).

（Ｉ）については、経験上原音声を完全にマスクするのに必要な雑音の相対レベルは略１５ｄＢである（非特許文献３参照）。この視点から見ると、雑音や音楽を流すことにより音声を隠蔽するという方法では、原音声に対してそれ以上のかなり大きな音量の雑音や音楽が必要となり、maskingであれshadingであれ、室内騒音レベルを大きく上昇させ得る。 As for (I), the relative level of noise necessary for completely masking the original voice is empirically about 15 dB (see Non-Patent Document 3). From this point of view, the method of concealing sound by flowing noise and music requires much louder noise and music than the original sound, and whether it is masking or shading, the room noise level Can be greatly increased.

（ＩＩ）については、発話がない時にも音がするという違和感を伴う。またそもそも発話がない時に雑音や音楽を流すことは会話内容の隠蔽の観点からは無駄と言える。また無駄であるばかりでなく、室の等価騒音レベル（L_Aeq：A-weighted equivalent sound level＝A特性で補正した音声信号の一定区間の自乗平均音圧レベル、つまり平均的な騒音レベル）を上昇させる結果となりうる。雑音の代わりに音楽や音声から作成した「ＨＳＬ雑音（Human Speech-like noise）」（非特許文献４参照）を流した場合でも、一般的なＢＧＭとの区別は困難である。 Regarding (II), there is a sense of incongruity that a sound is produced even when there is no utterance. In the first place, playing noise and music when there is no utterance is useless from the viewpoint of concealing conversation content. Not only is it wasteful, but it also _{increases the} room equivalent noise level (L _Aeq : A-weighted equivalent sound level = the root mean square sound pressure level of the audio signal corrected with the A characteristic, that is, the average noise level). Can result. Even when “HSL noise (Human Speech-like noise)” (see Non-Patent Document 4) created from music or voice is used instead of noise, it is difficult to distinguish from general BGM.

また、（３）のアプローチについては、費用的にかなり大きなものとなり、また開放感を阻害するのでオープンプランオフィスなどでの使用には適さない。 In addition, the approach (3) is considerably large in cost and hinders a feeling of opening, and is not suitable for use in an open plan office or the like.

また、特許文献１に記載のサウンドマスキングシステムでは、入力音(声)の話速を分析し、これに応じたフレーム長で分割して処理し、処理音声を合成する方法が述べられている。しかしながら、このシステムは「約２秒単位で入力音(声)を一時記憶し一連の処理を行う」ので、処理音声はそれがマスキング対象とする音声とは別の、過去の音声から生成される。したがって、処理音声とそれがマスキング対象とする音声との関連性は薄く、マスキング効果は十分とは言えない。 Further, in the sound masking system described in Patent Document 1, a method is described in which the speech speed of an input sound (voice) is analyzed, divided and processed by a frame length corresponding to this, and the processed speech is synthesized. However, since this system “stores the input sound (voice) temporarily in about 2 seconds and performs a series of processing”, the processed sound is generated from the past sound that is different from the sound that is the masking target. . Therefore, the relevance between the processed voice and the voice targeted for masking is low, and the masking effect is not sufficient.

本発明はこうした課題に鑑みてなされたものであり、その目的は、騒音レベルや受聴者の不快感の増長を抑えた上で音声の内容を隠蔽する技術の提供にある。 The present invention has been made in view of these problems, and an object of the present invention is to provide a technique for concealing audio content while suppressing an increase in noise level and listener discomfort.

本発明のある態様は、音声変更装置に関する。この音声変更装置は、発話中の音声を表す音声信号から、音声信号の波形に基づいて変更対象部分の信号を抽出する部分抽出部と、部分抽出部によって抽出された変更対象部分の信号を変更する部分変更部と、部分変更部によって変更された変更対象部分の信号を、発話中の音声が受聴されている領域に音声を出力可能な音声出力手段に出力する出力部と、を備える。 One embodiment of the present invention relates to a sound changing device. The voice changing device is configured to extract a signal of a change target part from a voice signal representing a voice being uttered based on a waveform of the voice signal, and change a signal of the change target part extracted by the partial extraction part. And an output unit that outputs the signal of the change target portion changed by the partial change unit to a voice output unit capable of outputting voice to an area where the voice being spoken is received.

この態様によると、音声信号のうち変更対象とする部分をその音声信号の波形に基づいて決めることができる。 According to this aspect, it is possible to determine the part to be changed in the audio signal based on the waveform of the audio signal.

本発明の別の態様は、音声情報秘話システムである。この音声情報秘話システムは、発話中の音声を受け、それを表す音声信号を生成する集音手段と、集音手段によって生成された音声信号を変更する音声変更装置と、音声変更装置によって変更された音声信号を音声に変換して発話中の音声が受聴されている領域に出力する音声出力手段と、を備える。音声変更装置は、集音手段によって生成された音声信号から、音声信号の波形に基づいて変更対象部分の信号を抽出する部分抽出部と、部分抽出部によって抽出された変更対象部分の信号を変更する部分変更部と、部分変更部によって変更された変更対象部分の信号を音声出力手段に出力する出力部と、を含む。 Another aspect of the present invention is a speech information secret talk system. The voice information secret speech system is modified by a sound collecting unit that receives a voice being uttered and generates a voice signal representing the voice, a voice changing device that changes a voice signal generated by the sound collecting unit, and a voice changing device. Voice output means for converting the received voice signal into a voice and outputting the voice signal to a region where the voice being spoken is received. The sound changing device includes a partial extraction unit that extracts a signal of a change target portion based on a waveform of the sound signal from the sound signal generated by the sound collecting unit, and changes the signal of the change target portion extracted by the partial extraction unit. And an output unit that outputs the signal of the change target portion changed by the partial change unit to the audio output unit.

なお、以上の構成要素の任意の組み合わせや、本発明の構成要素や表現を装置、方法、システム、コンピュータプログラム、コンピュータプログラムを格納した記録媒体などの間で相互に置換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements, or those obtained by replacing the constituent elements and expressions of the present invention with each other between apparatuses, methods, systems, computer programs, recording media storing computer programs, and the like are also included in the present invention. It is effective as an embodiment of

本発明によれば、騒音レベルや受聴者の不快感の増長を抑えた上で音声の内容を隠蔽できる。 ADVANTAGE OF THE INVENTION According to this invention, the content of an audio | voice can be concealed, suppressing the increase in a noise level and a listener's discomfort.

マスキングに関する従来のアプローチと実施の形態に係るアプローチをカテゴリに分けて示す説明図である。It is explanatory drawing which divides into a category the conventional approach regarding masking, and the approach which concerns on embodiment. 実施の形態に係る音声情報秘話システムが設けられたブースを模式的に示す斜視図である。It is a perspective view which shows typically the booth provided with the audio | voice information secret talk system which concerns on embodiment. 図２の音声情報秘話システムの機能および構成を模式的に示すブロック図である。It is a block diagram which shows typically the function and structure of the audio | voice information confidential system of FIG. 図２のＩＴパーティションの構成を示す側面図である。It is a side view which shows the structure of the IT partition of FIG. 図３のＳＤコントローラ部の機能および構成を示すブロック図である。It is a block diagram which shows the function and structure of the SD controller part of FIG. 図５の子音ライブラリを示すデータ構造図である。It is a data structure figure which shows the consonant library of FIG. マスキーの一例を表す音声信号の波形を示す波形図である。It is a wave form diagram which shows the waveform of the audio | voice signal showing an example of a maskee. 図７の音声信号を図５のＳＤコントローラ部において子音のみ置換モードで処理することで生成される音声信号の波形を示す波形図である。6 is a waveform diagram showing a waveform of an audio signal generated by processing the audio signal of FIG. 7 in the SD controller unit of FIG. 5 in a consonant only replacement mode. 第１決定部における変更対象部分の信号の決定基準を説明するための説明図である。It is explanatory drawing for demonstrating the determination criteria of the signal of the change object part in a 1st determination part. 受聴者位置におけるマスキーおよび時間回転処理されたマスカーを表す音声信号の波形を示す波形図である。It is a wave form diagram which shows the waveform of the audio | voice signal showing the masker and the time rotation process masker in a listener position. 図２の音声情報秘話システムにおける一連の処理を示すフローチャートである。It is a flowchart which shows a series of processes in the audio | voice information confidential system of FIG. 第１変形例に係る音声情報秘話システムの機能および構成を模式的に示すブロック図である。It is a block diagram which shows typically the function and structure of the audio | voice information confidential system which concerns on a 1st modification. 第２変形例に係る音声情報秘話システムの機能および構成を模式的に示すブロック図である。It is a block diagram which shows typically the function and structure of the audio | voice information confidential system which concerns on a 2nd modification.

以下、本発明を好適な実施の形態をもとに図面を参照しながら説明する。各図面に示される同一または同等の構成要素、部材、処理には、同一の符号を付するものとし、適宜重複した説明は省略する。 The present invention will be described below based on preferred embodiments with reference to the drawings. The same or equivalent components, members, and processes shown in the drawings are denoted by the same reference numerals, and repeated descriptions are omitted as appropriate.

特にオフィスなどにおいては、オープンプランの空間が有する開放性やコミュニケーションの円滑性を損なわずに音声情報、つまり音声の内容だけが隠蔽されることが望ましい。しかしながら、従来のＢＧＭやマスキングを使用する技術は、基本的には原音声とは性質の異なる、別過程で作成した音を原音声とは脈絡なく加えるので、聴覚的な違和感や室内の暗騒音を上昇させてしまうという嫌いがあった。本発明の実施の形態はマイクロホンなどにより集音した音声信号そのものの構造を実質的に実時間で変更することにより室内の暗騒音を上昇させることなく会話の内容を、理想的には会話の内容のみを、隠蔽し、円滑で快適な秘話環境を実現する。 Particularly in offices and the like, it is desirable to conceal only the voice information, that is, the voice content without impairing the openness and smoothness of communication of the open plan space. However, the conventional technology using BGM and masking basically adds a sound that is different in nature from the original sound and created in a separate process without any relation to the original sound. There was a dislike that would raise. The embodiment of the present invention changes the structure of the audio signal itself collected by a microphone or the like substantially in real time, thereby improving the conversation content without increasing the background noise in the room, ideally the conversation content. Hides only, and realizes a smooth and comfortable secret story environment.

図１は、マスキングに関する従来のアプローチと実施の形態に係るアプローチをカテゴリに分けて示す説明図である。（ａ）は、電気音響を用いたＳＲ（Sound Reinforcement）／ＰＡ（Public Address）である。これらは音量や明瞭度を高めて「よく聞こえるようにする」従来技術である。（ｆ）は、遮音（Sound Insulation）であり、空間を音響的に分離しできるだけ「聞こえないようにする」従来技術である。これらに対して実施の形態に係るアプローチは（ｅ）のＳＤ（Speech Deformation）であり、会話者本人の原音声を処理して準実時間で出力することにより、聞こえる聞こえないではなく会話内容を「分からなくする」一種の音声情報撹乱（聴覚翻弄）技術である。また、従来技術による（ｂ）ＥＭや（ｃ）ＳＳや（ｄ）ＩＭが多かれ少なかれ室内あるいは対象空間領域の騒音レベルを上昇させて不快感や違和感を増加させ得るのに対し、（ｅ）のＳＤではほとんど騒音レベルの上昇を伴わない。 FIG. 1 is an explanatory diagram showing a conventional approach related to masking and an approach according to an embodiment divided into categories. (A) is SR (Sound Reinforcement) / PA (Public Address) using electroacoustics. These are conventional technologies that increase the volume and clarity and “make them sound better”. (F) is Sound Insulation, which is a conventional technique for acoustically separating a space and making it “not audible” as much as possible. On the other hand, the approach according to the embodiment is SD (Speech Deformation) of (e). By processing the original voice of the conversation person and outputting it in near real time, the conversation contents are not heard but not heard. It is a kind of voice information disruption (hearing) technique that makes it “unknown”. Further, (b) EM, (c) SS, and (d) IM according to the prior art can increase the noise level in the room or the target space region to increase the unpleasantness and discomfort, SD hardly causes an increase in noise level.

本発明の実施の形態の主な立脚点は、言語の認識・理解が、特に日本語の場合は、音声の子音部分に大きく依存するという本発明者の認識である。この子音部分が変化すると、たとえば「雲（ＫＵＭＯ）」は「ＲＵＴＯ」となり、言葉として理解することができない。
また、聴覚音声認識(ＨＳＲ:Human Speech Recognition)が音声信号のキャリア(搬送波）より包絡線遷移などのアーティキュレーションにより強く依存することに基づき、原音声の包絡線の「略一山」を処理対象単位として時間反転または時間回転すると、スペクトルも包絡線形状も原音声と類似するので音声情報撹乱が効果的に機能する。 The main standpoint of the embodiment of the present invention is the recognition of the present inventor that language recognition and understanding depend largely on the consonant part of speech, particularly in the case of Japanese. If this consonant part changes, for example, “KUMO” becomes “RUTO” and cannot be understood as words.
Also, based on the fact that auditory speech recognition (HSR: Human Speech Recognition) is more dependent on articulation such as envelope transitions than the carrier of the audio signal, it processes the “sounds of the envelope” of the original speech. When time reversal or time rotation is performed as a target unit, since the spectrum and the envelope shape are similar to the original speech, the speech information disturbance functions effectively.

本発明の実施の形態では、音声認識・理解のこのような側面に着目し、あるモードでは原音声の子音部分を変更・削除・置換する。子音部分の処理が主となるので、原音声と比較して音圧レベル（音量）の上昇は小さい。さらに原音声（以下、マスキーと称す）に処理音声（以下、マスカーと称す）を加えた全体の音量を更に低減するために、以下の併用／工夫が可能である。
（i）マスカーの生成において、母音部分を無音に置き換え、処理された子音部分だけを元のタイミングで出力する。
（ii）マスカーの情報隠蔽効果を高めるために、ＡＮＣ（Active Noise Control）またはパラメータ固定のＰＮＣ（Passive Noise Control）技術を併用する。 In the embodiment of the present invention, paying attention to such aspects of speech recognition / understanding, the consonant part of the original speech is changed / deleted / replaced in a certain mode. Since the processing of the consonant part is the main, the increase in the sound pressure level (volume) is small compared to the original voice. Further, in order to further reduce the overall volume of the processed voice (hereinafter referred to as a masker) added to the original voice (hereinafter referred to as a maskee), the following combination / ingenuity is possible.
(I) In generating a masker, the vowel part is replaced with silence, and only the processed consonant part is output at the original timing.
(Ii) ANC (Active Noise Control) or parameter-fixed PNC (Passive Noise Control) technology is used in combination to enhance the masker's information hiding effect.

図２は、実施の形態に係る音声情報秘話システム１００が設けられたブース２を模式的に示す斜視図である。図３は、図２の音声情報秘話システム１００の機能および構成を模式的に示すブロック図である。
音声情報秘話システム１００は、銀行の相談カウンターなど、簡易パーティションで区画されたブース２に設けられる。音声情報秘話システム１００は、マイクロホンＭｉｃと、ＳＤコントローラ部ＳＤと、２つのパワーアンプＰＡと、２つのスピーカＳＰと、を備える。スピーカＳＰおよびＳＤコントローラ部ＳＤは、ブース間を視覚的に隔てるＩＴパーティション４に組み込まれてもよい。 FIG. 2 is a perspective view schematically showing the booth 2 in which the audio information secret system 100 according to the embodiment is provided. FIG. 3 is a block diagram schematically showing the function and configuration of the speech information secret system 100 of FIG.
The voice information secret story system 100 is provided in a booth 2 partitioned by a simple partition, such as a bank consultation counter. The audio information secret system 100 includes a microphone Mic, an SD controller unit SD, two power amplifiers PA, and two speakers SP. The speaker SP and the SD controller unit SD may be incorporated in the IT partition 4 that visually separates the booths.

相談員と会話を行っている顧客６を発話者とする。この発話者のマスキーH'(t)はカウンター部分またはその近傍に設けられたマイクロホンＭｉｃによって集音される。マイクロホンＭｉｃにより集音されたマスキーH'(t)は音声信号に変換され、ＳＤコントローラ部ＳＤに送られる。この音声信号がＳＤコントローラ部ＳＤによって変更、削除、置換、または時間的に反転／回転される。ＳＤコントローラ部ＳＤにおける処理を経た音声信号はパワーアンプＰＡを経てスピーカＳＰから左右の隣接ブース２’にマスカーH(t)として出力される。 A customer 6 who has a conversation with a counselor is a speaker. The speaker's maskee H ′ (t) is collected by a microphone Mic provided at or near the counter portion. The maskee H ′ (t) collected by the microphone Mic is converted into an audio signal and sent to the SD controller unit SD. This audio signal is changed, deleted, replaced, or reversed / rotated in time by the SD controller unit SD. The audio signal that has undergone processing in the SD controller section SD is output as a masker H (t) from the speaker SP to the left and right adjacent booths 2 'via the power amplifier PA.

隣接ブース２’にはマスキーH'(t)が空中を回り込んでくるので、顧客６が発話中の音声は隣接ブース２’内にいる受聴者８（顧客６とは異なる者）によって受聴されうる。しかしながら本実施の形態では、空中を回り込んで漏洩するマスキーH'(t)はマスカーH(t)と合成されて隣接ブース２’内の受聴者８に届く。したがってマスカーH(t)による擾乱により、受聴者８はマスキーH'(t)に含まれる会話の内容を理解することができない。 Since Muskie H '(t) goes around the air in the adjacent booth 2', the voice being spoken by the customer 6 is received by a listener 8 (a person different from the customer 6) in the adjacent booth 2 '. sell. However, in the present embodiment, the masky H ′ (t) that leaks through the air is combined with the masker H (t) and reaches the listener 8 in the adjacent booth 2 ′. Therefore, the listener 8 cannot understand the content of the conversation included in the maskee H ′ (t) due to the disturbance caused by the masker H (t).

スピーカＳＰは、ＳＤコントローラ部ＳＤやマイクロホンＭｉｃが設置されているブース２の隣の隣接ブース２’に向けてマスカーH(t)を出力する。ここで隣接ブース２’は、空中を回り込んで漏洩するマスキーH'(t)が受聴されている領域である。つまり、マスキーH'(t)とマスカーH(t)とが実質的に実時間で受聴者８に届くように、マスカーH(t)がスピーカＳＰから出力される。この実時間性を保証する主体はＳＤコントローラ部ＳＤであってもスピーカＳＰであってもよいが、以下ではＳＤコントローラ部ＳＤがマスキーH'(t)とマスカーH(t)との実時間性を考慮して音声信号を処理する場合について説明する。 The speaker SP outputs a masker H (t) toward the adjacent booth 2 'adjacent to the booth 2 where the SD controller unit SD and the microphone Mic are installed. Here, the adjacent booth 2 ′ is an area where the muskey H ′ (t) leaking around the air is received. That is, the masker H (t) and the masker H (t) are output from the speaker SP so that the masker H (t) and the masker H (t) reach the listener 8 substantially in real time. The main body that guarantees the real-time property may be the SD controller unit SD or the speaker SP, but in the following, the SD controller unit SD performs the real-time property of the maskee H ′ (t) and the masker H (t). A case where an audio signal is processed in consideration of the above will be described.

図４は、図２のＩＴパーティション４の構成を示す側面図である。ＩＴパーティション４は、第１吸音層４２と、遮音層４４と、第２吸音層４６と、をこの順に積層してなる積層構造を有する。第１吸音層４２および第２吸音層４６はそれぞれ厚さが２０ｍｍのグラスウールの層である。遮音層４４は厚さが１２ｍｍの石膏ボードである。 FIG. 4 is a side view showing the configuration of the IT partition 4 of FIG. The IT partition 4 has a laminated structure in which a first sound absorbing layer 42, a sound insulating layer 44, and a second sound absorbing layer 46 are laminated in this order. Each of the first sound absorbing layer 42 and the second sound absorbing layer 46 is a glass wool layer having a thickness of 20 mm. The sound insulation layer 44 is a gypsum board having a thickness of 12 mm.

図５は、図３のＳＤコントローラ部ＳＤの機能および構成を示すブロック図である。ここに示す各ブロックは、ハードウェア的には、コンピュータのＣＰＵ（central processing unit）をはじめとする素子や機械装置で実現でき、ソフトウェア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組合せによっていろいろなかたちで実現できることは、本明細書に触れた当業者には理解されるところである。 FIG. 5 is a block diagram showing the function and configuration of the SD controller unit SD of FIG. Each block shown here can be realized in hardware by an element such as a CPU (central processing unit) or a mechanical device, and in software by a computer program or the like. Describes functional blocks realized by collaboration. Accordingly, it is understood by those skilled in the art who have touched this specification that these functional blocks can be realized in various forms by a combination of hardware and software.

ＳＤコントローラ部ＳＤは、記憶装置１０と、Ａ／Ｄ部２０と、部分抽出部３０と、部分変更部９０と、出力部７２と、ノイズ生成部８０と、子音ライブラリ更新部８２と、母音ライブラリ更新部８４と、を含む。記憶装置１０は、子音ライブラリ１２と、母音ライブラリ１４と、共通ライブラリ１６と、を含む。部分抽出部３０は、音素抽出部３８と、略１山抽出部５２と、ランダム抽出部６０と、を有する。音素抽出部３８は、音声判別部３６と、子音抽出部３２と、母音抽出部３４と、を有する。略１山抽出部５２は、自乗音圧取得部５４と、ローパスフィルタ５６と、第１決定部５８と、を有する。ランダム抽出部６０は、信号分割部６２と、第２決定部６４と、を有する。部分変更部９０は、子音処理部４０と、母音処理部５０と、時間処理部６６と、を有する。出力部７２は、遅延調整部６８と、Ｄ／Ａ部７０と、を有する。 The SD controller unit SD includes a storage device 10, an A / D unit 20, a partial extraction unit 30, a partial change unit 90, an output unit 72, a noise generation unit 80, a consonant library update unit 82, and a vowel library. An update unit 84. The storage device 10 includes a consonant library 12, a vowel library 14, and a common library 16. The partial extraction unit 30 includes a phoneme extraction unit 38, a substantially single mountain extraction unit 52, and a random extraction unit 60. The phoneme extraction unit 38 includes a speech discrimination unit 36, a consonant extraction unit 32, and a vowel extraction unit 34. The approximately one mountain extraction unit 52 includes a square sound pressure acquisition unit 54, a low-pass filter 56, and a first determination unit 58. The random extraction unit 60 includes a signal division unit 62 and a second determination unit 64. The partial changing unit 90 includes a consonant processing unit 40, a vowel processing unit 50, and a time processing unit 66. The output unit 72 includes a delay adjustment unit 68 and a D / A unit 70.

子音ライブラリ１２は、子音部分の種類ごとにその波形データを記憶する。母音ライブラリ１４は、母音部分の種類ごとにその波形データを記憶する。共通ライブラリ１６は、子音部分の種類ごとに所定のサンプル波形データを記憶する。この共通ライブラリ１６に記憶される子音部分のサンプル波形データは、男性、女性、子供、大人などに分類されている。 The consonant library 12 stores waveform data for each type of consonant part. The vowel library 14 stores waveform data for each type of vowel part. The common library 16 stores predetermined sample waveform data for each type of consonant part. The sample waveform data of the consonant part stored in the common library 16 is classified into male, female, child, adult and the like.

部分抽出部３０は、Ａ／Ｄ部２０でＡ／Ｄ変換された音声信号から、その音声信号の波形に基づいて変更対象部分の信号を抽出する。部分変更部９０は、部分抽出部３０によって抽出された変更対象部分の信号を変更する。出力部７２は、部分変更部９０によって変更された変更対象部分の信号をＤ／Ａ変換し、スピーカＳＰに出力する。 The partial extraction unit 30 extracts the signal of the change target portion from the audio signal A / D converted by the A / D unit 20 based on the waveform of the audio signal. The partial change unit 90 changes the signal of the change target portion extracted by the partial extraction unit 30. The output unit 72 D / A converts the signal of the change target portion changed by the partial change unit 90 and outputs the signal to the speaker SP.

ＳＤコントローラ部ＳＤは少なくとも、子音のみ置換モード、子音母音置換モード、実時間モード、の３つの動作モードを有する。以下各動作モードごとに関連するブロックの機能を説明する。 The SD controller unit SD has at least three operation modes: a consonant only replacement mode, a consonant vowel replacement mode, and a real time mode. The function of the block related to each operation mode will be described below.

（１）子音のみ置換モード
マイクロホンＭｉｃにより集音されたマスキーH'(t)は音声信号に変換され、該音声信号はマイクアンプ（不図示）を経てＡ／Ｄ部２０に入力される。Ａ／Ｄ部２０は、アナログ信号である音声信号をデジタル信号に変換する。音声判別部３６は、Ａ／Ｄ部２０でデジタル化された音声信号の波形を過去の発話音声波形と比較することにより、その音声信号の子音部分と母音部分とを判別する。子音抽出部３２は、その判別結果を使用して子音部分の信号を抽出する。 (1) Consonant-only replacement mode The masky H ′ (t) collected by the microphone Mic is converted into an audio signal, and the audio signal is input to the A / D unit 20 via a microphone amplifier (not shown). The A / D unit 20 converts an audio signal that is an analog signal into a digital signal. The voice discriminating unit 36 discriminates a consonant part and a vowel part of the voice signal by comparing the waveform of the voice signal digitized by the A / D unit 20 with a past speech voice waveform. The consonant extraction unit 32 extracts the signal of the consonant part using the determination result.

子音ライブラリ更新部８２は、子音抽出部３２によって抽出された子音部分の信号の波形データをその種類ごとに子音ライブラリ１２に蓄積する。ここで子音部分の分類はその継続時間・スペクトル・統計処理などから行われる。このように子音ライブラリ１２に蓄積される子音部分の信号の波形データは、逐次処理によって会話開始から徐々に精度の高いものに置換されてゆく。 The consonant library update unit 82 stores the waveform data of the consonant part signal extracted by the consonant extraction unit 32 in the consonant library 12 for each type. Here, the classification of the consonant part is performed based on its duration, spectrum, statistical processing, and the like. In this way, the waveform data of the consonant signal stored in the consonant library 12 is gradually replaced with higher accuracy from the start of the conversation by sequential processing.

ノイズ生成部８０は、子音抽出部３２で抽出された子音部分の信号を基に、それとスペクトルが重なるか違う音を生成する。 Based on the signal of the consonant part extracted by the consonant extraction unit 32, the noise generation unit 80 generates a sound whose spectrum overlaps or is different.

子音処理部４０は、音声信号のうち子音抽出部３２で抽出された子音部分の信号を処理する。子音処理部４０は、子音抽出部３２によって抽出された子音部分の信号を子音ライブラリ１２から選出したほぼ同じ長さの別の子音部分の信号に置換する。子音処理部４０は、置換の候補が複数ある場合は、ランダムに、かつ各組み合わせが略等確率となるように置換する。ここで子音部分の長さに長短があることの例としては、「ｓ」に相当する子音部分の継続時間は比較的長く、「ｔ」や「ｐ」に相当する子音部分の継続時間は短いことがある。 The consonant processing unit 40 processes the signal of the consonant part extracted by the consonant extraction unit 32 in the audio signal. The consonant processing unit 40 replaces the signal of the consonant part extracted by the consonant extraction unit 32 with the signal of another consonant part selected from the consonant library 12 and having substantially the same length. When there are a plurality of replacement candidates, the consonant processing unit 40 performs replacement randomly and so that each combination has a substantially equal probability. Here, as an example of the length of the consonant part, the duration of the consonant part corresponding to “s” is relatively long, and the duration of the consonant part corresponding to “t” or “p” is short. Sometimes.

なお、子音処理部４０は、子音ライブラリ１２を使用して子音部分の信号を置換する代わりに、子音抽出部３２によって抽出された子音部分の信号をノイズ生成部８０によって生成された子音ノイズと置換してもよい。この場合、マスキーH'(t)とマスカーH(t)との合成音声の無作為性がより増大する。また子音処理部４０は、子音ライブラリ１２を使用して子音部分の信号を置換する代わりに、子音抽出部３２によって抽出された子音部分の信号を削除してもよい。 The consonant processing unit 40 replaces the consonant part signal extracted by the consonant extraction unit 32 with the consonant noise generated by the noise generation unit 80 instead of replacing the consonant part signal using the consonant library 12. May be. In this case, the randomness of the synthesized speech between the maskee H ′ (t) and the masker H (t) is further increased. The consonant processing unit 40 may delete the consonant part signal extracted by the consonant extracting unit 32 instead of replacing the consonant part signal using the consonant library 12.

発話開始から数秒〜数十秒程度（以下、発話開始期間と称す）は、子音ライブラリ１２に発話者本人の音声から採取した子音部分が十分に蓄積されていない可能性がある。そこでこの発話開始期間の間は、子音処理部４０は共通ライブラリ１６から対応する子音部分の信号を選出して子音抽出部３２によって抽出された子音部分の信号と置換する。あるいはまた、発話開始期間の間、子音処理部４０は子音抽出部３２によって抽出された子音部分の信号をノイズ生成部８０によって生成された子音ノイズと置換する。あるいはまた、発話開始期間の間、子音処理部４０は子音抽出部３２によって抽出された子音部分の信号を時間方向に反転する。 There is a possibility that the consonant portion collected from the voice of the utterer is not sufficiently accumulated in the consonant library 12 for several seconds to several tens of seconds (hereinafter referred to as an utterance start period) from the start of the utterance. Therefore, during this utterance start period, the consonant processing unit 40 selects a corresponding consonant part signal from the common library 16 and replaces it with the consonant part signal extracted by the consonant extraction unit 32. Alternatively, the consonant processing unit 40 replaces the consonant part signal extracted by the consonant extraction unit 32 with the consonant noise generated by the noise generation unit 80 during the utterance start period. Alternatively, during the utterance start period, the consonant processing unit 40 inverts the signal of the consonant part extracted by the consonant extraction unit 32 in the time direction.

発話開始期間の間に用いられるこれらの子音部分変更アルゴリズムでは、発話者本人の子音ライブラリ１２を使用する場合よりも自然さにおいて劣る。しかしながら発話開始後の短い時間だけなのでそれほど問題とはならない。 These consonant partial modification algorithms used during the utterance start period are less natural than using the consonant library 12 of the speaker himself. However, since it is only a short time after the start of utterance, it does not matter so much.

Ｄ／Ａ部７０は子音処理部４０において処理された音声信号を、スピーカＳＰを駆動するためのアナログの音声信号に変換してパワーアンプＰＡに出力する。Ｄ／Ａ部７０は特に、子音処理部４０によって置換された子音部分の信号と、その子音部分に対応する変更されていない母音部分の信号とを含む音声信号をアナログ信号に変換して出力する。 The D / A unit 70 converts the audio signal processed by the consonant processing unit 40 into an analog audio signal for driving the speaker SP and outputs the analog audio signal to the power amplifier PA. In particular, the D / A unit 70 converts an audio signal including a signal of the consonant part replaced by the consonant processing unit 40 and an unmodified vowel part signal corresponding to the consonant part into an analog signal and outputs the analog signal. .

なお、マスキーH'(t)をマイクロホンＭｉｃで集音してからＳＤコントローラ部ＳＤで処理しスピーカＳＰから対応するマスカーH(t)を出力するまでの時間、つまりＳＤ処理時間Ｔ_ＳＤは、Ｔ＋ｔ以内とされる。ここでＴはマスキーH'(t)が発せられた時点からそれが受聴者８に届くまでの時間であり、ｔはマスキーH'(t)とマスカーH(t)が受聴者８位置において顕著なエコーを発生させないような遅れ時間、もしくは受聴者８に届く合成音声が受聴者８にとって理解不能となる最大の遅れ時間である。ｔの具体的な値は実験により定められるが、代表的には数１００ｍｓ程度である。 Note that the time from when the musky H ′ (t) is collected by the microphone Mic until it is processed by the SD controller unit SD and the corresponding masker H (t) is output from the speaker SP, that is, the SD processing time T _SD is T + t. It is supposed to be within. Here, T is the time from when Muskie H '(t) is issued until it reaches the listener 8, and t is noticeable when the Muskie H' (t) and Masker H (t) are at the listener 8 position. This is the delay time that does not generate a simple echo, or the maximum delay time that the synthesized speech that reaches the listener 8 becomes unintelligible for the listener 8. Although the specific value of t is determined by experiment, it is typically about several hundred ms.

マスキーH'(t)とマスカーH(t)とを受聴者８位置で合成して情報隠蔽を行うためには上述の通りＳＤコントローラ部ＳＤでのＳＤ処理を実時間もしくは準実時間で行わなければならない。この時間的な制約の存在、つまりＳＤ処理時間Ｔ_ＳＤを短い時間であるＴ＋ｔ以下としなければならないこと、により、子音部分の信号の抽出及び置換・反転などの処理の精度を犠牲にしなければならない場合もある。しかしながら本実施の形態の目的は音声の明瞭度・了解度の低減にあり、想定／予定した処理自体の正確さが目的ではない。したがって本実施の形態では、マスカーH(t)の重畳によりマスキーH'(t)の意味内容が理解し難くなるという条件が満たされれば処理の精度は大きな問題とはならない。これは「意味内容が理解し難くなるという条件」は無数にあるからである。 In order to conceal information by synthesizing Masky H '(t) and Masker H (t) at the listener's 8 position, the SD processing in the SD controller unit SD must be performed in real time or near real time as described above. I must. Due to the presence of this time constraint, that is, the SD processing time T _SD must be shorter than T + t, which is a short time, the accuracy of processing such as extraction, replacement and inversion of consonant signal must be sacrificed. In some cases. However, the purpose of this embodiment is to reduce the intelligibility and intelligibility of speech, and the accuracy of the assumed / scheduled processing itself is not the purpose. Therefore, in this embodiment, if the condition that it becomes difficult to understand the meaning content of the maskee H ′ (t) due to the superposition of the maskers H (t), the processing accuracy does not become a big problem. This is because there are an infinite number of “conditions that make it difficult to understand the semantic content”.

（２）母音置換モード
上述の子音部分の変更に加えて、母音部分も変更するモードである。母音抽出部３４は、子音抽出部３２で子音部分の信号が抽出された音声信号から母音部分の信号を抽出する。 (2) Vowel replacement mode In this mode, the vowel part is also changed in addition to the above-described change of the consonant part. The vowel extraction unit 34 extracts the vowel part signal from the voice signal from which the consonant part signal is extracted by the consonant extraction unit 32.

母音ライブラリ更新部８４は、母音抽出部３４によって抽出された母音部分の信号の波形データをその種類ごとに母音ライブラリ１４に蓄積する。ここで母音部分の分類はその継続時間・スペクトル・統計処理などから行われる。このように母音ライブラリ１４に蓄積される母音部分の信号の波形データは、逐次処理によって会話開始から徐々に精度の高いものに置換されてゆく。 The vowel library updating unit 84 stores the waveform data of the vowel part signal extracted by the vowel extraction unit 34 in the vowel library 14 for each type. Here, the vowel part is classified based on its duration, spectrum, statistical processing, and the like. In this way, the waveform data of the signal of the vowel part stored in the vowel library 14 is gradually replaced with one having higher accuracy from the start of the conversation by sequential processing.

ノイズ生成部８０は、母音抽出部３４で抽出された母音部分の信号を基に、それとスペクトルが類似する母音ノイズを生成する。 The noise generation unit 80 generates vowel noise having a spectrum similar to that of the vowel part signal extracted by the vowel extraction unit 34.

母音処理部５０は、子音処理部４０において子音部分の信号が処理された後の音声信号のうち、母音抽出部３４で抽出された母音部分の信号を処理する。特に騒音レベルの上昇を極力抑える必要がある場合には、母音処理部５０は母音抽出部３４で抽出された母音部分を無音部分に置換する。この場合、Ｄ／Ａ部７０、スピーカＳＰを経て出力されるマスカーH(t)は子音部分と子音部分とに挟まれた無音部分を有する構成となる。つまりマスカーH(t)の子音部分は同期するマスキーH'(t)の母音部分と連結してひとつの音韻を構成することとなる。これにより全体の音量はマスカーH(t)で無音とした母音部分の分だけ低減され、室内の騒音レベルも低減される。 The vowel processing unit 50 processes the signal of the vowel part extracted by the vowel extraction unit 34 out of the voice signal after the signal of the consonant part is processed by the consonant processing unit 40. In particular, when it is necessary to suppress an increase in noise level as much as possible, the vowel processing unit 50 replaces the vowel part extracted by the vowel extraction part 34 with a silent part. In this case, the masker H (t) output via the D / A unit 70 and the speaker SP has a structure having a silent part sandwiched between a consonant part and a consonant part. That is, the consonant part of the masker H (t) is connected to the vowel part of the synchronized masky H ′ (t) to form one phoneme. As a result, the overall sound volume is reduced by the vowel part silenced by the masker H (t), and the indoor noise level is also reduced.

なお、母音処理部５０は、母音部分を無音部分で置き換える代わりに、ライブラリベースの置換を行ってもよい。つまり、母音処理部５０は、母音抽出部３４によって抽出された母音部分の信号を母音ライブラリ１４から選出した別の母音部分の信号に置換してもよい。母音処理部５０は、置換の候補が複数ある場合は、ランダムに、かつ各組み合わせが略等確率となるように置換する。発話開始期間における母音部分変更アルゴリズムについては子音部分のそれと同様である。 The vowel processing unit 50 may perform library-based replacement instead of replacing the vowel part with a silent part. In other words, the vowel processing unit 50 may replace the signal of the vowel part extracted by the vowel extraction unit 34 with the signal of another vowel part selected from the vowel library 14. When there are a plurality of replacement candidates, the vowel processing unit 50 performs replacement so that each combination has a substantially equal probability. The vowel part changing algorithm in the utterance start period is the same as that of the consonant part.

または、母音処理部５０は、母音部分を無音部分で置き換える代わりに、母音処理部５０によって抽出された母音部分の信号をノイズ生成部８０によって生成された母音ノイズと置換してもよい。この場合、やはりマスキーH'(t)とマスカーH(t)との合成音声の無作為性がより増大する。 Alternatively, the vowel processing unit 50 may replace the vowel part signal extracted by the vowel processing unit 50 with the vowel noise generated by the noise generation unit 80 instead of replacing the vowel part with a silent part. In this case, the randomness of the synthesized speech of the maskee H ′ (t) and the masker H (t) is further increased.

また、子音母音の処理の順番、つまり子音処理部４０における処理と母音処理部５０における処理の順番を入れ替えてもよい。 Further, the order of processing of consonant vowels, that is, the order of processing in the consonant processing unit 40 and processing in the vowel processing unit 50 may be switched.

図６は、子音ライブラリ１２を示すデータ構造図である。子音ライブラリ１２は、音素としての子音１１２とその子音の波形データ１１４とを対応付けて記憶する。母音ライブラリ１４および共通ライブラリ１６もまた子音ライブラリ１２と同様のデータ構造を有する。 FIG. 6 is a data structure diagram showing the consonant library 12. The consonant library 12 stores a consonant 112 as a phoneme and waveform data 114 of the consonant in association with each other. The vowel library 14 and the common library 16 also have the same data structure as the consonant library 12.

図７は、マスキーH'(t)の一例を表す音声信号の波形を示す波形図である。図７の波形は「あの、彼とはそうと（う）長いんだよね、実は（ANO KARETOWA SO-TONAGAINDAYONE ZITSUWA）」という原音声をマイクロホンＭｉｃで音声信号に変換したものである。図７の縦軸は信号強度を任意の単位で表し、横軸は時間を表す。図７において縦の破線で区画された領域ひとつひとつが音素に対応し、対応する音素がローマ字で明示されている。また、「-」は音声休止部を表す。包絡線１０２は実線で示される。ここで包絡線は音声サンプルを自乗音圧領域で数１０ｍｓｅｃの時定数をかけ平方根をとったものである。 FIG. 7 is a waveform diagram showing a waveform of an audio signal representing an example of the maskee H ′ (t). The waveform in FIG. 7 is obtained by converting the original voice “ANO KARETOWA SO-TONAGAINDAYONE ZITSUWA” into a voice signal with the microphone Mic. The vertical axis in FIG. 7 represents signal intensity in arbitrary units, and the horizontal axis represents time. In FIG. 7, each region divided by vertical broken lines corresponds to a phoneme, and the corresponding phoneme is clearly shown in Roman letters. “-” Represents a voice pause unit. Envelope 102 is shown as a solid line. Here, the envelope is obtained by multiplying a voice sample by a time constant of several tens of msec in the square sound pressure region and taking a square root.

図７における母音、子音、無音の別を表１に示す。音声開始前のある時刻を時刻の原点（ｔ＝０）として定める。 Table 1 shows vowels, consonants, and silences in FIG. A certain time before the start of voice is defined as the time origin (t = 0).

なお、子音、母音、無音の別は、エネルギやゼロ交差数、ＰＡＲＣＯＲ（PARtial auto-CORrelation）の第１係数（スペクトル傾斜）などにより判別することが可能である。

The distinction between consonants, vowels, and silence can be determined by energy, the number of zero crossings, the first coefficient (spectral slope) of PARCOR (PARtial auto-CORrelation), and the like.

図８は、図７の音声信号をＳＤコントローラ部ＳＤにおいて子音のみ置換モードで処理することで生成される音声信号の波形を示す波形図である。区画１０４で示される子音部が置換された子音部である。これらの置換に際し切り出し時間長や再挿入時レベル(ｄＢ)を調整している。
置換後の包絡線１０６は実線で示される。図７の包絡線１０２と図８の包絡線１０６とを比較するとそれ程変化していないことが分かる。つまり音声のイントネーションや抑揚にそれ程変化はない。しかしながら図８の音声信号がスピーカＳＰで音声に変換され、マスカーH(t)として出力されると、受聴者８サイトではマスキーH'(t)とマスカーH(t)とが合成されて聞こえ、その意味内容は理解されにくくなる。つまり「わからない」となることが多い（他の音に聞こえる場合もある）。 FIG. 8 is a waveform diagram showing a waveform of an audio signal generated by processing the audio signal of FIG. 7 in the consonant only replacement mode in the SD controller unit SD. This is a consonant part in which the consonant part indicated by the section 104 is replaced. For these replacements, the cut-out time length and re-insertion level (dB) are adjusted.
The replacement envelope 106 is indicated by a solid line. When the envelope curve 102 in FIG. 7 is compared with the envelope curve 106 in FIG. 8, it can be seen that there is not much change. In other words, there is not much change in voice intonation and intonation. However, when the audio signal of FIG. 8 is converted into sound by the speaker SP and output as a masker H (t), the listener 8 site synthesizes and hears the maskey H ′ (t) and the masker H (t), Its meaning is difficult to understand. In other words, it is often “I don't know” (may be heard by other sounds).

図５に戻る。
（３）実時間モード
マイクロホンＭｉｃにより集音されたマスキーH'(t)は音声信号に変換され、該音声信号はマイクアンプ（不図示）を経てＡ／Ｄ部２０に入力される。Ａ／Ｄ部２０は、アナログ信号である音声信号をデジタルデータに変換する。Ａ／Ｄ部２０でデジタル化された音声信号は、例えば音圧の大きさに応じた電圧値が時刻と対応付けられたデジタルデータである。 Returning to FIG.
(3) Real-time mode Musky H ′ (t) collected by the microphone Mic is converted into an audio signal, and the audio signal is input to the A / D unit 20 via a microphone amplifier (not shown). The A / D unit 20 converts an audio signal that is an analog signal into digital data. The audio signal digitized by the A / D unit 20 is digital data in which a voltage value corresponding to the magnitude of sound pressure is associated with time, for example.

部分抽出部３０は、Ａ／Ｄ部２０でデジタル化された音声信号から変更対象部分の信号を抽出する。部分抽出部３０は、変更対象部分の信号として子音部分の信号を抽出してもよい。あるいはまた、部分抽出部３０は、変更対象部分の信号として母音部分の信号を抽出してもよい。子音部分および母音部分の抽出については上述の通りである。 The partial extraction unit 30 extracts the signal of the change target portion from the audio signal digitized by the A / D unit 20. The partial extraction unit 30 may extract a consonant part signal as a change target part signal. Or the partial extraction part 30 may extract the signal of a vowel part as a signal of a change object part. The extraction of the consonant part and the vowel part is as described above.

あるいはまた、部分抽出部３０は、変更対象部分の信号として音声信号の包絡線の形状に基づいて決定されたひとまとまりの信号を抽出してもよい。あるいはまた、部分抽出部３０は、音声信号をランダムな長さを有する期間で分割し、分割後の１区間に対応する信号を変更対象部分の信号として抽出してもよい。 Alternatively, the partial extraction unit 30 may extract a group of signals determined based on the shape of the envelope of the audio signal as the signal of the change target portion. Alternatively, the partial extraction unit 30 may divide the audio signal by a period having a random length and extract a signal corresponding to one section after the division as a signal of the change target part.

部分抽出部３０が変更対象部分の信号として音声信号の包絡線の形状に基づいて決定されたひとまとまりの信号を抽出する場合を説明する。略１山抽出部５２は、音声信号の包絡線を示すデータを取得する。このデータは、例えば包絡線の大きさに応じた電圧値が時刻と対応付けられたデジタルデータである。以下、包絡線を示すデータを単に包絡線と称す。 A case will be described in which the partial extraction unit 30 extracts a group of signals determined based on the shape of the envelope of the audio signal as the signal of the change target portion. The approximately one mountain extracting unit 52 acquires data indicating the envelope of the audio signal. This data is digital data in which a voltage value corresponding to the size of the envelope is associated with time, for example. Hereinafter, data indicating an envelope is simply referred to as an envelope.

自乗音圧取得部５４は、Ａ／Ｄ部２０でデジタル化された音声信号の自乗音圧波形を取得する。自乗音圧取得部５４は、音声信号を自乗し、必要に応じて所定の係数を乗ずることにより自乗音圧波形を得る。 The squared sound pressure acquisition unit 54 acquires the squared sound pressure waveform of the audio signal digitized by the A / D unit 20. The squared sound pressure acquisition unit 54 squares the audio signal and obtains a squared sound pressure waveform by multiplying by a predetermined coefficient as necessary.

ローパスフィルタ５６は、自乗音圧取得部５４によって取得された自乗音圧波形を数ｍｓｅｃから数１００ｍｓｅｃの時定数で平均化する。すなわちローパスフィルタ５６は自乗音圧波形に対してローパスフィルタ処理をする。これにより、自乗音圧波形から時定数程度よりも速い変化が取り除かれ、滑らかな波形が得られる。本実施の形態では、この滑らかな波形が音声信号の包絡線である。なお、他の方法で音声信号の包絡線を求めてもよいことは、本明細書に触れた当業者には理解される。また、本実施の形態において包絡線は、広義には音声信号の平均エネルギ（振幅）の変化を示すデータである。
ローパスフィルタ５６は、必要であればローパスフィルタ処理されたデータの平方根をとる。 The low pass filter 56 averages the squared sound pressure waveform acquired by the squared sound pressure acquisition unit 54 with a time constant of several msec to several hundred msec. That is, the low pass filter 56 performs low pass filter processing on the squared sound pressure waveform. Thereby, a change faster than the time constant is removed from the squared sound pressure waveform, and a smooth waveform is obtained. In this embodiment, this smooth waveform is the envelope of the audio signal. It should be understood by those skilled in the art who have touched this specification that the envelope of the audio signal may be obtained by other methods. In the present embodiment, the envelope is data indicating a change in average energy (amplitude) of the audio signal in a broad sense.
The low pass filter 56 takes the square root of the low pass filtered data if necessary.

第１決定部５８は、ローパスフィルタ５６によって得られた音声信号の包絡線のうち、数ｄＢ〜数１０ｄＢ、例えば５ｄＢ以上連続して上昇する上昇部分を検出する。次に第１決定部５８は、上昇部分の後で数ｄＢ〜数１０ｄＢ、例えば５ｄＢ以上連続して下降する下降部分を検出する。第１決定部５８は、上昇部分とそれに対応する下降部分との間の音声信号を変更対象部分の信号として決定する。このようにして決定される変更対象部分の信号の包絡線は略１山状となることが多い。 The first determination unit 58 detects an ascending portion continuously rising from several dB to several tens dB, for example, 5 dB or more, from the envelope of the audio signal obtained by the low-pass filter 56. Next, the first determination unit 58 detects a descending portion that descends continuously several dB to several tens dB, for example, 5 dB or more after the ascending portion. The 1st determination part 58 determines the audio | voice signal between a raise part and the fall part corresponding to it as a signal of a change object part. In many cases, the envelope of the signal of the change target portion determined in this way is approximately one mountain.

図９は、第１決定部５８における変更対象部分の信号の決定基準を説明するための説明図である。図９（ａ）は、第１決定部５８において上昇部分と下降部分の検出に基づいて変更対象部分の信号が決定される場合を説明するための説明図である。図９（ａ）は、例示としての音声信号の波形２１１とその包絡線２０８とを示す。第１決定部５８は、包絡線２０８の変化率に基づき上昇部分２０２を検出する。次に第１決定部５８は上昇部分２０２の後の下降部分２０４を検出する。第１決定部５８は、上昇部分２０２と下降部分２０４とで挟まれる区間２０６（ピーク２０３より前の時刻ｔ１とピーク２０３より後の時刻ｔ２とで挟まれる区間）の音声信号を変更対象部分の信号として決定する。 FIG. 9 is an explanatory diagram for explaining a determination criterion for the signal of the change target portion in the first determination unit 58. FIG. 9A is an explanatory diagram for explaining a case where the signal of the change target portion is determined based on the detection of the rising portion and the falling portion in the first determination unit 58. FIG. 9A shows an exemplary audio signal waveform 211 and its envelope 208. The first determination unit 58 detects the rising portion 202 based on the rate of change of the envelope 208. Next, the first determination unit 58 detects the descending portion 204 after the ascending portion 202. The first determination unit 58 uses the audio signal of the section 206 (the section sandwiched between the time t1 before the peak 203 and the time t2 after the peak 203) sandwiched between the rising portion 202 and the descending portion 204 as the change target portion. Determine as a signal.

なお、第１決定部５８は、他の方法で変更対象部分の信号を決定してもよい。例えば、第１決定部５８は、包絡線が膨らんでいる部分を検出し、その部分に対応する音声信号を変更対象部分の信号として決定してもよい。あるいはまた、第１決定部５８は、包絡線のピークを検出し、その前後に所定の長さを有する区間の音声信号を変更対象部分の信号として決定してもよい。あるいはまた、第１決定部５８は、包絡線が所定のレベルを越えている連続的な区間の音声信号を変更対象部分の信号として決定してもよい。 Note that the first determination unit 58 may determine the signal of the change target portion by another method. For example, the first determination unit 58 may detect a portion where the envelope is swelled and determine a sound signal corresponding to the portion as a signal of the change target portion. Or the 1st determination part 58 may detect the peak of an envelope, and may determine the audio | voice signal of the area which has a predetermined length before and behind as a signal of a change object part. Or the 1st determination part 58 may determine the audio | voice signal of the continuous area where an envelope exceeds a predetermined level as a signal of a change object part.

図９（ｂ）は、第１決定部５８においてピークの検出に基づいて変更対象部分の信号が決定される場合を説明するための説明図である。図９（ｂ）は、例示としての音声信号の波形２１２とその包絡線２１４とを示す。第１決定部５８は、包絡線２１４のピーク２１６を検出する。第１決定部５８は、ピーク２１６の前後に所定の長さを有する区間２１８の音声信号を変更対象部分の信号として決定する。 FIG. 9B is an explanatory diagram for explaining a case where the signal of the change target portion is determined based on the detection of the peak in the first determination unit 58. FIG. 9B shows an exemplary sound signal waveform 212 and its envelope 214. The first determination unit 58 detects the peak 216 of the envelope 214. The first determination unit 58 determines the audio signal of the section 218 having a predetermined length before and after the peak 216 as the signal to be changed.

図９（ｃ）は、第１決定部５８において包絡線のレベルに基づいて変更対象部分の信号が決定される場合を説明するための説明図である。図９（ｃ）は、例示としての音声信号の波形２２０とその包絡線２２２とを示す。第１決定部５８は、包絡線２２２が所定のレベル２２４を越えている連続的な区間２２６を検出し、その区間２２６の音声信号を変更対象部分の信号として決定する。この場合、所定のレベルの取り方によっては、変更対象部分の信号が２以上のピークを含む場合がある。 FIG. 9C is an explanatory diagram for explaining a case where the signal of the change target portion is determined based on the envelope level in the first determination unit 58. FIG. 9C shows an exemplary audio signal waveform 220 and its envelope 222. The first determination unit 58 detects a continuous section 226 in which the envelope 222 exceeds a predetermined level 224, and determines the audio signal in the section 226 as a signal to be changed. In this case, depending on how to obtain a predetermined level, the signal of the change target portion may include two or more peaks.

以上のように変更対象部分の信号の決定手法は種々考えられる。このように選択肢が多いことは、ＳＤによる会話内容の隠蔽をより効果的とするための大きな自由度を提供するという意味で好適である。 As described above, various methods for determining the signal of the change target portion are conceivable. Such a large number of options is preferable in the sense that it provides a great degree of freedom for making the concealment of conversation contents by SD more effective.

また、これら種々の決定手法に通じて言えることは、音声信号の波形に基づいて、特にその統計的な性質に基づいて信号のひとまとまりが判別され、そのように判別されたひとまとまりの信号が変更対象部分の信号として決定されていることである。すなわち、入来する音声信号に応じて適応的に変更対象部分が決定される。この場合、本発明者の当業者としての経験および予備的な実験によると、例えば予め定められた一定の間隔で音声信号を切り出す場合と比べてより会話内容擾乱効果が高いことが見出された。特に、本発明者によって行われた実験によると、包絡線の略１山を変更単位として抽出する場合は、例えば一定周期で切り出す場合や子音や母音を変更単位とする場合と比べて擾乱効果が高いことが見出された。 In addition, what can be said through these various determination methods is that a group of signals is determined based on the waveform of an audio signal, particularly based on its statistical properties, and the group of signals thus determined is That is, it is determined as a signal of the part to be changed. That is, the change target portion is adaptively determined according to the incoming audio signal. In this case, according to the experience of the present inventor as a person skilled in the art and preliminary experiments, it has been found that the conversation content disturbance effect is higher than, for example, the case where audio signals are cut out at predetermined intervals. . In particular, according to an experiment conducted by the present inventor, when approximately one peak of an envelope is extracted as a change unit, the disturbance effect is more effective than, for example, cutting out at a constant period or using a consonant or vowel as a change unit. It was found to be expensive.

図５に戻る。
第１決定部５８は、音声信号のうち変更対象部分の信号として決定されなかった部分を遅延調整部６８に出力する。 Returning to FIG.
The first determination unit 58 outputs a portion of the audio signal that has not been determined as the signal to be changed to the delay adjustment unit 68.

部分抽出部３０が音声信号をランダムな長さを有する期間で分割し、分割後の１区間に対応する信号を変更対象部分の信号として抽出する場合について説明する。
信号分割部６２は、Ａ／Ｄ部２０でデジタル化された音声信号をランダムな長さを有する期間で分割する。期間の長さは数１０ｍｓｅｃ〜数１００ｍｓｅｃの間で変動する。または期間の長さは一定周期に対して±数１０％〜数１００％の範囲で変動する。例えば、期間の長さは、…、１１ｍｓｅｃ、１０ｍｓｅｃ，１２ｍｓｅｃ、…、と変化する。 A case will be described in which the partial extraction unit 30 divides the audio signal by a period having a random length and extracts a signal corresponding to one section after the division as a signal of the change target part.
The signal dividing unit 62 divides the audio signal digitized by the A / D unit 20 in a period having a random length. The length of the period varies between several tens of milliseconds to several hundreds of milliseconds. Alternatively, the length of the period varies within a range of ± several tens of% to several hundreds of% with respect to a certain period. For example, the length of the period changes as follows: 11 msec, 10 msec, 12 msec,.

第２決定部６４は、音声信号のうち信号分割部６２で分割された期間のひとつに対応する信号を変更対象部分の信号として決定する。第２決定部６４は、分割された全ての期間を変更対象部分として選択してもよいし、例えば１つおきに変更対象部分として選択してもよい。後者の場合、第２決定部６４は変更対象部分として選択されなかった期間に対応する部分の音声信号を遅延調整部６８に出力する。
この場合、期間の長さにランダム性が加味されているので、マスカーH(t)の自然性が向上する。 The 2nd determination part 64 determines the signal corresponding to one of the periods divided | segmented by the signal division part 62 among audio | voice signals as a signal of a change object part. The second determination unit 64 may select all the divided periods as the change target part, or may select every other period as the change target part, for example. In the latter case, the second determination unit 64 outputs the audio signal of the part corresponding to the period not selected as the change target part to the delay adjustment unit 68.
In this case, since the randomness is added to the length of the period, the naturalness of the masker H (t) is improved.

時間処理部６６は、部分抽出部３０によって抽出された変更対象部分の信号を、その時間軸に沿った波形に基づいて処理する。時間処理部６６は、変更対象部分の信号に対して時間反転または時間回転を施す。 The time processing unit 66 processes the signal of the change target portion extracted by the partial extraction unit 30 based on the waveform along the time axis. The time processing unit 66 performs time reversal or time rotation on the signal of the change target portion.

時間反転について、時間処理部６６は、抽出された変更対象部分の信号を時間について反転する。すなわち、時間処理部６６は、変更対象部分の信号から時間を逆行させた信号を生成する。より具体的に説明すると、時間処理部６６は、変更対象部分の信号の時刻ｔ_ｉ（０≦ｉ≦Ｎ、ｔ_０＜ｔ_１＜…＜ｔ_Ｎ、Ｎは自然数、ｔ_０≡０）における電圧値ｆ（ｔ_ｉ）に対して関数ｈ（ｆ（ｔ_ｉ））＝ｆ（ｔ_Ｎ−ｔ_ｉ）を作用させる。その結果、時間処理部６６における時間反転処理を経た変更対象部分の信号の波形は、元の波形をその中心を通り時間軸と垂直な線に対して折り返した形状を有する。 Regarding the time inversion, the time processing unit 66 inverts the extracted signal of the change target portion with respect to time. That is, the time processing unit 66 generates a signal obtained by reversing the time from the signal of the change target portion. More specifically, the time processing unit 66 at the time t _i (0 ≦ i ≦ N, t ₀ <t ₁ <... <T _N , N is a natural number, t ₀ ≡0) of the signal to be changed. A function h (f (t _i )) = f (t _N −t _i ) is applied to the voltage value f (t _i ). As a result, the waveform of the signal of the change target portion that has undergone the time reversal process in the time processing unit 66 has a shape that is folded back with respect to a line that passes through the center of the original waveform and is perpendicular to the time axis.

時間回転について、時間処理部６６は、抽出された変更対象部分の信号の時間軸に沿った波形を回転させる。より具体的に説明すると、時間処理部６６は、上述の通り変更対象部分の信号に対して時間反転を施す。加えて時間処理部６６は、時間反転が施された変更対象部分の信号の符号を反転する。その結果、時間処理部６６における時間回転処理を経た変更対象部分の信号の波形は、元の波形をその時間軸上の中心に対して１８０度回転した形状を有する。 Regarding the time rotation, the time processing unit 66 rotates the waveform along the time axis of the extracted signal of the change target portion. More specifically, the time processing unit 66 performs time reversal on the signal of the change target portion as described above. In addition, the time processing unit 66 inverts the sign of the signal of the change target portion subjected to the time inversion. As a result, the waveform of the signal of the change target portion that has undergone the time rotation processing in the time processing unit 66 has a shape obtained by rotating the original waveform by 180 degrees with respect to the center on the time axis.

出力部７２は、時間処理部６６からは時間反転または時間回転処理された変更対象部分の信号を、部分抽出部３０からは変更対象部分でない信号を、取得する。出力部７２は、それらをアナログ信号に変換し、パワーアンプＰＡを介してスピーカＳＰに出力する。 The output unit 72 acquires from the time processing unit 66 a signal of the change target portion that has been subjected to time reversal or time rotation processing, and receives from the partial extraction unit 30 a signal that is not the change target portion. The output unit 72 converts them into analog signals and outputs them to the speaker SP via the power amplifier PA.

遅延調整部６８は、時間反転または時間回転処理された変更対象部分の信号と変更対象部分でない信号とをつなぎ合わせて出力すべき出力音声信号を生成する。遅延調整部６８は、出力音声信号が出力部７２から出力されるタイミングを、マスキーH'(t)の伝搬にかかる時間に応じて調整する。特に遅延調整部６８は、出力音声信号に対して所定の遅延を与える。この遅延は、受聴者８位置におけるマスキーH'(t)に対するマスカーH(t)の遅れがマスキーH'(t)とマスカーH(t)とが実質的に実時間と言える程度の範囲内に収まるように設定される。 The delay adjusting unit 68 generates an output audio signal to be output by connecting the signal of the change target portion that has been subjected to the time inversion or time rotation processing and the signal that is not the change target portion. The delay adjustment unit 68 adjusts the timing at which the output audio signal is output from the output unit 72 according to the time required for propagation of the maskee H ′ (t). In particular, the delay adjustment unit 68 gives a predetermined delay to the output audio signal. This delay is within a range where the masker H (t) delay with respect to the masker H '(t) at the listener 8 position is such that the masker H' (t) and the masker H (t) can be said to be substantially real time. Set to fit.

マスキーH'(t)とマスカーH(t)とが実質的に実時間であることは、例えばマスキーH'(t)とマスカーH(t)とが隣接ブース２’内で少なくとも部分的に重畳することである。あるいはまた、出力部７２から出力された変更対象部分の信号がスピーカＳＰによって音声に変換され、その変換された音声が、マスキーH'(t)が隣接ブース２’内で受聴されている間に隣接ブース２’に出力されることである。あるいはまた、出力部７２から出力された変更対象部分の信号がスピーカＳＰによって音声に変換され、その変換された音声が、当該変更対象部分の信号に対応するマスキーH'(t)の部分が隣接ブース２’内で受聴されている間に隣接ブース２’に出力されることである。これは言い換えると、変更対象部分の信号に対応するマスキーH'(t)の部分と、当該変更対象部分の信号に対応するマスカーH(t)の部分とが隣接ブース２’内で少なくとも部分的に重畳することである。 The fact that the maskee H '(t) and the masker H (t) are substantially in real time means that, for example, the maskee H' (t) and the masker H (t) are at least partially overlapped in the adjacent booth 2 '. It is to be. Alternatively, the signal of the change target portion output from the output unit 72 is converted into sound by the speaker SP, and the converted sound is received while the maskee H ′ (t) is received in the adjacent booth 2 ′. It is to be output to the adjacent booth 2 ′. Alternatively, the signal of the change target portion output from the output unit 72 is converted into sound by the speaker SP, and the converted sound is adjacent to the portion of the maskee H ′ (t) corresponding to the signal of the change target portion. This is to be output to the adjacent booth 2 ′ while listening in the booth 2 ′. In other words, the portion of the maskee H ′ (t) corresponding to the signal of the change target portion and the portion of the masker H (t) corresponding to the signal of the change target portion are at least partially in the adjacent booth 2 ′. It is to superimpose on.

音声情報秘話システム１００を導入する際、マイクロホンＭｉｃおよびスピーカＳＰの位置は決まり、想定される顧客６の位置および想定される受聴者８の位置もある程度は決まる。また、ＳＤコントローラ部ＳＤにおける処理時間もある程度見積もることができる。したがって、音声情報秘話システム１００の導入時に、顧客６から受聴者８へのマスキーH'(t)の伝搬時間およびマスカーH(t)の伝搬時間をある程度見積もることができる。遅延調整部６８における遅延は、受聴者８位置におけるマスキーH'(t)に対するマスカーH(t)の遅れの所望値から逆算して設定される。 When the voice information secret system 100 is introduced, the positions of the microphone Mic and the speaker SP are determined, and the position of the assumed customer 6 and the assumed position of the listener 8 are also determined to some extent. In addition, the processing time in the SD controller unit SD can be estimated to some extent. Therefore, at the time of introducing the voice information secret talk system 100, the propagation time of the maskee H ′ (t) and the propagation time of the masker H (t) from the customer 6 to the listener 8 can be estimated to some extent. The delay in the delay adjusting unit 68 is set by calculating backward from a desired value of the delay of the masker H (t) with respect to the maskee H ′ (t) at the listener 8 position.

マスキーH'(t)に対するマスカーH(t)の遅れが大きいと、受聴者８位置においてエコーや残響が生じる虞がある。したがって、遅延調整部６８は、受聴者８位置におけるマスキーH'(t)に対するマスカーH(t)の遅れがそのような違和感を生じさせない程度の値となるような遅延を出力音声信号に対して与える。この遅延は実験により定められるが、代表的には数１００ｍｓｅｃ以下である。 If the masker H (t) has a large delay with respect to the maskee H ′ (t), echoes or reverberations may occur at the listener 8 position. Therefore, the delay adjusting unit 68 sets a delay for the output audio signal such that the delay of the masker H (t) with respect to the maskee H ′ (t) at the listener 8 position is a value that does not cause such a sense of incongruity. give. Although this delay is determined by experiment, it is typically several hundred msec or less.

また、マイクロホンＭｉｃ、スピーカＳＰ、顧客６、受聴者８の位置関係によっては、遅延調整部６８で遅延を付与しないとした場合にマスカーH(t)がマスキーH'(t)よりもかなり遅く受聴者８位置に到達することもある。この場合、マスキーH'(t)とマスカーH(t)とを受聴者８位置で実質的に実時間で合成して情報隠蔽を行うためには、ＳＤコントローラ部ＳＤでのＳＤ処理時間を短縮しなければならない。この時間的な制約の存在、つまりＳＤ処理時間を短縮しなければならないことにより、時間処理の精度を犠牲にしなければならない場合もある。しかしながら本実施の形態の目的は音声の明瞭度・了解度の低減にあり、想定／予定した処理自体の正確さが目的ではない。したがって本実施の形態では、マスカーH(t)の重畳によりマスキーH'(t)の意味内容が理解し難くなるという条件が満たされれば処理の精度は大きな問題とはならない。これは「意味内容が理解し難くなるという条件」は無数にあるからである。 Further, depending on the positional relationship between the microphone Mic, the speaker SP, the customer 6, and the listener 8, the masker H (t) is received much later than the maskee H '(t) when the delay adjusting unit 68 does not apply a delay. The position of the listener 8 may be reached. In this case, in order to conceal the information by synthesizing the maskee H '(t) and the masker H (t) in the real time in the listener 8 position, the SD processing time in the SD controller unit SD is shortened. Must. In some cases, the accuracy of time processing must be sacrificed due to the existence of this time constraint, that is, the SD processing time must be shortened. However, the purpose of this embodiment is to reduce the intelligibility and intelligibility of speech, and the accuracy of the assumed / scheduled processing itself is not the purpose. Therefore, in this embodiment, if the condition that it becomes difficult to understand the meaning content of the maskee H ′ (t) due to the superposition of the maskers H (t), the processing accuracy does not become a big problem. This is because there are an infinite number of “conditions that make it difficult to understand the semantic content”.

Ｄ／Ａ部７０は、遅延調整部６８によって遅延が付与された出力音声信号を、スピーカＳＰを駆動するためのアナログの音声信号に変換してパワーアンプＰＡに出力する。 The D / A unit 70 converts the output audio signal provided with the delay by the delay adjusting unit 68 into an analog audio signal for driving the speaker SP and outputs the analog audio signal to the power amplifier PA.

図１０は、受聴者８位置におけるマスキーH'(t)および時間回転処理されたマスカーH(t)を表す音声信号の波形を示す波形図である。図１０（ａ）は、マスキーH'(t)を表す音声信号の波形を示す波形図である。図１０（ａ）の波形は原音声をマイクロホンＭｉｃで音声信号に変換したものである。図１０（ａ）の縦軸は信号強度を任意の単位で表し、横軸は時間を表す。図１０（ｂ）は、図１０（ａ）の音声信号に対して、ＳＤコントローラ部ＳＤにおいて略１山単位で時間回転を施して生成される音声信号の波形を示す波形図である。例えば、ＳＤコントローラ部ＳＤは、図１０（ａ）の円１５０で示される略１山の音声信号を変更対象部分の信号として抽出し、その略１山の音声信号に時間回転を施して図１０（ｂ）の円１５２で示される音声信号を生成、出力する。 FIG. 10 is a waveform diagram showing the waveform of an audio signal representing the maskee H ′ (t) and the time-rotated masker H (t) at the listener 8 position. FIG. 10A is a waveform diagram showing the waveform of an audio signal representing the maskee H ′ (t). The waveform in FIG. 10A is obtained by converting the original voice into a voice signal using the microphone Mic. In FIG. 10A, the vertical axis represents signal intensity in arbitrary units, and the horizontal axis represents time. FIG. 10B is a waveform diagram showing a waveform of an audio signal generated by subjecting the audio signal of FIG. 10A to time rotation in approximately one mountain unit in the SD controller unit SD. For example, the SD controller unit SD extracts approximately one peak audio signal indicated by a circle 150 in FIG. 10A as a signal to be changed, performs time rotation on the approximately one peak audio signal, and performs FIG. An audio signal indicated by a circle 152 in (b) is generated and output.

図１０（ａ）の包絡線と図１０（ｂ）の包絡線とを比較するとそれ程変化していないことが分かる。つまり音声のイントネーションや抑揚にそれ程変化はない。しかしながら図１０（ｂ）の音声信号がスピーカＳＰで音声に変換され、マスカーH(t)として出力されると、受聴者８サイトではマスキーH'(t)とマスカーH(t)とが合成されて聞こえ、その意味内容は理解されにくくなる。つまり「わからない」となることが多い。 When the envelope of FIG. 10A is compared with the envelope of FIG. 10B, it can be seen that there is not much change. In other words, there is not much change in voice intonation and intonation. However, when the audio signal of FIG. 10B is converted into audio by the speaker SP and output as a masker H (t), the masker H '(t) and the masker H (t) are synthesized at the listener 8 site. The meaning is difficult to understand. In other words, it is often "I don't know".

図１１は、音声情報秘話システム１００における一連の処理を示すフローチャートである。マイクロホンＭｉｃは、マスキーH'(t)を収集し、音声信号を生成する（ステップ３０２）。Ａ／Ｄ部２０は、マスキーH'(t)を表す音声信号をマイクロホンＭｉｃから取得する（ステップ３０４）。部分抽出部３０は、Ａ／Ｄ部２０によって取得されＡ／Ｄ変換された音声信号から、その音声信号の波形に基づいて変更対象部分の信号を抽出する（ステップ３０６）。部分変更部９０は、部分抽出部３０によって抽出された変更対象部分の信号を変更する（ステップ３０８）。出力部７２は、部分変更部９０によって変更された変更対象部分の信号をスピーカＳＰに出力する（ステップ３１０）。スピーカＳＰは、受け取った信号を音声に変換してマスカーH(t)とし、そのマスカーH(t)をマスキーH'(t)が受聴されている隣接ブース２’に出力する（ステップ３１２）。 FIG. 11 is a flowchart showing a series of processes in the speech information secret system 100. The microphone Mic collects the maskee H ′ (t) and generates an audio signal (step 302). The A / D unit 20 acquires an audio signal representing the maskee H ′ (t) from the microphone Mic (step 304). The partial extraction unit 30 extracts the signal of the change target portion from the audio signal acquired and A / D converted by the A / D unit 20 based on the waveform of the audio signal (step 306). The partial change unit 90 changes the signal of the change target portion extracted by the partial extraction unit 30 (step 308). The output unit 72 outputs the signal of the change target portion changed by the partial change unit 90 to the speaker SP (step 310). The loudspeaker SP converts the received signal into a voice to obtain a masker H (t), and outputs the masker H (t) to the adjacent booth 2 'where the maskee H' (t) is being listened to (step 312).

以上の構成による音声情報秘話システム１００の動作を説明する。銀行のブース２に顧客６が座り、銀行の相談員と例えばローンについて相談する場合を考える。この際、ブース２の隣の隣接ブース２’には受聴者８がいて口座の開設を申請しているとする。顧客６は自己の事業の資金繰りが悪化したなどローンを申請する事情を説明している。無論このような話は受聴者８に漏れ聞こえないほうがよく、特に本実施の形態に係る音声情報秘話システム１００では顧客６の発話音声のうち子音部分の信号が変換されたものや時間回転が施されたものが受聴者８に届くので、受聴者８は顧客６の発話内容を理解できない。加えて顧客６の発話がない場合はスピーカＳＰから隣接ブース２’への出力は実質的にないため、隣接ブース２’内の騒音レベルを不必要に上昇させることもない。 The operation of the speech information secret system 100 having the above configuration will be described. Consider a case in which a customer 6 sits in a bank booth 2 and consults with a bank counselor about, for example, a loan. At this time, it is assumed that there is a listener 8 in the adjacent booth 2 ′ next to the booth 2 and an application for opening an account is being made. Customer 6 explains the circumstances of applying for a loan, such as the worsening of the cash flow of his business. Of course, it is better for the listener 8 not to leak such a story. In particular, in the speech information secret speech system 100 according to the present embodiment, a consonant signal converted from the speech of the customer 6 or a time rotation is applied. Since the received information reaches the listener 8, the listener 8 cannot understand the utterance content of the customer 6. In addition, when there is no utterance by the customer 6, there is substantially no output from the speaker SP to the adjacent booth 2 ', so that the noise level in the adjacent booth 2' is not increased unnecessarily.

上述の実施の形態において、記憶装置１０の例は、ハードディスクやメモリである。また、本明細書の記載に基づき、各ブロックを、図示しないＣＰＵや、インストールされたアプリケーションプログラムのモジュールや、システムプログラムのモジュールや、ハードディスクから読み出したデータの内容を一時的に記憶するメモリなどにより実現できることは本明細書に触れた当業者には理解されるところである。 In the above-described embodiment, examples of the storage device 10 are a hard disk and a memory. In addition, based on the description of the present specification, each block is stored in a CPU (not shown), an installed application program module, a system program module, a memory that temporarily stores data read from the hard disk, or the like. It will be understood by those skilled in the art who have touched this specification that it can be realized.

本実施の形態に係る音声情報秘話システム１００によると、以下の作用効果を得ることができる。 According to the speech information secret system 100 according to the present embodiment, the following operational effects can be obtained.

（１）本実施の形態に係る音声情報秘話システム１００によると、会話の存在そのものの隠蔽や抹消ではなく、その内容、つまり会話音声に含まれる情報が隠蔽される。この点に関し本発明者は以下を認識した。
オープンプランのオフィスや銀行や証券会社のロビーカウンター、特に簡易パーティションにより仕切られた接客カウンターなどでは、会話している人以外の人にその会話の中身を理解不能とすれば、会話内容の隠蔽という点では十分にその目的が果たされる。つまり会話の内容さえ漏れなければ音声そのものは聞こえてもよい。むしろ発話者の存在が視認できる場合などは、音声のスペクトルや包絡線（音質やイントネーション、抑揚）が保存されたほうが自然である。本実施の形態に係る音声情報秘話システム１００は、以上の視点・ニーズに対応し、より自然な形で会話内容を隠蔽する。 (1) According to the speech information secret system 100 according to the present embodiment, the content, that is, the information included in the conversation speech is concealed instead of concealing or deleting the presence of the conversation itself. In this regard, the inventor has recognized the following.
In open-plan offices, bank counters, and securities company lobby counters, especially at customer service counters that are partitioned by simple partitions, concealing the content of a conversation is possible if the contents of the conversation cannot be understood by anyone other than the person who is speaking. The point serves its purpose well. In other words, the voice itself may be heard as long as the content of the conversation is not leaked. Rather, when the presence of a speaker can be visually recognized, it is more natural to preserve the speech spectrum and envelope (sound quality, intonation, and intonation). The voice information secret system 100 according to the present embodiment corresponds to the above viewpoints and needs, and conceals conversation contents in a more natural manner.

（２）部分抽出部３０において子音部分が抽出される場合、マスカーH(t)は発話者本人のマスキーH'(t)を基にその子音部分に着目して作成され、原音声と並行してスピーカから出力される。したがって、特に子音のみ置換モードではマスキーH'(t)のスペクトルや包絡線はマスカーH(t)となっても保存されうる。その結果、マスカーH(t)のスペクトルやイントネーションはマスキーH'(t)のそれとほぼ同じとなるので、違和感はそれ程無く自然に聞き手に受け取られる。 (2) When the consonant part is extracted by the partial extraction unit 30, the masker H (t) is created based on the utterance of the speaker's own masque H '(t), focusing on the consonant part, and in parallel with the original speech. Output from the speaker. Therefore, particularly in the consonant-only replacement mode, the spectrum and envelope of the maskee H ′ (t) can be preserved even if it becomes the masker H (t). As a result, since the spectrum and intonation of the masker H (t) are almost the same as those of the maskee H '(t), the sense of incongruity is naturally received by the listener.

（３）部分抽出部３０において子音部分が抽出される場合、マスカーH(t)はマスキーH'(t)に対し子音部分のみを置換して、あるいは子音部分を置換したうえで母音部分を無音部分に置き換えたり処理したりして生成される。したがって、マスカーH(t)の音量（音圧レベル）ひいては室内騒音レベルの上昇を極力抑えることができる。 (3) When the consonant part is extracted by the partial extraction unit 30, the masker H (t) replaces only the consonant part with respect to the masky H '(t), or replaces the consonant part and then silences the vowel part. Generated by substituting or processing. Therefore, it is possible to suppress the increase in the volume (sound pressure level) of the masker H (t) and hence the indoor noise level as much as possible.

（４）時間軸上でマスキーH'(t)がないとき、つまり会話がないときはマスカーH(t)も出力されない。つまり両者は時間的に実質的に重畳する。したがって、音声発生のない「無音時」におけるマスカーH(t)による室内騒音レベルの上昇は抑えられる。 (4) No masker H (t) is output when there is no maskee H '(t) on the time axis, that is, when there is no conversation. That is, both overlap substantially in time. Therefore, an increase in the room noise level due to the masker H (t) during “no sound” when no sound is generated can be suppressed.

（５）従来の技術を使用した場合に発生しうるマスカー断続やレベル変動（会話停止時に断〜レベル低減）による違和感や、会話とは関係のない別の音（騒音・音楽）を放射することによる発話者・会話者・その他の在室者に対する違和感が抑えられる。 (5) Dissipating a feeling of discomfort due to intermittent maskers or level fluctuations (disrupted when the conversation is stopped to reduced level) that may occur when using conventional technology, or other sounds (noise / music) that are not related to conversation This reduces the sense of discomfort for speakers, conversers, and other people in the room.

（６）従来の技術における物理的な遮音や個室化に対しては、空間的な遮断や移動を必要としないので、開放感やコミュニケーションが妨げられにくくなる。 (6) With respect to the physical sound insulation and private room formation in the prior art, no spatial blockage or movement is required, so that a sense of openness and communication are less likely to be hindered.

（７）ＳＤコントローラ部ＳＤおよびスピーカＳＰはＩＴパーティション４に組み込まれるので、システムの設置や取付を大幅に簡略化できる。場合によってはマイクロホンＭｉｃをＩＴパーティション４に組み込んでもよい。この場合、さらに簡略化される。 (7) Since the SD controller unit SD and the speaker SP are incorporated in the IT partition 4, the installation and installation of the system can be greatly simplified. In some cases, the microphone Mic may be incorporated in the IT partition 4. In this case, it is further simplified.

（８）ＩＴパーティション４はそれ自体が吸音処理されている。したがって、ブース内での会話音声の明瞭度を上げつつ隣接ブースへの音漏れを低減できる。 (8) The IT partition 4 itself is subjected to sound absorption processing. Therefore, sound leakage to the adjacent booth can be reduced while increasing the clarity of the conversation voice in the booth.

（９）マスカーH(t)は置換・削除・反転・回転などの処理によりマスキーH'(t)（原音声）とは電気信号的な相関がそれ程高くない信号となる。したがって、音声情報秘話システム１００の動作時においてハウリングなどのフィードバックに起因する異常が生じにくい。 (9) The masker H (t) becomes a signal whose electrical signal correlation is not so high as that of the maskee H ′ (t) (original voice) by processing such as substitution, deletion, inversion, and rotation. Therefore, abnormalities due to feedback such as howling are less likely to occur during the operation of the speech information confidential system 100.

（１０）本実施の形態に係るＳＤコントローラ部ＳＤの実時間モードでは、変更対象部分の信号に時間反転または時間回転が施される。時間反転が施される場合、信号の包絡線を保存しつつ情報攪乱に効果的なマスカーH(t)を生成できる。ただし、時間反転の場合はマスキーH'(t)とマスカーH(t)とにそれほど聴感的な差が生じない場合もある。これに対して時間回転が施される場合は、マスキーH'(t)とマスカーH(t)との聴覚的な印象が微妙に変わってくることが本発明者による実験により分かっている。 (10) In the real time mode of the SD controller unit SD according to the present embodiment, time reversal or time rotation is performed on the signal of the change target portion. When time reversal is performed, it is possible to generate a masker H (t) effective for information disturbance while preserving the signal envelope. However, in the case of time reversal, there may be a case where a audible difference does not occur between the maskee H ′ (t) and the masker H (t). On the other hand, when time rotation is performed, it has been found by experiments by the present inventor that the auditory impression of the maskee H ′ (t) and the masker H (t) changes slightly.

情報隠蔽／聴覚翻弄のためには、マスキーH'(t)とマスカーH(t)とが聴覚的に類似すぎるのは問題であるが、異なりすぎるのも問題である。聴覚には、性質の異なるもの同士は区別して認識する、という性質があるからである。したがって、上記時間回転の場合は、聴覚的に近すぎずまた遠すぎない、情報隠蔽に丁度良いマスカーH(t)が提供されうる。 For information concealment / hearing, it is a problem that the maskee H '(t) and the masker H (t) are too auditoryly similar, but it is also a problem that they are too different. This is because the auditory sense has a property of distinguishing and recognizing different properties. Therefore, in the case of the time rotation described above, a masker H (t) that is not too close to the auditory sense and not too far away and is just good for information hiding can be provided.

（１１）部分抽出部３０において、略１山状の信号が変更対象部分の信号として抽出される場合、マスキーH'(t)の信号レベルが小さい部分で切り取りや貼り付けが行われるので、時間反転・回転処理によるクリック雑音などが低減される。すなわち、マスキーH'(t)が時間的に連続であればマスカーH(t)もほぼ連続となるので、一定時間で区画する場合には生じうる遮断部分におけるクリック雑音や、その低減を目的とした窓掛け処理による包絡線形状の崩壊（イントネーションの崩壊）も生じにくい。 (11) In the partial extraction unit 30, when approximately one mountain signal is extracted as the signal of the change target part, cutting and pasting are performed at the part where the signal level of the maskee H '(t) is low, so time Click noise due to inversion / rotation processing is reduced. That is, if the masker H ′ (t) is continuous in time, the masker H (t) is also substantially continuous. It is difficult for the envelope shape to collapse (intonation collapse) due to the windowing process.

（１２）部分抽出部３０において、略１山状の信号が変更対象部分の信号として抽出され、そのように抽出された信号に時間回転処理が施される場合、マスカーのスペクトルや包絡線の形状はほぼ保存され、マスキーのそれらと類似のものとなる。したがって、室内の騒音レベルの上昇やクリック雑音を最低限に抑えたまま効果的に音場情報撹乱（音声内容の隠蔽）を機能させることができる。 (12) When the partial extraction unit 30 extracts a substantially one-crested signal as a signal to be changed, and when the extracted signal is subjected to time rotation processing, the shape of the masker spectrum or envelope Are almost conserved and similar to those of Muskie. Therefore, it is possible to effectively perform sound field information disturbance (concealment of audio content) while suppressing an increase in indoor noise level and click noise to a minimum.

以上、実施の形態に係る音声情報秘話システム１００およびそれに含まれるＳＤコントローラ部ＳＤの構成と動作について説明した。この実施の形態は例示であり、その各構成要素や各処理の組み合わせにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 Heretofore, the configuration and operation of the audio information secret system 100 according to the embodiment and the SD controller unit SD included therein have been described. This embodiment is an exemplification, and it is understood by those skilled in the art that various modifications can be made to each component and combination of processes, and such modifications are within the scope of the present invention.

実施の形態では、隣接ブースの片側からマスカーH(t)が出力される場合について説明したが、これに限られない。例えば、信号加算によりマスカーH(t)が隣接ブースの左右両側から出力されてもよい。図１２は、第１変形例に係る音声情報秘話システムの機能および構成を模式的に示すブロック図である。第１変形例に係る音声情報秘話システムは、マイクロホンＭｉｃと、ＳＤコントローラ部ＳＤと、４つのスピーカＳＰａ〜ＳＰｄ（ＳＰｄは不図示）と、４つのパワーアンプＰＡａ〜ＰＡｄ（ＰＡｄは不図示）と、４つの加算器２１０ａ〜２１０ｄ（２１０ｄは不図示）と、を備える。 In the embodiment, the case where the masker H (t) is output from one side of the adjacent booth has been described, but the present invention is not limited to this. For example, the masker H (t) may be output from the left and right sides of the adjacent booth by signal addition. FIG. 12 is a block diagram schematically showing the function and configuration of the speech information secret system according to the first modification. The audio information secret system according to the first modification includes a microphone Mic, an SD controller unit SD, four speakers SPa to SPd (SPd is not shown), and four power amplifiers PAa to PAd (PAd is not shown). Four adders 210a to 210d (210d not shown).

ＳＤコントローラ部ＳＤにおける処理を経た音声信号は、ブース２の左のスピーカＳＰａに対応する加算器２１０ａと、ブース２の右のスピーカＳＰｂに対応する加算器２１０ｂと、ブース２の左隣の隣接ブース２’の左のスピーカＳＰｃに対応する加算器２１０ｃと、ブース２の右隣の隣接ブースの右のスピーカＳＰｄ（不図示）に対応する加算器２１０ｄ（不図示）と、に入力される。それぞれの加算器２１０ａ〜２１０ｄに入力された音声信号は対応するパワーアンプＰＡａ〜ＰＡｄを経てスピーカＳＰａ〜ＳＰｄから出力される。加算器はそれが接続されたスピーカが音声を出力するブースの両隣のブースから、ＳＤコントローラ部ＳＤにおける処理を経た音声信号を取得して加算する。
本変形例によると、マスカーH(t)が隣接ブース２’の左右両側から出力されるので、ブース２における会話内容が受聴者８により伝わりにくくなる。 The audio signal that has undergone the processing in the SD controller unit SD is the adder 210a corresponding to the left speaker SPa of the booth 2, the adder 210b corresponding to the right speaker SPb of the booth 2, and the adjacent booth adjacent to the left of the booth 2. The signal is input to the adder 210c corresponding to the left speaker SPc of 2 ′ and the adder 210d (not shown) corresponding to the right speaker SPd (not shown) of the adjacent booth adjacent to the right of the booth 2. The audio signals input to the adders 210a to 210d are output from the speakers SPa to SPd via the corresponding power amplifiers PAa to PAd. The adder acquires and adds the audio signal that has undergone processing in the SD controller unit SD from the booths adjacent to the booth to which the speaker to which the speaker is connected outputs audio.
According to this modification, since the masker H (t) is output from both the left and right sides of the adjacent booth 2 ′, the conversation contents in the booth 2 are difficult to be transmitted to the listener 8.

また、マスキーH'(t)のレベルを低減するためにＰＮＣ（Passive Noise Controller）を併用してもよい。ＰＮＣは公知のＡＮＣ（Active Noise Control）を調整時に適応処理させ、運用時には設定されたパラメータを固定して使用することを意図するものである。
図１３は、第２変形例に係る音声情報秘話システムの機能および構成を模式的に示すブロック図である。本変形例では、図１２のＳＤコントローラ部ＳＤを図１３の破線で囲まれた部分で置き換える。この部分ではＳＤコントローラ部ＳＤとＰＮＣ部ＰＮＣとが並列に設けられ、マイクロホンＭｉｃからの音声信号がＳＤコントローラ部ＳＤとＰＮＣ部ＰＮＣとに入力される。ＳＤコントローラ部ＳＤの出力側にはスイッチＳＷ１が設けられ、スイッチＳＷ１によってＳＤコントローラ部ＳＤの動作のオンオフが制御される。そのスイッチＳＷ１の出力とＰＮＣ部ＰＮＣの出力とは加算器４０６で加算され、パワーアンプＰＡを介してスピーカＳＰから音声として出力される。 Further, a PNC (Passive Noise Controller) may be used in combination to reduce the level of the maskee H ′ (t). The PNC intends to use a known ANC (Active Noise Control) adaptively at the time of adjustment, and to fix and use the set parameters at the time of operation.
FIG. 13 is a block diagram schematically showing the function and configuration of the audio information secret system according to the second modification. In this modification, the SD controller unit SD in FIG. 12 is replaced with a part surrounded by a broken line in FIG. In this part, the SD controller unit SD and the PNC unit PNC are provided in parallel, and an audio signal from the microphone Mic is input to the SD controller unit SD and the PNC unit PNC. A switch SW1 is provided on the output side of the SD controller unit SD, and the operation of the SD controller unit SD is controlled by the switch SW1. The output of the switch SW1 and the output of the PNC unit PNC are added by an adder 406 and output as sound from the speaker SP via the power amplifier PA.

本変形例では、音源４０２とアンプ４０４を介して接続されたヘッドトルソシミュレータＨＡＴＳ（HATS: Head and Torso Simulator）などを発話者位置Ｐに置いて、ＰＮＣ部ＰＮＣの同定を行う。スイッチＳＷ１を開いてＳＤコントローラ部ＳＤの動作を切り、ＨＡＴＳから適切な音声信号を放射して隣接ブース２’の受聴者位置Ｑに置いたマイクロホンＭｉｃ’の出力が最小になるようにＰＮＣ部ＰＮＣを適応動作させてシステム同定を行う。 In this modification, the head torso simulator HATS (HATS: Head and Torso Simulator) connected to the sound source 402 via the amplifier 404 is placed at the speaker position P, and the PNC unit PNC is identified. The switch SW1 is opened to turn off the operation of the SD controller unit SD, and an appropriate audio signal is emitted from the HATS so that the output of the microphone Mic ′ placed at the listener position Q in the adjacent booth 2 ′ is minimized. System identification is performed by adaptively operating.

このときマイクロホンＭｉｃおよびスピーカＳＰを含むインパルス応答は-h(x)となり、絶対値がＰＮＣ発話者−受聴者間のそれh(x)にほぼ等しくなる。その後スイッチＳＷ１を閉じ、同定されたパラメータを固定した状態でＰＮＣ部を稼動させる。すると発話者と受聴者の位置Ｐ、ＱおよびマイクロホンＭｉｃとスピーカＳＰの位置はほぼ固定されているので、マスキーH'(t)のレベルは効果的に低減され、マスカーH(t)が優勢となる。その結果、情報隠蔽（Information Masking）の効果が強められる。必要に応じてマスカーH(t)のレベルを下げると、マスキーH'(t)を含むシステム全体のレベル、つまり室内の騒音レベルをさらに低減することもできる。
なお、上述のＰＮＣ機能はＳＤコントローラ部ＳＤが組み込まれているコンピュータに組み込まれてもよい。 At this time, the impulse response including the microphone Mic and the speaker SP is −h (x), and the absolute value is substantially equal to that h (x) between the PNC speaker and the listener. Thereafter, the switch SW1 is closed, and the PNC unit is operated with the identified parameters fixed. Then, since the positions P and Q of the speaker and the listener and the positions of the microphone Mic and the speaker SP are substantially fixed, the level of the maskee H ′ (t) is effectively reduced, and the masker H (t) is dominant. Become. As a result, the effect of information masking is enhanced. If the level of the masker H (t) is lowered as necessary, the level of the entire system including the maskee H ′ (t), that is, the noise level in the room can be further reduced.
Note that the PNC function described above may be incorporated in a computer in which the SD controller unit SD is incorporated.

ＡＮＣ／ＰＮＣは既存の技術であるが、広い音場を３次元にわたりくまなく制御するのには向いていない。一方でカウンターのパーティションで囲まれた狭い空間のほぼ定まった位置に受聴者の頭が存在するようなケースでは３次元でも有効な音響低減手段となる。 Although ANC / PNC is an existing technology, it is not suitable for controlling a wide sound field all over three dimensions. On the other hand, in the case where the listener's head is present at a substantially fixed position in a narrow space surrounded by the partition of the counter, the sound reduction means is effective even in three dimensions.

実施の形態における子音部分などの変更対象部分の置換または削除にあたり、ハニング窓などの時間窓やゼロクロス検出を併用して、切り取り時に発生しうるクリック音などを除去してもよい。この場合、受聴者８あるいは在室者に与えうる違和感がさらに低減される。 In replacement or deletion of a change target portion such as a consonant portion in the embodiment, a time window such as a Hanning window or zero cross detection may be used together to remove a click sound that may occur at the time of clipping. In this case, the uncomfortable feeling that can be given to the listener 8 or the people in the room is further reduced.

以上、実施の形態にもとづき本発明を説明したが、実施の形態は、本発明の原理、応用を示しているにすぎないことはいうまでもなく、実施の形態には、請求の範囲に規定された本発明の思想を逸脱しない範囲において、多くの変形例や配置の変更が可能であることはいうまでもない。
例えば、原音声に複数の処理音声を重ねて放射したりすることも考えられる手法の例である。 Although the present invention has been described based on the embodiments, the embodiments merely show the principle and application of the present invention, and the embodiments are defined in the claims. Needless to say, many modifications and arrangements can be made without departing from the spirit of the present invention.
For example, it is an example of a technique in which a plurality of processed sounds are radiated on the original sound.

２ブース、４ＩＴパーティション、６顧客、８受聴者、１０記憶装置、２０Ａ／Ｄ部、３０部分抽出部、７２出力部、９０部分変更部、１００音声情報秘話システム、ＳＤＳＤコントローラ部、ＳＰスピーカ、Ｍｉｃマイクロホン。 2 booths, 4 IT partitions, 6 customers, 8 listeners, 10 storage devices, 20 A / D units, 30 partial extraction units, 72 output units, 90 partial change units, 100 voice information secrecy systems, SD SD controller units, SPs Speaker, Mic microphone.

Claims

A partial extraction unit that extracts a signal of a change target portion based on a waveform of the voice signal from a voice signal representing the voice being uttered;
A partial changing unit that changes the signal of the change target portion extracted by the partial extracting unit;
An output unit configured to output a signal of the change target portion changed by the partial change unit to an audio output unit capable of outputting audio to a region where the voice being spoken is received; Change device.

The partial extraction unit is a signal in a section sandwiched between a first time before the peak of the envelope of the waveform of the audio signal and a second time after the peak, and a substantially one mountain-shaped signal, The sound changing device according to claim 1, wherein the sound changing device is determined as a signal of a change target portion.

The voice changing device according to claim 1, wherein the partial extraction unit extracts a signal of a consonant part as a signal of a change target part.

The voice change device according to any one of claims 1 to 3, wherein the partial change unit rotates a waveform along a time axis of a signal of the change target portion extracted by the partial extraction unit.

The apparatus further comprises a timing adjustment unit that adjusts the timing at which the signal of the change target portion changed by the partial change unit is output from the output unit according to the time taken for propagation of the voice during the utterance. The voice changing device according to any one of claims 1 to 4.

Sound collecting means for receiving the voice being uttered and generating a voice signal representing the voice;
A sound changing device for changing a sound signal generated by the sound collecting means;
Voice output means for converting the voice signal changed by the voice changing device into voice and outputting the voice to the area where the voice being spoken is received,
The voice changing device is
A partial extraction unit that extracts a signal of a change target portion based on a waveform of the audio signal from the audio signal generated by the sound collecting unit;
A partial changing unit that changes the signal of the change target portion extracted by the partial extracting unit;
And an output unit that outputs the signal of the change target portion changed by the partial change unit to the voice output unit.

Extracting a signal to be changed from a voice signal representing a voice being uttered based on a waveform of the voice signal;
Changing the extracted signal of the change target portion;
Converting the changed signal of the change target portion into a voice, and outputting the converted voice to a region where the voice being spoken is listened to.