JP2009162879A

JP2009162879A - Speech support method

Info

Publication number: JP2009162879A
Application number: JP2007340381A
Authority: JP
Inventors: Shunsuke Ishimitsu; 俊介石光; Hitoshi Nakayama; 仁史中山
Original assignee: Hiroshima Industrial Promotion Organization
Current assignee: Hiroshima Industrial Promotion Organization
Priority date: 2007-12-28
Filing date: 2007-12-28
Publication date: 2009-07-23
Anticipated expiration: 2027-12-28
Also published as: JP5354485B2

Abstract

【課題】明瞭な音声でコミュニケーションを図ることが困難な環境、例えば発声機能障害者の食道発声法の発声支援や、高騒音環境に於ける通信補助等において、話者の本来の声に近い形式の声を再現しコミュニケーションを円滑にする方法を提供することを課題とする。
【解決手段】話者の体内伝導音等の入力信号と出力音声のモデルとする音声データを元にサブワード毎に伝達関数を生成し、話者が発声した際に、入力信号を連続的にサブワード毎に伝達関数にて音質変換し、デジタルフィルタにて音質変換後のサブワードをたたみ込み演算を行い、連続した出力音声を再現することにより、出力音声に近い声質の音声にて発声支援を実現する。
【選択図】図２PROBLEM TO BE SOLVED: To provide a format close to a speaker's original voice in an environment where it is difficult to communicate with clear voice, for example, utterance support for an esophageal utterance method of a person with speech dysfunction or communication assistance in a high noise environment It is an object to provide a method that reproduces the voice of the voice and facilitates communication.
A transfer function is generated for each subword based on an input signal such as a body conduction sound of a speaker and voice data as a model of output speech, and the input signal is continuously subworded when the speaker utters. Each sound quality is converted with a transfer function, the subword after sound quality conversion is convolved with a digital filter, and continuous output speech is reproduced, thereby realizing speech support with speech quality close to the output speech. .
[Selection] Figure 2

Description

本発明は、通常の発声による意思疎通が困難な状況に於ける発声支援方法に関する。 The present invention relates to an utterance support method in a situation where communication through normal utterance is difficult.

近年、咽頭癌等の病気により咽頭を全摘出する患者が増加傾向にある。咽頭全摘出手術は咽頭癌に対する一般的な処置の一つであるが、咽頭全摘出により声帯も失う事となり、通常の発声が不可能となる。 In recent years, there has been an increase in the number of patients who have had their entire pharynx removed due to diseases such as pharyngeal cancer. Total pharyngectomy is one of the common treatments for pharyngeal cancer, but vocal cords are also lost by total pharyngectomy, making normal speech impossible.

声帯を除去した患者の多くが、代替発声法として食道発声法を用いているが、食道発声法による音声は必ずしも明瞭ではなく、発声音の基本周波数の低下、及び音量の減少により、屋外等の高騒音環境においてはコミュニケーションに不具合が生じている。 Many patients who have removed vocal cords use the esophageal voicing method as an alternative voicing method, but the sound produced by the esophageal voicing method is not always clear. There is a problem with communication in high noise environments.

高騒音環境に於ける音声コミュニケーションの不具合は声帯摘出者に限った問題ではなく、例えば船舶の機関室のような高騒音環境において、健常者の発声音より周囲の騒音の音量が大きい場合には、音声認識率が著しく低下する。 The problem of voice communication in a high noise environment is not limited to the vocal cord extractor. The voice recognition rate is significantly reduced.

この問題に対して本願発明者は過去に、非特許文献１において骨伝導システムを用いて、骨伝導すなわち体内伝導音を用いた音声認識装置に関する研究を行っている。また同発明者は非特許文献２において、声帯振動信号を音声信号に変換する技術についての研究を行っている。 In the past, the inventor of the present application has conducted research on a speech recognition apparatus using bone conduction, that is, body conduction sound, using a bone conduction system in Non-Patent Document 1. In addition, in the non-patent document 2, the inventor is conducting research on a technique for converting a vocal cord vibration signal into a voice signal.

また体内伝導音を使用した咽頭全摘出者のための発声支援方法の一例として、例えば非特許文献３、特許文献１、及び特許文献２に挙げるような装置及び方法が開示されている。非特許文献３は非可聴つぶやき声（Ｎｏｎ−ＡｕｄｉｂｌｅＭｕｒｍｕｒ：ＮＡＭ）を抽出してささやき声への変換を行うことにより音声コミュニケーションを支援する方法に関する論文である。特許文献１は口周辺の筋肉の動きを筋電位により検出し音声を合成する事を特徴とする発声代行装置である。特許文献２は目標発話モデルを文章で登録し、話者が発話した内容と登録された目標発話モデルを突合し、合致するものがあれば出力し、合致する目標発話モデルがない場合は発話の特徴が一致する複数の目標発話モデルから生成して出力すると同時に新たな目標発話モデルとして登録する事を特徴とする会話支援装置である。
特開平７−１８１８８８特開２００４−２８７２０９「舶用機関運転支援のための骨伝導認識システムの検討」日本マリンエンジニアリング学会誌第３９巻第４号、Ｐ３５−４０「声道フィルタ特性を用いた声帯振動信号からの音声生成」日本機械学会２００３年度年次大会講演論文集、Ｐ１２１−１２２「肉伝導人工音声の変換に基づく咽頭全摘出者のための音声コミュニケーション支援システム」電子情報通信学会論文誌ＤＶｏｌ．Ｊ９０−ＤＮｏ．３、Ｐ７８０−７８７ In addition, as an example of an utterance support method for a total pharyngectomy person using a body conduction sound, devices and methods as disclosed in Non-Patent Document 3, Patent Document 1, and Patent Document 2, for example, are disclosed. Non-Patent Document 3 is a paper on a method for supporting voice communication by extracting a non-audible murmur (NAM) and converting it into a whisper. Japanese Patent Application Laid-Open No. 2004-228561 is a voice acting device characterized by synthesizing a voice by detecting the movement of a muscle around the mouth using a myoelectric potential. Patent Document 2 registers the target utterance model in text, matches the content of the speaker's utterance with the registered target utterance model, outputs if there is a match, and utterance characteristics if there is no matching target utterance model This is a conversation support apparatus characterized in that it generates and outputs from a plurality of target utterance models that coincide with each other, and at the same time registers as a new target utterance model.
JP-A-7-181888 JP 2004-287209 A "Study of bone conduction recognition system for marine engine operation support" Journal of Japan Marine Engineering Society Vol.39 No.4, P35-40 “Speech generation from vocal cord vibration signals using vocal tract filter characteristics” Proceedings of the Annual Meeting of the Japan Society of Mechanical Engineers, 2003, P121-122 "Voice communication support system for total pharyngectomy based on the conversion of meat conduction artificial speech" IEICE Transactions D Vol. J90-D No. 3, P780-787

しかし上記文献に記載の発明はいずれも、話者の声質を残しつつ明瞭な音声を生成するには到っているとは言えず、実用性に乏しい。 However, none of the inventions described in the above-mentioned literatures have been able to generate clear speech while maintaining the voice quality of the speaker, and are not practical.

本願発明者の先行研究となる非特許文献１においては、骨伝導（体内伝導）を抽出することには成功しているが、明瞭な音声に変換する段階までには到っていない。また非特許文献２においては、声帯信号と音声信号を同時に入力する必要があり、通常の音声信号が使用できない状況においては無効である。さらに非特許文献２においては、事前に登録した単語（該文献においては「府中（ｈｕｔｙｕｕ）」）による信号変換の確認であり、登録されていない単語や会話においては考慮されていない。 In Non-Patent Document 1, which is a prior study of the inventor of the present application, extraction of bone conduction (in-body conduction) has been successful, but it has not yet reached the stage of conversion into clear speech. In Non-Patent Document 2, it is necessary to simultaneously input a vocal cord signal and a voice signal, which is invalid in a situation where a normal voice signal cannot be used. Further, in Non-Patent Document 2, signal conversion is confirmed by a word registered in advance (in this document, “huchu”), and is not taken into account in unregistered words and conversations.

非特許文献３は「非可聴つぶやき声（Ｎｏｎ−ＡｕｄｉｂｌｅＭｕｒｍｕｒ：ＮＡＭ）」からささやき声への変換を行う発明であるが、ささやき声は実際の話音と異なり基本周波数を考慮していないため、男声か女声かの区別もなく、話者の声質を残した音声とは言えない。また該発明はＮＡＭによる入力信号をそのまま変換しているだけであり、入力信号を認識・解析する方法に関しては検討が成されていないため、曖昧性に弱い。 Non-Patent Document 3 is an invention that performs conversion from “non-audible murmur (NAM)” to whispering voice. However, whispering voice is different from actual speech sound and does not consider the fundamental frequency, so it is male voice. There is no distinction between female voices and it cannot be said that the voice quality of the speaker is left. Further, the present invention only converts the input signal by NAM as it is, and since no examination has been made regarding a method for recognizing and analyzing the input signal, it is vulnerable to ambiguity.

特許文献１は、口周辺の筋電位を元に音声を合成し出力する発明であるが、声質は筋肉の動きのみで決定されるわけではなく、声質を再現しているとは言えない。 Patent Document 1 is an invention that synthesizes and outputs speech based on the myoelectric potential around the mouth, but the voice quality is not determined only by the movement of the muscle, and cannot be said to reproduce the voice quality.

また特許文献２は、事前に目標発話モデルをデータベースとして保持し、話者が発話した内容とデータベース上の目標発話モデルを突合し、変換出力する装置であるが、目標発話モデルとして登録する内容を文章単位としている。目標発話モデルとして登録していない新規な内容を発話した場合も、登録されている発話モデルから自動合成によって対応可能である旨、また新規発話に対応するデータベースを自動的に構築しデータを補完する旨が記載されているが、新規発話に類似する発話モデルがデータベース上にない場合は自動合成ができない。また発話内容も単語も無限大に存在することを考えると、実現性に乏しいと言わざるを得ない。 Further, Patent Document 2 is a device that holds a target utterance model as a database in advance, collates the content of a speaker's utterance with the target utterance model on the database, and converts and outputs it. The unit. Even if new content that is not registered as the target utterance model is uttered, it is possible to respond by automatic synthesis from the registered utterance model, and a database corresponding to the new utterance is automatically constructed to supplement the data However, if there is no utterance model similar to a new utterance in the database, automatic synthesis cannot be performed. In addition, given the infinite number of utterances and words, it must be said that the feasibility is poor.

本発明は、話者の体内伝導音あるいは音声を入力信号情報とし、前記入力信号情報と出力する音声信号のモデルを関連づける伝達関数を、単語より短いサブワード毎にクロススペクトル法を用いて事前に作成し記憶保持する工程と、前記話者の体内伝導音あるいは音声を入力し入力信号とする工程と、前記入力信号を前記サブワード毎に識別する工程と、前記サブワード毎に対応する前記伝達関数により前記サブワードの音質を変換する工程と、前記音質変換したサブワードを一連の出力音声信号として合成する工程と、前記一連の出力音声信号を出力する工程とを具備する事を特徴とする。よって本発明により、話者の体内伝導音あるいは音声と出力したい音声のモデルを関連づける伝達関数を事前に生成することにより、実際に話者が発話した時に事前に準備した出力したい音声のモデルに近い音声として出力することができる。 The present invention uses a speaker's body conduction sound or voice as input signal information, and creates a transfer function that associates the input signal information with the output voice signal model in advance using a cross spectrum method for each subword shorter than a word. And storing and holding, a step of inputting a conduction sound or voice of the speaker as an input signal, a step of identifying the input signal for each subword, and the transfer function corresponding to each subword The method includes a step of converting the sound quality of the subword, a step of synthesizing the subword subjected to the sound quality conversion as a series of output sound signals, and a step of outputting the series of output sound signals. Therefore, according to the present invention, by generating in advance a transfer function that associates a speaker's internal conduction sound or speech with the speech model to be output, it is close to the speech model to be output prepared in advance when the speaker actually speaks. It can be output as audio.

また本発明に於いて、前記サブワードは、音素、半音節または音節とすることを特徴とする。よって本発明に於いては単語等の事前登録が不要になり、自由な発話に対応が可能となる。 In the present invention, the subword is a phoneme, semi-syllable or syllable. Therefore, in the present invention, pre-registration of words or the like is not necessary, and it is possible to deal with free utterances.

また本発明に於いて、前記伝達関数は、前記入力信号をサブワード単位で出力する音質に変換する関数であり、サブワード毎に関連づけて記憶保持することを特徴とする。よって本発明に於いて保持する伝達関数はサブワードの数だけでよく、データ量を抑えられる。 In the present invention, the transfer function is a function for converting the input signal into sound quality to be output in units of subwords, and is stored and held in association with each subword. Therefore, the transfer function held in the present invention need only be the number of subwords, and the amount of data can be suppressed.

また本発明に於いて、前記伝達関数は、前記入力信号の入力時に、サブワード毎に連続して呼び出されることを特徴とする。よって本発明に於いては、連続した発話に対して連続して変換が行われるため、スムーズ且つリアルタイムの出力が可能となる。 In the present invention, the transfer function is called continuously for each subword when the input signal is input. Therefore, in the present invention, since continuous conversion is performed for continuous utterances, smooth and real-time output is possible.

また本発明に於いて、前記一連の出力音声として合成する工程は、デジタルフィルタによるたたみ込み演算であることを特徴とする。よって本発明に於いて、サブワード毎に音質変換したデータを一連のスムーズな形式につなぎ合わせることにより、より自然な出力音声を得ることができる。 In the present invention, the step of synthesizing the series of output sounds is a convolution operation using a digital filter. Therefore, in the present invention, a more natural output sound can be obtained by connecting data subjected to sound quality conversion for each subword into a series of smooth formats.

本発明によると、話者の発話と平行して伝達関数によりサブワード毎に音質変換を行う事により、伝達関数生成時に使用した出力音声信号のモデルに近い音声として出力することができる。よって、出力音声信号のモデルとして話者自身の声を使用すれば、話者自身の声に近い音声を再現することを可能とする。 According to the present invention, by performing sound quality conversion for each subword by a transfer function in parallel with the speaker's utterance, it is possible to output a speech close to the model of the output speech signal used when generating the transfer function. Therefore, if the speaker's own voice is used as a model of the output voice signal, it is possible to reproduce a voice close to the speaker's own voice.

さらに本発明によると、音質変換の単位をサブワードとすることにより、入力信号を一括変換する場合に比べて緻密な変換を行うことが可能となり、明瞭な音声を出力できるという利点を有する。 Furthermore, according to the present invention, by setting the unit of sound quality conversion as a subword, it is possible to perform precise conversion compared to the case of batch conversion of input signals, and there is an advantage that clear sound can be output.

さらに本発明によると、音質変換の単位をサブワードとすることにより、単語毎や文章毎の変換と比べて保持する伝達関数の数が少なく、データ量を抑えられるという利点を有する。例えばサブワードを音節とした場合、所謂五十音、濁音、半濁音、拗音、促音等、発音に対応する伝達関数を準備するだけでよい。 Furthermore, according to the present invention, the unit of sound quality conversion is a subword, so that the number of transfer functions to be held is small compared to the conversion for each word or each sentence, and the amount of data can be suppressed. For example, when subwords are syllables, it is only necessary to prepare transfer functions corresponding to pronunciation, such as so-called fifty sounds, muddy sounds, semi-muddy sounds, stuttering sounds, and prompt sounds.

さらに本発明によると、音質変換の単位をサブワードとすることにより、単語や文章モデルという制限が無く、出力音声を柔軟に再現することを可能とする。 Furthermore, according to the present invention, by setting the unit of sound quality conversion as a subword, it is possible to flexibly reproduce the output speech without being restricted by a word or a sentence model.

さらに本発明によると、入力信号が曖昧でサブワード識別を誤った場合に話者の意図と異なる伝達関数が呼び出された場合（例えば「ひ」と発話したつもりが「し」の伝達関数により音質変換された場合）においても、元々の発話音声が曖昧であり、また呼び出される伝達関数も似通ったものであるため、出力音声も似通ったものとなり、よって曖昧性に強い変換を行うことを可能とする。 Further, according to the present invention, when the input signal is ambiguous and the subword identification is wrong, a transfer function that is different from the intention of the speaker is called (for example, the sound function is converted by the transfer function of “Shi” which is intended to speak “Hi”). The original speech is ambiguous, and the transfer function that is called is similar, so the output speech is also similar, thus making it possible to perform transformations that are robust against ambiguity. .

さらに本発明によると、サブワード毎の音質変換を行った後に、デジタルフィルタによってたたみ込み演算を行い、サブワード毎の変換後音質信号の繋がり部分を補う事により、出力音声をスムーズにすることを可能とする。 Furthermore, according to the present invention, after performing sound quality conversion for each subword, it is possible to smooth the output sound by performing convolution calculation with a digital filter and compensating for the connected portion of the converted sound quality signal for each subword. To do.

さらに本発明によると、伝達関数の入力信号及び出力音声信号のモデルに制限がない。よって入力信号は食道発声法の体内伝導音、声帯発声時の体内伝導音、健常者の通常の発声音、食道発声法の発声音、非可聴つぶやき声等、また出力音声信号のモデルは話者自身の現在の声、カセットテープ等記録媒体に収録されている過去の話者自身の声、親族の声、有名人の声、話者の理想とする声等、伝達関数を生成する時点で目的に応じて自由に設定することを可能とする。 Furthermore, according to the present invention, there is no restriction on the model of the input signal and output audio signal of the transfer function. Therefore, the input signal is the body conduction sound of the esophageal vocalization method, the body conduction sound at the time of vocal cord vocalization, the normal voice of the normal person, the voice of the esophageal vocalization method, the inaudible murmur, and the model of the output voice signal is the speaker At the time of generating a transfer function, such as the current voice of the speaker, the voice of a past speaker recorded on a recording medium such as a cassette tape, the voice of a relative, the voice of a celebrity, the ideal voice of the speaker, etc. It is possible to set freely according to this.

前述の通り、本願発明においては入力信号及び出力音声共に制限はないが、好適な実施例の一つとして、咽頭全摘出による発声機能障害者の発声支援方法を例示する。なお、発明を実施するための装置構成の一例として図１を参照して説明する。 As described above, in the present invention, both the input signal and the output voice are not limited. However, as one preferred embodiment, a voice support method for a voice function disabled person by total pharyngectomy is illustrated. An example of an apparatus configuration for carrying out the invention will be described with reference to FIG.

まず事前工程として、伝達関数を生成する。実施例においては発声機能障害者の発声支援を目的とするため、入力信号は話者が食道発声法を行う際の体内伝導音とする。また出力音声信号のモデルとして、カセットテープ等記録媒体に収録された声帯を失う前の話者自身の声を準備する。なおサブワードは音節単位とする。 First, as a preliminary process, a transfer function is generated. In the embodiment, for the purpose of assisting the speech of a person with speech disabilities, the input signal is a body conduction sound when the speaker performs the esophageal speech method. As a model of the output audio signal, the voice of the speaker himself before losing the vocal cords recorded on a recording medium such as a cassette tape is prepared. Subwords are in syllable units.

信号入力部１１として体内伝導音を抽出する機器を話者の身体に装着した状態で、記録媒体に収録された声を再生し、再生音声を音声モデル入力部２１にて伝達関数生成部２２に入力すると同時に、話者は食道発声法にて再生音声と同時に同一内容を発声し、伝達関数生成部２２に入力する。つまり、媒体に収録された声が「あさひ」（／ａ／ｓａ／ｈｉ／）と発話していたら、話者は再生音と同時に「あさひ」と発声する。これにより、収録された声と話者の食道発声法の体内伝導音が伝達関数生成部２２内で関連づけられる。 The voice recorded in the recording medium is reproduced with the apparatus for extracting the body conduction sound as the signal input unit 11 attached to the speaker's body, and the reproduced voice is transferred to the transfer function generation unit 22 by the voice model input unit 21. Simultaneously with the input, the speaker utters the same content simultaneously with the reproduced speech by the esophageal utterance method, and inputs it to the transfer function generator 22. In other words, if the voice recorded in the medium speaks “Asahi” (/ a / sa / hi /), the speaker speaks “Asahi” simultaneously with the playback sound. As a result, the recorded voice and the body conduction sound of the speaker's esophageal utterance method are associated in the transfer function generator 22.

再生音声の発話内容は事前に特定できるため、再生音声と体内伝導音を時間軸や振動の特徴などを元に重ね合わせることにより、事前に特定している再生音声の発話内容に従って再生音声をサブワード毎に抽出すると共に、対応する体内伝導音をサブワード毎に抽出できる。 Since the utterance content of the playback voice can be specified in advance, the playback voice is subworded according to the utterance content of the playback voice specified in advance by superimposing the playback voice and the body conduction sound based on the time axis and vibration characteristics, etc. In addition, the corresponding body conduction sound can be extracted for each subword.

サブワード毎に抽出した再生音声と体内伝導音を元に、伝達関数生成部２２内で伝達関数を生成する。体内伝導音のオートスペクトルをＨｘｘ、再生音声と体内伝導音のクロススペクトルをＨｘｄとした場合、伝達関数Ｈ_（ｆ）を生成する式は、Ｈ_（ｆ）＝Ｈｘｄ／Ｈｘｘとなる。 A transfer function is generated in the transfer function generator 22 based on the reproduced voice and the body conduction sound extracted for each subword. When the autospectrum of the body conduction sound is Hxx and the cross spectrum of the reproduced sound and the body conduction sound is Hxxd, the expression for generating the transfer function H _(f) is H _(f) = Hxd / Hxx.

なお図３に示す通り、同じサブワードに対して複数回情報の採取、また「あさひ」の「さ（／ｓａ／）」と、「さっぽろ」の「さ（／ｓａ／）」のように、異なる単語による同じサブワードの情報を採取することにより、入力信号のぶれに対する伝達関数に柔軟性を持たせることができ、精度が上がる。図３において、「音声」欄及び「体内伝導音」欄の上部の波形は音声波形であり、下部の色の濃淡はスペクトログラムである。 As shown in FIG. 3, information is collected multiple times for the same subword, and “asa (/ sa /)” of “Asahi” is different from “sa (/ sa /)” of “Sapporo”. By collecting information of the same subword by word, the transfer function for the fluctuation of the input signal can be made flexible, and the accuracy is improved. In FIG. 3, the upper waveform in the “voice” column and the “body conduction sound” column is a voice waveform, and the shade of the lower color is a spectrogram.

上記工程を全てのサブワードに対して行うことにより、全てのサブワードに対応する伝達関数を生成し、伝達関数記憶部１３にて記憶保持させることにより事前工程は完了である。 By performing the above process for all subwords, transfer functions corresponding to all subwords are generated and stored in the transfer function storage unit 13 to complete the preliminary process.

なお、図１における音声モデル入力部２１及び伝達関数生成部２２は、伝達関数を生成する工程でしか使用しないため、その他の装置構成部と切り離しても問題ない。 Note that the speech model input unit 21 and the transfer function generation unit 22 in FIG. 1 are used only in the process of generating the transfer function, and therefore, there is no problem even if separated from other device configuration units.

以下、本願発明に於ける、実際に発声を行う際の発声支援方法を記す。なお以下に示す発声支援方法のイメージとして図示したものが図２である。 Hereinafter, the utterance support method for actually uttering in the present invention will be described. FIG. 2 illustrates an image of the following utterance support method.

第一の工程として、話者が発話し、その体内伝導音を図１に於ける信号入力部１１にて入力し入力信号とする。体内伝導音は抽出する場所により入力信号が変わり、また機器による測定特性も考えられるため、伝達関数を生成する際に使用した機器と同じものを同じ位置に装着することが望ましい。 As a first step, a speaker speaks, and the body conduction sound is input at the signal input unit 11 in FIG. Since the input signal varies depending on where the body conduction sound is extracted, and the measurement characteristics of the device may be considered, it is desirable to attach the same device used to generate the transfer function at the same position.

第二の工程として、図１に於ける音声認識部１２にて入力信号を前記サブワード毎に識別する。識別は、一般的に知られている音声認識によって行う。 As a second step, the voice recognition unit 12 in FIG. 1 identifies the input signal for each subword. Identification is performed by generally known speech recognition.

第三の工程として、サブワード毎に対応する前述の伝達関数を図１に於ける伝達関数記憶部１３から呼び出し、図１に於ける音質変換部１４にてサブワード毎に周波数特性を補正することにより音質を変換する。 As a third step, the above-described transfer function corresponding to each subword is called from the transfer function storage unit 13 in FIG. 1, and the sound quality conversion unit 14 in FIG. 1 corrects the frequency characteristics for each subword. Convert sound quality.

第四の工程として、図１に於けるデジタルフィルタ部１５にて音質変換したサブワードを一連の出力音声信号として合成する。 As a fourth step, the subwords subjected to sound quality conversion by the digital filter unit 15 in FIG. 1 are synthesized as a series of output sound signals.

サブワード毎に寸断された音質変換結果を一連の連続した出力音声信号として合成する方法として、例えばＦＩＲフィルタ（ＦｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅＦｉｌｔｅｒ）を基本とするデジタルフィルタによるたたみ込み演算を行う。これは図２の下部に例示するように、最新のデータ（本願に於いては、図１に於ける音質変換部１４にて最後に変換したサブワード）と連続した過去の任意のｎ個のデータを結ぶ関数を求め、各データ間を補うたたみ込み演算を行い、これをサブワード毎に繰り返すことにより、寸断されたサブワード毎の繋ぎ目部分の差を補い、出力音声信号をスムーズにする。なおデジタルフィルタは、ＦＩＲフィルタに限定されない。例えばＩＩＲフィルタ（ＩｎｆｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅＦｉｌｔｅｒ）等、たたみ込み演算により各サブワード間を補完するデジタルフィルタであれば種類は問わない。 As a method of synthesizing a sound quality conversion result cut for each subword as a series of continuous output audio signals, for example, a convolution operation is performed by a digital filter based on a FIR filter (Finite Impulse Response Filter). As illustrated in the lower part of FIG. 2, this is arbitrary n pieces of past data that are continuous with the latest data (in this application, the subword last converted by the sound quality conversion unit 14 in FIG. 1). Is obtained, a convolution operation is performed to compensate for each data, and this is repeated for each subword, thereby compensating for the difference in the joint portion of each broken subword and smoothing the output audio signal. The digital filter is not limited to the FIR filter. Any type of digital filter can be used as long as it is a digital filter that complements each sub-word by a convolution operation, such as an IIR filter (Infinite Impulse Response Filter).

第五の工程として、図１に於ける出力部１６から一連の出力音声信号を出力する。出力部１６は音声としての出力を目的とするスピーカーあるいは携帯電話等通信機器、また音声認識ソフト等を介して文字データとしての出力を目的とする電子メールやワードプロセッサ等、目的と用途に応じて自由に設定が可能である。 As a fifth step, a series of output audio signals are output from the output unit 16 in FIG. The output unit 16 can be freely selected according to the purpose and application, such as a speaker or a communication device such as a mobile phone for output as voice, or an e-mail or word processor for output as character data via voice recognition software or the like. Can be set.

本願発明の特徴の一つとして、前述の通り曖昧性に強いことが挙げられるが、以下に一例を記述する。 One of the features of the present invention is that it has strong ambiguity as described above, and an example will be described below.

サブワード毎の識別、つまり区切る場所を誤った場合、例えば「あさひ」（／ａ／ｓａ／ｈｉ／）と発話したにも関わらず「あさらひい」（／ａ／ｓａ／ｒａ／ｈｉ／ｉ／）と識別された場合、当然「あ」「さ」「ら」「ひ」「い」に対応する伝達関数が呼び出されることとなる。しかしこの場合は逆に、元々の入力音声をより詳細に解析したことになるため、出力される音声自体は話者の発声に近いものとなる。 If each subword is identified incorrectly, that is, if the place to divide is incorrect, for example, “Asahi” (/ a / sa / ra / hi / i / ), Of course, the transfer function corresponding to “A”, “SA”, “RA”, “HI” and “I” will be called. However, in this case, conversely, since the original input speech is analyzed in more detail, the output speech itself is close to the utterance of the speaker.

また区切る場所ではなく発声音を誤認識した場合、例えば前述の通り「あさひ」（／ａ／ｓａ／ｈｉ／）を「あさし」（／ａ／ｓａ／ｓｈｉ／）と誤認した場合、当然「あ」「さ」「し」に対応した伝達関数が呼ばれることとなる。この場合は、入力音声の「ひ」（／ｈｉ／）を「し」（／ｓｈｉ／）と誤認した事になるが、誤認すると言うことは言い換えれば話者の発声が曖昧であった、つまり本来の音声と誤認された音声が似通っていると言え、この場合は伝達関数も似通ったものとなる。よって出力音声も話者の意図したものと大差はなく、話者の意思伝達には支障を来さない。 In addition, if the utterance is not recognized in place of the separation place, for example, as described above, “Asahi” (/ a / sa / hi /) is mistakenly recognized as “Asashi” (/ a / sa / shi /), naturally “ The transfer function corresponding to “a” “sa” “shi” will be called. In this case, “hi” (/ hi /) in the input speech is misidentified as “shi” (/ shi /), but in other words, the speaker's utterance is ambiguous. It can be said that the voice that is mistaken for the original voice is similar, and in this case, the transfer function is also similar. Therefore, the output speech is not much different from what the speaker intended, and does not interfere with the speaker's communication.

前述の実施形態に於いては、発声機能障害者の発声支援方法という想定で記述しているため、入力信号を話者の食道発生時の体内伝導音とし、出力音声を話者自身の過去の声として記述したが、これに制約されるわけではない。入力信号は健常者の声帯発声時の体内伝導音、実際の発声音、非可聴つぶやき音等、適宜変更可能である。また出力音声も発声者自身の声のサンプルがない場合は他人の声を使用しても同様の方法にて機能する。さらに前述の通り出力方法自体も、スピーカーを通した音声、通信機器、文字出力等自由に設定が可能である。 In the above-described embodiment, since the description is based on the assumption of the speech support method for the speech function disabled person, the input signal is the body conduction sound when the speaker's esophagus is generated, and the output speech is the speaker's own past. Although described as a voice, it is not limited to this. The input signal can be appropriately changed, such as a body conduction sound when a normal person's vocal cord is uttered, an actual utterance sound, a non-audible murmur sound, and the like. In addition, when there is no sample of the voice of the speaker itself, the output voice functions in the same manner even if another person's voice is used. Furthermore, as described above, the output method itself can be freely set such as voice through a speaker, communication equipment, and character output.

つまり、伝達関数を作成する段階で、使用者と使用目的が明確であれば、好適な入出力方法を選択し、目的に合わせた伝達関数を作成することにより、使用目的に応じた柔軟な対応ができる。 In other words, if the user and purpose of use are clear at the stage of creating the transfer function, select a suitable input / output method and create a transfer function tailored to the purpose. Can do.

例えば声帯発声が可能な話者の声帯発声時体内伝導音を入力信号とし、自身の声で出力し、出力方法を携帯電話等通信機器とした場合、高騒音下にて外部騒音の影響を抑えて明瞭に通信できる方法を提供することができる。 For example, when the voice of a speaker capable of vocal voicing is used as the input signal and the internal conduction sound when the vocal cord is uttered, and the output method is a communication device such as a mobile phone, the influence of external noise is suppressed under high noise. It is possible to provide a method capable of communicating clearly and clearly.

また入力信号の普通の音声マイクロフォンとし、出力音声を家族や有名人等他人の声として伝達関数を作成すれば、話者の声を他人の声に変換し、話者の自由発話を他人の声にて再現するボイスチェンジャーとしての音声伝達方法も提供することが可能である。 In addition, if a transfer function is created by using a normal voice microphone for the input signal and the output voice as the voice of another person such as a family member or celebrity, the voice of the speaker is converted into the voice of another person, and the speaker's free speech is converted into the voice of another person. It is also possible to provide a voice transmission method as a voice changer that reproduces the above.

「あさひ」という単語の変換について、健常者の声帯発声時体内伝導音と健常者自身の声を元に音質変換の有効性を検証した。体内伝導音は加速度ピックアップで上唇左上部から抽出した。サンプリング周波数は１６ｋＨｚである。体内伝導音の抽出位置については、前述の非特許文献１にて有効であることが確認できた位置である。 Regarding the conversion of the word “Asahi”, we verified the effectiveness of sound quality conversion based on the body conduction sound when the vocal cords were uttered by the healthy person and the voice of the healthy person himself. The body conduction sound was extracted from the upper left of the upper lip with an acceleration pickup. The sampling frequency is 16 kHz. The extraction position of the body conduction sound is a position that has been confirmed to be effective in Non-Patent Document 1 described above.

またサブワード毎の切れ目を識別する方法としては、市販のフリーソフト“Ｊｕｌｉｕｓ３．４．２”を使用した。サブワードは音節単位とした。該ソフトウェアを作動させるために必要な音響モデル及び言語モデルは、該ソフトウェアと共に提供されている“ＤｉｃｔａｔｉｏｎＫｉｔＶｅｒ．３．０”付属の不特定話者モデル及びｗｅｂベースの６万語ｂｉｇｒａｍを用いた。なお音声識別ソフトについては特に制限はなく、前述のソフト以外でも動作確認ができた。 In addition, as a method of identifying a break for each subword, commercially available free software “Julius 3.4.2” was used. Subwords are syllable units. The acoustic model and the language model necessary for operating the software were an unspecified speaker model attached to “Dictionation Kit Ver.3.0” and a web-based 60,000 word bigram provided with the software. . The voice identification software is not particularly limited, and the operation can be confirmed with other software.

また出力音声信号のモデルとする話者自身の声は、マイクロフォンを用いて話者から３０ｃｍの位置で採取した。 The voice of the speaker himself as a model of the output voice signal was collected at a position 30 cm from the speaker using a microphone.

図４乃至図７に測定結果を示す。各図の上部の波形が音声波形であり、下部の色の濃淡はスペクトログラムである。図４にマイクロフォンから採取した「あさひ」発声時の音声波形及びスペクトログラム、図５に「あさひ」発声時の体内伝導音の音声波形及びスペクトログラムを示す。比較すると、体内伝導音においては主として広域成分の周波数特性が失われていることがわかる。 The measurement results are shown in FIGS. The upper waveform in each figure is a speech waveform, and the shade of the lower color is a spectrogram. FIG. 4 shows the speech waveform and spectrogram of the “Asahi” utterance collected from the microphone, and FIG. 5 shows the speech waveform and spectrogram of the body conduction sound when the “Asahi” utterance. By comparison, it can be seen that the frequency characteristics of the broad component are mainly lost in the body conduction sound.

図４に示した声、及び図５に示した体内伝導音により伝達関数を生成し、生成した結果の伝達関数を用いて図５の体内伝導音を音質変換しＦＩＲフィルタを通した出力結果が図６である。図６より、体内伝導音に於いては失われていた高域成分の周波数特性を回復していることがわかる。また図４と図６を比較すると音声波形及びスペクトログラムは類似しており、よって出力音声は音声モデルに類似した声質で、発声内容も類似していることがわかる。なお図６における「／ＳＩＬ／」は、無発声状態を示す。 A transfer function is generated from the voice shown in FIG. 4 and the body conduction sound shown in FIG. 5, and the output result obtained by converting the sound of the body conduction sound of FIG. FIG. From FIG. 6, it can be seen that the frequency characteristic of the high frequency component lost in the body conduction sound is restored. Also, comparing FIG. 4 and FIG. 6, it can be seen that the speech waveform and spectrogram are similar, and therefore the output speech has a voice quality similar to the speech model and the utterance content is also similar. Note that “/ SIL /” in FIG. 6 indicates a silent state.

また参考までに、図７にて、「あ・さ・ひ」という発声に対して、母音も子音も異なる伝達関数「ひ・ひ・あ」にて音質変換を行った結果を示す。音声波形は大きく異なるが、スペクトログラムから高域の周波数特性を回復できていることがわかる。 For reference, FIG. 7 shows the result of sound quality conversion performed on the utterance “a, hi, hi” using the transfer function “hi, hi, a” with different vowels and consonants. It can be seen that the high frequency characteristics can be recovered from the spectrogram, although the speech waveforms are very different.

明瞭な音声でコミュニケーションを図ることが困難な環境、例えば発声機能障害者の食道発声法の発声支援や、高騒音環境に於ける携帯電話等通信機器を使用した意思伝達支援等において、より自然且つ明瞭な話音を再現することができる。また母親の声でしゃべる人形等、ボイスチェンジャー的な利用方法も想定できる。 In environments where it is difficult to communicate with clear voice, such as voice support for the esophageal vocalization method for people with speech dysfunction, support for communication using communication devices such as mobile phones in high noise environments, etc. Clear speech can be reproduced. It can also be used as a voice changer, such as a doll that speaks in the mother's voice.

本願発明に於ける、発声支援方法を実現する装置の構成図である。It is a block diagram of the apparatus which implement | achieves the speech support method in this invention. 本願発明に於ける、発声機能障害者に対する発声支援方法の模式図である。It is a schematic diagram of the speech support method with respect to a person with a speech function disorder in this invention. 本願発明に於ける、伝達関数生成イメージ図である。It is a transfer function generation | occurrence | production image figure in this invention. 本願発明に於ける、発声者の元の声を示す波形及びスペクトログラムである。It is the waveform and spectrogram which show the voice of the speaker in this invention. 本願発明に於ける、発声者の体内伝導音を示す波形及びスペクトログラムである。It is the waveform and spectrogram which show the inside-body sound of a speaker in this invention. 本願発明に於ける、正しい伝達関数を用いて体内伝導音から復元した波形及びスペクトログラムである。It is the waveform and spectrogram which were decompress | restored from the body conduction sound using the correct transfer function in this invention. 本願発明に於ける、誤った伝達関数を用いて体内伝導音から復元した波形及びスペクトログラムである。It is the waveform and spectrogram which were decompress | restored from the body conduction sound using the incorrect transfer function in this invention.

Explanation of symbols

１１信号入力部
１２音声識別部
１３伝達関数記憶部
１４音質変換部
１５デジタルフィルタ部
１６出力部
２１音声モデル入力部
２２伝達関数生成部 DESCRIPTION OF SYMBOLS 11 Signal input part 12 Voice identification part 13 Transfer function memory | storage part 14 Sound quality conversion part 15 Digital filter part 16 Output part 21 Voice model input part 22 Transfer function generation part

Claims

A transfer function that associates the body's internal conduction sound or voice with the input signal information and associates the input signal information with the model of the output voice signal is created and stored in advance using the cross spectrum method for each subword shorter than the word. Process,
Inputting the conducted sound or voice of the speaker as an input signal;
Identifying the input signal for each subword;
Converting the sound quality of the subword by the transfer function corresponding to each subword;
Synthesizing the sound quality converted subword as a series of output audio signals;
And a step of outputting the series of output audio signals.

The utterance support method according to claim 1, wherein the subword is a phoneme, a semi-syllable, or a syllable.

The utterance support method according to claim 1, wherein the transfer function is a function for converting the input signal into a sound quality to be output in units of subwords, and is stored and held in association with each subword.

The utterance support method according to claim 1, wherein the transfer function is continuously called for each subword when the input signal is input.

The speech support method according to claim 1, wherein the step of synthesizing as a series of output speech is a convolution operation using a digital filter.