JP2002259990A

JP2002259990A - Character input method and apparatus, character input program, and storage medium storing this program

Info

Publication number: JP2002259990A
Application number: JP2001054745A
Authority: JP
Inventors: Kensaku Fujii; 憲作藤井; Jun Shimamura; 潤島村; Tomohiko Arikawa; 知彦有川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2001-02-28
Filing date: 2001-02-28
Publication date: 2002-09-13

Abstract

(57)【要約】【課題】画像情報を利用して、人間の発話を高精度に
認識して文字列を入力する文字入力方法及び装置を提供
する。【解決手段】まず、高速度画像入力手段１はカメラで
撮影した発話者の唇、又は唇と顎を含む領域の高速度画
像を入力する。次に、形状抽出手段２は該高速度画像か
ら唇を含む形状を抽出する。次に、形状変化抽出手段３
は抽出された唇を含む形状、及び該形状の時間的変化の
パターンを抽出する。次に、パターン分布算出手段４は
該唇の形状、及び該形状の時間的変化のパターンの３次
元的な分布を生成する。最後に、発話単語認識手段５は
該パターンの３次元的な分布を照合辞書６と照合して発
話単語を認識する。該照合の際には、該パターンと発話
時の音声信号との時間的ずれを利用して、発話単語を認
識する。 (57) [Summary] [PROBLEMS] To provide a character input method and apparatus for inputting a character string by recognizing a human utterance with high accuracy using image information. A high-speed image input unit inputs a high-speed image of a lip of a speaker or a region including a lip and a chin, which is captured by a camera. Next, the shape extracting means 2 extracts a shape including the lips from the high-speed image. Next, shape change extraction means 3
Extracts a shape including the extracted lips and a pattern of temporal change of the shape. Next, the pattern distribution calculation means 4 generates a three-dimensional distribution of the shape of the lip and a pattern of a temporal change of the shape. Finally, the utterance word recognition means 5 recognizes the utterance word by comparing the three-dimensional distribution of the pattern with the collation dictionary 6. At the time of the collation, the utterance word is recognized using the time lag between the pattern and the speech signal at the time of utterance.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ＰＣや携帯端末な
どのコンピュータ、あるいは、電話やテレビなどの家電
の入力装置、発話障害者、難聴者のためのインタフェー
スなど、音声情報を扱うインタフェースを有する文字入
力方法及び装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention has an interface for handling voice information, such as a computer such as a PC or a portable terminal, or an input device of a home appliance such as a telephone or a television, and an interface for a speech-disabled person or a hearing impaired. The present invention relates to a character input method and apparatus.

【０００２】[0002]

【従来の技術】上記分野で広く利用される文字入力イン
タフェースは音声信号を処理するものであるが、これら
の方式では、周囲の雑音を受けやすく、十分な精度の認
識を行うのが難しいという問題がある。また、周囲から
見ると独り言を言っているようで、インタフェースとし
て利用しづらいという問題もある。このような問題に対
して、これまで、画像情報を利用した文字入力に関する
技術がいくつか開発されている。例えば、特開平１１−
１４９２９６号に記載の装置は、入力された画像情報か
ら唇の動きを追跡し、発話単語を認識するものである。2. Description of the Related Art A character input interface widely used in the above-mentioned fields is for processing a voice signal. However, these systems are susceptible to ambient noise, and it is difficult to perform recognition with sufficient accuracy. There is. In addition, there is a problem that it is difficult to use it as an interface because it seems to be talking about itself when viewed from the surroundings. In order to solve such a problem, several techniques relating to character input using image information have been developed. For example, JP-A-11-
The device described in Japanese Patent No. 149296 tracks the movement of the lips from the input image information and recognizes the utterance word.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、このよ
うな画像情報は時間的な解像度が低いため、唇形状の変
化を高速度に取得することはできない。そのため、文字
入力に利用する場合、非常に高速に動く唇を対象にする
ことになるので、十分な精度の認識を行うことが難し
く、実用的に利用できないという問題がある。これに対
して、特開平６−１２４８３号に記載の方法及び装置
は、筋電位波形を利用することで、こうした問題を解決
している。ところが、入力インタフェースを考えた場
合、このような装置は大掛かりなものとなってしまい、
また、離れた位置から簡単に画像情報を獲得できるとい
うカメラの利点を利用できなくなってしまうという問題
がある。However, since such image information has low temporal resolution, it is not possible to acquire a change in lip shape at a high speed. Therefore, when used for character input, lips that move at a very high speed are targeted, so that it is difficult to perform recognition with sufficient accuracy, and there is a problem that it cannot be used practically. On the other hand, the method and apparatus described in Japanese Patent Application Laid-Open No. 6-12483 have solved such a problem by using a myoelectric potential waveform. However, considering an input interface, such a device becomes large-scale,
Further, there is a problem that the advantage of the camera that image information can be easily obtained from a remote position cannot be used.

【０００４】本発明は上述したような従来技術が有する
問題点に鑑みてなされたものであって、画像情報を利用
して人間の発話を高精度に認識し文字列を入力する文字
入力方法及び装置を提供することを課題とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned problems of the prior art, and provides a character input method for inputting a character string by recognizing a human utterance with high accuracy using image information. It is an object to provide a device.

【０００５】[0005]

【課題を解決するための手段】上記の課題を解決するた
め、本発明による文字入力方法は、人間の発話を認識し
て文字列を入力する文字入力方法であって、少なくとも
発話者の唇を含む形状、及び該形状の時間的変化のパタ
ーンから、発話単語を認識することを特徴とする。In order to solve the above problems, a character input method according to the present invention is a character input method for recognizing a human utterance and inputting a character string. It is characterized by recognizing a spoken word from the included shape and the pattern of temporal change of the shape.

【０００６】あるいは、上記の文字入力方法において、
前記発話単語を認識する過程では、前記唇を含む形状、
及び該形状の時間的変化のパターンを、高速度撮影が可
能なカメラで撮影された唇、または唇及び顎を含む領域
の画像から抽出することを特徴とする。Alternatively, in the above character input method,
In the step of recognizing the spoken word, a shape including the lips,
And extracting a temporal change pattern of the shape from an image of a lip or a region including a lip and a chin, which is captured by a camera capable of high-speed photography.

【０００７】あるいは、上記の文字入力方法において、
前記発話単語を認識する過程では、前記唇を含む形状、
及び該形状の時間的変化のパターンを、画像から抽出し
た特徴点の遷移量を高速度な時間間隔で算出することに
より算出することを特徴とする。Alternatively, in the above character input method,
In the step of recognizing the spoken word, a shape including the lips,
And a pattern of the temporal change of the shape is calculated by calculating a transition amount of the feature point extracted from the image at a high-speed time interval.

【０００８】あるいは、上記の文字入力方法において、
前記発話単語を認識する過程では、前記パターンの３次
元的な分布を照合することにより、発話単語を認識する
ことを特徴とする。Alternatively, in the above character input method,
The step of recognizing the uttered word is characterized in that the uttered word is recognized by checking a three-dimensional distribution of the pattern.

【０００９】あるいは、上記の文字入力方法において、
前記パターンの３次元的な分布を照合する過程では、前
記パターンの３次元的な分布を、画像の領域に含まれる
すべての特徴点で前記唇を含む形状、及び該形状の時間
的変化のパターンを算出することにより生成することを
特徴とする。Alternatively, in the above character input method,
In the step of comparing the three-dimensional distribution of the pattern, the three-dimensional distribution of the pattern is determined by comparing the three-dimensional distribution of the pattern with the shape including the lips at all the feature points included in the image region and the pattern of the temporal change of the shape. Is generated by calculating.

【００１０】あるいは、上記の文字入力方法において、
前記パターンの３次元的な分布を照合する過程では、前
記パターンの３次元的な分布、及び発話時に獲得される
音声信号との時間的ずれを利用することを特徴とする。Alternatively, in the above character input method,
In the step of matching the three-dimensional distribution of the pattern, the three-dimensional distribution of the pattern and a time lag from an audio signal acquired at the time of utterance are used.

【００１１】また、本発明による文字入力装置は、人間
の発話を認識して文字列を入力する文字入力装置であっ
て、発話者の少なくとも唇、または唇及び顎を含む領域
の高速度画像を入力する高速度画像入力手段と、該高速
度画像から該唇を含む形状を抽出する形状抽出手段と、
該高速度画像から抽出された該唇を含む形状、及び該形
状の時間的変化のパターンを抽出する形状変化抽出手段
と、該唇の形状、及び該形状の時間的変化のパターンの
３次元的な分布を生成するパターン分布算出手段と、該
パターンの３次元的な分布を照合して発話単語を認識す
る発話単語認識手段と、を備えることを特徴とする。Further, the character input device according to the present invention is a character input device for recognizing a human utterance and inputting a character string, wherein the high-speed image of at least a lip or a region including a lip and a chin of the speaker is formed. High-speed image input means for inputting, shape extracting means for extracting a shape including the lips from the high-speed image,
Shape change extracting means for extracting a shape including the lips extracted from the high-speed image and a pattern of a temporal change of the shape, and a three-dimensional pattern of the shape of the lips and a pattern of a temporal change of the shape And a utterance word recognizing means for recognizing an utterance word by collating a three-dimensional distribution of the pattern.

【００１２】あるいは、上記の文字入力装置において、
前記発話単語認識手段は、前記パターンの３次元的な分
布を照合する際、発話時に獲得される音声信号との時間
的ずれを利用して、発話単語を認識するものであること
を特徴とする。Alternatively, in the above character input device,
The utterance word recognition means recognizes an utterance word by using a time lag from an audio signal acquired at the time of utterance when matching the three-dimensional distribution of the pattern. .

【００１３】また、本発明による文字入力プログラム
は、人間の発話を認識して文字列を入力する文字入力方
法をコンピュータで実行するためのプログラムであっ
て、発話者の少なくとも唇、または唇及び顎を含む領域
の高速度画像を入力する手順と、該高速度画像から該唇
を含む形状を抽出する手順と、該高速度画像から抽出さ
れた該唇を含む形状、及び該形状の時間的変化のパター
ンを抽出する手順と、該唇の形状、及び該形状の時間的
変化のパターンの３次元的な分布を生成するパターン分
布算出手順と、該パターンの３次元的な分布を照合して
発話単語を認識する手順と、を備えることを特徴とす
る。A character input program according to the present invention is a program for executing a character input method for recognizing a human utterance and inputting a character string by a computer, wherein at least a lip or a lip and a chin of a speaker are provided. Inputting a high-speed image of a region including, a step of extracting a shape including the lips from the high-speed image, a shape including the lips extracted from the high-speed image, and a temporal change of the shape Extracting a pattern, a pattern distribution calculating procedure for generating a three-dimensional distribution of the shape of the lip and a pattern of a temporal change of the shape, and collating the three-dimensional distribution of the pattern. And a step of recognizing a word.

【００１４】あるいは、上記の文字入力プログラムにお
いて、前記発話単語を認識する手順では、前記パターン
の３次元的な分布を照合する際、発話時に獲得される音
声信号との時間的ずれを利用して、発話単語を認識する
ことを特徴とする。Alternatively, in the above character input program, in the step of recognizing the uttered word, when comparing the three-dimensional distribution of the pattern, a time lag from a voice signal acquired at the time of utterance is used. , And recognizes a spoken word.

【００１５】また、本発明による文字入力プログラムを
記憶した記憶媒体は、人間の発話を認識して文字列を入
力する文字入力方法をコンピュータで実行するためのプ
ログラムを記憶した記憶媒体であって、発話者の少なく
とも唇、または唇及び顎を含む領域の高速度画像を入力
する手順と、該高速度画像から該唇を含む形状を抽出す
る手順と、該高速度画像から抽出された該唇を含む形
状、及び該形状の時間的変化のパターンを抽出する手順
と、該唇の形状、及び該形状の時間的変化のパターンの
３次元的な分布を生成するパターン分布算出手順と、該
パターンの３次元的な分布を照合して発話単語を認識す
る手順と、を備える文字入力プログラムを該コンピュー
タで実行するために、該コンピュータが読み取り可能な
記憶媒体に記憶したことを特徴とする。A storage medium storing a character input program according to the present invention is a storage medium storing a program for executing, by a computer, a character input method for recognizing a human utterance and inputting a character string, A step of inputting a high-speed image of at least a region including a lip of a speaker or a lip and a chin, a step of extracting a shape including the lips from the high-speed image, and a step of extracting the lips extracted from the high-speed image. A pattern distribution calculating procedure for generating a three-dimensional distribution of a shape including the shape and a temporal change pattern of the shape, a procedure of extracting the shape of the lip, and a pattern of the temporal change of the shape; A computer-readable storage medium for executing the character input program comprising the steps of: recognizing an utterance word by collating a three-dimensional distribution; And wherein the door.

【００１６】あるいは、上記の文字入力プログラムを記
憶した記憶媒体において、前記発話単語を認識する手順
では、前記パターンの３次元的な分布を照合する際、発
話時に獲得される音声信号との時間的ずれを利用して、
発話単語を認識することを特徴とする。Alternatively, in the storage medium storing the character input program, in the step of recognizing the utterance word, when collating the three-dimensional distribution of the pattern, a time difference between the three-dimensional distribution of the pattern and an audio signal acquired at the time of utterance is considered. Using the gap,
It is characterized by recognizing utterance words.

【００１７】本発明では、人間の発話を認識する際、例
えば、高速度撮影が可能なカメラで撮影された唇を含む
領域、または唇及び顎を含む領域の画像から抽出された
特徴点の遷移量を高速度な時間間隔で算出することによ
り、唇を含む形状、及び該形状の時間的変化のパターン
を算出できるようにし、カメラ等の撮像手段の画像を用
いた高精度な発話単語の認識を実現する。According to the present invention, when recognizing a human utterance, for example, transition of a feature point extracted from an image of a region including a lip or a region including a lip and a chin photographed by a camera capable of high-speed photographing is performed. By calculating the amount at high-speed time intervals, it is possible to calculate the shape including the lips and the pattern of the temporal change of the shape, and recognize speech words with high accuracy using an image of an imaging means such as a camera. To achieve.

【００１８】また、前記唇を含む形状、及び該形状の時
間的変化のパターンの３次元的な分布、更には、発話時
に獲得される音声信号との時間的ずれを利用することに
より、発話単語を照合できるようにし、発話単語の認識
の際、照合におけるノイズを軽減し、効率の良い非常に
高速な処理を実現する。Further, the three-dimensional distribution of the shape including the lips and the pattern of the temporal change of the shape, and the temporal shift from the voice signal obtained at the time of the utterance are used to obtain the utterance word. Can be collated, and noise in collation can be reduced at the time of recognition of a spoken word, and efficient and very high-speed processing can be realized.

【００１９】[0019]

【発明の実施の形態】以下、本発明の実施の形態につい
て図面を用いて説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００２０】本発明による文字入力方法を実現する装置
の実施形態例を図１に示す。本実施形態例による装置
は、発話者の唇、及び顎を含む領域の高速度画像を入力
する高速度画像入力手段１と、該高速度画像から唇の形
状を抽出する形状抽出手段２と、該高速度画像から唇の
形状、及び該形状の時間的変化のパターンを抽出する形
状変化抽出手段３と、該唇の形状、及び該形状の時間的
変化のパターンの３次元的な分布を生成するパターン分
布算出手段４と、該パターンの３次元的な分布から発話
単語を認識する発話単語認識手段５と、を備えている。
また、発話単語を認識するための、唇の形状、及び該形
状の時間的変化のパターンの３次元的な分布と該単語の
対応づけを記述している照合辞書６を備えている。FIG. 1 shows an embodiment of an apparatus for realizing a character input method according to the present invention. The apparatus according to the present embodiment includes a high-speed image input unit 1 for inputting a high-speed image of a region including a lip and a chin of a speaker, a shape extracting unit 2 for extracting a lip shape from the high-speed image, A shape change extracting means 3 for extracting a lip shape and a temporal change pattern of the shape from the high-speed image, and generating a three-dimensional distribution of the lip shape and a temporal change pattern of the shape And a speech word recognition means 5 for recognizing speech words from the three-dimensional distribution of the pattern.
In addition, there is provided a collation dictionary 6 for describing the three-dimensional distribution of the shape of the lips and the pattern of the temporal change of the shape and the correspondence of the word for recognizing the spoken word.

【００２１】まず、高速度画像入力手段１において、発
話者の唇、及び顎は、高速度カメラなどにより、高速度
な時間間隔で連続する２次元画像データとして入力され
る。入力対象となる領域は、唇、及び顎が確実に捕えら
れるように目の下あたりから咽のあたりまでの範囲であ
る。First, in the high-speed image input means 1, the lips and chin of a speaker are input as high-speed cameras or the like as continuous two-dimensional image data at high-speed time intervals. The area to be input is a range from under the eyes to around the throat so that the lips and the chin can be reliably captured.

【００２２】次に、形状抽出手段２により、得られた画
像デー夕から、唇、及び顎の輪郭などを示す特徴点の解
析が行われる。この形状解析の方法は種々あり、例え
ば、エッジ強調などの画像処理後、２値化して形状を求
める方法などがある。これらの手法は、従来の手法と特
に変わるところがないので、ここではその詳細な説明は
省略する。Next, from the obtained image data, feature points indicating the contours of the lips and the chin are analyzed by the shape extracting means 2. There are various methods of this shape analysis, for example, a method of obtaining a shape by binarizing after image processing such as edge enhancement. Since these methods are not particularly different from the conventional methods, detailed description thereof is omitted here.

【００２３】次に、こうして得られた唇、及び顎の輪郭
などを示す抽出した特徴点に対して、形状変化抽出手段
３において、その遷移量を高速度な時間間隔で算出し、
唇の形状、及び該形状の時間的変化のパターンが算出さ
れる。Next, with respect to the extracted characteristic points indicating the contours of the lips and the chin obtained in this way, the shape change extraction means 3 calculates the transition amount at high-speed time intervals.
The shape of the lip and a pattern of the temporal change of the shape are calculated.

【００２４】次に、パターン分布算出手段４において、
このパターンを処理対象領域に含まれるすべての特徴点
の遷移を積み重ねることにより、唇の形状、及び該形状
の時間的変化のパターンの３次元的な分布が生成され
る。Next, in the pattern distribution calculating means 4,
By stacking this pattern on the transitions of all the feature points included in the processing target area, a three-dimensional distribution of the shape of the lips and the pattern of the temporal change of the shape is generated.

【００２５】最後に、発話単語認識手段６において、こ
うして得られた３次元的なパターン分布を、あらかじめ
格納しておいたパターン分布の照合辞書６と照合するこ
とにより、出力として認識された文字列を得る。なお、
好ましくは、発話時に獲得される音声信号の時間的ずれ
を、この照合の際のパラメータとして処理するとしても
よい。Finally, the utterance word recognition means 6 compares the three-dimensional pattern distribution obtained in this way with a pattern distribution collation dictionary 6 stored in advance, thereby obtaining a character string recognized as an output. Get. In addition,
Preferably, the time lag of the audio signal obtained at the time of the utterance may be processed as a parameter for this collation.

【００２６】こうして得られた文字列は、いわゆるＦＥ
Ｐなどの文字入力に関する処理に送られ、実際の文字の
入力が行われることになる。The character string obtained in this way is a so-called FE
The process is sent to a process for inputting a character such as P, and an actual character is input.

【００２７】以降では、上述した処理を実際のデータに
即して、具体的に示す。高速度画像入力手段１におい
て、発話者の唇、及び顎の２次元画像データが、高速度
カメラにより、５００フレーム／秒入力されたとする。
例えば、図２に示すような画像が、高速度な時間間隔で
連続して入力されることになる。Hereinafter, the above-described processing will be specifically described according to actual data. It is assumed that two-dimensional image data of the lips and the chin of the speaker is input at 500 frames / second by the high-speed camera in the high-speed image input unit 1.
For example, images as shown in FIG. 2 are continuously input at high-speed time intervals.

【００２８】次に、形状抽出手段２により、唇、及び顎
の輪郭などの特徴点が算出される。例えば、図２に示す
２次元画像データから、図３に示すような特徴点が算出
されることになる。Next, characteristic points such as contours of the lips and the chin are calculated by the shape extracting means 2. For example, feature points as shown in FIG. 3 are calculated from the two-dimensional image data shown in FIG.

【００２９】次に、こうして得られた特徴点に対して、
形状変化抽出手段３において、その遷移量を高速度な時
間間隔で算出し、唇の形状、及び該形状の時間的変化の
パターンが算出される。例えば、図３の３０２に示す特
徴点の時間的変化のパターンを示すと、図４に示すよう
になる。Next, for the feature points thus obtained,
The shape change extraction means 3 calculates the amount of transition at high-speed time intervals, and calculates the shape of the lips and the pattern of temporal change in the shape. For example, FIG. 4 shows a temporal change pattern of the feature point shown by 302 in FIG.

【００３０】次に、このようなパターンを、パターン分
布算出手段４において、処理対象領域に含まれるすべて
の特徴点の遷移を積み重ねることにより、唇の形状、及
び該形状の時間的変化のパターンの３次元的な分布を生
成する。例えば、この３次元的な分布を、特徴点の数を
絞って見やすく表示すると、図５に示すようになる。Next, such a pattern is accumulated in the pattern distribution calculating means 4 by transitions of all the feature points included in the processing target area, so that the shape of the lip and the pattern of the temporal change of the shape are obtained. Generate a three-dimensional distribution. For example, when the three-dimensional distribution is displayed in an easy-to-view manner by reducing the number of feature points, the result is as shown in FIG.

【００３１】このとき、発話時に獲得される音声信号と
図４のパターンを、時間軸を合わせて重畳表示すると、
図６のようになる。このように、唇の動き始めや終わり
は、音声信号が発せられるのと同じタイミングで行われ
るのではなく、時間的なずれが生じていることがわか
る。この時間的ずれ、及び算出されたパターンの３次元
的な分布を、発話単語認識手段５にて、あらかじめ格納
しておいたパターン分布の照合辞書６と照合して、文字
列を得ることになる。At this time, when the voice signal obtained at the time of the utterance and the pattern of FIG.
As shown in FIG. Thus, it can be seen that the beginning and the end of the movement of the lips are not performed at the same timing as when the audio signal is emitted, but are shifted in time. This time lag and the calculated three-dimensional distribution of the pattern are collated by the utterance word recognition means 5 with the pattern distribution collation dictionary 6 stored in advance to obtain a character string. .

【００３２】なお、図１で示した処理の各部の一部もし
くは全部の処理機能を、コンピュータを用いて実現でき
ること、あるいは、その構成により実現される処理手順
をコンピュータに実行させることができることは言うま
でもなく、コンピュータでその各部の処理機能を実現す
るためのプログラム、あるいは、コンピュータにその処
理手順を実行させるためのプログラムを、そのコンピュ
ータが読み取り可能な記憶媒体、例えば、ＦＤ（フロッ
ピーディスク：登録商標）や、ＭＯ、ＲＯＭ、メモリカ
ード、ＣＤ、ＤＶＤ、リムーバルディスクなどに記録し
て、保存したり、提供したりすることが可能であり、ま
た、インターネットのような通信ネットワークを通じて
配布したりすることが可能である。It goes without saying that some or all of the processing functions of each part of the processing shown in FIG. 1 can be realized using a computer, or that the processing procedure realized by the configuration can be executed by a computer. In addition, a computer program for realizing the processing function of each unit in the computer or a program for causing the computer to execute the processing procedure is stored in a storage medium readable by the computer, for example, FD (Floppy Disk: registered trademark) And can be recorded on an MO, ROM, memory card, CD, DVD, removable disk, etc., stored or provided, and distributed through a communication network such as the Internet. It is possible.

【００３３】以上、本発明を実施形態例に基づき具体的
に説明したが、本発明は上記の実施形態例に限定される
ものではなく、その要旨を逸脱しない範囲で種々変更可
能であることはいうまでもない。また、本発明は、複数
の機器から構成されるシステムに適用しても、１つの機
器から成る装置に適用しても良い。また、本発明は、シ
ステム或は装置にプログラムを供給することによって達
成される場合にも適用できることは言うまでもない。As described above, the present invention has been specifically described based on the embodiment. However, the present invention is not limited to the above-described embodiment, and may be variously modified without departing from the gist thereof. Needless to say. Further, the present invention may be applied to a system including a plurality of devices or to an apparatus including one device. Needless to say, the present invention can be applied to a case where the present invention is achieved by supplying a program to a system or an apparatus.

【００３４】[0034]

【発明の効果】本発明によれば、人間の発話を認識する
際、例えば、高速度撮影が可能なカメラで撮影された
唇、または唇及び顎を含む領域の画像から抽出された特
徴点の遷移量を高速度な時間間隔で算出するなどして、
唇の形状、及び該形状の時間的変化のパターンを算出で
きるようにしたので、高精度な発話単語の認識を実現で
きる効果が得られる。According to the present invention, when recognizing a human utterance, for example, a feature point extracted from an image of a lip or a region including a lip and a chin photographed by a camera capable of high-speed photographing is used. By calculating the amount of transition at high-speed time intervals,
Since the shape of the lip and the pattern of the temporal change of the shape can be calculated, the effect of realizing highly accurate speech word recognition can be obtained.

【００３５】また、前記唇の形状、及び該形状の時間的
変化のパターンの３次元的な分布、更には、発話時に獲
得される音声信号との時間的ずれを利用することによ
り、発話単語を照合できるようにしたので、発話単語の
認識の際、照合におけるノイズを軽減し、効率の良い非
常に高速な処理を実現できる効果が得られる。Further, by utilizing the three-dimensional distribution of the shape of the lip and the pattern of the temporal change of the shape, and the temporal shift from the voice signal obtained at the time of the utterance, the utterance word can be obtained. Since the collation can be performed, the effect of reducing noise in the collation at the time of recognizing the utterance word and realizing efficient and very high-speed processing can be obtained.

[Brief description of the drawings]

【図１】本発明の一実施形態における文字入力方法を実
現する装置の構成図である。FIG. 1 is a configuration diagram of an apparatus for implementing a character input method according to an embodiment of the present invention.

【図２】本発明の一実施形態における高速度画像の一例
を説明するための図である。FIG. 2 is a diagram illustrating an example of a high-speed image according to an embodiment of the present invention.

【図３】本発明の一実施形態における上記の高速度画像
から抽出された特徴点の一例を説明するための図であ
る。FIG. 3 is a diagram illustrating an example of a feature point extracted from the high-speed image according to the embodiment of the present invention.

【図４】本発明の一実施形態における上記の特徴点の時
間的変化のパターンの一例を説明するための図である。FIG. 4 is a diagram for explaining an example of a pattern of a temporal change of the feature point according to the embodiment of the present invention.

【図５】本発明の一実施形態における上記の特徴点の時
間的変化のパターンの３次元分布の一例を説明するため
の図である。FIG. 5 is a diagram for explaining an example of a three-dimensional distribution of a temporal change pattern of the above-mentioned feature points according to an embodiment of the present invention.

【図６】本発明の一実施形態における上記の特徴点の時
間的変化のパターンと音声信号のタイミングのずれの一
例を説明するための図である。FIG. 6 is a diagram for explaining an example of a pattern of a temporal change of the characteristic point and a timing shift of an audio signal according to the embodiment of the present invention.

[Explanation of symbols]

１…高速度画像入力手段２…形状抽出手段４…形状変化抽出手段５…パターン分布算出手段６…発話単語認識手段７…照合辞書 DESCRIPTION OF SYMBOLS 1 ... High speed image input means 2 ... Shape extraction means 4 ... Shape change extraction means 5 ... Pattern distribution calculation means 6 ... Utterance word recognition means 7 ... Collation dictionary

───────────────────────────────────────────────────── フロントページの続き (72)発明者有川知彦東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5B057 BA02 CA12 CA16 DA06 DB02 DC05 DC16 DC36 5D015 BB01 LL00 5L096 BA20 CA02 FA06 HA01 HA09 JA11 ────────────────────────────────────────────────── ─── Continued on the front page (72) Inventor Tomohiko Arikawa 2-3-1 Otemachi, Chiyoda-ku, Tokyo F-term in Nippon Telegraph and Telephone Corporation (reference) 5B057 BA02 CA12 CA16 DA06 DB02 DC05 DC16 DC36 5D015 BB01 LL00 5L096 BA20 CA02 FA06 HA01 HA09 JA11

Claims

[Claims]

1. A character input method for recognizing a human utterance and inputting a character string, wherein a utterance word is recognized from at least a shape including a lip of a speaker and a pattern of a temporal change of the shape. Character input method characterized by the following.

2. A process for recognizing the spoken word, wherein a shape including the lips and a pattern of a temporal change of the shape include a lip photographed by a camera capable of high-speed photographing, or a lip and a chin. 2. The character input method according to claim 1, wherein the character is extracted from an image of the area.

3. In the step of recognizing the spoken word, a shape including the lip and a pattern of a temporal change of the shape are calculated at a high-speed time interval with a transition amount of a feature point extracted from an image. The character input method according to claim 1, wherein the calculation is performed by:

4. In the step of recognizing the utterance word, by comparing a three-dimensional distribution of the pattern,
4. The character input method according to claim 1, wherein the utterance word is recognized.

5. The process of matching the three-dimensional distribution of the pattern, wherein the three-dimensional distribution of the pattern is determined by comparing the three-dimensional distribution of the pattern with the shape including the lips at all the feature points included in the region of the image, and The character input method according to claim 4, wherein the character input method is generated by calculating a temporal change pattern.

6. The method of claim 3, wherein the step of comparing the three-dimensional distribution of the pattern uses a three-dimensional distribution of the pattern and a time lag between the three-dimensional distribution of the pattern and an audio signal obtained at the time of utterance. Item 4. The character input method according to item 4 or 5.

7. A character input device for recognizing a human utterance and inputting a character string, wherein said high-speed image input means inputs a high-speed image of at least a region including a lip or a lip and a chin of a speaker. Shape extracting means for extracting a shape including the lips from the high-speed image, shape changing extracting means for extracting a shape including the lips extracted from the high-speed image, and a pattern of temporal change of the shape; Pattern distribution calculating means for generating a three-dimensional distribution of the shape of the lip and a pattern of the temporal change of the shape; and utterance word recognition for recognizing the utterance word by collating the three-dimensional distribution of the pattern. Means, and a character input device.

8. The utterance word recognition means recognizes an utterance word by using a time lag from an audio signal acquired at the time of utterance when matching the three-dimensional distribution of the pattern. 8. The character input device according to claim 7, wherein:

9. A program for causing a computer to execute a character input method of recognizing a human utterance and inputting a character string, comprising: a high-speed image of at least a lip or a region including a lip and a chin of a speaker; Inputting, extracting the shape including the lip from the high-speed image, extracting the shape including the lip extracted from the high-speed image, and extracting the temporal change pattern of the shape; A pattern distribution calculation procedure for generating a three-dimensional distribution of the shape of the lip and a pattern of a temporal change of the shape, and a procedure of recognizing the spoken word by comparing the three-dimensional distribution of the pattern. A character input program comprising:

10. The step of recognizing the uttered word, wherein, when comparing the three-dimensional distribution of the pattern, the uttered word is recognized by using a time lag from an audio signal acquired at the time of uttering. 10. The character input program according to claim 9, wherein:

11. A storage medium storing a program for causing a computer to execute a character input method of recognizing a human utterance and inputting a character string, wherein the area includes at least a lip or a lip and a chin of a speaker. Inputting the high-speed image, extracting the shape including the lips from the high-speed image, and extracting the shape including the lips extracted from the high-speed image and the pattern of the temporal change of the shape. An extraction procedure; a pattern distribution calculation procedure for generating a three-dimensional distribution of the lip shape and a pattern of a temporal change in the shape; and collating the three-dimensional distribution of the pattern to recognize speech words. And a storage medium storing a character input program, wherein the storage medium is stored in a storage medium readable by the computer in order to execute the character input program comprising .

12. In the step of recognizing the uttered word, when the three-dimensional distribution of the pattern is compared, the uttered word is recognized using a time lag from an audio signal acquired at the time of uttering. A storage medium storing the character input program according to claim 11.