JP2020091559A

JP2020091559A - Expression recognition device, expression recognition method, and program

Info

Publication number: JP2020091559A
Application number: JP2018227019A
Authority: JP
Inventors: 賢志近藤; Kenji Kondo
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2018-12-04
Filing date: 2018-12-04
Publication date: 2020-06-11

Abstract

To provide an expression recognition device that can recognize expression with high accuracy even during utterance.SOLUTION: An expression recognition device 1 comprises: a face image obtaining unit 10 which obtains a face image; a voice obtaining unit 12 which obtains speech voice; a speech voice recognition unit 13 which obtains phonemes constituting the obtained voice; a lip shape change DB 16 which stores data regarding a lip shape change corresponding to phonemes; a lip shape change calculation unit 15 which reads data corresponding to phonemes obtained by the speech voice recognition unit 13 from the lip shape change DB 16, and generates a face image in which a lip shape change corresponding to the phonemes has been canceled from the face image obtained by the face image obtaining unit 10; and an expression identifier 17 which estimates expression based on a face image obtained by the lip shape change calculation unit 15.SELECTED DRAWING: Figure 1

Description

本発明は、人間の表情を認識する表情認識装置、表情認識方法、およびプログラムに関する。 The present invention relates to a facial expression recognition device, a facial expression recognition method, and a program for recognizing human facial expressions.

画像情報に基づいた人間の表情の認識技術については、すでに様々な製品が実用化されている。それらの多くは、顔画像から画像特徴点を抽出し、あらかじめ大量の顔画像により学習させた識別器を用いて対応する顔器官点（あらかじめ、顔認識のために定めた、目、鼻、口などの両端などの識別点）の画像中の位置を検出し、顔向きおよび口の開閉、目の開閉などの状態を検出し、学習データとして用意した顔画像（喜び、怒り、悲しみ、など）データセット中のどの感情データに近いかを基準に感情推定を行う。 Various products have already been put into practical use as technology for recognizing human facial expressions based on image information. Many of them extract image feature points from face images, and use corresponding discriminators that have been trained with a large number of face images in advance to correspond to corresponding facial organ points (eyes, nose, mouth, which were previously determined for face recognition. The position in the image (identification points such as both ends) is detected to detect the orientation of the face and the state of opening and closing the mouth, opening and closing of the eyes, and the face image prepared as learning data (joy, anger, sadness, etc.) Emotion estimation is performed based on which emotion data in the data set is closest.

他の適用事例としては、画像データを一般的な可視光画像（ＲＧＢ画像）を用いる代わりに近赤外カメラ画像などを利用し、外部の環境光変化に対して堅牢にするもの、複数のカメラ画像を使用し（ステレオカメラなど）、各画像の視差から奥行情報を推定し、画像データから抽出した特徴点に対して奥行情報を加えて３次元的に識別処理をおこなうもの、またはランダムドットパターンなどの画像パターンを赤外光で照射し、顔に映り込んだパターンの視差変化からより高精度な奥行情報を取得する深度カメラセンサーを用いたもの等がある。また、入力される画像データ等の種別は異なるが、以降の識別機による顔位置、器官点の変化の認識、感情ラベルデータを含む学習データに基づく類似・分類判定処理などの基本構成は大きく変わらない。 As another application example, instead of using a general visible light image (RGB image) as image data, a near infrared camera image is used to make it robust against changes in external ambient light, and a plurality of cameras. Using an image (stereo camera, etc.), depth information is estimated from the parallax of each image, and depth information is added to the feature points extracted from the image data for three-dimensional identification processing, or a random dot pattern. There is a device using a depth camera sensor that irradiates an image pattern such as with infrared light and acquires more accurate depth information from the parallax change of the pattern reflected on the face. In addition, although the types of input image data, etc. are different, the basic configuration such as the subsequent recognition of changes in face position and organ points by the discriminator, similarity/classification determination processing based on learning data including emotion label data, etc. will change greatly. Absent.

特開２０１７−１２０６０９号公報JP, 2017-120609, A

表情の認識において、特に利用者と対話しながら表情認識を行う場合には、発話による口周りの変化が、顔認識精度の低下に影響を与える。特に、顔認識の識別器の学習に使用される学習データは、静止画像に正解値として与えた感情ラベル（喜び、怒り、悲しみ、など）をもとにしており、発話中の口形状の変化に対応したデータが学習データに含まれてはいないため、口元の微妙な変化により正しく表情の認識を行うことが出来ない。 When recognizing facial expressions, especially when recognizing facial expressions while interacting with the user, changes in the mouth area due to utterances affect the accuracy of face recognition. In particular, the learning data used for learning the face recognition discriminator is based on emotion labels (joy, anger, sadness, etc.) given to the still image as correct values, and changes in mouth shape during utterance. Since the data corresponding to is not included in the learning data, the facial expression cannot be correctly recognized due to the subtle changes in the mouth.

この問題を解決するため、学習用画像を顔全体と、口領域を除外した画像に切り分けて、発話を検出した場合には、顔の上半分のみを学習用画像として用いた学習済みモデルに切り替えて感情認識を行う技術が提案されている（特許文献１）。しかし、口唇の情報を一律に切り捨てると、口元に表れている感情情報が使われなくなってしまう。 To solve this problem, the learning image is divided into the entire face and the image excluding the mouth area, and when utterance is detected, the model is switched to the learned model that uses only the upper half of the face as the learning image. A technique for recognizing emotions has been proposed (Patent Document 1). However, if the information on the lips is cut off uniformly, the emotional information shown on the lips will not be used.

そこで、本発明では、発話中であっても精度の高い表情の認識を行える表情認識装置を提供する。 Therefore, the present invention provides a facial expression recognition device that can recognize facial expressions with high accuracy even during utterance.

本発明の表情認識装置は、顔画像を取得する顔画像取得部と、発話音声を取得する音声取得部と、取得した音声を構成する音素を求める音声認識部と、音素に対応する口唇形状の変化に関するデータを記憶したデータベースと、前記音声認識部にて求めた音素に対応するデータを前記データベースから読み出し、前記顔画像取得部にて取得した顔画像から、前記音素に対応する口唇形状の変化をキャンセルした顔画像を生成する口唇形状変化計算部と、前記口唇形状変化計算部にて求めた顔画像に基づいて表情の推定を行う表情識別器とを備える。 The facial expression recognition device of the present invention includes a face image acquisition unit that acquires a face image, a voice acquisition unit that acquires a spoken voice, a voice recognition unit that obtains phonemes that form the acquired voice, and a lip shape corresponding to the phoneme. A database that stores data relating to changes, and data corresponding to phonemes obtained by the speech recognition unit are read from the database, and from the face image acquired by the face image acquisition unit, changes in lip shape corresponding to the phonemes. A lip shape change calculation unit that generates a face image that has been canceled, and a facial expression discriminator that estimates a facial expression based on the face image obtained by the lip shape change calculation unit.

この構成により、顔画像取得部にて画像情報を取得することに加えて、音声取得部にて発話音声も同時に取得し、音声信号処理による発話音素認識および発話音素に対応する顔形状の変化データを照合し、入力画像から得られる顔画像に対しこれらの発話による変化・変形を抑制する操作を加えたのちに、表情認識処理を行うことにより、発話中であっても精度良く表情を認識できる。 With this configuration, in addition to the image information being acquired by the face image acquisition unit, the speech acquisition unit simultaneously acquires the utterance voice, and the utterance phoneme recognition by the voice signal processing and the face shape change data corresponding to the utterance phoneme. The facial image obtained from the input image can be recognized with high accuracy even during the utterance by performing facial expression recognition processing after applying the operation to suppress the change and deformation due to these utterances to the face image obtained from the input image. ..

本発明の表情認識装置は、前記顔画像取得部にて取得した顔画像と、前記音声認識部にて求めた音素のデータの時系列を統合するデータパイプラインを備えてもよい。この構成により、顔画像と発話音声とを同期させ、発話に対応する口唇の動きを適切にキャンセルできる。 The facial expression recognition device of the present invention may include a data pipeline that integrates a face image acquired by the face image acquisition unit and a time series of phoneme data obtained by the speech recognition unit. With this configuration, it is possible to synchronize the face image with the uttered voice and appropriately cancel the movement of the lips corresponding to the utterance.

本発明の表情認識装置において、前記口唇形状変化計算部は、前記音素が存在しない区間では、前記データベースから読み出したデータを補間し、補間したデータを用いて口唇形状の変化をキャンセルしてもよい。ここで、データの補間として、リニア補間またはスプライン曲線を用いた補間を行ってもよい。これにより、音素が存在しない区間についても発話に対応する口唇の動きをキャンセルできる。 In the facial expression recognition device of the present invention, the lip shape change calculation unit may interpolate data read from the database in a section in which the phoneme does not exist, and cancel the change in lip shape using the interpolated data. .. Here, as the data interpolation, linear interpolation or interpolation using a spline curve may be performed. As a result, the movement of the lips corresponding to the utterance can be canceled even in the section in which no phoneme exists.

本発明の表情認識装置において、前記データベースには、デフォルトの口唇形状の変化に関するデータと、当該データに適用する係数とが記憶されており、前記係数は初期のキャリブレーションによってユーザごとに設定されてもよい。これにより、ユーザによって異なる口唇の動きに合わせて口唇の動きをキャンセルできる。 In the facial expression recognition apparatus of the present invention, the database stores data relating to changes in the default lip shape and coefficients applied to the data, and the coefficients are set for each user by initial calibration. Good. As a result, the movement of the lips can be canceled according to the movement of the lips that differs depending on the user.

本発明の表情認識装置において、前記データベースには、デフォルトの口唇形状の変化に関するデータと、当該データに適用する係数とが記憶されており、前記係数は、口唇形状の変化をキャンセルした顔画像がエラーにならないようにする方針で補正されてもよい。これにより、予めユーザがキャリブレーションを行わなくても、ユーザによって異なる口唇の動きに合わせる係数を設定できる。 In the facial expression recognition device of the present invention, the database stores data relating to a change in the default lip shape, and a coefficient applied to the data, and the coefficient is a face image in which the change in the lip shape is canceled. It may be corrected with a policy of not causing an error. Accordingly, even if the user does not perform calibration in advance, it is possible to set a coefficient that matches the movement of the lips that differs depending on the user.

本発明の表情認識装置において、前記データベースには、前後のいずれかの音素のデータを含むバイグラムまたは前後の両方の音素のデータも含むトライグラムに対応するデータを記憶していてもよい。同じ音素の発声であっても前後の音素によって口唇の動きが異なる場合があるが、本発明の構成によれば、前後の音素も含めて口唇の動きを適切にキャンセルできる。 In the facial expression recognition apparatus of the present invention, the database may store data corresponding to a bigram including data of either preceding or following phonemes or a trigram including data of both preceding and following phonemes. Even if the same phoneme is uttered, the movement of the lips may differ depending on the front and back phonemes, but the configuration of the present invention can appropriately cancel the movement of the lips including the front and back phonemes.

本発明の表情認識装置において、前記音声取得部は、複数のマイクを有するマイクアレイであり、前記顔画像取得部にて複数人の顔画像を取得したときには、各顔画像に係る人が発声した音声をマイクアレイによって特定し、前記表情識別部は、特定した音声を用いて顔画像の表情を識別してもよい。この構成により、複数人が発話しているときでも、口唇の動きをキャンセルできる。 In the facial expression recognition device of the present invention, the voice acquisition unit is a microphone array having a plurality of microphones, and when face images of a plurality of people are acquired by the face image acquisition unit, a person associated with each face image utters. The voice may be identified by a microphone array, and the facial expression identification unit may identify the facial expression of the face image using the identified voice. With this configuration, the movement of the lips can be canceled even when a plurality of people are speaking.

本発明の表情認識方法は、表情認識装置によって顔画像から表情を認識する方法であって、前記表情認識装置が、顔画像を取得するステップと、前記表情認識装置が、発話音声を取得するステップと、前記表情認識装置が、取得した音声を構成する音素を求めるステップと、前記表情認識装置が、求めた音素に対応するデータを、音素に対応する口唇形状の変化に関するデータを記憶したデータベースから読み出し、前記顔画像から、前記音素に対応する口唇形状の変化をキャンセルした顔画像を生成するステップと、前記表情認識装置が、口唇形状の変化がキャンセルされた顔画像に基づいて表情の推定を行うステップとを備える。 The facial expression recognition method of the present invention is a method for recognizing an facial expression from a facial image by a facial expression recognition device, wherein the facial expression recognition device acquires a facial image, and the facial expression recognition device acquires a speech voice. And a step in which the facial expression recognition device obtains a phoneme that forms the acquired voice, and the facial expression recognition device stores data corresponding to the obtained phoneme from a database that stores data relating to changes in the lip shape corresponding to the phoneme. The step of generating a face image in which the change in the lip shape corresponding to the phoneme is read out from the face image, and the facial expression recognition device estimates the facial expression based on the face image in which the change in the lip shape is canceled. And performing steps.

本発明のプログラムは、顔画像から表情を認識するプログラムであって、コンピュータに、顔画像を取得するステップと、発話音声を取得するステップと、取得した音声を構成する音素を求めるステップと、求めた音素に対応するデータを、音素に対応する口唇形状の変化に関するデータを記憶したデータベースから読み出し、前記顔画像から、前記音素に対応する口唇形状の変化をキャンセルした顔画像を生成するステップと、口唇形状の変化がキャンセルされた顔画像に基づいて表情の推定を行うステップとを実行させる。 The program of the present invention is a program for recognizing facial expressions from a face image, and includes a step of obtaining a face image, a step of obtaining a speech voice, a step of obtaining a phoneme configuring the obtained voice, Data corresponding to the phoneme is read from a database that stores data related to changes in the lip shape corresponding to the phoneme, from the face image, a step of generating a face image in which the change in the lip shape corresponding to the phoneme is canceled, And a step of estimating a facial expression based on the face image in which the change in the lip shape is canceled.

本発明によれば、発話中であっても適切に表情の認識を行える。 According to the present invention, facial expressions can be appropriately recognized even during utterance.

第１の実施の形態の表情認識装置の構成を示す図である。It is a figure which shows the structure of the facial expression recognition apparatus of 1st Embodiment. （ａ）顔特徴点データと音素データの時系列を統一した結果を示す図である。（ｂ）音素データがない区間で、表情認識の処理データの要求がある場合の例を示す図である。(A) It is a figure which shows the result of unifying the time series of facial feature point data and phoneme data. (B) is a diagram showing an example of a case where there is a request for facial expression recognition processing data in a section where there is no phoneme data. 第１の実施の形態の口唇形状変化ＤＢに記憶されたデータを示す図である。It is a figure which shows the data memorize|stored in lip shape change DB of 1st Embodiment. ２つの音素を発声するときの口唇形状の変化を示す図である。It is a figure which shows the change of the lip shape at the time of uttering two phonemes. 第１の実施の形態の表情認識装置の動作を示すフローチャートである。It is a flow chart which shows operation of the facial expression recognition device of a 1st embodiment. 第２の実施の形態の表情認識装置の構成を示す図である。It is a figure which shows the structure of the facial expression recognition apparatus of 2nd Embodiment. 第３の実施の形態の口唇形状変化ＤＢに記憶されたデータを示す図である。It is a figure which shows the data memorize|stored in lip shape change DB of 3rd Embodiment.

以下、本発明の実施の形態の表情認識装置について図面を参照しながら説明する。
（第１の実施の形態）
図１は、表情認識装置１の構成を示す図である。表情認識装置１は、顔画像撮影部１０と、顔特徴点認識部１１と、発話音声入力部１２と、発話音素認識部１３と、データパイプライン１４と、口唇形状変化計算部１５と、口唇形状変化データベース（以下、「口唇形状変化ＤＢ」という）１６と、表情識別部１７とを備えている。 Hereinafter, the facial expression recognition device according to the embodiment of the present invention will be described with reference to the drawings.
(First embodiment)
FIG. 1 is a diagram showing the configuration of the facial expression recognition device 1. The facial expression recognition device 1 includes a face image capturing unit 10, a facial feature point recognition unit 11, a speech input unit 12, a speech phoneme recognition unit 13, a data pipeline 14, a lip shape change calculation unit 15, and a lip. A shape change database (hereinafter referred to as “lip shape change DB”) 16 and a facial expression identification unit 17 are provided.

顔画像撮影部１０は、表情の認識対象であるユーザを撮影する機能を有するカメラである。具体的には、可視光カメラ、近赤外線カメラ、複数カメラによる奥行情報付きカメラ、深度カメラなどである。顔画像撮影部１０は、顔画像を撮影し、次の顔特徴点認識部１１に顔画像データを送信する。 The face image capturing unit 10 is a camera having a function of capturing a user whose facial expression is to be recognized. Specifically, it is a visible light camera, a near infrared camera, a camera with depth information by a plurality of cameras, a depth camera, or the like. The face image capturing unit 10 captures a face image and sends the face image data to the next face feature point recognition unit 11.

顔特徴点認識部１１は、顔画像撮影部１０にて取得した顔画像データから、顔の器官点（目、鼻、口など）の位置、輪郭部分などあらかじめ定義した器官点位置を示す特徴点の位置を推定、抽出する。得られる顔特徴点のそれぞれは、３次元の座標位置を含む。ここで、３次元座標位置の取得方法について述べる。 The facial feature point recognizing unit 11 uses the face image data acquired by the face image capturing unit 10 to identify the feature points indicating the positions of facial organ points (eyes, nose, mouth, etc.) and the predefined organ point positions such as the contour portion. The position of is estimated and extracted. Each of the obtained facial feature points includes a three-dimensional coordinate position. Here, a method of acquiring the three-dimensional coordinate position will be described.

深度カメラなどあらかじめ奥行を含む３次元点として計測される場合には３次元推定を行わずに特徴点の３次元位置を得ることができる。一方、顔画像撮影装置に応じて、器官点の推定検出時に２次元画像データ上の位置のみが求められる場合には、２次元画像上に写像された各器官点の位置から３次元姿勢を推定し、各器官点の奥行方向を含めた３次元位置を求める。 When measured as a three-dimensional point including depth in advance such as a depth camera, the three-dimensional position of the feature point can be obtained without performing the three-dimensional estimation. On the other hand, when only the position on the two-dimensional image data is obtained at the time of estimating and detecting the organ point according to the face image capturing device, the three-dimensional posture is estimated from the position of each organ point mapped on the two-dimensional image. Then, a three-dimensional position including the depth direction of each organ point is obtained.

顔特徴点認識部１１は、顔が横向きの場合や、顔の一部が遮蔽物により隠蔽され器官点位置が画像上に含まれない点については、認識できた点群とあらかじめ顔形状として記録された特徴点間の位置関係情報をもとに、推定位置を求め、出力結果とする。顔特徴点認識部１１は、顔輪郭および各器官点の特徴点位置に基づいて顔全体の顔向きを推定し、併せて出力結果とする。 The face feature point recognizing unit 11 records a recognized point group and a face shape in advance when a face is in a horizontal direction or when a part of the face is hidden by a shield and an organ point position is not included in the image. An estimated position is obtained based on the positional relationship information between the feature points thus obtained, and is used as an output result. The face feature point recognition unit 11 estimates the face orientation of the entire face based on the face contour and the feature point position of each organ point, and also outputs the result as an output result.

発話音声入力部１２は、表情の認識対象であるユーザの発話音声を入力する機能を有する。具体的には、例えば、マイクである。発話音声入力部１２は、次の発話音素認識部１３に音声データを送信する。 The uttered voice input unit 12 has a function of inputting the uttered voice of the user whose facial expression is to be recognized. Specifically, it is a microphone, for example. The utterance voice input unit 12 transmits voice data to the next utterance phoneme recognition unit 13.

発話音素認識部１３は、発話音声信号から、発話音素記号列を推定し、その記号列を出力する機能を有する。発話音声からの音素記号列の推定方法については、音声信号処理にてフォルマントから推定する方法、確率統計モデルに基づき推定する方法（ＧＭＭ−ＨＭＭ）、深層学習モデルを用いて音素特徴の学習及び識別を推定する方法（ＤＮＮ＿ＨＭＭ）などを用いることができる（徳田恵一「隠れマルコフモデルによる音声認識と音声合成」IPSJ Magazine Vol.45 No.10 Oct 2004）。本実施の形態において、発話音素を認識する特定の手法を限定するものではない。なお、認識結果の音素の定義については、後述する口唇形状変化ＤＢ１６に記憶された音素のデータに対応するものであればどのようなものであってもよい。 The utterance phoneme recognition unit 13 has a function of estimating a utterance phoneme symbol string from the utterance speech signal and outputting the symbol string. As for a method of estimating a phoneme symbol string from a speech, a method of estimating from a formant in a speech signal processing, a method of estimating based on a probability statistical model (GMM-HMM), and learning and identification of phoneme features using a deep learning model. Can be used (DNN_HMM) or the like (Keiichi Tokuda, “Voice Recognition and Speech Synthesis by Hidden Markov Model”, IPSJ Magazine Vol.45 No.10 Oct 2004). The present embodiment does not limit the specific method for recognizing a phoneme. It should be noted that the definition of the phoneme of the recognition result may be any as long as it corresponds to the phoneme data stored in the lip shape change DB 16 described later.

データパイプライン１４は、顔特徴点認識部１１、発話音素認識部１３からの出力結果を時系列的に統合し、一定時間にわたってデータを保持する。顔画像から得られた顔特徴点のデータと発話音声信号から得られた音素のデータは、それぞれの認識処理にかかる時間が異なるので、必ずしも同時刻に認識結果が得られない場合がある。次の処理を適用するに際しては、それぞれの認識結果について、各実時間での同一時刻で対応させたペアとして処理を行う必要がある。データパイプライン１４は、顔特徴点と発話音素のそれぞれの認識結果を一定時間記録保存する。これにより、後段処理では、指定時刻ごとに対応する両データを取得可能である。 The data pipeline 14 integrates the output results from the facial feature point recognition unit 11 and the speech phoneme recognition unit 13 in time series, and holds the data for a certain period of time. The data of the facial feature points obtained from the face image and the data of the phonemes obtained from the uttered speech signal have different recognition processing times, so that the recognition result may not always be obtained at the same time. When applying the following process, it is necessary to process each recognition result as a pair corresponding at the same time in each real time. The data pipeline 14 records and saves the recognition results of the facial feature points and the speech phonemes for a certain period of time. As a result, both data corresponding to each designated time can be acquired in the latter stage processing.

図２（ａ）は、顔特徴点データと音素データの時系列を統一した結果を示す図である。図２（ａ）において、網掛けをしているところは、認識結果のデータが存在する時刻である。図２（ａ）に示すように、データパイプライン１４は、各認識結果が得られた時刻を同期させてデータを保存する。 FIG. 2A is a diagram showing a result of unifying the time series of the facial feature point data and the phoneme data. In FIG. 2A, the shaded area is the time when the data of the recognition result exists. As shown in FIG. 2A, the data pipeline 14 saves data by synchronizing the times at which the recognition results are obtained.

口唇形状変化計算部１５は、顔画像から、発話による口唇形状変化をキャンセルして、発話をしていないとしたときの顔画像を計算する機能を有する。具体的には、顔画像から認識した顔特徴点の位置を発話に基づく口唇の動きの分だけ補正する。この計算には、口唇形状変化ＤＢ１６に記憶されたデータを用いる。 The lip shape change calculation unit 15 has a function of canceling the lip shape change due to the utterance from the face image and calculating the face image when it is determined that the utterance is not made. Specifically, the positions of the facial feature points recognized from the facial image are corrected by the amount of lip movement based on the utterance. The data stored in the lip shape change DB 16 is used for this calculation.

図３は、口唇形状変化ＤＢ１６に記憶されたデータを示す模式図である。口唇形状変化ＤＢ１６は、発話音素に対応する口唇や顎など顔特徴点の変化量（本書では「口唇形状変化データ」という）を記憶したデータベースである。口唇形状変化データは、音素に対応する口唇の動きをキャンセルするためのデータである。 FIG. 3 is a schematic diagram showing the data stored in the lip shape change DB 16. The lip shape change DB 16 is a database that stores the amount of change in facial feature points such as lips and jaws (referred to as “lip shape change data” in this document) corresponding to speech phonemes. The lip shape change data is data for canceling the movement of the lip corresponding to the phoneme.

例えば、音素が「あ」の場合には、口を大きく開けて発音するので、発声していない状態からみると下唇や顎の特徴点が下に移動している。「あ」の発声の口唇の動きをキャンセルして発声していないときの顔画像に補正するためには、図３に示すように下唇と顎の特徴点を上に移動させる。また、音素が「む」の場合には、上下の唇をぎゅっと閉じて発音をするので、発声していない状態からみると上下の唇の特徴点が近づくように移動している。「む」の発生の口唇の動きをキャンセルして発声していないときの顔画像に補正するためには、図３に示すように上唇を少し上に移動し、下唇を少し下に移動する。 For example, when the phoneme is “a”, the pronunciation is made with the mouth wide open, so that the characteristic points of the lower lip and the chin move downward when viewed from the state of not speaking. In order to cancel the movement of the lip of the utterance "a" and correct the face image when not uttered, the characteristic points of the lower lip and the jaw are moved upward as shown in FIG. Further, when the phoneme is "mu", the upper and lower lips are closed tightly to produce a pronunciation, so that the characteristic points of the upper and lower lips are moving closer to each other when viewed from a state where no utterance is made. In order to cancel the movement of the lip caused by “mu” and correct the face image when not speaking, move the upper lip slightly upward and move the lower lip slightly downward as shown in FIG. ..

口唇形状変化計算部１５は、データパイプライン１４によって保存されたデータから顔特徴点に対応する時刻の音素のデータを取得し、読み出した音素の対応する口唇形状変化データを口唇形状変化ＤＢ１６から読み出し、読み出した口唇形状変化データを顔特徴点に適用することにより、口唇の動きをキャンセルした顔特徴点のデータを求める。 The lip shape change calculation unit 15 acquires the phoneme data at the time corresponding to the facial feature point from the data stored by the data pipeline 14, and reads the lip shape change data corresponding to the read phoneme from the lip shape change DB 16. By applying the read lip shape change data to the facial feature points, the facial feature point data in which the movement of the lips is canceled is obtained.

事前に記録された音素ごとの口唇形状変化データは、正面方向であり、顔向きや、顔全体の大きさなどが異なる顔画像に対してそのまま適用することはできない。口唇形状変化計算部１５は、顔特徴識別時に取得した特徴点に基づいて、顔向き、および前後端、左右端から個人的な顔サイズを算出し、形状変化を適用する際にサイズ、向きの変換を口唇形状変化ＤＢ１６から読み出した口唇変化量に適用したうえで、元の顔特徴点に対して口唇形状変形の減算処理を行う。 The lip shape change data recorded in advance for each phoneme is in the front direction, and cannot be directly applied to face images having different face orientations, the size of the entire face, and the like. The lip shape change calculation unit 15 calculates a personal face size from the face orientation and front and rear edges, and left and right edges based on the feature points acquired at the time of face feature identification, and when applying the shape change, the lip shape change calculation unit 15 The conversion is applied to the lip change amount read from the lip shape change DB 16, and then the lip shape deformation is subtracted from the original facial feature points.

図２（ａ）に示すように、音素データについては、前後認識音素の中間状態で認識音素データがない区間がある。例えば、図２（ｂ）に示すように、音素データがない区間で、表情認識の処理データの要求がある場合があるが、その場合には中間状態の口唇変化量データを前後の音素データをもとに生成する。ここで、中間状態を求める処理について説明する。 As shown in FIG. 2A, regarding the phoneme data, there is a section where there is no recognized phoneme data in the intermediate state of the front and rear recognized phonemes. For example, as shown in FIG. 2B, there is a case where there is a request for facial expression recognition processing data in a section where there is no phoneme data. In that case, the lip change amount data in the intermediate state is compared with the preceding and following phoneme data. Generate it in the original. Here, the process of obtaining the intermediate state will be described.

図４は、２つの音素を発声するときの口唇形状の変化を示す図である。発話は発音音素ごとに区切って発話されるのではなく連続して発話されるので、認識された発音音素間における口唇形状は、直前の音素による口唇形状から緩やかに変化した中間状態となる。したがって、各時刻フレームの発音音素に基づく口唇形状状態は、前後に認識された発音音素の中間状態として合成し使用することとなる。２つの形状状態（特徴点の３次元座標位置）から中間状態を生成する場合には、前後の形状状態の座標位置の変化量に対して変化時間分のフレーム数分で分割したものを適宜積算することによって生成することができる。例えば、前後の音素が「あ」と「む」であった場合、図３に記憶された「あ」に対応する口唇形状変化データと「む」の口唇形状変化データとを補間することにより、中間状態における口唇の動きの口唇形状変化データを算出できる。 FIG. 4 is a diagram showing changes in the lip shape when uttering two phonemes. Since the utterance is not uttered for each phoneme but uttered continuously, the lip shape between the recognized phonemes is an intermediate state in which the lip shape is gently changed from the lip shape of the immediately preceding phoneme. Therefore, the lip shape state based on the phoneme of each time frame is synthesized and used as the intermediate state of the phonemes recognized before and after. When an intermediate state is created from two shape states (three-dimensional coordinate positions of feature points), the amount of change in the coordinate position of the preceding and following shape states divided by the number of frames corresponding to the change time is appropriately added. Can be generated by For example, when the phonemes before and after are “A” and “Mu”, the lip shape change data corresponding to “A” stored in FIG. 3 and the lip shape change data of “Mu” are interpolated, The lip shape change data of the lip movement in the intermediate state can be calculated.

この口唇形状変化データの補間の方法については、単純に線形で均等分割する方法（リニア補完）の他、開始前後の変化量について、あらかじめ各音素間の変化としてスプライン曲線などで定義する方法など、いくつかの実現方法がある。どの方法を用いるかは、発音音素に対応する口唇形状を定義するデータセットの定義、記録方法により選択することが可能である。 As for the method of interpolation of this lip shape change data, in addition to the method of simply linearly dividing it (linear interpolation), the amount of change before and after the start is defined in advance as a change between phonemes with a spline curve, etc. There are several ways to achieve this. Which method is used can be selected by defining and recording a data set that defines the lip shape corresponding to the phoneme.

表情識別部１７は、口唇形状変化計算部１５によって口唇形状変化による顔形状変形分を差し引いた顔全体の特徴点データを入力として、表情の識別を行う。識別方法については、統計手法に基づくもの、または深層学習により学習した識別器を用いるものなどがあるが、本実施の形態においては、入力となる顔特徴点データに対応するものであれば、どのような手法を用いてもよい。 The facial expression identification unit 17 identifies facial expressions by inputting feature point data of the entire face, which is obtained by subtracting the facial shape deformation due to the lip shape change by the lip shape change calculation unit 15. As an identification method, there are a method based on a statistical method, a method using a classifier learned by deep learning, etc., but in the present embodiment, any method can be used as long as it corresponds to the input facial feature point data. Such a method may be used.

図５は、第１の実施の形態の表情認識装置１の動作を示すフローチャートである。表情認識装置１は、顔画像撮影部１０にて、認識対象のユーザの顔画像を撮影し（Ｓ１０）、撮影した顔画像から顔器官の特徴点を認識する（Ｓ１１）。また、表情認識装置１は、発話音声入力部１２にて、顔画像を撮影するのと同時に発話を集音して入力する（Ｓ１２）。続いて、表情認識装置１は、入力された発話音声に含まれる音素を認識する（Ｓ１３）。表情認識装置１は、データパイプライン１４によって、顔器官の特徴点が得られた時刻と発話音声の音素が得られた時刻に基づいて、顔器官特徴点と音素を同期させて記憶する（Ｓ１４）。 FIG. 5 is a flowchart showing the operation of the facial expression recognition device 1 according to the first embodiment. In the facial expression recognition device 1, the face image capturing unit 10 captures a face image of the recognition target user (S10), and recognizes the feature points of the facial organs from the captured face image (S11). Further, the facial expression recognition apparatus 1 collects and inputs the utterance at the same time as the face image is captured by the utterance voice input unit 12 (S12). Then, the facial expression recognition device 1 recognizes a phoneme included in the input speech voice (S13). The facial expression recognition device 1 stores the facial organ feature points and the phonemes in synchronization with each other based on the time when the feature points of the facial organs are obtained and the time when the phonemes of the speech are obtained by the data pipeline 14 (S14). ).

次に、表情認識装置１は、表情を認識したい顔画像を撮影したときに発声された音素のデータを特定し、読み出した音素に対応する口唇形状変化データを口唇形状変化ＤＢ１６から読み出す（Ｓ１５）。ここで、顔画像を撮影したときに発声された音素のデータがない場合（図２（ｂ）に示す場合）には、前後の音素のデータを読み出し、前後の音素のデータに対応する口唇形状変化データを補間して、口唇形状変化データを得る。表情認識装置１は、顔特徴点に口唇形状変化データを適用することにより、発声による口唇の動きの影響をキャンセルする（Ｓ１６）。 Next, the facial expression recognition device 1 identifies the data of the phoneme uttered when the face image for which the facial expression is desired to be captured is captured, and the lip shape change data corresponding to the read phoneme is read from the lip shape change DB 16 (S15). .. Here, when there is no phoneme data uttered when the face image is captured (the case shown in FIG. 2B), the phoneme data before and after is read out, and the lip shape corresponding to the data of the phonemes before and after is read out. The change data is interpolated to obtain lip shape change data. The facial expression recognition device 1 cancels the influence of the movement of the lips caused by utterance by applying the lip shape change data to the facial feature points (S16).

続いて、表情認識装置１は、発声による口唇の動きの影響をキャンセルした顔画像を用いて、その表情の認識を行う（Ｓ１７）。 Subsequently, the facial expression recognition device 1 recognizes the facial expression by using the face image in which the influence of the movement of the lips caused by the utterance is canceled (S17).

以上、本実施の形態の表情認識装置１の構成について説明したが、上記した表情認識装置１のハードウェアの例は、ＣＰＵ、ＲＡＭ、ＲＯＭ、ハードディスク、ディスプレイ、キーボード、マウス、通信インターフェース等を備えたコンピュータと、顔画像を撮影するカメラ、発話音声を入力するマイクである。上記した各機能を実現するモジュールを有するプログラムをＲＡＭまたはＲＯＭに格納しておき、ＣＰＵによって当該プログラムを実行することによって、上記した表情認識装置１が実現される。このようなプログラムも本発明の範囲に含まれる。 The configuration of the facial expression recognition device 1 according to the present embodiment has been described above, but the example hardware of the facial expression recognition device 1 includes a CPU, a RAM, a ROM, a hard disk, a display, a keyboard, a mouse, a communication interface, and the like. A computer, a camera for capturing face images, and a microphone for inputting speech. The facial expression recognition device 1 described above is realized by storing a program having a module that realizes each of the functions described above in a RAM or a ROM and executing the program by the CPU. Such a program is also included in the scope of the present invention.

本実施の形態の表情認識装置１は、発話音声から口唇形状変化を求め、口唇形状変化データを差し引いた顔画像に基づいて表情を認識するので、発話中であっても表情を精度良く認識できる。また、口唇形状の変化をキャンセルした顔画像を表情識別部１７に渡す構成とすることにより、表情識別部１７の構成としては従来の表情認識の手法を用いることができる。 The facial expression recognition apparatus 1 according to the present embodiment obtains the lip shape change from the uttered voice and recognizes the facial expression based on the face image from which the lip shape change data is subtracted, so that the facial expression can be accurately recognized even during utterance. .. Further, by adopting a configuration in which the facial image in which the change in the lip shape has been canceled is passed to the facial expression identification unit 17, the conventional facial expression recognition method can be used as the configuration of the facial expression identification unit 17.

（第２の実施の形態）
図６は、第２の実施の形態の表情認識装置２の構成を示す図である。発声による口唇の動きには個人差がある。第２の実施の形態の表情認識装置２は、個人差を反映してより精度良く、発声による口唇の動きの影響をキャンセルする。 (Second embodiment)
FIG. 6 is a diagram showing the configuration of the facial expression recognition device 2 according to the second embodiment. There are individual differences in the movement of the lips due to vocalization. The facial expression recognition device 2 of the second exemplary embodiment reflects the individual difference and more accurately cancels the influence of the movement of the lips caused by utterance.

第２の実施の形態の表情認識装置２の基本的な構成は、第１の実施の形態と同じであるが、第２の実施の形態の表情認識装置２は、補正係数データベース（以下、「補正係数ＤＢ」という）１８を有している。補正係数ＤＢ１８は、発声による口唇の動きの影響をキャンセルする際に、個々のユーザごとに、口唇形状変化ＤＢ１６に記憶された口唇形状変化データをどの程度の強さで反映させるかを定める係数を記憶している。 Although the basic configuration of the facial expression recognition device 2 of the second exemplary embodiment is the same as that of the first exemplary embodiment, the facial expression recognition device 2 of the second exemplary embodiment has a correction coefficient database (hereinafter, referred to as “correction coefficient database”). 18). The correction coefficient DB 18 is a coefficient that determines how strongly the lip shape change data stored in the lip shape change DB 16 is reflected for each user when canceling the influence of lip movement due to utterance. I remember.

この係数は、表情認識装置２の初期設定において、ユーザが音素を発声する画像を撮影し、当該画像を用いて設定することができる。もし、初期設定が煩雑であれば、表情認識装置２を使用している間に徐々に係数をキャリブレーションしてもよい。補正係数が適切でないと、ユーザの顔画像に対して、補正係数をかけた口唇形状変化データを適用すると、顔画像がエラーになってしまうことがある。分かりやすい例としては、下唇の特徴点を上に移動した結果、下唇の特徴点が上唇の特徴点よりも上に位置してしまうことである。このようなエラーが生じた場合には、下唇の特徴点を上に移動させる度合いを小さくするという方針で補正係数を調整する。このように、表情認識を行いつつ徐々にキャリブレーションを行うことも可能である。第２の実施の形態の表情認識装置２のその他の構成および動作は第１の実施の形態と同じである。 This coefficient can be set in the initial setting of the facial expression recognition device 2 by shooting an image in which the user speaks a phoneme and using the image. If the initial setting is complicated, the coefficients may be calibrated gradually while using the facial expression recognition device 2. If the correction coefficient is not appropriate, when the lip shape change data multiplied by the correction coefficient is applied to the face image of the user, the face image may become an error. An easy-to-understand example is that, as a result of moving the lower lip feature point upward, the lower lip feature point is located above the upper lip feature point. When such an error occurs, the correction coefficient is adjusted based on the policy of reducing the degree of moving the characteristic point of the lower lip upward. In this way, it is possible to gradually perform calibration while recognizing facial expressions. The other configurations and operations of the facial expression recognition device 2 of the second embodiment are the same as those of the first embodiment.

第２の実施の形態の表情認識装置２は、発話音声から個々のユーザの口唇形状変化を求め、口唇形状変化データを差し引いた顔画像に基づいて表情を認識するので、発話中であっても表情を精度良く認識できる。 The facial expression recognition device 2 according to the second embodiment obtains the lip shape change of each user from the uttered voice, and recognizes the facial expression based on the face image from which the lip shape change data is subtracted, so that the facial expression recognition device 2 can be used even during utterance. Can recognize facial expressions with high accuracy.

なお、本実施の形態では、一般の口唇形状変化ＤＢ１６と補正係数ＤＢ１８との組合せによってユーザの個人差に対応する構成について説明したが、ユーザごとに口唇形状変化ＤＢ１６を持つこととしてもよい。 In the present embodiment, the configuration in which the general lip shape change DB 16 and the correction coefficient DB 18 are combined to deal with the user's individual difference has been described, but each user may have the lip shape change DB 16.

（第３の実施の形態）
次に、第３の実施の形態の表情認識装置３について説明する。第３の実施の形態の表情認識装置３の基本的な構成は、第１の実施の形態の表情認識装置１と同じであるが、口唇形状変化ＤＢ１６に記憶されたデータが異なる。 (Third Embodiment)
Next, the facial expression recognition device 3 according to the third embodiment will be described. The facial expression recognition device 3 of the third embodiment has the same basic configuration as the facial expression recognition device 1 of the first embodiment, but the data stored in the lip shape change DB 16 is different.

図７は、第３の実施の形態の表情認識装置３が有する口唇形状変化ＤＢ１６に記憶されたデータの例を示す図である。第１の実施の形態においては、口唇形状変化ＤＢ１６は１の音素に対応する口唇形状変化データを記憶していたが、本実施の形態では、３つの連続する音素（トライグラム）の真ん中の音素に対応する口唇形状変化データを記憶している。同じ音素であっても、前後の音素によって口唇の動きが異なるので、前後の音素も含めて口唇形状変化データを持つことにより、より精度の良い音素に対応する口唇形状変化データを規定することができる。第３の実施の形態の表情認識装置３のその他の構成および動作は第１の実施の形態と同じである。 FIG. 7 is a diagram showing an example of data stored in the lip shape change DB 16 included in the facial expression recognition device 3 according to the third embodiment. In the first embodiment, the lip shape change DB 16 stores the lip shape change data corresponding to one phoneme, but in the present embodiment, the middle phoneme of three consecutive phonemes (trigrams) is stored. The lip shape change data corresponding to is stored. Even for the same phoneme, the movement of the lip differs depending on the front and back phonemes.Therefore, by having the lip shape change data including the front and back phonemes, it is possible to specify the lip shape change data corresponding to the more accurate phoneme. it can. Other configurations and operations of the facial expression recognition device 3 of the third embodiment are the same as those of the first embodiment.

第３の実施の形態の表情認識装置３は、発話音声から個々のユーザの口唇形状変化を求め、口唇形状変化データを差し引いた顔画像に基づいて表情を認識するので、発話中であっても表情を精度良く認識できる。 The facial expression recognition device 3 according to the third embodiment obtains the lip shape change of each user from the uttered voice, and recognizes the facial expression based on the face image from which the lip shape change data is subtracted, so that the facial expression recognition device 3 can be used even during utterance. Can recognize facial expressions with high accuracy.

ここでは、トライグラムを例として挙げたが、変形例として、２つの連続する音素のうちの後の音素に対応する口唇形状変化データ、または、２つの連続する音素のうちの前の音素に対応する口唇形状変化データを有することとしてもよい。 Here, the trigram is taken as an example, but as a modified example, the lip shape change data corresponding to the latter phoneme of the two consecutive phonemes, or the previous phoneme of the two consecutive phonemes is supported. It may have the lip shape change data for

以上、本発明の表情認識装置について実施の形態を挙げて詳細に説明したが、本発明は上記した実施の形態に限定されるものではない。 Although the facial expression recognition apparatus of the present invention has been described in detail above with reference to the embodiments, the present invention is not limited to the above-described embodiments.

顔画像撮影部１０にて撮影した画像に複数の顔画像が含まれることも想定される。そのような場合に、表情認識装置が認識対象のユーザの顔画像から口唇の動きの影響をキャンセルできるように、認識対象のユーザの発話音声を特定する必要がある。本発明の表情認識装置は、発話音声入力部１２として複数のマイクからなるマイクアレイを用いてもよい。マイクアレイによって音源の方位特定および発話音声分離を行うことができ、認識対象のユーザの発話音声を特定できる。これにより、上述した実施の形態と同様に、発話音声から口唇形状変化データを求め、これを顔画像からキャンセルすることで、複数のユーザがいる場合にも精度の良い表情認識を行える。 It is also assumed that the image captured by the face image capturing unit 10 includes a plurality of face images. In such a case, it is necessary to specify the uttered voice of the recognition target user so that the facial expression recognition device can cancel the influence of the movement of the lip from the face image of the recognition target user. The facial expression recognition device of the present invention may use a microphone array including a plurality of microphones as the utterance voice input unit 12. The direction of the sound source can be identified and the speech can be separated by the microphone array, and the speech of the recognition target user can be identified. As a result, similar to the above-described embodiment, by obtaining lip shape change data from the uttered voice and canceling it from the face image, it is possible to perform accurate facial expression recognition even when there are multiple users.

上記した実施の形態では、発話音声から口唇形状変化を検知したが、これに加え、表情識別のための静止画像のデータセットに対して、それぞれ発音音素および中間状態の口唇変化を加えたものを学習データとして生成し、これを利用して口唇変化を推定するモデルを生成することで、さらなる精度向上につなげても良い。 In the above-described embodiment, the lip shape change is detected from the uttered voice, but in addition to this, the data obtained by adding the phoneme phoneme and the lip change in the intermediate state to the data set of the still image for facial expression recognition are used. The accuracy may be further improved by generating it as learning data and using this to generate a model for estimating lip changes.

本発明は、顔画像からその表情を認識する装置として有用である。 INDUSTRIAL APPLICABILITY The present invention is useful as a device for recognizing its facial expression from a face image.

１，２表情認識装置、１０顔画像撮影部、１１顔特徴点認識部、
１２発話音声入力部、１３発話音素認識部、１４データパイプライン、
１５口唇形状変化計算部、１６口唇形状変化ＤＢ、１７表情識別部、
１８補正係数ＤＢ。
1, 2 facial expression recognition device, 10 face image capturing unit, 11 face feature point recognition unit,
12 speech input unit, 13 speech phoneme recognition unit, 14 data pipeline,
15 lip shape change calculation unit, 16 lip shape change DB, 17 facial expression identification unit,
18 Correction coefficient DB.

Claims

A face image acquisition unit that acquires a face image,
A voice acquisition unit that acquires the utterance voice,
A speech recognition unit that obtains phonemes that compose the acquired speech;
A database that stores data on changes in lip shape corresponding to phonemes,
Lip corresponding to the phoneme obtained by the voice recognition unit is read from the database, and a face image obtained by canceling the change in the lip shape corresponding to the phoneme is generated from the face image obtained by the face image obtaining unit. A shape change calculation unit,
An expression discriminator that estimates an expression based on the face image obtained by the lip shape change calculation unit,
Facial expression recognition device.

The facial expression recognition device according to claim 1, further comprising a data pipeline that integrates a face image acquired by the face image acquisition unit and a time series of phoneme data obtained by the voice recognition unit.

The said lip shape change calculation part interpolates the data read from the said database in the area where the said phoneme does not exist, and cancels the change of a lip shape using the interpolated data. Facial expression recognition device.

The facial expression recognition device according to claim 3, wherein the lip shape change calculation unit performs linear interpolation or interpolation using a spline curve.

5. The database stores data on changes in default lip shape and coefficients applied to the data, and the coefficients are set for each user by initial calibration. Facial expression recognition device described in.

The database stores data on changes in the default lip shape and coefficients applied to the data, and the coefficient is designed to prevent a face image in which changes in the lip shape have been canceled from becoming an error. The facial expression recognition device according to claim 1, which is corrected.

The facial expression recognition device according to claim 1, wherein the database stores data corresponding to a trigram including data of phonemes before and after.

The voice acquisition unit is a microphone array having a plurality of microphones,
When face images of a plurality of people are acquired by the face image acquisition unit, a voice uttered by a person related to each face image is specified by a microphone array,
The facial expression recognition device according to any one of claims 1 to 7, wherein the facial expression identification unit identifies the facial expression of the face image using the identified voice.

A method of recognizing a facial expression from a facial image by a facial expression recognition device,
The facial expression recognition device acquires a facial image;
The facial expression recognition device acquires a speech voice,
The facial expression recognition device obtains phonemes forming the acquired voice;
The facial expression recognition device reads out the data corresponding to the obtained phoneme from the database that stores the data regarding the change in the lip shape corresponding to the phoneme, and the face in which the change in the lip shape corresponding to the phoneme is canceled from the face image. Generating an image,
The facial expression recognition device estimates a facial expression based on the facial image in which the change in the lip shape has been canceled;
Facial expression recognition method.

It is a program that recognizes facial expressions from face images.
A step of acquiring a face image,
A step of acquiring a speech voice,
A step of obtaining phonemes that compose the acquired speech,
Data corresponding to the obtained phoneme is read from a database that stores data relating to changes in the lip shape corresponding to the phoneme, and a step of generating a face image in which the change in the lip shape corresponding to the phoneme is canceled from the face image, ,
A step of estimating a facial expression based on the face image in which the change in the lip shape is canceled,
A program to execute.