JP2019075108A

JP2019075108A - Information processing method and device and information detection method and device

Info

Publication number: JP2019075108A
Application number: JP2018188151A
Authority: JP
Inventors: シェヌ・ウエイ; Wei Shen; リィウ・ルゥジエ; Rujie Liu
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-10-18
Filing date: 2018-10-03
Publication date: 2019-05-16
Anticipated expiration: 2038-10-03
Also published as: CN109685087B; JP7119865B2; CN109685087A; CN109685087B9

Abstract

【課題】情報処理方法及び装置、並びに情報検出方法を提供する。【解決手段】情報処理方法は、訓練セットから同一のセマンティック特徴に対応する３つの画像を抽出し、３つの画像は、セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに第１画像及び第２画像と異なるセマンティック特徴値を有する第３画像を含み、変分オートエンコーダ（ＶＡＥ）により、３つの画像のセマンティック特徴に対応する潜在変数の分布を取得し、３つの画像の各画像について、損失関数を最小化するようにＶＡＥのパラメータを更新し、損失関数は、第１画像の潜在変数の分布と第２画像の潜在変数の分布との間の第１距離に対して正の相関関係を有し、第１画像の潜在変数の分布と第３画像の潜在変数の分布との間の第２距離に対して負の相関関係を有する。【選択図】図１An information processing method and apparatus, and an information detection method are provided. An information processing method extracts three images corresponding to the same semantic feature from a training set, wherein the three images have first and second images having the same semantic feature value for the semantic feature, and A third image having semantic feature values different from the first image and the second image, a distribution of latent variables corresponding to semantic features of the three images obtained by a variational auto-encoder (VAE), and For each image, update the parameters of the VAE to minimize the loss function, wherein the loss function is based on the first distance between the distribution of the latent variables in the first image and the second image. It has a positive correlation and a negative correlation with the second distance between the distribution of the latent variables in the first image and the distribution of the latent variables in the third image. [Selection diagram] Fig. 1

Description

本発明は、情報処理の分野に関し、具体的には、識別性を有する顔セマンティック特徴を抽出できる情報処理方法及び装置、並びに情報検出方法及び装置に関する。 The present invention relates to the field of information processing, and more particularly, to an information processing method and apparatus capable of extracting face semantic features having distinctiveness, and an information detection method and apparatus.

近年、画像生成は明らかに進んでいる。例えば敵対的生成ネットワーク（ＧＡＮ：ＧｅｎｅｒａｔｉｖｅＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋ）及び変分オートエンコーダ（ＶＡＥ：ＶａｒｉａｔｉｏｎａｌＡｕｔｏＥｎｃｏｄｅｒ）等のモデルを用いて画像を生成する。しかし、ＧＡＮモデルは、ランダムノイズを入力とするものであり、画像を潜在空間に符号化する能力を有しない。ＶＡＥは、画像を潜在空間に符号化することができるが、該潜在空間はマンティック的な意味を有しない。即ち、これらのモデルは、識別性を有する顔セマンティック特徴を抽出することができない。 Image generation has clearly advanced in recent years. For example, an image is generated using a model such as a generative adaptive network (GAN) and a variational auto encoder (VAE). However, the GAN model takes random noise as input and does not have the ability to encode the image into the latent space. The VAE can encode an image into a latent space, but the latent space does not have a semantic meaning. That is, these models can not extract distinctive face semantic features.

以下は、本発明の態様を基本的に理解させるために、本発明の簡単な概要を説明する。なお、この簡単な概要は、本発明を網羅的な概要ではなく、本発明のポイント又は重要な部分を意図的に特定するものではなく、本発明の範囲を意図的に限定するものではなく、後述するより詳細的な説明の前文として、単なる概念を簡単な形で説明することを目的とする。 The following presents a simplified summary of the invention in order to provide a basic understanding of aspects of the invention. This brief summary is not an exhaustive overview of the present invention, does not intentionally identify points or important parts of the present invention, and does not intentionally limit the scope of the present invention, As a preamble to a more detailed explanation, which will be described later, the purpose is to simply explain the concept in a simple manner.

本発明は、上記の問題点を鑑み、識別性を有する顔セマンティック特徴を抽出できる情報処理方法及び装置、並びに情報検出方法及び装置を提供することを目的とする。 An object of the present invention is to provide an information processing method and apparatus capable of extracting face semantic features having distinctiveness, and an information detection method and apparatus, in view of the above problems.

本発明の１つの態様では、訓練セットから同一のセマンティック特徴に対応する３つの画像を抽出するステップであって、前記３つの画像は、前記セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに前記第１画像及び前記第２画像と異なるセマンティック特徴値を有する第３画像を含む、ステップと、変分オートエンコーダ（ＶＡＥ）により、前記３つの画像の前記セマンティック特徴に対応する潜在変数の分布を取得するステップと、前記３つの画像の各画像について、損失関数を最小化するように前記ＶＡＥのパラメータを更新するステップであって、前記損失関数は、前記第１画像の潜在変数の分布と前記第２画像の潜在変数の分布との間の第１距離に対して正の相関関係を有し、前記第１画像の潜在変数の分布と前記第３画像の潜在変数の分布との間の第２距離に対して負の相関関係を有する、ステップと、を含む、情報処理方法を提供する。 In one aspect of the invention, extracting three images corresponding to the same semantic feature from a training set, wherein the three images are a first image having the same semantic feature value for the semantic feature and A second image, and a third image having semantic feature values different from the first image and the second image, the variational auto encoder (VAE) corresponding to the semantic features of the three images Obtaining a distribution of latent variables, and updating parameters of the VAE to minimize a loss function for each of the three images, wherein the loss function is a potential of the first image A positive correlation with respect to a first distance between the distribution of variables and the distribution of latent variables of the second image; Having a negative correlation with respect to a second distance between the distribution of the latent variable image and the distribution of latent variables of the third image, comprising the steps, and provides an information processing method.

本発明のもう１つの態様では、訓練セットから同一のセマンティック特徴に対応する３つの画像を抽出する画像抽出手段であって、前記３つの画像は、前記セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに前記第１画像及び前記第２画像と異なるセマンティック特徴値を有する第３画像を含む、画像抽出手段と、変分オートエンコーダ（ＶＡＥ）により、前記３つの画像の前記セマンティック特徴に対応する潜在変数の分布を取得する潜在変数分布取得手段と、前記３つの画像の各画像について、損失関数を最小化するように前記ＶＡＥのパラメータを更新するパラメータ更新手段であって、前記損失関数は、前記第１画像の潜在変数の分布と前記第２画像の潜在変数の分布との間の第１距離に対して正の相関関係を有し、前記第１画像の潜在変数の分布と前記第３画像の潜在変数の分布との間の第２距離に対して負の相関関係を有する、パラメータ更新手段と、を含む、情報処理装置を提供する。 In another aspect of the invention, an image extracting means for extracting three images corresponding to the same semantic feature from a training set, wherein the three images have the same semantic feature value for the semantic feature. Image extraction means including an image and a second image, and a third image having a semantic feature value different from the first image and the second image, and the variational auto encoder (VAE) Latent variable distribution acquiring means for acquiring the distribution of latent variables corresponding to semantic features; and parameter updating means for updating the parameters of the VAE so as to minimize the loss function for each of the three images. The loss function is a first distance between a distribution of latent variables of the first image and a distribution of latent variables of the second image. Parameter updating means having a positive correlation and a negative correlation with respect to a second distance between the distribution of latent variables of the first image and the distribution of latent variables of the third image; And providing an information processing apparatus.

本発明の１つの態様では、訓練された変分オートエンコーダ（ＶＡＥ）に複数の画像をそれぞれ入力し、各画像のセマンティック特徴に対応する潜在変数の分布を取得し、各画像の再構築画像を取得するステップ、を含み、前記複数の画像のうち同一のセマンティック特徴を有する３つの画像について、前記３つの画像は、前記セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに前記第１画像及び前記第２画像と異なるセマンティック特徴値を有する第３画像を含み、前記第１画像の潜在変数の分布と前記第２画像の潜在変数の分布との間の第１距離は、前記第１画像の潜在変数の分布と前記第３画像の潜在変数の分布との間の第２距離よりも小さい、情報検出方法を提供する。 In one aspect of the invention, a plurality of images are each input to a trained variational auto-encoder (VAE), the distribution of latent variables corresponding to the semantic features of each image is obtained, and the reconstructed images of each image are obtained. Acquiring three images including three images having the same semantic feature among the plurality of images, the first image and the second image having the same semantic feature value for the semantic feature, and A first distance between the distribution of latent variables of the first image and the distribution of latent variables of the second image, including a third image having semantic feature values different from the first image and the second image; The information detection method may be smaller than a second distance between the distribution of latent variables of the first image and the distribution of latent variables of the third image.

本発明の他の態様では、上記本発明の方法を実現するためのコンピュータプログラムコード及びコンピュータプログラムプロダクト、並びに上記本発明の方法を実現するためのコンピュータプログラムコードを記録しているコンピュータ読み取り可能な記憶媒体をさらに提供する。 In another aspect of the present invention, a computer program code and computer program product for implementing the method of the present invention, and a computer readable storage storing computer program code for implementing the method of the present invention. Further provide media.

以下は、本発明の実施例の他の態様を説明し、本発明の実施例の好ましい実施例を詳細に説明するが、本発明はこれらの実施例に限定されない。 The following describes other aspects of the embodiments of the present invention and describes preferred embodiments of the embodiments of the present invention in detail, but the present invention is not limited to these embodiments.

本発明の他の特徴及び利点を理解させるために、図面を参照しながら本発明の各実施例を説明する。全ての図面において、同一又は類似の符号で同一又は類似の構成部を示している。ここで説明される図面は、好ましい実施例を例示するためのものであり、全ての可能な実施例ではなく、本発明の範囲を限定するものではない。
本発明の実施例に係る情報処理方法の流れの一例を示すフローチャートである。本発明の実施例に係るメトリック学習を示す図である。情報処理方法を実現するためのネットワークを示すブロック図である。本発明の実施例に係る符号化ネットワーク及び復号ネットワークの構成を示すブロック図である。本発明の実施例に係る潜在変数の構成を示す図である。本発明の実施例に係る情報処理装置の機能的構成の一例を示すブロック図である。本発明の実施例に適用可能な情報処理装置であるパーソナルコンピュータの例示的な構成を示すブロック図である。 In order to make the other features and advantages of the present invention understandable, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same or similar reference numerals indicate the same or similar components. The drawings described herein are for the purpose of illustrating preferred embodiments and are not all possible embodiments and are not intended to limit the scope of the present invention.
It is a flowchart which shows an example of the flow of the information processing method which concerns on the Example of this invention. It is a figure showing metric learning concerning an example of the present invention. It is a block diagram showing the network for realizing the information processing method. It is a block diagram which shows the structure of the encoding network which concerns on the Example of this invention, and a decoding network. It is a figure showing composition of a latent variable concerning an example of the present invention. It is a block diagram showing an example of functional composition of an information processor concerning an example of the present invention. It is a block diagram showing an exemplary composition of a personal computer which is an information processor applicable to an embodiment of the present invention.

以下、図面を参照しながら本発明の例示的な実施例を詳細に説明する。説明の便宜上、明細書には実際の実施形態の全ての特徴が示されていない。なお、実際に実施する際に、開発者の具体的な目標を実現するために、特定の実施形態を変更してもよい、例えばシステム及び業務に関する制限条件に応じて実施形態を変更してもよい。また、開発作業が非常に複雑であり、且つ時間がかかるが、本公開の当業者にとって、この開発作業は単なる例の作業である。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings. For convenience of explanation, all features of the actual embodiment are not shown in the specification. Note that, in actual implementation, the specific embodiment may be changed in order to realize the specific goal of the developer, for example, even if the embodiment is changed according to the system and business restrictions. Good. Also, although the development work is very complicated and time consuming, this development work is only an example work for the person skilled in the art of this disclosure.

なお、本発明を明確にするために、図面には本発明の実施例に密に関連する装置の構成及び／又は処理のステップのみが示され、本発明と関係のない細部が省略されている。 It should be noted that in order to clarify the present invention, only the steps of apparatus configuration and / or processing closely related to the embodiments of the present invention are shown in the drawings, and details not related to the present invention are omitted. .

ＶＡＥの主な目的は入力画像の再構築であり、ＶＡＥの入力は元の画像であり、出力は再構築画像である。より具体的には、ＶＡＥでは、入力画像を符号化し、潜在変数の分布表現を取得し、この分布表現は平均値ベクトル及び標準偏差ベクトルを含むガウス分布表現である。この２つのベクトルは何れも１次元のベクトルであり、平均値ベクトル及び標準偏差ベクトルをサンプリングして新たなベクトルを取得し、新たなベクトルを用いて再構築を行い、最終的な再構築画像を取得する。ＶＡＥを訓練するための目的関数（損失関数とも称される）は２つの部分により構成され、１つは再構築誤差（入力画像と再構築画像との誤差）であり、もう１つは中間潜在変数とガウス分布とのＫＬ（Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ：カルバック・ライブラー）距離である。ＶＡＥは、画像を潜在空間に符号化することができるが、該潜在空間はマンティック的な意味を有しない。 The main purpose of the VAE is to reconstruct the input image, the input of the VAE is the original image, and the output is the reconstructed image. More specifically, in VAE, an input image is encoded to obtain a distribution representation of latent variables, which is a Gaussian distribution representation including a mean value vector and a standard deviation vector. These two vectors are both one-dimensional vectors, and the mean value vector and the standard deviation vector are sampled to obtain new vectors, and reconstruction is performed using the new vectors to obtain a final reconstructed image. get. The objective function (also referred to as the loss function) for training the VAE consists of two parts, one is the reconstruction error (the error between the input image and the reconstruction image) and the other is an intermediate potential KL (Kullback-Leibler) distance between a variable and a Gaussian distribution. The VAE can encode an image into a latent space, but the latent space does not have a semantic meaning.

本願は、識別性を有する顔セマンティック特徴（例えば、身分、姿勢、年齢、性別など）を抽出できる情報処理方法を提供し、該情報処理方法は、ＶＡＥモデルの画像生成の能力とメトリック学習とを組み合わせたものである。 The present application provides an information processing method capable of extracting face semantic features (for example, identity, posture, age, gender, etc.) having distinctiveness, and the information processing method comprises the ability of image generation of VAE model and metric learning. It is a combination.

以下、図面を参照しながら、本発明の実施例を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

まず、図１を参照しながら、本発明の実施例に係る情報処理方法１００の流れの一例を説明する。図１は本発明の実施例に係る情報処理方法の流れの一例を示すフローチャートである。図１に示すように、本発明の実施例に係る情報処理方法１００は、画像抽出ステップＳ１０２、潜在変数分布取得ステップＳ１０４及びパラメータ更新ステップＳ１０６を含む。 First, an example of the flow of an information processing method 100 according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a flow chart showing an example of the flow of an information processing method according to an embodiment of the present invention. As shown in FIG. 1, the information processing method 100 according to the embodiment of the present invention includes an image extraction step S102, a latent variable distribution acquisition step S104, and a parameter update step S106.

画像抽出ステップＳ１０２において、訓練セットから同一のセマンティック特徴（ｓｅｍａｎｔｉｃｆｅａｔｕｒｅ）に対応する３つの画像を抽出してもよい。ここで、該３つの画像は、セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに第１画像及び第２画像と異なるセマンティック特徴値を有する第３画像を含む。 In the image extraction step S102, three images corresponding to the same semantic feature may be extracted from the training set. Here, the three images include a first image and a second image having the same semantic feature value for semantic features, and a third image having a semantic feature value different from the first image and the second image.

従来の変分自己符号化アルゴリズムでは、潜在変数の各次元特徴の観点から、特定のセマンティック特徴を有しない。本発明の実施例に係る情報処理方法１００では、潜在変数を複数の部分に分け、各部分は１つの特定のセマンティック特徴、例えば顔の姿勢、年齢、性別などに対応する。 Conventional variational self-coding algorithms do not have specific semantic features in terms of each dimensional feature of the latent variable. In the information processing method 100 according to the embodiment of the present invention, latent variables are divided into a plurality of parts, and each part corresponds to one specific semantic feature, such as face posture, age, gender and the like.

画像抽出ステップＳ１０２において、訓練セットから同一のセマンティック特徴に対応する３つの画像ｘ_ｉ ^ａ、ｘ_ｉ ^ｐ及びｘ_ｉ ^ｎを抽出する。ここで、画像ｘ_ｉ ^ａと画像ｘ_ｉ ^ｐは同一のセマンティック特徴値を有し、画像ｘ_ｉ ^ｎのセマンティック特徴値は画像ｘ_ｉ ^ａ及び画像ｘ_ｉ ^ｐと異なる。セマンティック特徴が身分であることを一例にすると、この３つの画像は全てセマンティック特徴「身分」を有し、画像ｘ_ｉ ^ａと画像ｘ_ｉ ^ｐが同一のセマンティック特徴値を有することは、この２つの画像が同一の人に属することを意味し、画像ｘ_ｉ ^ｎのセマンティック特徴値が画像ｘ_ｉ ^ａ及び画像ｘ_ｉ ^ｐと異なることは、画像ｘ_ｉ ^ｎが他の人に属することを意味する。 In the image extraction step S102, three images x _i ^a , x _i ^p and x _i ⁿ corresponding to the same semantic feature are extracted from the training set. Here, the image x _i ^a and the image x _i ^p have the same semantic feature value, and the semantic feature value of the image x _i ⁿ is different from the image x _i ^a and the image x _i ^p . Taking one example that the semantic feature is identity, the three images all have the semantic feature "identity", and the image x _i ^a and the image x _i ^p have the same semantic feature value. image means that belong to the same person, the semantic feature value of the image x _i ⁿ different from the image x _i ^a and the image x _i ^p is the image x _i ⁿ means that it belongs to others.

潜在変数分布取得ステップＳ１０４において、変分オートエンコーダ（ＶＡＥ：ＶａｒｉａｔｉｏｎａｌＡｕｔｏＥｎｃｏｄｅｒ）により、３つの画像のセマンティック特徴に対応する潜在変数の分布を取得してもよい。該ステップにおいて、ＶＡＥにより、３つの画像ｘ_ｉ ^ａ、ｘ_ｉ ^ｐ及びｘ_ｉ ^ｎのセマンティック特徴に対応する潜在変数の分布を取得してもよい。 In the latent variable distribution acquisition step S104, a distribution of latent variables corresponding to semantic features of three images may be acquired by a variational auto encoder (VAE). In this step, the VAE may obtain the distribution of latent variables corresponding to the semantic features of the three images x _i ^a , x _i ^p and x _i ⁿ .

パラメータ更新ステップＳ１０６において、該３つの画像の各画像について、損失関数を最小化するようにＶＡＥのパラメータを更新してもよい。ここで、損失関数は、第１画像の潜在変数の分布と第２画像の潜在変数の分布との間の第１距離に対して正の相関関係を有し、第１画像の潜在変数の分布と第３画像の潜在変数の分布との間の第２距離に対して負の相関関係を有する。 In the parameter updating step S106, the parameters of the VAE may be updated to minimize the loss function for each of the three images. Here, the loss function has a positive correlation with the first distance between the distribution of latent variables of the first image and the distribution of latent variables of the second image, and the distribution of latent variables of the first image And the second distance between the distribution of latent variables in the third image and the distribution of latent variables in the third image.

メトリック学習は、距離メトリックを構築することによって、異なるサンプル間の距離を表す。セマンティック特徴が身分情報であることを一例にすると、メトリック学習では、同一の人の異なる画像について、潜在変数の分布間の距離が小さくなり、０に収束することが望ましく、異なる人の画像について、潜在変数の分布間の距離が大きくなることが望ましい。 Metric learning represents distances between different samples by constructing a distance metric. Taking one example that the semantic feature is identity information, in metric learning, for different images of the same person, it is desirable that the distance between the distributions of latent variables be small and converge to 0, and for images of different persons, It is desirable that the distance between the distributions of latent variables be large.

図２は本発明の実施例に係るメトリック学習を示す図である。説明の便宜上、図２では、ａ、ｐ及びｎで画像ｘ_ｉ ^ａ、ｘ_ｉ ^ｐ及びｘ_ｉ ^ｎをそれぞれ表し、図２の左部分及び右部分はａ、ｐ及びｎにより構成された３元集合をそれぞれ示している。また、この３つの画像のセマンティック特徴が身分情報であると仮定すると、ａとｐは同一の身分値を有し（即ちａとｐは同一の人に対応し）、ｎの身分値はａ及びｐの身分値と異なる（即ちｎはａ及びｐと異なる人に対応する）。図２の左部分の３元集合では、ａとｐの間の潜在変数の分布の距離はａとｎの間の潜在変数の分布の距離よりも大きい。上述したように、メトリック学習では、ａとｐの間の潜在変数の分布の距離が小さくなり、ａとｎの間の潜在変数の分布の距離が大きくなるようにする。即ち、メトリック学習では、同一の人の異なる画像ａ及びｐについて、潜在変数の分布間の距離が小さくなり、０に収束することが望ましく、異なる人の画像について、潜在変数の分布間の距離が大きくなることが望ましい。図２の右部分の３元集合に示すように、上記メトリック学習が行われた後に、ａとｎの間の潜在変数の分布の距離はａとｐの間の潜在変数の分布の距離よりも大きい。図２では、説明の便宜上、セマンティック特徴が身分情報であると仮定しているが、これは単なる一例であり、本発明を限定するものではなく、図２におけるセマンティック特徴は例えば姿勢、年齢などの他のセマンティック特徴であってもよい。 FIG. 2 is a diagram showing metric learning according to an embodiment of the present invention. For convenience of explanation, in FIG. 2, a, p and n represent images x _i ^a , x _i ^p and x _i ⁿ respectively, and the left part and the right part of FIG. 2 are ternary elements composed of a, p and n. Each set is shown. Also, assuming that the semantic features of the three images are identity information, a and p have identical identity values (ie a and p correspond to the same person), and n identity values are a and Different from the identity value of p (ie n corresponds to a different person than a and p). In the ternary set in the left part of FIG. 2, the distance of the distribution of latent variables between a and p is greater than the distance of the distribution of latent variables between a and n. As described above, in metric learning, the distance of distribution of latent variables between a and p is decreased, and the distance of distribution of latent variables between a and n is increased. That is, in metric learning, for different images a and p of the same person, the distance between the distributions of latent variables is smaller, desirably converging to 0, and for images of different people, the distance between distributions of latent variables is It is desirable to be large. As shown in the ternary set in the right part of FIG. 2, after the metric learning is performed, the distribution distance of the latent variable between a and n is greater than the distribution distance of the latent variable between a and p large. Although it is assumed in FIG. 2 that the semantic feature is identification information for convenience of explanation, this is merely an example, and does not limit the present invention, and the semantic feature in FIG. It may be another semantic feature.

３つの画像ｘ_ｉ ^ａ、ｘ_ｉ ^ｐ及びｘ_ｉ ^ｎにより構成された３元集合について、距離メトリックに基づく損失関数Ｌ_ｍｅｔの計算式は、以下のように表してもよい。

The calculation formula of the loss function L _met based on the distance metric may be expressed as follows for a ternary set composed of three images x _i ^a , x _i ^p and x _i ⁿ .

式（１）において、
（外１）

、
（外２）

及び
（外３）

は画像ｘ_ｉ ^ａ、ｘ_ｉ ^ｐ及びｘ_ｉ ^ｎの潜在変数の分布をそれぞれ表し、
（外４）

は画像ｘ_ｉ ^ａとｘ_ｉ ^ｐの間の潜在変数の分布の第１距離であり、
（外５）

は画像ｘ_ｉ ^ａとｘ_ｉ ^ｎの間の潜在変数の分布の第２距離であり、＋は、［］内の値がゼロよりも大きい場合に該値を損失とし、ゼロよりも小さい場合に損失がゼロであることを表す。ｔは所定の閾値であり、当業者が経験に基づいて設定されてもよく、例えばｔを０に設定してもよい。式（１）から分かるように、損失関数Ｌ_ｍｅｔは、画像ｘ_ｉ ^ａの潜在変数の分布と画像ｘ_ｉ ^ｐの潜在変数の分布との間の第１距離に対して正の相関関係を有し、画像ｘ_ｉ ^ａの潜在変数の分布と画像ｘ_ｉ ^ｎの潜在変数の分布との間の第２距離に対して負の相関関係を有する。該メトリック学習を用いることで、識別性のより良い顔セマンティック特徴を取得できる。 In equation (1),
(Extra 1)

,
(Other 2)

And (3)

Represents the distribution of latent variables of the images x _i ^a , x _i ^p and x _i ⁿ respectively
(4 outside)

Is the first distance of the distribution of latent variables between the images x _i ^a and x _i ^p
(5 outside)

Is the second distance of the distribution of latent variables between the images x _i ^a and x _i ⁿ , and + is the loss if the value in [] is greater than zero, if it is less than zero Indicates that the loss is zero. t is a predetermined threshold and may be set based on experience by those skilled in the art, for example, t may be set to zero. As can be seen from equation (1), the loss function L _{the met} may have a positive correlation with the first distance between the distribution of the latent variable image x _i distribution image x _i ^p latent variables of ^a And has a negative correlation with the second distance between the distribution of latent variables of the image x _i ^a and the distribution of latent variables of the image x _i ⁿ . By using this metric learning, it is possible to obtain more distinctive face semantic features.

好ましくは、本発明の実施例に係る情報処理方法１００では、損失関数は、第２距離と第１距離との差が所定の閾値よりも大きいという制約をさらに含んでもよい。式（１）に示す距離メトリックでは、
（外６）

と
（外７）

との間の距離と、
（外８）

と
（外９）

との間の距離との間には、最小の間隔がある。例えば、式（１）における所定の閾値ｔをゼロでない値に設定してもよく、例えばｔを１に設定してもよい。 Preferably, in the information processing method 100 according to the embodiment of the present invention, the loss function may further include a constraint that the difference between the second distance and the first distance is larger than a predetermined threshold. In the distance metric shown in equation (1),
(Outside 6)

And (outside 7)

And the distance between
(Outside 8)

And (outside 9)

There is a minimum spacing between and the distance between For example, the predetermined threshold value t in Equation (1) may be set to a non-zero value, for example, t may be set to 1.

パラメータ更新ステップＳ１０６において、損失関数Ｌ_ｍｅｔを最小化するようにＶＡＥのパラメータを更新してもよい。 In the parameter updating step S106, it may update the parameters of the VAE so as to minimize the loss function _{L the met.}

以上は、説明の便宜上、訓練セットにおける３つの画像を一例にして本発明の実施例に係る情報処理方法１００のステップを説明し、即ち訓練セットにおける３つの画像を一例にしてＶＡＥに対する訓練を説明した。ＶＡＥの訓練を行うために、訓練セット全ての３元集合を走査してもよい。或いは、反復回数を予め設定し、予め設定された反復回数に達した場合にＶＡＥに対する訓練を終了してもよい。 The above describes the steps of the information processing method 100 according to the embodiment of the present invention by taking three images in the training set as an example for the sake of explanation, ie, explaining the training for the VAE by taking three images in the training set as an example. did. In order to train the VAE, the ternary set of all training sets may be scanned. Alternatively, the number of iterations may be preset, and training on the VAE may be ended when the preset number of iterations is reached.

本発明の実施例に係る情報処理方法をより明確に説明するために、図３は情報処理方法１００を実現するためのネットワークを示している。 In order to explain the information processing method according to the embodiment of the present invention more clearly, FIG. 3 shows a network for realizing the information processing method 100.

図３におけるネットワークは、符号化ネットワーク及び復号ネットワークを含む。符号化ネットワーク及び復号ネットワークは、潜在変数層及び結合層により結合される。入力画像は、符号化ネットワークにより符号化された後に潜在変数層に入力される。潜在変数層は、ｚ_０、ｚ_１、ｚ_２、…ｚ_ｎである合計ｎ＋１個の潜在変数を含み、各潜在変数は１つの特定のセマンティック特徴に対応する。メトリック学習により潜在変数の分布を制限し（即ち、上述したように、損失関数を最小化することで潜在変数の分布を制限し）、制限された潜在変数は復号ネットワークにフィードバックされ、復号ネットワークにより出力画像（再構築画像）が取得される。 The network in FIG. 3 includes a coding network and a decoding network. The coding network and the decoding network are combined by the latent variable layer and the combining layer. The input image is input to the latent variable layer after being encoded by the encoding network. The latent variable layer includes a total of n + 1 latent variables that are z ₀ , z ₁ , z ₂ ,... Z _n , each latent variable corresponding to one specific semantic feature. Limit the distribution of latent variables by metric learning (ie, limit the distribution of latent variables by minimizing the loss function as described above), and the limited latent variables are fed back to the decoding network, and by the decoding network An output image (reconstructed image) is acquired.

図４は本発明の実施例に係る符号化ネットワーク及び復号ネットワークの構成を示すブロック図である。図４に示すように、符号化ネットワーク及び復号ネットワークは、それぞれ複数の隠れ層により構成されている。 FIG. 4 is a block diagram showing the configuration of a coding network and a decoding network according to an embodiment of the present invention. As shown in FIG. 4, the encoding network and the decoding network are each configured by a plurality of hidden layers.

従来のＶＡＥに比べて、本発明の実施例に係る情報処理方法１００では、潜在変数は複数の部分により構成され、各部分は１つの特定のセマンティック特徴に対応する。また、メトリック学習の方法を用いることで、識別性のより良い顔セマンティック特徴を取得できる。 Compared to the conventional VAE, in the information processing method 100 according to the embodiment of the present invention, the latent variable is composed of a plurality of parts, and each part corresponds to one specific semantic feature. In addition, by using the method of metric learning, it is possible to obtain better facial semantic features with distinctiveness.

好ましくは、本発明の実施例に係る情報処理方法１００では、損失関数は、教師誤差に関する制約をさらに含んでもよく、教師誤差は、セマンティック特徴のラベル及び画像の潜在変数の分布に基づいて算出されたものである。ＶＡＥを訓練するプロセスにおいて教師情報を追加する。従って、本発明の実施例に係る情報処理方法１００における損失関数は教師誤差に関する制約をさらに含んでもよく、セマンティック特徴のラベル及び画像の潜在変数の分布に基づいて教師誤差を算出してもよい。 Preferably, in the information processing method 100 according to an embodiment of the present invention, the loss function may further include a constraint on training error, and the training error is calculated based on the distribution of latent feature labels of semantic features and images. It is Add teacher information in the process of training the VAE. Therefore, the loss function in the information processing method 100 according to an embodiment of the present invention may further include a constraint on the training error, and the training error may be calculated based on the labels of semantic features and the distribution of latent variables of the image.

好ましくは、教師誤差を算出するステップは、非線形関数を用いて画像の潜在変数の分布をクラス空間にマッピングしてマッピング出力を取得するステップと、分類損失関数又は回帰損失関数を用いて、マッピング出力及びセマンティック特徴のラベルに基づいて教師誤差を算出するステップとを含んでもよい。 Preferably, the step of calculating the training error includes the step of mapping the distribution of latent variables of the image to a class space using a non-linear function to obtain a mapping output, and the mapping output using a classification loss function or a regression loss function. And calculating the training error based on the labels of the semantic features.

一例として、教師誤差を算出する際に、まず非線形関数を用いて画像の潜在変数の分布をクラス空間にマッピングしてマッピング出力を取得してもよく、該非線形関数は多層ニューラルネットワークを用いて実現してもよい。画像の潜在変数をｚで表し、クラス空間がｍ個のクラスサブ空間（例えば身分サブ空間、姿勢サブ空間及び年齢サブ空間などであり、各クラスは１つのセマンティック特徴に対応する）を含むと、非線形関数ｆｕ（）を用いてｚの分布を該ｍ個のクラスサブ空間にそれぞれマッピングして、各クラスサブ空間におけるマッピング出力（即ち、各セマンティック特徴空間における出力）ｆｕ_ｉ（ｚ）を取得してもよく、ｉ＝０，１，２，…，ｍ−１。一例として、非線形関数ｆｕ（）を用いてｚの分布を身分サブ空間にマッピングして、身分サブ空間におけるマッピング出力を取得し、ｚの分布を姿勢サブ空間にマッピングして、姿勢サブ空間におけるマッピング出力を取得してもよい。このように、潜在変数の異なるクラスサブ空間における識別性を向上できる。 As an example, when calculating the teaching error, first, the distribution of latent variables of the image may be mapped to a class space using a non-linear function to obtain a mapping output, and the non-linear function is realized using a multi-layer neural network You may If the latent variables of the image are represented by z, and the class space includes m class subspaces (for example, identity subspace, posture subspace and age subspace, and each class corresponds to one semantic feature), Map the distribution of z into the m class subspaces using the non-linear function fu () to obtain the mapping output (i.e., the output in each semantic feature space) fu _i (z) in each class subspace I may be i = 0, 1, 2, ..., m-1. As an example, the distribution of z is mapped to the identity subspace using the non-linear function fu () to obtain the mapping output in the identity subspace, and the distribution of z is mapped to the pose subspace to map in the pose subspace You may get the output. In this way, the distinguishability of latent variables in different class subspaces can be improved.

セマンティック特徴のラベル値が離散的なものであるか、それとも連続的なものであるかに応じて、分類損失関数又は回帰損失関数を用いて教師誤差を算出してもよい。 Depending on whether the semantic feature label values are discrete or continuous, the classification loss function or the regression loss function may be used to calculate the training error.

ラベル値が例えば画像における顔の身分情報（Ａ、Ｂ、Ｃ、Ｄ）のような離散的なものである場合は、下記の分類損失関数を用いて教師誤差を算出する。

If the label value is discrete, such as face identification information (A, B, C, D) in an image, for example, a classification loss function described below is used to calculate the teaching error.

式（２）では、
（外１０）

は、ｆｕ_ｉ（ｚ）がｉ番目のクラスのセマンティック特徴のラベルｌａｂｅｌ_ｉであると予測される確率である。 In equation (2),
(Outside 10)

Is the probability that fu _i (z) is predicted to be the label _i of the semantic feature of the _ith class.

ラベルのタイプが例えば画像における顔姿勢情報の回転角度（５０度、４９度、４８度など）のような連続的なものである場合は、下記の回帰損失関数を用いて教師誤差を算出する。

If the type of label is continuous, such as the rotation angle (50 degrees, 49 degrees, 48 degrees, etc.) of face posture information in an image, for example, a regression loss function described below is used to calculate the teaching error.

式（３）では、ｌａｂｅｌ_ｉはｉ番目のクラスのセマンティック特徴のラベルである。 In equation (3), label _i is the label of the semantic feature of the ith class.

全てのｍ個のクラスについて、総教師誤差は、全てのｍ個のクラスの教師誤差の和、即ち
（外１１）

であり、ここで、ｉ＝０，１，２，…，ｍ−１。 For all m classes, the total teacher error is the sum of the teacher errors for all m classes,

, Where i = 0, 1, 2, ..., m-1.

好ましくは、画像の潜在変数の事前分布は任意の分布を含む。本発明の実施例に係る情報処理方法１００では、潜在変数の分布及び潜在変数の事前分布は、従来のＶＡＥにおけるガウス分布に限定されず、任意の分布であってもよい。 Preferably, the prior distribution of latent variables of the image comprises any distribution. In the information processing method 100 according to the embodiment of the present invention, the distribution of latent variables and the prior distribution of latent variables are not limited to the Gaussian distribution in the conventional VAE, and may be any distribution.

好ましくは、画像の前記潜在変数の事前分布を取得するステップは、潜在変数の分布に基づいて、ガウス分布に従う中間変数を取得するステップと、中間変数に対して非線形変換を行い、潜在変数の事前分布を取得するステップとを含む。 Preferably, the step of obtaining the prior distribution of the latent variable of the image comprises the steps of obtaining an intermediate variable conforming to a Gaussian distribution based on the distribution of the latent variable, and performing nonlinear transformation on the intermediate variable Obtaining the distribution.

図５は本発明の実施例に係る潜在変数の構成を示す図である。図５では、最下位の層は符号化ネットワークからの入力ベクトルであり、隠れ層を介して、潜在変数ｚの分布の平均値ベクトルｚ_ｍ及び分散ベクトルｚ_ｖが生成され、ｚ_ｍ及びｚ_ｖに対してサンプリングを行うことで、該構成の出力であるｚを取得でき、即ち、ｚは図５に示す潜在変数の構成の出力であり、図３における結合層に入力される。また、潜在変数ｚの分布に基づいて取得されたガウス分布に従う上位変数（中間変数）をｕで表し、ｚの事前分布をｚ’で表すと、ｕ及びｚ’とｚの関係を明確に示すために、図５においてｕ及びｚ’をさらに示し、図５においては、ｕの分布の平均値ベクトルはｕ_ｍであり、分散ベクトルはｕ_ｖである。 FIG. 5 is a diagram showing the configuration of latent variables according to an embodiment of the present invention. In FIG. 5, the lowest layer is an input vector from the coding network, and through the hidden layer, the mean value vector z _m and the dispersion vector z _v of the distribution of the latent variable z are generated, z _m and z _v Can be obtained to obtain z which is an output of the configuration, that is, z is an output of the configuration of the latent variable shown in FIG. 5 and is input to the coupling layer in FIG. Also, the upper variable (intermediate variable) that follows the Gaussian distribution obtained based on the distribution of the latent variable z is represented by u, and the prior distribution of z is represented by z ', the relationship between u and z' and z is clearly shown In order to further illustrate u and z 'in FIG. 5, in FIG. 5 the mean value vector of the distribution of u is u _m and the variance vector is u _v .

図５に示すように、潜在変数ｚの分布に基づいて、ガウス分布に従う中間変数ｕを取得する。ｚの事前分布ｚ’は中間変数ｕにより構築され、即ちｕを非線形マッピングしてｚの事前分布ｚ’を取得する。ｕの事前分布は標準ガウス分布であるが、非線形変換が行われた後に、ｚ’の分布は任意の分布であってもよい（即ち、複数のガウス分布を組み合わせて、任意の分布を取得してもよい）。一方、ｚとｚ’が類似の分布を有するように制限することで、ｚが任意の分布の性質を有することを実現する。 As shown in FIG. 5, based on the distribution of latent variables z, an intermediate variable u conforming to a Gaussian distribution is acquired. The prior distribution z 'of z is constructed by the intermediate variable u, i.e. non-linearly mapping u to obtain the prior distribution z' of z. The prior distribution of u is a standard Gaussian distribution, but after nonlinear transformation has been performed, the distribution of z 'may be any distribution (ie, combining multiple Gaussian distributions to obtain an arbitrary distribution May). On the other hand, by restricting z and z 'to have similar distributions, it is realized that z has an arbitrary distribution property.

好ましくは、本発明の実施例に係る情報処理方法１００では、損失関数は、潜在変数の分布と潜在変数の事前分布とのカルバック・ライブラー・ダイバージェンス（ＫＬダイバージェンス）に関する制約、及び中間変数の分布と標準ガウス分布とのＫＬダイバージェンスに関する制約をさらに含んでもよい。 Preferably, in the information processing method 100 according to an embodiment of the present invention, the loss function is a constraint on the Kullback-Leibler divergence (KL divergence) between the distribution of latent variables and the prior distribution of latent variables, and the distribution of intermediate variables. It may further include a constraint on the KL divergence of and the standard Gaussian distribution.

ＫＬダイバージェンス（ＫＬ距離）は、２つの分布の類似度を評価するために用いられる。２つの分布の差異が小さいほど、ＫＬダイバージェンスは小さくなり、２つの分布の差異が大きいほど、ＫＬダイバージェンスは大きくなる。 The KL divergence (KL distance) is used to evaluate the similarity of two distributions. The smaller the difference between the two distributions, the smaller the KL divergence, and the larger the difference between the two distributions, the larger the KL divergence.

潜在変数ｚの分布をＰ（ｚ）で表し、潜在変数ｚの分布Ｐ（ｚ）に合わせて、潜在変数ｚの事前分布をＱ（ｚ’）で表すと、潜在変数の分布と潜在変数の事前分布とのＫＬダイバージェンスＫＬ（Ｐ｜｜Ｑ）は次のように表してもよい。

Denoting the distribution of the latent variable z by P (z), matching the distribution of the latent variable z with P (z), and expressing the prior distribution of the latent variable z by Q (z '), the distribution of the latent variable and the latent variable KL divergence with prior distribution KL (P || Q) may be expressed as follows.

中間変数ｕの分布をＳ（ｕ）で表し、標準ガウス分布をＧ（０，１）で表すと、中間変数の分布と標準ガウス分布とのＫＬダイバージェンスＫＬ（Ｓ｜｜Ｇ）は次のように表してもよい。

When the distribution of the intermediate variable u is represented by S (u) and the standard Gaussian distribution is represented by G (0, 1), the KL divergence KL (S || G) between the distribution of the intermediate variable and the standard Gaussian distribution is as follows It may be expressed in

上述したように、損失関数は、式（４）及び（５）に基づいて算出されたＫＬダイバージェンスの制約をさらに含んでもよい。 As described above, the loss function may further include the KL divergence constraints calculated based on Equations (4) and (5).

好ましくは、本発明の実施例に係る情報処理方法１００では、損失関数は、再構築誤差に関する制約をさらに含んでもよく、再構築誤差は、ＶＡＥに出力された画像と該画像に対応するＶＡＥから出力された画像との差異を評価するために用いられる。ＶＡＥを用いて画像を再構築する場合は、ＶＡＥに出力された画像と、該画像に対応するＶＡＥから出力された画像（即ち、再構築画像）とは差異がある（即ち再構築誤差がある）。損失関数は、該再構築誤差に関する制約をさらに含んでもよい。画像の３元集合（３つの画像ｘ_ｉ ^ａ、ｘ_ｉ ^ｐ及びｘ_ｉ ^ｎを含む）の例では、以下は説明の便宜上、該３つの画像の全てをｘ_ｉで表し、それに対応する出力画像をｘ_ｉ ^ｏで表すと、各画像の再構築誤差Ｌ_ｒｅｃは次のように表してもよい。

Preferably, in the information processing method 100 according to an embodiment of the present invention, the loss function may further include a constraint on a reconstruction error, and the reconstruction error is an image output to the VAE and a VAE corresponding to the image. It is used to evaluate the difference from the output image. When VAE is used to reconstruct an image, there is a difference (that is, there is a reconstruction error) between the image output to VAE and the image output from VAE corresponding to the image (i.e., a reconstructed image) ). The loss function may further include a constraint on the reconstruction error. In the example of the ternary set of images (including three images x _i ^a , x _i ^p and x _i ⁿ ), for convenience of description, the following three represent all of the three images by x _i and the corresponding output image ^Denoting x by x _i ^o , the reconstruction error L _rec of each image may be expressed as follows.

上記３つの画像について、総再構築誤差は各画像の再構築誤差の和である。説明の便宜上、以下の説明では、総再構築誤差は単にＬ_ｒｅｃで表される。 For the above three images, the total reconstruction error is the sum of the reconstruction error of each image. For convenience of explanation, in the following description, the total reconstruction error is simply represented by L _rec .

本発明の実施例に係る情報処理方法１００では、損失関数が上記制約を全て含む場合、訓練セットの画像における任意の３元集合について、総損失関数Ｌは次のように表してもよい。

In the information processing method 100 according to the embodiment of the present invention, when the loss function includes all the above constraints, the total loss function L may be expressed as follows for any ternary set in the image of the training set.

式（７）では、Ｌ_ｒｅｃは総再構築誤差であり、
（外１２）

は総教師誤差であり、Ｌ_ｍｅｔは距離メトリックに基づく損失関数であり、ＫＬ（Ｐ｜｜Ｑ）は潜在変数の分布と潜在変数の事前分布とのＫＬダイバージェンスであり、ＫＬ（Ｓ｜｜Ｇ）は中間変数の分布と標準ガウス分布とのＫＬダイバージェンスであり、α及びβは定数であり、その値が［０，１］の範囲内にある。総誤差関数を最小化するように、ＶＡＥのパラメータを更新してもよい。 In equation (7), L _rec is the total reconstruction error,
(12 outside)

Is the total teacher error, L _met is the loss function based on the distance metric, and KL (P || Q) is the KL divergence between the distribution of latent variables and the prior distribution of latent variables, KL (S || G ) Is the KL divergence between the distribution of intermediate variables and the standard Gaussian distribution, and α and β are constants, the values of which are in the range of [0, 1]. The parameters of the VAE may be updated to minimize the total error function.

以上のことから、従来のＶＡＥに比べて、本発明の実施例に係る情報処理方法１００では、潜在変数は複数の部分により構成され、各部分は１つの特定のセマンティック特徴に対応する。潜在変数の分布は、ガウス分布に限定されず、任意の分布であってもよい。メトリック学習の方法を用いることで、識別性のより良い顔セマンティック特徴を取得できる。 From the above, in the information processing method 100 according to the embodiment of the present invention, the latent variable is composed of a plurality of parts, and each part corresponds to one specific semantic feature, as compared with the conventional VAE. The distribution of latent variables is not limited to the Gaussian distribution, but may be any distribution. By using the method of metric learning, it is possible to obtain more distinctive face semantic features.

上述した情報処理方法の実施例と同様に、本発明は情報処理装置の実施例をさらに提供する。 Similar to the embodiment of the information processing method described above, the present invention further provides an embodiment of the information processing apparatus.

図６は本発明の実施例に係る情報処理装置６００の機能的構成の一例を示すブロック図である。 FIG. 6 is a block diagram showing an example of a functional configuration of the information processing apparatus 600 according to the embodiment of the present invention.

図６に示すように、本発明の実施例に係る情報処理装置６００は、画像抽出部６０２、潜在変数分布取得部６０４及びパラメータ更新部６０６を含む。以下は、画像抽出部６０２、潜在変数分布取得部６０４及びパラメータ更新部６０６の機能的構成の一例を説明する。 As shown in FIG. 6, the information processing apparatus 600 according to the embodiment of the present invention includes an image extraction unit 602, a latent variable distribution acquisition unit 604, and a parameter update unit 606. The following describes an example of the functional configuration of the image extraction unit 602, the latent variable distribution acquisition unit 604, and the parameter update unit 606.

画像抽出部６０２は、訓練セットから同一のセマンティック特徴に対応する３つの画像を抽出してもよい。ここで、該３つの画像は、セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに第１画像及び第２画像と異なるセマンティック特徴値を有する第３画像を含む。 The image extraction unit 602 may extract three images corresponding to the same semantic feature from the training set. Here, the three images include a first image and a second image having the same semantic feature value for semantic features, and a third image having a semantic feature value different from the first image and the second image.

従来の変分自己符号化アルゴリズムでは、潜在変数の各次元特徴の観点から、特定のセマンティック特徴を有しない。本発明の実施例に係る情報処理装置６００では、潜在変数を複数の部分に分け、各部分は１つの特定のセマンティック特徴、例えば顔の姿勢、年齢、性別などに対応する。 Conventional variational self-coding algorithms do not have specific semantic features in terms of each dimensional feature of the latent variable. In the information processing apparatus 600 according to the embodiment of the present invention, latent variables are divided into a plurality of parts, and each part corresponds to one specific semantic feature, such as the posture, age, and gender of a face.

同一のセマンティック特徴に対応する３つの画像の抽出方法の例は、上記方法実施例の対応説明を参照してもよく、ここでその説明を省略する。 An example of a method of extracting three images corresponding to the same semantic feature may refer to the corresponding description of the above method embodiment, and the description thereof is omitted here.

潜在変数分布取得部６０４は、変分オートエンコーダ（ＶＡＥ）により、３つの画像のセマンティック特徴に対応する潜在変数の分布を取得してもよい。潜在変数分布取得部６０４では、ＶＡＥにより、３つの画像のセマンティック特徴に対応する潜在変数の分布を取得してもよい。 The latent variable distribution acquisition unit 604 may acquire the distribution of latent variables corresponding to the semantic features of the three images by means of a variational auto encoder (VAE). The latent variable distribution acquisition unit 604 may acquire the distribution of latent variables corresponding to the semantic features of the three images by the VAE.

パラメータ更新部６０６は、該３つの画像の各画像について、損失関数を最小化するようにＶＡＥのパラメータを更新してもよい。ここで、損失関数は、第１画像の潜在変数の分布と第２画像の潜在変数の分布との間の第１距離に対して正の相関関係を有し、第１画像の潜在変数の分布と第３画像の潜在変数の分布との間の第２距離に対して負の相関関係を有する。 The parameter updating unit 606 may update the parameters of the VAE so as to minimize the loss function for each of the three images. Here, the loss function has a positive correlation with the first distance between the distribution of latent variables of the first image and the distribution of latent variables of the second image, and the distribution of latent variables of the first image And the second distance between the distribution of latent variables in the third image and the distribution of latent variables in the third image.

メトリック学習は、距離メトリックを構築することによって、異なるサンプル間の距離を表す。セマンティック特徴が身分情報であることを一例にすると、メトリック学習では、同一の人の異なる画像について、潜在変数の分布間の距離が小さくなり、０に収束することが望ましく、異なる人の画像について、潜在変数の分布間の距離が大きくなることが望ましい。メトリック学習、第１画像の潜在変数の分布と第２画像の潜在変数の分布との間の第１距離、第１画像の潜在変数の分布と第３画像の潜在変数の分布との間の第２距離の例は、上記方法実施例の対応説明を参照してもよく、ここでその説明を省略する。 Metric learning represents distances between different samples by constructing a distance metric. Taking one example that the semantic feature is identity information, in metric learning, for different images of the same person, it is desirable that the distance between the distributions of latent variables be small and converge to 0, and for images of different persons, It is desirable that the distance between the distributions of latent variables be large. Metric learning, a first distance between the distribution of latent variables in the first image and a distribution of latent variables in the second image, a distribution between the distribution of latent variables in the first image and the distribution of latent variables in the third image For an example of the two distances, reference may be made to the corresponding description of the above method embodiment, and the description thereof will be omitted here.

好ましくは、損失関数は、第２距離と第１距離との差が所定の閾値よりも大きいという制約をさらに含んでもよい。その例は、上記方法実施例の対応説明を参照してもよく、ここでその説明を省略する。 Preferably, the loss function may further include the constraint that the difference between the second distance and the first distance is greater than a predetermined threshold. The example may refer to the corresponding description of the above method embodiment, and the description is omitted here.

以上は、訓練セットにおける３つの画像を一例にしてＶＡＥに対する訓練を説明した。ＶＡＥの訓練を行うために、訓練セット全ての３元集合を走査してもよい。或いは、反復回数を予め設定し、予め設定された反復回数に達した場合にＶＡＥに対する訓練を終了してもよい。 The above has described training for the VAE, taking three images in the training set as an example. In order to train the VAE, the ternary set of all training sets may be scanned. Alternatively, the number of iterations may be preset, and training on the VAE may be ended when the preset number of iterations is reached.

従来のＶＡＥに比べて、本発明の実施例に係る情報処理装置６００では、潜在変数は複数の部分により構成され、各部分は１つの特定のセマンティック特徴に対応する。また、メトリック学習の方法を用いることで、識別性のより良い顔セマンティック特徴を取得できる。 Compared to the conventional VAE, in the information processing apparatus 600 according to the embodiment of the present invention, the latent variable is composed of a plurality of parts, and each part corresponds to one specific semantic feature. In addition, by using the method of metric learning, it is possible to obtain better facial semantic features with distinctiveness.

好ましくは、本発明の実施例に係る情報処理装置６００では、損失関数は、教師誤差に関する制約をさらに含んでもよく、教師誤差は、セマンティック特徴のラベル及び画像の潜在変数の分布に基づいて算出されたものである。ＶＡＥを訓練するプロセスにおいて教師情報を追加する。従って、本発明の実施例に係る情報処理装置６００における損失関数は教師誤差に関する制約をさらに含んでもよく、セマンティック特徴のラベル及び画像の潜在変数の分布に基づいて教師誤差を算出してもよい。 Preferably, in the information processing apparatus 600 according to an embodiment of the present invention, the loss function may further include a constraint on a teaching error, and the training error is calculated based on the distribution of latent feature labels of semantic features and images. It is Add teacher information in the process of training the VAE. Therefore, the loss function in the information processing apparatus 600 according to the embodiment of the present invention may further include a constraint on training error, and may be calculated based on the distribution of latent feature labels of semantic features and images.

好ましくは、教師誤差を算出するステップは、非線形関数を用いて画像の潜在変数の分布をクラス空間にマッピングしてマッピング出力を取得するステップと、分類損失関数又は回帰損失関数を用いて、マッピング出力及びセマンティック特徴のラベルに基づいて教師誤差を算出するステップとを含んでもよい。教師誤差の算出方法の例は、上記方法実施例の対応説明を参照してもよく、ここでその説明を省略する。 Preferably, the step of calculating the training error includes the step of mapping the distribution of latent variables of the image to a class space using a non-linear function to obtain a mapping output, and the mapping output using a classification loss function or a regression loss function. And calculating the training error based on the labels of the semantic features. For an example of how to calculate the training error, reference may be made to the corresponding description of the method embodiment described above, and the description will be omitted here.

好ましくは、画像の潜在変数の事前分布は任意の分布を含む。本発明の実施例に係る情報処理装置６００では、潜在変数の分布及び潜在変数の事前分布は、従来のＶＡＥにおけるガウス分布に限定されず、任意の分布であってもよい。 Preferably, the prior distribution of latent variables of the image comprises any distribution. In the information processing apparatus 600 according to the embodiment of the present invention, the distribution of latent variables and the prior distribution of latent variables are not limited to the Gaussian distribution in the conventional VAE, but may be any distribution.

好ましくは、画像の潜在変数の事前分布を取得するステップは、潜在変数の分布に基づいて、ガウス分布に従う中間変数を取得するステップと、中間変数に対して非線形変換を行い、潜在変数の事前分布を取得するステップとを含む。画像の潜在変数の事前分布の取得方法の例は、上記方法実施例の対応説明を参照してもよく、ここでその説明を省略する。 Preferably, the step of obtaining the prior distribution of latent variables of the image comprises the steps of obtaining an intermediate variable conforming to a Gaussian distribution based on the distribution of latent variables, and performing nonlinear transformation on the intermediate variables to obtain the prior distribution of latent variables And acquiring. For an example of how to obtain the prior distribution of latent variables of an image, reference may be made to the corresponding description of the method embodiment above, which will not be described here.

好ましくは、本発明の実施例に係る情報処理装置６００では、損失関数は、潜在変数の分布と潜在変数の事前分布とのカルバック・ライブラー・ダイバージェンス（ＫＬダイバージェンス）に関する制約、及び中間変数の分布と標準ガウス分布とのＫＬダイバージェンスに関する制約をさらに含んでもよい。潜在変数の分布と潜在変数の事前分布とのＫＬダイバージェンス、及び中間変数の分布と標準ガウス分布とのＫＬダイバージェンスの例は、上記方法実施例の対応説明を参照してもよく、ここでその説明を省略する。 Preferably, in the information processing apparatus 600 according to the embodiment of the present invention, the loss function is a constraint on the Kullback-Leibler divergence (KL divergence) between the distribution of latent variables and the prior distribution of latent variables, and the distribution of intermediate variables. It may further include a constraint on the KL divergence of and the standard Gaussian distribution. An example of the KL divergence of the distribution of latent variables and the prior distribution of latent variables, and the KL divergence of the distributions of intermediate variables and standard Gaussian distributions may be referred to the corresponding description of the method embodiments above, and is described here Omit.

好ましくは、本発明の実施例に係る情報処理装置６００では、損失関数は、再構築誤差に関する制約をさらに含んでもよく、再構築誤差は、ＶＡＥに出力された画像と該画像に対応するＶＡＥから出力された画像との差異を評価するために用いられる。再構築誤差の算出方法の例は、上記方法実施例の対応説明を参照してもよく、ここでその説明を省略する。 Preferably, in the information processing apparatus 600 according to an embodiment of the present invention, the loss function may further include a constraint on a reconstruction error, and the reconstruction error is an image output to the VAE and a VAE corresponding to the image. It is used to evaluate the difference from the output image. For an example of the method of calculating the reconstruction error, reference may be made to the corresponding description of the above method embodiment, and the description thereof will be omitted here.

以上のことから、従来のＶＡＥに比べて、本発明の実施例に係る情報処理装置６００では、潜在変数は複数の部分により構成され、各部分は１つの特定のセマンティック特徴に対応する。潜在変数の分布は、ガウス分布に限定されず、任意の分布であってもよい。メトリック学習の方法を用いることで、識別性のより良い顔セマンティック特徴を取得できる。 From the above, in the information processing apparatus 600 according to the embodiment of the present invention, the latent variable is composed of a plurality of parts, and each part corresponds to one specific semantic feature, as compared with the conventional VAE. The distribution of latent variables is not limited to the Gaussian distribution, but may be any distribution. By using the method of metric learning, it is possible to obtain more distinctive face semantic features.

なお、以上は本発明の実施例に係る情報処理装置の機能的構成を説明しているが、該機能的構成は単なる例示的なものであり、本発明を限定するものではない。当業者は、本発明の原理に従って上記実施例を修正してもよく、例えば各実施例における機能的モジュールを追加、削除又は組み合わせてもよく、これらの修正は本発明の範囲に含まれるものである。 Although the functional configuration of the information processing apparatus according to the embodiment of the present invention has been described above, the functional configuration is merely exemplary and does not limit the present invention. Those skilled in the art may modify the above embodiments according to the principles of the present invention, for example, add, delete or combine functional modules in each embodiment, and these modifications are included in the scope of the present invention. is there.

また、ここの装置の実施例は上記方法の実施例に対応するため、装置の実施例に詳細に説明されていない内容は、上記方法実施例の対応説明を参照してもよく、ここでその説明を省略する。 Also, since the embodiments of the apparatus correspond to the embodiments of the method described above, contents not described in detail in the embodiments of the apparatus may refer to the corresponding description of the method embodiments described above. I omit explanation.

なお、本発明の実施例の記憶媒体及びプログラムプロダクトにおける機器が実行可能な命令は上記情報処理方法を実行してもよく、ここで詳細に説明されていない内容は、上記方法実施例の対応説明を参照してもよく、ここでその説明を省略する。 The instructions executable by the device in the storage medium and program product of the embodiment of the present invention may execute the above-mentioned information processing method, and the contents which are not explained in detail here are the corresponding explanation of the above-mentioned method embodiment. And may not be described here.

それに応じて、本発明は、機器が実行可能な命令を含むプログラムプロダクトが記録されている記憶媒体をさらに含む。該記憶媒体は、フロッピーディスク、光ディスク、光磁気ディスク、メモリカード、メモリスティック等を含むが、これらに限定されない。 Accordingly, the invention further includes a storage medium having recorded thereon a program product comprising instructions executable by the device. The storage medium includes, but is not limited to, a floppy disk, an optical disk, a magneto-optical disk, a memory card, a memory stick and the like.

本発明のもう１つの態様では、情報検出方法をさらに提供する。本発明の実施例に係る情報検出方法は、訓練された変分オートエンコーダ（ＶＡＥ）に複数の画像をそれぞれ入力し、各画像のセマンティック特徴に対応する潜在変数の分布を取得し、各画像の再構築画像を取得するステップ、を含み、ここで、複数の画像のうち同一のセマンティック特徴を有する３つの画像について、３つの画像は、セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに第１画像及び第２画像と異なるセマンティック特徴値を有する第３画像を含み、第１画像の潜在変数の分布と第２画像の潜在変数の分布との間の第１距離は、第１画像の潜在変数の分布と第３画像の潜在変数の分布との間の第２距離よりも小さい。 In another aspect of the present invention, there is further provided an information detection method. The information detection method according to the embodiment of the present invention inputs each of a plurality of images to a trained variational auto-encoder (VAE), acquires the distribution of latent variables corresponding to semantic features of each image, and Obtaining a reconstructed image, wherein for three images of the plurality of images having the same semantic feature, the three images have a first image having the same semantic feature value for the semantic feature and a third image The first distance between the distribution of latent variables of the first image and the distribution of latent variables of the second image, including two images and a third image having semantic feature values different from the first and second images, It is smaller than the second distance between the distribution of latent variables of the first image and the distribution of latent variables of the third image.

一例として、本発明の実施例の情報検出方法では、訓練されたＶＡＥにより取得された各入力画像のセマンティック特徴に対応する潜在変数の分布を取得し、各入力画像の再構築画像を取得する。同一のセマンティック特徴に対応する３つの画像ｘ_ｉ ^ａ、ｘ_ｉ ^ｐ及びｘ_ｉ ^ｎが存在し、画像ｘ_ｉ ^ａと画像ｘ_ｉ ^ｐは同一のセマンティック特徴値を有し、画像ｘ_ｉ ^ｎのセマンティック特徴値は画像ｘ_ｉ ^ａ及び画像ｘ_ｉ ^ｐと異なると仮定する。本発明の実施例に係る情報処理方法で式（１）を参照しながら説明するように、ＶＡＥを訓練する場合に、損失関数は、第１画像の潜在変数の分布と第２画像の潜在変数の分布との間の第１距離に対して正の相関関係を有し、第１画像の潜在変数の分布と第３画像の潜在変数の分布との間の第２距離に対して負の相関関係を有し、第２距離と第１距離との差が所定の閾値よりも大きい。このため、上記の訓練されたＶＡＥを用いて画像を再構築する場合に、３つの画像ｘ_ｉ ^ａ、ｘ_ｉ ^ｐ及びｘ_ｉ ^ｎについて、画像ｘ_ｉ ^ａの潜在変数の分布と画像ｘ_ｉ ^ｐの潜在変数の分布との間の第１距離は、画像ｘ_ｉ ^ａの潜在変数の分布と画像ｘ_ｉ ^ｎの潜在変数の分布との間の第２距離よりも小さい。 As an example, in the information detection method of the embodiment of the present invention, the distribution of latent variables corresponding to the semantic feature of each input image acquired by trained VAE is acquired, and the reconstructed image of each input image is acquired. There are three images x _i ^a , x _i ^p and x _i ⁿ corresponding to the same semantic feature, the image x _i ^a and the image x _i ^p have identical semantic feature values and the semantics of the image x _i ⁿ It is assumed that the feature values are different from the image x _i ^a and the image x _i ^p . As described with reference to equation (1) in the information processing method according to the embodiment of the present invention, when training a VAE, the loss function includes the distribution of latent variables of the first image and the latent variables of the second image Has a positive correlation with the first distance between the first and second distributions, and a second correlation with the second distance between the distribution of latent variables in the first image and the distribution of latent variables in the third image There is a relationship, and the difference between the second distance and the first distance is larger than a predetermined threshold. Therefore, when reconstructing an image using the above-mentioned trained VAE, the distribution of latent variables of the image x _i ^a and the image x _i ^{p for} three images x _i ^a , x _i ^p and x _i ⁿ The first distance between the distribution of latent variables of the image x _i ^a is smaller than the second distance between the distribution of latent variables of the image x _{i a} and the distribution of latent variables of the image x _i ⁿ .

本発明の実施例に係る情報検出方法によれば、識別性を有する顔セマンティック特徴を抽出することができる。 According to the information detection method of the embodiment of the present invention, it is possible to extract face semantic features having distinctiveness.

上記の情報検出方法の実施例と同様に、本発明は下記の情報検出装置の実施例をさらに提供する。本発明の実施例に係る情報検出装置は、訓練された変分オートエンコーダ（ＶＡＥ）に複数の画像をそれぞれ入力し、各画像のセマンティック特徴に対応する潜在変数の分布を取得し、各画像の再構築画像を取得する再構築画像取得部を含み、ここで、複数の画像のうち同一のセマンティック特徴を有する３つの画像について、３つの画像は、セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに第１画像及び第２画像と異なるセマンティック特徴値を有する第３画像を含み、第１画像の潜在変数の分布と第２画像の潜在変数の分布との間の第１距離は、第１画像の潜在変数の分布と第３画像の潜在変数の分布との間の第２距離よりも小さい。 Similar to the embodiments of the information detection method described above, the present invention further provides the following embodiments of the information detection apparatus. An information detection apparatus according to an embodiment of the present invention inputs a plurality of images to a trained variational auto encoder (VAE), acquires a distribution of latent variables corresponding to semantic features of each image, and A reconstruction image acquiring unit for acquiring a reconstruction image, wherein, for three images having the same semantic feature among the plurality of images, the three images have the same semantic feature value for the semantic feature; A first image between the distribution of latent variables of the first image and the distribution of latent variables of the second image, including an image and a second image, and a third image having semantic feature values different from the first image and the second image; The distance is smaller than a second distance between the distribution of latent variables of the first image and the distribution of latent variables of the third image.

本発明の実施例に係る情報検出装置によれば、識別性を有する顔セマンティック特徴を抽出することができる。 According to the information detection apparatus of the embodiment of the present invention, it is possible to extract face semantic features having distinctiveness.

なお、以上は本発明の実施例に係る情報検出装置の機能的構成を説明しているが、該機能的構成は単なる例示的なものであり、本発明を限定するものではない。当業者は、本発明の原理に従って上記実施例を修正してもよく、例えば各実施例における機能的モジュールを追加、削除又は組み合わせてもよく、これらの修正は本発明の範囲に含まれるものである。 Although the functional configuration of the information detection apparatus according to the embodiment of the present invention has been described above, the functional configuration is merely exemplary and does not limit the present invention. Those skilled in the art may modify the above embodiments according to the principles of the present invention, for example, add, delete or combine functional modules in each embodiment, and these modifications are included in the scope of the present invention. is there.

なお、本発明の実施例の記憶媒体及びプログラムプロダクトにおける機器が実行可能な命令は上記情報検出方法を実行してもよく、ここで詳細に説明されていない内容は、上記方法実施例の対応説明を参照してもよく、ここでその説明を省略する。 The instructions executable by the device in the storage medium and program product of the embodiment of the present invention may execute the above-mentioned information detection method, and the contents which are not described in detail here are the corresponding description of the above-mentioned method embodiment. And may not be described here.

本発明のもう１つの態様では、上記情報処理方法により訓練されたＶＡＥを用いて入力画像を再構築する方法及び装置をさらに提供する。 Another aspect of the present invention further provides a method and apparatus for reconstructing an input image using a VAE trained by the above information processing method.

なお、上記処理及び装置はソフトウェア及び／又はファームウェアにより実現されてもよい。ソフトウェア及び／又はファームウェアにより実施されている場合、記憶媒体又はネットワークから専用のハードウェア構成を有するコンピュータ、例えば図７示されている汎用パーソナルコンピュータ７００に上記方法を実施するためのソフトウェアを構成するプログラムをインストールしてもよく、該コンピュータは各種のプログラムがインストールされている場合は各種の機能などを実行できる。 Note that the above processes and apparatus may be realized by software and / or firmware. When implemented by software and / or firmware, a program for configuring software for implementing the above method to a computer having a dedicated hardware configuration from a storage medium or network, such as a general purpose personal computer 700 shown in FIG. 7 Can be installed, and the computer can execute various functions if various programs are installed.

図７において、中央処理部（即ちＣＰＵ）７０１は、読み出し専用メモリ（ＲＯＭ）７０２に記憶されているプログラム、又は記憶部７０８からランダムアクセスメモリ（ＲＡＭ）７０３にロードされたプログラムにより各種の処理を実行する。ＲＡＭ７０３には、必要に応じて、ＣＰＵ７０１が各種の処理を実行するに必要なデータが記憶されている。 In FIG. 7, a central processing unit (that is, CPU) 701 performs various processes according to a program stored in a read only memory (ROM) 702 or a program loaded from a storage unit 708 to a random access memory (RAM) 703. Run. The RAM 703 stores data necessary for the CPU 701 to execute various processes as needed.

ＣＰＵ７０１、ＲＯＭ７０２、及びＲＡＭ７０３は、バス７０４を介して互いに接続されている。入力／出力インターフェース７０５もバス７０４に接続されている。 The CPU 701, the ROM 702, and the RAM 703 are connected to one another via a bus 704. An input / output interface 705 is also connected to the bus 704.

入力部７０６（キーボード、マウスなどを含む）、出力部７０７（ディスプレイ、例えばブラウン管（ＣＲＴ）、液晶ディスプレイ（ＬＣＤ）など、及びスピーカなどを含む）、記憶部７０８（例えばハードディスクなどを含む）、通信部７０９（例えばネットワークのインタフェースカード、例えばＬＡＮカード、モデムなどを含む）は、入力／出力インターフェース７０５に接続されている。通信部７０９は、ネットワーク、例えばインターネットを介して通信処理を実行する。 An input unit 706 (including a keyboard, a mouse, etc.), an output unit 707 (including a display, such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., a speaker, etc.), a storage unit 708 (including a hard disk, etc.) A unit 709 (for example, including an interface card of a network such as a LAN card, a modem, etc.) is connected to the input / output interface 705. The communication unit 709 executes communication processing via a network, for example, the Internet.

必要に応じて、ドライブ部７１０は、入力／出力インターフェース７０５に接続されてもよい。取り外し可能な媒体７１１は、例えば磁気ディスク、光ディスク、光磁気ディスク、半導体メモリなどであり、必要に応じてドライブ部７１０にセットアップされて、その中から読みだされたコンピュータプログラムは必要に応じて記憶部７０８にインストールされている。 The drive unit 710 may be connected to the input / output interface 705 as needed. The removable medium 711 is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., and is set up in the drive unit 710 as needed, and the computer program read from it is stored as needed. It is installed in section 708.

ソフトウェアにより上記処理を実施する場合、ネットワーク、例えばインターネット、又は記憶媒体、例えば取り外し可能な媒体７１１を介してソフトウェアを構成するプログラムをインストールする。 When the above process is performed by software, a program that configures the software is installed via a network, for example, the Internet, or a storage medium, for example, removable medium 711.

なお、これらの記憶媒体は、図７に示されている、プログラムを記憶し、機器と分離してユーザへプログラムを提供する取り外し可能な媒体７１１に限定されない。取り外し可能な媒体７１１は、例えば磁気ディスク（フロッピーディスク（登録商標）を含む）、光ディスク（光ディスク−読み出し専用メモリ（ＣＤ−ＲＯＭ）、及びデジタル多目的ディスク（ＤＶＤ）を含む）、光磁気ディスク（ミニディスク（ＭＤ）（登録商標））及び半導体メモリを含む。或いは、記憶媒体は、ＲＯＭ７０２、記憶部７０８に含まれるハードディスクなどであってもよく、プログラムを記憶し、それらを含む機器と共にユーザへ提供される。 Note that these storage media are not limited to the removable media 711 illustrated in FIG. 7 that store the program and provide the program to the user separately from the device. The removable medium 711 may be, for example, a magnetic disk (including floppy disk (registered trademark)), an optical disk (including an optical disk-read only memory (CD-ROM), and a digital multipurpose disk (DVD)), a magneto-optical disk (mini It includes a disk (MD) (registered trademark) and a semiconductor memory. Alternatively, the storage medium may be a ROM 702, a hard disk included in the storage unit 708, or the like, which stores the program and is provided to the user along with an apparatus including them.

以上は図面を参照しながら本発明の好ましい実施例を説明しているが、上記実施例及び例は例示的なものであり、制限的なものではない。当業者は、特許請求の範囲の主旨及び範囲内で本発明に対して各種の修正、改良、均等的なものに変更してもよい。これらの修正、改良又は均等的なものに変更することは本発明の保護範囲に含まれるものである。 While the above describes the preferred embodiments of the present invention with reference to the drawings, the above embodiments and examples are illustrative and not restrictive. Those skilled in the art may make various modifications, improvements and equivalents to the present invention within the spirit and scope of the claims. Modifications to these modifications, improvements or equivalents are included in the protection scope of the present invention.

例えば、上記実施例の１つのユニットに含まれる機能は別々の装置により実現されてもよい。また、上記実施例の複数のユニットにより実現される複数の機能は別々の装置によりそれぞれ実現されてもよい。さらに、以上の機能の１つは複数のユニットにより実現されてもよい。なお、これらの構成は本発明の範囲内のものである。 For example, the functions included in one unit of the above embodiment may be realized by separate devices. Also, the plurality of functions realized by the plurality of units of the above embodiment may be realized respectively by different devices. Furthermore, one of the above functions may be realized by a plurality of units. These configurations are within the scope of the present invention.

また、本発明の方法は、明細書に説明された時間的順序で実行するものに限定されず、他の時間的順序で順次、並行、又は独立して実行されてもよい。このため、本明細書に説明された方法の実行順序は、本発明の技術的な範囲を限定するものではない。
また、上述の各実施例を含む実施形態に関し、更に以下の付記を開示する。
（付記１）
訓練セットから同一のセマンティック特徴に対応する３つの画像を抽出するステップであって、前記３つの画像は、前記セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに前記第１画像及び前記第２画像と異なるセマンティック特徴値を有する第３画像を含む、ステップと、
変分オートエンコーダ（ＶＡＥ）により、前記３つの画像の前記セマンティック特徴に対応する潜在変数の分布を取得するステップと、
前記３つの画像の各画像について、損失関数を最小化するように前記ＶＡＥのパラメータを更新するステップであって、前記損失関数は、前記第１画像の潜在変数の分布と前記第２画像の潜在変数の分布との間の第１距離に対して正の相関関係を有し、前記第１画像の潜在変数の分布と前記第３画像の潜在変数の分布との間の第２距離に対して負の相関関係を有する、ステップと、を含む、情報処理方法。
（付記２）
前記損失関数は、前記第２距離と前記第１距離との差が所定の閾値よりも大きいという制約をさらに含む、付記１に記載の情報処理方法。
（付記３）
前記損失関数は、教師誤差に関する制約をさらに含み、
前記教師誤差は、前記セマンティック特徴のラベル及び画像の前記潜在変数の分布に基づいて算出されたものである、付記２に記載の情報処理方法。
（付記４）
前記教師誤差を算出するステップは、
非線形関数を用いて画像の前記潜在変数の分布をクラス空間にマッピングしてマッピング出力を取得するステップと、
分類損失関数又は回帰損失関数を用いて、前記マッピング出力及び前記セマンティック特徴のラベルに基づいて前記教師誤差を算出するステップと、を含む、付記３に記載の情報処理方法。
（付記５）
画像の前記潜在変数の事前分布は任意の分布を含む、付記３に記載の情報処理方法。
（付記６）
画像の前記潜在変数の事前分布を取得するステップは、
前記潜在変数の分布に基づいて、ガウス分布に従う中間変数を取得するステップと、
前記中間変数に対して非線形変換を行い、前記潜在変数の事前分布を取得するステップと、を含む、付記５に記載の情報処理方法。
（付記７）
前記損失関数は、前記潜在変数の分布と前記潜在変数の事前分布とのカルバック・ライブラー・ダイバージェンス（ＫＬダイバージェンス）に関する制約、及び前記中間変数の分布と標準ガウス分布とのＫＬダイバージェンスに関する制約をさらに含む、付記６に記載の情報処理方法。
（付記８）
前記損失関数は、再構築誤差に関する制約をさらに含み、
前記再構築誤差は、前記ＶＡＥに出力された画像と該画像に対応する前記ＶＡＥから出力された画像との差異を評価するために用いられる、付記７に記載の情報処理方法。
（付記９）
訓練セットから同一のセマンティック特徴に対応する３つの画像を抽出する画像抽出手段であって、前記３つの画像は、前記セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに前記第１画像及び前記第２画像と異なるセマンティック特徴値を有する第３画像を含む、画像抽出手段と、
変分オートエンコーダ（ＶＡＥ）により、前記３つの画像の前記セマンティック特徴に対応する潜在変数の分布を取得する潜在変数分布取得手段と、
前記３つの画像の各画像について、損失関数を最小化するように前記ＶＡＥのパラメータを更新するパラメータ更新手段であって、前記損失関数は、前記第１画像の潜在変数の分布と前記第２画像の潜在変数の分布との間の第１距離に対して正の相関関係を有し、前記第１画像の潜在変数の分布と前記第３画像の潜在変数の分布との間の第２距離に対して負の相関関係を有する、パラメータ更新手段と、を含む、情報処理装置。
（付記１０）
前記損失関数は、前記第２距離と前記第１距離との差が所定の閾値よりも大きいという制約をさらに含む、付記９に記載の情報処理装置。
（付記１１）
前記損失関数は、教師誤差に関する制約をさらに含み、
前記教師誤差は、前記セマンティック特徴のラベル及び画像の前記潜在変数の分布に基づいて算出されたものである、付記１０に記載の情報処理装置。
（付記１２）
前記教師誤差を算出するステップは、
非線形関数を用いて画像の前記潜在変数の分布をクラス空間にマッピングしてマッピング出力を取得するステップと、
分類損失関数又は回帰損失関数を用いて、前記マッピング出力及び前記セマンティック特徴のラベルに基づいて前記教師誤差を算出するステップと、を含む、付記１１に記載の情報処理装置。
（付記１３）
画像の前記潜在変数の事前分布は任意の分布を含む、付記１１に記載の情報処理装置。
（付記１４）
画像の前記潜在変数の事前分布を取得するステップは、
前記潜在変数の分布に基づいて、ガウス分布に従う中間変数を取得するステップと、
前記中間変数に対して非線形変換を行い、前記潜在変数の事前分布を取得するステップと、を含む、付記１３に記載の情報処理装置。
（付記１５）
前記損失関数は、前記潜在変数の分布と前記潜在変数の事前分布とのカルバック・ライブラー・ダイバージェンス（ＫＬダイバージェンス）に関する制約、及び前記中間変数の分布と標準ガウス分布とのＫＬダイバージェンスに関する制約をさらに含む、付記１４に記載の情報処理装置。
（付記１６）
前記損失関数は、再構築誤差に関する制約をさらに含み、
前記再構築誤差は、前記ＶＡＥに出力された画像と該画像に対応する前記ＶＡＥから出力された画像との差異を評価するために用いられる、付記１５に記載の情報処理装置。
（付記１７）
訓練された変分オートエンコーダ（ＶＡＥ）に複数の画像をそれぞれ入力し、各画像のセマンティック特徴に対応する潜在変数の分布を取得し、各画像の再構築画像を取得するステップ、を含み、
前記複数の画像のうち同一のセマンティック特徴を有する３つの画像について、前記３つの画像は、前記セマンティック特徴について同一のセマンティック特徴値を有する第１画像及び第２画像、並びに前記第１画像及び前記第２画像と異なるセマンティック特徴値を有する第３画像を含み、
前記第１画像の潜在変数の分布と前記第２画像の潜在変数の分布との間の第１距離は、前記第１画像の潜在変数の分布と前記第３画像の潜在変数の分布との間の第２距離よりも小さい、情報検出方法。 Also, the methods of the present invention are not limited to those performed in the temporal order described herein, but may be performed sequentially, concurrently, or independently in other temporal orders. Thus, the order of execution of the methods described herein is not intended to limit the scope of the invention.
Further, the following appendices will be further disclosed regarding the embodiment including the above-mentioned respective examples.
(Supplementary Note 1)
Extracting three images corresponding to the same semantic feature from the training set, wherein the three images are a first image and a second image having the same semantic feature value for the semantic feature, and the first image and the second image. Including an image and a third image having semantic feature values different from the second image;
Obtaining a distribution of latent variables corresponding to the semantic features of the three images by a variational auto encoder (VAE);
Updating parameters of the VAE to minimize a loss function for each of the three images, wherein the loss function comprises a distribution of latent variables of the first image and a potential of the second image It has a positive correlation with the first distance between the distribution of variables and the second distance between the distribution of latent variables of the first image and the distribution of latent variables of the third image An information processing method comprising the steps of: having a negative correlation.
(Supplementary Note 2)
The information processing method according to claim 1, wherein the loss function further includes a constraint that a difference between the second distance and the first distance is larger than a predetermined threshold.
(Supplementary Note 3)
The loss function further includes constraints on teacher error,
The information processing method according to appendix 2, wherein the training error is calculated based on a label of the semantic feature and a distribution of the latent variable of an image.
(Supplementary Note 4)
The step of calculating the teacher error is
Mapping the distribution of the latent variables of the image into class space using a non-linear function to obtain a mapping output;
The information processing method according to claim 3, comprising: calculating the training error based on the mapping output and the label of the semantic feature using a classification loss function or a regression loss function.
(Supplementary Note 5)
The information processing method according to appendix 3, wherein the prior distribution of the latent variable of the image comprises any distribution.
(Supplementary Note 6)
Obtaining a prior distribution of the latent variables of the image,
Obtaining an intermediate variable according to a Gaussian distribution based on the distribution of the latent variables;
The information processing method according to appendix 5, comprising: performing nonlinear conversion on the intermediate variable to obtain a prior distribution of the latent variable.
(Appendix 7)
The loss function further includes a constraint on the Kullback-Leibler divergence (KL divergence) of the distribution of the latent variable and the prior distribution of the latent variable, and a constraint on the KL divergence of the distribution of the intermediate variable and the standard Gaussian distribution. The information processing method according to appendix 6, including.
(Supplementary Note 8)
The loss function further includes a constraint on the reconstruction error,
The information processing method according to claim 7, wherein the reconstruction error is used to evaluate a difference between an image output to the VAE and an image output from the VAE corresponding to the image.
(Appendix 9)
Image extraction means for extracting three images corresponding to the same semantic feature from a training set, wherein the three images are a first image and a second image having the same semantic feature value for the semantic feature, and Image extraction means, comprising a first image and a third image having semantic feature values different from the second image;
Latent variable distribution obtaining means for obtaining a distribution of latent variables corresponding to the semantic features of the three images by a variational auto encoder (VAE);
Parameter updating means for updating parameters of the VAE so as to minimize a loss function for each of the three images, wherein the loss function is a distribution of latent variables of the first image and the second image The first distance between the distribution of latent variables and the distribution of latent variables of the first image, and the second distance between the distribution of latent variables of the first image and the distribution of latent variables of the third image An information processing apparatus, comprising: parameter updating means having a negative correlation.
(Supplementary Note 10)
The information processing apparatus according to attachment 9, wherein the loss function further includes a constraint that a difference between the second distance and the first distance is larger than a predetermined threshold.
(Supplementary Note 11)
The loss function further includes constraints on teacher error,
The information processing apparatus according to appendix 10, wherein the training error is calculated based on the label of the semantic feature and the distribution of the latent variable of the image.
(Supplementary Note 12)
The step of calculating the teacher error is
Mapping the distribution of the latent variables of the image into class space using a non-linear function to obtain a mapping output;
The information processing apparatus according to claim 11, comprising: calculating the training error based on the mapping output and the label of the semantic feature using a classification loss function or a regression loss function.
(Supplementary Note 13)
The information processing apparatus according to appendix 11, wherein the prior distribution of the latent variable of the image includes any distribution.
(Supplementary Note 14)
Obtaining a prior distribution of the latent variables of the image,
Obtaining an intermediate variable according to a Gaussian distribution based on the distribution of the latent variables;
The information processing apparatus according to appendix 13, comprising: performing non-linear transformation on the intermediate variable to obtain a prior distribution of the latent variable.
(Supplementary Note 15)
The loss function further includes a constraint on the Kullback-Leibler divergence (KL divergence) of the distribution of the latent variable and the prior distribution of the latent variable and a constraint on KL divergence of the distribution of the intermediate variable and the standard Gaussian distribution. The information processing apparatus according to appendix 14, which includes.
(Supplementary Note 16)
The loss function further includes a constraint on the reconstruction error,
The information processing apparatus according to appendix 15, wherein the reconstruction error is used to evaluate a difference between an image output to the VAE and an image output from the VAE corresponding to the image.
(Supplementary Note 17)
Inputting a plurality of images into a trained variational auto-encoder (VAE), acquiring a distribution of latent variables corresponding to semantic features of each image, and acquiring a reconstructed image of each image;
For three images having the same semantic feature among the plurality of images, the three images are a first image and a second image having the same semantic feature value for the semantic feature, and the first image and the first image. Including a third image having semantic feature values different from the two images,
A first distance between the distribution of latent variables of the first image and the distribution of latent variables of the second image is between the distribution of latent variables of the first image and the distribution of latent variables of the third image. An information detection method smaller than the second distance of.

Claims

Extracting three images corresponding to the same semantic feature from the training set, wherein the three images are a first image and a second image having the same semantic feature value for the semantic feature, and the first image and the second image. Including an image and a third image having semantic feature values different from the second image;
Obtaining a distribution of latent variables corresponding to the semantic features of the three images by a variational auto encoder (VAE);
Updating parameters of the VAE to minimize a loss function for each of the three images, wherein the loss function comprises a distribution of latent variables of the first image and a potential of the second image It has a positive correlation with the first distance between the distribution of variables and the second distance between the distribution of latent variables of the first image and the distribution of latent variables of the third image An information processing method comprising the steps of: having a negative correlation.

The information processing method according to claim 1, wherein the loss function further includes a constraint that a difference between the second distance and the first distance is larger than a predetermined threshold.

The loss function further includes constraints on teacher error,
The information processing method according to claim 2, wherein the training error is calculated based on a label of the semantic feature and a distribution of the latent variable of an image.

The step of calculating the teacher error is
Mapping the distribution of the latent variables of the image into class space using a non-linear function to obtain a mapping output;
The method according to claim 3, further comprising: calculating the training error based on the mapping output and the label of the semantic feature using a classification loss function or a regression loss function.

The information processing method according to claim 3, wherein the prior distribution of the latent variable of the image includes any distribution.

Obtaining a prior distribution of the latent variables of the image,
Obtaining an intermediate variable according to a Gaussian distribution based on the distribution of the latent variables;
And D. performing non-linear transformation on the intermediate variable to obtain a prior distribution of the latent variable.

The loss function further includes a constraint on the Kullback-Leibler divergence (KL divergence) of the distribution of the latent variable and the prior distribution of the latent variable, and a constraint on the KL divergence of the distribution of the intermediate variable and the standard Gaussian distribution. The information processing method according to claim 6, comprising.

The loss function further includes a constraint on the reconstruction error,
The information processing method according to claim 7, wherein the reconstruction error is used to evaluate a difference between an image output to the VAE and an image output from the VAE corresponding to the image.

Image extraction means for extracting three images corresponding to the same semantic feature from a training set, wherein the three images are a first image and a second image having the same semantic feature value for the semantic feature, and Image extraction means, comprising a first image and a third image having semantic feature values different from the second image;
Latent variable distribution obtaining means for obtaining a distribution of latent variables corresponding to the semantic features of the three images by a variational auto encoder (VAE);
Parameter updating means for updating parameters of the VAE so as to minimize a loss function for each of the three images, wherein the loss function is a distribution of latent variables of the first image and the second image The first distance between the distribution of latent variables and the distribution of latent variables of the first image, and the second distance between the distribution of latent variables of the first image and the distribution of latent variables of the third image An information processing apparatus, comprising: parameter updating means having a negative correlation.

Inputting a plurality of images into a trained variational auto-encoder (VAE), acquiring a distribution of latent variables corresponding to semantic features of each image, and acquiring a reconstructed image of each image;
For three images having the same semantic feature among the plurality of images, the three images are a first image and a second image having the same semantic feature value for the semantic feature, and the first image and the first image. Including a third image having semantic feature values different from the two images,
A first distance between the distribution of latent variables of the first image and the distribution of latent variables of the second image is between the distribution of latent variables of the first image and the distribution of latent variables of the third image. An information detection method smaller than the second distance of.