JP2017004350A

JP2017004350A - Image processing apparatus, image processing method, and program

Info

Publication number: JP2017004350A
Application number: JP2015119147A
Authority: JP
Inventors: 聡疋田; Satoshi Hikita
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2015-06-12
Filing date: 2015-06-12
Publication date: 2017-01-05

Abstract

PROBLEM TO BE SOLVED: To support reduction in processing time of recognition processing.SOLUTION: An image processing system which recognizes a first region including an object in an image that image data represents and a category into which the object is classified has: recognition means of recognizing a category into which input image data is classified using a convolutional neural network; and candidate region generation means of generating one or more candidate region image data showing one or more candidate regions included in an image that the image data represents on the basis of first output data showing an output result of a prescribed layer of the convolutional neural network in the recognition means. The recognition means recognizes categories into which more than one candidate region image data generated by the candidate region generation means are classified.SELECTED DRAWING: Figure 2

Description

本発明は、画像処理装置、画像処理方法、及びプログラムに関する。 The present invention relates to an image processing apparatus, an image processing method, and a program.

デジタルカメラや携帯情報端末等の機器において、撮影された画像中の被写体が属するカテゴリ（例えば、「人」、「動物」、「車」等）を分類する技術が知られている。 In devices such as digital cameras and portable information terminals, a technique for classifying a category (for example, “person”, “animal”, “car”, etc.) to which a subject in a photographed image belongs is known.

また、画像中において、被写体が占める領域と、当該被写体が分類されるカテゴリとを認識する技術が知られている（例えば特許文献１及び非特許文献１参照）。このような技術では、被写体が占める領域の候補である候補領域に対して、カテゴリを分類するための処理を行うことで、被写体が占める領域と、当該被写体が分類されるカテゴリとを認識する。 Further, a technique for recognizing an area occupied by a subject and a category into which the subject is classified in an image is known (see, for example, Patent Document 1 and Non-Patent Document 1). In such a technique, a process for classifying a category is performed on a candidate area that is a candidate for an area occupied by the subject, thereby recognizing the area occupied by the subject and the category into which the subject is classified.

しかしながら上記の従来技術では、被写体が占める領域と、当該被写体が分類されるカテゴリとの認識処理に多くの時間を要する場合があった。例えば、候補領域の数が多い場合には、それぞれの候補領域に対してカテゴリを分類するための処理を行うため、認識処理に多くの時間を要することがある。 However, in the above-described prior art, it may take a long time to recognize the area occupied by the subject and the category into which the subject is classified. For example, when there are a large number of candidate areas, a process for classifying the categories for each candidate area is performed, so that a long time may be required for the recognition process.

本発明の実施形態は、認識処理の処理時間の削減を支援することを目的とする。 An object of the embodiment of the present invention is to support reduction of processing time of recognition processing.

上記目的を達成するため、本発明の実施の形態では、画像データが示す画像において対象が含まれる第１の領域と、該対象が分類されるカテゴリとを認識する画像処理装置であって、畳み込みニューラルネットワークを用いて、入力された前記画像データが分類されるカテゴリを認識する認識手段と、前記認識手段における前記畳み込みニューラルネットワークの所定の層の出力結果を示す第１の出力データに基づいて、前記画像データが示す画像に含まれる１以上の候補領域を示す１以上の候補領域画像データを作成する候補領域作成手段とを有し、前記認識手段は、前記候補領域作成手段により作成された前記１以上の候補領域画像データがそれぞれ分類されるカテゴリを認識する。 To achieve the above object, according to an embodiment of the present invention, there is provided an image processing apparatus that recognizes a first region including a target in an image indicated by image data and a category into which the target is classified, and includes a convolution. Recognizing means for recognizing a category into which the input image data is classified using a neural network, and first output data indicating an output result of a predetermined layer of the convolutional neural network in the recognizing means, Candidate area creating means for creating one or more candidate area image data indicating one or more candidate areas included in the image indicated by the image data, wherein the recognition means is created by the candidate area creating means. A category into which one or more candidate area image data is classified is recognized.

本発明の実施形態によれば、認識処理の処理時間の削減を支援することができる。 According to the embodiment of the present invention, it is possible to assist in reducing the processing time of recognition processing.

本実施形態の画像処理装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the image processing apparatus of this embodiment. 本実施形態の画像処理装置の機能構成の一例を示す図である。It is a figure which shows an example of a function structure of the image processing apparatus of this embodiment. 本実施形態の画像処理装置の認識処理のフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart of the recognition process of the image processing apparatus of this embodiment. 本実施形態の畳み込みニューラルネットワーク処理のフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart of the convolution neural network process of this embodiment. 本実施形態の入力画像データの加工処理の一例を示す図である。It is a figure which shows an example of the process of input image data of this embodiment. 本実施形態の第１層の畳み込み処理の一例を示す図である。It is a figure which shows an example of the convolution process of the 1st layer of this embodiment. 本実施形態の第１層のネットワークパラメータの一例を示す図である。It is a figure which shows an example of the network parameter of the 1st layer of this embodiment. 本実施形態の第１層のフィルタの一例を示す図である。It is a figure which shows an example of the filter of the 1st layer of this embodiment. 本実施形態の第１層のプーリング処理の一例を示す図である。It is a figure which shows an example of the pooling process of the 1st layer of this embodiment. 本実施形態の第２層の畳み込み処理の一例を示す図である。It is a figure which shows an example of the convolution process of the 2nd layer of this embodiment. 本実施形態の第２層のネットワークパラメータの一例を示す図である。It is a figure which shows an example of the network parameter of the 2nd layer of this embodiment. 本実施形態の第２層のフィルタの一例を示す図である。It is a figure which shows an example of the filter of the 2nd layer of this embodiment. 本実施形態の候補領域の作成処理のフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart of a creation process of the candidate area | region of this embodiment. 本実施形態の微分処理の一例を示す図である。It is a figure which shows an example of the differentiation process of this embodiment. 本実施形態の閾値処理の一例を示す図である。It is a figure which shows an example of the threshold value process of this embodiment. 本実施形態の領域分割の一例を示す図である。It is a figure which shows an example of the area | region division of this embodiment. 本実施形態の最小矩形の一例を示す図である。It is a figure which shows an example of the minimum rectangle of this embodiment. 本実施形態のカテゴリ分類処理のフローチャートの一例を示す図である。It is a figure which shows an example of the flowchart of the category classification | category process of this embodiment. 本実施形態の第３層の全結合処理の一例を示す図である。It is a figure which shows an example of the all the joint process of the 3rd layer of this embodiment. 本実施形態の第３層のネットワークパラメータの一例を示す図である。It is a figure which shows an example of the network parameter of the 3rd layer of this embodiment. 本実施形態の正規化処理の一例を示す図である。It is a figure which shows an example of the normalization process of this embodiment.

本実施形態は、画像データが示す画像において、当該画像の被写体を示す対象（例えば、人や物体等）を含む領域と、当該対象が分類されるカテゴリとを認識するものである。ここで、カテゴリとは、例えば、「人」、「動物」、「車」、「花」、「料理」等の対象が分類される種別のことである。 In the present embodiment, in an image indicated by image data, a region including a target (for example, a person or an object) indicating a subject of the image and a category into which the target is classified are recognized. Here, the category is a type into which objects such as “people”, “animals”, “cars”, “flowers”, “cooking” and the like are classified.

以降では、画像データに対して、上述した認識を行う処理（認識処理）を実行する画像処理装置１０について説明する。なお、本実施形態の画像処理装置１０は、例えば、デジタルカメラ、スマートフォン、タブレット端末、ゲーム機器、ノート型ＰＣ、デスクトップ型ＰＣ等である。 Hereinafter, the image processing apparatus 10 that performs the above-described recognition processing (recognition processing) on image data will be described. Note that the image processing apparatus 10 of the present embodiment is, for example, a digital camera, a smartphone, a tablet terminal, a game device, a notebook PC, a desktop PC, or the like.

＜ハードウェア構成＞
まず、本実施形態の画像処理装置１０のハードウェア構成について、図１を参照しながら説明する。図１は、本実施形態の画像処理装置のハードウェア構成の一例を示す図である。 <Hardware configuration>
First, the hardware configuration of the image processing apparatus 10 of the present embodiment will be described with reference to FIG. FIG. 1 is a diagram illustrating an example of a hardware configuration of the image processing apparatus according to the present embodiment.

本実施形態の画像処理装置１０は、入力装置１１、表示装置１２、ＣＰＵ（Central Processing Unit）１３、及びＲＯＭ（Read Only Memory）１４を有する。また、本実施形態の画像処理装置１０は、ＲＡＭ（Random Access Memory）１５、インタフェース装置１６、記憶装置１７、及び撮像装置１８を有する。これら各ハードウェアは、バスBにより相互に接続されている。 The image processing apparatus 10 according to the present embodiment includes an input device 11, a display device 12, a CPU (Central Processing Unit) 13, and a ROM (Read Only Memory) 14. Further, the image processing apparatus 10 according to the present embodiment includes a RAM (Random Access Memory) 15, an interface device 16, a storage device 17, and an imaging device 18. These pieces of hardware are connected to each other by a bus B.

入力装置１１は、キーボードやマウス、タッチパネル、各種ボタン等を含み、画像処理装置１０に各種信号を入力するのに用いられる。表示装置１２は、ディスプレイ等を含み、各種の処理結果を表示する。特に、表示装置１２には、本実施形態の認識処理の処理結果が表示される。すなわち、表示装置１２には、入力された画像データが示す画像において、被写体等の対象が含まれる領域と、当該対象が分類されるカテゴリと示す処理結果が表示される。 The input device 11 includes a keyboard, a mouse, a touch panel, various buttons, and the like, and is used to input various signals to the image processing device 10. The display device 12 includes a display and displays various processing results. In particular, the display device 12 displays the processing result of the recognition processing of the present embodiment. That is, the display device 12 displays a processing result indicating an area including a subject such as a subject and a category into which the target is classified in the image indicated by the input image data.

ＣＰＵ１３は、例えば記憶装置１７やＲＯＭ１４等からプログラムやデータをＲＡＭ１５上に読み出して、各種処理を実行する演算装置である。ＲＯＭ１４は、電源を切ってもデータを保持することができる不揮発性の半導体メモリである。ＲＡＭ１５は、プログラムやデータを一時保存することができる揮発性の半導体メモリである。 The CPU 13 is an arithmetic device that reads programs and data from the storage device 17 and the ROM 14 onto the RAM 15 and executes various processes. The ROM 14 is a nonvolatile semiconductor memory that can retain data even when the power is turned off. The RAM 15 is a volatile semiconductor memory that can temporarily store programs and data.

インタフェース装置１６は、外部装置とのインタフェースである。外部装置には、例えば、ＣＤ（Compact Disk）やＤＶＤ（Digital Versatile Disk）、ＳＤメモリカード(SD memory card）、ＵＳＢメモリ（Universal Serial Bus memory）等の記録媒体がある。画像処理装置１０は、インタフェース装置１６を介して、本実施形態の認識処理の処理対象となる画像データを記録媒体から読み取ることができる。 The interface device 16 is an interface with an external device. Examples of the external device include a recording medium such as a CD (Compact Disk), a DVD (Digital Versatile Disk), an SD memory card (SD memory card), and a USB memory (Universal Serial Bus memory). The image processing apparatus 10 can read image data to be processed in the recognition process of the present embodiment from the recording medium via the interface device 16.

記憶装置１７は、プログラムやデータを格納しているＨＤＤ（Hard Disk Drive）やＳＳＤ（Solid State Drive）等の不揮発性のメモリである。記憶装置１７に格納されるプログラムやデータには、本実施形態の認識処理を実行する画像処理プログラム２０がある。また、本実施形態の認識処理の処理対象となる画像データが格納されても良い。 The storage device 17 is a non-volatile memory such as a hard disk drive (HDD) or a solid state drive (SSD) that stores programs and data. The programs and data stored in the storage device 17 include an image processing program 20 that executes the recognition process of the present embodiment. In addition, image data to be processed in the recognition process of the present embodiment may be stored.

撮像装置１８は、カメラ等であり、本実施形態の認識処理の処理対象となる画像データを作成する。 The imaging device 18 is a camera or the like, and creates image data that is a processing target of the recognition processing of the present embodiment.

本実施形態の画像処理装置１０は、上記ハードウェア構成により後述する各種処理を実現することができる。 The image processing apparatus 10 according to the present embodiment can realize various processes described later with the above hardware configuration.

＜機能構成＞
次に、本実施形態の画像処理装置１０の機能構成について、図２を参照しながら説明する。図２は、本実施形態の画像処理装置の機能構成の一例を示す図である。 <Functional configuration>
Next, the functional configuration of the image processing apparatus 10 of the present embodiment will be described with reference to FIG. FIG. 2 is a diagram illustrating an example of a functional configuration of the image processing apparatus according to the present embodiment.

本実施形態の画像処理装置１０は、ＣＮＮ処理部１１０、候補領域作成処理部１２０、正規化処理部１３０、及び出力部１４０を有する。これら各部は、画像処理装置１０にインストールされた画像処理プログラム２０が、ＣＰＵ１３に実行させる処理により実現される。 The image processing apparatus 10 according to the present embodiment includes a CNN processing unit 110, a candidate area creation processing unit 120, a normalization processing unit 130, and an output unit 140. Each of these units is realized by processing executed by the CPU 13 by the image processing program 20 installed in the image processing apparatus 10.

ＣＮＮ処理部１１０は、ネットワークパラメータ１０００に基づいて、畳み込みニューラルネットワーク（ＣＮＮ：Convolutional Neural Network）処理を行う。畳み込みニューラルネットワークは、一般に、ｎを３以上の任意の自然数として、畳み込み処理及びプーリング処理を行う第１層〜第ｎ−２層と、畳み込み処理を行う第ｎ−１層と、全結合処理を行う第ｎ層とを含む。 The CNN processing unit 110 performs a convolutional neural network (CNN) process based on the network parameter 1000. In general, the convolutional neural network is configured such that n is an arbitrary natural number of 3 or more, the first layer to the n-2th layer for performing the convolution process and the pooling process, the n-1th layer for performing the convolution process, and the total connection process. N-th layer to be performed.

ここで、ネットワークパラメータ１０００は、教師あり学習の手法により、学習データに基づいて畳み込みニューラルネットワークの各層毎に予め学習されたデータである。教師あり学習の手法には、例えば誤差逆伝播法（Backpropagation）を用いれば良い。 Here, the network parameter 1000 is data previously learned for each layer of the convolutional neural network based on the learning data by a supervised learning method. For example, the backpropagation method may be used as a supervised learning method.

このようなネットワークパラメータ１０００は、例えば記憶装置１７等に格納され、バイアスデータ１１００及び重みデータ１２００が含まれる。なお、以降では、第ｎ層のネットワークパラメータ１０００を「ネットワークパラメータ１０００−ｎ」と表す。したがって、第ｎ層のバイアスデータ１１００及び重みデータ１２００はそれぞれ「バイアスデータ１１００−ｎ」及び「重みデータ１２００−ｎ」と表される。ネットワークパラメータ１０００の詳細については後述する。 Such a network parameter 1000 is stored in the storage device 17 or the like, for example, and includes bias data 1100 and weight data 1200. Hereinafter, the network parameter 1000 of the nth layer is expressed as “network parameter 1000-n”. Therefore, the bias data 1100 and the weight data 1200 of the nth layer are expressed as “bias data 1100-n” and “weight data 1200-n”, respectively. Details of the network parameter 1000 will be described later.

ＣＮＮ処理部１１０は、入力画像を示す画像データ５１０に対して、畳み込みニューラルネットワーク処理を行い、予め設定された第Ｎ層における畳み込み処理の処理結果を示す出力データ５２０を出力する。 The CNN processing unit 110 performs convolutional neural network processing on the image data 510 indicating the input image, and outputs output data 520 indicating the processing result of the convolution processing in the preset Nth layer.

ここで、本実施形態では、Ｎ＝２であるものとして説明する。Ｎ＝２の場合、出力データ５２０は、例えば、２８×２８×６４チャンネルの画像データとして表すことができる。換言すれば、出力データ５２０は、６４個の２８×２８チャンネルの画像データの集合として表すことができる。なお、Ｎの値は、画像処理プログラム２０の設計者等により予め設定される。Ｎの値は、例えば、２〜２０程度が好ましい。 Here, in the present embodiment, it is assumed that N = 2. When N = 2, the output data 520 can be expressed as, for example, 28 × 28 × 64 channel image data. In other words, the output data 520 can be expressed as a set of 64 28 × 28 channel image data. Note that the value of N is preset by the designer of the image processing program 20 or the like. The value of N is preferably about 2 to 20, for example.

また、ＣＮＮ処理部１１０は、後述する候補領域作成処理部１２０により作成された候補領域画像データ５３０に対して、畳み込みニューラルネットワーク処理を行い、出力結果を正規化処理部１３０に出力する。 The CNN processing unit 110 performs convolutional neural network processing on candidate area image data 530 created by a candidate area creation processing unit 120 described later, and outputs the output result to the normalization processing unit 130.

さらに、ＣＮＮ処理部１１０は、加工部１１１、畳み込み処理部１１２、プーリング処理部１１３、及び全結合処理部１１４を有する。加工部１１１は、ＣＮＮ処理部１１０に入力された画像データの加工処理を行う。畳み込み処理部１１２は、畳み込みニューラルネットワークの各層において畳み込み処理を行う。プーリング処理部１１３は、畳み込みニューラルネットワークの各層においてプーリング処理を行う。全結合処理部１１４は、全結合処理を行う。 Furthermore, the CNN processing unit 110 includes a processing unit 111, a convolution processing unit 112, a pooling processing unit 113, and a full connection processing unit 114. The processing unit 111 performs processing on the image data input to the CNN processing unit 110. The convolution processing unit 112 performs convolution processing in each layer of the convolutional neural network. The pooling processing unit 113 performs pooling processing in each layer of the convolutional neural network. The all combination processing unit 114 performs all combination processing.

ここで、ＣＮＮ処理部１１０は、全結合処理部１１４をカテゴリの組毎に有しているものとする。カテゴリの組とは、カテゴリと、当該カテゴリ以外を示すカテゴリとのペアである。具体的には、カテゴリの組は、「人」「人以外」、「車」「車以外」、「動物」「動物以外」等の、あるカテゴリと、当該カテゴリ以外を示すカテゴリとのペアである。なお、以降では、複数の全結合処理部１１４を区別して表す場合は、「全結合処理部１１４−１」、「全結合処理部１１４−２」等と表す。 Here, it is assumed that the CNN processing unit 110 has an all combination processing unit 114 for each set of categories. A category set is a pair of a category and a category indicating other than the category. Specifically, a category pair is a pair of a category such as “person”, “non-human”, “car”, “non-car”, “animal”, “non-animal”, and a category indicating other than the category. is there. Hereinafter, when the plurality of all-join processing units 114 are distinguished from each other, they are represented as “all-join processing units 114-1”, “all-join processing units 114-2”, and the like.

候補領域作成処理部１２０は、出力データ５２０に基づいて、１以上の候補領域画像データ５３０を作成する。候補領域画像データ１２０とは、画像データ５１０が示す画像において、対象が含まれる領域の候補を示すデータである。なお、以降では、複数の候補領域画像データ５３０を区別して表す場合は、「候補領域画像データ５３０−１」、「候補領域画像データ５３０−２」等と表す。 The candidate area creation processing unit 120 creates one or more candidate area image data 530 based on the output data 520. The candidate area image data 120 is data indicating a candidate for an area including a target in the image indicated by the image data 510. Hereinafter, when the plurality of candidate area image data 530 are distinguished from each other, they are expressed as “candidate area image data 530-1”, “candidate area image data 530-2”, and the like.

ここで、候補領域作成処理部１２０は、データ決定部１２１、境界決定部１２２、閾値処理部１２３、領域分割部１２４、及び候補領域作成部１２５を有する。 Here, the candidate area creation processing unit 120 includes a data determination unit 121, a boundary determination unit 122, a threshold processing unit 123, an area division unit 124, and a candidate area creation unit 125.

データ決定部１２１は、例えば６４個の２８×２８チャンネルのデータとして表される出力データ５２０から所定のＭ個の２８×２８チャンネルのデータを決定する。ここで、Ｍの値は、画像処理プログラム２０の設計者等により予め設定される。Ｍの値は、例えば、３〜２０程度が好ましい。 The data determination unit 121 determines predetermined M pieces of 28 × 28 channel data from output data 520 expressed as, for example, 64 pieces of 28 × 28 channel data. Here, the value of M is preset by the designer of the image processing program 20 or the like. The value of M is preferably about 3 to 20, for example.

境界決定部１２２は、データ決定部１２１により決定されたそれぞれのデータに対して、所定の微分処理を行い、領域分割部１２４が分割する領域の境界を決定する。 The boundary determination unit 122 performs a predetermined differentiation process on each data determined by the data determination unit 121, and determines the boundary of the region to be divided by the region dividing unit 124.

閾値処理部１２３は、閾値処理を行う。閾値処理とは、予め設定された閾値以下のデータを削除（すなわち、「０」とする）する処理である。なお、このような閾値は、画像処理プログラム２０の設計者等により予め設定される。閾値の値は、例えば、１０〜５０程度が好ましい。 The threshold processing unit 123 performs threshold processing. The threshold process is a process of deleting data that is equal to or less than a preset threshold (that is, “0”). Such a threshold is set in advance by a designer of the image processing program 20 or the like. The threshold value is preferably about 10 to 50, for example.

領域分割部１２４は、境界決定部１２２により決定された境界に基づいて、データ決定部１２１により決定されたデータが示す画像を、複数の領域に分割する。 The area dividing unit 124 divides the image indicated by the data determined by the data determining unit 121 into a plurality of areas based on the boundary determined by the boundary determining unit 122.

候補領域作成部１２５は、領域分割部１２４により分割された複数の領域に基づいて、候補領域を作成し、作成した候補領域を示す候補領域画像データ５３０を出力する。 The candidate area creating unit 125 creates a candidate area based on the plurality of areas divided by the area dividing unit 124, and outputs candidate area image data 530 indicating the created candidate area.

例えば、候補領域作成部１２５は、領域分割部１２４により分割された複数の領域のうちの一の領域に基づいて、候補領域画像データ５３０−１を出力する。同様に、候補領域作成部１２５は、領域分割部１２４により分割された複数の領域のうちの他の領域に基づいて、候補領域画像データ５３０−２を出力する。 For example, the candidate area creating unit 125 outputs the candidate area image data 530-1 based on one area among the plurality of areas divided by the area dividing unit 124. Similarly, the candidate area creation unit 125 outputs candidate area image data 530-2 based on another area among the plurality of areas divided by the area dividing unit 124.

このように、本実施形態の候補領域作成処理部１２０は、出力データ５２０に基づいて候補領域画像データ５３０を作成する。これにより、本実施形態では、認識処理の精度の低下を防ぎつつ、候補領域を削減させることができる。したがって、本実施形態では、認識処理の処理時間を削減させることができる。 As described above, the candidate area creation processing unit 120 of this embodiment creates the candidate area image data 530 based on the output data 520. Thereby, in this embodiment, a candidate area | region can be reduced, preventing the fall of the precision of recognition processing. Therefore, in the present embodiment, the processing time for the recognition process can be reduced.

正規化処理部１３０は、ＣＮＮ処理部１１０による処理結果を正規化する。ＣＮＮ処理部１１０の各全結合処理部１１４による処理結果を比較することができる。以降では、正規化処理部１３０により正規化された、全結合処理部１１４の処理結果を「確信度」と表す。 The normalization processing unit 130 normalizes the processing result by the CNN processing unit 110. It is possible to compare the processing results obtained by all the combination processing units 114 of the CNN processing unit 110. Hereinafter, the processing result of the all combination processing unit 114 normalized by the normalization processing unit 130 is expressed as “confidence level”.

例えば、カテゴリの組「人」「人以外」に対応する全結合処理部１１４の確信度は、ＣＮＮ処理部１１０に入力された画像データが示す画像が、カテゴリ「人」に分類される度合いを示す第１の値と、カテゴリ「人以外」に分類される度合いを示す第２の値との組で表される。 For example, the certainty factor of all combination processing units 114 corresponding to the category set “people” and “non-people” indicates the degree to which the image indicated by the image data input to the CNN processing unit 110 is classified into the category “people”. And a second value indicating the degree of classification into the category “non-human”.

同様に、カテゴリの組「車」「車以外」に対応する全結合処理部１１４の確信度は、ＣＮＮ処理部１１０に入力された画像データが示す画像が、カテゴリ「車」に分類される度合いを示す第１の値と、カテゴリ「人以外」に分類される度合いを示す第２の値との組で表される。 Similarly, the certainty factor of the all combination processing unit 114 corresponding to the category set “car” and “other than car” is the degree to which the image indicated by the image data input to the CNN processing unit 110 is classified into the category “car”. And a second value indicating the degree of classification into the category “non-human”.

出力部１４０は、認識結果５４０を出力する。ここで、認識結果５４０には、候補領域画像データ５３０から選択された結果画像データ５４１と、当該結果画像データ５４１のカテゴリを示すカテゴリ情報５４２とが含まれる。なお、出力部１４０は、候補領域画像データ５３０の確信度に基づいて、当該候補領域画像データ５３０から結果画像データ５４１を選択するとともに、当該結果画像データ５４１のカテゴリを決定してカテゴリ情報５４２を作成する。 The output unit 140 outputs a recognition result 540. Here, the recognition result 540 includes result image data 541 selected from the candidate area image data 530 and category information 542 indicating a category of the result image data 541. The output unit 140 selects the result image data 541 from the candidate area image data 530 based on the certainty factor of the candidate area image data 530, determines the category of the result image data 541, and sets the category information 542. create.

これにより、画像データ５１０が示す画像において、対象が含まれる領域の画像と、当該対象が分類されるカテゴリとが出力される。 Thereby, in the image indicated by the image data 510, an image of an area including the target and a category into which the target is classified are output.

＜処理の詳細＞
次に、本実施形態の画像処理装置１０の認識処理の詳細について、図３を参照しながら説明する。図３は、本実施形態の画像処理装置の認識処理のフローチャートの一例を示す図である。 <Details of processing>
Next, details of the recognition processing of the image processing apparatus 10 of the present embodiment will be described with reference to FIG. FIG. 3 is a diagram illustrating an example of a flowchart of recognition processing of the image processing apparatus according to the present embodiment.

画像処理装置１０は、画像データ５１０を入力する（ステップＳ３１）。画像処理装置１０は、例えば、記憶装置１７に格納されている画像データ５１０を入力しても良いし、撮像装置１８により生成された画像データ５１０を入力しても良い。また、画像処理装置１０は、例えば、ネットワーク経由でダウンロードした画像データ５１０を入力しても良い。 The image processing apparatus 10 receives the image data 510 (step S31). For example, the image processing apparatus 10 may input the image data 510 stored in the storage device 17 or may input the image data 510 generated by the imaging apparatus 18. Further, the image processing apparatus 10 may input image data 510 downloaded via a network, for example.

画像処理装置１０は、ＣＮＮ処理部１１０により、入力された画像データ５１０に対して、予め設定された第Ｎ層の畳み込み処理までの畳み込みニューラルネットワーク処理を行う（ステップＳ３２）。この畳み込みニューラルネットワーク処理についての詳細については、後述する。ここでは、本ステップの畳み込みニューラルネットワーク処理において、第Ｎ層の畳み込み処理の処理結果を示す出力データ５２０が得られたものとして説明を続ける。 The image processing apparatus 10 causes the CNN processing unit 110 to perform convolutional neural network processing up to the preset N-th layer convolution processing on the input image data 510 (step S32). Details of the convolutional neural network processing will be described later. Here, the description will be continued assuming that output data 520 indicating the processing result of the Nth layer convolution processing is obtained in the convolutional neural network processing of this step.

なお、上述したように、Ｎ＝３である場合、出力データ５２０は、例えば６４個の２８×２８チャンネルのデータとして表される。 As described above, when N = 3, the output data 520 is represented as, for example, 64 28 × 28 channel data.

画像処理装置１０は、候補領域作成処理部１２０により、出力データ５２０を入力して候補領域の作成処理を行う（ステップＳ３３）。この候補領域の作成処理において、候補領域作成処理部１２０は、出力データ５２０に基づいて、１以上の候補領域画像データ５３０を作成する。この候補領域の作成処理の詳細については、後述する。ここでは、本ステップの候補領域の作成処理において、１以上の候補領域画像データ５３０が得られたものとして説明を続ける。 In the image processing apparatus 10, the candidate area creation processing unit 120 inputs the output data 520 and performs a candidate area creation process (step S33). In this candidate area creation processing, the candidate area creation processing unit 120 creates one or more candidate area image data 530 based on the output data 520. Details of the candidate area creation processing will be described later. Here, the description will be continued on the assumption that one or more candidate area image data 530 has been obtained in the candidate area creation processing in this step.

画像処理装置１０は、ＣＮＮ処理部１１０及び正規化処理部１３０により、一の候補領域画像データ５３０を入力し、当該一の候補領域画像データ５３０のカテゴリを分類するカテゴリ分類処理を行う（ステップＳ３４）。このカテゴリ分類処理により、入力された一の候補領域画像データ５３０の確信度が得られる。このカテゴリ分類処理の詳細については、後述する。ここでは、本ステップのカテゴリ分類処理において、一の候補領域画像データ５３０の確信度が得られたものとして説明を続ける。 In the image processing apparatus 10, the CNN processing unit 110 and the normalization processing unit 130 input one candidate area image data 530 and perform a category classification process for classifying the category of the one candidate area image data 530 (step S34). ). By this category classification process, the certainty factor of the input candidate area image data 530 is obtained. Details of this category classification processing will be described later. Here, the description will be continued on the assumption that the certainty factor of one candidate area image data 530 is obtained in the category classification process of this step.

画像処理装置１０は、ＣＮＮ処理部１１０及び正規化処理部１３０により、すべての候補領域画像データ５３０の確信度が得られたか否かを判定する（ステップＳ３５）。確信度が得られていない（すなわち、カテゴリ分類処理を行っていない）候補領域画像データ５３０が存在する場合には、ステップＳ３４に戻る。すなわち、画像処理装置１０は、候補領域画像データ５３０−１、候補領域画像データ５３０−２、・・・等に対して、それぞれの確信度を順に取得する。 The image processing apparatus 10 determines whether or not the certainty factor of all candidate area image data 530 has been obtained by the CNN processing unit 110 and the normalization processing unit 130 (step S35). If there is candidate area image data 530 for which the certainty factor has not been obtained (that is, category classification processing has not been performed), the process returns to step S34. That is, the image processing apparatus 10 sequentially acquires the certainty factors for the candidate area image data 530-1, the candidate area image data 530-2,.

一方、すべての候補領域画像データ５３０の確信度が得られた場合には、ステップＳ３６に進む。 On the other hand, if the certainty factor of all candidate area image data 530 is obtained, the process proceeds to step S36.

画像処理装置１０は、出力部１４０により、得られた確信度に基づいて候補領域画像データ５３０から結果画像データ５４１を選択するとともに、当該結果画像データ５４１のカテゴリを決定してカテゴリ情報５４２を作成する。（ステップＳ３６）。すなわち、出力部１４０は、認識結果５４０を決定する。 The image processing apparatus 10 uses the output unit 140 to select the result image data 541 from the candidate area image data 530 based on the obtained certainty factor, determine the category of the result image data 541, and create category information 542 To do. (Step S36). That is, the output unit 140 determines the recognition result 540.

出力部１４０は、すべての候補領域画像データ５３０を結果画像データ５４１と選択しても良いし、候補領域画像データ５３０のうちの一部を結果画像データ５４１と選択しても良い。 The output unit 140 may select all the candidate area image data 530 as the result image data 541, or may select a part of the candidate area image data 530 as the result image data 541.

また、出力部１４０は、例えば、候補領域画像データ５３０が示す画像のうち、一部が重畳している画像が存在する場合に、当該重畳している画像が示す候補領域画像データ５３０のうち、最も確信度が高い候補領域画像データ５３０を結果画像データ５４１と選択しても良い。より具体的には、例えば、候補領域画像データ５３０−１が示す第１の画像と、候補領域画像データ５３０−２が示す第２の画像と、候補領域画像データ５３０−３が示す第３の画像とが、少なくとも一部の領域において重畳しているものとする。この場合、第１の画像の確信度の第１の値と、第２の画像の確信度の第１の値と、第３の画像の確信度の第１の値とを比較し、最も値が高い画像を示す候補領域画像データ５３０を結果画像データ５４１と選択すれば良い。 In addition, for example, when there is an image that is partially overlapped among the images indicated by the candidate area image data 530, the output unit 140 includes, among the candidate area image data 530 indicated by the overlapped image, The candidate area image data 530 having the highest certainty factor may be selected as the result image data 541. More specifically, for example, a first image indicated by the candidate area image data 530-1, a second image indicated by the candidate area image data 530-2, and a third image indicated by the candidate area image data 530-3. It is assumed that the image overlaps at least a part of the area. In this case, the first value of the certainty factor of the first image, the first value of the certainty factor of the second image, and the first value of the certainty factor of the third image are compared, and the highest value is obtained. The candidate area image data 530 indicating an image with a high image quality may be selected as the result image data 541.

なお、ステップＳ３６において、出力部１４０は、２以上の認識結果５４０を決定しても良い。すなわち、出力部１４０は、候補領域画像データ５３０から２以上の結果画像データ５４１を選択するとともに、当該２以上の結果画像データ５４１のそれぞれのカテゴリ情報５４２を作成しても良い。これにより、例えば、画像データ５１０が示す画像において、複数の対象（例えば、「人」と「車」等）が写っている場合にも、それぞれの対象が含まれる領域の画像と、それぞれの対象が分類されるカテゴリとを決定することができる。 In step S36, the output unit 140 may determine two or more recognition results 540. In other words, the output unit 140 may select two or more result image data 541 from the candidate area image data 530 and create each category information 542 of the two or more result image data 541. Thereby, for example, even when a plurality of objects (for example, “person” and “car”, etc.) are shown in the image indicated by the image data 510, the image of the area including each object and each object Can be determined.

画像処理装置１０は、出力部１４０により、決定された認識結果５４０を出力する（ステップＳ３７）。このとき、出力部１４０は、例えば表示装置１２に認識結果５４０を出力すれば良い。これにより、画像データ５１０が示す画像において、対象が含まれる領域の画像と、当該対象が分類されるカテゴリとが表示装置１２に表示される。 The image processing apparatus 10 outputs the determined recognition result 540 through the output unit 140 (step S37). At this time, the output unit 140 may output the recognition result 540 to the display device 12, for example. As a result, in the image indicated by the image data 510, the image of the area including the target and the category into which the target is classified are displayed on the display device 12.

次に、図３のステップＳ３２の畳み込みニューラルネットワーク処理について、図４を参照しながら説明する。図４は、本実施形態の畳み込みニューラルネットワーク処理のフローチャートの一例を示す図である。 Next, the convolutional neural network process in step S32 in FIG. 3 will be described with reference to FIG. FIG. 4 is a diagram illustrating an example of a flowchart of convolutional neural network processing according to the present embodiment.

加工部１１１は、入力された画像データ５１０の加工処理を行う（ステップＳ４１）。この加工処理は、入力された画像データ５１０を、畳み込み処理部１１２が処理可能な形式とするための処理である。 The processing unit 111 performs processing on the input image data 510 (step S41). This processing is processing for converting the input image data 510 into a format that can be processed by the convolution processing unit 112.

ここで、加工処理について、図５を参照しながら説明する。図５は、本実施形態の入力画像データの加工処理の一例を示す図である。なお、入力された画像データ５１０の色空間がＲＧＢ色空間である（すなわち、画像データ５１０の色チャンネルが３チャンネルである）ものとして説明する。ただし、画像データ５１０の色空間は、ＲＧＢ色空間に限られず、例えば、ＣＭＫ色空間、ＨＳＶ色空間、ＨＬＳ色空間等であっても良い。 Here, the processing will be described with reference to FIG. FIG. 5 is a diagram illustrating an example of processing of input image data according to the present embodiment. In the following description, it is assumed that the color space of the input image data 510 is the RGB color space (that is, the color channels of the image data 510 are 3 channels). However, the color space of the image data 510 is not limited to the RGB color space, and may be, for example, a CMK color space, an HSV color space, an HLS color space, or the like.

Ｓｔｅｐ４１１）加工部１１１は、入力された画像データ５１０を６４×６４（ピクセル）となるように縮小する、このとき、加工部１１１は、画像データ５１０の長辺が６４（ピクセル）となるように縮小を行う。また、加工部１１１は、短辺が縮小された結果６４（ピクセル）に満たない部分については値０（すなわち、ＲＧＢの各色成分の値が０）でパディングして６４（ピクセル）とする。なお、画像データ５１０を縮小するためのアルゴリズムには、例えば、バイリニア法を用いれば良い。 (Step 411) The processing unit 111 reduces the input image data 510 to 64 × 64 (pixels). At this time, the processing unit 111 sets the long side of the image data 510 to 64 (pixels). Perform reduction. In addition, the processing unit 111 pads a portion that is less than 64 (pixels) as a result of the reduction of the short side with a value of 0 (that is, the value of each color component of RGB is 0) to be 64 (pixels). Note that, for example, a bilinear method may be used as an algorithm for reducing the image data 510.

Ｓｔｅｐ４１２）加工部１１１は、ＳｔｅｐＳ４１１で得られた６４×６４の画像データの各画素値から、所定の値を減算した画像データを生成する。 (Step 412) The processing unit 111 generates image data obtained by subtracting a predetermined value from each pixel value of the 64 × 64 image data obtained in Step S411.

ここで、所定の値は、各学習データに含まれる画像データ（以降、「学習画像データ」という）の各画素値の平均値である。すなわち、学習画像データの画素位置（ｉ，ｊ）における各学習画像データの画素値の平均値をＭ（ｉ，ｊ）とした場合、上記のＳｔｅｐ４１１において得られた６４×６４の画像データの各画素位置（ｉ，ｊ）の画素値からＭ（ｉ，ｊ）を減算する。ここで、ｉ，ｊ＝１，・・・，６４である。 Here, the predetermined value is an average value of pixel values of image data (hereinafter referred to as “learning image data”) included in each learning data. That is, when the average value of the pixel values of each learning image data at the pixel position (i, j) of the learning image data is M (i, j), each of the 64 × 64 image data obtained in the above Step 411 M (i, j) is subtracted from the pixel value at the pixel position (i, j). Here, i, j = 1,.

Ｓｔｅｐ４１３）加工部１１１は、Ｓｔｅｐ４１２で得られた画像データの中心の５６×５６（ピクセル）の画像データ以外を０クリアする。換言すれば、Ｓｔｅｐ４１２において得られた画像データの周辺４ピクセル分を０クリアする。なお、図５において、網掛け部分が０クリアした部分である。 (Step 413) The processing unit 111 clears 0 other than the image data of 56 × 56 (pixels) at the center of the image data obtained in Step 412. In other words, the surrounding 4 pixels of the image data obtained in Step 412 are cleared to 0. In FIG. 5, the shaded portion is a portion where 0 is cleared.

そして、加工部１１１は、図５のＳｔｅｐ４１３で得られた６４×６４（ピクセル）の画像データ（この画像データを「画像データ５１１」とする。）を畳み込み処理部１１２に出力する。 Then, the processing unit 111 outputs the 64 × 64 (pixel) image data (this image data is referred to as “image data 511”) obtained in Step 413 in FIG. 5 to the convolution processing unit 112.

ＣＮＮ処理部１１０は、畳み込みニューラルネットワークの層を示す変数ｎを１とする（ステップＳ４２）。 The CNN processing unit 110 sets the variable n indicating the layer of the convolutional neural network to 1 (step S42).

畳み込み処理部１１２は、画像データ５１１を入力して、第１層の畳み込み処理を行う（ステップＳ４３）。 The convolution processing unit 112 receives the image data 511 and performs the first layer convolution processing (step S43).

ここで、第１層の畳み込み処理について、図６を参照しながら説明する。図６は、本実施形態の第１層の畳み込み処理の一例を示す図である。 Here, the first layer convolution processing will be described with reference to FIG. FIG. 6 is a diagram illustrating an example of a first layer convolution process according to the present embodiment.

Ｓｔｅｐ４３１）畳み込み処理部１１２は、画像データ５１１を入力する。ここで、入力した画像データ５１１の色空間はＲＧＢ色空間であるため、色チャンネルは６４×６４×３チャンネルである。 Step 431) The convolution processing unit 112 inputs the image data 511. Here, since the color space of the input image data 511 is an RGB color space, the color channel is 64 × 64 × 3 channels.

Ｓｔｅｐ４３２）畳み込み処理部１１２は、重みデータ１２００−１からフィルタを生成し、画像データ５１１の中心の５６×５６（ピクセル）の部分に対して、生成したフィルタを用いてフィルタ処理を行う。ここで、重みデータ１２００−１のデータ構成及び当該重みデータ１２００−１から生成されるフィルタ１３００ｆ_ｊ−１（ｊ＝１，・・・，６４）のデータ構成について説明する。 (Step 432) The convolution processing unit 112 generates a filter from the weight data 1200-1, and performs filter processing on the 56 × 56 (pixel) portion of the center of the image data 511 using the generated filter. Here, the data configuration of the weight data 1200-1 and the data configuration of the filter 1300f _j −1 (j = 1,..., 64) generated from the weight data 1200-1 will be described.

図７（ｂ）は、第１層の重みデータ１２００−１の一例を示す図である。図７（ｂ）に示すように、第１層の重みデータ１２００−１は、７５×６４の行列で表される。なお、重みデータ１２００−１の各値ｗ_１（ｉ，ｊ）は、上述したように、学習データに基づいて予め学習された値である。 FIG. 7B is a diagram illustrating an example of the weight data 1200-1 of the first layer. As shown in FIG. 7B, the first layer weight data 1200-1 is represented by a 75 × 64 matrix. Each value w ₁ (i, j) of the weight data 1200-1 is a value learned in advance based on the learning data as described above.

次に、重みデータ１２００−１から生成されるフィルタ１３００ｆ_ｊ−１（ｊ＝１，・・・，６４）について説明する。図８は、本実施形態の第１層のフィルタの一例を示す図である。 Next, the filter 1300f _j −1 (j = 1,..., 64) generated from the weight data 1200-1 will be described. FIG. 8 is a diagram illustrating an example of the first layer filter of the present embodiment.

図８に示すように、各フィルタ１３００ｆ_ｊ−１（ｊ＝１，・・・，６４）は、５×５の行列の３つの組で表される。換言すれば、各フィルタ１３００ｆ_ｊ−１（ｊ＝１，・・・，６４）は、５×５×３で表される。 As shown in FIG. 8, each filter 1300f _j −1 (j = 1,..., 64) is represented by three sets of 5 × 5 matrices. In other words, each filter 1300f _j −1 (j = 1,..., 64) is represented by 5 × 5 × 3.

ここで、重みデータ１２００−１のｗ_１（１，１）〜ｗ_１（２５，１）、ｗ_１（２６，１）〜ｗ_１（５０，１）、及びｗ_１（５１，１）〜ｗ_１（７５，１）からフィルタ１３００ｆ_１−１が生成される。同様に、重みデータ１２００−１のｗ_１（１，２）〜ｗ_１（２５，２）、ｗ_１（２６，２）〜ｗ_１（５０，２）、及びｗ_１（５１，２）〜ｗ_１（７５，２）からフィルタ１３００ｆ_２−１が生成される。ｊ＝３，・・・，６４の場合も同様である。 Here, _{_w} 1 _(1,1) _~w 1 weight data 1200-1 (25,1), w 1 ( 26,1) ~w 1 (50,1), and _w 1 (51,1) ~ A filter 1300f ₁ −1 is generated from w ₁ (75, ₁ ). Similarly, _{_w} 1 _(1,2) _~w 1 weight data 1200-1 (25,2), w 1 ( 26,2) ~w 1 (50,2), and _w 1 (51,2) ~ A filter 1300f ₂ −1 is generated from w ₁ (75, ₂ ). The same applies to j = 3,.

以上のように生成された各フィルタ１３００ｆ_ｊ−１（ｊ＝１，・・・，６４）を用いて、畳み込み処理部１１２は、画像データ５１１に対してフィルタ処理を行う。畳み込み処理部１１２は、例えば以下のようにしてフィルタ処理を行う。 The convolution processing unit 112 performs filter processing on the image data 511 using each filter 1300f _j −1 (j = 1,..., 64) generated as described above. The convolution processing unit 112 performs filter processing as follows, for example.

（１）画像データ５１１の中心５６×５６×３の部分に対してフィルタ１３００ｆ_１−１をかける（すなわち、画像データ５１１とフィルタ１３００ｆ_１−１の対応する値の乗算を行う）。 (1) The filter 1300f ₁ −1 is applied to the center 56 × 56 × 3 portion of the image data 511 (that is, the corresponding values of the image data 511 and the filter 1300f ₁ −1 are multiplied).

これは、例えば、Ｒチャンネルを固定し、フィルタ１３００ｆ_１−１のＲチャンネル用フィルタの中心を、画像データ５１１のＲチャンネルの５６×５６の部分に対して、左上から５ずつ右にずらしながら行う。そして、フィルタ１３００ｆ_１−１のＲチャンネル用フィルタの中心が画像データ５１１のＲチャンネルの５６×５６の部分の右端まで辿り着いたら、当該Ｒチャンネル用フィルタの中心を下に５ずらして、再度、左端から行えば良い。 For example, this is performed by fixing the R channel and shifting the center of the filter for the R channel of the filter 1300f ₁ -1 to the right by 5 from the upper left with respect to the 56 × 56 portion of the R channel of the image data 511. . When the center of the R channel filter of the filter 1300f ₁ -1 reaches the right end of the 56 × 56 portion of the R channel of the image data 511, the center of the R channel filter is shifted downward by 5 and again, Just go from the left.

（２）次に、画像データ５１１のＧチャンネルに対しても、上記（１）と同様の方法でフィルタ１３００ｆ_１−１のＧチャンネル用フィルタをかける。画像データ５１１のＢチャンネルに対しても同様である。 (2) Next, the G channel filter of the filter 1300f ₁ −1 is applied to the G channel of the image data 511 in the same manner as in the above (1). The same applies to the B channel of the image data 511.

（３）フィルタ１３００ｆ_２−１〜フィルタ１３００ｆ_６４−１についても、上記と同様に、画像データ５１１のＲＧＢの各チャンネルに対してフィルタ処理を順に行う。 (3) For the filters 1300f ₂ −1 to 1300f ₆₄ −1, the filter processing is sequentially performed on the RGB channels of the image data 511 in the same manner as described above.

以上のフィルタ処理により、画像データ５１１から６４×６４×３×６４チャンネルの画像データが生成される。 Through the above filtering process, 64 × 64 × 3 × 64 channel image data is generated from the image data 511.

Ｓｔｅｐ４３３）畳み込み処理部１１２は、Ｓｔｅｐ４３２で得られた６４×６４×３×６４チャンネルの画像データの各ＲＧＢ成分を加算する。この結果、６４×６４×６４チャンネルの画像データが得られる。 (Step 433) The convolution processing unit 112 adds the RGB components of the image data of 64 × 64 × 3 × 64 channels obtained in Step 432. As a result, 64 × 64 × 64 channel image data is obtained.

Ｓｔｅｐ４３４）畳み込み処理部１１２は、Ｓｔｅｐ４３３で得られた６４×６４×６４チャンネルの画像データの各画素値に対して、バイアスデータ１１００−１を加算する。 (Step 434) The convolution processing unit 112 adds the bias data 1100-1 to each pixel value of the image data of 64 × 64 × 64 channels obtained in Step 433.

ここで、図７（ａ）は、第１層のバイアスデータ１１００−１の一例を示す図である。図７（ａ）に示すように、バイアスデータ１１００−１は、１×６４の行列により表される。そこで、畳み込み処理部１１２は、１つめの６４×６４チャンネルの画像データの各画素値に対してバイアスデータ１１００−１のデータ値ｂ_１（１）を加算する。同様に、２つ目の６４×６４チャンネルの画像データの各画素値に対してバイアスデータ１１００−１のデータ値ｂ_１（２）を加算する。以降、同様に、６４個すべての６４×６４チャンネルの画像データの各画素値に対して、それぞれ、バイアスデータ１１００−１のデータ値を加算する。 FIG. 7A is a diagram illustrating an example of the bias data 1100-1 for the first layer. As shown in FIG. 7A, the bias data 1100-1 is represented by a 1 × 64 matrix. Therefore, the convolution processing unit 112 adds the data value b ₁ (1) of the bias data 1100-1 to each pixel value of the first 64 × 64 channel image data. Similarly, the data value b ₁ (2) of the bias data 1100-1 is added to each pixel value of the second 64 × 64 channel image data. Thereafter, similarly, the data values of the bias data 1100-1 are added to the respective pixel values of all 64 image data of 64 × 64 channels.

Ｓｔｅｐ４３５）畳み込み処理部１１２は、Ｓｔｅｐ４３４で得られた６４×６４×６４チャンネルの画像データに対して、所定の活性化関数を適用して出力画像データを得る。所定の活性化関数としては、例えば、任意の画素値ｘに対して、ｆ（ｘ）＝ｍａｘ（０，ｘ）で定義される関数が挙げられる。 (Step 435) The convolution processing unit 112 obtains output image data by applying a predetermined activation function to the image data of 64 × 64 × 64 channels obtained in Step 434. Examples of the predetermined activation function include a function defined by f (x) = max (0, x) for an arbitrary pixel value x.

そして、６４×６４×６４チャンネルの画像データに対して、活性化関数を適用した後、ステップＳ４１の加工処理において０クリアした部分は取り除き、画像データの中心の５６×５６部分をプーリング処理部１１３に出力する。したがって、第１層において、畳み込み処理部１１２がプーリング処理部１１３に出力する画像データの色チャンネルは、５６×５６×６４である。このようにして得られた５６×５６×６４チャンネルの画像データを「画像データ５１２」と表す。なお、ステップＳ４１の加工処理において０クリアした部分は、Ｓｔｅｐ４３３又はＳｔｅｐ４３４で取り除いても良い。 Then, after applying the activation function to the image data of 64 × 64 × 64 channels, the portion that has been cleared to 0 in the processing in step S41 is removed, and the 56 × 56 portion at the center of the image data is removed from the pooling processing unit 113. Output to. Therefore, in the first layer, the color channel of the image data output from the convolution processing unit 112 to the pooling processing unit 113 is 56 × 56 × 64. The image data of 56 × 56 × 64 channels obtained in this way is represented as “image data 512”. In addition, you may remove the part which cleared 0 in the process process of step S41 by Step433 or Step434.

プーリング処理部１１３は、画像データ５１２を入力して、第１層のプーリング処理を行う（ステップＳ４４）。 The pooling processing unit 113 receives the image data 512 and performs the first layer pooling process (step S44).

ここで、第１層のプーリング処理について、図９を参照しながら説明する。図９は、本実施形態の第１層のプーリング処理の一例を示す図である。 Here, the pooling process of the first layer will be described with reference to FIG. FIG. 9 is a diagram illustrating an example of a first layer pooling process according to the present embodiment.

Ｓｔｅｐ４４１）プーリング処理部１１３は、５６×５６×６４チャンネルの画像データ５１２を入力する。 (Step 441) The pooling processing unit 113 inputs image data 512 of 56 × 56 × 64 channels.

Ｓｔｅｐ４４２）プーリング処理部１１３は、画像データ５１２の３×３の領域内の最大値を出力する処理を繰り返し行い、２８×２８×６４の画像データ（この画像データを以降「画像データ５１３」とする）を生成する。これは、例えば、以下のようにして行う。 (Step 442) The pooling processing unit 113 repeatedly performs the process of outputting the maximum value in the 3 × 3 area of the image data 512, and the 28 × 28 × 64 image data (this image data is hereinafter referred to as “image data 513”). ) Is generated. This is performed as follows, for example.

（１）画像データ５１３の１つの５６×５６の画像データ（１つのチャンネルを固定した５６×５６の画像データ）について、左上を中心とした３×３の領域における画素値の最大値を得る。そして、この最大値を、画像データ５１３の画素位置（１，１）の画素値とする。 (1) For one 56 × 56 image data (56 × 56 image data with one channel fixed) of the image data 513, the maximum value of pixel values in a 3 × 3 region centering on the upper left is obtained. The maximum value is set as the pixel value at the pixel position (1, 1) of the image data 513.

（２）次に、３×３の領域を右に２ずつ移動させながら、それぞれの領域内における画素値の最大値を得て、それぞれ、画像データ５１３の画素位置（１，２）〜（１，２８）の画素値とする。 (2) Next, while moving the 3 × 3 region to the right by two, the maximum pixel value in each region is obtained, and the pixel positions (1, 2) to (1) of the image data 513 are obtained. , 28).

（３）続いて、３×３の領域の中心を下に２移動させ、左端から同様に２ずつ領域の中心を移動させながら、それぞれの領域内における画素値の最大値を得て、それぞれ、画像データ５１３の画素位置（２，１）〜（２，２８）の画素値とする。以降、同様に、（３，１）〜（２８，２８）の画素値を得る。 (3) Subsequently, the center of the 3 × 3 region is moved down by 2 and the center of the region is similarly moved by 2 from the left end to obtain the maximum value of the pixel values in each region, The pixel values at the pixel positions (2, 1) to (2, 28) of the image data 513 are used. Thereafter, similarly, pixel values of (3, 1) to (28, 28) are obtained.

（４）上記の（１）〜（３）を、すべての５６×５６の画像データについて行う。すなわち、上記の（１）〜（３）を、６４個の５６×５６の画像データについて行う。 (4) The above (1) to (3) are performed for all the 56 × 56 image data. That is, the above (1) to (3) are performed on 64 pieces of 56 × 56 image data.

Ｓｔｅｐ４４３）プーリング処理部１１３は、画像データ５１３を第２層の畳み込み処理部１１２に出力する。 (Step 443) The pooling processing unit 113 outputs the image data 513 to the convolution processing unit 112 in the second layer.

次に、ＣＮＮ処理部１１０は、畳み込みニューラルネットワークの層を示す変数ｎに１を加算する（ステップＳ４５）。 Next, the CNN processing unit 110 adds 1 to the variable n indicating the layer of the convolutional neural network (step S45).

次に、ＣＮＮ処理部１１０は、変数ｎが、予め設定されたＮと等しいか否かを判定する（ステップＳ４６）。変数ｎがＮと等しい場合、ＣＮＮ処理部１１０は、ステップＳ４７に進む。 Next, the CNN processing unit 110 determines whether or not the variable n is equal to N set in advance (step S46). When the variable n is equal to N, the CNN processing unit 110 proceeds to step S47.

一方、変数ｎがＮと等しくない場合（すなわち、変数ｎがＮより小さい場合）、ＣＮＮ処理部１１０は、ステップＳ４３に戻る。すなわち、この場合、ＣＮＮ処理部１１０は、畳み込みニューラルネットワークの次の層の畳み込み処理及びプーリング処理を行う。 On the other hand, when the variable n is not equal to N (that is, when the variable n is smaller than N), the CNN processing unit 110 returns to step S43. That is, in this case, the CNN processing unit 110 performs a convolution process and a pooling process for the next layer of the convolution neural network.

本実施形態では、Ｎ＝２であるため、ＣＮＮ処理部１１０は、ステップＳ４７に進むものとする。 In this embodiment, since N = 2, the CNN processing unit 110 proceeds to step S47.

畳み込み処理部１１２は、画像データ５１３を入力して、第２層の畳み込み処理を行う（ステップＳ４７）。 The convolution processing unit 112 receives the image data 513 and performs the second layer convolution processing (step S47).

ここで、第２層の畳み込み処理について、図１０を参照しながら説明する。図１０は、本実施形態の第２層の畳み込み処理の一例を示す図である。なお、第２層の畳み込み処理は、第１層の畳み込み処理と各データのチャンネル数が異なること以外は同様である。より一般には、第ｎ層の畳み込み処理は、他の層の畳み込み処理と各データのチャンネル数が異なること以外は同様である。 Here, the convolution processing of the second layer will be described with reference to FIG. FIG. 10 is a diagram illustrating an example of the second layer convolution process according to the present embodiment. The second layer convolution processing is the same as the first layer convolution processing except that the number of channels of each data is different. More generally, the nth layer convolution process is the same as the other layer convolution process except that the number of channels of each data is different.

Ｓｔｅｐ４７１）畳み込み処理部１１２は、画像データ５１３を入力する。ここで、入力した画像データ５１３の色チャンネルは、上述した通り、２８×２８×６４チャンネルである。 (Step 471) The convolution processing unit 112 inputs the image data 513. Here, the color channels of the input image data 513 are 28 × 28 × 64 channels as described above.

Ｓｔｅｐ４７２）畳み込み処理部１１２は、重みデータ１２００−２からフィルタを生成し、画像データ５１３に対して、生成したフィルタを用いてフィルタ処理を行う。ここで、重みデータ１２００−２のデータ構成及び当該重みデータ１２００−２から生成されるフィルタ１３００ｆ_ｊ−２（ｊ＝１，・・・，６４）のデータ構成について説明する。 (Step 472) The convolution processing unit 112 generates a filter from the weight data 1200-2, and performs a filtering process on the image data 513 using the generated filter. Here, the data configuration of the weight data 1200-2 and the data configuration of the filter 1300f _j -2 (j = 1,..., 64) generated from the weight data 1200-2 will be described.

図１１（ｂ）は、第２層の重みデータ１２００−２の一例を示す図である。図１１（ｂ）に示すように、第２層の重みデータ１２００−２は、１６００×６４の行列で表される。なお、重みデータ１２００−２の各値ｗ_２（ｉ，ｊ）は、上述したように、学習データに基づいて予め学習された値である。 FIG. 11B is a diagram illustrating an example of the second layer weight data 1200-2. As shown in FIG. 11B, the second layer weight data 1200-2 is represented by a 1600 × 64 matrix. Each value w ₂ (i, j) of the weight data 1200-2 is a value learned in advance based on the learning data as described above.

次に、重みデータ１２００−２から生成されるフィルタ１３００ｆ_ｊ−２（ｊ＝１，・・・，６４）について説明する。図１２は、本実施形態の第２層のフィルタの一例を示す図である。 Next, the filter 1300f _j -2 (j = 1,..., 64) generated from the weight data 1200-2 will be described. FIG. 12 is a diagram illustrating an example of the second layer filter of the present embodiment.

図１２に示すように、各フィルタ１３００ｆ_ｊ−２（ｊ＝１，・・・，６４）は、５×５の行列の６４個の組で表される。換言すれば、各フィルタ１３００ｆ_ｊ−２（ｊ＝１，・・・，６４）は、５×５×６４で表される。 As shown in FIG. 12, each filter 1300f _j -2 (j = 1,..., 64) is represented by 64 sets of 5 × 5 matrices. In other words, each filter 1300f _j -2 (j = 1,..., 64) is represented by 5 × 5 × 64.

ここで、重みデータ１２００−２のｗ_２（１，１）〜ｗ_２（２５，１）、・・・、ｗ_２（１５７６，１）〜ｗ_２（１６００，１）からフィルタ１３００ｆ_１−２が生成される。同様に、重みデータ１２００−２のｗ_２（１，２）〜ｗ_２（２５，２）、・・・、ｗ_２（１５７６，２）〜ｗ_２（１６００，２）からフィルタ１３００ｆ_２−２が生成される。ｊ＝３，・・・，６４の場合も同様である。 Here, _w ₂ _(1,1) ~w 2 weight data _{1200-2 (25,1), ···, w} 2 (1576,1) filter from ~w 2 (1600,1) 1300f ₁ -2 Is generated. Similarly, _w 2 of the weighting data _{_{1200-2 (1,2) ~w 2 (25,2}} ), ···, w 2 (1576,2) ~w 2 (1600,2) filter from 1300 f ₂ -2 Is generated. The same applies to j = 3,.

以上のように生成された各フィルタ１３００ｆ_ｊ−２（ｊ＝１，・・・，６４）を用いて、畳み込み処理部１１２は、画像データ５１３に対してフィルタ処理を行う。畳み込み処理部１１２は、例えば以下のようにしてフィルタ処理を行う。 The convolution processing unit 112 performs filter processing on the image data 513 using each filter 1300f _j -2 (j = 1,..., 64) generated as described above. The convolution processing unit 112 performs filter processing as follows, for example.

（１）画像データ５１３に対してフィルタ１３００ｆ_１−２をかける（すなわち、画像データ５１３とフィルタ１３００ｆ_１−２の対応する値の乗算を行う）。 (1) The filter 1300f ₁ -2 is applied to the image data 513 (that is, the corresponding values of the image data 513 and the filter 1300f ₁ -2 are multiplied).

これは、例えば、１つのチャンネルを固定し、フィルタ１３００ｆ_１−２の中心を、画像データ５１３の２８×２８の部分の左上から５ずつ右にずらしながら行う。そして、フィルタ１３００ｆ_１−２の中心が画像データ５１３の２８×２８の部分の右端まで辿り着いたら、フィルタ１３００ｆ_１−２の中心を下に５ずらして、再度、左端から行えば良い。 For example, this is performed by fixing one channel and shifting the center of the filter 1300f ₁ -2 to the right by 5 from the upper left of the 28 × 28 portion of the image data 513. Then, when the center of the filter 1300f ₁ -2 reaches the right end of the 28 × 28 portion of the image data 513, the center of the filter 1300f ₁ -2 may be shifted downward by 5 and the processing may be performed again from the left end.

（２）次に、画像データ５１３の他のチャンネルに対しても、上記（１）と同様の方法でフィルタ１３００ｆ_１−２をかける。この処理をすべてのチャンネル１〜６４に対して繰り返す。 (2) Next, filters 1300f ₁ -2 are applied to other channels of the image data 513 in the same manner as in (1) above. This process is repeated for all channels 1 to 64.

（３）フィルタ１３００ｆ_２−２〜フィルタ１３００ｆ_６４−２についても、上記と同様に、１〜６４のチャンネル毎に、画像データ５１３の２８×２８の部分に対して、フィルタ処理を順に行う。 (3) filter 1300 f ₂ -2 to filter _{1300 f} 64 -2, similarly to the above, each channel in the 1 to 64, for the portion of the 28 × 28 image data 513, the filtering processing is carried out sequentially.

以上のフィルタ処理により、画像データ５１３から２８×２８×６４×６４チャンネルの画像データが生成される。 By the above filter processing, 28 × 28 × 64 × 64 channel image data is generated from the image data 513.

Ｓｔｅｐ４７３）畳み込み処理部１１２は、Ｓｔｅｐ４７２で得られた画像データの２８×２８の部分について、各画素値を１〜６４チャンネルのそれぞれについて加算する。この結果、２８×２８×６４チャンネルの画像データが得られる。 (Step 473) The convolution processing unit 112 adds each pixel value for each of the 1 to 64 channels for the 28 × 28 portion of the image data obtained in Step 472. As a result, 28 × 28 × 64 channel image data is obtained.

Ｓｔｅｐ４７４）畳み込み処理部１１２は、Ｓｔｅｐ４７３で得られた２８×２８×６４チャンネルの画像データの各画素値に対して、バイアスデータ１１００−２を加算する。 (Step 474) The convolution processing unit 112 adds the bias data 1100-2 to each pixel value of the 28 × 28 × 64 channel image data obtained in Step 473.

ここで、図１１（ａ）は、第２層のバイアスデータ１１００−２の一例を示す図である。図１１（ａ）に示すように、バイアスデータ１１００−２は、１×６４の行列により表される。そこで、畳み込み処理部１１２は、１つめの２８×２８チャンネルの画像データの各画素値に対してバイアスデータ１１００−２のデータ値ｂ_２（１）を加算する。同様に、２つ目の２８×２８チャンネルの画像データの各画素値に対してバイアスデータ１１００−２のデータ値ｂ_２（２）を加算する。以降、同様に、６４個すべての２８×２８チャンネルの画像データの各画素値に対して、それぞれ、バイアスデータ１１００−２のデータ値を加算する。 Here, FIG. 11A is a diagram illustrating an example of the bias data 1100-2 for the second layer. As shown in FIG. 11A, the bias data 1100-2 is represented by a 1 × 64 matrix. Therefore, the convolution processing unit 112 adds the data value b ₂ (1) of the bias data 1100-2 to each pixel value of the first 28 × 28 channel image data. Similarly, the data value b ₂ (2) of the bias data 1100-2 is added to each pixel value of the second 28 × 28 channel image data. Thereafter, similarly, the data value of the bias data 1100-2 is added to each of the pixel values of all 64 image data of 28 × 28 channels.

Ｓｔｅｐ４７５）畳み込み処理部１１２は、Ｓｔｅｐ４７４で得られた２８×２８×６４チャンネルの画像データに対して、所定の活性化関数を適用して出力画像データを得る。所定の活性化関数としては、例えば、任意の画素値ｘに対して、ｆ（ｘ）＝ｍａｘ（０，ｘ）で定義される関数が挙げられる。このようにして得られた出力画像データが、出力データ５２０である。このように本実施形態の出力データ５２０は、２８×２８×６４チャンネルの画像データである。 (Step 475) The convolution processing unit 112 applies a predetermined activation function to the 28 × 28 × 64 channel image data obtained in Step 474 to obtain output image data. Examples of the predetermined activation function include a function defined by f (x) = max (0, x) for an arbitrary pixel value x. The output image data obtained in this way is output data 520. As described above, the output data 520 of the present embodiment is 28 × 28 × 64 channel image data.

なお、上記の説明で示されるように、出力データ５２０は、フィルタ１３００ｆ_ｊ−２の各ｊ（ｊ＝１，・・・，６４）に対応する２８×２８の画像データ（出力データ）の集合と言うことができる。すなわち、出力データ５２０には、フィルタ１３００ｆ_１−２に対応する２８×２８の出力データ５２０−１，・・・、フィルタ１３００ｆ_６４−２に対応する２８×２８の出力データ５２０−６４が含まれる。 As shown in the above description, the output data 520 is a set of 28 × 28 image data (output data) corresponding to each j (j = 1,..., 64) of the filter 1300f _j -2. Can be said. That is, the output data 520 includes the output data 520-1 of 28 × 28 corresponding to the filter 1300 f ₁ -2, · · ·, output data 520-64 of the filters _{1300 f} 64 corresponding to -2 28 × 28 is .

次に、図３のステップＳ３３の候補領域の作成処理について、図１３を参照しながら説明する。図１３は、本実施形態の候補領域の作成処理のフローチャートの一例を示す図である。 Next, the candidate area creation processing in step S33 of FIG. 3 will be described with reference to FIG. FIG. 13 is a diagram illustrating an example of a flowchart of candidate area creation processing according to the present embodiment.

候補領域作成処理部１２０のデータ決定部１２１は、出力データ５２０に含まれる出力データ５２０−１，・・・，出力データ５２０−６４のそれぞれについて代表値ａ_１，・・・ａ_６４を決定する（ステップＳ１３１）。 The data determination unit 121 of the candidate area creation processing unit 120 determines the representative values a ₁ ,..., A ₆₄ for the output data 520-1, ..., output data 520-64 included in the output data 520. (Step S131).

ここで、代表値ａ_１，・・・ａ_６４としては、出力データ５２０−１，・・・，出力データ５２０−６４それぞれのデータ値の最大値とすれば良い。例えば、出力データ５２０−１に含まれるデータ値の最大値を代表値ａ_１とすれば良い。他の出力データ５２０−２，・・・，出力データ５２０−６４についても同様である。ただし、代表値ａ_１，・・・ａ_６４は、最大値に限られず、例えば、平均値等を用いても良い。 Here, the representative values a ₁ ,..., A ₆₄ may be the maximum values of the data values of the output data 520-1,. For example, the maximum value of the data value contained in the output data 520-1 may be a representative value a _1. The same applies to the other output data 520-2,..., Output data 520-64. However, the representative values a ₁ ,..., A ₆₄ are not limited to the maximum value, and for example, an average value or the like may be used.

候補領域作成処理部１２０のデータ決定部１２１は、代表値ａ_１，・・・ａ_６４に基づいて、出力データ５２０−１，・・・，出力データ５２０−６４から所定のＭ個のデータを決定する（ステップＳ１３２）。ここで、データ決定部１２１は、代表値ａ_１，・・・ａ_６４の値が大きい順に（昇順に）、上位Ｍ個の代表値に対応する出力データを決定すれば良い。 Data determination unit 121 of the candidate area generation processing unit 120, representative value _a 1, on the basis of ... _{a 64,} output data 520-1, ..., and predetermined M data from the output data 520-64 Determine (step S132). Here, the data determination unit 121 may determine output data corresponding to the top M representative values in descending order of the representative values a ₁ ,..., A ₆₄ (in ascending order).

以降では、Ｍ＝３として、データ決定部１２１により、出力データ５２０−２、出力データ５２０−４３、及び出力データ５２０−４７が決定されたものとする。 Hereinafter, it is assumed that M = 3 and the data determination unit 121 determines the output data 520-2, the output data 520-43, and the output data 520-47.

なお、Ｍの値を大きくすることで、認識処理の精度を向上させることができるが、処理速度は低下する。一方で、Ｍの値を小さくすることで、認識処理の精度は低下するものの処理速度が向上する。したがって、Ｍは、画像処理プログラム２０の設計者等により、認識対象の画像データ５１０の性質や、認識処理に求められる精度等に応じて適切な値が予め設定される。 Note that by increasing the value of M, the accuracy of the recognition process can be improved, but the processing speed decreases. On the other hand, by reducing the value of M, the processing speed is improved although the accuracy of the recognition process is reduced. Therefore, an appropriate value of M is set in advance by the designer or the like of the image processing program 20 according to the nature of the image data 510 to be recognized, the accuracy required for the recognition process, and the like.

候補領域作成処理部１２０は、データ決定部１２１により決定されたＭ個の出力データ５２０のうち一の出力データを取得する（ステップＳ１３３）。すなわち、本実施形態では、データ決定部１２１は、出力データ５２０−２、出力データ５２０−４３、及び出力データ５２０−４７から一の出力データを取得する。以降では、候補領域作成処理部１２０は、出力データ５２０−２を取得したものとして説明する。 The candidate area creation processing unit 120 acquires one output data among the M pieces of output data 520 determined by the data determination unit 121 (step S133). That is, in this embodiment, the data determination unit 121 acquires one output data from the output data 520-2, the output data 520-43, and the output data 520-47. In the following description, it is assumed that the candidate area creation processing unit 120 has acquired the output data 520-2.

候補領域作成処理部１２０の境界決定部１２２は、取得された出力データ５２０−２について、微分処理を行って、領域分割部１２４により分割される領域の境界を決定する（ステップＳ１３４）。 The boundary determination unit 122 of the candidate region creation processing unit 120 performs differentiation processing on the acquired output data 520-2 to determine the boundary of the region divided by the region division unit 124 (step S134).

ここで、境界決定部１２２により決定される領域の境界について、図１４を参照しながら説明する。図１４は、本実施形態の微分処理の一例を示す図である。 Here, the boundary of the region determined by the boundary determination unit 122 will be described with reference to FIG. FIG. 14 is a diagram illustrating an example of differentiation processing according to the present embodiment.

図１４では、一例として、出力データ５２０−２について、微分処理を行った場合を示している。図１４に示すように、境界決定部１２２により微分処理を行い、微分値が負から正に変わる部分を、出力データ５２０−１の出力値の谷間として検出する。そして、境界決定部１２２は、検出された出力値の谷間を、境界Ｄ１及び境界Ｄ２として決定する。ここで、微分処理には、例えばSobelフィルタを用いれば良い。 In FIG. 14, as an example, a case where the differentiation process is performed on the output data 520-2 is illustrated. As illustrated in FIG. 14, the boundary determination unit 122 performs differentiation processing, and detects a portion where the differential value changes from negative to positive as a valley of the output value of the output data 520-1. Then, the boundary determining unit 122 determines the valleys of the detected output values as the boundary D1 and the boundary D2. Here, for example, a Sobel filter may be used for the differentiation process.

候補領域作成処理部１２０の閾値処理部１２３は、閾値処理を行う（ステップＳ１３５）。すなわち、閾値処理部１２３は、予め設定された閾値（例えば、閾値＝３０）以下のデータを削除する。 The threshold processing unit 123 of the candidate area creation processing unit 120 performs threshold processing (step S135). That is, the threshold processing unit 123 deletes data that is equal to or less than a preset threshold (for example, threshold = 30).

ここで、閾値処理部１２３による閾値処理について、図１５を参照しながら説明する。図１５は、本実施形態の閾値処理の一例を示す図である。図１５では、一例として、出力データ５２０−２に対して閾値処理を行った場合を示している。図１５に示すように、閾値処理部１２３は、閾値処理を行って所定の閾値以下のデータ値を削除することにより、出力データ５２０−２から出力データ５２１−２を作成する。なお、図１５に示す出力データ５２１において、網掛けで示した部分がデータ値を削除した部分である。 Here, threshold processing by the threshold processing unit 123 will be described with reference to FIG. FIG. 15 is a diagram illustrating an example of threshold processing according to the present embodiment. FIG. 15 shows a case where threshold processing is performed on the output data 520-2 as an example. As illustrated in FIG. 15, the threshold processing unit 123 creates output data 521-2 from the output data 520-2 by performing threshold processing and deleting data values equal to or less than a predetermined threshold. In the output data 521 shown in FIG. 15, the shaded portion is a portion from which the data value is deleted.

候補領域作成処理部１２０の領域分割部１２４は、境界決定部１２２により決定された境界に基づいて、ステップＳ１３３で取得された一の出力データが示す画像を複数の領域に分割する（ステップＳ１３６）。 The region dividing unit 124 of the candidate region creation processing unit 120 divides the image indicated by the one output data acquired in step S133 into a plurality of regions based on the boundary determined by the boundary determining unit 122 (step S136). .

ここで、領域分割部１２４により分割される領域について、図１６を参照しながら説明する。図１６は、本実施形態の領域分割の一例を示す図である。図１６では、出力データ５２１−２が示す画像を境界Ｄ１及び境界Ｄ２に基づいて分割した例を示している。図１６に示すように、出力データ５２１−２が示す画像は、境界Ｄ１及び境界Ｄ２に基づいて、領域Ｓ１、領域Ｓ２、領域Ｓ３、及び領域Ｓ４に分割される。 Here, the regions divided by the region dividing unit 124 will be described with reference to FIG. FIG. 16 is a diagram illustrating an example of area division according to the present embodiment. FIG. 16 shows an example in which the image indicated by the output data 521-2 is divided based on the boundary D1 and the boundary D2. As shown in FIG. 16, the image indicated by the output data 521-2 is divided into a region S1, a region S2, a region S3, and a region S4 based on the boundary D1 and the boundary D2.

候補領域作成処理部１２０の候補領域作成部１２５は、領域分割部１２４により分割された領域Ｓ１〜Ｓ４について、各領域を含む最小矩形を特定し、当該特定された最小矩形に基づいて候補領域を示す候補領域画像データ５３０を作成する（ステップＳ１３７）。 The candidate area creation unit 125 of the candidate area creation processing unit 120 specifies the minimum rectangle including each area for the areas S1 to S4 divided by the area division unit 124, and selects the candidate area based on the specified minimum rectangle. The candidate area image data 530 shown is created (step S137).

ここで、一例として、領域Ｓ１を囲む最小矩形Ｂ１を図１７に示す。このように最小矩形とは、領域分割部１２４により分割された領域された領域に外接する矩形のことである。したがって、候補領域作成部１２５は、各領域Ｓ１〜Ｓ４について、それぞれ最小矩形を特定する。 Here, as an example, a minimum rectangle B1 surrounding the region S1 is shown in FIG. As described above, the minimum rectangle is a rectangle that circumscribes the area divided by the area dividing unit 124. Therefore, the candidate area creation unit 125 specifies a minimum rectangle for each of the areas S1 to S4.

そして、候補領域作成部１２５は、画像データ５１０が示す画像において、当該特定された最小矩形によって囲まれる領域と対応する領域を候補領域として候補領域画像データ５３０を作成する。このとき、候補領域作成部１２５は、画像データ５１０が示す画像において、最小矩形によって囲まれる領域と対応する領域を、当該画像データ５１０の解像度を考慮した上で候補領域として候補領域画像データ５３０を作成する。 Then, the candidate area creating unit 125 creates candidate area image data 530 using an area corresponding to the area surrounded by the specified minimum rectangle in the image indicated by the image data 510 as a candidate area. At this time, the candidate area creating unit 125 sets the candidate area image data 530 as an area corresponding to the area surrounded by the minimum rectangle in the image indicated by the image data 510, considering the resolution of the image data 510. create.

候補領域作成処理部１２０は、ステップＳ１３２で決定されたすべての出力データに対して、候補領域画像データ５３０を作成したか否かを判定する（ステップＳ１３８）。すなわち、候補領域作成処理部１２０は、出力データ５２０−２、出力データ５２０−４３、及び出力データ５２０−４７に対して、ステップＳ１３３〜ステップＳ１３８の処理が実行されたか否かを判定する。 The candidate area creation processing unit 120 determines whether candidate area image data 530 has been created for all the output data determined in step S132 (step S138). In other words, the candidate area creation processing unit 120 determines whether or not the processing of Step S133 to Step S138 has been performed on the output data 520-2, the output data 520-43, and the output data 520-47.

ステップＳ１３２で決定されたすべての出力データに対して、候補領域画像データ５３０が作成された場合、候補領域作成処理部１２０は、処理を終了させる。一方、ステップＳ１３２で決定された出力データのうち、候補領域画像データ５３０が作成されていない出力データがある場合、候補領域作成処理部１２０は、ステップＳ１３３に戻る。 When the candidate area image data 530 is created for all the output data determined in step S132, the candidate area creation processing unit 120 ends the process. On the other hand, if there is output data for which the candidate area image data 530 has not been created among the output data determined in step S132, the candidate area creation processing unit 120 returns to step S133.

これにより、本実施形態の画像処理装置１０では、入力された画像データ５１０が示す画像において、対象が含まれる領域の候補である候補領域を示す候補領域画像データ５３０が作成される。しかも、本実施形態の画像処理装置１０では、畳み込みニューラルネットワークの第Ｎ層における出力データ５２０を用いて、候補領域画像データ５３０が作成される。このため、本実施形態の画像処理装置１０では、認識処理の精度の低下を防ぎつつ、候補領域を削減させることができる。 Thereby, in the image processing apparatus 10 according to the present embodiment, candidate area image data 530 indicating candidate areas that are candidates for areas including the object is created in the image indicated by the input image data 510. Moreover, in the image processing apparatus 10 of the present embodiment, candidate area image data 530 is created using the output data 520 in the Nth layer of the convolutional neural network. For this reason, in the image processing apparatus 10 according to the present embodiment, it is possible to reduce candidate regions while preventing a reduction in recognition processing accuracy.

次に、図３のステップＳ３４のカテゴリ分類処理について、図１８を参照しながら説明する。図１８は、本実施形態のカテゴリ分類処理のフローチャートの一例を示す図である。 Next, the category classification process in step S34 in FIG. 3 will be described with reference to FIG. FIG. 18 is a diagram illustrating an example of a flowchart of the category classification process of the present embodiment.

ＣＮＮ処理部１１０は、１以上の候補領域画像データ５３０から一の候補領域画像データ５３０を入力し、入力された候補領域画像データ５３０に対して、畳み込みニューラルネットワーク処理を行う（ステップＳ１８１）。すなわち、ＣＮＮ処理部１１０は、入力された候補領域画像データ５３０に対して、図４で示した畳み込みニューラルネットワーク処理を行う。 The CNN processing unit 110 receives one candidate area image data 530 from one or more candidate area image data 530, and performs convolutional neural network processing on the input candidate area image data 530 (step S181). That is, the CNN processing unit 110 performs the convolutional neural network process shown in FIG. 4 on the input candidate area image data 530.

なお、ステップＳ１８１において、ＣＮＮ処理部１１０は、予め設定された第Ｎ層までの畳み込みニュールラルネットワーク処理を行っても良いし、Ｎより大きい任意の自然数をＬとして、第Ｌ層までの畳み込みニューラルネットワーク処理を行っても良い。 In step S181, the CNN processing unit 110 may perform a convolutional neural network process up to a preset N-th layer, or an arbitrary natural number greater than N as L and a convolutional neural network up to the L-th layer. Network processing may be performed.

ここでは、ステップＳ１８１において、ＣＮＮ処理部１１０は、第Ｎ層までの畳み込みニューラルネットワーク処理を行ったものとして説明する。したがって、ステップＳ１８１の処理結果として、ＣＮＮ処理部１１０の畳み込み処理部１１２は、出力データ５２０と同じデータ構成である２８×２８×６４チャンネルの出力データ５３１を全結合処理部１１４に出力する。 Here, description will be made assuming that the CNN processing unit 110 performs the convolutional neural network processing up to the Nth layer in step S181. Accordingly, as a result of the processing in step S181, the convolution processing unit 112 of the CNN processing unit 110 outputs the output data 531 of 28 × 28 × 64 channels having the same data configuration as the output data 520 to the all combination processing unit 114.

次に、ＣＮＮ処理部１１０の全結合処理部１１４は、出力データ５３１を入力して、全結合処理を行う。なお、全結合処理部１１４は、上述したように、カテゴリの組毎に存在する。したがって、各全結合処理部１１４は、それぞれ、出力データ５３１を入力する。 Next, the full connection processing unit 114 of the CNN processing unit 110 receives the output data 531 and performs full connection processing. Note that, as described above, the total combination processing unit 114 exists for each category set. Therefore, each all combination processing unit 114 receives the output data 531.

例えば、カテゴリ数が「人」、「動物」、「車」の３つである場合、全結合処理部１１４は、カテゴリの組「人」「人以外」に対応する全結合処理部１１４−１、カテゴリの組「動物」「動物以外」に対応する全結合処理部１１４−２、及びカテゴリの組「車」「車以外」に対応する全結合処理部１１４−３の３つが存在する。 For example, when the number of categories is three, “people”, “animals”, and “cars”, the all combination processing unit 114 includes all combination processing units 114-1 corresponding to the category sets “people” and “other than people”. There are three combination processing units 114-2 corresponding to the category set "animal" and "non-animal", and all connection processing units 114-3 corresponding to the category set "car" and "other than car".

ここで、全結合処理について、図１９を参照しながら説明する。図１９は、本実施形態の第３層の全結合処理の一例を示す図である。 Here, the total joining process will be described with reference to FIG. FIG. 19 is a diagram illustrating an example of the third layer full connection processing according to the present embodiment.

Ｓｔｅｐ１８２１）全結合処理部１１４は、出力データ５３１を入力する。ここで、入力した出力データ５３１の色チャンネルは、上述したように、２８×２８×６４である。 (Step 1821) The all combination processing unit 114 receives the output data 531. Here, the color channel of the input output data 531 is 28 × 28 × 64 as described above.

Ｓｔｅｐ１８２２）全結合処理部１１４は、出力データ５３１の各データ値をベクトル値に変換する。すなわち、２８×２８×６４チャンネルの出力データ５３１の各データ値を５０１７６行１列のベクトル値に変換する。ここで、ベクトル値の各成分の値をｘ_１，・・・，ｘ_{５０１７６}とする。 (Step 1822) The total connection processing unit 114 converts each data value of the output data 531 into a vector value. That is, each data value of the output data 531 of 28 × 28 × 64 channels is converted into a vector value of 50176 rows and 1 column. Here, the value of each component of the vector value is assumed to be x ₁ ,..., X ₅₀₁₇₆ .

Ｓｔｅｐ１８２３）全結合処理部１１４は、それぞれ、バイアスデータ１１００−３及び重みデータ１２００−３を用いて、積和演算を行う。 (Step 1823) The all combination processing unit 114 performs a product-sum operation using the bias data 1100-3 and the weight data 1200-3, respectively.

ここで、バイアスデータ１１００−３及び重みデータ１２００−３について、図２０を参照しながら説明する。図２０は、本実施形態の第３層のネットワークパラメータの一例を示す図である。 Here, the bias data 1100-3 and the weight data 1200-3 will be described with reference to FIG. FIG. 20 is a diagram illustrating an example of network parameters of the third layer according to the present embodiment.

図２０（ａ）は、第３層のバイアスデータ１１００−３の一例を示す図である。図２０（ａ）に示すように、第３層のバイアスデータ１１００−３は、カテゴリ毎のバイアスデータ１１００−３_１，バイアスデータ１１００−３_２，・・・を含む。また、カテゴリ毎のバイアスデータ１１００−３_ｋは、１行２列のベクトル値である。なお、ベクトルの各成分の値ｂ_３（ｋ，ｊ）は、上述したように、学習データに基づいて予め学習された値である。 FIG. 20A is a diagram illustrating an example of the bias data 1100-3 for the third layer. As shown in FIG. 20A, the third layer bias data 1100-3 includes bias data 1100-3 ₁ , bias data 1100-3 ₂ ,... For each category. The bias data 1100-3 _k for each category is a vector value of one row and two columns. Note that the value b ₃ (k, j) of each component of the vector is a value learned in advance based on the learning data, as described above.

ここで、ｋは、カテゴリを示す数値であるとする。例えば、ｋ＝１のときカテゴリ「人」を示し、ｋ＝２のときカテゴリ「動物」を示し、ｋ＝３のときカテゴリ「車」を示す等である。また、ｊは、カテゴリに分類されるか否かを示す数値である。例えば、ｊ＝１のときは該当のカテゴリに分類される場合を示し、ｊ＝２のときは該当のカテゴリに分類されない場合（すなわち、該当のカテゴリ以外のカテゴリに分類される場合）を示す。 Here, k is a numerical value indicating a category. For example, the category “person” is indicated when k = 1, the category “animal” is indicated when k = 2, the category “car” is indicated when k = 3, and so on. Further, j is a numerical value indicating whether or not it is classified into a category. For example, when j = 1, it indicates a case where it is classified into the corresponding category, and when j = 2, it indicates a case where it is not classified into the corresponding category (that is, when it is classified into a category other than the corresponding category).

図２０（ｂ）は、第３層の重みデータ１２００−３の一例を示す図である。図２０（ｂ）に示すように、第３層の重みデータ１２００−３は、カテゴリ毎の重みデータ１２００−３_１，重みデータ１２００−３_２，・・・を含む。また、カテゴリ毎の重みデータ１２００−３_ｋは、５０１７６行２列の行列である。なお、この行列の各成分の値ｗ_３（ｉ，ｊ，ｋ）は、上述したように、学習データに基づいて予め学習された値である。 FIG. 20B is a diagram illustrating an example of the third layer weight data 1200-3. As shown in FIG. 20B, the weight data 1200-3 of the third layer includes weight data 1200-3 ₁ , weight data 1200-3 ₂ ,... For each category. In addition, weight data 1200-3 _k of each category is a matrix of 50,176 rows and two columns. Note that the value w ₃ (i, j, k) of each component of the matrix is a value learned in advance based on the learning data, as described above.

図１９の説明に戻り、全結合処理部１１４は、それぞれ以下の積和演算を行う。すなわち、カテゴリｋに対して、全結合処理部１１４−ｋは、以下の積和演算を行う。 Returning to the description of FIG. 19, the all-join processing unit 114 performs the following product-sum operation. That is, for the category k, the full connection processing unit 114-k performs the following product-sum operation.

ここで、ｊ及びｋの意味は上述した通りである。 Here, the meanings of j and k are as described above.

Ｓｔｅｐ１８２４）全結合処理部１１４は、Ｓｔｅｐ１８２３で得られた２×１×｜ｋ｜のデータを正規化処理部１３０に出力する。なお、｜ｋ｜は、カテゴリ数である。 (Step 1824) The total connection processing unit 114 outputs the data of 2 × 1 × | k | obtained in Step 1823 to the normalization processing unit 130. Note that | k | is the number of categories.

なお、上記の積和演算の結果が、入力された候補領域画像データ５３０がカテゴリｋに分類される場合（ｊ＝１の場合）の算出結果と、当該候補領域画像データ５３０がカテゴリｋ以外のカテゴリに分類される場合（ｊ＝２の場合）の算出結果である。 Note that the result of the product-sum operation described above is the calculation result when the input candidate area image data 530 is classified into category k (when j = 1), and the candidate area image data 530 is other than category k. It is a calculation result when it is classified into categories (when j = 2).

これにより、候補領域画像データ５３０が、あるカテゴリｋに分類されるか否かを数値として判定することができる。例えば、あるカテゴリｋについて、ｙ_１（ｋ）の値が０．７、ｙ_２（ｋ）の値が０．３である場合、当該候補領域画像データ５３０は、カテゴリｋに分類される場合が高いと判定することができる。換言すれば、あるカテゴリｋについて、ｙ_１（ｋ）の値がｙ_２（ｋ）の値より高い場合、入力された候補領域画像データ５３０はカテゴリｋに分類される可能性が高いといえる。 Thereby, it can be determined numerically whether candidate area image data 530 is classified into a certain category k. For example, for a certain category k, if the value of y ₁ (k) is 0.7 and the value of y ₂ (k) is 0.3, the candidate area image data 530 may be classified into the category k. It can be determined to be high. In other words, when the value of y ₁ (k) is higher than the value of y ₂ (k) for a certain category k, it can be said that the input candidate area image data 530 is highly likely to be classified into the category k.

ただし、上記の算出結果では、各全結合処理部１１４の出力結果同士の比較ができない場合があるため、次のステップＳ１８３において正規化処理を行う。 However, in the above calculation results, there is a case where the output results of the respective all-joining processing units 114 cannot be compared with each other, so normalization processing is performed in the next step S183.

正規化処理部１３０は、全結合処理部１１４により出力された２×１×｜ｋ｜のデータを入力して、正規化処理を行う（ステップＳ１８３）。 The normalization processing unit 130 receives the data of 2 × 1 × | k | output from the full connection processing unit 114 and performs normalization processing (step S183).

ここで、正規化処理について、図２１を参照しながら説明する。図２１は、本実施形態の正規化処理の一例を示す図である。 Here, the normalization process will be described with reference to FIG. FIG. 21 is a diagram illustrating an example of the normalization process of the present embodiment.

Ｓｔｅｐ１８３１）正規化処理部１３０は、全結合処理部１１４により出力された２×１×｜ｋ｜のデータを入力する。 (Step 1831) The normalization processing unit 130 inputs the data of 2 × 1 × | k | output from the full connection processing unit 114.

Ｓｔｅｐ１８３２）正規化処理部１３０は、（ｙ_１（ｋ），ｙ_２（ｋ））について、カテゴリ毎に以下の式により正規化を行う。 (Step 1832) The normalization processing unit 130 normalizes (y ₁ (k), y ₂ (k)) by the following formula for each category.

このようにして得られた２×１×｜ｋ｜が確信度である、このように正規化処理を行うことにより、すべてのカテゴリにおける確信度は０以上１以下の値に正規化される。このため、異なるカテゴリ同士の確信度を比較することが可能となる。例えば、ｋ＝１をカテゴリ「人」、ｋ＝２をカテゴリ「動物」とした場合において、ｚ_１（１）＝０．８，ｚ_２（１）＝０．２，ｚ_１（２）＝０．６，ｚ_２（２）＝０．４であるとき、入力された候補領域画像データ５３０は、カテゴリ「人」に分類される可能性が高いと言える。 The 2 × 1 × | k | obtained in this way is the certainty factor. By performing the normalization process in this way, the certainty factors in all categories are normalized to a value of 0 or more and 1 or less. For this reason, it becomes possible to compare the reliability of different categories. For example, when k = 1 is a category “people” and k = 2 is a category “animal”, z ₁ (1) = 0.8, z ₂ (1) = 0.2, z ₁ (2) = When 0.6, z ₂ (2) = 0.4, it can be said that the input candidate area image data 530 is highly likely to be classified into the category “person”.

Ｓｔｅｐ１８３３）正規化処理部１３０は、各カテゴリの確信度を出力部１４０に出力する。 (Step 1833) The normalization processing unit 130 outputs the certainty factor of each category to the output unit 140.

以上により、本実施形態の画像処理装置１０では、入力された画像データが示す画像において、被写体等を示す対象が含まれる領域の候補となる候補領域画像データを作成する。しかも、本実施形態の画像処理装置１０では、畳み込みニューラルネットワークの予め設定された層の出力結果に基づいて、候補領域画像データを作成することにより、認識処理の精度の低下を防ぎつつ、候補領域画像データの数の削減を図ることができる。 As described above, the image processing apparatus 10 according to the present embodiment creates candidate area image data that is a candidate for an area that includes a target indicating a subject or the like in the image indicated by the input image data. Moreover, in the image processing apparatus 10 according to the present embodiment, the candidate region image data is created based on the output result of the preset layer of the convolutional neural network, thereby preventing a reduction in the accuracy of the recognition process and the candidate region. The number of image data can be reduced.

したがって、本実施形態の画像処理装置１０は、入力された画像データが示す画像において、対象が含まれる領域と、当該対象が分類されるカテゴリとを識別する識別処理の処理時間を削減することができる。 Therefore, the image processing apparatus 10 according to the present embodiment can reduce the processing time of the identification process for identifying the region including the target and the category into which the target is classified in the image indicated by the input image data. it can.

本発明は、具体的に開示された上記の実施形態に限定されるものではなく、特許請求の範囲から逸脱することなく、種々の変形や変更が可能である。 The present invention is not limited to the specifically disclosed embodiments, and various modifications and changes can be made without departing from the scope of the claims.

１０画像処理装置
２０画像処理プログラム
１１０ＣＮＮ処理部
１１１加工部
１１２畳み込み処理部
１１３プーリング処理部
１１４全結合処理部
１２０候補領域作成処理部
１２１データ決定部
１２２境界決定部
１２３閾値処理部
１２４領域分割部
１２５候補領域作成部
１３０正規化処理部
１４０出力部 DESCRIPTION OF SYMBOLS 10 Image processing apparatus 20 Image processing program 110 CNN process part 111 Processing part 112 Convolution process part 113 Pooling process part 114 All joint process part 120 Candidate area creation process part 121 Data determination part 122 Boundary determination part 123 Threshold process part 124 Area division part 125 Candidate area creation unit 130 Normalization processing unit 140 Output unit

特許第４３２２９１３号公報Japanese Patent No. 4322913

Rich feature hierarchies for accurate object detection and semantic segmentation. Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. CVPR 2014.Rich feature hierarchies for accurate object detection and semantic segmentation. Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. CVPR 2014.

Claims

An image processing apparatus for recognizing a first region including a target in an image indicated by image data and a category into which the target is classified,
Recognizing means for recognizing a category into which the input image data is classified using a convolutional neural network;
One or more candidate area image data indicating one or more candidate areas included in an image indicated by the image data based on first output data indicating an output result of a predetermined layer of the convolutional neural network in the recognition means. A candidate area creation means to create,
The recognition means is
An image processing apparatus for recognizing a category into which each of the one or more candidate area image data created by the candidate area creating means is classified.

The first output data includes second output data for each filter specified from network parameters of the predetermined layer of the convolutional neural network,
Determining means for determining a predetermined number of the second output data from the first output data;
The candidate area creating means includes
The image processing apparatus according to claim 1, wherein the one or more candidate area data are created based on the second output data determined by the determination unit.

The determining means includes
The image processing apparatus according to claim 2, wherein the predetermined number of the second output data is determined in ascending order of representative data values of the second output data. "

Dividing means for dividing the image indicated by the second output data into one or more second regions;
The candidate area creating means includes
The image processing apparatus according to claim 2, wherein, for each of the one or more second areas divided by the dividing unit, a minimum rectangular area surrounding the second area is set as the candidate area.

The dividing means includes
The image processing apparatus according to claim 4, wherein a boundary of the one or more second regions is detected by differentiation processing and is divided based on the detected boundary.

The dividing means includes
The image processing apparatus according to claim 5, wherein a Sobel filter is used for the differentiation process.

Having a threshold means for deleting data values below a predetermined threshold,
The dividing means includes
7. The image processing apparatus according to claim 4, wherein an image indicated by the second output data in which a data value equal to or less than a predetermined threshold is deleted by the threshold means is divided into one or more regions.

An image processing method by an image processing apparatus for recognizing a first region including a target in an image indicated by image data and a category into which the target is classified,
A recognition procedure for recognizing a category into which the input image data is classified using a convolutional neural network;
Based on the first output data indicating the output result of the predetermined layer of the convolutional neural network in the recognition procedure, one or more candidate area image data indicating one or more candidate areas included in the image indicated by the image data is obtained. A candidate area creation procedure to be created, and
The recognition procedure is:
An image processing method for recognizing a category into which each of the one or more candidate area image data created by the candidate area creation procedure is classified.

An image processing apparatus for recognizing a first region including a target in an image indicated by image data and a category into which the target is classified;
A recognition means for recognizing a category into which the input image data is classified using a convolutional neural network;
One or more candidate area image data indicating one or more candidate areas included in an image indicated by the image data based on first output data indicating an output result of a predetermined layer of the convolutional neural network in the recognition means. Function as a candidate area creation means to create,
The recognition means is
A program for recognizing a category into which each of the one or more candidate area image data created by the candidate area creating means is classified.