JP2021131734A

JP2021131734A - Object detection device, object detection system, and object detection method

Info

Publication number: JP2021131734A
Application number: JP2020026738A
Authority: JP
Inventors: 聡笹谷; So Sasatani; 剛志佐々木; Tsuyoshi Sasaki; 誠也伊藤; Seiya Ito
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2021-09-09
Anticipated expiration: 2040-02-20
Also published as: WO2021166290A1; JP7358269B2

Abstract

【課題】計測範囲内の被写体の検出精度を向上させるための辞書データの構築を支援すること。【解決手段】物体検出装置１は、物体検出の対象である実画像データとは別の学習画像データの集合と、各学習画像データに対してタグ付けされた学習シーンデータとを有する辞書データ１２Ｄの記憶部と、物体検出の対象である実画像データから、辞書データ１２Ｄにより物体を検出する物体検出部１５と、物体検出部１５の物体検出結果から実画像データにタグ付けする実シーンデータを推定する実シーン推定部１６と、推定された実シーンデータと、辞書データ１２Ｄの学習シーンデータとの類似度を算出するシーン類似度算出部１７と、追加学習に必要な学習要素データを出力する追加学習要素出力部１８とを有する。【選択図】図１[Problem] To support the construction of dictionary data to improve detection accuracy of objects within a measurement range. An object detection device 1 includes dictionary data 12D that includes a set of learning image data different from real image data that is a target of object detection, and learning scene data tagged with each learning image data. , an object detection unit 15 that detects an object from the real image data that is the target of object detection using dictionary data 12D, and real scene data that tags the real image data from the object detection result of the object detection unit 15. A real scene estimation unit 16 performs estimation, a scene similarity calculation unit 17 calculates the similarity between the estimated real scene data and the learning scene data of the dictionary data 12D, and outputs learning element data necessary for additional learning. It has an additional learning element output section 18. [Selection diagram] Figure 1

Description

本発明は、物体検出装置、物体検出システム、および、物体検出方法に関する。 The present invention relates to an object detection device, an object detection system, and an object detection method.

映像認識技術により監視カメラなどが取得した映像データを解析することで、検出対象やその周囲の物体を認識することができる。一般的な映像認識技術として、畳み込みニューラルネットワークなどの機械学習により作成した辞書データを用いる方法がある。
その機械学習の辞書データを作成するため、事前に学習用の画像データを収集する必要がある。そして収集された大量の画像データに対して、各画像データに写っている物体の種類、位置、大きさなどの説明データをタグデータとして、画像データに付与する作業（アノテーションと呼ばれる）が手動で行われる。 By analyzing the video data acquired by a surveillance camera or the like using video recognition technology, it is possible to recognize the detection target and surrounding objects. As a general image recognition technique, there is a method of using dictionary data created by machine learning such as a convolutional neural network.
In order to create the machine learning dictionary data, it is necessary to collect image data for learning in advance. Then, for a large amount of collected image data, the work (called annotation) of manually adding explanatory data such as the type, position, and size of the object reflected in each image data to the image data as tag data is performed. Will be done.

本来は様々な地点に設置した監視カメラの多様なシーンを大量に学習できればよいものの、現実的な工数を踏まえると地点やシーンのバリエーションは制限される。そのため、監視カメラの設置環境によっては、構築した辞書データによる物体検出精度が低下する場面が多々ある。本対策として、現地のカメラ画角や撮影シーンに類似した画像を追加学習することで事前に構築した辞書データを更新する方法が挙げられる。 Originally, it would be good if we could learn a large amount of various scenes of surveillance cameras installed at various points, but considering the realistic man-hours, the variations of points and scenes are limited. Therefore, depending on the installation environment of the surveillance camera, there are many cases where the object detection accuracy based on the constructed dictionary data is lowered. As a countermeasure, there is a method of updating the dictionary data constructed in advance by additionally learning an image similar to the local camera angle of view and the shooting scene.

しかし、追加学習には画像へのアノテーションなど新規の作業が発生するため、効率的に精度の高い辞書データを構築する方法が求められる。特許文献１では、現地映像に対して、複数のカメラ方向で撮影した検出対象の画像データから作成した辞書データを適用し、最も高い尤度を出力した辞書データを元に追加学習をするシステムが記載されている。 However, since new work such as annotation to images is required for additional learning, a method for efficiently constructing highly accurate dictionary data is required. In Patent Document 1, a system that applies dictionary data created from image data of detection targets taken in a plurality of camera directions to a local image and performs additional learning based on the dictionary data that outputs the highest likelihood. Have been described.

特開２０１６−１５０４５号公報Japanese Unexamined Patent Publication No. 2016-15045

同じ空間内に複数の監視カメラが設置されたイベント会場などでは、同じ被写体を撮影した画像データであっても、各監視カメラの設置環境のちがいによって画像データ内の被写体の撮影結果がそれぞれ違うこともある。また、同じ監視カメラで時間をずらして撮影した複数の画像データであっても、各時間での被写体の位置のちがいによって画像データ内の被写体の撮影結果がそれぞれ違うこともある。 At event venues where multiple surveillance cameras are installed in the same space, even if the image data is the same subject, the shooting results of the subject in the image data will differ depending on the installation environment of each surveillance camera. There is also. Further, even if a plurality of image data are taken by the same surveillance camera at different times, the shooting result of the subject in the image data may be different depending on the difference in the position of the subject at each time.

よって、実際に監視対象となる被写体が写る画像データ（実画像データ）と照合するための辞書データは、なるべく実画像データと近い環境下の画像データ集合から学習したものを用いることで、実画像データから被写体の検出精度が高くなる。
しかし、従来の技術では、実画像データの環境に適応した辞書データを選択するという観点では、機械学習がなされていなかった。特許文献１では、検出対象の向き情報を活用して追加学習の元になる辞書データを選定するだけである。 Therefore, the dictionary data for collating with the image data (actual image data) in which the subject to be actually monitored is captured is the actual image by using the data learned from the image data set in an environment as close as possible to the actual image data. The detection accuracy of the subject is improved from the data.
However, in the conventional technique, machine learning has not been performed from the viewpoint of selecting dictionary data suitable for the environment of real image data. In Patent Document 1, only dictionary data to be a source of additional learning is selected by utilizing the orientation information of the detection target.

そこで、本発明は、計測範囲内の被写体の検出精度を向上させるための辞書データの構築を支援することを、主な課題とする。 Therefore, the main object of the present invention is to support the construction of dictionary data for improving the detection accuracy of a subject within the measurement range.

前記課題を解決するために、本発明の物体検出装置は、以下の特徴を有する。
本発明は、学習計測データの集合と、前記各学習計測データに対してタグ付けされた学習シーンデータとを有する辞書データの記憶部と、
物体検出の対象である実計測データから、前記辞書データにより物体を検出する物体検出部と、
前記物体検出部の物体検出結果から前記実計測データにタグ付けする実シーンデータを推定する実シーン推定部と、
推定された前記実シーンデータと、前記辞書データの前記学習シーンデータとの類似度を算出するシーン類似度算出部と、
前記辞書データの追加学習に必要な学習要素データを出力する追加学習要素出力部とを有しており、
前記追加学習要素出力部が、
前記シーン類似度算出部が算出した類似度が所定閾値よりも高い場合、類似度の計算に使用された前記学習シーンデータに基づく前記学習要素データを出力し、
前記シーン類似度算出部が算出した類似度が所定閾値以下の場合、類似度の計算に使用された前記実シーンデータに基づく前記学習要素データを出力することを特徴とする。
その他の手段は、後記する。 In order to solve the above problems, the object detection device of the present invention has the following features.
The present invention includes a storage unit for dictionary data having a set of learning measurement data and learning scene data tagged with each learning measurement data.
An object detection unit that detects an object from the actual measurement data that is the target of object detection using the dictionary data,
An actual scene estimation unit that estimates the actual scene data to be tagged with the actual measurement data from the object detection result of the object detection unit, and
A scene similarity calculation unit that calculates the similarity between the estimated actual scene data and the learning scene data of the dictionary data, and
It has an additional learning element output unit that outputs learning element data required for additional learning of the dictionary data.
The additional learning element output unit
When the similarity calculated by the scene similarity calculation unit is higher than a predetermined threshold value, the learning element data based on the learning scene data used for the calculation of the similarity is output.
When the similarity calculated by the scene similarity calculation unit is equal to or less than a predetermined threshold value, the learning element data based on the actual scene data used for the calculation of the similarity is output.
Other means will be described later.

本発明によれば、計測範囲内の被写体の検出精度を向上させるための辞書データの構築を支援することができる。 According to the present invention, it is possible to support the construction of dictionary data for improving the detection accuracy of a subject within the measurement range.

本発明の実施例１に関する物体検出装置の構成図である。It is a block diagram of the object detection apparatus which concerns on Example 1 of this invention. 本発明の実施例１に関する学習シーン取得部の構成図である。It is a block diagram of the learning scene acquisition part which concerns on Example 1 of this invention. 本発明の実施例１に関する追加学習要素出力部の構成図である。It is a block diagram of the additional learning element output part which concerns on Example 1 of this invention. 本発明の実施例１に関する学習画像データの一例を示す図である。It is a figure which shows an example of the learning image data which concerns on Example 1 of this invention. 本発明の実施例１に関する図４の学習画像データに付与された学習シーンデータである。This is the learning scene data added to the learning image data of FIG. 4 according to the first embodiment of the present invention. 本発明の実施例１に関するアノテーションデータ解析部の説明図である。It is explanatory drawing of the annotation data analysis part which concerns on Example 1 of this invention. 本発明の実施例１に関する実画像データの一例を示す図である。It is a figure which shows an example of the real image data which concerns on Example 1 of this invention. 本発明の実施例１に関する図７の実画像データに付与された実シーンデータである。It is the actual scene data added to the actual image data of FIG. 7 according to the first embodiment of the present invention. 本発明の実施例１に関する実シーン推定部の説明図である。It is explanatory drawing of the real scene estimation part with respect to Example 1 of this invention. 本発明の実施例１に関する第２の実画像データの一例を示す図である。It is a figure which shows an example of the 2nd real image data which concerns on Example 1 of this invention. 本発明の実施例１に関する第２の実シーンデータである。It is the second actual scene data which concerns on Example 1 of this invention. 本発明の実施例１に関する実シーン推定部の第２の説明図である。It is a 2nd explanatory drawing of the real scene estimation part which concerns on Example 1 of this invention. 本発明の実施例１に関する学習シーン解析部の処理を示すフローチャートである。It is a flowchart which shows the process of the learning scene analysis part which concerns on Example 1 of this invention. 本発明の実施例２に関する物体検出装置の構成図である。It is a block diagram of the object detection apparatus which concerns on Example 2 of this invention.

以下、本発明の具体的な実施形態（実施例１，２）について、図面を参照しながら説明する。 Hereinafter, specific embodiments of the present invention (Examples 1 and 2) will be described with reference to the drawings.

図１は、物体検出装置１の構成図である。
物体検出装置１は演算装置としてのＣＰＵ（Central Processing Unit）、主記憶装置としてのメモリ、および、外部記憶装置としてのハードディスクを有する計算機として構成される。
この計算機は、ＣＰＵが、メモリ上に読み込んだプログラム（アプリケーションや、その略のアプリとも呼ばれる）を実行することにより、各処理部により構成される制御部（制御手段）を動作させる。 FIG. 1 is a block diagram of the object detection device 1.
The object detection device 1 is configured as a computer having a CPU (Central Processing Unit) as an arithmetic unit, a memory as a main storage device, and a hard disk as an external storage device.
In this computer, the CPU operates a control unit (control means) composed of each processing unit by executing a program (also called an application or an abbreviation for application) read in the memory.

カメラ２は、計測現場に設置され、物体検出の対象となる撮影した実画像データ（実計測データ）を、カメラ情報取得部１４に出力する計測装置である。物体検出装置１はカメラ２と同一筐体としてもよいし、カメラ２とは別の筐体としてもよい。なお、実施例１では、計測装置をモノラルのカメラ２とした場合について説明するが、計測装置はこれに限定されるものではなく、ステレオカメラや距離センサなどの他のセンサに応用可能である。 The camera 2 is a measuring device installed at a measurement site and outputs captured actual image data (actual measurement data) to be an object detection target to the camera information acquisition unit 14. The object detection device 1 may be in the same housing as the camera 2, or may be in a separate housing from the camera 2. Although the case where the measuring device is a monaural camera 2 will be described in the first embodiment, the measuring device is not limited to this, and can be applied to other sensors such as a stereo camera and a distance sensor.

学習画像データ３（詳細は図４）は、物体検出のために実画像データと照合される学習画像データ（学習計測データ）の集合である。つまり、画像データは、物体検出の対象となる実画像データと、機械学習により辞書データ１２Ｄを構築する素材となる学習画像データとに分類される。
以下、画像データの集合を構成する１枚の画像データ（フレーム）を「シーン」と呼ぶ。シーンデータは、各シーンの画像データごとに、アノテーションにより付与されるタグデータである。例えば、実画像データに対しては実シーンデータが付与されるとともに、学習画像データに対しては学習シーンデータが付与される。
よって、物体検出装置１は、学習シーンデータと、実シーンデータとを比較することで、辞書データ１２Ｄの追加学習のために必要な学習画像データを特定するデータ（以下、「学習要素データ」）を出力する。 The training image data 3 (details in FIG. 4) is a set of training image data (learning measurement data) that is collated with the actual image data for object detection. That is, the image data is classified into real image data that is a target of object detection and learning image data that is a material for constructing dictionary data 12D by machine learning.
Hereinafter, one image data (frame) constituting a set of image data is referred to as a "scene". The scene data is tag data added by annotation for each image data of each scene. For example, the actual scene data is added to the actual image data, and the learning scene data is added to the learning image data.
Therefore, the object detection device 1 compares the learning scene data with the actual scene data to specify the learning image data necessary for the additional learning of the dictionary data 12D (hereinafter, “learning element data”). Is output.

物体検出装置１は、アノテーション部１１と、辞書生成部１２と、辞書データ１２Ｄの記憶部と、学習シーン取得部１３と、カメラ情報取得部１４と、物体検出部１５と、実シーン推定部１６と、シーン類似度算出部１７と、追加学習要素出力部１８とを有する。
アノテーション部１１は、学習画像データ３を構成する１枚ずつの各学習画像データに対して、学習シーンデータをタグ付け（アノテーション）する（詳細は図５）。
辞書生成部１２は、アノテーション部１１により生成された学習シーンデータと、学習画像データ３との組み合わせデータを、辞書データ１２Ｄとして構築する。
学習シーン取得部１３は、学習画像データ３から各シーンを選択し、その選択したシーンごとの学習シーンデータを辞書データ１２Ｄから取得する。 The object detection device 1 includes an annotation unit 11, a dictionary generation unit 12, a storage unit for dictionary data 12D, a learning scene acquisition unit 13, a camera information acquisition unit 14, an object detection unit 15, and an actual scene estimation unit 16. And a scene similarity calculation unit 17 and an additional learning element output unit 18.
The annotation unit 11 tags (annotates) the learning scene data with respect to each of the learning image data constituting the learning image data 3 (details are shown in FIG. 5).
The dictionary generation unit 12 constructs the combination data of the learning scene data generated by the annotation unit 11 and the learning image data 3 as the dictionary data 12D.
The learning scene acquisition unit 13 selects each scene from the learning image data 3, and acquires the learning scene data for each selected scene from the dictionary data 12D.

図２は、学習シーン取得部１３の構成図である。
学習シーン取得部１３は、アノテーションデータ解析部１３１と、映像解析部１３２と、カメラパラメータ取得部１３３と、学習シーン解析部１３４とを有する。
アノテーションデータ解析部１３１は、取得した学習シーンデータを解析することで、検出対象の位置分布マップを取得する（詳細は図６）。
映像解析部１３２は、学習画像データ３内の画像を解析し、画質などの撮影条件を撮影環境情報として取得する。
カメラパラメータ取得部１３３は、学習画像データ３内の画像を撮影したカメラ２のパラメータを取得する。
学習シーン解析部１３４は、アノテーションデータ解析部１３１の位置分布マップと、映像解析部１３２の撮影環境情報と、カメラパラメータ取得部１３３のカメラパラメータとを元に学習シーンを解析し、その解析結果を学習シーンデータとする（詳細は図１３）。 FIG. 2 is a configuration diagram of the learning scene acquisition unit 13.
The learning scene acquisition unit 13 includes an annotation data analysis unit 131, a video analysis unit 132, a camera parameter acquisition unit 133, and a learning scene analysis unit 134.
The annotation data analysis unit 131 acquires a position distribution map of the detection target by analyzing the acquired learning scene data (details are shown in FIG. 6).
The image analysis unit 132 analyzes the image in the learning image data 3 and acquires shooting conditions such as image quality as shooting environment information.
The camera parameter acquisition unit 133 acquires the parameters of the camera 2 that captured the image in the learning image data 3.
The learning scene analysis unit 134 analyzes the learning scene based on the position distribution map of the annotation data analysis unit 131, the shooting environment information of the video analysis unit 132, and the camera parameters of the camera parameter acquisition unit 133, and analyzes the analysis result. It is used as learning scene data (details are shown in FIG. 13).

図１に戻り、物体検出装置１の各構成要素の概要を説明する。
カメラ情報取得部１４は、カメラ２の撮像画像である実画像データに加え、その実画像データに付与する実シーンデータとして、例えば、撮影環境情報、および、カメラパラメータを取得する。撮影環境情報は、実画像データの撮影条件を示す情報であり、例えば、撮影時間帯情報、撮影場所情報である。なお、実シーンデータは学習シーンデータと同等なデータであり、その内容は特に限定しない。
物体検出部１５は、辞書データ１２Ｄを用いて実画像データ中に存在する検出対象の物体を検出する（詳細は図５）。
実シーン推定部１６は、カメラ情報取得部１４が取得した撮影環境情報およびカメラパラメータと、物体検出部１５の物体検出結果とをもとに、実画像データに付与する実シーンデータを推定する。
シーン類似度算出部１７は、学習シーン取得部１３が取得した学習シーンデータと、実シーン推定部１６が推定した実シーンデータとを比較することで、両シーン間の類似度を算出する。 Returning to FIG. 1, an outline of each component of the object detection device 1 will be described.
The camera information acquisition unit 14 acquires, for example, shooting environment information and camera parameters as actual scene data to be added to the actual image data in addition to the actual image data which is the captured image of the camera 2. The shooting environment information is information indicating shooting conditions of actual image data, and is, for example, shooting time zone information and shooting location information. The actual scene data is the same as the learning scene data, and the content thereof is not particularly limited.
The object detection unit 15 detects an object to be detected existing in the actual image data using the dictionary data 12D (details are shown in FIG. 5).
The actual scene estimation unit 16 estimates the actual scene data to be added to the actual image data based on the shooting environment information and the camera parameters acquired by the camera information acquisition unit 14 and the object detection result of the object detection unit 15.
The scene similarity calculation unit 17 calculates the similarity between the two scenes by comparing the learning scene data acquired by the learning scene acquisition unit 13 with the actual scene data estimated by the actual scene estimation unit 16.

図３は追加学習要素出力部１８の構成図である。
追加学習要素出力部１８は、シーン類似度算出部１７が算出した類似度から、辞書データ１２Ｄの追加学習のために必要な学習要素データを出力する。追加学習要素出力部１８は、辞書適応度取得部１８１と、追加学習判定部１８２と、追加学習要素決定部１８３とを有する。
辞書適応度取得部１８１は、現在使用している辞書データ１２Ｄの実シーンデータへの適応度を取得する。
追加学習判定部１８２は、辞書適応度取得部１８１の適応度に応じて追加学習が必要か否かを判定する。
追加学習要素決定部１８３は、追加学習判定部１８２により追加学習が必要と判定された場合に、追加学習に必要となる学習要素データを決定し、その学習要素データをユーザまたは他システムに出力する。 FIG. 3 is a configuration diagram of the additional learning element output unit 18.
The additional learning element output unit 18 outputs learning element data necessary for additional learning of the dictionary data 12D from the similarity calculated by the scene similarity calculation unit 17. The additional learning element output unit 18 includes a dictionary fitness acquisition unit 181, an additional learning determination unit 182, and an additional learning element determination unit 183.
The dictionary fitness acquisition unit 181 acquires the fitness of the currently used dictionary data 12D to the actual scene data.
The additional learning determination unit 182 determines whether or not additional learning is necessary according to the fitness of the dictionary fitness acquisition unit 181.
The additional learning element determination unit 183 determines the learning element data required for the additional learning when the additional learning determination unit 182 determines that the additional learning is necessary, and outputs the learning element data to the user or another system. ..

図４は、学習画像データ３の一例を示す図である。
この学習画像データ３ａには、２人の検出対象（本例では人物１１１，１２１）が写っている。検知枠１１２，１２２は各人物１１１，１２１を囲う矩形であり、始点座標１１３，１２３は、検知枠１１２，１２２の学習画像データ３ａにおける始点（左上点）の座標である。 FIG. 4 is a diagram showing an example of the learning image data 3.
In this learning image data 3a, two detection targets (persons 111 and 121 in this example) are captured. The detection frames 112 and 122 are rectangles surrounding the persons 111 and 121, and the start point coordinates 113 and 123 are the coordinates of the start point (upper left point) in the learning image data 3a of the detection frames 112 and 122.

図５は、図４の学習画像データに付与された学習シーンデータである。
アノテーション部１１は、シーンごとの学習画像データの各人物１１１，１２１に対して、ＧＵＩなどを利用し、手動により検知枠１１２，１２２の入力を受け付ける。さらに、アノテーション部１１は、検知枠１１２，１２２で指定された各人物１１１，１２１の詳細情報の入力を受け付ける。以下、詳細情報の一例である。
・その対象が何かを示すクラス情報（ここでは人物を示す「Person」）と、そのクラス情報である確率を示す信頼度
・検知枠の始点座標１１３，１２３
・検知枠のサイズ（横幅情報と縦幅情報）
これらの入力された各情報は、学習シーンデータとして学習画像データに付与される。 FIG. 5 is learning scene data added to the learning image data of FIG.
The annotation unit 11 manually accepts the input of the detection frames 112 and 122 for each person 111 and 121 of the learning image data for each scene by using a GUI or the like. Further, the annotation unit 11 accepts input of detailed information of each person 111, 121 designated by the detection frames 112, 122. The following is an example of detailed information.
-Class information indicating what the target is (here, "Person" indicating a person) and reliability indicating the probability of being the class information-Start point coordinates 113,123 of the detection frame
-Detection frame size (width information and height information)
Each of these input information is added to the learning image data as learning scene data.

辞書生成部１２は、学習画像データ３と対応するタグデータ（学習シーンデータ）から、機械学習などを活用することで、画像中の人物を検出可能な辞書データ１２Ｄを生成する。換言すると、辞書生成部１２は、学習シーンデータを正解ラベルとして学習画像データ３を入力データとする推論モデルを辞書データ１２Ｄとして生成することで、物体検出部１５が実画像データを入力データとする推論モデル（辞書データ１２Ｄ）により、実シーンデータ（検出する物体）を特定可能とする。
なお、本実施例では、複数の学習画像データ３とタグデータのペアから、予め複数の事象を生成してもよい。また、辞書データ１２Ｄを生成する際のアルゴリズムについても、畳み込みニューラルネットワークやＡｄａＢｏｏｓｔなど一般的なものでよく、特に限定しない。 The dictionary generation unit 12 generates dictionary data 12D capable of detecting a person in an image from tag data (learning scene data) corresponding to the learning image data 3 by utilizing machine learning or the like. In other words, the dictionary generation unit 12 generates an inference model with the learning scene data as the correct answer label and the learning image data 3 as the input data, so that the object detection unit 15 uses the actual image data as the input data. The inference model (dictionary data 12D) makes it possible to identify the actual scene data (object to be detected).
In this embodiment, a plurality of events may be generated in advance from a pair of the plurality of learned image data 3 and the tag data. Further, the algorithm for generating the dictionary data 12D may be a general algorithm such as a convolutional neural network or AdaBoost, and is not particularly limited.

図６は、アノテーションデータ解析部１３１の説明図である。アノテーションデータ解析部１３１は、図５の検知枠の情報（始点Ｘ，始点Ｙ，横幅、縦幅）から、検出対象のクラスごとの位置分布マップを学習画像データごとに算出する。位置分布マップの生成方法としては、例えば、以下の手順１〜手順３に従う。
（手順１）学習画像データ３の位置分布２１０を複数の小領域（ここでは６×６個のセル）に分割する。左側の人物１１１の検知枠１１２が領域２１１に対応し、右側の人物１２１の検知枠１２２が領域２１２に対応する。
（手順２）各セルごとに、検知枠との重なり具合を割合で算出する。例えば、領域２１１は、縦４セル×横１セル分に重なっている。その１番上のセルとはほぼ（90%）重なり、上から２番目のセルとは100%重なり、上から３番目のセルとは100%重なり、上から４番目のセルとは40%重なっている。
（手順３）手順２の重なり具合をもとに、位置分布マップ２２０を生成する。ここでは、位置分布マップ２２０のセル値として100%の重なり具合を「1」とし、0%の重なり具合を「0」とした。 FIG. 6 is an explanatory diagram of the annotation data analysis unit 131. The annotation data analysis unit 131 calculates a position distribution map for each class to be detected for each training image data from the information (start point X, start point Y, width, height) of the detection frame of FIG. As a method of generating the position distribution map, for example, the following steps 1 to 3 are followed.
(Procedure 1) The position distribution 210 of the training image data 3 is divided into a plurality of small regions (here, 6 × 6 cells). The detection frame 112 of the person 111 on the left side corresponds to the area 211, and the detection frame 122 of the person 121 on the right side corresponds to the area 212.
(Procedure 2) For each cell, the degree of overlap with the detection frame is calculated as a ratio. For example, the area 211 overlaps with 4 cells in the vertical direction and 1 cell in the horizontal direction. It almost (90%) overlaps the top cell, 100% overlaps the second cell from the top, 100% overlaps the third cell from the top, and 40% overlaps the fourth cell from the top. ing.
(Procedure 3) The position distribution map 220 is generated based on the overlapping condition of the procedure 2. Here, as the cell value of the position distribution map 220, the degree of overlap of 100% is set to "1", and the degree of overlap of 0% is set to "0".

なお、セルの分割数は特に限定せず、処理を実行するＰＣのスペックなどを考慮して決めてよく、また予め学習画像データ３の解像度を縮小した後複数の領域に分割し位置分布マップを生成してもよい。また、位置分布マップの値としては、セル内に各クラスの対象が存在する割合を示すものであれば、特に限定せず、検知枠との重なり具合を使用するのではなく、検知枠の中央座標が含まれるセルの値を「１」とし、矩形端が「０」となるような正規分布を生成して各領域の存在率を算出する方法など特に限定しない。 The number of cell divisions is not particularly limited and may be determined in consideration of the specifications of the PC that executes the processing, and the position distribution map is divided into a plurality of areas after the resolution of the learning image data 3 is reduced in advance. It may be generated. Further, the value of the position distribution map is not particularly limited as long as it indicates the ratio of the objects of each class existing in the cell, and the degree of overlap with the detection frame is not used, but the center of the detection frame. The method of calculating the abundance rate of each region by generating a normal distribution in which the value of the cell including the coordinates is "1" and the rectangular end is "0" is not particularly limited.

図２の映像解析部１３２は、学習画像データ３の映像を解析することで撮影環境情報を抽出する。撮影環境情報の種類としては、物体検出精度に影響し、かつ映像を解析することで取得可能条件であれば特に限定せず、例えば、シーン認識技術を活用することで屋外、屋内などの設置場所の情報、画像の輝度情報の解析による昼間、夜間などの撮影時間帯の情報、画像解析により推定したレンズぼけの情報などが挙げられる。 The image analysis unit 132 of FIG. 2 extracts the shooting environment information by analyzing the image of the learning image data 3. The type of shooting environment information is not particularly limited as long as it affects the object detection accuracy and can be acquired by analyzing the image. For example, by utilizing the scene recognition technology, the installation location such as outdoors or indoors. Information, information on shooting time zones such as daytime and nighttime by analyzing the brightness information of the image, and information on lens blur estimated by image analysis.

カメラパラメータ取得部１３３は、学習画像データ３の映像を撮影したカメラのカメラパラメータを取得する。取得するパラメータの種類としては、焦点距離やレンズ歪係数などの内部パラメータとカメラの俯角や設置高さなどの外部パラメータが挙げられ、全てのパラメータを取得する事が好まれるが、一部のパラメータのみ取得するだけでもよい。なお、カメラパラメータ取得部１３３において、カメラパラメータの一部を画像解析により推定してもよく、例えば、画像の消失点情報を活用しカメラの外部パラメータを推定するなどの方法を採用してもよい。 The camera parameter acquisition unit 133 acquires the camera parameters of the camera that captured the image of the learning image data 3. The types of parameters to be acquired include internal parameters such as focal length and lens distortion coefficient and external parameters such as camera depression angle and installation height. It is preferable to acquire all parameters, but some parameters are to be acquired. You may only get it. In the camera parameter acquisition unit 133, a part of the camera parameters may be estimated by image analysis, and for example, a method of estimating the external parameters of the camera by utilizing the vanishing point information of the image may be adopted. ..

ここまでの説明では、１枚の画像データが１つのシーンを示すものとした。一方、辞書データ１２Ｄに大量の学習画像データが含まれている場合、シーン類似度算出部１７による実シーンデータと学習シーンデータとの比較処理は、学習シーンデータの数に応じて計算量も増大してしまう。そこで、学習シーン解析部１３４は、複数の学習シーンを１つの学習シーンとしてグルーピングすることで、比較処理の回数を削減してもよい（詳細は図１３）。 In the explanation so far, it is assumed that one image data indicates one scene. On the other hand, when the dictionary data 12D contains a large amount of training image data, the calculation amount of the comparison processing between the actual scene data and the learning scene data by the scene similarity calculation unit 17 increases according to the number of training scene data. Resulting in. Therefore, the learning scene analysis unit 134 may reduce the number of comparison processes by grouping a plurality of learning scenes as one learning scene (details are shown in FIG. 13).

以上、図４〜図６を参照して、学習シーンデータについて説明した。以下、図７〜図１２を参照して、実シーンデータについて説明する。
学習画像データと実画像データとの違いとして、学習画像データでは正解データである学習シーンデータを外部のユーザからアノテーション部１１を介して教えてもらえるが、実画像データの実シーンデータは自動的に解析して取得する必要がある。 The learning scene data has been described above with reference to FIGS. 4 to 6. Hereinafter, the actual scene data will be described with reference to FIGS. 7 to 12.
The difference between the training image data and the actual image data is that the learning scene data, which is the correct answer data, can be taught by an external user via the annotation unit 11, but the actual scene data of the actual image data is automatically obtained. It needs to be analyzed and obtained.

図７は、実画像データの一例を示す図である。この実画像データ２ａには、１人の検出対象（本例では人物１３１）が写っている。検知枠１３２は人物１３１を囲う矩形である。
図８は、図７の実画像データに付与された実シーンデータである。実シーン推定部１６は、物体検出部１５によって実画像データ２ａから検出された人物１３１の検知枠１３２を示す検知枠情報を用いて、カメラ２が設置された現場の実シーンデータを示す情報を取得する。 FIG. 7 is a diagram showing an example of actual image data. In this actual image data 2a, one detection target (person 131 in this example) is shown. The detection frame 132 is a rectangle surrounding the person 131.
FIG. 8 is actual scene data added to the actual image data of FIG. 7. The actual scene estimation unit 16 uses the detection frame information indicating the detection frame 132 of the person 131 detected from the actual image data 2a by the object detection unit 15 to provide information indicating the actual scene data of the site where the camera 2 is installed. get.

図９は、実シーン推定部１６の説明図である。
実シーン推定部１６による実シーンデータを構成する位置分布マップ２４０の取得処理は、図６で示したアノテーションデータ解析部１３１による位置分布マップ２２０の検出処理と類似する。具体的には、実シーン推定部１６は、検知枠１３２に対応する領域２３１を含む位置分布マップ２３０を作成する。そして、実シーン推定部１６は、位置分布マップ２３０の領域２３１の各セル値に対して図８の信頼度「0.8」を重み付け（乗算）した領域２４１を含む信頼度付き位置分布マップ２４０を、実シーンデータの一部として取得する。
また、実シーン推定部１６は、カメラ映像中の複数の撮像画像に対しても、信頼度付き位置分布マップ２４０の生成処理と同様の処理を行うことで信頼度付き位置分布マップを算出し、カメラ情報取得部１４によって取得した撮影環境情報とカメラパラメータの情報を合わせて出力する。 FIG. 9 is an explanatory diagram of the actual scene estimation unit 16.
The acquisition process of the position distribution map 240 constituting the actual scene data by the actual scene estimation unit 16 is similar to the detection process of the position distribution map 220 by the annotation data analysis unit 131 shown in FIG. Specifically, the actual scene estimation unit 16 creates a position distribution map 230 including the region 231 corresponding to the detection frame 132. Then, the actual scene estimation unit 16 uses the reliable position distribution map 240 including the region 241 in which the reliability "0.8" of FIG. 8 is weighted (multiplied) with respect to each cell value of the region 231 of the position distribution map 230. Acquired as part of the actual scene data.
Further, the actual scene estimation unit 16 calculates the reliability position distribution map by performing the same processing as the generation process of the reliability position distribution map 240 for the plurality of captured images in the camera image. The shooting environment information acquired by the camera information acquisition unit 14 and the camera parameter information are combined and output.

以上、図７〜図９の実画像データおよび実シーンデータは、図４〜図６の学習画像データおよび学習シーンデータとの間のシーン類似度が高いデータの一例として説明した。つまり、図７の実画像データ内の検知枠１３２と、図４の学習画像データ内の検知枠１１２とが略一致するために、両方の撮影環境情報も類似することで、実画像データ内の人物１３１と学習画像データ内の人物１１１とが同一人物か否かが判定しやすい。 As described above, the actual image data and the actual scene data of FIGS. 7 to 9 have been described as an example of data having a high degree of scene similarity with the learning image data and the learning scene data of FIGS. 4 to 6. That is, since the detection frame 132 in the actual image data of FIG. 7 and the detection frame 112 in the learning image data of FIG. 4 substantially match, both shooting environment information are similar, so that the actual image data It is easy to determine whether the person 131 and the person 111 in the learning image data are the same person.

一方、図１０〜図１２の実画像データおよび実シーンデータは、シーン類似度が低いデータの一例である。
同じカメラ２で撮影しているものの、図７の実画像データ２ａ内の左側の人物１３１が、図１０の実画像データ２ｂ内では中央奥側の人物１４１として移動してしまった。よって、図１０の人物１４１の検知枠１４２に対応する図１１の実シーンデータでは、検知枠１４２が小さいためにその信頼度も図８の「0.8」よりも小さい「0.7」となる。
また、実シーン推定部１６は、図１１の検知枠情報の検知枠１４２に対応する領域２５１を含む図１２の位置分布マップ２５０を作成し、その位置分布マップ２５０に信頼度「0.7」を重み付け（乗算）した領域２６１を含む信頼度付き位置分布マップ２６０を作成する。 On the other hand, the actual image data and the actual scene data of FIGS. 10 to 12 are examples of data having low scene similarity.
Although the same camera 2 was used for shooting, the person 131 on the left side in the actual image data 2a of FIG. 7 has moved as the person 141 on the back side of the center in the actual image data 2b of FIG. Therefore, in the actual scene data of FIG. 11 corresponding to the detection frame 142 of the person 141 of FIG. 10, since the detection frame 142 is small, its reliability is also “0.7”, which is smaller than “0.8” of FIG.
Further, the actual scene estimation unit 16 creates the position distribution map 250 of FIG. 12 including the area 251 corresponding to the detection frame 142 of the detection frame information of FIG. 11, and weights the position distribution map 250 with a reliability “0.7”. A reliable position distribution map 260 including the (multiplied) region 261 is created.

なお、取得する位置分布マップの数が多いほど実シーンデータを正確に把握できるものの、処理コストの低減のために取得する位置分布マップ数を削減する処理を追加してもよい。以下、位置分布マップ数を削減する処理を例示する。
・信頼度の高い位置分布マップのみを採用する方法。
・学習シーン解析部１３４のようにグルーピングを行う方法。例えば、位置分布マップ間の差分総和を算出し、差分が予め定めた閾値以下である場合は類似シーンとして扱い複数のグループに分類した後、各グループにおいて差分が中間値を示す位置分布マップを実シーンにおける代表の位置分布マップとする方法
・学習シーン解析部１３４のようにグルーピングを行う別の方法として、撮影環境情報を考慮したグルーピング（撮影環境情報が互いに類似するメンバの集合を束ねる）を行い、各グループで代表の位置分布マップのみ出力する方法。 Although the actual scene data can be grasped more accurately as the number of position distribution maps to be acquired increases, a process of reducing the number of position distribution maps to be acquired may be added in order to reduce the processing cost. Hereinafter, a process for reducing the number of position distribution maps will be illustrated.
-A method that uses only a highly reliable position distribution map.
-A method of grouping as in the learning scene analysis unit 134. For example, the total difference between position distribution maps is calculated, and if the difference is less than or equal to a predetermined threshold, it is treated as a similar scene and classified into multiple groups. Method of using a representative position distribution map in the scene-As another method of grouping as in the learning scene analysis unit 134, grouping that considers the shooting environment information (bunching a set of members whose shooting environment information is similar to each other) is performed. , How to output only the representative position distribution map in each group.

シーン類似度算出部１７は、学習シーンと実シーンにおける位置分布マップ、撮影条件、カメラパラメータの情報から、以下に例示する方法で、シーン間の類似度を算出する。
・位置分布マップの差分総和が小さいものが類似度が高いと判定する方法
・撮影環境情報やカメラパラメータが近いものを類似度が高いと判定する方法
・複数の類似度を足し合わせた値を最終的な類似度とする方法 The scene similarity calculation unit 17 calculates the similarity between scenes by the method illustrated below from the position distribution map, shooting conditions, and camera parameter information in the learning scene and the actual scene.
-A method of determining that the total difference of the position distribution map is small-A method of determining that the one with similar shooting environment information and camera parameters has a high degree of similarity-The final value is the sum of multiple similarities. How to make similarities

なお、図６の位置分布マップ２２０に対して、図９の位置分布マップ２４０は、互いに対応する画像左下側の検知結果（領域２４１と領域２２１）が類似しているため、シーン類似度は高いものとして算出される。
一方、図１２の位置分布マップ２６０に対しては、画像中央上側の検知結果（領域２６１）が図６の位置分布マップ２２０には存在しないため（領域２６１に対応する位置分布マップ２２０のセル値が「０」であるため）、シーン類似度は低いものとして算出される。 Note that the position distribution map 240 of FIG. 9 is similar to the position distribution map 220 of FIG. 6 in the detection results (regions 241 and 221) on the lower left side of the images corresponding to each other, so that the scene similarity is high. Calculated as a thing.
On the other hand, with respect to the position distribution map 260 of FIG. 12, the detection result (region 261) on the upper center of the image does not exist in the position distribution map 220 of FIG. 6 (cell value of the position distribution map 220 corresponding to the region 261). Is "0"), so the scene similarity is calculated as low.

なお、シーン類似度算出部１７で説明した「類似度」とは、１枚の実画像データ（の実シーンデータ）と１枚の学習画像データ（の学習シーンデータ）との間で比較されるのシーンごとの指標である。
一方、これから追加学習要素出力部１８で説明する「適応度」とは、１枚の実画像データに対する、Ｎ枚の学習画像データ（からの学習結果である辞書データ１２Ｄ）との間で計算される指標である。
適応度は、実画像データから人物などを検出する精度について、現在の辞書データ１２Ｄに対して追加学習が必要か否かを判定するために用いられる。換言すると、現在の辞書データ１２Ｄを用いても、所定の実画像データから人物などを検出する精度が高いなら、追加学習は不要である。
一方、類似度は、追加学習が必要と判定された後で、どのような追加学習の学習要素データをユーザに知らせて学習画像データを追加させるかを特定させるために用いられる。 The "similarity" described by the scene similarity calculation unit 17 is compared between one real image data (actual scene data) and one learning image data (learning scene data). It is an index for each scene.
On the other hand, the "fitness" described in the additional learning element output unit 18 from now on is calculated between N pieces of learning image data (dictionary data 12D which is the learning result from) for one piece of real image data. It is an index.
The fitness is used to determine whether additional learning is required for the current dictionary data 12D with respect to the accuracy of detecting a person or the like from the actual image data. In other words, even if the current dictionary data 12D is used, if the accuracy of detecting a person or the like from the predetermined real image data is high, additional learning is not necessary.
On the other hand, the similarity is used to specify what kind of learning element data of the additional learning is notified to the user to add the learning image data after it is determined that the additional learning is necessary.

図３の辞書適応度取得部１８１は、辞書生成部１２により構築した辞書データ１２Ｄの実シーンへの適応度を取得する。適応度の取得方法としては、予め実シーンデータを付与（アノテーション）したカメラ２の実画像データを用意し、実画像データ内の全体の検出対象数における辞書データ１２Ｄによる物体検出数の割合などの物体検出精度を採用する方法などがある。 The dictionary fitness acquisition unit 181 of FIG. 3 acquires the fitness of the dictionary data 12D constructed by the dictionary generation unit 12 to the actual scene. As a method of acquiring the degree of adaptability, the actual image data of the camera 2 to which the actual scene data is added (annotated) in advance is prepared, and the ratio of the number of objects detected by the dictionary data 12D to the total number of detection targets in the actual image data, etc. There is a method of adopting object detection accuracy.

追加学習判定部１８２は、辞書適応度取得部１８１より取得した辞書データ１２Ｄの実シーンへの適応度が予め定めた閾値より低い場合に追加学習が必要と判定する。なお、辞書適応度取得部１８１を省略する代わりに、カメラ２の撮像画像に対する検知枠などの検出結果をユーザにより目視確認することで、ユーザが追加学習が必要か否かを判断する構成としてもよい。 The additional learning determination unit 182 determines that additional learning is necessary when the fitness of the dictionary data 12D acquired from the dictionary fitness acquisition unit 181 to the actual scene is lower than a predetermined threshold value. Instead of omitting the dictionary fitness acquisition unit 181, the user visually confirms the detection result such as the detection frame for the captured image of the camera 2, so that the user can determine whether or not additional learning is necessary. good.

追加学習要素決定部１８３は、シーン類似度算出部１７が求めた類似度情報を元に、以下に例示する内容の学習要素データを決定する。
・追加学習に必要な学習画像データの種類を示す画像種類情報。
・学習シーンデータのアノテーション方法などの学習方法情報。
・辞書データ１２Ｄの構築に関する情報を示す辞書種類情報。 The additional learning element determination unit 183 determines the learning element data having the contents illustrated below based on the similarity information obtained by the scene similarity calculation unit 17.
-Image type information indicating the type of learning image data required for additional learning.
-Learning method information such as how to annotate learning scene data.
-Dictionary type information indicating information related to the construction of dictionary data 12D.

そのため、追加学習要素決定部１８３は、類似度情報から学習要素データを決定する。以下に例示する方法は、位置分布マップの類似度情報が閾値より高い場合の、類似度の計算に使用された学習シーンデータに基づく学習要素データの決定方法である。
・類似度情報の計算に使用された（該当の）グループの学習画像データをユーザに提示し、ユーザが類似した画像を手動収集したものに決定する方法
・該当のグループの学習画像データに対応する学習シーンデータの位置分布マップを参考に、画像内の人物領域を他のグループの学習画像内の人物領域に置き換え自動で学習画像を作成する方法 Therefore, the additional learning element determination unit 183 determines the learning element data from the similarity information. The method illustrated below is a method for determining learning element data based on the learning scene data used for calculating the similarity when the similarity information of the position distribution map is higher than the threshold value.
-A method of presenting the learning image data of the (corresponding) group used for calculating the similarity information to the user and determining that the user manually collected similar images-Corresponding to the learning image data of the corresponding group A method of automatically creating a learning image by replacing the person area in the image with the person area in the learning image of another group by referring to the position distribution map of the learning scene data.

一方、以下に例示する方法は、位置分布マップの類似度情報が閾値以下の場合の、類似度の計算に使用された実シーンデータに基づく学習要素データの決定方法である。
・実シーンデータの位置分布マップや検知枠の情報をユーザに明示し、ユーザが類似した画像を手動収集する方法
・画像に対して辞書データ１２Ｄによる物体検出の位置分布マップを生成し、実シーンデータの位置分布マップと類似度が高い画像を探索した後、類似度が高い画像があればその画像に対してアノテーションを実施するようユーザに明示する方法 On the other hand, the method illustrated below is a method for determining learning element data based on the actual scene data used for calculating the similarity when the similarity information of the position distribution map is equal to or less than the threshold value.
-A method in which the position distribution map of the actual scene data and the information of the detection frame are clearly shown to the user, and the user manually collects similar images.-The position distribution map of the object detection by the dictionary data 12D is generated for the image, and the actual scene. After searching for an image that has a high degree of similarity to the data position distribution map, if there is an image with a high degree of similarity, a method of clearly indicating to the user that the image should be annotated.

また、辞書生成部１２において異なる学習画像データ３から辞書データ１２Ｄを複数生成している場合は、実シーンの位置分布マップと最も類似度が高い学習画像データ３を探索し、該当の学習画像データ３に対応する辞書データ１２Ｄを使用するようユーザに明示する方法などもある。
さらに、実シーンの位置分布マップと最も類似度が高い学習画像データ３はあるものの、類似度が閾値より低い場合は、該当の辞書データ１２Ｄと必要な学習画像をユーザに提示する方法を採用してもよい。
また、追加学習要素決定部１８３は、異種の物体の位置分布マップ、物体検出部１５が検出に失敗した対象の位置分布マップ、および、物体検出部１５が誤って検出した対象の位置分布マップのいずれかを用いて、学習要素データを決定してもよい。 When a plurality of dictionary data 12Ds are generated from different learning image data 3 in the dictionary generation unit 12, the learning image data 3 having the highest degree of similarity to the position distribution map of the actual scene is searched, and the corresponding learning image data is searched. There is also a method of clearly indicating to the user to use the dictionary data 12D corresponding to 3.
Further, although there is the learning image data 3 having the highest degree of similarity to the position distribution map of the actual scene, if the degree of similarity is lower than the threshold value, a method of presenting the corresponding dictionary data 12D and the necessary learning image to the user is adopted. You may.
Further, the additional learning element determination unit 183 includes a position distribution map of different objects, a position distribution map of the object that the object detection unit 15 failed to detect, and a position distribution map of the object that the object detection unit 15 erroneously detected. Any of them may be used to determine the learning element data.

図１３は、学習シーン解析部１３４の処理を示すフローチャートである。以下に示すように、学習シーン解析部１３４は、アノテーションデータ解析部１３１、映像解析部１３２、カメラパラメータ取得部１３３からの出力情報を用いて、学習シーンデータを解析する。
以下、図１３のフローチャートにおけるカウンタ変数として、学習画像データ３の集合を構成する学習画像データごとの変数ｎと、グルーピングの結果として生成されるグループごとの変数ｍとを用いる。 FIG. 13 is a flowchart showing the processing of the learning scene analysis unit 134. As shown below, the learning scene analysis unit 134 analyzes the learning scene data using the output information from the annotation data analysis unit 131, the video analysis unit 132, and the camera parameter acquisition unit 133.
Hereinafter, as the counter variables in the flowchart of FIG. 13, the variable n for each training image data constituting the set of the training image data 3 and the variable m for each group generated as a result of grouping are used.

まず初期化処理として、学習シーン解析部１３４は、学習画像データ３の集合から１枚の学習画像データ（ｎ＝１）を抽出し、その抽出したｎ＝１を含む新規グループ（ｍ＝１）を作成する（Ｓ１０１）。グループ（ｍ＝１）には、学習画像データ（ｎ＝１）に加えて、その学習シーンデータ（位置分布マップなどの情報）も対応づけられている。 First, as an initialization process, the learning scene analysis unit 134 extracts one learning image data (n = 1) from the set of learning image data 3, and a new group (m = 1) including the extracted n = 1. Is created (S101). In addition to the learning image data (n = 1), the learning scene data (information such as a position distribution map) is also associated with the group (m = 1).

以下、Ｓ１０２〜Ｓ１２１で示す外側のループでは、学習画像データ（ｎ＝２，…，Ｎ）を順に選択し、選択した学習画像データに対応する学習シーンデータ（位置分布マップ、撮影環境情報、カメラパラメータ）を取得してから（Ｓ１０３）、学習シーン解析部１３４は内側のループを実行する。
Ｓ１１１〜Ｓ１１７で示す内側のループでは、学習シーン解析部１３４は、すでに作成したグループ（ｍ＝１，…，Ｍ）を順に選択する。
学習シーン解析部１３４は、選択したグループｍ内の位置分布マップの差分総和を算出し（Ｓ１１２）、その差分総和が閾値未満か否かを判定する（Ｓ１１３）。Ｓ１１３でYesならＳ１１４に進み、NoならＳ１０２に戻る。 Hereinafter, in the outer loop shown in S102 to S121, the learning image data (n = 2, ..., N) are selected in order, and the learning scene data (position distribution map, shooting environment information, camera) corresponding to the selected learning image data is selected. After acquiring the parameter) (S103), the learning scene analysis unit 134 executes the inner loop.
In the inner loop shown in S111 to S117, the learning scene analysis unit 134 sequentially selects the already created groups (m = 1, ..., M).
The learning scene analysis unit 134 calculates the total difference of the position distribution maps in the selected group m (S112), and determines whether or not the total difference is less than the threshold value (S113). If Yes in S113, the process proceeds to S114, and if No, the process returns to S102.

学習シーン解析部１３４は、全てのグループ（ｍ＝１，…，Ｍ）の探索を完了したなら（Ｓ１１４，Yes）、現在選択中の学習画像データｎは既存のグループｍのどこにも該当しないので、新規グループ（ｍ＝Ｍ＋１）を作成し、その新規グループに現在選択中の学習画像データｎを割り当てる（Ｓ１１５）。まだ未探索のグループが存在するなら（Ｓ１１４，No）、既存のグループｍに現在選択中の学習画像データｎを割り当てる（Ｓ１１６）。
以上、Ｓ１０２〜Ｓ１２１で示す外側のループを実行することで、学習画像データ３の各学習画像データを、学習シーンデータのグループに割り当てる。 If the learning scene analysis unit 134 completes the search for all the groups (m = 1, ..., M) (S114, Yes), the currently selected learning image data n does not correspond to any of the existing groups m. , A new group (m = M + 1) is created, and the currently selected learning image data n is assigned to the new group (S115). If there is an unsearched group (S114, No), the currently selected training image data n is assigned to the existing group m (S116).
As described above, by executing the outer loop shown in S102 to S121, each learning image data of the learning image data 3 is assigned to the group of the learning scene data.

学習シーン解析部１３４は、作成された学習シーンの各グループ（ｍ＝１，…，Ｍ）において、撮影環境情報とカメラパラメータとによりグループ内の学習画像をさらに分類してもよい（Ｓ１２２）。この分類方法としては、特に限定せず、屋内と屋外に分類する方法や撮影時間帯の情報から分類する方法などがある。カメラパラメータを分類に使用する場合は、例えば、カメラ俯角の情報から、０〜１０度、１０〜４５度、４５〜８０度、８０〜９０度などの４つに分類するなどの方法があり、特に限定しない。
また、学習シーン解析部１３４において、学習シーンのグループ数を学習シーンを分類する際に予め最大グループ数を決めておいてもよく、Ｋ−ｍｅａｎｓ法により全学習画像間の位置分布マップの差分情報から学習画像をクラスタリングする方法などを用いてもよく、特に限定しない。 In each group (m = 1, ..., M) of the created learning scene, the learning scene analysis unit 134 may further classify the learning images in the group according to the shooting environment information and the camera parameters (S122). The classification method is not particularly limited, and there are a method of classifying into indoor and outdoor, a method of classifying from information of shooting time zone, and the like. When the camera parameters are used for classification, for example, there is a method of classifying them into four, such as 0 to 10 degrees, 10 to 45 degrees, 45 to 80 degrees, and 80 to 90 degrees, based on the camera depression angle information. Not particularly limited.
Further, in the learning scene analysis unit 134, the maximum number of groups may be determined in advance when classifying the learning scenes into the number of groups of the learning scenes, and the difference information of the position distribution map between all the learning images is determined by the K-means method. A method of clustering learning images from the above may be used, and is not particularly limited.

以上説明した実施例１により、計測範囲内の物体を検出するような物体検出装置１において、シーン類似度算出部１７が実シーンデータと学習シーンデータとの間の類似度を算出することで、実画像データ内の物体を高精度に認識する辞書データ１２Ｄを生成するための学習要素データを出力できる。
なお、実施例１では、位置分布マップを使用して学習シーンと実シーンとの間の検出対象の位置情報を比較したが、画像中の検出対象間の位置関係を比較可能な方法であれば、特に限定しない。 According to the first embodiment described above, in the object detection device 1 that detects an object within the measurement range, the scene similarity calculation unit 17 calculates the similarity between the actual scene data and the learning scene data. It is possible to output learning element data for generating dictionary data 12D that recognizes an object in real image data with high accuracy.
In the first embodiment, the position information of the detection target between the learning scene and the actual scene is compared using the position distribution map, but any method can compare the positional relationship between the detection targets in the image. , Not particularly limited.

また、実施例１では、検出対象を人物に限定した場合について述べたが、検出対象はこれに限らず、検出対象を複数のクラスとして各クラス間の位置分布マップを比較することで追加学習要素を決定してもよく、また、クラス間の位置分布マップを結合し各クラス間の重なり度合などを算出することで、シーンデータの詳細な分析が可能となり実シーンとの類似度が高い画像の探索を効率的に実施するなどの方法を採用してもよい。 Further, in the first embodiment, the case where the detection target is limited to a person has been described, but the detection target is not limited to this, and additional learning elements are obtained by comparing the position distribution maps between each class with the detection target as a plurality of classes. Also, by combining the position distribution maps between the classes and calculating the degree of overlap between the classes, detailed analysis of the scene data becomes possible and the image with a high degree of similarity to the actual scene. A method such as efficiently performing the search may be adopted.

さらに、実施例１において、学習シーンと実シーン間を比較する際に、検出対象のみの位置分布マップを使用したが、検出対象以外の位置分布マップを作成し、シーン間の比較に活用してもよい。例えば、本来は人物のみ検出対象である場合においても、人物周囲にある家具や障害物などの位置分布マップを作成し学習シーンと比較することで、学習シーンと実シーン間のより詳細な比較を実施してもよい。
また、実施例１において、位置分布マップを作成する際に全検知枠や検知枠の情報を使用したが、情報を削減する処理を追加してもよい。例えば、学習シーンを分類する際にグループ数が肥大化しないよう、一定のサイズ以下の検知枠は無視するなどの処理を加えてもよい。 Further, in the first embodiment, when comparing the learning scene and the actual scene, the position distribution map of only the detection target was used, but the position distribution map other than the detection target was created and used for the comparison between the scenes. May be good. For example, even when only a person is originally detected, a more detailed comparison between the learning scene and the actual scene can be made by creating a position distribution map of furniture, obstacles, etc. around the person and comparing it with the learning scene. It may be carried out.
Further, in the first embodiment, the information of all the detection frames and the detection frames is used when creating the position distribution map, but a process for reducing the information may be added. For example, when classifying the learning scenes, processing such as ignoring the detection frame of a certain size or less may be added so that the number of groups does not become large.

図１４は、実施例２の物体検出装置１０の構成図である。
物体検出装置１０と図１の物体検出装置１とを比較すると、物体検出装置１０は、物体検出装置１から追加学習要素出力部１８を削除し、テストデータ生成部１９Ａと、辞書データ再構成部１９Ｂとを追加している。 FIG. 14 is a configuration diagram of the object detection device 10 of the second embodiment.
Comparing the object detection device 10 with the object detection device 1 of FIG. 1, the object detection device 10 deletes the additional learning element output unit 18 from the object detection device 1, and has a test data generation unit 19A and a dictionary data reconstruction unit. 19B and is added.

テストデータ生成部１９Ａは、カメラ情報取得部１４から出力されたカメラ２の実画像データの集合をもとに、テストデータを生成する。テストデータは、各実画像データに加え、辞書データ１２Ｄの検出精度を検証するための実シーンデータ（以下、「テストシーンデータ」）がアノテーション部１１により付与されている。実画像データの集合の選定方法としては、位置分布マップの分散が大きくなるように選定する方法など、特に限定しない。 The test data generation unit 19A generates test data based on a set of actual image data of the camera 2 output from the camera information acquisition unit 14. As the test data, in addition to each actual image data, actual scene data (hereinafter, “test scene data”) for verifying the detection accuracy of the dictionary data 12D is added by the annotation unit 11. The method of selecting the set of the actual image data is not particularly limited, such as the method of selecting so that the variance of the position distribution map becomes large.

辞書データ再構成部１９Ｂは、シーン類似度算出部１７が算出するテストシーンデータと学習シーンデータとの類似度と、辞書適応度取得部１８１が算出するテストデータにおける辞書データ１２Ｄの物体検出精度とから、辞書データ１２Ｄの再構成（修正）の内容を決定する。
辞書データ１２Ｄの再構成処理として、辞書データ１２Ｄに使用するグループ（図１３のｍ＝１，…，Ｍ）を入れ替えるといった修正内容を提示し、辞書データ１２Ｄを再構成するという方法がある。そのため、辞書データ再構成部１９Ｂは、学習シーンの種類について、位置分布マップによって複数の学習シーンのグループが存在し、かつ、生成した辞書データ１２Ｄが全てのグループを使用して生成したもので無い場合に、実シーンデータと類似度の高い（つまり検出精度が向上すると予測される）グループを探索する。 The dictionary data reconstruction unit 19B determines the similarity between the test scene data calculated by the scene similarity calculation unit 17 and the learning scene data, and the object detection accuracy of the dictionary data 12D in the test data calculated by the dictionary adaptability acquisition unit 181. From the above, the content of the reconstruction (correction) of the dictionary data 12D is determined.
As a method of reconstructing the dictionary data 12D, there is a method of reconstructing the dictionary data 12D by presenting correction contents such as exchanging the groups (m = 1, ..., M in FIG. 13) used for the dictionary data 12D. Therefore, the dictionary data reconstruction unit 19B does not have a plurality of learning scene groups according to the position distribution map for the type of learning scene, and the generated dictionary data 12D is not generated by using all the groups. In this case, search for a group that has a high degree of similarity to the actual scene data (that is, the detection accuracy is expected to improve).

辞書データ１２Ｄの学習シーンデータ（タグデータ）の修正処理として、辞書データ再構成部１９Ｂは、一部の検出対象にのみアノテーション部１１からアノテーションすることで工数を削減できる。そのため、辞書データ再構成部１９Ｂは、例えば顧客要件や目視確認の容易さなどからアノテーションする学習シーンデータを決定する。
このように、辞書データ再構成部１９Ｂは、辞書データ１２Ｄの構成を入れ替えたり、辞書データ１２Ｄの学習シーンデータを修正したりして、実シーンデータと類似度の高いグループが探索できるように、辞書データ１２Ｄを繰り返し更新する。これにより、アノテーション作業工数を抑えつつ効率的に検出精度の高い辞書データ１２Ｄを構築できる。 As a correction process of the learning scene data (tag data) of the dictionary data 12D, the dictionary data reconstruction unit 19B can reduce the man-hours by annotating only a part of the detection targets from the annotation unit 11. Therefore, the dictionary data reconstruction unit 19B determines the learning scene data to be annotated from, for example, customer requirements and ease of visual confirmation.
In this way, the dictionary data reconstruction unit 19B replaces the configuration of the dictionary data 12D and modifies the learning scene data of the dictionary data 12D so that a group having a high degree of similarity to the actual scene data can be searched. The dictionary data 12D is updated repeatedly. As a result, dictionary data 12D with high detection accuracy can be efficiently constructed while suppressing the man-hours for annotation work.

以上説明した実施例２により、辞書データ１２Ｄの物体検出精度が高くなるような学習シーンの修正方法をユーザに提示することで、アノテーション作業を可能な限り抑えつつ効率的に検出精度の高い辞書データ１２Ｄを構築できる。
なお、実施例２において、テストデータにおける辞書データ１２Ｄの検出精度を求める際に、未検出であった検出対象の位置分布マップを生成し、辞書データ再構成部１９Ｂによって未検出の位置分布マップと類似度が高い位置分布マップを持つ学習シーンのグループに対して、アノテーションを追加するようユーザに提示する方法を採用してもよく、本手法により効率的に高精度な辞書データ１２Ｄを生成できる。 According to the second embodiment described above, by presenting to the user a method of modifying the learning scene so that the object detection accuracy of the dictionary data 12D is high, the dictionary data with high detection accuracy can be efficiently suppressed while suppressing the annotation work as much as possible. You can build 12D.
In Example 2, when determining the detection accuracy of the dictionary data 12D in the test data, a position distribution map of the detection target that was not detected is generated, and the dictionary data reconstruction unit 19B sets the undetected position distribution map. A method of presenting the user to add annotations to a group of learning scenes having a position distribution map having a high degree of similarity may be adopted, and highly accurate dictionary data 12D can be efficiently generated by this method.

また、テストデータにおいて誤検出が生じた検出対象クラスの位置分布マップを生成し、辞書データ再構成部１９Ｂによって、誤検出の位置分布マップと類似度が高い位置分布マップを持つ学習シーンのグループと、現在のグループを入れ替えるといった修正内容を提示してもよい。
さらに、実施例２において、タグデータを修正する際に最新の辞書データ１２Ｄによる検出結果を活用してもよい。例えば、該当の学習画像に対して最新の辞書データ１２Ｄにより物体を検出し検知枠がついた学習画像をＧＵＩ画面などに出力することで、ユーザがアノテーション作業をする際の補助情報として使用するなどの方法がある。 In addition, a position distribution map of the detection target class in which the false detection occurred in the test data is generated, and the dictionary data reconstruction unit 19B is used with a group of learning scenes having a position distribution map having a high degree of similarity to the false detection position distribution map. , You may present the correction contents such as replacing the current group.
Further, in the second embodiment, the detection result by the latest dictionary data 12D may be utilized when modifying the tag data. For example, by detecting an object with the latest dictionary data 12D for the learning image and outputting the learning image with a detection frame to a GUI screen or the like, it can be used as auxiliary information when the user performs annotation work. There is a method.

また、実施例２において、学習画像データ３内にある学習画像に対してタグデータを付与するのではなく、一部の学習画像にのみアノテーションを実施し、辞書データ再構成部１９Ｂにより徐々に学習画像を増やしていくという構成を採用してもよい。具体的には、アノテーションを付与していない学習画像に最新の辞書データ１２Ｄにより物体検出を実施し位置分布マップを作成した後、実シーンの位置分布マップに近い学習画像を追加するようユーザに提示する方法などがある。本手法を繰り返し実施することで、アノテーション作業を実施する学習画像の枚数を抑えつつ、効率的に高精度な辞書データ１２Ｄを生成できる。 Further, in the second embodiment, instead of adding tag data to the learning image in the learning image data 3, only a part of the learning images is annotated, and the dictionary data reconstruction unit 19B gradually learns. A configuration in which the number of images is increased may be adopted. Specifically, after object detection is performed on the unannotated learning image using the latest dictionary data 12D to create a position distribution map, the user is presented to add a learning image close to the position distribution map of the actual scene. There is a way to do it. By repeatedly executing this method, it is possible to efficiently generate highly accurate dictionary data 12D while suppressing the number of learning images for which annotation work is to be performed.

さらに、実施例２において、顧客要件に応じて効率的な辞書データ１２Ｄを生成できるような構成としてもよい。例えば、目標精度や最終的な辞書データ１２Ｄが必要な期限などの情報に基づき、学習画像データ３の枚数を最小限にしつつ実施例２に示す辞書データ１２Ｄの再構成処理の繰り返し回数を増やす。これにより、辞書データ再構成部１９Ｂが辞書データ１２Ｄの生成に使用する学習シーンのグループの組み合わせを変更することで、目標精度を達成する辞書データ１２Ｄを効率的に構築できる。 Further, in the second embodiment, the configuration may be such that efficient dictionary data 12D can be generated according to the customer requirements. For example, based on information such as the target accuracy and the deadline for the final dictionary data 12D, the number of repetitions of the reconstruction process of the dictionary data 12D shown in the second embodiment is increased while minimizing the number of the learning image data 3. As a result, the dictionary data 12D that achieves the target accuracy can be efficiently constructed by changing the combination of the learning scene groups used by the dictionary data reconstruction unit 19B to generate the dictionary data 12D.

なお、本発明は前記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、前記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。
また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。
また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。また、上記の各構成、機能、処理部、処理手段などは、それらの一部または全部を、例えば集積回路で設計するなどによりハードウェアで実現してもよい。
また、前記の各構成、機能などは、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。 The present invention is not limited to the above-described embodiment, and includes various modifications. For example, the above-described embodiment has been described in detail in order to explain the present invention in an easy-to-understand manner, and is not necessarily limited to those having all the described configurations.
Further, it is possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment.
Further, it is possible to add / delete / replace a part of the configuration of each embodiment with another configuration. Further, each of the above configurations, functions, processing units, processing means and the like may be realized by hardware by designing a part or all of them by, for example, an integrated circuit.
Further, each of the above configurations, functions, and the like may be realized by software by the processor interpreting and executing a program that realizes each function.

各機能を実現するプログラム、テーブル、ファイルなどの情報は、メモリや、ハードディスク、ＳＳＤ（Solid State Drive）などの記録装置、または、ＩＣ（Integrated Circuit）カード、ＳＤカード、ＤＶＤ（Digital Versatile Disc）などの記録媒体におくことができる。
また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際にはほとんど全ての構成が相互に接続されていると考えてもよい。
さらに、各装置を繋ぐ通信手段は、無線ＬＡＮに限定せず、有線ＬＡＮやその他の通信手段に変更してもよい。 Information such as programs, tables, and files that realize each function can be stored in memory, hard disks, recording devices such as SSDs (Solid State Drives), IC (Integrated Circuit) cards, SD cards, DVDs (Digital Versatile Discs), etc. Can be placed on the recording medium of.
In addition, control lines and information lines are shown as necessary for explanation, and not all control lines and information lines are necessarily shown in the product. In practice, it can be considered that almost all configurations are interconnected.
Further, the communication means for connecting each device is not limited to the wireless LAN, and may be changed to a wired LAN or other communication means.

１物体検出装置
２カメラ
３学習画像データ（学習計測データ）
１０物体検出装置
１１アノテーション部
１２辞書生成部
１２Ｄ辞書データ（記憶部）
１３学習シーン取得部
１４カメラ情報取得部
１５物体検出部
１６実シーン推定部
１７シーン類似度算出部
１８追加学習要素出力部
１９Ａテストデータ生成部
１９Ｂ辞書データ再構成部
１３１アノテーションデータ解析部
１３２映像解析部
１３３カメラパラメータ取得部
１３４学習シーン解析部
１８１辞書適応度取得部
１８２追加学習判定部
１８３追加学習要素決定部 1 Object detection device 2 Camera 3 Learning image data (learning measurement data)
10 Object detection device 11 Annotation unit 12 Dictionary generation unit 12D Dictionary data (storage unit)
13 Learning scene acquisition unit 14 Camera information acquisition unit 15 Object detection unit 16 Real scene estimation unit 17 Scene similarity calculation unit 18 Additional learning element output unit 19A Test data generation unit 19B Dictionary data reconstruction unit 131 Annotation data analysis unit 132 Video analysis Part 133 Camera parameter acquisition part 134 Learning scene analysis part 181 Dictionary adaptability acquisition part 182 Additional learning judgment part 183 Additional learning element determination part

Claims

A storage unit of dictionary data having a set of learning measurement data and learning scene data tagged with each learning measurement data.
An object detection unit that detects an object from the actual measurement data that is the target of object detection using the dictionary data,
An actual scene estimation unit that estimates the actual scene data to be tagged with the actual measurement data from the object detection result of the object detection unit, and
A scene similarity calculation unit that calculates the similarity between the estimated actual scene data and the learning scene data of the dictionary data, and
It has an additional learning element output unit that outputs learning element data required for additional learning of the dictionary data.
The additional learning element output unit
When the similarity calculated by the scene similarity calculation unit is higher than a predetermined threshold value, the learning element data based on the learning scene data used for the calculation of the similarity is output.
An object detection device characterized in that when the similarity calculated by the scene similarity calculation unit is equal to or less than a predetermined threshold value, the learning element data based on the actual scene data used for the calculation of the similarity is output.

The learning scene data and the actual scene data each have a position distribution map in the image data extracted from the position information of the detection target in the image data and the detection frame information indicating the size.
The object detection device according to claim 1, wherein the scene similarity calculation unit calculates the similarity between the position distribution maps.

The additional learning element output unit includes the position distribution map of a different object, the position distribution map of an object that the object detection unit failed to detect, and the position distribution map of an object that the object detection unit erroneously detected. The object detection device according to claim 2, wherein the learning element data is determined by using any of the above.

The object detection device further groups the set of the learning measurement data similar among the position distribution maps of the learning measurement data with respect to the set of the learning measurement data of the dictionary data, and generates the set by the grouping. The object detection device according to claim 2, further comprising a learning scene acquisition unit that causes the scene similarity calculation unit to calculate the similarity using the learning scene data for each group.

The claim is characterized in that the additional learning element output unit determines whether or not the dictionary data used by the object detection unit needs to be modified based on fitness, which is the accuracy with which the object detection unit detects an object. Item 2. The object detection device according to item 1.

The object detection device further determines the content of the reconstruction of the dictionary data used by the object detection unit based on the adaptability which is the accuracy of detecting the object by the object detection unit. The object detection device according to claim 1, wherein the object detection device has.

The scene similarity calculation unit calculates the similarity between the actual scene data and the learning scene data by using at least one of the shooting time zone information, the shooting location information, and the camera parameters as the scene data. The object detection device according to claim 1, wherein the object detection device is characterized by the above.

The additional learning element output unit is at least one of image type information indicating the type of the learning measurement data required for additional learning, learning method information of the learning scene data, and dictionary type information related to the construction of the dictionary data. The object detection device according to claim 1, wherein one is output as the learning element data.

An object detection system having an object detection device and a measuring device.
The measuring device is either a monaural camera, a stereo camera, or a distance sensor.
The object detection device is
A storage unit of dictionary data having a set of learning measurement data and learning scene data tagged with each learning measurement data.
An object detection unit that detects an object from the actual measurement data measured by the measuring device as an object detection target by the dictionary data, and an object detection unit.
An actual scene estimation unit that estimates the actual scene data to be tagged with the actual measurement data from the object detection result of the object detection unit, and
A scene similarity calculation unit that calculates the similarity between the estimated actual scene data and the learning scene data of the dictionary data, and
It has an additional learning element output unit that outputs learning element data required for additional learning of the dictionary data.
The additional learning element output unit
When the similarity calculated by the scene similarity calculation unit is higher than a predetermined threshold value, the learning element data based on the learning scene data used for the calculation of the similarity is output.
An object detection system characterized in that when the similarity calculated by the scene similarity calculation unit is equal to or less than a predetermined threshold value, the learning element data based on the actual scene data used for the calculation of the similarity is output.

A storage unit of dictionary data having a set of learning measurement data and learning scene data tagged with each learning measurement data, an object detection unit, an actual scene estimation unit, a scene similarity calculation unit, and the like. An object detection method executed by an object detection device having an additional learning element output unit.
The object detection unit detects an object from the actual measurement data, which is the object of the object detection, by the dictionary data.
The actual scene estimation unit estimates the actual scene data to be tagged with the actual measurement data from the object detection result of the object detection unit.
The scene similarity calculation unit calculates the similarity between the estimated actual scene data and the learning scene data of the dictionary data.
When the additional learning element output unit outputs the learning element data necessary for the additional learning of the dictionary data,
When the similarity calculated by the scene similarity calculation unit is higher than a predetermined threshold value, the learning element data based on the learning scene data used for the calculation of the similarity is output.
An object detection method characterized in that when the similarity calculated by the scene similarity calculation unit is equal to or less than a predetermined threshold value, the learning element data based on the actual scene data used for the calculation of the similarity is output.