JP7630935B2

JP7630935B2 - Information processing device, information processing method, and program

Info

Publication number: JP7630935B2
Application number: JP2020123119A
Authority: JP
Inventors: 俊太舘; 修平小川; 裕輔御手洗
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2025-02-18
Anticipated expiration: 2040-07-17
Also published as: JP2022019339A; JP2025063335A

Description

本発明は、被写体を追尾する技術に関する。 The present invention relates to a technology for tracking a subject.

画像内の特定の被写体を追尾するための技術としては、輝度や色情報を利用するものやテンプレートマッチングなどが存在する。近年、ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ（以下ＤＮＮと省略）を利用した技術が、高精度な追尾技術として注目を集めている。 Technologies for tracking a specific subject within an image include those that use brightness or color information, and template matching. In recent years, technology that uses Deep Neural Networks (hereafter abbreviated as DNN) has been attracting attention as a highly accurate tracking technology.

非特許文献１は、画像内の特定の被写体を追尾するための方法の１つである。追尾対象が映った画像と、探索範囲となる画像を、重みが同一のＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ（以下ＣＮＮと省略）にそれぞれ入力する。ＣＮＮから得られたそれぞれの特徴量同士の相互相関を算出することによって、探索範囲の画像中で追尾対象が存在する位置を特定するものである。このような追尾手法は追尾対象の位置を正確に同定できる一方、追尾対象に類似した物体が画面の上で重なるような場合に、誤った対象を追尾する失敗が発生し易い。 Non-Patent Document 1 is one method for tracking a specific subject in an image. An image containing the target to be tracked and an image that is the search range are input to a Convolutional Neural Network (hereafter abbreviated as CNN) with the same weights. The position of the target to be tracked in the image in the search range is identified by calculating the cross-correlation between the features obtained from the CNN. While this type of tracking method can accurately identify the position of the target to be tracked, it is prone to failing to track the wrong target when an object similar to the target to be tracked overlaps on the screen.

これを回避するために特許文献１の手法に代表されるように、検出物体の領域の色特徴や奥行き情報からヒストグラムを作成し、その変化等を調べて物体が遮蔽されているか否かを判定する手法がある。 To avoid this, there are methods such as that described in Patent Document 1, which create a histogram from the color features and depth information of the area of the detected object, and then check the changes in this to determine whether the object is occluded or not.

米国特許出願第１０１８５８７７（Ｂ２）号広報U.S. Patent Application No. 10185877(B2)

Ｂｅｒｔｉｎｅｔｔｏｅｔａｌ．，Ｆｕｌｌｙ－ＣｏｎｖｏｌｕｔｉｏｎａｌＳｉａｍｅｓｅＮｅｔｗｏｒｋｓｆоｒＯｂｊｅｃｔＴｒａｃｋｉｎｇ，ａｒＸｉｖ２０１６Bertinetto et al. , Fully-Convolutional Siamese Networks for Object Tracking, arXiv 2016

しかしながら、特許文献１に示される方法では、同じような姿勢の物体や外見的な特徴の類似した物体が画面上で重なると、色やテクスチャといった特徴量のヒストグラムに差異が出にくいため判定できないという課題がある。例えば、スポーツの集団競技等においては、狭い範囲に存在する複数の人物の服装や姿勢が同一になることも多く、異なる人物を同じ人物と見なして追尾する失敗が起こりうる。本発明は、このような課題に鑑みなされたものであり、外見的特徴や姿勢が類似した物体が近接する場合においても安定して追尾を継続することを目的とする。 However, the method shown in Patent Document 1 has the problem that when objects with similar postures or external characteristics overlap on the screen, it is difficult to determine whether they are the same or not because differences in the histograms of features such as color and texture are difficult to detect. For example, in group sports competitions, multiple people in a small area often wear the same clothes and postures, and different people may fail to be tracked as the same person. The present invention has been made in consideration of such problems, and aims to continue stable tracking even when objects with similar external characteristics and postures are close to each other.

上記課題を解決する本発明にかかる情報処理装置は、画像から物体を検出する情報処理装置であって、前記画像から検出された物体の画像特徴を抽出する特徴抽出手段と、前記画像から検出された物体が他の物体によって遮蔽されていることを示す尤度を出力する学習済みモデルの出力に基づいて、前記画像から検出された各物体について、前記画像から検出された他の物体との遮蔽関係を示す遮蔽情報を推定する推定手段と、少なくとも前記画像特徴と前記遮蔽情報とに基づいて、前記画像から検出された各物体について、前記画像と異なる時刻に撮像された画像において検出された物体との対応関係を特定する特定手段と、を有する。 The information processing device according to the present invention that solves the above problem is an information processing device that detects objects from an image, and includes: a feature extraction means that extracts image features of objects detected from the image; an estimation means that estimates, for each object detected from the image, occlusion information that indicates an occlusion relationship with other objects detected from the image based on the output of a trained model that outputs a likelihood that an object detected from the image is occluded by another object; and an identification means that identifies, for each object detected from the image, a correspondence relationship with an object detected in an image captured at a time different from that of the image, based on at least the image features and the occlusion information.

本発明によれば、外見的特徴や姿勢が類似した物体が近接する場合においても安定して追尾を継続できる。 The present invention allows for stable and continuous tracking even when objects with similar external features and poses are in close proximity.

物体検出の一例を説明する模式図FIG. 1 is a schematic diagram illustrating an example of object detection; 情報処理装置のハードウェア構成例を示す図FIG. 1 is a diagram showing an example of a hardware configuration of an information processing device; 情報処理装置の機能構成例を示すブロック図FIG. 1 is a block diagram showing an example of a functional configuration of an information processing device; 情報処理装置が実行する処理手順を示すフローチャート1 is a flowchart showing a processing procedure executed by an information processing device; 情報処理装置の処理の結果例を示す図FIG. 13 is a diagram showing an example of a result of processing by an information processing device; 情報処理装置の処理の結果例を示す図FIG. 13 is a diagram showing an example of a result of processing by an information processing device; 情報処理装置が実行する処理手順を示すフローチャート1 is a flowchart showing a processing procedure executed by an information processing device; 遮蔽に関する情報の派生の例Example of deriving occlusion information 情報処理装置の機能構成例を示すブロック図FIG. 1 is a block diagram showing an example of a functional configuration of an information processing device; 情報処理装置の処理の結果例を示す図FIG. 13 is a diagram showing an example of a result of processing by an information processing device; 情報処理装置が実行する処理手順を示すフローチャート1 is a flowchart showing a processing procedure executed by an information processing device; 情報処理装置の学習処理の例を示す図FIG. 13 is a diagram showing an example of a learning process of an information processing device; 情報処理装置の処理の結果例を示す図FIG. 13 is a diagram showing an example of a result of processing by an information processing device; 情報処理装置の処理の結果例を示す図FIG. 13 is a diagram showing an example of a result of processing by an information processing device; 情報処理装置の処理の結果例を示す図FIG. 13 is a diagram showing an example of a result of processing by an information processing device; 情報処理装置の学習処理の例を示す図FIG. 13 is a diagram showing an example of a learning process of an information processing device;

＜実施形態１＞
実施形態に係る情報処理装置を、図面を参照しながら説明する。なお、図面間で符号の同じものは同じ動作をするとして重ねての説明を省く。また、この実施の形態に掲載されている構成要素はあくまで例示であり、この発明の範囲をそれらのみに限定する趣旨のものではない。 <Embodiment 1>
The information processing device according to the embodiment will be described with reference to the drawings. Note that the same reference numerals in the drawings perform the same operation, and therefore the description will be omitted. Also, the components shown in the embodiment are merely examples, and are not intended to limit the scope of the present invention to those only.

本実施形態では、動画もしくは連続撮影した静止画フレームから人物を検出し、追尾する機能について説明する。適用範囲は検出・追尾対象の物体のカテゴリを限定しないが、本実施形態１は対象を人物に限定する。本実施形態では、時間的に連続する画像毎に人物を検出し、連続する画像間でそれぞれどの人物がどの人物と同一人物であるかを対応付けることで、人物の追尾を実現する。本実施形態では特に、スポーツイベントなどの撮影を想定し、人物の服装や移動方向等が類似しており、高頻度で近接・交差するとする。このような場合、各画像における人物の位置または服装の色といった外見的な特徴が近い人物同士を対応付けるだけでは、誤った対応付けが発生しやすい。このような失敗をここでは誤マッチングと呼ぶ。 In this embodiment, a function for detecting and tracking a person from a video or continuously captured still image frames will be described. The scope of application is not limited to the category of object to be detected and tracked, but in this embodiment 1, the target is limited to a person. In this embodiment, a person is detected in each temporally consecutive image, and a person is associated with a person in each consecutive image, thereby tracking the person. In particular, this embodiment assumes the shooting of a sporting event, etc., in which the people's clothing, direction of movement, etc. are similar, and they frequently approach and cross paths. In such cases, simply associating people with similar external characteristics, such as the position of the person in each image or the color of their clothing, is likely to result in incorrect association. Such a failure is referred to here as incorrect matching.

本実施形態では撮影者から見て物体が重なっている時の、遮蔽関係のパターンを学習した学習済みモデルが出力する遮蔽に関する情報に着目する。学習済みモデルによって出力された遮蔽関係を、物体の対応関係の特定に併せて用いることで、手前にいる人物と奥にいる人物同士を対応付ける失敗を抑制し、追尾の精度を向上する。 In this embodiment, we focus on the information regarding occlusion output by a trained model that has learned the patterns of occlusion relationships when objects overlap as seen by the photographer. By using the occlusion relationships output by the trained model in conjunction with identifying the correspondence between objects, we can reduce mistakes in matching people in the foreground with people in the background, improving tracking accuracy.

これを模式的に示した図が図１である。図１（Ａ）の画像２１００，２２１２０，２１４０は同一の絵柄の２枚のトランプカードがテーブル上で交差していく様子を上から写した動画の３フレーム分の静止画を示している（時系列順に左から右に並んでいる）。画像２１００，２１２０，２１４０を観察しだけでは各画像におけるカードがそれぞれどのように移動したかを確定することができない。一方で図１（Ｂ）は（Ａ）よりも高フレームレートで同じ様子を撮影した例である。つまり、より短い時間間隔で撮像された画像群である。図１（Ｂ）の画像を時系列順に観察していけば、どちらのカードが次の画像でどこに移動したかを対応付けることができる。全体としては左側のカード２２０１が右側のカード２２０２の上を通過し、右側に移動したということを比較的容易に推定することができる。画像２２１０や画像２２３０に示すように、物体同士の交差の瞬間に過渡的に生じる見えを観察することで、画像２２２０においてどちらのカードが手前側を通過し、どちらが奥側にあるのかが判定可能となる。この判定に際しては、２．５次元の奥行画像といった特別なセンサーやオプティカルフロー等の生成のコストの高い情報は必ずしも必要でない。物体同士が手前と奥で重なったときに、どのような見えが生じ易いかという、遮蔽関係と見えの特徴（ａｐｐｅａｒａｎｃｅｆｅａｔｕｒｅ）とのパターン認識の問題として解くことができる。これは図１（Ｃ）および図１（Ｄ）に示す人物の交差のようなシーンでも同様である。本図１（Ｃ）（Ｄ）では人物の服装や姿勢等の見え、移動方向は同一であるとする。このような場合も、物体が交差する前後の見えの状態に着目して観察すれば、図１（Ｄ）の画像２４２０では人物２４０１が手前側に、人物２４０２が奥側にいると判定する。以降の画像において、この遮蔽関係を維持したままであれば、人物２４０１が人物２４０２を一度遮蔽した場合に、手前側の人物２４０１を追尾し、奥側の人物２４０１の遮蔽関係と画像特徴を保持する。そして、遮蔽が解消したときには、人物２４０１の追尾を継続しつつ、奥側にいた人物２４０２を再び検出することが可能である。以上が本実施形態の原理の概要を示す説明である。詳細な処理については後述する。 This is shown in Figure 1. Images 2100, 2120, and 2140 in Figure 1(A) show three still frames of a video showing two playing cards of the same design crossing each other on a table from above (arranged from left to right in chronological order). It is not possible to determine how the cards in each image moved by observing only images 2100, 2120, and 2140. On the other hand, Figure 1(B) is an example of the same scene captured at a higher frame rate than (A). In other words, it is a group of images captured at a shorter time interval. If you observe the images in Figure 1(B) in chronological order, you can associate which card moved to where in the next image. Overall, it is relatively easy to estimate that the card 2201 on the left passed over the card 2202 on the right and moved to the right. As shown in images 2210 and 2230, by observing the appearance that occurs transiently at the moment when objects cross each other, it is possible to determine which card passes in front and which card is in the back in image 2220. This determination does not necessarily require a special sensor such as a 2.5-dimensional depth image or information with high generation costs such as optical flow. It can be solved as a pattern recognition problem between the occlusion relationship and the appearance feature, which is what appearance is likely to occur when objects overlap in front and back. This is also true for scenes such as the intersection of people shown in Figures 1(C) and 1(D). In these Figures 1(C) and 1(D), the appearance of the people's clothes, posture, etc., and the direction of movement are the same. In such a case, if observing while paying attention to the appearance state before and after the objects cross, it is determined that the person 2401 is in the front and the person 2402 is in the back in image 2420 of Figure 1(D). If this occlusion relationship is maintained in subsequent images, when person 2401 occludes person 2402 once, the person 2401 in the foreground is tracked, and the occlusion relationship and image features of person 2401 in the background are maintained. Then, when the occlusion is removed, it is possible to continue tracking person 2401 and detect person 2402 in the background again. This is an explanation showing an overview of the principle of this embodiment. Detailed processing will be described later.

図２は、本実施形態における、画像認識によって追尾対象を追尾する情報処理装置１のハードウェア構成図である。ＣＰＵＨ１０１は、ＲＯＭＨ１０２に格納されている制御プログラムを実行することにより、本装置全体の制御を行う。ＲＡＭＨ１０３は、各構成要素からの各種データを一時記憶する。また、プログラムを展開し、ＣＰＵＨ１０１が実行可能な状態にする。記憶部Ｈ１０４は、本実施形態の処理対象となるデータを格納するものであり、追尾対象となるデータを記憶する。記憶部Ｈ１０４の媒体としては、ＨＤＤ，フラッシュメモリ、各種光学メディアなどを用いることができる。入力部Ｈ１０５は、キーボード・タッチパネル、ダイヤル等で構成され、ユーザからの入力を受け付けるものであり、追尾対象を設定する際になどに用いられる。表示部Ｈ１０６は、液晶ディスプレイ等で構成され、被写体や追尾結果をユーザに対して表示する。また、本装置は通信部Ｈ１０７を介して、撮影装置等の他の装置と通信することができる。 2 is a hardware configuration diagram of an information processing device 1 that tracks a tracking target by image recognition in this embodiment. The CPU H101 executes a control program stored in the ROM H102 to control the entire device. The RAM H103 temporarily stores various data from each component. It also expands the program and makes it executable by the CPU H101. The storage unit H104 stores data to be processed in this embodiment and stores data to be tracked. As a medium for the storage unit H104, a HDD, a flash memory, various optical media, etc. can be used. The input unit H105 is composed of a keyboard, a touch panel, a dial, etc., and accepts input from the user, and is used when setting a tracking target, etc. The display unit H106 is composed of a liquid crystal display, etc., and displays the subject and tracking results to the user. In addition, this device can communicate with other devices such as a shooting device via the communication unit H107.

図３は、情報処理装置の機能構成例を示すブロック図である。図３ではＣＰＵＨ１０１において実行される処理を、それぞれ機能ブロックとして示している。情報処理装置１は、画像取得部２０１、物体検出部２０２、遮蔽情報生成部２０３、抽出部２０４、対応付け部２０５を有し、外部の記憶部２０６に接続されている。記憶部２０６は情報処理装置１の内部にあってもよい。それぞれの機能を簡単に説明する。画像取得部２０１は、撮像装置によって特定の物体（本実施形態では人物）を撮像した動画や連続静止画の画像を取得する。物体検出部２０２は、画像取得部２０１によって取得された画像から予め設定された所定の物体を示す画像特徴を検出する。例えば、さまざまな姿勢の人物の画像を用いて人体（頭や動体）を示す画像特徴を予め学習した学習済みモデルに基づいて、画像における人物の領域を検出する。遮蔽情報生成部２０３は、遮蔽する物体と遮蔽された物体との遮蔽関係を示す画像特徴を学習した学習済みモデルに基づいて、画像から検出された各物体について、画像から検出された他の物体との遮蔽関係を示す遮蔽情報を推定する。遮蔽情報とは、注目物体が他の物体によって遮蔽されている可能性を表す尤度（被遮蔽／遮蔽スコア）である。例えば、ある物体について、他の物体によって遮蔽されている可能性が高ければ、被遮蔽／遮蔽スコアを１に近づける。ある物体について、他の物体を遮蔽している可能性が高ければ、被遮蔽／遮蔽スコアを０に近づける。このような遮蔽関係を示す被遮蔽／遮蔽スコアを、学習済みモデルを用いて推定する。抽出部２０４は、ある画像について検出された物体ごとに遮蔽情報を記憶部２０６に記憶する。対応付け部２０５は、複数の画像間で検出された物体の対応付けを行う。すなわち、遮蔽情報に基づいて、ある画像から検出された各物体について、ある画像と異なる時刻に撮像された画像において検出された物体との対応関係を特定する。異なる時間で撮像された画像のそれぞれから検出された物体同士を正しく対応付けることによって物体を追尾できる。また、物体同士の遮蔽関係はある一定の期間において維持されることを仮定することによって、遮蔽関係を使って物体同士を対応付けることができる。記憶部２０６は、各検出物体の被遮蔽スコアを記憶する。各機能部の処理の詳細は図４のフローチャートを用いて説明する。 Figure 3 is a block diagram showing an example of the functional configuration of an information processing device. In Figure 3, the processes executed in the CPU H101 are shown as functional blocks. The information processing device 1 has an image acquisition unit 201, an object detection unit 202, an occlusion information generation unit 203, an extraction unit 204, and a matching unit 205, and is connected to an external storage unit 206. The storage unit 206 may be inside the information processing device 1. The functions of each unit will be briefly described. The image acquisition unit 201 acquires images of a video or continuous still images in which a specific object (a person in this embodiment) is captured by an imaging device. The object detection unit 202 detects image features indicating a predetermined object from the images acquired by the image acquisition unit 201. For example, the area of a person in an image is detected based on a learned model that has previously learned image features indicating a human body (head or moving body) using images of people in various postures. The occlusion information generating unit 203 estimates occlusion information indicating an occlusion relationship between each object detected from an image and other objects detected from the image based on a trained model that has learned image features indicating an occlusion relationship between an occluding object and an occluded object. The occlusion information is a likelihood (occlusion/occlusion score) that indicates the possibility that the target object is occluded by other objects. For example, if an object is highly likely to be occluded by other objects, the occlusion/occlusion score is made closer to 1. If an object is highly likely to occlude other objects, the occlusion/occlusion score is made closer to 0. The occlusion/occlusion score indicating such an occlusion relationship is estimated using the trained model. The extracting unit 204 stores occlusion information for each object detected in an image in the storage unit 206. The matching unit 205 matches objects detected between multiple images. That is, based on the occlusion information, a correspondence relationship between each object detected from an image and an object detected in an image captured at a different time from the image is identified. Objects can be tracked by correctly associating objects detected from images captured at different times. In addition, by assuming that the occlusion relationship between objects is maintained for a certain period of time, objects can be associated with each other using the occlusion relationship. The storage unit 206 stores the occlusion score of each detected object. Details of the processing of each functional unit will be explained using the flowchart in FIG. 4.

図４は本実施形態の処理の流れを示したフローチャートである。以下の説明では、各工程（ステップ）について先頭にＳを付けて表記することで、工程（ステップ）の表記を省略する。ただし、情報処理装置はこのフローチャートで説明するすべての工程を必ずしも行わなくても良い。図４のフローチャートに示した処理は、コンピュータである図２のＣＰＵＨ１０１により記憶部Ｈ１０４に格納されているコンピュータプログラムに従って実行される。 Figure 4 is a flowchart showing the processing flow of this embodiment. In the following explanation, each process (step) is represented by adding an S to the beginning, and the notation of the process (step) is omitted. However, the information processing device does not necessarily have to perform all of the processes described in this flowchart. The processing shown in the flowchart in Figure 4 is executed by the CPU H101 of Figure 2, which is a computer, in accordance with a computer program stored in the memory unit H104.

Ｓ３０１では、情報処理装置１が、各動画フレームについて繰り返すループ処理を開始する。Ｓ３０２では、画像取得部２０１が人物を撮像した動画や連続静止画の画像フレームを順次取得する。以降の処理はＳ３０１～Ｓ３１１まで各画像について順次処理がなされる。なお、画像取得部２０１は、情報処理装置に接続された撮像装置によって撮像された画像を取得してもよいし、記憶部Ｈ１０４に記憶された画像を取得してもよい。図５（Ａ）中の動画フレーム３１００，３１１０，３１２０，３１３０，３１４０が取得した画像フレームの例である。 In S301, the information processing device 1 starts a loop process that is repeated for each video frame. In S302, the image acquisition unit 201 sequentially acquires image frames of a video or continuous still images capturing a person. Subsequent processing from S301 to S311 is performed sequentially for each image. Note that the image acquisition unit 201 may acquire images captured by an imaging device connected to the information processing device, or may acquire images stored in the storage unit H104. Video frames 3100, 3110, 3120, 3130, and 3140 in FIG. 5(A) are examples of acquired image frames.

次にＳ３０３では、物体検出部２０が、所定の物体（ここでは人物）の画像特徴に基づいて、前記取得された画像から少なくとも１つ以上の所定の物体を検出する。画像内から物体を検出する公知技術としては、Ｌｉｕによる手法等が挙げられる（Ｌｉｕ，ＳＳＤ：ＳｉｎｇｌｅＳｈｏｔＭｕｌｔｉｂｏｘＤｅｔｅｃｔｏｒ．Ｉｎ：ＥＣＣＶ２０１６）。画像内から候補物体を検出した結果を図５（Ａ）に示す。図５（Ａ）中の矩形枠３１０１，３１０２，３１０３，３１１１，３１１２，３１１３，３１２１，３１２２，３１３１，３１３２，３１４１，３１４２が検出された物体領域を示すＢｏｕｎｄｉｎｇＢｏｘ（以下ＢＢ）である。 Next, in S303, the object detection unit 20 detects at least one or more predetermined objects from the acquired image based on the image features of the predetermined objects (people in this case). A known technique for detecting objects from within an image is the method by Liu (Liu, SSD: Single Shot Multibox Detector. In: ECCV2016). The result of detecting candidate objects from within an image is shown in FIG. 5(A). The rectangular frames 3101, 3102, 3103, 3111, 3112, 3113, 3121, 3122, 3131, 3132, 3141, and 3142 in FIG. 5(A) are bounding boxes (hereinafter BB) that indicate the detected object regions.

Ｓ３０４では、遮蔽マップ生成部２０３が、各画像について、領域毎に遮蔽されているか否か（遮蔽関係）についての遮蔽情報を示した遮蔽マップを生成する。遮蔽マップ生成部２０３が、各画像について、遮蔽されている物体のうちの見えている領域（被遮蔽物体領域）を推定する。ここでは各人物が他の人物と重なっているか、重なっている場合に奥側にいるか、手前側にいるかを判定し、その結果を遮蔽状態のスコア（尤度）として領域ごとに出力する。これは意味的領域分割の認識タスクの一種であり、Ｃｈｅｎらの手法等の公知の手法を使って実現することができる。（Ｃｈｅｎ，ＤｅｅｐＬａｂ：ＳｅｍａｎｔｉｃＩｍａｇｅＳｅｇｍｅｎｔａｔｉｏｎｗｉｔｈＤｅｅｐＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｔｓ，ＡｔｒｏｕｓＣｏｎｖｏｌｕｔｉｏｎ，ａｎｄＦｕｌｌｙＣｏｎｎｅｃｔｅｄＣＲＦｓ，２０１６）。 In S304, the occlusion map generating unit 203 generates an occlusion map for each image that indicates occlusion information for each region as to whether it is occluded or not (occlusion relationship). The occlusion map generating unit 203 estimates the visible region (occluded object region) of the occluded object for each image. Here, it is determined whether each person overlaps with another person, and if so, whether the person is in the back or the front, and the result is output as an occlusion state score (likelihood) for each region. This is a type of semantic region segmentation recognition task, and can be realized using known methods such as the method of Chen et al. (Chen, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, 2016).

図６（Ａ）に遮蔽マップの生成処理を説明する模式図と結果の一例を示す。ニューラルネットワーク４０２は入力された画像から、入力画像の各画素について遮蔽状態を判定するニューラルネットワークである。ＲＧＢ画像４０１が入力されると、ニューラルネットワーク４０２は画像中に人物がいるか否か、さらにその人物が遮蔽されているか否かを推定した結果を遮蔽マップ４０４として出力する。同マップは遮蔽されていない人物および人物以外の領域と推定された場合は０、遮蔽されている人物の領域には１、の被遮蔽スコアが出力される。遮蔽マップ４０４中の黒い領域ほど高い被遮蔽スコアであることを示す。すなわち、黒い領域は遮蔽された人物の領域であると推定されたことを示している。ニューラルネットワーク４０２は入力画像に対してこのような出力ができるように事前に学習を行っている（学習については後述する）。なお、図に示した遮蔽マップ４０４は推定結果として理想的な出力状態の一例を示したものである。 Figure 6 (A) shows a schematic diagram explaining the generation process of the occlusion map and an example of the result. The neural network 402 is a neural network that judges the occlusion state of each pixel of the input image from the input image. When an RGB image 401 is input, the neural network 402 outputs the result of estimating whether or not a person is present in the image and whether or not the person is occluded as an occlusion map 404. The map outputs an occlusion score of 0 for an area estimated to be an unoccluded person or a non-person area, and 1 for an area of an occluded person. The darker the area in the occlusion map 404, the higher the occlusion score. In other words, the black area indicates that it is estimated to be an area of an occluded person. The neural network 402 has been trained in advance so that it can output such an output for the input image (learning will be described later). The occlusion map 404 shown in the figure shows an example of an ideal output state as an estimation result.

なお、ＲＧＢ画像４０１のほかに、専用センサー等を使って２．５次元奥行画像４０５を別途取得するような派生的な形態も考えられる。前記奥行画像４０５を３チャンネルのＲＧＢ画像４０１と連結した４チャンネルの情報をＲＧＢ画像の代わりに画像入力として学習・認識する。これにより遮蔽領域の情報をより高精度にすることも可能である。 In addition to the RGB image 401, a derivative form may be considered in which a 2.5-dimensional depth image 405 is separately acquired using a dedicated sensor or the like. The depth image 405 is linked to the 3-channel RGB image 401 to generate 4-channel information, which is learned and recognized as an image input instead of an RGB image. This makes it possible to obtain information on occluded areas with higher accuracy.

次に、Ｓ３０５では、情報処理装置１が、Ｓ３０３で検出された各物体について、Ｓ３０６からＳ３０７のループ処理を実行する。Ｓ３０５～Ｓ３０８では、抽出部２０４が、生成された遮蔽マップから検出物体ごとに遮蔽関係を示す情報を抽出し、記憶部２０６に記憶する。Ｓ３０６では、抽出部２０４が、遮蔽マップから検出物体毎に遮蔽関係を示す情報を抽出する。具体的には、図６（Ａ）の人物検出枠４０７の中の遮蔽マップ４０４の被遮蔽スコアを平均する。この被遮蔽スコアが１に近いほどその物体は遮蔽されている可能性が高く、被遮蔽スコアが０に近いほどその物体は遮蔽されていない可能性が高いことを示す。なお、検出枠の位置のずれや、遮蔽マップ４０４にノイズが含まれることを想定して、図６（Ｂ）に示すように枠の中央付近を重視した重み付き平均で取得する。図中の演算４１２は各画像の部分領域毎ごとの要素積（アダマール積）を意味する。マップ４１３は中央にピークがあり、画像ブロックの総和が１となる２次元ガウス関数である（縦横サイズを人物検出枠に合わせて変形してある）。取得結果の例を図６（Ａ）に記号ｏｃｃを付して被遮蔽スコア値４０９と４１０として示す。左側の検出枠は奥側にいる人物のため被遮蔽スコアが高く、右側の検出枠は手前側のため被遮蔽スコアが低いと判定されている。以上のような処理を先ほどの図５の入力画像に対して処理した結果例を図５（Ｂ）（遮蔽マップ）および図５（Ｃ）（各枠の被遮蔽スコア推定結果）として示す。交差開始～終了の間、奥側に位置する人物３１０１に対応する被遮蔽スコアは人物３１０２のそれよりも相対的に高いことを示している。 Next, in S305, the information processing device 1 executes the loop process from S306 to S307 for each object detected in S303. In S305 to S308, the extraction unit 204 extracts information indicating the occlusion relationship for each detected object from the generated occlusion map and stores it in the storage unit 206. In S306, the extraction unit 204 extracts information indicating the occlusion relationship for each detected object from the occlusion map. Specifically, the occlusion scores of the occlusion map 404 in the person detection frame 407 in FIG. 6A are averaged. The closer this occlusion score is to 1, the more likely the object is occluded, and the closer the occlusion score is to 0, the more likely the object is not occluded. Note that, assuming that the position of the detection frame is shifted or that the occlusion map 404 contains noise, the weighted average is obtained with emphasis on the center of the frame as shown in FIG. 6B. The calculation 412 in the figure means the element product (Hadamard product) for each partial region of each image. Map 413 has a peak in the center and is a two-dimensional Gaussian function whose image block sum is 1 (the vertical and horizontal sizes have been transformed to fit the person detection frame). An example of the obtained results is shown in FIG. 6(A) with the symbol occ as occlusion score values 409 and 410. The detection frame on the left is determined to have a high occlusion score because the person is at the back, and the detection frame on the right is determined to have a low occlusion score because the person is at the front. Examples of the results of processing the above-mentioned process on the input image in FIG. 5 are shown in FIG. 5(B) (occlusion map) and FIG. 5(C) (estimated occlusion score for each frame). This shows that the occlusion score corresponding to person 3101 located at the back is relatively higher than that of person 3102 between the start and end of the intersection.

Ｓ３０７では、記憶部２０６が、抽出部２０４によって取得された各検出物体の被遮蔽スコアを記憶する。同時に、各検出物体の位置・サイズの情報、および色やテクスチャのヒストグラムといった物体の見えに関する特徴量、も記憶する。ここではこれら複数種類の数量を一括して検出物体の特徴量と呼ぶ。なお、見えに関する特徴量としてはこの他にニューラルネットワークの中間層情報等を利用してもよい。（例えば”Ｈａｒｉｈａｒａｎ，ｅｔ．ａｌ，ＨｙｐｅｒｃｏｌｕｍｎｓｆｏｒＯｂｊｅｃｔＳｅｇｍｅｎｔａｔｉｏｎａｎｄＦｉｎｅ－ｇｒａｉｎｅｄＬｏｃａｌｉｚａｔｉｏｎ，ｉｎＣＶＰＲ２０１５”）。 In S307, the memory unit 206 stores the occlusion score of each detected object acquired by the extraction unit 204. At the same time, information on the position and size of each detected object, and feature quantities related to the object's appearance, such as color and texture histograms, are also stored. Here, these multiple types of quantities are collectively referred to as the feature quantities of the detected object. Note that, in addition to this, intermediate layer information of a neural network may also be used as the feature quantities related to the appearance. (For example, "Hariharan, et.al, Hypercolumns for Object Segmentation and Fine-grained Localization, in CVPR2015").

Ｓ３０８では、情報処理装置１が、各画像について繰り返すループと、各画像において検出された人物について繰り返すループを終了する。このループは画像毎に、その画像から検出された人物すべてについて遮蔽情報を取得したときに終了する。次に、Ｓ３０９～Ｓ３１０では、対応付け部２０５が、前後の画像間の物体の対応付けを行う（ただし一つ目の動画フレームの場合は過去のフレームがないためこれを行わない）。まず、Ｓ３０９で、対応付け部２０５が記憶部２０６に記憶された過去の物体の特徴量である、被遮蔽スコア、位置サイズおよび見えの特徴量を取得する。次にＳ３１０で、対応付け部２０５が過去の動画フレーム中に検出された物体と、現在処理しているフレーム中に検出された物体の対応付けを行う。 In S308, the information processing device 1 ends the loop that is repeated for each image and the loop that is repeated for the people detected in each image. This loop ends for each image when occlusion information has been acquired for all people detected in that image. Next, in S309 to S310, the matching unit 205 matches objects between previous and next images (however, this is not performed in the case of the first video frame because there are no previous frames). First, in S309, the matching unit 205 acquires the occlusion score, position/size, and appearance features, which are features of past objects stored in the memory unit 206. Next, in S310, the matching unit 205 matches objects detected in past video frames with objects detected in the frame currently being processed.

Ｓ３１０における対応付け部２０５の詳細な処理フローを図７に示す。Ｓ５０１で、対応付け部２０５は、まず現フレームの検出物体と一つ前のフレームで検出された物体の間で全組み合わせのペアを作る。前後のフレームでそれぞれｎ人とｍ人の人物が検出されていれば、全部でｎ×ｍ個のペアが生成される。次に、Ｓ５０２で、対応付け部２０５は全ての物体ペアについて類似度を算出する。類似度としては検出物体同士の特徴量の差分に基づいた指標を用いることができる。一例として過去の検出物体ｃ_１と現在の検出物体ｃ_２の類似度を下式のように算出する。
（数式１）
Ｌ（ｃ_１，ｃ_２）＝－Ｗ_１｜｜ＢＢ_１－ＢＢ_２｜｜
－Ｗ_２｜｜ｆ_１－ｆ_２｜｜－Ｗ_３｜｜ｏｃｃ_１－ｏｃｃ_２｜｜
ここで、ＢＢとは各物体の（中心座標値ｘ、中心座標値ｙ、幅、高さ）の４変数をまとめたベクトルであり、ｆは各物体の特徴を示したものである。｜｜ｘ｜｜はｘのＬ^ｐノルムである。ｏｃｃは各物体の被遮蔽スコアである。Ｗ１，Ｗ２，Ｗ３はそれぞれ経験的あるいは機械学習的に調整して設定される０以上のバランス係数である。ここで各特徴量のばらつきを事前に統計的に求めておいて各特徴量を正規化する等してもよい。物同士が交差する場合であっても、他の物体を遮蔽する側の人物を追尾することによって、被遮蔽側の人物が再び画像で確認されたときに、直前で他の物体を遮蔽する側の人物と対応付けようとすると数式１の３つめの項の値が小さくなり、類似度が低く算出される。つまり、この処理によって、遮蔽関係が異なる人物同士はマッチングされる可能性が低くなり、追尾の誤マッチングが抑制できる。 A detailed process flow of the matching unit 205 in S310 is shown in FIG. 7. In S501, the matching unit 205 first creates pairs of all combinations between the detected object in the current frame and the object detected in the previous frame. If n people and m people are detected in the previous and next frames, respectively, a total of n×m pairs are generated. Next, in S502, the matching unit 205 calculates the similarity between all object pairs. As the similarity, an index based on the difference in the feature amounts between the detected objects can be used. As an example, the similarity between the past detected object _c1 and the current detected object _c2 is calculated as follows:
(Formula 1)
L(c ₁ , c ₂ )= -W ₁ | | BB ₁ -BB ₂ | |
-W ₂ | | f ₁ - f ₂ | | -W ₃ | | occ ₁ - occ ₂ | |
Here, BB is a vector summarizing four variables (center coordinate value x, center coordinate value y, width, height) of each object, and f indicates the characteristics of each object. ||x|| is the ^Lp norm of x. occ is the occlusion score of each object. W1, W2, and W3 are balance coefficients equal to or greater than 0 that are set by empirical or machine learning adjustment. Here, the variation of each feature may be statistically obtained in advance and each feature may be normalized. Even when objects intersect with each other, by tracking a person on the side that occludes other objects, when the person on the occluded side is confirmed again in the image, the value of the third term of Equation 1 becomes small, and the similarity is calculated to be low. In other words, this process reduces the possibility of matching people with different occlusion relationships, and erroneous matching of tracking can be suppressed.

次に、Ｓ５０３において、対応付け部２０５が、過去の物体と現在の物体との類似度に基づいて物体間の対応関係を特定するための対応付け（マッチング）を行う。マッチングの方法にはいくつか存在する。例えば、類似度が高い候補同士から優先的にマッチングする方法や、ハンガリアンアルゴリズムを用いる方法等がある。ここでは前者を用いる。 Next, in S503, the correspondence unit 205 performs correspondence (matching) to identify the correspondence between objects based on the similarity between past objects and current objects. There are several matching methods. For example, there is a method of prioritizing matching among candidates with high similarity, and a method using the Hungarian algorithm. Here, the former is used.

Ｓ５０３では、対応付け部２０５が、現フレームの全物体について対応付けが終了していなければＳ５０６で類似度最大のペアから同一人物として対応付けていく。対応付けの終わったペアの物体は対応付けの候補から省いていく。上記の処理の際に、その時点で残っているペアの中の最大の類似度の大きさが所定の閾値を下回った場合は、もはや類似した物体ペアが残っていないことを意味する。その場合はそれ以上無理に対応付けることなく（Ｓ５０５）、対応付けを終了する。 In S503, if the matching unit 205 has not yet completed matching for all objects in the current frame, it proceeds to S506, where it matches the pair with the greatest similarity as the same person. Once matching has been completed, the objects of the pair are removed from the matching candidates. During the above process, if the magnitude of the greatest similarity among the pairs remaining at that point falls below a predetermined threshold, this means that there are no more similar object pairs remaining. In that case, it does not attempt to match any further (S505) and ends the matching process.

以上の処理Ｓ３０１～Ｓ３１１を動画フレームごとに行う。その結果、図５（Ｄ）に結果例を示すように、動画中から人物を検出し、それぞれの物体がどこに移動したかの一連の追尾結果が得られる（フレーム間の同一の人物に記号Ａ，Ｂ，Ｃで付して追尾の結果を示している）。 The above steps S301 to S311 are performed for each video frame. As a result, as shown in the example result in FIG. 5(D), people are detected from the video, and a series of tracking results are obtained showing where each object has moved (the same people between frames are marked with symbols A, B, and C to show the tracking results).

＜変形例＞
本実施形態では物体ペア同士のマッチングの類似度として差分に基づき、被遮蔽スコアや見えといった各指標の距離を重み付け和した。ここで例えばＫＬダイバージェンスを使うことも考えられる。またメトリック学習を行ってより精度の高い距離指標を求めることも考えられる。また単一の類似度を一度だけ用いるのでなく、まず見えの特徴で類似度を判定し、条件を満たしたものは次に遮蔽状態のスコアの類似度に基づいて判定する、等のルールベースによる方法や段階的な判定方法も考えられる。またさらにニューラルネットやサポートベクトルマシンといった公知の識別器の手法を用い、説明変数を特徴量、目的変数を同一物体か否かの結果、として学習・識別し、この値によってマッチングを判定することも可能である。以上のようにフレーム間の物体間の対応付けは特定の形態に限定されない。 <Modification>
In this embodiment, the distance of each index such as the occlusion score and the appearance is weighted and summed based on the difference as the similarity of matching between object pairs. Here, for example, KL divergence can be used. It is also possible to perform metric learning to obtain a more accurate distance index. In addition, instead of using a single similarity only once, a rule-based method or a stepwise judgment method can be used in which the similarity is first judged based on the appearance features, and if the condition is satisfied, it is judged based on the similarity of the occlusion score. Furthermore, it is also possible to use a known classifier method such as a neural network or a support vector machine to learn and classify the explanatory variables as features and the objective variables as the result of whether or not they are the same object, and to judge matching based on this value. As described above, the correspondence between objects between frames is not limited to a specific form.

またさらに別の派生形態として、被遮蔽スコアの推定値を安定させるために、下式のように過去のスコアを移動平均した値を用いる工夫も考えられる。
（数式２）
ｏｃｃ^ＥＭＡ _（ｔ）＝（１－α）×ｏｃｃ^ＥＭＡ _{（ｔ－１）}＋ α×ｏｃｃ_（ｔ）
上式は指数移動平均値と呼ばれる値であり、ｏｃｃ^ＥＭＡ _（ｔ）は時刻ｔの被遮蔽スコアの指数移動平均値、ｏｃｃ_（ｔ）は時刻ｔの被遮蔽スコア、αは０＜α≦１の係数である。過去の複数フレームで追尾ができている物体については上式で指数移動平均値を算出しておき、類似度を比較する際には元の被遮蔽スコアではなく、指数移動平均被遮蔽スコアを用いる。これにより、交差時に複数のフレームにまたがって徐々に重畳状態が起こるような場合に、複数フレームの被遮蔽スコアの平均値に基づいてマッチングできるので、より物体間の対応付けが安定する。 As yet another derivative, in order to stabilize the estimated value of the occlusion score, a moving average of past scores may be used as shown in the following formula.
(Formula 2)
occ ^EMA _(t) = (1-α)×occ ^EMA _(t-1) + α×occ _(t)
The above formula is a value called the exponential moving average, occ ^EMA _(t) is the exponential moving average of the occlusion score at time t, occ _(t) is the occlusion score at time t, and α is a coefficient of 0<α≦1. For objects that have been tracked in multiple past frames, the exponential moving average is calculated using the above formula, and when comparing similarities, the exponential moving average occlusion score is used instead of the original occlusion score. This allows matching based on the average value of the occlusion scores for multiple frames in cases where an overlapping state gradually occurs across multiple frames at the time of intersection, making the correspondence between objects more stable.

またさらに別の派生形態として、マッチングの際に前後フレーム間の類似度だけでなく、ｎステップ前の過去の複数のフレームの特徴量・位置を用いてマッチングを行うような形態も考えられる。この方法を用いることで、一度物体が遮蔽されて追尾できないフレームが発生しても、その後のフレームで遮蔽が解消されれば再び追尾が可能になる。この形態では例えば、ｎフレームまでさかのぼって物体の特徴量の平均値を求め、これに基づいて現フレームから検出された物体との類似度の算出を行う。もしくは、過去のｎフレームの物体と現フレームの物体間でそれぞれ類似度を求め、得られたｎ個の類似度の平均値が最も高い物体に対応付ける。また、過去だけでなく、ｎステップの未来のフレームの結果も使って双方向で判定を行うことも考えられる。この形態は未来のフレームを処理するまで結果が判明しないため処理のリアルタイム性には劣るが、過去のみを見る方法よりも高精度である。 As yet another derived form, matching can be performed using not only the similarity between previous and next frames, but also the feature values and positions of multiple past frames n steps back. By using this method, even if an object is occluded and cannot be tracked in a frame, it can be tracked again if the occlusion is removed in a subsequent frame. In this form, for example, the average feature value of the object is calculated going back n frames, and the similarity with the object detected in the current frame is calculated based on this. Alternatively, the similarity between the object in the past n frames and the object in the current frame is calculated, and the object with the highest average value of the n similarities obtained is associated with the object. It is also possible to perform bidirectional judgment using not only the past but also the results of future frames n steps. This form is inferior in real-time processing because the results are not known until future frames are processed, but it is more accurate than the method of looking only at the past.

またさらに別の派生形態として、検出の失敗に対応するための形態が考えられる。物体検出・追尾においては物体の姿勢が特殊な形状に変化した、等の理由で物体検出の段階で一時的に失敗するようなことも起こり得る。このような未検出が起こると、フレーム間の対応付けの際に、前のフレームに存在した物体が、現フレームでは対応なしと判定される。すると追尾はそこで途切れることになる。このような失敗を防ぐために、以下のような工夫もありえる。すなわち、マッチングで未対応の人物が発生したら、その情報をリストに記憶しておき、次のフレームのマッチングのときに対応付けの候補に加える（一定時間が経過してもまだ未対応であれば物体自体がもう存在しないと判断し、リストから除去する。ここではこれをタイムアウト処理と呼ぶ）。 Yet another possible variant is one that deals with detection failures. In object detection and tracking, temporary failures in object detection can occur due to reasons such as an object's posture changing to an unusual shape. If such a failure to detect occurs, when matching frames, an object that existed in the previous frame is determined to not match in the current frame. This causes tracking to be interrupted. To prevent such failures, the following can be devised. That is, if an unmatched person is encountered in matching, the information about that person is stored in a list, and the person is added to the list of candidates for matching when matching the next frame (if a certain amount of time has passed and the object is still unmatched, it is determined that the object itself no longer exists and is removed from the list. Here, this is called timeout processing).

このように動画フレームをまたがる物体の対応付けについては種々のやり方が考えられ、特定の形態に限定されない。 There are various ways to match objects across video frames in this way, and we are not limited to any particular form.

＜遮蔽情報の形態のバリエーションおよび学習方法＞
本実施形態では、遮蔽マップとして、遮蔽されている物体のうちの見えている領域（被遮蔽物体領域）を推定した。この形態についても様々な派生形態が考えられる。一例を図８に示す。ここでは図８（Ｂ）に示すように、画像８０１のように奥側の物体の見えている領域を推定する以外でもよい。例えば、画像８０２のように奥側の物体の全領域を推定する。また、画像８０３のように、手前側の遮蔽物体の領域を推定する（図の領域４４０のように他物体と重なっていない物体も手前側領域として含めて推定している。ただし別の形態としてこのような単独の物体は手前側の領域に含めないことも考えられる）。また、画像８０１～８０３のように前景領域を推定するのではなく、画像８０４のように物体の中心や重心の位置を推定することも考えられる。画像８０４の場合においては被遮蔽物体の中心付近の領域に大きな正の値を、遮蔽物体の中心付近に小さな負の値を推定するようにする（ここでいう物体の中心領域は図示するようにガウス関数状の領域を推定させるような形態が考えられる）。 <Variations in the form of occlusion information and learning methods>
In this embodiment, the visible area (occluded object area) of the occluded object is estimated as the occlusion map. Various derived forms are also conceivable. An example is shown in FIG. 8. Here, as shown in FIG. 8B, it is possible to estimate other than the visible area of the object on the back side as in image 801. For example, the entire area of the object on the back side is estimated as in image 802. Also, as in image 803, the area of the occluding object on the front side is estimated (an object that does not overlap with other objects as in area 440 in the figure is also estimated as the front side area. However, as another form, it is conceivable not to include such a single object in the front side area). Also, instead of estimating the foreground area as in images 801 to 803, it is conceivable to estimate the position of the center or center of gravity of the object as in image 804. In the case of image 804, a large positive value is estimated in the area near the center of the occluded object, and a small negative value is estimated in the area near the center of the occluded object (the central area of the object here can be estimated as a Gaussian function area as shown in the figure).

ここで遮蔽状態の情報の学習方法について図６（Ｄ）を用いて説明する。前述のＣｈｅｎらの手法等で示されるニューラルネット４０２は、入力画像であるＲＧＢ画像４０１に対して遮蔽物体の被遮蔽スコアマップ４０３を出力する。４０３の結果例を４３０に示す。ＣＨＥＮらの手法等は特定カテゴリ物体の前景領域を推定する手法であるが、ここでは遮蔽情報の教師値４３１を与えて、教師値４３１と同じようなマップが推定によって得られるようニューラルネット４０２の学習を行う。具体的には出力結果のマップ４０３と教師値４３１を比較し、交差エントロピーや二乗誤差などの公知の方法で損失値算出４３２を行う。損失値が漸減するように誤差逆伝搬法等でニューラルネット４０２の重みパラメーターを調整する（この処理についてはＣｈｅｎらの手法と同一のため詳細は略す）。入力画像と教師値は十分な量を与える必要がある。重なった物体の領域の教師値を作成するのはコストがかかるため、ＣＧを用いることや、物体画像を切り出して重畳する画像合成の方法を用いて学習データを作成するようなことも考えられる。以上が学習方法になる。 Here, the learning method of the information of the occlusion state will be explained with reference to FIG. 6(D). The neural network 402 shown in the above-mentioned Chen et al. method etc. outputs the occlusion score map 403 of the occluded object for the RGB image 401, which is the input image. An example of the result of 403 is shown in 430. The CHEN et al. method etc. is a method for estimating the foreground region of a specific category object, but here, a teacher value 431 of the occlusion information is given, and the neural network 402 is trained so that a map similar to the teacher value 431 can be obtained by estimation. Specifically, the output result map 403 is compared with the teacher value 431, and a loss value calculation 432 is performed by a known method such as cross entropy or square error. The weight parameters of the neural network 402 are adjusted by the error backpropagation method etc. so that the loss value gradually decreases (this process is the same as the Chen et al. method, so details are omitted). It is necessary to give a sufficient amount of the input image and the teacher value. Creating training values for overlapping object regions is costly, so it is possible to create training data using CG or an image synthesis method that cuts out and overlaps object images. This concludes the learning method.

またさらに、本実施形態では上記で求めた物体の枠の中で取得して被遮蔽スコアと呼ぶ指標を求めた。遮蔽情報の取得の形態の様々な例を図８（Ｃ）に示す。図８（Ｃ１）は本実施形態の形態である。この他に、（Ｃ２）奥側の被遮蔽スコアと手前側の被遮蔽スコアの差分値を物体枠内で取得する、（Ｃ３）物体の中心のスコアを１点だけ参照する、等様々に考えられる。また、枠内で取得する際に、物体の枠内で取得する際に、他の物体枠と重なっている領域についてはどちらの物体の領域か判然としないために取得から省くような方法も考えられる。 Furthermore, in this embodiment, an index called an occlusion score is obtained within the object frame obtained above. Various examples of the form of obtaining occlusion information are shown in FIG. 8 (C). FIG. 8 (C1) shows the form of this embodiment. In addition, various other methods are possible, such as (C2) obtaining the difference value between the occlusion score on the back side and the occlusion score on the front side within the object frame, or (C3) referencing only one point of the score at the center of the object. Also, when obtaining within a frame, a method is conceivable in which areas that overlap with other object frames are omitted from the obtaining because it is unclear which object's area it is.

またさらに、上述の＜遮蔽状態の推定＞と＜各物体の被遮蔽スコアの取得＞を同時に行う方法も考えられる。例として、Ｌｉｕの手法等で使われている公知な方法であるアンカーと呼ばれる手法があげられる。この手法では物体の候補枠の集合が求められるので、これを利用して各候補枠が遮蔽物体か被遮蔽物体かの被遮蔽スコアを推定し対応付けることが考えられる（この形態の詳細については実施形態３で示すのでここでは説明を略す）。 Furthermore, a method can be considered in which the above-mentioned <estimation of the occlusion state> and <obtainment of the occlusion score for each object> are performed simultaneously. One example is a method called anchor, which is a well-known method used in Liu's method and others. This method obtains a set of candidate frames for objects, and it is possible to use this to estimate and associate each candidate frame with an occlusion score indicating whether it is an occluding object or an occluded object (the details of this form will be shown in embodiment 3, so the explanation will be omitted here).

またさらに、上で示したような複数の形態の遮蔽情報をそれぞれ取得し、これを遮蔽に関する多次元の特徴として後段の物体の対応付けに用いてもよい。もしくは前記の遮蔽に関する多次元の特徴から機械学習によって物体の遮蔽されている面積の割合を推定して用いてもよい。この場合は前記の遮蔽に関する多次元の特徴を説明変数とし、物体が遮蔽されている面積の割合を目標変数とし、ロジスティック回帰等の公知技術で回帰推定を行う等すればよい。 Furthermore, multiple forms of occlusion information as shown above may be obtained and used as multidimensional features related to occlusion in subsequent object matching. Alternatively, the proportion of the object's occluded area may be estimated by machine learning from the multidimensional features related to occlusion. In this case, the multidimensional features related to occlusion may be used as explanatory variables, the proportion of the object's occluded area may be used as a target variable, and regression estimation may be performed using a publicly known technique such as logistic regression.

＜実施形態２＞
本実施形態では実施形態１と同様に人物の検出と追尾を行う。ハードウェア構成は実施形態１の図２と同様である。本実施形態における機能構成例を示すブロック図は図９（Ａ）になる。実施形態１の構成に新たに遮蔽状態判定部３０１が追加されている。実施形態１では追尾中に人物の枠は人物同士の重なりによって、人物の検出ができないことがある。例えば図１０（Ａ）中の動画フレーム４１２０に示すように、人物間で重なった面積が大きいときには、奥側の人物が検出できないことは多い。このような時に遮蔽状態判定部３０１が、人物は存在しているが被遮蔽状態にある、と判定する。 <Embodiment 2>
In this embodiment, detection and tracking of a person are performed in the same manner as in the first embodiment. The hardware configuration is the same as in FIG. 2 of the first embodiment. A block diagram showing an example of the functional configuration in this embodiment is shown in FIG. 9A. A shielding state determination unit 301 is newly added to the configuration of the first embodiment. In the first embodiment, during tracking, the frame of a person may not be able to detect a person due to overlapping of the people. For example, as shown in the video frame 4120 in FIG. 10A, when the overlapping area between people is large, it is often the case that the person in the back cannot be detected. In such a case, the shielding state determination unit 301 determines that a person is present but is in a shielded state.

実施形態１で説明したような物体検出部の一時的な検出の失敗による未検出と異なり、人物の集団が同じ方向に同じ速度で移動しているような場合、長時間未検出の状態が続く。さらに被遮蔽状態から脱した画面上の位置が、被遮蔽状態が開始した位置から離れることがある。このため被遮蔽状態であると積極的に判定し、推定した前記状態に応じた処理を行うことで追尾の成功率を高めることが望ましい。 Unlike non-detection caused by a temporary failure of the object detection unit as described in the first embodiment, when a group of people are moving in the same direction at the same speed, the non-detection state continues for a long time. Furthermore, the position on the screen where the object leaves the occluded state may be far from the position where the occluded state began. For this reason, it is desirable to increase the success rate of tracking by proactively determining that the object is in an occluded state and performing processing according to the estimated state.

本実施形態も全体の処理フローは実施形態１の図４と同じであるが、Ｓ３１０の処理の詳細が下記のように異なる。ここでは、実施形態１と異なるＳ３１０の処理についてのみ説明する。図１１を用いて遮蔽状態判定部３０１が行うＳ３１０処理の詳細なフローについて説明する。まずこれまでと同じようにＳ６０１で現フレームと前フレームで物体の対応付けを行う。Ｓ６０２で対応付けられなかった前フレームの物体がある場合、被遮蔽状態に入った可能性がある。そこでＳ６０３で当該物体のそれまでの被遮蔽スコアの高さが閾値以上かを調べる。これは動画フレームのフレームレートが十分に高ければ、遮蔽により未検出になる前後で被遮蔽スコアが高くなることが多いためである。さらにＳ６０４で当該物体の周辺領域で現フレームの物体の検出数の数が減っていないかを調べ、上記の二つの結果が真であれば当該物体は被遮蔽状態に入ったと推定し被遮蔽状態のリストに記憶する（Ｓ６０５）。被遮蔽状態のリストに記憶された物体については前回検出されたときの特徴量と位置も合わせて記憶する。これによって、遮蔽が解消されて再び検出されたときに追尾できる可能性が向上する。 In this embodiment, the overall processing flow is the same as in FIG. 4 of the first embodiment, but the details of the processing in S310 are different as follows. Here, only the processing in S310 that differs from the first embodiment will be described. A detailed flow of the processing in S310 performed by the occlusion state determination unit 301 will be described with reference to FIG. 11. First, in S601, objects are associated with each other in the current frame and the previous frame as in the previous embodiment. If there is an object in the previous frame that was not associated with each other in S602, it is possible that the object has entered an occluded state. Therefore, in S603, it is checked whether the occlusion score of the object up to that point is equal to or higher than a threshold value. This is because if the frame rate of the video frame is sufficiently high, the occlusion score often becomes high before and after the object becomes undetected due to occlusion. Furthermore, in S604, it is checked whether the number of detections of the object in the current frame has decreased in the surrounding area of the object, and if the above two results are true, it is estimated that the object has entered an occluded state and is stored in the list of occluded states (S605). For the objects stored in the list of occluded states, the feature amount and position at the time of the previous detection are also stored. This improves the chances of tracking when the object is unobstructed and detected again.

Ｓ６０６～Ｓ６１０は被遮蔽状態の物体が再出現したかどうかを判定する処理である。Ｓ６０３で対応付けられなかった現フレームの物体がある場合、被遮蔽状態を脱して再度検出できるようになった可能性がある。そこでＳ６０７で当該物体の被遮蔽スコアの高さが閾値以上かを調べる。さらにＳ６０８で当該物体の周辺領域で現フレームの物体の検出数の数が増えていいないかを調べる。両方の結果が真で、且つ被遮蔽状態のリストに記憶されている物体のいずれかと当該物体が所定閾値以上に類似度が高い場合（Ｓ６０８）、当該物体は被遮蔽状態から脱して再度出現したと推定する。そのとき、対応付けた物体を被遮蔽状態のリストから除去する（Ｓ６０９）被遮蔽状態のリストから除去された物体については、現在の入力画像から検出された特徴量と位置を取得する。 S606 to S610 are processes for judging whether an object in an occluded state has reappeared. If there is an object in the current frame that was not associated in S603, there is a possibility that it has escaped from the occluded state and can be detected again. Therefore, in S607, it is checked whether the occluded score of the object is equal to or higher than a threshold. Furthermore, in S608, it is checked whether the number of detections of the object in the current frame has increased in the surrounding area of the object. If both results are true and the object has a similarity to any of the objects stored in the occluded state list that is higher than a predetermined threshold (S608), it is estimated that the object has escaped from the occluded state and reappeared. At that time, the associated object is removed from the occluded state list (S609). For the object removed from the occluded state list, the feature amount and position detected from the current input image are obtained.

ここで、対応付けの処理の工夫として、例えば、フレーム間の物体のマッチングの際に、被遮蔽状態にある人物とのマッチングは距離による類似度のペナルティを減ずる。再出現を待つタイムアウトの時間を長く取る。遮蔽状態の物体との対応付けの閾値は、通常の物体間のマッチングよりも閾値を低く設定する、等が考えられる。 Here, some ideas for matching processing could be, for example, reducing the distance-based similarity penalty when matching with an occluded person when matching objects between frames; lengthening the timeout period for waiting for a person to reappear; setting a lower threshold for matching with occluded objects than for matching between normal objects; etc.

またさらに、ここでは二人の人物の重なりを想定して説明を行ったが、３人以上の人物の間で重なりが生じることもある。この場合は、遮蔽状態に入ったと判定されれば被遮蔽状態のリストに加えておき、再出現したら前フレームとの対応付けを行い、被遮蔽状態のリストから都度除去する。これにより３人以上についてもある程度の対応が可能である。 Furthermore, although the explanation here assumes two people overlapping, overlaps can also occur between three or more people. In this case, if it is determined that a person has entered an occluded state, they are added to a list of occluded states, and when they reappear, they are associated with the previous frame and removed from the list of occluded states each time. This makes it possible to deal with cases of three or more people to a certain extent.

＜実施形態３＞
本実施形態では、ユーザが指定した単一の物体を追尾する形態について説明する。ここでは追尾対象は人体等の特定カテゴリに限らず、ユーザが指定した不特定の物体を追尾する形態を扱う。例えば、犬などの動物や、車などの乗り物であってもよい。 <Embodiment 3>
In this embodiment, a form of tracking a single object designated by a user will be described. Here, the tracking target is not limited to a specific category such as a human body, but a form of tracking an unspecified object designated by a user will be dealt with. For example, it may be an animal such as a dog, or a vehicle such as a car.

機能ブロックの図は図９（Ｂ）になる。これまでの構成に新たに追尾物体指定部３０２が追加されている。ここで追尾物体指定部３０２と物体検出部２０２の機能は非特許文献１の方法を用いることで容易に実現することができる。追尾物体指定部３０２はユーザが動画フレーム中で追尾対象物体の枠位置を指定する機能部である。これにより追尾すべき物体の特徴が初期化される。物体検出部２０２は各動画中で最も対象物体と一致度の高い画像領域を同定する。同定した結果例を図１２（Ａ）に示す。図１２（Ａ）の動画フレーム５１１０上の枠５１１１がユーザによって指示された追尾物体の枠である。動画フレーム５１２０ではこの物体が画面中で右側に移動しており、物体検出部２０２によって枠５１２１として検出されている。非特許文献１の方法は物体の追尾手法として優れるが、類似物体間で容易に誤スイッチが生じる。そこで本実施形態ではこれまでの実施形態と同様に、追尾物体に対して遮蔽状態に関する情報を推定し、誤スイッチが生じていないかを判定する。 The functional block diagram is shown in FIG. 9(B). A new tracking object designation unit 302 has been added to the previous configuration. The functions of the tracking object designation unit 302 and the object detection unit 202 can be easily realized by using the method of Non-Patent Document 1. The tracking object designation unit 302 is a functional unit in which the user designates the frame position of the tracking target object in the video frame. This initializes the characteristics of the object to be tracked. The object detection unit 202 identifies the image area in each video that is most similar to the target object. An example of the identification result is shown in FIG. 12(A). The frame 5111 on the video frame 5110 in FIG. 12(A) is the frame of the tracking object designated by the user. In the video frame 5120, this object moves to the right side in the screen and is detected as a frame 5121 by the object detection unit 202. The method of Non-Patent Document 1 is an excellent object tracking method, but erroneous switching easily occurs between similar objects. Therefore, in this embodiment, as in the previous embodiments, information regarding the occlusion state of the tracking object is estimated, and it is determined whether or not an erroneous switch has occurred.

このために遮蔽情報生成部２０３として図１３（Ｂ）に示すようなニューラルネット６３００を用いる。これは検出された追尾物体の画像６３０１（ここでは処理の簡単のために正方形の画像に縦横比率を正規化している）を入力すると、画像パターンを見て、遮蔽されている（Ｙｅｓ）かされていない（Ｎｏ）かの分類結果６３０２を出力する分類器である。遮蔽の有無の定義としては、物体の面積が何％以上遮蔽されているか否かとして定義する。この２クラスの値を教師値として与えてニューラルネット６３００を学習させる。この技術は通常の画像分類タスクと同様の広く公知な方法のため詳細を略す。また、教師値（目標変数）を遮蔽の有無の２値ではなく遮蔽面積の割合として与えて回帰学習を行えば、推定結果６３０３のように遮蔽の割合を推定することができる。この回帰学習には学習時に与える損失値として二乗誤差等を用いる。 For this purpose, a neural network 6300 as shown in FIG. 13B is used as the occlusion information generating unit 203. This is a classifier that, when an image 6301 of a detected tracking object (here, the aspect ratio is normalized to a square image for ease of processing) is input, looks at the image pattern and outputs a classification result 6302 of whether it is occluded (Yes) or not (No). The presence or absence of occlusion is defined as the percentage of the object's area that is occluded or not. The values of these two classes are given as teacher values to train the neural network 6300. This technology is a widely known method similar to normal image classification tasks, so details are omitted. In addition, if the teacher value (target variable) is given as the percentage of the occluded area rather than the binary value of occlusion or not and regression learning is performed, the occlusion percentage can be estimated as in the estimation result 6303. In this regression learning, square error or the like is used as a loss value given during learning.

遮蔽情報生成部２０３で追尾物体候補の遮蔽度を推定した結果が図１４（Ａ）（Ｂ）である。図１４（Ａ）に示す物体の検出結果に対して、物体検出部２０２が図１４（Ｂ）に符号ｏｃｃを付して示したのが被遮蔽面積の推定値である。同図では被遮蔽スコアの変動幅は所定値（例えば０．３等の値）より小さく、追尾に失敗していないと判定できる（ここで、被遮蔽スコアだけでなく実施形態１で用いたような位置や見えの特徴量の類似度も併用して追尾の成功・失敗を判定してもよい）。 Figures 14(A) and (B) show the results of estimating the occlusion degree of a candidate object to be tracked by the occlusion information generation unit 203. In comparison with the object detection result shown in Figure 14(A), the estimated occluded area by the object detection unit 202 is shown in Figure 14(B) with the symbol occ. In this figure, the fluctuation range of the occlusion score is smaller than a predetermined value (e.g., a value such as 0.3), and it can be determined that tracking has not failed (here, the success or failure of tracking may be determined using not only the occlusion score but also the similarity of the feature amounts of position and appearance as used in embodiment 1).

一方で図１４（Ｃ）では、動画フレーム７２２０から７２３０にかけて物体７２０１が物体７２０２の向こう側を通過しており、その結果、物体検出部２０２が動画フレーム７２３０における物体の位置を枠７２３１として誤って推定している。この場合の遮蔽スコアは図１４（Ｄ）に示すように０．４から０．０へと大きく変動しているため、交差によって誤追尾が発生したと判定することができる。誤追尾が発生したことが分かれば、そこで検出を止めたり、後段で修正する等の工夫を行うことができる。 On the other hand, in FIG. 14(C), object 7201 passes behind object 7202 from video frame 7220 to 7230, and as a result, the object detection unit 202 erroneously estimates the position of the object in video frame 7230 as frame 7231. In this case, the occlusion score fluctuates significantly from 0.4 to 0.0 as shown in FIG. 14(D), so it can be determined that erroneous tracking has occurred due to the intersection. If it is determined that erroneous tracking has occurred, it is possible to take measures such as stopping detection at that point or correcting it at a later stage.

以上が本実施形態の説明となる。 This concludes the explanation of this embodiment.

なお、遮蔽情報生成部２０３の学習は図１２（Ａ）５１１０～５１５０に示すように、不特定の物体について遮蔽状態が判定できるように様々な物体の遮蔽状態を推定できるように学習しておくことが望ましい。 It is desirable that the occlusion information generating unit 203 be trained to be able to estimate the occlusion state of various objects so that the occlusion state of unspecified objects can be determined, as shown in 5110 to 5150 in FIG. 12(A).

なお他の派生の形態としては、図１３（Ｂ）では、物体枠で切られた画像６３０１を入力画像として示している。しかし、被遮蔽状態にあるか否かの判定には当該物体だけでなくその周辺を観察することが重要なため、入力画像としてはより広い範囲を入力することも考えられる（その場合、推定時にも同様の範囲を切り取って入力する）。 As another derivative form, in FIG. 13(B), an image 6301 cut by an object frame is shown as the input image. However, since it is important to observe not only the object itself but also its surroundings in order to determine whether it is in an occluded state, it is also possible to input a wider range as the input image (in that case, a similar range is also cut out and input during estimation).

なお他の派生の形態としては、図１３（Ｃ）に示すように、上述のＬｉｕの手法のようなアンカーと言われる候補枠を使って物体の検出と遮蔽度の推定を同時に行う形態も考えられる。アンカー枠は図１３（Ｄ）に示すような複数のサイズ・縦横比率の候補枠の集合である（ここでは３種類のアンカー枠を図示している）。アンカー枠は図１３（Ｃ）の結果画像６４５０に示すように、画像中の各ブロック領域に配置されている。ニューラルネット６４００は画像が入力されたら、各ブロック領域の各アンカーに当該物体があるか否かの被遮蔽スコアマップ６４３０を生成する。被遮蔽スコアマップ６４３０はアンカー枠の種類の３個に対応した３枚のマップである。推定結果の例を図１３（Ｃ）６４５０に示す（以上の手法は広く公知のため詳細は上述のＬｉｕの方法を参照されたい）。 As another derived form, as shown in FIG. 13(C), a form in which object detection and occlusion degree estimation are performed simultaneously using candidate frames called anchors, as in the Liu method described above, is also considered. The anchor frames are a set of candidate frames of multiple sizes and aspect ratios as shown in FIG. 13(D) (three types of anchor frames are illustrated here). The anchor frames are placed in each block region in the image, as shown in the result image 6450 in FIG. 13(C). When an image is input, the neural network 6400 generates an occlusion score map 6430 indicating whether or not the object is present in each anchor in each block region. The occlusion score map 6430 is a set of three maps corresponding to the three types of anchor frames. An example of the estimation result is shown in FIG. 13(C) 6450 (the above method is widely known, so please refer to the Liu method described above for details).

ここで本実施形態の派生の形態として、物体が存在するか否かの推定と同時に、物体の被遮蔽スコアマップ６４４０を生成する。これは各アンカー枠に、もしそこに物体がある場合、その被遮蔽割合がいくつになるかを推定したマップである。同マップもアンカーの種類の数に対応した３枚からなる（学習時には画像の各ブロックにおいて、各アンカー枠に被遮蔽スコアの教師値を与えてニューラルネット６４００を学習すればよい）。結果例を図１３（Ｃ）６４６０に示す。二つの推定マップを最終的に統合した例を統合結果例６４７０として図示する。 As a derived form of this embodiment, an object occlusion score map 6440 is generated at the same time as estimating whether an object exists. This is a map that estimates the occlusion ratio for each anchor frame if an object is present there. This map also consists of three maps corresponding to the number of anchor types (during learning, the neural network 6400 can be trained by providing a teacher value of the occlusion score for each anchor frame in each block of the image). An example result is shown in FIG. 13(C) 6460. An example of the final integration of the two estimated maps is shown as an example integration result 6470.

上記の説明は物体検出の例になるが、非特許文献１の方法もアンカー候補枠ベースの手法であるため、物体を追尾しながら同時にその被遮蔽スコアを推定する派生形態を構成することが可能である。 The above explanation is an example of object detection, but since the method in Non-Patent Document 1 is also an anchor candidate frame-based method, it is possible to construct a derivative form that tracks an object while simultaneously estimating its occlusion score.

＜実施形態４＞
本実施形態では、ユーザが指定した単一の物体を追尾する形態について説明する。機能ブロックの図は実施形態３と同じで図９（Ｂ）である。これまでの実施形態では類似度を比較する際に、直前と直後のフレームで特徴量を比較することや、前後のｎフレームを用いて比較すること等、ルールベースでフレーム間の物体の対応付けを行った。本実施形態では、この部分を機械学習に置き換えることでより精度の高い対応付けを行う。 <Embodiment 4>
In this embodiment, a form in which a single object designated by a user is tracked will be described. The diagram of the functional blocks is the same as that of the third embodiment, and is shown in FIG. 9B. In the previous embodiments, when comparing similarities, object correspondence between frames was performed on a rule basis, such as by comparing features between the immediately preceding and following frames, or by comparing using n frames before and after. In this embodiment, this part is replaced with machine learning to perform more accurate correspondence.

リカレントニューラルネットは時系列データを処理して識別・分類等を行うことができる技術であり、Ｂｙｅｏｎらの方法などで公知なＬｏｎｇｓｈｏｒｔｔｅｒｍｍｅｍｏｒｙネットワーク（以下ＬＳＴＭ）が代表的手法である。（Ｂｙｅｏｎｅｔａｌ．，ＳｃｅｎｅｌａｂｅｌｉｎｇｗｉｔｈＬＳＴＭｒｅｃｕｒｒｅｎｔｎｅｕｒａｌｎｅｔｗｏｒｋｓ，ＣＶＰＲ２０１５）。当該手法で物体の特徴の経時的な変化を判別して物体間の対応付けを行うことができる。本実施形態の構成と結果例の模式図を図１５に示す。ここでは１つの物体９１０２が追尾対象として指定され、Ｂｅｒｔｉｎｅｔｔｏら等の手法で追尾されている（ｔ＝２の動画フレームで誤スイッチが起こっている）。図１５（Ｃ）のＬＳＴＭユニット９５０１～９５０４は、各時刻で追尾している物体の特徴９４０１～９４０４を受け取って、追尾が成功しているか、失敗しているかを判定して出力９７０１～９７０４として出力する。ここでは図示上ＬＳＴＭユニットを複数書いているが、ここでは複数のユニットが存在するのではなく同一のユニットの各時刻の状態を示している。各時刻のＬＳＴＭユニットは次の時刻のＬＳＴＭユニットにリカレント入力９８０２を送る。ＬＳＴＭユニットはその時点の物体の特徴とそれまでの過去の情報を含むリカレント入力９８０２を元に内部状態を必要に応じて変更する。これにより物体のパターンが経時的にどのように変化しているかを踏まえた上で現時点の追尾が成功しているか否かを判断することができる。 A recurrent neural network is a technology that can process time-series data to perform identification, classification, etc., and a representative method is the long short term memory network (hereinafter referred to as LSTM), which is known from the method of Byeon et al. (Byeon et al., Scene labeling with LSTM recurrent neural networks, CVPR 2015). This method can determine changes in the characteristics of objects over time and associate them with each other. A schematic diagram of the configuration of this embodiment and an example of the results is shown in Figure 15. Here, one object 9102 is specified as the tracking target and is tracked using the method of Bertinetto et al. (a false switch occurs in the video frame at t=2). The LSTM units 9501-9504 in FIG. 15(C) receive features 9401-9404 of the object being tracked at each time, determine whether tracking is successful or unsuccessful, and output the results as outputs 9701-9704. Although multiple LSTM units are shown in the figure, this does not indicate the existence of multiple units, but rather the state of the same unit at each time. The LSTM unit at each time sends a recurrent input 9802 to the LSTM unit at the next time. The LSTM unit changes its internal state as necessary based on the recurrent input 9802, which includes the features of the object at that time and past information up to that point. This makes it possible to determine whether tracking at the current time is successful or not, taking into account how the object's pattern is changing over time.

ＬＳＴＭユニットへの入力の特徴量は実施形態３で説明したニューラルネットの特徴量などを用いることができる。例えば図１３（Ｂ）の物体の被遮蔽スコアを判定するニューラルネット６３１０の最終層６３２０への入力値を用いる。ここでは前記層の出力値（１値のスカラー）でなく入力値（多次元ベクトル）を用いている。これは遮蔽状態を判断するのに用いたのと同じ多次元特徴を用いることで、遮蔽に関する多種の情報をＬＳＴＭに取り込むためである。これにより様々な遮蔽のパターンを判定できることが期待できる。 The features input to the LSTM unit can be the features of the neural network described in embodiment 3. For example, the input value to the final layer 6320 of the neural network 6310 that determines the occlusion score of an object in FIG. 13(B) is used. Here, the input value (multidimensional vector) is used instead of the output value of the previous layer (a single-valued scalar). This is because by using the same multidimensional features as those used to determine the occlusion state, various types of information regarding occlusion can be incorporated into the LSTM. This is expected to enable various occlusion patterns to be determined.

ＬＳＴＭユニットの学習時には、教師値として各瞬間の追尾が成功しているか失敗しているかを与え、ＬＳＴＭの各重みパラメーターを調整する。また別の形態として図１６（Ｄ）に示すように、追尾の成功・失敗ではなく、教師値として遮蔽状態にあるか否かを与えて学習すれば、被遮蔽状態にあるか否かを判定させることも可能である。 When the LSTM unit is trained, whether tracking at each moment is successful or unsuccessful is given as a teaching value, and each weight parameter of the LSTM is adjusted. As another form, as shown in FIG. 16(D), if the teaching value is given as to whether or not the object is occluded, rather than the success or failure of tracking, it is also possible to train the unit to determine whether or not it is occluded.

また別の形態として、実施形態３で説明した派生の形態と同様に、追尾物体をアンカー枠ベースで検出し、９４０１として図１３（Ｃ）の物体の位置および被遮蔽スコアを同時に判定するニューラルネット６４００の特徴量６４２０を使ってもよい。この形態であれば、物体の追尾や検出と当該物体の被遮蔽スコアを同時・高速に判定することができる。 As another embodiment, similar to the derived embodiment described in the third embodiment, a tracked object may be detected based on an anchor frame, and feature quantity 6420 of neural network 6400 in FIG. 13(C) that simultaneously determines the object's position and occlusion score may be used as 9401. With this embodiment, it is possible to simultaneously and quickly determine the tracking and detection of an object and the occlusion score of the object.

＜実施形態５＞
本実施形態では、ユーザが指定した単一の物体を追尾する形態について説明する。基本機能構成は実施形態１と同様である。本実施形態では物体の遮蔽情報として、相対的な物体間の遠近情報を用いる。 <Embodiment 5>
In this embodiment, a form in which a single object designated by a user is tracked will be described. The basic functional configuration is the same as in embodiment 1. In this embodiment, relative perspective information between objects is used as occlusion information of the object.

図１６（Ａ１）にその例を示す。ここでは学習画像としてＲＧＢ画像８０１を用意する。さらにレーザーレンジファインダー装置やステレオ計測等によりＲＧＢ画像８０１に対応した距離画像８３３が得られている。距離画像８３３はカメラからの距離の絶対値をグレースケールで表したものであり、白い色ほど近い距離の物体を意味する。本実施形態ではＲＧＢ画像８０１を入力画像とし、距離画像を教師値８３１として、ニューラルネット４０２の重みを学習する。ただし絶対値としての距離画像８３１と全く同じ出力結果８３０を得ることはパターン認識としては比較的難しい問題であり、本実施形態に用いる遮蔽情報としてはそこまで高精度であることを必要としない。そこで本実施形態では近傍の物体間の相対的な遠近関係を推定するような学習を行う。 An example is shown in FIG. 16 (A1). Here, an RGB image 801 is prepared as a learning image. Furthermore, a distance image 833 corresponding to the RGB image 801 is obtained by a laser rangefinder device, stereo measurement, or the like. The distance image 833 is a grayscale representation of the absolute value of the distance from the camera, and the whiter the color, the closer the object is. In this embodiment, the RGB image 801 is used as an input image, and the distance image is used as a teaching value 831 to learn the weights of the neural network 402. However, obtaining an output result 830 that is exactly the same as the distance image 831 as an absolute value is a relatively difficult problem in terms of pattern recognition, and the occlusion information used in this embodiment does not need to be so highly accurate. Therefore, in this embodiment, learning is performed to estimate the relative perspective relationship between nearby objects.

例えば同図の出力結果８３０に示すように、人物８０１１と８０１２の距離の推定結果８３０１と８３０２は絶対値としては正しくない。人物８０１１と離れた人物８０１３に対応する推定結果８３０１と８３０３も正しくない遠近関係になっている。しかし近傍の二人の人物８０１１と８０１２の、遠近の順序関係だけに限定すれば、正しい結果である。このように＜局所の物体間＞の＜遠近順序の関係＞は正しく推定できるように学習し、これらを物体の遮蔽情報として集計して用いる。 For example, as shown in output result 830 in the figure, the distance estimation results 8301 and 8302 for persons 8011 and 8012 are incorrect as absolute values. Estimation results 8301 and 8303 corresponding to person 8013, who is distant from person 8011, also have incorrect perspective relationships. However, if we limit the perspective order relationship to two nearby persons 8011 and 8012, the results are correct. In this way, the system learns to be able to correctly estimate the <perspective order relationship> between <local objects>, and these are compiled and used as object occlusion information.

以上は、学習時の損失値計算に以下の工夫を施すことで実現される。図１６（Ａ２）に図１６（Ａ１）の教師値８３１上の記号＊の付近の領域を拡大した教師値領域８３１ａを示す。対応する出力結果の領域８３０ａも示す。ここで領域８３１ａ上の各画素ｉと画素ｊに注目し、その遠近関係が正しいか否かで当該画素ペアの損失を求める。ここでは領域８３０ａ上の画素ｉと画素ｊの遠近関係は教師値と一致するので損失は発生しない。対してもし領域８３０ｂのような推定結果であった場合は、遠近関係が正しくないので損失を計上する。このような判断を、所定距離内にある全画素ペアで行う。最終的に遠近関係を誤ったペア数を全ペア数で割った値を損失値の総計とする。このようにして学習したニューラルネット８０２が学習終了し、推定した距離の出力結果８３４を図１６（Ｂ）に示す。 The above is achieved by applying the following innovation to the calculation of loss values during learning. Figure 16 (A2) shows a teacher value area 831a, which is an enlarged area near the symbol * on the teacher value 831 in Figure 16 (A1). The corresponding output result area 830a is also shown. Here, attention is paid to each pixel i and pixel j on the area 831a, and the loss of the pixel pair is calculated based on whether the perspective relationship is correct or not. Here, the perspective relationship between pixel i and pixel j on the area 830a matches the teacher value, so no loss occurs. On the other hand, if the estimated result is as in area 830b, the perspective relationship is incorrect, so a loss is recorded. This judgment is made for all pixel pairs within a specified distance. Finally, the total loss value is calculated by dividing the number of pairs with incorrect perspective relationships by the total number of pairs. The neural network 802 that has been trained in this way finishes learning, and the output result 834 of the estimated distance is shown in Figure 16 (B).

次に、相対的な距離の出力結果８３４を集計して物体の被遮蔽尤度を求める。ここでは別途検出しておいた人物検出枠８３５１と８３５２を用いて検出枠ごとに集計する。各枠内でそれぞれの距離の値を平均し、ｄ^ａｖｅ _１とｄ^ａｖｅ _２とする。次にこの距離の値を隣接した物体枠間で比較して正規化して被遮蔽尤度のスコア値ｏｃｃへと変換する。例えば下式で変換する。
（数式３）
ｏｃｃ_ｉ＝Ｓｉｇｍｏｉｄ（Ｌｏｇ（ｄ^ａｖｅ _ｉ／ｄ^ａｖｅ _ｊ））
＝１／（１＋ｄ^ａｖｅ _ｊ／ｄ^ａｖｅ _ｉ），
ｏｃｃ_ｊ＝１／（１＋ｄ^ａｖｅ _ｉ／ｄ^ａｖｅ _ｊ），
ただし
Ｓｉｇｍｏｉｄ（ｘ）＝１／（１＋ｅｘｐ（－ｘ））．
ここでｉとｊは重なり部分のある二つの隣接した検出物体枠である。３つ以上の物体が重なっている場合は、それぞれ上記の式で被遮蔽スコアｏｃｃ_ｉを求め、そのうちの最大値をその物体の被遮蔽スコアとしてもよい。 Next, the relative distance output results 834 are tallied to determine the occluded likelihood of the object. Here, the results are tallied for each detection frame using separately detected person detection frames 8351 and 8352. The distance values within each frame are averaged to d ^ave ₁ and d ^ave _2. Next, the distance values are compared between adjacent object frames, normalized, and converted into an occluded likelihood score value occ. For example, the conversion is performed using the following formula.
(Formula 3)
occ _i =Sigmoid(Log(d ^ave _i /d ^ave _j ))
=1/(1+d ^ave _j /d ^ave _i ),
occ _j =1/(1+d ^ave _i /d ^ave _j ),
where Sigmaid(x) = 1/(1 + exp(-x)).
Here, i and j are two adjacent detection object frames with overlapping parts. When three or more objects overlap, the occlusion score _occi may be calculated by the above formula for each object, and the maximum value among them may be used as the occlusion score of the object.

以上が相対的な距離推定を行い、被遮蔽スコアを集計するまでの処理内容となる。被遮蔽スコアを用いた追尾処理は実施形態１と同様になるためここでは割愛する。 The above is the process of performing relative distance estimation and tallying up the occlusion score. The tracking process using the occlusion score is the same as in embodiment 1, so it will not be described here.

なお派生的な学習の工夫として下記のようなものが考えられる。（１）距離の教師値の差分が所定閾値Θ以上のペアのみに限定して損失を集計する。これにより距離画像の観測時のノイズに対しロバストに学習できる。（２）（１）を行い、且つマージン領域を設定する。例えばペアの遠近関係が正しいか正しくないかのみならず、遠近関係が正しく、且つ所定閾値Θ以上値が相対的に離れていない場合に損失を発生させる。（３）距離の教師値の差分が閾値Θ未満の画素ペアに対する出力値が、閾値Θ以上に大きなケースも誤りとして損失を与える。これによりノイズ的な出力を抑制する。 The following are some of the derivative learning ideas that can be considered. (1) Losses are tallied only for pairs where the difference in distance teacher values is equal to or greater than a certain threshold Θ. This allows for robust learning against noise when observing distance images. (2) (1) is performed, and a margin area is set. For example, losses are generated not only when the perspective relationship of the pair is correct or not, but also when the perspective relationship is correct and the values are not relatively far apart by more than a certain threshold Θ. (3) Losses are also generated as errors when the output value for a pixel pair where the difference in distance teacher values is less than the threshold Θ is greater than the threshold Θ. This suppresses noisy output.

以上、さまざまな形態があり得るが、相対的・局所的に距離を学習できるような形態であればいずれでもよく、一つの形態に限定されない。本発明は、以下の処理を実行することによっても実現される。即ち、上述した実施形態の機能を実現するソフトウェア（プログラム）を、データ通信用のネットワーク又は各種記憶媒体を介してシステム或いは装置に供給する。そして、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ等）がプログラムを読み出して実行する処理である。また、そのプログラムをコンピュータが読み取り可能な記録媒体に記録して提供してもよい As mentioned above, there are various possible forms, but any form that can learn distances relatively and locally is acceptable, and is not limited to one form. The present invention can also be realized by executing the following process. That is, software (programs) that realize the functions of the above-mentioned embodiments are supplied to a system or device via a data communication network or various storage media. Then, the computer (or CPU, MPU, etc.) of the system or device reads and executes the program. The program may also be provided by recording it on a computer-readable recording medium.

１情報処理装置
２０１画像取得部
２０２物体検出部
２０３遮蔽情報生成部
２０４特徴量取得部
２０５対応付け部
２０６記憶部 REFERENCE SIGNS LIST 1 Information processing device 201 Image acquisition unit 202 Object detection unit 203 Occlusion information generation unit 204 Feature amount acquisition unit 205 Corresponding unit 206 Storage unit

Claims

An information processing device for detecting an object from an image,
A feature extraction means for extracting image features of an object detected from the image;
an estimation means for estimating, for each object detected from the image, occlusion information indicating an occlusion relationship between the object detected from the image and another object detected from the image, based on an output of a trained model that outputs a likelihood indicating that the object detected from the image is occluded by another object;
and an identification means for identifying, for each object detected from the image, a correspondence relationship with an object detected in an image captured at a different time from the image based on at least the image features and the occlusion information.

The information processing device according to claim 1, characterized in that the estimation means estimates the occlusion information indicating a partial area of an object that is occluded for each object detected from the image based on the output of the trained model.

The information processing device according to claim 1 or 2, characterized in that the identification means identifies a correspondence between the image and an object detected in an image captured at a different time from the image, based on the image features of the object and the occlusion information estimated by the estimation means.

The information processing device according to any one of claims 1 to 3, further comprising a storage means for storing the likelihood in association with the image features of the object.

The information processing device according to any one of claims 1 to 4, characterized in that the occlusion information is information that indicates, for each region of the image, a larger likelihood in the region of the object that is occluded and a smaller likelihood in the other regions.

The information processing device according to any one of claims 1 to 5, characterized in that the identification means identifies, for each object detected from the image, a correspondence relationship between the object detected from an image captured at a different time from the image, based on the position in the image, the image features detected from the image, and the occlusion relationship in the image.

The method further includes acquiring means for acquiring a region for each of the objects from the image,
The information processing apparatus according to claim 1 , wherein the estimation means estimates the occlusion information indicating the presence or absence of an occluded object for each of the object regions acquired by the acquisition means.

a determination means for determining whether or not an object is occluded based on a correspondence between each object detected from the image identified by the identification means and an object detected in an image captured at a different time from the image; and
8. The information processing apparatus according to claim 1, further comprising: a storage unit for storing the object that is obstructed.

the determining means determines, as a first object, an object that does not correspond to any object detected in an image captured prior to the image;
9. The information processing apparatus according to claim 8, wherein said storage means stores information indicating that said first object was occluded at the time when said image was captured.

the determining means determines, among the objects detected from the image, an object that does not correspond to an object detected in an image captured prior to the image as a second object;
The information processing device according to claim 9, characterized in that the storage means stores information indicating that the first object was not occluded at the time the image was captured when a similarity between the second object detected from the image and the first object determined by the storage means to be occluded in an image captured prior to the image is greater than a predetermined threshold value.

If two objects are detected in a first image and one object is detected in a second image captured after the first image,
The estimation means estimates the occlusion information indicating that an object detected in the second image occludes another object,
The information processing device according to any one of claims 1 to 10, characterized in that the identification means is configured to identify, based on the occlusion information estimated by the estimation means, an object detected in the first image that is occluding other objects, as being the same object as the object detected in the second image.

When two objects are detected from a third image captured after the second image,
the estimation means estimates the occlusion information indicating that an object detected from the third image, which is associated with an image feature of the object detected from the second image, is occluding another object; and
The information processing device according to claim 11, characterized in that the identification means identifies, for an object detected from the first image that is occluded by another object, an object detected from the third image that is different from an object associated with an image feature of an object detected from the second image, as being the same object as the object occluded in the second image.

The information processing device according to any one of claims 1 to 12, characterized in that the trained model is a neural network.

The information processing device according to any one of claims 1 to 13, characterized in that the trained model is a model that has been trained on image features that indicate an occlusion relationship between an occluding object and an occluded object.

A program for causing a computer to function as each of the means possessed by an information processing device according to any one of claims 1 to 14.

An information processing method performed by an information processing device that detects an object from an image, comprising:
a feature extraction step of extracting image features of the object detected from the image;
an estimation step of estimating, for each object detected from the image, occlusion information indicating an occlusion relationship between the object detected from the image and another object detected from the image, based on an output of a trained model that outputs a likelihood indicating that the object detected from the image is occluded by another object;
and an identification step of identifying, for each object detected from the image, a correspondence relationship between the object detected in the image and an object detected in an image captured at a different time from the image based on at least the image features and the occlusion information.