JP2021006956A

JP2021006956A - Retrieval device, retrieval method and retrieval program

Info

Publication number: JP2021006956A
Application number: JP2019120889A
Authority: JP
Inventors: 永野　秀尚; Hidenao Nagano; 秀尚永野; 柏野　邦夫; Kunio Kashino; 邦夫柏野; 佐藤　真一; Shinichi Sato; 真一佐藤
Original assignee: Nippon Telegraph and Telephone Corp; Research Organization of Information and Systems
Current assignee: Nippon Telegraph and Telephone Corp; Research Organization of Information and Systems
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2021-01-21

Abstract

【課題】精度良くシーンを検索する。【解決手段】特徴量抽出部７０が、複数のクエリ映像の各々について、クエリ映像から画像又は音響信号に関する特徴量を抽出する。シーン特徴量取得部７２が、複数のクエリ映像の各々について、抽出された特徴量を用いて対象シーンのシーン特徴量を取得する。検索部７４が、複数のクエリ映像の各々の特徴量及びシーン特徴量と、蓄積映像の特徴量及びシーン特徴量とを用いて検索を行う。【選択図】図４PROBLEM TO BE SOLVED: To search a scene with high accuracy. A feature amount extraction unit 70 extracts a feature amount related to an image or an acoustic signal from a query video for each of a plurality of query videos. The scene feature amount acquisition unit 72 acquires the scene feature amount of the target scene using the extracted feature amount for each of the plurality of query videos. The search unit 74 performs a search using each feature amount and scene feature amount of the plurality of query images, and the feature amount and scene feature amount of the accumulated video. [Selection diagram] Fig. 4

Description

開示の技術は、検索装置、検索方法、及び検索プログラムに関する。 Disclosure techniques relate to search devices, search methods, and search programs.

大量の映像（蓄積映像と呼ぶ）の中からクエリ映像として与えられたクエリ（特定の場所や物体）が映っている蓄積映像を探し出すことをインスタンス探索という。これまでのインスタンス探索ではクエリ映像及び蓄積映像の各々からフレーム画像を取り出した上でそのフレーム画像から画像の特徴を抽出し、その画像特徴同士を照合してクエリが蓄積映像中に映っているかを判定し探索するという、画像情報に基づく探索が中心であった（例えば、非特許文献１参照）。 Finding a stored video that shows a query (specific place or object) given as a query video from a large amount of video (called a stored video) is called instance search. In the instance search so far, after extracting the frame image from each of the query video and the stored video, the feature of the image is extracted from the frame image, and the image features are compared with each other to check whether the query is reflected in the stored video. The search was mainly based on image information, that is, determination and search (see, for example, Non-Patent Document 1).

Y. Peng, J. Zhang, X. Huang, M. Sun, X. He, P. Tang, Y. Zhao, J. Zhao, J. Qi, J. Zhang, “PKU-ICST at TRECVID 2015: Instance Search Task”, in Proceedings of TRECVID 2015, NIST, USA, 2015.Y. Peng, J. Zhang, X. Huang, M. Sun, X. He, P. Tang, Y. Zhao, J. Zhao, J. Qi, J. Zhang, “PKU-ICST at TRECVID 2015: Instance Search Task ”, in Proceedings of TRECVID 2015, NIST, USA, 2015.

従来の画像情報に基づくこれまでの手法では、例えば特定の場所が映った蓄積映像を探したい場合、クエリ（ここではその場所）を指し示す画像の前景に障害物（他の人物など）が存在するとき、クエリが画像としては遮蔽され十分な画像情報が抽出できず照合に失敗し、目的の蓄積映像を探索できないことがあるという問題があった。また、クエリ映像と蓄積映像で同じ場所が映っていても、カメラの撮影位置や撮影方向の違いからクエリ映像と蓄積映像で撮影した映像が大きく異なる場合、探索ができないという問題があった。 In the conventional method based on conventional image information, for example, when you want to search for a stored image showing a specific place, there is an obstacle (such as another person) in the foreground of the image pointing to the query (here, that place). At that time, there is a problem that the query is shielded as an image, sufficient image information cannot be extracted, collation fails, and the target stored image cannot be searched. Further, even if the same place is shown in the query video and the stored video, there is a problem that the search cannot be performed if the video shot by the query video and the stored video is significantly different due to the difference in the shooting position and shooting direction of the camera.

開示の技術は、上記の点に鑑みてなされたものであり、精度良くシーンを検索することができる検索装置、検索方法、及び検索プログラムを提供することを目的とする。 The disclosed technique has been made in view of the above points, and an object of the present invention is to provide a search device, a search method, and a search program capable of searching a scene with high accuracy.

本開示の第１態様は、検索装置であって、少なくとも蓄積映像若しくは前記蓄積映像から得られた特徴量が格納されているデータベースから、複数のクエリ映像に含まれる対象シーンを検索する検索装置であって、前記複数のクエリ映像の各々について、前記クエリ映像から画像又は音響信号に関する特徴量を抽出する特徴量抽出部と、前記複数のクエリ映像の各々について、前記抽出された特徴量を用いて前記対象シーンのシーン特徴量を取得するシーン特徴量取得部と、前記複数のクエリ映像の各々の前記特徴量及び前記シーン特徴量と、前記蓄積映像の前記特徴量及び前記シーン特徴量とを用いて検索を行う検索部と、を有し、前記シーン特徴量は、画像の特徴量から画像に対応する音響信号の特徴量を取得する処理、若しくは、音響信号の特徴量から画像の特徴量を取得する処理の中間出力値である。 The first aspect of the present disclosure is a search device, which searches for a target scene included in a plurality of query images from at least a stored image or a database in which a feature amount obtained from the stored image is stored. Therefore, for each of the plurality of query videos, a feature quantity extraction unit that extracts a feature quantity related to an image or an acoustic signal from the query video, and for each of the plurality of query videos, the extracted feature quantity is used. Using the scene feature amount acquisition unit for acquiring the scene feature amount of the target scene, the feature amount and the scene feature amount of each of the plurality of query images, and the feature amount and the scene feature amount of the accumulated image. The scene feature amount is a process of acquiring the feature amount of the acoustic signal corresponding to the image from the feature amount of the image, or the feature amount of the image is obtained from the feature amount of the acoustic signal. This is the intermediate output value of the process to be acquired.

本開示の第２態様は、検索方法であって、少なくとも蓄積映像若しくは前記蓄積映像から得られた特徴量が格納されているデータベースから、複数のクエリ映像に含まれる対象シーンを検索する検索方法であって、特徴量抽出部が、前記複数のクエリ映像の各々について、前記クエリ映像から画像又は音響信号に関する特徴量を抽出し、シーン特徴量取得部が、前記複数のクエリ映像の各々について、前記抽出された特徴量を用いて前記対象シーンのシーン特徴量を取得し、検索部が、前記複数のクエリ映像の各々の前記特徴量及び前記シーン特徴量と、前記蓄積映像の前記特徴量及び前記シーン特徴量とを用いて検索を行うことを含み、前記シーン特徴量は、画像の特徴量から画像に対応する音響信号の特徴量を取得する処理、若しくは、音響信号の特徴量から画像の特徴量を取得する処理の中間出力値である。 The second aspect of the present disclosure is a search method, which is a search method for searching a target scene included in a plurality of query images from at least a stored image or a database in which a feature amount obtained from the stored image is stored. Therefore, the feature amount extraction unit extracts the feature amount related to the image or the acoustic signal from the query video for each of the plurality of query images, and the scene feature amount acquisition unit extracts the feature amount for each of the plurality of query images. The scene feature amount of the target scene is acquired using the extracted feature amount, and the search unit uses the feature amount and the scene feature amount of each of the plurality of query images, and the feature amount and the feature amount of the accumulated image. The scene feature amount includes a search using the scene feature amount, and the scene feature amount is a process of acquiring the feature amount of the acoustic signal corresponding to the image from the feature amount of the image, or an image feature from the feature amount of the acoustic signal. This is the intermediate output value of the process for acquiring the amount.

本開示の第３態様は、検索プログラムであって、少なくとも蓄積映像若しくは前記蓄積映像から得られた特徴量が格納されているデータベースから、複数のクエリ映像に含まれる対象シーンを検索するための検索プログラムであって、前記複数のクエリ映像の各々について、前記クエリ映像から画像又は音響信号に関する特徴量を抽出し、前記複数のクエリ映像の各々について、前記抽出された特徴量を用いて前記対象シーンのシーン特徴量を取得し、前記複数のクエリ映像の各々の前記特徴量及び前記シーン特徴量と、前記蓄積映像の前記特徴量及び前記シーン特徴量とを用いて検索を行うことをコンピュータに実行させるための検索プログラムであって、前記シーン特徴量は、画像の特徴量から画像に対応する音響信号の特徴量を取得する処理、若しくは、音響信号の特徴量から画像の特徴量を取得する処理の中間出力値である。 The third aspect of the present disclosure is a search program for searching a target scene included in a plurality of query images from at least a stored image or a database in which a feature amount obtained from the stored image is stored. In the program, for each of the plurality of query videos, a feature amount related to an image or an acoustic signal is extracted from the query video, and for each of the plurality of query videos, the target scene is used using the extracted feature amount. The scene feature amount is acquired, and the computer executes a search using the feature amount and the scene feature amount of each of the plurality of query images and the feature amount and the scene feature amount of the accumulated image. The scene feature amount is a search program for obtaining the feature amount of the sound signal corresponding to the image from the feature amount of the image, or a process of acquiring the feature amount of the image from the feature amount of the acoustic signal. It is an intermediate output value of.

開示の技術によれば、精度良くシーンを検索することができる。 According to the disclosed technology, the scene can be searched with high accuracy.

第１実施形態及び第２実施形態の学習装置及び検索装置として機能するコンピュータの一例の概略ブロック図である。It is a schematic block diagram of an example of a computer functioning as a learning device and a search device of the first embodiment and the second embodiment. 第１実施形態及び第２実施形態の学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the learning apparatus of 1st Embodiment and 2nd Embodiment. ニューラルネットワークの構造の一例を示す図である。It is a figure which shows an example of the structure of a neural network. 第１実施形態及び第２実施形態の検索装置の構成を示すブロック図である。It is a block diagram which shows the structure of the search apparatus of 1st Embodiment and 2nd Embodiment. 特徴量抽出モデルの構造の一例を示す図である。It is a figure which shows an example of the structure of the feature amount extraction model. 第１実施形態の検索装置の検索部の構成を示すブロック図である。It is a block diagram which shows the structure of the search part of the search apparatus of 1st Embodiment. 第１実施形態及び第２実施形態の学習装置の学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the learning processing routine of the learning apparatus of 1st Embodiment and 2nd Embodiment. 第１実施形態の検索装置の検索処理ルーチンを示すフローチャートである。It is a flowchart which shows the search processing routine of the search apparatus of 1st Embodiment. 第２実施形態の検索装置の検索部の構成を示すブロック図である。It is a block diagram which shows the structure of the search part of the search apparatus of 2nd Embodiment. 第２実施形態の検索装置の検索処理ルーチンを示すフローチャートである。It is a flowchart which shows the search processing routine of the search apparatus of 2nd Embodiment. 実験結果を示す図である。It is a figure which shows the experimental result.

以下、開示の技術の実施形態の一例を、図面を参照しつつ説明する。なお、各図面において同一又は等価な構成要素及び部分には同一の参照符号を付与している。また、図面の寸法比率は、説明の都合上誇張されており、実際の比率とは異なる場合がある。 Hereinafter, an example of the embodiment of the disclosed technique will be described with reference to the drawings. The same reference numerals are given to the same or equivalent components and parts in each drawing. In addition, the dimensional ratios in the drawings are exaggerated for convenience of explanation and may differ from the actual ratios.

＜本実施形態の概要＞
本実施形態では、インスタンス探索において、映像の音の情報も用いることとし、画像情報と音情報を一つの空間（ｅｍｂｅｄｄｉｎｇｓｐａｃｅと呼ぶ）に埋め込み、この情報を用いることで、音情報と映像情報の背後にあるシーンそのものの情報に基づいてシーンを分析し検索する。具体的には、まず、探索の前に、多数の映像を用いてその映像の画像情報（ＶＧＧ特徴など）と音情報（ｍｆｃｃなど）を抽出し、この画像情報から音情報を出力するようにニューラルネットワークを教師あり学習で学習する。探索時には、このニューラルネットワークを用い、クエリ映像及び蓄積映像の画像情報をこのニューラルネットワークに入力する。そして、このときのニューラルネットワークの隠れ層の出力を、シーン特徴量として照合を行い、探索を行う。この隠れ層は先述のｅｍｂｅｄｄｉｎｇｓｐａｃｅに相当しており、このニューラルネットワークは入力となる映像情報をこのｅｍｂｅｄｄｉｎｇｓｐａｃｅに写像し、そのｅｍｂｅｄｄｉｎｇｓｐａｃｅへの写像結果を音情報に写像していると言える。 <Outline of this embodiment>
In the present embodiment, the sound information of the video is also used in the instance search, the image information and the sound information are embedded in one space (called an embedding space), and by using this information, the sound information and the video information can be obtained. Analyze and search the scene based on the information of the scene itself behind it. Specifically, first, before the search, image information (VGG features, etc.) and sound information (mfcc, etc.) of the video are extracted using a large number of videos, and sound information is output from this image information. Learn neural networks by supervised learning. At the time of search, this neural network is used, and image information of query video and stored video is input to this neural network. Then, the output of the hidden layer of the neural network at this time is collated as a scene feature amount to perform a search. It can be said that this hidden layer corresponds to the above-mentioned embedding space, and this neural network maps the input video information to the embedding space, and maps the mapping result to the embedding space to the sound information.

この手法を用いた映像検索実験では、映像の画像特徴量のみを用いた照合法（ＶＧＧ特徴のみを用いて照合する手法）、映像の音特徴量（映像のｍｆｃｃ特徴のみを用いて照合する手法）では検索できなかったシーンも検索されている。 In the video search experiment using this method, a collation method using only the image feature amount of the video (a method of collating using only the VGG feature) and a sound feature amount of the video (a method of collating using only the mfcc feature of the video). ) Is also searching for scenes that could not be searched.

そして、本実施形態の手法ではこのシーン特徴量を用いた検索と、画像特徴量を用いた検索を行い、各特徴量での検索スコアを統合した結果を最終的な検索スコアとすることにより、各モーダルの情報を補完し検索を行う。 Then, in the method of the present embodiment, a search using this scene feature amount and a search using the image feature amount are performed, and the result of integrating the search scores for each feature amount is used as the final search score. The information of each modal is complemented and searched.

また、インスタンス探索においてはあるインスタンスを検索するためのクエリ映像が複数与えられる場合がある。この場合、あるクエリ映像ではあるモーダルが検索に有効であり、別のクエリ映像では別のモーダルが検索に有効であるという場合がある。例えば、あるクエリ映像では検索に有効な音が含まれるが、別のクエリ映像は無音であり画像情報しか有効でないなどの場合である。そこで、本実施形態の手法では、複数のクエリ映像の各々から得られる、シーン特徴量を含む複数の特徴量で検索し、この検索スコアを特徴量ごとに統合し、さらにこの各特徴量によるスコアを統合することで、複数のクエリ映像の複数のモーダルを用いた検索を行う。 Further, in the instance search, a plurality of query videos for searching a certain instance may be given. In this case, one modal may be effective for the search in one query video, and another modal may be effective for the search in another query video. For example, one query video contains sounds that are valid for search, while another query video is silent and only image information is valid. Therefore, in the method of the present embodiment, a search is performed using a plurality of feature quantities including a scene feature quantity obtained from each of the plurality of query videos, the search scores are integrated for each feature quantity, and a score based on each feature quantity is further obtained. By integrating the above, a search using multiple modals of multiple query videos is performed.

［第１実施形態］
＜第１実施形態に係る学習装置の構成＞
図１は、本実施形態の学習装置１０のハードウェア構成を示すブロック図である。 [First Embodiment]
<Structure of the learning device according to the first embodiment>
FIG. 1 is a block diagram showing a hardware configuration of the learning device 10 of the present embodiment.

図１に示すように、学習装置１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１３、ストレージ１４、入力部１５、表示部１６及び通信インタフェース（Ｉ／Ｆ）１７を有する。各構成は、バス１９を介して相互に通信可能に接続されている。 As shown in FIG. 1, the learning device 10 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and a communication interface (a communication interface (Read) Memory) 12. It has an I / F) 17. Each configuration is communicably connected to each other via a bus 19.

ＣＰＵ１１は、中央演算処理ユニットであり、各種プログラムを実行したり、各部を制御したりする。すなわち、ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４からプログラムを読み出し、ＲＡＭ１３を作業領域としてプログラムを実行する。ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４に記憶されているプログラムに従って、上記各構成の制御及び各種の演算処理を行う。本実施形態では、ＲＯＭ１２又はストレージ１４には、ニューラルネットワークを学習するための学習プログラムが格納されている。学習プログラムは、１つのプログラムであっても良いし、複数のプログラム又はモジュールで構成されるプログラム群であっても良い。 The CPU 11 is a central arithmetic processing unit that executes various programs and controls each part. That is, the CPU 11 reads the program from the ROM 12 or the storage 14, and executes the program using the RAM 13 as a work area. The CPU 11 controls each of the above configurations and performs various arithmetic processes according to the program stored in the ROM 12 or the storage 14. In the present embodiment, the ROM 12 or the storage 14 stores a learning program for learning the neural network. The learning program may be one program, or may be a group of programs composed of a plurality of programs or modules.

ＲＯＭ１２は、各種プログラム及び各種データを格納する。ＲＡＭ１３は、作業領域として一時的にプログラム又はデータを記憶する。ストレージ１４は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）又はＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）により構成され、オペレーティングシステムを含む各種プログラム、及び各種データを格納する。 The ROM 12 stores various programs and various data. The RAM 13 temporarily stores a program or data as a work area. The storage 14 is composed of an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores various programs including an operating system and various data.

入力部１５は、マウス等のポインティングデバイス、及びキーボードを含み、各種の入力を行うために使用される。 The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used for performing various inputs.

表示部１６は、例えば、液晶ディスプレイであり、各種の情報を表示する。表示部１６は、タッチパネル方式を採用して、入力部１５として機能しても良い。 The display unit 16 is, for example, a liquid crystal display and displays various types of information. The display unit 16 may adopt a touch panel method and function as an input unit 15.

通信インタフェース１７は、他の機器と通信するためのインタフェースであり、例えば、イーサネット（登録商標）、ＦＤＤＩ、Ｗｉ−Ｆｉ（登録商標）等の規格が用いられる。 The communication interface 17 is an interface for communicating with other devices, and for example, standards such as Ethernet (registered trademark), FDDI, and Wi-Fi (registered trademark) are used.

次に、学習装置１０の機能構成について説明する。図２は、学習装置１０の機能構成の例を示すブロック図である。 Next, the functional configuration of the learning device 10 will be described. FIG. 2 is a block diagram showing an example of the functional configuration of the learning device 10.

学習装置１０は、機能的には、図２に示すように、入力部１５、演算部３０、及び出力部３２を備えている。 Functionally, the learning device 10 includes an input unit 15, a calculation unit 30, and an output unit 32, as shown in FIG.

入力部１５は、学習用映像の入力を受け付ける。 The input unit 15 receives the input of the learning video.

演算部３０は、学習用特徴量抽出部２０と、モデル学習部２２とを含んだ構成で表すことができる。 The calculation unit 30 can be represented by a configuration including a learning feature amount extraction unit 20 and a model learning unit 22.

学習用特徴量抽出部２０は、入力部１５で受け付けた学習用映像から、フレーム毎に、フレーム画像の画像特徴量と、音響信号の特徴量とのペアを抽出する。 The learning feature amount extraction unit 20 extracts a pair of the image feature amount of the frame image and the feature amount of the acoustic signal from the learning video received by the input unit 15 for each frame.

例えば、画像特徴量としてＶＧＧ特徴を抽出し、音響信号の特徴量として、ｍｆｃｃ特徴を抽出する。 For example, VGG features are extracted as image features, and mfcc features are extracted as acoustic signal features.

モデル学習部２２は、図３に示すような、画像特徴量を入力として、対応する音響信号の特徴量に取得する処理を行うモデルを学習する。 The model learning unit 22 learns a model that receives an image feature amount as an input and acquires a feature amount of a corresponding acoustic signal as shown in FIG.

例えば、ＶＧＧ特徴を入力として、対応するｍｆｃｃ特徴を出力するニューラルネットワークを学習する。このニューラルネットワークは、入力層と中間層と出力層を含む３層以上の層からなるニューラルネットワークである。 For example, a neural network that takes a VGG feature as an input and outputs a corresponding mfcc feature is learned. This neural network is a neural network composed of three or more layers including an input layer, an intermediate layer, and an output layer.

本実施の形態では、学習用特徴量抽出部２０によって抽出された画像特徴量と音響信号の特徴量とのペアについて、画像特徴量をニューラルネットワークに入力したときに、ペアとなる音響信号の特徴量が出力されるように、ニューラルネットワークのパラメータを学習する。 In the present embodiment, with respect to the pair of the image feature amount extracted by the learning feature amount extraction unit 20 and the feature amount of the acoustic signal, when the image feature amount is input to the neural network, the features of the paired acoustic signal Learn the neural network parameters so that the quantity is output.

モデル学習部２２は、学習したニューラルネットワークのうちの入力層から中間層までの処理により得られる中間層の出力を求めるモデルを、特徴量抽出モデルとして、出力部３２により出力する。 The model learning unit 22 outputs a model for obtaining the output of the intermediate layer obtained by processing from the input layer to the intermediate layer of the learned neural network as a feature amount extraction model by the output unit 32.

＜第１実施形態に係る検索装置の構成＞
上記図１は、本実施形態の検索装置５０のハードウェア構成を示すブロック図である。 <Structure of the search device according to the first embodiment>
FIG. 1 is a block diagram showing a hardware configuration of the search device 50 of the present embodiment.

上記図１に示すように、検索装置５０は、学習装置１０と同様に、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１３、ストレージ１４、入力部１５、表示部１６及び通信インタフェース（Ｉ／Ｆ）１７を有する。本実施形態では、ＲＯＭ１２又はストレージ１４には、映像を検索するための検索プログラムが格納されている。 As shown in FIG. 1, the search device 50 has a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, and an input unit 15, similarly to the learning device 10. , Display unit 16 and communication interface (I / F) 17. In the present embodiment, the ROM 12 or the storage 14 stores a search program for searching a video.

次に、検索装置５０の機能構成について説明する。図４は、検索装置５０の機能構成の例を示すブロック図である。 Next, the functional configuration of the search device 50 will be described. FIG. 4 is a block diagram showing an example of the functional configuration of the search device 50.

検索装置５０は、機能的には、図４に示すように、入力部１５、演算部５４、及び出力部５６を備えている。 Functionally, the search device 50 includes an input unit 15, a calculation unit 54, and an output unit 56, as shown in FIG.

入力部１５は、複数のクエリ映像の入力を受け付ける。クエリ映像には、画像成分で表される対象シーン（例えば、特定の物体や場所）が含まれる。 The input unit 15 accepts inputs of a plurality of query videos. The query video includes a target scene (for example, a specific object or place) represented by an image component.

演算部５４は、蓄積映像データベース６０と、特徴量抽出部６２と、特徴量抽出モデル記憶部６４と、シーン特徴量取得部６６と、シーン特徴量データベース６８と、特徴量抽出部７０と、シーン特徴量取得部７２と、検索部７４とを含んだ構成で表すことができる。 The calculation unit 54 includes a stored video database 60, a feature amount extraction unit 62, a feature amount extraction model storage unit 64, a scene feature amount acquisition unit 66, a scene feature amount database 68, a feature amount extraction unit 70, and a scene. It can be represented by a configuration including a feature amount acquisition unit 72 and a search unit 74.

蓄積映像データベース６０は、複数の蓄積映像が記憶されている。 The stored video database 60 stores a plurality of stored videos.

特徴量抽出部６２は、蓄積映像データベース６０に記憶されている複数の蓄積映像の各々について、フレーム毎に、画像特徴量を抽出する。 The feature amount extraction unit 62 extracts an image feature amount for each frame of each of the plurality of stored images stored in the stored image database 60.

特徴量抽出モデル記憶部６４は、学習装置１０により出力された特徴量抽出モデルが記憶されている。 The feature amount extraction model storage unit 64 stores the feature amount extraction model output by the learning device 10.

シーン特徴量取得部６６は、複数の蓄積映像の各々について、フレーム毎に特徴量抽出部６２によって抽出された画像特徴量の平均を入力として、対応する音響信号の特徴量に取得する処理を行うモデルにおける中間出力値をシーン特徴量として取得する。具体的には、画像特徴量を、特徴量抽出モデルの入力層に入力して、中間層の出力を、シーン特徴量として取得する（図５参照）。 The scene feature amount acquisition unit 66 performs a process of acquiring the feature amount of the corresponding acoustic signal by inputting the average of the image feature amounts extracted by the feature amount extraction unit 62 for each frame for each of the plurality of stored images. The intermediate output value in the model is acquired as a scene feature. Specifically, the image feature amount is input to the input layer of the feature amount extraction model, and the output of the intermediate layer is acquired as the scene feature amount (see FIG. 5).

シーン特徴量データベース６８には、複数の蓄積映像の各々について、取得された画像特徴量及びシーン特徴量が記憶される。 In the scene feature amount database 68, the acquired image feature amount and the scene feature amount are stored for each of the plurality of stored images.

特徴量抽出部７０は、複数のクエリ映像の各々について、フレーム毎に、画像特徴量を抽出する。 The feature amount extraction unit 70 extracts an image feature amount for each frame of each of the plurality of query images.

シーン特徴量取得部７２は、複数のクエリ映像の各々について、フレーム毎に特徴量抽出部７０によって抽出された画像特徴量の平均を、特徴量抽出モデルに入力してシーン特徴量を取得する。具体的には、画像特徴量を、特徴量抽出モデルの入力層に入力して、中間層の出力を、シーン特徴量として取得する。 The scene feature amount acquisition unit 72 acquires the scene feature amount by inputting the average of the image feature amounts extracted by the feature amount extraction unit 70 for each frame into the feature amount extraction model for each of the plurality of query images. Specifically, the image feature amount is input to the input layer of the feature amount extraction model, and the output of the intermediate layer is acquired as the scene feature amount.

検索部７４は、シーン特徴量データベース６８を参照して、複数の蓄積映像の各々について、複数のクエリ映像の各々の画像特徴量及びシーン特徴量と、当該蓄積映像の特徴量及びシーン特徴量とを用いて、検索スコアを計算し、検索スコアが上位Ｎ個の蓄積映像を、検索結果として出力部５６により出力する。 The search unit 74 refers to the scene feature amount database 68, and for each of the plurality of stored images, the image feature amount and the scene feature amount of each of the plurality of query images, and the feature amount and the scene feature amount of the stored video. The search score is calculated using the above, and the accumulated video having the highest search score of N is output by the output unit 56 as a search result.

具体的には、検索部７４は、図６に示すように、照合部８０と、モーダル内スコア統合部８２と、モーダル間スコア統合部８４とを備えている。 Specifically, as shown in FIG. 6, the search unit 74 includes a collation unit 80, an intra-modal score integration unit 82, and an inter-modal score integration unit 84.

照合部８０は、蓄積映像毎に、複数のクエリ映像の各々について、当該クエリ映像の画像特徴量と当該蓄積映像の画像特徴量とを照合した特徴量検索スコアを計算する。例えば、画像特徴量同士のコサイン類似度、又はコサイン類似度から得られるＺスコアを、特徴量検索スコアとして計算する。また、照合部８０は、蓄積映像毎に、複数のクエリ映像の各々について、当該クエリ映像のシーン特徴量と当該蓄積映像のシーン特徴量とを照合したシーン特徴量検索スコアを計算する。例えば、シーン特徴量同士のコサイン類似度、又はコサイン類似度から得られるＺスコアを、シーン特徴量検索スコアとして計算する。 The collation unit 80 calculates a feature amount search score in which the image feature amount of the query video and the image feature amount of the stored video are collated for each of the plurality of query videos for each stored video. For example, the cosine similarity between image features or the Z score obtained from the cosine similarity is calculated as the feature search score. Further, the collation unit 80 calculates a scene feature amount search score in which the scene feature amount of the query video and the scene feature amount of the stored video are collated for each of the plurality of query videos for each stored video. For example, the cosine similarity between scene features or the Z score obtained from the cosine similarity is calculated as the scene feature search score.

モーダル内スコア統合部８２は、蓄積映像毎に、複数のクエリ映像の各々について計算された特徴量検索スコアを統合すると共に、複数のクエリ映像の各々について計算されたシーン特徴量検索スコアを統合する。例えば、複数のクエリ映像の各々について計算された特徴量検索スコアの平均値、最大値、又は重み付き和を、特徴量検索スコアの統合結果とする。また、複数のクエリ映像の各々について計算されたシーン特徴量検索スコア平均値、最大値、又は重み付き和を、特徴量検索スコアの統合結果とする。 The in-modal score integration unit 82 integrates the feature amount search scores calculated for each of the plurality of query videos for each stored video, and also integrates the scene feature amount search scores calculated for each of the plurality of query videos. .. For example, the average value, the maximum value, or the weighted sum of the feature quantity search scores calculated for each of the plurality of query videos is used as the integration result of the feature quantity search scores. Further, the average value, the maximum value, or the weighted sum of the scene feature amount search scores calculated for each of the plurality of query videos is used as the integrated result of the feature amount search scores.

モーダル間スコア統合部８４は、蓄積映像毎に、特徴量検索スコアとシーン特徴量検索スコアとを統合して検索スコアを求め、検索スコアが上位Ｎ個の蓄積映像を、検索結果として出力部３２により出力する。例えば、特徴量検索スコアとシーン特徴量検索スコアの平均値、最大値、又は重み付き和を、検索スコアとする。 The inter-modal score integration unit 84 integrates the feature amount search score and the scene feature amount search score for each accumulated video to obtain a search score, and outputs the accumulated video having the highest N search scores as a search result. Is output by. For example, the average value, the maximum value, or the weighted sum of the feature amount search score and the scene feature amount search score is used as the search score.

なお、上記では、フレーム毎に抽出された画像特徴量の平均を、特徴量抽出モデルに入力してシーン特徴量を取得する場合を例に説明したが、これに限定されるものではなく、フレーム毎に、抽出された画像特徴量を特徴量抽出モデルに入力してシーン特徴量を取得するようにしてもよい。この場合には、シーン特徴量取得部７２は、クエリ映像について、フレーム毎に、特徴量抽出部７０によって抽出された画像特徴量を、特徴量抽出モデルに入力してシーン特徴量を取得する。また、検索部７４は、シーン特徴量データベース６８を参照して、複数の蓄積映像の各々について、当該蓄積映像のフレーム毎の画像特徴量及びシーン特徴量と、クエリ映像のフレーム毎の画像特徴量及びシーン特徴量とを時系列で照合して、検索スコアを算出し、検索スコアが上位Ｎ個の蓄積映像を、検索結果として出力部５６により出力する。 In the above description, the case where the average of the image features extracted for each frame is input to the feature extraction model to acquire the scene features has been described as an example, but the present invention is not limited to this, and the frame is not limited to this. For each case, the extracted image features may be input to the feature extraction model to acquire the scene features. In this case, the scene feature amount acquisition unit 72 acquires the scene feature amount by inputting the image feature amount extracted by the feature amount extraction unit 70 into the feature amount extraction model for each frame of the query video. Further, the search unit 74 refers to the scene feature amount database 68, and for each of the plurality of stored images, the image feature amount and the scene feature amount for each frame of the stored video, and the image feature amount for each frame of the query video. The search score is calculated by collating with the scene feature amount in chronological order, and the accumulated video having the highest search score is output by the output unit 56 as the search result.

＜第１実施形態に係る学習装置の作用＞
次に、学習装置１０の作用について説明する。図７は、学習装置１０による学習処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から学習プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、学習処理が行なわれる。また、学習装置１０に、複数の学習用映像が入力される。 <Operation of the learning device according to the first embodiment>
Next, the operation of the learning device 10 will be described. FIG. 7 is a flowchart showing the flow of the learning process by the learning device 10. The learning process is performed by the CPU 11 reading the learning program from the ROM 12 or the storage 14, expanding it into the RAM 13 and executing it. Further, a plurality of learning images are input to the learning device 10.

ステップＳ１００において、ＣＰＵ１１は、学習用特徴量抽出部２０として、各学習用映像から、フレーム毎に、フレーム画像の画像特徴量と、音響信号の特徴量とのペアを抽出する。 In step S100, the CPU 11, as the learning feature amount extracting unit 20, extracts a pair of the image feature amount of the frame image and the feature amount of the acoustic signal from each learning video for each frame.

ステップＳ１０２では、ＣＰＵ１１は、モデル学習部２２として、上記ステップＳ１００で抽出された画像特徴量と音響信号の特徴量とのペアについて、画像特徴量をニューラルネットワークに入力したときに、ペアとなる音響信号の特徴量が出力されるように、ニューラルネットワークのパラメータを学習し、学習したニューラルネットワークのうちの入力層から中間層までの処理により中間層の出力を求めるモデルを、特徴量抽出モデルとして、出力部３２により出力する。 In step S102, the CPU 11 serves as the model learning unit 22, and when the image feature amount is input to the neural network for the pair of the image feature amount extracted in step S100 and the feature amount of the acoustic signal, the paired sound A model that learns the parameters of the neural network so that the feature amount of the signal is output and obtains the output of the intermediate layer by processing from the input layer to the intermediate layer of the learned neural network is used as a feature amount extraction model. It is output by the output unit 32.

＜第１実施形態に係る検索装置の作用＞
次に、検索装置５０の作用について説明する。 <Operation of the search device according to the first embodiment>
Next, the operation of the search device 50 will be described.

まず、検索装置５０に、特徴量抽出モデルが入力されると、特徴量抽出モデル記憶部６４に格納される。 First, when the feature amount extraction model is input to the search device 50, it is stored in the feature amount extraction model storage unit 64.

また、検索装置５０に、複数の蓄積映像が入力されると、蓄積映像データベース６０に格納される。そして、特徴量抽出部６２は、蓄積映像データベース６０に記憶されている複数の蓄積映像の各々について、フレーム毎に、画像特徴量を抽出し、シーン特徴量取得部６６は、複数の蓄積映像の各々について、フレーム毎に特徴量抽出部６２によって抽出された画像特徴量の平均を、特徴量抽出モデルに入力してシーン特徴量を取得する。そして、シーン特徴量データベース６８には、複数の蓄積映像の各々について、取得されたシーン特徴量が記憶される。 Further, when a plurality of stored videos are input to the search device 50, they are stored in the stored video database 60. Then, the feature amount extraction unit 62 extracts an image feature amount for each frame for each of the plurality of stored images stored in the stored image database 60, and the scene feature amount acquisition unit 66 extracts the image feature amount of the plurality of stored images. For each frame, the average of the image features extracted by the feature extraction unit 62 is input to the feature extraction model to acquire the scene features. Then, the acquired scene feature amount is stored in the scene feature amount database 68 for each of the plurality of stored images.

図８は、検索装置５０による検索処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から検索プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、検索処理が行なわれる。また、検索装置５０に、対象シーンを含む複数のクエリ映像が入力される。 FIG. 8 is a flowchart showing the flow of the search process by the search device 50. The search process is performed by the CPU 11 reading the search program from the ROM 12 or the storage 14, expanding it into the RAM 13 and executing the search program. Further, a plurality of query videos including the target scene are input to the search device 50.

ステップＳ１１０で、ＣＰＵ１１は、特徴量抽出部７０として、複数のクエリ映像の各々について、フレーム毎に、画像特徴量を抽出する。 In step S110, the CPU 11 extracts the image feature amount for each frame of each of the plurality of query videos as the feature amount extraction unit 70.

ステップＳ１１２で、ＣＰＵ１１は、シーン特徴量取得部７２として、複数のクエリ映像の各々について、フレーム毎に特徴量抽出部７０によって抽出された画像特徴量の平均を、特徴量抽出モデルに入力してシーン特徴量を取得する。 In step S112, the CPU 11 inputs, as the scene feature amount acquisition unit 72, the average of the image feature amounts extracted by the feature amount extraction unit 70 for each frame into the feature amount extraction model for each of the plurality of query images. Acquire the scene feature amount.

ステップＳ１１４では、ＣＰＵ１１は、照合部８０として、蓄積映像毎に、複数のクエリ映像の各々について、当該クエリ映像の画像特徴量と当該蓄積映像の画像特徴量とを照合した特徴量検索スコアを計算する。 In step S114, the CPU 11, as the collation unit 80, calculates a feature amount search score for each of the plurality of query videos by collating the image feature amount of the query video with the image feature amount of the stored video for each stored video. To do.

また、ＣＰＵ１１は、照合部８０として、蓄積映像毎に、複数のクエリ映像の各々について、当該クエリ映像のシーン特徴量と当該蓄積映像のシーン特徴量とを照合したシーン特徴量検索スコアを計算する。 Further, the CPU 11 calculates, as the collation unit 80, a scene feature amount search score in which the scene feature amount of the query video and the scene feature amount of the stored video are collated for each of the plurality of query videos for each stored video. ..

ステップＳ１１６では、ＣＰＵ１１は、モーダル内スコア統合部８２として、蓄積映像毎に、複数のクエリ映像の各々について計算された特徴量検索スコアを統合すると共に、複数のクエリ映像の各々について計算されたシーン特徴量検索スコアを統合する。 In step S116, the CPU 11 integrates the feature amount search scores calculated for each of the plurality of query videos for each stored video as the in-modal score integration unit 82, and the scene calculated for each of the plurality of query videos. Integrate feature search scores.

ステップＳ１１８では、ＣＰＵ１１は、モーダル間スコア統合部８４として、蓄積映像毎に、特徴量検索スコアとシーン特徴量検索スコアとを統合して検索スコアを求める。 In step S118, the CPU 11 integrates the feature amount search score and the scene feature amount search score for each stored video to obtain the search score as the inter-modal score integration unit 84.

ステップＳ１２０では、ＣＰＵ１１は、検索スコアが上位Ｎ個の蓄積映像を表す検索結果を作成して出力部５６により出力して、検索処理ルーチンを終了する。 In step S120, the CPU 11 creates a search result representing the accumulated video having the highest N search scores, outputs the search result by the output unit 56, and ends the search processing routine.

［第２実施形態］
次に、第２実施形態に係る学習装置及び検索装置について説明する。なお、第２実施形態に係る学習装置は、第１実施形態と同様の構成であるため、同一符号を付して説明を省略する。また、第２実施形態に係る検索装置において、第１実施形態と同様の構成となる部分については同一符号を付して説明を省略する。 [Second Embodiment]
Next, the learning device and the search device according to the second embodiment will be described. Since the learning device according to the second embodiment has the same configuration as that of the first embodiment, the same reference numerals are given and the description thereof will be omitted. Further, in the search device according to the second embodiment, the parts having the same configuration as that of the first embodiment are designated by the same reference numerals and the description thereof will be omitted.

＜第２実施形態の検索装置の構成＞
上記図１に示すように、本実施形態の検索装置２５０のハードウェア構成は、第１実施形態の検索装置５０と同様である。 <Structure of the search device of the second embodiment>
As shown in FIG. 1, the hardware configuration of the search device 250 of the present embodiment is the same as that of the search device 50 of the first embodiment.

次に、検索装置２５０の機能構成について説明する。 Next, the functional configuration of the search device 250 will be described.

上記図４に示すように、検索装置２５０の演算部５４は、蓄積映像データベース６０と、特徴量抽出部６２と、特徴量抽出モデル記憶部６４と、シーン特徴量取得部６６と、シーン特徴量データベース６８と、特徴量抽出部７０と、シーン特徴量取得部７２と、検索部２７４とを含んだ構成で表すことができる。 As shown in FIG. 4, the calculation unit 54 of the search device 250 includes a storage video database 60, a feature amount extraction unit 62, a feature amount extraction model storage unit 64, a scene feature amount acquisition unit 66, and a scene feature amount. It can be represented by a configuration including a database 68, a feature amount extraction unit 70, a scene feature amount acquisition unit 72, and a search unit 274.

検索部２７４は、図９に示すように、平均特徴量計算部２８０、照合部２８２と、モーダル間スコア統合部２８４とを備えている。 As shown in FIG. 9, the search unit 274 includes an average feature amount calculation unit 280, a collation unit 282, and an inter-modal score integration unit 284.

平均特徴量計算部２８０は、複数のクエリ映像の各々の画像特徴量を統合した統合特徴量として、画像特徴量を平均した特徴量を計算する。例えば、複数のクエリ映像の各々の画像特徴量の平均ベクトルを、統合特徴量とする。 The average feature amount calculation unit 280 calculates the feature amount obtained by averaging the image feature amounts as an integrated feature amount in which the image feature amounts of each of the plurality of query videos are integrated. For example, the average vector of each image feature of a plurality of query videos is used as the integrated feature.

平均特徴量計算部２８０は、複数のクエリ映像の各々のシーン特徴量を統合した統合シーン特徴量として、シーン特徴量を平均した特徴量を計算する。例えば、複数のクエリ映像の各々のシーン特徴量の平均ベクトルを、統合シーン特徴量とする。 The average feature amount calculation unit 280 calculates the feature amount obtained by averaging the scene feature amount as an integrated scene feature amount in which the scene feature amounts of each of the plurality of query images are integrated. For example, the average vector of the scene features of each of the plurality of query videos is used as the integrated scene features.

照合部２８２は、蓄積映像毎に、統合特徴量と当該蓄積映像の画像特徴量とを照合した特徴量検索スコアを計算する。また、照合部２８２は、蓄積映像毎に、統合シーン特徴量と当該蓄積映像のシーン特徴量とを照合したシーン特徴量検索スコアを計算する。 The collation unit 282 calculates a feature amount search score that collates the integrated feature amount with the image feature amount of the stored image for each stored image. Further, the collation unit 282 calculates a scene feature amount search score that collates the integrated scene feature amount with the scene feature amount of the stored image for each stored image.

モーダル間スコア統合部２８４は、モーダル間スコア統合部８４と同様に、蓄積映像毎に、特徴量検索スコアとシーン特徴量検索スコアとを統合して検索スコアを求め、検索スコアが上位Ｎ個の蓄積映像を、検索結果として出力部５６により出力する。 Similar to the inter-modal score integration unit 84, the inter-modal score integration unit 284 integrates the feature amount search score and the scene feature amount search score for each stored video to obtain a search score, and the search score is the top N N. The stored video is output by the output unit 56 as a search result.

＜第２実施形態に係る学習装置の作用＞
第２実施形態に係る学習装置１０の作用は、第１実施形態と同様であるため、説明を省略する。 <Operation of the learning device according to the second embodiment>
Since the operation of the learning device 10 according to the second embodiment is the same as that of the first embodiment, the description thereof will be omitted.

＜第２実施形態に係る検索装置の作用＞
次に、検索装置２５０の作用について説明する。 <Operation of the search device according to the second embodiment>
Next, the operation of the search device 250 will be described.

図１０は、検索装置２５０による検索処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から検索プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、検索処理が行なわれる。また、検索装置２５０に、対象シーンを含む複数のクエリ映像が入力される。 FIG. 10 is a flowchart showing the flow of the search process by the search device 250. The search process is performed by the CPU 11 reading the search program from the ROM 12 or the storage 14, expanding it into the RAM 13 and executing the search program. In addition, a plurality of query videos including the target scene are input to the search device 250.

ステップＳ１１２で、ＣＰＵ１１は、シーン特徴量取得部７２として、複数のクエリ映像の各々について、フレーム毎に特徴量抽出部７０によって抽出された画像特徴量の平均を、特徴量抽出モデルに入力してシーン特徴量を取得する。 In step S112, the CPU 11 inputs, as the scene feature amount acquisition unit 72, the average of the image feature amounts extracted by the feature amount extraction unit 70 for each frame into the feature amount extraction model for each of the plurality of query videos. Acquire the scene feature amount.

ステップＳ２００では、ＣＰＵ１１は、平均特徴量計算部２８０として、複数のクエリ映像の各々の画像特徴量を平均して、統合特徴量を計算する。ＣＰＵ１１は、平均特徴量計算部２８０として、複数のクエリ映像の各々のシーン特徴量を平均して、統合シーン特徴量を計算する。 In step S200, the CPU 11 calculates the integrated feature amount by averaging the image feature amounts of each of the plurality of query videos as the average feature amount calculation unit 280. The CPU 11 calculates the integrated scene feature amount by averaging the scene feature amounts of each of the plurality of query images as the average feature amount calculation unit 280.

ステップＳ２０２では、ＣＰＵ１１は、照合部２８２として、蓄積映像毎に、統合特徴量と当該蓄積映像の画像特徴量とを照合した特徴量検索スコアを計算する。また、ＣＰＵ１１は、照合部２８２として、蓄積映像毎に、統合シーン特徴量と当該蓄積映像のシーン特徴量とを照合したシーン特徴量検索スコアを計算する。 In step S202, the CPU 11 calculates, as the collation unit 282, a feature amount search score in which the integrated feature amount and the image feature amount of the stored image are collated for each stored image. Further, the CPU 11 uses the collation unit 282 to calculate a scene feature amount search score for collating the integrated scene feature amount with the scene feature amount of the stored image for each stored image.

ステップＳ２０４では、ＣＰＵ１１は、モーダル間スコア統合部２８４として、モーダル間スコア統合部８４と同様に、蓄積映像毎に、特徴量検索スコアとシーン特徴量検索スコアとを統合して検索スコアを求める。 In step S204, as the inter-modal score integration unit 284, the CPU 11 integrates the feature amount search score and the scene feature amount search score for each stored video to obtain a search score, similarly to the inter-modal score integration unit 84.

ステップＳ１２０では、ＣＰＵ１１は、検索スコアが上位Ｎ個の蓄積映像を表す検索結果を作成して出力部３２により出力して、検索処理ルーチンを終了する。 In step S120, the CPU 11 creates a search result representing the accumulated video having the highest N search scores, outputs the search result by the output unit 32, and ends the search processing routine.

＜実験例＞
上記の実施形態で説明した手法の有効性を確認するために実験を行った結果について説明する。ここでは、画像特徴量としてＶＧＧ特徴を用い、音響信号の特徴量としてｓｏｕｎｄｎｅｔ［非特許文献１］の８層モデルの音特徴、同第５層モデル［参考文献１］の音特徴、及びｍｆｃｃ特徴を用い、ＶＧＧ特徴を、ｓｏｕｎｄｎｅｔの８層モデルの音特徴、同第５層モデルの音特徴、及びｍｆｃｃ特徴の各々に変換するニューラルネットワークを学習した。 <Experimental example>
The results of an experiment conducted to confirm the effectiveness of the method described in the above embodiment will be described. Here, the VGG feature is used as the image feature amount, and the sound feature of the 8-layer model of soundnet [Non-Patent Document 1], the sound feature of the 5th layer model [Reference 1], and the mfcc feature are used as the feature amount of the acoustic signal. We learned a neural network that converts VGG features into the sound features of the sound feature of the 8-layer model of sound, the sound feature of the 5th layer model, and the mfcc feature.

１００，０００個の蓄積映像から、特定の場所を含む映像を検索する実験での結果を、図１１に示す。この実験ではある場所を探すのに複数の映像クエリが与えられる。グラフの横軸は検索精度（ＭＡＰ値）である。ｖｇｇはＶＧＧ特徴だけを用いた検索を行う比較例を表す。また、ｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ８、ｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ５、ｉｍ２ｓ＿ｖｇｇｉｓｈ、及びｉｍ２ｓ＿ｍｆｃｃはＶＧＧ特徴を音特徴に変換するシーン特徴量を表し、ｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ８、ｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ５、ｉｍ２ｓ＿ｖｇｇｉｓｈ、及びｉｍ２ｓ＿ｍｆｃｃは順に、ＶＧＧ特徴を、ｓｏｕｎｄｎｅｔの８層モデルの音特徴、同第５層モデルの音特徴、ｍｆｃｃ特徴に変換する場合のシーン特徴量を表す。そして、ｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ８は、ｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ８のシーン特徴量のみを用いた検索を行う比較例を表す。また、ｖｇｇａｎｄｉｍ２ｓ＿ｍｆｃｃはＶＧＧ特徴による検索結果とｉｍ２ｓ＿ｍｆｃｃによるシーン特徴量による検索結果を統合したもの、ｖｇｇａｎｄｉｍ２ｓ＿ｖｇｇｉｓｈはＶＧＧ特徴による検索結果とｉｍ２ｓ＿ｖｇｇｉｓｈによるシーン特徴量による検索結果を統合したもの、ｖｇｇａｎｄｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ５はＶＧＧ特徴による検索結果とｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ５によるシーン特徴量による検索結果を統合したもの、ｖｇｇａｎｄｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ８はＶＧＧ特徴による検索結果とｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ８によるシーン特徴量による検索結果を統合したものである。ＶＧＧ特徴による検索スコアとシーン特徴量による検索スコアの統合は各スコアの重み付き和を計算することにより行った。また、ｍｅａｎｚｓｃｏｒｅは、特徴量毎の検索スコアを、複数のクエリ映像による検索スコア（コサイン類似度から計算したＺスコア）の平均値としたもの、ｍｅａｎｆｅａｔｕｒｅは各特徴量を、複数のクエリ映像からの特徴量の平均としたもの、ｍａｘｃｏｓｉｎｅｓｉｍｉｌａｒｉｔｙは、特徴量毎の検索スコアを、複数のクエリ映像による検索スコア（コサイン類似度）の最大値としたもの、ｍａｘｚｓｃｏｒｅは、特徴量毎の検索スコアを、複数のクエリ映像による検索スコア（コサイン類似度から計算したＺスコア）の最大値としたものを表す。実験結果よりＶＧＧ特徴のみで検索した場合（図１１中のｖｇｇ）やシーン特徴量のみで検索した場合（図１１中のｉｍ２ｓ＿ｓｏｕｎｄｎｅｔ８）に比べ、ＶＧＧ特徴とシーン特徴量による検索結果を統合することで検索精度が向上していることがわかる。 FIG. 11 shows the results of an experiment in which an image including a specific place is searched from 100,000 stored images. In this experiment, multiple video queries are given to find a place. The horizontal axis of the graph is the search accuracy (MAP value). vgg represents a comparative example of performing a search using only VGG features. In addition, im2s_soundnet8, im2s_soundnet5, im2s_vgish, and im2s_mfcc represent scene features that convert VGG features into sound features, and im2s_soundnet8, im2s_soundnet5, im2s_soundnet5, im2s_soundnet5, im2s_soundnet5, im2s_sondnet5, im2s_sondnet5 The scene feature amount when converted to the sound feature and mfcc feature of the fifth layer model is shown. Then, im2s_soundnet8 represents a comparative example in which a search is performed using only the scene features of im2s_soundnet8. In addition, vgg and im2s_mfcc is a combination of search results by VGG features and search results by scene features by im2s_mfcc, and vgg and im2s_vgish is a combination of search results by VGG features and search results by scene features by im2s_vggish. im2s_soundnet5 is a combination of search results by VGG features and search results by scene features by im2s_soundnet5, and vgg and im2s_soundnet8 is a combination of search results by VGG features and search results by scene features by im2s_soundnet8. The integration of the search score by the VGG feature and the search score by the scene feature amount was performed by calculating the weighted sum of each score. In addition, mean zscore uses the search score for each feature as the average value of the search scores (Z score calculated from the cosine similarity) of a plurality of query videos, and mean feature uses each feature as a plurality of query videos. The average of the features from the above, max cosine similarity is the search score for each feature as the maximum value of the search score (cosine similarity) by multiple query videos, and max zscore is for each feature. The search score is the maximum value of the search score (Z score calculated from the cosine similarity) of a plurality of query videos. Compared with the case of searching only by VGG features (vgg in FIG. 11) and the case of searching only by scene features (im2s_soundnet8 in FIG. 11) from the experimental results, by integrating the search results by VGG features and scene features. It can be seen that the search accuracy is improved.

［参考文献１］ Y. Aytar, et al., “SoundNet: Learning Sound Representations from Unlabeled Video”, NIPS 2016.
［参考文献２］ S. Hershey, et al., “CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION”, ICASSP 2017. [Reference 1] Y. Aytar, et al., “SoundNet: Learning Sound Representations from Unlabeled Video”, NIPS 2016.
[Reference 2] S. Hershey, et al., “CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION”, ICASSP 2017.

以上説明したように、本実施形態に係る検索装置によれば、複数のクエリ映像の各々について、画像の特徴量を抽出し、抽出された画像の特徴量から対応する音響信号の特徴量を取得する処理の中間出力値をシーン特徴量として取得して、複数のクエリ映像の各々の画像特徴量及びシーン特徴量と、蓄積映像の画像特徴量及びシーン特徴量とを用いて検索を行うことにより、遮蔽物があったり、撮影位置や撮影方向などの撮影条件が異なっていても、精度良く、クエリ映像のシーンを含む蓄積映像を検索することができる。 As described above, according to the search device according to the present embodiment, the feature amount of the image is extracted for each of the plurality of query images, and the feature amount of the corresponding acoustic signal is acquired from the feature amount of the extracted image. By acquiring the intermediate output value of the processing to be performed as a scene feature amount and performing a search using the image feature amount and the scene feature amount of each of the plurality of query images and the image feature amount and the scene feature amount of the accumulated video. Even if there is a shield or the shooting conditions such as the shooting position and shooting direction are different, it is possible to search the stored video including the scene of the query video with high accuracy.

＜変形例＞
なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 <Modification example>
The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、映像から音響信号の特徴量を抽出し、特徴量抽出モデルを用いて、音響信号の特徴量からシーン特徴量を取得するようにしてもよい。この場合には、モデル学習部２２は、音響信号の特徴量を入力として、対応する画像特徴量に変換するニューラルネットワークを学習し、学習したニューラルネットワークのうちの入力層から中間層までの処理により中間層の出力を求めるモデルを、特徴量抽出モデルとする。また、特徴量抽出部６２は、蓄積映像データベース６０に記憶されている複数の蓄積映像の各々について、フレーム毎に、音響信号の特徴量を抽出する。シーン特徴量取得部６６は、複数の蓄積映像の各々について、特徴量抽出部６２によって抽出された音響信号の特徴量を、特徴量抽出モデルに入力してシーン特徴量を取得する。特徴量抽出部７０は、音響成分で表される特定の対象が含まれるクエリ映像について、フレーム毎に、音響信号の特徴量を抽出する。シーン特徴量取得部７２は、クエリ映像について、特徴量抽出部７０によって抽出された音響信号の特徴量を、特徴量抽出モデルの入力層に入力して、中間層の出力を、シーン特徴量として取得する。検索部７４は、複数の蓄積映像の各々について、複数のクエリ映像の各々の音響信号の特徴量及びシーン特徴量と、当該蓄積映像の音響信号の特徴量及びシーン特徴量とを用いて、検索スコアを計算し、検索スコアが上位Ｎ個の蓄積映像を、検索結果として出力部５６により出力する。 For example, the feature amount of the acoustic signal may be extracted from the video, and the scene feature amount may be acquired from the feature amount of the acoustic signal by using the feature amount extraction model. In this case, the model learning unit 22 learns a neural network that takes the feature amount of the acoustic signal as an input and converts it into the corresponding image feature amount, and processes the learned neural network from the input layer to the intermediate layer. A model for obtaining the output of the intermediate layer is used as a feature extraction model. In addition, the feature amount extraction unit 62 extracts the feature amount of the acoustic signal for each frame of each of the plurality of stored images stored in the stored video database 60. The scene feature amount acquisition unit 66 acquires the scene feature amount by inputting the feature amount of the acoustic signal extracted by the feature amount extraction unit 62 into the feature amount extraction model for each of the plurality of stored images. The feature amount extraction unit 70 extracts the feature amount of the acoustic signal for each frame of the query video including the specific object represented by the acoustic component. The scene feature amount acquisition unit 72 inputs the feature amount of the acoustic signal extracted by the feature amount extraction unit 70 to the input layer of the feature amount extraction model for the query video, and uses the output of the intermediate layer as the scene feature amount. get. The search unit 74 searches for each of the plurality of stored images by using the feature amount and the scene feature amount of each acoustic signal of the plurality of query images and the feature amount and the scene feature amount of the acoustic signal of the stored image. The score is calculated, and the accumulated video having the highest search score is output by the output unit 56 as a search result.

これにより、複数のクエリ映像の音響信号の特徴量を抽出し、抽出された音響信号の特徴量から対応する画像の特徴量を取得する処理の中間出力値をシーン特徴量として取得し、複数のクエリ映像の各々の音響信号の特徴量及びシーン特徴量と、蓄積映像の音響信号の特徴量及びシーン特徴量とを用いて検索を行うことにより、精度良く、複数のクエリ映像のシーンを含む蓄積映像を検索することができる。 As a result, the intermediate output value of the process of extracting the feature amount of the acoustic signal of a plurality of query images and acquiring the feature amount of the corresponding image from the feature amount of the extracted acoustic signal is acquired as the scene feature amount, and a plurality of By performing a search using the feature amount and the scene feature amount of each acoustic signal of the query image and the feature amount and the scene feature amount of the acoustic signal of the accumulated image, the accumulation including the scenes of a plurality of query images can be performed with high accuracy. You can search the video.

また、検索部７４は、複数の蓄積映像の各々について、複数のクエリ映像の各々の画像特徴量、音響信号の特徴量、及びシーン特徴量と、当該蓄積映像の画像特徴量、音響信号の特徴量、及びシーン特徴量とを用いて、検索スコアを計算し、検索スコアが上位Ｎ個の蓄積映像を、検索結果として出力部５６により出力するようにしてもよい。この場合には、上記第１実施形態と同様に、複数のクエリ映像の各々について、クエリ映像の画像特徴量と蓄積映像の画像特徴量とを照合した画像特徴量検索スコアを計算し、計算した画像特徴量検索スコアを統合する。また、複数のクエリ映像の各々について、クエリ映像の音響信号の特徴量と蓄積映像の音響信号の特徴量とを照合した音特徴量検索スコアを計算し、計算した音特徴量検索スコアを統合する。また、複数のクエリ映像の各々について、クエリ映像のシーン特徴量と蓄積映像のシーン特徴量とを照合したシーン特徴量検索スコアを計算し、計算したシーン特徴量検索スコアを統合する。そして、統合した画像特徴量検索スコア、音特徴量検索スコア、及びシーン特徴量検索スコアを用いて検索を行うようにすればよい。 Further, for each of the plurality of stored images, the search unit 74 includes the image feature amount, the acoustic signal feature amount, and the scene feature amount of each of the plurality of query images, and the image feature amount and the acoustic signal feature of the stored video. The search score may be calculated using the amount and the scene feature amount, and the accumulated video having the top N search scores may be output by the output unit 56 as the search result. In this case, as in the first embodiment, for each of the plurality of query videos, an image feature search score was calculated by collating the image feature of the query video with the image feature of the stored video. Integrate image feature search scores. In addition, for each of the plurality of query videos, a sound feature search score that collates the feature amount of the acoustic signal of the query video and the feature amount of the acoustic signal of the stored video is calculated, and the calculated sound feature search score is integrated. .. Further, for each of the plurality of query videos, a scene feature search score that collates the scene feature amount of the query video and the scene feature amount of the accumulated video is calculated, and the calculated scene feature search score is integrated. Then, the search may be performed using the integrated image feature amount search score, sound feature amount search score, and scene feature amount search score.

あるいは、上記第２実施形態と同様に、複数のクエリ映像の各々の画像特徴量を統合した画像統合特徴量、複数のクエリ映像の各々の音響信号の特徴量を統合した音統合特徴量、及び複数のクエリ映像の各々のシーン特徴量を統合した統合シーン特徴量と、蓄積映像の画像特徴量、音響信号の特徴量、及びシーン特徴量とを用いて検索を行うようにすればよい。 Alternatively, as in the second embodiment, the image integrated feature amount that integrates the image feature amount of each of the plurality of query images, the sound integrated feature amount that integrates the feature amount of each acoustic signal of the plurality of query images, and the sound integrated feature amount. The search may be performed using the integrated scene feature amount that integrates the scene feature amount of each of the plurality of query images, the image feature amount of the accumulated video, the acoustic signal feature amount, and the scene feature amount.

また、学習装置と検索装置とを一つの装置として構成してもよい。また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 Further, the learning device and the search device may be configured as one device. Further, although described as an embodiment in which the program is pre-installed in the specification of the present application, it is also possible to provide the program by storing it in a computer-readable recording medium.

また、上記各実施形態でＣＰＵがソフトウェア（プログラム）を読み込んで実行した各種処理を、ＣＰＵ以外の各種のプロセッサが実行してもよい。この場合のプロセッサとしては、ＦＰＧＡ（Ｆｉｅｌｄ−ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等の製造後に回路構成を変更可能なＰＬＤ（ＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＤｅｖｉｃｅ）、及びＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等の特定の処理を実行させるために専用に設計された回路構成を有するプロセッサである専用電気回路等が例示される。また、単語埋め込みベクトル統合処理を、これらの各種のプロセッサのうちの１つで実行してもよいし、同種又は異種の２つ以上のプロセッサの組み合わせ（例えば、複数のＦＰＧＡ、及びＣＰＵとＦＰＧＡとの組み合わせ等）で実行してもよい。また、これらの各種のプロセッサのハードウェア的な構造は、より具体的には、半導体素子等の回路素子を組み合わせた電気回路である。 Further, various processors other than the CPU may execute various processes executed by the CPU reading software (program) in each of the above embodiments. In this case, the processor includes a PLD (Programmable Logic Device) whose circuit configuration can be changed after manufacturing an FPGA (Field-Programmable Gate Array), an ASIC (Application Specific Integrated Circuit), and the like. An example is a dedicated electric circuit or the like, which is a processor having a circuit configuration designed exclusively for it. Further, the word embedding vector integration process may be executed by one of these various processors, or a combination of two or more processors of the same type or different types (for example, a plurality of FPGAs, and a CPU and an FPGA). It may be executed by a combination of). Further, the hardware structure of these various processors is, more specifically, an electric circuit in which circuit elements such as semiconductor elements are combined.

また、上記各実施形態では、学習プログラム及び検索プログラムがストレージ１４に予め記憶（インストール）されている態様を説明したが、これに限定されない。プログラムは、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＤＶＤ−ＲＯＭ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、及びＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ等の非一時的（ｎｏｎ−ｔｒａｎｓｉｔｏｒｙ）記憶媒体に記憶された形態で提供されてもよい。また、プログラムは、ネットワークを介して外部装置からダウンロードされる形態としてもよい。 Further, in each of the above embodiments, the mode in which the learning program and the search program are stored (installed) in the storage 14 in advance has been described, but the present invention is not limited to this. The program is stored in a non-temporary medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versaille Disk Online Memory), and a USB (Universal Serial Bus) memory. It may be provided in the form. Further, the program may be downloaded from an external device via a network.

以上の実施形態に関し、更に以下の付記を開示する。 The following additional notes will be further disclosed with respect to the above embodiments.

（付記項１）
少なくとも蓄積映像若しくは前記蓄積映像から得られた特徴量が格納されているデータベースから、複数のクエリ映像に含まれる対象シーンを検索する検索装置であって、
メモリと、
前記メモリに接続された少なくとも１つのプロセッサと、
を含み、
前記プロセッサは、
前記複数のクエリ映像の各々について、前記クエリ映像から画像又は音響信号に関する特徴量を抽出し、
前記複数のクエリ映像の各々について、前記抽出された特徴量を用いて前記対象シーンのシーン特徴量を取得し、
前記複数のクエリ映像の各々の前記特徴量及び前記シーン特徴量と、前記蓄積映像の前記特徴量及び前記シーン特徴量とを用いて検索を行う、
ように構成され、
前記シーン特徴量は、
画像の特徴量から画像に対応する音響信号の特徴量を取得する処理、若しくは、音響信号の特徴量から画像の特徴量を取得する処理の中間出力値である
検索装置。 (Appendix 1)
A search device that searches for target scenes included in a plurality of query videos from a database in which at least the stored video or the feature amount obtained from the stored video is stored.
With memory
With at least one processor connected to the memory
Including
The processor
For each of the plurality of query videos, a feature amount related to an image or an acoustic signal is extracted from the query video.
For each of the plurality of query videos, the scene feature amount of the target scene is acquired by using the extracted feature amount.
A search is performed using the feature amount and the scene feature amount of each of the plurality of query images, and the feature amount and the scene feature amount of the accumulated video.
Is configured as
The scene feature amount is
A search device that is an intermediate output value of a process of acquiring a feature amount of an acoustic signal corresponding to an image from a feature amount of an image, or a process of acquiring a feature amount of an image from a feature amount of an acoustic signal.

（付記項２）
少なくとも蓄積映像若しくは前記蓄積映像から得られた特徴量が格納されているデータベースから、複数のクエリ映像に含まれる対象シーンを検索する検索処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
前記検索処理は、
前記複数のクエリ映像の各々について、前記クエリ映像から画像又は音響信号に関する特徴量を抽出し、
前記複数のクエリ映像の各々について、前記抽出された特徴量を用いて前記対象シーンのシーン特徴量を取得し、
前記複数のクエリ映像の各々の前記特徴量及び前記シーン特徴量と、前記蓄積映像の前記特徴量及び前記シーン特徴量とを用いて検索を行い、
前記シーン特徴量は、
画像の特徴量から画像に対応する音響信号の特徴量を取得する処理、若しくは、音響信号の特徴量から画像の特徴量を取得する処理の中間出力値である
非一時的記憶媒体。 (Appendix 2)
A program that stores a program that can be executed by a computer to execute a search process for searching a target scene included in a plurality of query videos from at least a stored video or a database in which a feature amount obtained from the stored video is stored. It is a temporary storage medium
The search process is
For each of the plurality of query videos, a feature amount related to an image or an acoustic signal is extracted from the query video.
For each of the plurality of query videos, the scene feature amount of the target scene is acquired by using the extracted feature amount.
A search is performed using the feature amount and the scene feature amount of each of the plurality of query images, and the feature amount and the scene feature amount of the accumulated video.
The scene feature amount is
A non-temporary storage medium that is an intermediate output value of a process of acquiring a feature amount of an acoustic signal corresponding to an image from a feature amount of an image or a process of acquiring a feature amount of an image from a feature amount of an acoustic signal.

１０学習装置
１５入力部
２０学習用特徴量抽出部
２２モデル学習部
３０、５４演算部
３２、５６出力部
５０、２５０検索装置
６０蓄積映像データベース
６２、７０特徴量抽出部
６４特徴量抽出モデル記憶部
６６、７２シーン特徴量取得部
６８シーン特徴量データベース
７４、２７４検索部
８０、２８２照合部
８２モーダル内スコア統合部
８４、２８４モーダル間スコア統合部
２８０平均特徴量計算部 10 Learning device 15 Input unit 20 Learning feature amount extraction unit 22 Model learning unit 30, 54 Calculation unit 32, 56 Output unit 50, 250 Search device 60 Accumulated video database 62, 70 Feature amount extraction unit 64 Feature amount extraction model storage unit 66, 72 Scene feature acquisition unit 68 Scene feature database 74, 274 Search unit 80, 282 Matching unit 82 In-modal score integration unit 84, 284 Inter-modal score integration unit 280 Average feature calculation unit

Claims

A search device that searches for target scenes included in a plurality of query videos from a database in which at least the stored video or the feature amount obtained from the stored video is stored.
For each of the plurality of query videos, a feature quantity extraction unit that extracts a feature quantity related to an image or an acoustic signal from the query video, and a feature quantity extraction unit.
For each of the plurality of query videos, a scene feature amount acquisition unit for acquiring the scene feature amount of the target scene using the extracted feature amount, and a scene feature amount acquisition unit.
It has a search unit that searches using the feature amount and the scene feature amount of each of the plurality of query images, and the feature amount and the scene feature amount of the accumulated video.
The scene feature amount is
A search device that is an intermediate output value of a process of acquiring a feature amount of an acoustic signal corresponding to an image from an image feature amount or a process of acquiring an image feature amount from an acoustic signal feature amount.

The search unit calculates, for each of the plurality of query videos, a feature amount search score in which the feature amount of the query video and the feature amount of the accumulated video are collated, and integrates the calculated feature amount search score. And
For each of the plurality of query videos, a scene feature search score that collates the scene feature amount of the query video with the scene feature amount of the stored video is calculated, and the calculated scene feature search score is integrated. ,
The search device according to claim 1, wherein a search is performed using the integrated feature amount search score and the scene feature amount search score.

The search unit includes an integrated feature amount that integrates the feature amounts of each of the plurality of query images, an integrated scene feature amount that integrates the scene feature amounts of each of the plurality of query images, and the accumulated video. The search device according to claim 1, wherein a search is performed using the feature amount and the scene feature amount.

The scene feature amount is the output value of the hidden layer of the neural network that acquires the feature amount of the acoustic signal corresponding to the image from the feature amount of the image, or the feature amount of the neural network that acquires the feature amount of the image from the feature amount of the acoustic signal. The search device according to any one of claims 1 to 3, which is an output value of the hidden layer.

It is a search method for searching a target scene included in a plurality of query videos from at least a stored video or a database in which a feature amount obtained from the stored video is stored.
The feature quantity extraction unit extracts the feature quantity related to the image or the acoustic signal from the query video for each of the plurality of query videos.
The scene feature amount acquisition unit acquires the scene feature amount of the target scene by using the extracted feature amount for each of the plurality of query images.
The search unit includes performing a search using the feature amount and the scene feature amount of each of the plurality of query images and the feature amount and the scene feature amount of the accumulated video.
The scene feature amount is
A search method that is an intermediate output value of the process of acquiring the feature amount of the acoustic signal corresponding to the image from the feature amount of the image or the process of acquiring the feature amount of the image from the feature amount of the acoustic signal.

It is a search program for searching a target scene included in a plurality of query videos from at least a stored video or a database in which a feature amount obtained from the stored video is stored.
For each of the plurality of query videos, a feature amount related to an image or an acoustic signal is extracted from the query video.
For each of the plurality of query videos, the scene feature amount of the target scene is acquired by using the extracted feature amount.
It is a search program for causing a computer to perform a search using the feature amount and the scene feature amount of each of the plurality of query images and the feature amount and the scene feature amount of the accumulated video. ,
The scene feature amount is
A search program that is an intermediate output value of the process of acquiring the feature amount of the acoustic signal corresponding to the image from the feature amount of the image or the process of acquiring the feature amount of the image from the feature amount of the acoustic signal.