WO2019045101A1

WO2019045101A1 - Image processing device and program

Info

Publication number: WO2019045101A1
Application number: PCT/JP2018/032635
Authority: WO
Inventors: 清晴相澤; 小川　徹
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2017-09-04
Filing date: 2018-09-03
Publication date: 2019-03-07
Anticipated expiration: 2020-03-04
Also published as: JP2019046253A

Abstract

漫画画像データを受け入れ、当該漫画画像データに基づいて、コマ部分、顔部分、身体部分、文字部分のそれぞれを特定するよう機械学習した結果を用い、コマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とをそれぞれ生成して、当該情報が所定の情報処理に供される画像処理装置である。Accepts cartoon image data, and uses information obtained by machine learning to identify each of the top, face, body, and text based on the cartoon image data, and identifies information for identifying the top and the face. It is an image processing apparatus in which the information to be identified, the information to identify the body part, and the information to identify the character part are respectively generated, and the information is provided to predetermined information processing.

Description

Image processing apparatus and program

　本発明は、画像処理装置及びプログラムに関する。 The present invention relates to an image processing apparatus and program.

　近年では、漫画の画像データを加工して、セリフ部分を抽出して他国語用に翻訳する技術や、コマごとに配列を変更して、スマートフォン等の画面に適した状態とする技術が考えられている。 In recent years, the technology of processing the image data of a cartoon, extracting the speech part and translating it for another language, and changing the arrangement for each frame to make it suitable for the screen of a smartphone etc. can be considered. ing.

　このような処理を行うにあたり、従来から、色や文字認識処理の結果等を用いて、描かれた人物の部分やセリフ部分を特定する処理等が考えられている（非特許文献１）。 In order to perform such processing, it has been conventionally considered to specify processing of a drawn person part and a serif part using the result of color and character recognition processing and the like (Non-Patent Document 1).

Christophe Rigaud, et. al., Speech ballon and speaker association for comics and manga understanding., Proceedings of the 13th International Conference on Document Analysis and Recognition, pp. 351-355, IEEE, 2015Christophe Rigaud, et. Al., Speech ballon and speaker association for comics and manga understanding., Proceedings of the 13th International Conference on Document Analysis and Recognition, pp. 351-355, IEEE, 2015

　一方、近年では機械学習により画像中から物体を検出する技術が開発され、広く研究されている。しかしながら、従来の一般物体検出の処理では、画像中の特定の部分には一つの物体が含まれるとの前提で検出が行われるため、多数の検出対象が互いに重なりあっている場合については考慮されていない。 On the other hand, in recent years, a technique for detecting an object from an image by machine learning has been developed and widely studied. However, in the conventional general object detection processing, detection is performed on the premise that a specific part in an image includes one object, so it is considered in the case where multiple detection targets overlap each other. Not.

　ところが漫画画像データにおいては、コマの内側に（ないしは複数のコマにまたがって）人物の身体や顔が描画され、また、これらの各部に重なり合わせてセリフの文字が配置されることが一般的である。従って、機械学習による物体検出処理をそのまま適用したのでは、コマ、登場人物の身体、顔、セリフといった部分がそれぞれ十分な精度で検出できない。 However, in cartoon image data, it is common for a person's body or face to be drawn inside a frame (or across multiple frames), and a serif character is placed overlapping these parts. is there. Therefore, if the object detection processing by machine learning is applied as it is, parts such as the top, the body of the character, the face, and the speech can not be detected with sufficient accuracy.

　本発明は上記実情に鑑みて為されたもので、漫画画像データの処理において、機械学習処理を用いて、コマ、顔部分、身体部分、及び文字部分の認識精度を、従来のものに比べて向上できる画像処理装置及びプログラムを提供することを、その目的の一つとする。 The present invention has been made in view of the above-mentioned circumstances, and in the processing of cartoon image data, the machine learning processing is used to compare the recognition accuracy of the top, face, body and character parts with the conventional one. An object of the present invention is to provide an image processing apparatus and program that can be improved.

　上記従来例の問題点を解決するための本発明は、画像処理装置であって、漫画画像データを受け入れる受入手段と、画像データから、当該画像データ内に描画された漫画のコマ部分を検出するよう機械学習された状態にあるフレーム検出手段と、画像データから、当該画像データ内に描画された顔部分を検出するよう機械学習された状態にある顔検出手段と、画像データから、当該画像データ内に描画された身体部分を検出するよう機械学習された状態にある身体検出手段と、画像データから、当該画像データ内に含まれる文字部分を検出するよう機械学習された状態にある文字検出手段と、前記受け入れた漫画画像データに基づいて、前記フレーム検出手段が検出したコマ部分を特定する情報と、前記顔検出手段が検出した、顔部分を特定する情報と、前記身体検出手段が検出した、身体部分を特定する情報と、前記文字検出手段が検出した、文字部分を特定する情報と、を生成する検出情報生成手段と、を含み、前記生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とが所定の情報処理に供されることとしたものである。 The present invention for solving the problems of the above-mentioned conventional example is an image processing apparatus, which comprises an accepting means for receiving cartoon image data, and detecting a frame portion of a cartoon drawn in the image data from the image data. Such as frame detection means in the machine-learned state, face detection means in the state of machine-learned to detect the face portion drawn in the image data from the image data, and the image data from the image data Body detecting means in a state of machine learning to detect a body part drawn in, and character detecting means in a state of machine learning to detect a character part included in the image data from the image data And information for identifying a frame portion detected by the frame detection means based on the accepted cartoon image data, and a face portion detected by the face detection means. Detection information generation means for generating information for identifying the body part detected by the body detection means, and information for identifying the character part detected by the character detection means; The information specifying the top part, the information specifying the face part, the information specifying the body part, and the information specifying the character part are to be subjected to predetermined information processing.

　本発明によれば、機械学習処理を用いて、漫画画像データのうちからコマ、顔部分、身体部分、及び文字部分を認識する際の認識精度を、従来のものに比べて向上できる。 According to the present invention, it is possible to improve the recognition accuracy when recognizing a top, a face, a body, and a character from cartoon image data using machine learning processing as compared with the conventional one.

本発明の実施の形態に係る画像処理装置の構成例を表すブロック図である。It is a block diagram showing the example of composition of the image processing device concerning an embodiment of the invention. 本発明の実施の形態に係る画像処理装置の例を表す機能ブロック図である。It is a functional block diagram showing the example of the image processing device concerning an embodiment of the invention. 本発明の実施の形態に係る画像処理装置が処理の対象とする漫画画像データの概要例を表す説明図である。It is an explanatory view showing an outline example of comics image data which the image processing device concerning an embodiment of the present invention makes an object of processing. 本発明の実施の形態に係る画像処理装置の検出処理部の概要例を表す内部機能ブロック図である。It is an internal functional block diagram showing the outline example of the detection processing part of the image processing device concerning an embodiment of the invention. 本発明の実施の形態に係る画像処理装置の検出処理部のもう一つの例を表す内部機能ブロック図である。It is an internal functional block diagram showing the other example of the detection process part of the image processing apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る画像処理装置の検出処理部のもう一つの例による処理の概要例を表す説明図である。FIG. 7 is an explanatory diagram showing a schematic example of processing according to another example of the detection processing unit of the image processing apparatus according to the embodiment of the present invention. 本発明の実施の形態に係る画像処理装置の検出処理部の構成例を表す機能ブロック図である。It is a functional block diagram showing the example of composition of the detection processing part of the image processing device concerning an embodiment of the invention.

　本発明の実施の形態について図面を参照しながら説明する。本発明の実施の形態に係る画像処理装置１は、図１に例示するように、制御部１１と、記憶部１２と、操作部１３と、表示部１４と、入出力部１５とを含んで構成されている。 Embodiments of the present invention will be described with reference to the drawings. The image processing apparatus 1 according to the embodiment of the present invention includes the control unit 11, the storage unit 12, the operation unit 13, the display unit 14, and the input / output unit 15, as illustrated in FIG. It is configured.

　制御部１１は、ＣＰＵ等のプログラム制御デバイスであり、記憶部１２に格納されたプログラムを実行して、漫画画像データを受け入れ、当該受け入れた漫画画像データに基づいて、コマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報と、を生成する。本実施の形態の制御部１１は、これらの各部分を特定する処理において、画像データから、当該画像データ内に描画された漫画のコマ部分を検出するよう機械学習された状態にあるフレーム検出器と、画像データから、当該画像データ内に描画された顔部分を検出するよう機械学習された状態にある顔検出器と、画像データから、当該画像データ内に描画された身体部分を検出するよう機械学習された状態にある身体検出器と、画像データから、当該画像データ内に含まれる文字部分を検出するよう機械学習された状態にある文字検出器とを用いる。 The control unit 11 is a program control device such as a CPU, and executes a program stored in the storage unit 12 to receive cartoon image data, and information for specifying a top part based on the received cartoon image data and , Information for identifying the face part, information for identifying the body part, and information for identifying the text part. The control unit 11 according to the present embodiment is a frame detector in a state of being machine-learned to detect a comic piece drawn in the image data from the image data in the process of identifying each of these parts. And a face detector in a state of being machine-learned to detect a face part drawn in the image data from the image data, and a body part drawn in the image data from the image data A body detector in a machine-learned state and a character detector in a machine-learned state to detect character portions included in the image data from the image data are used.

　またこの制御部１１は、生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを用いて所定の情報処理を実行する。この情報処理としては、例えば、各部を表す画像を出力する処理や、文字部分を特定する情報により特定された範囲内の文字列に対する光学文字認識処理や、コマ部分ごとに画像データを分割する分割処理等がある。これらの制御部１１の動作については、後に詳しく述べる。 In addition, the control unit 11 executes predetermined information processing using the generated information for identifying the frame part, the information for identifying the face part, the information for identifying the body part, and the information for identifying the character part Do. This information processing includes, for example, processing for outputting an image representing each part, optical character recognition processing for a character string within a range specified by information specifying a character part, and division for dividing image data for each frame part There is processing etc. The operation of these control units 11 will be described in detail later.

　記憶部１２は、メモリデバイス等であり、制御部１１により実行されるプログラムを保持する。このプログラムは、コンピュータ可読かつ、非一時的な記録媒体に格納されて提供され、この記憶部１２に格納されたものであってもよい。また、この記憶部１２は、制御部１１のワークメモリとしても動作する。 The storage unit 12 is a memory device or the like, and holds a program executed by the control unit 11. This program may be stored in a computer readable non-transitory recording medium and provided, and may be stored in the storage unit 12. The storage unit 12 also operates as a work memory of the control unit 11.

　操作部１３は、マウスやキーボード等であり、利用者の指示操作を受け入れて制御部１１に出力する。表示部１４は、例えばディスプレイ等であり、制御部１１から入力される指示に基づいて情報を表示出力する。 The operation unit 13 is a mouse, a keyboard, or the like, receives an instruction operation of the user, and outputs the operation to the control unit 11. The display unit 14 is, for example, a display or the like, and displays and outputs information based on an instruction input from the control unit 11.

　入出力部１５は、例えばネットワークインタフェース等であり、外部からデータ（画像データ等）を受信して、制御部１１に出力する。またこの入出力部１５は、制御部１１から入力される指示に従って、データを外部の装置等に送出する。 The input / output unit 15 is, for example, a network interface or the like, receives data (image data and the like) from the outside, and outputs the data to the control unit 11. The input / output unit 15 also sends data to an external device or the like according to an instruction input from the control unit 11.

　次に制御部１１の動作について説明する。本実施の形態の制御部１１は、記憶部１２に格納されたプログラムを実行することで、機能的には、図２に例示するように、受入部２１と、検出処理部２２と、検出情報生成部２３と、情報処理部２４とを含んで構成される。また検出処理部２２は、フレーム検出部３１と、顔検出部３２と、身体検出部３３と、文字検出部３４とを含む。 Next, the operation of the control unit 11 will be described. The control unit 11 according to the present embodiment executes the program stored in the storage unit 12 to functionally receive the reception unit 21, the detection processing unit 22, and the detection information as illustrated in FIG. 2. The configuration includes a generation unit 23 and an information processing unit 24. The detection processing unit 22 also includes a frame detection unit 31, a face detection unit 32, a body detection unit 33, and a character detection unit 34.

　受入部２１は、漫画画像データを受け入れて検出処理部２２に出力する。ここで漫画画像データは、一般的には、顔部分（Ｆ）と身体部分（Ｂ）と文字部分（Ｃ）とが互いに重なり合って描画された画像データであり（図３）、少なくとも一つのコマ（Ｍ）を含む。また、この受入部２１は、検出処理部２２におけるニューラルネットワークを利用するため、漫画画像データを拡大または縮小して、ニューラルネットワークの入力に適したサイズにリサイズする。 The receiving unit 21 receives cartoon image data and outputs the data to the detection processing unit 22. Here, the cartoon image data is generally image data drawn by overlapping the face portion (F), the body portion (B) and the character portion (C) with each other (FIG. 3), and at least one frame (M) is included. Further, in order to use the neural network in the detection processing unit 22, the receiving unit 21 enlarges or reduces the comic image data and resizes it to a size suitable for the input of the neural network.

　検出処理部２２のフレーム検出部３１は、画像データから、当該画像データ内に描画された漫画のコマ部分を検出するよう機械学習された状態にあるフレーム検出器を有する。具体的に、このフレーム検出部３１が備えるフレーム検出器は、Ｒ－ＣＮＮ（Regions with CNN features）（Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014）や、Fast Ｒ－ＣＮＮ（Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015）、Faster Ｒ－ＣＮＮ（Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015）、YOLO（You Only Look Once）（Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." arXiv preprint arXiv: 1506.02640 (2015)）、あるいは、ＳＳＤ（シングル・ショット・マルチボックス・ディテクタ；Single Shot MultiBox Detector）（Liu, Wei, et al. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015)）など、種々の方法で構成されたニューラルネットワークを採用して実現できる。 The frame detection unit 31 of the detection processing unit 22 has a frame detector in a state of being machine-learned to detect a comic piece drawn in the image data from the image data. Specifically, the frame detector included in the frame detection unit 31 is R-CNN (Regions with CNN features) (Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014), Fast R-CNN (Girshick, Ross. "Fast r- cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015), Faster R-CNN (Ren, Shaoqing, et al. "Faster R-CNN: Real-time object detection with region proposal networks." Advances in neural information processing systems. 2015), YOLO (You Only Look Once) (Redmon, Joseph, et al. "You only look once: Unified, real-time object detection. "arXiv preprint arXiv: 1506.02640 (2015) or SSD (single shot multi-box detector; Sing It can be realized by adopting a neural network configured by various methods such as le Shot MultiBox Detector) (Liu, Wei, et al. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv: 1512.02325 (2015)).

　図４にその概略を示すように、ＳＳＤ等のニューラルネットワークを採用した検出器４０は、ベースネットワーク部４１と、分類器４２とを含んで構成される。ここでベースネットワーク部４１は、検出対象の候補が含まれる画像の範囲と、当該範囲内の画像の特徴量とを出力する。また分類器４２は、出力された画像の範囲に、検出対象（フレーム検出部３１の場合、漫画画像データのコマを区分する枠線）が含まれるか否かを、出力された特徴量に基づいて判断する。 As schematically shown in FIG. 4, a detector 40 employing a neural network such as an SSD is configured to include a base network unit 41 and a classifier 42. Here, the base network unit 41 outputs the range of the image including the candidate to be detected and the feature amount of the image within the range. In addition, the classifier 42 is based on the output feature amount whether or not the detection target (in the case of the frame detection unit 31, a frame line dividing frames of comic image data) is included in the range of the output image. To judge.

　このようなＳＳＤ等を採用した検出器４０は、検出対象の範囲（フレーム検出部３１の場合、漫画画像データのコマを区分する枠線に外接する形状の範囲）を人為的に指定した画像データのサンプルを用いて機械学習させる。ここで機械学習の具体的方法や、検出器４０の利用方法については、広く知られているので、ここでの詳しい説明を省略する。 The detector 40 employing such an SSD or the like artificially designates the range to be detected (in the case of the frame detection unit 31, the range of the shape circumscribed to the frame line dividing the frame of the comic image data) artificially Machine learning using the sample of. Here, since a specific method of machine learning and a method of using the detector 40 are widely known, the detailed description thereof is omitted here.

　顔検出部３２は、画像データから、当該画像データ内に描画されたキャラクタの顔部分を検出するよう機械学習された状態にある顔検出器を有する。この顔検出器も、フレーム検出部３１が備えるフレーム検出器と同様、ＳＳＤ等、種々の方法で構成されたニューラルネットワークを採用して実現できる。この顔検出器は、検出対象の範囲である、漫画画像データに含まれるキャラクタの顔に外接する所定形状の範囲を人為的に指定した画像データのサンプルを用いて機械学習させる。 The face detection unit 32 has a face detector in a state of being machine-learned to detect a face portion of a character drawn in the image data from the image data. This face detector can also be realized by employing a neural network configured by various methods such as an SSD, as with the frame detector included in the frame detection unit 31. The face detector performs machine learning using a sample of image data in which a range of a predetermined shape circumscribing the face of a character included in cartoon image data, which is a range of a detection target, is artificially specified.

　身体検出部３３は、画像データから、当該画像データ内に描画されたキャラクタの身体部分を検出するよう機械学習された状態にある身体検出器を有する。この身体検出器も、フレーム検出部３１が備えるフレーム検出器と同様、ＳＳＤ等、種々の方法で構成されたニューラルネットワークを採用して実現できる。この身体検出器は、検出対象の範囲である、漫画画像データに含まれるキャラクタの身体に外接する所定形状の範囲を人為的に指定した画像データのサンプルを用いて機械学習させる。 The body detection unit 33 has a body detector in a state of being machine-learned to detect the body part of the character drawn in the image data from the image data. This body detector can also be realized by employing a neural network configured by various methods such as an SSD, as with the frame detector included in the frame detection unit 31. The body detector performs machine learning using a sample of image data artificially designating a range of a predetermined shape circumscribing the body of a character included in cartoon image data, which is a range to be detected.

　文字検出部３４は、画像データから、当該画像データ内に描画された文字部分を検出するよう機械学習された状態にある文字検出器を有する。この文字検出器も、フレーム検出部３１が備えるフレーム検出器と同様、ＳＳＤ等、種々の方法で構成されたニューラルネットワークを採用して実現できる。この文字検出器は、検出対象の範囲である、漫画画像データに含まれる文字部分に外接する所定形状の範囲を人為的に指定した画像データのサンプルを用いて機械学習させる。 The character detection unit 34 has a character detector in a state of being machine-learned to detect a character portion drawn in the image data from the image data. This character detector can also be realized by employing a neural network configured by various methods such as an SSD, as with the frame detector included in the frame detection unit 31. The character detector performs machine learning using a sample of image data in which a range of a predetermined shape circumscribed to a character portion included in cartoon image data, which is a range to be detected, is artificially specified.

　検出情報生成部２３は、受入部２１が受け入れた漫画画像データについて、フレーム検出部３１が検出したコマ部分を特定する情報と、顔検出部３２が検出した、顔部分を特定する情報と、身体検出部３３が検出した、身体部分を特定する情報と、文字検出部３４が検出した、文字部分を特定する情報とを生成する。 The detection information generation unit 23 includes, for the cartoon image data received by the reception unit 21, information for identifying a top portion detected by the frame detection unit 31, information for identifying a face portion detected by the face detection unit 32, and a body The information which specifies the body part which the detection part 33 detected, and the information which specifies the character part which the character detection part 34 detected are generated.

　情報処理部２４は、生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを用いて所定の情報処理を実行する。この情報処理としては、例えば、特定された文字部分の画像に対して光学的文字認識（ＯＣＲ）を行い、その結果を出力する処理等がある。また、情報処理部２４は、光学的文字認識の結果、得られた文字列を、機械翻訳処理により他言語に翻訳して出力してもよい。 The information processing unit 24 executes predetermined information processing using the generated information for identifying the top part, the information for identifying the face part, the information for identifying the body part, and the information for identifying the text part. . This information processing includes, for example, processing for performing optical character recognition (OCR) on an image of a specified character portion and outputting the result. In addition, the information processing unit 24 may translate and output a character string obtained as a result of optical character recognition into another language by machine translation processing.

　本実施の形態の一例は以上の構成を備え、次のように動作する。なお、以下の説明では、制御部１１によるフレーム検出部３１，顔検出部３２，身体検出部３３，及び文字検出部３４は、ＳＳＤを採用し、それぞれ、予め画像データから、当該画像データ内に描画された漫画のコマ部分、顔部分、身体部分、及び文字部分を検出するよう機械学習した状態にあるものとする。 An example of the present embodiment has the above configuration and operates as follows. In the following description, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 by the control unit 11 adopt an SSD, and each of the image data in advance and the image data It is assumed that machine learning has been performed so as to detect a comic piece, a face, a body, and a character part of the drawn cartoon.

　画像処理装置１は、利用者から入力される漫画の画像データ（機械学習のサンプルに含まれないもの）を処理の対象として、当該処理対象の画像データに対して並列的に、フレーム検出器と、顔検出器と、身体検出器と、文字検出器とにより、コマ部分、キャラクタの顔部分、身体部分、及び文字部分をそれぞれ検出して、それぞれ検出した画像の範囲を特定する情報を得る。 The image processing apparatus 1 sets a frame detector in parallel to image data of the processing target, with the image data of a cartoon (not included in the machine learning sample) input from the user as the processing target. A face detector, a body detector, and a character detector respectively detect a top part, a face part of a character, a body part, and a character part, and obtain information for specifying the range of the detected image.

　そして画像処理装置１は、コマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを用いて所定の情報処理、例えば特定された文字部分の画像に対して光学的文字認識（ＯＣＲ）を行い、当該光学的文字認識の結果、得られた文字列を、機械翻訳処理により他言語に翻訳して出力する。 Then, the image processing apparatus 1 performs predetermined information processing, for example, identification using information identifying the top part, information identifying the face part, information identifying the body part, and information identifying the character part. Optical character recognition (OCR) is performed on the image of the character part, and a character string obtained as a result of the optical character recognition is translated into another language by machine translation processing and output.

［ベースネットワークを共用する例］
　またここまでの説明では、フレーム検出部３１，顔検出部３２，身体検出部３３，及び文字検出部３４は、それぞれ独立したベースネットワークと、検出器を備えるものとしたが、本実施の形態はこの例に限られない。例えば一つのベースネットワークをフレーム検出部３１，顔検出部３２，身体検出部３３，文字検出部３４が共用してもよい。 [Example of sharing base network]
In the description so far, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 each include a base network and a detector that are independent of each other. It is not limited to this example. For example, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 may share one base network.

　すなわちこの例では、フレーム検出部３１，顔検出部３２，身体検出部３３，及び文字検出部３４は、図５に例示するように、それぞれに共通して、検出対象の候補となる画像の範囲と、当該範囲内の画像の特徴量とを機械学習した状態にあり、処理対象となった画像データに基づき、検出対象の候補となる画像の範囲と、当該範囲内の画像の特徴量とを出力するベースネットワーク部４１′と、フレーム検出部３１，顔検出部３２，身体検出部３３，及び文字検出部３４のそれぞれに対応して、独立して設けられる分類器４２ａ，４２ｂ，４２ｃ，４２ｄとを備える。 That is, in this example, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 are, as illustrated in FIG. And the feature amount of the image within the range, which is machine-learned, and based on the image data to be processed, the range of the image to be the candidate of the detection target and the feature amount of the image within the range Classifiers 42a, 42b, 42c, 42d provided independently corresponding to the base network unit 41 'for output, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34, respectively. And

　なお、この例でも、ベースネットワーク部４１′及び分類器４２ａ，ｂ，ｃ，ｄは、ＳＳＤに基づくニューラルネットワークとしてよいが、次の点でＳＳＤを変形して用いる。すなわち一般的なＳＳＤの出力段では、物体を検出する領域の候補（アンカーボックス）が予め複数定められており（複数のアンカーボックスの集合をアンカーセットと呼ぶ）、当該複数の領域の候補のうちから、対象となる物体が含まれる領域を特定する。 Also in this example, the base network unit 41 'and the classifiers 42a, b, c, and d may be neural networks based on SSDs, but the SSDs are modified and used in the following points. That is, in the general output stage of the SSD, a plurality of candidate areas (anchor boxes) for detecting an object are determined in advance (a set of a plurality of anchor boxes is called an anchor set), and among the plurality of candidate areas From this, identify the area in which the object of interest is included.

　本実施の形態のここでの例では、出力段より前のネットワーク（ベースネットワーク部４１′）は１つとするが、出力段において、アンカーセット（各アンカーセットには、例えば８７３２個のアンカーボックスが含まれる）をフレーム検出部３１，顔検出部３２，身体検出部３３，及び文字検出部３４のそれぞれに対応して、４つ複製して、各分類器４２ａ，ｂ，ｃ，ｄとして用いる。すなわちこの例におけるＳＳＤでは、各アンカーボックスについて、当該アンカーボックス内で検出した物体の領域との位置特定誤差（左上座標の情報と幅及び高さの情報とからなる４次元の情報）と、物体が含まれ得るとされる確信度（ここではシグモイド関数により正規化しておく）との合計５次元の情報を出力するが、アンカーボックスが複製した４つの同じアンカーセット中のアンカーボックスのうち、第１のアンカーセットＡ１中のアンカーボックスについては画像データのうちコマ部分を機械学習させた状態とする。また第２のアンカーセットＡ２中のアンカーボックスについては、画像データのうち顔部分を機械学習させた状態とし、第３のアンカーセットＡ３中のアンカーボックスについては、画像データのうち身体部分を機械学習させた状態とし、第４のアンカーセットＡ４中のアンカーボックスについては、画像データのうち文字部分を機械学習させた状態とする（図６）。 In this example of the present embodiment, there is one network (base network unit 41 ') before the output stage, but at the output stage, anchor sets (for example, 8732 anchor boxes are provided in each anchor set) Corresponding to each of the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34, four copies are used as classifiers 42a, b, c, and d. That is, in the SSD in this example, with respect to each anchor box, position specification error with respect to the area of the object detected in the anchor box (four-dimensional information including information of upper left coordinates and information of width and height) Outputs a total of five-dimensional information with the degree of certainty that can be included (here normalized by the sigmoid function), but the first of the anchor boxes in the same four anchor sets replicated by the anchor box The anchor box in the first anchor set A1 is in a state in which the top portion of the image data is machine-learned. Further, for the anchor box in the second anchor set A2, the face portion of the image data is machine-learned, and for the anchor box in the third anchor set A3, the body portion of the image data is machine-learned The anchor box in the fourth anchor set A4 is in a state in which the character portion of the image data is machine-learned (FIG. 6).

　具体的には、出力段が出力する第１のアンカーセット中のアンカーボックスの情報については、学習用のサンプルを入力したときに、コマの枠線に外接する矩形が推定されることとなるよう、出力段から誤差を逆伝播して、分類器４２ａ及びベースネットワーク部４１′のパラメータを更新する。 Specifically, for the information on the anchor box in the first anchor set output by the output stage, a rectangle circumscribing the frame of the frame is estimated when the learning sample is input. , Backpropagating the error from the output stage to update the parameters of the classifier 42a and the base network section 41 '.

　同様に、出力段が出力する第２のアンカーセット中のアンカーボックスの情報については、学習用のサンプルを入力したときに、キャラクタの顔部分に外接する矩形が推定されることとなるよう、出力段から誤差を逆伝播して、分類器４２ｂ及びベースネットワーク部４１′のパラメータを更新する。また出力段が出力する第３のアンカーセット中のアンカーボックスの情報については、学習用のサンプルを入力したときに、キャラクタの身体に外接する矩形が推定されることとなるよう、出力段から誤差を逆伝播して、分類器４２ｃ及びベースネットワーク部４１′のパラメータを更新する。さらに出力段が出力する第４のアンカーセット中のアンカーボックスの情報については、学習用のサンプルを入力したときに、文字に外接する矩形が推定されることとなるよう、出力段から誤差を逆伝播して、分類器４２ｄ及びベースネットワーク部４１′のパラメータを更新する。 Similarly, regarding the information of the anchor box in the second anchor set output by the output stage, an output is performed so that a rectangle circumscribing the face portion of the character is estimated when the learning sample is input. The error is back-propagated from the stage to update the parameters of the classifier 42b and the base network section 41 '. In addition, regarding the information on the anchor box in the third anchor set output by the output stage, an error is generated from the output stage so that a rectangle circumscribed to the character's body is estimated when a sample for learning is input. Are back-propagated to update the parameters of the classifier 42c and the base network unit 41 '. Furthermore, regarding the information of the anchor box in the fourth anchor set output by the output stage, the error is reversed from the output stage so that the rectangle circumscribing the character is estimated when the sample for learning is input. It propagates and updates the parameters of the classifier 42d and the base network unit 41 '.

　なお、第ｉ（ｉ＝１，２，３，４）のアンカーセット中のａ番目のアンカーボックス（ａ＝１，２，…８７３２）に対する、ｍ番目のサンプル（ミニバッチ学習を行うこととして、ミニバッチサイズをＭとすると、ｍ＝１，２，…，Ｍ）の割り当てｓ（ｍ，ｉ，ａ）とその重なりＪ（ｍ，ｉ，ａ）とを次のように定義する。

ここで、ｇは１以上、Ｇ（ｍ）以下の整数であり、Ｇ（ｍ）は、上記ｍ番目のサンプルに含まれる正解の個数であり、ｔ（ｍ，ｇ）、及びＢ（ｍ，ｇ）は上記ｍ番目のサンプルのｇ番目の正解のクラス（コマ、顔、身体、文字のいずれであるかを表す情報）と、外接矩形とを表す。 It is to be noted that, for the a-th anchor box (a = 1, 2,... 8732) in the i-th (i = 1, 2, 3, 4) anchor set, the m-th sample (mini batch learning is performed as a mini Assuming that the batch size is M, the assignment s (m, i, a) of m = 1, 2,..., M) and the overlap J (m, i, a) thereof are defined as follows.

Here, g is an integer of 1 or more and G (m) or less, and G (m) is the number of correct answers included in the m-th sample, and t (m, g) and B (m, g) g) represents a g-th correct class (information indicating which is a top, a face, a body, or a character) of the m-th sample and a circumscribed rectangle.

　そして損失関数（Loss関数）Ｌ（ｚ）を、位置特定誤差Ｌloc（ｍ，ｚ）と、確信度Ｌconf（ｍ，ｚ）との和として次のように設定する。

ここで、ｚはニューラルネットワークの出力を表し、Ａ（ｍ，ｐｏｓ）は、ｍ番目のサンプルについてオブジェクトが割り当てられたアンカーボックスの添字集合であり、具体的には、

などとしておく。 Then, the loss function (Loss function) L (z) is set as the sum of the position identification error Lloc (m, z) and the certainty factor Lconf (m, z) as follows.

Here, z represents the output of the neural network, and A (m, pos) is the index set of the anchor box to which the object is assigned for the m-th sample, specifically,

And so on.

　なお、Ｌloc（ｍ，ｚ）及び、Ｌconf（ｍ，ｚ）は、次のように定義しておく。

なお、Ａ（ｍ，ｎｅｇ）は、ハードネガティブ（hard negative）の集合であって、オブジェクトに割り当てられていないアンカーボックスのうち、ｌ（ｍ，ｉ，ａ，ｚ）が大きい順に上位ｋ｜Ａ（ｍ，ｐｏｓ）｜個を選択して得られる（ハードネガティブマイニングと呼ばれる方法であるので、ここでの詳細な説明を省略する）。また、huber()は、ヒューバー関数（huber関数）である。この関数についても広く知られているのでここでの詳細な説明を省略する。 Lloc (m, z) and Lconf (m, z) are defined as follows.

Note that A (m, neg) is a set of hard negative and among the anchor boxes not assigned to the object, the top k | A in the order of large l (m, i, a, z) It is obtained by selecting (m, pos) | pieces (it is a method called hard negative mining, so the detailed description is omitted here). Also, huber () is a Huber function (huber function). Since this function is also widely known, the detailed description is omitted here.

　以上のようにフレーム検出部３１，顔検出部３２，身体検出部３３，及び文字検出部３４が構成された本実施の形態の画像処理装置１においても、フレーム検出部３１，顔検出部３２，身体検出部３３，及び文字検出部３４のそれぞれに対応する分類器４２ａ，ｂ，ｃ，ｄが、それぞれコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを推定するので、これらを用いて所定の情報処理を実行する。 Also in the image processing apparatus 1 of the present embodiment in which the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 are configured as described above, the frame detection unit 31, the face detection unit 32, The classifiers 42a, b, c, d corresponding to the body detection unit 33 and the character detection unit 34 respectively specify information for identifying a top, information for identifying a face, and information for identifying a body. Since the information for identifying the character part is estimated, predetermined information processing is executed using these.

［画像データを分割する例］
　また本実施の形態の制御部１１は、受入部２１の動作として、入力された漫画画像データの全体を、拡大または縮小して、ニューラルネットワークの入力に適したサイズにリサイズするのではなく、入力された漫画画像データを、所定の条件に基づいて複数の分割部分に分割し、当該分割して得られた分割部分（部分的な漫画画像データ、以下、部分画像データと呼ぶ）を、ニューラルネットワークの入力に適したサイズにリサイズして、検出処理部２２に出力してもよい。 [Example of dividing image data]
In addition, as the operation of the receiving unit 21, the control unit 11 according to the present embodiment does not enlarge or reduce the entire input cartoon image data, and resizes it to a size suitable for the input of the neural network, but does not The extracted cartoon image data is divided into a plurality of divided portions based on predetermined conditions, and divided portions obtained by the division (partial cartoon image data, hereinafter referred to as It may be resized to a size suitable for the input of and output to the detection processing unit 22.

　ここで上記所定の条件は、例えば、元の漫画画像データ（幅ｗ，高さｈ）を、２×２個に分割（それぞれが幅ｗ／２，高さｈ／２となるような、重なり合わない４つの領域に分割）するとの条件であってもよい。またこの条件は、漫画画像データの内容に基づき、例えば、白色（背景色）が連続する部分で分割するとの条件であってもよい。さらに、この所定の条件は、コマごとに分割するとの条件であってもよい。 Here, the predetermined condition is, for example, dividing the original cartoon image data (width w, height h) into 2 × 2 pieces (each having a width w / 2, a height h / 2, an overlap It may be a condition to divide into four areas which do not match. Further, this condition may be, for example, a condition of division at a portion where white (background color) continues based on the content of the cartoon image data. Furthermore, the predetermined condition may be a condition of division into pieces.

　本実施の形態のこの例では、検出処理部２２のフレーム検出部３１，顔検出部３２，身体検出部３３，及び文字検出部３４は、部分画像データのそれぞれからコマ部分（コマごとに分割しない場合）、顔部分、身体部分及び文字部分を検出する。なお、この例では、機械学習の処理も、分割して得られた部分画像データを用いて行うこととしてもよい。 In this example of the present embodiment, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 of the detection processing unit 22 do not divide a frame part from each of the partial image data And detect the face part, body part and character part. In this example, the machine learning process may also be performed using partial image data obtained by division.

　そして検出情報生成部２３は、部分画像データごとにフレーム検出部３１が検出したコマ部分を特定する情報と、顔検出部３２が検出した、顔部分を特定する情報と、身体検出部３３が検出した、身体部分を特定する情報と、文字検出部３４が検出した、文字部分を特定する情報とを生成し、これらをまとめて元の漫画画像データにおける、コマ部分、顔部分、身体部分、及び文字部分のそれぞれを特定する情報を生成する。 The detection information generation unit 23 detects, for each partial image data, information specifying the frame portion detected by the frame detection unit 31, information detected by the face detection unit 32, information specifying the face portion, and the body detection unit 33 The information for identifying the body part and the information for identifying the character part detected by the character detection unit 34 are generated, and these are put together and the top part, face part, body part, and the like in the original cartoon image data Generates information that identifies each of the letter parts.

　情報処理部２４は、検出情報生成部２３により生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを用いて所定の情報処理を実行する。 The information processing unit 24 uses the information specifying the top part generated by the detection information generation unit 23, the information specifying the face part, the information specifying the body part, and the information specifying the character part. Execute the information processing of.

　また、上述の通り、部分画像データに分割する所定の条件として、コマごとに分割するとの条件であってもよい。この場合、制御部１１は、検出処理部２２のフレーム検出部３１としての動作によりコマ部分を検出し、当該検出したコマ部分ごとに分割して部分画像データを生成することとしてもよい。 Further, as described above, the predetermined condition for division into partial image data may be a condition of division for each frame. In this case, the control unit 11 may detect a top portion by the operation of the detection processing unit 22 as the frame detection unit 31, and generate partial image data by dividing the detected top portion.

　すなわち本実施の形態のこの例では、図７に例示するように、フレーム検出部３１が出力する、コマ部分（コマを区分する枠線）に外接する多角形（または円弧等の曲線を少なくとも一部に含んでもよい）を特定する情報を、顔検出部３２，身体検出部３３，文字検出部３４に出力する。そして、顔検出部３２，身体検出部３３，文字検出部３４のそれぞれが、フレーム検出部３１が出力する情報で特定されるコマ部分ごとに、各コマ部分をそれぞれ部分画像データとして、部分画像データのそれぞれから顔部分、身体部分及び文字部分を検出する。なお、この例でも、顔検出部３２，身体検出部３３，文字検出部３４に係る機械学習の処理は、分割して得られた部分画像データを用いて行うこととしてもよい。 That is, in this example of the present embodiment, as illustrated in FIG. 7, at least one curve (such as a circular arc) circumscribed to a frame portion (a frame line dividing a frame) output by the frame detection unit 31. The information which specifies “which may be included in the unit” is output to the face detection unit 32, the body detection unit 33, and the character detection unit 34. Then, each of the face detection unit 32, the body detection unit 33, and the character detection unit 34 sets partial image data with each frame portion as partial image data for each frame portion specified by the information output from the frame detection unit 31. The face part, the body part and the character part are detected from each of the Also in this example, the machine learning process related to the face detection unit 32, the body detection unit 33, and the character detection unit 34 may be performed using partial image data obtained by division.

　そしてこの例でも、検出情報生成部２３は、フレーム検出部３１が検出したコマ部分ごとに、顔検出部３２が検出した、顔部分を特定する情報と、身体検出部３３が検出した、身体部分を特定する情報と、文字検出部３４が検出した、文字部分を特定する情報とを生成し、これらをまとめて元の漫画画像データにおける、コマ部分、顔部分、身体部分、及び文字部分のそれぞれを特定する情報を生成する。 Also in this example, the detection information generation unit 23 detects, for each frame portion detected by the frame detection unit 31, the information for identifying the face portion detected by the face detection unit 32, and the body portion detected by the body detection unit 33. Information identifying the character and information identifying the character part detected by the character detection unit 34 are generated, and these are put together to form each of the top, face, body, and character in the original cartoon image data. Generate information to identify

［検出結果の合成］
　また処理の対象とする画像データを分割する場合、制御部１１は、分割前の画像データについても、ニューラルネットワークの入力に適したサイズにリサイズして、検出処理部２２としての動作を行ってもよい。すなわち、制御部１１は、フレーム検出部３１，顔検出部３２，身体検出部３３，及び文字検出部３４の動作として、分割前の画像データのそれぞれからコマ部分、顔部分、身体部分及び文字部分を検出する。 [Composition of detection results]
When dividing image data to be processed, the control unit 11 resizes the image data before division to a size suitable for input to the neural network and performs the operation as the detection processing unit 22. Good. That is, as the operations of the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34, the control unit 11 calculates the frame part, the face part, the body part and the character part from each of the image data before division. To detect

　そして制御部１１は、ここで検出したコマ部分、顔部分、身体部分及び文字部分を特定する情報を記憶しておき、さらに処理の対象とする画像データを、所定の条件に基づいて複数の分割部分に分割し、当該分割して得られた部分画像データごとに、ニューラルネットワークの入力に適したサイズにリサイズして、検出処理部２２としての動作を行う。 Then, the control unit 11 stores information for identifying the frame part, the face part, the body part and the character part detected here, and further divides the image data to be processed into a plurality of divisions based on a predetermined condition. It is divided into parts, and the partial image data obtained by the division is resized to a size suitable for the input of the neural network, and the operation as the detection processing unit 22 is performed.

　この例によると、分割前の画像データについて検出されたコマ部分、顔部分、身体部分及び文字部分を特定する情報と、分割後に得られた部分画像データごとの顔部分、身体部分及び文字部分を特定する情報とが得られることとなる。 According to this example, information for identifying the top part, face part, body part and character part detected for the image data before division and the face part, body part and character part for each partial image data obtained after division Information to be identified will be obtained.

　そして制御部１１は、分割前の画像データから検出されたコマ部分、顔部分、身体部分及び文字部分を特定する情報と、分割後の部分画像データのそれぞれから検出された顔部分、身体部分及び文字部分を特定する情報とを用い、分割前、または分割後のいずれか少なくとも一方から顔部分、身体部分及び文字部分が検出されたならば、検出情報生成部２３は、当該少なくとも一方から検出した顔部分、身体部分及び文字部分を特定する情報を生成して出力する（各部分の検出結果をそれぞれ統合して出力する）。 Then, the control unit 11 detects the top part, the face part, the body part and the character part detected from the image data before division, and the face part, the body part and the part detected from each of the partial image data after division. If a face part, a body part and a character part are detected from at least one of before or after division using information specifying a character part, the detection information generation unit 23 detects it from the at least one It generates and outputs information for identifying the face part, the body part and the character part (the detection results of each part are integrated and output, respectively).

　この例では、いわば、分割前の画像データから検出した顔部分、身体部分、文字部分のそれぞれと、分割後の画像データから検出した顔部分、身体部分、文字部分のそれぞれとの論理和が、処理対象となった画像データから検出した顔部分、身体部分、文字部分として、当該処理対象となった画像データから検出した顔部分、身体部分、文字部分を特定する情報が出力される。 In this example, the so-called OR of the face, body and character detected from the image data before division and the face, body and character detected from the image data after division is As a face portion, a body portion and a character portion detected from the image data to be processed, information for specifying a face portion, a body portion and a character portion detected from the image data to be processed is output.

　なお、コマ部分は、顔部分、身体部分、または文字部分よりも一般的に高い精度で検出できるため、分割前の画像データ（または分割後の画像データであってもよい）のいずれか一方のみから検出すれば十分と考えられるが、制御部１１は、コマ部分についても、分割前の画像データまたは分割後の画像データの少なくともいずれかから検出した場合に、当該コマ部分を特定する情報を出力するようにしてもよい。 In addition, since the top part can be generally detected with higher accuracy than the face part, the body part, or the character part, only one of the image data before division (or the image data after division may be used) Although it is considered sufficient if it is detected from the control unit 11, the control unit 11 also outputs information for specifying the frame portion when it is detected from at least one of the image data before division and the image data after division also for the frame portion. You may do it.

　また、このように、いずれかから検出された各部分（コマ部分、顔部分、身体部分、文字部分のそれぞれ）の情報を出力する場合は、重複している部分の情報については、重複を除いて出力する。 In addition, when outputting information of each part (each of a top, a face, a body, and a character) detected from one of the above, the duplication is excluded for the information of the overlapping parts. Output.

　またここでは分割前の画像データと、分割後の画像データとのいずれかから検出されたコマ部分、顔部分、身体部分及び文字部分を特定する情報を出力することとした。つまり、例えば処理対象の画像データが漫画の１ページ分の画像データである場合、ページ全体で検出したものと、部分ごとに分割した分割部分ごとに検出したものとの「ＯＲ（論理和）」をとることとした。しかしながら、本実施の形態のこの例は、これに限られず、分割前の画像データと、分割後の画像データとの双方から共通して検出されたコマ部分、顔部分、身体部分及び文字部分を特定する情報を出力してもよい（つまり、例えば処理対象の画像データが漫画の１ページ分の画像データである場合、ページ全体で検出したものと、部分ごとに分割した分割部分ごとに検出したものとの「ＡＮＤ（論理積）」をとってもよい）。 In addition, here, it is decided to output information for specifying a frame part, a face part, a body part and a character part detected from any of the image data before division and the image data after division. That is, for example, when the image data to be processed is image data of one page of a cartoon, “OR (logical sum)” of one detected in the entire page and one detected in each divided portion divided into portions I decided to take However, this example of the present embodiment is not limited to this, and the coma part, the face part, the body part and the character part commonly detected from both the image data before division and the image data after division Information to be specified may be output (that is, for example, when the image data to be processed is image data of one page of a cartoon, it is detected for each divided portion divided into portions and detected on the entire page) You may take "AND" with things).

　さらに、ここでは分割の態様を一種類としたが、複数種類の分割態様で分割して得た複数種類の部分画像データを生成してもよい。例えば、コマごとに分割して得た部分画像データと、２×２の４分割した部分画像データと…といったように複数種類の態様で分割して得られた部分画像データ（さらに分割前の画像データを加えてもよい）のいずれか少なくとも一つから（あるいはそれぞれから共通して）検出されたコマ部分、顔部分、身体部分及び文字部分を特定する情報を出力することとしてもよい。この場合も、重複が生じる場合は、重複を除いて出力する。 Further, although the division mode is one type here, plural types of partial image data obtained by division in plural types of division modes may be generated. For example, partial image data obtained by dividing in a plurality of types of modes, such as partial image data obtained by division for each frame, partial image data obtained by division into 2 × 2, etc. (image before division It is also possible to output information specifying a frame part, a face part, a body part and a character part detected from any one or more of (or each of the data may be added). Also in this case, when duplication occurs, the duplication is excluded and output.

［実施形態の効果］
　このように本実施の形態によれば、漫画画像データ内で互いに重なり合い、または包含関係となるコマ部分、キャラクタの顔部分、身体部分、及び文字部分の各部に対応してそれぞれ独立した検出器（または分類器）により、それぞれ検出を行うので、従来の機械学習を利用した検出に比べ、検出の精度を向上できる。 [Effect of the embodiment]
As described above, according to the present embodiment, independent detectors respectively correspond to the parts of the frame part, the face part of the character, the body part, and the character part, which are mutually overlapping or included in the cartoon image data. Alternatively, since the classification is performed by each of the classifiers, the accuracy of the detection can be improved as compared with the conventional detection using machine learning.

　１　画像処理装置、１１　制御部、１２　記憶部、１３　操作部、１４　表示部、１５　入出力部、２１　受入部、２２　検出処理部、２３　検出情報生成部、２４　情報処理部、３１　フレーム検出部、３２　顔検出部、３３　身体検出部、３４　文字検出部、４０　検出器、４１，４１′　ベースネットワーク部、４２　分類器。
Reference Signs List 1 image processing apparatus, 11 control unit, 12 storage unit, 13 operation unit, 14 display unit, 15 input / output unit, 21 reception unit, 22 detection processing unit, 23 detection information generation unit, 24 information processing unit, 31 frame detection unit , 32 face detection units, 33 body detection units, 34 character detection units, 40 detectors, 41, 41 'base network units, 42 classifiers.

Claims

Receiving means for receiving cartoon image data;
Frame detection means in a state of being machine-learned to detect a comic piece drawn in the image data from the image data;
Face detection means in a state of being machine-learned to detect a face portion drawn in the image data from the image data;
Body detection means in a state machine-learned to detect a body part drawn in the image data from the image data;
Character detection means in a state of being machine-learned to detect character parts included in the image data from the image data;
Information specifying the top part detected by the frame detecting means, information specifying the face part detected by the face detecting means, and the body detected by the body detecting means, based on the received cartoon image data Detection information generation means for generating information for specifying a portion and information for specifying a character portion detected by the character detection means;
Including
The image processing apparatus, wherein the information for specifying the generated frame part, the information for specifying a face part, the information for specifying a body part, and the information for specifying a character part are provided for predetermined information processing.

The image processing apparatus according to claim 1,
The frame detection means, the face detection means, the body detection means, and the character detection means share in common the range of images to be candidates for detection and the feature amount of the images within the range. A base network unit that outputs a range of images to be candidates for detection and a feature amount of an image within the range based on image data that has been learned and is to be processed;
A classifier provided corresponding to each of the frame detection means, the face detection means, the body detection means, and the character detection means, which is included in the range of the corresponding image based on the feature amount of the image. An image processing apparatus including a classifier for classifying whether an image to be captured is a top part, a face part, a body part, or a character part.

The image processing apparatus according to claim 1 or 2,
The detection information generation means includes the face detection means and the body detection means from each of divided portions obtained by dividing the received cartoon image data into a plurality of divided portions based on predetermined conditions. And an image processing apparatus for generating information for specifying a face portion, information for specifying a body portion, and information for specifying a character portion by means of a character detection unit.

The image processing apparatus according to claim 3,
The detection information generation means uses the information for identifying the frame portion detected by the frame detection means, and uses each of the identified frame portions as the divided portion, from among each of the divided portions, the face detection means Image processing for generating information for specifying a face portion, information for specifying a body portion, and information for specifying a character portion by using the body detection means and the character detection means, for each top part which is a divided portion apparatus.

The image processing apparatus according to claim 3 or 4, wherein
The detection information generation means is information for specifying the face portion detected by the face detection means, the body detection means and the character detection means from the image data before dividing the received cartoon image data, and the body part Information to identify the character part and information to specify the character part,
An image processing apparatus that integrates and outputs the generated information, information for identifying a face portion detected from each of the divided portions, information for identifying a body portion, and information for identifying a character portion.

The image processing apparatus according to any one of claims 1 to 5, wherein
An image processing apparatus, wherein the frame detection means, the face detection means, the body detection means, and the character detection means are configured using a single shot multi-box detector (SSD).

Computer,
Receiving means for receiving cartoon image data;
Frame detection means in a state of being machine-learned to detect a comic piece drawn in the image data from the image data;
Face detection means in a state of being machine-learned to detect a face portion drawn in the image data from the image data;
Body detection means in a state machine-learned to detect a body part drawn in the image data from the image data;
Character detection means in a state of being machine-learned to detect character parts included in the image data from the image data;
Information specifying the top part detected by the frame detecting means, information specifying the face part detected by the face detecting means, and the body detected by the body detecting means, based on the received cartoon image data Detection information generation means for generating information for specifying a portion and information for specifying a character portion detected by the character detection means;
To act as
The program by which the information which specifies the produced | generated top part, the information which specifies a face part, the information which specifies a body part, and the information which specifies a character part are provided to a predetermined information processing.