[go: up one dir, main page]

WO2019045101A1 - Image processing device and program - Google Patents

Image processing device and program Download PDF

Info

Publication number
WO2019045101A1
WO2019045101A1 PCT/JP2018/032635 JP2018032635W WO2019045101A1 WO 2019045101 A1 WO2019045101 A1 WO 2019045101A1 JP 2018032635 W JP2018032635 W JP 2018032635W WO 2019045101 A1 WO2019045101 A1 WO 2019045101A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
image data
character
detection means
face
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2018/032635
Other languages
French (fr)
Japanese (ja)
Inventor
清晴 相澤
小川 徹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Tokyo NUC
Original Assignee
University of Tokyo NUC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Tokyo NUC filed Critical University of Tokyo NUC
Publication of WO2019045101A1 publication Critical patent/WO2019045101A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation

Definitions

  • the present invention relates to an image processing apparatus and program.
  • Non-Patent Document 1 In order to perform such processing, it has been conventionally considered to specify processing of a drawn person part and a serif part using the result of color and character recognition processing and the like (Non-Patent Document 1).
  • the present invention has been made in view of the above-mentioned circumstances, and in the processing of cartoon image data, the machine learning processing is used to compare the recognition accuracy of the top, face, body and character parts with the conventional one.
  • An object of the present invention is to provide an image processing apparatus and program that can be improved.
  • the present invention for solving the problems of the above-mentioned conventional example is an image processing apparatus, which comprises an accepting means for receiving cartoon image data, and detecting a frame portion of a cartoon drawn in the image data from the image data.
  • image detection means in the machine-learned state
  • face detection means in the state of machine-learned to detect the face portion drawn in the image data from the image data
  • image data from the image data Body detecting means in a state of machine learning to detect a body part drawn in
  • character detecting means in a state of machine learning to detect a character part included in the image data from the image data
  • Detection information generation means for generating information for identifying the body part detected by the body detection means, and information for identifying the character part detected by the character detection means;
  • the information specifying the top part, the information specifying the face part, the information specifying the body part, and the information specifying the character part are to be subjected to predetermined information processing.
  • the present invention it is possible to improve the recognition accuracy when recognizing a top, a face, a body, and a character from cartoon image data using machine learning processing as compared with the conventional one.
  • FIG. 7 is an explanatory diagram showing a schematic example of processing according to another example of the detection processing unit of the image processing apparatus according to the embodiment of the present invention. It is a functional block diagram showing the example of composition of the detection processing part of the image processing device concerning an embodiment of the invention.
  • the image processing apparatus 1 includes the control unit 11, the storage unit 12, the operation unit 13, the display unit 14, and the input / output unit 15, as illustrated in FIG. It is configured.
  • the control unit 11 is a program control device such as a CPU, and executes a program stored in the storage unit 12 to receive cartoon image data, and information for specifying a top part based on the received cartoon image data and , Information for identifying the face part, information for identifying the body part, and information for identifying the text part.
  • the control unit 11 according to the present embodiment is a frame detector in a state of being machine-learned to detect a comic piece drawn in the image data from the image data in the process of identifying each of these parts.
  • a face detector in a state of being machine-learned to detect a face part drawn in the image data from the image data, and a body part drawn in the image data from the image data
  • a body detector in a machine-learned state and a character detector in a machine-learned state to detect character portions included in the image data from the image data are used.
  • control unit 11 executes predetermined information processing using the generated information for identifying the frame part, the information for identifying the face part, the information for identifying the body part, and the information for identifying the character part Do.
  • This information processing includes, for example, processing for outputting an image representing each part, optical character recognition processing for a character string within a range specified by information specifying a character part, and division for dividing image data for each frame part There is processing etc. The operation of these control units 11 will be described in detail later.
  • the storage unit 12 is a memory device or the like, and holds a program executed by the control unit 11. This program may be stored in a computer readable non-transitory recording medium and provided, and may be stored in the storage unit 12.
  • the storage unit 12 also operates as a work memory of the control unit 11.
  • the operation unit 13 is a mouse, a keyboard, or the like, receives an instruction operation of the user, and outputs the operation to the control unit 11.
  • the display unit 14 is, for example, a display or the like, and displays and outputs information based on an instruction input from the control unit 11.
  • the input / output unit 15 is, for example, a network interface or the like, receives data (image data and the like) from the outside, and outputs the data to the control unit 11.
  • the input / output unit 15 also sends data to an external device or the like according to an instruction input from the control unit 11.
  • the control unit 11 executes the program stored in the storage unit 12 to functionally receive the reception unit 21, the detection processing unit 22, and the detection information as illustrated in FIG. 2.
  • the configuration includes a generation unit 23 and an information processing unit 24.
  • the detection processing unit 22 also includes a frame detection unit 31, a face detection unit 32, a body detection unit 33, and a character detection unit 34.
  • the receiving unit 21 receives cartoon image data and outputs the data to the detection processing unit 22.
  • the cartoon image data is generally image data drawn by overlapping the face portion (F), the body portion (B) and the character portion (C) with each other (FIG. 3), and at least one frame (M) is included.
  • the receiving unit 21 enlarges or reduces the comic image data and resizes it to a size suitable for the input of the neural network.
  • the frame detection unit 31 of the detection processing unit 22 has a frame detector in a state of being machine-learned to detect a comic piece drawn in the image data from the image data.
  • the frame detector included in the frame detection unit 31 is R-CNN (Regions with CNN features) (Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2014), Fast R-CNN (Girshick, Ross. "Fast r- cnn.” Proceedings of the IEEE International Conference on Computer Vision. 2015), Faster R-CNN (Ren, Shaoqing, et al.
  • a detector 40 employing a neural network such as an SSD is configured to include a base network unit 41 and a classifier 42.
  • the base network unit 41 outputs the range of the image including the candidate to be detected and the feature amount of the image within the range.
  • the classifier 42 is based on the output feature amount whether or not the detection target (in the case of the frame detection unit 31, a frame line dividing frames of comic image data) is included in the range of the output image. To judge.
  • the detector 40 employing such an SSD or the like artificially designates the range to be detected (in the case of the frame detection unit 31, the range of the shape circumscribed to the frame line dividing the frame of the comic image data) artificially Machine learning using the sample of.
  • the range to be detected in the case of the frame detection unit 31, the range of the shape circumscribed to the frame line dividing the frame of the comic image data
  • Machine learning since a specific method of machine learning and a method of using the detector 40 are widely known, the detailed description thereof is omitted here.
  • the face detection unit 32 has a face detector in a state of being machine-learned to detect a face portion of a character drawn in the image data from the image data.
  • This face detector can also be realized by employing a neural network configured by various methods such as an SSD, as with the frame detector included in the frame detection unit 31.
  • the face detector performs machine learning using a sample of image data in which a range of a predetermined shape circumscribing the face of a character included in cartoon image data, which is a range of a detection target, is artificially specified.
  • the body detection unit 33 has a body detector in a state of being machine-learned to detect the body part of the character drawn in the image data from the image data.
  • This body detector can also be realized by employing a neural network configured by various methods such as an SSD, as with the frame detector included in the frame detection unit 31.
  • the body detector performs machine learning using a sample of image data artificially designating a range of a predetermined shape circumscribing the body of a character included in cartoon image data, which is a range to be detected.
  • the character detection unit 34 has a character detector in a state of being machine-learned to detect a character portion drawn in the image data from the image data.
  • This character detector can also be realized by employing a neural network configured by various methods such as an SSD, as with the frame detector included in the frame detection unit 31.
  • the character detector performs machine learning using a sample of image data in which a range of a predetermined shape circumscribed to a character portion included in cartoon image data, which is a range to be detected, is artificially specified.
  • the detection information generation unit 23 includes, for the cartoon image data received by the reception unit 21, information for identifying a top portion detected by the frame detection unit 31, information for identifying a face portion detected by the face detection unit 32, and a body The information which specifies the body part which the detection part 33 detected, and the information which specifies the character part which the character detection part 34 detected are generated.
  • the information processing unit 24 executes predetermined information processing using the generated information for identifying the top part, the information for identifying the face part, the information for identifying the body part, and the information for identifying the text part.
  • This information processing includes, for example, processing for performing optical character recognition (OCR) on an image of a specified character portion and outputting the result.
  • OCR optical character recognition
  • the information processing unit 24 may translate and output a character string obtained as a result of optical character recognition into another language by machine translation processing.
  • An example of the present embodiment has the above configuration and operates as follows.
  • the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 by the control unit 11 adopt an SSD, and each of the image data in advance and the image data It is assumed that machine learning has been performed so as to detect a comic piece, a face, a body, and a character part of the drawn cartoon.
  • the image processing apparatus 1 sets a frame detector in parallel to image data of the processing target, with the image data of a cartoon (not included in the machine learning sample) input from the user as the processing target.
  • a face detector, a body detector, and a character detector respectively detect a top part, a face part of a character, a body part, and a character part, and obtain information for specifying the range of the detected image.
  • the image processing apparatus 1 performs predetermined information processing, for example, identification using information identifying the top part, information identifying the face part, information identifying the body part, and information identifying the character part.
  • Optical character recognition OCR is performed on the image of the character part, and a character string obtained as a result of the optical character recognition is translated into another language by machine translation processing and output.
  • the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 each include a base network and a detector that are independent of each other. It is not limited to this example.
  • the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 may share one base network.
  • the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 are, as illustrated in FIG.
  • the feature amount of the image within the range which is machine-learned, and based on the image data to be processed, the range of the image to be the candidate of the detection target and the feature amount of the image within the range Classifiers 42a, 42b, 42c, 42d provided independently corresponding to the base network unit 41 'for output, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34, respectively.
  • the base network unit 41 'and the classifiers 42a, b, c, and d may be neural networks based on SSDs, but the SSDs are modified and used in the following points. That is, in the general output stage of the SSD, a plurality of candidate areas (anchor boxes) for detecting an object are determined in advance (a set of a plurality of anchor boxes is called an anchor set), and among the plurality of candidate areas From this, identify the area in which the object of interest is included.
  • anchor sets for example, 8732 anchor boxes are provided in each anchor set
  • the face detection unit 32, the body detection unit 33, and the character detection unit 34 four copies are used as classifiers 42a, b, c, and d.
  • the anchor box in the second anchor set A2 the face portion of the image data is machine-learned, and for the anchor box in the third anchor set A3, the body portion of the image data is machine-learned
  • the anchor box in the fourth anchor set A4 is in a state in which the character portion of the image data is machine-learned (FIG. 6).
  • a rectangle circumscribing the frame of the frame is estimated when the learning sample is input.
  • an output is performed so that a rectangle circumscribing the face portion of the character is estimated when the learning sample is input.
  • the error is back-propagated from the stage to update the parameters of the classifier 42b and the base network section 41 '.
  • an error is generated from the output stage so that a rectangle circumscribed to the character's body is estimated when a sample for learning is input. Are back-propagated to update the parameters of the classifier 42c and the base network unit 41 '.
  • the error is reversed from the output stage so that the rectangle circumscribing the character is estimated when the sample for learning is input. It propagates and updates the parameters of the classifier 42d and the base network unit 41 '.
  • g is an integer of 1 or more and G (m) or less
  • G (m) is the number of correct answers included in the m-th sample
  • t (m, g) and B (m, g) g) represents a g-th correct class (information indicating which is a top, a face, a body, or a character) of the m-th sample and a circumscribed rectangle.
  • the loss function (Loss function) L (z) is set as the sum of the position identification error Lloc (m, z) and the certainty factor Lconf (m, z) as follows.
  • z represents the output of the neural network
  • a (m, pos) is the index set of the anchor box to which the object is assigned for the m-th sample, specifically, And so on.
  • Lloc (m, z) and Lconf (m, z) are defined as follows. Note that A (m, neg) is a set of hard negative and among the anchor boxes not assigned to the object, the top k
  • the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 are configured as described above
  • the classifiers 42a, b, c, d corresponding to the body detection unit 33 and the character detection unit 34 respectively specify information for identifying a top, information for identifying a face, and information for identifying a body. Since the information for identifying the character part is estimated, predetermined information processing is executed using these.
  • the control unit 11 does not enlarge or reduce the entire input cartoon image data, and resizes it to a size suitable for the input of the neural network, but does not
  • the extracted cartoon image data is divided into a plurality of divided portions based on predetermined conditions, and divided portions obtained by the division (partial cartoon image data, hereinafter referred to as It may be resized to a size suitable for the input of and output to the detection processing unit 22.
  • the predetermined condition is, for example, dividing the original cartoon image data (width w, height h) into 2 ⁇ 2 pieces (each having a width w / 2, a height h / 2, an overlap It may be a condition to divide into four areas which do not match. Further, this condition may be, for example, a condition of division at a portion where white (background color) continues based on the content of the cartoon image data. Furthermore, the predetermined condition may be a condition of division into pieces.
  • the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 of the detection processing unit 22 do not divide a frame part from each of the partial image data And detect the face part, body part and character part.
  • the machine learning process may also be performed using partial image data obtained by division.
  • the detection information generation unit 23 detects, for each partial image data, information specifying the frame portion detected by the frame detection unit 31, information detected by the face detection unit 32, information specifying the face portion, and the body detection unit 33
  • the information for identifying the body part and the information for identifying the character part detected by the character detection unit 34 are generated, and these are put together and the top part, face part, body part, and the like in the original cartoon image data Generates information that identifies each of the letter parts.
  • the information processing unit 24 uses the information specifying the top part generated by the detection information generation unit 23, the information specifying the face part, the information specifying the body part, and the information specifying the character part. Execute the information processing of.
  • the predetermined condition for division into partial image data may be a condition of division for each frame.
  • the control unit 11 may detect a top portion by the operation of the detection processing unit 22 as the frame detection unit 31, and generate partial image data by dividing the detected top portion.
  • At least one curve (such as a circular arc) circumscribed to a frame portion (a frame line dividing a frame) output by the frame detection unit 31.
  • the information which specifies “which may be included in the unit” is output to the face detection unit 32, the body detection unit 33, and the character detection unit 34.
  • each of the face detection unit 32, the body detection unit 33, and the character detection unit 34 sets partial image data with each frame portion as partial image data for each frame portion specified by the information output from the frame detection unit 31.
  • the face part, the body part and the character part are detected from each of the Also in this example, the machine learning process related to the face detection unit 32, the body detection unit 33, and the character detection unit 34 may be performed using partial image data obtained by division.
  • the detection information generation unit 23 detects, for each frame portion detected by the frame detection unit 31, the information for identifying the face portion detected by the face detection unit 32, and the body portion detected by the body detection unit 33.
  • Information identifying the character and information identifying the character part detected by the character detection unit 34 are generated, and these are put together to form each of the top, face, body, and character in the original cartoon image data. Generate information to identify
  • the information processing unit 24 uses the information specifying the top part generated by the detection information generation unit 23, the information specifying the face part, the information specifying the body part, and the information specifying the character part. Execute the information processing of.
  • the control unit 11 When dividing image data to be processed, the control unit 11 resizes the image data before division to a size suitable for input to the neural network and performs the operation as the detection processing unit 22. Good. That is, as the operations of the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34, the control unit 11 calculates the frame part, the face part, the body part and the character part from each of the image data before division. To detect
  • control unit 11 stores information for identifying the frame part, the face part, the body part and the character part detected here, and further divides the image data to be processed into a plurality of divisions based on a predetermined condition. It is divided into parts, and the partial image data obtained by the division is resized to a size suitable for the input of the neural network, and the operation as the detection processing unit 22 is performed.
  • control unit 11 detects the top part, the face part, the body part and the character part detected from the image data before division, and the face part, the body part and the part detected from each of the partial image data after division. If a face part, a body part and a character part are detected from at least one of before or after division using information specifying a character part, the detection information generation unit 23 detects it from the at least one It generates and outputs information for identifying the face part, the body part and the character part (the detection results of each part are integrated and output, respectively).
  • the so-called OR of the face, body and character detected from the image data before division and the face, body and character detected from the image data after division is As a face portion, a body portion and a character portion detected from the image data to be processed, information for specifying a face portion, a body portion and a character portion detected from the image data to be processed is output.
  • control unit 11 since the top part can be generally detected with higher accuracy than the face part, the body part, or the character part, only one of the image data before division (or the image data after division may be used) Although it is considered sufficient if it is detected from the control unit 11, the control unit 11 also outputs information for specifying the frame portion when it is detected from at least one of the image data before division and the image data after division also for the frame portion. You may do it.
  • the division mode is one type here, plural types of partial image data obtained by division in plural types of division modes may be generated.
  • partial image data obtained by dividing in a plurality of types of modes such as partial image data obtained by division for each frame, partial image data obtained by division into 2 ⁇ 2, etc. (image before division)
  • image before division It is also possible to output information specifying a frame part, a face part, a body part and a character part detected from any one or more of (or each of the data may be added). Also in this case, when duplication occurs, the duplication is excluded and output.
  • independent detectors respectively correspond to the parts of the frame part, the face part of the character, the body part, and the character part, which are mutually overlapping or included in the cartoon image data.
  • the accuracy of the detection can be improved as compared with the conventional detection using machine learning.
  • Reference Signs List 1 image processing apparatus 11 control unit, 12 storage unit, 13 operation unit, 14 display unit, 15 input / output unit, 21 reception unit, 22 detection processing unit, 23 detection information generation unit, 24 information processing unit, 31 frame detection unit , 32 face detection units, 33 body detection units, 34 character detection units, 40 detectors, 41, 41 'base network units, 42 classifiers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

漫画画像データを受け入れ、当該漫画画像データに基づいて、コマ部分、顔部分、身体部分、文字部分のそれぞれを特定するよう機械学習した結果を用い、コマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とをそれぞれ生成して、当該情報が所定の情報処理に供される画像処理装置である。Accepts cartoon image data, and uses information obtained by machine learning to identify each of the top, face, body, and text based on the cartoon image data, and identifies information for identifying the top and the face. It is an image processing apparatus in which the information to be identified, the information to identify the body part, and the information to identify the character part are respectively generated, and the information is provided to predetermined information processing.

Description

画像処理装置及びプログラムImage processing apparatus and program

 本発明は、画像処理装置及びプログラムに関する。 The present invention relates to an image processing apparatus and program.

 近年では、漫画の画像データを加工して、セリフ部分を抽出して他国語用に翻訳する技術や、コマごとに配列を変更して、スマートフォン等の画面に適した状態とする技術が考えられている。 In recent years, the technology of processing the image data of a cartoon, extracting the speech part and translating it for another language, and changing the arrangement for each frame to make it suitable for the screen of a smartphone etc. can be considered. ing.

 このような処理を行うにあたり、従来から、色や文字認識処理の結果等を用いて、描かれた人物の部分やセリフ部分を特定する処理等が考えられている(非特許文献1)。 In order to perform such processing, it has been conventionally considered to specify processing of a drawn person part and a serif part using the result of color and character recognition processing and the like (Non-Patent Document 1).

Christophe Rigaud, et. al., Speech ballon and speaker association for comics and manga understanding., Proceedings of the 13th International Conference on Document Analysis and Recognition, pp. 351-355, IEEE, 2015Christophe Rigaud, et. Al., Speech ballon and speaker association for comics and manga understanding., Proceedings of the 13th International Conference on Document Analysis and Recognition, pp. 351-355, IEEE, 2015

 一方、近年では機械学習により画像中から物体を検出する技術が開発され、広く研究されている。しかしながら、従来の一般物体検出の処理では、画像中の特定の部分には一つの物体が含まれるとの前提で検出が行われるため、多数の検出対象が互いに重なりあっている場合については考慮されていない。 On the other hand, in recent years, a technique for detecting an object from an image by machine learning has been developed and widely studied. However, in the conventional general object detection processing, detection is performed on the premise that a specific part in an image includes one object, so it is considered in the case where multiple detection targets overlap each other. Not.

 ところが漫画画像データにおいては、コマの内側に(ないしは複数のコマにまたがって)人物の身体や顔が描画され、また、これらの各部に重なり合わせてセリフの文字が配置されることが一般的である。従って、機械学習による物体検出処理をそのまま適用したのでは、コマ、登場人物の身体、顔、セリフといった部分がそれぞれ十分な精度で検出できない。 However, in cartoon image data, it is common for a person's body or face to be drawn inside a frame (or across multiple frames), and a serif character is placed overlapping these parts. is there. Therefore, if the object detection processing by machine learning is applied as it is, parts such as the top, the body of the character, the face, and the speech can not be detected with sufficient accuracy.

 本発明は上記実情に鑑みて為されたもので、漫画画像データの処理において、機械学習処理を用いて、コマ、顔部分、身体部分、及び文字部分の認識精度を、従来のものに比べて向上できる画像処理装置及びプログラムを提供することを、その目的の一つとする。 The present invention has been made in view of the above-mentioned circumstances, and in the processing of cartoon image data, the machine learning processing is used to compare the recognition accuracy of the top, face, body and character parts with the conventional one. An object of the present invention is to provide an image processing apparatus and program that can be improved.

 上記従来例の問題点を解決するための本発明は、画像処理装置であって、漫画画像データを受け入れる受入手段と、画像データから、当該画像データ内に描画された漫画のコマ部分を検出するよう機械学習された状態にあるフレーム検出手段と、画像データから、当該画像データ内に描画された顔部分を検出するよう機械学習された状態にある顔検出手段と、画像データから、当該画像データ内に描画された身体部分を検出するよう機械学習された状態にある身体検出手段と、画像データから、当該画像データ内に含まれる文字部分を検出するよう機械学習された状態にある文字検出手段と、前記受け入れた漫画画像データに基づいて、前記フレーム検出手段が検出したコマ部分を特定する情報と、前記顔検出手段が検出した、顔部分を特定する情報と、前記身体検出手段が検出した、身体部分を特定する情報と、前記文字検出手段が検出した、文字部分を特定する情報と、を生成する検出情報生成手段と、を含み、前記生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とが所定の情報処理に供されることとしたものである。 The present invention for solving the problems of the above-mentioned conventional example is an image processing apparatus, which comprises an accepting means for receiving cartoon image data, and detecting a frame portion of a cartoon drawn in the image data from the image data. Such as frame detection means in the machine-learned state, face detection means in the state of machine-learned to detect the face portion drawn in the image data from the image data, and the image data from the image data Body detecting means in a state of machine learning to detect a body part drawn in, and character detecting means in a state of machine learning to detect a character part included in the image data from the image data And information for identifying a frame portion detected by the frame detection means based on the accepted cartoon image data, and a face portion detected by the face detection means. Detection information generation means for generating information for identifying the body part detected by the body detection means, and information for identifying the character part detected by the character detection means; The information specifying the top part, the information specifying the face part, the information specifying the body part, and the information specifying the character part are to be subjected to predetermined information processing.

 本発明によれば、機械学習処理を用いて、漫画画像データのうちからコマ、顔部分、身体部分、及び文字部分を認識する際の認識精度を、従来のものに比べて向上できる。 According to the present invention, it is possible to improve the recognition accuracy when recognizing a top, a face, a body, and a character from cartoon image data using machine learning processing as compared with the conventional one.

本発明の実施の形態に係る画像処理装置の構成例を表すブロック図である。It is a block diagram showing the example of composition of the image processing device concerning an embodiment of the invention. 本発明の実施の形態に係る画像処理装置の例を表す機能ブロック図である。It is a functional block diagram showing the example of the image processing device concerning an embodiment of the invention. 本発明の実施の形態に係る画像処理装置が処理の対象とする漫画画像データの概要例を表す説明図である。It is an explanatory view showing an outline example of comics image data which the image processing device concerning an embodiment of the present invention makes an object of processing. 本発明の実施の形態に係る画像処理装置の検出処理部の概要例を表す内部機能ブロック図である。It is an internal functional block diagram showing the outline example of the detection processing part of the image processing device concerning an embodiment of the invention. 本発明の実施の形態に係る画像処理装置の検出処理部のもう一つの例を表す内部機能ブロック図である。It is an internal functional block diagram showing the other example of the detection process part of the image processing apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る画像処理装置の検出処理部のもう一つの例による処理の概要例を表す説明図である。FIG. 7 is an explanatory diagram showing a schematic example of processing according to another example of the detection processing unit of the image processing apparatus according to the embodiment of the present invention. 本発明の実施の形態に係る画像処理装置の検出処理部の構成例を表す機能ブロック図である。It is a functional block diagram showing the example of composition of the detection processing part of the image processing device concerning an embodiment of the invention.

 本発明の実施の形態について図面を参照しながら説明する。本発明の実施の形態に係る画像処理装置1は、図1に例示するように、制御部11と、記憶部12と、操作部13と、表示部14と、入出力部15とを含んで構成されている。 Embodiments of the present invention will be described with reference to the drawings. The image processing apparatus 1 according to the embodiment of the present invention includes the control unit 11, the storage unit 12, the operation unit 13, the display unit 14, and the input / output unit 15, as illustrated in FIG. It is configured.

 制御部11は、CPU等のプログラム制御デバイスであり、記憶部12に格納されたプログラムを実行して、漫画画像データを受け入れ、当該受け入れた漫画画像データに基づいて、コマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報と、を生成する。本実施の形態の制御部11は、これらの各部分を特定する処理において、画像データから、当該画像データ内に描画された漫画のコマ部分を検出するよう機械学習された状態にあるフレーム検出器と、画像データから、当該画像データ内に描画された顔部分を検出するよう機械学習された状態にある顔検出器と、画像データから、当該画像データ内に描画された身体部分を検出するよう機械学習された状態にある身体検出器と、画像データから、当該画像データ内に含まれる文字部分を検出するよう機械学習された状態にある文字検出器とを用いる。 The control unit 11 is a program control device such as a CPU, and executes a program stored in the storage unit 12 to receive cartoon image data, and information for specifying a top part based on the received cartoon image data and , Information for identifying the face part, information for identifying the body part, and information for identifying the text part. The control unit 11 according to the present embodiment is a frame detector in a state of being machine-learned to detect a comic piece drawn in the image data from the image data in the process of identifying each of these parts. And a face detector in a state of being machine-learned to detect a face part drawn in the image data from the image data, and a body part drawn in the image data from the image data A body detector in a machine-learned state and a character detector in a machine-learned state to detect character portions included in the image data from the image data are used.

 またこの制御部11は、生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを用いて所定の情報処理を実行する。この情報処理としては、例えば、各部を表す画像を出力する処理や、文字部分を特定する情報により特定された範囲内の文字列に対する光学文字認識処理や、コマ部分ごとに画像データを分割する分割処理等がある。これらの制御部11の動作については、後に詳しく述べる。 In addition, the control unit 11 executes predetermined information processing using the generated information for identifying the frame part, the information for identifying the face part, the information for identifying the body part, and the information for identifying the character part Do. This information processing includes, for example, processing for outputting an image representing each part, optical character recognition processing for a character string within a range specified by information specifying a character part, and division for dividing image data for each frame part There is processing etc. The operation of these control units 11 will be described in detail later.

 記憶部12は、メモリデバイス等であり、制御部11により実行されるプログラムを保持する。このプログラムは、コンピュータ可読かつ、非一時的な記録媒体に格納されて提供され、この記憶部12に格納されたものであってもよい。また、この記憶部12は、制御部11のワークメモリとしても動作する。 The storage unit 12 is a memory device or the like, and holds a program executed by the control unit 11. This program may be stored in a computer readable non-transitory recording medium and provided, and may be stored in the storage unit 12. The storage unit 12 also operates as a work memory of the control unit 11.

 操作部13は、マウスやキーボード等であり、利用者の指示操作を受け入れて制御部11に出力する。表示部14は、例えばディスプレイ等であり、制御部11から入力される指示に基づいて情報を表示出力する。 The operation unit 13 is a mouse, a keyboard, or the like, receives an instruction operation of the user, and outputs the operation to the control unit 11. The display unit 14 is, for example, a display or the like, and displays and outputs information based on an instruction input from the control unit 11.

 入出力部15は、例えばネットワークインタフェース等であり、外部からデータ(画像データ等)を受信して、制御部11に出力する。またこの入出力部15は、制御部11から入力される指示に従って、データを外部の装置等に送出する。 The input / output unit 15 is, for example, a network interface or the like, receives data (image data and the like) from the outside, and outputs the data to the control unit 11. The input / output unit 15 also sends data to an external device or the like according to an instruction input from the control unit 11.

 次に制御部11の動作について説明する。本実施の形態の制御部11は、記憶部12に格納されたプログラムを実行することで、機能的には、図2に例示するように、受入部21と、検出処理部22と、検出情報生成部23と、情報処理部24とを含んで構成される。また検出処理部22は、フレーム検出部31と、顔検出部32と、身体検出部33と、文字検出部34とを含む。 Next, the operation of the control unit 11 will be described. The control unit 11 according to the present embodiment executes the program stored in the storage unit 12 to functionally receive the reception unit 21, the detection processing unit 22, and the detection information as illustrated in FIG. 2. The configuration includes a generation unit 23 and an information processing unit 24. The detection processing unit 22 also includes a frame detection unit 31, a face detection unit 32, a body detection unit 33, and a character detection unit 34.

 受入部21は、漫画画像データを受け入れて検出処理部22に出力する。ここで漫画画像データは、一般的には、顔部分(F)と身体部分(B)と文字部分(C)とが互いに重なり合って描画された画像データであり(図3)、少なくとも一つのコマ(M)を含む。また、この受入部21は、検出処理部22におけるニューラルネットワークを利用するため、漫画画像データを拡大または縮小して、ニューラルネットワークの入力に適したサイズにリサイズする。 The receiving unit 21 receives cartoon image data and outputs the data to the detection processing unit 22. Here, the cartoon image data is generally image data drawn by overlapping the face portion (F), the body portion (B) and the character portion (C) with each other (FIG. 3), and at least one frame (M) is included. Further, in order to use the neural network in the detection processing unit 22, the receiving unit 21 enlarges or reduces the comic image data and resizes it to a size suitable for the input of the neural network.

 検出処理部22のフレーム検出部31は、画像データから、当該画像データ内に描画された漫画のコマ部分を検出するよう機械学習された状態にあるフレーム検出器を有する。具体的に、このフレーム検出部31が備えるフレーム検出器は、R-CNN(Regions with CNN features)(Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014)や、Fast R-CNN(Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015)、Faster R-CNN(Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015)、YOLO(You Only Look Once)(Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." arXiv preprint arXiv: 1506.02640 (2015))、あるいは、SSD(シングル・ショット・マルチボックス・ディテクタ;Single Shot MultiBox Detector)(Liu, Wei, et al. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015))など、種々の方法で構成されたニューラルネットワークを採用して実現できる。 The frame detection unit 31 of the detection processing unit 22 has a frame detector in a state of being machine-learned to detect a comic piece drawn in the image data from the image data. Specifically, the frame detector included in the frame detection unit 31 is R-CNN (Regions with CNN features) (Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014), Fast R-CNN (Girshick, Ross. "Fast r- cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015), Faster R-CNN (Ren, Shaoqing, et al. "Faster R-CNN: Real-time object detection with region proposal networks." Advances in neural information processing systems. 2015), YOLO (You Only Look Once) (Redmon, Joseph, et al. "You only look once: Unified, real-time object detection. "arXiv preprint arXiv: 1506.02640 (2015) or SSD (single shot multi-box detector; Sing It can be realized by adopting a neural network configured by various methods such as le Shot MultiBox Detector) (Liu, Wei, et al. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv: 1512.02325 (2015)).

 図4にその概略を示すように、SSD等のニューラルネットワークを採用した検出器40は、ベースネットワーク部41と、分類器42とを含んで構成される。ここでベースネットワーク部41は、検出対象の候補が含まれる画像の範囲と、当該範囲内の画像の特徴量とを出力する。また分類器42は、出力された画像の範囲に、検出対象(フレーム検出部31の場合、漫画画像データのコマを区分する枠線)が含まれるか否かを、出力された特徴量に基づいて判断する。 As schematically shown in FIG. 4, a detector 40 employing a neural network such as an SSD is configured to include a base network unit 41 and a classifier 42. Here, the base network unit 41 outputs the range of the image including the candidate to be detected and the feature amount of the image within the range. In addition, the classifier 42 is based on the output feature amount whether or not the detection target (in the case of the frame detection unit 31, a frame line dividing frames of comic image data) is included in the range of the output image. To judge.

 このようなSSD等を採用した検出器40は、検出対象の範囲(フレーム検出部31の場合、漫画画像データのコマを区分する枠線に外接する形状の範囲)を人為的に指定した画像データのサンプルを用いて機械学習させる。ここで機械学習の具体的方法や、検出器40の利用方法については、広く知られているので、ここでの詳しい説明を省略する。 The detector 40 employing such an SSD or the like artificially designates the range to be detected (in the case of the frame detection unit 31, the range of the shape circumscribed to the frame line dividing the frame of the comic image data) artificially Machine learning using the sample of. Here, since a specific method of machine learning and a method of using the detector 40 are widely known, the detailed description thereof is omitted here.

 顔検出部32は、画像データから、当該画像データ内に描画されたキャラクタの顔部分を検出するよう機械学習された状態にある顔検出器を有する。この顔検出器も、フレーム検出部31が備えるフレーム検出器と同様、SSD等、種々の方法で構成されたニューラルネットワークを採用して実現できる。この顔検出器は、検出対象の範囲である、漫画画像データに含まれるキャラクタの顔に外接する所定形状の範囲を人為的に指定した画像データのサンプルを用いて機械学習させる。 The face detection unit 32 has a face detector in a state of being machine-learned to detect a face portion of a character drawn in the image data from the image data. This face detector can also be realized by employing a neural network configured by various methods such as an SSD, as with the frame detector included in the frame detection unit 31. The face detector performs machine learning using a sample of image data in which a range of a predetermined shape circumscribing the face of a character included in cartoon image data, which is a range of a detection target, is artificially specified.

 身体検出部33は、画像データから、当該画像データ内に描画されたキャラクタの身体部分を検出するよう機械学習された状態にある身体検出器を有する。この身体検出器も、フレーム検出部31が備えるフレーム検出器と同様、SSD等、種々の方法で構成されたニューラルネットワークを採用して実現できる。この身体検出器は、検出対象の範囲である、漫画画像データに含まれるキャラクタの身体に外接する所定形状の範囲を人為的に指定した画像データのサンプルを用いて機械学習させる。 The body detection unit 33 has a body detector in a state of being machine-learned to detect the body part of the character drawn in the image data from the image data. This body detector can also be realized by employing a neural network configured by various methods such as an SSD, as with the frame detector included in the frame detection unit 31. The body detector performs machine learning using a sample of image data artificially designating a range of a predetermined shape circumscribing the body of a character included in cartoon image data, which is a range to be detected.

 文字検出部34は、画像データから、当該画像データ内に描画された文字部分を検出するよう機械学習された状態にある文字検出器を有する。この文字検出器も、フレーム検出部31が備えるフレーム検出器と同様、SSD等、種々の方法で構成されたニューラルネットワークを採用して実現できる。この文字検出器は、検出対象の範囲である、漫画画像データに含まれる文字部分に外接する所定形状の範囲を人為的に指定した画像データのサンプルを用いて機械学習させる。 The character detection unit 34 has a character detector in a state of being machine-learned to detect a character portion drawn in the image data from the image data. This character detector can also be realized by employing a neural network configured by various methods such as an SSD, as with the frame detector included in the frame detection unit 31. The character detector performs machine learning using a sample of image data in which a range of a predetermined shape circumscribed to a character portion included in cartoon image data, which is a range to be detected, is artificially specified.

 検出情報生成部23は、受入部21が受け入れた漫画画像データについて、フレーム検出部31が検出したコマ部分を特定する情報と、顔検出部32が検出した、顔部分を特定する情報と、身体検出部33が検出した、身体部分を特定する情報と、文字検出部34が検出した、文字部分を特定する情報とを生成する。 The detection information generation unit 23 includes, for the cartoon image data received by the reception unit 21, information for identifying a top portion detected by the frame detection unit 31, information for identifying a face portion detected by the face detection unit 32, and a body The information which specifies the body part which the detection part 33 detected, and the information which specifies the character part which the character detection part 34 detected are generated.

 情報処理部24は、生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを用いて所定の情報処理を実行する。この情報処理としては、例えば、特定された文字部分の画像に対して光学的文字認識(OCR)を行い、その結果を出力する処理等がある。また、情報処理部24は、光学的文字認識の結果、得られた文字列を、機械翻訳処理により他言語に翻訳して出力してもよい。 The information processing unit 24 executes predetermined information processing using the generated information for identifying the top part, the information for identifying the face part, the information for identifying the body part, and the information for identifying the text part. . This information processing includes, for example, processing for performing optical character recognition (OCR) on an image of a specified character portion and outputting the result. In addition, the information processing unit 24 may translate and output a character string obtained as a result of optical character recognition into another language by machine translation processing.

 本実施の形態の一例は以上の構成を備え、次のように動作する。なお、以下の説明では、制御部11によるフレーム検出部31,顔検出部32,身体検出部33,及び文字検出部34は、SSDを採用し、それぞれ、予め画像データから、当該画像データ内に描画された漫画のコマ部分、顔部分、身体部分、及び文字部分を検出するよう機械学習した状態にあるものとする。 An example of the present embodiment has the above configuration and operates as follows. In the following description, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 by the control unit 11 adopt an SSD, and each of the image data in advance and the image data It is assumed that machine learning has been performed so as to detect a comic piece, a face, a body, and a character part of the drawn cartoon.

 画像処理装置1は、利用者から入力される漫画の画像データ(機械学習のサンプルに含まれないもの)を処理の対象として、当該処理対象の画像データに対して並列的に、フレーム検出器と、顔検出器と、身体検出器と、文字検出器とにより、コマ部分、キャラクタの顔部分、身体部分、及び文字部分をそれぞれ検出して、それぞれ検出した画像の範囲を特定する情報を得る。 The image processing apparatus 1 sets a frame detector in parallel to image data of the processing target, with the image data of a cartoon (not included in the machine learning sample) input from the user as the processing target. A face detector, a body detector, and a character detector respectively detect a top part, a face part of a character, a body part, and a character part, and obtain information for specifying the range of the detected image.

 そして画像処理装置1は、コマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを用いて所定の情報処理、例えば特定された文字部分の画像に対して光学的文字認識(OCR)を行い、当該光学的文字認識の結果、得られた文字列を、機械翻訳処理により他言語に翻訳して出力する。 Then, the image processing apparatus 1 performs predetermined information processing, for example, identification using information identifying the top part, information identifying the face part, information identifying the body part, and information identifying the character part. Optical character recognition (OCR) is performed on the image of the character part, and a character string obtained as a result of the optical character recognition is translated into another language by machine translation processing and output.

[ベースネットワークを共用する例]
 またここまでの説明では、フレーム検出部31,顔検出部32,身体検出部33,及び文字検出部34は、それぞれ独立したベースネットワークと、検出器を備えるものとしたが、本実施の形態はこの例に限られない。例えば一つのベースネットワークをフレーム検出部31,顔検出部32,身体検出部33,文字検出部34が共用してもよい。
[Example of sharing base network]
In the description so far, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 each include a base network and a detector that are independent of each other. It is not limited to this example. For example, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 may share one base network.

 すなわちこの例では、フレーム検出部31,顔検出部32,身体検出部33,及び文字検出部34は、図5に例示するように、それぞれに共通して、検出対象の候補となる画像の範囲と、当該範囲内の画像の特徴量とを機械学習した状態にあり、処理対象となった画像データに基づき、検出対象の候補となる画像の範囲と、当該範囲内の画像の特徴量とを出力するベースネットワーク部41′と、フレーム検出部31,顔検出部32,身体検出部33,及び文字検出部34のそれぞれに対応して、独立して設けられる分類器42a,42b,42c,42dとを備える。 That is, in this example, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 are, as illustrated in FIG. And the feature amount of the image within the range, which is machine-learned, and based on the image data to be processed, the range of the image to be the candidate of the detection target and the feature amount of the image within the range Classifiers 42a, 42b, 42c, 42d provided independently corresponding to the base network unit 41 'for output, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34, respectively. And

 なお、この例でも、ベースネットワーク部41′及び分類器42a,b,c,dは、SSDに基づくニューラルネットワークとしてよいが、次の点でSSDを変形して用いる。すなわち一般的なSSDの出力段では、物体を検出する領域の候補(アンカーボックス)が予め複数定められており(複数のアンカーボックスの集合をアンカーセットと呼ぶ)、当該複数の領域の候補のうちから、対象となる物体が含まれる領域を特定する。 Also in this example, the base network unit 41 'and the classifiers 42a, b, c, and d may be neural networks based on SSDs, but the SSDs are modified and used in the following points. That is, in the general output stage of the SSD, a plurality of candidate areas (anchor boxes) for detecting an object are determined in advance (a set of a plurality of anchor boxes is called an anchor set), and among the plurality of candidate areas From this, identify the area in which the object of interest is included.

 本実施の形態のここでの例では、出力段より前のネットワーク(ベースネットワーク部41′)は1つとするが、出力段において、アンカーセット(各アンカーセットには、例えば8732個のアンカーボックスが含まれる)をフレーム検出部31,顔検出部32,身体検出部33,及び文字検出部34のそれぞれに対応して、4つ複製して、各分類器42a,b,c,dとして用いる。すなわちこの例におけるSSDでは、各アンカーボックスについて、当該アンカーボックス内で検出した物体の領域との位置特定誤差(左上座標の情報と幅及び高さの情報とからなる4次元の情報)と、物体が含まれ得るとされる確信度(ここではシグモイド関数により正規化しておく)との合計5次元の情報を出力するが、アンカーボックスが複製した4つの同じアンカーセット中のアンカーボックスのうち、第1のアンカーセットA1中のアンカーボックスについては画像データのうちコマ部分を機械学習させた状態とする。また第2のアンカーセットA2中のアンカーボックスについては、画像データのうち顔部分を機械学習させた状態とし、第3のアンカーセットA3中のアンカーボックスについては、画像データのうち身体部分を機械学習させた状態とし、第4のアンカーセットA4中のアンカーボックスについては、画像データのうち文字部分を機械学習させた状態とする(図6)。 In this example of the present embodiment, there is one network (base network unit 41 ') before the output stage, but at the output stage, anchor sets (for example, 8732 anchor boxes are provided in each anchor set) Corresponding to each of the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34, four copies are used as classifiers 42a, b, c, and d. That is, in the SSD in this example, with respect to each anchor box, position specification error with respect to the area of the object detected in the anchor box (four-dimensional information including information of upper left coordinates and information of width and height) Outputs a total of five-dimensional information with the degree of certainty that can be included (here normalized by the sigmoid function), but the first of the anchor boxes in the same four anchor sets replicated by the anchor box The anchor box in the first anchor set A1 is in a state in which the top portion of the image data is machine-learned. Further, for the anchor box in the second anchor set A2, the face portion of the image data is machine-learned, and for the anchor box in the third anchor set A3, the body portion of the image data is machine-learned The anchor box in the fourth anchor set A4 is in a state in which the character portion of the image data is machine-learned (FIG. 6).

 具体的には、出力段が出力する第1のアンカーセット中のアンカーボックスの情報については、学習用のサンプルを入力したときに、コマの枠線に外接する矩形が推定されることとなるよう、出力段から誤差を逆伝播して、分類器42a及びベースネットワーク部41′のパラメータを更新する。 Specifically, for the information on the anchor box in the first anchor set output by the output stage, a rectangle circumscribing the frame of the frame is estimated when the learning sample is input. , Backpropagating the error from the output stage to update the parameters of the classifier 42a and the base network section 41 '.

 同様に、出力段が出力する第2のアンカーセット中のアンカーボックスの情報については、学習用のサンプルを入力したときに、キャラクタの顔部分に外接する矩形が推定されることとなるよう、出力段から誤差を逆伝播して、分類器42b及びベースネットワーク部41′のパラメータを更新する。また出力段が出力する第3のアンカーセット中のアンカーボックスの情報については、学習用のサンプルを入力したときに、キャラクタの身体に外接する矩形が推定されることとなるよう、出力段から誤差を逆伝播して、分類器42c及びベースネットワーク部41′のパラメータを更新する。さらに出力段が出力する第4のアンカーセット中のアンカーボックスの情報については、学習用のサンプルを入力したときに、文字に外接する矩形が推定されることとなるよう、出力段から誤差を逆伝播して、分類器42d及びベースネットワーク部41′のパラメータを更新する。 Similarly, regarding the information of the anchor box in the second anchor set output by the output stage, an output is performed so that a rectangle circumscribing the face portion of the character is estimated when the learning sample is input. The error is back-propagated from the stage to update the parameters of the classifier 42b and the base network section 41 '. In addition, regarding the information on the anchor box in the third anchor set output by the output stage, an error is generated from the output stage so that a rectangle circumscribed to the character's body is estimated when a sample for learning is input. Are back-propagated to update the parameters of the classifier 42c and the base network unit 41 '. Furthermore, regarding the information of the anchor box in the fourth anchor set output by the output stage, the error is reversed from the output stage so that the rectangle circumscribing the character is estimated when the sample for learning is input. It propagates and updates the parameters of the classifier 42d and the base network unit 41 '.

 なお、第i(i=1,2,3,4)のアンカーセット中のa番目のアンカーボックス(a=1,2,…8732)に対する、m番目のサンプル(ミニバッチ学習を行うこととして、ミニバッチサイズをMとすると、m=1,2,…,M)の割り当てs(m,i,a)とその重なりJ(m,i,a)とを次のように定義する。

Figure JPOXMLDOC01-appb-M000001
ここで、gは1以上、G(m)以下の整数であり、G(m)は、上記m番目のサンプルに含まれる正解の個数であり、t(m,g)、及びB(m,g)は上記m番目のサンプルのg番目の正解のクラス(コマ、顔、身体、文字のいずれであるかを表す情報)と、外接矩形とを表す。 It is to be noted that, for the a-th anchor box (a = 1, 2,... 8732) in the i-th (i = 1, 2, 3, 4) anchor set, the m-th sample (mini batch learning is performed as a mini Assuming that the batch size is M, the assignment s (m, i, a) of m = 1, 2,..., M) and the overlap J (m, i, a) thereof are defined as follows.
Figure JPOXMLDOC01-appb-M000001
Here, g is an integer of 1 or more and G (m) or less, and G (m) is the number of correct answers included in the m-th sample, and t (m, g) and B (m, g) g) represents a g-th correct class (information indicating which is a top, a face, a body, or a character) of the m-th sample and a circumscribed rectangle.

 そして損失関数(Loss関数)L(z)を、位置特定誤差Lloc(m,z)と、確信度Lconf(m,z)との和として次のように設定する。

Figure JPOXMLDOC01-appb-M000002
ここで、zはニューラルネットワークの出力を表し、A(m,pos)は、m番目のサンプルについてオブジェクトが割り当てられたアンカーボックスの添字集合であり、具体的には、
Figure JPOXMLDOC01-appb-M000003
などとしておく。 Then, the loss function (Loss function) L (z) is set as the sum of the position identification error Lloc (m, z) and the certainty factor Lconf (m, z) as follows.
Figure JPOXMLDOC01-appb-M000002
Here, z represents the output of the neural network, and A (m, pos) is the index set of the anchor box to which the object is assigned for the m-th sample, specifically,
Figure JPOXMLDOC01-appb-M000003
And so on.

 なお、Lloc(m,z)及び、Lconf(m,z)は、次のように定義しておく。

Figure JPOXMLDOC01-appb-M000004
なお、A(m,neg)は、ハードネガティブ(hard negative)の集合であって、オブジェクトに割り当てられていないアンカーボックスのうち、l(m,i,a,z)が大きい順に上位k|A(m,pos)|個を選択して得られる(ハードネガティブマイニングと呼ばれる方法であるので、ここでの詳細な説明を省略する)。また、huber()は、ヒューバー関数(huber関数)である。この関数についても広く知られているのでここでの詳細な説明を省略する。 Lloc (m, z) and Lconf (m, z) are defined as follows.
Figure JPOXMLDOC01-appb-M000004
Note that A (m, neg) is a set of hard negative and among the anchor boxes not assigned to the object, the top k | A in the order of large l (m, i, a, z) It is obtained by selecting (m, pos) | pieces (it is a method called hard negative mining, so the detailed description is omitted here). Also, huber () is a Huber function (huber function). Since this function is also widely known, the detailed description is omitted here.

 以上のようにフレーム検出部31,顔検出部32,身体検出部33,及び文字検出部34が構成された本実施の形態の画像処理装置1においても、フレーム検出部31,顔検出部32,身体検出部33,及び文字検出部34のそれぞれに対応する分類器42a,b,c,dが、それぞれコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを推定するので、これらを用いて所定の情報処理を実行する。 Also in the image processing apparatus 1 of the present embodiment in which the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 are configured as described above, the frame detection unit 31, the face detection unit 32, The classifiers 42a, b, c, d corresponding to the body detection unit 33 and the character detection unit 34 respectively specify information for identifying a top, information for identifying a face, and information for identifying a body. Since the information for identifying the character part is estimated, predetermined information processing is executed using these.

[画像データを分割する例]
 また本実施の形態の制御部11は、受入部21の動作として、入力された漫画画像データの全体を、拡大または縮小して、ニューラルネットワークの入力に適したサイズにリサイズするのではなく、入力された漫画画像データを、所定の条件に基づいて複数の分割部分に分割し、当該分割して得られた分割部分(部分的な漫画画像データ、以下、部分画像データと呼ぶ)を、ニューラルネットワークの入力に適したサイズにリサイズして、検出処理部22に出力してもよい。
[Example of dividing image data]
In addition, as the operation of the receiving unit 21, the control unit 11 according to the present embodiment does not enlarge or reduce the entire input cartoon image data, and resizes it to a size suitable for the input of the neural network, but does not The extracted cartoon image data is divided into a plurality of divided portions based on predetermined conditions, and divided portions obtained by the division (partial cartoon image data, hereinafter referred to as It may be resized to a size suitable for the input of and output to the detection processing unit 22.

 ここで上記所定の条件は、例えば、元の漫画画像データ(幅w,高さh)を、2×2個に分割(それぞれが幅w/2,高さh/2となるような、重なり合わない4つの領域に分割)するとの条件であってもよい。またこの条件は、漫画画像データの内容に基づき、例えば、白色(背景色)が連続する部分で分割するとの条件であってもよい。さらに、この所定の条件は、コマごとに分割するとの条件であってもよい。 Here, the predetermined condition is, for example, dividing the original cartoon image data (width w, height h) into 2 × 2 pieces (each having a width w / 2, a height h / 2, an overlap It may be a condition to divide into four areas which do not match. Further, this condition may be, for example, a condition of division at a portion where white (background color) continues based on the content of the cartoon image data. Furthermore, the predetermined condition may be a condition of division into pieces.

 本実施の形態のこの例では、検出処理部22のフレーム検出部31,顔検出部32,身体検出部33,及び文字検出部34は、部分画像データのそれぞれからコマ部分(コマごとに分割しない場合)、顔部分、身体部分及び文字部分を検出する。なお、この例では、機械学習の処理も、分割して得られた部分画像データを用いて行うこととしてもよい。 In this example of the present embodiment, the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34 of the detection processing unit 22 do not divide a frame part from each of the partial image data And detect the face part, body part and character part. In this example, the machine learning process may also be performed using partial image data obtained by division.

 そして検出情報生成部23は、部分画像データごとにフレーム検出部31が検出したコマ部分を特定する情報と、顔検出部32が検出した、顔部分を特定する情報と、身体検出部33が検出した、身体部分を特定する情報と、文字検出部34が検出した、文字部分を特定する情報とを生成し、これらをまとめて元の漫画画像データにおける、コマ部分、顔部分、身体部分、及び文字部分のそれぞれを特定する情報を生成する。 The detection information generation unit 23 detects, for each partial image data, information specifying the frame portion detected by the frame detection unit 31, information detected by the face detection unit 32, information specifying the face portion, and the body detection unit 33 The information for identifying the body part and the information for identifying the character part detected by the character detection unit 34 are generated, and these are put together and the top part, face part, body part, and the like in the original cartoon image data Generates information that identifies each of the letter parts.

 情報処理部24は、検出情報生成部23により生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを用いて所定の情報処理を実行する。 The information processing unit 24 uses the information specifying the top part generated by the detection information generation unit 23, the information specifying the face part, the information specifying the body part, and the information specifying the character part. Execute the information processing of.

 また、上述の通り、部分画像データに分割する所定の条件として、コマごとに分割するとの条件であってもよい。この場合、制御部11は、検出処理部22のフレーム検出部31としての動作によりコマ部分を検出し、当該検出したコマ部分ごとに分割して部分画像データを生成することとしてもよい。 Further, as described above, the predetermined condition for division into partial image data may be a condition of division for each frame. In this case, the control unit 11 may detect a top portion by the operation of the detection processing unit 22 as the frame detection unit 31, and generate partial image data by dividing the detected top portion.

 すなわち本実施の形態のこの例では、図7に例示するように、フレーム検出部31が出力する、コマ部分(コマを区分する枠線)に外接する多角形(または円弧等の曲線を少なくとも一部に含んでもよい)を特定する情報を、顔検出部32,身体検出部33,文字検出部34に出力する。そして、顔検出部32,身体検出部33,文字検出部34のそれぞれが、フレーム検出部31が出力する情報で特定されるコマ部分ごとに、各コマ部分をそれぞれ部分画像データとして、部分画像データのそれぞれから顔部分、身体部分及び文字部分を検出する。なお、この例でも、顔検出部32,身体検出部33,文字検出部34に係る機械学習の処理は、分割して得られた部分画像データを用いて行うこととしてもよい。 That is, in this example of the present embodiment, as illustrated in FIG. 7, at least one curve (such as a circular arc) circumscribed to a frame portion (a frame line dividing a frame) output by the frame detection unit 31. The information which specifies “which may be included in the unit” is output to the face detection unit 32, the body detection unit 33, and the character detection unit 34. Then, each of the face detection unit 32, the body detection unit 33, and the character detection unit 34 sets partial image data with each frame portion as partial image data for each frame portion specified by the information output from the frame detection unit 31. The face part, the body part and the character part are detected from each of the Also in this example, the machine learning process related to the face detection unit 32, the body detection unit 33, and the character detection unit 34 may be performed using partial image data obtained by division.

 そしてこの例でも、検出情報生成部23は、フレーム検出部31が検出したコマ部分ごとに、顔検出部32が検出した、顔部分を特定する情報と、身体検出部33が検出した、身体部分を特定する情報と、文字検出部34が検出した、文字部分を特定する情報とを生成し、これらをまとめて元の漫画画像データにおける、コマ部分、顔部分、身体部分、及び文字部分のそれぞれを特定する情報を生成する。 Also in this example, the detection information generation unit 23 detects, for each frame portion detected by the frame detection unit 31, the information for identifying the face portion detected by the face detection unit 32, and the body portion detected by the body detection unit 33. Information identifying the character and information identifying the character part detected by the character detection unit 34 are generated, and these are put together to form each of the top, face, body, and character in the original cartoon image data. Generate information to identify

 情報処理部24は、検出情報生成部23により生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを用いて所定の情報処理を実行する。 The information processing unit 24 uses the information specifying the top part generated by the detection information generation unit 23, the information specifying the face part, the information specifying the body part, and the information specifying the character part. Execute the information processing of.

[検出結果の合成]
 また処理の対象とする画像データを分割する場合、制御部11は、分割前の画像データについても、ニューラルネットワークの入力に適したサイズにリサイズして、検出処理部22としての動作を行ってもよい。すなわち、制御部11は、フレーム検出部31,顔検出部32,身体検出部33,及び文字検出部34の動作として、分割前の画像データのそれぞれからコマ部分、顔部分、身体部分及び文字部分を検出する。
[Composition of detection results]
When dividing image data to be processed, the control unit 11 resizes the image data before division to a size suitable for input to the neural network and performs the operation as the detection processing unit 22. Good. That is, as the operations of the frame detection unit 31, the face detection unit 32, the body detection unit 33, and the character detection unit 34, the control unit 11 calculates the frame part, the face part, the body part and the character part from each of the image data before division. To detect

 そして制御部11は、ここで検出したコマ部分、顔部分、身体部分及び文字部分を特定する情報を記憶しておき、さらに処理の対象とする画像データを、所定の条件に基づいて複数の分割部分に分割し、当該分割して得られた部分画像データごとに、ニューラルネットワークの入力に適したサイズにリサイズして、検出処理部22としての動作を行う。 Then, the control unit 11 stores information for identifying the frame part, the face part, the body part and the character part detected here, and further divides the image data to be processed into a plurality of divisions based on a predetermined condition. It is divided into parts, and the partial image data obtained by the division is resized to a size suitable for the input of the neural network, and the operation as the detection processing unit 22 is performed.

 この例によると、分割前の画像データについて検出されたコマ部分、顔部分、身体部分及び文字部分を特定する情報と、分割後に得られた部分画像データごとの顔部分、身体部分及び文字部分を特定する情報とが得られることとなる。 According to this example, information for identifying the top part, face part, body part and character part detected for the image data before division and the face part, body part and character part for each partial image data obtained after division Information to be identified will be obtained.

 そして制御部11は、分割前の画像データから検出されたコマ部分、顔部分、身体部分及び文字部分を特定する情報と、分割後の部分画像データのそれぞれから検出された顔部分、身体部分及び文字部分を特定する情報とを用い、分割前、または分割後のいずれか少なくとも一方から顔部分、身体部分及び文字部分が検出されたならば、検出情報生成部23は、当該少なくとも一方から検出した顔部分、身体部分及び文字部分を特定する情報を生成して出力する(各部分の検出結果をそれぞれ統合して出力する)。 Then, the control unit 11 detects the top part, the face part, the body part and the character part detected from the image data before division, and the face part, the body part and the part detected from each of the partial image data after division. If a face part, a body part and a character part are detected from at least one of before or after division using information specifying a character part, the detection information generation unit 23 detects it from the at least one It generates and outputs information for identifying the face part, the body part and the character part (the detection results of each part are integrated and output, respectively).

 この例では、いわば、分割前の画像データから検出した顔部分、身体部分、文字部分のそれぞれと、分割後の画像データから検出した顔部分、身体部分、文字部分のそれぞれとの論理和が、処理対象となった画像データから検出した顔部分、身体部分、文字部分として、当該処理対象となった画像データから検出した顔部分、身体部分、文字部分を特定する情報が出力される。 In this example, the so-called OR of the face, body and character detected from the image data before division and the face, body and character detected from the image data after division is As a face portion, a body portion and a character portion detected from the image data to be processed, information for specifying a face portion, a body portion and a character portion detected from the image data to be processed is output.

 なお、コマ部分は、顔部分、身体部分、または文字部分よりも一般的に高い精度で検出できるため、分割前の画像データ(または分割後の画像データであってもよい)のいずれか一方のみから検出すれば十分と考えられるが、制御部11は、コマ部分についても、分割前の画像データまたは分割後の画像データの少なくともいずれかから検出した場合に、当該コマ部分を特定する情報を出力するようにしてもよい。 In addition, since the top part can be generally detected with higher accuracy than the face part, the body part, or the character part, only one of the image data before division (or the image data after division may be used) Although it is considered sufficient if it is detected from the control unit 11, the control unit 11 also outputs information for specifying the frame portion when it is detected from at least one of the image data before division and the image data after division also for the frame portion. You may do it.

 また、このように、いずれかから検出された各部分(コマ部分、顔部分、身体部分、文字部分のそれぞれ)の情報を出力する場合は、重複している部分の情報については、重複を除いて出力する。 In addition, when outputting information of each part (each of a top, a face, a body, and a character) detected from one of the above, the duplication is excluded for the information of the overlapping parts. Output.

 またここでは分割前の画像データと、分割後の画像データとのいずれかから検出されたコマ部分、顔部分、身体部分及び文字部分を特定する情報を出力することとした。つまり、例えば処理対象の画像データが漫画の1ページ分の画像データである場合、ページ全体で検出したものと、部分ごとに分割した分割部分ごとに検出したものとの「OR(論理和)」をとることとした。しかしながら、本実施の形態のこの例は、これに限られず、分割前の画像データと、分割後の画像データとの双方から共通して検出されたコマ部分、顔部分、身体部分及び文字部分を特定する情報を出力してもよい(つまり、例えば処理対象の画像データが漫画の1ページ分の画像データである場合、ページ全体で検出したものと、部分ごとに分割した分割部分ごとに検出したものとの「AND(論理積)」をとってもよい)。 In addition, here, it is decided to output information for specifying a frame part, a face part, a body part and a character part detected from any of the image data before division and the image data after division. That is, for example, when the image data to be processed is image data of one page of a cartoon, “OR (logical sum)” of one detected in the entire page and one detected in each divided portion divided into portions I decided to take However, this example of the present embodiment is not limited to this, and the coma part, the face part, the body part and the character part commonly detected from both the image data before division and the image data after division Information to be specified may be output (that is, for example, when the image data to be processed is image data of one page of a cartoon, it is detected for each divided portion divided into portions and detected on the entire page) You may take "AND" with things).

 さらに、ここでは分割の態様を一種類としたが、複数種類の分割態様で分割して得た複数種類の部分画像データを生成してもよい。例えば、コマごとに分割して得た部分画像データと、2×2の4分割した部分画像データと…といったように複数種類の態様で分割して得られた部分画像データ(さらに分割前の画像データを加えてもよい)のいずれか少なくとも一つから(あるいはそれぞれから共通して)検出されたコマ部分、顔部分、身体部分及び文字部分を特定する情報を出力することとしてもよい。この場合も、重複が生じる場合は、重複を除いて出力する。 Further, although the division mode is one type here, plural types of partial image data obtained by division in plural types of division modes may be generated. For example, partial image data obtained by dividing in a plurality of types of modes, such as partial image data obtained by division for each frame, partial image data obtained by division into 2 × 2, etc. (image before division It is also possible to output information specifying a frame part, a face part, a body part and a character part detected from any one or more of (or each of the data may be added). Also in this case, when duplication occurs, the duplication is excluded and output.

[実施形態の効果]
 このように本実施の形態によれば、漫画画像データ内で互いに重なり合い、または包含関係となるコマ部分、キャラクタの顔部分、身体部分、及び文字部分の各部に対応してそれぞれ独立した検出器(または分類器)により、それぞれ検出を行うので、従来の機械学習を利用した検出に比べ、検出の精度を向上できる。
[Effect of the embodiment]
As described above, according to the present embodiment, independent detectors respectively correspond to the parts of the frame part, the face part of the character, the body part, and the character part, which are mutually overlapping or included in the cartoon image data. Alternatively, since the classification is performed by each of the classifiers, the accuracy of the detection can be improved as compared with the conventional detection using machine learning.

 1 画像処理装置、11 制御部、12 記憶部、13 操作部、14 表示部、15 入出力部、21 受入部、22 検出処理部、23 検出情報生成部、24 情報処理部、31 フレーム検出部、32 顔検出部、33 身体検出部、34 文字検出部、40 検出器、41,41′ ベースネットワーク部、42 分類器。
 
Reference Signs List 1 image processing apparatus, 11 control unit, 12 storage unit, 13 operation unit, 14 display unit, 15 input / output unit, 21 reception unit, 22 detection processing unit, 23 detection information generation unit, 24 information processing unit, 31 frame detection unit , 32 face detection units, 33 body detection units, 34 character detection units, 40 detectors, 41, 41 'base network units, 42 classifiers.

Claims (7)

 漫画画像データを受け入れる受入手段と、
 画像データから、当該画像データ内に描画された漫画のコマ部分を検出するよう機械学習された状態にあるフレーム検出手段と、
 画像データから、当該画像データ内に描画された顔部分を検出するよう機械学習された状態にある顔検出手段と、
 画像データから、当該画像データ内に描画された身体部分を検出するよう機械学習された状態にある身体検出手段と、
 画像データから、当該画像データ内に含まれる文字部分を検出するよう機械学習された状態にある文字検出手段と、
 前記受け入れた漫画画像データに基づいて、前記フレーム検出手段が検出したコマ部分を特定する情報と、前記顔検出手段が検出した、顔部分を特定する情報と、前記身体検出手段が検出した、身体部分を特定する情報と、前記文字検出手段が検出した、文字部分を特定する情報と、を生成する検出情報生成手段と、
 を含み、
 前記生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とが所定の情報処理に供される画像処理装置。
Receiving means for receiving cartoon image data;
Frame detection means in a state of being machine-learned to detect a comic piece drawn in the image data from the image data;
Face detection means in a state of being machine-learned to detect a face portion drawn in the image data from the image data;
Body detection means in a state machine-learned to detect a body part drawn in the image data from the image data;
Character detection means in a state of being machine-learned to detect character parts included in the image data from the image data;
Information specifying the top part detected by the frame detecting means, information specifying the face part detected by the face detecting means, and the body detected by the body detecting means, based on the received cartoon image data Detection information generation means for generating information for specifying a portion and information for specifying a character portion detected by the character detection means;
Including
The image processing apparatus, wherein the information for specifying the generated frame part, the information for specifying a face part, the information for specifying a body part, and the information for specifying a character part are provided for predetermined information processing.
 請求項1記載の画像処理装置であって、
 前記フレーム検出手段と、顔検出手段と、身体検出手段と、文字検出手段とは、それぞれに共通して、検出対象の候補となる画像の範囲と、当該範囲内の画像の特徴量とを機械学習した状態にあり、処理対象となった画像データに基づき、検出対象の候補となる画像の範囲と、当該範囲内の画像の特徴量とを出力するベースネットワーク部と、
 前記フレーム検出手段と、顔検出手段と、身体検出手段と、文字検出手段とのそれぞれに対応して設けられる分類器であって、前記画像の特徴量に基づき、対応する画像の範囲内に含まれる画像が、それぞれコマ部分、顔部分、身体部分、文字部分であるか否かを分類する分類器とを含む画像処理装置。
The image processing apparatus according to claim 1,
The frame detection means, the face detection means, the body detection means, and the character detection means share in common the range of images to be candidates for detection and the feature amount of the images within the range. A base network unit that outputs a range of images to be candidates for detection and a feature amount of an image within the range based on image data that has been learned and is to be processed;
A classifier provided corresponding to each of the frame detection means, the face detection means, the body detection means, and the character detection means, which is included in the range of the corresponding image based on the feature amount of the image. An image processing apparatus including a classifier for classifying whether an image to be captured is a top part, a face part, a body part, or a character part.
 請求項1または2に記載の画像処理装置であって、
 前記検出情報生成手段は、前記受け入れた漫画画像データを、所定の条件に基づいて複数の分割部分に分割して得られた分割部分のそれぞれのうちから、前記顔検出手段と、身体検出手段と、文字検出手段とにより、分割部分ごとに、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを生成する画像処理装置。
The image processing apparatus according to claim 1 or 2,
The detection information generation means includes the face detection means and the body detection means from each of divided portions obtained by dividing the received cartoon image data into a plurality of divided portions based on predetermined conditions. And an image processing apparatus for generating information for specifying a face portion, information for specifying a body portion, and information for specifying a character portion by means of a character detection unit.
 請求項3に記載の画像処理装置であって、
 前記検出情報生成手段は、フレーム検出手段が検出したコマ部分を特定する情報を用いて、当該特定されたコマ部分のそれぞれを前記分割部分として、当該分割部分のそれぞれのうちから、前記顔検出手段と、身体検出手段と、文字検出手段とにより、分割部分であるコマ部分ごとに、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを生成する画像処理装置。
The image processing apparatus according to claim 3,
The detection information generation means uses the information for identifying the frame portion detected by the frame detection means, and uses each of the identified frame portions as the divided portion, from among each of the divided portions, the face detection means Image processing for generating information for specifying a face portion, information for specifying a body portion, and information for specifying a character portion by using the body detection means and the character detection means, for each top part which is a divided portion apparatus.
 請求項3または4に記載の画像処理装置であって、
 前記検出情報生成手段は、前記受け入れた漫画画像データを、分割する前の画像データから前記顔検出手段と、身体検出手段と、文字検出手段とにより検出した顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを生成し、
 当該生成した情報と、前記分割部分のそれぞれのうちから検出した、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とを統合して出力する画像処理装置。
The image processing apparatus according to claim 3 or 4, wherein
The detection information generation means is information for specifying the face portion detected by the face detection means, the body detection means and the character detection means from the image data before dividing the received cartoon image data, and the body part Information to identify the character part and information to specify the character part,
An image processing apparatus that integrates and outputs the generated information, information for identifying a face portion detected from each of the divided portions, information for identifying a body portion, and information for identifying a character portion.
 請求項1から5のいずれか一項に記載の画像処理装置であって、
 前記フレーム検出手段と、顔検出手段と、身体検出手段と、文字検出手段とは、シングル・ショット・マルチボックス・ディテクタ(SSD)を用いて構成される画像処理装置。
The image processing apparatus according to any one of claims 1 to 5, wherein
An image processing apparatus, wherein the frame detection means, the face detection means, the body detection means, and the character detection means are configured using a single shot multi-box detector (SSD).
 コンピュータを、
 漫画画像データを受け入れる受入手段と、
 画像データから、当該画像データ内に描画された漫画のコマ部分を検出するよう機械学習された状態にあるフレーム検出手段と、
 画像データから、当該画像データ内に描画された顔部分を検出するよう機械学習された状態にある顔検出手段と、
 画像データから、当該画像データ内に描画された身体部分を検出するよう機械学習された状態にある身体検出手段と、
 画像データから、当該画像データ内に含まれる文字部分を検出するよう機械学習された状態にある文字検出手段と、
 前記受け入れた漫画画像データに基づいて、前記フレーム検出手段が検出したコマ部分を特定する情報と、前記顔検出手段が検出した、顔部分を特定する情報と、前記身体検出手段が検出した、身体部分を特定する情報と、前記文字検出手段が検出した、文字部分を特定する情報と、を生成する検出情報生成手段と、
 として機能させ、
 前記生成されたコマ部分を特定する情報と、顔部分を特定する情報と、身体部分を特定する情報と、文字部分を特定する情報とが所定の情報処理に供されるプログラム。
 
Computer,
Receiving means for receiving cartoon image data;
Frame detection means in a state of being machine-learned to detect a comic piece drawn in the image data from the image data;
Face detection means in a state of being machine-learned to detect a face portion drawn in the image data from the image data;
Body detection means in a state machine-learned to detect a body part drawn in the image data from the image data;
Character detection means in a state of being machine-learned to detect character parts included in the image data from the image data;
Information specifying the top part detected by the frame detecting means, information specifying the face part detected by the face detecting means, and the body detected by the body detecting means, based on the received cartoon image data Detection information generation means for generating information for specifying a portion and information for specifying a character portion detected by the character detection means;
To act as
The program by which the information which specifies the produced | generated top part, the information which specifies a face part, the information which specifies a body part, and the information which specifies a character part are provided to a predetermined information processing.
PCT/JP2018/032635 2017-09-04 2018-09-03 Image processing device and program Ceased WO2019045101A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017-169632 2017-09-04
JP2017169632A JP2019046253A (en) 2017-09-04 2017-09-04 Information processing apparatus and program

Publications (1)

Publication Number Publication Date
WO2019045101A1 true WO2019045101A1 (en) 2019-03-07

Family

ID=65527562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/032635 Ceased WO2019045101A1 (en) 2017-09-04 2018-09-03 Image processing device and program

Country Status (2)

Country Link
JP (1) JP2019046253A (en)
WO (1) WO2019045101A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070082A (en) * 2020-08-24 2020-12-11 西安理工大学 Curve character positioning method based on instance perception component merging network

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7626384B2 (en) * 2020-04-15 2025-02-04 ネットスター株式会社 Trained model, site determination program, and site determination system
JP7324475B1 (en) 2022-10-20 2023-08-10 株式会社hotarubi Information processing device, information processing method and information processing program
JP7802902B1 (en) * 2024-12-12 2026-01-20 株式会社Nttドコモ Information processing device and translation method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002229766A (en) * 2000-11-29 2002-08-16 Eastman Kodak Co Method for sending image to low-display function terminal
JP2011238043A (en) * 2010-05-11 2011-11-24 Kddi Corp Summarized comic image generation device, program and method for generating summary of comic content

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002229766A (en) * 2000-11-29 2002-08-16 Eastman Kodak Co Method for sending image to low-display function terminal
JP2011238043A (en) * 2010-05-11 2011-11-24 Kddi Corp Summarized comic image generation device, program and method for generating summary of comic content

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
FUJIMOTO, AZUMA ET AL.: "Creation and Analysis of Academic Manga Image Dataset with Annotations", IEICE TECHNICAL REPORT, vol. 116, no. 64, 15 March 2017 (2017-03-15), pages 35 - 40 *
FUKUI, HIROSHI ET AL.: "Research Trends in Pedestrian Detection Using Deep Learning", IEICE TECHNICAL REPORT, vol. 116, no. 366, 17 January 2017 (2017-01-17), pages 37 - 46 *
YANAGISAWA, HIDEAKI ET AL.: "Structural Analysis of Comic Images using Faster R-CNN", PCSJ/IMPS 2016, 30 November 2016 (2016-11-30), pages 80 - 81 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112070082A (en) * 2020-08-24 2020-12-11 西安理工大学 Curve character positioning method based on instance perception component merging network
CN112070082B (en) * 2020-08-24 2023-04-07 西安理工大学 Curve character positioning method based on instance perception component merging network

Also Published As

Publication number Publication date
JP2019046253A (en) 2019-03-22

Similar Documents

Publication Publication Date Title
US11430259B2 (en) Object detection based on joint feature extraction
US11681418B2 (en) Multi-sample whole slide image processing in digital pathology via multi-resolution registration and machine learning
Hoque et al. Real time bangladeshi sign language detection using faster r-cnn
KR101896357B1 (en) Method, device and program for detecting an object
JP2020095713A (en) Method and system for information extraction from document images using conversational interface and database querying
JP2020527260A (en) Text detection analysis methods, devices and devices
US11893784B2 (en) Assessment of image quality for optical character recognition using machine learning
GB2549554A (en) Method and system for detecting an object in an image
US12498556B2 (en) Microscopy system and method for evaluating image processing results
CN112036447A (en) Zero-sample target detection system and learnable semantic and fixed semantic fusion method
CN114862845B (en) Defect detection method, device and equipment for mobile phone touch screen and storage medium
WO2019045101A1 (en) Image processing device and program
CN110674777A (en) An Optical Character Recognition Method in Patent Text Scenario
US20240212382A1 (en) Extracting multiple documents from single image
CN115713481A (en) Complex scene target detection method based on multi-mode fusion and storage medium
Uddin et al. Horse detection using haar like features
CN115775386A (en) User interface component identification method and device, computer equipment and storage medium
CN113780116A (en) Invoice classification method, apparatus, computer equipment and storage medium
Guo et al. Multi-face detection and alignment using multiple kernels
CN113343989B (en) Target detection method and system based on self-adaption of foreground selection domain
WO2023112302A1 (en) Training data creation assistance device and training data creation assistance method
Mayer et al. Adjusted pixel features for robust facial component classification
CN117540715A (en) Table identification method and system based on deep learning and computer vision
Kasem et al. Attention-guided hybrid learning for accurate defect classification in manufacturing environments
US20260017810A1 (en) Alignment system for specimen inspection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18852176

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18852176

Country of ref document: EP

Kind code of ref document: A1