[go: up one dir, main page]

CN117809221A - Object detection method and device, electronic equipment and storage medium - Google Patents

Object detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117809221A
CN117809221A CN202311864670.3A CN202311864670A CN117809221A CN 117809221 A CN117809221 A CN 117809221A CN 202311864670 A CN202311864670 A CN 202311864670A CN 117809221 A CN117809221 A CN 117809221A
Authority
CN
China
Prior art keywords
target
scene
video
same
target image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311864670.3A
Other languages
Chinese (zh)
Inventor
谢煊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202311864670.3A priority Critical patent/CN117809221A/en
Publication of CN117809221A publication Critical patent/CN117809221A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The application provides an object detection method, an object detection device, electronic equipment and a storage medium, wherein object detection is carried out on target videos to obtain respective position information of the same second object contained in continuous multi-frame target images contained in each target image group, target detection results such as category prediction information and the like of the second object belonging to the object category used for implanting a first object, after each target image group corresponding to each scene video divided by the target videos is determined, each target image group corresponding to the same second object can be obtained according to each target image group and the target detection result belonging to each scene video, and therefore accurate and complete detection results of the second object in the target videos are obtained according to the target detection results and each target image group corresponding to the same second object, so that the implantation efficiency of the first object realized according to the detection results is improved, and the problem of first object implantation through upper is avoided.

Description

Object detection method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of multimedia applications, and in particular, to an object detection method and apparatus, an electronic device, and a storage medium.
Background
In recent years, the implantation of advertisements and other objects in movie plays has become a new propaganda mode, and is widely applied to movie plays at home and abroad. In order not to influence the viewing of the movie and television play by the audience, the advertisement is naturally embedded into the movie and television play scene at present, for example, the advertisement pits such as a billboard or a display screen of the movie and television play are replaced by corresponding advertisements which need to be embedded.
In order to determine the advertisement pits in videos such as movie and television drama, a worker is usually required to repeatedly read each image frame of the videos, the time cost is too high, and the advertisement pits are easily detected incompletely due to negligence by manpower, so that the problem of advertisement implantation and lasting is generated, the recall time of the advertisement pits is reduced, and the advertisement implantation benefit is influenced.
Disclosure of Invention
In order to solve the technical problems, the embodiment of the application provides the following technical scheme:
the application provides an object detection method, which comprises the following steps:
obtaining a target video; the target video is a video which needs to be implanted into a first object;
performing object detection on the target video to obtain a target detection result of each target image group; wherein each target image group comprises a plurality of continuous target images in the target video, the target images comprise the same second object, the target detection result at least comprises position information and category prediction information of the second object in each frame of the target images, the category prediction information is a result of predicting that the second object belongs to a target category, and the target category is a predefined object category used for implanting the first object;
Dividing the target video into a plurality of scene videos, and determining each target image group belonging to the same scene video; one of the scene videos contains successive multi-frame images in the target video;
obtaining each target image group corresponding to the same second object from a plurality of second objects contained in each target image group according to each target image group belonging to each scene video and the target detection result;
and obtaining the detection result of the same second object in the target video according to the target detection result and each target image group corresponding to the same second object, and implanting the first object of the corresponding category according to the detection result.
Optionally, the performing object detection on the target video to obtain a target detection result of each target image group includes:
inputting the target video into a target detection network to obtain the position information of a second object contained in each frame of target image in each target image group and the category prediction information of the second object belonging to the target category; the method comprises the steps of carrying out a first treatment on the surface of the
Obtaining a frame number of a multi-frame target image contained in each target image group;
And forming a target detection result of the target image group by the frame number, the position information and the category prediction information corresponding to each target image group.
Optionally, the dividing the target video into a plurality of scene videos, determining each of the target image groups belonging to the same scene video includes:
performing scene segmentation on the target video to obtain corresponding scene segmentation information; the scene segmentation information characterizes continuous multi-frame images contained in each of a plurality of scene videos into which the target video is segmented;
and determining each target image group belonging to the same scene video based on the scene segmentation information and the frame sequence numbers corresponding to each target image group.
Optionally, the obtaining, according to each target image group belonging to each scene video and the target detection result, each target image group corresponding to the same second object from the second objects contained in each target image group includes:
obtaining the frame sequence numbers corresponding to the target image groups belonging to the same scene video;
according to the position information of the second object and the frame sequence number corresponding to the same scene video, similar object clustering is carried out on multiple frames of target images belonging to the same scene video, and an object clustering result belonging to the same second object in the corresponding scene video is obtained;
According to the object clustering results of the plurality of scene videos and the category prediction information corresponding to the second object, performing scene similar clustering on the target images of the plurality of scene videos containing the same target category second object to obtain scene clustering results of the plurality of scene videos containing the same second object;
and determining each target image group corresponding to the same second object in the target video according to the scene clustering result.
Optionally, the clustering of similar objects is performed on the multiple frames of the target images belonging to the same scene video according to the position information of the second object and the frame number corresponding to the same scene video, so as to obtain an object clustering result belonging to the same second object in the corresponding scene video, including:
obtaining images to be processed with the same preset frame number from each target image group according to the frame numbers corresponding to each target image group belonging to the same scene video;
extracting a second object region image corresponding to the to-be-processed image by using the position information of the second object in each to-be-processed image;
According to the category prediction information of the second object in the image to be processed, similarity calculation is carried out on each second object region image corresponding to the second object of the same target category extracted from the same scene video, so that corresponding object similarity is obtained;
and obtaining an object clustering result belonging to the same second object in the corresponding scene video according to the object similarity and the target detection result.
Optionally, the performing similar scene clustering on the target images of the plurality of scene videos including the same target class second object according to the respective object clustering results of the plurality of scene videos and the class prediction information corresponding to the second object, to obtain a scene clustering result of the plurality of scene videos including the same second object, includes:
obtaining a frame of image to be detected corresponding to the maximum category prediction information of the second object from each target image group aiming at the same second object in the corresponding scene video according to the object clustering result of each scene video and the category prediction information of the second object;
performing similarity calculation on the images to be detected of frames corresponding to the same target class second object in the plurality of scene videos to obtain corresponding scene similarity;
And clustering the same second object in the plurality of scene videos according to the scene similarity to obtain a scene clustering result of the same second object in the target video.
Optionally, the performing similarity calculation on the to-be-detected images of frames corresponding to the same target class second object in the plurality of scene videos to obtain corresponding scene similarity includes:
inputting the obtained images to be detected of each frame into a scene similarity detection network, and carrying out similarity calculation on any two frames of the images to be detected containing the second object of the same target class to obtain corresponding scene similarity;
clustering the same second object in the plurality of scene videos according to the scene similarity to obtain a scene clustering result of the same second object in the target video, wherein the method comprises the following steps:
determining each scene video with the scene similarity larger than a scene similarity threshold as a group of scene videos to be clustered;
and aggregating the second objects of the same target class in each group of the scene videos to be clustered into the same second object to obtain scene clustering results of the same second object in the plurality of scene videos.
The application also proposes an object detection device comprising: the target video acquisition module is used for acquiring a target video; the target video is a video which needs to be implanted into a first object;
the target detection result obtaining module is used for carrying out object detection on the target video to obtain a target detection result of each target image group; wherein each target image group comprises a plurality of continuous target images in the target video, the target images comprise the same second object, the target detection result at least comprises position information and category prediction information of the second object in each frame of the target images, the category prediction information is a result of predicting that the second object belongs to a target category, and the target category is a predefined object category used for implanting the first object;
the scene video segmentation module is used for segmenting the target video into a plurality of scene videos and determining each target image group belonging to the same scene video; one of the scene videos contains successive multi-frame images in the target video;
the target image processing module is used for obtaining each target image group corresponding to the same second object from a plurality of second objects contained in each target image group according to each target image group belonging to each scene video and the target detection result;
The detection result obtaining module is used for obtaining the detection result of the same second object in the target video according to the target detection result and each target image group corresponding to the same second object, and is used for implanting the first object of the corresponding category according to the detection result.
The application also proposes an electronic device comprising: at least one memory and at least one processor, wherein:
the processor is configured to load and execute the computer instructions stored in the memory, so as to implement the object detection method.
The present application also proposes a computer readable storage medium having stored thereon a computer program, which is loaded and executed by a processor, implementing an object detection method as described above.
Therefore, the object detection is performed on the target video, so that the target detection results of a plurality of target image groups contained in the target video are quickly and accurately obtained, each target image group contains continuous multi-frame target images, each multi-frame target image contains the same second object, the corresponding target detection result can comprise position information in each frame of target image and category prediction information of the second object belonging to the object category (namely the target category) used for implanting the first object, and the like, manual reference of the second object of each frame of image is not needed, object detection efficiency is greatly improved, and then each target image group corresponding to the same second object can be obtained according to each target image group belonging to each scene video and the target detection result, so that accurate and complete detection results of the second object in the target video are obtained according to the target detection results and each target image group corresponding to the same second object, the second object recall time length is shortened, the first implantation efficiency realized according to the detection results is improved, and the first implantation of the first object is prevented from being implanted, and the first advertisement problem such as the first advertisement is avoided is solved.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.
FIG. 1 is a flowchart of an alternative embodiment of an object detection method according to the present application;
FIG. 2 is a flowchart of a second alternative embodiment of the object detection method according to the present application;
FIG. 3 is a flowchart illustrating an alternative embodiment of an object detection method according to the present application;
fig. 4 is a schematic diagram of an object clustering result of any scene video segmented by a target video in the object detection method provided by the present application;
FIG. 5 is a schematic diagram of a clustering process of the same second object in different scene videos in a target video in the object detection method provided by the present application;
FIG. 6 is a schematic diagram of a scene clustering result of the same second object in different scene videos in a target video in the object detection method provided by the present application;
FIG. 7 is a flowchart of a fourth alternative embodiment of the object detection method proposed in the present application;
FIG. 8 is a schematic structural diagram of an alternative embodiment of an object detection device according to the present application;
fig. 9 is a schematic structural diagram of a second alternative embodiment of the object detection device proposed in the present application;
fig. 10 is a schematic hardware structure of an alternative embodiment of an electronic device suitable for the object detection method proposed in the present application.
Detailed Description
For the description content of the background art, it is known through analysis that, taking the first object of implanting advertisement into a movie and television play (i.e. video) as an example, due to the playing layout requirement of the scenario content, one advertisement pit (such as a second object of an office table, a tea table, etc.) may appear in the whole movie and television play for a plurality of times, if the advertisement pit of a part of the appearance time is missed, the advertisement is implanted only for the advertisement pit of the part of the appearance time manually referred to, which will cause the advertisement to cross the side and affect the advertisement income. In order to improve the problems, the method and the device hope to automatically detect all the advertisement pits in the movie and television drama to replace an artificial reading mode, so that the detection efficiency and the integrity of the advertisement pits are improved, the recall time of the advertisement pits is prolonged, and the improvement of the implantation efficiency and the reliability of the advertisement is facilitated.
Based on the above, it is known through researches that each second object (such as an advertisement hole) in videos such as a movie and television show is often related to the video content, scene transition is realized through the change of the video content, objects in different scene videos are often changed, and the same object can appear in non-adjacent image frames in the same scene video or even in different scene videos, so that the second object in the same scene video and each frame image where the same second object in different scene videos is located can be aggregated, each image frame where the same second object is located in the whole video is accurately and completely determined, and the problem that the first object is implanted into a wall due to incomplete detection under the condition that the same second object in one scene video intermittently appears for many times is solved.
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be appreciated that "system," "apparatus," "unit" and/or "module" as used in this application is one method for distinguishing between different components, elements, parts, portions or assemblies at different levels. However, if other words can achieve the same purpose, the word can be replaced by other expressions.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units. And references to "one" or "a plurality" in this disclosure are intended to be illustrative and not limiting, as those skilled in the art will appreciate that "one or more" is to be understood, unless the context clearly indicates otherwise.
Referring to fig. 1, a flowchart of an alternative embodiment of an object detection method according to the present application is shown, where the method is applicable to an electronic device, such as a server or a terminal device with data processing capability, where the server may be one or more physical servers, or a cloud server supporting cloud services, and the terminal device may be a business terminal including, but not limited to, a notebook computer, a desktop computer, a smart phone, or a professional field. As shown in fig. 1, the object detection method proposed in the present embodiment may include:
step S101, obtaining a target video; the target video is a video which needs to be implanted into a first object;
the target video may be from a video source, and the content and the source of the target video are not limited in this application, for example, a movie or a television play of a first object such as an advertisement or a trademark needs to be implanted.
Step S102, object detection is carried out on the target video, and a target detection result of each target image group is obtained; each target image group comprises a plurality of continuous target images in a target video, the target images comprise the same second object, and the target detection result comprises the position information and the category prediction information of the second object in each target image;
In the case that the first object needs to be implanted into the target video, for example, the original similar second object in the video is directly replaced by the first object, or the first object is implanted into the second object display area, for information of different contents or types, the application can predefine the object of at least one object class (which can be marked as the target class) used for implanting the first object according to the class of each first object needing to be implanted and the implantation mode thereof, for example, each object which can be implanted with advertisement and is predefined and can be used as an advertisement pit, and the application does not limit the determination method of the target class, and can be determined according to the situation.
Based on the above, after the electronic device obtains the target video, it may determine whether the object in each frame of image has a second object belonging to any one target category by performing object detection on each frame of image included in the target video, and if so, may record each frame of image including the second object as the target image. .
In the above detection process of the information implantation object of the target video, the object in the target video, in which the first object can be implanted, is a second object from among all continuous multi-frame target images that appear to disappear from the beginning, and the continuous multi-frame target images are determined as a group of image frames and recorded as a target image group. In order to reduce the amount of calculation of the target image group appearing on the second object, and improve the detection efficiency, the frame number or playing time of each frame of target image contained in the target image group can be recorded, so that each frame of target image contained in different target image groups can be distinguished.
Regarding the detection method for the existence of each second object in the target video, a suitable object detection (object detection) algorithm, such as a YOLO algorithm or a neural network based on deep learning, may be adopted to perform object detection on each frame of target image in each target image group, obtain location information of each object in the frame of target image, such as determining coordinates of an area where each detected object is located (which may be determined by a bounding box), and a class prediction value (such as a prediction probability or a prediction score) that the object belongs to different classes, where the greater the class prediction value is, the greater the probability that the object belongs to a corresponding class is indicated, and the higher the confidence of the class detection result of the object is.
Therefore, the category corresponding to the maximum category predicted value is selected from the category predicted values of the same object in each frame of target image to be determined as the category of the object, namely, the category of each object contained in the frame of target image is detected, then the object belonging to the predefined target category is determined as the second object, the coordinates of the area where the object is located in the frame of target image are determined as the position information, and the maximum category predicted value is determined as the category predicted information of the second object in the frame of target image. It can be seen that the class prediction information is a result of predicting that the second object belongs to the target class, i.e. to any object class used to implant the first object.
Step S103, dividing the target video into a plurality of scene videos, and determining each target image group belonging to the same scene video; one scene video contains continuous multi-frame images in a target video;
step S104, according to each target image group belonging to each scene video and a target detection result, each target image group corresponding to the same second object is obtained from a plurality of second objects contained in each target image group;
according to the above description of the technical solution of the present application, since the number of image frames contained in the whole target video is often very large, the whole target video can be divided into different scenes according to the video content, such as a party scene, a conference scene, an entertainment scene, etc., the same second object in the same scene may appear intermittently, such as a table may appear in the scene when the scene is shot around, disappear after a plurality of continuous frames of images, and appear after a plurality of frames of images, in this case, according to the above object detection method in step S102, the two intermittent occurrences of the same table are determined as two different tables, and exist in different target image groups.
Moreover, the aggregation effect obtained by the object clustering is poor when conditions such as shooting angles and light rays are changed greatly, the same second object can be detected in two scene videos with a large time span, and in order to accurately and completely detect the same second object in the target videos, the scene clustering is performed by combining object clustering results of different scene videos, so that cross-scene aggregation of target images of the same second object in different scene videos is realized.
Based on this, for a target video containing a plurality of scenes and having a large playing time length, the target video may be divided into a plurality of sub-videos, and recorded as scene videos, each of the scene videos may include consecutive multi-frame images in the target video, but the number of image frames contained in one scene video is not limited. Then, for the detected multiple target image groups, according to the frame number or playing time or image content of each frame of target image contained in the detected multiple target image groups, which scene video each target image group belongs to can be determined, and the implementation process is not limited.
After determining each target image group included in each scene video, the target detection results of each frame of target image in each target image group may be combined, for example, the contents of the position information and the category prediction information of the second object may be further analyzed, whether the second objects included in each of the plurality of target image groups in the same scene video and in different scene videos are the same object, so as to determine a plurality of target image groups corresponding to the same second object.
Optionally, for the detection mode of whether the second object in the different target images is the same object, after determining that the second object belongs to the same class of objects, a further similarity detection algorithm may be adopted to implement, where the similarity detection algorithm may include one or more combinations of a clustering algorithm, a nearest neighbor classification, a euclidean distance, a cosine similarity algorithm, a hash algorithm, a mutual information or structural similarity metric based, and the like, and may be determined according to actual requirements, and the class and the operation principle of the similarity detection algorithm are not described in detail in the present application.
Step S105, according to the target detection result and each target image group corresponding to the same second object, obtaining the detection result of the same second object in the target video, so as to implant the first object of the corresponding class according to the detection result.
In the above-mentioned analysis, the present application can accurately and completely determine, from the whole target video, each frame of target image where the second object used for implanting the first object is located, and because the target detection result at least includes the category of the second object and the position information in each frame of target image, the comparison can determine which image frames of each second object used for implanting the first object in the target video, and the detection results such as the position of each image frame, so that when a certain first object, such as an advertisement of a certain object, needs to be implanted, a corresponding second object can be recalled from the target video according to the detection result, and accordingly, the second object is implanted into each frame of target image, such as replacing the corresponding second object in each frame of target image with the first object.
Therefore, in the embodiment of the application, after the object type suitable for being implanted with the first object such as an advertisement, a trademark or a notice is predefined and is marked as the target type, the target video to be implanted with the first object is obtained, the second object existing in each continuous multi-frame target image in the target video, the position information and the type prediction information of the second object in each frame target image can be quickly, accurately and completely obtained through object detection on the target video, and compared with the mode of manually consulting each frame image to determine whether the second object exists or not, the object detection mode of the application ensures that all frame target images of the second object with any target type in the target video are detected, can accurately determine the position of the second object in the frame target image, improves the detection efficiency and reliability, and lays a reliable foundation for recalling the second object.
And after dividing the target video into a plurality of scene videos and determining each target image group contained in each scene video, the method and the device analyze whether the second objects of different target image groups in the same scene video are the same object and whether the second objects of each target image group in different scene videos are the same according to the target detection results, accurately and completely determine each frame of target image corresponding to the same second object in the whole target video, output the detection result of the second object according to the target detection results, namely, quickly and accurately recall the same second object contained in multi-frame target images from the target video, accordingly perform implantation operation of the corresponding type first object, avoid missing implantation of the first object in part of frame images, lead the first object to penetrate in the whole target video, and influence the implantation benefit of the first object.
Referring to fig. 2, a flowchart of a second alternative embodiment of the object detection method according to the present application may describe an alternative refinement implementation procedure of the object detection method according to the present application, as shown in fig. 2, the method may include:
step S201, obtaining a target video; the target video is a video which needs to be implanted into a first object;
step S202, inputting a target video into a target detection network to obtain position information of a second object contained in each frame of target image in each target image group and category prediction information of the second object belonging to a target category; each target image group comprises a plurality of continuous frames of target images;
in combination with the description of the corresponding part of the above embodiment, the target detection network may be a model obtained by training based on an object detection algorithm, may be a trained YOLOX network or other neural network based on deep learning, etc., where the network structure of the target detection network and its training implementation method are not limited, and the trained target detection network may be directly invoked to detect whether a second object of a target class exists in an input image, that is, after performing data enhancement processing on an input frame of image, extracting features of the enhanced image to obtain a corresponding feature map, and obtaining bounding boxes of areas where different sizes of detection boxes (or bbox) are located through feature fusion analysis, while determining coordinates (such as endpoint coordinates of the detection boxes) of the detection boxes, and determining a class prediction value corresponding to the maximum class prediction value as a class of the object, so as to determine the detection box coordinates and class prediction information of the second object belonging to the target class (such as predicting probability or larger number score of the second object belonging to the target class) from the second object and class prediction information (such that the accuracy score is higher than the accuracy of the predicted result).
Preferably, in this embodiment, object detection on each frame of image in the target video may be implemented by using a YOLOX network, so that, compared with other network structures, the amount of calculation can be reduced, and the number of generated detection frames for predicting the second object belonging to the target class is reduced, so that the position information and the class (which may be determined by the class prediction information) of the second object in the input image are rapidly and accurately detected, and the implementation process is not described in detail in this application.
Step S203, obtaining the frame number of the multi-frame target image contained in each target image group;
step S204, forming a target detection result of each target image group by the corresponding frame number, position information and category prediction information of the target image group;
in this embodiment, in order to determine that each target image group belongs to each scene of the target video, it is also convenient to identify each frame target image containing the same second object, and a frame number (that is, a sequence number of each frame image in the whole target video, which may be determined according to an output sequence of each frame image in a target video playing process) is proposed to implement the method, so that a frame number of each frame target image contained in each target image group may be determined, and a target detection result of a corresponding target image group may be formed by combining content such as position information and category prediction information of the second object in the frame target image.
It should be noted that, the content of the target detection result of each target image group is not limited to the content listed in the embodiment, and may be adaptively adjusted according to actual requirements, for example, an id may be configured for each target image group to distinguish different target image groups, or may distinguish a second object included in different target image groups.
Step S205, performing scene segmentation on the target video to obtain corresponding scene segmentation information; the scene segmentation information characterizes continuous multi-frame images contained in each of a plurality of scene videos into which a target video is segmented;
in the embodiment of the present application, the scene change condition of the target video may be obtained through a scene segmentation algorithm, so the present application may use the scene segmentation algorithm to perform scene recognition on all frame images included in the target video to obtain segmentation points of different scenes, thereby segmenting the entire target video into multiple segments of sub-videos, each segment of sub-video is determined to be a scene video, and scene segmentation information of the target video is obtained according to the scene segmentation points of two adjacent segments of sub-videos. The application does not limit the type of the scene segmentation algorithm and the operation principle thereof.
Alternatively, the present application may perform scene segmentation training on the sample video in advance based on the scene segmentation algorithm, to obtain a scene segmentation network for identifying a segmented scene of the sample video, so that after obtaining the target video, the target video may be input into the scene segmentation network, and the scene segmentation information as described above is output, which is not limited to this implementation method.
In some embodiments, for the above-mentioned scene division point, the frame number between two adjacent scene videos may be, for example, the frame number of the last frame image belonging to the previous scene video and the frame number of the first frame image belonging to the adjacent next scene video, so as to obtain the start frame number and the end frame number of the continuous multi-frame image (i.e. each sub-video) included in each scene video, which may be denoted as the start and end frame numbers.
In other embodiments, the above-mentioned scene division point may also be a system time (i.e. a playing time of the target video), and the playing time point is used to distinguish a sub-video corresponding to each of the different scene videos, determine a scene playing time (which is usually a period of time) of each of the scene videos in the target video, for example, record a start playing time and an end playing time of each of the scene videos, etc., but is not limited to the image frame number and/or the system playing time representation described in the present application, so as to distinguish the playing time of the different division scenes. In this case, in the above object detection process, the system play time of each frame of the target image may be recorded in the obtained target detection result.
As can be seen, the scene segmentation information obtained in the present application may include, but is not limited to, a start-stop frame number or a scene playing time corresponding to each scene video included in the target video, which may be consistent with the above-mentioned representation manner of the display duration of each detected second object in the target detection result, such as a frame number or a system playing time of each group of continuous multi-frame target images, so as to determine to which scene video each frame of the detected second object belongs.
In practical application, for each scene video (i.e., each segment of sub-video) separated from the target video, a scene identifier corresponding to each other, such as a scene number/a scene id, may be configured, so that different scene videos can be distinguished by the scene identifier later, and a start-stop frame number or a scene playing time of the corresponding scene video is read.
Step S206, determining each target image group belonging to the same scene video based on the scene segmentation information and the corresponding frame number of each target image group;
based on the above analysis, based on the scene segmentation information, it is possible to determine the continuous multi-frame images contained in each scene video, each frame image having a corresponding frame number, and in the case of determining the frame number of each frame target image contained in each target image group, it is possible to determine to which scene video each target image group belongs, that is, determine each target image group contained in each scene video, by comparing the frame numbers.
Step S207, obtaining the corresponding frame numbers of each target image group belonging to the same scene video;
step S208, clustering similar objects of the multi-frame target images belonging to the same scene video according to the position information of the second object and the frame number corresponding to the same scene video to obtain an object clustering result belonging to the same second object in the corresponding scene video;
in order to solve the problem that in the case that the same second object in the same scene video appears intermittently for multiple times in the target video, the object detection in step S202 obtains multiple target detection results corresponding to multiple target image groups, and cannot determine which second objects included in the target image groups are actually the same object, which will interfere with the recall efficiency and accuracy of the second object.
It should be noted that, in the above-mentioned similarity graph clustering implementation process, for each target class second object included in the same scene video, the second object may be a second object from one or more target image groups, and if the second object is from multiple continuous or discontinuous target image groups, the multiple target image groups may be aggregated to correspond to the one second object, so as to obtain an object clustering result of the second object. Each target image group of the second object aggregation may be represented by its identification id, or may also be represented by a frame number of each frame target image contained therein, etc., which is not limited in the content of the object clustering result of the same second object in the same scene video and the representation manner thereof, as the case may be.
Step S209, performing scene similar clustering on target images of the plurality of scene videos containing the second objects of the same target class according to respective object clustering results of the plurality of scene videos and class prediction information corresponding to the second objects, and obtaining scene clustering results of the plurality of scene videos containing the same second objects;
step S210, determining each target image group corresponding to the same second object in the target video according to the scene clustering result;
Step S211, according to the target detection result and each target image group corresponding to the same second object, obtaining the detection result of the same second object in the target video, and implanting the first object of the corresponding category according to the detection result.
In order to accurately detect the same second object contained in the two scene videos with larger time span, a scene clustering algorithm can be adopted to aggregate the second objects across scenes. And determining each target image group corresponding to the second object from the same target class of the different scene videos by using the class prediction information, performing similarity calculation on each frame of target image of the second object from the same target class of the different scene videos, and determining whether the second object is the same or not, thereby determining each frame of target image corresponding to the same second object appearing in the different scene videos according to the obtained scene clustering result, and accordingly, recalling all the second objects from the target videos quickly and accurately, implanting first objects of corresponding classes in the target videos, avoiding the first objects from being implanted into the lasting, improving the recall time of the first objects in the target videos, and ensuring the implantation benefits of the first objects.
Referring to fig. 3, for a schematic flow chart of a third alternative embodiment of the object detection method according to the present application, the present embodiment may describe an alternative refinement implementation procedure of the object detection method according to the second alternative embodiment, as shown in fig. 3, the method may include:
step S301, obtaining a target video; the target video is a video which needs to be implanted into a first object;
step S302, inputting a target video into a target detection network to obtain position information of a second object contained in each frame of target image in each target image group and category prediction information of the second object belonging to a target category; each target image group comprises a plurality of continuous frames of target images;
step S303, dividing the target video into a plurality of scene videos, and determining each target image group belonging to the same scene video;
step S304, obtaining the corresponding frame numbers of each target image group belonging to the same scene video;
regarding the implementation procedure of step S301 to step S304, reference may be made to the descriptions of the corresponding parts of the above embodiments, such as the descriptions of step S101 and step S202 to step S207, which are not described herein.
Illustratively, in the case that an advertisement (a first object) needs to be embedded in a target video, in determining each advertisement pit (i.e., each object of a target class) suitable for the advertisement, after inputting the target video into a target detection network according to the method described above, according to a predefined target class suitable for embedding the advertisement (i.e., the advertisement pit), the frame number of a target image in which advertisement pits belonging to any target class appear continuously is detected, the position information (such as four vertices or diagonal vertex coordinates) of a detection frame in each frame of target images appearing continuously, and a prediction score (which may be a value between 0 and 1 or may be represented by a prediction probability) predicted as a target class, the target detection result for each advertisement pit may be output in an output format as shown in table 1 below.
In table 1 below, each of the multi-frame target images in which the same second object appears consecutively is referred to as a target image group, and the frame numbers of the respective frame target images included therein may be recorded, for example, as in table 1. Constituent element [ x ] in one target image group in each cell in the column of position information 1 ,y 1 ,x 2 ,y 2 ]The upper left corner and lower right corner of the detection frame corresponding to the advertisement pit in the target image of one frame can be represented. It should be understood that the frame number, the position information and the prediction score sequence belonging to the same target image group are in one-to-one correspondence from one end, for example, the position information of the detection frame belonging to the advertisement pit of the screen in the (i+1) th frame target image isAnd the prediction score of the detection frame of the advertisement pit predicted as the screen is +.>. In order to facilitate distinguishing between different target image groups, as shown in table 1, each target image group may be configured with an identification id corresponding to one, a number shown in a column of identification ids shown in table 1, or the like.
TABLE 1
In combination with the above description of the implementation manner of scene segmentation of the target video, the target video is input into the scene segmentation network, and the scene segmentation information shown in the following table 2 can be output, which can be an array, and the elements in each array can be the scene identification id and the start-stop frame number t of each scene video (which are formed by the frame numbers or the system playing time of the first frame image and the last frame image in the continuous multi-frame images contained in the scene video, and the present application uses a mode of distinguishing different frame images by taking the frame numbers as an example), but is not limited to this array representation manner:
TABLE 2
Scene identification id Start frame number Terminating frame sequence number
0 0 t0
1 t0+1 t1
Based on this, as in the above record contents of table 1 and table 2, for the frame numbers of the corresponding frame target images (continuous multi-frame target images) in each target image, that is, the elements included in each sequence in the column where the frame numbers are located in table 1, are compared with the start-stop frame numbers corresponding to each scene video, and it is determined to which scene video each target image group belongs, that is, the target image belonging to the frame number in the start-stop frame number range of any one scene video belongs to that scene video, and the implementation manner of step S303 is not limited in this application.
After the target video is obtained, the execution order of the object detection process and the scene segmentation process is not limited, and may be sequentially executed as described above, simultaneously executed, or the like, as appropriate.
Step S305, clustering similar objects of the multi-frame target images belonging to the same scene video according to the position information of the second object and the frame number corresponding to the same scene video to obtain an object clustering result belonging to the same second object in the corresponding scene video;
in order to accurately determine the same second object intermittently appearing in the same scene video for a plurality of times, detection workload is reduced, in the process of carrying out object clustering processing on each frame of target images belonging to each target image group contained in the same scene video based on a similarity image clustering mode, position information of the second object in each frame of target images belonging to the scene video can be utilized to extract an object area image (namely the image in the detection frame) of an area where the second object is located in a corresponding frame of target images, and then similarity calculation can be directly carried out on each object area image of the second object in the same scene video to determine whether the second object in each corresponding frame of target images is an object.
Illustratively, the multi-frame image included in the scene K (e.g., any one of the target videos) shown in fig. 4 displays a plurality of target class second objects, such as four classes of second objects including a building, a billboard, a table, and a screen, that is, four classes of objects suitable for advertisement placement. After the above-mentioned similarity graph clustering process, the second object of one target class may include second objects from one or more target image groups, and each second object is clustered by the second objects in the multiple target image groups, as shown in fig. 4, in multiple target image groups (which may be represented by id) of the second object such as a billboard, the billboards with id 1 and 3 (each corresponding to a target image group including continuous multi-frame target images) belong to the same billboard, and are clustered into id (1, 3), and the billboards with id 5,6 and 7 belong to the same billboard, and are clustered into id (5, 6, 7), but the method is not limited to this way of representing the object clustering result, and may also be represented by the frame numbers of the target images of each frame included in each target image group.
Step S306, according to the object clustering result of each scene video and the category prediction information of the second object, a frame of image to be detected corresponding to the maximum category prediction information of the second object is obtained from each target image group aiming at the same second object in the corresponding scene video;
For each frame of target image of the same second object included in the same scene video, if at least one target image group corresponding to the second object includes each frame of target image, the greater the detected class prediction information (here, the prediction probability or score that the second object belongs to the corresponding target class) of the second object, the higher the reliability that the second object belongs to the predicted target class is. Therefore, in order to improve the accuracy and efficiency of scene clustering, the method and the device can screen a frame of target image with highest category prediction information from the object clustering result of each scene video to be an image to be detected, namely, from a plurality of frames of target images of the same second object detected from the scene video, determine that the frame of target image with highest category prediction information is the image to be detected of the second object, and can represent the scene video.
It should be noted that, in the present application, a plurality of frame target images with higher classification prediction information ordering may be selected as the images to be detected according to the above-described method, so as to implement subsequent scene clustering.
Step S307, performing similarity calculation on each frame of to-be-detected image corresponding to the second object of the same target class in the plurality of scene videos to obtain corresponding scene similarity;
step S308, clustering the same second object in a plurality of scene videos according to the scene similarity to obtain a scene clustering result of the same second object in the target video;
in some embodiments, the present application may train the scene-similar network first, for example, train the multi-frame sample images under different scenes based on a suitable similarity algorithm, and the present application does not limit the structure of the scene-similar network and the training implementation method thereof. Thus, after obtaining the to-be-detected images of the same type of second object (each second object corresponding to the same target class obtained by prediction as described above) in each scene video according to the method described above, the obtained to-be-detected images of each frame may be input into a scene similarity detection network, and similarity calculation may be performed on any two frames of to-be-detected images (for example, one frame of to-be-detected image corresponding to each aggregation id in the object clustering result diagram of scene k shown in fig. 4) containing the second object of the same target class, and the corresponding scene similarity may be output.
For example, in the application scenario shown in fig. 5, taking the scenario clustering of the same second object in the three segmented scenario videos of scenario a, scenario b and scenario c as an example, the description about the object clustering results of each of the three scenario videos based on the similarity clustering algorithm will not be described in detail in this embodiment, and the object clustering results shown in each of scenario a, scenario b and scenario c shown in fig. 5 may be combined with the description about the object clustering results shown in fig. 4. In this embodiment, only the scene clustering process of the second object (such as an advertisement pit) of the table included in the three scene videos is taken as an example for illustration, and the scene clustering method of the second object of other target categories is similar and is not described in detail herein.
Based on this, in the respective target images of the scene a, the scene b and the scene c, the to-be-detected image of the corresponding frame of each table in the second object such as a table is detected, as shown in fig. 5, one table is included in the scene a and the scene b, one frame of to-be-detected image including the highest classification prediction information of the table is selected from the corresponding scene video, the scene c includes two different tables, and sub-scene videos (including one or more target image groups) including each frame of target image of the two different tables can be respectively recorded as the scene c-1 and the scene c-2, and according to the above method, one frame of to-be-detected image including the highest classification prediction information of the corresponding table is selected from the multi-frame target images included in the corresponding sub-scene video.
Then, similarity calculation is performed on any two frames of the screened images to be detected, that is, any two scene videos (any two of a scene a, a scene b, a scene c-1 and a scene c-2) are clustered, so as to obtain corresponding scene similarity (for example, a numerical value between 0 and 1, wherein the larger the numerical value is, the higher the similarity of a second object in the corresponding two scene videos is, that is, the higher the probability that tables in the corresponding two images to be detected are the same table is), and for convenience of subsequent analysis, the obtained scene similarity can be recorded as a scene similarity matrix shown in fig. 6. In this way, the second object in the image to be detected, which corresponds to the scene similarity reaching the scene similarity threshold value in the same row or column in the scene similarity matrix, can be determined to be a second object, and the target image groups corresponding to the same second object in different scene videos can be aggregated to obtain the scene clustering result of the same second object in the target video.
That is, each scene video with the scene similarity greater than the scene similarity threshold value can be determined as a group of scene videos to be clustered, and the second objects of the same target class in each group of scene videos to be clustered are aggregated into the same second object, so that a scene clustering result of the same second object in a plurality of scene videos is obtained.
The scene similarity threshold may represent a similarity threshold of the second object belonging to the same object in the to-be-detected images of the two scene videos, and the value is not limited, and may be larger, for example, 0.8 or 0.9, if the case may be, in order to improve the detection accuracy. Based on the scene similarity matrix shown in fig. 6, the scene similarity between the images to be detected corresponding to the scene a and the scene c-2 reaches the scene similarity threshold, and it is determined that the tables in the two frames of images to be detected are actually the same table, cross-scene aggregation can be performed on the respective corresponding target image groups, that is, cross-scene clustering can be performed on the second objects in different scene videos, for example, the second objects in the target image groups with the identifiers id of 7, 8 and 27 are one second object, and can be aggregated together, and the scene identifiers id of the obtained new clustered scene video can be configured, as shown in fig. 6, so that the representation mode of the scene clustering result is not limited in the application.
And for the scene a (or the scene c-2), the scene b and the scene c-1, the scene similarity among the images to be detected corresponding to the scene a (or the scene c-2) does not reach the scene similarity threshold, the second objects contained in the respective objects belong to different objects, each scene video belongs to one clustering scene video, and the object clustering result is determined as the scene clustering result.
Step S309, determining each target image group corresponding to the same second object in the target video according to the scene clustering result;
step S310, according to the target detection result and each target image group corresponding to the same second object, obtaining the detection result of the same second object in the target video, and implanting the first object of the corresponding category according to the detection result.
Based on the scene clustering result, each frame of target image contained in each target image group corresponding to the same second object in the target video can be aggregated, the detection result of the second object can be obtained by combining the target detection result of each frame of target image, all second objects in the corresponding category can be quickly and accurately recalled from the target video based on the detection result according to the actual detection requirement (such as advertisement pit recall recommendation requirement), such as the frame number, the position information and the like of the target image of the second object in the target video, and accordingly, the first object in the corresponding category can be implanted, the complete and efficient implantation of the corresponding first object can be realized accordingly, the recall time of the first object is improved by reducing the occurrence of the missing condition of the first object implanted in the target video, and the implantation benefit of the first object is ensured.
Referring to fig. 7, a flowchart of a fourth alternative embodiment of the object detection method provided in the present application may implement the method of the third alternative embodiment, and in combination with the object detection method described in the foregoing embodiment, after determining a scene video to which each frame of target image belongs, the method shown in fig. 7 may be referred to obtain an object clustering result corresponding to each scene video:
step S701, obtaining the images to be processed with the same preset frame number from each target image group according to the corresponding frame number of each target image group belonging to the same scene video;
in the above analysis, in the case that a scene video contains multiple target image groups, each target image group contains multiple continuous frames of target images, and the multiple frames of target images contain the same second object, in order to reduce the subsequent clustering calculation amount, target images with preset frames can be randomly extracted from each target image group to be processed, for example, 5 frames of to be processed images, and the preset frames can be determined empirically, and the numerical value of the preset frames is not limited in the application.
Step S702, extracting a second object region image corresponding to the to-be-processed image by using the position information of the second object in each to-be-processed image;
Step S703, according to the category prediction information of the second object in the image to be processed, performing similarity calculation on each second object region image corresponding to the second object of the same target category extracted from the video of the same scene to obtain a corresponding object similarity;
in combination with the above description about the target detection result of each second object, the target detection network may output the coordinate position, that is, the position information, of the detection frame of the second object in the target image of the frame, so, for each frame of to-be-processed image, the region where the detection frame is located may be extracted from the frame of to-be-processed image according to the coordinate position of the corresponding detection frame, and recorded as the second object region image, that is, the region image where the detection frame is located is cut from the frame of to-be-processed image. And then, obtaining the object similarity between any two second object area images based on the similarity detection network of the deep learning to judge whether the second objects in the two corresponding frames of images to be processed are the same object. The implementation method of the object similarity in step S703 is not limited, and the network structure of the similarity detection network can be determined according to the actual requirement, which is not described in detail in this embodiment.
Step S704, according to the object similarity and the target detection result, aggregating the target image groups corresponding to the same second object in the corresponding scene video to obtain the object clustering result of the second object.
In this embodiment, similarity calculation may be performed on second objects in the images to be processed included in any two target image groups, whether the second objects corresponding to the two second object groups are the same object may be determined, if the object similarity reaches an object similarity threshold, it may be determined that the second objects in the images to be processed corresponding to the two frames belong to the same object, that is, the second objects included in the two corresponding target image groups are the same object, the obtained determination results are synthesized, it is determined which of the second objects in all the target image groups included in the same scene video are the same object, and aggregation is performed on each target image group corresponding to the same second object, for example, a correspondence between the same second object and each target image group is established, so as to implement implantation of the first object of the corresponding category.
And then, by combining the description of the third embodiment, the same second object among different scene videos can be further clustered, so that each second object contained in the target video and each frame image containing each second object can be rapidly, accurately and completely determined, the first implantation omission condition is avoided, and the first recall duration of the target video is improved.
Referring to fig. 8, a schematic structural diagram of a first alternative embodiment of an object detection device proposed in the present application, as shown in fig. 8, the detection device may include:
a target video obtaining module 81 for obtaining a target video; the target video is a video which needs to be implanted into a first object;
a target detection result obtaining module 82, configured to perform object detection on the target video, to obtain a target detection result of each target image group; wherein each target image group comprises a plurality of continuous target images in the target video, the target images comprise the same second object, the target detection result at least comprises position information and category prediction information of the second object in each frame of the target images, the category prediction information is a result of predicting that the second object belongs to a target category, and the target category is a predefined object category used for implanting the first object;
a scene video segmentation module 83, configured to segment the target video into a plurality of scene videos, and determine each of the target image groups belonging to the same scene video; one of the scene videos contains successive multi-frame images in the target video;
A target image processing module 84, configured to obtain, according to each of the target image groups belonging to each of the scene videos and the target detection result, each of the target image groups corresponding to the same second object from a plurality of second objects included in each of the target image groups;
the detection result obtaining module 85 is configured to obtain a detection result of the same second object in the target video according to the target detection result and each target image group corresponding to the same second object, and implant the first object of the corresponding class according to the detection result.
Optionally, the target detection result obtaining module 82 may include:
the first obtaining unit is used for inputting the target video into a target detection network, and obtaining position information of a second object contained in each frame of target image in each target image group and category prediction information of the second object belonging to the target category;
a second obtaining unit configured to obtain a frame number of a multi-frame target image included in each of the target image groups;
and the target detection result forming unit is used for forming a target detection result of the corresponding target image group by the frame number, the position information and the prediction category information corresponding to each target image group.
In some embodiments, the scene video segmentation module 83 may include:
the scene segmentation unit is used for carrying out scene segmentation on the target video to obtain corresponding scene segmentation information; the scene segmentation information characterizes continuous multi-frame images contained in each of a plurality of scene videos into which the target video is segmented;
and a first determining unit configured to determine each of the target image groups belonging to the same scene video based on the scene segmentation information and the frame numbers corresponding to each of the target image groups.
Alternatively, as shown in fig. 9, the target image processing module 84 may include:
a third obtaining unit 841, configured to obtain the frame numbers corresponding to the target image groups belonging to the same scene video;
an object clustering unit 842, configured to perform similar object clustering on multiple frames of the target images belonging to the same scene video according to the position information of the second object and the frame number corresponding to the same scene video, so as to obtain an object clustering result belonging to the same second object in the corresponding scene video;
a scene clustering unit 843, configured to perform scene similar clustering on target images of second objects in the same target class included in the plurality of scene videos according to the object clustering results of each of the plurality of scene videos and class prediction information corresponding to the second objects, so as to obtain scene clustering results of the same second objects included in the plurality of scene videos;
And the second determining unit 844 is configured to determine each of the target image groups corresponding to the same second object in the target video according to the scene clustering result.
Optionally, the object clustering unit 842 may include: the to-be-processed image obtaining unit is used for obtaining to-be-processed images with the same preset frame number from each target image group according to the frame numbers corresponding to each target image group belonging to the same scene video;
a second object region image extracting unit configured to extract a second region image corresponding to the image to be processed using the position information of the second object in each of the images to be processed;
the object similarity obtaining unit is used for carrying out similarity calculation on each second object region image corresponding to the second object of the same target class extracted from the video of the same scene according to the class prediction information of the second object in the image to be processed to obtain corresponding object similarity;
and the object clustering result obtaining unit is used for obtaining object clustering results belonging to the same second object in the corresponding scene video according to the object similarity and the target detection result.
In still other embodiments, the scene clustering unit 843 described above may include:
the to-be-detected image obtaining unit is used for obtaining a frame of to-be-detected image corresponding to the maximum category prediction information of the second object from each target image group aiming at the same second object in the corresponding scene video according to the object clustering result of each scene video and the category prediction information of the second object;
the scene similarity obtaining unit is used for carrying out similarity calculation on the images to be detected of frames corresponding to the same target class second object in a plurality of scene videos to obtain corresponding scene similarity;
and a fourth obtaining unit, configured to cluster the same second object in the multiple scene videos according to the scene similarity, and obtain a scene clustering result of the same second object in the target video.
Optionally, the scene similarity obtaining unit may include:
the scene similarity obtaining unit is used for inputting the obtained images to be detected of each frame into a scene similarity detection network, carrying out similarity calculation on any two frames of images to be detected of the second object of the same target class, and outputting corresponding scene similarity;
Based on this, the fourth obtaining unit may include:
the to-be-clustered scene video determining unit is used for determining each scene video with the scene similarity larger than a scene similarity threshold value as a group of to-be-clustered scene videos; and the scene clustering result obtaining unit is used for aggregating the second objects of the same target class in each group of the scene videos to be clustered into the same second object to obtain scene clustering results of the same second object in the plurality of scene videos.
It should be noted that, regarding the various modules, units, and the like in the foregoing embodiments of the apparatus, the various modules and units may be stored as program modules in memories of respective electronic devices, and the processor of the electronic device may execute the program modules stored in the memories to implement the respective functions, or may be implemented by combining the program modules and hardware, and regarding the functions implemented by each program module and the combination thereof, and the achieved technical effects, the description of the corresponding portions of the foregoing corresponding method embodiments may be referred to, which is not repeated herein.
The embodiment of the present application further provides a computer readable storage medium, on which a computer program is stored, where the computer program is loaded and executed by a processor, to implement each step of the object detection method described in the foregoing embodiment, and a specific implementation process may refer to descriptions of corresponding parts of the foregoing embodiment, which is not repeated in this embodiment.
Referring to fig. 10, which is a schematic diagram of a hardware structure of an alternative embodiment of an electronic device suitable for the object detection method proposed in the present application, taking an example that the electronic device is a server as an illustration, as shown in fig. 10, the electronic device may include: at least one memory 101 and at least one processor 102, wherein:
the memory 101 may be used to store computer instructions implementing the object detection methods described in the method embodiments above; the processor 102 may load and execute the computer instructions stored in the memory to implement the steps of the object detection method described in the foregoing corresponding method embodiments, and the specific implementation process may refer to the description of the corresponding parts of the foregoing embodiments, which is not repeated.
In practical applications, the memory 101 and the processor 102 may be connected to a communication bus, through which data interaction between each other and other structural components of the electronic device are implemented, which may be specifically determined according to practical requirements, which will not be described in detail in this application.
In embodiments of the present application, memory 101 may comprise high-speed random access memory, and may also comprise non-volatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device. The processor 102 may be a central processing unit (Central Processing Unit, CPU), application-specific integrated circuit (ASIC), digital Signal Processor (DSP), application-specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA), or other programmable logic device, etc. The structures and the models of the memory 101 and the processor 102 are not limited, and can be flexibly adjusted according to actual requirements.
It should be understood that the structure of the electronic device shown in fig. 10 is not limited to the electronic device in the embodiment of the present application, in practical application, the electronic device may include more components than those shown in fig. 10, or some components may be combined, such as a data transmission circuit for receiving target data from other devices, etc., where the electronic device is a terminal device, the electronic device may further include at least one input component such as a touch sensing unit that senses a touch event on a touch display panel, a keyboard, a mouse, a camera, a microphone, etc.; at least one output component such as a display, speaker, vibration mechanism, light, etc.; an antenna; a sensor module; the power module and the like can determine the hardware structure according to the type of the terminal equipment and the functional requirements of the terminal equipment, and the application is not specifically enumerated herein.
Finally, it is pointed out that, although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.
While several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Claims (10)

1. An object detection method, characterized in that the object detection method comprises:
obtaining a target video; the target video is a video which needs to be implanted into a first object;
Performing object detection on the target video to obtain a target detection result of each target image group; wherein each target image group comprises a plurality of continuous target images in the target video, the target images comprise the same second object, the target detection result at least comprises position information and category prediction information of the second object in each frame of the target images, the category prediction information is a result of predicting that the second object belongs to a target category, and the target category is a predefined object category used for implanting the first object;
dividing the target video into a plurality of scene videos, and determining each target image group belonging to the same scene video; one of the scene videos contains successive multi-frame images in the target video;
obtaining each target image group corresponding to the same second object from a plurality of second objects contained in each target image group according to each target image group belonging to each scene video and the target detection result;
and obtaining the detection result of the same second object in the target video according to the target detection result and each target image group corresponding to the same second object, and implanting the first object of the corresponding category according to the detection result.
2. The object detection method according to claim 1, wherein the object detecting the object video to obtain the object detection result of each object image group includes:
inputting the target video into a target detection network to obtain the position information of a second object contained in each frame of target image in each target image group and the category prediction information of the second object belonging to the target category;
obtaining a frame number of a multi-frame target image contained in each target image group;
and forming a target detection result of the target image group by the frame number, the position information and the category prediction information corresponding to each target image group.
3. The object detection method according to claim 2, wherein the dividing the target video into a plurality of scene videos, determining each of the target image groups belonging to the same scene video, comprises:
performing scene segmentation on the target video to obtain corresponding scene segmentation information; the scene segmentation information characterizes continuous multi-frame images contained in each of a plurality of scene videos into which the target video is segmented;
and determining each target image group belonging to the same scene video based on the scene segmentation information and the frame sequence numbers corresponding to each target image group.
4. The object detection method according to claim 2, wherein the obtaining, from the second objects included in each of the target image groups, each of the target image groups corresponding to the same second object, based on each of the target image groups belonging to each of the scene videos and the target detection result, includes:
obtaining the frame sequence numbers corresponding to the target image groups belonging to the same scene video;
according to the position information of the second object and the frame sequence number corresponding to the same scene video, similar object clustering is carried out on multiple frames of target images belonging to the same scene video, and an object clustering result belonging to the same second object in the corresponding scene video is obtained;
according to the object clustering results of the plurality of scene videos and the category prediction information corresponding to the second object, performing scene similar clustering on the target images of the plurality of scene videos containing the same target category second object to obtain scene clustering results of the plurality of scene videos containing the same second object;
and determining each target image group corresponding to the same second object in the target video according to the scene clustering result.
5. The method for detecting objects according to claim 4, wherein the step of clustering similar objects of the multiple frames of the target images belonging to the same scene video according to the position information of the second object and the frame number corresponding to the same scene video to obtain the object clustering result of the same second object in the corresponding scene video includes:
obtaining images to be processed with the same preset frame number from each target image group according to the frame numbers corresponding to each target image group belonging to the same scene video;
extracting a second object region image corresponding to the to-be-processed image by using the position information of the second object in each to-be-processed image;
according to the category prediction information of the second object in the image to be processed, similarity calculation is carried out on each second object region image corresponding to the second object of the same target category extracted from the same scene video, so that corresponding object similarity is obtained;
and obtaining an object clustering result belonging to the same second object in the corresponding scene video according to the object similarity and the target detection result.
6. The method according to claim 4, wherein the step of performing similar scene clustering on the target images of the plurality of scene videos including the same target class second object according to the respective object clustering results of the plurality of scene videos and the class prediction information corresponding to the second object to obtain scene clustering results of the plurality of scene videos including the same second object includes:
obtaining a frame of image to be detected corresponding to the maximum category prediction information of the second object from each target image group aiming at the same second object in the corresponding scene video according to the object clustering result of each scene video and the category prediction information of the second object;
performing similarity calculation on the images to be detected of frames corresponding to the same target class second object in the plurality of scene videos to obtain corresponding scene similarity;
and clustering the same second object in the plurality of scene videos according to the scene similarity to obtain a scene clustering result of the same second object in the target video.
7. The method for detecting an object according to claim 6, wherein the performing similarity calculation on the to-be-detected image of each frame corresponding to the second object of the same target class in the plurality of scene videos to obtain a corresponding scene similarity includes:
Inputting the obtained images to be detected of each frame into a scene similarity detection network, and carrying out similarity calculation on any two frames of the images to be detected containing the second object of the same target class to obtain corresponding scene similarity;
clustering the same second object in the plurality of scene videos according to the scene similarity to obtain a scene clustering result of the same second object in the target video, wherein the method comprises the following steps:
determining each scene video with the scene similarity larger than a scene similarity threshold as a group of scene videos to be clustered;
and aggregating the second objects of the same target class in each group of the scene videos to be clustered into the same second object to obtain scene clustering results of the same second object in the plurality of scene videos.
8. An object detection apparatus, characterized in that the object detection apparatus comprises:
the target video acquisition module is used for acquiring a target video; the target video is a video which needs to be implanted into a first object;
the target detection result obtaining module is used for carrying out object detection on the target video to obtain a target detection result of each target image group; wherein each target image group comprises a plurality of continuous target images in the target video, the target images comprise the same second object, the target detection result at least comprises position information and category prediction information of the second object in each frame of the target images, the category prediction information is a result of predicting that the second object belongs to a target category, and the target category is a predefined object category used for implanting the first object;
The scene video segmentation module is used for segmenting the target video into a plurality of scene videos and determining each target image group belonging to the same scene video; one of the scene videos contains successive multi-frame images in the target video;
the target image processing module is used for obtaining each target image group corresponding to the same second object from a plurality of second objects contained in each target image group according to each target image group belonging to each scene video and the target detection result;
the detection result obtaining module is used for obtaining the detection result of the same second object in the target video according to the target detection result and each target image group corresponding to the same second object, and is used for implanting the first object of the corresponding category according to the detection result.
9. An electronic device, the electronic device comprising: at least one memory and at least one processor, wherein:
the processor is configured to load and execute computer instructions stored in the memory to implement the object detection method according to any one of claims 1-7.
10. A computer readable storage medium having stored thereon a computer program, wherein the computer program is loaded and executed by a processor, implementing the object detection method according to any of claims 1-7.
CN202311864670.3A 2023-12-29 2023-12-29 Object detection method and device, electronic equipment and storage medium Pending CN117809221A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311864670.3A CN117809221A (en) 2023-12-29 2023-12-29 Object detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311864670.3A CN117809221A (en) 2023-12-29 2023-12-29 Object detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117809221A true CN117809221A (en) 2024-04-02

Family

ID=90429464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311864670.3A Pending CN117809221A (en) 2023-12-29 2023-12-29 Object detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117809221A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119323526A (en) * 2024-12-16 2025-01-17 厦门真景科技有限公司 Training method for removing lasting object model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119323526A (en) * 2024-12-16 2025-01-17 厦门真景科技有限公司 Training method for removing lasting object model
CN119323526B (en) * 2024-12-16 2025-03-14 厦门真景科技有限公司 A training method for removing blooper object models

Similar Documents

Publication Publication Date Title
US12073621B2 (en) Method and apparatus for detecting information insertion region, electronic device, and storage medium
US20220172476A1 (en) Video similarity detection method, apparatus, and device
CN105744292B (en) A kind of processing method and processing device of video data
US8805123B2 (en) System and method for video recognition based on visual image matching
CN111314732A (en) Method for determining video label, server and storage medium
US9418297B2 (en) Detecting video copies
Srinivas et al. An improved algorithm for video summarization–a rank based approach
CN111191591B (en) Watermark detection and video processing method and related equipment
CN112925905B (en) Method, device, electronic equipment and storage medium for extracting video subtitles
CN111836118B (en) Video processing method, device, server and storage medium
CN113516609B (en) Split-screen video detection method and device, computer equipment and storage medium
TW201907736A (en) Method and device for generating video summary
CN113435438B (en) Image and subtitle fused video screen plate extraction and video segmentation method
CN117809221A (en) Object detection method and device, electronic equipment and storage medium
US20150356353A1 (en) Method for identifying objects in an audiovisual document and corresponding device
Ma et al. Lecture video segmentation and indexing
CN112995666B (en) Video horizontal and vertical screen conversion method and device combined with scene switching detection
CN110019951B (en) Method and equipment for generating video thumbnail
CN101339662B (en) Method and device for creating video frequency feature data
CN113468928B (en) Rotating background video recognition method, device, computer equipment and storage medium
CN111479168B (en) Method, device, server and medium for marking multimedia content hot spot
CN116074582B (en) Implant position determining method and device, electronic equipment and storage medium
CN114140798B (en) Text region segmentation method and device, electronic equipment and storage medium
CN112019923B (en) Video cutting processing method
CN114282057B (en) Video character classification method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination