CN106354816B

CN106354816B - video image processing method and device

Info

Publication number: CN106354816B
Application number: CN201610765659.5A
Authority: CN
Inventors: 邹博; 刘玉洁; 周玲武
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2016-08-30
Filing date: 2016-08-30
Publication date: 2019-12-13
Anticipated expiration: 2036-08-30
Also published as: CN106354816A

Abstract

the application provides a video image processing method and a video image processing device, wherein the method comprises the following steps: acquiring a video image sequence; identifying a target object from video image frames in a sequence of video images; tracking a target object and determining a motion track of the target object; acquiring video structured information based on a target object and a motion track of the target object; and performing target object retrieval and/or performing video condensation on the video image sequence based on the video structural information. The video image processing method and the video image processing device can rapidly detect the target, namely, the target detection speed is increased, and further the case detection speed is increased.

Description

Video image processing method and device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a video image processing method and apparatus.

background

with the improvement of video monitoring systems, video image investigation technology has become the fourth investigation and solution solving technology behind public security organ criminal investigation technology, action technology and network investigation technology. The conventional video image investigation technology mainly adopts a man-sea tactic, namely, a large number of investigators are required to investigate the target from each video image frame of the video, and the investigation mode consumes a large amount of manpower and a long time, namely, the conventional investigation mode is time-consuming and labor-consuming and has poor investigation effect.

Disclosure of Invention

in view of the above, the present invention provides a video image processing method and apparatus, so as to solve the problem that the case detection speed is slow due to time and labor consumption of the video image detection method in the prior art, and the technical scheme is as follows:

A method of video image processing, the method comprising:

acquiring a video image sequence;

Identifying a target object from image frames in the sequence of video images;

Tracking the target object and determining a motion track of the target object;

acquiring video structured information based on the target object and the motion trail of the target object;

performing target object retrieval and/or performing video condensation on the video image sequence based on the video structured information.

wherein the identifying a target object from each frame of video image in the sequence of video images comprises:

Identifying a target object from each video image frame in the sequence of video images based on a deep convolutional neural network.

wherein the tracking the target object comprises:

And tracking the target object by adopting a Lucas-Kanade optical flow method tracking algorithm based on the optical flow points extracted from the target object in the video image frame.

wherein the video structuring information comprises: text information and/or image feature information of a target object, wherein the text information of the target object comprises attribute information and motion information of the target object;

the retrieving of the target object based on the video structural information includes:

When a retrieval instruction of text information to be retrieved is received, retrieving in the text information of the target object based on the text information to be retrieved, or when a retrieval instruction of an image to be retrieved is received, retrieving in the image feature information of the target object based on the image to be retrieved, or when a retrieval instruction of event information to be retrieved is received, retrieving in the text information based on the event to be retrieved and a pre-established event model to obtain a retrieval result;

And outputting the target object information associated with the retrieval result.

wherein the image feature information of the target object comprises a depth convolution feature and a local feature;

The retrieving in the image feature information of the target object based on the image to be retrieved to obtain a retrieval result, including:

Matching the depth convolution characteristics of the image to be retrieved in the image characteristic information of the target object according to a first matching rule to obtain a candidate characteristic set;

and matching the depth convolution characteristics and the local characteristics of the image to be retrieved in the candidate characteristic set according to a second matching rule based on the depth convolution characteristics and the local characteristics of the image to be retrieved to obtain target image characteristics serving as the detection result.

matching the depth convolution characteristics based on the image to be retrieved in the image characteristic information of the target object according to a first matching rule to obtain a candidate characteristic set, wherein the matching comprises the following steps:

Acquiring the depth convolution characteristic and the local characteristic of the image to be retrieved, and performing binary coding on the depth convolution characteristic of the image to be retrieved to obtain a binary coding characteristic of the image to be retrieved;

Respectively matching the binary coding features of the image to be retrieved with binary coding features corresponding to each depth convolution feature in the image feature information of the target object, determining the binary coding features with the matching degree of the binary coding features of the image to be retrieved being greater than a first preset value as target binary coding features, and taking the target depth convolution features and the target local features corresponding to the target coding features as candidate feature sets;

Matching the depth convolution characteristic and the local characteristic based on the image to be retrieved in the candidate characteristic set according to a second matching rule to obtain a target image characteristic as the detection result, wherein the matching comprises the following steps:

Matching the depth convolution features of the image to be retrieved with each depth convolution feature in the candidate feature set, matching the local features of the image to be retrieved with each local feature in the candidate feature set, and taking the image features of which the comprehensive matching degree with the corresponding local features is greater than a second preset value as retrieval results;

The outputting the target object information associated with the retrieval result specifically includes:

and outputting a target object image associated with the image features of which the comprehensive matching degree of the depth convolution features and the corresponding local features is greater than a second preset value, wherein the target object image is an image of the target object extracted from a video image frame where the target object is located in advance.

Wherein the video structuring information comprises: image density of each video image frame in the sequence of video images, the image density being used to characterize a condition of a target object in the video image frame;

Acquiring the structured information based on the target object and the motion trajectory of the target object comprises:

Determining structure information of a monitored area in the video image sequence according to the position of the target object in a video image frame and the motion track of the target object, wherein the structure information of the monitored area comprises area information of the target object appearing in the monitored area;

Determining an image density of each video image frame in the sequence of video images based on the structural information of the monitored area.

Wherein the video-condensing the sequence of video images based on the video structuring information comprises:

Segmenting the video image sequence based on the image density of each video image frame, and determining a video segment to be condensed from each segment;

and performing video concentration on the video segment to be concentrated, and merging the video segment subjected to the video concentration with other video segments not subjected to the video concentration to obtain a concentrated video image sequence.

wherein the segmenting the sequence of video images based on the image density and determining a video segment to be condensed from each segment comprises:

Dividing the video image sequence into a plurality of video segments by using a preset image density threshold value according to the image density of each video image frame, wherein each video segment comprises a plurality of continuous video image frames;

and determining the video segments of which the image density of each video image frame is greater than the image density threshold value as the video segments to be condensed.

wherein the video concentration of the video segment to be concentrated includes:

Determining an optimal movement strategy for moving at least one target object in the video segment to be condensed in a time dimension and a space dimension through a space-time condensation model;

and carrying out image fusion based on the optimal movement strategy to obtain the concentrated video segment.

A video image processing apparatus, the apparatus comprising: the system comprises a video acquisition module, a target identification module, a target tracking module, a video structured information acquisition module and a processing module;

The video acquisition module is used for acquiring a video image sequence;

The target identification module is used for identifying a target object from video image frames in the video image sequence acquired by the video acquisition module;

The target tracking module is used for tracking the target object identified by the target identification module and determining the motion track of the target object;

the information acquisition module is used for acquiring video structured information based on the target object identified by the target identification module and the motion track of the target object determined by the target tracking module;

the processing module is used for carrying out target object retrieval and/or carrying out video concentration on the video image sequence based on the video structural information acquired by the information acquisition module.

the target identification module is specifically configured to identify a target object from each video image frame in the video image sequence based on a deep convolutional neural network.

the target tracking module is specifically configured to track the target object by using a Lucas-Kanade optical flow method tracking algorithm based on optical flow points extracted from the target object in the video image frame.

the processing module comprises: a retrieval module and an output module;

the retrieval module is used for retrieving in the text information of the target object based on the text information to be retrieved when a retrieval instruction of the text information to be retrieved is received, or retrieving in the image characteristic information of the target object based on the image to be retrieved when a retrieval instruction of the image to be retrieved is received, or retrieving in the text information based on the event to be retrieved and a pre-established event model when a retrieval instruction of the event information to be retrieved is received, so as to obtain a retrieval result;

The output module is used for outputting the target object information related to the retrieval result of the retrieval module.

the retrieval module comprises: a rough matching module and an accurate matching module;

the rough matching module is used for matching in the image feature information of the target object according to a first matching rule based on the depth convolution feature of the image to be retrieved to obtain a candidate feature set;

and the precise matching module is used for matching the depth convolution characteristics and the local characteristics of the image to be retrieved in the candidate characteristic set according to a second matching rule based on the depth convolution characteristics and the local characteristics of the image to be retrieved to obtain target image characteristics as the detection result.

Wherein the coarse matching module comprises: a feature acquisition and processing sub-module and a coarse matching sub-module;

the feature obtaining and processing submodule is used for obtaining depth convolution features and local features of the image to be retrieved, performing binary coding on the depth convolution features of the image to be retrieved to obtain binary coding features of the image to be retrieved, and performing binary coding on each depth convolution feature in the image feature information of the target object to obtain binary coding features corresponding to each depth convolution feature in the image feature information of the target object;

The rough matching submodule is used for respectively matching the binary coding features of the image to be retrieved with the binary coding features corresponding to the depth convolution features in the image feature information of the target object, determining the binary coding features of which the matching degree with the binary coding features of the image to be retrieved is greater than a first preset value as target binary coding features, and taking the target depth convolution features and the target local features corresponding to the target coding features as candidate feature sets;

The precise matching module is specifically configured to match the depth convolution features of the image to be retrieved with each depth convolution feature in the candidate feature set, match the local features of the image to be retrieved with each local feature in the candidate feature set, and take the image features of which the comprehensive matching degree between the depth convolution features and the corresponding local features is greater than a second preset value as a retrieval result;

the output module is specifically configured to output a target object image associated with an image feature whose comprehensive matching degree between the depth convolution feature and the corresponding local feature is greater than a second preset value, where the target object image is an image of the target object extracted in advance from a video image frame in which the target object is located.

The information acquisition module includes: a monitoring area structure determining module and an image density determining module;

the monitoring area structure determining submodule is used for determining the structure information of a monitoring area in the video image sequence according to the position of the target object in a video image frame and the motion track of the target object, wherein the structure information of the monitoring area comprises area information of the target object appearing in the monitoring area;

the image density determining submodule is used for determining the image density of each video image frame in the video image sequence based on the structural information of the monitoring area determined by the monitoring area structure determining submodule.

wherein the processing module comprises: the video processing device comprises a video preprocessing module and a video concentration module;

the video preprocessing module is used for segmenting the video image sequence based on the image density and determining a video segment to be concentrated from each segment;

The video concentration module is used for carrying out video concentration on the video segment to be concentrated and merging the video segment subjected to video concentration with other video segments not subjected to video concentration to obtain a concentrated video image sequence.

Wherein, the video preprocessing module comprises: the video segmentation submodule and the video segment to be concentrated determining submodule;

the video segmentation submodule is used for dividing the video image sequence into a plurality of video segments by utilizing a preset image density threshold value according to the image density of each video image frame;

The to-be-condensed video segment determining submodule is used for determining the video segments of which the image densities of all the video image frames are greater than the image density threshold value as the to-be-condensed video segments.

Wherein the video enrichment module comprises: an optimal concentration strategy determination submodule and an image fusion submodule;

the optimal concentration strategy determining submodule is used for determining an optimal moving strategy for moving at least one target object in the video segment to be concentrated on a time dimension and a space dimension through a space-time concentration model;

And the image fusion submodule is used for carrying out image fusion based on the optimal movement strategy to obtain a concentrated video segment.

the technical scheme has the following beneficial effects:

the video image processing method and the video image processing device provided by the invention can identify and track the target object in the video image sequence, further can acquire the video structural information based on the target object and the motion track of the target object, can search based on the video structural information after acquiring the video structural information, and can quickly detect the target by the mode. That is, how the user knows some information of the target object in advance, the information can be directly utilized to search so as to quickly detect the target, and if the user does not know the information of the target object, the user can directly browse the condensed video so as to quickly detect the target. The video image processing method and the video image processing device can rapidly detect the target, namely, the speed of detecting the target is increased, and the detection speed of the case is increased.

drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a video image processing method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a video image processing method according to an embodiment of the present invention, in which target detection is performed to generate a series of target candidate frames in a video image frame;

fig. 3 is a schematic diagram of optical flow points extracted from a target object in a video image frame in a video image processing method according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a specific implementation manner of obtaining a retrieval result based on retrieval of an image to be retrieved in image feature information of a target object in the video image processing method according to the embodiment of the present invention;

Fig. 5 is a schematic flowchart of acquiring video structured information based on a target object and a motion trajectory of the target object in the video image processing method according to the embodiment of the present invention;

fig. 6 is a schematic flow chart of an implementation process of segmenting a video image sequence based on image density and determining a video segment to be condensed from each segment in the video image processing method according to the embodiment of the present invention;

Fig. 7 is a schematic flowchart of an implementation process of performing video compression on a video segment to be compressed in the video image processing method according to the embodiment of the present invention;

Fig. 8 is a schematic structural diagram of a video image processing apparatus according to an embodiment of the present invention.

Detailed Description

the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a video image processing method, please refer to fig. 1, which shows a flow diagram of the video image processing method, and the method may include:

step S101: a sequence of video images is acquired.

Step S102: a target object is identified from video image frames in a sequence of video images.

step S103: and tracking the target object and determining the motion trail of the target object.

Step S104: and acquiring video structured information based on the target object and the motion trail of the target object.

Step S105: and performing target object retrieval and/or performing video condensation on the video image sequence based on the video structural information.

The video image processing method provided by the invention can identify and track the target object in the video image sequence, further can acquire video structural information based on the target object and the motion track of the target object, can search based on the video structural information after acquiring the video structural information, can quickly detect the target by the method, and can also perform video concentration based on the video structural information. That is, how the user knows the specific information of the target object in advance, the information can be directly utilized for retrieval, so that the target object is quickly detected, and if the user does not know the information of the target object, the user can directly browse the condensed video, so that the target object is quickly detected. The video image processing method provided by the embodiment of the invention can quickly detect the target object from the video, namely, the embodiment of the invention improves the speed of target detection, further improves the detection speed of the case, and has better user experience.

The background modeling method is mostly adopted in the conventional target identification method, namely, the background of the image is modeled firstly, the image is compared with the background model after the model is established, and the foreground target is determined according to the comparison result. However, the method has poor adaptability in environments with low contrast and light variation, and often generates many false identifications when identifying moving targets, and often generates missing detections for stationary targets. In view of the problems of the existing identification method, in order to improve the accuracy of subsequent retrieval, the invention provides an identification method for identifying a target object in a video image frame based on a deep convolutional neural network, that is, in the above embodiment, the target object is identified from the video image frame in a video image sequence

the process of identifying the target object from the video image frame based on the deep convolutional neural network comprises the following steps: firstly, target detection is carried out by using a target detection model based on a deep convolutional neural network, a series of target candidate frames are generated in a video image frame, as shown in fig. 2, then target identification is carried out by using a target classification model based on the deep convolutional neural network, and the target candidate frames are corrected.

it should be noted that the deep convolutional neural network needs to be trained before being identified by using it: in a possible implementation manner, aiming at the characteristics of the public security monitoring environment and target (vehicle/human), an AlexNet network structure can be selected for training, an ImageNet2012 data set is used for pre-training, and a public security monitoring sample is used for optimizing the network on the basis. Because the difference between different vehicle types is large, the samples are divided into 6 categories in total for cars, passenger cars, trucks, tricycles, non-motor vehicles (motorcycles/electric vehicles/bicycles) and pedestrians during training.

In addition, in order to improve the subsequent identification speed, the convolution feature sharing can be carried out on the target detection network model and the target classification network model.

And tracking the target object after the target object is identified. Considering that the optical flow points are usually extracted from the whole image when the existing optical flow tracking algorithm extracts the optical flow points, however, the target object is often concerned when the target tracking is performed, the optical flow points of other irrelevant areas interfere with the tracking of the target object, and in order to improve the tracking speed and accuracy, the improved Lucas-Kanade optical flow tracking algorithm is adopted to track the target object, namely, the Lucas-Kanade optical flow tracking algorithm is adopted to track the target object based on the optical flow points extracted from the target object (foreground image) in the video image frame.

specifically, optical flow points are extracted from the whole video image frame, then optical flow points are extracted from the foreground image (i.e. the target object in the video image frame), and finally optical flow points in the whole video image frame except the optical flow points in the foreground image are filtered out based on the optical flow points extracted from the foreground image, as shown in fig. 3. The foreground image is obtained by frame difference between two adjacent video images, a white part in the second image in fig. 3 is a foreground region, and a black part is a background region.

In the above embodiment, the video structural information obtained based on the target object and the motion trajectory of the target object may include: text information and/or image feature information of the target object. The text information of the target object may include attribute information and motion information of the target object.

For example, the target object is a vehicle, the attribute information of the target object may include a category of the vehicle, a color of the vehicle, a license plate number of the vehicle, a brand model of the vehicle, and the like, and the motion information of the target object may be a motion direction of the vehicle, a position of the vehicle in the video image frame, and the like.

after the video structural information is acquired, retrieval can be performed based on the video structural information, so that the target object can be rapidly detected. There are various ways to implement target object retrieval based on video structured information.

In a possible implementation manner, the retrieval may be performed based on a text, that is, when a retrieval instruction of the text information to be retrieved is received, the text information of the target object is retrieved based on the text information to be retrieved, a retrieval result is obtained, and the target object information associated with the retrieval result is output. It should be noted that, in this embodiment, the text information of all the target objects may be combined into a text information base, and when performing text search, the search is performed in the text information base.

preferably, the target object information is a target object image, the target object image is an image of a target object extracted in advance from a video image frame where the target object is located, and the target object image is associated with text information and/or image feature information in the video structured information.

Specifically, text information to be retrieved input by a user is acquired, target text information matched with the text information to be retrieved is searched in the text information of the target object based on the text information to be retrieved, and a target object image associated with the target text information is output.

for example, a user inputs a vehicle license plate number on a retrieval interface for retrieval, and if the target information of the target object contains the vehicle license plate number, an image of a vehicle associated with the vehicle license plate number can be directly displayed, so that the target is quickly detected.

in another possible implementation manner, the retrieval may be performed based on rules, that is, when a retrieval instruction of the event information to be retrieved is received, the retrieval is performed in the text information of the target object based on the event to be retrieved and the pre-established event model, so as to obtain a retrieval result, and the target object information associated with the retrieval result is output.

specifically, event information to be retrieved input by a user is acquired, text information related to the event information to be retrieved is searched in the text information based on the event information to be retrieved, and target text information is determined from the text information related to the event information to be retrieved, wherein the event information corresponding to the target text information is the event information to be retrieved; and outputting the target object image associated with the target text information.

The event information may be area intrusion, tripwire, loitering and the like, the text information related to the event information to be retrieved may be position information of the target object in the video image frame or a motion track of the target object, which event occurs in the target object may be determined based on a change in the position or the motion track of the target object, and if an event identical to the event to be retrieved occurs in a target object, the target object image is output.

in another possible implementation manner, the retrieval may be performed based on the image, and when a retrieval instruction of the image to be retrieved is received, the retrieval result is obtained by retrieving in the image feature information base of the target object based on the image to be retrieved, and the target object information associated with the retrieval result is output. It should be noted that, in this embodiment, the image feature information of all the target objects may be combined into an image feature information base, and when performing image retrieval, retrieval may be performed in the image feature information base based on the features of the image to be retrieved.

specifically, an image to be retrieved input by a user is obtained, image features of the image to be retrieved are extracted as image features to be retrieved, then target image features matched with the image features to be retrieved are searched in image feature information based on the image features to be retrieved, and a target object image associated with the target image features is output. For the user, when searching based on the image, the user only needs to input the image to be searched in the searching interface, and the target object image matched with the image to be clued can be obtained.

in a preferred implementation manner, the image features include depth convolution features and local features, and in this embodiment, matching may be performed in the image feature information of the target object according to a first matching rule based on the depth convolution features of the image to be retrieved to obtain a candidate feature set; and then matching the depth convolution characteristics and the local characteristics of the image to be retrieved in the candidate characteristic set according to a second matching rule to obtain target image characteristics as a detection result. I.e. the final output is information of the target object associated with the target image feature.

Referring to fig. 4, a schematic flow chart of a specific implementation manner for obtaining a retrieval result by retrieving in image feature information of a target object based on an image to be retrieved is shown, and may include:

step S401: and acquiring the depth convolution characteristics and the local characteristics of the image to be retrieved, and performing binary coding on the depth convolution characteristics of the image to be retrieved to acquire the binary coding characteristics of the image to be retrieved.

It should be noted that the deep convolution features are extracted from the high layer based on the CNN deep convolution neural network, and the local features concern the local attributes of the image and can be used as the assistance and supplement of the deep convolution features. Among them, the local features are preferably SURF features with high speed and high robustness, and are marked as F_SURF。

In a preferred implementation, a PCA method may be used to perform dimension reduction on the depth convolution features, remove redundant features, use the depth convolution features from which the redundant features are removed as final depth convolution features for subsequent matching, and record the final depth convolution features as F_CNN+PCA. Performing binary coding on the depth convolution characteristics by adopting an LSH (least squares h) method to generate binary coding characteristics, F_CNNH。

step S402: and respectively matching the binary coding features of the image to be retrieved with the binary coding features corresponding to the depth convolution features in the image feature information of the target object, determining the binary coding features with the matching degree of the binary coding features of the image to be retrieved being greater than a first preset value as target binary coding features, and taking the target depth convolution features and the local features corresponding to the target coding features as candidate feature sets.

When the binary coding features are matched, the matching degree can be represented by the similarity, and the similarity can be obtained by calculating the hamming distance between the two binary coding features.

since the image feature is associated with the target image, specifying the candidate feature set corresponds to specifying the candidate image set { O }₁，O₂，……，O_N}：

Wherein, O_iRepresents the ith image to be matched, S_irepresenting the similarity, theta, of the image to be retrieved and the ith image to be matched_His a similarity threshold.

The process of matching the binary coding features is a coarse matching process, and after the coarse matching is completed, the precise matching is further performed based on the depth convolution features and the local features.

Step S403: and matching the depth convolution features of the image to be retrieved with each depth convolution feature in the candidate feature set, matching the local features of the image to be retrieved with each local feature in the candidate feature set, and taking the image features of which the comprehensive matching degree with the corresponding local features is greater than a second preset value as retrieval results.

In the process of accurate matching, Euclidean distance is adopted for similarity calculation, and the similarity is specifically calculated according to the following formula:

S(k)＝α×S_CNN+PCA(k)+β×S_SURF(k)

wherein alpha and beta respectively represent the weight of similarity calculated by the depth convolution characteristic and the local characteristic, S_CNN+PCA(k) Representing the similarity of the deep convolution feature calculation, S_SURF(k) representing the similarity of the local feature calculations.

Outputting target object information associated with the retrieval result, specifically: and outputting the target object image associated with the image features of which the comprehensive matching degree of the depth convolution features and the corresponding local features is greater than a second preset value.

In outputting the target object images, if there are a plurality of target object images satisfying the condition, the respective target object images may be displayed in order of high to low similarity.

the above procedure gives an implementation way of increasing the speed of image investigation, i.e. retrieving to obtain the target based on the target information, and the precondition for obtaining the target by this way is that the keywords used for retrieval are known in advance, i.e. the user knows in advance part of the information of the object to be investigated, however, at some point, the investigator may not know anything about the object, i.e. without keywords for retrieval, in which case the video image sequence can only be browsed one by one, considering that a video image sequence usually contains a large number of video image frames, in order to increase the detection speed, the embodiment of the invention performs video concentration on the video image frame sequence based on the video structural information, so that the concentrated video image frame has fewer frames and contains a large amount of information.

in this embodiment, the video structuring information may include: image density of each video image frame in a sequence of video images. Wherein the image density is used to characterize the condition of the target object in the video image frame.

referring to fig. 5, a schematic flowchart illustrating a process of obtaining video structured information based on a target object and a motion trajectory of the target object in the foregoing embodiment may include:

Step S501: and determining the structural information of the monitored area in the video image sequence according to the position of the target object in the video image frame and the motion track of the target object.

The structure information of the monitored area comprises area information of the target object appearing in the monitored area.

Step S502: the image density of each video image frame in the video image sequence is determined based on the structural information of the monitored area.

After the video structuring information is obtained, video condensation can be performed on the video image sequence based on the video structuring information. In order to increase the video concentration speed, in the embodiment of the present invention, a target video image sequence is first segmented based on the image density of each video image frame, a video segment to be concentrated is determined from each segment, then the video segment to be concentrated is subjected to video concentration, and the video segment subjected to video concentration is merged with other video segments not subjected to video concentration, so as to obtain a concentrated video image sequence.

further, referring to fig. 6, a flowchart illustrating an implementation process of segmenting a video image sequence based on image density and determining a video segment to be condensed from each segment is shown, and the implementation process may include:

Step S601: the video image sequence is divided into a plurality of video segments by the image density of each video image frame by using a preset image density threshold value.

Step S602: and determining the video segments of which the image density of each video image frame is less than the image density threshold value as the video segments to be condensed.

Illustratively, 100 video image frames are included in the video image sequence, the image density of each image video frame in the first 30 video image frames is less than the set image density threshold, the image density of each image video frame in the 31 th to 70 th video image frames is greater than the set image density threshold, and the image density of each image video frame in the 71 th to 100 th video image frames is less than the set image density threshold, then the video image sequence can be divided into 3 video segments, 1-30 frames are the 1 st video segment, 31-70 frames are the 2 nd video segment, 71-100 frames are the 3 rd video segment, and since the image densities of 1-30 frames and 71-100 frames are less than the set image density threshold, the two video segments of 1-30 frames and 31-70 frames are determined as the video segments to be condensed.

After determining the video segment to be condensed, performing video condensation on the video segment to be condensed, referring to fig. 7, a flowchart illustrating an implementation process of performing video condensation on the video segment to be condensed is shown, and the implementation process may include:

step S701: and determining an optimal movement strategy for moving at least one target object in the video segment to be condensed in a time dimension and a space dimension through a space-time condensation model.

Step S702: and carrying out image fusion based on the optimal movement strategy to obtain the concentrated video segment.

the space-time concentration model provided by the embodiment of the invention can carry out maximum video concentration in two dimensions of time and space while ensuring the original time sequence of the target without losing any target, and the concentrated video has no collision and stroboflash and has good visual effect.

specifically, the energy function of the spatio-temporal enrichment model is characterized as follows:

E(M)＝min{ΣE_a(b)+{αΣE_c(b,b')+βΣE_t(b,b')}},(b,b'∈B)

Where b is the image sequence with the first target object, b' is the image sequence with the second target object, Σ E_a(b) For the active energy loss function, if the object with large area is not mapped into the condensed video, the penalty represented by this term is larger, and vice versa, it is understood that the object with large area should be retained in the condensed video. E_c(b, b') collision penalty terms,is the inner product of two trajectory conflict periods. In the condensed video, the original target is moved on a time axis and spatial distribution, so that the situations of cross collision, shielding and the like between tracks are inevitably generated, and if two target sequences have a shared time interval and the tracks are crossed, a penalty item is the inner product operation of a corresponding overlapped area. E_tthe term (b, b') is a timing penalty term, which means that the sequence of the original video activity events is kept as much as possible, for example, two people walk in front of each other or talk side by side in the original video, and the relative relationship is reasonably kept in the condensed video. E_tAnd (b, b ') is exp (- (d (b, b ')/omega), d (b, b ') represents the Euclidean distance of the central pixel of the two track sharing time interval, and omega is a custom parameter, and the event timing is adjusted. The above-described optimal movement strategy is a strategy in which the corresponding target object moves in time and space when the value of the energy function of the spatio-temporal enrichment model is the minimum value.

Corresponding to the above method, an embodiment of the present invention further provides a video image processing apparatus, please refer to fig. 8, which shows a schematic structural diagram of the apparatus, and the apparatus may include: a video acquisition module 801, a target recognition module 802, a target tracking module 803, an information acquisition module 804, and a processing module 805.

a video obtaining module 801, configured to obtain a video image sequence.

a target identification module 802, configured to identify a target object from video image frames in the video image sequence acquired by the video acquisition module 801.

And a target tracking module 803, configured to track the target object identified by the target identification module 802 and determine a motion trajectory of the target object.

The information obtaining module 804 obtains video knotting information based on the target object identified by the target identifying module 802 and the motion trajectory of the target object determined by the target tracking module 803.

A processing module 805, configured to perform target object retrieval and/or perform video condensation on the video image sequence based on the video structural information acquired by the video structural information acquiring module 804.

The video image processing device provided by the invention can identify and track the target object in the video image sequence, further can acquire the video structural information based on the target object and the motion track of the target object, can search based on the video structural information after acquiring the video structural information, and can quickly detect the target by the mode. That is, how the user knows some information of the target object in advance, the information can be directly utilized to search so as to quickly detect the target object, and if the user does not know the information of the target object, the user can directly browse the condensed video so as to quickly detect the target object. The video image processing device provided by the embodiment of the invention can quickly detect the target object from the video, namely, the embodiment of the invention improves the speed of target detection, further improves the detection speed of the case, and has better user experience.

in the video image processing apparatus provided in the above embodiment, the target identification module 802 is specifically configured to identify a target object from each video image frame in a video image sequence based on a deep convolutional neural network.

In the video image processing apparatus provided in the above embodiment, the target tracking module 803 is specifically configured to track the target object by using a Lucas-Kanade optical flow tracking algorithm based on the optical flow points extracted from the target object in the video image frame.

In the video image processing apparatus provided in the foregoing embodiment, the video structured information obtained by the video structured information obtaining module 804 includes: text information and/or image feature information of the target object, the text information of the target object including attribute information and motion information of the target object.

The processing module 805 includes: the device comprises a retrieval module and an output module.

And the output module is used for outputting the target object information related to the retrieval result of the retrieval module.

in the above-described embodiment, the image feature information of the target object includes the depth convolution feature and the local feature associated with the depth convolution feature.

The retrieval module may include: a coarse matching module and an accurate matching module.

and the rough matching module is used for matching in the image characteristic information of the target object according to a first matching rule based on the depth convolution characteristic of the image to be retrieved to obtain a candidate characteristic set.

And the precise matching module is used for matching the depth convolution characteristics and the local characteristics of the image to be retrieved in the candidate characteristic set according to a second matching rule to obtain the target image characteristics as a detection result.

Further, the coarse matching module comprises: a feature acquisition and processing sub-module and a coarse matching sub-module.

And the feature obtaining and processing submodule is used for obtaining the depth convolution features and the local features of the image to be retrieved, performing binary coding on the depth convolution features of the image to be retrieved to obtain binary coding features of the image to be retrieved, and also used for performing binary coding on each depth convolution feature in the image feature information of the target object to obtain binary coding features corresponding to each depth convolution feature in the image feature information of the target object.

and the rough matching submodule is used for respectively matching the binary coding features of the image to be retrieved with the binary coding features corresponding to the depth convolution features in the image feature information of the target object, determining the binary coding features with the matching degree of the binary coding features of the image to be retrieved being larger than a first preset value as target binary coding features, and taking the target depth convolution features and the target local features corresponding to the target coding features as candidate feature sets.

and the precise matching module is specifically used for matching the depth convolution features of the image to be retrieved with each depth convolution feature in the candidate feature set, matching the local features of the image to be retrieved with each local feature in the candidate feature set, and taking the image features of which the comprehensive matching degree with the corresponding local features is greater than a second preset value as the retrieval result.

the output module is specifically configured to output a target object image associated with an image feature whose comprehensive matching degree between the depth convolution feature and the corresponding local feature is greater than a second preset value, where the target object image is an image of a target object extracted in advance from a video image frame where the target object is located.

In the video image processing apparatus provided in the foregoing embodiment, the video structured information obtained by the video structured information obtaining module 804 includes: and the image density of each video image frame in the video image sequence is used for representing the condition of the target object in the video image frame.

the video structured information obtaining module comprises: a monitoring area structure determining module and an image density determining module.

And the monitoring area structure determining submodule is used for determining the structure information of a monitoring area in a video image sequence according to the position of a target object in a video image frame and the motion track of the target object, wherein the structure information of the monitoring area comprises the area information of the target object appearing in the monitoring area.

and the image density determining submodule is used for determining the image density of each video image frame in the video image sequence based on the structural information of the monitoring area determined by the monitoring area structure determining submodule.

In the video image processing apparatus provided in the foregoing embodiment, the processing module includes: the device comprises a video preprocessing module and a video concentration module.

And the video preprocessing module is used for segmenting the video image sequence based on the image density and determining a video segment to be condensed from each segment.

And the video concentration module is used for carrying out video concentration on the video segment to be concentrated and merging the video segment subjected to the video concentration with other video segments which are not subjected to the video concentration to obtain a concentrated video image sequence.

Further, the video preprocessing module comprises: the video segmentation submodule and the video segment to be concentrated determining submodule.

The video segmentation sub-module is used for dividing the video image sequence into a plurality of video segments by utilizing a preset image density threshold value according to the image density of each video image frame;

And the to-be-condensed video segment determining submodule is used for determining the video segments of which the image densities of all the video image frames are greater than the image density threshold value as the to-be-condensed video segments.

further, the video enrichment module comprises: and the optimal concentration strategy determination sub-module and the image fusion sub-module.

And the optimal concentration strategy determining submodule is used for determining an optimal movement strategy for moving at least one target object in the video segment to be concentrated on a time dimension and a space dimension through a space-time concentration model.

and the image fusion submodule is used for carrying out image fusion based on the optimal movement strategy to obtain the concentrated video segment.

the embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

in the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus, and device may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for video image processing, the method comprising:

Acquiring a video image sequence;

Identifying a target object from image frames in the sequence of video images;

tracking the target object and determining a motion track of the target object;

Performing target object retrieval and/or performing video condensation on the video image sequence based on the video structural information, wherein the video structural information comprises image feature information of the target object and/or image densities of video image frames in the video image sequence, the image feature information of the target object is used for target object retrieval, the image densities of the video image frames in the video image sequence are used for video condensation, the image feature information comprises convolution characteristics and local characteristics, and the image densities are used for representing the situation of the target object in the video image frames;

The target object retrieval is carried out based on the image characteristics of the target object, and the method comprises the following steps:

acquiring a binarization coding feature of an image to be retrieved, wherein the binarization coding feature is obtained by performing binarization coding on a convolution feature of the image to be retrieved; acquiring a candidate feature set from the image feature information of the target object based on the binarization coding features of the image to be retrieved, wherein the candidate feature set comprises at least one convolution feature and a local feature corresponding to each convolution feature; and determining a target image feature from the candidate feature set based on the convolution feature and the local feature of the image to be retrieved, and outputting a target object image associated with the target image feature.

2. The method of claim 1, wherein the identifying a target object from each frame of video image in the sequence of video images comprises:

3. The video image processing method of claim 1, wherein the tracking the target object comprises:

4. The video image processing method of claim 1, wherein the video structuring information further comprises: text information of a target object, wherein the text information of the target object comprises attribute information and motion information of the target object;

Then, performing target object retrieval based on the text information of the target object, including:

when a retrieval instruction of text information to be retrieved is received, retrieving in the text information of the target object based on the text information to be retrieved, or when the retrieval instruction of event information to be retrieved is received, retrieving in the text information based on the event to be retrieved and a pre-established event model to obtain a retrieval result;

5. The video image processing method according to claim 4, wherein said obtaining a candidate feature set from the image feature information of the target object based on the binarized coding feature of the image to be retrieved comprises:

matching the image characteristic information of the target object according to a first matching rule based on the binarization coding characteristics of the image to be retrieved to obtain a candidate characteristic set;

the determining the target image feature from the candidate feature set based on the convolution feature and the local feature of the image to be retrieved comprises:

and matching the depth convolution characteristics and the local characteristics of the image to be retrieved in the candidate characteristic set according to a second matching rule to obtain the target image characteristics.

6. the video image processing method according to claim 5, wherein the matching based on the binarization coding feature of the image to be retrieved in the image feature information of the target object according to a first matching rule to obtain a candidate feature set comprises:

Matching the depth convolution characteristic and the local characteristic based on the image to be retrieved in the candidate characteristic set according to a second matching rule to obtain the target image characteristic, wherein the matching comprises the following steps:

the outputting of the target object image associated with the target image feature specifically includes:

7. The method of claim 1, wherein the obtaining the image density of each video image frame in the video image sequence based on the target object and the motion trail of the target object comprises:

8. The method of claim 7, wherein the video-condensing the sequence of video images based on the video structuring information comprises:

9. The method according to claim 8, wherein said segmenting the sequence of video images based on the image density and determining a video segment to be condensed from each segment comprises:

10. The video image processing method according to claim 8, wherein said video-condensing the video segment to be condensed comprises:

11. A video image processing apparatus, characterized in that the apparatus comprises: the system comprises a video acquisition module, a target identification module, a target tracking module, a video structured information acquisition module and a processing module;

The video acquisition module is used for acquiring a video image sequence;

The processing module is used for carrying out target object retrieval and/or video concentration on the video image sequence based on the video structural information acquired by the information acquisition module; wherein the video structuring information comprises image feature information of the target object and/or image densities of video image frames in the video image sequence, the image feature information of the target object is used for target object retrieval, the image densities of the video image frames in the video image sequence are used for video condensation, the image feature information comprises convolution characteristics and local characteristics, and the image densities are used for representing the situation of the target object in the video image frames;

Wherein the processing module comprises a retrieval module and an output module;

the retrieval module is used for acquiring the binarization coding characteristics of an image to be retrieved, acquiring a candidate feature set from the image feature information of the target object based on the binarization coding characteristics of the image to be retrieved, and determining target image features from the candidate feature set based on the convolution features and the local features of the image to be retrieved; the binary coding feature is obtained by carrying out binary coding on the convolution feature of the image to be retrieved; the candidate feature set comprises at least one convolution feature and a local feature corresponding to each convolution feature;

The output module is used for outputting the target object image associated with the target image characteristic.

12. the video image processing apparatus according to claim 11, wherein the target identification module is specifically configured to identify a target object from each video image frame in the video image sequence based on a deep convolutional neural network.

13. the video image processing apparatus according to claim 11, wherein the target tracking module is specifically configured to track the target object by using a Lucas-Kanade optical flow tracking algorithm based on optical flow points extracted from the target object in the video image frame.

14. the video image processing apparatus of claim 11, wherein the video structuring information further comprises: text information of a target object, wherein the text information of the target object comprises attribute information and motion information of the target object;

The retrieval module is further used for retrieving in the text information of the target object based on the text information to be retrieved when a retrieval instruction of the text information to be retrieved is received, or retrieving in the text information based on the event to be retrieved and a pre-established event model when a retrieval instruction of the event information to be retrieved is received, so as to obtain a retrieval result;

The output module is further used for outputting the target object information associated with the retrieval result.

15. The video image processing apparatus of claim 14, wherein the retrieving module comprises: a rough matching module and an accurate matching module;

The rough matching module is used for acquiring the binarization coding characteristics of the image to be retrieved, and matching the binarization coding characteristics of the image to be retrieved in the image characteristic information of the target object according to a first matching rule based on the binarization coding characteristics of the image to be retrieved to acquire a candidate characteristic set;

and the precise matching module is used for matching the depth convolution characteristics and the local characteristics of the image to be retrieved in the candidate characteristic set according to a second matching rule based on the depth convolution characteristics and the local characteristics of the image to be retrieved to obtain target image characteristics as the retrieval result.

16. the video image processing apparatus of claim 15, wherein the coarse matching module comprises: a feature acquisition and processing sub-module and a coarse matching sub-module;

17. the video image processing apparatus according to claim 11, wherein the information acquisition module includes: a monitoring area structure determining module and an image density determining module;

18. The video image processing apparatus of claim 17, wherein the processing module comprises: the video processing device comprises a video preprocessing module and a video concentration module;

19. The video image processing apparatus of claim 18, wherein the video pre-processing module comprises: the video segmentation submodule and the video segment to be concentrated determining submodule;

20. the video image processing apparatus of claim 18, wherein the video enrichment module comprises: an optimal concentration strategy determination submodule and an image fusion submodule;