CN117998039A

CN117998039A - Video data processing method, device, equipment and storage medium

Info

Publication number: CN117998039A
Application number: CN202410013837.3A
Authority: CN
Inventors: 陈伟杰
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2024-01-02
Filing date: 2024-01-02
Publication date: 2024-05-07

Abstract

The embodiment of the application provides a video data processing method, a device, equipment and a storage medium, relating to the technical field of data processing, wherein the method comprises the following steps: and carrying out background separation processing on the video stream to be stored to obtain one or more background images corresponding to the video stream and foreground images corresponding to each video frame in the video stream. According to the target objects existing in each foreground image, object feature sets of each video frame are obtained respectively, and the object feature sets of each video frame are subjected to text processing to obtain a text frame sequence corresponding to the video stream. And generating a storage data stream of the video stream through one or more background images and a text frame sequence of the video stream, and storing the storage data stream to a preset storage position. Therefore, the video stream data which originally needs to occupy a larger storage space of the storage device is converted into storage data consisting of the background image and the text frame sequence, so that the video data amount needed to be stored by the device is reduced, and the storage efficiency of the video data is improved.

Description

Video data processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to a method, apparatus, device, and storage medium for processing video data.

Background

At present, in video monitoring schemes in various fields of public security, commercial security, traffic management, industrial control and the like, historical video data with a certain time period is regularly stored through storage equipment, so that related personnel can perform later evidence collection and tracing through video playback.

However, the video data occupies a large amount of storage space, however, the data storage capacity of the storage device is limited, resulting in a smaller amount of video data that the storage device can continuously store and a lower storage efficiency of video data.

For example, in actual scenes such as frequent video recording, long video recording duration, etc., the amount of video data required to be stored by the storage device increases faster, and the storage device with limited capacity cannot store more historical video data, so that old video data is easy to be quickly covered by new data, and key video records are lost, thereby directly affecting the subsequent video playback effect.

Therefore, how to improve the storage efficiency of video data is a problem to be solved.

Disclosure of Invention

The embodiment of the invention aims to provide a video data processing method and device which are used for improving video data storage efficiency.

In one aspect, an embodiment of the present application provides a video data processing method, including:

Obtaining a target video stream to be stored, and carrying out background separation processing on the target video stream to obtain at least one target background image corresponding to the target video stream and foreground images corresponding to each video frame in the target video stream;

Based on the foreground images corresponding to the video frames, respectively obtaining object feature sets corresponding to the video frames, wherein object features in the object feature sets correspond to target objects in the foreground images one by one;

Carrying out text processing on each obtained object feature set to obtain a target text frame sequence of the target video stream; each target text frame in the target text frame sequence corresponds to each video frame one by one, and each target text frame comprises object description information of each target object in the corresponding video frame;

And generating a target storage data stream corresponding to the target video stream based on the target text frame sequence and the at least one target background image, and storing the target storage data stream to a preset storage position.

In one aspect, an embodiment of the present application provides a video data processing apparatus, including:

The background separation unit is used for acquiring a target video stream to be stored, carrying out background separation processing on the target video stream, and acquiring at least one target background image corresponding to the target video stream and foreground images corresponding to each video frame in the target video stream;

The object recognition unit is used for respectively obtaining object feature sets corresponding to the video frames based on the foreground images corresponding to the video frames, and each object feature in the object feature sets corresponds to each target object in the foreground images one by one;

The text framing unit is used for carrying out text processing on each obtained object feature set to obtain a target text frame sequence of the target video stream; each target text frame in the target text frame sequence corresponds to each video frame one by one, and each target text frame comprises object description information of each target object in the corresponding video frame;

and the data storage unit is used for generating a target storage data stream corresponding to the target video stream based on the target text frame sequence and the at least one target background image, and storing the target storage data stream to a preset storage position.

Optionally, the object identifying unit is specifically configured to:

Performing object recognition processing on the foreground images of each video frame based on a preset object recognition strategy, and determining at least one target object contained in each foreground image;

for each target object in the at least one target object, respectively executing the following operations:

Aiming at a target object, at least one attribute feature of the target object is obtained based on at least one preset object attribute dimension; each attribute characteristic represents characteristic information of the target object under the attribute dimension of the corresponding object;

Performing feature fusion processing on the at least one attribute feature to obtain an object feature corresponding to the target object;

And obtaining an object feature set of the video frame based on the respective object features of the respective target objects.

Optionally, the text framing unit is specifically configured to:

respectively carrying out text processing on each object feature set to obtain object description information corresponding to each video frame;

And generating a target text frame sequence corresponding to the video stream based on the object description information of each video frame, wherein each target text frame in the target text frame sequence is arranged according to the time sequence of the corresponding video frame.

Optionally, the background separation unit is specifically configured to:

Respectively carrying out background separation processing on each video frame in the target video stream to obtain candidates Jing Tuxiang corresponding to each video frame;

Sequentially judging whether the candidate background images of each video frame have background change relative to the detection image according to the video frame sequence of the target video stream;

and when determining that the target candidate background image has background change, taking the target candidate background image as the target background image of the target video stream.

Optionally, the background separation unit is specifically configured to:

taking the first frame video frame of the target video stream as a detection image, and comparing the similarity between the detection image and a target candidate background image of the next frame video frame;

And when the similarity between the target candidate background image of the next frame of video frame and the detection image is smaller than a preset threshold value, determining that the target candidate background image of the next frame of video frame has background change, and taking the target candidate background image as the next detection image.

Optionally, the apparatus further comprises a picture playback unit for:

Responding to a video playback instruction triggered by aiming at a target video stream, and acquiring a target storage data stream;

performing object analysis processing on the target text frame sequence in the target storage data stream to obtain object description information corresponding to each target text frame; each object description information characterizes each target object in the corresponding target video frame;

Performing object rendering processing on the corresponding target background images based on the object description information corresponding to each target video frame to obtain playback images corresponding to each target video frame;

Based on the playback images corresponding to the target video frames, obtaining a playback image sequence corresponding to the target video stream;

and carrying out dynamic display processing on the playback image sequence based on a preset picture refreshing strategy so as to display a video playback picture corresponding to the target video stream.

Optionally, the device further includes a data uploading unit, configured to:

responding to an event early warning instruction, and determining the target video stream based on event occurrence time indicated by the event early warning instruction; the video duration of the target video stream is determined based on the event occurrence time and a preset video duration;

and obtaining storage position information of the target storage data stream based on the target storage data stream corresponding to the target video stream, and sending the storage position information and the event occurrence time to a corresponding storage server.

In one aspect, an embodiment of the present application provides a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the video data processing method described above when the processor executes the program.

In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which when run on the computer device, causes the computer device to perform the steps of the video data processing method described above.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which, when executed by a computer device, cause the computer device to perform the steps of the video data processing method described above.

In the embodiment of the application, the background separation processing is carried out on the video stream to be stored, so as to obtain one or more background images corresponding to the video stream and the foreground images corresponding to each video frame in the video stream. According to the target objects existing in each foreground image, object feature sets of each video frame are obtained respectively, and the object feature sets of each video frame are subjected to text processing to obtain a text frame sequence corresponding to the video stream. And generating a storage data stream of the video stream through one or more background images and a text frame sequence of the video stream, and storing the storage data stream to a preset storage position. Therefore, the video stream data which originally needs to occupy a larger storage space of the storage device is converted into storage data consisting of the background image and the text frame sequence, so that the storage structure of the video data is optimized, the data quantity needed to be stored by the device is reduced, and the storage efficiency of the data is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 2 is a flow chart of a method for processing video data according to an embodiment of the present application;

Fig. 3 is a schematic diagram of a target video stream obtaining process according to an embodiment of the present application;

fig. 4 is a schematic diagram of a background separation process of a video frame according to an embodiment of the present application;

FIG. 5 is a diagram illustrating a video frame data processing procedure according to an embodiment of the present application;

fig. 6 is a schematic diagram of object description information of a video frame according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a memory data flow according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an object rendering process according to an embodiment of the present application;

fig. 9 is a schematic diagram of a picture stacking process according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application;

Fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. Also, while a logical order of illustration is depicted in the flowchart, in some cases the steps shown or described may be performed in a different order than presented.

The terms first and second in the description and claims of the application and in the above-mentioned figures are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The term "plurality" in the present application may mean at least two, for example, may be two, three or more, and embodiments of the present application are not limited.

The term "and/or" in the embodiment of the present application is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

In the technical scheme of the application, the data is collected, transmitted, used and the like, and all meet the requirements of national relevant laws and regulations. It will be appreciated that in the following detailed description of the application, reference is made to the collection of user-related data and related data, where various embodiments of the application are employed in particular products or technologies, that related permissions or consents need to be obtained, and that the collection, use and processing of related data is required to comply with relevant national and regional laws and regulations and standards.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained here:

network camera (Internet Protocol Camera, IPC): video cameras based on internet protocols are commonly used for monitoring and video transmission, where video data can be transmitted over a network. In this scheme, a webcam may be used to capture real-time video streams in a monitored scene.

Video stream: video data, which consists of a series of successive video frames, is typically acquired in real-time by a camera or the like, containing visual information in a particular scene.

Video frame: is a single image data in the video stream representing a still picture of a particular scene at a certain instant.

Text frame: in the scheme, the text frame is the result of text processing of the video frame and contains text description information related to specific video content in the video frame.

Foreground image: the main objects or objects in an image or video, as well as other dynamic or interesting elements, are typically moving, changing or parts requiring special attention. The foreground image is typically the portion that needs to be detected, analyzed or tracked, as compared to the background image that provides a static background environment.

Background image: refers to the image content of the background portion in video or image processing. The background image is typically relatively stationary compared to the foreground object of interest, with no salient objects or motion, primarily to provide an environmental background. In video surveillance, image analysis, the background image provides a relatively constant background that helps identify and track foreground objects.

Visual target detection algorithm (You Only Look Once, YOLO): a deep learning algorithm commonly used for object detection can detect a plurality of different classes of object objects simultaneously. In this scenario, the YOLO algorithm may be used to identify and tag different target objects in the video frame, such as people, vehicles, etc.

And (3) picture superposition: it is commonly used in the field of video or image processing to superimpose the content of multiple images in one image or video by stacking different image layers together. The different image layers may include text, graphics, other images or video, etc. Picture overlay techniques are commonly used to enhance visual information, identify specific objects, regions, or events, and provide more information.

The following briefly describes the design concept of the embodiment of the present application:

For example, in a video surveillance scenario, a surveillance camera typically permanently monitors pictures of a certain view angle, such as a residential periphery, a road entrance, etc., and stores surveillance videos captured by the surveillance camera by an associated storage device. Once the video monitoring scene has a specific event such as a traffic accident, related personnel can call the monitoring video stored in the storage device and play back the video. The surveillance video provides evidence of the event as being corroborated so that the relevant personnel can track the occurrence of the event, analyze the cause of the event and take appropriate action.

However, the video data occupies a large amount of storage space of the storage device, and the data storage capacity of the storage device is limited, which results in a smaller amount of video data that can be stored continuously by the storage device and a lower storage efficiency of the video data.

For example, in a video monitoring scene, the data volume of the monitoring video required to be stored by the storage device increases faster due to actual reasons such as frequent monitoring video recording and long monitoring time, so that the storage device with limited capacity cannot store more monitoring video data, and the storage efficiency of the video data is lower. Even, the old monitoring video is easily covered by the new video data quickly, and the key monitoring record is lost, so that the subsequent video playback and evidence collection tracing effect are directly affected. Therefore, how to improve the storage efficiency of video data is a problem to be solved.

In view of the above technical problems, an embodiment of the present application provides a video data processing method, which obtains one or more background images corresponding to a video stream to be stored and foreground images corresponding to each video frame in the video stream by performing background separation processing on the video stream. According to the target objects existing in each foreground image, object feature sets of each video frame are obtained respectively, and the object feature sets of each video frame are subjected to text processing to obtain a text frame sequence corresponding to the video stream. And generating a storage data stream of the video stream through one or more background images and a text frame sequence of the video stream, and storing the storage data stream to a preset storage position. Therefore, the video stream data which originally needs to occupy a larger storage space of the storage device is converted into storage data consisting of the background image and the text frame sequence, so that the storage structure of the video data is optimized, the video data amount needed to be stored by the device is reduced, and the storage efficiency of the video data is improved.

In order to further reduce the data storage amount of video data and improve the storage efficiency of the video data, the embodiment of the application considers that the background images in the monitored scene are always kept unchanged and do not need to be stored repeatedly, so that in the process of carrying out background separation processing on the target video stream, background separation processing can be carried out on each video frame respectively to obtain each candidate background image of each video frame, and whether the candidate background image of each video frame has background change relative to the detection image is judged in sequence according to the video frame sequence of the target video stream. And when the background change of the target candidate background image is determined, taking the target candidate background image as the target background image of the target video stream. Therefore, the embodiment of the application only stores the background images of the video frames with the background changes in the video stream, avoids repeatedly storing the same background images, reduces data redundancy, saves storage space and storage cost, and improves the storage efficiency of video data.

In order to further reduce the data storage amount of video data, the embodiment of the application considers whether the change of each object in the foreground image triggers the event early warning condition or not and is the focus of attention required in the video monitoring scene, so that the embodiment of the application carries out object recognition processing on the foreground image of each video frame through an object recognition strategy to accurately recognize the possible target object in each foreground image. And respectively acquiring corresponding attribute characteristics of each target object from preset object attribute dimensions to acquire characteristic information of each target object under different object attribute dimensions, and carrying out characteristic fusion processing on each attribute characteristic of each target object to acquire object characteristics corresponding to the target object and an object characteristic set corresponding to a video frame. Therefore, the foreground image of the video stream which originally needs to occupy a large storage space is converted into the object feature set containing the characteristic information of each object in the foreground image, the compression of the foreground image and the video stream is realized, the data storage capacity of video data is greatly reduced, and the storage cost is remarkably reduced. Meanwhile, through the multidimensional attribute characteristics, the embodiment of the application can obtain more comprehensive characteristic information of the object, and the multidimensional attribute characteristics are subjected to characteristic fusion processing and integrated into more comprehensive and comprehensive object characteristics, so that the characteristic redundancy is further reduced, the object characteristics can more accurately represent the characteristic information of the target object, the identification accuracy of the target object is improved, whether the object change in the foreground image triggers an event early warning condition or not can be accurately judged when the object rendering processing and the picture playback are carried out later, the false alarm rate is reduced, and intelligent analysis such as event reproduction, event tracing and evidence obtaining can be better realized.

In order to further reduce the storage data volume of video data and save storage space, the embodiment of the application carries out text processing on the object feature set of each video frame to obtain the object description information of each video frame, and the abstract object features are converted into the visualized object description information to show the characteristic information of the target object, so that the related event passing of the video stream record can be conveniently understood and analyzed. And according to the object description information of each video frame, generating a text frame sequence corresponding to the video stream, wherein the time sequence of the text frame data after serialization is the same as that of the original video stream, so that the subsequent playback through a picture is convenient, and the event reproduction and time analysis are realized. And by textualizing and serializing the object feature set of each video frame, the original video frame data is stored in a more compact and structured text form, so that the storage space and the storage cost required by the video data are reduced, and the storage efficiency of the video data is improved.

In order to achieve a video playback effect through stored storage data streams so that related personnel can conveniently conduct event evidence collection and tracing, the embodiment of the application obtains the target storage data streams stored on the storage device or the storage server by responding to video playback instructions triggered for the target video streams, and performs object analysis processing on target text frame sequences in the target storage data streams to obtain object description information representing each target object in corresponding target video frames in each target text frame. And the object description information is used for carrying out object rendering processing on the corresponding target background image to obtain a playback image corresponding to each video frame and a playback image sequence corresponding to the target video stream, so that the characteristic information of the target object is visually presented in the playback image, the target object is easier to identify and position by related personnel, and the display effect of video playback is further improved. And the playback image sequence is dynamically displayed through a preset picture refreshing strategy to show the corresponding video playback picture of the original video stream, so that related personnel can know the event passing corresponding to the video stream through the equivalent video playback effect, and the event evidence collection and tracing are more comprehensively and deeply carried out.

In order to further improve the transmission and storage efficiency of video data, the embodiment of the application can also respond to the event early warning instruction and determine the target video stream to be stored with a certain video duration according to the event occurrence time indicated by the event early warning instruction and the preset video duration. Therefore, the embodiment of the application combines the video data storage and event early warning functions, determines the occurrence time of the related early warning event through the event early warning instruction, stores only the video data of the preset time before and after the occurrence of the early warning event, realizes that the video content of the related event condition of the early warning event can be recorded by only storing the video content actually required in the monitoring scene, is beneficial to quickly positioning the early warning event, reduces the time and the workload of post evidence tracing, does not need to continuously store the video for a long time, reduces the storage probability of invalid video data, and saves the storage resources of storage equipment. Meanwhile, as the requirement of storage capacity is greatly reduced, more economical storage equipment can be used, and the related cost of video data storage is reduced. In the video data transmission scenes such as remote monitoring and cloud storage service, the embodiment of the application only needs to transmit the video data with preset time length before and after the early warning event occurs, so that the transmission data quantity of the video data is greatly reduced, and the transmission efficiency of the video data is improved. Furthermore, after the target video stream is converted into the corresponding target storage data stream to be stored, the storage position information and the event occurrence time of the target storage data stream can be sent to the corresponding storage server, so that the storage server can acquire and store the target storage data stream for a long time through the storage position information, the storage and processing burden of the storage device is reduced, the data backup and the redundancy are realized, and when the storage device fails or the data is lost, the storage server is accessed to recover the data, and the availability and the safety of the storage data stream are improved. And the storage data flow of each early warning event is centrally managed and organized through the storage server, so that the efficiency of data management is improved, and the follow-up event evidence collection and tracing by related personnel are facilitated.

After the design idea of the embodiment of the present application is introduced, some simple descriptions are made below for application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application and are not limiting. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be applied to any business scene related to video data storage, including but not limited to video monitoring, video playback and the like. As shown in fig. 1, a schematic view of an application scenario provided in an embodiment of the present application may include a video capturing device 101, a video processing device 102, and a network 103.

The video capturing device 101 is a capturing device capable of recording video data and transmitting the recorded video data to a video processing device, and includes, but is not limited to, one or more of an IPC, a hard disk video recorder (Digital Video Recorder, DVR), a network hard disk video recorder (Network Video Recorder, NVR), a speed dome camera, a gun camera, a pan-tilt camera, a dome camera, a tele, a mid-focus or wide camera, and the like.

The video processing device 102 is a computer device with a certain processing capability, which can implement video processing and data storage functions, for example, may be a tablet computer (PAD), a notebook computer, a personal computer (Personal Computer, PC), a server, or the like, and may be configured to execute any one of the methods provided in the embodiments of the present application, which is not exemplified here. The video processing device 102 can obtain the recorded video stream data from the video acquisition device, and according to the video data processing method provided by the embodiment of the application, the video stream to be stored is converted into a storage data stream and is stored in a storage space such as a memory of the video processing device.

The video processing device 102 and the video capturing device 101 may be connected through a network 103, where the network 103 may be a wired network, or may be a Wireless network, for example, a Wireless network may be a mobile cellular network, for example, a fourth generation mobile communication (4 g) network, a fifth generation mobile communication (5 g) network, or a New Radio (NR) network, or may be a Wireless-Fidelity (WIFI) network, or may be other possible networks, which may not limit the embodiments of the present invention.

In a possible implementation manner, the video processing device 102 may be the same device as the video capturing device 101, that is, the video processing device 102 may have a video capturing device such as a camera, and capture video data in real time through the video capturing device, and convert the captured video stream data into a storage data stream and store the storage data stream in the device memory based on the video data processing method provided by the embodiment of the present application. Further, in order to improve the security and availability of the storage data stream, the video processing device 102 may be further connected to a storage server, where the video processing device sends the storage location information of the storage data stream and the event occurrence time to the corresponding storage server, so that the storage server obtains and stores the storage data stream for a long time through the storage location information, thereby reducing the storage and processing burden of the video processing device, realizing data backup and redundancy, so that the video processing device fails or loses data, and the storage server can be accessed to recover data, thereby improving the availability and security of the storage data stream. Or the video processing device 102 may also be a server connected to the video capturing device 101, where the video capturing device 101 transmits video stream data to the server through a connection network, and the server receives the video stream data, converts the video stream into a storage data stream based on the video data processing method provided by the embodiment of the present application, and stores the data for a long period.

It should be noted that, the number of the video processing device 102 and the video capturing device 101 is not limited in practice, and is not particularly limited in the embodiment of the present application, which is shown in fig. 1 by way of example only. And the components and structures shown in fig. 1 are exemplary only and not limiting, and other components and structures may be provided as desired in a practical scenario.

Of course, the method provided by the embodiment of the present application is not limited to the application scenario shown in fig. 1, but may be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described together in the following method embodiments, which are not described in detail herein.

Referring to fig. 2, fig. 2 is a flowchart of a possible video data processing method according to an embodiment of the present application, where an execution subject of the method may be the video processing apparatus shown in fig. 1, and a specific implementation flow of the method is as follows:

step 201: and obtaining a target video stream to be stored.

In the embodiment of the application, the target video stream with a certain video duration time can be obtained from a video acquisition device such as a monitoring camera and a sensor or other electronic equipment, and the video duration time refers to the duration recording time range of the obtained target video stream.

In one possible implementation manner, in order to further reduce the storage of video data, the embodiment of the present application does not directly store video stream data continuously acquired from the video acquisition device, but combines an event early warning function, and determines a target video stream to be acquired and stored by responding to a corresponding event early warning instruction when a specific change occurs in a video picture or a specific condition is triggered. The event early warning instruction is a signal instruction generated when the video stream meets a specific early warning event triggering condition by carrying out intelligent event analysis on the video stream in a video monitoring scene. The event early warning instruction can comprise early warning event related information such as early warning event type, event occurrence time and the like, so that reference information is provided for a target video stream for acquiring a certain video duration later. Specifically, in order to further reduce the data storage amount of the video stream, the embodiment of the application can preset a specific preset video duration for indicating the duration of the video stream obtained each time, and acquire the video stream data of the preset video duration before and after the occurrence of the early warning event according to the event occurrence time of the early warning event indicated by the event early warning instruction, that is, the video duration of the target video stream is determined according to the event occurrence time and the preset video duration. For example, in video surveillance applications, video duration may refer to the period of time from the time when video data is recorded or acquired to the time when recording or acquisition is stopped, which typically includes video data before and after the occurrence of an event for event playback and analysis. Therefore, the acquisition and storage of the video stream are not required to be continuously carried out for a long time, the storage probability of the useless video data is reduced, and the storage resource of the video processing equipment is saved. Meanwhile, as the storage capacity and the calculation power are greatly reduced, more economical video processing equipment can be used, and the related cost of video data storage is reduced.

Specifically, referring to fig. 3, a schematic diagram of an acquisition process of a target video stream according to an embodiment of the present application is shown, where an example is that a video duration of each time of acquiring a video stream is X seconds, and the target video stream may be a real-time video stream from IPC. In an actual video monitoring scene, the front-end IPC can be deployed in different video monitoring areas for capturing video pictures at different viewing angles in real time, and the IPC can be connected to the video processing device of the embodiment of the application through a network. When the video processing equipment receives an event early warning instruction, the video processing equipment determines a corresponding front-end IPC and acquires the time before the event occurrence time according to the event occurrence time indicated by the event early warning instructionSeconds and after the event occurs/>And the second video stream is obtained, namely, video stream data of X seconds before and after the occurrence time of the event is obtained, and a subsequent video data processing flow is executed for the video stream, so that the functions of compression storage, video playback and the like of the video data are realized.

Specifically, the early warning event may include, but is not limited to, various types of events such as wire-line intrusion, vehicle overspeed, moving object detection, fire alarm, etc. The real-time video stream data can be intelligently analyzed, and when the occurrence of various early warning events is detected, the occurrence time of the event is recorded and a corresponding event early warning instruction is generated. For example, mixed line intrusion is commonly used in security monitoring scenarios to secure a particular area from unauthorized access or intrusion. Cameras or sensors are typically used around fences, borders or forbidden areas for monitoring whether a person or object crosses a border line of a specific area, and event early warning instructions triggering a wire-mixing intrusion upon detecting that a person or object crosses a preset border line. Vehicle overspeed events typically occur in the field of traffic monitoring, where the speed of a vehicle on a road is monitored via a real-time video stream. When detecting that the speed of a certain vehicle exceeds the set speed limit, an event early warning instruction of overspeed of the vehicle is triggered. The detection of moving objects is commonly used in the fields of monitoring and security applications to monitor the moving behavior of unauthorized persons or vehicles in a monitored area so as to take measures in time and effectively prevent theft, damage and other criminal activities. When detecting that an unauthorized mobile object exists in the real-time video stream, an event early warning instruction for mobile object detection is triggered. Fire alarm is commonly used for fire monitoring in the scenes of buildings, factories, warehouses and the like, and once fire or smoke is detected in a video stream, an event early warning instruction of the fire alarm is generated, and an emergency alarm is sent out to inform related departments and personnel to take emergency measures.

Step 202: and carrying out background separation processing on the target video stream to obtain at least one target background image corresponding to the target video stream and each foreground image corresponding to each video frame.

In the embodiment of the application, when the target video stream is obtained, the foreground and the background of each frame of video frame in the video stream can be respectively separated, so that the foreground image of each frame of video frame contained in the target video stream and one or more background images corresponding to the target video stream are obtained and used for the subsequent video data processing flow.

Specifically, referring to fig. 4, a schematic diagram of background separation processing of a video frame according to an embodiment of the present application is shown. In the illustrated video surveillance scenario, the background image of the video frame typically contains static background objects such as the road surface, square, road gate, tree, etc. being monitored, whereas the foreground image, which is made up of moving objects such as pedestrians, vehicles, etc. or other dynamic or interesting elements, is typically the part that needs to be detected, analyzed or tracked, as compared to the background image that provides a static background. Therefore, the background image and the foreground image can be extracted from each video frame through foreground and background separation algorithms such as background subtraction, optical flow method, differential image method, depth learning method and the like. The background subtraction method can compare the difference between the image of each video frame and the background model to identify the foreground image through common algorithms such as a frame difference method and a Gaussian mixture model, and the preset background model can be a single frame image or a dynamically updated model. Optical flow methods can utilize pixel displacement information between images of adjacent video frames to analyze dynamic foreground images to help detect and track motion of objects. The differential image method may detect a changed foreground object by calculating differential images of adjacent video frame images.

In one possible implementation, it is considered that in the video monitoring scenario, the background image of each video frame in the acquired video stream data is typically a scene frame such as a monitored road surface, a square, a road gate, etc., that is, the background image of each video frame is typically composed of static background objects. Therefore, the background image of each video frame hardly changes frequently. In this way, referring to fig. 5, which is a data processing process diagram for each video frame provided by the embodiment of the present application, in order to further reduce the data storage amount of the video data and improve the storage efficiency, when performing background separation processing on each video frame of the target video stream to obtain respective candidate background images of each video frame, the embodiment of the present application further sequentially determines, according to the video frame sequence of the target video stream, the candidate background images of each video frame, with respect to whether a background change occurs in the detected image. And when the candidate background images with the background change are determined to exist, taking one or more candidate background images with the background change as target background images of a target video stream. Therefore, only the background image of the video frame with the background change in the video stream is stored, the repeated storage of the same background image is avoided, the data redundancy is reduced, the storage space and the storage cost of video processing equipment are saved, and the storage efficiency of video data is improved.

Specifically, a frame difference method may be used to detect a background change of each background image, for example, a subtraction operation is performed on a candidate background image of a video frame and a detected image, so as to obtain a differential image representing a difference condition between the two images. And calculates a gray difference value for each pixel in the difference image, or calculates a difference value between color channels of each pixel, or a pixel difference value of the gray image, and then compares with a preset threshold value. If the difference exceeds a preset threshold, the background image is compared with the detected image to form a background change, the background image is stored as a target background image of a target video stream, and the detected target background images are numbered according to the sequence of each video frame in the video stream.

In one possible implementation manner, the first frame of the target video stream may be used as the detection image, and since the first frame generally represents a static background, the result of similarity comparison may be relatively stable, and a fine background change may be more easily detected, thereby improving accuracy and sensitivity of background change detection. And comparing the similarity between the detected image and the target candidate background image of the next frame of video frame, and when the similarity between the target candidate background image of the next frame of video frame and the detected image is smaller than a preset threshold value, determining that the target candidate background image of the next frame of video frame has background change, and taking the target candidate background image as the next detected image. Therefore, the detection image can be dynamically updated according to the actual background change, and the accuracy and the detection efficiency of the background change detection are improved.

Specifically, the similarity comparison may be measured using image variability, such as mean square error (Mean Squared Error, MSE), structural similarity index (Structural Similarity Index, SSIM), or other methods of calculating the variability.

Step 203: and respectively obtaining object feature sets corresponding to the video frames based on the foreground images of the video frames.

In the embodiment of the application, whether the behavior of each target object in the foreground image of each video frame triggers the event early warning condition is considered in the actual video monitoring scene, and is the focus of attention required in the video monitoring scene. Therefore, in order to further reduce the data storage amount of video data, after the foreground images of each video frame are obtained, the application also extracts the corresponding object characteristics for the target objects in the foreground images of each video frame, so that the foreground images which originally need to occupy a larger storage space are converted into the object characteristic sets containing the characteristic information of each target object in the foreground images under the condition of not influencing the video monitoring and playback effects, the data compression of the foreground images and the video stream is realized, the data storage amount of the video data is greatly reduced, and the storage cost is remarkably reduced.

In one possible implementation manner, the foreground images of each video frame may be subjected to object recognition processing according to a preset object recognition policy, so as to accurately recognize target objects such as pedestrians, vehicles, and the like in each foreground image. And for each target object in each foreground image, respectively acquiring the corresponding attribute characteristics of each target object from at least one preset object attribute dimension so as to acquire the characteristic information of the target object under different object attribute dimensions, thereby acquiring the more comprehensive and sufficient characteristic information of the object. And carrying out feature fusion processing on the multidimensional attribute features of each target object to obtain comprehensive object features of the target objects, and comprehensively obtaining an object feature set corresponding to the video frame by combining the object features of all the target objects in the video frame, thereby further reducing feature redundancy, enabling the object feature set to more accurately represent the characteristic information of each target object in a foreground image of the video frame, and improving the identification accuracy of the target objects.

Specifically, the object recognition strategies may include, but are not limited to, YOLO, region-based convolutional neural network (Region-based Convolutional Neural Network, fast R-CNN), deep learning methods such as single-lens multi-frame detector (Single Shot MultiBox Detector, SDD), and conventional object recognition methods such as directional gradient histogram (Histogram of Oriented Gradients, HOG) and Haar cascade classifier (HAAR CASCADE CLASSIFIER), and any other methods that can perform object detection on foreground images to realize object recognition, where the specific selection depends on factors such as actual application scenario, performance requirement and resource constraint, and embodiments of the present application are not limited specifically. Taking the YOLO algorithm as an example, the foreground image can be input into the YOLO model to obtain the respective bounding box positions and labels of each target object output by the model. The YOLO algorithm can accurately identify target objects such as characters and vehicles in the foreground image, and distinguish the target objects from other noises. After each target object in the foreground image is determined, the object characteristics of each target object in various preset attribute dimensions such as types, sub-types, positions, colors, sizes, shapes and the like can be extracted, corresponding characteristic labels are given to the target objects, the object characteristics and the target objects are associated so as to track the movement and behaviors of different target objects, and the method and the device are beneficial to accurately identifying the movement paths and interactions of the target objects, such as tracking vehicles in traffic monitoring or tracking pedestrians in safety monitoring.

In a possible implementation manner, the object attribute dimension in the embodiment of the present application may be multiple attribute dimensions such as a type, a subtype, a position, a color, a size, and a shape of the target object, so that characteristic information of the target object is fully represented in multiple directions, so as to accurately distinguish different target objects. Wherein,

(1) Category (type) attribute: may represent a generic class of target objects. For example, in the case of target object recognition in traffic safety, where the target objects involved are mainly pedestrians and vehicles on roads, the types of target objects may include "human", "motor vehicle", "non-motor vehicle".

(2) Subtype (subtype) attribute: can represent different types of target objects under the same category and is used for more finely dividing the target objects with the same category attribute. For example, the category "human", and the subtype may include "man", "woman"; the category is "motor vehicle", the subtype may include "truck", "passenger car", sport utility vehicle (Sport Utility Vehicle, SUV), "car"; the category is "non-motor vehicles", and the subtype may include "bicycles", "tricycles", and "battery cars", etc.

(3) Position (position) attribute: the specific position information of the representative target objects in the foreground images can be used for tracking and positioning the target objects, calculating the distance between the target objects and the like. Coordinates or regions may be generally used to represent position attributes of the target object, such as center point coordinates of the target object, bounding box coordinates, or polygon boundary information.

(3) Color (color) attribute: the color attributes may be representative of the apparent colors of the target objects, particularly in multi-target scenes, to aid in identifying and describing the apparent features of different target objects. For example, color attributes may be used to describe the color of the vehicle, such as red, blue, white, etc.; for pedestrians, the color attribute may include the color of the pedestrian's clothing, such as black clothes, blue trousers, and the like.

(4) Size (size) attribute: may represent the size and scale of the target object in a scenario where it is desired to determine whether the target object meets certain criteria. For example, for a vehicle, the size attribute may represent the length, width, and height of the vehicle, and for behavior, the size attribute may represent the height and volume of the pedestrian.

(4) Shape (Shape) attribute: the method can be used for describing the appearance characteristics of the target objects, distinguishing the shapes of different target objects, and improving the accuracy of target object identification and classification. For example, the shape of the vehicle may be rectangular or elliptical. While pedestrians may be more diverse in shape, a "humanoid" may be used to describe the general outline of a human, or a more detailed "head", "torso", "arms", "legs" and so on. In the object recognition scene, in order to simplify the image processing and reduce the computational complexity, the shape attribute of the pedestrian can be directly described by using a square, wherein the square refers to a bounding box or a bounding box surrounding the pedestrian object so as to efficiently recognize the object.

Step 204: and carrying out text processing on each obtained object feature set to obtain a target text frame sequence of the target video stream.

In the embodiment of the application, the obtained object feature sets of each video frame are subjected to textualization to generate the target text frame sequence corresponding to the target video stream, so that the foreground image which originally occupies a larger storage space is converted into a text frame form which contains key object information and has a more structure for storage through textualization and serialization, the storage space and the storage cost required by video data are further reduced, and the storage efficiency of the video data is improved.

In one possible implementation manner, the object feature sets of each video frame may be respectively subjected to text processing, so as to generate object description information corresponding to each video frame. Thus, the abstract feature information is converted into a visual text form to realize the feature information of the target object. And generating a target text frame sequence corresponding to the video stream according to the obtained object description information of each video frame and the time sequence of the corresponding video frame. Because the time arrangement sequence of the target text frame after serialization is the same as that of each video frame of the original video stream, the text frame sequence can be used for picture playback, and the functions of event reproduction, time analysis and the like are realized.

Specifically, the obtained object feature set corresponding to the foreground image of each video frame can be converted into corresponding fields and parameter values according to attribute information of each object feature contained in the object feature set, and the corresponding object description information of each video frame is generated by describing the object feature set in a text form, and then the corresponding object description information of each video frame is serialized into a text frame sequence corresponding to the video stream according to the time sequence of the video frames in the video stream. For example, the object feature set may be represented by a JavaScript object notation (JavaScript Object Notation, JSON) or the like structured, in which the object feature set of each video frame may include the frame number of that video frame and the identified respective target object. Each target object has a unique identifier (object_id) and attribute information such as category, subtype, color, shape and location coordinates so that structured data is more easily converted to text form and serialized into a sequence of text frames.

Specifically, fig. 6 is a schematic diagram of object description information of a video frame according to an embodiment of the present application, where the foreground image includes 3 target objects A, B, C, so as to obtain an object feature set formed by object features a, b, and c of each of the 3 target objects. The object feature a of the target object A is obtained by feature fusion of five attribute features of human type, male subtype, black color, square shape and position coordinate 240,235; the object feature B of the target object B is composed of a type of "motor vehicle", a subtype of "truck", a color of "white", a shape of "rectangle", and a position coordinate of [780,253]; the object feature C of the target object C is composed of a category of "non-motor vehicle", a subtype of "bicycle", a color of "black", a shape of "rectangle", and a position coordinate of [890,353].

Step 205: and generating a target storage data stream corresponding to the target video stream based on the target text frame sequence and at least one target background image, and storing the target storage data stream in a preset storage position.

In the embodiment of the present application, referring to fig. 7, a schematic diagram of a stored data stream provided by the embodiment of the present application is shown, where a target video stream is converted into a corresponding target text frame sequence, and one or more target background images (as shown in the drawing, background image 1 and background image 2), that is, a target stored data stream, so as to convert video stream data that originally occupies a larger storage space into a stored data stream composed of the background image and the text frame sequence. Compared with video data, the storage space occupation of the image and text data is smaller, so that the storage structure of the video data is optimized, the data quantity required to be stored by the video processing equipment is reduced, and the storage efficiency of the video data is improved.

In one possible implementation, the target storage data stream may be directly stored in a memory of the video processing device or other terminal device with a local storage function, so as to implement local storage of the target storage data stream. And the storage server can acquire and store the target storage data stream for a long time through the storage position information, thereby reducing the storage and processing burden of the video processing equipment and realizing data backup and redundancy. And when the related local storage equipment such as the video processing equipment fails or data is lost, the data can be recovered by accessing the storage server, so that the availability and the safety of the stored data stream are improved. And the storage data flow of each early warning event is centrally managed and organized through the storage server, so that the efficiency of data management is improved, and the follow-up event evidence collection and tracing by related personnel are facilitated.

Specifically, taking the video processing equipment as an example, receiving an event early warning instruction, according to the event occurrence time 12:00 indicated by the instruction, obtaining a video stream to be stored for X seconds before and after the event occurrence time 12:00, and generating a corresponding target storage data stream, numbering each target background image in the target storage data stream by the video processing equipment, and storing each target background image and a target text frame sequence into a local memory to finish the local persistent storage of data. According to the local storage position information of the target storage data stream, the video processing equipment reports corresponding event early warning information to the storage server, wherein the event early warning information carries the occurrence time of an event, uniform resource locators (Uniform Resource Locator, URLs) corresponding to each numbered target background picture and URLs corresponding to a text frame sequence, so that the storage server downloads the corresponding target storage data stream according to the URL addresses.

In one possible implementation manner, after performing the above data compression and conversion procedure on the target video stream and storing the obtained storage data stream, the embodiment of the present application may further obtain the stored target storage data stream in response to a video playback instruction triggered for the target video stream. And performing object analysis processing on the target text frame sequence, thereby obtaining object description information representing each target object in each video frame. And carrying out object rendering processing on the target background image corresponding to the video frame through the object description information, generating a playback image corresponding to each target video frame, and obtaining a playback image sequence corresponding to the video stream. In this way, the target object is presented in the playback image in a visual mode, so that related personnel can more easily identify and position the target object in the playback image, and as only the re-rendered target object is displayed in the playback image instead of a complete foreground image, redundant information is reduced, information overload of a viewer is relieved, and related personnel can concentrate on key details more, so that event evidence collection and tracing are realized more comprehensively and deeply.

Specifically, the object analysis processing can be performed on each text frame in the target text frame sequence by using semantic analysis, keyword extraction, machine learning, and other analysis modes alone or in combination, so as to obtain the object description information of each target object contained in each text frame. For example, text description information about the target object in each text frame is extracted by using semantic analysis modes such as word segmentation, entity recognition, syntax analysis and the like. Or extracting keywords and phrases preset on the target object from each text frame by using a keyword extraction mode to obtain the attribute information of the target object. Or training the model by using deep learning technologies such as convolutional neural networks (Convolutional Neural Network, CNN) and cyclic neural networks (Recurrent Neural Network, RNN) to automatically parse text frames to obtain object description information.

Specifically, referring to fig. 8, a schematic diagram of object rendering processing according to an embodiment of the present application is shown, where when performing object rendering processing on a corresponding background image according to object description information corresponding to an extracted foreground image, a corresponding target object may be drawn through attribute parameters such as a type, a subtype, a color, and a size of the target object in the object description information. In order to improve the visibility of the target object, the target object such as a pedestrian, a vehicle, or the like may be drawn as a marker image with a simple outline. For example, pedestrians are drawn as matches in the form of simple lines, and cars, buses, trucks are drawn as tiles of simple trapezoids, squares and polygons.

Specifically, referring to fig. 9, which is a schematic diagram of a picture stacking process provided by an embodiment of the present application, a picture stacking technology may be used in combination with a position parameter in object description information, so that a simple image of a target object, which is newly drawn according to object description information of a foreground image of the video frame, is embedded into a target position indicated by the position parameter in a background image of the video frame, so as to implement object rendering processing of the target object, and obtain a playback image corresponding to the video frame, as shown in fig. 8.

In one possible implementation manner, after the playback image sequence corresponding to the video stream is obtained, the playback image sequence may be dynamically displayed by a preset picture refreshing policy, so as to display a video playback picture of the original video stream. Because the video playback picture is generated by continuously refreshing the background picture and the front Jing Wenben frame sequence, the video playback can be performed according to the self-defined play speed or picture interval, thereby improving the query efficiency of key events and behaviors in the video stream.

Specifically, the user can control the dynamic display speed of the playback sequence image, the forward or backward movement of the image and the like through interactive user interface elements such as a time axis and a play button, and the like, and allow the user to pause, fast forward, rewind, click a specific area to view detailed information, zoom and pan the playback image and the like. And, by adjusting the number of playback images displayed per second to increase or decrease the frequency of display of the playback images, a higher frame rate may produce a smooth playback effect, while a lower frame rate may be used for slow motion playback. Also, in order to reduce the display effects of visual jumps and picture discontinuities, the transition between playback images may be smoothed by interpolation techniques to insert intermediate frames between two consecutive playback images. The technical means can be combined and used according to different practical application scenes and user requirements so as to present attractive and interactive playback pictures, improve user experience and enable viewers to understand and explore picture contents more easily.

Referring to fig. 10, based on the same inventive concept, an embodiment of the present application further provides a video data processing apparatus 100, including:

The background separation unit 1001 is configured to obtain a target video stream to be stored, and perform background separation processing on the target video stream to obtain at least one target background image corresponding to the target video stream, and foreground images corresponding to respective video frames in the target video stream;

The object recognition unit 1002 is configured to obtain, based on the foreground images corresponding to the video frames, object feature sets corresponding to the video frames, where each object feature in the object feature sets corresponds to each target object in the foreground images one by one;

A text framing unit 1003, configured to perform a text processing on each obtained object feature set, to obtain a target text frame sequence of the target video stream; each target text frame in the target text frame sequence corresponds to each video frame one by one, and each target text frame comprises object description information of each target object in the corresponding video frame;

The data storage unit 1004 is configured to generate a target storage data stream corresponding to the target video stream based on the target text frame sequence and at least one target background image, and store the target storage data stream to a preset storage location.

Optionally, the object identifying unit 1002 is specifically configured to:

performing object recognition processing on foreground images of each video frame based on a preset object recognition strategy, and determining at least one target object contained in each foreground image;

For each target object in the at least one target object, the following operations are respectively executed:

Performing feature fusion processing on at least one attribute feature to obtain an object feature corresponding to the target object;

And obtaining an object feature set of the video frame based on the respective object features of each target object.

Optionally, the text framing unit 1003 is specifically configured to:

Optionally, the background separation unit 1001 is specifically configured to:

And when the background change of the target candidate background image is determined, taking the target candidate background image as the target background image of the target video stream.

Optionally, the background separation unit 1001 is specifically configured to:

when the similarity between the target candidate background image of the next frame of video frame and the detection image is smaller than a preset threshold value, determining that the target candidate background image of the next frame of video frame has background change, and taking the target candidate background image as the detection image of the next time.

Optionally, the apparatus further comprises a picture playback unit 1005 for:

Performing object analysis processing on a target text frame sequence in a target storage data stream to obtain object description information corresponding to each target text frame; each object description information characterizes each target object in the corresponding target video frame;

and carrying out dynamic display processing on the playback image sequence based on a preset picture refreshing strategy so as to show a video playback picture corresponding to the target video stream.

Optionally, the apparatus further includes a data uploading unit 1006 configured to:

Responding to the event early warning instruction, and determining a target video stream based on event occurrence time indicated by the event early warning instruction; the video duration of the target video stream is determined based on the event occurrence time and a preset video duration;

For convenience of description, the above parts are respectively described as being functionally divided into unit modules (or modules). Of course, the functions of each unit (or module) may be implemented in the same piece or pieces of software or hardware when implementing the present application. The apparatus may be used to perform the methods shown in the embodiments of the present application, and therefore, the description of the foregoing embodiments may be referred to for the functions that can be implemented by each functional module of the apparatus, and the like, which are not repeated.

Referring to fig. 11, the embodiment of the application further provides a computer device based on the same technical concept. In one embodiment, the computer device may include a memory 1101, a communication module 1103, and one or more processors 1102 as shown.

Memory 1101 for storing computer programs executed by processor 1102. The memory 1101 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system; the storage data area may store various sets of operation instructions, etc.

The memory 1101 may be a volatile memory (RAM), such as a random-access memory (RAM); the memory 1101 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a hard disk (HARD DISK DRIVE, HDD) or a solid state disk (solid-state drive-STATE DRIVE, SSD); or memory 1101, is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1101 may be a combination of the above memories.

The processor 1102 may include one or more central processing units (central processing unit, CPUs) or digital processing units, or the like. A processor 1102 for implementing the video data processing method described above when invoking a computer program stored in the memory 1101.

The communication module 1103 is configured to communicate with a video capture device, a storage server, or other network device.

The specific connection medium between the memory 1101, the communication module 1103, and the processor 1102 is not limited to the above embodiment of the present application. The embodiment of the present application is illustrated in fig. 11 by a bus 1104 connecting the memory 1101 and the processor 1102, the bus 1104 being illustrated in fig. 11 by a bold line, and the connection between other components is merely illustrative and not limiting. The bus 1104 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 11, but only one bus or one type of bus is not depicted.

The memory 1101 has stored therein a computer storage medium having stored therein computer executable instructions for implementing the video data processing method of the embodiment of the present application. The processor 1102 is configured to perform the video data processing method of each of the above embodiments.

Based on the same inventive concept, the embodiments of the present application also provide a storage medium having stored thereon a computer program which, when executed on a computer, causes a computer processor to perform the steps in the video data processing method according to the various embodiments of the present application described above in the present specification.

In some possible implementations, aspects of the video data processing method provided by the present application may also be implemented in the form of a program product, which includes a program code for causing a computer device to perform the steps of the video data processing method according to the various exemplary embodiments of the application as described herein above, when the program product is run on a computer device, e.g. the computer device may perform the steps of the various embodiments.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code and may run on a computing device. However, the program product of the present application is not limited thereto, and in the present application, the readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's equipment, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of video data processing, the method comprising:

2. The method of claim 1, wherein the obtaining the respective object feature sets for the respective video frames based on the respective foreground images for the respective video frames, respectively, comprises:

3. The method of claim 1, wherein the textualizing the obtained respective object feature sets to obtain a target text frame sequence of the target video stream comprises:

4. The method of claim 1, wherein the performing the background separation on the target video stream to obtain at least one target background image corresponding to the target video stream comprises:

5. The method of claim 4, wherein sequentially determining whether the candidate background images of the respective video frames have a background change with respect to the detected image in the video frame order of the target video stream comprises:

6. The method of any of claims 1-5, wherein after generating a target storage data stream corresponding to the target video stream based on the target text frame sequence and the at least one target background image and saving the target storage data stream to a preset storage location, the method further comprises:

7. The method of any one of claims 1-5, wherein the method further comprises:

8. A video data processing apparatus, the apparatus comprising:

The separation unit is used for acquiring a target video stream to be stored, carrying out background separation processing on the target video stream, and acquiring at least one target background image corresponding to the target video stream and foreground images corresponding to each video frame in the target video stream;

The identification unit is used for respectively obtaining object feature sets corresponding to the video frames based on the foreground images corresponding to the video frames, and each object feature in the object feature sets corresponds to each target object in the foreground images one by one;

The processing unit is used for carrying out text processing on each obtained object feature set to obtain a target text frame sequence of the target video stream; each target text frame in the target text frame sequence corresponds to each video frame one by one, and each target text frame comprises object description information of each target object in the corresponding video frame;

And the generating unit is used for generating a target storage data stream corresponding to the target video stream based on the target text frame sequence and the at least one target background image, and storing the target storage data stream to a preset storage position.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that,

The processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 7.

10. A computer storage medium having stored thereon computer program instructions, characterized in that,

Which computer program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 7.