CN114648774B

CN114648774B - Subtitle extraction method, device, computer equipment, readable storage medium and product

Info

Publication number: CN114648774B
Application number: CN202210266640.1A
Authority: CN
Inventors: 王洪松; 黄珊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2024-12-24
Anticipated expiration: 2042-03-17
Also published as: CN114648774A

Abstract

The embodiment of the application discloses a subtitle extraction method, a subtitle extraction device, computer equipment, a readable storage medium and a product, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. The subtitle extraction device acquires a video to be processed, carries out framing processing on the video to be processed to obtain a multi-frame picture sequence, carries out text detection processing on each frame of picture to identify text detection frames in each frame of picture, carries out time track following on the text detection frames in each frame of picture to obtain a text detection frame following track set of the video to be processed, determines a reference subtitle region according to the text detection frame following track set of the video to be processed, and extracts subtitles of the video to be processed based on the reference subtitle region. The subtitle extraction accuracy is improved.

Description

Subtitle extraction method, subtitle extraction device, computer equipment, readable storage medium and product

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a subtitle extraction method, apparatus, computer device, readable storage medium, and product.

Background

Many existing video subtitle extraction methods do not automatically extract subtitles for an input video, but rather input pictures of subtitle regions in the video, and extract subtitles from the subtitle pictures.

Many existing video subtitle extraction methods use traditional computer vision methods (such as edge detection and image filtering), and the steps of video text detection and recognition use traditional OCR, so that the traditional OCR technology cannot process text in internet video with complex background, and the previous subtitle detection and subtitle following methods aim at simple video, namely, no text or few text except subtitle in the video. With the increasing number of internet videos, the variety of characters in the videos is also large, the subtitle is only a part of characters in the videos, and many other non-subtitle character areas may appear in the videos. For these complex videos containing many words, previous methods cannot accurately extract and follow the subtitles.

Disclosure of Invention

The embodiment of the application provides a subtitle extraction method, a subtitle extraction device, computer equipment, a readable storage medium and a product, which can improve the subtitle extraction efficiency.

A subtitle extraction method, comprising:

acquiring a video to be processed;

Carrying out framing treatment on the video to be treated to obtain a multi-frame picture sequence;

Performing text detection processing on each frame of picture to identify a text detection frame in each frame of picture;

Performing temporal track following on the text detection frames in each frame of picture to obtain a text detection frame following track set of the video to be processed;

Determining a reference subtitle region according to a trace set followed by a text detection frame of the video to be processed;

and extracting the caption of the video to be processed based on the reference caption area.

Accordingly, an embodiment of the present application provides a subtitle extracting apparatus, including:

The acquisition unit is used for acquiring the video to be processed;

the framing unit is used for framing the video to be processed to obtain a multi-frame picture sequence;

The detecting unit is used for carrying out text detection processing on each frame of picture so as to identify a text detection frame in each frame of picture;

the track following unit is used for carrying out track following on the text detection frames in each frame of picture in time to obtain a text detection frame following track set of the video to be processed;

The determining unit is used for determining a reference subtitle region according to the text detection frame following track set of the video to be processed;

And the extraction unit is used for extracting the subtitle of the video to be processed based on the reference subtitle region.

Optionally, in some embodiments, the track following unit may be specifically configured to perform a temporal track following on a text detection frame in each frame of picture, so as to obtain a following track set of the text detection frame, and generate, according to the following track set of the text detection frame, the text detection frame following track set of the video to be processed.

Optionally, in some embodiments, the track following unit may be specifically configured to obtain a text editing distance of a text detection frame in each adjacent frame picture and coordinate information of a corresponding text detection frame in each frame picture in the adjacent frame pictures, and generate a following track set of the text detection frame according to the coordinate information and the text editing distance.

Optionally, in some embodiments, the track following unit may be specifically configured to determine, according to the coordinate information, area information of the text detection frames in the adjacent frame pictures, and if it is determined, according to the area information, that the text detection frames in the adjacent frame pictures meet a first matching condition, and it is determined, according to the text editing distance, that the text detection frames in the adjacent frame pictures meet a second matching condition, add the text detection frames of the next frame pictures in the adjacent frame pictures to a track following set corresponding to the text detection frames of the previous frame pictures.

Optionally, in some embodiments, the track following unit may be specifically configured to obtain an intersection area and a union area of the text detection frames in the adjacent frame pictures, calculate a union ratio of the text detection frames in the adjacent frame pictures according to the intersection area and the union area, and determine that the text detection frames satisfy a first matching condition if the union ratio of the text detection frames is greater than or equal to a preset union ratio.

Optionally, in some embodiments, the track following unit may be specifically configured to determine that the text detection frame meets a second matching condition if the text editing distance is less than or equal to a preset editing distance.

Optionally, in some embodiments, the track following unit may specifically be configured to identify text content in a text detection frame in an adjacent frame picture by using a preset text identification algorithm, and calculate a text editing distance between text content in the text detection frame in the adjacent frame picture according to the text content.

Optionally, in some embodiments, the track following unit may be specifically configured to initialize a track following set of text detection frames according to the text detection frames in a later frame picture in the adjacent frame picture if it is determined that the text detection frames in the adjacent frame picture do not meet the first matching condition according to the area information and/or it is determined that the text detection frames in the adjacent frame picture do not meet the second matching condition according to the text editing distance.

Optionally, in some embodiments, the track following unit may be specifically configured to determine, if it is determined, according to the area information, that a plurality of text detection frames exist in the adjacent frame picture and satisfy a first matching condition and satisfy a second matching condition, determine, according to a parallel-to-cross ratio corresponding to the plurality of text detection frames, a target text detection frame from the plurality of text detection frames, and add the target text detection frame to a following track set corresponding to the text detection frame in a previous frame picture of the adjacent frame picture.

Optionally, in some embodiments, the determining unit may be specifically configured to sequentially select coordinate information corresponding to a reference text detection frame from each text detection frame following track in the set of text detection frame following tracks of the video to be processed, determine, according to a preset algorithm, a candidate value of a picture region corresponding to the reference text detection frame and the coordinate information corresponding to the reference text detection frame, and select a picture region corresponding to a maximum value in the candidate values as a reference subtitle region.

Optionally, in some embodiments, the extracting unit may be specifically configured to obtain coordinate information of the reference caption area and coordinate information of all text detection frames in the text detection frame following track set of the video to be processed, and extract, according to the coordinate information of the reference caption area and the coordinate information of all text detection frames in the text detection frame following track set of the video to be processed, a caption of the video to be processed, where a difference value between the coordinate information of the caption area of the video to be processed and the coordinate information of the reference character area is smaller than a preset value.

In addition, the embodiment of the application also provides a computer device, which comprises a processor and a memory, wherein the memory stores an application program, and the processor is used for running the application program in the memory to realize the subtitle extraction method provided by the embodiment of the application.

In addition, the embodiment of the application also provides a computer readable storage medium, which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in any subtitle extraction method provided by the embodiment of the application.

In addition, an embodiment of the present application further provides a computer program product, including a computer program, where the computer program is implemented when executed by a processor to perform any one of the subtitle extraction methods provided in the embodiments of the present application.

After the electronic equipment acquires the video to be processed, carrying out frame division processing on the video to be processed to obtain a multi-frame picture sequence, carrying out text detection processing on each frame of picture to identify a text detection frame in each frame of picture, carrying out track following on the text detection frame in each frame of picture to obtain a text detection frame following track set of the video to be processed, determining a reference caption area according to the text detection frame following track set of the video to be processed, and extracting captions of the video to be processed based on the reference caption area. The method comprises the steps of detecting all characters in the video to be processed, following the tracks of the character detection frames, and calculating the subtitle region of the video to be processed according to the following tracks of each character detection frame, so that the subtitle and other characters in the video to be processed are distinguished, further, the subtitle is extracted, and the accuracy of extracting the subtitle in the video is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a subtitle extraction method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of two of the two frames of a video to be processed according to an embodiment of the present application;

FIG. 3 is a schematic diagram of coordinates of a text detection frame according to an embodiment of the present application;

fig. 4 is a schematic diagram of a picture obtained by merging two adjacent frames of pictures according to an embodiment of the present application;

fig. 5 is another schematic flow chart of subtitle extraction according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a subtitle extracting apparatus according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides a subtitle extraction method, a subtitle extraction device and a computer readable storage medium. The subtitle extraction device may be integrated in an electronic device, which may be a server or a terminal.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

For example, referring to fig. 1, taking an example that a subtitle extracting device is integrated in an electronic device, when the electronic device obtains a video to be processed, the electronic device performs framing processing on the video to be processed to obtain a multi-frame picture sequence, then performs text detection processing on each frame of picture to identify a text detection frame in each frame of picture, then performs temporal track following on the text detection frame in each frame of picture to obtain a text detection frame following track set of the video to be processed, then determines a reference subtitle region according to the text detection frame following track set of the video to be processed, and extracts a subtitle of the video to be processed based on the reference subtitle region.

The character detection frame can be an external rectangular frame of characters, and the following track set of the character detection frame comprises coordinate information of a plurality of character detection frames.

The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.

The description will be made from the perspective of a subtitle extraction apparatus, which may be integrated in an electronic device, where the electronic device may be a server or a terminal, and the terminal may include a mobile phone, a computer, a personal computer (PC, personal Computer), an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, a wearable device, a virtual reality device, or other devices that may acquire data. The embodiment of the invention can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.

As shown in fig. 1, the specific flow of the subtitle extraction method is as follows:

101. and acquiring the video to be processed.

For example, the electronic device obtains a video to be processed from a locally stored video, or the electronic device obtains a video to be processed from other devices, or the electronic device obtains a video to be processed from a server, or the electronic device receives a video to be processed input by a user.

102. And carrying out framing treatment on the video to be treated so as to obtain a multi-frame picture sequence.

The video to be processed may be an internet video or a television video, and the video to be processed may contain subtitle text and non-subtitle text, where the non-subtitle text may be a watermark, a barrage, a station logo or an advertisement word in the video, and in order to extract the subtitle in the video to be processed, the video to be processed needs to be first divided into a multi-frame picture sequence, so as to extract the subtitle in each frame picture.

Optionally, the framing of the video to be processed to obtain the multi-frame picture sequence includes obtaining a preset frame rate for the video to be processed, and dividing the video to be processed into the multi-frame picture sequence arranged according to time sequence according to the preset frame rate. The preset frame rate may be an original frame rate of the video to be processed, or a sampling frame rate lower than the original frame rate, for example, the original frame rate of the video to be processed is 24FPS, and the sampling frame rate when the video to be processed is subjected to framing processing may be 10FPS. For example, if the video to be processed is 10s and the preset frame rate is 10FPS, the video to be processed may be divided into 100 frames of pictures.

103. And carrying out text detection processing on each frame of picture so as to identify a text detection frame in each frame of picture.

The number of the text detection frames in each frame of picture can be one or more, and in order to accurately identify the text detection frames in the pictures, a text detection algorithm can be adopted to process each frame of picture so as to extract the text detection frames. The method for identifying the text detection frames in each frame of picture comprises the steps of obtaining a preset text detection algorithm, carrying out text detection processing on each frame of picture according to the text detection algorithm to extract feature information corresponding to the text in each frame of picture, and determining the text detection frames in each frame of picture according to the feature information. The preset text detection algorithm may be an end-to-end image segmentation algorithm Pixellink based on deep learning, the feature information may be text position information and text visual feature information, the text visual feature information may include information with distinct features in a visual image, such as edges, corner points, circle or ellipse centers, shape features of the image, and the like, and determining a text detection frame in each frame of picture according to the feature information may include inputting the text position information and the visual feature information into a recurrent neural network model to obtain the text detection frame.

Considering that the subtitles in the video to be processed are generally horizontal characters, the character detection frame output by the character detection algorithm can also be a forward rectangular frame and can be expressed by coordinates (x, y, w and h), wherein x and y are respectively the horizontal coordinate and the vertical coordinate of the top left corner vertex of the character detection frame, and w and h are respectively the height and the width of the character detection frame. For convenience of description, a certain text detection frame in the t-th frame picture may be denoted by a symbol Bt, for example, bt (x 1, y1, w1, h 1) is one text detection frame in the t-th frame picture, and Bt (x 2, y2, w2, h 2) is another text detection frame in the t-th frame picture.

104. And carrying out track following on the text detection frames in each frame of picture in time to obtain a text detection frame following track set of the video to be processed.

And tracking the text detection frame based on the text detection, so that the same text among different frame pictures in the video to be processed is connected to form a tracking track of the text detection frame in time. Since the trace of the text detection frame is a set of text detection frames of a plurality of frames, the last text detection frame in the set can be used to represent the trace of the text detection frame, for example, the trace of one text detection frame can be represented by Tt, T _t＝{B_s,B_s+1,…,B_t, where Bt represents the text detection frame in the picture of the T frame, and similar Bs represents the text detection frame of the s frame, where the text detection frame appears first in the s frame of the video to be processed and appears last in the T frame of the video to be processed, and s < T needs to be satisfied. The text detection frame following is an online algorithm, namely, the text detection frame following result of the s-1 st frame is generated based on the text detection frame following result of the s-1 st frame and the text detection result of the s-th frame. In this embodiment, the text detection frames in each frame of picture need to be followed by a temporal track, so as to obtain a following track set of all the text detection frames in the video to be processed.

Optionally, performing temporal track following on the text detection frames in each frame of picture to obtain a text detection frame following track set of the video to be processed may include performing temporal track following on the text detection frames in each frame of picture to obtain a following track set of the text detection frames, and generating the text detection frame following track set of the video to be processed according to the following track set of the text detection frames.

The set of the following tracks of the text detection frames of the video to be processed is H= { T ₁,…,T_i,…,T_n }, wherein H comprises the following tracks of n text detection frames, namely the algorithm considers that n different texts exist in the video to be processed. The following track set of the text detection frame is obtained by carrying out track following on the text detection frame in each frame of picture, and the following steps can be specifically included:

S1, acquiring the text editing distance of a text detection frame in each adjacent frame picture and the coordinate information of the corresponding text detection frame of each frame picture in the adjacent frame picture.

Each adjacent frame picture refers to an adjacent frame from a first frame in a multi-frame picture sequence arranged according to time sequence, wherein the multi-frame picture sequence is obtained after the video to be processed is divided into frames, for example, if the video to be processed is divided into frames to obtain a t frame picture, the text editing distance between the text detection frames in the first frame picture and the second frame picture is obtained, and the text editing distance between the text detection frames in the second frame picture and the third frame picture is obtained according to the sequence until the text editing distance between the text detection frames in the t-1 frame picture and the t frame picture is obtained.

The text editing distance can measure the similarity of the text contents in the two text detection frames, and the larger the text editing distance is, the lower the similarity of the text contents in the two text detection frames is, the smaller the text editing distance is, the higher the similarity of the text contents in the two text detection frames is, that is, the two text detection frames are more likely to be the same text detection frame. One or more text detection frames corresponding to each frame of picture can be provided, and correspondingly, one or more text detection frames can also be provided, for example, the coordinate information of one text detection frame of a previous frame, namely a t-2 th frame, in the adjacent frame of picture can be represented as B _t-2 (x 1, y1, w1, h 1), the coordinate information of another text detection frame can be represented as B _t-2 (x 2, y2, w2, h 2), the coordinate information of one text detection frame of a later frame, namely a t-1 th frame, can be represented as B _t-1 (x 1, y1, w1, h 1), and the coordinate information of another text detection frame can be represented as B _t-1 (x 2, y2, w2, h 2).

The obtaining the text editing distance of the text detection frame in each adjacent frame of picture may include:

recognizing the text content in a text detection frame in the picture of the adjacent frame by adopting a preset text recognition algorithm;

and calculating the text editing distance of the text content in the text detection frame in the adjacent frame picture according to the text content.

The preset text recognition algorithm may be an end-to-end text recognition algorithm based on deep learning, for example, understand text recognition network model (CRNN, convolutional Recurrent Neural Network). The text editing distance of the text detection frames in the adjacent frame pictures is calculated, and whether the content of the text detection frames in the adjacent frame pictures is the same can be compared.

S2, generating a following track set of the text detection frame according to the coordinate information and the text editing distance.

Alternatively, S2 may include:

(1) And if the text detection frames in the adjacent frame pictures meet the first matching condition and the text detection frames in the adjacent frame pictures meet the second matching condition according to the text editing distance, adding the text detection frames of the next frame picture into a follow track set corresponding to the text detection frames of the previous frame picture.

The area information of the text detection frames in the adjacent frame pictures comprises intersection areas and union areas of the corresponding areas of the text detection frames in the adjacent frame pictures. The first matching condition is that the ratio of the intersection area and the union area of the corresponding areas of the text detection frames in the adjacent frame pictures is larger than or equal to a preset intersection ratio, and the second matching condition is that the text editing distance of the text detection frames in the adjacent frame pictures is smaller than or equal to a preset editing distance.

The preset intersection ratio may be preset according to an actual requirement, for example, the preset intersection ratio may be set to 0.8, and the preset editing distance may be preset according to an actual requirement, for example, the preset editing distance may be set to 2.

That is, if the ratio of the intersection area and the union area of the corresponding areas of the text detection frames in the adjacent frame pictures is greater than or equal to the preset intersection ratio, and the text editing distance of the text detection frames is smaller than or equal to the preset editing distance, the text detection frames of the next frame picture in the adjacent frame pictures are added to the following track set corresponding to the text detection frames in the previous frame picture.

For example, if the T-2 frame of the video to be processed and the text detection frame B _t-1 (x 1, y1, w1, h 1) in the T-1 frame picture, which is the next frame in the two adjacent frame pictures of the T-1 frame, and the text detection frame B _t-2 (x 1, y1, w1, h 1) in the previous frame, which is the T-2 frame picture, satisfy the first matching condition and the second matching condition simultaneously, then B _t-1 (x 1, y1, w1, h 1) is added to the follow track set T _t-2＝{B₁,......,B_t-2 (x 1, y1, w1, h 1) corresponding to B _t-2 (x 1, y1, w1, h 1), so as to obtain T _t-1＝{B₁,......,B_t-2(x1,y1,w1,h1),B_t-1 (x 1, y1, w1, h 1) }.

Optionally, in an embodiment, if it is determined according to the area information that the text detection frame in the adjacent frame picture meets the first matching condition and it is determined according to the text editing distance that the text detection frame in the adjacent frame picture meets the second matching condition, adding the text detection frame of the next frame picture to the following track set corresponding to the text detection frame of the previous frame picture may include:

If the plurality of character detection frames in the adjacent frame pictures are determined to meet the first matching condition and meet the second matching condition according to the area information, determining a target character detection frame from the plurality of character detection frames according to the corresponding parallel-to-cross ratio of the plurality of character detection frames, and adding the target character detection frame into a following track set corresponding to the character detection frame in the previous frame picture of the adjacent frame pictures.

For example, if the text detection frame B _t-1 (x 1, y1, w1, h 1) of the t-1 st frame and the text detection frame B _t-2 (x 1, y1, w1, h 1) of the t-2 nd frame in two adjacent frames of pictures satisfy the first matching condition and the second matching condition simultaneously, and the text detection frame B _t-1 (x 2, y2, w2, h 2) of the t-1 st frame and the text detection frame B _t-2 (x 1, y1, w1, h 1) of the t-2 nd frame also satisfy the first matching condition and the second matching condition simultaneously, that is, two text detection frames in the t-1 st frame of the t-2 nd frame of pictures are both matched with one of the text detection frames in the t-1 st frame of the t-2 th frame of pictures, then the target text detection frame is determined from the two text detection frames in the t-1 st frame of the pictures as the following track correctly matched with the text detection frame B _t-2 (x 1, y1, w1, h 1).

Optionally, determining the target text detection frame from the text detection frames may include comparing the parallel-to-cross ratios of the text detection frames in the previous frame of picture with the text detection frames in the previous frame of picture, and selecting the text detection frame with the largest parallel-to-cross ratio with the text detection frame in the previous frame of picture from the text detection frames as the target text detection frame.

For example, if the intersection ratio of text detection frame B _t-1 (x 1, y1, w1, h 1) of the t-1 frame and text detection frame B _t-2 (x 1, y1, w1, h 1) of the t-2 frame is 0.83, and the intersection ratio of text detection frame B _t-1 (x 2, y2, w2, h 2) of the t-1 frame and text detection frame B _t-2 (x 1, y1, w1, h 1) of the t-2 frame is 0.86, text detection frame B _t-1 (x 2, y2, w2, h 2) is selected as the target text detection frame.

Correspondingly, adding the target text detection frame to the following track set corresponding to the text detection frame in the previous frame picture of the adjacent frame picture can specifically include adding the target text detection frame B _t-1 (x 2, y2, w2, h 2) to the following track set T _t-2＝{B₁,......,B_t-2 (x 1, y1, w1, h 1) corresponding to the text detection frame B _t-2 (x 1, y1, w1, h 1) to obtain T _t-1＝{B₁,......,B_t-2(x1,y1,w1,h1),B_t-1 (x 2, y2, w2, h 2).

Optionally, determining that the text detection frames in the adjacent frame pictures meet the first matching condition according to the area may include obtaining an intersection area and a union area of the text detection frames in the adjacent frame pictures, calculating a union ratio of the text detection frames in the adjacent frame pictures according to the intersection area and the union area, and determining that the text detection frames meet the first matching condition if the union ratio of the text detection frames is greater than or equal to a preset union ratio.

The intersection ratio of the text detection frames in the adjacent frame pictures is the ratio of the intersection area to the intersection area of the two detection frames, for example, the intersection area of a certain two text detection frames in the adjacent frame pictures is M, the intersection area is N, the intersection ratio of the two text detection frames is M/N, if a plurality of text detection frames in the adjacent frame pictures exist, the text detection frames in the next frame picture and the text detection frames in the previous frame picture are calculated in sequence and are compared to obtain a plurality of intersection ratios.

Optionally, determining that the text detection frame in the adjacent frame picture meets the second matching condition according to the text editing distance may include determining that the text detection frame meets the second matching condition if the text editing distance is less than or equal to the preset editing distance.

It can be understood that in practical application, due to reasons such as blocked characters in the pictures, interference of the picture background, errors of the character recognition algorithm, and the like, the character recognition results of the following tracks of the character detection frames belonging to the same character are not necessarily the same, so in this embodiment, we calculate the character editing distance of the character detection frames in the adjacent frame pictures, and if the editing distance is greater than or equal to the preset editing distance, it indicates that the character detection frames in the adjacent frame pictures are not matched and cannot be followed as the same character detection frame.

(2) If the text detection frames in the adjacent frame pictures do not meet the first matching condition according to the area information and/or the text detection frames in the adjacent frame pictures do not meet the second matching condition according to the text editing distance, initializing a following track set of the text detection frames according to the text detection frames in the next frame picture in the adjacent frame pictures.

Specifically, if the text detection frames in the adjacent frame pictures do not meet the first matching condition and do not meet the second matching condition, initializing a following track set of the text detection frames according to the text detection frames in the next frame picture in the adjacent frame pictures. That is, the intersection ratio of the text detection frames in the adjacent frame pictures is smaller than the preset intersection ratio, and the text editing distance between the text detection frames is larger than the preset editing distance, so that a new text detection frame following track is created for the text detection frames of the next frame picture in the adjacent frame pictures.

Or if the text detection frames in the adjacent frame pictures do not meet the first matching condition or do not meet the second matching condition, initializing a following track set of the text detection frames according to the text detection frames in the next frame picture in the adjacent frame pictures. That is, if the intersection ratio of the text detection frames in the adjacent frame pictures is smaller than the preset intersection ratio or the text editing distance between the text detection frames is larger than the preset editing distance, initializing the text detection frame of the next frame picture in the adjacent frame pictures as a new text detection frame following track.

It will be appreciated that if none of the text detection frames in the next frame of pictures matches one of the text detection frames in the previous frame of pictures, it is indicated that none of the text detection frames in the next frame of pictures has occurred in the previous frame, that is, a new text detection frame, so that it is necessary to initialize the following track with the text detection frame in the next frame of pictures to create a new text detection frame following track.

For example, referring to fig. 2, fig. 2 is a schematic diagram of two frames of pictures of a video to be processed in a subtitle extraction process, where a station caption in each frame of pictures of the video to be processed, that is, a text detection frame in an area where "XXTV" is located in two adjacent frames of pictures in fig. 2, is identical, and may be used as a same text detection frame to follow a track, and a subtitle in the video to be processed may be changed in real time, that is, the subtitle of a next frame of pictures is different from the subtitle of a previous frame of pictures, as in fig. 2, a sum-to-intersection ratio of a text detection frame of "top-view bright moon, low-top-view home country" and a text detection frame of "suspected ground frost" in a previous frame of pictures is smaller than a preset sum-to-intersection ratio, and a text editing distance is greater than a preset text editing distance, that is, the text detection frames in two frames of pictures are not matched. At this time, the text detection frame in the area where the caption of the next frame of picture is located needs to be newly built to follow the track, that is, the text detection frame of "head lifting and bright moon in the next frame, and the text detection frame of" low head thinking and hometown "is a new text detection frame, and needs to be initialized to be the new text detection frame to follow the track.

Therefore, if new subtitles appear in the video to be processed, the text detection frames corresponding to the new subtitles are not necessarily matched with the text detection frames in the previous frame of pictures, and the text detection frames corresponding to the subtitles are initialized to be a new text detection frame following track, that is, each subtitle has a corresponding text detection frame following track, so that the text detection frames corresponding to each subtitle can be accurately detected, and the accuracy of extracting the text detection frames corresponding to the subtitles can be improved.

105. And determining a reference subtitle region according to the text detection frame following track set of the video to be processed.

Optionally, determining the reference caption area according to the text detection frame following track set of the video to be processed may include selecting coordinate information corresponding to one reference text detection frame from each text detection frame following track in the text detection frame following track set of the video to be processed in sequence, determining candidate values of picture areas corresponding to the reference text detection frames according to coordinate information corresponding to the reference text detection frames of a preset algorithm, and selecting the picture area corresponding to the maximum value in the candidate values as the reference caption area.

It should be noted that, assuming that the set of the text detection frame following track of the video to be processed is H, H may be expressed as

H={T₁,…,T_i,…,T_n}

The H includes n text detection frames following tracks, that is, the subtitle extraction device considers that n different text regions exist in the video, and for each text detection frame following track in the video to be processed, a reference text detection frame B _t,B_t∈T_i is selected for Ti in sequence.

Referring to fig. 3, fig. 3 is a schematic diagram of coordinates of a text detection frame. Wherein B _t adopts coordinates

B_t＝(x_t,y_t,w_t,h_t)

Where x _t and y _t are the abscissa and ordinate of the upper left corner vertex of the text detection box and w _t and h _t are the width and height, respectively, of the text detection box.

The preset algorithm for calculating the candidate value R (x, y) of the picture area corresponding to the reference text detection frame is as follows:

that is to say, the following track of the text detection frame in each frame of picture of the video to be processed is updated once, and then the candidate value of the picture area corresponding to the text detection frame is increased by one number. It can be understood that, in general, the text detection frame in the region where the subtitle is located in the video to be processed changes at the highest speed, that is, the following track of the text detection frame in the region where the subtitle is located is the largest, so that the candidate value of the region where the number of times of the following track of the text detection frame corresponding to the subtitle is the largest.

Referring to fig. 4, fig. 4 is a schematic diagram of a picture obtained by merging two adjacent frames of pictures of a video to be processed. The video to be processed is assumed to include a picture frame 10 and a picture frame 20, the picture frame 10 is a first frame of the video to be processed, and since the text detection frame 22 of the logo "XXTV" in the picture frame 20 in the two adjacent frames of pictures and the text detection frame 12 of the logo "XXTV" in the picture frame 10 satisfy the first matching condition and satisfy the second matching condition, the two adjacent frames of pictures belong to the same text detection frame following set, the text detection frames corresponding to subtitles in the two adjacent frames of pictures are not identical, that is, the text detection frame 21 in the picture frame 20 and the text detection frame 11 in the picture frame 10 are not matched, that is, the text detection frame 21 is a newly appeared text detection frame, the text detection frame 21 and the text detection frame 11 are two text detection frame following tracks, at this time, the text detection frame 11 in the picture frame 10 corresponding to be processed and the text detection frame 22 in the picture frame 20 are the same text detection track, therefore the R value of the picture region corresponding to the text detection frame is 1, and the text detection frame 11 in the picture frame 10 and the text detection frame 20 are the R value of the picture region corresponding to the picture region to the R region to the picture region to the R1, that is the two detected frames to be combined, that is the corresponding to the two detected frames to the picture frame 21 is the picture region to the R value to the picture region to be 2. The candidate value of the region where the text detection frame 11 and the text detection frame 21 corresponding to the subtitles in the two frames of pictures 10 and 20 repeatedly appear is the largest, that is, the candidate value of the gray region in fig. 4 is the largest. This represents that the more the text follows the track in the picture, the larger the candidate value of the region, and the text of the subtitle region in the video to be processed is the most varied, i.e. the text detection frame of the subtitle region follows the track the most, and the candidate value of the subtitle region is the largest.

Therefore, the position information of the caption area can be accurately distinguished according to the candidate value, and the accuracy of caption extraction is improved.

Assuming that the candidate value tmax is R, a binary matrix M is calculated:

and selecting a positive rectangular box of the maximum connected region with the M value of 1 as a reference subtitle region in the video to be processed.

Thus, the electronic device follows the track set by each text detection frame of the video to be processed

106. And extracting subtitles of the video to be processed based on the reference subtitle region.

Optionally, extracting the caption of the video to be processed based on the reference caption area may include obtaining coordinate information of the reference caption area and coordinate information of all text detection frames in a text detection frame following track set of the video to be processed, and extracting the caption of the video to be processed according to the coordinate information of the reference caption area and the coordinate information of all text detection frames in the text detection frame following track set of the video to be processed, wherein a difference value between the coordinate information of the caption area of the video to be processed and the coordinate information of the reference character area is smaller than a preset value.

Wherein the reference subtitle region may be represented by coordinates (x _s,y_s,w_s,h_s).

It can be understood that, since the reference caption area is the area where the text detection frames corresponding to all captions in the video to be processed overlap most, the coordinates of the text detection frames corresponding to the captions in the video to be processed are relatively close to the coordinates of the reference caption area, and the vertical coordinates and the heights of the text circumscribed rectangular frames, i.e. the text detection frames, are relatively close. Therefore, in order to accurately acquire all the subtitles of the video to be processed, it is necessary to determine the text detection frames corresponding to all the subtitles from all the text detection frames of the video to be processed based on the reference subtitle region, that is, select the following tracks of the text detection frames with the vertical coordinates of the reference subtitle region and the vertical coordinates of the text detection frames being equal to or having small errors from all the following track sets of the video to be processed, consider them as the subtitles, and extract the subtitles of the video to be processed from the text detection frames corresponding to the subtitles.

Optionally, extracting the caption of the video to be processed according to the coordinate information of the reference caption area and the coordinate information of all the text detection frames in the text detection frame follow track set of the video to be processed comprises extracting the text in the text detection frame as the caption of the video to be processed if the difference between the ordinate of the text detection frame of the video to be processed and the ordinate of the reference caption area is smaller than a first preset value and the difference between the height of the text detection frame of the video to be processed and the height of the reference caption area is smaller than a second preset value.

The first preset value and the second preset value are smaller, the first preset value can be delta _y, and the second preset value can be delta _h. For example, for the text detection frame following track T _i, one text detection frame B _t,B_t＝(x_t,y_t,w_t,h_t) is selected, and when the text detection frame B _t satisfies |y _t-y_s|<δ_yand|h_t-h_s|<δ_h, it is determined that the text detection frame following track T _i is a subtitle.

In an actual application scene, some internet videos have a lot of text content, and the video to be processed has a lot of other text which does not belong to subtitles, such as barrages or advertisement text. In order to extract the subtitles in the video to be processed more accurately, the text detection frames corresponding to the obtained subtitles can be subjected to post-processing along the track. The rule followed by post-processing is, for example, (1) the same subtitle is located in different frame pictures of the video to be processed, and even if the text detection algorithm deviates from the detection frame of the same subtitle output corresponding to different frames, the detection frame should be within a certain error range. (2) And selecting the horizontal text as the caption, wherein the vertical coordinates of the horizontal text corresponding to different captions are the same or have smaller error in the whole video to be processed.

As can be seen from the above, in the embodiment of the present invention, after a video to be processed is acquired by the subtitle extraction device, the video to be processed is subjected to framing processing to obtain a multi-frame image sequence, then each frame of image is subjected to text detection processing to identify text detection frames in each frame of image, and the text detection frames in each frame of image are subjected to temporal track following to obtain a text detection frame following track set of the video to be processed, a reference subtitle region is determined according to the text detection frame following track set of the video to be processed, and subtitles of the video to be processed are extracted based on the reference subtitle region. The method comprises the steps of detecting all characters in the video to be processed, following the track of the character detection frame, and calculating the subtitle region of the video to be processed according to the following track of the character detection frame, so that the subtitle and other characters in the video to be processed are distinguished, further, the subtitle is extracted, and the accuracy of extracting the subtitle in the video is improved.

According to the method described in the above embodiments, examples are described in further detail below.

In this embodiment, description will be given taking an example in which the subtitle extracting apparatus is specifically integrated in an electronic device. As shown in fig. 5, fig. 5 is another flow chart of subtitle extraction according to an embodiment of the present application.

A subtitle extraction method specifically comprises the following steps:

201. the electronic equipment acquires a video to be processed;

For example, the electronic device acquires a video to be processed input by a user, or the electronic device automatically detects the video to be processed, wherein the video to be processed can be an internet video or a television video, and the video to be processed can contain subtitle characters and non-subtitle characters, and the non-subtitle characters can be watermarks, barrages, station marks or advertisement words in the video.

202. The electronic equipment carries out framing treatment on the video to be treated so as to obtain a multi-frame picture sequence;

For example, the electronic device performs frame division processing on the acquired video to be processed according to the original frame rate or sampling frame rate of the video to be processed, so as to obtain a multi-frame picture sequence arranged according to the time sequence.

203. The electronic equipment performs text detection processing on each frame of picture so as to identify a text detection frame in each frame of picture;

For example, the electronic device may perform text detection processing on each frame of picture by using an end-to-end image segmentation algorithm based on deep learning, so as to output a text detection frame of each frame of picture, where the text detection frame is a forward rectangular frame, and may be represented by coordinates Bt (x, y, w, h), where Bt represents a text detection frame in the t-th frame of picture, x and y are respectively an abscissa and an ordinate of a top left corner vertex of the text detection frame, and w and h are respectively a height and a width of the text detection frame.

204. The electronic equipment acquires the text editing distance of the text detection frame in each adjacent frame picture and the coordinate information of the corresponding text detection frame of each frame picture in the adjacent frame picture;

The electronic device obtains coordinate information of the text detection frame in each adjacent frame picture, for example, coordinate information of one text detection frame of a previous frame, i.e., a t-2 th frame, in the adjacent frame picture may be denoted as B _t-2 (x 1, y1, w1, h 1), coordinate information of another text detection frame may be denoted as B _t-2 (x 2, y2, w2, h 2), coordinate information of one text detection frame of a subsequent frame, i.e., a t-1 th frame, may be denoted as B _t-1 (x 1, y1, w1, h 1), and coordinate information of another text detection frame may be denoted as B _t-1 (x 2, y2, w2, h 2).

The electronic device adopts the understanding text recognition network model to recognize text content in the text detection frames in each adjacent frame picture, calculates text editing distances of the text detection frames in the adjacent frame pictures according to the text content, for example, calculates text editing distances of the text detection frames of the first frame picture and the second frame picture, and calculates text editing distances of the text detection frames between the second frame picture and the third frame picture.

205. And the electronic equipment generates a text detection frame following track set of the video to be processed according to the coordinate information and the text editing distance.

The track following of the text detection frame refers to connecting the same text among different frame pictures in the video to be processed according to the time sequence to form the track of the text detection frame in time, one text following track can be represented by T,

T={B_s,B_s+1,…,B_t}

B _t represents the text detection box of the t frame, similarly, B _s represents the text detection box of the s frame, the text appears first in the s frame of the video, and finally appears in the t frame of the video, so that s < t needs to be satisfied.

The track following of the text detection frame is an online algorithm, namely, the text detection frame following result of the s-1 st frame is generated based on the text detection frame following result of the s-1 st frame and the text detection frame following result of the s-th frame. For the text detection box detected by the first frame, s=1, and the initialized text detection box follows the result t= { B1}.

The electronic device generates a text detection frame following track set of the video to be processed according to the coordinate information and the text editing distance specifically may be:

(1) And calculating the intersection ratio and the text editing distance of the text detection frames in the pictures of the adjacent frames.

The intersection ratio of the text detection frames refers to the ratio of the intersection area of the two text detection frames to the intersection area, for example, the intersection area of the text detection frame B1 in the first frame and the text detection frame B2 in the second frame picture is A, and the intersection area is B, and the intersection ratio of the text detection frame B1 in the first frame and the text detection frame B2 in the second frame picture is A/B. Since the text detection frame following track T is a set of text detection frames, the last text detection frame in the set is used to represent the following track of the text detection frame.

(2) Judging whether the parallel-to-cross ratio of the text detection frames in the adjacent frame pictures meets a first matching condition.

The first matching condition is that the ratio of the intersection area and the union area of the corresponding areas of the text detection frames in the adjacent frame pictures is larger than or equal to a preset intersection ratio. If the parallel-to-cross ratio of the text detection frames in the adjacent frame pictures is larger than or equal to the preset parallel-to-cross ratio, entering the next step, and judging whether the editing distance of the two text detection frames meets the second matching condition.

(3) And judging whether the text editing distance of the text detection frame in the adjacent frame picture meets a second matching condition.

The second matching condition is that the text editing distance of the text detection frame in the adjacent frame picture is smaller than or equal to the preset editing distance. If the edit distance of the text detection frames in the adjacent frame pictures is smaller than or equal to the preset edit distance, the two text detection frames are judged to be successfully matched, namely the two text detection frames are identical text.

(4) If the intersection ratio of the text detection frames in the adjacent frame pictures meets the first matching condition and the text editing distance meets the second matching condition, judging that the two text detection frames are successfully matched, and adding the text detection frames in the next frame picture into a set of text detection frame following tracks in the previous frame picture, for example, the text following track of the previous frame is T _t-1, and the detection frame matched with the text track is B _t, the new text track is in the form of T _t,T_t-1 and T _t as follows:

T_t-1＝{B_s,B_s+1,…,B_t-1},T_t＝{B_s,B_s+1,…,B_t-1,B_t}.

(5) If the intersection ratio of a plurality of character detection frames in the adjacent frame pictures meets the first matching condition and the character editing distance meets the second matching condition, selecting the character detection frame with the largest intersection ratio in the matching process as the character detection frame with correct matching.

(6) If the intersection ratio of the text detection frames in the pictures of the adjacent frames does not meet the first matching condition and/or the text editing distance does not meet the second matching condition, that is, if all the text detection frames of a certain text detection frame and the past frames are failed to match, the text detection frames are considered to be a new text, and at the moment, the text detection frames can be used for initializing the following track of the text detection frames, namely, the following track of the newly-built text detection frames.

206. The electronic equipment sequentially selects coordinate information corresponding to one reference character detection frame from each character detection frame following track in a character detection frame following track set of the video to be processed;

It should be noted that, assuming that the height and width of the corresponding picture of the video to be processed are O and P, initializing an all-zero matrix R with a dimension of OxP, then calculating the value of the matrix R based on the following track set of the text detection frame, assuming that the following set of the text detection frame of the video to be processed is H, the H is expressed as follows:

H={T₁,…,T_i,…,T_n}

The H comprises n text detection frames which follow the track, namely the electronic equipment calculates that n different texts exist in the video to be processed. For each following track in the video, i.e. for T _i in turn, a reference text detection box B _t,B_t∈T_i,B_t is selected to be represented with four coordinates:

B_t＝(x_t,y_t,w_t,h_t)

Wherein x _t and y _t are the abscissa and ordinate of the top left corner vertex of the reference text detection frame, and w _t and h _t are the width and height of the reference text detection frame, respectively.

207. The electronic equipment determines candidate values of the picture areas corresponding to the reference character detection frames according to coordinate information corresponding to the reference character detection frames by a preset algorithm;

Calculating a candidate value of a picture region corresponding to the reference character detection frame B _t, namely an R value according to a preset algorithm, wherein the preset algorithm is that

208. And the electronic equipment selects the picture area corresponding to the maximum value in the candidate values as a reference subtitle area.

Assuming that the maximum R value in the R matrix is R, a binary matrix M is calculated,

The positive rectangle of the maximum connected area with M value of 1 is selected as the reference caption area in the video, and is expressed by coordinates (x _s,y_s,w_s,h_s).

209. The electronic device extracts subtitles of the video to be processed based on the reference subtitle region.

Based on the ordinate y _s of the reference caption area and the height H _s of the caption, we select the text detection frame following track with the ordinate y _s of the reference caption area and the height H _s of the caption from the set H of the text detection frame following tracks of the whole video to be processed, respectively, and consider the text detection frame following tracks as the caption, for example, if the difference between the ordinate of the text detection frame of the video to be processed and the ordinate of the reference caption area is smaller than a first preset value and the difference between the height of the text detection frame of the video to be processed and the height of the reference caption area is smaller than a second preset value, the text in the text detection frame is extracted as the caption of the video to be processed.

In order to better implement the above method, the embodiment of the present invention further provides a data acquisition device, where the data acquisition device may be integrated in an electronic device, such as a server or a terminal, where the terminal may include a tablet computer, a smart television, a mobile phone, a notebook computer, and/or a personal computer.

For example, as shown in fig. 6, the data acquisition apparatus may include an acquisition unit 301, a framing unit 302, a detection unit 303, a track following unit 304, a determination unit 305, and an extraction unit 306, as follows:

an acquiring unit 301, configured to acquire a video to be processed;

The video to be processed can be internet video or television video, and the video to be processed can contain caption characters and non-caption characters, wherein the non-caption characters can be watermarks, station marks or advertisement words in the video, and in order to extract the caption in the video to be processed, the video to be processed is firstly required to be divided into a plurality of frame picture sequences, so that the caption in each frame picture is extracted.

The framing unit 302 is configured to perform framing processing on the video to be processed to obtain a multi-frame picture sequence;

For example, the video to be processed is subjected to framing processing according to a preset frame rate to obtain a multi-frame picture sequence arranged according to a time sequence.

A detecting unit 303, configured to perform a text detection process on each frame of picture, so as to identify a text detection frame in each frame of picture;

for example, a text detection algorithm may be used to process each frame of picture to extract a text detection box. The method comprises the steps of obtaining a preset character detection algorithm, carrying out character detection processing on each frame of picture according to the character detection algorithm to extract characteristic information corresponding to characters in each frame of picture, and determining a character detection frame in each frame of picture according to the characteristic information.

The track following unit 304 is configured to perform a track following on time for a text detection frame in each frame of picture, so as to obtain a text detection frame following track set of the video to be processed;

for example, the text detection frame obtained based on text detection follows a track, so that the same text between different frame pictures in the video to be processed is connected to form a track followed by the text detection frame in time.

A determining unit 305, configured to determine a reference caption area according to a set of text detection frames following a track of a video to be processed;

For example, coordinate information corresponding to one reference text detection frame is selected from each text detection frame following track in a text detection frame following track set of a video to be processed in sequence, candidate values of picture areas corresponding to the reference text detection frames are determined according to the coordinate information corresponding to the reference text detection frames of a preset algorithm, and the picture area corresponding to the maximum value in the candidate values is selected as a reference subtitle area.

An extracting unit 306, configured to extract subtitles of the video to be processed based on the reference subtitle region.

For example, since the reference caption area is the area where the text detection frames corresponding to all captions in the video to be processed overlap most, the coordinates of the text detection frames corresponding to the captions in the video to be processed are relatively close to the coordinates of the reference caption area, and the vertical coordinates and the heights of the text circumscribed rectangular frames, i.e., the text detection frames, are relatively close. Therefore, in order to accurately acquire all the subtitles of the video to be processed, it is necessary to determine the text detection frames corresponding to all the subtitles from all the text detection frames of the video to be processed based on the reference subtitle region, that is, select the following tracks of the text detection frames with the vertical coordinates of the reference subtitle region and the vertical coordinates of the text detection frames being equal to or having small errors from all the following track sets of the video to be processed, consider them as the subtitles, and extract the subtitles of the video to be processed from the text detection frames corresponding to the subtitles.

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

As can be seen from the foregoing, in this embodiment, the obtaining unit 301 obtains the video to be processed, the framing unit 302 performs framing processing on the video to be processed to obtain a multi-frame picture sequence, the detecting unit 303 performs text detection processing on each frame picture to identify text detection frames in each frame picture, the track following unit 304 performs temporal track following on the text detection frames in each frame picture to obtain a text detection frame following track set of the video to be processed, the determining unit 305 determines a reference caption area according to the text detection frame following track set of the video to be processed, and the extracting unit 306 extracts captions of the video to be processed based on the reference caption area. The method comprises the steps of detecting all characters in the video to be processed, following the track of the character detection frame, and calculating the subtitle region of the video to be processed according to the following track of the character detection frame, so that the subtitle and other characters in the video to be processed are distinguished, further, the subtitle is extracted, and the accuracy of extracting the subtitle in the video is improved.

The embodiment of the application also provides a computer device, as shown in fig. 7, which shows a schematic structural diagram of the computer device according to the embodiment of the application, specifically:

The computer device may include one or more processors 401 of a processing core, memory 402 of one or more computer readable storage media, a power supply 403, and an input unit 404, among other components. Those skilled in the art will appreciate that the computer device structure shown in FIG. 7 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components. Wherein:

The processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402. Optionally, the processor 401 may include one or more processing cores, and preferably the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), etc., and a storage data area that may store data created according to the use of the computer device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of charge, discharge, and power consumption management may be performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The computer device may also include an input unit 404, which input unit 404 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the computer device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

The method comprises the steps of obtaining a video to be processed, carrying out frame division processing on the video to be processed to obtain a multi-frame picture sequence, carrying out text detection processing on each frame of picture to identify text detection frames in each frame of picture, carrying out temporal track following on the text detection frames in each frame of picture to obtain a text detection frame following track set of the video to be processed, determining a reference caption area according to the text detection frame following track set of the video to be processed, and extracting captions of the video to be processed based on the reference caption area.

For example, the computer equipment divides the video to be processed into a multi-frame picture sequence according to a preset frame rate, then carries out text detection processing on each frame of picture according to a preset text detection algorithm to identify text detection frames in each frame of picture, carries out track following on the text detection frames in each frame of picture in time to obtain a following track set of all the text detection frames in the video to be processed, calculates a caption area of the video to be processed according to the text detection frame following track set, so as to distinguish captions in the video to be processed from other texts, and further extracts captions.

The specific implementation of each operation may be referred to the previous embodiments, and will not be described herein.

It can be seen from the above that, after obtaining a video to be processed, the computer device in the embodiment of the application carries out framing processing on the video to be processed to obtain a multi-frame picture sequence, then carries out text detection processing on each frame of picture to identify a text detection frame in each frame of picture, then carries out temporal track following on the text detection frame in each frame of picture to obtain a text detection frame following track set of the video to be processed, then determines a reference caption area according to the text detection frame following track set of the video to be processed, and extracts captions of the video to be processed based on the reference caption area. The method comprises the steps of detecting all characters in the video to be processed, following the track of the character detection frame, calculating the subtitle region of the video to be processed according to the following track of the character detection frame, extracting the subtitle, and improving the accuracy of subtitle extraction in the video.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any one of the subtitle extraction methods provided by the embodiments of the present application. For example, the instructions may perform the steps of:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

The computer readable storage medium may include, among others, read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disks, and the like.

The steps in any subtitle extraction method provided by the embodiment of the present application may be executed by the instructions stored in the computer readable storage medium, so that the beneficial effects that any subtitle extraction method provided by the embodiment of the present application may be achieved are detailed in the previous embodiments, and will not be described herein.

Wherein according to an aspect of the application, a computer program product or a computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in various alternative implementations of the subtitle extraction aspect described above.

The foregoing describes a method, apparatus and computer readable storage medium for extracting subtitles according to embodiments of the present application in detail, and specific examples are provided herein to illustrate the principles and embodiments of the present application, and the above description of the embodiments is only for aiding in understanding the method and core concept of the present application, and meanwhile, for those skilled in the art, according to the concept of the present application, there are variations in the specific embodiments and application scope, and in summary, the disclosure should not be construed as limiting the application.

Claims

1. A subtitle extraction method, comprising:

Get the video to be processed;

Performing frame processing on the video to be processed to obtain a multi-frame image sequence;

Performing text detection processing on each frame of the image to identify a text detection frame in each frame of the image;

Performing temporal tracking of the text detection frame in each frame of the image to obtain a set of text detection frame tracking trajectories of the video to be processed;

Determine a reference subtitle area according to the set of text detection frame following trajectories of the video to be processed, wherein the reference subtitle area is an area in which the following trajectories of the text detection frames in the video to be processed are updated the most times and the text detection frames corresponding to all subtitles in the video to be processed overlap the most;

A text detection frame corresponding to the position of the reference subtitle area is determined from all the text detection frames in the text detection frame following trajectory set to extract the subtitles of the video to be processed.

2. The subtitle extraction method according to claim 1, characterized in that the step of performing temporal tracking on the text detection frame in each frame of the picture to obtain a set of text detection frame tracking trajectories of the video to be processed comprises:

Performing temporal tracking of the text detection frame in each frame of the image to obtain a set of tracking trajectories of the text detection frame;

A set of following trajectories of the text detection frame of the video to be processed is generated according to the set of following trajectories of the text detection frame.

3. The subtitle extraction method according to claim 2, wherein the step of performing temporal tracking on the text detection frame in each frame of the image to obtain a set of tracking trajectories of the text detection frame comprises:

Obtaining text edit distances of text detection frames in adjacent frames of images, and coordinate information of a corresponding text detection frame in each of the adjacent frames of images;

A set of follow-up trajectories of the text detection frame is generated according to the coordinate information and the text edit distance.

4. The subtitle extraction method according to claim 3, wherein generating a set of follow-up trajectories of the text detection frame according to the coordinate information and the text edit distance comprises:

Determine area information of the text detection frame in the adjacent frame image according to the coordinate information;

If it is determined based on the area information that the text detection box in the adjacent frame picture meets the first matching condition, and it is determined based on the text editing distance that the text detection box in the adjacent frame picture meets the second matching condition, then the text detection box of the subsequent frame picture in the adjacent frame pictures is added to the follow-up trajectory set corresponding to the text detection box of the previous frame picture.

5. The subtitle extraction method according to claim 4, wherein the area information includes an intersection area and a union area, and determining that the text detection frame in the adjacent frame image satisfies the first matching condition according to the area information comprises:

Obtaining the intersection area and union area of the text detection frames in the adjacent frame images;

Calculate the intersection ratio of the text detection frames in the adjacent frame images according to the intersection area and the union area;

If the union-intersection ratio of the text detection frame is greater than or equal to the preset union-intersection ratio, it is determined that the text detection frame meets the first matching condition.

6. The subtitle extraction method according to claim 4, wherein determining that the text detection frame in the adjacent frame image satisfies the second matching condition according to the text edit distance comprises:

If the text edit distance is less than or equal to the preset edit distance, it is determined that the text detection frame satisfies the second matching condition.

7. The subtitle extraction method according to claim 3, wherein the step of obtaining the text edit distance of the text detection frames in the adjacent frame images comprises:

Using a preset text recognition algorithm to recognize text content in text detection boxes in adjacent frame images;

The text edit distance between the text contents in the text detection frames in the adjacent frame images is calculated according to the text contents.

8. The subtitle extraction method according to claim 4, characterized in that the method further comprises:

If it is determined based on the area information that the text detection box in the adjacent frame picture does not satisfy the first matching condition, and/or it is determined based on the text editing distance that the text detection box in the adjacent frame picture does not satisfy the second matching condition, then the following trajectory set of the text detection box is initialized according to the text detection box in the next frame picture in the adjacent frame picture.

9. The subtitle extraction method according to claim 4, characterized in that if it is determined according to the area information that the text detection frame in the adjacent frame picture satisfies a first matching condition, and it is determined according to the text edit distance that the text detection frame in the adjacent frame picture satisfies a second matching condition, then adding the text detection frame of the next frame picture in the adjacent frame picture to the follow-up track set corresponding to the text detection frame of the previous frame picture comprises:

If it is determined according to the area information that there are multiple text detection frames in the adjacent frame images that meet the first matching condition and the second matching condition, then determine a target text detection frame from the multiple text detection frames according to the union-intersection ratios corresponding to the multiple text detection frames;

The target text detection frame is added to a set of follow-up trajectories corresponding to the text detection frame in a previous frame of the adjacent frame.

10. The subtitle extraction method according to claim 1, wherein determining the reference subtitle area according to the set of trajectories followed by the text detection frame of the video to be processed comprises:

Selecting coordinate information corresponding to a reference text detection frame from each text detection frame following trajectory in the text detection frame following trajectory set of the video to be processed;

Determine, according to a preset algorithm and coordinate information corresponding to the reference text detection frame, a candidate value of the image area corresponding to the reference text detection frame;

The picture area corresponding to the maximum value among the candidate values is selected as the reference subtitle area.

11. The subtitle extraction method according to claim 10, characterized in that the step of determining a text detection frame corresponding to the position of the reference subtitle area from all the text detection frames of the text detection frame following trajectory set to extract the subtitles of the video to be processed comprises:

Acquire the coordinate information of the reference subtitle area and the coordinate information of all the text detection frames in the set of the text detection frames following the trajectory of the video to be processed;

Extract the subtitles of the video to be processed based on the coordinate information of the reference subtitle area and the coordinate information of all text detection boxes in the set of text detection boxes following trajectories of the video to be processed, wherein the difference between the coordinate information of the subtitle area of the video to be processed and the coordinate information of the reference subtitle area is less than a preset value.

12. A subtitle extraction device, comprising:

An acquisition unit, used for acquiring a video to be processed;

A frame division unit, used for performing frame division processing on the video to be processed to obtain a multi-frame image sequence;

A detection unit, used to perform text detection processing on each frame of the image to identify a text detection frame in each frame of the image;

A trajectory following unit, used to perform temporal trajectory following on the text detection frame in each frame of the picture, and obtain a set of text detection frame following trajectories of the video to be processed;

A determination unit, configured to determine a reference subtitle region according to a set of following trajectories of the text detection frames of the video to be processed, wherein the reference subtitle region is a region in which the following trajectories of the text detection frames in the video to be processed are updated the most times and in which the text detection frames corresponding to all the subtitles in the video to be processed overlap the most;

The extraction unit is used to determine the text detection frame corresponding to the position of the reference subtitle area from all the text detection frames in the set of text detection frames following the trajectory, so as to extract the subtitles of the video to be processed.

13. A computer device, comprising a memory and a processor; the memory stores a computer program, and the processor is used to run the computer program in the memory to execute the subtitle extraction method according to any one of claims 1 to 11.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program, and the computer program is loaded by a processor to execute the subtitle extraction method according to any one of claims 1 to 11.

15. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the subtitle extraction method according to any one of claims 1 to 11 is implemented.