CN113934888B

CN113934888B - Video tag processing method and device

Info

Publication number: CN113934888B
Application number: CN202111153268.5A
Authority: CN
Inventors: ��利明; 徐文博; 谭圣音; 王彬; 潘攀
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2025-04-29
Anticipated expiration: 2041-09-29
Also published as: CN113934888A

Abstract

The embodiment of the application provides a video tag processing method and device, which comprise the steps of obtaining a target video, respectively inputting the target video into a plurality of different video classification models, obtaining alternative video tags output by the video classification models, correcting errors of the alternative video tags according to co-occurrence relations among the alternative video tags output by the two different video classification models, and displaying the target video and the corrected alternative video tags in an interface. In the embodiment of the application, the whole process has little dependence on manpower, the identification process of the alternative video label is realized based on machine learning, and the method does not depend on subjective judgment of manpower, thereby improving the accuracy of the labeling process.

Description

Video tag processing method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for processing a video tag, an electronic device, and a machine readable medium.

Background

With the advent of short video content, short video consumption has grown in bursts. There is also an increasing demand for short video production tools.

In the related art, a video producer needs to process a large amount of short video materials, so as to be convenient for managing the short videos, watch video contents in a manual watching mode, and mark corresponding video labels on the short videos, thereby being convenient for sorting, recommending and the like of the short videos in a sorting mode through the video labels.

However, in the current scheme, the method of manually marking the video label is time-consuming and labor-consuming, and is affected by subjective factors, so that the precision is low.

Disclosure of Invention

The embodiment of the application provides a video tag processing method, which aims to solve the problems that a mode of manually marking video tags in the related art is time-consuming and labor-consuming, and the accuracy is low due to the influence of subjective factors.

Correspondingly, the embodiment of the application also provides a video tag processing device, electronic equipment and a storage medium, which are used for ensuring the realization and application of the method.

In order to solve the above problems, an embodiment of the present application discloses a video tag processing method, which includes:

Acquiring a target video;

respectively inputting a target video into a plurality of different video classification models, and obtaining alternative video labels output by the video classification models;

Correcting errors of the alternative video labels according to co-occurrence relations among the alternative video labels output by the two different video classification models;

and displaying the target video and displaying the error-corrected alternative video label in an interface.

The embodiment of the application discloses a video tag processing device, which comprises:

the acquisition module is used for acquiring a target video;

the generation module is used for respectively inputting the target video into a plurality of different video classification models and obtaining alternative video tags output by the video classification models;

The error correction module is used for correcting errors of the alternative video labels according to the co-occurrence relation among the alternative video labels output by the two different video classification models;

And the display module is used for displaying the target video and displaying the error-corrected alternative video labels in an interface.

The embodiment of the application also discloses electronic equipment, which comprises a processor and a memory, wherein executable codes are stored on the memory, and when the executable codes are executed, the processor is caused to execute one or more methods in the embodiment of the application.

Embodiments of the application also disclose one or more machine-readable media having executable code stored thereon that, when executed, cause a processor to perform a method as described in one or more of the embodiments of the application.

Compared with the prior art, the embodiment of the application has the following advantages:

According to the embodiment of the application, the multi-dimensional video classification model can be utilized to accurately extract the candidate video labels matched with the short video under the multi-dimensional condition, the co-occurrence relation between the candidate video labels output by the two different video classification models is further utilized, the correlation between the identification labels under the different classification dimensions is analyzed to correct the candidate video labels, the precision of the output candidate video labels is improved, finally the short video and the candidate video labels after correction are displayed through an interface, the intuitiveness and convenience of the labeling process of the video are improved, the whole process has small dependence on manpower, the identification process of the candidate video labels is realized based on machine learning, the subjective judgment of manpower is not relied on, and the precision of the labeling process is improved.

Drawings

Fig. 1 is a schematic diagram of a video tag processing method according to an embodiment of the present application;

FIG. 2 is an interface diagram of a method for processing a tag of a motion short video according to an embodiment of the present application;

FIG. 3 is an interface diagram of a label processing method for identifying express package video according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating steps of a video tag processing method according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating steps of a method for processing video tags according to an embodiment of the present application;

FIG. 6 is a bipartite graph of an embodiment of the application;

FIG. 7 is an interface diagram of another method for processing tags of motion short video according to an embodiment of the present application;

FIG. 8 is an interface diagram of another method for processing tags of motion short video according to an embodiment of the present application;

Fig. 9 is a block diagram showing a configuration of a video tag processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description.

For a better understanding of the present application, the following description is given to illustrate the concepts related to the present application to those skilled in the art:

The video classification model is a network model constructed based on a deep learning classification technology, has a high-layer artificial neural network structure, can be used for classifying multi-mode contents such as images, videos, voices and texts, and can take videos as input and classification alternative video labels corresponding to the videos and confidence degrees of the alternative video labels as output.

The video tag can be a classification tag, particularly a text description, which is used for describing the characteristics of the content of the video, for example, a video playing football, and the corresponding alternative video tag can comprise football, sports and the like.

Confidence in statistics the confidence interval (Confidence interval) of a probability sample is an interval estimate of some overall parameter of this sample. The confidence level shows the degree that the true value of the parameter has a certain probability of falling around the measurement result, and in the embodiment of the application, the confidence level of the alternative video label reflects the matching degree of the alternative video label and the video, and the greater the confidence level, the more the alternative video label and the video are matched.

The co-occurrence relation is used for representing the condition that a plurality of feature items appear together, in the embodiment of the application, the feature items can be candidate video tags in a text form, the co-occurrence relation among the candidate video tags can be used for representing the relevance among the plurality of candidate video tags, namely, the co-occurrence relation exists between two candidate video tags, so that the relevance exists between the two candidate video tags, and the fact that the co-occurrence relation does not exist between the two candidate video tags, so that the relevance does not exist between the two candidate video tags is shown.

Referring to fig. 1, an architecture diagram of a video tag processing method provided by an embodiment of the present application includes an alternative video tag generating module, an alternative video tag post-processing module, and a display module.

The architecture of the video tag processing method of the embodiment of the application can be applied to the field of tagging short videos, in particular, the short videos can be input into video classification models corresponding to a plurality of classification dimensions in an alternative video tag generation module, the alternative video tags under each classification dimension are output by utilizing the multi-dimensional video classification model, then the alternative video tags can be corrected by utilizing the co-occurrence relation between the alternative video tags output by different two video classification models through an alternative video tag post-processing module, the display module is utilized to display the short videos in an interface and display the corrected alternative video tags, for a video author, the short videos can be conveniently previewed in the interface, and the alternative video tags shown in the interface are selected, the selected candidate video labels are bound with the short video, so that the embodiment of the application firstly utilizes a multi-dimensional video classification model to accurately extract the candidate video labels matched with the short video under the multi-dimensional condition, further utilizes the co-occurrence relation among the candidate video labels output by different two video classification models, improves the confidence coefficient of the candidate video labels with higher correlation under different dimensions, simultaneously reduces the confidence coefficient of the candidate video labels with lower correlation under different dimensions, achieves the purpose of error correction of the candidate video labels, finally displays the short video and the candidate video labels after error correction through an interface, and combines the operation of an author, thereby improving the intuitiveness and convenience of the label binding process. In addition, the selected alternative video label can be displayed in the process of playing the short video, and the user can also take the video label as a search word to quickly search the short video related to the search word.

The architecture of the video tag processing method of the embodiment of the application can be applied to the live broadcast field, particularly, a target video can be intercepted every preset time period in the live broadcast process of a live broadcast, the target video is input into a video classification model corresponding to each of a plurality of classification dimensions in an alternative video tag generation module, alternative video tags under each classification dimension are output, then error correction can be carried out on the alternative video tags through an alternative video tag post-processing module, a display module is utilized to display short videos and display error corrected alternative video tags in an interface, for a live broadcast person, the live broadcast person can conveniently preview live broadcast video clips in a live broadcast management interface, and the selected alternative video tags and the live broadcast video clips are bound, so that the selected alternative video tags can be displayed in a live broadcast picture in real time, in addition, the selected alternative video tags can be displayed in the live broadcast picture in an outline, and the display of key information in the live broadcast process is improved.

Specifically, the alternative video tag generation module can firstly determine each classification dimension covered by an alternative video tag system and a specific alternative video tag set according to actual requirements in a video production application scene, for example, the action, commodity, content and the like classification dimensions are divided according to the actual requirements, the alternative video tag set of the action dimension can comprise alternative video tags such as football, basketball, running and gymnastics, the alternative video tag set of the commodity dimension can comprise alternative video tags such as football, basketball, breakfast machine and television, and the alternative video tag set of the content dimension can comprise alternative video tags such as goods, singing, dance, explanation and network lessons.

Further, after each classification dimension is divided, the embodiment of the application can design the video classification model corresponding to each classification dimension according to the characteristics of each classification dimension, so that the video classification model can process the characteristics of the corresponding classification dimension, and further obtain high-precision alternative video label output, for example, the content dimension has a corresponding content label video classification model, the commodity dimension has a corresponding commodity label video classification model, and the action dimension has a corresponding action label video classification model. The video classification model can convert the picture frames of the input video content into characteristic vectors with identification, based on a distance measurement method between the characteristic vectors, search for an approximate candidate video label similar to the characteristic vectors from a candidate video label set by utilizing a search technology, and determine the confidence of the candidate video label according to the similarity, and the video classification model under each dimension is correspondingly improved according to the factors of the data characteristics, the actual scene requirements and the like of the dimension.

In a scene for selecting video tags for sports videos, actions of people are often displayed through continuous multi-frame pictures in action videos, so that when characteristics of picture frames of the action videos are extracted by an action tag video classification model in an action dimension, each characteristic is encoded and combined based on time sequences of the picture frames corresponding to the characteristic (for example, the characteristics of multi-frame pictures forming one jumping action can be encoded and combined into a whole according to time sequences), and then, when the alternative video tags are matched according to the time sequence characteristics based on the time sequence encoding combination, a more accurate action tag matching result can be obtained. In addition, the trade names of sports equipment covered by the sports videos can be identified according to the commodity label video classification model, and the sports videos can be identified to be in the form of sports videos, travel beats, action movies and other content modes according to the content label video classification model.

Referring to fig. 2, fig. 2 shows an interface diagram of a method for processing a motion short video label according to an embodiment of the present application, for a short video playing a football, a football goal label is identified, a football playing label and a running label are obtained in a commodity dimension, a motion video label and a landscape label in a motion dimension, an alternative video label in each dimension can be displayed in addition to the short video and the video name displayed in an interface, a user can drag the selected alternative video label into an area 10 and bind the selected alternative video label with the short video, and the user can delete and modify an unsuitable alternative video label, and in addition, the user can newly add the alternative video label in a corresponding field.

In another scenario aiming at a video label for identifying express packages, the scenario is similar to a live broadcast scenario, a target video can be intercepted every preset time period, and based on the video label processing method of the embodiment of the application, information such as trade names, receiving addresses, delivery addresses and the like corresponding to express barcodes of express packages appearing in each target video is identified, because one of the core requirements of the scenario is that the trade names of express packages can be identified, while the express barcodes comprising the express package trade names are attached to the surface of the express packages, but the occupied area is smaller, if the feature recognition of the trade name is directly carried out on the video picture frame, the recognition accuracy is lower due to the interference of the area outside the express bar code in the picture frame, so that the embodiment of the application can design a commodity label video classification model under commodity dimension, the express bar code area in the video picture frame can be determined firstly through the feature recognition, then the image feature of the express bar code area is extracted, and the alternative video label matching is carried out based on the image feature of the express bar code area, and more accurate matching results can be obtained due to the reduction of external interference. In addition, aiming at the action type label video classification model, the moving track label of the express package can be identified, and aiming at the content type label video classification model, the video can be identified as the content form for identifying the express package.

Referring to fig. 3, fig. 3 shows an interface diagram of a tag processing method for identifying express package video, and a section of video clip is intercepted for a live video broadcast identifying express package, wherein the sorting moving process includes a package 1 and a package 2, tags with commodity dimensions are identified, package 1 is a television, package 2 is a mobile phone, tags with action dimensions are package 1 is a forward, package 2 is a left line, package identification tags with content dimensions are displayed in an interface, alternative video tags in each dimension can be displayed besides the video clip, a user can drag selected alternative video tags into an area 21 corresponding to package 1 or an area 22 corresponding to package 2 and bind with the video clip, bound alternative video tags are displayed in live video, the user can delete and modify inappropriate alternative video tags, and the like, and in addition, the user can newly add alternative video tags in a corresponding field.

In addition, the video classification model aiming at the content classification dimension can be used for identifying the tag based on a multi-mode fusion identification technology, such as extracting text, sound, image and other characteristics in the target video to be fused into multi-mode characteristics, and then when the tag matching is carried out based on the multi-mode characteristics, more abundant and more-mode information can be utilized, so that the accuracy of the tag matching is improved.

Furthermore, after the system of the multi-dimensional video classification model is built, the candidate video label generation module can take the target video as input, take label sets output by each video classification model as module output, and each label set can be further input into a label post-processing module for error correction. In the video tagging scene of the embodiment of the application, although the tag sets corresponding to the multi-classification dimensions are divided, a certain semantic relevance exists among the tags of different tag sets, for example, if a breakfast machine exists in the tag set of the commodity dimension, the probability that the tag exists in the tag set of the content dimension, namely, the nutritional breakfast, is larger, and the probability that the tag exists in the motion type tag set is smaller.

The label set of the commodity dimension is provided with a label of a breakfast machine (confidence coefficient 80), the label set of the content dimension is provided with a label of a nutritional breakfast (confidence coefficient 70), the sports label set is provided with a label of running (confidence coefficient 50), and according to the definition, the fact that the label of the breakfast machine and the label of the nutritional breakfast have a co-occurrence relationship and the label of the nutritional breakfast do not have a co-occurrence relationship with other two labels can be known, so that the confidence coefficient of the label of the nutritional breakfast machine and the label of the nutritional breakfast can be improved according to the co-occurrence relationship between the label of the alternative video labels output by different two video classification models, and the error correction of the alternative video labels can be achieved.

The display module can further display the error-corrected alternative video tags besides providing an interface for displaying the target video, provide a corresponding interaction mode, enable the user to select the alternative video tags to bind the target video, and enable the alternative video tags to be added, deleted, modified and the like.

Referring to fig. 4, a flowchart of steps of a video tag processing method according to an embodiment of the present application is shown, including:

and step 101, acquiring a target video.

In the embodiment of the present application, the target video may be a short video, a long video, a video segment cut from a continuous video stream, and the like, which is not limited in the embodiment of the present application.

Step 102, inputting the target video into a plurality of different video classification models respectively, and obtaining alternative video labels output by the video classification models.

In the embodiment of the application, each classification dimension and a specific set of alternative video tags covered by an alternative video tag system can be determined according to actual demands in video production application scenes, for example, classification dimensions such as actions, commodities, contents and the like are divided according to the actual demands, the alternative video tag set of the action dimension can comprise alternative video tags such as football, basketball, running and gymnastics, the alternative video tag set of the commodity dimension can comprise alternative video tags such as football, basketball, breakfast machine and television, and the alternative video tag set of the content dimension can comprise alternative video tags such as goods, singing, dance, explanation and network lessons.

After each classification dimension is divided, the embodiment of the application can design the video classification model corresponding to each classification dimension according to the characteristics of each classification dimension, so that the video classification model can process the characteristics of the corresponding classification dimension, and further obtain the high-precision alternative video tag output. The video classification model in each dimension is correspondingly improved according to the data characteristics of the dimension, the actual scene requirements and other factors. Further, after the system of the multi-dimensional video classification model is built, the target video can be used as the input of each video classification model, and the set of the alternative video labels output by each video classification model can be used as the output.

And 103, correcting errors of the alternative video labels according to co-occurrence relations among the alternative video labels output by the two different video classification models.

In the video tagging scene of the embodiment of the application, although tag sets corresponding to multiple classification dimensions are divided, a certain semantic relevance exists among tags of different tag sets, and the semantic relevance can be expressed through co-occurrence relations among alternative video tags.

The false equipment selecting video label generating module outputs all label sets, wherein labels are in a breakfast machine in label set of commodity dimension, labels are in a nutritional breakfast in label set of content dimension, labels are in a running type label set, and according to the definition, the fact that the co-occurrence relationship exists between the breakfast machine and the labels, the co-occurrence relationship does not exist between the running type label and the other two labels can be known, so that the error correction operation can comprise an implementation mode of improving the confidence coefficient of each of a plurality of candidate video labels with the co-occurrence relationship, and reducing the confidence coefficient of each of the plurality of candidate video labels without the co-occurrence relationship. In another implementation, a plurality of alternative video tags with co-occurrence relationships are reserved, and a plurality of alternative video tags without co-occurrence relationships are deleted. Another implementation prioritizes the presentation of multiple alternative video tags that have co-occurrence relationships.

And 104, displaying the target video and displaying the error-corrected alternative video label in an interface.

In the step, besides displaying the target video through providing the interface, the error-corrected alternative video label can be further displayed in the interface, a corresponding interaction mode is provided, the user selects the alternative video label to bind the target video, and the alternative video label is added, deleted, modified and the like.

After the video tags are bound to the target video, the bound video tags can be displayed in the subsequent playing process of the target video, the user can also take the video tags as search words to quickly search the video related to the search words, and of course, the video platform can also carry out classified management on the video according to the video bound video tags, and can also realize classified recommendation and the like of the video according to the bound video tags in a video recommendation link.

In summary, in the embodiment of the application, the multi-dimensional video classification model can be utilized to accurately extract the candidate video labels matched with the short video under the multi-dimensional condition, the co-occurrence relation between the candidate video labels output by the two different video classification models is further utilized, the correlation between the identification labels under the different classification dimensions is analyzed to correct the candidate video labels, the precision of the output candidate video labels is improved, finally the short video and the corrected candidate video labels are displayed through the interface, the intuitiveness and convenience of the labeling process for the video are improved, the dependence of the whole process on manpower is small, the identification process of the candidate video labels is realized based on machine learning, the subjective judgment of manpower is not relied on, and the precision of the labeling process is improved.

Referring to fig. 5, a flowchart of steps of another embodiment of a video tag processing method of the present application is shown. Comprising the following steps:

In step 201, a target video is acquired.

This step may refer to step 101, and will not be described herein.

Step 202, inputting a target video into a plurality of different video classification models respectively, and obtaining alternative video labels output by the video classification models.

This step may refer to step 102, and will not be described herein.

Optionally, in the case that the video classification model is a video classification model in an action dimension, the video classification model includes a time series model, the step 202 includes:

step 2021, inputting the target video into the time series model, and extracting image features of video frames of the target video.

Sub-step 2022, encoding the image feature of the video frame according to the corresponding time of the video frame on the playing time axis of the target video, to obtain the encoding feature.

Sub-step 2023, obtaining, from said coding features, alternative video tags for characterizing categories of actions in said target video.

In the embodiment of the application, the actions of the characters are usually displayed through continuous multi-frame pictures in the action video, so that the video classification model in the action dimension is designed to be a time sequence model, the time sequence model can learn the changes among frames, the prediction result is more stable among adjacent frames, when the time sequence model extracts the characteristics of the picture frames of the action video, each characteristic is coded and combined based on the time sequence of the picture frames corresponding to the characteristics (for example, the characteristics of multi-frame pictures forming one jumping action can be combined into a whole according to time sequence coding), the coding characteristics are obtained, and then when the alternative video label matching is carried out according to the coding characteristics based on the time sequence coding combination, the more accurate action label matching result can be obtained.

In the motion dimension, a 3-dimensional convolutional neural network (3D CNN,3D Convolutional Neural Networks) can be specifically selected for feature coding, and the precision and efficiency of coding can be improved through convolutional operation and multi-layer neural networks. The construction of the time series model comprises 1, constructing the model based on a 3D CNN network, and expressing the model by a group of learnable parameters with specific structures. 2. The training set corresponding to the model is constructed, and the training set comprises a label set and a plurality of sample videos corresponding to each label. 3. And training a model, and iteratively updating parameters of the model by a back propagation and gradient descent method based on the training set and the classification target, so that the model can accurately classify sample videos on the training set. And stopping obtaining parameters of the final model after a certain number of iterations.

In one implementation, the time sequence model may be specifically a time sequence model analyzed based on an optical flow (optical flow) method, where the optical flow method is a method for finding a correspondence existing between a previous frame and a current frame by using a change of a pixel in an image sequence in a time domain and a correlation between adjacent frames, so as to calculate motion information of an object between the adjacent frames, and the time sequence model analyzed based on the optical flow method may well process a dynamic motion video to accurately extract a feature related to a motion therein and identify a category of the motion feature.

Optionally, in the case that the video classification model is a video classification model in a commodity information dimension, the video classification model includes a commodity information detection model, the commodity information detection model includes a detector and a classifier, and the step 202 includes:

Step 2024, inputting the target videos into the commodity information detection model respectively, and extracting commodity information areas in video frames of the target videos by the detector.

And step 2025, extracting the regional characteristics of the commodity information region by the classifier, and acquiring an alternative video tag for representing the commodity information in the target video according to the regional characteristics.

In the embodiment of the application, based on a scene needing to identify commodity information, if characteristic identification of commodity information on a video picture frame is directly carried out on the commodity name because the occupied area of the commodity information in the whole picture is smaller, the identification precision is lower because of interference of areas outside the commodity information in the picture frame, so that the embodiment of the application can design a commodity dimension video classification model to comprise a commodity information detection model, wherein the commodity information detection model comprises a detector and a classifier, the characteristic identification can be firstly carried out through the detector, the region (ROI, region of interest) where the commodity information in the video picture frame is located is determined, then the image characteristic of the region of interest is extracted, and then alternative video tag matching is carried out through the classifier based on the image characteristic of the region of interest, and because the external interference is reduced, a more accurate matching result can be obtained, namely the video characteristic used for representing the commodity information is identified.

For example, in a scene of express parcel identification, although an express barcode of an express parcel trade name is attached to the surface of the express parcel, the occupied area is small, if feature identification of the trade name is directly performed on a video picture frame, the identification accuracy is low due to interference of an area outside the express barcode in the picture frame, so that an commodity label video classification model under commodity dimension can be designed according to the embodiment of the application, firstly, the express barcode area in the video picture frame can be determined through feature identification, then, the image features of the express barcode area are extracted, and then, alternative video label matching is performed based on the image features of the express barcode area.

Step 203, updating the confidence level of the alternative video labels according to the co-occurrence relation between the alternative video labels output by the different two video classification models.

Wherein the video classification model also outputs a confidence level corresponding to the alternative video tag.

In the embodiment of the application, the confidence coefficient of the alternative video label reflects the matching degree of the alternative video label and the video, and the greater the confidence coefficient is, the more the alternative video label is matched with the video. Because the co-occurrence relationship among the candidate video tags with different dimensions has a direct relationship with the confidence coefficient of the candidate video tags, in one implementation, the corresponding confidence coefficient of the plurality of candidate video tags with the co-occurrence relationship is improved, and the purpose of updating the confidence coefficient of the candidate video tags by using the co-occurrence relationship is achieved for the plurality of candidate video tags without the co-occurrence relationship.

Optionally, step 203 may specifically include:

Sub-step 2031, obtaining joint probabilities between a first candidate video tag and all second candidate video tags according to the confidence level of the first candidate video tag, the confidence level of the second candidate video tag and the co-occurrence relation value marked for the corresponding relation between the first candidate video tag and the second candidate video tag, wherein the first candidate video tag and the second candidate video tag are respectively candidate video tags output by two different video classification models.

In the embodiment of the application, the confidence level of a first alternative video tag output by one video classification model can be represented by { f _j }, the confidence level of a second alternative video tag output by the other video classification model can be represented by { g _i }, and whether the first alternative video tag and the second alternative video tag have a co-occurrence relationship can be determined by marking in advance, for example, the corresponding relationship of the two tags can be marked as having a co-occurrence relationship when the tags are in semantic association with each other, and the corresponding relationship of the two tags can be marked as not having a co-occurrence relationship when the tags are in semantic association with each other. In addition, the co-occurrence relationship can be quantified into a co-occurrence relationship value, if the co-occurrence relationship exists between two labels, the co-occurrence relationship value marked by the corresponding relationship between the two labels is 1, and if the co-occurrence relationship does not exist between the two labels, the co-occurrence relationship value marked by the corresponding relationship between the two labels is 0.

By the definition, the constraint that the co-occurrence relationship exists between the labels with different dimensions can be seen, the constraint can be embodied as a bipartite graph, referring to fig. 6, which shows that an embodiment of the present application provides a bipartite graph, including the confidence { f _j } of a first alternative video label output by one video classification model, the confidence { g _i } of a second alternative video label output by another video classification model, in existing data, the co-occurrence relationship exists between two dimension labels, for example, one data simultaneously carries a first alternative video label corresponding to f _j and a second alternative video label corresponding to g _i, and then the corresponding connecting edge w _{i,j} =1 (i.e. a dotted line in the graph is also equivalent to a co-occurrence relationship value) is used to indicate that the two have the co-occurrence relationship, and w _{i,j} =0 indicates that the two have no co-occurrence relationship.

Further, a joint probability between the first candidate video tag and the second candidate video tag can be obtained firstly according to the confidence { f _j } of the first candidate video tag, the confidence { g _i } of the second candidate video tag and the co-occurrence relation value w _{i,j} of the corresponding relation label of the first candidate video tag and the second candidate video tag, wherein the joint probability is used for representing the probability of the two occurring simultaneously, and the joint probability between one first candidate video tag and one second candidate video tag

Wherein Z is a normalized coefficient,

Substep 2032, taking the average value of the joint probabilities between the first candidate video tag and all the second candidate video tags as the confidence after the first candidate video tag is updated.

In this step, the confidence level of the updated first alternative video tag:

It is expressed as the average of the joint probabilities between the first alternative video tag and all the second alternative video tags as the confidence after the first alternative video tag is updated.

Optionally, the method may further include:

and A1, acquiring bipartite graphs preset for the first alternative video tag and the second alternative video tag.

In this step, in the embodiment of the present application, there is a constraint of co-occurrence relationship between labels in different dimensions, and this constraint may be embodied as a bipartite graph shown in fig. 6, where the bipartite graph includes a confidence level { f _j } of a first candidate video label output by one video classification model, and a confidence level { g _i } of a second candidate video label output by another video classification model, where a line indicates that there is a co-occurrence relationship between two candidate video labels.

And A2, determining the co-occurrence relation value corresponding to the first candidate video tag and the second candidate video tag as a first numerical value under the condition that a co-occurrence relation connecting line is marked between the first candidate video tag and the second candidate video tag in the bipartite graph.

And A3, determining the co-occurrence relation value corresponding to the first alternative video label and the second alternative video label as a second numerical value under the condition that a co-occurrence relation connecting line is not marked between the first alternative video label and the second alternative video label in the bipartite graph, wherein the first numerical value and the second numerical value are different numerical values.

In the embodiment of the present application, referring to fig. 6, with a preset bipartite graph, since one data simultaneously has a first alternative video tag corresponding to f _j and a second alternative video tag corresponding to g _i, a corresponding continuous edge w _{i,j} =1 (i.e. a broken line connecting line in the graph is also equivalent to a co-occurrence relation value) may be used to indicate that there is a co-occurrence relation between the two, and w _{i,j} =0 indicates that there is no co-occurrence relation between the two.

And 204, correcting the error of the alternative video label according to the confidence level.

Alternatively, in one implementation of the embodiment of the present application, error correction of the alternative video tags using confidence levels may be accomplished by removing alternative video tags having confidence levels less than a confidence threshold. For example, among the candidate video tags with updated confidence, the candidate video tags with confidence less than 50 can be removed, so as to achieve the purpose of removing the candidate video tags with poor semantic relevance to the video content. Or providing marking information for the alternative video labels with the confidence coefficient smaller than the confidence coefficient threshold value, wherein the marking information is used for marking the alternative video labels with the confidence coefficient smaller than the confidence coefficient threshold value when the alternative video labels with the confidence coefficient smaller than the confidence coefficient threshold value are displayed in an interface. The alternative video tag with the confidence level less than the confidence threshold may also be provided with tag information, such as a bolded tag, a highlighted tag, a warning tag, etc., for highlighting the alternative tag while the alternative video tag is displayed, alerting the user that the alternative tag has low confidence. The user can then modify or confirm deletion or confirm reservation of these alternative video tags

Or displaying alternative video labels with confidence less than a confidence threshold in a preset area of the interface. In practical applications, an additional area may be provided outside the area where the candidate video tag with a confidence level greater than or equal to the confidence level threshold is displayed, where the candidate video tag with a confidence level less than the confidence level threshold is displayed. The user may then modify or confirm the deletion or confirm the reservation of these alternative video tags.

In addition, referring to fig. 2, assuming that after error correction, confidence levels of the candidate video tags, namely, the character and the candidate video tag, namely, the story are smaller than 50, the embodiment of the application can also display the candidate video tags with the confidence levels smaller than the confidence level threshold value in the preset area 30 of the interface, and prompt the video tags to be low-confidence tags, so that a user can modify and edit the tags according to the prompt, such as deleting the tags or modifying the content of the tags.

And step 205, displaying the target video and displaying the error-corrected alternative video labels in an interface.

This step may refer to step 101, and is not described herein.

Optionally, step 205 may specifically include:

and step 2051, displaying the alternative video labels under the classification dimension in an input box corresponding to the classification dimension in the interface.

Referring to fig. 2, a scenario of processing a sports short video tag is provided, and football and ball clothing tags in commodity dimension are identified on the assumption that a football playing short video is aimed at, football playing and running tags in action dimension, sports video and landscape tags in content dimension are obtained, and then the short video and video names are displayed in an interface, and alternative video tags in various classification dimensions can be displayed in the interface, namely, the alternative video tags in the interface are respectively displayed in the interface according to input frames corresponding to the classification dimensions, so that the alternative video tags in the interface are distinguished based on the difference of dimensions, and the intuitiveness of classification situations of the alternative video tags in the interface is improved.

In addition, the alternative video labels can be ordered according to the order of the confidence level from high to low in the interface so as to preferentially display the alternative video labels with high confidence level.

Optionally, after step 205, the method may further include:

Step 206, binding the selected at least one candidate video tag with the target video in response to the selection operation of the candidate video tag.

In this step, referring to fig. 2, after the alternative video tag is displayed in the interface, a selection operation may be performed on the alternative video tag in the interface, where the selection operation may be to drag the draggable alternative video tag into the area 10 to bind with the short video, or may be to directly click on the alternative video tag to complete the selection operation. After the candidate video tag is bound with the target video, the content of the candidate video tag can be used as the description attribute of the target video so as to be seen by a user who watches the target video later.

Step 207, in response to the triggering operation on the alternative video tag, performing one or more of deleting, adding and editing operations on the alternative video tag.

In the embodiment of the application, the user can execute one or more operations of deleting, adding and editing the alternative video labels according to the actual requirements by performing related triggering operations on the alternative video labels in the interface. For example, referring to fig. 2, the corresponding alternative video tag may be deleted by clicking a delete control in the alternative video tag, or an edit menu may be displayed by clicking the alternative video tag right-clicking the alternative video tag, and an edit item may be selected in the edit menu, so that the content in the alternative video tag may be edited, or an add item may be selected in the edit menu, so as to add a new alternative video tag.

Optionally, the video classification model further outputs a playing progress of the candidate video tag corresponding to the target video, and step 205 may specifically further include:

And step 2052, playing the target video broadcast and displaying the alternative video label corresponding to the playing progress on the interface according to the corresponding relation between the alternative video label and the playing progress.

In the embodiment of the application, the video classification model outputs the alternative video label and also outputs the corresponding playing progress of the alternative video label in the target video, if the video classification model detects that the alternative video label is an automobile in the current picture when the target video is played to the time of 00:20, the video classification model outputs the alternative video label is the automobile, and the time corresponding to the alternative video label is 00:20.

Specifically, in the interface, when the target video is played, an alternative video tag corresponding to the playing progress may be displayed, for example, when the target video is played to the target playing progress, the alternative video tag corresponding to the target playing progress is prominently displayed in the tag area, or when the target video is played to the target playing progress, the alternative video tag corresponding to the target playing progress is displayed in the target video. The close association between the video tag and the target video is revealed in an intuitive way.

Further, the playing progress corresponding to the candidate video tag may be a time, for example, a time corresponding to an x-th frame picture of the candidate video tag is identified in the target video, and the playing progress corresponding to the candidate video tag may be a time period, for example, a time period corresponding to a video segment formed by the x-th frame picture and the y-th frame picture of the candidate video tag is identified in the target video. When a user drags a playing bar of the target video, the displayed alternative video labels also change along with the dragging of the user, for example, when dragging to an a segment, only the alternative video labels corresponding to the a segment are displayed, and when dragging to another b segment, only the alternative video labels corresponding to the b segment are displayed, so that the user can determine whether the binding of the labels is proper in an intuitive mode.

Optionally, the substep 2052 may specifically further include:

And B1, under the condition that the target video in the interface is played to the target playing progress, displaying the alternative video label corresponding to the target playing progress according to the corresponding relation.

In one implementation manner of the embodiment of the application, when the target video in the interface is played to the target playing progress, the alternative video labels corresponding to the target playing progress can be displayed according to the corresponding relation, so that a user can accurately position the playing progress corresponding to each alternative video label in the process of playing the target video, thereby conveniently analyzing whether the alternative video labels are accurate or not under the playing progress and improving the labeling efficiency. In this way, the candidate video tags may be displayed according to the playing progress, and the candidate video tags that do not correspond to the playing progress may not be displayed at the playing progress time.

Referring to fig. 7, an interface diagram of another method for processing a tag of a sports short video is provided according to an embodiment of the present application, for a short video playing football, a football clothing tag, and a playing football tag with an action dimension are identified at a time of 00:20, a running tag with an action dimension is identified at a time of 02:20, a sports video and a landscape tag with a content dimension are identified at a time of 01:40, and when the video classification model outputs the alternative video tags, a playing progress corresponding to the alternative video tags is also output, and the alternative video tags are displayed at a corresponding playing progress on a playing progress bar 40.

Further, when the short video in the interface is currently played to 00:20, football clothing labels and football playing labels in action dimensions corresponding to the playing progress 00:20 can be displayed in the short video according to the corresponding relation between the alternative video labels and the playing progress, and in addition, in the classified display area of the labels, the labels corresponding to the playing progress 00:20 can be highlighted and displayed, so that label analysis and error correction can be conveniently performed by a user. Assuming that the subsequent user drags the progress bar to 02:20, the motion video and the landscape label of the content dimension corresponding to the playing progress 02:20 can be displayed in the short video, so that the displayed label changes along with the dragging of the progress bar.

Optionally, the substep 2052 may specifically further include:

And B2, playing the target video and the alternative video labels in the interface, and highlighting the alternative video labels corresponding to the target playing progress according to the corresponding relation when the target video is played to the target playing progress.

In another implementation manner of the embodiment of the application, all the candidate video tags can be displayed while the target video is played in the interface, and when the target video is played to the target playing progress, the candidate video tags corresponding to the target playing progress are prominently displayed according to the corresponding relation, so that whether the candidate video tags are accurate or not is conveniently analyzed under the playing progress, and the labeling efficiency is improved.

Referring to fig. 8, an interface diagram of another method for processing a tag of a sports short video is provided according to an embodiment of the present application, for a short video playing football, a football clothing tag, and a playing football tag with an action dimension are identified at a time of 00:20, a running tag with an action dimension is identified at a time of 02:20, a sports video and a landscape tag with a content dimension are identified at a time of 01:40, and when the video classification model outputs the alternative video tags, a playing progress corresponding to the alternative video tags is also output, and the alternative video tags are displayed at a corresponding playing progress on a playing progress bar 40.

Further, in the process of video playing in the interface, all the candidate video tags can be displayed, when the video is played to 00:20, football clothing tags and football playing tags in action dimensions corresponding to the playing progress 00:20 in the short video and tag classification display areas can be highlighted and displayed according to the corresponding relation between the candidate video tags and the playing progress, so that tag analysis and error correction can be conveniently carried out by users. Assuming that the subsequent user drags the progress bar to 02:20, the motion video and landscape label of the content dimension corresponding to the playing progress 02:20 can be highlighted and displayed in the short video and label classification display area, so that the displayed label changes along with the dragging of the progress bar.

The method for highlighting the display includes, but is not limited to, thickening a label frame for display, highlighting a label for display, and adding a background display lamp for display.

Referring to fig. 9, a block diagram of a video tag processing apparatus according to an embodiment of the present application is provided, including:

An acquisition module 301, configured to acquire a target video;

the generating module 302 is configured to input a target video into a plurality of different video classification models respectively, and obtain an alternative video tag output by the video classification model;

an error correction module 303, configured to correct the candidate video tags according to co-occurrence relationships between the candidate video tags output by the different two video classification models;

and the display module 304 is used for displaying the target video and displaying the error-corrected alternative video labels in an interface.

Optionally, the video classification model further outputs a confidence level corresponding to the candidate video tag, and the error correction module 303 includes:

an updating sub-module, configured to update a confidence level of an alternative video tag according to co-occurrence relationships between alternative video tags output by different two video classification models;

And the error correction sub-module is used for correcting the error of the alternative video label according to the confidence level.

Optionally, the updating sub-module includes:

The first computing unit is used for obtaining joint probabilities between the first alternative video tag and all the second alternative video tags according to the confidence level of the first alternative video tag, the confidence level of the second alternative video tag and the co-occurrence relation value marked for the corresponding relation between the first alternative video tag and the second alternative video tag;

And the second calculation unit is used for taking the average value of joint probabilities between the first alternative video tag and all the second alternative video tags as the updated confidence of the first alternative video tag.

Optionally, the apparatus further includes:

the bipartite graph module is used for acquiring bipartite graphs preset for the first alternative video tag and the second alternative video tag;

the first computing module is used for determining the co-occurrence relation value corresponding to the first candidate video tag and the second candidate video tag as a first numerical value under the condition that a co-occurrence relation connecting line is marked between the first candidate video tag and the second candidate video tag in the bipartite graph;

The second calculation module is used for determining the co-occurrence relation value corresponding to the first candidate video label and the second candidate video label as a second numerical value under the condition that a co-occurrence relation connecting line is not marked between the first candidate video label and the second candidate video label in the bipartite graph, and the first numerical value and the second numerical value are different numerical values.

Optionally, the error correction sub-module includes:

And the screening unit is used for removing the alternative video labels with the confidence coefficient smaller than the confidence coefficient threshold value or displaying the alternative video labels with the confidence coefficient smaller than the confidence coefficient threshold value in a preset area of the interface.

Optionally, different video classification models correspond to different classification dimensions, and the presentation module 404 includes:

And the display sub-module is used for displaying the alternative video labels under the classification dimension in the input box corresponding to the classification dimension in the interface.

Optionally, the apparatus further includes:

and the first response module is used for responding to the selection operation of the alternative video labels and binding at least one selected alternative video label with the target video.

Optionally, the apparatus further includes:

and the second response module is used for responding to the triggering operation of the alternative video label and executing one or more of deleting, adding and editing operations of the alternative video label.

Optionally, in the case that the video classification model is a video classification model in an action dimension, the video classification model includes a time series model;

The generating module 302 includes:

the first input sub-module is used for inputting the target video into the time sequence model and extracting image characteristics of video frames of the target video;

the encoding submodule is used for encoding the image characteristics of the video frame according to the corresponding moment of the video frame on the playing time axis of the target video to obtain encoding characteristics;

And the first matching sub-module is used for acquiring alternative video tags used for representing the categories of actions in the target video according to the coding characteristics.

Optionally, in the case that the video classification model is a video classification model in the commodity information dimension, the video classification model includes a commodity information detection model, wherein the commodity information detection model includes a detector and a classifier;

The generating module 302 includes:

The second input sub-module is used for respectively inputting the target videos into the commodity information detection model, and extracting commodity information areas in video frames of the target videos through the detector;

and the second matching sub-module is used for extracting the regional characteristics of the commodity information region through the classifier and acquiring an alternative video tag used for representing commodity information in the target video according to the regional characteristics.

Optionally, the video classification model further outputs a playing progress of the candidate video tag corresponding to the target video;

The display module 304 includes:

And the display sub-module is used for playing the target video broadcast and displaying the alternative video label corresponding to the playing progress on the interface according to the corresponding relation between the alternative video label and the playing progress.

Optionally, the display sub-module includes:

The first display unit is used for displaying the alternative video labels corresponding to the target playing progress according to the corresponding relation when the target video in the interface is played to the target playing progress.

Optionally, the display sub-module includes:

and the second display unit plays the target video and the alternative video label in the interface, and highlights and displays the alternative video label corresponding to the target playing progress according to the corresponding relation when the target video is played to the target playing progress.

In summary, in the embodiment of the application, the multi-dimensional video classification model can be utilized to accurately extract the candidate video labels matched with the short video under the multi-dimensional condition, the co-occurrence relation between the candidate video labels output by the two different video classification models is further utilized, the correlation between the identification labels under the different classification dimensions is analyzed to correct the candidate video labels, the precision of the output candidate video labels is improved, finally the short video and the corrected candidate video labels are displayed through the interface, the intuitiveness and convenience of the labeling process of the video are improved, the whole process has small dependence on manpower, the identification process of the candidate video labels is realized based on machine learning, the subjective judgment of manpower is not relied on, and the precision of the labeling process is improved.

The embodiment of the application also provides a non-volatile readable storage medium, where one or more modules (programs) are stored, where the one or more modules are applied to a device, and the instructions (instructions) of each method step in the embodiment of the application may cause the device to execute.

Embodiments of the application provide one or more machine-readable media having instructions stored thereon that, when executed by one or more processors, cause an electronic device to perform a method as described in one or more of the above embodiments. In the embodiment of the application, the electronic equipment comprises various types of equipment such as terminal equipment, a server (cluster) and the like.

Embodiments of the present disclosure may be implemented as an apparatus for performing a desired configuration using any suitable hardware, firmware, software, or any combination thereof, which may include electronic devices such as terminal devices, servers (clusters), etc. Fig. 10 schematically illustrates an exemplary apparatus 1000 that may be used to implement various embodiments described in embodiments of the present application.

For one embodiment, fig. 10 illustrates an example apparatus 1000 having one or more processors 1002, a control module (chipset) 1004 coupled to at least one of the processor(s) 1002, a memory 1006 coupled to the control module 1004, a non-volatile memory (NVM)/storage 1008 coupled to the control module 1004, one or more input/output devices 1010 coupled to the control module 1004, and a network interface 1012 coupled to the control module 1004.

The processor 1002 may include one or more single-core or multi-core processors, and the processor 1002 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 1000 may be used as a terminal device, a server (a cluster), or the like in the embodiments of the present application.

In some embodiments, the apparatus 1000 can include one or more computer-readable media (e.g., memory 1006 or NVM/storage 1008) having instructions 1014 and one or more processors 1002 in combination with the one or more computer-readable media configured to execute the instructions 1014 to implement the modules to perform the actions described in this disclosure.

For one embodiment, the control module 1004 may include any suitable interface controller to provide any suitable interface to at least one of the processor(s) 1002 and/or any suitable device or component in communication with the control module 1004.

The control module 1004 may include a memory controller module to provide an interface to the memory 1006. The memory controller modules may be hardware modules, software modules, and/or firmware modules.

Memory 1006 may be used to load and store data and/or instructions 1014 for device 1000, for example. For one embodiment, the memory 1006 may include any suitable volatile memory, such as a suitable DRAM. In some embodiments, the memory 1006 may comprise a double data rate type four synchronous dynamic random access memory (DDR 4 SDRAM).

For one embodiment, the control module 1004 may include one or more input/output controllers to provide an interface to the NVM/storage 1008 and the input/output device(s) 1010.

For example, NVM/storage 1008 may be used to store data and/or instructions 1014. NVM/storage 1008 may include any suitable nonvolatile memory (e.g., flash memory) and/or may include any suitable nonvolatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 1008 may include storage resources that are physically part of the device on which apparatus 1000 is installed, or may be accessible by the device without necessarily being part of the device. For example, NVM/storage 1008 may be accessed over a network via input/output device(s) 1010.

Input/output device(s) 1010 may provide an interface for apparatus 1000 to communicate with any other suitable device, input/output device 1010 may include communication components, audio components, sensor components, and the like. Network interface 1012 may provide an interface for device 1000 to communicate over one or more networks, and device 1000 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as accessing a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.

For one embodiment, at least one of the processor(s) 1002 may be packaged together with logic of one or more controllers (e.g., memory controller modules) of the control module 1004. For one embodiment, at least one of the processor(s) 1002 may be packaged together with logic of one or more controllers of the control module 1004 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 1002 may be integrated on the same mold as logic of one or more controllers of the control module 1004. For one embodiment, at least one of the processor(s) 1002 may be integrated on the same die with logic of one or more controllers of the control module 1004 to form a system on chip (SoC).

In various embodiments, apparatus 1000 may be, but is not limited to being, a terminal device such as a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, device 1000 may have more or fewer components and/or different architectures. For example, in some embodiments, the apparatus 1000 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and a speaker.

The detection device can adopt a main control chip as a processor or a control module, sensor data, position information and the like are stored in a memory or an NVM/storage device, a sensor group can be used as an input/output device, and a communication interface can comprise a network interface.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The foregoing describes the principles and embodiments of the present application in detail using specific examples to facilitate understanding of the method and core ideas of the present application, and meanwhile, to those skilled in the art, according to the ideas of the present application, the present application should not be construed as being limited to the above description.

Claims

1. A method of video tag processing, the method comprising:

Acquiring a target video;

displaying the target video and displaying the error-corrected alternative video label in an interface;

The video classification model also outputs a confidence level corresponding to the alternative video tag;

the error correction of the alternative video labels according to the co-occurrence relation between the alternative video labels output by the two different video classification models comprises the following steps:

the confidence coefficient of a plurality of alternative video labels with the co-occurrence relationship is improved, and the confidence coefficient of a plurality of alternative video labels without the co-occurrence relationship is reduced;

and correcting the error of the alternative video label according to the updated confidence.

2. The method of claim 1, wherein updating the confidence level of the candidate video tags based on co-occurrence relationships between the candidate video tags output by the different two video classification models comprises:

Acquiring joint probabilities between a first alternative video tag and all second alternative video tags according to the confidence level of the first alternative video tag, the confidence level of the second alternative video tag and the co-occurrence relation value marked according to the corresponding relation between the first alternative video tag and the second alternative video tag;

and taking the average value of joint probabilities between the first alternative video tag and all the second alternative video tags as the confidence after the first alternative video tag is updated.

3. The method of claim 1, wherein said error correcting the alternative video tag based on the confidence level comprises:

removing the alternative video labels with the confidence coefficient smaller than the confidence coefficient threshold value;

Or providing marking information for the alternative video labels with the confidence coefficient smaller than the confidence coefficient threshold value, wherein the marking information is used for marking the alternative video labels with the confidence coefficient smaller than the confidence coefficient threshold value when the alternative video labels with the confidence coefficient smaller than the confidence coefficient threshold value are displayed in an interface.

4. The method of claim 1, wherein different video classification models correspond to different classification dimensions, the presenting error corrected candidate video tags, comprising:

and displaying the alternative video labels under the classification dimension in an input box corresponding to the classification dimension in the interface.

5. The method of claim 1, wherein the video classification model further outputs a corresponding progress of playback of the candidate video tag in the target video;

the displaying the target video and displaying the error-corrected alternative video label in the interface comprises the following steps:

And playing the target video broadcast and displaying the alternative video label corresponding to the playing progress on the interface according to the corresponding relation between the alternative video label and the playing progress.

6. The method according to claim 5, wherein playing the candidate video tag corresponding to the target video broadcast and the display playing progress at the interface according to the correspondence between the candidate video tag and the playing progress comprises:

And under the condition that the target video in the interface is played to the target playing progress, displaying the alternative video label corresponding to the target playing progress according to the corresponding relation.

7. The method according to claim 5, wherein playing the candidate video tag corresponding to the target video broadcast and the display playing progress at the interface according to the correspondence between the candidate video tag and the playing progress comprises:

And playing the target video and the alternative video labels in the interface, and highlighting the alternative video labels corresponding to the target playing progress according to the corresponding relation when the target video is played to the target playing progress.

8. The method of claim 1, wherein after the presenting the error corrected alternative video label, the method further comprises:

And in response to the selection operation of the alternative video tags, binding the selected at least one alternative video tag with the target video.

9. The method of claim 1, wherein in the case where the video classification model is a video classification model in an action dimension, the video classification model comprises a time series model;

The step of inputting the target video into a plurality of different video classification models respectively to obtain the alternative video labels output by the video classification models comprises the following steps:

Inputting the target video into the time sequence model, and extracting image features of video frames of the target video;

Coding the image characteristics of the video frame according to the corresponding moment of the video frame on the playing time axis of the target video to obtain coding characteristics;

and according to the coding characteristics, obtaining alternative video labels used for representing the categories of actions in the target video.

10. The method of claim 1, wherein in the case where the video classification model is a video classification model in a commodity information dimension, the video classification model comprises a commodity information detection model, the commodity information detection model comprises a detector and a classifier;

respectively inputting the target videos into the commodity information detection model, and extracting commodity information areas in video frames of the target videos through the detector;

And extracting the regional characteristics of the commodity information region by the classifier, and acquiring an alternative video tag for representing commodity information in the target video according to the regional characteristics.

11. A video tag processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring a target video;

the display module is used for displaying the target video and displaying the error-corrected alternative video labels in an interface;

the error correction module is specifically configured to improve the confidence coefficient of a plurality of candidate video tags having a co-occurrence relationship, reduce the confidence coefficient of a plurality of candidate video tags not having a co-occurrence relationship, and correct the error of the candidate video tags according to the updated confidence coefficient.

12. An electronic device is characterized by comprising a processor; and

A memory having executable code stored thereon that, when executed, causes the processor to perform the method of any of claims 1-10.

13. One or more machine readable media having executable code stored thereon that, when executed, causes a processor to perform the method of any of claims 1-10.