CN113919446B

CN113919446B - Model training and similarity determining method and device for multimedia resources

Info

Publication number: CN113919446B
Application number: CN202111339230.7A
Authority: CN
Inventors: 张水发
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2025-04-08
Anticipated expiration: 2041-11-12
Also published as: CN113919446A

Abstract

The application relates to the technical field of artificial intelligence, provides a method and a device for training a model of a multimedia resource and determining similarity, and aims to solve the problem that training of a convolutional neural network model in the related art is complex. In the application, the multimedia resources corresponding to the similar query requests are screened by adopting a time window in order to better construct the positive sample pair based on the similar multimedia resources corresponding to the similar query requests. The time window can be used because two access operations with similar access times under the same or similar query requests tend to be directed to very similar access to multimedia resources. Therefore, the positive sample pair screened based on the time window has high practicability and accuracy.

Description

Model training and similarity determining method and device for multimedia resources

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a device for model training and similarity determination of multimedia resources.

Background

The video resources in the network are rich, and the number of videos is huge along with the accumulation of time and the increase of users. Some video-based applications typically use a feature representation of the video. One application is video recommendation by computing similarity between videos.

Similarity between videos generally requires converting the video into a feature representation, such as embedding features. The distance between the feature expressions of the two videos is then calculated as the similarity of the two videos.

In the related art, a convolutional neural network CNN is mostly adopted to extract the characteristic expression of the video. However, training convolutional neural networks requires positive and negative pairs of samples. The labeling of positive and negative pairs of samples is complex, difficult, and low in yield. Therefore, training of the convolutional neural network model in the related art is complicated.

Disclosure of Invention

The embodiment of the application provides a method and a device for model training and similarity determination of multimedia resources, which are used for solving the problem that the training work of a convolutional neural network model in the related technology is complex.

In a first aspect, the present application provides a training method for a feature extraction model of a multimedia resource, the method comprising:

constructing a positive sample pair using multimedia resources in the set of multimedia resources accessed by the same account object, wherein access times of two samples in the positive sample pair are taken from the same time window, and

Constructing a negative sample pair by adopting non-accessed multimedia resources and accessed multimedia resources in the multimedia resource set, wherein the multimedia resources matched with similar query requests form the multimedia resource set, and the query requests with similarity higher than a similarity threshold form the similar query requests;

and training a feature extraction model by adopting the positive sample pair and the negative sample pair, wherein the feature extraction model is used for extracting the similarity between multimedia resources.

Optionally, the feature extraction model is of a double-tower network structure, wherein each tower comprises a convolutional neural network and a first full-connection layer;

The convolutional neural network of a first tower network structure in the double-tower network structure is used for extracting a first characteristic of a first multimedia resource; the convolutional neural network of the second tower network structure in the double tower network structure is used for extracting the first characteristic of the second multimedia resource;

the first full-connection layer of the first tower network structure is used for extracting the first characteristics of the first multimedia resources and the second characteristics of the first multimedia resources to obtain the characteristic expression of the first multimedia resources;

the first full-connection layer of the second tower network structure is used for extracting the first characteristics of the second multimedia resources and the second characteristics of the second multimedia resources to obtain the characteristic expression of the second multimedia resources;

The feature representation of the first multimedia asset and the feature representation of the second multimedia asset are used to determine a similarity between the first multimedia asset and the second multimedia asset.

Optionally, the second feature is a degree of interest parameter of a multimedia resource, and the convolutional neural network of the first tower network structure is specifically configured to perform feature extraction on the content of the first multimedia resource and the degree of interest parameter of the first multimedia resource to obtain the first feature of the first multimedia resource, or perform feature extraction on the content of the first multimedia resource to obtain the first feature of the first multimedia resource;

The convolutional neural network of the second tower network structure is specifically configured to perform feature extraction on the content of the second multimedia resource and a parameter of attention degree of the second multimedia resource to obtain the first feature of the second multimedia resource, or perform feature extraction on the content of the second multimedia resource to obtain the first feature of the second multimedia resource.

Optionally, the training the feature extraction model using the positive sample pair and the negative sample pair includes:

Acquiring respective characteristic expressions of a first multimedia resource and a second multimedia resource in any sample pair;

extracting the characteristics of the characteristic expressions of the first multimedia resource and the second multimedia resource by adopting a second full-connection layer to obtain classification characteristics;

classifying the classification features to obtain prediction categories of any sample pair, wherein the classification categories for division comprise positive sample pairs and negative sample pairs;

and determining a loss value by adopting a prediction category and a labeling category, and adjusting model parameters of the feature extraction model based on the loss value.

Optionally, the method further comprises:

Acquiring the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;

screening multimedia resources with appointed operation records from the candidate resource set to obtain a positive sample resource set;

the method for constructing the positive sample pair by adopting the multimedia resources which are accessed by the same account object in the multimedia resource set specifically comprises the following steps:

And constructing positive sample pairs for the positive sample resource set by adopting the multimedia resources accessed by the same account object.

Optionally, the loss function adopted by the training feature extraction model includes feature similarity of two samples in the positive sample pair and difference degree of the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the difference degree of the two samples in the negative sample pair, and the sign is different.

Optionally, the loss function is used to minimize the distance between two samples in the positive pair and maximize the degree of difference between two samples in the negative pair.

Optionally, the loss function is configured to minimize a value of the following formula:

where Dp represents sample l and sample c as positive sample pairs, dn represents sample l and sample c as negative sample pairs, vc is the characteristic expression of sample c, and Vl is the characteristic expression of sample l.

Optionally, the method further comprises:

the set of multimedia resources is constructed based on the following method:

determining the similarity of each query request in the plurality of query requests;

Acquiring at least two query requests with similarity higher than a corresponding similarity threshold value, and obtaining a query request set;

And constructing the multimedia resource set by adopting the multimedia resources corresponding to each query request in the query request set.

Optionally, the method further comprises:

And for each query request in the query request set, if the last accessed multimedia resource associated with the query request has a specified operation record, marking the multimedia resource and any multimedia resource accessed by the same account object in the multimedia resource set as a positive sample pair.

Optionally, the plurality of query requests satisfy at least one of the following conditions:

The query time interval is less than the first duration;

query requests for the same account object.

Optionally, if the plurality of query requests satisfy the query requests with the query time interval smaller than the first duration or the query requests of the same account object, the similarity threshold adopted by the plurality of query requests with the query time interval smaller than the first duration is smaller than the similarity threshold adopted by the query requests of the same account object.

Optionally, the specified operation record includes at least one of the following:

the playing time length is longer than the appointed time length;

praise, forward, collect, focus, comment.

Optionally, the degree of interest parameter includes at least one of the following parameters:

click volume index parameter, attention volume index parameter, praise volume index parameter, comment volume index parameter, and forwarding volume index parameter.

In a second aspect, the present application further provides a method for determining similarity of multimedia resources, where the method includes:

Acquiring a first multimedia resource and a second multimedia resource;

Respectively extracting the feature expressions of the first multimedia resource and the second multimedia resource by adopting a feature extraction model;

determining the similarity of the first multimedia resource and the second multimedia resource based on the respective feature expressions of the first multimedia resource and the second multimedia resource;

the feature extraction model is trained in advance based on positive sample pairs and negative samples, wherein:

the query requests with the similarity higher than the similarity threshold value form the similar query requests, and the multimedia resources matched with the similar query requests form the multimedia resource set;

The positive sample pair is constructed by adopting the multimedia resources which are accessed by the same account object in the multimedia resource set, and the access time of two samples in the positive sample pair is taken from the same time window;

the negative-sample pair is constructed using the non-accessed multimedia assets and the accessed multimedia assets in the set of multimedia assets.

Optionally, training a feature extraction model using the positive sample pair and the negative sample pair, including:

Optionally, the method further comprises:

the set of multimedia resources is constructed based on the following method:

Optionally, the method further comprises:

The query time interval is less than the first duration;

query requests for the same account object.

Optionally, the method further comprises:

the playing time length is longer than the appointed time length;

praise, forward, collect, focus, comment.

In a third aspect, the present application further provides a training device for a feature extraction model of a multimedia resource, where the device includes:

The system comprises a sample pair construction module, a negative sample pair construction module and a comparison module, wherein the sample pair construction module is configured to construct a positive sample pair by adopting multimedia resources which are accessed by the same account object in a multimedia resource set, wherein the access time of two samples in the positive sample pair is taken from the same time window;

And a training module configured to train a feature extraction model using the positive sample pair and the negative sample pair, wherein the feature extraction model is used to extract similarities between multimedia resources.

Optionally, executing the training feature extraction model using the positive sample pair and the negative sample pair, the training module being specifically configured to:

Optionally, the apparatus further includes:

the candidate resource determining module is configured to acquire the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;

the positive sample resource set determining module is configured to screen multimedia resources with appointed operation records from the candidate resource set to obtain a positive sample resource set;

performing the constructing of positive sample pairs using the multimedia assets in the multimedia asset set accessed by the same account object, the sample pair constructing module being specifically configured to:

Optionally, the apparatus further includes:

A multimedia resource set construction module configured to construct the multimedia resource set based on the following method:

Optionally, the sample pair construction module is further configured to label, for each query request in the query request set, the multimedia resource and any one of the multimedia resources in the multimedia resource set accessed by the same account object as a positive sample pair if the last accessed multimedia resource associated with the query request has a specified operation record.

The query time interval is less than the first duration;

query requests for the same account object.

the playing time length is longer than the appointed time length;

praise, forward, collect, focus, comment.

In a fourth aspect, the present application also provides a device for determining similarity of multimedia resources, where the device includes:

An acquisition module configured to acquire a first multimedia resource and a second multimedia resource;

A feature expression extraction module configured to extract feature expressions of the first multimedia resource and the second multimedia resource, respectively, using a feature extraction model;

A similarity determination module configured to determine a similarity of the first multimedia resource and the second multimedia resource based on respective feature expressions of the first multimedia resource and the second multimedia resource;

Optionally, the apparatus further comprises a training module configured to train a feature extraction model with the positive sample pair and the negative sample pair based on the following method:

Optionally, the apparatus further includes:

And the sample pair construction module is configured to label any multimedia resource accessed by the same account object in the multimedia resource set as a positive sample pair for each query request in the query request set if the last accessed multimedia resource associated with the query request has a specified operation record.

The query time interval is less than the first duration;

query requests for the same account object.

Optionally, the apparatus further includes:

The sample pair construction module is used for constructing positive sample pairs by adopting the multimedia resources accessed by the same account object in the multimedia resource set based on the following method:

the playing time length is longer than the appointed time length;

praise, forward, collect, focus, comment.

In a fifth aspect, the present application also provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

Wherein the processor is configured to execute the instructions to implement any of the methods as provided in the first and/or second aspects of the application.

In a sixth aspect, an embodiment of the application also provides a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform any of the methods as provided in the first and/or second aspects of the application.

In a seventh aspect, an embodiment of the application provides a computer program product comprising a computer program which, when executed by a processor, implements any of the methods as provided in the first and/or second aspects of the application.

The technical scheme provided by the embodiment of the application has the beneficial effect that the same or similar multimedia resources can be searched through different inquiry requests. As such, there is some similarity between query requests for the same or similar multimedia resources. In other words, similar query requests correspond to similar multimedia resources. In order to better construct positive sample pairs, the multimedia resources corresponding to similar query requests are screened by adopting a time window in the embodiment of the application. The time window can be used because two access operations with similar access times under the same or similar query requests tend to be directed to very similar access to multimedia resources. Therefore, the positive sample pair screened based on the time window has high practicability and accuracy.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2 is a flow chart of a sample construction and training method of a feature extraction model of a multimedia resource according to an embodiment of the present application;

FIG. 3 is a second flowchart of a training method of a feature extraction model of a multimedia resource according to an embodiment of the application;

FIG. 4 is a schematic structural diagram of a feature extraction model of a multimedia resource according to an embodiment of the present application;

FIG. 5 is a third flow chart of a training method of a feature extraction model of a multimedia resource according to an embodiment of the application;

Fig. 6 is a flowchart illustrating a method for determining similarity of multimedia resources according to an embodiment of the present application;

FIG. 7 is a block diagram of a training device for feature extraction model of multimedia resources according to an embodiment of the present application;

Fig. 8 is a block diagram of a device for determining similarity of multimedia resources according to an embodiment of the present application;

Fig. 9 is a schematic diagram of a structure of an electronic device according to an exemplary embodiment.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions of the present application, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

In the following, some terms in the embodiments of the present application are explained for easy understanding by those skilled in the art.

(1) The term "plurality" in embodiments of the present application means two or more, and other adjectives are similar.

(2) "And/or" describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate that there are three cases of a alone, a and B together, and B alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

(3) The server is used for serving the terminal, the content of the service such as providing resources for the terminal and storing the terminal data, and the server corresponds to the application program installed on the terminal and operates in cooperation with the application program on the terminal.

(4) The terminal device may refer to APP (Application) of a software class or a client. The system has a visual display interface, can interact with a user, corresponds to a server and provides local service for the client. Applications for software classes, except some applications that only run locally, are typically installed on a common client terminal, and need to run in conjunction with a server. After the development of the internet, more commonly used application programs include, for example, short video applications, email clients when receiving email, and clients for instant messaging. For this type of application program, there is a need to have a corresponding server and service program in the network to provide a corresponding service, such as a database service, a configuration parameter service, etc., so that a specific communication connection needs to be established between the client terminal and the server terminal to ensure the normal operation of the application program.

(5) In the embodiment of the application, the multimedia resources refer to various resources which can be accessed in the network, such as audio resources, video resources, webpage resources and the like.

(6) Feature expression, which refers to information describing features of a multimedia resource, can extract high-level features from the multimedia resource for use by subsequent applications.

In view of the problems of complex marking work, difficult marking and complex training work of a convolutional neural network model trained due to low output of positive and negative sample pairs in the related art, the embodiment of the application provides a solution.

The method provided by the embodiment of the application is not only suitable for video, but also suitable for the feature extraction of any multimedia resource such as audio resources, webpage resources and the like.

The inventive concept of the embodiment of the present application may be summarized in that, based on an access record of a user to a multimedia resource, an access request set is constructed by acquiring the same or similar access requests, then a positive sample pair is constructed by selecting a multimedia resource accessed within the same time window from the multimedia resource set corresponding to the access request set, and a negative sample pair is constructed by selecting a multimedia resource not accessed and a multimedia resource accessed. Therefore, the same or similar multimedia resources can be screened based on the same or similar access requests, and then the accuracy of positive sample pair construction can be ensured by screening the multimedia resources with the same time window from the same or similar multimedia resources based on the access behaviors of users. In addition, automatic mining and labeling of negative sample pairs is achieved based on non-accessed multimedia resources and accessed multimedia resources. Thus, the present application can simplify the labeling of positive and negative sample pairs.

After the design idea of the embodiment of the present application is introduced, some simple descriptions are made below for application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application and are not limiting. In the specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs. It should be noted that the user information related to the present application is obtained based on the user authorization.

Referring to fig. 1, a schematic diagram of an application scenario of a training method and a processing method of a post-processing model of a speech recognition result according to an embodiment of the present application is shown. The application scenario includes a plurality of terminal devices 101 (including terminal device 101-1, terminal device 101-2, terminal device 101-n.) and further includes a server 102. The terminal device 101 and the server 102 are connected through a wireless or wired network, and the terminal device 101 includes, but is not limited to, electronic devices such as a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, an intelligent wearable device, and an intelligent television. Server 102 may be a server, a server cluster formed by a plurality of servers, or a cloud computing center. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like.

Of course, the method provided by the embodiment of the present application is not limited to the application scenario shown in fig. 1, but may be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described together in the following method embodiments, which are not described in detail herein.

The user may send a query request to the server 102 based on the terminal device 101, and the server 102 may search for related multimedia resources for the user based on the query request and then push the multimedia resources to the terminal device 101 for presentation. The user can screen the multimedia resources meeting the search intention from the display results to access. Thus, the server 102 may mine out the sample corresponding to the query request of the user and conforming to the search intention of the user based on the access operation of the user to the query request of the user, and the further non-accessed multimedia resource is the sample not conforming to the search intention of the user. Therefore, the same or similar search intents can be analyzed, positive sample pairs are constructed by excavating samples of the same or similar search intents, then negative sample pairs can be constructed by samples which do not belong to the search intents and samples which belong to the search intents, and therefore automatic excavation and labeling of the positive and negative sample pairs are achieved. The constructed positive and negative sample pairs accord with the search intention of the user, and can be suitable for different query requests without being limited to the self cognition condition of labeling personnel, so that the categories of the constructed positive and negative sample pairs can be richer, the training of the feature extraction model is more complete, the feature extraction model can learn more features, and the features capable of expressing deep semantics are extracted.

In order to further explain the technical solution provided by the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although embodiments of the present application provide the method operational steps shown in the following embodiments or figures, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application.

For easy understanding, the training method of the feature extraction model for multimedia resources and the method for determining the similarity between multimedia resources by the user according to the present application are described below.

1. Training of feature extraction models

(1) Labeling of positive and negative sample pairs

The same or similar multimedia assets can be searched through different query requests. As such, there is some similarity between query requests for the same or similar multimedia resources. In other words, similar query requests correspond to similar multimedia resources. In order to better construct positive sample pairs, the multimedia resources corresponding to similar query requests are screened by adopting a time window in the embodiment of the application. The time window can be used because two access operations with similar access times under the same or similar query requests tend to be directed to very similar access to multimedia resources. Therefore, the positive sample pair screened based on the time window has high practicability and accuracy.

In view of this, in the embodiment of the present application, the query requests with similarity higher than the similarity threshold form the similar query requests, and the matched multimedia resources of the similar query requests form the multimedia resource set. In practice, the set of multimedia resources may be constructed based on the following method, as shown in fig. 2, comprising the steps of:

in step 201, a similarity of each of a plurality of query requests is determined.

In the embodiment of the application, the feature extraction can be carried out on the query request to obtain the semantic features of the query request, and then the similarity between the query requests is determined based on the distance between the semantic features.

The query request may be triggered based on a search keyword, and the query request for extracting semantic features may include the search keyword, and may further include a query category. The query class is user specified. For example, a user queries specific categories under a broad category of television dramas, such as ancient costume dramas, suspense dramas, year of production, actors, etc.

In step 202, obtaining at least two query requests with similarity higher than a corresponding similarity threshold value, and obtaining a query request set;

in practice, in order to be able to better mine positive sample pairs, a plurality of query requests for building a set of query requests need to satisfy at least one of the following conditions:

And 1, the inquiry time interval is smaller than the first duration.

Query request sets constructed from query requests satisfying the condition and having a similarity above a similarity threshold enable short term interest modeling of the user.

If embedding features of two query requests are highly similar, and the query time interval of two query requests is less than 30min (minutes), such queries are combined into one query request set session.

Condition 2. Query request for the same account object. This condition is used to model the long-term interest of the user. The condition can realize modeling of long-term interests of the user to obtain a positive sample pair.

In practice, the similarity threshold used for short-term interest modeling in condition 1 and the similarity threshold used for long-term interest modeling in condition 2 may be different. For example, the conditions for short-term interest modeling may be relaxed with a similarity threshold that is lower than that employed for long-term interest modeling.

In the implementation, for example, the short-term interest modeling in the condition 1 requires that the similarity of each query request in the same query request set is higher, and the similarity threshold adopted in the long-term interest modeling is higher, so that the same query request of the same user can be reflected.

In step 203, the multimedia resource set is constructed by using the multimedia resources corresponding to each query request in the query request set.

For example, the multimedia resources corresponding to the query request 1 include a and b, the multimedia resources corresponding to the query request 2 include b and c, and the multimedia resources corresponding to the query request 3 include d and e. The multimedia asset set corresponding to query requests 1-3 includes a, b, c, d, e.

On the basis of obtaining the multimedia resource set, positive and negative samples can be constructed for complete model training, and the method comprises the following steps as shown in fig. 2:

in step 204, a positive sample pair is constructed using the multimedia assets in the multimedia asset set that are accessed by the same account object, wherein the access times of the two samples in the positive sample pair are taken from the same time window.

For example, both long-term interest modeling and short-term interest modeling can result in a corresponding set of query requests. Each query request in the query request set obtains a corresponding query result, namely a multimedia resource corresponding to the query request, each query structure in the same query request set forms a multimedia resource set, and which multimedia resources in the query result corresponding to each query request are accessed are determinable information. Therefore, aiming at the multimedia resources accessed in the multimedia resource set, namely similar multimedia resources, in order to ensure higher similarity, the positive sample pair is constructed by acquiring the multimedia resources accessed in the same time window.

If the sequence of the multimedia resources is obtained according to the access time sequence, any two accessed multimedia resources in the same time window are obtained according to the size of the time window to construct a positive sample pair.

If the query time interval of the query request A and the query request B is smaller than the first duration, and the similarity is higher than the similarity threshold. The multimedia resources searched and presented based on the query request a and the multimedia resources searched and presented based on the query request B are shown in table 1:

The sequences (a 1, a2, b1, a3, b 2) are obtained by sorting according to the order of access time. The time window size is 2 and its constituent positive sample pairs include [ a1, a2] and [ a1, b1] for a 1. For a2, the positive sample pair it constitutes includes [ a1, a2], [ a2, b1] and [ a2, a3], and so on.

In some embodiments, in order to further ensure the accuracy of the constructed positive sample pair, in the embodiments of the present application, a multimedia resource set corresponding to the query request set may be screened, and the effectively accessed multimedia resource may be screened to construct the positive sample pair. Can be implemented as shown in fig. 3 comprising the steps of:

in step 301, the accessed multimedia resources in the multimedia resource set are acquired, and a candidate resource set is obtained.

If the multimedia resources corresponding to the query request a are a1, a2, a3 and a4, wherein the first three are accessed, a1, a2 and a3 are used for constructing a candidate resource set, and a4 is used as the multimedia resource which is not accessed.

In step 302, multimedia resources with specified operation records are screened from the candidate resource set, and a positive sample resource set is obtained.

The operation records are designated for screening out the available access to the multimedia resources, which may also be referred to as available clicks. The specified operation record may include at least one of a play time period longer than the specified time period, having at least one of an operation attribute of praise, forward, collection, attention, comment, and the like.

For example, if the short video a1 is accessed and the playback time period is longer than the specified time period, a1 has a specified operation record. Or a1 is selected from any one of praise, conversion, collection, attention or comment after being accessed, a1 also has a specified operation record, and a1 is a multimedia resource effectively accessed.

The multimedia resources effectively accessed in the candidate resource set are screened out to obtain a positive sample resource set, and then in step 303, positive sample pairs are constructed for the positive sample resource set by adopting the multimedia resources accessed by the same account object.

Therefore, by screening positive samples with appointed operation records, noise data with invalid access can be screened out, and the accuracy of constructed positive sample pairs is improved.

In another embodiment, to increase the diversity and number of positive sample pairs, satisfactorily accessed multimedia resources are defined in an embodiment of the present application. The multimedia resource which is accessed satisfactorily is the last multimedia resource which is accessed in one query request, and if the multimedia resource has the appointed operation record, the multimedia resource is the multimedia resource which is accessed satisfactorily. Both satisfactorily accessed multimedia resources and multimedia resources accessed/effectively accessed by the same account object construct positive sample pairs.

In step 205, a negative sample pair is constructed with the non-accessed multimedia asset and the accessed multimedia asset in the set of multimedia assets.

It should be noted that, the execution sequence of the operation of constructing the positive sample pair in the step 204 and the operation of constructing the negative sample pair in the step 205 is not limited, and the step 204 may be executed first, the step 205 may be executed first, the step 204 may be executed second, or the step 204 and the step 205 may be executed simultaneously, which is suitable for the embodiment of the present application.

(2) Parameter training process of feature extraction model

After the positive and negative sample pairs are obtained, a feature extraction model may be trained using the positive and negative sample pairs in step 206.

The structure of the feature extraction model to be trained is shown in fig. 4, wherein the feature extraction model to be trained is of a multi-tower network structure, each tower comprises a convolutional neural network and a first full-connection layer, a second full-connection layer is connected behind the first full-connection layer of each tower, and a logistic regression layer is connected behind the second full-connection layer to conduct classification prediction on whether a sample pair is a positive sample pair or a negative sample pair. Wherein:

The convolutional neural network of a first tower network structure in the double tower network structure is used for extracting the first characteristic of the first multimedia resource;

The convolutional neural network is used for extracting features of the content of the multimedia resource and the attention degree parameter Ctr of the multimedia resource to obtain a first feature, or extracting features of the content of the multimedia resource to obtain the first feature. The attention degree parameter is hereinafter also referred to as a second feature, and may include at least one of a click amount index parameter such as a click amount or click rate, an attention amount index parameter such as an attention amount and attention rate, a praise amount index parameter such as a praise amount or praise rate, a comment amount index parameter such as a comment total amount or comment rate, a comment rate being understood as a ratio between a user who posted a comment and a user who accesses, a forwarding amount index parameter such as a forwarding amount or forwarding rate, a forwarding rate being understood as a ratio between a number of forwarded users and a number of accessed users, and the like.

The first full-connection layer is used for extracting the first characteristics and the attention degree parameters to obtain the characteristic expression of the multimedia resource.

The feature extraction model to be trained for two tower structures is shown in fig. 4. In fig. 4, the first fully-connected layer and the second fully-connected layer may each include one or more fully-connected layers. In connection with fig. 4, the training process includes the steps as shown in fig. 5:

In step 501, respective feature expressions of the first multimedia resource and the second multimedia resource in any pair of samples, i.e. feature expressions of the first full connection layer output, are acquired.

In step 502, a second full connection layer is used to perform feature extraction on the feature expressions of the first multimedia resource and the second multimedia resource, so as to obtain classification features, that is, features input to a logistic regression layer softmax.

In step 503, the classification feature is subjected to classification processing, so as to obtain a predicted class of any sample pair, where the classification classes for classification include a positive sample pair and a negative sample pair.

Softmax performs classification processing based on classification features to obtain whether the two input videos are positive or negative sample pairs. Based on the predicted category, a comparison may be made with the annotated category, and in step 504, a loss value is determined using the predicted category and the annotated category, and model parameters of the feature extraction model are adjusted based on the loss value.

In implementation, the loss function adopted by the training feature extraction model comprises feature similarity of two samples in the positive sample pair and the difference degree of the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is identical to the expression of the difference degree of the two samples in the negative sample pair and has different signs. Thus, a simple formula can be used to determine the loss value. So that the loss of positive and negative pairs of samples can be simultaneously taken into account when calculating the loss. The loss function is used to minimize the distance between two samples in a positive pair and maximize the degree of difference between two samples in a negative pair. Therefore, the model training can be carried out by classifying similar multimedia resources and dividing different multimedia resources as far as possible, and the accuracy of extracting the characteristics of the multimedia resources by the model is improved.

In practice, the loss function used is as shown in equation (1), and the training goal is to make the result of equation 1 smaller and better.

Wherein Dp represents sample l and sample c as positive sample pairs, dn represents sample l and sample c as negative sample pairs, vc represents the characteristic expression of sample c, and Vl represents the characteristic expression of sample l.

Taking a video as an example, the video may be sampled, an image sequence obtained by sampling may be input into the feature extraction model, or a single frame image of the video may be input into the feature extraction model. When an image sequence is input, the final feature expression is the feature expression of the entire image sequence. When a single frame image is input, the feature expression of each frame image of the video can be obtained, then the feature expression of an image sequence is obtained by splicing, or each frame of sampling image of the first video and each frame of sampling image of the other video are respectively subjected to classified prediction, and finally the prediction category of each frame of sampling image is obtained and compared with the corresponding sample category to obtain a loss value.

2. Determining similarity based on feature extraction model

After the feature extraction model is trained, the similarity of the multimedia resource can be determined based on the model, and the feature extraction model is trained once and can be repeatedly used for determining the similarity. Fig. 6 is a schematic flow chart of a method for determining similarity by using the feature extraction model according to the embodiment of the application, which includes the following steps:

In step 601, a first multimedia resource and a second multimedia resource are acquired;

In step 602, extracting feature expressions of the first multimedia resource and the second multimedia resource respectively by using a feature extraction model;

In step 603, a similarity of the first multimedia asset and the second multimedia asset is determined based on the respective feature expressions of the first multimedia asset and the second multimedia asset.

In implementation, the cosine distance of the two embedding features is used to obtain the similarity of the two multimedia resources.

Based on the same inventive concept, the present application provides a training device for a feature extraction model of a multimedia resource, as shown in fig. 7, the device 700 includes:

A sample pair construction module 701 configured to construct a positive sample pair using multimedia resources in the set of multimedia resources accessed by the same account object, wherein access times of two samples in the positive sample pair are taken from the same time window; the multimedia resources matched with the similar query requests form the multimedia resource set, and the query requests with similarity higher than a similarity threshold form the similar query requests;

A training module 702 is configured to train a feature extraction model using the positive and negative sample pairs, wherein the feature extraction model is used to extract similarities between multimedia resources.

Optionally, the apparatus further includes:

The query time interval is less than the first duration;

query requests for the same account object.

the playing time length is longer than the appointed time length;

praise, forward, collect, focus, comment.

The embodiment of the present application further provides a device for determining similarity of multimedia resources based on the same inventive concept, as shown in fig. 8, the device 800 includes:

An acquisition module 801 configured to acquire a first multimedia resource and a second multimedia resource;

A feature expression extraction module 802 configured to extract feature expressions of the first multimedia asset and the second multimedia asset, respectively, using a feature extraction model;

a similarity determination module 803 configured to determine a similarity of the first multimedia resource and the second multimedia resource based on respective feature expressions of the first multimedia resource and the second multimedia resource;

Optionally, the apparatus further includes:

The query time interval is less than the first duration;

query requests for the same account object.

Optionally, the apparatus further includes:

the playing time length is longer than the appointed time length;

praise, forward, collect, focus, comment.

Having described the model training method and processing method of multimedia resources and related apparatuses of exemplary embodiments of the present application, next, an electronic device according to another exemplary embodiment of the present application is described.

Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects that may be referred to herein collectively as a "circuit," module "or" system.

In some possible embodiments, an electronic device according to the application may comprise at least one processor and at least one memory. Wherein the memory stores program code that, when executed by the processor, causes the processor to perform the multimedia information editing method according to various exemplary embodiments of the present application described above in the present specification. For example, the processor may perform steps such as a model training method and a similarity determination method for the multimedia asset.

An electronic device 130 according to this embodiment of the application is described below with reference to fig. 9. The electronic device 130 shown in fig. 9 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 9, the electronic device 130 is embodied in the form of a general-purpose electronic device. The components of the electronic device 130 may include, but are not limited to, the at least one processor 131, the at least one memory 132, and a bus 133 connecting the various system components, including the memory 132 and the processor 131.

Bus 133 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures.

Memory 132 may include readable media in the form of volatile memory such as Random Access Memory (RAM) 1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.

Memory 132 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The electronic device 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the electronic device 130, and/or any device (e.g., router, modem, etc.) that enables the electronic device 130 to communicate with one or more other electronic devices. Such communication may occur through an input/output (I/O) interface 135. Also, electronic device 130 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 136. As shown, network adapter 136 communicates with other modules for electronic device 130 over bus 133. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 130, including, but not limited to, microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

In an exemplary embodiment, a computer readable storage medium is also provided, such as a memory 132 comprising instructions executable by the processor 131 of the apparatus 700 or the processor 131 of the apparatus 800 to perform the training method and the post-processing method of the post-processing model of the speech recognition result. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program which, when executed by the processor 131, implements any one of the model training method and the similarity determination method of the multimedia resource as provided by the present application.

In an exemplary embodiment, aspects of the model training method and the similarity determination method for a multimedia resource provided by the present application may also be implemented in the form of a program product, which includes program code for causing a computer device to perform the steps of the model training method and the similarity determination method for a multimedia resource according to the various exemplary embodiments of the present application described above when the program product is run on the computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of the model training method and the similarity determination method for multimedia resources according to embodiments of the present application may employ a portable compact disc read-only memory (CD-ROM) and include program code and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device, partly on the remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable image scaling device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable image scaling device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable image scaling device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable image scaling apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for training a feature extraction model of multimedia resources, characterized in that the method comprises:

A positive sample pair is constructed by using multimedia resources in a multimedia resource set that are accessed by the same account object, wherein the access time of two samples in the positive sample pair is taken from the same time window; the constructing of the positive sample pair includes: obtaining a sequence of multimedia resources by sorting the access time, and obtaining any two accessed multimedia resources in the same time window to construct a positive sample pair according to the time window size; and,

A negative sample pair is constructed using multimedia resources that have not been accessed and multimedia resources that have been accessed in the multimedia resource set; wherein multimedia resources matched by similar query requests constitute the multimedia resource set, and query requests with similarity higher than a similarity threshold constitute the similar query requests;

Using the positive sample pair and the negative sample pair to train a feature extraction model, wherein the feature extraction model is used to extract similarities between multimedia resources;

The method further comprises:

Acquire the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;

Screening out multimedia resources with specified operation records from the candidate resource set to obtain a positive sample resource set;

The method of constructing positive sample pairs using multimedia resources in a multimedia resource set accessed by the same account object specifically includes:

For the positive sample resource set, positive sample pairs are constructed using multimedia resources accessed by the same account object.

2. The method according to claim 1, characterized in that the feature extraction model is a double-tower network structure, wherein each tower includes a convolutional neural network and a first fully connected layer;

The convolutional neural network of the first tower network structure in the dual-tower network structure is used to extract the first feature of the first multimedia resource; the convolutional neural network of the second tower network structure in the dual-tower network structure is used to extract the first feature of the second multimedia resource;

The first fully connected layer of the first tower network structure is used to extract the first feature of the first multimedia resource and the second feature of the first multimedia resource to obtain a feature expression of the first multimedia resource;

The first fully connected layer of the second tower network structure is used to extract the first feature of the second multimedia resource and the second feature of the second multimedia resource to obtain a feature expression of the second multimedia resource;

The feature expression of the first multimedia resource and the feature expression of the second multimedia resource are used to determine the similarity between the first multimedia resource and the second multimedia resource.

3. The method according to claim 2 is characterized in that the second feature is a parameter of the attention level of the multimedia resource, and the convolutional neural network of the first tower network structure is specifically used to extract features from the content of the first multimedia resource and the parameter of the attention level of the first multimedia resource to obtain the first feature of the first multimedia resource, or to extract features from the content of the first multimedia resource to obtain the first feature of the first multimedia resource;

The convolutional neural network of the second tower network structure is specifically used to extract features from the content of the second multimedia resource and the attention level parameter of the second multimedia resource to obtain the first feature of the second multimedia resource, or to extract features from the content of the second multimedia resource to obtain the first feature of the second multimedia resource.

4. The method according to claim 2 or 3, characterized in that the using the positive sample pairs and the negative sample pairs to train the feature extraction model comprises:

Obtaining respective feature expressions of a first multimedia resource and a second multimedia resource in any sample pair;

Using a second fully connected layer to extract features from feature expressions of the first multimedia resource and the second multimedia resource to obtain classification features;

Performing classification processing on the classification features to obtain a predicted category of any sample pair, wherein the classification categories to be classified include positive sample pairs and negative sample pairs;

A loss value is determined using the predicted category and the labeled category, and a model parameter of the feature extraction model is adjusted based on the loss value.

5. The method according to claim 1 is characterized in that the loss function used in training the feature extraction model includes the feature similarity of the two samples in the positive sample pair and the degree of difference between the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the degree of difference between the two samples in the negative sample pair, but has a different sign.

6. The method according to claim 5 is characterized in that the loss function is used to minimize the distance between two samples in a positive sample pair and maximize the difference between two samples in a negative sample pair.

7. The method according to claim 5 or 6, characterized in that the loss function is used to minimize the value of the following formula:

Among them, Dp indicates that sample l and sample c are a positive sample pair, Dn indicates that sample l and sample c are a negative sample pair, Vc is the feature expression of sample c, and Vl is the feature expression of sample l.

8. The method according to claim 1, characterized in that the method further comprises:

The multimedia resource set is constructed based on the following method:

Determining a similarity between each of the plurality of query requests;

Obtain at least two query requests whose similarity is higher than a corresponding similarity threshold to obtain a query request set;

The multimedia resource set is constructed by using the multimedia resources corresponding to each query request in the query request set.

9. The method according to claim 8, characterized in that the method further comprises:

For each query request in the query request set, if the last accessed multimedia resource associated with the query request has a specified operation record, the multimedia resource and any multimedia resource in the multimedia resource set accessed by the same account object are marked as a positive sample pair.

10. The method according to claim 8, wherein the multiple query requests satisfy at least one of the following conditions:

The query time interval is less than the first time duration;

Query request for the same account object.

11. The method according to claim 10 is characterized in that if the multiple query requests satisfy the query time interval being less than the first duration or the query requests for the same account object, then the similarity threshold used by the multiple query requests with a query time interval less than the first duration is less than the similarity threshold used by the query requests for the same account object.

12. The method according to claim 1 or 9, characterized in that the specified operation record includes at least one of the following:

The playback duration is longer than the specified duration;

Like, forward, favorite, follow, and comment.

13. The method according to claim 3, wherein the attention level parameter comprises at least one of the following parameters:

Click volume index parameters, attention volume index parameters, like volume index parameters, comment volume index parameters, and forwarding volume index parameters.

14. A method for determining similarity of multimedia resources, characterized in that the method comprises:

Acquire a first multimedia resource and a second multimedia resource;

Using a feature extraction model to extract feature expressions of the first multimedia resource and the second multimedia resource respectively;

Determining the similarity between the first multimedia resource and the second multimedia resource based on the respective feature expressions of the first multimedia resource and the second multimedia resource;

The feature extraction model is pre-trained based on positive sample pairs and negative sample pairs, where:

Query requests with similarity higher than a similarity threshold constitute query requests of the same type; multimedia resources matched by query requests of the same type constitute the multimedia resource set;

The positive sample pair is constructed by using multimedia resources in the multimedia resource set that are accessed by the same account object, and the access time of the two samples in the positive sample pair is taken from the same time window; constructing the positive sample pair includes: obtaining a sequence of multimedia resources according to the access time, and obtaining any two accessed multimedia resources in the same time window according to the time window size to construct the positive sample pair;

The negative sample pairs are constructed by using the multimedia resources that have not been accessed and the multimedia resources that have been accessed in the multimedia resource set;

The method further comprises:

The multimedia resources accessed by the same account object in the multimedia resource set are used to construct positive sample pairs, including:

15. The method according to claim 14, characterized in that the feature extraction model is a double-tower network structure, wherein each tower includes a convolutional neural network and a first fully connected layer;

16. The method according to claim 15, characterized in that the second feature is a parameter of the attention level of the multimedia resource, and the convolutional neural network of the first tower network structure is specifically used to extract features from the content of the first multimedia resource and the parameter of the attention level of the first multimedia resource to obtain the first feature of the first multimedia resource, or to extract features from the content of the first multimedia resource to obtain the first feature of the first multimedia resource;

17. The method according to claim 15 or 16, characterized in that the use of the positive sample pairs and the negative sample pairs to train a feature extraction model comprises:

18. The method according to claim 14 is characterized in that the loss function used in training the feature extraction model includes the feature similarity of two samples in the positive sample pair and the degree of difference between two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the degree of difference between the two samples in the negative sample pair, but has a different sign.

19. The method according to claim 14, characterized in that the method further comprises:

The multimedia resource set is constructed based on the following method:

Determining a similarity between each of the plurality of query requests;

20. The method according to claim 19, characterized in that the method further comprises:

21. The method according to claim 19, wherein the multiple query requests satisfy at least one of the following conditions:

The query time interval is less than the first time duration;

Query request for the same account object.

22. The method according to claim 21 is characterized in that if the multiple query requests satisfy a query time interval less than a first duration or are query requests for the same account object, then the similarity threshold used by the multiple query requests with a query time interval less than the first duration is less than the similarity threshold used by the query requests for the same account object.

23. The method according to claim 18 is characterized in that the loss function is used to minimize the distance between two samples in a positive sample pair and maximize the difference between two samples in a negative sample pair.

24. The method according to claim 18 or 23, characterized in that the loss function is used to minimize the value of the following formula:

25. The method according to claim 20 or 14, characterized in that the specified operation record includes at least one of the following:

The playback duration is longer than the specified duration;

Like, forward, favorite, follow, and comment.

26. The method according to claim 16, wherein the attention level parameter comprises at least one of the following parameters:

27. A training device for a feature extraction model of multimedia resources, characterized in that the device comprises:

The sample pair construction module is configured to construct a positive sample pair using multimedia resources accessed by the same account object in a multimedia resource set, wherein the access time of two samples in the positive sample pair is taken from the same time window; the construction of the positive sample pair is configured to: obtain a sequence of multimedia resources by sorting the access time, and obtain any two accessed multimedia resources in the same time window according to the time window size to construct a positive sample pair; and construct a negative sample pair using multimedia resources that have not been accessed and multimedia resources that have been accessed in the multimedia resource set; wherein multimedia resources matched by the same type of query requests constitute the multimedia resource set, and query requests with a similarity higher than a similarity threshold constitute the same type of query requests;

A training module is configured to train a feature extraction model using the positive sample pair and the negative sample pair, wherein the feature extraction model is used to extract similarities between multimedia resources;

The device also includes:

A candidate resource determination module is configured to obtain the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;

A positive sample resource set determination module is configured to filter out multimedia resources with specified operation records from the candidate resource set to obtain a positive sample resource set;

The method of constructing positive sample pairs by using multimedia resources in a multimedia resource set accessed by the same account object is performed, and the sample pair construction module is specifically configured as follows:

28. The device according to claim 27, characterized in that the feature extraction model is a double-tower network structure, wherein each tower includes a convolutional neural network and a first fully connected layer;

29. The device according to claim 28, characterized in that the second feature is a parameter of the attention level of the multimedia resource, and the convolutional neural network of the first tower network structure is specifically used to extract features from the content of the first multimedia resource and the parameter of the attention level of the first multimedia resource to obtain the first feature of the first multimedia resource, or to extract features from the content of the first multimedia resource to obtain the first feature of the first multimedia resource;

30. The apparatus according to claim 28 or 29, characterized in that, when performing the step of training the feature extraction model using the positive sample pairs and the negative sample pairs, the training module is specifically configured as follows:

31. The device according to claim 27 is characterized in that the loss function used in training the feature extraction model includes the feature similarity of two samples in the positive sample pair and the degree of difference between two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the degree of difference between the two samples in the negative sample pair, but has different signs.

32. The device according to claim 31 is characterized in that the loss function is used to minimize the distance between two samples in a positive sample pair and maximize the difference between two samples in a negative sample pair.

33. The device according to claim 31 or 32, characterized in that the loss function is used to minimize the value of the following formula:

34. The device according to claim 27, characterized in that the device further comprises:

The multimedia resource set construction module is configured to construct the multimedia resource set based on the following method:

Determining a similarity between each of the plurality of query requests;

35. The device according to claim 34 is characterized in that the sample pair construction module is also configured to: for each query request in the query request set, if the last accessed multimedia resource associated with the query request has a specified operation record, then the multimedia resource and any multimedia resource in the multimedia resource set accessed by the same account object are marked as a positive sample pair.

36. The apparatus according to claim 34, wherein the plurality of query requests satisfy at least one of the following conditions:

The query time interval is less than the first time duration;

Query request for the same account object.

37. The device according to claim 36 is characterized in that if the multiple query requests satisfy the query time interval less than the first duration or the query requests for the same account object, then the similarity threshold used by the multiple query requests with a query time interval less than the first duration is less than the similarity threshold used by the query requests for the same account object.

38. The device according to claim 27 or 35, wherein the specified operation record includes at least one of the following:

The playback duration is longer than the specified duration;

Like, forward, favorite, follow, and comment.

39. The apparatus according to claim 29, wherein the attention level parameter comprises at least one of the following parameters:

40. A device for determining similarity of multimedia resources, characterized in that the device comprises:

An acquisition module, configured to acquire a first multimedia resource and a second multimedia resource;

A feature expression extraction module, configured to respectively extract feature expressions of the first multimedia resource and the second multimedia resource using a feature extraction model;

a similarity determination module, configured to determine the similarity between the first multimedia resource and the second multimedia resource based on the feature expressions of each of the first multimedia resource and the second multimedia resource;

The positive sample pair is constructed by using the multimedia resources in the multimedia resource set that are accessed by the same account object, and the access time of the two samples in the positive sample pair is taken from the same time window; constructing the positive sample pair is configured as follows: obtaining a sequence of multimedia resources by sorting the access time, and obtaining any two accessed multimedia resources in the same time window to construct the positive sample pair according to the time window size;

The device also includes:

The sample pair construction module uses the multimedia resources accessed by the same account object in the multimedia resource set to construct positive sample pairs based on the following method:

41. The device according to claim 40, characterized in that the feature extraction model is a double-tower network structure, wherein each tower includes a convolutional neural network and a first fully connected layer;

42. The device according to claim 41, characterized in that the second feature is a parameter of the attention level of the multimedia resource, and the convolutional neural network of the first tower network structure is specifically used to extract features from the content of the first multimedia resource and the parameter of the attention level of the first multimedia resource to obtain the first feature of the first multimedia resource, or to extract features from the content of the first multimedia resource to obtain the first feature of the first multimedia resource;

43. The device according to claim 41 or 42, characterized in that the device further comprises: a training module configured to train a feature extraction model using the positive sample pair and the negative sample pair based on the following method:

44. The device according to claim 40 is characterized in that the loss function used in training the feature extraction model includes the feature similarity of two samples in the positive sample pair and the degree of difference between two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the degree of difference between the two samples in the negative sample pair, but has a different sign.

45. The device according to claim 40, characterized in that the device further comprises:

Determining a similarity between each of the plurality of query requests;

46. The device according to claim 45, characterized in that the device further comprises:

The sample pair construction module is configured to, for each query request in the query request set, mark the multimedia resource and any multimedia resource in the multimedia resource set accessed by the same account object as a positive sample pair if the last multimedia resource accessed associated with the query request has a specified operation record.

47. The apparatus according to claim 45, wherein the plurality of query requests satisfy at least one of the following conditions:

The query time interval is less than the first time duration;

Query request for the same account object.

48. The device according to claim 47 is characterized in that if the multiple query requests satisfy the query time interval less than the first duration or the query requests for the same account object, then the similarity threshold used by the multiple query requests with a query time interval less than the first duration is less than the similarity threshold used by the query requests for the same account object.

49. The device according to claim 44 is characterized in that the loss function is used to minimize the distance between two samples in a positive sample pair and maximize the difference between two samples in a negative sample pair.

50. The device according to claim 44 or 49, characterized in that the loss function is used to minimize the value of the following formula:

51. The device according to claim 46 or 40, wherein the specified operation record includes at least one of the following:

The playback duration is longer than the specified duration;

Like, forward, favorite, follow, and comment.

52. The apparatus according to claim 42, wherein the attention level parameter comprises at least one of the following parameters:

53. An electronic device, comprising:

processor;

a memory for storing instructions executable by the processor;

The processor is configured to execute the instructions to implement the method as described in any one of claims 1-26.

54. A computer-readable storage medium, characterized in that when the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is enabled to execute any method as claimed in claims 1-26.

55. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the method according to any one of claims 1 to 26 is implemented.