CN113919446B - Model training and similarity determining method and device for multimedia resources - Google Patents
Model training and similarity determining method and device for multimedia resources Download PDFInfo
- Publication number
- CN113919446B CN113919446B CN202111339230.7A CN202111339230A CN113919446B CN 113919446 B CN113919446 B CN 113919446B CN 202111339230 A CN202111339230 A CN 202111339230A CN 113919446 B CN113919446 B CN 113919446B
- Authority
- CN
- China
- Prior art keywords
- multimedia resource
- multimedia
- feature
- positive sample
- sample pair
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/43—Querying
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/45—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Biology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application relates to the technical field of artificial intelligence, provides a method and a device for training a model of a multimedia resource and determining similarity, and aims to solve the problem that training of a convolutional neural network model in the related art is complex. In the application, the multimedia resources corresponding to the similar query requests are screened by adopting a time window in order to better construct the positive sample pair based on the similar multimedia resources corresponding to the similar query requests. The time window can be used because two access operations with similar access times under the same or similar query requests tend to be directed to very similar access to multimedia resources. Therefore, the positive sample pair screened based on the time window has high practicability and accuracy.
Description
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method and a device for model training and similarity determination of multimedia resources.
Background
The video resources in the network are rich, and the number of videos is huge along with the accumulation of time and the increase of users. Some video-based applications typically use a feature representation of the video. One application is video recommendation by computing similarity between videos.
Similarity between videos generally requires converting the video into a feature representation, such as embedding features. The distance between the feature expressions of the two videos is then calculated as the similarity of the two videos.
In the related art, a convolutional neural network CNN is mostly adopted to extract the characteristic expression of the video. However, training convolutional neural networks requires positive and negative pairs of samples. The labeling of positive and negative pairs of samples is complex, difficult, and low in yield. Therefore, training of the convolutional neural network model in the related art is complicated.
Disclosure of Invention
The embodiment of the application provides a method and a device for model training and similarity determination of multimedia resources, which are used for solving the problem that the training work of a convolutional neural network model in the related technology is complex.
In a first aspect, the present application provides a training method for a feature extraction model of a multimedia resource, the method comprising:
constructing a positive sample pair using multimedia resources in the set of multimedia resources accessed by the same account object, wherein access times of two samples in the positive sample pair are taken from the same time window, and
Constructing a negative sample pair by adopting non-accessed multimedia resources and accessed multimedia resources in the multimedia resource set, wherein the multimedia resources matched with similar query requests form the multimedia resource set, and the query requests with similarity higher than a similarity threshold form the similar query requests;
and training a feature extraction model by adopting the positive sample pair and the negative sample pair, wherein the feature extraction model is used for extracting the similarity between multimedia resources.
Optionally, the feature extraction model is of a double-tower network structure, wherein each tower comprises a convolutional neural network and a first full-connection layer;
The convolutional neural network of a first tower network structure in the double-tower network structure is used for extracting a first characteristic of a first multimedia resource; the convolutional neural network of the second tower network structure in the double tower network structure is used for extracting the first characteristic of the second multimedia resource;
the first full-connection layer of the first tower network structure is used for extracting the first characteristics of the first multimedia resources and the second characteristics of the first multimedia resources to obtain the characteristic expression of the first multimedia resources;
the first full-connection layer of the second tower network structure is used for extracting the first characteristics of the second multimedia resources and the second characteristics of the second multimedia resources to obtain the characteristic expression of the second multimedia resources;
The feature representation of the first multimedia asset and the feature representation of the second multimedia asset are used to determine a similarity between the first multimedia asset and the second multimedia asset.
Optionally, the second feature is a degree of interest parameter of a multimedia resource, and the convolutional neural network of the first tower network structure is specifically configured to perform feature extraction on the content of the first multimedia resource and the degree of interest parameter of the first multimedia resource to obtain the first feature of the first multimedia resource, or perform feature extraction on the content of the first multimedia resource to obtain the first feature of the first multimedia resource;
The convolutional neural network of the second tower network structure is specifically configured to perform feature extraction on the content of the second multimedia resource and a parameter of attention degree of the second multimedia resource to obtain the first feature of the second multimedia resource, or perform feature extraction on the content of the second multimedia resource to obtain the first feature of the second multimedia resource.
Optionally, the training the feature extraction model using the positive sample pair and the negative sample pair includes:
Acquiring respective characteristic expressions of a first multimedia resource and a second multimedia resource in any sample pair;
extracting the characteristics of the characteristic expressions of the first multimedia resource and the second multimedia resource by adopting a second full-connection layer to obtain classification characteristics;
classifying the classification features to obtain prediction categories of any sample pair, wherein the classification categories for division comprise positive sample pairs and negative sample pairs;
and determining a loss value by adopting a prediction category and a labeling category, and adjusting model parameters of the feature extraction model based on the loss value.
Optionally, the method further comprises:
Acquiring the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;
screening multimedia resources with appointed operation records from the candidate resource set to obtain a positive sample resource set;
the method for constructing the positive sample pair by adopting the multimedia resources which are accessed by the same account object in the multimedia resource set specifically comprises the following steps:
And constructing positive sample pairs for the positive sample resource set by adopting the multimedia resources accessed by the same account object.
Optionally, the loss function adopted by the training feature extraction model includes feature similarity of two samples in the positive sample pair and difference degree of the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the difference degree of the two samples in the negative sample pair, and the sign is different.
Optionally, the loss function is used to minimize the distance between two samples in the positive pair and maximize the degree of difference between two samples in the negative pair.
Optionally, the loss function is configured to minimize a value of the following formula:
where Dp represents sample l and sample c as positive sample pairs, dn represents sample l and sample c as negative sample pairs, vc is the characteristic expression of sample c, and Vl is the characteristic expression of sample l.
Optionally, the method further comprises:
the set of multimedia resources is constructed based on the following method:
determining the similarity of each query request in the plurality of query requests;
Acquiring at least two query requests with similarity higher than a corresponding similarity threshold value, and obtaining a query request set;
And constructing the multimedia resource set by adopting the multimedia resources corresponding to each query request in the query request set.
Optionally, the method further comprises:
And for each query request in the query request set, if the last accessed multimedia resource associated with the query request has a specified operation record, marking the multimedia resource and any multimedia resource accessed by the same account object in the multimedia resource set as a positive sample pair.
Optionally, the plurality of query requests satisfy at least one of the following conditions:
The query time interval is less than the first duration;
query requests for the same account object.
Optionally, if the plurality of query requests satisfy the query requests with the query time interval smaller than the first duration or the query requests of the same account object, the similarity threshold adopted by the plurality of query requests with the query time interval smaller than the first duration is smaller than the similarity threshold adopted by the query requests of the same account object.
Optionally, the specified operation record includes at least one of the following:
the playing time length is longer than the appointed time length;
praise, forward, collect, focus, comment.
Optionally, the degree of interest parameter includes at least one of the following parameters:
click volume index parameter, attention volume index parameter, praise volume index parameter, comment volume index parameter, and forwarding volume index parameter.
In a second aspect, the present application further provides a method for determining similarity of multimedia resources, where the method includes:
Acquiring a first multimedia resource and a second multimedia resource;
Respectively extracting the feature expressions of the first multimedia resource and the second multimedia resource by adopting a feature extraction model;
determining the similarity of the first multimedia resource and the second multimedia resource based on the respective feature expressions of the first multimedia resource and the second multimedia resource;
the feature extraction model is trained in advance based on positive sample pairs and negative samples, wherein:
the query requests with the similarity higher than the similarity threshold value form the similar query requests, and the multimedia resources matched with the similar query requests form the multimedia resource set;
The positive sample pair is constructed by adopting the multimedia resources which are accessed by the same account object in the multimedia resource set, and the access time of two samples in the positive sample pair is taken from the same time window;
the negative-sample pair is constructed using the non-accessed multimedia assets and the accessed multimedia assets in the set of multimedia assets.
Optionally, the feature extraction model is of a double-tower network structure, wherein each tower comprises a convolutional neural network and a first full-connection layer;
The convolutional neural network of a first tower network structure in the double-tower network structure is used for extracting a first characteristic of a first multimedia resource; the convolutional neural network of the second tower network structure in the double tower network structure is used for extracting the first characteristic of the second multimedia resource;
the first full-connection layer of the first tower network structure is used for extracting the first characteristics of the first multimedia resources and the second characteristics of the first multimedia resources to obtain the characteristic expression of the first multimedia resources;
the first full-connection layer of the second tower network structure is used for extracting the first characteristics of the second multimedia resources and the second characteristics of the second multimedia resources to obtain the characteristic expression of the second multimedia resources;
The feature representation of the first multimedia asset and the feature representation of the second multimedia asset are used to determine a similarity between the first multimedia asset and the second multimedia asset.
Optionally, the second feature is a degree of interest parameter of a multimedia resource, and the convolutional neural network of the first tower network structure is specifically configured to perform feature extraction on the content of the first multimedia resource and the degree of interest parameter of the first multimedia resource to obtain the first feature of the first multimedia resource, or perform feature extraction on the content of the first multimedia resource to obtain the first feature of the first multimedia resource;
The convolutional neural network of the second tower network structure is specifically configured to perform feature extraction on the content of the second multimedia resource and a parameter of attention degree of the second multimedia resource to obtain the first feature of the second multimedia resource, or perform feature extraction on the content of the second multimedia resource to obtain the first feature of the second multimedia resource.
Optionally, training a feature extraction model using the positive sample pair and the negative sample pair, including:
Acquiring respective characteristic expressions of a first multimedia resource and a second multimedia resource in any sample pair;
extracting the characteristics of the characteristic expressions of the first multimedia resource and the second multimedia resource by adopting a second full-connection layer to obtain classification characteristics;
classifying the classification features to obtain prediction categories of any sample pair, wherein the classification categories for division comprise positive sample pairs and negative sample pairs;
and determining a loss value by adopting a prediction category and a labeling category, and adjusting model parameters of the feature extraction model based on the loss value.
Optionally, the loss function adopted by the training feature extraction model includes feature similarity of two samples in the positive sample pair and difference degree of the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the difference degree of the two samples in the negative sample pair, and the sign is different.
Optionally, the method further comprises:
the set of multimedia resources is constructed based on the following method:
determining the similarity of each query request in the plurality of query requests;
Acquiring at least two query requests with similarity higher than a corresponding similarity threshold value, and obtaining a query request set;
And constructing the multimedia resource set by adopting the multimedia resources corresponding to each query request in the query request set.
Optionally, the method further comprises:
And for each query request in the query request set, if the last accessed multimedia resource associated with the query request has a specified operation record, marking the multimedia resource and any multimedia resource accessed by the same account object in the multimedia resource set as a positive sample pair.
Optionally, the plurality of query requests satisfy at least one of the following conditions:
The query time interval is less than the first duration;
query requests for the same account object.
Optionally, if the plurality of query requests satisfy the query requests with the query time interval smaller than the first duration or the query requests of the same account object, the similarity threshold adopted by the plurality of query requests with the query time interval smaller than the first duration is smaller than the similarity threshold adopted by the query requests of the same account object.
Optionally, the method further comprises:
Acquiring the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;
screening multimedia resources with appointed operation records from the candidate resource set to obtain a positive sample resource set;
The method for constructing the positive sample pair by adopting the multimedia resources which are accessed by the same account object in the multimedia resource set specifically comprises the following steps:
And constructing positive sample pairs for the positive sample resource set by adopting the multimedia resources accessed by the same account object.
Optionally, the loss function is used to minimize the distance between two samples in the positive pair and maximize the degree of difference between two samples in the negative pair.
Optionally, the loss function is configured to minimize a value of the following formula:
where Dp represents sample l and sample c as positive sample pairs, dn represents sample l and sample c as negative sample pairs, vc is the characteristic expression of sample c, and Vl is the characteristic expression of sample l.
Optionally, the specified operation record includes at least one of the following:
the playing time length is longer than the appointed time length;
praise, forward, collect, focus, comment.
Optionally, the degree of interest parameter includes at least one of the following parameters:
click volume index parameter, attention volume index parameter, praise volume index parameter, comment volume index parameter, and forwarding volume index parameter.
In a third aspect, the present application further provides a training device for a feature extraction model of a multimedia resource, where the device includes:
The system comprises a sample pair construction module, a negative sample pair construction module and a comparison module, wherein the sample pair construction module is configured to construct a positive sample pair by adopting multimedia resources which are accessed by the same account object in a multimedia resource set, wherein the access time of two samples in the positive sample pair is taken from the same time window;
And a training module configured to train a feature extraction model using the positive sample pair and the negative sample pair, wherein the feature extraction model is used to extract similarities between multimedia resources.
Optionally, the feature extraction model is of a double-tower network structure, wherein each tower comprises a convolutional neural network and a first full-connection layer;
The convolutional neural network of a first tower network structure in the double-tower network structure is used for extracting a first characteristic of a first multimedia resource; the convolutional neural network of the second tower network structure in the double tower network structure is used for extracting the first characteristic of the second multimedia resource;
the first full-connection layer of the first tower network structure is used for extracting the first characteristics of the first multimedia resources and the second characteristics of the first multimedia resources to obtain the characteristic expression of the first multimedia resources;
the first full-connection layer of the second tower network structure is used for extracting the first characteristics of the second multimedia resources and the second characteristics of the second multimedia resources to obtain the characteristic expression of the second multimedia resources;
The feature representation of the first multimedia asset and the feature representation of the second multimedia asset are used to determine a similarity between the first multimedia asset and the second multimedia asset.
Optionally, the second feature is a degree of interest parameter of a multimedia resource, and the convolutional neural network of the first tower network structure is specifically configured to perform feature extraction on the content of the first multimedia resource and the degree of interest parameter of the first multimedia resource to obtain the first feature of the first multimedia resource, or perform feature extraction on the content of the first multimedia resource to obtain the first feature of the first multimedia resource;
The convolutional neural network of the second tower network structure is specifically configured to perform feature extraction on the content of the second multimedia resource and a parameter of attention degree of the second multimedia resource to obtain the first feature of the second multimedia resource, or perform feature extraction on the content of the second multimedia resource to obtain the first feature of the second multimedia resource.
Optionally, executing the training feature extraction model using the positive sample pair and the negative sample pair, the training module being specifically configured to:
Acquiring respective characteristic expressions of a first multimedia resource and a second multimedia resource in any sample pair;
extracting the characteristics of the characteristic expressions of the first multimedia resource and the second multimedia resource by adopting a second full-connection layer to obtain classification characteristics;
classifying the classification features to obtain prediction categories of any sample pair, wherein the classification categories for division comprise positive sample pairs and negative sample pairs;
and determining a loss value by adopting a prediction category and a labeling category, and adjusting model parameters of the feature extraction model based on the loss value.
Optionally, the apparatus further includes:
the candidate resource determining module is configured to acquire the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;
the positive sample resource set determining module is configured to screen multimedia resources with appointed operation records from the candidate resource set to obtain a positive sample resource set;
performing the constructing of positive sample pairs using the multimedia assets in the multimedia asset set accessed by the same account object, the sample pair constructing module being specifically configured to:
And constructing positive sample pairs for the positive sample resource set by adopting the multimedia resources accessed by the same account object.
Optionally, the loss function adopted by the training feature extraction model includes feature similarity of two samples in the positive sample pair and difference degree of the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the difference degree of the two samples in the negative sample pair, and the sign is different.
Optionally, the loss function is used to minimize the distance between two samples in the positive pair and maximize the degree of difference between two samples in the negative pair.
Optionally, the loss function is configured to minimize a value of the following formula:
where Dp represents sample l and sample c as positive sample pairs, dn represents sample l and sample c as negative sample pairs, vc is the characteristic expression of sample c, and Vl is the characteristic expression of sample l.
Optionally, the apparatus further includes:
A multimedia resource set construction module configured to construct the multimedia resource set based on the following method:
determining the similarity of each query request in the plurality of query requests;
Acquiring at least two query requests with similarity higher than a corresponding similarity threshold value, and obtaining a query request set;
And constructing the multimedia resource set by adopting the multimedia resources corresponding to each query request in the query request set.
Optionally, the sample pair construction module is further configured to label, for each query request in the query request set, the multimedia resource and any one of the multimedia resources in the multimedia resource set accessed by the same account object as a positive sample pair if the last accessed multimedia resource associated with the query request has a specified operation record.
Optionally, the plurality of query requests satisfy at least one of the following conditions:
The query time interval is less than the first duration;
query requests for the same account object.
Optionally, if the plurality of query requests satisfy the query requests with the query time interval smaller than the first duration or the query requests of the same account object, the similarity threshold adopted by the plurality of query requests with the query time interval smaller than the first duration is smaller than the similarity threshold adopted by the query requests of the same account object.
Optionally, the specified operation record includes at least one of the following:
the playing time length is longer than the appointed time length;
praise, forward, collect, focus, comment.
Optionally, the degree of interest parameter includes at least one of the following parameters:
click volume index parameter, attention volume index parameter, praise volume index parameter, comment volume index parameter, and forwarding volume index parameter.
In a fourth aspect, the present application also provides a device for determining similarity of multimedia resources, where the device includes:
An acquisition module configured to acquire a first multimedia resource and a second multimedia resource;
A feature expression extraction module configured to extract feature expressions of the first multimedia resource and the second multimedia resource, respectively, using a feature extraction model;
A similarity determination module configured to determine a similarity of the first multimedia resource and the second multimedia resource based on respective feature expressions of the first multimedia resource and the second multimedia resource;
the feature extraction model is trained in advance based on positive sample pairs and negative samples, wherein:
the query requests with the similarity higher than the similarity threshold value form the similar query requests, and the multimedia resources matched with the similar query requests form the multimedia resource set;
The positive sample pair is constructed by adopting the multimedia resources which are accessed by the same account object in the multimedia resource set, and the access time of two samples in the positive sample pair is taken from the same time window;
the negative-sample pair is constructed using the non-accessed multimedia assets and the accessed multimedia assets in the set of multimedia assets.
Optionally, the feature extraction model is of a double-tower network structure, wherein each tower comprises a convolutional neural network and a first full-connection layer;
The convolutional neural network of a first tower network structure in the double-tower network structure is used for extracting a first characteristic of a first multimedia resource; the convolutional neural network of the second tower network structure in the double tower network structure is used for extracting the first characteristic of the second multimedia resource;
the first full-connection layer of the first tower network structure is used for extracting the first characteristics of the first multimedia resources and the second characteristics of the first multimedia resources to obtain the characteristic expression of the first multimedia resources;
the first full-connection layer of the second tower network structure is used for extracting the first characteristics of the second multimedia resources and the second characteristics of the second multimedia resources to obtain the characteristic expression of the second multimedia resources;
The feature representation of the first multimedia asset and the feature representation of the second multimedia asset are used to determine a similarity between the first multimedia asset and the second multimedia asset.
Optionally, the second feature is a degree of interest parameter of a multimedia resource, and the convolutional neural network of the first tower network structure is specifically configured to perform feature extraction on the content of the first multimedia resource and the degree of interest parameter of the first multimedia resource to obtain the first feature of the first multimedia resource, or perform feature extraction on the content of the first multimedia resource to obtain the first feature of the first multimedia resource;
The convolutional neural network of the second tower network structure is specifically configured to perform feature extraction on the content of the second multimedia resource and a parameter of attention degree of the second multimedia resource to obtain the first feature of the second multimedia resource, or perform feature extraction on the content of the second multimedia resource to obtain the first feature of the second multimedia resource.
Optionally, the apparatus further comprises a training module configured to train a feature extraction model with the positive sample pair and the negative sample pair based on the following method:
Acquiring respective characteristic expressions of a first multimedia resource and a second multimedia resource in any sample pair;
extracting the characteristics of the characteristic expressions of the first multimedia resource and the second multimedia resource by adopting a second full-connection layer to obtain classification characteristics;
classifying the classification features to obtain prediction categories of any sample pair, wherein the classification categories for division comprise positive sample pairs and negative sample pairs;
and determining a loss value by adopting a prediction category and a labeling category, and adjusting model parameters of the feature extraction model based on the loss value.
Optionally, the loss function adopted by the training feature extraction model includes feature similarity of two samples in the positive sample pair and difference degree of the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the difference degree of the two samples in the negative sample pair, and the sign is different.
Optionally, the apparatus further includes:
A multimedia resource set construction module configured to construct the multimedia resource set based on the following method:
determining the similarity of each query request in the plurality of query requests;
Acquiring at least two query requests with similarity higher than a corresponding similarity threshold value, and obtaining a query request set;
And constructing the multimedia resource set by adopting the multimedia resources corresponding to each query request in the query request set.
Optionally, the apparatus further includes:
And the sample pair construction module is configured to label any multimedia resource accessed by the same account object in the multimedia resource set as a positive sample pair for each query request in the query request set if the last accessed multimedia resource associated with the query request has a specified operation record.
Optionally, the plurality of query requests satisfy at least one of the following conditions:
The query time interval is less than the first duration;
query requests for the same account object.
Optionally, if the plurality of query requests satisfy the query requests with the query time interval smaller than the first duration or the query requests of the same account object, the similarity threshold adopted by the plurality of query requests with the query time interval smaller than the first duration is smaller than the similarity threshold adopted by the query requests of the same account object.
Optionally, the apparatus further includes:
the candidate resource determining module is configured to acquire the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;
the positive sample resource set determining module is configured to screen multimedia resources with appointed operation records from the candidate resource set to obtain a positive sample resource set;
The sample pair construction module is used for constructing positive sample pairs by adopting the multimedia resources accessed by the same account object in the multimedia resource set based on the following method:
And constructing positive sample pairs for the positive sample resource set by adopting the multimedia resources accessed by the same account object.
Optionally, the loss function is used to minimize the distance between two samples in the positive pair and maximize the degree of difference between two samples in the negative pair.
Optionally, the loss function is configured to minimize a value of the following formula:
where Dp represents sample l and sample c as positive sample pairs, dn represents sample l and sample c as negative sample pairs, vc is the characteristic expression of sample c, and Vl is the characteristic expression of sample l.
Optionally, the specified operation record includes at least one of the following:
the playing time length is longer than the appointed time length;
praise, forward, collect, focus, comment.
Optionally, the degree of interest parameter includes at least one of the following parameters:
click volume index parameter, attention volume index parameter, praise volume index parameter, comment volume index parameter, and forwarding volume index parameter.
In a fifth aspect, the present application also provides an electronic device, including:
a processor;
a memory for storing the processor-executable instructions;
Wherein the processor is configured to execute the instructions to implement any of the methods as provided in the first and/or second aspects of the application.
In a sixth aspect, an embodiment of the application also provides a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform any of the methods as provided in the first and/or second aspects of the application.
In a seventh aspect, an embodiment of the application provides a computer program product comprising a computer program which, when executed by a processor, implements any of the methods as provided in the first and/or second aspects of the application.
The technical scheme provided by the embodiment of the application has the beneficial effect that the same or similar multimedia resources can be searched through different inquiry requests. As such, there is some similarity between query requests for the same or similar multimedia resources. In other words, similar query requests correspond to similar multimedia resources. In order to better construct positive sample pairs, the multimedia resources corresponding to similar query requests are screened by adopting a time window in the embodiment of the application. The time window can be used because two access operations with similar access times under the same or similar query requests tend to be directed to very similar access to multimedia resources. Therefore, the positive sample pair screened based on the time window has high practicability and accuracy.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;
FIG. 2 is a flow chart of a sample construction and training method of a feature extraction model of a multimedia resource according to an embodiment of the present application;
FIG. 3 is a second flowchart of a training method of a feature extraction model of a multimedia resource according to an embodiment of the application;
FIG. 4 is a schematic structural diagram of a feature extraction model of a multimedia resource according to an embodiment of the present application;
FIG. 5 is a third flow chart of a training method of a feature extraction model of a multimedia resource according to an embodiment of the application;
Fig. 6 is a flowchart illustrating a method for determining similarity of multimedia resources according to an embodiment of the present application;
FIG. 7 is a block diagram of a training device for feature extraction model of multimedia resources according to an embodiment of the present application;
Fig. 8 is a block diagram of a device for determining similarity of multimedia resources according to an embodiment of the present application;
Fig. 9 is a schematic diagram of a structure of an electronic device according to an exemplary embodiment.
Detailed Description
In order to enable a person skilled in the art to better understand the technical solutions of the present application, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
In the following, some terms in the embodiments of the present application are explained for easy understanding by those skilled in the art.
(1) The term "plurality" in embodiments of the present application means two or more, and other adjectives are similar.
(2) "And/or" describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate that there are three cases of a alone, a and B together, and B alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
(3) The server is used for serving the terminal, the content of the service such as providing resources for the terminal and storing the terminal data, and the server corresponds to the application program installed on the terminal and operates in cooperation with the application program on the terminal.
(4) The terminal device may refer to APP (Application) of a software class or a client. The system has a visual display interface, can interact with a user, corresponds to a server and provides local service for the client. Applications for software classes, except some applications that only run locally, are typically installed on a common client terminal, and need to run in conjunction with a server. After the development of the internet, more commonly used application programs include, for example, short video applications, email clients when receiving email, and clients for instant messaging. For this type of application program, there is a need to have a corresponding server and service program in the network to provide a corresponding service, such as a database service, a configuration parameter service, etc., so that a specific communication connection needs to be established between the client terminal and the server terminal to ensure the normal operation of the application program.
(5) In the embodiment of the application, the multimedia resources refer to various resources which can be accessed in the network, such as audio resources, video resources, webpage resources and the like.
(6) Feature expression, which refers to information describing features of a multimedia resource, can extract high-level features from the multimedia resource for use by subsequent applications.
In view of the problems of complex marking work, difficult marking and complex training work of a convolutional neural network model trained due to low output of positive and negative sample pairs in the related art, the embodiment of the application provides a solution.
The method provided by the embodiment of the application is not only suitable for video, but also suitable for the feature extraction of any multimedia resource such as audio resources, webpage resources and the like.
The inventive concept of the embodiment of the present application may be summarized in that, based on an access record of a user to a multimedia resource, an access request set is constructed by acquiring the same or similar access requests, then a positive sample pair is constructed by selecting a multimedia resource accessed within the same time window from the multimedia resource set corresponding to the access request set, and a negative sample pair is constructed by selecting a multimedia resource not accessed and a multimedia resource accessed. Therefore, the same or similar multimedia resources can be screened based on the same or similar access requests, and then the accuracy of positive sample pair construction can be ensured by screening the multimedia resources with the same time window from the same or similar multimedia resources based on the access behaviors of users. In addition, automatic mining and labeling of negative sample pairs is achieved based on non-accessed multimedia resources and accessed multimedia resources. Thus, the present application can simplify the labeling of positive and negative sample pairs.
After the design idea of the embodiment of the present application is introduced, some simple descriptions are made below for application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application and are not limiting. In the specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs. It should be noted that the user information related to the present application is obtained based on the user authorization.
Referring to fig. 1, a schematic diagram of an application scenario of a training method and a processing method of a post-processing model of a speech recognition result according to an embodiment of the present application is shown. The application scenario includes a plurality of terminal devices 101 (including terminal device 101-1, terminal device 101-2, terminal device 101-n.) and further includes a server 102. The terminal device 101 and the server 102 are connected through a wireless or wired network, and the terminal device 101 includes, but is not limited to, electronic devices such as a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, an intelligent wearable device, and an intelligent television. Server 102 may be a server, a server cluster formed by a plurality of servers, or a cloud computing center. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like.
Of course, the method provided by the embodiment of the present application is not limited to the application scenario shown in fig. 1, but may be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described together in the following method embodiments, which are not described in detail herein.
The user may send a query request to the server 102 based on the terminal device 101, and the server 102 may search for related multimedia resources for the user based on the query request and then push the multimedia resources to the terminal device 101 for presentation. The user can screen the multimedia resources meeting the search intention from the display results to access. Thus, the server 102 may mine out the sample corresponding to the query request of the user and conforming to the search intention of the user based on the access operation of the user to the query request of the user, and the further non-accessed multimedia resource is the sample not conforming to the search intention of the user. Therefore, the same or similar search intents can be analyzed, positive sample pairs are constructed by excavating samples of the same or similar search intents, then negative sample pairs can be constructed by samples which do not belong to the search intents and samples which belong to the search intents, and therefore automatic excavation and labeling of the positive and negative sample pairs are achieved. The constructed positive and negative sample pairs accord with the search intention of the user, and can be suitable for different query requests without being limited to the self cognition condition of labeling personnel, so that the categories of the constructed positive and negative sample pairs can be richer, the training of the feature extraction model is more complete, the feature extraction model can learn more features, and the features capable of expressing deep semantics are extracted.
In order to further explain the technical solution provided by the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although embodiments of the present application provide the method operational steps shown in the following embodiments or figures, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application.
For easy understanding, the training method of the feature extraction model for multimedia resources and the method for determining the similarity between multimedia resources by the user according to the present application are described below.
1. Training of feature extraction models
(1) Labeling of positive and negative sample pairs
The same or similar multimedia assets can be searched through different query requests. As such, there is some similarity between query requests for the same or similar multimedia resources. In other words, similar query requests correspond to similar multimedia resources. In order to better construct positive sample pairs, the multimedia resources corresponding to similar query requests are screened by adopting a time window in the embodiment of the application. The time window can be used because two access operations with similar access times under the same or similar query requests tend to be directed to very similar access to multimedia resources. Therefore, the positive sample pair screened based on the time window has high practicability and accuracy.
In view of this, in the embodiment of the present application, the query requests with similarity higher than the similarity threshold form the similar query requests, and the matched multimedia resources of the similar query requests form the multimedia resource set. In practice, the set of multimedia resources may be constructed based on the following method, as shown in fig. 2, comprising the steps of:
in step 201, a similarity of each of a plurality of query requests is determined.
In the embodiment of the application, the feature extraction can be carried out on the query request to obtain the semantic features of the query request, and then the similarity between the query requests is determined based on the distance between the semantic features.
The query request may be triggered based on a search keyword, and the query request for extracting semantic features may include the search keyword, and may further include a query category. The query class is user specified. For example, a user queries specific categories under a broad category of television dramas, such as ancient costume dramas, suspense dramas, year of production, actors, etc.
In step 202, obtaining at least two query requests with similarity higher than a corresponding similarity threshold value, and obtaining a query request set;
in practice, in order to be able to better mine positive sample pairs, a plurality of query requests for building a set of query requests need to satisfy at least one of the following conditions:
And 1, the inquiry time interval is smaller than the first duration.
Query request sets constructed from query requests satisfying the condition and having a similarity above a similarity threshold enable short term interest modeling of the user.
If embedding features of two query requests are highly similar, and the query time interval of two query requests is less than 30min (minutes), such queries are combined into one query request set session.
Condition 2. Query request for the same account object. This condition is used to model the long-term interest of the user. The condition can realize modeling of long-term interests of the user to obtain a positive sample pair.
In practice, the similarity threshold used for short-term interest modeling in condition 1 and the similarity threshold used for long-term interest modeling in condition 2 may be different. For example, the conditions for short-term interest modeling may be relaxed with a similarity threshold that is lower than that employed for long-term interest modeling.
In the implementation, for example, the short-term interest modeling in the condition 1 requires that the similarity of each query request in the same query request set is higher, and the similarity threshold adopted in the long-term interest modeling is higher, so that the same query request of the same user can be reflected.
In step 203, the multimedia resource set is constructed by using the multimedia resources corresponding to each query request in the query request set.
For example, the multimedia resources corresponding to the query request 1 include a and b, the multimedia resources corresponding to the query request 2 include b and c, and the multimedia resources corresponding to the query request 3 include d and e. The multimedia asset set corresponding to query requests 1-3 includes a, b, c, d, e.
On the basis of obtaining the multimedia resource set, positive and negative samples can be constructed for complete model training, and the method comprises the following steps as shown in fig. 2:
in step 204, a positive sample pair is constructed using the multimedia assets in the multimedia asset set that are accessed by the same account object, wherein the access times of the two samples in the positive sample pair are taken from the same time window.
For example, both long-term interest modeling and short-term interest modeling can result in a corresponding set of query requests. Each query request in the query request set obtains a corresponding query result, namely a multimedia resource corresponding to the query request, each query structure in the same query request set forms a multimedia resource set, and which multimedia resources in the query result corresponding to each query request are accessed are determinable information. Therefore, aiming at the multimedia resources accessed in the multimedia resource set, namely similar multimedia resources, in order to ensure higher similarity, the positive sample pair is constructed by acquiring the multimedia resources accessed in the same time window.
If the sequence of the multimedia resources is obtained according to the access time sequence, any two accessed multimedia resources in the same time window are obtained according to the size of the time window to construct a positive sample pair.
If the query time interval of the query request A and the query request B is smaller than the first duration, and the similarity is higher than the similarity threshold. The multimedia resources searched and presented based on the query request a and the multimedia resources searched and presented based on the query request B are shown in table 1:
The sequences (a 1, a2, b1, a3, b 2) are obtained by sorting according to the order of access time. The time window size is 2 and its constituent positive sample pairs include [ a1, a2] and [ a1, b1] for a 1. For a2, the positive sample pair it constitutes includes [ a1, a2], [ a2, b1] and [ a2, a3], and so on.
In some embodiments, in order to further ensure the accuracy of the constructed positive sample pair, in the embodiments of the present application, a multimedia resource set corresponding to the query request set may be screened, and the effectively accessed multimedia resource may be screened to construct the positive sample pair. Can be implemented as shown in fig. 3 comprising the steps of:
in step 301, the accessed multimedia resources in the multimedia resource set are acquired, and a candidate resource set is obtained.
If the multimedia resources corresponding to the query request a are a1, a2, a3 and a4, wherein the first three are accessed, a1, a2 and a3 are used for constructing a candidate resource set, and a4 is used as the multimedia resource which is not accessed.
In step 302, multimedia resources with specified operation records are screened from the candidate resource set, and a positive sample resource set is obtained.
The operation records are designated for screening out the available access to the multimedia resources, which may also be referred to as available clicks. The specified operation record may include at least one of a play time period longer than the specified time period, having at least one of an operation attribute of praise, forward, collection, attention, comment, and the like.
For example, if the short video a1 is accessed and the playback time period is longer than the specified time period, a1 has a specified operation record. Or a1 is selected from any one of praise, conversion, collection, attention or comment after being accessed, a1 also has a specified operation record, and a1 is a multimedia resource effectively accessed.
The multimedia resources effectively accessed in the candidate resource set are screened out to obtain a positive sample resource set, and then in step 303, positive sample pairs are constructed for the positive sample resource set by adopting the multimedia resources accessed by the same account object.
Therefore, by screening positive samples with appointed operation records, noise data with invalid access can be screened out, and the accuracy of constructed positive sample pairs is improved.
In another embodiment, to increase the diversity and number of positive sample pairs, satisfactorily accessed multimedia resources are defined in an embodiment of the present application. The multimedia resource which is accessed satisfactorily is the last multimedia resource which is accessed in one query request, and if the multimedia resource has the appointed operation record, the multimedia resource is the multimedia resource which is accessed satisfactorily. Both satisfactorily accessed multimedia resources and multimedia resources accessed/effectively accessed by the same account object construct positive sample pairs.
In step 205, a negative sample pair is constructed with the non-accessed multimedia asset and the accessed multimedia asset in the set of multimedia assets.
It should be noted that, the execution sequence of the operation of constructing the positive sample pair in the step 204 and the operation of constructing the negative sample pair in the step 205 is not limited, and the step 204 may be executed first, the step 205 may be executed first, the step 204 may be executed second, or the step 204 and the step 205 may be executed simultaneously, which is suitable for the embodiment of the present application.
(2) Parameter training process of feature extraction model
After the positive and negative sample pairs are obtained, a feature extraction model may be trained using the positive and negative sample pairs in step 206.
The structure of the feature extraction model to be trained is shown in fig. 4, wherein the feature extraction model to be trained is of a multi-tower network structure, each tower comprises a convolutional neural network and a first full-connection layer, a second full-connection layer is connected behind the first full-connection layer of each tower, and a logistic regression layer is connected behind the second full-connection layer to conduct classification prediction on whether a sample pair is a positive sample pair or a negative sample pair. Wherein:
The convolutional neural network of a first tower network structure in the double tower network structure is used for extracting the first characteristic of the first multimedia resource;
the first full-connection layer of the first tower network structure is used for extracting the first characteristics of the first multimedia resources and the second characteristics of the first multimedia resources to obtain the characteristic expression of the first multimedia resources;
the first full-connection layer of the second tower network structure is used for extracting the first characteristics of the second multimedia resources and the second characteristics of the second multimedia resources to obtain the characteristic expression of the second multimedia resources;
The feature representation of the first multimedia asset and the feature representation of the second multimedia asset are used to determine a similarity between the first multimedia asset and the second multimedia asset.
The convolutional neural network is used for extracting features of the content of the multimedia resource and the attention degree parameter Ctr of the multimedia resource to obtain a first feature, or extracting features of the content of the multimedia resource to obtain the first feature. The attention degree parameter is hereinafter also referred to as a second feature, and may include at least one of a click amount index parameter such as a click amount or click rate, an attention amount index parameter such as an attention amount and attention rate, a praise amount index parameter such as a praise amount or praise rate, a comment amount index parameter such as a comment total amount or comment rate, a comment rate being understood as a ratio between a user who posted a comment and a user who accesses, a forwarding amount index parameter such as a forwarding amount or forwarding rate, a forwarding rate being understood as a ratio between a number of forwarded users and a number of accessed users, and the like.
The first full-connection layer is used for extracting the first characteristics and the attention degree parameters to obtain the characteristic expression of the multimedia resource.
The feature extraction model to be trained for two tower structures is shown in fig. 4. In fig. 4, the first fully-connected layer and the second fully-connected layer may each include one or more fully-connected layers. In connection with fig. 4, the training process includes the steps as shown in fig. 5:
In step 501, respective feature expressions of the first multimedia resource and the second multimedia resource in any pair of samples, i.e. feature expressions of the first full connection layer output, are acquired.
In step 502, a second full connection layer is used to perform feature extraction on the feature expressions of the first multimedia resource and the second multimedia resource, so as to obtain classification features, that is, features input to a logistic regression layer softmax.
In step 503, the classification feature is subjected to classification processing, so as to obtain a predicted class of any sample pair, where the classification classes for classification include a positive sample pair and a negative sample pair.
Softmax performs classification processing based on classification features to obtain whether the two input videos are positive or negative sample pairs. Based on the predicted category, a comparison may be made with the annotated category, and in step 504, a loss value is determined using the predicted category and the annotated category, and model parameters of the feature extraction model are adjusted based on the loss value.
In implementation, the loss function adopted by the training feature extraction model comprises feature similarity of two samples in the positive sample pair and the difference degree of the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is identical to the expression of the difference degree of the two samples in the negative sample pair and has different signs. Thus, a simple formula can be used to determine the loss value. So that the loss of positive and negative pairs of samples can be simultaneously taken into account when calculating the loss. The loss function is used to minimize the distance between two samples in a positive pair and maximize the degree of difference between two samples in a negative pair. Therefore, the model training can be carried out by classifying similar multimedia resources and dividing different multimedia resources as far as possible, and the accuracy of extracting the characteristics of the multimedia resources by the model is improved.
In practice, the loss function used is as shown in equation (1), and the training goal is to make the result of equation 1 smaller and better.
Wherein Dp represents sample l and sample c as positive sample pairs, dn represents sample l and sample c as negative sample pairs, vc represents the characteristic expression of sample c, and Vl represents the characteristic expression of sample l.
Taking a video as an example, the video may be sampled, an image sequence obtained by sampling may be input into the feature extraction model, or a single frame image of the video may be input into the feature extraction model. When an image sequence is input, the final feature expression is the feature expression of the entire image sequence. When a single frame image is input, the feature expression of each frame image of the video can be obtained, then the feature expression of an image sequence is obtained by splicing, or each frame of sampling image of the first video and each frame of sampling image of the other video are respectively subjected to classified prediction, and finally the prediction category of each frame of sampling image is obtained and compared with the corresponding sample category to obtain a loss value.
2. Determining similarity based on feature extraction model
After the feature extraction model is trained, the similarity of the multimedia resource can be determined based on the model, and the feature extraction model is trained once and can be repeatedly used for determining the similarity. Fig. 6 is a schematic flow chart of a method for determining similarity by using the feature extraction model according to the embodiment of the application, which includes the following steps:
In step 601, a first multimedia resource and a second multimedia resource are acquired;
In step 602, extracting feature expressions of the first multimedia resource and the second multimedia resource respectively by using a feature extraction model;
In step 603, a similarity of the first multimedia asset and the second multimedia asset is determined based on the respective feature expressions of the first multimedia asset and the second multimedia asset.
In implementation, the cosine distance of the two embedding features is used to obtain the similarity of the two multimedia resources.
Based on the same inventive concept, the present application provides a training device for a feature extraction model of a multimedia resource, as shown in fig. 7, the device 700 includes:
A sample pair construction module 701 configured to construct a positive sample pair using multimedia resources in the set of multimedia resources accessed by the same account object, wherein access times of two samples in the positive sample pair are taken from the same time window; the multimedia resources matched with the similar query requests form the multimedia resource set, and the query requests with similarity higher than a similarity threshold form the similar query requests;
A training module 702 is configured to train a feature extraction model using the positive and negative sample pairs, wherein the feature extraction model is used to extract similarities between multimedia resources.
Optionally, the feature extraction model is of a double-tower network structure, wherein each tower comprises a convolutional neural network and a first full-connection layer;
The convolutional neural network of a first tower network structure in the double-tower network structure is used for extracting a first characteristic of a first multimedia resource; the convolutional neural network of the second tower network structure in the double tower network structure is used for extracting the first characteristic of the second multimedia resource;
the first full-connection layer of the first tower network structure is used for extracting the first characteristics of the first multimedia resources and the second characteristics of the first multimedia resources to obtain the characteristic expression of the first multimedia resources;
the first full-connection layer of the second tower network structure is used for extracting the first characteristics of the second multimedia resources and the second characteristics of the second multimedia resources to obtain the characteristic expression of the second multimedia resources;
The feature representation of the first multimedia asset and the feature representation of the second multimedia asset are used to determine a similarity between the first multimedia asset and the second multimedia asset.
Optionally, the second feature is a degree of interest parameter of a multimedia resource, and the convolutional neural network of the first tower network structure is specifically configured to perform feature extraction on the content of the first multimedia resource and the degree of interest parameter of the first multimedia resource to obtain the first feature of the first multimedia resource, or perform feature extraction on the content of the first multimedia resource to obtain the first feature of the first multimedia resource;
The convolutional neural network of the second tower network structure is specifically configured to perform feature extraction on the content of the second multimedia resource and a parameter of attention degree of the second multimedia resource to obtain the first feature of the second multimedia resource, or perform feature extraction on the content of the second multimedia resource to obtain the first feature of the second multimedia resource.
Optionally, executing the training feature extraction model using the positive sample pair and the negative sample pair, the training module being specifically configured to:
Acquiring respective characteristic expressions of a first multimedia resource and a second multimedia resource in any sample pair;
extracting the characteristics of the characteristic expressions of the first multimedia resource and the second multimedia resource by adopting a second full-connection layer to obtain classification characteristics;
classifying the classification features to obtain prediction categories of any sample pair, wherein the classification categories for division comprise positive sample pairs and negative sample pairs;
and determining a loss value by adopting a prediction category and a labeling category, and adjusting model parameters of the feature extraction model based on the loss value.
Optionally, the apparatus further includes:
the candidate resource determining module is configured to acquire the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;
the positive sample resource set determining module is configured to screen multimedia resources with appointed operation records from the candidate resource set to obtain a positive sample resource set;
performing the constructing of positive sample pairs using the multimedia assets in the multimedia asset set accessed by the same account object, the sample pair constructing module being specifically configured to:
And constructing positive sample pairs for the positive sample resource set by adopting the multimedia resources accessed by the same account object.
Optionally, the loss function adopted by the training feature extraction model includes feature similarity of two samples in the positive sample pair and difference degree of the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the difference degree of the two samples in the negative sample pair, and the sign is different.
Optionally, the loss function is used to minimize the distance between two samples in the positive pair and maximize the degree of difference between two samples in the negative pair.
Optionally, the loss function is configured to minimize a value of the following formula:
where Dp represents sample l and sample c as positive sample pairs, dn represents sample l and sample c as negative sample pairs, vc is the characteristic expression of sample c, and Vl is the characteristic expression of sample l.
Optionally, the apparatus further includes:
A multimedia resource set construction module configured to construct the multimedia resource set based on the following method:
determining the similarity of each query request in the plurality of query requests;
Acquiring at least two query requests with similarity higher than a corresponding similarity threshold value, and obtaining a query request set;
And constructing the multimedia resource set by adopting the multimedia resources corresponding to each query request in the query request set.
Optionally, the sample pair construction module is further configured to label, for each query request in the query request set, the multimedia resource and any one of the multimedia resources in the multimedia resource set accessed by the same account object as a positive sample pair if the last accessed multimedia resource associated with the query request has a specified operation record.
Optionally, the plurality of query requests satisfy at least one of the following conditions:
The query time interval is less than the first duration;
query requests for the same account object.
Optionally, if the plurality of query requests satisfy the query requests with the query time interval smaller than the first duration or the query requests of the same account object, the similarity threshold adopted by the plurality of query requests with the query time interval smaller than the first duration is smaller than the similarity threshold adopted by the query requests of the same account object.
Optionally, the specified operation record includes at least one of the following:
the playing time length is longer than the appointed time length;
praise, forward, collect, focus, comment.
Optionally, the degree of interest parameter includes at least one of the following parameters:
click volume index parameter, attention volume index parameter, praise volume index parameter, comment volume index parameter, and forwarding volume index parameter.
The embodiment of the present application further provides a device for determining similarity of multimedia resources based on the same inventive concept, as shown in fig. 8, the device 800 includes:
An acquisition module 801 configured to acquire a first multimedia resource and a second multimedia resource;
A feature expression extraction module 802 configured to extract feature expressions of the first multimedia asset and the second multimedia asset, respectively, using a feature extraction model;
a similarity determination module 803 configured to determine a similarity of the first multimedia resource and the second multimedia resource based on respective feature expressions of the first multimedia resource and the second multimedia resource;
the feature extraction model is trained in advance based on positive sample pairs and negative samples, wherein:
the query requests with the similarity higher than the similarity threshold value form the similar query requests, and the multimedia resources matched with the similar query requests form the multimedia resource set;
The positive sample pair is constructed by adopting the multimedia resources which are accessed by the same account object in the multimedia resource set, and the access time of two samples in the positive sample pair is taken from the same time window;
the negative-sample pair is constructed using the non-accessed multimedia assets and the accessed multimedia assets in the set of multimedia assets.
Optionally, the feature extraction model is of a double-tower network structure, wherein each tower comprises a convolutional neural network and a first full-connection layer;
The convolutional neural network of a first tower network structure in the double-tower network structure is used for extracting a first characteristic of a first multimedia resource; the convolutional neural network of the second tower network structure in the double tower network structure is used for extracting the first characteristic of the second multimedia resource;
the first full-connection layer of the first tower network structure is used for extracting the first characteristics of the first multimedia resources and the second characteristics of the first multimedia resources to obtain the characteristic expression of the first multimedia resources;
the first full-connection layer of the second tower network structure is used for extracting the first characteristics of the second multimedia resources and the second characteristics of the second multimedia resources to obtain the characteristic expression of the second multimedia resources;
The feature representation of the first multimedia asset and the feature representation of the second multimedia asset are used to determine a similarity between the first multimedia asset and the second multimedia asset.
Optionally, the second feature is a degree of interest parameter of a multimedia resource, and the convolutional neural network of the first tower network structure is specifically configured to perform feature extraction on the content of the first multimedia resource and the degree of interest parameter of the first multimedia resource to obtain the first feature of the first multimedia resource, or perform feature extraction on the content of the first multimedia resource to obtain the first feature of the first multimedia resource;
The convolutional neural network of the second tower network structure is specifically configured to perform feature extraction on the content of the second multimedia resource and a parameter of attention degree of the second multimedia resource to obtain the first feature of the second multimedia resource, or perform feature extraction on the content of the second multimedia resource to obtain the first feature of the second multimedia resource.
Optionally, the apparatus further comprises a training module configured to train a feature extraction model with the positive sample pair and the negative sample pair based on the following method:
Acquiring respective characteristic expressions of a first multimedia resource and a second multimedia resource in any sample pair;
extracting the characteristics of the characteristic expressions of the first multimedia resource and the second multimedia resource by adopting a second full-connection layer to obtain classification characteristics;
classifying the classification features to obtain prediction categories of any sample pair, wherein the classification categories for division comprise positive sample pairs and negative sample pairs;
and determining a loss value by adopting a prediction category and a labeling category, and adjusting model parameters of the feature extraction model based on the loss value.
Optionally, the loss function adopted by the training feature extraction model includes feature similarity of two samples in the positive sample pair and difference degree of the two samples in the negative sample pair, and the expression of the feature similarity of the two samples in the positive sample pair is the same as the expression of the difference degree of the two samples in the negative sample pair, and the sign is different.
Optionally, the apparatus further includes:
A multimedia resource set construction module configured to construct the multimedia resource set based on the following method:
determining the similarity of each query request in the plurality of query requests;
Acquiring at least two query requests with similarity higher than a corresponding similarity threshold value, and obtaining a query request set;
And constructing the multimedia resource set by adopting the multimedia resources corresponding to each query request in the query request set.
Optionally, the apparatus further includes:
And the sample pair construction module is configured to label any multimedia resource accessed by the same account object in the multimedia resource set as a positive sample pair for each query request in the query request set if the last accessed multimedia resource associated with the query request has a specified operation record.
Optionally, the plurality of query requests satisfy at least one of the following conditions:
The query time interval is less than the first duration;
query requests for the same account object.
Optionally, if the plurality of query requests satisfy the query requests with the query time interval smaller than the first duration or the query requests of the same account object, the similarity threshold adopted by the plurality of query requests with the query time interval smaller than the first duration is smaller than the similarity threshold adopted by the query requests of the same account object.
Optionally, the apparatus further includes:
the candidate resource determining module is configured to acquire the accessed multimedia resources in the multimedia resource set to obtain a candidate resource set;
the positive sample resource set determining module is configured to screen multimedia resources with appointed operation records from the candidate resource set to obtain a positive sample resource set;
The sample pair construction module is used for constructing positive sample pairs by adopting the multimedia resources accessed by the same account object in the multimedia resource set based on the following method:
And constructing positive sample pairs for the positive sample resource set by adopting the multimedia resources accessed by the same account object.
Optionally, the loss function is used to minimize the distance between two samples in the positive pair and maximize the degree of difference between two samples in the negative pair.
Optionally, the loss function is configured to minimize a value of the following formula:
where Dp represents sample l and sample c as positive sample pairs, dn represents sample l and sample c as negative sample pairs, vc is the characteristic expression of sample c, and Vl is the characteristic expression of sample l.
Optionally, the specified operation record includes at least one of the following:
the playing time length is longer than the appointed time length;
praise, forward, collect, focus, comment.
Optionally, the degree of interest parameter includes at least one of the following parameters:
click volume index parameter, attention volume index parameter, praise volume index parameter, comment volume index parameter, and forwarding volume index parameter.
Having described the model training method and processing method of multimedia resources and related apparatuses of exemplary embodiments of the present application, next, an electronic device according to another exemplary embodiment of the present application is described.
Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects that may be referred to herein collectively as a "circuit," module "or" system.
In some possible embodiments, an electronic device according to the application may comprise at least one processor and at least one memory. Wherein the memory stores program code that, when executed by the processor, causes the processor to perform the multimedia information editing method according to various exemplary embodiments of the present application described above in the present specification. For example, the processor may perform steps such as a model training method and a similarity determination method for the multimedia asset.
An electronic device 130 according to this embodiment of the application is described below with reference to fig. 9. The electronic device 130 shown in fig. 9 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present application.
As shown in fig. 9, the electronic device 130 is embodied in the form of a general-purpose electronic device. The components of the electronic device 130 may include, but are not limited to, the at least one processor 131, the at least one memory 132, and a bus 133 connecting the various system components, including the memory 132 and the processor 131.
Bus 133 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures.
Memory 132 may include readable media in the form of volatile memory such as Random Access Memory (RAM) 1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.
Memory 132 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The electronic device 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the electronic device 130, and/or any device (e.g., router, modem, etc.) that enables the electronic device 130 to communicate with one or more other electronic devices. Such communication may occur through an input/output (I/O) interface 135. Also, electronic device 130 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 136. As shown, network adapter 136 communicates with other modules for electronic device 130 over bus 133. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 130, including, but not limited to, microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
In an exemplary embodiment, a computer readable storage medium is also provided, such as a memory 132 comprising instructions executable by the processor 131 of the apparatus 700 or the processor 131 of the apparatus 800 to perform the training method and the post-processing method of the post-processing model of the speech recognition result. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, comprising a computer program which, when executed by the processor 131, implements any one of the model training method and the similarity determination method of the multimedia resource as provided by the present application.
In an exemplary embodiment, aspects of the model training method and the similarity determination method for a multimedia resource provided by the present application may also be implemented in the form of a program product, which includes program code for causing a computer device to perform the steps of the model training method and the similarity determination method for a multimedia resource according to the various exemplary embodiments of the present application described above when the program product is run on the computer device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of the model training method and the similarity determination method for multimedia resources according to embodiments of the present application may employ a portable compact disc read-only memory (CD-ROM) and include program code and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device, partly on the remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable image scaling device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable image scaling device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable image scaling device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable image scaling apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (55)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111339230.7A CN113919446B (en) | 2021-11-12 | 2021-11-12 | Model training and similarity determining method and device for multimedia resources |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111339230.7A CN113919446B (en) | 2021-11-12 | 2021-11-12 | Model training and similarity determining method and device for multimedia resources |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113919446A CN113919446A (en) | 2022-01-11 |
CN113919446B true CN113919446B (en) | 2025-04-08 |
Family
ID=79246167
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111339230.7A Active CN113919446B (en) | 2021-11-12 | 2021-11-12 | Model training and similarity determining method and device for multimedia resources |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113919446B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114579869B (en) * | 2022-05-05 | 2022-07-22 | 腾讯科技(深圳)有限公司 | Model training method and related product |
CN116821674A (en) * | 2023-05-22 | 2023-09-29 | 北京达佳互联信息技术有限公司 | Network training and feature representation methods, devices, media and equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113032589A (en) * | 2021-03-29 | 2021-06-25 | 北京奇艺世纪科技有限公司 | Multimedia file recommendation method and device, electronic equipment and readable storage medium |
CN113051368A (en) * | 2021-03-24 | 2021-06-29 | 北京百度网讯科技有限公司 | Double-tower model training method, double-tower model searching device and electronic equipment |
CN113469298A (en) * | 2021-09-03 | 2021-10-01 | 北京达佳互联信息技术有限公司 | Model training method and resource recommendation method |
CN114782719A (en) * | 2022-04-26 | 2022-07-22 | 北京百度网讯科技有限公司 | A training method, object retrieval method and device for feature extraction model |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111708964B (en) * | 2020-05-27 | 2023-06-20 | 北京百度网讯科技有限公司 | Recommendation method and device for multimedia resources, electronic equipment and storage medium |
CN112258285A (en) * | 2020-10-26 | 2021-01-22 | 北京沃东天骏信息技术有限公司 | Content recommendation method and device, equipment and storage medium |
-
2021
- 2021-11-12 CN CN202111339230.7A patent/CN113919446B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113051368A (en) * | 2021-03-24 | 2021-06-29 | 北京百度网讯科技有限公司 | Double-tower model training method, double-tower model searching device and electronic equipment |
CN113032589A (en) * | 2021-03-29 | 2021-06-25 | 北京奇艺世纪科技有限公司 | Multimedia file recommendation method and device, electronic equipment and readable storage medium |
CN113469298A (en) * | 2021-09-03 | 2021-10-01 | 北京达佳互联信息技术有限公司 | Model training method and resource recommendation method |
CN114782719A (en) * | 2022-04-26 | 2022-07-22 | 北京百度网讯科技有限公司 | A training method, object retrieval method and device for feature extraction model |
Also Published As
Publication number | Publication date |
---|---|
CN113919446A (en) | 2022-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11790933B2 (en) | Systems and methods for manipulating electronic content based on speech recognition | |
JP7688458B2 (en) | Cognitive Video and Voice Search Aggregation | |
Pohl et al. | Online indexing and clustering of social media data for emergency management | |
US11126682B1 (en) | Hyperlink based multimedia processing | |
US9489626B2 (en) | Systems and methods for identifying and notifying users of electronic content based on biometric recognition | |
CN112148881B (en) | Methods and devices for outputting information | |
US10878020B2 (en) | Automated extraction tools and their use in social content tagging systems | |
CN114328947A (en) | Knowledge graph-based question and answer method and device | |
CN113806588B (en) | Method and device for searching videos | |
JP2015162244A (en) | Methods, programs and computation processing systems for ranking spoken words | |
CN111625715A (en) | Information extraction method and device, electronic equipment and storage medium | |
CN112015928A (en) | Information extraction method and device of multimedia resource, electronic equipment and storage medium | |
CN115033739A (en) | Search method, model training method, device, electronic equipment and medium | |
Ahmed et al. | Sentiment analysis for smart cities: state of the art and opportunities | |
CN113919446B (en) | Model training and similarity determining method and device for multimedia resources | |
US20250200428A1 (en) | Cluster-based few-shot sampling to support data processing and inferences in imperfect labeled data environments | |
CN115114395A (en) | Content retrieval and model training method and device, electronic equipment and storage medium | |
CN114443904A (en) | Video query method, video query device, computer equipment and computer readable storage medium | |
CN111639234B (en) | Method and apparatus for mining core entity concerns | |
CN115630170B (en) | Document recommendation method, system, terminal and storage medium | |
CN114117239A (en) | A method, device and device for pushing a house | |
US20220171808A1 (en) | Heuristic video searching | |
KR102046224B1 (en) | Apparatus for providing personalized contents | |
CN120123970B (en) | Omnimedia fusion method and system based on complementary fusion | |
Sidiropoulos et al. | Framework of a collaborative audio analysis and visualization tool for data journalists |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |