CN116127126A

CN116127126A - Self-supervision multi-mode fusion music recommendation method

Info

Publication number: CN116127126A
Application number: CN202211560638.1A
Authority: CN
Inventors: 张克俊; 唐睿源; 马玏; 吴鑫达; 张铁耀; 仲崇珺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2023-05-16

Abstract

The invention discloses a self-supervised multimodal fusion music recommendation method, comprising: collecting multimodal data; performing feature extraction and feature weighted fusion on the multimodal data to obtain a video-text fusion vector and an audio-emotion fusion vector ;Project the video-text fusion vector and the corresponding audio-emotion fusion vector into the public space to obtain positive sample pairs, and construct negative sample pairs through random sampling; use multi-modal comparative learning strategy for matching training; perform music recommendation as needed video data to get matching video-text fusion vector; project the audio-emotion weighted result of existing music data and video-text fusion vector to be matched into the public space, calculate and sort the similarity matrix, and recommend the one with the highest similarity Music as video background music. The present invention can recommend background music with similar modal characteristics according to a given video, which can be used for video soundtrack work of e-commerce advertisements.

Description

Self-supervision multi-mode fusion music recommendation method

Technical Field

The invention belongs to the field of music recommendation, and particularly relates to a self-supervision multi-mode fusion music recommendation method.

Background

The balanced extraction, processing and matching of the multi-mode information are complex work, and the existing multi-mode music recommendation algorithm cannot consider the accuracy of recommendation and the balance of multiple modes, and cannot accurately recommend against the electronic market.

Based on different recommendation information, the multi-modal music recommendation algorithm can be divided into a video score retrieval algorithm based on emotion tags and a video score retrieval algorithm based on data semantic content. The method is characterized in that emotion tags are used as intermediate bridges of audio-visual mode data, and music is recommended to the video according to the matching degree of the two-mode emotion tags. The latter is a method for finally completing multi-mode music recommendation by extracting characteristic representations of all mode data and then carrying out data relevance learning based on the data characteristics.

Previous researchers have developed multi-modal music recommendation algorithms by building larger-scale, more modal data sets, or iterating more optimal data processing and machine learning algorithms.

However, current multi-modal music recommendation algorithm research has mainly the following three problems:

first, at the data level, there is currently no large-scale multi-modal dataset in the industry. Many video-music retrieval research efforts rely on manual labeling, which is time-consuming and labor-consuming to manually label on large-scale data sets, and may have certain subjectivity deviations.

Secondly, the existing data set does not contain e-commerce information and commodity information, which also results in that the existing algorithm cannot be suitable for multi-mode music recommendation tasks under the e-commerce scene; in the audio-visual matching method, the video mode has high complexity, and comprises picture features, audio features and the like, the music mode comprises rich audio features, and a larger semantic gap exists between the two modes, so that the association between the model learning video and music is more difficult.

Thirdly, most of the existing researches focus on the picture features of videos and the audio features of music, and cannot effectively process fine granularity features; in the aspect of application field, the current research work mostly uses MV music video data on the Internet for model training, but video picture features in the MV data are often deliberately shot to fit with the audio itself, and finally, the model is difficult to adapt to a real video score application scene.

The process of making the e-commerce advertisement comprises shooting and editing videos, and selecting the most suitable background music from mass music in different styles, which is a very laborious and time-consuming task. Therefore, it is necessary to design a multi-modal music recommendation algorithm for video distribution of e-commerce advertisements to help merchants realize efficient and high-speed e-commerce advertisement production.

Disclosure of Invention

The invention discloses a self-supervision multi-mode fusion music recommendation method which can be used for video music distribution work of e-commerce advertisements according to background music with similar video recommendation mode characteristics.

A self-supervision multi-mode fusion music recommendation method comprises the following steps:

(1) Collecting multi-modal data including video, text, audio, emotion information;

(2) Extracting features of the collected multi-mode data, and respectively extracting video features, text features, audio features and emotion features;

(3) Respectively fusing the video features with the text features, the audio features and the emotion features by utilizing a feature weighted fusion module to obtain a video-text fusion vector and a corresponding audio-emotion fusion vector;

(4) Projecting the video-text fusion vector and the corresponding audio-emotion fusion vector into a public projection space to obtain video-text projection and the corresponding audio-emotion projection as positive sample pairs, and constructing negative sample pairs in a random sampling mode;

(5) Matching training is carried out on the music recommendation model by using a positive sample pair and a negative sample by adopting a multi-mode contrast learning strategy; in the training process, the model uses a distance function to calculate the similarity, so that the distance between the video-text projection and the matched music-emotion projection in the public projection space is continuously reduced, and the distance between the video-text projection and the unmatched music-emotion projection in the public projection space is gradually enlarged;

(6) Extracting video features and text features of video data to be subjected to music recommendation, and obtaining fusion vectors of videos and texts to be matched after weighting by a weighting fusion module, wherein the fusion vectors are used as search query keywords;

(7) And projecting the audio-emotion fusion vector of the existing music data and the search query keywords into a public projection space, calculating a similarity matrix according to a music recommendation model, and sequencing, wherein the music with the highest similarity is recommended as video background music.

The method has good application scenes in the e-commerce field, and in the step (1), the collected multi-mode data come from an e-commerce platform.

Short videos in an e-commerce platform usually contain rich picture features, background music features and text features, so that multi-modal signal feature processing and cross-modal retrieval technologies are required to comprehensively consider multi-modal data, and background music recommendation is performed according to the multi-modal data.

In the step (2), the manner of extracting the audio features is as follows:

firstly, extracting frequency domain features of music from a music piece of multi-mode data by means of a torchaudio tool, and inputting the frequency domain features into an AST network to obtain high-order audio features.

The way of extracting emotion features is as follows:

and extracting low-order features related to music emotion from the music pieces of the multi-mode data by using an OpenSmile tool as emotion features.

The way to extract video features is as follows:

and acquiring video frame pictures from the video according to a set sampling rate, inputting the acquired video frames into a pre-trained network model acceptance to obtain feature vectors corresponding to each frame of pictures, and obtaining feature vector representation of video level by using a time sequence global average convergence technology as video features.

The text features are extracted as follows:

extracting the category text features of the commodity by using a pre-trained natural language processing model Bert-wwm, and finally obtaining the text features.

In the step (3), the feature Weighted Fusion module adds a mode balance module Weighted Fusion in a feature Fusion stage, and the specific Fusion process has the following formula:

g(X，Y)＝σ([X，Y]W ^g +b ^g )

[X′，Y′]＝[X，Y]⊙g(X，Y)

F＝[X′，Y′]W ^F +b ^F

in the formula, X, Y is input data of different modes, and two mode characteristics are spliced to obtain [ X, Y ]]G (X, Y) is a modal weight coefficient, and sigma represents a sigmoid activation function; modal stitching features [ X, Y]The weighted modal mosaic characteristic [ X ', Y ' is obtained by the product of the modal weight coefficient g (X, Y) and the element layer surface ']，W ^g 、b ^g 、W ^F B ^F All are learnable parameters, and F is the output of the final two-mode fusion.

In the step (5), when the positive sample pair and the negative sample pair are used for carrying out matching training on the music recommendation model, the triple interval loss function is respectively used for carrying out cross matching on two directions of audio-emotion projection retrieval from video-text projection retrieval and video-text projection retrieval from audio-emotion projection retrieval, and the formula is as follows:

L _{cross_match} ＝β⊙L _v→a +L _a→v

wherein L is _v→a Representation slave video-textThe projection retrieves the loss function of the audio-emotion projection, D represents the cosine distance function, z _vt A video-text projection is represented and,

representation and video-text projection z _vt Matched music-emotion projection, +.>

Representing music-emotion projection obtained by random sampling, wherein alpha represents a settable parameter; l (L) _a→v Representing a loss function, z, for retrieving video-text projections from audio-emotion projections _ae Representing music-emotion projection, < >>

Representation and music-emotion projection z _ae Matched video-text projection, < >>

Representing randomly sampled video-text projections; l (L) _{cross_matc} Representing the triplet spacing penalty function, β represents the weight parameter used to control the video-music retrieval penalty.

Compared with the prior art, the invention has the following beneficial effects:

1. the method disclosed by the invention is based on feature extraction, feature weighted fusion and music recommendation, comprehensively considers multi-mode information such as video features, music features, text features, emotion features and the like, and improves the efficiency and accuracy of music recommendation.

2. The method has good application scene in the field of electronic commerce, and the multi-mode music recommendation algorithm is utilized in video match work of the electronic commerce advertisement, so that the advertisement production efficiency of the electronic commerce can be greatly improved.

Drawings

Fig. 1 is an overall architecture diagram of a self-monitoring multi-modal fusion music recommendation method of the present invention.

Detailed Description

The invention will be described in further detail with reference to the drawings and examples, it being noted that the examples described below are intended to facilitate the understanding of the invention and are not intended to limit the invention in any way.

As shown in fig. 1, a self-supervision multi-modal fusion music recommendation method mainly comprises three recommendation steps of feature multi-modal feature extraction, feature weighted fusion and music recommendation, and specifically comprises the following steps:

1. collecting video data required to be recommended by music, extracting modal characteristics of the video,

2. and weighting the visual characteristics of the video and the text characteristics of the video content by a weighting fusion module to serve as search query keywords.

3. And projecting the audio-emotion weighted result of the existing music data and the search query keywords into a multi-mode public space, calculating a similarity matrix by using a music recommendation model, sequencing, and recommending the music with the highest similarity as video background music.

The three parts of the multi-mode feature extraction method, the feature weighting fusion module and the multi-mode comparison learning module adopted in the invention are described in detail below.

The multi-mode feature extraction method comprises the following steps:

for the audio features, extraction is performed using an audio feature extraction backbone network AST. AST pre-trains on a large-scale audio dataset AudioSet and achieves leading results in audio classification tasks. Firstly, extracting frequency domain features (Fbank) of music from the music fragments by means of a torchaudio tool, inputting the frequency domain features into an AST network, and finally obtaining high-order audio features. In consideration of importance of music emotion styles to score recommendation in previous research works, low-order features related to music emotion are extracted from music fragments by using an OpenSmile tool, and finally emotion features are obtained. Compared with the higher-order audio features extracted by AST, the lower-order audio features extracted based on OpenSmile can more intuitively reflect the emotion style of background music.

And extracting video features by using a network model acceptance and a time sequence global average convergence technology. And acquiring video frame pictures from the video according to a certain sampling rate, and then inputting the extracted video frames into a pre-trained network model acceptance to obtain feature vectors corresponding to each frame of picture. Finally, a temporal global average convergence technique (Temporal global average pooling) is used to obtain a video-level feature vector representation.

For text features, extraction was performed using the natural language processing model Bert-wwm. The data of the research come from a China E-commerce platform, and the language environment is mainly Chinese. Therefore, we use a natural language processing model Bert-wwm pre-trained on Chinese wiki to extract the category text features of the commodity, and finally obtain the text feature vector.

And a feature weighted fusion module:

the invention introduces a more flexible modal balancing module (Weighted Fusion) to replace the traditional Fusion method in the feature Fusion stage. Given two different modality data, the calculation is performed as follows

g(X，Y)＝σ([X，Y]W ^g +b ^g )

[X′，Y′]＝[X，Y]⊙g(X，Y)

F＝[X′，Y′]W ^F +b ^F

In the formula, X, Y is input data of different modes, and two mode characteristics are spliced to obtain [ X, Y ]]G (X, Y) is a modal weight coefficient, and sigma represents a sigmoid activation function. Modal stitching features [ X, Y]The weighted modal mosaic characteristic [ X ', Y ' is obtained by the product of the modal weight coefficient g (X, Y) and the element layer surface ']，W ^g 、b ^g 、W ^F B ^F All are learnable parameters, and F is the output of the final two-mode fusion.

By introducing the Weighted Fusion module, the algorithm model weights the two modes in the training process, and learns and extracts parameters with high importance for the video score recommendation task in the two modes to be fused, so that the video-music retrieval task can be completed better and more flexibly.

Multimode contrast learning module:

the invention adopts a multi-mode contrast learning strategy to give a batch of video-text fusion vectors andprojection z of audio-emotion fusion vector in public space _vt Z _ae The data negative sample pairs are then constructed by means of Random sampling (Random sampling).

The invention adopts a triplet interval loss function (Triplet margin loss) to reduce the distance between data and positive samples, and increases the distance between the data and negative samples, and the triplet interval loss function is respectively crossed and matched from two directions of video retrieval music and music retrieval video characteristics, and the calculation mode is as follows

L _{cross_match} ＝β⊙L _v→a +L _a→v

Wherein L is _v→a Representing a loss function for retrieving an audio-emotion projection from a video-text projection, D representing a cosine distance function, z _vt A video-text projection is represented and,

Representing randomly sampled video-text projections; l (L) _{cross_ma} Representing the triplet spacing penalty function, β represents the weight parameter used to control the video-music retrieval penalty.

By continuous optimization, the video-text projection z is enabled _vt Music-emotion projection with matching

The distance in the public space is continuously reduced during the training process while simultaneously being projected +.>

The distance between them gradually expands.

The model details and settings of the invention are as follows:

the network model is realized based on Pytorch, the dimension of data emped is set to 1024, the dimension of two layers of public space projection is respectively set to 512, the dimension of the second layer projection is set to 256, L _{cross_matc} Alpha is set to 0.1 and beta is set to 3.0. The model was trained at RTX3090 with an optimizer Adam, batch size (Batch size) set to 1024 and learning rate set to 0.0003 during training. A total of 30 rounds were trained on the training set, each round validating the model in the validation set and recording validation losses, preventing the model from overfitting. Video is sampled from the beginning of the video at a sampling rate of 1 frame/sec to get at most 300 frames, and the pictures are stretched to 299x299, and finally the pictures are subjected to data normalization so that the range of data values is within [0,1 ]]In between, input the acceptance network and get a video level representation using Temporal Global Average method.

The foregoing embodiments have described in detail the technical solution and the advantages of the present invention, it should be understood that the foregoing embodiments are merely illustrative of the present invention and are not intended to limit the invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the invention.

Claims

1. A self-supervision multi-mode fusion music recommendation method is characterized by comprising the following steps:

2. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (1), the collected multimodal data is from an e-commerce platform.

3. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (2), the manner of extracting audio features is as follows:

4. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (2), the emotion feature is extracted as follows:

5. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (2), the manner of extracting video features is as follows:

6. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (2), the text features are extracted as follows:

7. The self-supervising multi-modal fusion music recommendation method according to claim 1, wherein in the step (3), a feature weighted fusion module adds a modal balancing module in a feature fusion stage, and a specific fusion process has a formula as follows:

g(X,Y)＝σ([X,Y]W ^g +b ^g )

[X',Y']＝[X,Y]⊙g(X,Y)

F＝[X′,Y']W ^F +b ^F

8. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (5), when matching training is performed on the music recommendation model using positive and negative sample pairs, the triplet interval loss function is cross-matched in two directions, namely, audio-emotion projection retrieval from video-text projection and video-text projection retrieval from audio-emotion projection retrieval, respectively, as follows:

L _{cross_match} ＝β⊙L _v→a +L _a→v

Representing randomly sampled video-text projections; l (L) _{cross_match} Representing the triplet spacing penalty function, β represents the weight parameter used to control the video-music retrieval penalty. />