[go: up one dir, main page]

CN116127126A - Self-supervision multi-mode fusion music recommendation method - Google Patents

Self-supervision multi-mode fusion music recommendation method Download PDF

Info

Publication number
CN116127126A
CN116127126A CN202211560638.1A CN202211560638A CN116127126A CN 116127126 A CN116127126 A CN 116127126A CN 202211560638 A CN202211560638 A CN 202211560638A CN 116127126 A CN116127126 A CN 116127126A
Authority
CN
China
Prior art keywords
video
music
emotion
features
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211560638.1A
Other languages
Chinese (zh)
Inventor
张克俊
唐睿源
马玏
吴鑫达
张铁耀
仲崇珺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202211560638.1A priority Critical patent/CN116127126A/en
Publication of CN116127126A publication Critical patent/CN116127126A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/64Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本发明公开了一种自监督的多模态融合音乐推荐方法,包括:收集多模态数据;对多模态数据进行特征提取和特征加权融合,得到视频‑文本融合向量和音频‑情感融合向量;将视频‑文本融合向量和对应的音频‑情感融合向量投影至公共空间,得到正样本对,通过随机采样构建负样本对;采用多模态对比学习策略,进行匹配训练;根据需要进行音乐推荐的视频数据,得到待匹配视频‑文本融合向量;将已有音乐数据的音频‑情感加权结果与待匹配视频‑文本融合向量一同投影到公共空间,计算相似度矩阵并排序,推荐相似度最高的音乐作为视频背景音乐。利用本发明,能够根据给定的视频推荐模态特征相近的背景音乐,可用于电商广告的视频配乐工作。

Figure 202211560638

The invention discloses a self-supervised multimodal fusion music recommendation method, comprising: collecting multimodal data; performing feature extraction and feature weighted fusion on the multimodal data to obtain a video-text fusion vector and an audio-emotion fusion vector ;Project the video-text fusion vector and the corresponding audio-emotion fusion vector into the public space to obtain positive sample pairs, and construct negative sample pairs through random sampling; use multi-modal comparative learning strategy for matching training; perform music recommendation as needed video data to get matching video-text fusion vector; project the audio-emotion weighted result of existing music data and video-text fusion vector to be matched into the public space, calculate and sort the similarity matrix, and recommend the one with the highest similarity Music as video background music. The present invention can recommend background music with similar modal characteristics according to a given video, which can be used for video soundtrack work of e-commerce advertisements.

Figure 202211560638

Description

Self-supervision multi-mode fusion music recommendation method
Technical Field
The invention belongs to the field of music recommendation, and particularly relates to a self-supervision multi-mode fusion music recommendation method.
Background
The balanced extraction, processing and matching of the multi-mode information are complex work, and the existing multi-mode music recommendation algorithm cannot consider the accuracy of recommendation and the balance of multiple modes, and cannot accurately recommend against the electronic market.
Based on different recommendation information, the multi-modal music recommendation algorithm can be divided into a video score retrieval algorithm based on emotion tags and a video score retrieval algorithm based on data semantic content. The method is characterized in that emotion tags are used as intermediate bridges of audio-visual mode data, and music is recommended to the video according to the matching degree of the two-mode emotion tags. The latter is a method for finally completing multi-mode music recommendation by extracting characteristic representations of all mode data and then carrying out data relevance learning based on the data characteristics.
Previous researchers have developed multi-modal music recommendation algorithms by building larger-scale, more modal data sets, or iterating more optimal data processing and machine learning algorithms.
However, current multi-modal music recommendation algorithm research has mainly the following three problems:
first, at the data level, there is currently no large-scale multi-modal dataset in the industry. Many video-music retrieval research efforts rely on manual labeling, which is time-consuming and labor-consuming to manually label on large-scale data sets, and may have certain subjectivity deviations.
Secondly, the existing data set does not contain e-commerce information and commodity information, which also results in that the existing algorithm cannot be suitable for multi-mode music recommendation tasks under the e-commerce scene; in the audio-visual matching method, the video mode has high complexity, and comprises picture features, audio features and the like, the music mode comprises rich audio features, and a larger semantic gap exists between the two modes, so that the association between the model learning video and music is more difficult.
Thirdly, most of the existing researches focus on the picture features of videos and the audio features of music, and cannot effectively process fine granularity features; in the aspect of application field, the current research work mostly uses MV music video data on the Internet for model training, but video picture features in the MV data are often deliberately shot to fit with the audio itself, and finally, the model is difficult to adapt to a real video score application scene.
The process of making the e-commerce advertisement comprises shooting and editing videos, and selecting the most suitable background music from mass music in different styles, which is a very laborious and time-consuming task. Therefore, it is necessary to design a multi-modal music recommendation algorithm for video distribution of e-commerce advertisements to help merchants realize efficient and high-speed e-commerce advertisement production.
Disclosure of Invention
The invention discloses a self-supervision multi-mode fusion music recommendation method which can be used for video music distribution work of e-commerce advertisements according to background music with similar video recommendation mode characteristics.
A self-supervision multi-mode fusion music recommendation method comprises the following steps:
(1) Collecting multi-modal data including video, text, audio, emotion information;
(2) Extracting features of the collected multi-mode data, and respectively extracting video features, text features, audio features and emotion features;
(3) Respectively fusing the video features with the text features, the audio features and the emotion features by utilizing a feature weighted fusion module to obtain a video-text fusion vector and a corresponding audio-emotion fusion vector;
(4) Projecting the video-text fusion vector and the corresponding audio-emotion fusion vector into a public projection space to obtain video-text projection and the corresponding audio-emotion projection as positive sample pairs, and constructing negative sample pairs in a random sampling mode;
(5) Matching training is carried out on the music recommendation model by using a positive sample pair and a negative sample by adopting a multi-mode contrast learning strategy; in the training process, the model uses a distance function to calculate the similarity, so that the distance between the video-text projection and the matched music-emotion projection in the public projection space is continuously reduced, and the distance between the video-text projection and the unmatched music-emotion projection in the public projection space is gradually enlarged;
(6) Extracting video features and text features of video data to be subjected to music recommendation, and obtaining fusion vectors of videos and texts to be matched after weighting by a weighting fusion module, wherein the fusion vectors are used as search query keywords;
(7) And projecting the audio-emotion fusion vector of the existing music data and the search query keywords into a public projection space, calculating a similarity matrix according to a music recommendation model, and sequencing, wherein the music with the highest similarity is recommended as video background music.
The method has good application scenes in the e-commerce field, and in the step (1), the collected multi-mode data come from an e-commerce platform.
Short videos in an e-commerce platform usually contain rich picture features, background music features and text features, so that multi-modal signal feature processing and cross-modal retrieval technologies are required to comprehensively consider multi-modal data, and background music recommendation is performed according to the multi-modal data.
In the step (2), the manner of extracting the audio features is as follows:
firstly, extracting frequency domain features of music from a music piece of multi-mode data by means of a torchaudio tool, and inputting the frequency domain features into an AST network to obtain high-order audio features.
The way of extracting emotion features is as follows:
and extracting low-order features related to music emotion from the music pieces of the multi-mode data by using an OpenSmile tool as emotion features.
The way to extract video features is as follows:
and acquiring video frame pictures from the video according to a set sampling rate, inputting the acquired video frames into a pre-trained network model acceptance to obtain feature vectors corresponding to each frame of pictures, and obtaining feature vector representation of video level by using a time sequence global average convergence technology as video features.
The text features are extracted as follows:
extracting the category text features of the commodity by using a pre-trained natural language processing model Bert-wwm, and finally obtaining the text features.
In the step (3), the feature Weighted Fusion module adds a mode balance module Weighted Fusion in a feature Fusion stage, and the specific Fusion process has the following formula:
g(X,Y)=σ([X,Y]W g +b g )
[X′,Y′]=[X,Y]⊙g(X,Y)
F=[X′,Y′]W F +b F
in the formula, X, Y is input data of different modes, and two mode characteristics are spliced to obtain [ X, Y ]]G (X, Y) is a modal weight coefficient, and sigma represents a sigmoid activation function; modal stitching features [ X, Y]The weighted modal mosaic characteristic [ X ', Y ' is obtained by the product of the modal weight coefficient g (X, Y) and the element layer surface '],W g 、b g 、W F B F All are learnable parameters, and F is the output of the final two-mode fusion.
In the step (5), when the positive sample pair and the negative sample pair are used for carrying out matching training on the music recommendation model, the triple interval loss function is respectively used for carrying out cross matching on two directions of audio-emotion projection retrieval from video-text projection retrieval and video-text projection retrieval from audio-emotion projection retrieval, and the formula is as follows:
Figure BDA0003984475250000041
Figure BDA0003984475250000042
L cross_match =β⊙L v→a +L a→v
wherein L is v→a Representation slave video-textThe projection retrieves the loss function of the audio-emotion projection, D represents the cosine distance function, z vt A video-text projection is represented and,
Figure BDA0003984475250000043
representation and video-text projection z vt Matched music-emotion projection, +.>
Figure BDA0003984475250000044
Representing music-emotion projection obtained by random sampling, wherein alpha represents a settable parameter; l (L) a→v Representing a loss function, z, for retrieving video-text projections from audio-emotion projections ae Representing music-emotion projection, < >>
Figure BDA0003984475250000045
Representation and music-emotion projection z ae Matched video-text projection, < >>
Figure BDA0003984475250000046
Representing randomly sampled video-text projections; l (L) cross_matc Representing the triplet spacing penalty function, β represents the weight parameter used to control the video-music retrieval penalty.
Compared with the prior art, the invention has the following beneficial effects:
1. the method disclosed by the invention is based on feature extraction, feature weighted fusion and music recommendation, comprehensively considers multi-mode information such as video features, music features, text features, emotion features and the like, and improves the efficiency and accuracy of music recommendation.
2. The method has good application scene in the field of electronic commerce, and the multi-mode music recommendation algorithm is utilized in video match work of the electronic commerce advertisement, so that the advertisement production efficiency of the electronic commerce can be greatly improved.
Drawings
Fig. 1 is an overall architecture diagram of a self-monitoring multi-modal fusion music recommendation method of the present invention.
Detailed Description
The invention will be described in further detail with reference to the drawings and examples, it being noted that the examples described below are intended to facilitate the understanding of the invention and are not intended to limit the invention in any way.
As shown in fig. 1, a self-supervision multi-modal fusion music recommendation method mainly comprises three recommendation steps of feature multi-modal feature extraction, feature weighted fusion and music recommendation, and specifically comprises the following steps:
1. collecting video data required to be recommended by music, extracting modal characteristics of the video,
2. and weighting the visual characteristics of the video and the text characteristics of the video content by a weighting fusion module to serve as search query keywords.
3. And projecting the audio-emotion weighted result of the existing music data and the search query keywords into a multi-mode public space, calculating a similarity matrix by using a music recommendation model, sequencing, and recommending the music with the highest similarity as video background music.
The three parts of the multi-mode feature extraction method, the feature weighting fusion module and the multi-mode comparison learning module adopted in the invention are described in detail below.
The multi-mode feature extraction method comprises the following steps:
for the audio features, extraction is performed using an audio feature extraction backbone network AST. AST pre-trains on a large-scale audio dataset AudioSet and achieves leading results in audio classification tasks. Firstly, extracting frequency domain features (Fbank) of music from the music fragments by means of a torchaudio tool, inputting the frequency domain features into an AST network, and finally obtaining high-order audio features. In consideration of importance of music emotion styles to score recommendation in previous research works, low-order features related to music emotion are extracted from music fragments by using an OpenSmile tool, and finally emotion features are obtained. Compared with the higher-order audio features extracted by AST, the lower-order audio features extracted based on OpenSmile can more intuitively reflect the emotion style of background music.
And extracting video features by using a network model acceptance and a time sequence global average convergence technology. And acquiring video frame pictures from the video according to a certain sampling rate, and then inputting the extracted video frames into a pre-trained network model acceptance to obtain feature vectors corresponding to each frame of picture. Finally, a temporal global average convergence technique (Temporal global average pooling) is used to obtain a video-level feature vector representation.
For text features, extraction was performed using the natural language processing model Bert-wwm. The data of the research come from a China E-commerce platform, and the language environment is mainly Chinese. Therefore, we use a natural language processing model Bert-wwm pre-trained on Chinese wiki to extract the category text features of the commodity, and finally obtain the text feature vector.
And a feature weighted fusion module:
the invention introduces a more flexible modal balancing module (Weighted Fusion) to replace the traditional Fusion method in the feature Fusion stage. Given two different modality data, the calculation is performed as follows
g(X,Y)=σ([X,Y]W g +b g )
[X′,Y′]=[X,Y]⊙g(X,Y)
F=[X′,Y′]W F +b F
In the formula, X, Y is input data of different modes, and two mode characteristics are spliced to obtain [ X, Y ]]G (X, Y) is a modal weight coefficient, and sigma represents a sigmoid activation function. Modal stitching features [ X, Y]The weighted modal mosaic characteristic [ X ', Y ' is obtained by the product of the modal weight coefficient g (X, Y) and the element layer surface '],W g 、b g 、W F B F All are learnable parameters, and F is the output of the final two-mode fusion.
By introducing the Weighted Fusion module, the algorithm model weights the two modes in the training process, and learns and extracts parameters with high importance for the video score recommendation task in the two modes to be fused, so that the video-music retrieval task can be completed better and more flexibly.
Multimode contrast learning module:
the invention adopts a multi-mode contrast learning strategy to give a batch of video-text fusion vectors andprojection z of audio-emotion fusion vector in public space vt Z ae The data negative sample pairs are then constructed by means of Random sampling (Random sampling).
The invention adopts a triplet interval loss function (Triplet margin loss) to reduce the distance between data and positive samples, and increases the distance between the data and negative samples, and the triplet interval loss function is respectively crossed and matched from two directions of video retrieval music and music retrieval video characteristics, and the calculation mode is as follows
Figure BDA0003984475250000071
Figure BDA0003984475250000072
L cross_match =β⊙L v→a +L a→v
Wherein L is v→a Representing a loss function for retrieving an audio-emotion projection from a video-text projection, D representing a cosine distance function, z vt A video-text projection is represented and,
Figure BDA0003984475250000073
representation and video-text projection z vt Matched music-emotion projection, +.>
Figure BDA0003984475250000074
Representing music-emotion projection obtained by random sampling, wherein alpha represents a settable parameter; l (L) a→v Representing a loss function, z, for retrieving video-text projections from audio-emotion projections ae Representing music-emotion projection, < >>
Figure BDA0003984475250000075
Representation and music-emotion projection z ae Matched video-text projection, < >>
Figure BDA0003984475250000076
Representing randomly sampled video-text projections; l (L) cross_ma Representing the triplet spacing penalty function, β represents the weight parameter used to control the video-music retrieval penalty.
By continuous optimization, the video-text projection z is enabled vt Music-emotion projection with matching
Figure BDA0003984475250000077
The distance in the public space is continuously reduced during the training process while simultaneously being projected +.>
Figure BDA0003984475250000078
The distance between them gradually expands.
The model details and settings of the invention are as follows:
the network model is realized based on Pytorch, the dimension of data emped is set to 1024, the dimension of two layers of public space projection is respectively set to 512, the dimension of the second layer projection is set to 256, L cross_matc Alpha is set to 0.1 and beta is set to 3.0. The model was trained at RTX3090 with an optimizer Adam, batch size (Batch size) set to 1024 and learning rate set to 0.0003 during training. A total of 30 rounds were trained on the training set, each round validating the model in the validation set and recording validation losses, preventing the model from overfitting. Video is sampled from the beginning of the video at a sampling rate of 1 frame/sec to get at most 300 frames, and the pictures are stretched to 299x299, and finally the pictures are subjected to data normalization so that the range of data values is within [0,1 ]]In between, input the acceptance network and get a video level representation using Temporal Global Average method.
The foregoing embodiments have described in detail the technical solution and the advantages of the present invention, it should be understood that the foregoing embodiments are merely illustrative of the present invention and are not intended to limit the invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the invention.

Claims (8)

1. A self-supervision multi-mode fusion music recommendation method is characterized by comprising the following steps:
(1) Collecting multi-modal data including video, text, audio, emotion information;
(2) Extracting features of the collected multi-mode data, and respectively extracting video features, text features, audio features and emotion features;
(3) Respectively fusing the video features with the text features, the audio features and the emotion features by utilizing a feature weighted fusion module to obtain a video-text fusion vector and a corresponding audio-emotion fusion vector;
(4) Projecting the video-text fusion vector and the corresponding audio-emotion fusion vector into a public projection space to obtain video-text projection and the corresponding audio-emotion projection as positive sample pairs, and constructing negative sample pairs in a random sampling mode;
(5) Matching training is carried out on the music recommendation model by using a positive sample pair and a negative sample by adopting a multi-mode contrast learning strategy; in the training process, the model uses a distance function to calculate the similarity, so that the distance between the video-text projection and the matched music-emotion projection in the public projection space is continuously reduced, and the distance between the video-text projection and the unmatched music-emotion projection in the public projection space is gradually enlarged;
(6) Extracting video features and text features of video data to be subjected to music recommendation, and obtaining fusion vectors of videos and texts to be matched after weighting by a weighting fusion module, wherein the fusion vectors are used as search query keywords;
(7) And projecting the audio-emotion fusion vector of the existing music data and the search query keywords into a public projection space, calculating a similarity matrix according to a music recommendation model, and sequencing, wherein the music with the highest similarity is recommended as video background music.
2. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (1), the collected multimodal data is from an e-commerce platform.
3. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (2), the manner of extracting audio features is as follows:
firstly, extracting frequency domain features of music from a music piece of multi-mode data by means of a torchaudio tool, and inputting the frequency domain features into an AST network to obtain high-order audio features.
4. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (2), the emotion feature is extracted as follows:
and extracting low-order features related to music emotion from the music pieces of the multi-mode data by using an OpenSmile tool as emotion features.
5. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (2), the manner of extracting video features is as follows:
and acquiring video frame pictures from the video according to a set sampling rate, inputting the acquired video frames into a pre-trained network model acceptance to obtain feature vectors corresponding to each frame of pictures, and obtaining feature vector representation of video level by using a time sequence global average convergence technology as video features.
6. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (2), the text features are extracted as follows:
extracting the category text features of the commodity by using a pre-trained natural language processing model Bert-wwm, and finally obtaining the text features.
7. The self-supervising multi-modal fusion music recommendation method according to claim 1, wherein in the step (3), a feature weighted fusion module adds a modal balancing module in a feature fusion stage, and a specific fusion process has a formula as follows:
g(X,Y)=σ([X,Y]W g +b g )
[X',Y']=[X,Y]⊙g(X,Y)
F=[X′,Y']W F +b F
in the formula, X, Y is input data of different modes, and two mode characteristics are spliced to obtain [ X, Y ]]G (X, Y) is a modal weight coefficient, and sigma represents a sigmoid activation function; modal stitching features [ X, Y]The weighted modal mosaic characteristic [ X ', Y ' is obtained by the product of the modal weight coefficient g (X, Y) and the element layer surface '],W g 、b g 、W F B F All are learnable parameters, and F is the output of the final two-mode fusion.
8. The self-supervising multimodal fusion music recommendation method of claim 1, wherein in step (5), when matching training is performed on the music recommendation model using positive and negative sample pairs, the triplet interval loss function is cross-matched in two directions, namely, audio-emotion projection retrieval from video-text projection and video-text projection retrieval from audio-emotion projection retrieval, respectively, as follows:
Figure FDA0003984475240000031
Figure FDA0003984475240000032
L cross_match =β⊙L v→a +L a→v
wherein L is v→a Representing a loss function for retrieving an audio-emotion projection from a video-text projection, D representing a cosine distance function, z vt A video-text projection is represented and,
Figure FDA0003984475240000033
representation and video-text projection z vt Matched music-emotion projection, +.>
Figure FDA0003984475240000034
Representing music-emotion projection obtained by random sampling, wherein alpha represents a settable parameter; l (L) a→v Representing a loss function, z, for retrieving video-text projections from audio-emotion projections ae Representing music-emotion projection, < >>
Figure FDA0003984475240000035
Representation and music-emotion projection z ae Matched video-text projection, < >>
Figure FDA0003984475240000036
Representing randomly sampled video-text projections; l (L) cross_match Representing the triplet spacing penalty function, β represents the weight parameter used to control the video-music retrieval penalty. />
CN202211560638.1A 2022-12-07 2022-12-07 Self-supervision multi-mode fusion music recommendation method Pending CN116127126A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211560638.1A CN116127126A (en) 2022-12-07 2022-12-07 Self-supervision multi-mode fusion music recommendation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211560638.1A CN116127126A (en) 2022-12-07 2022-12-07 Self-supervision multi-mode fusion music recommendation method

Publications (1)

Publication Number Publication Date
CN116127126A true CN116127126A (en) 2023-05-16

Family

ID=86303523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211560638.1A Pending CN116127126A (en) 2022-12-07 2022-12-07 Self-supervision multi-mode fusion music recommendation method

Country Status (1)

Country Link
CN (1) CN116127126A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118551074A (en) * 2024-07-30 2024-08-27 浙江大学 Cross-modal music generation method and device for video soundtrack
CN118585668A (en) * 2024-06-11 2024-09-03 腾讯音乐娱乐科技(深圳)有限公司 Song recommendation model training method, computer device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111314771A (en) * 2020-03-13 2020-06-19 腾讯科技(深圳)有限公司 Video playing method and related equipment
CN111918094A (en) * 2020-06-29 2020-11-10 北京百度网讯科技有限公司 Video processing method and device, electronic equipment and storage medium
CN112560830A (en) * 2021-02-26 2021-03-26 中国科学院自动化研究所 Multi-mode dimension emotion recognition method
CN115169472A (en) * 2022-07-19 2022-10-11 腾讯科技(深圳)有限公司 Music matching method and device for multimedia data and computer equipment
CN115329127A (en) * 2022-07-22 2022-11-11 华中科技大学 Multi-mode short video tag recommendation method integrating emotional information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111314771A (en) * 2020-03-13 2020-06-19 腾讯科技(深圳)有限公司 Video playing method and related equipment
CN111918094A (en) * 2020-06-29 2020-11-10 北京百度网讯科技有限公司 Video processing method and device, electronic equipment and storage medium
CN112560830A (en) * 2021-02-26 2021-03-26 中国科学院自动化研究所 Multi-mode dimension emotion recognition method
CN115169472A (en) * 2022-07-19 2022-10-11 腾讯科技(深圳)有限公司 Music matching method and device for multimedia data and computer equipment
CN115329127A (en) * 2022-07-22 2022-11-11 华中科技大学 Multi-mode short video tag recommendation method integrating emotional information

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118585668A (en) * 2024-06-11 2024-09-03 腾讯音乐娱乐科技(深圳)有限公司 Song recommendation model training method, computer device and storage medium
CN118551074A (en) * 2024-07-30 2024-08-27 浙江大学 Cross-modal music generation method and device for video soundtrack

Similar Documents

Publication Publication Date Title
Surís et al. Cross-modal embeddings for video and audio retrieval
CN106096004B (en) A method for building a large-scale cross-domain text sentiment analysis framework
CN114461836B (en) A cross-modal retrieval method for image-text
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
CN111881363B (en) A recommendation method based on graph interaction network
CN108427670A (en) A kind of sentiment analysis method based on context word vector sum deep learning
CN107944911B (en) Recommendation method of recommendation system based on text analysis
CN108984523A (en) A kind of comment on commodity sentiment analysis method based on deep learning model
CN112801762B (en) Multi-mode video highlight detection method and system based on commodity perception
CN111274398A (en) Method and system for analyzing comment emotion of aspect-level user product
CN113190709B (en) Background music recommendation method and device based on short video key frame
WO2022134701A1 (en) Video processing method and apparatus
CN106919951A (en) A kind of Weakly supervised bilinearity deep learning method merged with vision based on click
Yu et al. Research on automatic music recommendation algorithm based on facial micro-expression recognition
CN109902229A (en) A kind of interpretable recommended method based on comment
CN118069927A (en) News recommendation method and system based on knowledge perception and user multi-interest feature representation
Lee et al. Photo aesthetics analysis via DCNN feature encoding
CN114510564A (en) Video knowledge graph generation method and device
CN116127126A (en) Self-supervision multi-mode fusion music recommendation method
CN116010711B (en) A KGCN model movie recommendation method integrating user information and interest changes
CN115238191A (en) Object recommendation method and device
CN111324773A (en) Background music construction method, device, electronic device and storage medium
CN116680363A (en) A Sentiment Analysis Method Based on Multimodal Review Data
CN110069713A (en) A kind of personalized recommendation method based on user&#39;s context perception
CN118334549A (en) Short video label prediction method and system for multi-mode collaborative interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination