CN117407486A

CN117407486A - Multimodal dialogue emotion recognition method based on multimodal voting

Info

Publication number: CN117407486A
Application number: CN202311245990.0A
Authority: CN
Inventors: 牟昊; 黄于晏; 何宇轩; 徐亚波; 李旭日
Original assignee: Guangzhou Datastory Information Technology Co ltd
Current assignee: Guangzhou Datastory Information Technology Co ltd
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2024-01-16
Anticipated expiration: 2043-09-26
Also published as: CN117407486B

Abstract

The invention provides a multimode dialogue emotion recognition method based on multimode voting, which comprises the steps of firstly obtaining multimode data generated by speaking of at least 1 speaker, respectively constructing emotion classification tasks of 3 modes aiming at text data, audio data and picture data, and carrying out first emotion classification; merging the multi-mode data by utilizing a multi-head attention mechanism and carrying out second emotion classification; carrying out third emotion classification after the multi-mode emotion feature vector is fused with the time sequence context information; hard voting is carried out on the three emotion classification results, and emotion categories with the largest number of votes obtained by each speaker are respectively used as final emotion classification results to complete multi-mode dialogue emotion recognition; according to the method, the multi-modal interaction mode is optimized, emotion interference is avoided, meanwhile, interaction between a historical dialogue and a speaker is modeled, emotion features contained in each modal are mined in a finer mode, and accuracy and robustness of emotion classification can be enhanced.

Description

A multi-modal dialogue emotion recognition method based on multi-model voting

技术领域Technical field

本发明涉及深度学习和对话情感分类技术领域，更具体地，涉及一种基于多模型投票的多模态对话情感识别方法。The present invention relates to the technical fields of deep learning and dialogue emotion classification, and more specifically, to a multi-modal dialogue emotion recognition method based on multi-model voting.

背景技术Background technique

对话情感分类（ERC）长期以来一直是多模态分类和自然语言处理（NLP）领域中备受关注的研究方向。在人类日常交流中，识别和追踪对话参与者的情绪状态对于人机交互、对话分析和视频理解等领域的进展至关重要，并具有广泛的潜在应用价值。随着流媒体服务的发展，多模态对话情感识别在智能客服、社交媒体分析、情感驱动的内容推荐、情感分析研究以及情感驱动的人机交互等领域展示了广泛的应用场景和重要意义。以基于情感驱动的人机交互系统为例，该技术能够辨识和分析用户在对话过程中表达的情感状态，例如愤怒、满意、沮丧等。该系统针对用户的不同情感需求，能够自适应地调整交互方式、语气和反馈，从而与用户进行情感交互，提供更加个性化和人性化的服务。Conversational emotion classification (ERC) has long been a research direction of great interest in the fields of multimodal classification and natural language processing (NLP). In daily human communication, identifying and tracking the emotional state of conversation participants is crucial to progress in fields such as human-computer interaction, conversation analysis, and video understanding, and has a wide range of potential applications. With the development of streaming media services, multi-modal dialogue emotion recognition has demonstrated a wide range of application scenarios and important significance in the fields of intelligent customer service, social media analysis, emotion-driven content recommendation, emotion analysis research, and emotion-driven human-computer interaction. Taking emotion-driven human-computer interaction systems as an example, this technology can identify and analyze the emotional states expressed by users during conversations, such as anger, satisfaction, frustration, etc. The system can adaptively adjust the interaction method, tone and feedback according to the different emotional needs of users, thereby conducting emotional interaction with users and providing more personalized and humanized services.

传统的多模态情感识别方法存在一些限制，无法有效解决模态之间表达情感相反、单模态表达多种情感以及对话者之间的模态影响等问题，从而对情感判别造成影响。Traditional multi-modal emotion recognition methods have some limitations and cannot effectively solve problems such as opposite emotions expressed between modalities, multiple emotions expressed in a single modality, and modal influences between interlocutors, which have an impact on emotion discrimination.

多模态对话的情感识别通常存在以下难点：第一，在多模态对话场景下，与传统的单句多模态情感识别不同，多模态情感识别面临更具挑战性的问题。这是因为对话中存在多个影响说话者情绪状态的因素，包括多模态的上下文、对话者刺激、自身情绪惰性、对话场景和人格特征等。因此，需要针对多模态对话进行不同模态和角度的情感建模和预测；第二，不同模态之间表达的情感可能存在冲突，例如一个人面带微笑却表达悲伤的情绪，因此，在进行模态交互的过程中需要特别注意处理这种情况；第三，单一模态也会呈现出多种情感，因此需要进行更细粒度的情感分类方式；第四，对话中的不同说话者和对话者之间可能会相互影响，在对话者未开口时，说话者的对话情感可能会受到对话者的表情刺激。因此，在建模过程中，交互信息的考虑不仅仅局限于说话人自身的视频、音频和文本特征，还需考虑对话者的相应视频特征。第五，在对话情感分类时，需要结合历史对话内容来分析当前说话者的情感。即使当前对话内容没有明显的情感倾向，也可以从历史的多模态对话数据中分析情感变化趋势，以更好地进行情感判断。Emotion recognition in multi-modal dialogues usually has the following difficulties: First, in multi-modal dialogue scenarios, unlike traditional single-sentence multi-modal emotion recognition, multi-modal emotion recognition faces more challenging problems. This is because there are multiple factors that affect the speaker's emotional state in the conversation, including multi-modal context, interlocutor stimulation, own emotional inertia, conversation scene and personality characteristics, etc. Therefore, emotion modeling and prediction in different modalities and angles are needed for multimodal dialogue; secondly, there may be conflicts in the emotions expressed between different modalities, for example, a person is smiling but expressing sad emotions. Therefore, Special attention needs to be paid to handling this situation during modal interaction; third, a single modality will also present multiple emotions, so a more fine-grained emotion classification method is needed; fourth, different speakers in the conversation There may be mutual influence between the speaker and the interlocutor, and the speaker's conversational emotions may be stimulated by the interlocutor's expression when the interlocutor is not speaking. Therefore, during the modeling process, the consideration of interactive information is not limited to the speaker's own video, audio, and text features, but also the corresponding video features of the interlocutor. Fifth, when classifying conversation emotions, it is necessary to combine historical conversation content to analyze the current speaker's emotions. Even if the current conversation content has no obvious emotional tendency, the emotional change trend can be analyzed from historical multi-modal conversation data to make better emotional judgments.

现有技术公开了一种基于动态上下文表示和模态融合的多模态分类方法及系统，其解决了每个模态的特征未得到充分的分析，以及没有根据其特性进行针对性地处理的问题，其中，该方法将每个模态的特征分别进行全局上下文表征、局部上下文表征和直接映射表征，再根据动态路径选择方法融合上述表征，得到每个模态的初始融合特征；将所有模态的初始融合特征分别执行全融合、部分融合和带偏融合处理，得到全融合结果、部分融合结果和带偏融合结果，再通过动态路径选择方法融合，得到最终用于分类的多模态融合特征，其提高了最终识别任务类别的准确性；该现有技术中的方法尽管能够解决模态交互形式过于简单的问题，以及每个模态的特征未得到充分的分析和没有根据其特性进行针对性地处理的问题，通过对多模态过程中不同模态信息量不一致进行了区分和针对性处理，减少了模态融合过程中信息量较少的模态带来的噪音，但通常一个对话内容中会包含多种情感，该方法无法从更细粒度去挖掘不同模态包含的情感特征，而且当不同模态表达的情感不一致的情况时，会对模态融合造成干扰。The prior art discloses a multi-modal classification method and system based on dynamic context representation and modal fusion, which solves the problem of insufficient analysis of the characteristics of each modality and failure to perform targeted processing according to its characteristics. problem, in which this method performs global context representation, local context representation and direct mapping representation on the features of each modality respectively, and then fuses the above representations according to the dynamic path selection method to obtain the initial fusion features of each modality; all modes are Perform full fusion, partial fusion and biased fusion processing on the initial fusion features of the state to obtain full fusion results, partial fusion results and biased fusion results, and then fuse them through the dynamic path selection method to obtain the final multi-modal fusion used for classification. features, which improves the accuracy of the final recognition task category; although the methods in this prior art can solve the problem that the modal interaction form is too simple, and the characteristics of each modality are not fully analyzed and not based on its characteristics The problem is dealt with in a targeted manner. By distinguishing and dealing with the inconsistency in the amount of information of different modalities in the multi-modal process, the noise caused by the modalities with less information in the modal fusion process is reduced, but usually one Conversation content will contain a variety of emotions. This method cannot mine the emotional characteristics contained in different modalities at a finer granularity. Moreover, when the emotions expressed by different modalities are inconsistent, it will cause interference to modal fusion.

发明内容Contents of the invention

本发明为克服上述现有技术在对话情感识别时未对模态进行细粒度挖掘、模态交互易受到干扰和识别准确性较低的缺陷，提供一种基于多模型投票的多模态对话情感识别方法，通过对多模态交互方式进行优化，避免了情感干扰，同时对历史对话和说话者之间的交互进行建模，以更加细致的方式挖掘各模态所包含的情感特征，能够增强情感分类的准确性和鲁棒性。In order to overcome the above-mentioned defects of the existing technology in dialogue emotion recognition, which do not conduct fine-grained mining of modalities, modal interactions are susceptible to interference, and the recognition accuracy is low, the present invention provides a multi-modal dialogue emotion based on multi-model voting. The recognition method avoids emotional interference by optimizing multi-modal interaction methods. At the same time, it models historical conversations and interactions between speakers, and mines the emotional characteristics contained in each modality in a more detailed way, which can enhance Accuracy and robustness of sentiment classification.

为解决上述技术问题，本发明的技术方案如下：In order to solve the above technical problems, the technical solutions of the present invention are as follows:

一种基于多模型投票的多模态对话情感识别方法，包括以下步骤：A multi-modal dialogue emotion recognition method based on multi-model voting, including the following steps:

S1：获取至少1个说话者对话产生的多模态数据；所述多模态数据包括文本数据、音频数据和图片数据；每个模态的数据均包括至少1种待识别的情感类别；S1: Obtain multi-modal data generated by at least one speaker's dialogue; the multi-modal data includes text data, audio data and picture data; each modal data includes at least one emotion category to be recognized;

S2：将文本数据、音频数据和图片数据分别输入预设的文本编码器、音频编码器和图片编码器中进行特征提取，分别获取文本特征、音频特征和图片特征；S2: Input the text data, audio data and picture data into the preset text encoder, audio encoder and picture encoder respectively for feature extraction, and obtain text features, audio features and picture features respectively;

S3：将文本特征、音频特征和图片特征分别输入预设的文本情感分类器、音频情感分类器和图片情感分类器中进行第一次情感分类，分别获取文本情感分类结果、音频情感分类结果和图片情感分类结果；S3: Input the text features, audio features and picture features into the preset text emotion classifier, audio emotion classifier and picture emotion classifier respectively for the first emotion classification, and obtain the text emotion classification results, audio emotion classification results and Image emotion classification results;

S4：根据文本情感分类结果、音频情感分类结果和图片情感分类结果分别计算每个模态对应的惩罚因子；将文本特征、音频特征和图片特征分别与每个模态对应的惩罚因子相乘，分别获取文本降权向量、音频降权向量和图片降权向量；S4: Calculate the penalty factor corresponding to each modality based on the text emotion classification results, audio emotion classification results and picture emotion classification results; multiply the text features, audio features and picture features with the penalty factors corresponding to each modality respectively, Obtain the text weight reduction vector, audio weight reduction vector and picture weight reduction vector respectively;

S5：将文本降权向量、音频降权向量和图片降权向量共同输入预设的多头注意力层中进行多模态特征的融合交互，获取多模态情感特征向量；S5: Input the text down-weighting vector, audio down-weighting vector and picture down-weighting vector into the preset multi-head attention layer to perform fusion and interaction of multi-modal features to obtain multi-modal emotional feature vectors;

S6：将多模态情感特征向量输入预设的多模态情感分类器中进行第二次情感分类，获取多模态融合情感分类结果；S6: Input the multi-modal emotion feature vector into the preset multi-modal emotion classifier for the second emotion classification, and obtain the multi-modal fusion emotion classification result;

S7：将多模态情感特征向量分解为若干个多模态情感特征子向量，并将所有多模态情感特征子向量按时序进行重新拼接，获取融合时序特征的情感特征向量；S7: Decompose the multi-modal emotional feature vector into several multi-modal emotional feature sub-vectors, and re-splice all multi-modal emotional feature sub-vectors in time series to obtain an emotional feature vector that integrates time series features;

S8：将融合时序特征的情感特征向量输入训练好的双向RNN分类器中进行第三次情感分类，获取时序上下文交互情感分类结果；S8: Input the emotional feature vector fused with time series features into the trained bidirectional RNN classifier for the third emotion classification, and obtain the time series context interactive emotion classification results;

S9：将文本情感分类结果、音频情感分类结果、图片情感分类结果、多模态融合情感分类结果和时序上下文交互情感分类结果共同进行硬投票，分别将每个说话者得票数量最多的情感类别作为其最终的情感分类结果，完成多模态对话情感识别。S9: Conduct a hard vote on the text emotion classification results, audio emotion classification results, picture emotion classification results, multi-modal fusion emotion classification results and temporal context interactive emotion classification results, and use the emotion category with the most votes for each speaker as the The final emotion classification result completes multi-modal dialogue emotion recognition.

优选地，所述步骤S1中的多模态数据的获取方式为：从预设的至少1个说话者的对话视频数据中分别提取文本数据、音频数据和图片数据，获取所述多模态数据。Preferably, the multi-modal data in step S1 is obtained by extracting text data, audio data and picture data respectively from the preset conversation video data of at least one speaker, and obtaining the multi-modal data. .

优选地，所述步骤S1中还包括对获取到的音频数据和图片数据进行预处理，具体方法为：Preferably, step S1 also includes preprocessing the acquired audio data and picture data. The specific method is:

对所述音频数据依次进行采样率调整、去噪和音频增益调整，并利用滑动窗口提取音频片段，完成音频数据的预处理；Perform sampling rate adjustment, denoising and audio gain adjustment on the audio data in sequence, and use a sliding window to extract audio fragments to complete the preprocessing of the audio data;

对所述图片数据进行去重操作，具体为：Perform deduplication operations on the image data, specifically:

S1.1：对所述对话视频数据相邻帧的2张图片使用光流算法进行光流计算，获取光流计算结果；S1.1: Use the optical flow algorithm to perform optical flow calculations on two pictures of adjacent frames of the conversation video data, and obtain the optical flow calculation results;

S1.2：根据光流计算结果计算相邻帧的2张图片的变化幅度，判断变化幅度是否大于预设阈值，若大于，则将该相邻帧的2张图片保留至去重结果集合，执行步骤S1.3；否则直接执行步骤S1.3；S1.2: Calculate the change amplitude of the two images in adjacent frames based on the optical flow calculation results, and determine whether the change amplitude is greater than the preset threshold. If it is greater, the two images in the adjacent frames are retained in the deduplication result set. Execute step S1.3; otherwise, directly execute step S1.3;

S1.3：从所述对话视频数据的第一帧图片开始，重复步骤S1.1~S1.2对所述去重结果集合进行更新，直到遍历所述对话视频数据的最后一帧图片，将最后一次更新获得的去重结果集合保存为去重后的图片数据，完成图片数据的预处理。S1.3: Starting from the first frame of the conversation video data, repeat steps S1.1~S1.2 to update the deduplication result set until the last frame of the conversation video data is traversed. The deduplication result set obtained in the last update is saved as the deduplication image data to complete the preprocessing of the image data.

优选地，所述步骤S1.1中的光流算法具体为Lucas-Kanade算法、Farneback算法和FlowNet算法中的任意一种。Preferably, the optical flow algorithm in step S1.1 is specifically any one of the Lucas-Kanade algorithm, the Farneback algorithm and the FlowNet algorithm.

优选地，所述步骤S1.2中的变化幅度具体为相邻帧的2张图片中的每个像素的位移向量大小或相似度度量；所述相似度度量包括欧氏距离和角度变化。Preferably, the change amplitude in step S1.2 is specifically the displacement vector size or similarity measure of each pixel in the two pictures of adjacent frames; the similarity measure includes Euclidean distance and angle change.

优选地，所述步骤S2中，预设的文本编码器具体为Transformer编码器，所述Transformer编码器具体为BERT模型、MacBERT模型、RoBERTa模型和ERNIE模型中的任意一种；Preferably, in step S2, the preset text encoder is specifically a Transformer encoder, and the Transformer encoder is specifically any one of the BERT model, MacBERT model, RoBERTa model and ERNIE model;

所述预设的音频编码器具体为HuBERT模型；The preset audio encoder is specifically the HuBERT model;

所述预设的图片编码器具体为ViT模型。The preset picture encoder is specifically a ViT model.

优选地，所述步骤S3中预设的文本情感分类器、音频情感分类器和图片情感分类器，以及所述步骤S6中预设的多模态情感分类器均为CRF分类器。Preferably, the text emotion classifier, audio emotion classifier and picture emotion classifier preset in step S3, and the multi-modal emotion classifier preset in step S6 are all CRF classifiers.

优选地，所述步骤S4中，计算每个模态对应的惩罚因子的具体方法为：Preferably, in step S4, the specific method for calculating the penalty factor corresponding to each mode is:

根据文本情感分类结果、音频情感分类结果和图片情感分类结果获取情感类别的种类n，并分别统计每种情感类别出现的总次数，计算第i种情感类别的占比，具体为：According to the text emotion classification results, audio emotion classification results and picture emotion classification results, the type n of the emotion category is obtained, and the total number of occurrences of each emotion category is counted respectively, and the proportion of the i-th emotion category is calculated. ,Specifically:

其中，为第i种情感类别出现的次数；in, is the number of occurrences of the i-th emotion category;

对于每个模态的分类结果，分别将其所包含的所有情感类别的占比相加，将加和结果作为对应模态的惩罚因子。For the classification results of each modality, the proportions of all emotional categories contained in it are added up, and the summation result is used as the penalty factor for the corresponding modality.

优选地，所述步骤S7的具体方法为：Preferably, the specific method of step S7 is:

将多模态情感特征向量按不同说话者分解为若干个多模态情感特征子向量，每个多模态情感特征子向量与每个说话者一一对应；Decompose the multi-modal emotion feature vector into several multi-modal emotion feature sub-vectors according to different speakers, and each multi-modal emotion feature sub-vector corresponds to each speaker one-to-one;

所述多模态情感特征子向量包括文本降权子向量、音频降权子向量和图片降权子向量；The multi-modal emotional feature subvectors include text downweighted subvectors, audio downweighted subvectors and picture downweighted subvectors;

将每个说话者对应的图片降权子向量分别与其他所有说话者对应的多模态情感特征子向量进行融合，获取每个说话者对应的融合后的多模态情感特征子向量；Fuse the image downweighted subvector corresponding to each speaker with the multimodal emotion feature subvectors corresponding to all other speakers to obtain the fused multimodal emotion feature subvector corresponding to each speaker;

将所有说话者对应的融合后的多模态情感特征子向量按不同说话者的说话先后顺序进行重新拼接，获取融合时序特征的情感特征向量。The fused multi-modal emotional feature sub-vectors corresponding to all speakers are re-spliced according to the speaking order of different speakers to obtain the emotional feature vector of fused temporal features.

优选地，所述步骤S8中的双向RNN分类器具体为：Preferably, the bidirectional RNN classifier in step S8 is specifically:

所述双向RNN分类器包括结构相同且并列设置的前向RNN分类器和后向RNN分类器；The bidirectional RNN classifier includes a forward RNN classifier and a backward RNN classifier with the same structure and arranged in parallel;

所述前向RNN分类器和后向RNN分类器均包括若干个依次连接的RNN分类层；Both the forward RNN classifier and the backward RNN classifier include several sequentially connected RNN classification layers;

所有RNN分类层结构相同，具体结构为LSTM网络、GRU网络和PQRNN网络中的任意一种。All RNN classification layers have the same structure, and the specific structure is any one of LSTM network, GRU network and PQRNN network.

与现有技术相比，本发明技术方案的有益效果是：Compared with the existing technology, the beneficial effects of the technical solution of the present invention are:

本发明提供一种基于多模型投票的多模态对话情感识别方法，首先获取至少1个说话者对话产生的多模态数据；将文本数据、音频数据和图片数据分别输入预设的文本编码器、音频编码器和图片编码器中进行特征提取，分别获取文本特征、音频特征和图片特征；将文本特征、音频特征和图片特征分别输入预设的文本情感分类器、音频情感分类器和图片情感分类器中进行第一次情感分类，分别获取文本情感分类结果、音频情感分类结果和图片情感分类结果；根据文本情感分类结果、音频情感分类结果和图片情感分类结果分别计算每个模态对应的惩罚因子；将文本特征、音频特征和图片特征分别与每个模态对应的惩罚因子相乘，分别获取文本降权向量、音频降权向量和图片降权向量；将文本降权向量、音频降权向量和图片降权向量共同输入预设的多头注意力层中进行多模态特征的融合交互，获取多模态情感特征向量；将多模态情感特征向量输入预设的多模态情感分类器中进行第二次情感分类，获取多模态融合情感分类结果；将多模态情感特征向量分解为若干个多模态情感特征子向量，并将所有多模态情感特征子向量按时序进行重新拼接，获取融合时序特征的情感特征向量；将融合时序特征的情感特征向量输入训练好的双向RNN分类器中进行第三次情感分类，获取时序上下文交互情感分类结果；最后将文本情感分类结果、音频情感分类结果、图片情感分类结果、多模态融合情感分类结果和时序上下文交互情感分类结果共同进行硬投票，分别将每个说话者得票数量最多的情感类别作为其最终的情感分类结果，完成多模态对话情感识别；The present invention provides a multi-modal dialogue emotion recognition method based on multi-model voting. First, multi-modal data generated by at least one speaker's dialogue is obtained; text data, audio data and picture data are respectively input into a preset text encoder. , audio encoder and picture encoder for feature extraction to obtain text features, audio features and picture features respectively; input text features, audio features and picture features into the preset text emotion classifier, audio emotion classifier and picture emotion respectively. The first emotion classification is performed in the classifier, and the text emotion classification results, audio emotion classification results and picture emotion classification results are obtained respectively; based on the text emotion classification results, audio emotion classification results and picture emotion classification results, the corresponding values of each modality are calculated respectively. Penalty factor; multiply the text features, audio features, and picture features by the penalty factors corresponding to each modality to obtain the text weight reduction vector, audio weight reduction vector, and picture weight reduction vector respectively; multiply the text weight reduction vector, audio weight reduction vector The weight vector and the picture weight reduction vector are jointly input into the preset multi-head attention layer for fusion and interaction of multi-modal features to obtain multi-modal emotion feature vectors; the multi-modal emotion feature vectors are input into the preset multi-modal emotion classification The second emotion classification is performed in the processor to obtain the multi-modal fusion emotion classification results; the multi-modal emotion feature vector is decomposed into several multi-modal emotion feature sub-vectors, and all multi-modal emotion feature sub-vectors are processed in time sequence. Re-splicing to obtain the emotional feature vector fused with temporal features; input the emotional feature vector fused with temporal features into the trained bidirectional RNN classifier for the third emotion classification to obtain the temporal context interactive emotion classification results; finally, the text emotion classification results , audio emotion classification results, picture emotion classification results, multi-modal fusion emotion classification results and temporal context interaction emotion classification results are jointly conducted hard voting, and the emotion category with the most votes for each speaker is regarded as the final emotion classification result. Complete multi-modal dialogue emotion recognition;

本发明通过构建各模态细粒度情感分类任务，能够解决现有技术无法识别出不同模态表达的多种不同情感的问题，通过这种比普通情感分类训练更难的情感片段提取的训练任务，提高了分类模型对模态情感特征出现位置的灵敏度，从而提高模型的判断准确率；其次，本发明利用各模态细粒度情感分类任务指导各模态带偏见的融合交互，避免了模态情感信息的误差和噪音影响交互结果；另外，本发明还提供了多模态数据的时序上下文交互方法，同时结合上下文信息和对话中不同说话者之间相互影响的特征，以更好地理解对话的情感内容；当一位说话者倾听而另一位说话者发言时，本发明还考虑了倾听者的表情和图片特征对当前说话者的情绪影响，能够更准确地识别对话中的情感内容，有效增强了情感分类的准确性和鲁棒性。By constructing fine-grained emotion classification tasks for each modality, the present invention can solve the problem that the existing technology cannot identify a variety of different emotions expressed in different modalities. Through this training task of emotional segment extraction, which is more difficult than ordinary emotion classification training, , which improves the sensitivity of the classification model to the occurrence position of modal emotional features, thereby improving the judgment accuracy of the model; secondly, the present invention uses the fine-grained emotion classification tasks of each modality to guide the biased fusion interaction of each modality, avoiding the modal Errors and noise in emotional information affect the interaction results; in addition, the present invention also provides a time-series context interaction method for multi-modal data, while combining contextual information and the characteristics of interaction between different speakers in the dialogue to better understand the dialogue The emotional content of the conversation; when one speaker listens and another speaker speaks, the present invention also considers the emotional impact of the listener's expression and picture features on the current speaker, and can more accurately identify the emotional content in the conversation, Effectively enhances the accuracy and robustness of emotion classification.

附图说明Description of the drawings

图1为实施例1所提供的一种基于多模型投票的多模态对话情感识别方法流程图。Figure 1 is a flow chart of a multi-modal dialogue emotion recognition method based on multi-model voting provided in Embodiment 1.

图2为实施例2所提供的文本数据情感片段和情感标签示意图。Figure 2 is a schematic diagram of text data emotion segments and emotion labels provided in Embodiment 2.

图3为实施例2所提供的文本编码器和文本情感分类器示意图。Figure 3 is a schematic diagram of the text encoder and text emotion classifier provided in Embodiment 2.

图4为实施例2所提供的音频编码器和音频情感分类器示意图。Figure 4 is a schematic diagram of the audio encoder and audio emotion classifier provided in Embodiment 2.

图5为实施例2所提供的图片数据去重操作示意图。Figure 5 is a schematic diagram of the image data deduplication operation provided in Embodiment 2.

图6为实施例2所提供的第三次情感分类示意图。Figure 6 is a schematic diagram of the third emotion classification provided in Embodiment 2.

具体实施方式Detailed ways

附图仅用于示例性说明，不能理解为对本申请的限制；The drawings are for illustrative purposes only and should not be construed as limiting the application;

为了更好说明本实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；In order to better illustrate this embodiment, some components in the drawings will be omitted, enlarged or reduced, which does not represent the size of the actual product;

对于本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。It is understandable to those skilled in the art that some well-known structures and their descriptions may be omitted in the drawings.

下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solution of the present invention will be further described below with reference to the accompanying drawings and examples.

实施例1Example 1

如图1所示，本实施例提供一种基于多模型投票的多模态对话情感识别方法，包括以下步骤：As shown in Figure 1, this embodiment provides a multi-modal dialogue emotion recognition method based on multi-model voting, which includes the following steps:

在具体实施过程中，首先获取至少1个说话者对话产生的多模态数据；将文本数据、音频数据和图片数据分别输入预设的文本编码器、音频编码器和图片编码器中进行特征提取，分别获取文本特征、音频特征和图片特征；In the specific implementation process, multi-modal data generated by at least one speaker's dialogue is first obtained; text data, audio data and picture data are respectively input into the preset text encoder, audio encoder and picture encoder for feature extraction. , obtain text features, audio features and image features respectively;

将文本特征、音频特征和图片特征分别输入预设的文本情感分类器、音频情感分类器和图片情感分类器中进行第一次情感分类，分别获取文本情感分类结果、音频情感分类结果和图片情感分类结果；Input text features, audio features and picture features into the preset text emotion classifier, audio emotion classifier and picture emotion classifier respectively for the first emotion classification, and obtain the text emotion classification results, audio emotion classification results and picture emotion respectively. Classification results;

根据文本情感分类结果、音频情感分类结果和图片情感分类结果分别计算每个模态对应的惩罚因子；将文本特征、音频特征和图片特征分别与每个模态对应的惩罚因子相乘，分别获取文本降权向量、音频降权向量和图片降权向量；Calculate the penalty factor corresponding to each modality based on the text emotion classification results, audio emotion classification results and picture emotion classification results respectively; multiply the text features, audio features and picture features with the penalty factors corresponding to each modality to obtain respectively Text downweight vector, audio downweight vector and image downweight vector;

将文本降权向量、音频降权向量和图片降权向量共同输入预设的多头注意力层中进行多模态特征的融合交互，获取多模态情感特征向量；Input the text down-weighting vector, audio down-weighting vector and picture down-weighting vector into the preset multi-head attention layer to perform fusion and interaction of multi-modal features to obtain multi-modal emotional feature vectors;

将多模态情感特征向量输入预设的多模态情感分类器中进行第二次情感分类，获取多模态融合情感分类结果；Input the multi-modal emotion feature vector into the preset multi-modal emotion classifier for the second emotion classification to obtain the multi-modal fusion emotion classification result;

将多模态情感特征向量分解为若干个多模态情感特征子向量，并将所有多模态情感特征子向量按时序进行重新拼接，获取融合时序特征的情感特征向量；Decompose the multi-modal emotional feature vector into several multi-modal emotional feature sub-vectors, and re-splice all multi-modal emotional feature sub-vectors in time series to obtain an emotional feature vector that integrates time series features;

将融合时序特征的情感特征向量输入训练好的双向RNN分类器中进行第三次情感分类，获取时序上下文交互情感分类结果；Input the emotional feature vector fused with temporal features into the trained bidirectional RNN classifier for the third emotion classification, and obtain the temporal context interactive emotion classification results;

最后将文本情感分类结果、音频情感分类结果、图片情感分类结果、多模态融合情感分类结果和时序上下文交互情感分类结果共同进行硬投票，分别将每个说话者得票数量最多的情感类别作为其最终的情感分类结果，完成多模态对话情感识别；Finally, the text emotion classification results, audio emotion classification results, picture emotion classification results, multi-modal fusion emotion classification results and temporal context interaction emotion classification results are jointly put into hard voting, and the emotion category with the largest number of votes for each speaker is selected as the other. The final emotion classification result completes multi-modal dialogue emotion recognition;

本方法通过对多模态交互方式进行优化，避免了情感干扰，同时对历史对话和说话者之间的交互进行建模，以更加细致的方式挖掘各模态所包含的情感特征，能够增强情感分类的准确性和鲁棒性。This method avoids emotional interference by optimizing multi-modal interaction methods, while modeling historical conversations and interactions between speakers, mining the emotional characteristics contained in each modality in a more detailed way, and can enhance emotions. Classification accuracy and robustness.

实施例2Example 2

本实施例提供一种基于多模型投票的多模态对话情感识别方法，包括以下步骤：This embodiment provides a multi-modal dialogue emotion recognition method based on multi-model voting, which includes the following steps:

S9：将文本情感分类结果、音频情感分类结果、图片情感分类结果、多模态融合情感分类结果和时序上下文交互情感分类结果共同进行硬投票，分别将每个说话者得票数量最多的情感类别作为其最终的情感分类结果，完成多模态对话情感识别；S9: Conduct a hard vote on the text emotion classification results, audio emotion classification results, picture emotion classification results, multi-modal fusion emotion classification results and temporal context interactive emotion classification results, and use the emotion category with the most votes for each speaker as the The final emotion classification result completes multi-modal dialogue emotion recognition;

所述步骤S1中的多模态数据的获取方式为：从预设的至少1个说话者的对话视频数据中分别提取文本数据、音频数据和图片数据，获取所述多模态数据；The multi-modal data in step S1 is obtained by: respectively extracting text data, audio data and picture data from the preset conversation video data of at least one speaker to obtain the multi-modal data;

所述步骤S1中还包括对获取到的音频数据和图片数据进行预处理，具体方法为：The step S1 also includes preprocessing the obtained audio data and picture data. The specific method is:

S1.2：根据光流计算结果计算相邻帧的2张图片的变化幅度，判断变化幅度是否大于预设阈值，若大于，则将该相邻帧的2张图片保留至去重结果集合，执行步骤S1.3；否则直接执行步骤S1.3；S1.2: Calculate the change amplitude of the two pictures in adjacent frames based on the optical flow calculation results, and determine whether the change amplitude is greater than the preset threshold. If it is greater, the two pictures in the adjacent frames are retained in the deduplication result set. Execute step S1.3; otherwise, directly execute step S1.3;

S1.3：从所述对话视频数据的第一帧图片开始，重复步骤S1.1~S1.2对所述去重结果集合进行更新，直到遍历所述对话视频数据的最后一帧图片，将最后一次更新获得的去重结果集合保存为去重后的图片数据，完成图片数据的预处理；S1.3: Starting from the first frame of the conversation video data, repeat steps S1.1~S1.2 to update the deduplication result set until the last frame of the conversation video data is traversed. The deduplication result set obtained in the last update is saved as the deduplication image data to complete the preprocessing of the image data;

所述步骤S1.1中的光流算法具体为Lucas-Kanade算法、Farneback算法和FlowNet算法中的任意一种；The optical flow algorithm in step S1.1 is specifically any one of the Lucas-Kanade algorithm, the Farneback algorithm and the FlowNet algorithm;

所述步骤S1.2中的变化幅度具体为相邻帧的2张图片中的每个像素的位移向量大小或相似度度量；所述相似度度量包括欧氏距离和角度变化；The change amplitude in step S1.2 is specifically the displacement vector size or similarity measure of each pixel in the two pictures of adjacent frames; the similarity measure includes Euclidean distance and angle change;

所述步骤S2中，预设的文本编码器具体为Transformer编码器，所述Transformer编码器具体为BERT模型、MacBERT模型、RoBERTa模型和ERNIE模型中的任意一种；In step S2, the preset text encoder is specifically a Transformer encoder, and the Transformer encoder is specifically any one of the BERT model, the MacBERT model, the RoBERTa model, and the ERNIE model;

所述预设的图片编码器具体为ViT模型；The preset picture encoder is specifically a ViT model;

所述步骤S3中预设的文本情感分类器、音频情感分类器和图片情感分类器，以及所述步骤S6中预设的多模态情感分类器均为CRF分类器；The text emotion classifier, audio emotion classifier and picture emotion classifier preset in step S3, and the multi-modal emotion classifier preset in step S6 are all CRF classifiers;

所述步骤S4中，计算每个模态对应的惩罚因子的具体方法为：In step S4, the specific method for calculating the penalty factor corresponding to each mode is:

对于每个模态的分类结果，分别将其所包含的所有情感类别的占比相加，将加和结果作为对应模态的惩罚因子；For the classification results of each modality, the proportions of all emotion categories contained in it are added up, and the summation result is used as the penalty factor for the corresponding modality;

所述步骤S7的具体方法为：The specific method of step S7 is:

将所有说话者对应的融合后的多模态情感特征子向量按不同说话者的说话先后顺序进行重新拼接，获取融合时序特征的情感特征向量；The fused multi-modal emotional feature sub-vectors corresponding to all speakers are re-spliced according to the speaking order of different speakers to obtain the emotional feature vector of fused temporal features;

所述步骤S8中的双向RNN分类器具体为：The bidirectional RNN classifier in step S8 is specifically:

在具体实施过程中，首先从预设的至少1个说话者的对话视频数据中分别提取文本数据、音频数据和图片数据，获取多模态数据；每个模态的数据均包括高兴、沮丧、平静、开心、惊讶、难过、厌恶、生气和害怕等一种或多种情感类别，情感类别也可分为积极的（positive）和消极的（negative）；In the specific implementation process, text data, audio data and picture data are first extracted from the preset conversation video data of at least one speaker to obtain multi-modal data; each modal data includes happiness, frustration, One or more emotional categories such as calm, happy, surprised, sad, disgusted, angry, and scared. Emotional categories can also be divided into positive and negative;

为了解决现有技术无法识别出不同模态表达的多种不同情感的问题，本实施例首先分别针对三种模态的数据设计细粒度情感分类任务，进行第一次情感分类，具体为：In order to solve the problem that the existing technology cannot identify multiple different emotions expressed in different modalities, this embodiment first designs a fine-grained emotion classification task for the data of the three modalities, and performs the first emotion classification, specifically as follows:

1）对于文本数据，首先将文本数据单独提取出来，并对情感文本片段和情感标签进行人工标注；例如，如图2所示，在文本“我觉得挺开心，就算你一直让我流泪”中有两个情感片段，分别对应“高兴（pos）”和“沮丧（neg）”的情感标签；1) For text data, first extract the text data separately, and manually label emotional text fragments and emotional labels; for example, as shown in Figure 2, in the text "I feel very happy, even if you keep making me cry" There are two emotional segments, corresponding to the emotional labels of "happy (pos)" and "frustrated (neg)" respectively;

如图3所示，之后利用Transformer编码器提取文本特征，利用CRF分类器学习标签之间的依赖关系和情感分类；As shown in Figure 3, the Transformer encoder is then used to extract text features, and the CRF classifier is used to learn the dependencies between labels and emotion classification;

对于文本的情感分类包括两个层面的判断：一个是判断哪些文字属于情感文本片段的开始（B）、中间（I）还是结束（E）部分，或者不是情感片段（O）；另一个是判断情感片段属于哪种情感类型（pos 或 neg）；The emotional classification of text includes two levels of judgment: one is to judge which text belongs to the beginning (B), middle (I) or end (E) of the emotional text segment, or is not an emotional segment (O); the other is to judge Which emotion type (pos or neg) the emotion segment belongs to;

在本实施例中，Transformer编码器具体为 BERT、MacBERT、RoBERTa和ERNIE模型中的任意一种；In this embodiment, the Transformer encoder is specifically any one of BERT, MacBERT, RoBERTa and ERNIE models;

使用文本数据集对模型进行训练，需要定义损失函数和优化器，并利用反向传播算法更新模型参数，在训练过程中，通常使用批量梯度下降（batch gradient descent）或其他优化技术；根据模型在验证集上的性能，调整超参数，如学习率、批次大小、Transformer 层数和隐藏单元数等；可以使用交叉验证或网格搜索等方法进行超参数调优；To train a model using text data sets, you need to define a loss function and optimizer, and use the backpropagation algorithm to update the model parameters. During the training process, batch gradient descent or other optimization techniques are usually used; according to the model's Performance on the validation set, adjust hyperparameters, such as learning rate, batch size, number of Transformer layers, number of hidden units, etc.; you can use methods such as cross-validation or grid search for hyperparameter tuning;

之后利用训练好的模型对输入的文本数据进行第一次情感分类，获取文本情感分类结果，具体为：包含情感状态的文本片段以及对应的情感类别；Then use the trained model to perform the first emotion classification on the input text data, and obtain the text emotion classification results, specifically: text fragments containing emotional states and corresponding emotion categories;

通过文本情感片段识别和分类任务，模型能够更加专注于识别对话文本中包含情感信息的片段，同时，当对话文本中存在多种情感时，模型能够更加准确地进行识别；Through the text emotion segment recognition and classification task, the model can focus more on identifying segments containing emotional information in the dialogue text. At the same time, when multiple emotions exist in the dialogue text, the model can identify more accurately;

2）对于音频数据，收集包含音频数据和对应情感标签的数据集，将待识别的音频信号进行预处理，例如进行采样率调整、去噪和音频增益调整等；设定滑动窗口的大小和步长，对音频信号进行滑动窗口处理，从音频的起始位置开始，依次按照设定的步长移动滑动窗口。在每个窗口内提取相应的音频片段；2) For audio data, collect a data set containing audio data and corresponding emotional labels, and preprocess the audio signal to be recognized, such as sampling rate adjustment, denoising and audio gain adjustment, etc.; set the size and step of the sliding window. long, perform sliding window processing on the audio signal, starting from the starting position of the audio, and move the sliding window in sequence according to the set step size. Extract corresponding audio clips within each window;

滑动窗口的大小决定了每个音频片段的时长，而步长则确定了相邻窗口之间的重叠程度；The size of the sliding window determines the duration of each audio segment, while the step size determines the degree of overlap between adjacent windows;

将提取的音频片段标注对应的情感标签，确保每个音频片段都有与之相关的情感标签，每个音频片段的起始用符号“B”表示，内容用符号“I”表示，结束用符号“E”表示，非情感音频片段用符号“O”表示；此外，音频片段还根据情感类别进行标记，例如“pos”表示积极的正面情感，“neg”表示消极的负面情感；Mark the extracted audio clips with corresponding emotion tags to ensure that each audio clip has an emotion label associated with it. The start of each audio clip is represented by the symbol "B", the content is represented by the symbol "I", and the end is represented by the symbol "E" indicates that non-emotional audio clips are represented by the symbol "O"; in addition, audio clips are also labeled according to emotional categories, such as "pos" indicating positive positive emotions and "neg" indicating negative negative emotions;

如图4所示，将预处理后的音频数据集进行数据增强后输入HuBERT模型+CRF分类器中进行训练，HuBERT模型用于提取输入音频序列的上下文信息，而CRF分类器用于学习标签之间的依赖关系；HuBERT模型的输出特征表示被输入到CRF分类器中，以输出每个音频片段的情感类型；使用有监督学习算法进行训练，在训练过程中通过最小化损失函数（如交叉熵损失）来优化模型参数，并使用反向传播算法计算梯度并更新模型参数；As shown in Figure 4, the preprocessed audio data set is data enhanced and then input into the HuBERT model + CRF classifier for training. The HuBERT model is used to extract contextual information of the input audio sequence, and the CRF classifier is used to learn between labels. The dependencies of ) to optimize model parameters, and use the backpropagation algorithm to calculate gradients and update model parameters;

本实施例还设定情感概率阈值参数，对于每个音频片段，判断预测的情感类别或情感得分是否超过设定的情感概率阈值参数，若超过阈值，则判定该时间步为包含情感的片段；This embodiment also sets an emotion probability threshold parameter. For each audio segment, it is determined whether the predicted emotion category or emotion score exceeds the set emotion probability threshold parameter. If it exceeds the threshold, it is determined that the time step is a segment containing emotion;

另外，对于被判定为包含情感的时间步，还需要进行连续帧的分析，本实施例设定了一个连续帧的阈值，表示连续多少个时间步被判定为包含情感的片段该音频片段才最终被确认为情感片段；In addition, for time steps that are determined to contain emotions, continuous frame analysis also needs to be performed. This embodiment sets a threshold for continuous frames, indicating how many consecutive time steps are determined to contain emotional segments before the audio segment is finally Identified as an emotional piece;

之后利用训练好的HuBERT模型+CRF分类器对输入的音频数据进行第一次情感分类，获取音频情感分类结果；Then use the trained HuBERT model + CRF classifier to perform the first emotion classification on the input audio data, and obtain the audio emotion classification results;

3）对于图片数据，首先对对话视频数据进行分帧操作，以便提取表达情感的特征信息，特征信息主要集中在人脸上；3) For picture data, the conversation video data is first divided into frames to extract feature information that expresses emotions. The feature information is mainly concentrated on human faces;

之后对所述图片数据进行去重操作，具体为：Afterwards, the image data is deduplicated, specifically as follows:

S1.1：对所述对话视频数据相邻帧的2张图片使用光流算法进行光流计算，获取光流计算结果；光流算法可估计相邻帧之间的像素位移，捕捉帧间运动信息，具体为Lucas-Kanade算法、Farneback算法和FlowNet算法中的任意一种；S1.1: Use the optical flow algorithm to perform optical flow calculations on two pictures of adjacent frames of the conversation video data to obtain the optical flow calculation results; the optical flow algorithm can estimate the pixel displacement between adjacent frames and capture inter-frame motion. Information, specifically any one of the Lucas-Kanade algorithm, Farneback algorithm and FlowNet algorithm;

S1.2：根据光流计算结果计算相邻帧的2张图片的变化幅度，变化幅度为每个像素的位移向量大小或相似度度量（如欧氏距离和角度变化）；S1.2: Calculate the change amplitude of two pictures in adjacent frames based on the optical flow calculation results. The change amplitude is the displacement vector size of each pixel or similarity measure (such as Euclidean distance and angle change);

判断变化幅度是否大于预设阈值，若大于，则将该相邻帧的2张图片保留至去重结果集合，执行步骤S1.3；否则直接执行步骤S1.3；Determine whether the change amplitude is greater than the preset threshold. If it is greater, retain the two pictures of the adjacent frames in the deduplication result set and execute step S1.3; otherwise, directly execute step S1.3;

保留变化幅度较大的帧，同时舍弃变化幅度较小的帧，如图5所示，这些帧覆盖说话者从生气到开心状态的变化，经过去重操作后保留了代表“开心”和“生气”情感类别的图片；Frames with large changes are retained, while frames with small changes are discarded. As shown in Figure 5, these frames cover the speaker's change from angry to happy. After deduplication, the frames representing "happy" and "angry" are retained. "Pictures in the emotion category;

之后利用预处理后的图片数据进行训练模型，使用ViT模型+CRF分类器来提取图片特征和分类，ViT模型将图片切分成图片块（或图片补丁），然后重新排列为序列来处理图片；训练常用的损失函数包括边界框回归的平滑L1损失和分类任务的交叉熵损失，训练时通过最小化损失函数来优化模型参数，可使用随机梯度下降（SGD）或其他优化算法进行参数更新；Then use the preprocessed image data to train the model, and use the ViT model + CRF classifier to extract image features and classify. The ViT model divides the image into image blocks (or image patches), and then rearranges them into sequences to process the image; training Commonly used loss functions include smooth L1 loss for bounding box regression and cross-entropy loss for classification tasks. Model parameters are optimized by minimizing the loss function during training. Stochastic gradient descent (SGD) or other optimization algorithms can be used for parameter update;

最后通过训练好的ViT模型+CRF分类器进行图片数据的第一次情感识别，获取图片数据情感分类结果，即人脸区域对应的情感类别；Finally, the first emotion recognition of the image data is performed through the trained ViT model + CRF classifier, and the emotion classification result of the image data is obtained, that is, the emotion category corresponding to the face area;

第一次情感分类后，为了避免多模态情感分类中噪声的影响，本实施例还进行了第二次情感分类，具体为：After the first emotion classification, in order to avoid the influence of noise in multi-modal emotion classification, this embodiment also performs a second emotion classification, specifically as follows:

根据文本情感分类结果、音频情感分类结果和图片情感分类结果分别计算每个模态对应的惩罚因子，具体方法为：The penalty factor corresponding to each modality is calculated based on the text emotion classification results, audio emotion classification results and picture emotion classification results. The specific method is:

将文本特征、音频特征和图片特征分别与每个模态对应的惩罚因子相乘，分别获取文本降权向量、音频降权向量和图片降权向量；Multiply the text features, audio features and picture features by the penalty factors corresponding to each modality to obtain the text down-weighting vector, audio down-weighting vector and picture down-weighting vector respectively;

将文本降权向量、音频降权向量和图片降权向量共同输入预设的多头注意力层中进行多模态特征的融合交互，获取多模态情感特征向量；通过多头注意力机制，模型可以自动地关注不同模态中对情感分类有重要影响的部分；The text down-weighting vector, audio down-weighting vector and picture down-weighting vector are jointly input into the preset multi-head attention layer for fusion and interaction of multi-modal features to obtain multi-modal emotional feature vectors; through the multi-head attention mechanism, the model can Automatically focus on the parts of different modalities that have an important impact on emotion classification;

第二次情感分类后，本实施例还考虑到对话中不同说话者之间相互影响的问题，还进行了第三次情感分类，具体为：After the second emotion classification, this embodiment also takes into account the mutual influence between different speakers in the dialogue, and also performs a third emotion classification, specifically as follows:

首先将多模态情感特征向量按不同说话者分解为若干个多模态情感特征子向量，每个多模态情感特征子向量与每个说话者一一对应；First, the multi-modal emotion feature vector is decomposed into several multi-modal emotion feature sub-vectors according to different speakers, and each multi-modal emotion feature sub-vector corresponds to each speaker one-to-one;

如图6所示，假设有2个说话者，说话者A和B按时间先后顺序依次进行以下对话：As shown in Figure 6, assuming there are two speakers, speakers A and B have the following conversations in chronological order:

A：(Neutral)对话1；A: (Neutral) Dialogue 1;

B：(Neutral)对话2；B: (Neutral) Dialogue 2;

A：(Surprise)对话3；A: (Surprise) Dialogue 3;

B：(Sad)对话4；B: (Sad) Dialogue 4;

B：(Fear, Sad)对话5；B: (Fear, Sad) Dialogue 5;

A：(Anger)对话6；A: (Anger) Dialogue 6;

图6中的6组多模态向量分别对应上述6句对话，在当前轮说话者A在说话的时候，说话者B不说话但存在表情，因此要将该情况考虑，对于每个时间步，f_a为说话者A 的多模态情感特征子向量，而与它拼接的f_b向量的音频和文本部分均为空，只有图片模态的降权子向量，这里的图片模态选择上一轮说话者B的图片（当说话者为B时，执行同样处理）；The six sets of multimodal vectors in Figure 6 respectively correspond to the above six sentences of dialogue. When speaker A is speaking in the current round, speaker B is not speaking but has expressions. Therefore, this situation must be considered. For each time step, f _a is the multi-modal emotional feature subvector of speaker A, and the audio and text parts of the f _b vector spliced with it are empty, with only the reduced weight subvector of the picture modality. The picture modality here is selected from the previous one. Take the picture of speaker B (when the speaker is B, the same process is performed);

在图6中，使用滑动窗口的方法将每个说话者的特征序列添加到一个固定长度的特征窗口中，并将窗口内的特征拼接成一个特征向量，为每个说话者的特征序列添加时序上下文交互信息，判断当前对话的情感，需要将前n轮对话的特征经过rnn模型融合到当前对话的特征中，共同来判断情感，前n轮的特征是为当前的对话的情感分类服务的；将对话人A和B的融合特征重复添加到每个时间步上，以反映对话的进展和对话人之间的相互影响；In Figure 6, the sliding window method is used to add each speaker's feature sequence to a fixed-length feature window, and the features within the window are spliced into a feature vector to add time series to each speaker's feature sequence. To use contextual interactive information to determine the emotion of the current conversation, it is necessary to integrate the features of the previous n rounds of dialogue into the features of the current dialogue through the RNN model to jointly determine the emotion. The features of the first n rounds serve for the emotional classification of the current dialogue; The fusion features of interlocutors A and B are repeatedly added to each time step to reflect the progress of the conversation and the mutual influence between interlocutors;

另外，在图6中，说话者A和说话者B也可为同一个人，此时可看作只有1个说话者自言自语；In addition, in Figure 6, speaker A and speaker B can also be the same person. In this case, it can be regarded as only one speaker talking to himself;

将融合时序特征的情感特征向量输入训练好的双向RNN分类器中进行第三次情感分类，获取时序上下文交互情感分类结果，最后输出的情感是每个说话者最后一句对话的情感类别；Input the emotional feature vector fused with temporal features into the trained bidirectional RNN classifier for the third emotion classification to obtain the temporal context interactive emotion classification results. The final emotion output is the emotion category of the last sentence of each speaker's dialogue;

如图6所示，所述双向RNN分类器包括结构相同且并列设置的前向RNN分类器和后向RNN分类器；所述前向RNN分类器和后向RNN分类器均包括若干个依次连接的RNN分类层；所有RNN分类层结构相同，具体结构为LSTM网络；As shown in Figure 6, the bidirectional RNN classifier includes a forward RNN classifier and a backward RNN classifier with the same structure and arranged in parallel; the forward RNN classifier and the backward RNN classifier each include several sequentially connected RNN classification layer; all RNN classification layers have the same structure, and the specific structure is an LSTM network;

在每个时间步，前向RNN将输入序列从头到尾进行处理，而后向RNN则从尾到头处理，两个方向的隐层状态可以在每个时间步进行拼接，形成一个更全面的表示，包含了时序上下文交互信息；At each time step, the forward RNN processes the input sequence from beginning to end, while the backward RNN processes it from end to beginning. The hidden layer states in both directions can be spliced at each time step to form a more comprehensive representation. Contains temporal context interaction information;

最后将3次情感分类的结果按照少数服从多数的规则进行硬投票，将文本情感分类结果、音频情感分类结果、图片情感分类结果、多模态融合情感分类结果和时序上下文交互情感分类结果共同进行硬投票，分别将每个说话者得票数量最多的情感类别作为其最终的情感分类结果，完成多模态对话情感识别；如果出现次数最多的情感结果有多个，可以选择其中任意一个作为最终的情感分类结果；Finally, the results of the three emotion classifications were hard voted according to the majority rule, and the text emotion classification results, audio emotion classification results, picture emotion classification results, multi-modal fusion emotion classification results and temporal context interaction emotion classification results were jointly conducted. Hard voting uses the emotion category with the most votes for each speaker as the final emotion classification result to complete multi-modal dialogue emotion recognition; if there are multiple emotion results that appear most often, any one of them can be selected as the final result. Sentiment classification results;

相同或相似的标号对应相同或相似的部件；The same or similar numbers correspond to the same or similar parts;

附图中描述位置关系的用语仅用于示例性说明，不能理解为对本申请的限制；The terms used to describe positional relationships in the drawings are only for illustrative purposes and should not be construed as limitations to the present application;

显然，本发明的上述实施例仅仅是为清楚地说明本发明所作的举例，而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples to clearly illustrate the present invention, and are not intended to limit the implementation of the present invention. For those of ordinary skill in the art, other different forms of changes or modifications can be made based on the above description. An exhaustive list of all implementations is neither necessary nor possible. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention shall be included in the protection scope of the claims of the present invention.

Claims

1. A multimode dialogue emotion recognition method based on multimode voting is characterized by comprising the following steps:

s1: acquiring multi-modal data generated by speaking at least 1 speaker; the multi-modal data includes text data, audio data, and picture data; the data of each mode comprises at least 1 emotion type to be identified;

S2: respectively inputting text data, audio data and picture data into a preset text encoder, an audio encoder and a picture encoder for feature extraction, and respectively obtaining text features, audio features and picture features;

s3: inputting the text features, the audio features and the picture features into a preset text emotion classifier, an audio emotion classifier and a picture emotion classifier respectively for first emotion classification, and obtaining a text emotion classification result, an audio emotion classification result and a picture emotion classification result respectively;

s4: respectively calculating penalty factors corresponding to each mode according to the text emotion classification result, the audio emotion classification result and the picture emotion classification result; multiplying the text feature, the audio feature and the picture feature with penalty factors corresponding to each mode respectively to obtain a text weight-reducing vector, an audio weight-reducing vector and a picture weight-reducing vector respectively;

s5: the text weight-reducing vector, the audio weight-reducing vector and the picture weight-reducing vector are input into a preset multi-head attention layer together to perform multi-mode feature fusion interaction, and a multi-mode emotion feature vector is obtained;

s6: inputting the multi-modal emotion feature vector into a preset multi-modal emotion classifier to carry out second emotion classification, and obtaining a multi-modal fusion emotion classification result;

S7: decomposing the multi-modal emotion feature vector into a plurality of multi-modal emotion feature sub-vectors, and re-splicing all the multi-modal emotion feature sub-vectors according to the time sequence to obtain an emotion feature vector fused with the time sequence features;

s8: inputting emotion feature vectors fused with time sequence features into a trained bidirectional RNN classifier to perform third emotion classification, and obtaining a time sequence context interaction emotion classification result;

s9: and carrying out hard voting on the text emotion classification result, the audio emotion classification result, the picture emotion classification result, the multi-modal fusion emotion classification result and the time sequence context interaction emotion classification result, and respectively taking the emotion type with the largest number of votes obtained by each speaker as the final emotion classification result to finish multi-modal dialogue emotion recognition.

2. The multimodal dialog emotion recognition method based on multimodal voting according to claim 1, wherein the multimodal data in step S1 is obtained by: and respectively extracting text data, audio data and picture data from preset dialogue video data of at least 1 speaker, and obtaining the multi-modal data.

3. The multimodal dialog emotion recognition method based on multimodal voting according to claim 2, wherein the step S1 further comprises preprocessing the obtained audio data and the obtained picture data, and the specific method is as follows:

Sequentially carrying out sampling rate adjustment, denoising and audio gain adjustment on the audio data, extracting audio fragments by utilizing a sliding window, and finishing preprocessing of the audio data;

and carrying out de-duplication operation on the picture data, wherein the de-duplication operation specifically comprises the following steps:

s1.1: performing optical flow calculation on 2 pictures of the adjacent frames of the dialogue video data by using an optical flow algorithm to obtain an optical flow calculation result;

s1.2: calculating the variation amplitude of 2 pictures of the adjacent frame according to the optical flow calculation result, judging whether the variation amplitude is larger than a preset threshold value, if so, reserving the 2 pictures of the adjacent frame to a de-duplication result set, and executing the step S1.3; otherwise, directly executing the step S1.3;

s1.3: and starting from the first frame of picture of the dialogue video data, repeating the steps S1.1-S1.2 to update the duplicate removal result set until the last frame of picture of the dialogue video data is traversed, and storing the duplicate removal result set obtained by the last update as duplicate removal picture data to finish preprocessing of the picture data.

4. The multimode dialogue emotion recognition method based on multimode voting according to claim 3, wherein the optical flow algorithm in the step S1.1 is specifically any one of Lucas-Kanade algorithm, farnebback algorithm and FlowNet algorithm.

5. The multimodal dialog emotion recognition method based on multimodal voting according to claim 4, wherein the variation amplitude in step S1.2 is specifically a displacement vector size or similarity measure of each pixel in 2 pictures of adjacent frames; the similarity measure includes euclidean distance and angle variation.

6. The multimodal dialog emotion recognition method based on multimodal voting according to claim 1 or 5, wherein in the step S2, the preset text encoder is specifically a transducer encoder, and the transducer encoder is specifically any one of a BERT model, a MacBERT model, a RoBERTa model and an ERNIE model;

the preset audio encoder is specifically a HuBERT model;

the preset picture encoder is specifically a ViT model.

7. The multimodal dialog emotion recognition method based on multimodal voting according to claim 6, wherein the text emotion classifier, the audio emotion classifier and the picture emotion classifier preset in step S3 are all CRF classifiers.

8. The multimodal dialog emotion recognition method based on multimodal voting according to claim 7, wherein in the step S4, the specific method for calculating the penalty factor corresponding to each modality is as follows:

Acquiring the category n of emotion categories according to the text emotion classification result, the audio emotion classification result and the picture emotion classification result, respectively counting the total occurrence times of each emotion category, and calculating the duty ratio of the ith emotion categoryThe method specifically comprises the following steps:

wherein,the number of occurrences of the ith emotion type;

and for the classification result of each mode, the duty ratios of all emotion categories contained in the classification result are added, and the addition result is used as a penalty factor of the corresponding mode.

9. The multimodal dialog emotion recognition method based on multimodal voting according to claim 8, wherein the specific method of step S7 is as follows:

decomposing the multi-modal emotion feature vector into a plurality of multi-modal emotion feature sub-vectors according to different speakers, wherein each multi-modal emotion feature sub-vector corresponds to each speaker one by one;

the multi-mode emotion feature sub-vector comprises a text weight-reducing sub-vector, an audio weight-reducing sub-vector and a picture weight-reducing sub-vector;

fusing the picture weight-reducing sub-vector corresponding to each speaker with the multi-modal emotion feature sub-vectors corresponding to all other speakers respectively to obtain fused multi-modal emotion feature sub-vectors corresponding to each speaker;

And re-splicing the fused multi-mode emotion feature sub-vectors corresponding to all the speakers according to the speaking sequence of different speakers to obtain emotion feature vectors fused with time sequence features.

10. The multimodal dialog emotion recognition method based on multimodal voting according to claim 9, wherein the bidirectional RNN classifier in step S8 specifically comprises:

the bidirectional RNN classifier comprises a forward RNN classifier and a backward RNN classifier which are identical in structure and are arranged in parallel;

the forward RNN classifier and the backward RNN classifier both comprise a plurality of RNN classification layers which are connected in sequence;

all RNN classification layers have the same structure, and the specific structure is any one of an LSTM network, a GRU network and a PQRNN network.