[go: up one dir, main page]

CN111164601B - Emotion recognition method, smart device and computer-readable storage medium - Google Patents

Emotion recognition method, smart device and computer-readable storage medium Download PDF

Info

Publication number
CN111164601B
CN111164601B CN201980003314.8A CN201980003314A CN111164601B CN 111164601 B CN111164601 B CN 111164601B CN 201980003314 A CN201980003314 A CN 201980003314A CN 111164601 B CN111164601 B CN 111164601B
Authority
CN
China
Prior art keywords
semantic feature
data
sequence
emotion recognition
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201980003314.8A
Other languages
Chinese (zh)
Other versions
CN111164601A (en
Inventor
丁万
黄东延
李柏
邵池
熊友军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ubtech Technology Co ltd
Original Assignee
Shenzhen Ubtech Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Ubtech Technology Co ltd filed Critical Shenzhen Ubtech Technology Co ltd
Publication of CN111164601A publication Critical patent/CN111164601A/en
Application granted granted Critical
Publication of CN111164601B publication Critical patent/CN111164601B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例公开了一种情感识别方法,该情感识别方法包括:获取包括视频数据、音频数据和/或文本数据中的至少两个的待识别多模态数据组;提取视频数据的视频语义特征序列,提取音频数据的音频语义特征序列,和/或提取文本数据中的文本语义特征序列;将文本语义特征序列向音频数据的时间维度对齐处理,生成文本语义时序序列;将视频语义特征序列、音频语义特征序列和/或文本语义时序序列按照时间维度融合,生成多模态语义特征序列;将多模态语义特征序列输入预训练的情感识别神经网络,将情感识别神经网络的输出结果作为待识别数据组待识别多模态数据组对应的目标情感。本发明还公开了智能装置和计算机可读存储介质。本发明可以有效提升情感识别的准确性。

The embodiment of the present invention discloses an emotion recognition method. The emotion recognition method includes: acquiring a multimodal data group to be recognized including at least two of video data, audio data and/or text data; extracting video semantics of the video data The feature sequence extracts the audio semantic feature sequence of the audio data, and/or extracts the text semantic feature sequence in the text data; aligns the text semantic feature sequence to the time dimension of the audio data to generate a text semantic sequence sequence; converts the video semantic feature sequence , audio semantic feature sequence and/or text semantic sequence sequence are fused according to the time dimension to generate a multi-modal semantic feature sequence; the multi-modal semantic feature sequence is input into the pre-trained emotion recognition neural network, and the output result of the emotion recognition neural network is used as The target emotion corresponding to the multimodal data set to be identified in the data set to be identified. The invention also discloses a smart device and a computer-readable storage medium. The present invention can effectively improve the accuracy of emotion recognition.

Description

Emotion recognition method, intelligent device and computer readable storage medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to an emotion recognition method, an intelligent device, and a computer-readable storage medium.
Background
The emotion of a person in a natural state causes reactions of a plurality of modalities (such as facial movements, speaking tone, language, heartbeat, etc.). Traditional multimodal fusion emotion recognition methods are based on Low-level feature alignment fusion (Low-level features fusion) or Decision-level fusion (Decision-level fusion). The limitation of the two methods is that (a) the human brain is independent of the processing mechanism of the lower-level information of different modes (such as physical characteristics: brightness of pixels, frequency spectrum of sound waves, spelling of words); (b) The decision layer fusion ignores the spatiotemporal relationship between the multi-modal semantic features. Different simultaneous-spatial distributions of multi-modal semantic features may correspond to different affective information. For example, a smiling face and a say "good" appear simultaneously; b: smiling faces appear after saying "good". A and B differ in the precedence of the two semantic features smiling face and saying "good", which leads to a difference in emotional expression, e.g., B is more likely to be under praise or inequality.
Disclosure of Invention
Based on this, it is necessary to propose an emotion recognition method, an intelligent device, and a computer-readable storage medium in order to solve the above-described problems.
A method of emotion recognition, the method comprising: acquiring a multi-modal data set to be identified, wherein the multi-modal data set to be identified comprises at least two of video data, audio data and/or text data; extracting a video semantic feature sequence of the video data, extracting an audio semantic feature sequence of the audio data, and/or extracting a text semantic feature sequence in the text data; aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic time sequence; fusing the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence according to the time dimension to generate a multi-mode semantic feature sequence; inputting the multi-modal semantic feature sequence into a pre-trained emotion recognition neural network, and taking an output result of the emotion recognition neural network as a target emotion corresponding to the multi-modal data set to be recognized.
An intelligent device, comprising: the multi-mode data acquisition module is used for acquiring a multi-mode data set to be identified, wherein the multi-mode data set to be identified comprises video data, audio data and text data; the extraction module is used for extracting video semantic feature sequences of the video data, extracting audio semantic feature sequences of the audio data and extracting text semantic feature sequences in the text data; the alignment module is used for aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic time sequence; the serial module is used for connecting the video semantic feature sequence, the audio semantic feature sequence and the text semantic time sequence in series according to the time dimension to generate a multi-mode semantic feature sequence; and the emotion module is used for inputting the multi-modal semantic feature sequence into a pre-trained emotion recognition neural network, and taking an output result of the emotion recognition neural network as a target emotion corresponding to the multi-modal data set to be recognized.
An intelligent device, comprising: acquisition circuitry, a processor coupled to the memory and the acquisition circuitry, the memory having stored therein a computer program, the processor executing the computer program to implement the method as described above.
A computer readable storage medium storing a computer program executable by a processor to implement a method as described above.
The embodiment of the invention has the following beneficial effects:
after the multimodal data set to be identified is obtained, extracting a video semantic feature sequence of video data, extracting an audio semantic feature sequence of audio data and/or extracting a text semantic feature sequence in text data. The text semantic feature sequence is aligned to the time dimension of the audio data to generate a text semantic time sequence, the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence are fused according to the time dimension to generate a multi-mode semantic feature sequence, the acquired semantic features are not low-level features, the feature alignment and fusion of the multi-mode space-time relationship reserved by the emotion features of the multi-mode data set to be identified can be more accurately represented, and the accuracy of the target emotion acquired according to the multi-mode semantic feature sequence is higher, so that the accuracy of emotion identification is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Wherein:
FIG. 1 is a diagram of an emotion recognition method application environment in one embodiment of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of an emotion recognition method provided by the present invention;
FIG. 3 is a schematic flow chart of a second embodiment of an emotion recognition method provided by the present invention;
FIG. 4 is a schematic flow chart of a third embodiment of an emotion recognition method provided by the present invention;
FIG. 5 is a schematic structural diagram of a first embodiment of the smart device provided by the present invention;
FIG. 6 is a schematic structural diagram of a second embodiment of the smart device provided by the present invention;
fig. 7 is a schematic structural diagram of an embodiment of a computer readable storage medium according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the prior art, the decision layer fusion ignores the space-time relationship among the multi-mode semantic features. Because different time-space distributions of the multi-mode semantic features correspond to different emotion information, ignoring the space-time relationship can cause low emotion recognition accuracy.
In this embodiment, in order to solve the above-mentioned problem, an emotion recognition method is provided, which can effectively improve the accuracy of emotion recognition.
Referring to fig. 1, fig. 1 is a diagram illustrating an application environment of an emotion recognition method according to an embodiment of the present invention. Referring to fig. 1, the facial emotion recognition method is applied to an emotion recognition system. The emotion recognition system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network, and the terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers. The terminal 110 is configured to obtain a multimodal data set to be identified, where the multimodal data set to be identified includes at least two of video data, audio data, and/or text data, and the server 120 is configured to extract a video semantic feature sequence of the video data, extract an audio semantic feature sequence of the audio data, and/or extract a text semantic feature sequence in the text data; aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic time sequence; fusing the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence according to the time dimension to generate a multi-mode semantic feature sequence; inputting the multi-mode semantic feature sequences into a pre-trained emotion recognition neural network to obtain target emotion corresponding to the multi-mode data set to be recognized.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of an emotion recognition method according to the present invention. The emotion recognition method provided by the invention comprises the following steps:
s101: and acquiring a multi-modal data set to be identified, wherein the multi-modal data set to be identified comprises at least two of video data, audio data and/or text data.
In one particular implementation scenario, a multimodal data set to be identified is obtained, the multimodal data set to be identified comprising at least two of video data, audio data, and/or text data. In this implementation scenario, the multimodal data set to be identified includes video data, audio data, and text data. The multimodal data sets to be identified may be provided by the user, or may be obtained from a database, or may be generated by live recording. The video data, the audio data and the text data correspond to the same speaker in the same time period.
S102: extracting a video semantic feature sequence of video data, extracting an audio semantic feature sequence of audio data, and/or extracting a text semantic feature sequence in text data.
In the implementation scene, a video semantic feature sequence of video data is extracted, an audio semantic feature sequence of audio data is extracted, and a text semantic feature sequence in text data is extracted. The video semantic feature sequence, the audio semantic feature sequence of the audio data and the text semantic feature sequence can be obtained by inputting the multimodal data set to be identified into a pre-trained feature extraction neural network. In other implementation scenarios, the video data may be input into a pre-trained video feature extraction neural network to obtain a video semantic feature sequence, the audio data may be input into a pre-trained audio feature extraction neural network to obtain an audio semantic feature sequence, and the text data may be input into a pre-trained text feature extraction neural network to obtain a text semantic feature sequence.
Specifically, the video data is input into a pre-trained video feature extraction neural network, and before the video semantic feature sequence is acquired, the video feature extraction neural network needs to be trained. Face video data is prepared, and face action units in the face video data are marked. Before training, defining a result of the video feature extraction network as a CNN-RNN structure, defining an iteration initial value as epoch=0, and defining a loss function. Inputting face video data and corresponding face action units into a video feature extraction neural network, obtaining a training result, randomly batching the training result, calculating a loss function, updating the weight of CNN-RNN by adopting a cashback gradient propagation algorithm according to the calculated loss value, and iterating the value epoch+1 until epoch=2000 after all training structures are traversed, and ending training.
Inputting the text data into a pre-trained text feature extraction neural network, and training the text feature extraction neural network before acquiring a text semantic feature sequence. Preparing training text data, labeling positive/negative emotion labels for the training text data, counting word frequencies of the training text data, and word segmentation is carried out on the text data based on the maximum word frequency with the largest numerical value. Training conditional probability function based on word2vec methodWord features in the text data are extracted. The text feature extraction neural network is defined as a transducer+attribute+RNN structure, a loss function is defined, word features of text data and positive/negative emotion marks of the text data are input into the text feature extraction neural network for training, and training is terminated when the loss function meets preset conditions.
S103: and aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic time sequence.
In this implementation scenario, both audio data and video data have a time dimension, while text data do not have a time dimension, so both audio semantic feature sequences and video semantic feature sequences have a time dimension, while text semantic feature sequences do not have a time dimension. And aligning the text semantic feature sequence to the time dimension of the audio data. In other implementation scenarios, the text semantic feature sequences may also be aligned to the time dimension of the video data.
In the implementation scenario, each pronunciation phoneme in the audio data can be acquired through a voice recognition method, text semantic feature data corresponding to the pronunciation phoneme is found in the text semantic feature sequence, and each text semantic feature data in the text semantic feature sequence is aligned with the time dimension of the pronunciation phoneme to generate a text semantic time sequence.
S104: and fusing the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence according to the time dimension to generate a multi-mode semantic feature sequence.
In the implementation scenario, the time dimension of the video semantic feature sequence is aligned with the time dimension of the audio semantic feature sequence based on the time dimension of the audio semantic feature sequence, and the text semantic time sequence is aligned with the audio semantic feature sequence in the time dimension.
The method comprises the steps of obtaining video semantic feature data, audio semantic feature data and text semantic feature data at each moment, and connecting the video semantic feature data, the audio semantic feature data and the text semantic feature data at each moment in series to form semantic feature units. And generating a multi-mode semantic feature sequence according to the semantic feature units at each moment by time sequence arrangement.
S105: inputting the multi-mode semantic feature sequences into a pre-trained emotion recognition neural network, and taking the output of the emotion recognition neural network as a target emotion corresponding to the multi-mode data set to be recognized.
In the implementation scene, the multi-mode semantic feature sequence is input into a pre-trained emotion recognition neural network, and the output of the emotion recognition neural network is used as a target emotion corresponding to the multi-mode data set to be recognized.
In this implementation scenario, training of the emotion recognition neural network is required. Before training, preparing a plurality of training multi-mode semantic feature sequences, labeling emotion data for each training multi-mode semantic feature sequence, defining a network structure of the emotion recognition neural network, and defining the layer number, for example, 19 layers, of the emotion recognition neural network. It is also possible to define the type of emotion recognition neural network, such as convolutional neural network, or fully connected neural network, etc. A loss function of the emotion recognition neural network is defined, and conditions for termination of training of the emotion recognition neural network, such as stopping after 2000 times of training, are defined. After training is successful, the multi-mode semantic feature sequences are input into an emotion recognition neural network, and the emotion recognition neural network outputs target emotion corresponding to the multi-mode semantic feature sequences.
As can be seen from the above description, in this embodiment, after the multimodal data set to be identified is acquired, a video semantic feature sequence of video data is extracted, an audio semantic feature sequence of audio data is extracted, and/or a text semantic feature sequence in text data is extracted. The text semantic feature sequence is aligned to the time dimension of the audio data to generate a text semantic time sequence, the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence are fused according to the time dimension to generate a multi-mode semantic feature sequence, semantic features instead of low-level features are acquired, emotion features of a multi-mode data set to be identified can be more accurately represented, feature alignment and fusion of multi-mode space-time relations are reserved, and accuracy of target emotion acquired according to the multi-mode semantic feature sequence is higher, so that accuracy of emotion identification is effectively improved.
Referring to fig. 3, fig. 3 is a schematic flow chart of a second embodiment of the emotion recognition method provided by the present invention. The emotion recognition method provided by the invention comprises the following steps:
s201: and acquiring a multi-modal data set to be identified, wherein the multi-modal data set to be identified comprises at least two of video data, audio data and/or text data.
S202: extracting a video semantic feature sequence of video data, extracting an audio semantic feature sequence of audio data, and/or extracting a text semantic feature sequence in text data.
In a specific implementation scenario, steps S201 to S202 are substantially identical to steps S101 to S102 of the first embodiment of the emotion recognition method provided in the present invention, and are not described herein in detail.
S203: and acquiring at least one pronunciation phoneme of the audio data, and acquiring text semantic feature data in a text semantic feature sequence corresponding to each pronunciation phoneme.
In this implementation scenario, at least one pronunciation phoneme of the audio data is obtained through ASR (Automatic Speech Recognition, speech recognition) technology, and text semantic feature data corresponding to each pronunciation phoneme is found in the text semantic feature sequence.
S204: and acquiring the time position of each pronunciation phoneme, and aligning the text semantic feature data with the time position of the corresponding pronunciation phoneme.
In the present implementation scenario, a time position of each pronunciation phoneme is obtained, and text semantic feature data in the text semantic feature sequence is aligned with a time position of a corresponding pronunciation phoneme. For example, when the time position of the pronunciation phoneme of o is 1 minute and 32 seconds, the text semantic feature data corresponding to o in the text semantic feature sequence is aligned with the time position of 1 minute and 32 seconds.
S205: and respectively acquiring video semantic feature data, audio semantic feature data and text semantic feature data of each moment of the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence.
In the implementation scene, the video semantic feature sequence also has a time dimension, and video semantic feature data at each moment can be acquired. Similarly, the audio semantic feature data at each moment can be obtained, and the text semantic feature data in the text semantic time sequence can be obtained after being aligned with the time dimension of the audio data in step S204.
S206: and concatenating the video semantic feature data, the audio semantic feature data and/or the text semantic feature data at the same moment into semantic feature units.
In the implementation scene, the video semantic feature data, the audio semantic feature data and the text semantic feature data are vectors, and the video semantic feature data, the audio semantic feature data and the text semantic feature data at the same moment are connected in series to form a semantic feature unit, namely three vectors are connected in series to form one vector. For example, the video semantic feature data, the audio semantic feature data and the text semantic feature data are all 2-dimensional vectors, and the speech feature units generated after concatenation are 6-dimensional vectors.
S207: and arranging semantic feature units at each moment according to a time sequence to generate a multi-mode semantic feature sequence.
In the present implementation scenario, the speech feature units at each moment are arranged in time sequence to generate a multi-grind semantic feature sequence. The time sequence is the time dimension of the audio semantic feature sequence.
S208: inputting the multi-mode semantic feature sequences into a pre-trained emotion recognition neural network, and taking the output of the emotion recognition neural network as a target emotion corresponding to the multi-mode data set to be recognized.
In a specific implementation scenario, step S208 is substantially identical to step S105 of the first embodiment of the emotion recognition method provided in the present invention, and will not be described herein.
As can be seen from the above description, in this embodiment, by acquiring text semantic feature data in a text semantic feature sequence corresponding to each pronunciation phoneme of audio data, acquiring a time corresponding to the text semantic feature data, concatenating video semantic feature data, audio semantic feature data and text semantic feature numbers at the same time to form semantic feature units, arranging the semantic feature units at each time according to a time sequence, generating a multi-mode semantic feature sequence, retaining feature alignment and fusion of multi-mode space-time relationships, and acquiring a target emotion according to the multi-mode semantic feature sequence with higher accuracy, thereby effectively improving emotion recognition accuracy.
Referring to fig. 4, fig. 4 is a schematic flow chart of a third embodiment of an emotion recognition method according to the present invention. The emotion recognition method provided by the invention comprises the following steps:
s301: and acquiring a multi-modal data set to be identified, wherein the multi-modal data set to be identified comprises at least two of video data, audio data and/or text data.
S302: extracting a video semantic feature sequence of video data, extracting an audio semantic feature sequence of audio data, and/or extracting a text semantic feature sequence in text data.
S303: and aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic time sequence.
S304: and fusing the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence according to the time dimension to generate a multi-mode semantic feature sequence.
In a specific implementation scenario, steps S301 to S304 are substantially identical to steps S101 to S104 of the first embodiment of the emotion recognition method provided in the present invention, and are not described herein.
S305: and respectively inputting semantic feature units at each moment into a pre-trained unit recognition neural network, and taking the output result of the unit recognition neural network as the emotion recognition result at each moment.
In the present implementation scenario, the semantic feature unit at each time is input into the pre-trained unit recognition neural network, and the output result of the unit recognition neural network is used as the emotion recognition result at each time.
In this implementation scenario, the cell identification neural network includes a convolutional neural network layer and a two-way long and short memory neural network layer. Convolving a neural network with a current elementDefine a width for the center as +.>The sensing window of (2) performs full-connection network calculation on input elements in the window, taking one-dimensional data as an example
Let the input beThe model of the convolutional neural network is:
wherein the method comprises the steps ofFor nonlinear activation function +.>Representing shared weights, i.e.)>Inequality but->When the weights are equal, the corresponding weights are input.
CNNs are often used with pooling (pooling) layers, which function is characterized by spatial invariance, and is commonly found in:
Max-pooling:
Average-pooling:
a Long Short-Term Memory network (LSTM) is a sequence annotation model, and is the current momenttOutput of (2)h t Is the current time inputx t And output from the previous timeh t-1 Is a function of (2). The following demonstrates a method of implementation of LSTM:
is provided withx t For the current input vector to be present,h t-1 the vector is output for the previous time instant,c t-1 for the cell state vector at the previous time,h t the vector is output for the current time instant,h t the calculation mode of (a) is as follows:
wherein the method comprises the steps ofWAndUrespectively representing different weight matrixes, wherein tanh is a nonlinear activation function:
in other implementations, the cell identification neural network may also include only one layer of neural network, such as an LSTM.
S306: and sequencing the emotion recognition results at each moment according to time to generate an emotion recognition sequence.
In the present embodiment, emotion recognition results at each time are sorted in time, and an emotion recognition sequence is generated. A plurality of unit recognition neural networks can be arranged, emotion recognition results at each moment can be output at the same time, a unit recognition neural network can be also arranged, semantic feature units at each moment are sequentially input, and emotion recognition results at each moment are sequentially output.
S307: the method comprises the steps of obtaining weights of emotion recognition results at each moment, performing dot multiplication operation on the emotion recognition results at each moment and the weights corresponding to the emotion recognition results, inputting emotion recognition sequences after the dot multiplication operation into a pre-trained emotion recognition neural network, and taking output of the emotion recognition neural network as target emotion corresponding to a multi-mode data set to be recognized.
In the present implementation scenario, the weight of the emotion recognition result at each moment in the emotion recognition sequence is obtained, and the emotion recognition result at each moment is multiplied by the weight point corresponding to the emotion recognition result. Because the emotion recognition results at each moment in the emotion recognition sequence are mutually influenced, for example, some emotion recognition results are subconscious reactions, and some emotion recognition results have stronger emotion, the influence capacity of different emotion recognition results on the target emotion corresponding to the emotion recognition sequence is different.
In this embodiment, attention calculation is performed on the emotion recognition sequence, and the weight of the emotion recognition result at each time is obtained.
Wherein,,weight of emotion recognition result for each moment, +.>For emotion recognition sequence, < > for>The operation formula of the function is as follows:
in this implementation scenario, the emotion recognition neural network is a fully connected neural network. The fully-connected neural network defaults to establish weight connection between all inputs and outputs, taking one-dimensional data as an example:
let the input beThe model of the fully connected network is:
wherein the method comprises the steps ofFor network parameters +.>As non-linear activation functions, common e.g. Sigmoid functions
As can be seen from the above description, in this embodiment, the video semantic feature data, the audio semantic feature data and the text semantic feature data at the same moment are connected in series to form semantic feature units, the semantic feature unit at each moment is input into a neural network for identifying emotion at each moment, so as to obtain an emotion identification result at each moment, and the neural network for identifying emotion includes a convolutional neural network layer and a two-way long and short memory neural network layer, so that the accuracy of the emotion identification result can be improved.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a first embodiment of an intelligent device according to the present invention. The intelligent device 10 comprises an acquisition module 11, an extraction module 12, an alignment module 13, a concatenation module 14 and an emotion module 15. The acquisition module 11 acquires a multimodal data group to be identified, which includes video data, audio data, and text data. The extraction module 12 is used for extracting video semantic feature sequences of video data, extracting audio semantic feature sequences of audio data and extracting text semantic feature sequences in text data. The alignment module 13 is configured to align the text semantic feature sequence to a time dimension of the audio data, and generate a text semantic time sequence. The concatenation module 14 is configured to concatenate the video semantic feature sequence, the audio semantic feature sequence, and the text semantic temporal sequence according to a time dimension to generate a multi-modal semantic feature sequence. The emotion module 15 is configured to input the multimodal semantic feature sequence into a pre-trained emotion recognition neural network, and obtain an emotion included in the multimodal data set to be recognized.
As can be seen from the above description, in this embodiment, after the intelligent device obtains the multimodal data set to be identified, the video semantic feature sequence of the video data is extracted, the audio semantic feature sequence of the audio data is extracted, and/or the text semantic feature sequence in the text data is extracted. The text semantic feature sequence is aligned to the time dimension of the audio data to generate a text semantic time sequence, the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence are fused according to the time dimension to generate a multi-mode semantic feature sequence, so that the feature alignment and fusion of the multi-mode space-time relationship can be kept, the accuracy of the target emotion obtained according to the multi-mode semantic feature sequence is higher, and the emotion recognition accuracy is effectively improved.
Please continue to refer to fig. 5. The alignment module 13 comprises a first acquisition sub-module 131 and an alignment sub-module 132. The first obtaining sub-module 131 is configured to obtain at least one pronunciation phoneme of the audio data, and obtain text semantic feature data corresponding to each pronunciation phoneme. The alignment sub-module 132 is configured to obtain a time position of each pronunciation phoneme, and align text semantic feature data with the time position of the corresponding pronunciation phoneme.
The tandem module 14 includes a second acquisition sub-module 141 and a tandem sub-module 142. The second obtaining sub-module 141 is configured to obtain video semantic feature data, audio semantic feature data, and text semantic feature data at each moment in the video semantic feature sequence, the audio semantic feature sequence, and the text semantic time sequence, respectively. The concatenation sub-module 142 is configured to concatenate video semantic feature data, audio semantic feature data, and text semantic feature data at the same time into a semantic feature unit.
Emotion module 15 includes emotion recognition sub-module 151, arrangement sub-module 152, and emotion sub-module 153. The emotion recognition sub-module 151 is configured to input semantic feature units at each moment into a pre-trained unit recognition neural network, and obtain emotion recognition data at each moment. The arrangement sub-module 152 is configured to sort the emotion recognition data at each time according to time, and generate an emotion recognition sequence. The emotion sub-module 153 is configured to input the emotion recognition sequence into a pre-trained emotion recognition neural network, and obtain the emotion included in the multimodal data set to be recognized.
Emotion submodule 153 includes a weight element 1531. The weight unit 1531 is configured to obtain weights of the emotion recognition data at each moment, perform a dot product operation on the emotion recognition data at each moment and the weights corresponding to the emotion recognition data, and input the calculated emotion recognition sequence into the pre-trained emotion recognition neural network.
The weight unit 1531 is configured to perform attention operation on the emotion recognition sequence, and obtain weights of emotion recognition data at each moment.
The unit identification neural network comprises a convolutional neural network layer and a two-way long and short memory network layer.
Wherein, emotion recognition neural network is the full-connection neural network.
The smart device 10 further includes a training module 16, the training module 16 being configured to train the emotion recognition neural network.
Training module 16 includes a preparation sub-module 161, a definition sub-module 162, and an input sub-module 163.
The preparation sub-module 161 is configured to prepare a plurality of training multi-modal feature sequences and label the target emotion of each training multi-modal feature sequence. Definition submodule 162 is used to define the structure, loss function, and termination conditions of the trained emotion recognition neural network. The input submodule 163 is used for training the input emotion recognition neural network by taking the multiple multi-modal feature sequences and the corresponding target emotion thereof.
As can be seen from the above description, in this embodiment, semantic feature units at each moment are arranged according to a time sequence to generate a multi-mode semantic feature sequence, so that semantic features, rather than low-level features, are acquired, emotion features of a multi-mode data set to be identified can be more accurately represented, feature alignment and fusion of multi-mode space-time relationships are reserved, accuracy of target emotion acquired according to the multi-mode semantic feature sequence is higher, so that accuracy of emotion identification is effectively improved, video semantic feature data, audio semantic feature data and text semantic feature numbers at the same moment are connected in series to form semantic feature units, the semantic feature units at each moment are input into a unit identification neural network to acquire an emotion identification result at each moment, the unit identification neural network comprises a convolutional neural network layer and a two-way long and short memory neural network layer, and accuracy of the emotion identification result can be improved.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a second embodiment of the smart device according to the present invention. The smart device 20 includes a processor 21, a memory 22, and a fetch circuit 23. The processor 21 is coupled to the memory 22 and the acquisition circuit 23. The memory 22 has stored therein a computer program which is executed by the processor 21 in operation to implement the method as shown in fig. 2-4. The detailed method can be referred to above, and will not be described here.
As can be seen from the above description, in this embodiment, after the intelligent device obtains the multimodal data set to be identified, the video semantic feature sequence of the video data is extracted, the audio semantic feature sequence of the audio data is extracted, and/or the text semantic feature sequence in the text data is extracted. The text semantic feature sequence is aligned to the time dimension of the audio data to generate a text semantic time sequence, the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence are fused according to the time dimension to generate a multi-mode semantic feature sequence, semantic features instead of low-level features are acquired, emotion features of a multi-mode data set to be identified can be more accurately represented, feature alignment and fusion of multi-mode space-time relations are reserved, and accuracy of target emotion acquired according to the multi-mode semantic feature sequence is higher, so that accuracy of emotion identification is effectively improved.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of a computer readable storage medium according to the present invention. The computer readable storage medium 30 stores at least one computer program 31, and the computer program 71 is used for being executed by a processor to implement the method shown in fig. 2-4, and the detailed method is referred to above and will not be repeated here. In one embodiment, the computer readable storage medium 30 may be a memory chip, a hard disk or a removable hard disk in a terminal, or other readable and writable storage means such as a flash disk, an optical disk, etc., and may also be a server, etc.
As can be seen from the above description, the computer program stored in the storage medium in this embodiment may be used to extract a video semantic feature sequence of video data, extract an audio semantic feature sequence of audio data, and/or extract a text semantic feature sequence in text data after acquiring the multimodal data set to be identified. The text semantic feature sequence is aligned to the time dimension of the audio data to generate a text semantic time sequence, the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence are fused according to the time dimension to generate a multi-mode semantic feature sequence, semantic features instead of low-level features are acquired, emotion features of a multi-mode data set to be identified can be more accurately represented, feature alignment and fusion of multi-mode space-time relations are reserved, and accuracy of target emotion acquired according to the multi-mode semantic feature sequence is higher, so that accuracy of emotion identification is effectively improved.
Compared with the prior art, the method has the advantages that semantic features are obtained instead of low-level features, the emotion features of the multi-mode data set to be identified can be more accurately represented, the alignment and fusion of the features of the multi-mode space-time relationship are reserved, and the accuracy of the target emotion obtained according to the multi-mode semantic feature sequence is higher, so that the accuracy of emotion identification is effectively improved.
The foregoing disclosure is illustrative of the present invention and is not to be construed as limiting the scope of the invention, which is defined by the appended claims.

Claims (8)

1. An emotion recognition method, comprising:
acquiring a multi-modal data set to be identified, wherein the multi-modal data set to be identified comprises at least two of video data, audio data and/or text data;
extracting a video semantic feature sequence of the video data, extracting an audio semantic feature sequence of the audio data, and/or extracting a text semantic feature sequence in the text data;
aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic time sequence;
fusing the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence according to the time dimension to generate a multi-mode semantic feature sequence;
inputting the multi-modal semantic feature sequence into a pre-trained emotion recognition neural network, and taking an output result of the emotion recognition neural network as a target emotion corresponding to the multi-modal data set to be recognized;
the step of inputting the multi-modal semantic feature sequence into a pre-trained emotion recognition neural network comprises the following steps:
respectively acquiring video semantic feature data, audio semantic feature data and/or text semantic feature data of each moment of the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence;
the video semantic feature data, the audio semantic feature data and/or the text semantic feature data at the same moment are connected in series to form semantic feature units;
the semantic feature units at each moment are arranged according to a time sequence to generate a multi-mode semantic feature sequence;
inputting the multi-mode semantic feature sequences into a pre-trained emotion recognition neural network;
the step of inputting the multi-modal semantic feature sequence into a pre-trained emotion recognition neural network to obtain the emotion included in the multi-modal data set to be recognized comprises the following steps:
inputting the semantic feature units at each moment into a pre-trained unit recognition neural network respectively, and taking the output result of the unit recognition neural network as the emotion recognition result at each moment;
the emotion recognition results at each moment are sequenced according to time, and an emotion recognition sequence is generated;
inputting the emotion recognition sequence into a pre-trained emotion recognition neural network to obtain the emotion included in the multi-modal data set to be recognized.
2. The emotion recognition method according to claim 1, wherein the step of aligning the text semantic feature sequence to the time dimension of the audio data includes:
acquiring at least one pronunciation phoneme of audio data, and acquiring text semantic feature data in a text semantic feature sequence corresponding to each pronunciation phoneme;
and acquiring the time position of each pronunciation phoneme, and aligning the text semantic feature data with the time position of the corresponding pronunciation phoneme.
3. The emotion recognition method of claim 1, wherein the step of inputting the emotion recognition sequence into a pre-trained emotion recognition neural network comprises:
and obtaining the weight of the emotion recognition result at each moment, performing dot multiplication operation on the emotion recognition result at each moment and the weight corresponding to the emotion recognition result, and inputting the emotion recognition sequence after dot multiplication operation into a pre-trained emotion recognition neural network.
4. The emotion recognition method of claim 3, wherein,
the step of obtaining the weight of the emotion recognition result at each moment comprises the following steps:
and performing attention operation on the emotion recognition sequence to acquire the weight of the emotion recognition result at each moment.
5. The emotion recognition method of claim 1, wherein before the step of inputting the multimodal semantic feature sequence into a pre-trained emotion recognition neural network, comprising:
training the emotion recognition neural network;
the step of training the emotion recognition neural network comprises the following steps:
preparing a plurality of training multi-modal feature sequences, and labeling target emotion of each training multi-modal feature sequence;
defining the structure, the loss function and the termination condition of the trained emotion recognition neural network;
and inputting the multi-modal feature sequences and the corresponding target emotion thereof into the emotion recognition neural network for training.
6. An intelligent device, characterized by comprising:
the multi-mode data acquisition module is used for acquiring a multi-mode data set to be identified, wherein the multi-mode data set to be identified comprises video data, audio data and text data;
the extraction module is used for extracting video semantic feature sequences of the video data, extracting audio semantic feature sequences of the audio data and extracting text semantic feature sequences in the text data;
the alignment module is used for aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic time sequence;
the serial module is used for connecting the video semantic feature sequence, the audio semantic feature sequence and the text semantic time sequence in series according to the time dimension to generate a multi-mode semantic feature sequence; respectively acquiring video semantic feature data, audio semantic feature data and/or text semantic feature data of each moment of the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence; the video semantic feature data, the audio semantic feature data and/or the text semantic feature data at the same moment are connected in series to form semantic feature units; the semantic feature units at each moment are arranged according to a time sequence to generate a multi-mode semantic feature sequence; inputting the multi-mode semantic feature sequences into a pre-trained emotion recognition neural network;
the emotion module is used for inputting the multi-modal semantic feature sequence into a pre-trained emotion recognition neural network, and taking an output result of the emotion recognition neural network as a target emotion corresponding to the multi-modal data set to be recognized; inputting the semantic feature units at each moment into a pre-trained unit recognition neural network respectively, and taking the output result of the unit recognition neural network as the emotion recognition result at each moment; the emotion recognition results at each moment are sequenced according to time, and an emotion recognition sequence is generated; inputting the emotion recognition sequence into a pre-trained emotion recognition neural network to obtain the emotion included in the multi-modal data set to be recognized.
7. An intelligent device, characterized by comprising: acquisition circuitry, a processor, a memory, the processor being coupled to the memory and the acquisition circuitry, the memory having stored therein a computer program, the processor executing the computer program to implement the method of any of claims 1-5.
8. A computer readable storage medium, characterized in that a computer program is stored, which computer program is executable by a processor to implement the method of any one of claims 1-5.
CN201980003314.8A 2019-12-30 2019-12-30 Emotion recognition method, smart device and computer-readable storage medium Active CN111164601B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130065 WO2021134277A1 (en) 2019-12-30 2019-12-30 Emotion recognition method, intelligent device, and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN111164601A CN111164601A (en) 2020-05-15
CN111164601B true CN111164601B (en) 2023-07-18

Family

ID=70562368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980003314.8A Active CN111164601B (en) 2019-12-30 2019-12-30 Emotion recognition method, smart device and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN111164601B (en)
WO (1) WO2021134277A1 (en)

Families Citing this family (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753549B (en) * 2020-05-22 2023-07-21 江苏大学 A multi-modal emotional feature learning and recognition method based on attention mechanism
CN111832317B (en) * 2020-07-09 2023-08-18 广州市炎华网络科技有限公司 Intelligent information diversion method, device, computer equipment and readable storage medium
CN111898670B (en) * 2020-07-24 2024-04-05 深圳市声希科技有限公司 Multi-mode emotion recognition method, device, equipment and storage medium
CN111723783B (en) * 2020-07-29 2023-12-08 腾讯科技(深圳)有限公司 Content identification method and related device
CN112101097A (en) * 2020-08-02 2020-12-18 华南理工大学 Depression and suicide tendency identification method integrating body language, micro expression and language
CN112233698B (en) * 2020-10-09 2023-07-25 中国平安人寿保险股份有限公司 Character emotion recognition method, device, terminal equipment and storage medium
CN112418034B (en) * 2020-11-12 2024-08-20 上海元梦智能科技有限公司 Multimodal emotion recognition method, device, electronic device and storage medium
CN112489635B (en) * 2020-12-03 2022-11-11 杭州电子科技大学 Multi-mode emotion recognition method based on attention enhancement mechanism
CN112560622B (en) * 2020-12-08 2023-07-21 中国联合网络通信集团有限公司 Virtual object motion control method, device and electronic equipment
CN112584062B (en) * 2020-12-10 2023-08-08 上海幻电信息科技有限公司 Background audio construction method and device
CN112735404A (en) * 2020-12-18 2021-04-30 平安科技(深圳)有限公司 Ironic detection method, system, terminal device and storage medium
CN112579745B (en) * 2021-02-22 2021-06-08 中国科学院自动化研究所 Dialogue emotion error correction system based on graph neural network
CN113470787B (en) * 2021-07-09 2024-01-30 福州大学 Emotion recognition and desensitization training effect evaluation method based on neural network
CN113536009B (en) * 2021-07-14 2024-11-29 Oppo广东移动通信有限公司 Data description method and device, computer readable medium and electronic equipment
CN113408503B (en) * 2021-08-19 2021-12-21 明品云(北京)数据科技有限公司 Emotion recognition method and device, computer readable storage medium and equipment
CN113743267B (en) * 2021-08-25 2023-06-16 中国科学院软件研究所 Multi-mode video emotion visualization method and device based on spiral and text
CN113688745B (en) * 2021-08-27 2024-04-05 大连海事大学 A gait recognition method based on automatic mining of related nodes and statistical information
CN113704504B (en) * 2021-08-30 2023-09-19 平安银行股份有限公司 Emotion recognition method, device, equipment and storage medium based on chat record
CN113704552B (en) * 2021-08-31 2024-09-24 哈尔滨工业大学 A sentiment analysis method, system and device based on cross-modal automatic alignment and pre-trained language model
CN113903327B (en) * 2021-09-13 2024-06-28 北京卷心菜科技有限公司 Voice environment atmosphere recognition method based on deep neural network
CN113837072A (en) * 2021-09-24 2021-12-24 厦门大学 Method for sensing emotion of speaker by fusing multidimensional information
CN114022668B (en) * 2021-10-29 2023-09-22 北京有竹居网络技术有限公司 A method, device, equipment and medium for text-aligned speech
CN114005446B (en) * 2021-11-01 2024-12-13 科大讯飞股份有限公司 Sentiment analysis method, related device and readable storage medium
CN114067241B (en) * 2021-11-03 2025-05-27 Oppo广东移动通信有限公司 Video emotion prediction method, device, equipment and readable storage medium
WO2023084348A1 (en) * 2021-11-12 2023-05-19 Sony Group Corporation Emotion recognition in multimedia videos using multi-modal fusion-based deep neural network
US12333794B2 (en) 2021-11-12 2025-06-17 Sony Group Corporation Emotion recognition in multimedia videos using multi-modal fusion-based deep neural network
CN114255433B (en) * 2022-02-24 2022-05-31 首都师范大学 Depression identification method and device based on facial video and storage medium
CN114581570B (en) * 2022-03-01 2024-01-26 浙江同花顺智能科技有限公司 Three-dimensional face action generation method and system
CN114821558A (en) * 2022-03-10 2022-07-29 电子科技大学 A Multi-Orientation Text Detection Method Based on Text Feature Alignment
CN115101032B (en) * 2022-06-17 2024-06-28 北京有竹居网络技术有限公司 Method, apparatus, electronic device and medium for generating a soundtrack for text
CN114913590B (en) * 2022-07-15 2022-12-27 山东海量信息技术研究院 Data emotion recognition method, device and equipment and readable storage medium
CN115393927A (en) * 2022-08-05 2022-11-25 北京理工大学 Multi-modal emotion emergency decision system based on multi-stage long and short term memory network
CN115359398A (en) * 2022-08-19 2022-11-18 浙江理工大学 Voice video positioning model and construction method, device and application thereof
CN115526228A (en) * 2022-08-19 2022-12-27 科大讯飞股份有限公司 Identification method, identification device, electronic equipment and storage medium
CN115512104A (en) * 2022-09-02 2022-12-23 华为技术有限公司 A data processing method and related equipment
CN115690875A (en) * 2022-10-19 2023-02-03 桂林电子科技大学 Emotion recognition method, device and system and storage medium
CN115641533A (en) * 2022-10-21 2023-01-24 湖南大学 Target object emotion recognition method, device and computer equipment
CN116364066A (en) * 2023-03-16 2023-06-30 北京有竹居网络技术有限公司 Classification model generation method, audio classification method, device, medium and equipment
CN116522962A (en) * 2023-03-29 2023-08-01 北京有竹居网络技术有限公司 Method, apparatus, electronic device and medium for video translation
CN116467416B (en) * 2023-04-21 2025-05-13 四川省人工智能研究院(宜宾) A multimodal dialogue emotion recognition method and system based on graph neural network
CN116245102B (en) * 2023-05-11 2023-07-04 广州数说故事信息科技有限公司 Multi-mode emotion recognition method based on multi-head attention and graph neural network
CN116561634B (en) * 2023-05-12 2025-08-26 北京理工大学 Multimodal physiological signal semantic alignment method and system for emotion recognition
CN116501902A (en) * 2023-05-19 2023-07-28 平安科技(深圳)有限公司 Multimodal movie emotion recognition method, device, device, and storage medium
CN116612543B (en) * 2023-06-01 2025-08-01 科大讯飞股份有限公司 Emotion recognition method, emotion recognition device, storage medium and equipment
CN117058405B (en) * 2023-07-04 2024-05-17 首都医科大学附属北京朝阳医院 Image-based emotion recognition method, system, storage medium and terminal
CN117033637B (en) * 2023-08-22 2024-03-22 镁佳(北京)科技有限公司 Invalid conversation refusing model training method, invalid conversation refusing method and device
CN117197719A (en) * 2023-09-26 2023-12-08 深圳技术大学 Multimodal emotion recognition methods, devices, equipment, computer storage media
CN117546796B (en) * 2023-12-26 2024-06-28 深圳天喆科技有限公司 Dog training control method and system based on dog behavior recognition technology
CN117611845B (en) * 2024-01-24 2024-04-26 浪潮通信信息系统有限公司 Multi-mode data association identification method, device, equipment and storage medium
CN117893948A (en) * 2024-01-30 2024-04-16 桂林电子科技大学 Multimodal sentiment analysis method based on multi-granularity feature comparison and fusion framework
CN117933269B (en) * 2024-03-22 2024-06-18 合肥工业大学 A method and system for constructing a multimodal deep model based on emotion distribution
CN118228194B (en) * 2024-04-02 2024-11-08 北京科技大学 A multimodal personality prediction method and system integrating spatiotemporal graph attention network
CN118861977A (en) * 2024-07-04 2024-10-29 南通大学 A multimodal sentiment analysis system and method
CN118841014B (en) * 2024-09-20 2024-12-20 卓世智星(青田)元宇宙科技有限公司 Digital human interaction method and device based on emotion and electronic equipment
CN119918010B (en) * 2025-01-21 2025-09-30 广东工业大学 Multi-mode emotion analysis method and system based on multidimensional sensing
CN120543710B (en) * 2025-04-25 2026-02-03 中建材信息技术股份有限公司 Digital person generation method based on multi-mode large model
CN121144858B (en) * 2025-11-19 2026-01-30 厦门身份宝网络科技有限公司 Multi-mode data multi-model combined training method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
WO2019219968A1 (en) * 2018-05-18 2019-11-21 Deepmind Technologies Limited Visual speech recognition by phoneme prediction

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609572B (en) * 2017-08-15 2021-04-02 中国科学院自动化研究所 Multimodal emotion recognition method and system based on neural network and transfer learning
WO2019132459A1 (en) * 2017-12-28 2019-07-04 주식회사 써로마인드로보틱스 Multimodal information coupling method for recognizing user's emotional behavior, and device therefor
JP7199451B2 (en) * 2018-01-26 2023-01-05 インスティテュート オブ ソフトウェア チャイニーズ アカデミー オブ サイエンシズ Emotional interaction system, device and method based on emotional computing user interface
US20190341025A1 (en) * 2018-04-18 2019-11-07 Sony Interactive Entertainment Inc. Integrated understanding of user characteristics by multimodal processing
CN108877801B (en) * 2018-06-14 2020-10-02 南京云思创智信息科技有限公司 Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system
CN108805089B (en) * 2018-06-14 2021-06-29 南京云思创智信息科技有限公司 Multi-modal-based emotion recognition method
CN109614895A (en) * 2018-10-29 2019-04-12 山东大学 A multimodal emotion recognition method based on attention feature fusion
CN109472232B (en) * 2018-10-31 2020-09-29 山东师范大学 Video semantic representation method, system and medium based on multi-mode fusion mechanism
CN110033029A (en) * 2019-03-22 2019-07-19 五邑大学 A kind of emotion identification method and device based on multi-modal emotion model
CN110147548B (en) * 2019-04-15 2023-01-31 浙江工业大学 Emotion recognition method based on bidirectional gated recurrent unit network and novel network initialization
CN110188343B (en) * 2019-04-22 2023-01-31 浙江工业大学 Multimodal Emotion Recognition Method Based on Fusion Attention Network
CN110083716A (en) * 2019-05-07 2019-08-02 青海大学 Multi-modal affection computation method and system based on Tibetan language

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019219968A1 (en) * 2018-05-18 2019-11-21 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于长短期记忆和卷积神经网络的语音情感识别;卢官明 等;南京邮电大学学报(自然科学版);第38卷(第05期);第63-69页 *

Also Published As

Publication number Publication date
WO2021134277A1 (en) 2021-07-08
CN111164601A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN111164601B (en) Emotion recognition method, smart device and computer-readable storage medium
CN111695352B (en) Scoring method, device, terminal equipment and storage medium based on semantic analysis
EP3617946B1 (en) Context acquisition method and device based on voice interaction
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
CN112100337B (en) Emotion recognition method and device in interactive dialogue
WO2020253128A1 (en) Voice recognition-based communication service method, apparatus, computer device, and storage medium
CN108829662A (en) A kind of conversation activity recognition methods and system based on condition random field structuring attention network
CN112910761B (en) Instant messaging method, device, equipment, storage medium and program product
CN114822558A (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN112632248A (en) Question answering method, device, computer equipment and storage medium
CN111357051A (en) Speech emotion recognition method, intelligent device and computer readable storage medium
CN116049446B (en) Event extraction method, device, equipment and computer readable storage medium
CN113870863B (en) Voiceprint recognition method and device, storage medium and electronic equipment
CN117593608B (en) Training method, device, equipment and storage medium for graphic recognition large model
CN111344717A (en) Interactive behavior prediction method, intelligent device and computer-readable storage medium
CN113177112A (en) KR product fusion multi-mode information-based neural network visual dialogue model and method
CN114333786B (en) Speech emotion recognition method and related device, electronic device and storage medium
CN116312512A (en) Audio-visual fusion wake-up word recognition method and device for multi-person scenes
CN115512693B (en) Audio recognition method, acoustic model training method, device and storage medium
CN111563161A (en) Sentence recognition method, sentence recognition device and intelligent equipment
CN111462762A (en) Speaker vector regularization method and device, electronic equipment and storage medium
CN117174177B (en) Training method and device for protein sequence generation model and electronic equipment
CN117541960A (en) Target object identification method, device, computer equipment and storage medium
CN109961152B (en) Personalized interaction method and system of virtual idol, terminal equipment and storage medium
CN112580669A (en) Training method and device for voice information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant