Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the prior art, the decision layer fusion ignores the space-time relationship among the multi-mode semantic features. Because different time-space distributions of the multi-mode semantic features correspond to different emotion information, ignoring the space-time relationship can cause low emotion recognition accuracy.
In this embodiment, in order to solve the above-mentioned problem, an emotion recognition method is provided, which can effectively improve the accuracy of emotion recognition.
Referring to fig. 1, fig. 1 is a diagram illustrating an application environment of an emotion recognition method according to an embodiment of the present invention. Referring to fig. 1, the facial emotion recognition method is applied to an emotion recognition system. The emotion recognition system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network, and the terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers. The terminal 110 is configured to obtain a multimodal data set to be identified, where the multimodal data set to be identified includes at least two of video data, audio data, and/or text data, and the server 120 is configured to extract a video semantic feature sequence of the video data, extract an audio semantic feature sequence of the audio data, and/or extract a text semantic feature sequence in the text data; aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic time sequence; fusing the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence according to the time dimension to generate a multi-mode semantic feature sequence; inputting the multi-mode semantic feature sequences into a pre-trained emotion recognition neural network to obtain target emotion corresponding to the multi-mode data set to be recognized.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of an emotion recognition method according to the present invention. The emotion recognition method provided by the invention comprises the following steps:
s101: and acquiring a multi-modal data set to be identified, wherein the multi-modal data set to be identified comprises at least two of video data, audio data and/or text data.
In one particular implementation scenario, a multimodal data set to be identified is obtained, the multimodal data set to be identified comprising at least two of video data, audio data, and/or text data. In this implementation scenario, the multimodal data set to be identified includes video data, audio data, and text data. The multimodal data sets to be identified may be provided by the user, or may be obtained from a database, or may be generated by live recording. The video data, the audio data and the text data correspond to the same speaker in the same time period.
S102: extracting a video semantic feature sequence of video data, extracting an audio semantic feature sequence of audio data, and/or extracting a text semantic feature sequence in text data.
In the implementation scene, a video semantic feature sequence of video data is extracted, an audio semantic feature sequence of audio data is extracted, and a text semantic feature sequence in text data is extracted. The video semantic feature sequence, the audio semantic feature sequence of the audio data and the text semantic feature sequence can be obtained by inputting the multimodal data set to be identified into a pre-trained feature extraction neural network. In other implementation scenarios, the video data may be input into a pre-trained video feature extraction neural network to obtain a video semantic feature sequence, the audio data may be input into a pre-trained audio feature extraction neural network to obtain an audio semantic feature sequence, and the text data may be input into a pre-trained text feature extraction neural network to obtain a text semantic feature sequence.
Specifically, the video data is input into a pre-trained video feature extraction neural network, and before the video semantic feature sequence is acquired, the video feature extraction neural network needs to be trained. Face video data is prepared, and face action units in the face video data are marked. Before training, defining a result of the video feature extraction network as a CNN-RNN structure, defining an iteration initial value as epoch=0, and defining a loss function. Inputting face video data and corresponding face action units into a video feature extraction neural network, obtaining a training result, randomly batching the training result, calculating a loss function, updating the weight of CNN-RNN by adopting a cashback gradient propagation algorithm according to the calculated loss value, and iterating the value epoch+1 until epoch=2000 after all training structures are traversed, and ending training.
Inputting the text data into a pre-trained text feature extraction neural network, and training the text feature extraction neural network before acquiring a text semantic feature sequence. Preparing training text data, labeling positive/negative emotion labels for the training text data, counting word frequencies of the training text data, and word segmentation is carried out on the text data based on the maximum word frequency with the largest numerical value. Training conditional probability function based on word2vec methodWord features in the text data are extracted. The text feature extraction neural network is defined as a transducer+attribute+RNN structure, a loss function is defined, word features of text data and positive/negative emotion marks of the text data are input into the text feature extraction neural network for training, and training is terminated when the loss function meets preset conditions.
S103: and aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic time sequence.
In this implementation scenario, both audio data and video data have a time dimension, while text data do not have a time dimension, so both audio semantic feature sequences and video semantic feature sequences have a time dimension, while text semantic feature sequences do not have a time dimension. And aligning the text semantic feature sequence to the time dimension of the audio data. In other implementation scenarios, the text semantic feature sequences may also be aligned to the time dimension of the video data.
In the implementation scenario, each pronunciation phoneme in the audio data can be acquired through a voice recognition method, text semantic feature data corresponding to the pronunciation phoneme is found in the text semantic feature sequence, and each text semantic feature data in the text semantic feature sequence is aligned with the time dimension of the pronunciation phoneme to generate a text semantic time sequence.
S104: and fusing the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence according to the time dimension to generate a multi-mode semantic feature sequence.
In the implementation scenario, the time dimension of the video semantic feature sequence is aligned with the time dimension of the audio semantic feature sequence based on the time dimension of the audio semantic feature sequence, and the text semantic time sequence is aligned with the audio semantic feature sequence in the time dimension.
The method comprises the steps of obtaining video semantic feature data, audio semantic feature data and text semantic feature data at each moment, and connecting the video semantic feature data, the audio semantic feature data and the text semantic feature data at each moment in series to form semantic feature units. And generating a multi-mode semantic feature sequence according to the semantic feature units at each moment by time sequence arrangement.
S105: inputting the multi-mode semantic feature sequences into a pre-trained emotion recognition neural network, and taking the output of the emotion recognition neural network as a target emotion corresponding to the multi-mode data set to be recognized.
In the implementation scene, the multi-mode semantic feature sequence is input into a pre-trained emotion recognition neural network, and the output of the emotion recognition neural network is used as a target emotion corresponding to the multi-mode data set to be recognized.
In this implementation scenario, training of the emotion recognition neural network is required. Before training, preparing a plurality of training multi-mode semantic feature sequences, labeling emotion data for each training multi-mode semantic feature sequence, defining a network structure of the emotion recognition neural network, and defining the layer number, for example, 19 layers, of the emotion recognition neural network. It is also possible to define the type of emotion recognition neural network, such as convolutional neural network, or fully connected neural network, etc. A loss function of the emotion recognition neural network is defined, and conditions for termination of training of the emotion recognition neural network, such as stopping after 2000 times of training, are defined. After training is successful, the multi-mode semantic feature sequences are input into an emotion recognition neural network, and the emotion recognition neural network outputs target emotion corresponding to the multi-mode semantic feature sequences.
As can be seen from the above description, in this embodiment, after the multimodal data set to be identified is acquired, a video semantic feature sequence of video data is extracted, an audio semantic feature sequence of audio data is extracted, and/or a text semantic feature sequence in text data is extracted. The text semantic feature sequence is aligned to the time dimension of the audio data to generate a text semantic time sequence, the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence are fused according to the time dimension to generate a multi-mode semantic feature sequence, semantic features instead of low-level features are acquired, emotion features of a multi-mode data set to be identified can be more accurately represented, feature alignment and fusion of multi-mode space-time relations are reserved, and accuracy of target emotion acquired according to the multi-mode semantic feature sequence is higher, so that accuracy of emotion identification is effectively improved.
Referring to fig. 3, fig. 3 is a schematic flow chart of a second embodiment of the emotion recognition method provided by the present invention. The emotion recognition method provided by the invention comprises the following steps:
s201: and acquiring a multi-modal data set to be identified, wherein the multi-modal data set to be identified comprises at least two of video data, audio data and/or text data.
S202: extracting a video semantic feature sequence of video data, extracting an audio semantic feature sequence of audio data, and/or extracting a text semantic feature sequence in text data.
In a specific implementation scenario, steps S201 to S202 are substantially identical to steps S101 to S102 of the first embodiment of the emotion recognition method provided in the present invention, and are not described herein in detail.
S203: and acquiring at least one pronunciation phoneme of the audio data, and acquiring text semantic feature data in a text semantic feature sequence corresponding to each pronunciation phoneme.
In this implementation scenario, at least one pronunciation phoneme of the audio data is obtained through ASR (Automatic Speech Recognition, speech recognition) technology, and text semantic feature data corresponding to each pronunciation phoneme is found in the text semantic feature sequence.
S204: and acquiring the time position of each pronunciation phoneme, and aligning the text semantic feature data with the time position of the corresponding pronunciation phoneme.
In the present implementation scenario, a time position of each pronunciation phoneme is obtained, and text semantic feature data in the text semantic feature sequence is aligned with a time position of a corresponding pronunciation phoneme. For example, when the time position of the pronunciation phoneme of o is 1 minute and 32 seconds, the text semantic feature data corresponding to o in the text semantic feature sequence is aligned with the time position of 1 minute and 32 seconds.
S205: and respectively acquiring video semantic feature data, audio semantic feature data and text semantic feature data of each moment of the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence.
In the implementation scene, the video semantic feature sequence also has a time dimension, and video semantic feature data at each moment can be acquired. Similarly, the audio semantic feature data at each moment can be obtained, and the text semantic feature data in the text semantic time sequence can be obtained after being aligned with the time dimension of the audio data in step S204.
S206: and concatenating the video semantic feature data, the audio semantic feature data and/or the text semantic feature data at the same moment into semantic feature units.
In the implementation scene, the video semantic feature data, the audio semantic feature data and the text semantic feature data are vectors, and the video semantic feature data, the audio semantic feature data and the text semantic feature data at the same moment are connected in series to form a semantic feature unit, namely three vectors are connected in series to form one vector. For example, the video semantic feature data, the audio semantic feature data and the text semantic feature data are all 2-dimensional vectors, and the speech feature units generated after concatenation are 6-dimensional vectors.
S207: and arranging semantic feature units at each moment according to a time sequence to generate a multi-mode semantic feature sequence.
In the present implementation scenario, the speech feature units at each moment are arranged in time sequence to generate a multi-grind semantic feature sequence. The time sequence is the time dimension of the audio semantic feature sequence.
S208: inputting the multi-mode semantic feature sequences into a pre-trained emotion recognition neural network, and taking the output of the emotion recognition neural network as a target emotion corresponding to the multi-mode data set to be recognized.
In a specific implementation scenario, step S208 is substantially identical to step S105 of the first embodiment of the emotion recognition method provided in the present invention, and will not be described herein.
As can be seen from the above description, in this embodiment, by acquiring text semantic feature data in a text semantic feature sequence corresponding to each pronunciation phoneme of audio data, acquiring a time corresponding to the text semantic feature data, concatenating video semantic feature data, audio semantic feature data and text semantic feature numbers at the same time to form semantic feature units, arranging the semantic feature units at each time according to a time sequence, generating a multi-mode semantic feature sequence, retaining feature alignment and fusion of multi-mode space-time relationships, and acquiring a target emotion according to the multi-mode semantic feature sequence with higher accuracy, thereby effectively improving emotion recognition accuracy.
Referring to fig. 4, fig. 4 is a schematic flow chart of a third embodiment of an emotion recognition method according to the present invention. The emotion recognition method provided by the invention comprises the following steps:
s301: and acquiring a multi-modal data set to be identified, wherein the multi-modal data set to be identified comprises at least two of video data, audio data and/or text data.
S302: extracting a video semantic feature sequence of video data, extracting an audio semantic feature sequence of audio data, and/or extracting a text semantic feature sequence in text data.
S303: and aligning the text semantic feature sequence to the time dimension of the audio data to generate a text semantic time sequence.
S304: and fusing the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence according to the time dimension to generate a multi-mode semantic feature sequence.
In a specific implementation scenario, steps S301 to S304 are substantially identical to steps S101 to S104 of the first embodiment of the emotion recognition method provided in the present invention, and are not described herein.
S305: and respectively inputting semantic feature units at each moment into a pre-trained unit recognition neural network, and taking the output result of the unit recognition neural network as the emotion recognition result at each moment.
In the present implementation scenario, the semantic feature unit at each time is input into the pre-trained unit recognition neural network, and the output result of the unit recognition neural network is used as the emotion recognition result at each time.
In this implementation scenario, the cell identification neural network includes a convolutional neural network layer and a two-way long and short memory neural network layer. Convolving a neural network with a current elementDefine a width for the center as +.>The sensing window of (2) performs full-connection network calculation on input elements in the window, taking one-dimensional data as an example
Let the input beThe model of the convolutional neural network is:
wherein the method comprises the steps ofFor nonlinear activation function +.>Representing shared weights, i.e.)>Inequality but->When the weights are equal, the corresponding weights are input.
CNNs are often used with pooling (pooling) layers, which function is characterized by spatial invariance, and is commonly found in:
Max-pooling:
Average-pooling:
a Long Short-Term Memory network (LSTM) is a sequence annotation model, and is the current momenttOutput of (2)h t Is the current time inputx t And output from the previous timeh t-1 Is a function of (2). The following demonstrates a method of implementation of LSTM:
is provided withx t For the current input vector to be present,h t-1 the vector is output for the previous time instant,c t-1 for the cell state vector at the previous time,h t the vector is output for the current time instant,h t the calculation mode of (a) is as follows:
wherein the method comprises the steps ofWAndUrespectively representing different weight matrixes, wherein tanh is a nonlinear activation function:
in other implementations, the cell identification neural network may also include only one layer of neural network, such as an LSTM.
S306: and sequencing the emotion recognition results at each moment according to time to generate an emotion recognition sequence.
In the present embodiment, emotion recognition results at each time are sorted in time, and an emotion recognition sequence is generated. A plurality of unit recognition neural networks can be arranged, emotion recognition results at each moment can be output at the same time, a unit recognition neural network can be also arranged, semantic feature units at each moment are sequentially input, and emotion recognition results at each moment are sequentially output.
S307: the method comprises the steps of obtaining weights of emotion recognition results at each moment, performing dot multiplication operation on the emotion recognition results at each moment and the weights corresponding to the emotion recognition results, inputting emotion recognition sequences after the dot multiplication operation into a pre-trained emotion recognition neural network, and taking output of the emotion recognition neural network as target emotion corresponding to a multi-mode data set to be recognized.
In the present implementation scenario, the weight of the emotion recognition result at each moment in the emotion recognition sequence is obtained, and the emotion recognition result at each moment is multiplied by the weight point corresponding to the emotion recognition result. Because the emotion recognition results at each moment in the emotion recognition sequence are mutually influenced, for example, some emotion recognition results are subconscious reactions, and some emotion recognition results have stronger emotion, the influence capacity of different emotion recognition results on the target emotion corresponding to the emotion recognition sequence is different.
In this embodiment, attention calculation is performed on the emotion recognition sequence, and the weight of the emotion recognition result at each time is obtained.
Wherein,,weight of emotion recognition result for each moment, +.>For emotion recognition sequence, < > for>The operation formula of the function is as follows:
in this implementation scenario, the emotion recognition neural network is a fully connected neural network. The fully-connected neural network defaults to establish weight connection between all inputs and outputs, taking one-dimensional data as an example:
let the input beThe model of the fully connected network is:
wherein the method comprises the steps ofFor network parameters +.>As non-linear activation functions, common e.g. Sigmoid functions。
As can be seen from the above description, in this embodiment, the video semantic feature data, the audio semantic feature data and the text semantic feature data at the same moment are connected in series to form semantic feature units, the semantic feature unit at each moment is input into a neural network for identifying emotion at each moment, so as to obtain an emotion identification result at each moment, and the neural network for identifying emotion includes a convolutional neural network layer and a two-way long and short memory neural network layer, so that the accuracy of the emotion identification result can be improved.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a first embodiment of an intelligent device according to the present invention. The intelligent device 10 comprises an acquisition module 11, an extraction module 12, an alignment module 13, a concatenation module 14 and an emotion module 15. The acquisition module 11 acquires a multimodal data group to be identified, which includes video data, audio data, and text data. The extraction module 12 is used for extracting video semantic feature sequences of video data, extracting audio semantic feature sequences of audio data and extracting text semantic feature sequences in text data. The alignment module 13 is configured to align the text semantic feature sequence to a time dimension of the audio data, and generate a text semantic time sequence. The concatenation module 14 is configured to concatenate the video semantic feature sequence, the audio semantic feature sequence, and the text semantic temporal sequence according to a time dimension to generate a multi-modal semantic feature sequence. The emotion module 15 is configured to input the multimodal semantic feature sequence into a pre-trained emotion recognition neural network, and obtain an emotion included in the multimodal data set to be recognized.
As can be seen from the above description, in this embodiment, after the intelligent device obtains the multimodal data set to be identified, the video semantic feature sequence of the video data is extracted, the audio semantic feature sequence of the audio data is extracted, and/or the text semantic feature sequence in the text data is extracted. The text semantic feature sequence is aligned to the time dimension of the audio data to generate a text semantic time sequence, the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence are fused according to the time dimension to generate a multi-mode semantic feature sequence, so that the feature alignment and fusion of the multi-mode space-time relationship can be kept, the accuracy of the target emotion obtained according to the multi-mode semantic feature sequence is higher, and the emotion recognition accuracy is effectively improved.
Please continue to refer to fig. 5. The alignment module 13 comprises a first acquisition sub-module 131 and an alignment sub-module 132. The first obtaining sub-module 131 is configured to obtain at least one pronunciation phoneme of the audio data, and obtain text semantic feature data corresponding to each pronunciation phoneme. The alignment sub-module 132 is configured to obtain a time position of each pronunciation phoneme, and align text semantic feature data with the time position of the corresponding pronunciation phoneme.
The tandem module 14 includes a second acquisition sub-module 141 and a tandem sub-module 142. The second obtaining sub-module 141 is configured to obtain video semantic feature data, audio semantic feature data, and text semantic feature data at each moment in the video semantic feature sequence, the audio semantic feature sequence, and the text semantic time sequence, respectively. The concatenation sub-module 142 is configured to concatenate video semantic feature data, audio semantic feature data, and text semantic feature data at the same time into a semantic feature unit.
Emotion module 15 includes emotion recognition sub-module 151, arrangement sub-module 152, and emotion sub-module 153. The emotion recognition sub-module 151 is configured to input semantic feature units at each moment into a pre-trained unit recognition neural network, and obtain emotion recognition data at each moment. The arrangement sub-module 152 is configured to sort the emotion recognition data at each time according to time, and generate an emotion recognition sequence. The emotion sub-module 153 is configured to input the emotion recognition sequence into a pre-trained emotion recognition neural network, and obtain the emotion included in the multimodal data set to be recognized.
Emotion submodule 153 includes a weight element 1531. The weight unit 1531 is configured to obtain weights of the emotion recognition data at each moment, perform a dot product operation on the emotion recognition data at each moment and the weights corresponding to the emotion recognition data, and input the calculated emotion recognition sequence into the pre-trained emotion recognition neural network.
The weight unit 1531 is configured to perform attention operation on the emotion recognition sequence, and obtain weights of emotion recognition data at each moment.
The unit identification neural network comprises a convolutional neural network layer and a two-way long and short memory network layer.
Wherein, emotion recognition neural network is the full-connection neural network.
The smart device 10 further includes a training module 16, the training module 16 being configured to train the emotion recognition neural network.
Training module 16 includes a preparation sub-module 161, a definition sub-module 162, and an input sub-module 163.
The preparation sub-module 161 is configured to prepare a plurality of training multi-modal feature sequences and label the target emotion of each training multi-modal feature sequence. Definition submodule 162 is used to define the structure, loss function, and termination conditions of the trained emotion recognition neural network. The input submodule 163 is used for training the input emotion recognition neural network by taking the multiple multi-modal feature sequences and the corresponding target emotion thereof.
As can be seen from the above description, in this embodiment, semantic feature units at each moment are arranged according to a time sequence to generate a multi-mode semantic feature sequence, so that semantic features, rather than low-level features, are acquired, emotion features of a multi-mode data set to be identified can be more accurately represented, feature alignment and fusion of multi-mode space-time relationships are reserved, accuracy of target emotion acquired according to the multi-mode semantic feature sequence is higher, so that accuracy of emotion identification is effectively improved, video semantic feature data, audio semantic feature data and text semantic feature numbers at the same moment are connected in series to form semantic feature units, the semantic feature units at each moment are input into a unit identification neural network to acquire an emotion identification result at each moment, the unit identification neural network comprises a convolutional neural network layer and a two-way long and short memory neural network layer, and accuracy of the emotion identification result can be improved.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a second embodiment of the smart device according to the present invention. The smart device 20 includes a processor 21, a memory 22, and a fetch circuit 23. The processor 21 is coupled to the memory 22 and the acquisition circuit 23. The memory 22 has stored therein a computer program which is executed by the processor 21 in operation to implement the method as shown in fig. 2-4. The detailed method can be referred to above, and will not be described here.
As can be seen from the above description, in this embodiment, after the intelligent device obtains the multimodal data set to be identified, the video semantic feature sequence of the video data is extracted, the audio semantic feature sequence of the audio data is extracted, and/or the text semantic feature sequence in the text data is extracted. The text semantic feature sequence is aligned to the time dimension of the audio data to generate a text semantic time sequence, the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence are fused according to the time dimension to generate a multi-mode semantic feature sequence, semantic features instead of low-level features are acquired, emotion features of a multi-mode data set to be identified can be more accurately represented, feature alignment and fusion of multi-mode space-time relations are reserved, and accuracy of target emotion acquired according to the multi-mode semantic feature sequence is higher, so that accuracy of emotion identification is effectively improved.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of a computer readable storage medium according to the present invention. The computer readable storage medium 30 stores at least one computer program 31, and the computer program 71 is used for being executed by a processor to implement the method shown in fig. 2-4, and the detailed method is referred to above and will not be repeated here. In one embodiment, the computer readable storage medium 30 may be a memory chip, a hard disk or a removable hard disk in a terminal, or other readable and writable storage means such as a flash disk, an optical disk, etc., and may also be a server, etc.
As can be seen from the above description, the computer program stored in the storage medium in this embodiment may be used to extract a video semantic feature sequence of video data, extract an audio semantic feature sequence of audio data, and/or extract a text semantic feature sequence in text data after acquiring the multimodal data set to be identified. The text semantic feature sequence is aligned to the time dimension of the audio data to generate a text semantic time sequence, the video semantic feature sequence, the audio semantic feature sequence and/or the text semantic time sequence are fused according to the time dimension to generate a multi-mode semantic feature sequence, semantic features instead of low-level features are acquired, emotion features of a multi-mode data set to be identified can be more accurately represented, feature alignment and fusion of multi-mode space-time relations are reserved, and accuracy of target emotion acquired according to the multi-mode semantic feature sequence is higher, so that accuracy of emotion identification is effectively improved.
Compared with the prior art, the method has the advantages that semantic features are obtained instead of low-level features, the emotion features of the multi-mode data set to be identified can be more accurately represented, the alignment and fusion of the features of the multi-mode space-time relationship are reserved, and the accuracy of the target emotion obtained according to the multi-mode semantic feature sequence is higher, so that the accuracy of emotion identification is effectively improved.
The foregoing disclosure is illustrative of the present invention and is not to be construed as limiting the scope of the invention, which is defined by the appended claims.