CN115620356A - A Framework and Method for Addressee Detection Based on Audio and Facial Input - Google Patents
A Framework and Method for Addressee Detection Based on Audio and Facial Input Download PDFInfo
- Publication number
- CN115620356A CN115620356A CN202211019716.7A CN202211019716A CN115620356A CN 115620356 A CN115620356 A CN 115620356A CN 202211019716 A CN202211019716 A CN 202211019716A CN 115620356 A CN115620356 A CN 115620356A
- Authority
- CN
- China
- Prior art keywords
- audio
- attention
- visual
- cross
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Theoretical Computer Science (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Social Psychology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Databases & Information Systems (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
本发明属于视听处理、机器学习技术领域,公开了一种基于音频和面部输入的受话方检测框架和方法,前端包括音频流编码器和视频流编码器;后端包括交叉注意力模块;双线性融合模块以及自注意力模块;本发明框架输入可变长度的音频和面部区域信息,并通过联合分析音频和面部特征,预测每帧中的受话方。它使用在人对人和人对机器人混合设置中记录的数据集。因此,所述框架可应用并适用于机器人,以区分机器人是否为受话方。使得机器人具有智能视听感知能力,提高了机器人智能化程度。
The invention belongs to the technical fields of audio-visual processing and machine learning, and discloses a framework and method for detecting a receiver based on audio and face input. The front end includes an audio stream encoder and a video stream encoder; the back end includes a cross-attention module; Linear fusion module and self-attention module; the framework of the present invention inputs variable-length audio and facial region information, and predicts the receiver in each frame by jointly analyzing audio and facial features. It uses datasets recorded in human-to-human and human-to-robot hybrid settings. Therefore, the framework can be applied and adapted to the robot to distinguish whether the robot is the addressee or not. It makes the robot have intelligent audio-visual perception ability, and improves the intelligence of the robot.
Description
技术领域technical field
本发明属于视听处理、机器学习技术领域,尤其涉及一种基于音频和面 部输入的受话方检测框架和方法。The invention belongs to the technical fields of audio-visual processing and machine learning, and in particular relates to a frame and method for detecting a receiver based on audio and face input.
背景技术Background technique
类人型机器人的根本挑战是具有智能视听感知能力系统,以辅助与人类 的自然交互和合作。丰富此系统的方式之一是让机器人识别它是否为受话方。 它帮助机器人决定是否对人类话语做出反应。其主要应用于导引机器人、同 伴助理、机器人管家、机器人救生员和移动护理机器人。然而,尽管有少量 先前的工作,这一领域还没有得到广泛的探索,以最先进的方法在现实环境 中使用有效的沟通线索。虽然近年来国内外对AD的研究取得了重要的进展, 但还没有研究将音频和视频(面部)特征结合起来探讨AD。先前的工作没有从现有的音频和视频信息、长时间和短时间片段中获得太多益处。这些研究 大多集中在0.2s到0.6s的片段级(单幅图像)信息,很难从单幅图像或0.2s 的视频片段预测对话活动。然而,在现实中,人们会考虑跨越数百个视频帧 的整个句子来判断一个人是否在对另一个人说话。例如,一个5秒的视频平 均包含15个单词,0.2秒的短时间片段甚至不能覆盖一个完整的单词。此外, 现有的框架在人对人或人对机器人的设置中使用在有固定参与者的会议室中 记录的数据集,这不适用于人机交互。再者,现有的受话方检测工作广泛采 用统计和基于规则的方法,这些方法仅适用于特定任务,无法适用于其他情 况,例如,不同的动作和沟通表达以及不同的参与者数量。A fundamental challenge for humanoid robots is to have an intelligent audio-visual perception capability system to assist natural interaction and cooperation with humans. One way to enrich this system is to let the robot recognize whether it is the recipient or not. It helps robots decide whether to respond to human utterances. Its main applications are guide robots, companion assistants, robot butlers, robot lifeguards, and mobile care robots. However, despite a small amount of prior work, this area has not been extensively explored with state-of-the-art methods for using effective communication cues in real-world settings. Although the research on AD at home and abroad has made important progress in recent years, there is still no research that combines audio and video (facial) features to explore AD. Previous work does not benefit much from existing audio and video information, long and short time clips. Most of these studies focus on segment-level (single image) information from 0.2s to 0.6s, and it is difficult to predict dialogue activities from a single image or a 0.2s video clip. However, in reality, people consider entire sentences spanning hundreds of video frames to judge whether a person is speaking to another person. For example, a 5-second video contains 15 words on average, and a short time segment of 0.2 seconds cannot even cover a complete word. Furthermore, existing frameworks use datasets recorded in conference rooms with fixed participants in human-to-human or human-to-robot settings, which are not suitable for human-robot interaction. Moreover, existing addressee detection works widely use statistical and rule-based methods, which are only applicable to specific tasks and cannot be applied to other situations, e.g., different actions and communication expressions and different numbers of participants.
发明内容Contents of the invention
本发明目的在于提供一种基于音频和面部输入的受话方检测框架和方法, 以解决上述的技术问题。The purpose of the present invention is to provide a framework and method for detecting a receiver based on audio and face input, so as to solve the above-mentioned technical problems.
为解决上述技术问题,本发明的一种基于音频和面部输入的受话方检测 框架和方法的具体技术方案如下:In order to solve the above-mentioned technical problems, a kind of specific technical scheme of the receiver detection framework and method based on audio and face input of the present invention is as follows:
一种基于音频和面部输入的受话方检测框架,所述框架包括基于双流的 端到端框架ADNet,ADNet用于剪裁人脸区域的可变时间长度和相应的音频 片段作为输入,并预测人类是在对机器人还是对其他人说话,ADNet包括 前端和后端,所述前端包括音频流编码器和视频流编码器;所述后端包括交 叉注意力模块;双线性融合模块以及自注意力模块;A framework for addressee detection based on audio and face input, which includes a dual-stream-based end-to-end framework ADNet, which is used to crop variable time lengths of face regions and corresponding audio clips as input, and predict human Whether speaking to the robot or to other people, ADNet includes a front end and a back end, the front end includes an audio stream encoder and a video stream encoder; the back end includes a cross-attention module; a bilinear fusion module and a self-attention module;
所述视频流编码器用于输入N个连续人脸区域,并学习面部区域运动的 长时间表示;The video stream encoder is used to input N continuous face regions, and learns the long-term expression of facial region motion;
所述音频流编码器从时间动态中学习音频特征表示;The audio stream encoder learns audio feature representations from temporal dynamics;
所述交叉注意力模块用于动态关联视频和音频内容;The cross-attention module is used to dynamically associate video and audio content;
所述双线性融合模块用于融合视频和音频两个模态;The bilinear fusion module is used to fuse two modalities of video and audio;
所述自注意力模块用于在话语层面从背景监测受话方活动。The self-attention module is used to monitor listener activities from the background at the utterance level.
进一步的,所述视频流包括两个子模块:视觉前端网络模块和视觉时间卷 积模块,用于将视频流编码为具有相同时间分辨率的视觉嵌入Ev序列。Further, the video stream includes two sub-modules: a visual front-end network module and a visual temporal convolution module, which are used to encode the video stream into a visual embedding E v sequence with the same temporal resolution.
进一步的,包括全连接层,全连接层通过softmax操作将自注意力网络的 输出投射到AD标签序列。Further, including the fully connected layer, the fully connected layer projects the output of the self-attention network to the AD label sequence through the softmax operation.
进一步的,所述视觉前端网络模块采用3D-ResNet,从时空卷积,即3D 卷积层开始,然后通过18层残差网络ResNet18逐步降低空间维数,学习 每个视频帧的空间信息,并将视频帧流编码为基于帧的嵌入序列;所述视 觉时间卷积模块V-TCN用于表示长时间视觉时空流中的时间内容,V-TCN包括五个残差连接线性单元ReLU,批归一化BN和深度可分离卷积层DSConv1D,最后,加入Conv1D层,将特征维数降至128。Further, the visual front-end network module adopts 3D-ResNet, starts from spatio-temporal convolution, that is, 3D convolution layer, and then gradually reduces the spatial dimension through the 18-layer residual network ResNet18, learns the spatial information of each video frame, and The video frame stream is encoded as a frame-based embedding sequence; the visual temporal convolution module V-TCN is used to represent the temporal content in the long-term visual spatiotemporal stream, and the V-TCN includes five residual connection linear units ReLU, batch regression One BN and depth separable convolution layer DSConv1D, finally, add Conv1D layer to reduce the feature dimension to 128.
进一步的,所述音频流编码器采用包含压缩和激励SE模块的ResNet-34 网络;所述音频流编码器使用梅尔频率倒谱系数MFCC,每个时间步使 用13个梅尔频率带,所述ResNet-34网络输入音频帧序列以生成音频嵌 入Ea序列,所述音频流编码器特征维度输出设置为(1,128),ResNet34 的设计采用空洞卷积,使音频嵌入Ea时间分辨率与视觉嵌入Ev相匹配, 以方便所述交叉注意力模块,使用25ms分析窗口提取MFCC特征,步幅 为10ms,每秒产生100个音频帧。Further, the audio stream encoder adopts a ResNet-34 network comprising compression and excitation SE modules; the audio stream encoder uses Mel frequency cepstral coefficient MFCC, and each time step uses 13 Mel frequency bands, so Describe the ResNet-34 network input audio frame sequence to generate the audio embedding E a sequence, the audio stream encoder feature dimension output is set to (1, 128), the design of ResNet34 uses hole convolution, so that the audio embedding E a time resolution Matched with the visual embedding Ev to facilitate the cross-attention module, MFCC features were extracted using a 25ms analysis window with a stride of 10ms and 100 audio frames per second.
进一步的,所述交叉注意力网络的核心部分是注意力层,输入为线性层分 别投射的音频和视觉嵌入的查询(Qa,Qv)、键(Ka,Kv)和值(Va,Vv) 向量,输出为音频注意力特征:音频交叉注意力ACA,和视觉注意力特 征:视觉交叉注意力VCA;Further, the core part of the cross-attention network is the attention layer, and the input is the query (Q a , Q v ), key (K a , K v ) and value (V a , V v ) vector, the output is an audio attention feature: audio cross attention ACA, and a visual attention feature: visual cross attention VCA;
其中,d表示Q,A和V的维度,ACA通过采用视频流中的目标序列来生 成查询,采用音频流中的源序列来生成键和值来学习,从而生成新的交互 音频特征,反之亦然,以类似的方式生成新的视觉特征;最后,在将两个 交叉注意力模块送入融合层之前,增加前馈层、残差连接和层归一化,生 成最终交叉注意力网络。where d represents the dimensions of Q, A, and V. ACA learns by using the target sequence in the video stream to generate queries, and the source sequence in the audio stream to generate keys and values, thereby generating new interactive audio features, and vice versa. Then, new visual features are generated in a similar way; finally, feed-forward layers, residual connections and layer normalization are added before the two cross-attention modules are fed into the fusion layer to generate the final cross-attention network.
进一步的,视频和音频流交叉注意力生成的128维音频和128维视觉注意 力每帧特征与双线性融合连接,BLF=fblp(ACAij,VCAij),然后对位置求和, 以沿时间方向连接特征:Further, the 128-D audio and 128-D visual attention-per-frame features generated by video and audio stream cross-attention are concatenated with bilinear fusion, BLF = f blp (ACA ij , VCA ij ), and then summed over the positions to Join features along time:
由此产生的特征捕获了相应空间位置的相乘交互作用,在视 听交叉注意力层融合音频和视觉注意力特征生成融合特征Eav后,增加 BLF。The resulting features Capturing the multiplicative interactions of the corresponding spatial locations, the BLF is augmented after the audio-visual cross-attention layer fuses audio and visual attention features to generate the fused feature E av .
进一步的,采用自注意力模块输入BLF的融合特征Eav,对视听话语层面 时间信息进行建模。Further, the self-attention module is used to input the fusion feature E av of BLF to model the time information at the level of audio-visual discourse.
进一步的,将AD视为帧级分类任务,通过交叉熵损失将预测的标签序列 与地面实况标签序列进行比较,损失函数如下,其中Pi和yi是jth视频帧 j∈[a,N]的预测和地面实况AD标签,N为视频帧数;Further, considering AD as a frame-level classification task, the predicted label sequence is compared with the ground-truth label sequence by cross-entropy loss. The loss function is as follows, where P i and y i are j th video frame j ∈ [a,N ] prediction and ground truth AD labels, N is the number of video frames;
本发明还公开了一种基于音频和面部输入的受话方检测框架进行深度学 习的方法,包括如下步骤:The present invention also discloses a method for deep learning based on the audio and facial input receiver detection framework, comprising the following steps:
步骤1:构建在人对人和人对机器人混合设置中记录的数据集,通过增加 对话活动的时空注释来扩展现有的MuMMER数据集,或使用自定义数据 集,在人对人和人对机器人之间发生对话的场景中记录数据集。由人类注 释器生成受话方标签,从而对数据集进行注释,画面中每张说话的人脸均 注释为边界框区域(x、y),x表示宽度,Y表示高度;活动时间轴描述 了音频波形,以标记语音的开始和结束时间戳,手动选择开始和结束时间 戳([(ts0,te0),(ts1,te1),...,(tsn,ten)])的语音片段,并根据所选语 音片段边界中的活动标记每张人脸的活动,其中,n是视频中最后一个语 音片段,在数据集中,对机器人说话标记为0,主体与机器人对话,机器 人为受话方;对主体说话标记为1,主体与另一主体对话;将数据集随机 分割,其中60%用于训练,20%用于验证,20%用于测试;Step 1: Build a dataset recorded in a human-to-human and human-to-robot hybrid setting, extend the existing MuMMER dataset by adding spatiotemporal annotations of dialogue activities, or use a custom A dataset recorded in scenes where conversations between robots take place. The dataset is annotated by human annotators generating addressee labels. Each speaking face in the frame is annotated as a bounding box area (x, y), where x represents width and Y represents height; the active time axis describes Audio waveforms to mark the start and end timestamps of the speech, with manually selected start and end timestamps ([(t s0 , t e0 ), (t s1 , t e1 ), ..., (t sn , t en )] ), and mark the activity of each face according to the activity in the boundary of the selected voice segment, where n is the last voice segment in the video, in the data set, speaking to the robot is marked as 0, and the subject talks to the robot, The robot is the receiver; speaking to the subject is marked as 1, and the subject talks to another subject; the data set is randomly divided, 60% is used for training, 20% is used for verification, and 20% is used for testing;
步骤2:构建一个基于双流的深度学习框架ADNet,所述框架使用长时间 和短时间的视听特征来预测受话方,此框架包括用于提取并嵌入音频和视 频信息的音频和视频流编码器、用于跨模态交互的视听交叉注意力方法、 用于结合两种模态的双线性融合以及用于捕获长时间语音活动的自注意 力方法;Step 2: Construct a dual-stream-based deep learning framework ADNet, which uses long-term and short-term audio-visual features to predict the addressee, this framework includes audio and video stream encoders for extracting and embedding audio and video information , an audiovisual cross-attention method for cross-modal interaction, a bilinear fusion for combining two modalities, and a self-attention method for capturing long-term speech activity;
步骤3:使用指定的E-MuMMER数据集或自定义数据集训练端到端模型。Step 3: Train an end-to-end model using the specified E-MuMMER dataset or a custom dataset.
本发明的一种基于音频和面部输入的受话方检测框架和方法具有以下优 点:本发明框架输入可变长度的音频和面部区域信息,并通过联合分析音频 和面部特征,预测每帧中的受话方。它使用在人对人和人对机器人混合设置 中记录的数据集。因此,所述框架可应用并适用于机器人,以区分机器人是 否为受话方。使得机器人具有智能视听感知能力,提高了机器人智能化程度。An audio and facial input-based receiver detection framework and method of the present invention have the following advantages: the inventive framework inputs variable-length audio and facial region information, and predicts the region in each frame by jointly analyzing audio and facial features Receiver. It uses datasets recorded in human-to-human and human-to-robot hybrid settings. Therefore, the framework can be applied and adapted to robots to distinguish whether a robot is a callee or not. It makes the robot have intelligent audio-visual perception ability, and improves the intelligence of the robot.
附图说明Description of drawings
图1为本发明的基于音频和面部输入的受话方检测框架示意图;Fig. 1 is a schematic diagram of the receiver detection framework based on audio and facial input of the present invention;
图2为本发明的视频流编码器框架示意图;Fig. 2 is a schematic diagram of the video stream encoder framework of the present invention;
图3为本发明的音频流编码器框架示意图;FIG. 3 is a schematic diagram of an audio stream encoder framework of the present invention;
图4(a)为本发明的交叉注意力模块框架示意图;Figure 4 (a) is a schematic diagram of the cross-attention module framework of the present invention;
图4(b)为本发明的自注意力模块框架示意图;Figure 4 (b) is a schematic diagram of the framework of the self-attention module of the present invention;
图5为本发明的E-MuMMER的注释界面示意图。Fig. 5 is a schematic diagram of the annotation interface of the E-MuMMER of the present invention.
具体实施方式detailed description
为了更好地了解本发明的目的、结构及功能,下面结合附图,对本发明 一种基于音频和面部输入的受话方检测框架和方法做进一步详细的描述。In order to better understand the purpose, structure and function of the present invention, below in conjunction with accompanying drawing, a kind of receiver detection frame and method based on audio frequency and face input of the present invention are described in further detail.
一种基于音频和面部输入的受话方检测框架,所述框架输入可变长度的音 频和面部区域信息,并通过联合分析音频和面部特征,预测每帧中的受话方。 它使用在人对人和人对机器人混合设置中记录的数据集。因此,所述框架可 应用并适用于机器人,以区分机器人是否为受话方。所述框架的实施步骤如 下。An audio and facial input-based addressee detection framework that takes variable-length audio and facial region information as input and predicts the addressee in each frame by jointly analyzing audio and facial features. It uses datasets recorded in human-to-human and human-to-robot hybrid settings. Thus, the framework is applicable and applicable to bots to distinguish whether a bot is a callee or not. The implementation steps of the framework are as follows.
步骤1:构建在人对人和人对机器人混合设置中记录的数据集,通过增加 对话活动的时空注释来扩展现有的MuMMER数据集。还可以使用自定义数 据集,但应在人对人和人对机器人之间发生对话的场景中记录数据集。由人 类注释器使用图5所示的界面生成受话方标签,从而对数据集进行注释。画 面中每张说话的人脸均应注释为边界框区域(x、y)。x表示宽度,Y表示高 度。窗口下方的活动时间轴描述了音频波形,以标记语音的开始和结束时间 戳。手动选择开始和结束时间戳([(ts0,te0),(ts1,te1),...,(tsn,ten)]) 的语音片段,并根据所选语音片段边界中的活动标记每张人脸的活动,其中, n是视频中最后一个语音片段。E-MuMMER中有两种语音活动类型:“对机 器人说话/与机器人对话”和“对主体说话/与主体对话”。在数据集中,对机 器人说话标记为0,主体与机器人对话,机器人为受话方;对主体说话标记 为1,主体与另一主体对话。将数据集随机分割,其中60%用于训练,20%用于验证,20%用于测试。当没有语音(没有高亮波形,参考图5)且机器 人在说话时,则不会显示注释。Step 1: Construct a dataset recorded in a human-to-human and human-to-robot hybrid setting, extending the existing MuMMER dataset by adding spatiotemporal annotations of dialogue activities. Custom datasets can also be used, but should be recorded in scenarios where human-to-human and human-to-robot conversations occur. The dataset is annotated by a human annotator using the interface shown in Figure 5 to generate the addressee labels. Each speaking face in the frame should be annotated as a bounding box region (x, y). x means width and y means height. The active timeline below the window depicts the audio waveform, marking the start and end timestamps of the speech. Manually select speech segments with start and end timestamps ([(t s0 , t e0 ), (t s1 , t e1 ), ..., (t sn , t en )]), and according to the selected speech segment boundaries The activity of each face marks the activity of each face, where n is the last speech segment in the video. There are two speech activity types in E-MuMMER: "talk to/talk to robot" and "talk to/talk to subject". In the data set, speaking to the robot is marked as 0, the subject is talking to the robot, and the robot is the receiver; speaking to the subject is marked as 1, and the subject is talking to another subject. The dataset is randomly split with 60% for training, 20% for validation and 20% for testing. When there is no speech (no highlighted waveform, see Figure 5) and the robot is talking, no annotations will be displayed.
步骤2:构建一个基于双流的深度学习框架(称为ADNet),此框架使用 长时间和短时间的视听特征来预测受话方。此框架包括用于提取并嵌入音频 和视频信息的音频和视频流编码器、用于跨模态交互的视听交叉注意力方法、 用于结合两种模态的双线性融合以及用于捕获长时间语音活动的自注意力方 法。Step 2: Construct a dual-stream based deep learning framework (called ADNet) that uses long- and short-term audio-visual features to predict recipients. This framework includes audio and video stream encoders for extracting and embedding audio and video information, audiovisual cross-attention methods for cross-modal interaction, bilinear fusion for combining two modalities, and A self-attention approach to temporal phonological activity.
步骤3:使用指定的E-MuMMER数据集或自定义数据集训练端到端模型。Step 3: Train an end-to-end model using the specified E-MuMMER dataset or a custom dataset.
如图1所示,前端网络由视频流编码器(VSE)和音频流编码器(ASE) 组成。后端网络由每个流上的交叉注意力(CA)模块、双线性融合(BLF) 模块、自注意力(SA)模块和受话方检测器(AD)模块组成。ADNet为一 个基于双流的端到端框架,将剪裁人脸区域的可变时间长度(片段)和相应 的音频片段作为输入,并预测人类是在对机器人还是对其他人说话。ADNet 由前端的音频流编码器(ASE)和视频流编码器(VSE)组成。这些双流输 入基于帧的音频和视频信号,并将其编码为表示时间背景的音频和视频嵌入。 后端网络由三个模块组成:1)交叉注意力模块(动态关联视频和音频内容), 2)双线性融合模块(融合两个模态),以及3)自注意力模块(在话语层面 从背景监测受话方活动)。As shown in Figure 1, the front-end network consists of a video stream encoder (VSE) and an audio stream encoder (ASE). The backend network consists of a cross-attention (CA) module, a bilinear fusion (BLF) module, a self-attention (SA) module and an addressee detector (AD) module on each stream. ADNet is a two-stream based end-to-end framework that takes as input variable temporal lengths (segments) of cropped face regions and corresponding audio segments, and predicts whether the human is speaking to the robot or to someone else. ADNet consists of an audio stream encoder (ASE) and a video stream encoder (VSE) at the front end. These dual streams take frame-based audio and video signals and encode them into audio and video embeddings that represent temporal context. The back-end network consists of three modules: 1) cross-attention module (dynamically associates video and audio content), 2) bilinear fusion module (fusing two modalities), and 3) self-attention module (at the utterance level Monitor callee activity from the background).
如图2所示,视频流编码器输入一系列N个连续112×112灰度剪裁人脸区 域。此编码器旨在学习面部区域运动的长时间表示。所述视频流包括两个子 模块:视觉前端和视觉时间网络。目的是将视频流编码为具有相同时间分辨 率的视觉嵌入Ev序列。对于视觉前端网络,我们采用了3D-ResNet。从时空 卷积(即3D卷积层(3DConv))开始,然后通过18层残差网络(ResNet18) 逐步降低空间维数。目的是学习每个视频帧的空间信息,并将视频帧流编码 为基于帧的嵌入序列。视觉时间卷积模块(V-TCN)表示长时间视觉时空流 中的时间内容。V-TCN网络包括五个残差连接线性单元ReLU,批归一化(BN) 和深度可分离卷积层(DSConv1D)。最后,加入Conv1D层,将特征维数降 至128。As shown in Figure 2, the video stream encoder inputs a series of N consecutive 112×112 grayscale cropped face regions. This encoder is designed to learn long-term representations of motion in facial regions. The video stream includes two sub-modules: a visual front end and a visual temporal network. The aim is to encode a video stream into a sequence of visual embeddings Ev with the same temporal resolution. For the visual front-end network, we adopted 3D-ResNet. It starts with spatio-temporal convolution (i.e. 3D convolutional layer (3DConv)), and then gradually reduces the spatial dimension through an 18-layer residual network (ResNet18). The goal is to learn the spatial information of each video frame and encode a stream of video frames into a sequence of frame-based embeddings. The Visual Temporal Convolutional Module (V-TCN) represents temporal content in long-term visual spatiotemporal streams. The V-TCN network consists of five residual connected linear units ReLU, batch normalization (BN) and depthwise separable convolutional layers (DSConv1D). Finally, a Conv1D layer is added to reduce the feature dimension to 128.
音频流编码器从时间动态中学习音频特征表示。如图3所示,音频流编码 器采用包含压缩和激励(SE)模块的ResNet-34网络。所述音频流编码器使 用梅尔频率倒谱系数(MFCC),每个时间步使用13个梅尔频率带。所述 ResNet-34网络输入音频帧序列以生成音频嵌入Ea序列。所述音频流编码器 特征维度输出设置为(1,128)。ResNet34的设计采用了空洞卷积,使音频 嵌入Ea时间分辨率与视觉嵌入Ev相匹配,以方便下面的注意力模块。使用25ms分析窗口提取MFCC特征,步幅为10ms,每秒产生100个音频帧。Audio stream encoders learn audio feature representations from temporal dynamics. As shown in Figure 3, the audio stream encoder employs a ResNet-34 network containing compression and excitation (SE) modules. The audio stream encoder uses Mel-frequency cepstral coefficients (MFCCs), using 13 mel-frequency bands per time step. The ResNet-34 network inputs a sequence of audio frames to generate a sequence of audio embeddings E a . The feature dimension output of the audio stream encoder is set to (1, 128). ResNet34 is designed with dilated convolutions to match the temporal resolution of the audio embedding E a with the visual embedding E v to facilitate the underlying attention module. MFCC features are extracted using a 25ms analysis window with a stride of 10ms and 100 audio frames per second.
如图4(a)所示,在视听特征嵌入后增加视听交叉注意力,以处理时间 维度上的动态视听交互。嵌入的特征Ev和Ea旨在区分适合于音频和视觉语 音活动的事件。交叉注意力网络的核心部分是注意力层。输入为线性层分别 投射的音频和视觉嵌入的查询(Qa,Qv)、键(Ka,Kv)和值(Va,Vv)向量。 如公式1和公式2所示,输出为音频注意力特征(音频交叉注意力(ACA)) 和视觉注意力特征(视觉交叉注意力(VCA))。As shown in Fig. 4(a), audio-visual cross-attention is added after audio-visual feature embedding to handle dynamic audio-visual interactions in the temporal dimension. The embedded features E v and E a aim to distinguish events suitable for audio and visuo-speech activities. The core part of the cross-attention network is the attention layer. The input is the query (Q a , Q v ), key (K a , K v ) and value (V a , V v ) vectors of the audio and visual embeddings respectively projected by the linear layer. As shown in Equation 1 and
其中,d表示Q,A和V的维度。如公式1和公式2所示,ACA通过采用 视频流中的目标序列来生成查询,采用音频流中的源序列来生成键和值来学 习,从而生成新的交互音频特征,反之亦然。以类似的方式生成新的视觉特 征。最后,在将两个交叉注意力模块送入融合层之前,增加前馈层、残差连 接和层归一化,生成最终交叉注意力网络。where d represents the dimensions of Q, A and V. As shown in Equation 1 and
视频和音频流交叉注意力生成的128维音频和128维视觉注意力每帧特征 与双线性融合连接。此处的双线性融合BLF=fblp(ACAij,VCAij)旨在计算每个像 素位置上两个跨模态注意力特征的外部矩阵乘积,然后对位置求和,以沿时 间方向连接特征。The 128-D audio and 128-D visual attention-per-frame features generated by cross-attention from video and audio streams are concatenated with bilinear fusion. Here the bilinear fusion BLF=f blp (ACA ij ,VCA ij ) aims to compute the external matrix product of two cross-modal attention features at each pixel location, and then sum over the locations to connect along the time direction feature.
由此产生的特征捕获了相应空间位置的相乘交互作用。在视 听交叉注意力层融合音频和视觉注意力特征生成融合特征(Eav)后,增加 BLF。The resulting features The multiplicative interaction of the corresponding spatial locations is captured. BLF is added after the audio-visual cross-attention layer fuses audio and visual attention features to generate fused features (E av ).
参考图4(b),我们采用自注意力模块输入BLF的融合特征(Eav),对视 听话语层面时间信息进行建模。除了来自联合视听特征的查询(Qav)、键(Kav) 和值(Vav)之外,自注意力架构与交叉注意力网络相同,如图4(b)所示。 此模块旨在区分与机器人对话和与其他主体对话的帧。Referring to Fig. 4(b), we adopt the self-attention module to input the fused features (E av ) of BLF to model the audiovisual discourse-level temporal information. The self-attention architecture is the same as the cross-attention network, except for queries (Q av ), keys (K av ) and values (V av ) from joint audiovisual features, as shown in Fig. 4(b). This module aims to distinguish between frames of conversations with the robot and conversations with other agents.
最后,增加全连接层,然后通过softmax操作将自注意力网络的输出投射 到AD标签序列。我们将AD视为帧级分类任务。通过交叉熵损失将预测的 标签序列与地面实况标签序列进行比较。损失函数如公式5所示,其中Pi和yi是jth视频帧j∈[a,N]的预测和地面实况AD标签。N为视频帧数。Finally, a fully connected layer is added, and then the output of the self-attention network is projected to the AD label sequence through a softmax operation. We regard AD as a frame-level classification task. The predicted label sequence is compared to the ground truth label sequence via a cross-entropy loss. The loss function is shown in
可以理解,本发明是通过一些实施例进行描述的,本领域技术人员知悉 的,在不脱离本发明的精神和范围的情况下,可以对这些特征和实施例进行 各种改变或等效替换。另外,在本发明的教导下,可以对这些特征和实施例 进行修改以适应具体的情况及材料而不会脱离本发明的精神和范围。因此, 本发明不受此处所公开的具体实施例的限制,所有落入本申请的权利要求范 围内的实施例都属于本发明所保护的范围内。It can be understood that the present invention is described through some embodiments, and those skilled in the art know that various changes or equivalent replacements can be made to these features and embodiments without departing from the spirit and scope of the present invention. In addition, the features and embodiments may be modified to adapt a particular situation and material to the teachings of the invention without departing from the spirit and scope of the invention. Therefore, the present invention is not limited by the specific embodiments disclosed here, and all embodiments falling within the scope of the claims of this application belong to the protection scope of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211019716.7A CN115620356A (en) | 2022-08-24 | 2022-08-24 | A Framework and Method for Addressee Detection Based on Audio and Facial Input |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211019716.7A CN115620356A (en) | 2022-08-24 | 2022-08-24 | A Framework and Method for Addressee Detection Based on Audio and Facial Input |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN115620356A true CN115620356A (en) | 2023-01-17 |
Family
ID=84856323
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211019716.7A Pending CN115620356A (en) | 2022-08-24 | 2022-08-24 | A Framework and Method for Addressee Detection Based on Audio and Facial Input |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115620356A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118136232A (en) * | 2024-01-18 | 2024-06-04 | 山东大学 | Early detection method and system of Parkinson's disease based on multimodal deep learning |
| CN119049468A (en) * | 2024-07-29 | 2024-11-29 | 淮阴工学院 | Digital person construction method based on voice driving |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180005483A1 (en) * | 2016-06-29 | 2018-01-04 | Synergy Blue, Llc | Dynamic placement of in-game ads, in-game product placement, and in-game promotions in wager-based game environments |
| US20190318725A1 (en) * | 2018-04-13 | 2019-10-17 | Mitsubishi Electric Research Laboratories, Inc. | Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers |
| US20200135209A1 (en) * | 2018-10-26 | 2020-04-30 | Apple Inc. | Low-latency multi-speaker speech recognition |
| CN112580416A (en) * | 2019-09-27 | 2021-03-30 | 英特尔公司 | Video tracking based on deep Siam network and Bayesian optimization |
| US20210097995A1 (en) * | 2019-09-27 | 2021-04-01 | Tata Consultancy Services Limited | Attention shifting of a robot in a group conversation using audio-visual perception based speaker localization |
| CN114445462A (en) * | 2022-01-26 | 2022-05-06 | 安徽大学 | Cross-modal visual tracking method and device based on adaptive convolution |
| CN114495973A (en) * | 2022-01-25 | 2022-05-13 | 中山大学 | Special person voice separation method based on double-path self-attention mechanism |
-
2022
- 2022-08-24 CN CN202211019716.7A patent/CN115620356A/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180005483A1 (en) * | 2016-06-29 | 2018-01-04 | Synergy Blue, Llc | Dynamic placement of in-game ads, in-game product placement, and in-game promotions in wager-based game environments |
| US20190318725A1 (en) * | 2018-04-13 | 2019-10-17 | Mitsubishi Electric Research Laboratories, Inc. | Methods and Systems for Recognizing Simultaneous Speech by Multiple Speakers |
| US20200135209A1 (en) * | 2018-10-26 | 2020-04-30 | Apple Inc. | Low-latency multi-speaker speech recognition |
| CN112580416A (en) * | 2019-09-27 | 2021-03-30 | 英特尔公司 | Video tracking based on deep Siam network and Bayesian optimization |
| US20210097995A1 (en) * | 2019-09-27 | 2021-04-01 | Tata Consultancy Services Limited | Attention shifting of a robot in a group conversation using audio-visual perception based speaker localization |
| CN114495973A (en) * | 2022-01-25 | 2022-05-13 | 中山大学 | Special person voice separation method based on double-path self-attention mechanism |
| CN114445462A (en) * | 2022-01-26 | 2022-05-06 | 安徽大学 | Cross-modal visual tracking method and device based on adaptive convolution |
Non-Patent Citations (2)
| Title |
|---|
| ROTH, J 等: "AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION", 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2 March 2021 (2021-03-02), pages 4492 - 4496 * |
| 汤一明;刘玉菲;黄鸿;: "视觉单目标跟踪算法综述", 测控技术, no. 08, 18 August 2020 (2020-08-18) * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118136232A (en) * | 2024-01-18 | 2024-06-04 | 山东大学 | Early detection method and system of Parkinson's disease based on multimodal deep learning |
| CN119049468A (en) * | 2024-07-29 | 2024-11-29 | 淮阴工学院 | Digital person construction method based on voice driving |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Tao et al. | End-to-end audiovisual speech recognition system with multitask learning | |
| CN114974215A (en) | Audio and video dual-mode-based voice recognition method and system | |
| WO2024032159A1 (en) | Speaking object detection in multi-human-machine interaction scenario | |
| Beyan et al. | RealVAD: A real-world dataset and a method for voice activity detection by body motion analysis | |
| CN119293740B (en) | A multimodal conversation emotion recognition method | |
| CN115620356A (en) | A Framework and Method for Addressee Detection Based on Audio and Facial Input | |
| CN119295994B (en) | A multimodal sentiment analysis method based on cross-modal attention | |
| Vayadande et al. | Lipreadnet: A deep learning approach to lip reading | |
| CN117909885A (en) | A method and system for audio-visual multimodal emotion recognition based on cross-modal attention mechanism | |
| Kumar et al. | Towards robust speech recognition model using deep learning | |
| CN117112778A (en) | A knowledge-based method for generating multi-modal conference abstracts | |
| CN114743129B (en) | Method and system for predicting emotion of old people in real time based on posture recognition | |
| Tesema et al. | Addressee detection using facial and audio features in mixed human–human and human–robot settings: A deep learning framework | |
| Korzun et al. | ReCell: replicating recurrent cell for auto-regressive pose generation | |
| JP7426917B2 (en) | Program, device and method for interacting with a user according to multimodal information around the user | |
| CN114283493A (en) | Artificial intelligence-based identification system | |
| CN120279176A (en) | Interactive multi-round dialogue digital person modeling system and method | |
| Robi et al. | Active speaker detection using audio, visual and depth modalities: A survey | |
| CN116597353B (en) | Video sentiment analysis method based on multi-scale feature extraction and multi-task learning | |
| Fatan et al. | 3M-Transformer: a multi-stage multi-stream multimodal transformer for embodied turn-taking prediction | |
| CN118823645A (en) | A multimodal emotion recognition method and system for scenarios where text modality is missing | |
| CN113571060B (en) | Multi-person dialogue ordering method and system based on audio-visual sense fusion | |
| CN118260679A (en) | Emotion recognition method, system, terminal and medium based on multimodal feature fusion | |
| Khekare et al. | A Deep Dive into Existing Lip Reading Technologies | |
| Al-Hames et al. | Automatic multi-modal meeting camera selection for video-conferences and meeting browsers |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |