CN117711409A

CN117711409A - Multifunctional voice interaction device based on deep learning

Info

Publication number: CN117711409A
Application number: CN202311721463.2A
Authority: CN
Inventors: 庞中华; 商鹏飞; 高胜男; 翟维枫; 于铧仁; 郭海彬
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-03-15

Abstract

The invention provides a multifunctional voice interaction device based on deep learning, belonging to the field of voice interaction. The device includes: a human-computer interaction module, used to provide a human-computer interaction interface for users, receive the voice and instructions input by the user, and display the voice Text and emotion categories are used to play multi-style speech; the speech recognition module is used to perform text recognition and emotion recognition on the voice input by the user based on the deep learning model, and the speech text and emotion categories are obtained; the speech synthesis module is used to perform text recognition and emotion recognition based on the speech synthesis model. , generate multi-style speech based on the speech input by the user. The present invention improves the flexibility of voice interaction and adds emotion recognition to voice recognition, so that words and emotions can be recognized simultaneously based on voice, making voice interaction closer to actual communication scenarios.

Description

A multifunctional voice interaction device based on deep learning

技术领域Technical field

本发明涉及语音交互领域，特别是涉及一种基于深度学习的多功能语音交互装置。The present invention relates to the field of voice interaction, and in particular to a multifunctional voice interaction device based on deep learning.

背景技术Background technique

公园是城市绿地系统的重要组成部分，其在城市中的重要性日益凸显。随着5G、物联网、大数据等新一代信息技术的快速发展，公园的管理和服务需求呈现多元化发展的趋势，传统公园已经越来越难以满足新的需求，并暴露出诸多问题，公园“智慧化”已成为必然的发展趋势，在新一代信息通信技术的发展下，公园正在公园服务管理、市民智能交互等方面实现数字化表达。作为服务于市民的城市公园，以人为本，提高市民游园乐趣和参与感是智慧公园发展的基础和重中之重，而作为游园市民可以直接接触的智能交互产品，则完全符合智慧公园以人为本的核心理念。Parks are an important part of the urban green space system, and their importance in cities has become increasingly prominent. With the rapid development of new generation information technologies such as 5G, Internet of Things, and big data, the management and service needs of parks have shown a trend of diversified development. It has become increasingly difficult for traditional parks to meet new needs and exposed many problems. Parks "Intelligence" has become an inevitable development trend. With the development of the new generation of information and communication technology, parks are realizing digital expression in aspects such as park service management and citizen intelligent interaction. As a city park that serves citizens, people-oriented and improving citizens' enjoyment and sense of participation in the park are the foundation and top priority for the development of smart parks. As intelligent interactive products that visitors can directly access, they are fully in line with the people-oriented core concept of smart parks. .

针对智慧公园中市民智能交互场景，调研后，目前交互产品存在智能化不足、主控算力不够、用户交互较差等诸多不足。因此有必要发明一款搭载前沿科技的智能交互装置供给市民娱乐和科普。Regarding the intelligent interaction scenarios for citizens in smart parks, after research, it was found that the current interactive products have many shortcomings such as insufficient intelligence, insufficient main control computing power, and poor user interaction. Therefore, it is necessary to create an intelligent interactive device equipped with cutting-edge technology to provide citizens with entertainment and science popularization.

发明内容Contents of the invention

本发明的目的是提供一种基于深度学习的多功能语音交互装置，可提高语音交互的灵活性，使语音交互更贴近实际交流场景。The purpose of the present invention is to provide a multifunctional voice interaction device based on deep learning, which can improve the flexibility of voice interaction and make voice interaction closer to actual communication scenarios.

为实现上述目的，本发明提供了一种基于深度学习的多功能语音交互装置，包括：To achieve the above objectives, the present invention provides a multifunctional voice interaction device based on deep learning, including:

人机交互模块，用于为用户提供人机交互界面，接收用户输入的语音及指令，并显示语音文本及情绪类别，播放多风格语音；The human-computer interaction module is used to provide users with a human-computer interaction interface, receive voice and instructions input by the user, display voice text and emotion categories, and play multi-style voices;

语音识别模块，与所述人机交互模块连接，用于基于深度学习模型，对用户输入的语音进行文字识别及情绪识别，得到语音文本及情绪类别；A speech recognition module, connected to the human-computer interaction module, is used to perform text recognition and emotion recognition on the speech input by the user based on a deep learning model to obtain speech text and emotion categories;

语音合成模块，与所述人机交互模块连接，用于基于语音合成模型，根据用户输入的语音生成多风格语音。A speech synthesis module, connected to the human-computer interaction module, is used to generate multi-style speech according to the speech input by the user based on the speech synthesis model.

可选地，所述人机交互模块、所述语音识别模块及所述语音合成模块均基于JetsonNano SOC主控芯片构建。Optionally, the human-computer interaction module, the speech recognition module and the speech synthesis module are all built based on the JetsonNano SOC main control chip.

可选地，所述人机交互模块包括HMI组态屏幕、麦克风及扬声器。Optionally, the human-computer interaction module includes an HMI configuration screen, a microphone and a speaker.

可选地，所述语音识别模块包括：Optionally, the speech recognition module includes:

语音文字识别子模块，用于基于文字识别模型，对用户输入的语音进行文字识别；所述文字识别模型为预先在个人计算机端采用语音文字数据集对Efficient Conformer模型进行训练得到的；语音文字数据集中包括多个语音样本及各语音样本对应的文字；The speech and text recognition sub-module is used to perform text recognition on the speech input by the user based on the text recognition model; the text recognition model is obtained by pre-training the Efficient Conformer model using the speech and text data set on the personal computer; the speech and text data The collection includes multiple voice samples and the text corresponding to each voice sample;

语音情绪识别子模块，用于基于情绪识别模型，对用户输入的语音进行情绪识别，以确定所述用户的情绪类别；所述情绪识别模块为预先在个人计算机端采用语音情绪数据集对声纹识别模型进行训练得到的；所述语音情绪数据集中包括多个语音样本及各语音样本对应的情绪类别：所述声纹识别模型包括依次连接的第一全连接层、重塑层、双向LSTM层、Tanh激活层、丢弃层、第二全连接层、ReLU激活层及第三全连接层。The voice emotion recognition sub-module is used to perform emotion recognition on the voice input by the user based on the emotion recognition model to determine the user's emotion category; the emotion recognition module uses the voice emotion data set on the personal computer in advance to analyze the voiceprint. The recognition model is trained; the voice emotion data set includes multiple voice samples and the emotion categories corresponding to each voice sample: the voiceprint recognition model includes a first fully connected layer, a reshaping layer, and a bidirectional LSTM layer that are connected in sequence. , Tanh activation layer, dropout layer, second fully connected layer, ReLU activation layer and third fully connected layer.

可选地，所述文字识别模型包括依次连接的CNN前端、多个LAC编码器及多个LAC解码器；Optionally, the text recognition model includes a CNN front end, multiple LAC encoders and multiple LAC decoders connected in sequence;

所述CNN前端包括依次连接的第一卷积神经网络、第二卷积神经网络及嵌入神经网络；The CNN front-end includes a first convolutional neural network, a second convolutional neural network and an embedded neural network connected in sequence;

每个LAC编码器包括依次连接的第一低秩前馈神经网络、第一多头注意力层、第三卷积神经网络及第二低秩前馈神经网络；Each LAC encoder includes a first low-rank feedforward neural network, a first multi-head attention layer, a third convolutional neural network and a second low-rank feedforward neural network connected in sequence;

每个LAC解码器包括依次连接的掩码多头注意力层、第一归一化层、第二多头注意力层、第二归一化层、第三低秩前馈神经网络及第三归一化层。Each LAC decoder consists of a masked multi-head attention layer, a first normalization layer, a second multi-head attention layer, a second normalization layer, a third low-rank feedforward neural network and a third normalization layer connected in sequence. One layer.

可选地，所述语音文字数据集为使用线性能量图对中文女音数据库AISHELL-1进行预处理，并采用ffmpeg工具进行增强噪声、增强语速、增强音量、偏移增强及频谱增强得到的。Optionally, the speech text data set is obtained by preprocessing the Chinese female voice database AISHELL-1 using a linear energy map, and using the ffmpeg tool to enhance noise, enhance speech speed, enhance volume, offset enhancement and spectrum enhancement. .

可选地，所述语音合成模型为预先采用多说话人数据集对fastspeech2模型进行训练后，使用Paddld工具转化为动态图模型得到的。Optionally, the speech synthesis model is obtained by pre-training the fastspeech2 model using a multi-speaker data set and then converting it into a dynamic graph model using the Paddld tool.

根据本发明提供的具体实施例，本发明公开了以下技术效果：本发明提供的基于深度学习的多功能语音交互装置融合了人机交互、语音识别及语音合成功能，提高了语音交互的灵活性，并且，将情绪识别加入语音识别中，可以根据语音同时识别出文字和情绪，使得语音交互更加贴近实际交流场景。According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects: The multifunctional voice interaction device based on deep learning provided by the present invention integrates human-computer interaction, speech recognition and speech synthesis functions, improving the flexibility of voice interaction. , and by adding emotion recognition to speech recognition, text and emotions can be recognized simultaneously based on speech, making voice interaction closer to actual communication scenarios.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the drawings of the present invention. Embodiments, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

图1为本发明提供的基于深度学习的多功能语音交互装置的功能模块示意图；Figure 1 is a schematic diagram of the functional modules of the multifunctional voice interaction device based on deep learning provided by the present invention;

图2为本发明提供的基于深度学习的多功能语音交互装置的硬件组成示意图；Figure 2 is a schematic diagram of the hardware composition of the multifunctional voice interaction device based on deep learning provided by the present invention;

图3为HMI上位机组态界面的示意图；Figure 3 is a schematic diagram of the HMI host computer configuration interface;

图4为文字识别模型的示意图；Figure 4 is a schematic diagram of the text recognition model;

图5为Efficient Conformer模型的训练流程图；Figure 5 is the training flow chart of the Efficient Conformer model;

图6为线性能量图处理的流程图；Figure 6 is a flow chart of linear energy map processing;

图7为Efficient Conformer模型的训练结果图；Figure 7 shows the training results of the Efficient Conformer model;

图8为声纹识别模型的示意图；Figure 8 is a schematic diagram of the voiceprint recognition model;

图9为声纹识别模型训练集损失示意图；Figure 9 is a schematic diagram of the voiceprint recognition model training set loss;

图10为声纹识别模型训练集的分类函数通用性能指标准确率的变化情况图；Figure 10 shows the changes in the accuracy of the general performance index of the classification function of the voiceprint recognition model training set;

图11为声纹识别模型测试集的分类函数通用性能指标准确率的另一种变化情况图；Figure 11 is another change diagram of the accuracy of the general performance indicator of the classification function of the voiceprint recognition model test set;

图12为声纹识别模型的混淆矩阵；Figure 12 shows the confusion matrix of the voiceprint recognition model;

图13为PaddleSpeech项目实现的流程图；Figure 13 is the flow chart of the PaddleSpeech project implementation;

图14为AISHELL-3数据集真实声音的梅尔频谱图；Figure 14 is the Mel spectrogram of real sounds in the AISHELL-3 data set;

图15为fastspeech2合成声音的梅尔频谱图。Figure 15 is the Mel spectrum diagram of fastspeech2 synthesized sound.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

本发明的目的是提供一种基于深度学习的多功能语音交互装置，在传统语音交互的基础上加入深度学习技术，搭载了语音文本和情绪识别以及多风格语音合成等功能。The purpose of the present invention is to provide a multifunctional voice interaction device based on deep learning, which adds deep learning technology to traditional voice interaction and is equipped with functions such as voice text and emotion recognition, and multi-style speech synthesis.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more obvious and understandable, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

如图1所示，本发明提供的基于深度学习的多功能语音交互装置包括：人机交互模块、语音识别模块及语音合成模块。As shown in Figure 1, the multifunctional voice interaction device based on deep learning provided by the present invention includes: a human-computer interaction module, a speech recognition module and a speech synthesis module.

具体地，所述人机交互模块、所述语音识别模块及所述语音合成模块均基于Jetson Nano SOC主控芯片构建。如图2所示，本发明的软件基于Jetson nano SOC主控芯片，安装Ubuntu18.04系统实现。硬件主要包括人机交互界面、Jetson Nano SOC主控芯片、麦克风阵列、外置扬声器以及其他传感器。其中，人机交互界面通过UART通讯方式与JetsonNano SOC主控芯片连接，麦克风阵列、外置扬声器及其他传感器通过USB(UniversalSerial Bus，串行总线)与Jetson Nano SOC主控芯片连接。Specifically, the human-computer interaction module, the speech recognition module and the speech synthesis module are all built based on the Jetson Nano SOC main control chip. As shown in Figure 2, the software of the present invention is based on the Jetson nano SOC main control chip and is implemented by installing the Ubuntu18.04 system. The hardware mainly includes human-computer interaction interface, Jetson Nano SOC main control chip, microphone array, external speakers and other sensors. Among them, the human-computer interaction interface is connected to the Jetson Nano SOC main control chip through UART communication, and the microphone array, external speakers and other sensors are connected to the Jetson Nano SOC main control chip through USB (Universal Serial Bus, serial bus).

人机交互模块用于为用户提供人机交互界面，接收用户输入的语音及指令，并显示语音文本及情绪类别，播放多风格语音。The human-computer interaction module is used to provide users with a human-computer interaction interface, receive voice and instructions input by the user, display voice text and emotion categories, and play multiple styles of voices.

具体地，所述人机交互模块包括HMI组态屏幕、麦克风及扬声器。交互模块是实现各功能连接的桥梁，本发明使用串口通讯的方式，搭配HMI组态屏幕，实现用户和装置的人机交互。人机交互模块分为HMI交互界面设计和UART通讯两部分。串口交互是实现语音交互的基础，本发明使用串口HMI作为上位机，在其上设计和搭建组态界面，使用UART的方式与Jetson Nano进行交互，最终实现语音智能交互的目的。如图3所示为HMI上位机组态界面。Specifically, the human-computer interaction module includes an HMI configuration screen, a microphone and a speaker. The interactive module is a bridge that realizes the connection of various functions. The present invention uses serial communication and cooperates with the HMI configuration screen to realize human-computer interaction between the user and the device. The human-computer interaction module is divided into two parts: HMI interactive interface design and UART communication. Serial port interaction is the basis for realizing voice interaction. The present invention uses serial port HMI as the host computer, designs and builds a configuration interface on it, uses UART to interact with Jetson Nano, and finally achieves the purpose of intelligent voice interaction. Figure 3 shows the HMI host computer configuration interface.

(1)HMI交互界面设计。HMI交互界面包括各功能按钮的位置、显示信息的分布等。HMI交互界面包括主界面、信息显示界面及虚拟绘图界面。每个界面都包含了基本信息的显示和各功能按钮。主界面主要用于切换各功能界面。信息显示界面用于显示当前温湿度、当前时间、提示信息及功能按钮。其中，当前时间由RTC(Real_Time Clock，实时时钟)获取函数实现，HMI会自动获取当前时间然后显示在屏幕上。在点击功能按钮后，将调整界面。(1) HMI interactive interface design. The HMI interactive interface includes the location of each function button, the distribution of displayed information, etc. The HMI interactive interface includes the main interface, information display interface and virtual drawing interface. Each interface contains basic information display and various function buttons. The main interface is mainly used to switch between functional interfaces. The information display interface is used to display the current temperature and humidity, current time, prompt information and function buttons. Among them, the current time is implemented by the RTC (Real_Time Clock, real-time clock) acquisition function, and the HMI will automatically obtain the current time and display it on the screen. After clicking the function button, the interface will be adjusted.

(2)通讯是人机交互模块的基础，是实现多个设备交互的必要功能。UART通讯实现简单，只需两根线即可实现数据的传输，而且传输稳定。编写通讯协议是实现两个设备之间数据传输的基础。通讯功能的实现分为HMI端和JetsonNano端。(2) Communication is the basis of the human-computer interaction module and a necessary function to realize the interaction of multiple devices. UART communication is simple to implement, only two wires are needed to transmit data, and the transmission is stable. Writing a communication protocol is the basis for data transmission between two devices. The implementation of communication function is divided into HMI side and JetsonNano side.

具体地，首先在HMI端定义功能变量funcTX，将funcTX与按钮进行关联，也就是按下按钮则会触发funcTX。使用自定义prints函数使用串口发送数据至JetsonNano端。HMI发送完数据后，JetsonNano需要接收并处理数据。在终端使用serial库接收串口数据进行解析后使用subprocess库运行相应指令。首先引用serial库，打开JetsonNano串口，设置波特率、停止位等参数，使其处于等待接收状态；然后创建功能类，在类中定义各协议对应的操作，比如创建了MUSIC目标检测class：MUSIC_010000，类名由功能和通讯协议组成。Specifically, first define the function variable funcTX on the HMI side and associate funcTX with the button, that is, pressing the button will trigger funcTX. Use the custom prints function to send data to the JetsonNano using the serial port. After the HMI sends the data, JetsonNano needs to receive and process the data. In the terminal, use the serial library to receive serial port data for analysis and then use the subprocess library to run the corresponding instructions. First, reference the serial library, open the JetsonNano serial port, set the baud rate, stop bit and other parameters, so that it is in a waiting state; then create a functional class and define the operations corresponding to each protocol in the class, such as creating a MUSIC target detection class: MUSIC_010000 , the class name consists of functions and communication protocols.

根据功能要求编写shell脚本命令，使用subprocess库控制linux系统设备，比如接收打开音乐播放器的命令后，运行MUSIC.sh文件，此脚本会将系统切换到MUSIC目录下，运行相应程序。另外创建字典，将各操作加入字典中，串口收到数据后在字典中检索配套的指令，从而打开相应功能。Write shell script commands according to functional requirements and use the subprocess library to control Linux system devices. For example, after receiving a command to open a music player, run the MUSIC.sh file. This script will switch the system to the MUSIC directory and run the corresponding program. In addition, create a dictionary and add each operation to the dictionary. After receiving the data, the serial port retrieves the matching instructions in the dictionary to open the corresponding function.

语音识别模块与所述人机交互模块连接，语音识别模块用于基于深度学习模型，对用户输入的语音进行文字识别及情绪识别，得到语音文本及情绪类别。The speech recognition module is connected to the human-computer interaction module. The speech recognition module is used to perform text recognition and emotion recognition on the speech input by the user based on the deep learning model to obtain the speech text and emotion categories.

本发明中，语音识别模块包括语音文字识别子模块及语音情绪识别子模块。语音文字识别是语音交互的基础，只有识别出准确的文字才会有良好的语音交互体验。传统的语音识别只有文字识别功能，无法实现个性化或者真实的语音交互体验，本发明提供的语音识别模块在传统语音文字识别的基础上加入语音情绪识别，不仅可以识别出语音文字也可以识别出语音情绪。两个功能均使用百度公司的Paddle开源框架，其中语音文字识别使用Efficient Conformer模型，语音情绪识别为在ECAPA-TDNN基础上发明的SER-TDNN模型(声纹识别模型)。In the present invention, the speech recognition module includes a speech character recognition sub-module and a speech emotion recognition sub-module. Speech and text recognition is the basis of voice interaction. Only by recognizing accurate text can you have a good voice interaction experience. Traditional speech recognition only has text recognition function and cannot achieve personalized or real voice interaction experience. The speech recognition module provided by the present invention adds speech emotion recognition on the basis of traditional speech and text recognition, which can not only recognize speech characters but also recognize Voice emotions. Both functions use Baidu's Paddle open source framework. Speech and text recognition uses the Efficient Conformer model, and speech emotion recognition uses the SER-TDNN model (voiceprint recognition model) invented based on ECAPA-TDNN.

(1)语音文字识别子模块用于基于文字识别模型，对用户输入的语音进行文字识别。(1) The speech and text recognition sub-module is used to perform text recognition on the speech input by the user based on the text recognition model.

所述文字识别模型为预先在个人计算机端采用语音文字数据集对EfficientConformer模型进行训练得到的。语音文字数据集中包括多个语音样本及各语音样本对应的文字。所述语音文字数据集为使用线性能量图对中文女音数据库AISHELL-1进行预处理，并采用ffmpeg工具进行增强噪声、增强语速、增强音量、偏移增强及频谱增强得到的。The text recognition model is obtained by pre-training the EfficientConformer model on a personal computer using a speech text data set. The speech text data set includes multiple speech samples and the text corresponding to each speech sample. The speech text data set is obtained by preprocessing the Chinese female voice database AISHELL-1 using a linear energy map, and using the ffmpeg tool to enhance noise, speech speed, volume, offset enhancement and spectrum enhancement.

具体地，如图4所示，所述文字识别模型包括依次连接的CNN前端、E个LAC编码器及D个LAC解码器。Specifically, as shown in Figure 4, the text recognition model includes a CNN front end, E LAC encoders and D LAC decoders connected in sequence.

CNN前端包括依次连接的第一卷积神经网络、第二卷积神经网络及嵌入神经网络。CNN前端用于对输入的声音进行特征提取，包括语音的帧数、频率等特征，并使用绝对位置编码后将输入的声学特征转化为特征矩阵其中，T是下采样后的序列长度，d_e是嵌入层的维度。The CNN front-end includes the first convolutional neural network, the second convolutional neural network and the embedded neural network connected in sequence. The CNN front-end is used to extract features from the input sound, including the frame number, frequency and other features of the voice, and convert the input acoustic features into a feature matrix using absolute position encoding. Among them, T is the sequence length after downsampling, and _de is the dimension of the embedding layer.

每个LAC编码器包括依次连接的第一低秩前馈神经网络、第一多头注意力层、第三卷积神经网络及第二低秩前馈神经网络。Each LAC encoder includes a first low-rank feedforward neural network, a first multi-head attention layer, a third convolutional neural network and a second low-rank feedforward neural network connected in sequence.

LAC编码器在Conformer模型的基础上引入了线型注意力机制，使得EfficientConformer在模型复杂度降低的情况下仍能保持较好的识别准确率。The LAC encoder introduces a linear attention mechanism based on the Conformer model, so that EfficientConformer can still maintain good recognition accuracy even when the model complexity is reduced.

第i个LAC编码器的对输入数据的处理过程为：The input data processing process of the i-th LAC encoder is:

X″_i＝X′_i+Conv(X′_i)；X″ _i =X′ _i +Conv(X′ _i );

其中，X_i为第i个LAC编码器的输入数据，LFFN()为低秩前馈神经网络，为第一低秩前馈神经网络的输出数据，MHLSA()为多头注意力机制，X'_i为第一多头注意力层的输出数据，Conv()为卷积，X″_i为第三卷积神经网络的输出数据，Layernorm()为层归一化，Y_i为第i个LAC编码器的输出数据。Among them, _Xi is the input data of the i-th LAC encoder, LFFN() is a low-rank feedforward neural network, is the output data of the first low-rank feedforward neural network, MHLSA() is the multi-head attention mechanism, X' _i is the output data of the first multi-head attention layer, Conv() is the convolution, and X″ _i is the third The output data of the convolutional neural network, Layernorm() is layer normalization, and Y _i is the output data of the i-th LAC encoder.

每个LAC解码器包括依次连接的掩码多头注意力层、第一归一化层、第二多头注意力层、第二归一化层、第三低秩前馈神经网络及第三归一化层。LAC解码器中的第三低秩前馈神经网络与LAC编码器中的第一低秩前馈神经网络和第二低秩前馈神经网络相同。通过D个LAC解码器，可以将模型学到的语音表示映射到最终的文本结果，并从语音信号中提取的特征转换为对应的文本输出。Each LAC decoder consists of a masked multi-head attention layer, a first normalization layer, a second multi-head attention layer, a second normalization layer, a third low-rank feedforward neural network and a third normalization layer connected in sequence. One layer. The third low-rank feedforward neural network in the LAC decoder is the same as the first and second low-rank feedforward neural networks in the LAC encoder. Through D LAC decoders, the speech representation learned by the model can be mapped to the final text result, and the features extracted from the speech signal are converted into corresponding text output.

本发明使用PPASR项目，移植Efficient Conformer模型，先在PC(PersonalComputer，个人计算机)端进行模型训练和优化，训练流程如图5所示。使用PaddlePaddle深度学习框架，主要安装的Python库包括：Paddleaudio-1.0.2、scikit-learn-1.2.2等，并安装CUDA(Compute Unified Device Architecture，统一计算设备架构)启用GPU(GraphicsProcessing Unit，图形处理器)加速。This invention uses the PPASR project to transplant the Efficient Conformer model, and first performs model training and optimization on the PC (Personal Computer) side. The training process is shown in Figure 5. To use the PaddlePaddle deep learning framework, the main installed Python libraries include: Paddleaudio-1.0.2, scikit-learn-1.2.2, etc., and install CUDA (Compute Unified Device Architecture, unified computing device architecture) to enable GPU (GraphicsProcessing Unit, graphics processing machine) accelerates.

语音文字数据集使用开源的中文女音数据库AISHELL-1和语音增强噪声数据。首先按照预定要求将各数据进行归类汇总和重命名，使用线性能量图进行数据预处理。线性能量图是PaddleSpeech提供的一种音频预处理方法，主要是对音频做实数快速傅里叶变换，得到真实信号的光谱图，并经过缩放等操作后得到模型可以直接处理的数据列表，其具体处理流程如图6所示，对音频依次进行预加重加窗、做实数快速傅里叶变换、绝对值取平方、压缩、计算快速傅里叶频率及取对数处理。数据经线性能量图预处理后得到长短不一的矩阵数据，还需要进行长度统一，处理方法为：读取一批次数据后，对数据进行排序，找出最长的音频和标签，然后按照max-length创建一个零张量，接着把原数据全部填充到此张量中，得到尾部补充为0的统一长度数据。为了在有限数据下提高文字识别模型的准确率，本发明在原项目基础上引入ffmpeg工具，增添了噪声增强，语速增强，音量增强，偏移增强，频谱增强五种语音数据增强方式，只需在配置文件中，修改相应参数即可使用对应的数据增强方法。The speech text data set uses the open source Chinese female voice database AISHELL-1 and speech enhancement noise data. First, each data is classified, summarized and renamed according to predetermined requirements, and a linear energy diagram is used for data preprocessing. Linear energy map is an audio preprocessing method provided by PaddleSpeech. It mainly performs real-number fast Fourier transform on the audio to obtain the spectrogram of the real signal, and after scaling and other operations, obtains a list of data that can be directly processed by the model. The specific processing flow is shown in Figure 6. The audio is sequentially pre-emphasized and windowed, real fast Fourier transformed, absolute value squared, compressed, fast Fourier frequency calculated and logarithmic processed. After the data is preprocessed by the linear energy map, matrix data of different lengths are obtained, and the lengths need to be unified. The processing method is: after reading a batch of data, sort the data, find the longest audio and label, and then follow the max-length creates a zero tensor, and then fills all the original data into this tensor to obtain uniform length data with 0 at the end. In order to improve the accuracy of the text recognition model under limited data, the present invention introduces the ffmpeg tool based on the original project and adds five speech data enhancement methods: noise enhancement, speech rate enhancement, volume enhancement, offset enhancement, and spectrum enhancement. In the configuration file, modify the corresponding parameters to use the corresponding data enhancement method.

文字识别模型的训练环境为Intel i9-12900K处理器为I，显卡为NVIDIA GeForceRTX 3090，显卡内存24G。训练设置batch_size＝128，num_workers＝24，训练186轮，训练时长约75h。训练结果如图7所示，文字识别模型使用CTC和LSM联合损失函数，权重各为0.5。为防止过拟合，文字识别模型使用预热学习策略，预热学习率0.00005，预热步数25000，初始学习率0.001；词错率CER是语音识别通用性能指标：CER＝(插入+删除+替换)/总单词数。其中插入、删除和替换为预测文本和正确文本对比后需要调整的部分，186轮最佳字错率为0.052，识别正确率可以达到95％。满足装置使用。The training environment of the text recognition model is Intel i9-12900K processor I, the graphics card is NVIDIA GeForceRTX 3090, and the graphics card memory is 24G. The training settings are batch_size=128, num_workers=24, training is 186 rounds, and the training time is about 75 hours. The training results are shown in Figure 7. The text recognition model uses the CTC and LSM joint loss functions, with each weight being 0.5. In order to prevent overfitting, the text recognition model uses a preheating learning strategy, with a preheating learning rate of 0.00005, a number of preheating steps of 25000, and an initial learning rate of 0.001; the word error rate CER is a general performance indicator for speech recognition: CER = (insertion + deletion + replacement)/total number of words. Insertion, deletion and replacement are the parts that need to be adjusted after comparing the predicted text and the correct text. The best word error rate in 186 rounds is 0.052, and the recognition accuracy can reach 95%. Suitable for device use.

文字识别模型是基于X86架构的Windows系统，而本发明使用的主控基于ARM64结构，所以首先需要将paddlepaddle等windows架构下的深度学习框架改为ARM架构。另外，语音处理类深度学习模型，数据处理后都会转为基于时间的大量离散矩阵，因此对设备的算力要求极高。The text recognition model is a Windows system based on the X86 architecture, and the main control used in the present invention is based on the ARM64 structure. Therefore, it is first necessary to change the deep learning framework under the Windows architecture such as paddlepaddle to the ARM architecture. In addition, speech processing deep learning models will convert the data into a large number of discrete matrices based on time after data processing, so the computing power requirements of the equipment are extremely high.

具体地，将paddlepaddle框架改为用于ARM设备的paddle inference，到GitHub下载scikit-learn源码，使用make工具进行ARM版本的编译，并配置好其他环境后，直接运行文字识别模型会显示内存溢出，这是由于装置主控算力较低，无法进行大规模矩阵运算。将原项目使用的集束搜索算法更换为占用内存较小贪心搜索算法后，文字识别模型运行成功，识别30字左右语音耗时3s，满足要求。Specifically, change the paddlepaddle framework to paddle inference for ARM devices, download the scikit-learn source code from GitHub, use the make tool to compile the ARM version, and configure other environments. Directly running the text recognition model will show a memory overflow. This is due to the low computing power of the device's main control and its inability to perform large-scale matrix operations. After replacing the beam search algorithm used in the original project with a greedy search algorithm that occupies less memory, the text recognition model ran successfully. It took 3 seconds to recognize about 30 words of speech, which met the requirements.

(2)语音情绪识别子模块用于基于情绪识别模型，对用户输入的语音进行情绪识别，以确定所述用户的情绪类别。具体地，情绪类别包括：生气、害怕、开心、紧张、伤心和惊喜。(2) The speech emotion recognition sub-module is used to perform emotion recognition on the voice input by the user based on the emotion recognition model to determine the user's emotion category. Specifically, the emotion categories include: angry, scared, happy, nervous, sad, and surprised.

所述情绪识别模块为预先在个人计算机端采用语音情绪数据集对声纹识别模型进行训练得到的。所述语音情绪数据集中包括多个语音样本及各语音样本对应的情绪类别。The emotion recognition module is obtained by pre-training the voiceprint recognition model using the voice emotion data set on the personal computer. The voice emotion data set includes a plurality of voice samples and emotion categories corresponding to each voice sample.

本发明在ECAPA-TDNN的基础上进行了改进，得到声纹识别模型，基于PaddlePaddle实现了基本的语音情绪识别功能。The present invention improves on the basis of ECAPA-TDNN, obtains a voiceprint recognition model, and implements basic voice emotion recognition functions based on PaddlePaddle.

如图8所示，声纹识别模型包括依次连接的第一全连接层、重塑层、双向LSTM层、Tanh激活层、丢弃层、第二全连接层、ReLU激活层及第三全连接层。As shown in Figure 8, the voiceprint recognition model includes the first fully connected layer, the reshaping layer, the bidirectional LSTM layer, the Tanh activation layer, the discarding layer, the second fully connected layer, the ReLU activation layer and the third fully connected layer. .

声纹识别模型主要是输入的数据进行特征提取和分类，数据输入后，经过第一全连接层进行线性变换；变换后的数据被重塑层重塑为三维形状，以适应双向LSTM层的输入要求；重塑后的数据被送入双向LSTM层进行时序处理，双向LSTM会提高声纹识别模型的拟合能力；双向LSTM层的输出通过Tanh激活层进行激活；经过Tanh激活层后，输出数据通过丢弃层进行随机丢弃；丢弃层的输出数据进入第二全连接层进行线性变换；变换后的数据通过ReLU激活层进行非线性处理；经过ReLU激活层后，输出数据通过第三全连接层进行最终的线性变换；最后，声纹识别模型输出每个情绪类别的预测分数。The voiceprint recognition model mainly performs feature extraction and classification on the input data. After the data is input, it undergoes linear transformation through the first fully connected layer; the transformed data is reshaped into a three-dimensional shape by the reshaping layer to adapt to the input of the bidirectional LSTM layer. Requirements; the reshaped data is sent to the bidirectional LSTM layer for timing processing. Bidirectional LSTM will improve the fitting ability of the voiceprint recognition model; the output of the bidirectional LSTM layer is activated through the Tanh activation layer; after passing through the Tanh activation layer, the data is output Random discarding is performed through the discarding layer; the output data of the discarding layer enters the second fully connected layer for linear transformation; the transformed data undergoes nonlinear processing through the ReLU activation layer; after passing through the ReLU activation layer, the output data passes through the third fully connected layer. The final linear transformation; finally, the voiceprint recognition model outputs the prediction score for each emotion category.

由于开源的语音情绪数据集的质量较差，本发明将三个开源的语音情绪数据集进行分类和数据加强后作为声纹识别模块的语音情绪数据集，加强后的语音情绪数据集的数量和质量都得到大幅度提高，满足声纹识别模型的训练要求。Due to the poor quality of the open source voice emotion data sets, the present invention classifies and data-enhances three open source voice emotion data sets as the voice emotion data set of the voiceprint recognition module. The number of the enhanced voice emotion data sets and The quality has been greatly improved, meeting the training requirements of the voiceprint recognition model.

具体地，首先进行数据预处理，将语音情绪数据集按一定比例分为训练集和测试集，并且遍历文件，形成路径和文件名对应的标签信息；生成归一化模型，用于训练。训练设置batch_size＝128，训练200轮。训练时会生成log日志，使用Paddle日志查看工具，可以实时查看训练情况，如图9为训练集的损失，使用交叉熵损失函数，训练200轮，最佳损失为0.22；图10和图11为分类函数通用性能指标准确率的变化情况，准确率是指声纹识别模型在测试集上预测正确的样本数与总样本数的比值，用来衡量声纹识别模型在分类任务中的分类准确程度，准确率＝(正确预测的样本数)/(总样本数)。训练200轮，训练集的准确率可以达到0.87，测试集的准确率为0.90；图12是声纹识别模型的混淆矩阵，展示了声纹识别模型在测试集上的预测结果与真实标签之间的对应关系，以及分类的准确与错误情况，从图9至图12可以看出，除fear情绪外，其他五种情绪准确率均达到90％以上，由于fear情绪数据集相对较少，所以准确率为43％。训练效果满足发明的要求。Specifically, data preprocessing is performed first, and the speech emotion data set is divided into a training set and a test set according to a certain proportion, and the files are traversed to form label information corresponding to the path and file name; a normalized model is generated for training. Training setting batch_size=128, training for 200 rounds. A log log will be generated during training. You can use the Paddle log viewing tool to view the training situation in real time. Figure 9 shows the loss of the training set. Using the cross-entropy loss function, training for 200 rounds, the optimal loss is 0.22; Figure 10 and Figure 11 are Changes in the accuracy of the general performance index of the classification function. The accuracy refers to the ratio of the number of samples correctly predicted by the voiceprint recognition model on the test set to the total number of samples. It is used to measure the classification accuracy of the voiceprint recognition model in the classification task. , accuracy = (number of correctly predicted samples)/(total number of samples). After 200 rounds of training, the accuracy of the training set can reach 0.87, and the accuracy of the test set can reach 0.90; Figure 12 is the confusion matrix of the voiceprint recognition model, showing the difference between the prediction results of the voiceprint recognition model on the test set and the real labels The corresponding relationship, as well as the accuracy and error of classification, can be seen from Figure 9 to Figure 12 that, except for the fear emotion, the accuracy of the other five emotions has reached more than 90%. Since the fear emotion data set is relatively small, the accuracy The rate is 43%. The training effect meets the requirements of the invention.

本发明中，语音情绪识别也基于paddle，所以直接在语音文字识别环境的基础上，增添相应Python环境即可，情绪识别模型部署成功后，经测试可以成功识别语音情绪，耗时2s，满足本发明的需求。In the present invention, speech emotion recognition is also based on paddle, so it is enough to add the corresponding Python environment directly on the basis of the speech and text recognition environment. After the emotion recognition model is successfully deployed, the speech emotion can be successfully recognized after testing, which takes 2 seconds and satisfies this requirement. The need for invention.

语音合成模块与所述人机交互模块连接，语音合成模块用于基于语音合成模型，根据用户输入的语音生成多风格语音。具体地，语音合成模型为预先采用多说话人数据集对fastspeech2模型进行训练后，使用Paddld工具转化为动态图模型得到的。The speech synthesis module is connected to the human-computer interaction module. The speech synthesis module is used to generate multi-style speech according to the speech input by the user based on the speech synthesis model. Specifically, the speech synthesis model is obtained by pre-training the fastspeech2 model using a multi-speaker data set and then converting it into a dynamic graph model using the Paddld tool.

本发明中，多风格语音合成使用基于Paddle框架的fastspeech2模型，声码器使用Parallel WaveGAN预训练模型，使用AISHELL3多说话人语音数据库进行训练后实现多风格语音合成。fastSpeech2是2020年发表的端到端语音合成模型，其核心思想是使用自回归Transformer模型来生成声学特征，然后通过声码器将这些特征转换为音频波形，是目前主流的TTS(Text-to-Speech，文本到语音)模型。本发明使用PaddleSpeech项目实现fastSpeech2功能。使用fastspeech2的语音合成模型提供多种说话风格，可以根据需要选择语音风格。In the present invention, the multi-style speech synthesis uses the fastspeech2 model based on the Paddle framework, the vocoder uses the Parallel WaveGAN pre-training model, and the AISHELL3 multi-speaker speech database is used for training to realize multi-style speech synthesis. fastSpeech2 is an end-to-end speech synthesis model published in 2020. Its core idea is to use an autoregressive Transformer model to generate acoustic features, and then convert these features into audio waveforms through a vocoder. It is the current mainstream TTS (Text-to- Speech, text-to-speech) model. This invention uses the PaddleSpeech project to implement the fastSpeech2 function. The speech synthesis model using fastspeech2 provides a variety of speaking styles, and you can choose a voice style according to your needs.

本发明按照图13所示流程进行PaddleSpeech的项目实现，使用AISHELL-3多说话人数据集进行训练，由于数据集较大，直接训练耗费资源较大，因此在预训练模型fastspeech2_aishell_ckpt_1.1.0的基础上进行微调。其中，声码器使用ParallelWaveGAN预训练模型pwg_aishell3_ckpt_0.5。预测时需要根据AISHLL-3数据集说明指定说话人id以实现多说话人语音合成。图14是AISHELL-3数据集真实声音的梅尔频谱图，图15是fastspeech2合成声音的梅尔频谱图，可见，合成声音音色较为真实，另外语音合成模型提供多于100种风格的语音，满足多风格语音合成的要求。The present invention implements the PaddleSpeech project according to the process shown in Figure 13, and uses the AISHELL-3 multi-speaker data set for training. Since the data set is large, direct training consumes a lot of resources, so it is based on the pre-training model fastspeech2_aishell_ckpt_1.1.0 Make fine adjustments. Among them, the vocoder uses the ParallelWaveGAN pre-trained model pwg_aishell3_ckpt_0.5. When predicting, you need to specify the speaker ID according to the AISHLL-3 data set description to achieve multi-speaker speech synthesis. Figure 14 is the mel spectrogram of the real sound in the AISHELL-3 data set, and Figure 15 is the mel spectrogram of the fastspeech2 synthesized sound. It can be seen that the timbre of the synthesized sound is more realistic. In addition, the speech synthesis model provides more than 100 styles of speech, satisfying Requirements for multi-style speech synthesis.

语音合成模型的部署与文字识别模型及情绪识别模型相同，需要进行ARM环境本地编译和模型优化。在语音合成模块中，原版fastspeech2是动态图模型，动态图的优势是可以在搭建网络时看见变量的值，便于检查，缺点是由于不知道下一步要运算的内容，导致前向运算不好优化，由于每次运行都会重新构建一个新的计算图，所以占用内存也会较大，而静态图是在运行前可以对图结构进行优化，比如常数折叠、算子融合等，可以获得更快的前向运算速度，因此本发明使用Paddld工具集将项目的动态图模型转化为静态图模型。优化后，语音合成模型可以成功运行，合成一段10s左右的语音耗时5s左右。The deployment of the speech synthesis model is the same as the text recognition model and emotion recognition model, and requires local compilation and model optimization in the ARM environment. In the speech synthesis module, the original fastspeech2 is a dynamic graph model. The advantage of dynamic graphs is that you can see the values of variables when building the network, which is easy to check. The disadvantage is that the forward operation is not easy to optimize because you do not know what to calculate next. , since a new calculation graph will be rebuilt every time it is run, it will also occupy a larger amount of memory. Static graphs can optimize the graph structure before running, such as constant folding, operator fusion, etc., to obtain faster results. Forward operation speed, therefore the present invention uses the Paddld tool set to convert the dynamic graph model of the project into a static graph model. After optimization, the speech synthesis model can run successfully, and it takes about 5 seconds to synthesize a speech of about 10 seconds.

本发明针对传统交互装置智能化不足的问题，引入了深度学习，并使用SOC主控实现了智能语音交互。将情绪识别加入语音识别中，可以根据语音识别出文字和情绪，使得语音交互更加贴近实际交流场景。使用串口屏幕制作了HMI上位机，使用UART功能实现上位机控制装置，交互更直观便捷，便于在实际工程中应用和推广。In order to solve the problem of insufficient intelligence of traditional interactive devices, the present invention introduces deep learning and uses SOC master control to realize intelligent voice interaction. Adding emotion recognition to speech recognition can recognize text and emotions based on speech, making voice interaction closer to actual communication scenarios. The HMI host computer was made using the serial port screen, and the UART function was used to realize the host computer control device. The interaction was more intuitive and convenient, and it was easy to apply and promote in actual projects.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本发明的限制。This article uses specific examples to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only used to help understand the method and the core idea of the present invention; at the same time, for those of ordinary skill in the art, according to the present invention There will be changes in the specific implementation methods and application scope of the ideas. In summary, the contents of this description should not be construed as limitations of the present invention.

Claims

1. A multifunctional voice interaction device based on deep learning, characterized in that the multifunctional voice interaction device based on deep learning includes:

The human-computer interaction module is used to provide users with a human-computer interaction interface, receive voice and instructions input by the user, display voice text and emotion categories, and play multi-style voices;

A speech recognition module, connected to the human-computer interaction module, is used to perform text recognition and emotion recognition on the speech input by the user based on a deep learning model to obtain speech text and emotion categories;

A speech synthesis module, connected to the human-computer interaction module, is used to generate multi-style speech according to the speech input by the user based on the speech synthesis model.

2. The multifunctional voice interaction device based on deep learning according to claim 1, characterized in that the human-computer interaction module, the speech recognition module and the speech synthesis module are all constructed based on the Jetson Nano SOC main control chip. .

3. The multifunctional voice interaction device based on deep learning according to claim 1, wherein the human-computer interaction module includes an HMI configuration screen, a microphone and a speaker.

4. The multifunctional voice interaction device based on deep learning according to claim 1, characterized in that the voice recognition module includes:

The speech and text recognition sub-module is used to perform text recognition on the speech input by the user based on the text recognition model; the text recognition model is obtained by pre-training the Efficient Conformer model using the speech and text data set on the personal computer; the speech and text data The collection includes multiple voice samples and the text corresponding to each voice sample;

The voice emotion recognition sub-module is used to perform emotion recognition on the voice input by the user based on the emotion recognition model to determine the user's emotion category; the emotion recognition module uses the voice emotion data set on the personal computer in advance to analyze the voiceprint. The recognition model is trained; the voice emotion data set includes multiple voice samples and the emotion categories corresponding to each voice sample: the voiceprint recognition model includes a first fully connected layer, a reshaping layer, and a bidirectional LSTM layer that are connected in sequence. , Tanh activation layer, dropout layer, second fully connected layer, ReLU activation layer and third fully connected layer.

5. The multifunctional voice interaction device based on deep learning according to claim 4, wherein the text recognition model includes a CNN front end, multiple LAC encoders and multiple LAC decoders connected in sequence;

The CNN front-end includes a first convolutional neural network, a second convolutional neural network and an embedded neural network connected in sequence;

Each LAC encoder includes a first low-rank feedforward neural network, a first multi-head attention layer, a third convolutional neural network and a second low-rank feedforward neural network connected in sequence;

Each LAC decoder consists of a masked multi-head attention layer, a first normalization layer, a second multi-head attention layer, a second normalization layer, a third low-rank feedforward neural network and a third normalization layer connected in sequence. One layer.

6. The multifunctional voice interaction device based on deep learning according to claim 4, characterized in that the speech text data set is preprocessed using a linear energy map to Chinese female voice database AISHELL-1, and the ffmpeg tool is used It is obtained by enhancing noise, enhancing speech rate, enhancing volume, excursion enhancement and spectrum enhancement.

7. The multifunctional voice interaction device based on deep learning according to claim 1, characterized in that the speech synthesis model is a fastspeech2 model trained using a multi-speaker data set in advance, and then converted into a dynamic graph using the Paddld tool. obtained by the model.