CN111898420A

CN111898420A - A lip language recognition system

Info

Publication number: CN111898420A
Application number: CN202010556817.2A
Authority: CN
Inventors: 鲁远耀; 李宏波
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2020-11-06

Abstract

The invention discloses a lip language recognition system, which comprises: a human-computer interaction interface and an algorithm module; the human-computer interaction interface is connected with the algorithm module through a signal slot; the human-computer interaction interface is used for acquiring a video to be identified; the algorithm module is used for carrying out lip language identification on the video to be identified to obtain a lip language identification result; and the human-computer interaction interface is also used for displaying the lip language recognition result and displaying the lip language recognition process according to time sequence. The lip language recognition system designed by the invention can better observe and analyze the quality of the model performance and each link from the original video to the final recognition process, thereby improving and optimizing the model and the algorithm.

Description

A lip language recognition system

技术领域technical field

本发明涉及唇语识别领域，具体涉及一种唇语识别系统。The invention relates to the field of lip language recognition, in particular to a lip language recognition system.

背景技术Background technique

唇语识别，就是通过说话者嘴唇的动态形状变化来使计算机理解说话者表达的语言内容。唇语识别技术可以解决环境嘈杂甚至无音频采集器的场景下对说话者表达内容的识别任务，这也使唇语识别可应用于不同领域中的众多场景里，自动唇读技术可以广泛应用于虚拟现实系统、信息安全、语音识别和辅助驾驶系统等领域。随着物联网和5G技术的发展，唇语识别技术未来还会应用于智能家居、智能驾驶和智能沟通等领域。Lip recognition is to make the computer understand the language content expressed by the speaker through the dynamic shape change of the speaker's lips. Lip recognition technology can solve the task of identifying the content of the speaker in a noisy environment or even without an audio collector, which also makes lip recognition applicable to many scenarios in different fields. Automatic lip reading technology can be widely used in Virtual reality systems, information security, speech recognition and assisted driving systems, etc. With the development of the Internet of Things and 5G technology, lip recognition technology will also be used in smart home, smart driving and smart communication in the future.

传统的唇语识别技术是基于图像处理和模式识别来完成从序列图像到识别结果的过程，因此识别任务可分为：嘴唇区域定位、嘴唇感兴趣区域特征提取和唇语内容识别。其中，特征提取和分类器的设计是识别过程的重点与难点。一般采用的主流变换方法有：主要成分分析(Principal Component Analysis，PCA)、离散时间余弦变换(Discrete CosineTransformation，DCT)、奇异特征值分解法(Singular Value Decomposition，SVD)和信号处理中的独立成分分析法(Independent Component Correlation Algorithm，ICA)。唇语的内容识别主要由人工设计的分类器来完成。分类方法一般可划分为：隐马尔可夫模型(Hidden Markov Models，HMM)、人工神经网络(Artificial Neural Network，ANN)、模板匹配法(Template Matching Algorithm，TMA)和支持向量机(Support Vector Machine，SVM)等。当前对唇语识别的研究主要集中在如何实现唇语识别，以及如何对现有的唇语识别过程进行改进得到更高准确度的识别结果，在识别过程中，研究人员不能从识别过程直观的得到导致识别结果误差大的原因，往往需要大量的实验通过排除法先确认产生误差的环节，然后解决产生误差的原因，该过程不仅效率低，而且不能得到直观的唇语识别过程。The traditional lip language recognition technology is based on image processing and pattern recognition to complete the process from sequence images to recognition results, so the recognition tasks can be divided into: lip region location, lip region of interest feature extraction and lip language content recognition. Among them, feature extraction and classifier design are the key and difficult points of the identification process. The mainstream transformation methods generally used are: Principal Component Analysis (PCA), Discrete Cosine Transformation (DCT), Singular Value Decomposition (SVD) and independent component analysis in signal processing method (Independent Component Correlation Algorithm, ICA). The content recognition of lip language is mainly done by artificially designed classifiers. Classification methods can generally be divided into: Hidden Markov Models (HMM), Artificial Neural Network (ANN), Template Matching Algorithm (TMA) and Support Vector Machine (Support Vector Machine, SVM) etc. The current research on lip language recognition mainly focuses on how to realize lip language recognition, and how to improve the existing lip language recognition process to obtain higher accuracy recognition results. To obtain the reason for the large error of the recognition result, it often requires a lot of experiments to first confirm the error-producing link through the elimination method, and then solve the cause of the error. This process is not only inefficient, but also cannot obtain an intuitive lip language recognition process.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中所存在的上述不足，本发明提供了一种唇语识别系统，包括：人机交互界面和算法模块；所述人机交互界面和所述算法模块通过信号槽连接；In order to solve the above deficiencies in the prior art, the present invention provides a lip language recognition system, comprising: a human-computer interaction interface and an algorithm module; the human-computer interaction interface and the algorithm module are connected through a signal slot;

所述人机交互界面，用于获取待识别的视频；The human-computer interaction interface is used to obtain the video to be identified;

所述算法模块，用于对所述待识别的视频进行唇语识别得到唇语识别结果；The algorithm module is used to perform lip language recognition on the video to be recognized to obtain a lip language recognition result;

所述人机交互界面，还用于展示所述唇语识别结果，并按时序展示唇语识别过程。The human-computer interaction interface is also used for displaying the lip language recognition result and displaying the lip language recognition process in time sequence.

优选的，所述算法模块包括：Preferably, the algorithm module includes:

固定帧抽取子模块，用于基于半随机抽取固定帧策略从待识别的视频中抽取进行处理的视频帧；The fixed frame extraction sub-module is used to extract the video frame for processing from the video to be identified based on the semi-random fixed frame extraction strategy;

分割子模块，用于从进行处理的视频帧中分割嘴唇图像，得到嘴唇数据集；Segmentation sub-module for segmenting lip images from the processed video frames to obtain a lip dataset;

识别子模块，用于基于设计的模型对所述嘴唇数据集中的各嘴唇图像进行识别得到唇语识别结果。The recognition sub-module is used for recognizing each lip image in the lip data set based on the designed model to obtain a lip language recognition result.

优选的，所述识别子模块，包括：Preferably, the identification submodule includes:

特征提取单元，用于对嘴唇图像进行特征提取得到图像特征，并对各卷积层的嘴唇图像以及图像特征进行切片操作，得到可视化的各卷积层嘴唇图像和高维图像特征；The feature extraction unit is used to perform feature extraction on the lip image to obtain image features, and perform slicing operations on the lip images and image features of each convolution layer to obtain the visualized lip images and high-dimensional image features of each convolution layer;

时序特征提取单元，用于对所述图像特征进行时序特征提取得到序列特征，并对所述序列特征的进行切片操作，得到可视化的序列特征；a time series feature extraction unit, configured to perform time series feature extraction on the image features to obtain sequence features, and perform a slicing operation on the sequence features to obtain visualized sequence features;

分类单元，用于对提取到的时序特征进行分类，得到唇语识别结果。The classification unit is used to classify the extracted time series features to obtain a lip language recognition result.

优选的，所述特征提取单元为卷积神经网络CNN。Preferably, the feature extraction unit is a convolutional neural network CNN.

优选的，所述时序特征提取单元为循环神经网络RNN。Preferably, the time series feature extraction unit is a recurrent neural network RNN.

优选的，所述分类单元为softmax分类器。Preferably, the classification unit is a softmax classifier.

优选的，所述人机交互界面包括：Preferably, the human-computer interaction interface includes:

选择视频选项，用于当被触发时获取待识别的视频；Select the video option to obtain the video to be recognized when triggered;

识别视频选项，用于当被触发时对所述待识别的视频进行唇语识别得到识别结果；Recognition video option for performing lip language recognition on the video to be recognized when triggered to obtain a recognition result;

可视化选项，用于当被触发时将唇语识别过程和所述唇语识别结果基于可视化需求设置的配置文件进行展示；The visualization option is used to display the lip language recognition process and the lip language recognition result based on the configuration file set by the visualization requirements when triggered;

其中，展示内容包括待识别的视频帧、从所述视频帧中分割得到的嘴唇图像、可视化的各卷积层嘴唇图像、可视化的高维图像特征、可视化的序列特征和/或所述待识别的视频对应的至少一个识别结果。Wherein, the displayed content includes the video frame to be identified, the lip image segmented from the video frame, the visualized lip image of each convolutional layer, the visualized high-dimensional image feature, the visualized sequence feature and/or the to-be-identified lip image at least one recognition result corresponding to the video.

优选的，所述固定帧抽取子模块具体用于：Preferably, the fixed frame extraction submodule is specifically used for:

基于先验条件确定所需抽取的固定帧数；Determine the fixed number of frames to be extracted based on prior conditions;

将待识别的视频按照总体视频帧数量划分多个区域块；Divide the video to be identified into a plurality of regional blocks according to the total number of video frames;

其中，每个区域块中的区域范围最大化的平均。where the average of the maximization of the area extent in each area block.

优选的，所述待识别的视频，包括：Preferably, the video to be identified includes:

基于至少一个采集设备对同一个目标对象采集待识别的视频。The video to be identified is collected for the same target object based on at least one collection device.

优选的，所述人机交互界面通过PyQt5框架进行设计与搭建。Preferably, the human-computer interaction interface is designed and constructed through the PyQt5 framework.

与现有技术相比，本发明的有益效果为：Compared with the prior art, the beneficial effects of the present invention are:

本发明提供的唇语识别系统包括人机交互界面和算法模块；所述人机交互界面和所述算法模块通过信号槽连接；所述人机交互界面，用于获取待识别的视频；所述算法模块，用于对所述待识别的视频进行唇语识别得到唇语识别结果；所述人机交互界面，还用于展示所述唇语识别结果，并按时序展示唇语识别过程，可以更好的观察与分析算法模块性能的优劣以及从待识别视频到最终识别过程中的每个环节，从而提升并优化算法模块。The lip language recognition system provided by the present invention includes a human-computer interaction interface and an algorithm module; the human-computer interaction interface and the algorithm module are connected through a signal slot; the human-computer interaction interface is used to obtain the video to be recognized; the The algorithm module is used to perform lip language recognition on the video to be recognized to obtain a lip language recognition result; the human-computer interaction interface is also used to display the lip language recognition result, and display the lip language recognition process in time sequence, which can Better observe and analyze the performance of the algorithm module and every link in the process from the video to be recognized to the final recognition, so as to improve and optimize the algorithm module.

附图说明Description of drawings

图1为本发明提供的唇语识别系统示意图；1 is a schematic diagram of a lip language recognition system provided by the present invention;

图2为本发明中唇语识别系统架构与算法流程示意图；Fig. 2 is the schematic flow chart of lip language recognition system architecture and algorithm flow in the present invention;

图3为本发明中唇语识别系统具有的功能示意图；Fig. 3 is the functional schematic diagram that the lip language recognition system has in the present invention;

图4为本发明中唇读识别系统的模型架构示意图；4 is a schematic diagram of a model architecture of a lip-reading recognition system in the present invention;

图5为本发明中LSTM基本单元结构图；5 is a structural diagram of a basic unit of LSTM in the present invention;

图6为本发明中SOFTMAX分类器得到概率示意图；6 is a schematic diagram of the probability obtained by the SOFTMAX classifier in the present invention;

图7为本发明中视频唇语识别系统的人机交互界面；Fig. 7 is the human-computer interaction interface of the video lip language recognition system in the present invention;

图8为本发明提供操作人机交互界面中的选择视频选择示意图；8 is a schematic diagram of selecting video selection in an operating human-computer interaction interface provided by the present invention;

图9为本发明中系统识别视频结果显示示意图；FIG. 9 is a schematic diagram showing a system identification video result display in the present invention;

图10为本发明中人机交互界面上展示可视化结果的示意图；Figure 10 is a schematic diagram showing visualization results on the human-computer interaction interface in the present invention;

图11为本发明中不同时期的Loss损失函数曲线示意图；11 is a schematic diagram of the Loss loss function curve in different periods in the present invention;

图12为本发明中不同时期的准确率曲线示意图；Fig. 12 is the schematic diagram of the accuracy rate curve in different periods in the present invention;

图13为本发明中各独立数字发音视频片段Recall示意图。FIG. 13 is a schematic diagram of each independent digital pronunciation video segment Recall in the present invention.

具体实施方式Detailed ways

为了更好地理解本发明，下面结合说明书附图和实例对本发明的内容做进一步的说明。In order to better understand the present invention, the content of the present invention will be further described below with reference to the accompanying drawings and examples.

由于唇语识别算法应用场景广阔，比如：科研算法中的需要与模型对比、辅助发音训练、弥补听觉语音识别的功能需要和生物识别技术与XR技术的深入发展等。而目前，关于唇语识别的研究上从未有研究员完成对唇语特征识别系统的工程设计，而且并未具有特征可视化的分析性工具。为了克服当前缺少可以分析视觉特征和视频时间序列特征的研究工具，本发明提供了唇语识别系统可以为计算机视觉研究者提供很好的特征工程可视化的操作，不仅可以深入观察监督卷积过程的特征工程，还可以了解循环神经网络的时序特征变化，这可以很好的为后续研究者对特征工程设计和更加有效的特征提取工程带来极大的便利。Due to the wide application scenarios of lip language recognition algorithm, for example, the needs of scientific research algorithms are compared with models, auxiliary pronunciation training, the functional needs of making up for auditory speech recognition, and the in-depth development of biometric technology and XR technology. At present, in the research on lip language recognition, no researcher has completed the engineering design of the lip language feature recognition system, and there is no analytical tool for feature visualization. In order to overcome the current lack of research tools that can analyze visual features and video time series features, the present invention provides a lip language recognition system that can provide computer vision researchers with a good feature engineering visualization operation, not only can deeply observe and supervise the convolution process. Feature engineering can also understand the time series feature changes of the recurrent neural network, which can bring great convenience to subsequent researchers in feature engineering design and more effective feature extraction engineering.

如图1所示，本发明提供的一种唇语识别系统，包括：人机交互界面和算法模块；所述人机交互界面和所述算法模块通过信号槽连接；As shown in Figure 1, a lip language recognition system provided by the present invention includes: a human-computer interaction interface and an algorithm module; the human-computer interaction interface and the algorithm module are connected through a signal slot;

该唇语识别系统使用PyQt5框架完成对人机交互界面的设计与搭建，算法模块使用Qt多线程完成对信息处理与计算的功能。本唇语识别系统的需求定义为视觉特征可视化、时间序列特征可视化等展示过程的操作，同时还增加了Top3准确情况分析哪些词汇具有近似性，需要对数据和方案做优化和调整，避免纯在风险点影响唇语识别的研究进展和成果。本实施例中还针对唇语识别系统崩溃和异常等影响程序正常运行的情况，设计捕捉异常处理机制，抛出并记录异常原因和信息，有利用快速定位异常原因，提高修复效率。The lip recognition system uses the PyQt5 framework to complete the design and construction of the human-computer interaction interface, and the algorithm module uses Qt multithreading to complete the functions of information processing and calculation. The requirements of this lip recognition system are defined as visual feature visualization, time series feature visualization and other display process operations. At the same time, it also adds Top 3 accurate analysis of which words are similar. It is necessary to optimize and adjust the data and solutions to avoid pure Risk points affect the research progress and results of lip language recognition. In this embodiment, a lip language recognition system crash and abnormality that affect the normal operation of the program are also designed to capture the abnormality processing mechanism, throw and record the abnormality cause and information, and use it to quickly locate the abnormality cause and improve the repair efficiency.

本技术方案以接口模块化的代码形式构建高内聚低耦合的高效唇语识别系统。该唇语识别系统适用于Ubuntu、Mac、Windows等多种平台，能够对视频中的唇语进行识别，尤其对独立的数字发音有很好的识别效果，本发明设计的唇语识别系统可以很好对今后唇语识别的研究以及特征工程可视化的研究上带来便利与深入研究。The technical solution constructs a high-cohesion and low-coupling high-efficiency lip language recognition system in the form of a modular interface code. The lip language recognition system is suitable for Ubuntu, Mac, Windows and other platforms, can recognize lip language in video, especially has a good recognition effect on independent digital pronunciation, the lip language recognition system designed by the present invention can be very It is good to bring convenience and in-depth research to the future research on lip language recognition and feature engineering visualization.

如图2所示，展示了本系统的整体识别流程，通过信号槽的方式连接前端人机交互界面和后端模块算法，通过接口调用实现待识别视频的调用以及展示内容的获取。待识别的视频通过分解算法，再通过设计的算法模型，最终再由信号槽传递最终结果反馈给前端各个部件进行结果显示。在图7中对人机交互界面上的唇语识别系统进行具体展示，如图7所示，唇语识别系统的界面上包括选择视频、可视化和识别视频选项，唇语识别结果的展示框，Top3准确率的展示框、脸部特征展示区域、嘴部特征展示区域、卷积层1和卷积层2展示区域、图像特征展示区域和序列特征展示区域，在图10中以一个具体实施例为例展示可视化结果。As shown in Figure 2, the overall identification process of the system is shown. The front-end human-computer interaction interface and the back-end module algorithm are connected by means of signal slots, and the call of the video to be identified and the acquisition of the display content are realized through the interface call. The video to be identified passes through the decomposition algorithm, then passes through the designed algorithm model, and finally transmits the final result through the signal slot and feeds it back to each front-end component for result display. In Figure 7, the lip language recognition system on the human-computer interaction interface is specifically displayed. As shown in Figure 7, the interface of the lip language recognition system includes video selection, visualization and recognition video options, and a display frame for the lip language recognition results. Top3 accuracy display frame, facial feature display area, mouth feature display area, convolutional layer 1 and convolutional layer 2 display area, image feature display area and sequence feature display area, in Figure 10 with a specific example Example to show the visualization results.

所有前端人机交互界面显示的特征过程均可有配置文件进行设置，满足高可用、高定制化的通用系统设计工程。本发明提供的唇语识别系统既可以满足初学者对于卷积计算的过程了解，又可以满足研究员对唇语识别特征工程的处理带来新的思路和方法。All the characteristic processes displayed on the front-end human-computer interaction interface can be set by configuration files, which can meet the general system design engineering of high availability and high customization. The lip language recognition system provided by the present invention can not only satisfy beginners' understanding of the process of convolution calculation, but also satisfy researchers' new ideas and methods for processing lip language recognition feature engineering.

本发明提供唇语识别系统具有高内聚，低耦合，高可用，高定制化的特点。将前端和后端以模块化接口的形式完成设计，具有高内聚低耦合的性质，同时不会出因为部分错误而导致整体系统的瘫痪而重构系统，系统模型接口具有更新迭代的优点。随着对唇语识别技术的深入研究，只需要替换模型参数和推理过程即可继续使用该系统，另外后端模型只提供接口为前端返回结果，从而具有高度分离性、高度可用性。无论是视觉特征显示还是时间序列特征显示都具有高度定制化，可以显示任何想要显示的中间结果。如图3所示为人机交互界面上展示的内容，该系统可以展示中间特征提取，嘴唇定位等功能，同时，可以分析出容易混淆的相似发音，这将更好的为后续研究做铺垫。The lip language recognition system provided by the invention has the characteristics of high cohesion, low coupling, high availability and high customization. The front-end and back-end are designed in the form of modular interfaces, which have the properties of high cohesion and low coupling. At the same time, the system will not be reconstructed due to partial errors that lead to the paralysis of the whole system. The system model interface has the advantage of updating and iterating. With the in-depth study of lip language recognition technology, the system can continue to be used only by replacing the model parameters and reasoning process. In addition, the back-end model only provides an interface to return the results to the front-end, so it is highly separable and highly usable. Both the visual feature display and the time series feature display are highly customizable and can display any intermediate results you want to display. Figure 3 shows the content displayed on the human-computer interaction interface. The system can display functions such as intermediate feature extraction and lip positioning. At the same time, it can analyze similar pronunciations that are easily confused, which will better pave the way for subsequent research.

本发明提供的唇语识别系统分为六个部分：人机交互界面、唇部分割、CNN特征提取、RNN时序特征提取、全连接分类和结果展示，其中人机交互界面也称为用户界面。The lip language recognition system provided by the present invention is divided into six parts: human-computer interaction interface, lip segmentation, CNN feature extraction, RNN time-series feature extraction, fully connected classification and result display, wherein the human-computer interaction interface is also called user interface.

用户界面是基于python版本的Qt5框架(PyQt5)，算法推理的变量由信号槽连接前端与后端信息的传递，嘴唇的分割使用的方法为人脸68点的Dlib库，它可以很好的检测出嘴唇位置，利用以下公式(1)(2)进行分割操作。The user interface is based on the python version of the Qt5 framework (PyQt5). The variables of the algorithm inference are transmitted by the signal slot to connect the front-end and the back-end information. The method of lip segmentation is the Dlib library of 68 points of the face, which can be well detected. For the lip position, the segmentation operation is performed using the following formulas (1) and (2).

其中，嘴唇坐标点计算出嘴唇中心位置，记为(x₀,y₀)，令w和h分别代表嘴巴图像的宽度和高度，L₁和L₂分别代表包围嘴巴的左右和上下分割线，根据以上公式计算嘴的包围盒。Among them, the lip coordinate point calculates the position of the center of the lips, denoted as (x ₀ , y ₀ ), let w and h represent the width and height of the mouth image respectively, L ₁ and L ₂ represent the left and right and upper and lower dividing lines surrounding the mouth, respectively, Calculate the bounding box of the mouth according to the above formula.

而对于唇读整个视频采用的方法是抽取固定帧的方案，抽取固定帧的算法公式如公式(3)(4)所示：The method used for lip reading the entire video is to extract a fixed frame, and the algorithm formula for extracting a fixed frame is shown in formula (3) (4):

其中，x表示将v分割成x个block，A代表在block_n中有序的取i帧，F表示最后每个block取的帧的序号数。Among them, x means dividing v into x blocks, A means taking i frames in an orderly manner in block _n , and F means the sequence number of the last frame taken by each block.

本实施例中采用半随机抽取固定帧策略(Semi-random Fixed Frame ExtractionStrategy,SFFES)从视频中抽取固定帧。经过大量的实验研究表明，本发明设计的SFFES具有灵活性强、时间复杂度低和抗干扰性强等特点。算法策略的思路是先将视频按照总体视频帧数量进行分区域块工作,其具体操作是在已知先验条件所需抽取的固定帧数为n的前提下，尽可能平均每个区域块的区域范围，剩余帧的数量x不大于区域块的个数n，然后前x个区域块的范围分别扩大1。至此，每个区域块的范围已经完成好半随机分配。In this embodiment, a semi-random Fixed Frame Extraction Strategy (SFFES) is used to extract fixed frames from the video. After a large number of experimental studies, it is shown that the SFFES designed by the present invention has the characteristics of strong flexibility, low time complexity and strong anti-interference. The idea of the algorithm strategy is to first divide the video into regions and blocks according to the total number of video frames. The area range, the number x of the remaining frames is not greater than the number n of the area blocks, and then the range of the first x area blocks is expanded by 1. So far, the range of each area block has been semi-randomly allocated.

该计算方法具有鲁棒性、计算效率高和特征向量一致性强等特点，可以很好的去除冗余信息。The calculation method has the characteristics of robustness, high calculation efficiency and strong eigenvector consistency, and can remove redundant information well.

嘴巴分割完成之后，将所得到的原始嘴唇数据集处理成标准的224x224像素。模型的整体结构如图4所示，本实施例设计的模型仅为一个实施例，能完成基本的唇读识别功能，在实际应用过程中可以根据需要设计进行唇语识别的模型。本实施例设计的模型架构是由VGG16进行特征提取，每个时刻生成4096x1的特征向量，模型的长度为10长度，因此整个动作的特征向量长度为4096x10，然后将特征输入到lstm里进行时序特征的提取，最后根据最后一个lstm单元结构的输出，由两层全连接和softmax进行分类工作。通过取最终结果sofmax前三最大概率值，得到top-3准确率情况，然后根据模型输出中间层结果的切片操作，得到特征提取的中间结果和可视化图像。After the mouth segmentation is completed, the resulting raw lips dataset is processed into standard 224x224 pixels. The overall structure of the model is shown in FIG. 4 . The model designed in this embodiment is only an embodiment, which can complete the basic lip reading recognition function. In the actual application process, a model for lip reading recognition can be designed as required. The model architecture designed in this embodiment uses VGG16 for feature extraction. A 4096x1 feature vector is generated at each moment, and the length of the model is 10. Therefore, the length of the feature vector for the entire action is 4096x10, and then the features are input into lstm for time series features. Finally, according to the output of the last lstm unit structure, the classification work is performed by two layers of full connection and softmax. By taking the top three maximum probability values of the final result sofmax, the top-3 accuracy rate is obtained, and then according to the slicing operation of the middle layer results of the model output, the intermediate results and visual images of feature extraction are obtained.

由于传统的RNN有长期依赖问题，在进行迭代时会产生信息量随着时间的增加而逐步减少的问题，导致时间间隔较远的输出对当前时刻隐含层的输入的影响非常小。所以传统的RNN只适合处理时间序列较短的数据，当序列的时间间隔较长时，RNN难以表达序列间隐含的相关性。Due to the long-term dependency problem of traditional RNNs, the problem of gradually reducing the amount of information as time increases during iteration, resulting in outputs with a long time interval have very little impact on the input of the hidden layer at the current moment. Therefore, the traditional RNN is only suitable for processing data with short time series. When the time interval of the sequence is long, it is difficult for RNN to express the implicit correlation between the sequences.

针对RNN在处理长序列数据时隐含层在传递信息时容易出现梯度消失和梯度爆炸的问题，采用长短期记忆结构(Long short-term memory，LSTM)，专门用于处理长期依赖序列时出现的信息缺失问题。LSTM通过引入记忆单元来存储历史信息，通过引入三种控制门结构包括输入门，忘记门和输出门来控制网络中信息流的增加和去除。记住长序列中需要记住的关联信息，忘记部分无用的信息，以便更好地发现和利用来自序列数据(如视频，音频和文本)的长时依赖性。图5表示在单个LSTM单元内部执行的操作，其中x_t表示t时刻网络节点的输入向量，h_t表示t时刻网络节点的输出向量，i_t，f_t，o_t，c_t分别表示t时刻的输入门，遗忘门，输出门和记忆单元。Aiming at the problem that the hidden layer of RNN is prone to gradient disappearance and gradient explosion when transmitting information when processing long sequence data, a long short-term memory (LSTM) structure is adopted, which is specially used for processing long-term dependent sequences. Missing information problem. LSTM stores historical information by introducing memory units, and controls the addition and removal of information flow in the network by introducing three control gate structures including input gate, forget gate and output gate. Remember associations that need to be remembered in long sequences, and forget parts of useless information to better discover and exploit long-term dependencies from sequence data such as video, audio, and text. Figure 5 shows the operations performed inside a single LSTM cell, where x _t represents the input vector of the network node at time t, h _t represents the output vector of the network node at time t, and i _t , f _t , o _t , and c _t represent time t, respectively input gate, forget gate, output gate and memory unit.

下面将分别介绍LSTM单元内部的输入门、忘记门、记忆单元和输出门：The following will introduce the input gate, forget gate, memory unit and output gate inside the LSTM unit:

(1)输入门(Input gate)：该门用于输入节点信息的控制。输入信息由两部分组成，首先使用sigmoid激活函数决定哪些新信息需要被输入，然后利用tanh函数选择需要存放在单元中的新信息。输入门的输出i_t与候选信息g_t的数学表达式为：(1) Input gate: This gate is used to control the input node information. The input information consists of two parts. First, the sigmoid activation function is used to determine which new information needs to be input, and then the tanh function is used to select the new information that needs to be stored in the unit. The mathematical expression of the output i _t of the input gate and the candidate information g _t is:

i_t＝σ(U_ix_t+W_ih_t-1+b_i) (5)i _t =σ(U _i x _t +W _i h _t-1 +b _i ) (5)

g_t＝tanh(U_gx_t+W_gh_t-1+b_g) (6)g _t =tanh(U _g x _t +W _g h _t-1 +b _g ) (6)

其中U_i，W_i和b_i分别表示输入门的权重和偏置，U_g，W_g和b_g分别表示候选状态的权重和偏置，σ表示sigmoid激活函数，tanh为激活函数。where U _i , Wi and b _i represent the weight and bias of the input gate _, respectively, U _g , W _g and b _g represent the weight and bias of the candidate state, respectively, σ represents the sigmoid activation function, and tanh is the activation function.

(2)忘记门(Forget gate)：该门用于控制当前LSTM单元丢弃哪些信息。通过sigmoid激活函数输出一个0-1之间的函数值，当函数值越接近1时，表示当前时刻节点包含有用的信息越多，则将保留更多的信息到下一个时刻；当函数值越接近0时，表示当前时刻节点包含有用的信息越少，则将丢弃更多的信息到下一个时刻。忘记门f_t的数学表达式为：(2) Forget gate: This gate is used to control which information is discarded by the current LSTM unit. A function value between 0 and 1 is output through the sigmoid activation function. When the function value is closer to 1, it means that the node contains more useful information at the current moment, and more information will be retained to the next moment; When it is close to 0, it means that the node contains less useful information at the current moment, and more information will be discarded to the next moment. The mathematical expression of the forget gate f _t is:

f_t＝σ(U_fx_t+W_fh_t-1+b_f) (7)f _t =σ(U _f x _t +W _f h _t-1 +b _f ) (7)

其中U_f，W_f和b_f分别表示遗忘门的权重和偏置，σ表示sigmoid激活函数。where U _f , W _f and b _f represent the weight and bias of the forget gate, respectively, and σ represents the sigmoid activation function.

(3)记忆单元(Memory cell)：该单元用于保存状态信息，进行状态更新，记忆单元c_t的数学表达式为：(3) Memory cell: This cell is used to save state information and update the state. The mathematical expression of the memory cell c _t is:

c_t＝f_t⊙c_t-1+i_t⊙g_t (8)c _t =f _t ⊙c _t-1 +i _t ⊙g _t (8)

其中⊙代表哈达马积。where ⊙ represents the Hadamard product.

(4)输出门(Output gate)：该门用于输出节点信息的控制。首先利用sigmoid函数确定需要输出哪些信息，得到初始输出值o_t，然后使用tanh函数将c_t固定在(-1，1)区间内，最后与初始输出值o_t进行逐点相乘，得到LSTM单元的输出h_t。所以h_t是由o_t和记忆单元c_t共同决定，o_t与h_t的数学表达式为：(4) Output gate: This gate is used to control the output node information. First, use the sigmoid function to determine what information needs to be output, and obtain the initial output value o _t , then use the tanh function to fix c _t in the (-1, 1) interval, and finally multiply the initial output value o _t point by point to obtain LSTM The output h _t of the cell. So h _t is jointly determined by o _t and memory unit c _t , and the mathematical expression of o _t and h _t is:

o_t＝σ(U_ox_t+W_oh_t-1+b_o) (9)o _t =σ(U _o x _t +W _o h _t-1 +b _o ) (9)

h_t＝o_t⊙tanh(c_t) (10)h _t =o _t ⊙tanh(c _t ) (10)

其中U_o，W_o和b_o分别表示输出门的权重和偏置。where U _o , W _o and _bo denote the weight and bias of the output gate, respectively.

Logistic和SVM等一般是解决二分类问题，多分类也可由多个二分类组合而成，从数学角度上看，互斥事件选用SOFTMAX更佳；非互斥事件则使用Logistic或SVM等组合分类器。在SOFTMAX中，样本x属于分类j的概率可表示为：Logistic and SVM generally solve binary classification problems, and multi-classification can also be composed of multiple binary classifications. From a mathematical point of view, it is better to use SOFTMAX for mutually exclusive events; for non-mutually exclusive events, use a combination classifier such as Logistic or SVM. . In SOFTMAX, the probability that sample x belongs to class j can be expressed as:

其中j＝1,2,···,K，那么SOFTMAX的损失函数可表示为：Where j=1,2,...,K, then the loss function of SOFTMAX can be expressed as:

其中K为类别数，p∈{0,1}^K，w为网络权值。where K is the number of categories, p∈{0,1} ^K , and w is the network weight.

一般地，输入特征经过特征处理层后由SOFTMAX分类器得到概率分布。如图6所示，三个类别概率分别为[0.88 0.10 0.02]，他们的和为单位1，可见概率事件是相互独立的。Generally, the probability distribution of the input features is obtained by the SOFTMAX classifier after the feature processing layer. As shown in Figure 6, the probabilities of the three categories are [0.88 0.10 0.02], and their sum is 1, which shows that the probability events are independent of each other.

如图7所示，在唇语识别系统的人机交互界面上首先具有识别结果Top-3显示以及概率统计功能；其次具有待识别视频帧的展示功能；然后展示视频在唇读识别过程中的特征提取和时序特征提取过程，即具有展示CNN的特征可视化功能以及RNN的特征可视化功能，通过可视化功能可以观察特征提取过程中的特征向量的变化；最后具有看到对待识别视频的嘴唇分割过程功能，通过上述功能本系统可以很好的作为模型验证过程遇到的问题分析系统。本实施例中CNN的特征可视化包括各卷积层嘴唇图像和高维图像特征，RNN的特征可视化包括可视化的序列特征。As shown in Figure 7, on the human-computer interaction interface of the lip language recognition system, firstly, it has the function of displaying Top-3 recognition results and probability statistics; secondly, it has the function of displaying the video frames to be recognized; Feature extraction and time series feature extraction process, that is, it has the feature visualization function of CNN and the feature visualization function of RNN. Through the visualization function, the change of the feature vector in the feature extraction process can be observed; finally, it has the function of seeing the lip segmentation process of the video to be recognized. , through the above functions, the system can be used as a good analysis system for the problems encountered in the model verification process. The feature visualization of CNN in this embodiment includes lip images and high-dimensional image features of each convolutional layer, and the feature visualization of RNN includes visualized sequence features.

本实施例中通过不同的摄像采集设备在同一角度，对同一目标对象进行视频采集，通过本申请提供的唇语识别系统对不同摄像采集设备采集的视频进行识别，并在人机交互界面上展示采集到的视频、嘴唇分割过程以及对视频的识别结果，基于上述情况可以确定不同摄像采集设备分割的唇部区域存在不准确或者偏离的情况，通过对比识别结果对不同摄像采集设备采集视频的准确度进行排序，为研究人员在选择摄像采集设备时提供依据。In this embodiment, different camera collection devices are used to capture video of the same target object at the same angle, and the lip language recognition system provided by this application is used to recognize the videos collected by different camera collection devices, and display them on the human-computer interaction interface The collected video, the lip segmentation process, and the recognition result of the video, based on the above situation, it can be determined that the lip area segmented by different camera collection equipment is inaccurate or deviated, and the accuracy of the video collected by different camera collection equipment can be verified by comparing the recognition results. Ranked by degrees to provide a basis for researchers to choose camera acquisition equipment.

本实施例还提供了通过同一摄像采集设备在不同角度，对同一目标对象进行视频采集，通过本申请提供的唇语识别系统对同一摄像采集设备在不同角度采集的视频进行识别，并在人机交互界面上展示采集到的视频、嘴唇分割过程以及对视频的识别结果，基于上述情况通过对比同一摄像采集设备在不同角度下拍摄的视频展示结果，可以确定拍摄角度不同会影响识别结果的准确度，通过该唇语识别系统可以确定对识别最有利的拍摄角度，因此本系统可以很好的作为模型验证过程遇到的问题分析系统，从数据输入源头分析预测模型遇到的多种问题。This embodiment also provides video collection of the same target object at different angles through the same camera collection device, and the lip language recognition system provided by the present application recognizes the video collected by the same camera collection device at different angles, and the human-computer The collected video, the lip segmentation process and the recognition result of the video are displayed on the interactive interface. Based on the above situation, by comparing the video display results shot by the same camera acquisition device at different angles, it can be determined that different shooting angles will affect the accuracy of the recognition results. , the lip language recognition system can determine the most favorable shooting angle for recognition, so the system can be used as a problem analysis system in the model verification process, analyzing and predicting various problems encountered by the model from the source of data input.

本发明提供了一个具体的实施例，利用唇语识别系统对得到测试结果并对唇语识别系统的测试结果进行分析，具体包括：The present invention provides a specific embodiment, using a lip language recognition system to obtain test results and analyze the test results of the lip language recognition system, specifically including:

该实施例中的数据集建立在音像数据库，包括10个独立的英语数字发音(从0到9)由6位不同的目标对象(3位男性和3位女性)。每个目标对象对每个单词的发音次数至100次。从每个目标对象的正面视角收集视频，目标对象可以自然地坐着不需要没有任何动作。每一帧图片的原始尺寸是1920×1080的分辨率，每秒大约25帧。为了准确定位每个发音单元的开头和结尾，使用音频作为辅助，将每个发音单词分开，每个单词分别持续约1秒。然后，将每个孤立的单词视频进一步提取到10帧的固定长度，处理每个视频帧后,获得224x224的像素图像作为CNN模型输入的标准值。The dataset in this example is built on an audiovisual database and consists of 10 independent English digit pronunciations (from 0 to 9) by 6 different target subjects (3 males and 3 females). Each target object pronounced each word up to 100 times. Videos are collected from the frontal view of each target object, which can sit naturally without any movement. The original size of each frame is 1920×1080 resolution, about 25 frames per second. To pinpoint the beginning and end of each articulated unit, using audio as an aid, separate each articulated word, each lasting approximately 1 second. Then, each isolated word video is further extracted to a fixed length of 10 frames, and after processing each video frame, a 224x224 pixel image is obtained as the standard value for the input of the CNN model.

唇语识别系统以完成识别视频唇语内容为目的，主要功能包括人机交互界面的功能说明、预测结果显示Top-3准确率和算法推理过程的可视化。在预测识别过程可以预测Top-3准确情况，观察出推理结果中哪些发音视频是相似的和易混淆的。在可视化过程中，可以直观的观察CNN和RNN每个阶段的特征变化和学习情况。使系统满足在算法推理预测过程中更好的观察模型中间过程，方便后续深入研究与优化算法模型。The lip language recognition system aims to complete the recognition of video lip language content. The main functions include the function description of the human-computer interaction interface, the prediction result display Top-3 accuracy rate and the visualization of the algorithm reasoning process. In the prediction and recognition process, the accuracy of Top-3 can be predicted, and which pronunciation videos are similar and confusing in the inference results can be observed. In the visualization process, the feature changes and learning situation of each stage of CNN and RNN can be observed intuitively. It enables the system to better observe the intermediate process of the model in the process of algorithm inference and prediction, which is convenient for subsequent in-depth research and optimization of the algorithm model.

如图7所示，唇语识别系统运行之后就会出现人机交互界面。在人机交互界面上有三个功能按钮，即选择视频、可视化和识别视频。在功能按钮的右侧为系统识别视频的最终结果，紧挨着的右侧是Top-3准确率和结果框，显示预测结果为LCD液晶显示管样式。在人机交互界面的下方用于显示算法推理中间过程，包含抽取固定长度的视频帧，定位并分割出嘴部位置，CNN的可视化和RNN在每个时期的动态可视化。As shown in Figure 7, the human-computer interface will appear after the lip language recognition system is running. There are three function buttons on the human-computer interaction interface, namely, select video, visualize and identify video. On the right side of the function button is the final result of the system identifying the video, and on the right next to it is the Top-3 accuracy rate and result box, showing the prediction result in the style of an LCD liquid crystal display tube. The lower part of the human-computer interaction interface is used to display the intermediate process of algorithmic reasoning, including extracting fixed-length video frames, locating and segmenting the mouth position, visualization of CNN and dynamic visualization of RNN in each period.

如图8所示，在进入人机交互界面后，点击“选择视频”按钮，会弹出选择视频文件夹，选择待识别视频所在的文件夹，默认为当前工程下的test文件夹，若不存在则为当前文件夹。选择要识别的视频，点击确认后文件夹窗口自动关闭，并提取固定长度帧，后端窗口会显示后台进度。As shown in Figure 8, after entering the human-computer interaction interface, click the "Select Video" button, and select the video folder will pop up, select the folder where the video to be recognized is located, the default is the test folder under the current project, if it does not exist is the current folder. Select the video to be recognized, click OK, and the folder window will automatically close, and the fixed-length frame will be extracted, and the backend window will display the background progress.

以数字发音“Six”的视频为例，在完成选择加载的数字发音“Six”的视频之后，可以进行预测和推理过程可视化，点击“识别视频”开始识别预测。如图9所示，视频的识别结果是数字发音“Six”，Top-3准确情况为“Six”的概率为95.96％，其次识别为“zero”的概率为2.13％，最后识别为“Five”的概率为0.63％。根据Top-3的展示可以确定识别结果十分准确，并且多次识别均表现稳定，Top-3的概率和不等于1的原因是，分类为互斥事件，因此每个类别都会有一个概率值表示其类别可能性。Taking the video of the digital pronunciation "Six" as an example, after selecting the loaded video of the digital pronunciation "Six", the prediction and reasoning process can be visualized, and click "Recognize Video" to start the recognition and prediction. As shown in Figure 9, the recognition result of the video is the digital pronunciation "Six", the probability of Top-3 being "Six" is 95.96%, the probability of identifying it as "zero" is 2.13%, and finally it is recognized as "Five" The probability is 0.63%. According to the display of Top-3, it can be confirmed that the recognition results are very accurate, and the performance of multiple recognitions is stable. The reason why the probability sum of Top-3 is not equal to 1 is that the classification is mutually exclusive, so each category will have a probability value. its category possibilities.

如图10所示，触发人机交互界面的“可视化”选项，等待约1秒钟可显示结果。通过可视化过程可以观察中间卷积的特征提取层结果，也可以观察图像特征与时序特征的变化，这样可以分析出融合神经网络过程中识别不好的原因。As shown in Figure 10, trigger the "Visualization" option of the human-computer interaction interface, and wait for about 1 second to display the results. Through the visualization process, the result of the feature extraction layer of the intermediate convolution can be observed, and the changes of image features and time series features can also be observed, so that the reasons for poor recognition in the process of fusion neural network can be analyzed.

人机交互界面上展示的视频帧、各卷积层嘴唇图像、高维图像特征和/或可视化的序列特征是按时序推进的一个连续过程，例如如图10所示，第一行展示了从待识别视频中抽取出来的固定视频帧，第二行展示的是从第一行展示的视频帧中分割得到的嘴唇特征，第三行、第四行和第五行展示了对嘴唇特征进行特征提取的中间过程，第三行为卷积层1的嘴唇图像，第四行为卷积层2的嘴唇图像，第五行为高维嘴唇图像特征，第六行展示了时序特征提取过程中的序列特征，通过可视化的展示可以观察模型中进行的唇语识别过程，通过展示不同的模型识别过程确定哪种模型的识别准确率高，效率快，也可以通过展示同一模型在不同参数下的识别结果，有利于快速的确定模型参数。The video frames, lip images of each convolutional layer, high-dimensional image features, and/or visualized sequence features displayed on the human-computer interface are a sequential process advancing in time, such as shown in Figure 10, the first row shows The fixed video frame extracted from the video to be recognized, the second row shows the lip feature segmented from the video frame displayed in the first row, and the third, fourth and fifth rows show the feature extraction of the lip feature. In the middle process, the third line is the lip image of convolutional layer 1, the fourth line is the lip image of convolutional layer 2, the fifth line is the high-dimensional lip image feature, and the sixth line shows the sequence features in the time series feature extraction process. The visual display can observe the lip language recognition process in the model, and determine which model has high recognition accuracy and efficiency by showing different model recognition processes. It can also show the recognition results of the same model under different parameters, which is beneficial to Quickly determine model parameters.

通过在图像特征和序列特征的可视化上，可以很好的看出激活函数的作用，使得后续计算具有降低特征维度的作用，减少算力的依赖，同时对于RNN在时序特征推理的过程中，可以看出从最初的特征不是很明显，到最终特征具有两极化，说明该系统所使用的模型具有很好的拟合力，可以达到识别视觉唇读发音的要求。该唇语识别系统兼顾了CNN和RNN两种模型的中间过程展示，并且完成唇读分割工作，这可以很好的完成对于深入研究深度学习理论在唇语识别工作上的背景和支持。Through the visualization of image features and sequence features, the role of the activation function can be well seen, so that the subsequent calculation can reduce the feature dimension and reduce the dependence of computing power. It can be seen that the initial features are not obvious, and the final features are polarized, indicating that the model used by the system has a good fitting force and can meet the requirements of visual lip-reading pronunciation. The lip language recognition system takes into account the intermediate process display of CNN and RNN models, and completes the lip reading segmentation work, which can well complete the background and support for in-depth study of deep learning theory in lip language recognition work.

人机交互界面上的具体展示内容及展示方式可以在配置文件中按照可视化需求进行设置。另外人机交互界面可以提供中文英文双语的替换，使得系统具有更好的人机交互，给更多的使用者带来便利。The specific display content and display method on the human-computer interaction interface can be set in the configuration file according to the visualization requirements. In addition, the human-computer interaction interface can provide bilingual replacement of Chinese and English, which makes the system have better human-computer interaction and brings convenience to more users.

在交互界面与系统功能完成之后，首先需要关注模型在训练的过程中是否收敛。当超参数经验值设置不当或者模型构造不够合理时模型很难收敛，从而导致模型不工作。因此本次训练过程中记录了不同时期训练集与测试集的损失函数变化曲线，通过此曲线可以得出本文提出的算法模型是否能学习到数据集的特性，从而评估算法收敛性。After the interactive interface and system functions are completed, it is first necessary to pay attention to whether the model converges during the training process. When the empirical values of hyperparameters are not set properly or the model construction is not reasonable enough, it is difficult for the model to converge, resulting in the model not working. Therefore, in this training process, the change curve of the loss function of the training set and the test set in different periods was recorded. Through this curve, it can be concluded whether the algorithm model proposed in this paper can learn the characteristics of the data set, so as to evaluate the convergence of the algorithm.

为了确定Loss的变化趋势，对实验进行了70轮次的记录，每次迭代记录一次结果。图11为不同时期损失函数Loss的变化曲线，其中1个epoch代表将整个数据集训练一遍的轮次，一般情况训练10轮次左右大多数数据集都会收敛完成，继续训练可能会导致过拟合。In order to determine the change trend of Loss, the experiment was recorded for 70 rounds, and the results were recorded once per iteration. Figure 11 shows the change curve of the loss function Loss in different periods. One epoch represents the round of training the entire data set. Generally, most data sets will converge after about 10 rounds of training. Continuing training may lead to overfitting. .

从图11可以看出，在epochs为15次左右的时候趋于稳定，此时模型已达到最优解，并且在后续训练学习过程中不断的震荡，说明已经到达了模型学习的极限。由于更新参数是依据训练集进行更新迭代的，因此训练集的损失会相对小于验证集，同时训练集和验证集损失均随着训练迭代次数而逐渐收敛，说明数据集没有存在异常问题并且模型在唇语识别上表现良好。至此，可以验证数据集和模型均是可工作的结论。It can be seen from Figure 11 that it tends to be stable when the epochs is about 15 times. At this time, the model has reached the optimal solution, and it continuously oscillates in the subsequent training and learning process, indicating that the limit of model learning has been reached. Since the update parameters are updated iteratively based on the training set, the loss of the training set will be relatively smaller than that of the validation set, and both the training set and validation set losses will gradually converge with the number of training iterations, indicating that there is no abnormal problem in the data set and the model is in Good performance on lip recognition. At this point, it can be verified that both the dataset and the model are working.

在验证模型已经收敛之后，本实施例还提出的基于注意力机制的CNN-LSTM模型在测试集进一步测试其性能。图12表示识别准确情况变化曲线，其中每次迭代进行一次记录，纵坐标recognation accuracy为识别率(％)。为了对比模型性能的提升，实验对比了CNN-LSTM网络模型，这样通过控制变量的方法可以得出引入注意力机制对模型性能的提升。整体的训练趋势和损失函数Loss走势相吻合，在epochs为15次左右的时候趋于稳定，说明在训练过程中参数更新后不断接近模型最优解，并且此时已达到最优。同时，基础网络CNN-LSTM模型在准确性能上较大差于本发明的网络，说明注意力机制可以很好的理解视频重要关键帧和减弱序列图像噪声。至此，可以得出注意力机制在模型性能上有着很好的提升和鲁棒性。After verifying that the model has converged, the CNN-LSTM model based on the attention mechanism proposed in this embodiment further tests its performance on the test set. Figure 12 shows the change curve of the recognition accuracy, in which each iteration is recorded once, and the ordinate recognition accuracy is the recognition rate (%). In order to compare the improvement of model performance, the experiment compared the CNN-LSTM network model, so that the introduction of the attention mechanism can improve the performance of the model by controlling the variables. The overall training trend is consistent with the loss function of the loss function, and tends to be stable when the epochs is about 15 times, indicating that the parameter update in the training process is continuously approaching the optimal solution of the model, and the optimal solution has been reached at this time. At the same time, the accuracy of the basic network CNN-LSTM model is much worse than the network of the present invention, indicating that the attention mechanism can well understand the important key frames of the video and reduce the noise of the sequence image. So far, it can be concluded that the attention mechanism has a good improvement and robustness in model performance.

可以从图12看出本实施例引入注意力的方法整体性能优于一般融合神经网络的方法，并且随着训练的持续，准确率上升趋势明显，本实施例的方法由于需要计算注意力权重的分配，因此在开始学习阶段准确率会抖动明显，说明注意力权重还未学习完成，需要继续训练模型参数，最后epochs大约在15时，本实施例的方法基本完成训练，整体性能较优于一般的融合神经网络。It can be seen from FIG. 12 that the overall performance of the method for introducing attention in this embodiment is better than that of the general fusion neural network method, and as the training continues, the accuracy rate increases significantly. The method of this embodiment needs to calculate the attention weight. Therefore, at the beginning of the learning phase, the accuracy rate will shake significantly, indicating that the attention weight has not been learned, and the model parameters need to be trained. The final epochs is about 15. The method of this embodiment basically completes the training, and the overall performance is better than the general fusion neural network.

实验结果表明，在每个数字发音的结果上基于注意力机制的CNN-LSTM模型较优于CNN-LSTM模型。图13对比了每个独立发音的结果，发音“Two”、发音“Four”和发音“Nine”性能提升明显，这说明在单音节发音的单词视频中更容易出现噪声，导致结果识别不好，当把视频噪声减弱后模型的性能得到了很大的提升。而对于复杂唇动的发音“Five”和发音“one”则提升不是很明显，同时说明视频噪声和学习时空特征比较困难。发音“zero”的嘴唇动作变化较小，且发音过程中舌动为关键因素，因此模型预测起来比较困难。The experimental results show that the CNN-LSTM model based on the attention mechanism is better than the CNN-LSTM model on the results of each digit pronunciation. Figure 13 compares the results of each independent pronunciation. The performance of pronunciation "Two", pronunciation "Four" and pronunciation "Nine" is significantly improved, which shows that noise is more likely to appear in the word video of monosyllabic pronunciation, resulting in poor recognition of the results. The performance of the model is greatly improved when the video noise is reduced. For the pronunciation of "Five" and "one" of complex lip movements, the improvement is not obvious, and it shows that it is difficult to learn video noise and spatiotemporal features. The lip movements for pronouncing "zero" change less, and the tongue movement is a key factor in the pronunciation process, so it is difficult for the model to predict.

本发明提供的唇语识别系统可以更好的观察与分析模型性能的优劣以及从原始视频到最终识别过程中的每个环节，从而提升并优化模型和算法。因此实现的唇语识别系统具有重要的实际意义。另一方面，本发明从轻量化模型的角度来看待唇语识别技术的发展方向，通过减少卷积和全连接的结构，很大程度减少了模型的计算量，在一定程度上减少了对GPU的依赖，降低了成本和硬件需要求。The lip language recognition system provided by the present invention can better observe and analyze the pros and cons of the model performance and each link in the process from the original video to the final recognition, thereby improving and optimizing the model and algorithm. Therefore, the realized lip language recognition system has important practical significance. On the other hand, the present invention looks at the development direction of lip language recognition technology from the perspective of lightweight model. By reducing the structure of convolution and full connection, the calculation amount of the model is greatly reduced, and the GPU is reduced to a certain extent. Dependency, reducing costs and hardware requirements.

通过上述具体实施例实现了通过采用测试数据集进行测试实验，验证算法模型的有效性。实验结果表明，本发明设计的视频唇语识别系统具有高可用的可行性，提出的基于CNN与注意力机制的RNN融合神经网络模型具有高效可行性，也验证了本发明提出的算法在预测识别上与其他算法模型的对比具有较高的准确率。Through the above-mentioned specific embodiments, the validity of the algorithm model is verified by using the test data set to conduct test experiments. The experimental results show that the video lip recognition system designed by the present invention has high availability and feasibility, and the proposed RNN fusion neural network model based on CNN and attention mechanism has high efficiency and feasibility. Compared with other algorithm models, it has a higher accuracy.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

以上仅为本发明的实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均包含在申请待批的本发明的权利要求范围之内。The above are only examples of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention are included in the application for pending approval of the present invention. within the scope of the claims.

Claims

1. A lip language identification system, comprising: a human-computer interaction interface and an algorithm module; the human-computer interaction interface is connected with the algorithm module through a signal slot;

the human-computer interaction interface is used for acquiring a video to be identified;

the algorithm module is used for carrying out lip language identification on the video to be identified to obtain a lip language identification result;

and the human-computer interaction interface is also used for displaying the lip language recognition result and displaying the lip language recognition process according to time sequence.

2. The system of claim 1, wherein the algorithm module comprises:

the fixed frame extraction submodule is used for extracting a video frame to be processed from a video to be identified based on a semi-random fixed frame extraction strategy;

the segmentation submodule is used for segmenting the lip image from the processed video frame to obtain a lip data set;

and the recognition submodule is used for recognizing each lip image in the lip data set based on the designed model to obtain a lip language recognition result.

3. The system of claim 2, wherein the identification submodule comprises:

the characteristic extraction unit is used for carrying out characteristic extraction on the lip images to obtain image characteristics, and carrying out slicing operation on the lip images and the image characteristics of the curling layers to obtain visual lip images and high-dimensional image characteristics of the curling layers;

the time sequence feature extraction unit is used for extracting time sequence features from the image features to obtain sequence features, and performing slicing operation on the sequence features to obtain visual sequence features;

and the classification unit is used for classifying the extracted time sequence characteristics to obtain a lip language identification result.

4. The system of claim 3, wherein the feature extraction unit is a Convolutional Neural Network (CNN).

5. The system of claim 3, wherein the timing feature extraction unit is a Recurrent Neural Network (RNN).

6. The system of claim 3, wherein the classification unit is a softmax classifier.

7. The system of claim 3, wherein the human-machine interface comprises:

selecting a video option for acquiring a video to be identified when triggered;

the identification video option is used for carrying out lip language identification on the video to be identified when the video to be identified is triggered to obtain an identification result;

a visualization option for displaying the lip language recognition process and the lip language recognition result based on a configuration file set by visualization requirements when triggered;

the display content comprises a video frame to be recognized, lip images obtained by dividing the video frame, visual curled layer lip images, visual high-dimensional image features, visual sequence features and/or at least one recognition result corresponding to the video to be recognized.

8. The system of claim 2, wherein the fixed frame decimation sub-module is specifically configured to:

determining a fixed frame number to be extracted based on a priori condition;

dividing a video to be identified into a plurality of area blocks according to the number of overall video frames;

wherein the area coverage in each area block is maximized on average.

9. The system of claim 2, wherein the video to be identified comprises:

and acquiring the video to be identified for the same target object based on at least one acquisition device.

10. The system of claim 1, wherein the human-machine interface is designed and constructed through a PyQt5 framework.