CN112287893B

CN112287893B - A recognition method of sow lactation behavior based on audio and video information fusion

Info

Publication number: CN112287893B
Application number: CN202011336361.5A
Authority: CN
Inventors: 杨阿庆; 薛月菊; 赵慧民; 林智勇; 刘晓勇; 陈荣军; 黄华盛; 张磊; 韩娜
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2023-07-18
Anticipated expiration: 2040-11-25
Also published as: CN112287893A

Abstract

The invention discloses a sow lactation behavior identification method based on audio and video information fusion, which comprises the following steps: acquiring overlooking video and audio data of lactating sows, extracting audio and video pictures from the original video and audio data, carrying out audio denoising and framing, acquiring an audio waveform chart sequence, and extracting a corresponding optical flow chart sequence according to the original video pictures; inputting the optical flow sequence and the video frame into a preset appearance-motion double-flow network to extract visual characteristics, inputting the audio waveform diagram sequence into a preset auditory characteristic extraction network, splicing and fusing the visual and auditory characteristics, inputting into a preset long-period and short-period memory network to extract time sequence visual and auditory characteristics, and finally, sending into a full-connection layer and a soft maximum to conduct behavior classification, and outputting a lactation behavior class. The invention utilizes the visual and auditory information in the lactation behavior to identify the sow lactation behavior in the pig farm environment, thereby acquiring the sow lactation information, being beneficial to timely finding out abnormal conditions of the lactation behavior and taking effective measures, and improving the economic benefit of the pig farm.

Description

A recognition method of sow lactation behavior based on audio and video information fusion

技术领域technical field

本发明涉及智慧畜牧、多模态信息融合和交互行为识别的技术领域，尤其是指一种基于音视频信息融合的母猪哺乳行为识别方法。The invention relates to the technical field of intelligent animal husbandry, multi-modal information fusion and interactive behavior recognition, in particular to a sow-feeding behavior recognition method based on audio and video information fusion.

背景技术Background technique

母猪哺乳行为不仅是母猪身体健康状况和母性行为能力的重要体现，也是影响哺乳期仔猪成活和生长的主要因素。观察母猪哺乳行为状态并及时发现哺乳行为异常有利于猪场管理者及时做出有效的人工干预决策，提高母猪及仔猪的身体健康状态，从而提高猪场经济效益。The lactation behavior of sows is not only an important reflection of the health status of sows and maternal behavior, but also the main factor affecting the survival and growth of lactating piglets. Observing the status of sows' lactation behavior and timely detection of abnormal lactation behavior will help pig farm managers to make effective manual intervention decisions in a timely manner, improve the health status of sows and piglets, and thus improve the economic benefits of pig farms.

目前，母猪哺乳行为的监测主要通过人工现场或者视频监控，以及基于视觉的行为监测方法。人工观察方式要求人工现场或者通过监控视频长时间观察母猪行为，并记录母猪哺乳行为发生时间、时长、频率等信息，再通过经验判断母猪哺乳行为是否异常，该方式耗时、耗力、主观性强且无法适应规模化养殖方式的发展。母猪哺乳行为是一种母猪和仔猪群的交互行为，当前基于视觉的母猪哺乳行为识别的研究鲜有报道，其方式主要基于母猪在哺乳过程中表现的视觉信息，如母猪侧躺，暴露乳房，仔猪趴窝在母猪乳房进行吸乳运动等特点，通过算法分析自动获取母猪哺乳行为信息。但由于猪舍环境光线变化、仔猪群拥挤、遮挡等，导致部分视觉信息无法获取，影响哺乳行为判断与分析。At present, the monitoring of sows' lactation behavior is mainly through manual on-site or video monitoring, and visual-based behavior monitoring methods. The manual observation method requires manual on-site or long-term observation of the behavior of sows through surveillance video, and records information such as the time, duration, and frequency of sows’ lactation behavior, and then judges whether the sows’ lactation behavior is abnormal through experience. This method is time-consuming and labor-intensive. , strong subjectivity and unable to adapt to the development of large-scale farming methods. Sow lactation behavior is an interactive behavior between sows and piglets. Currently, there are few reports on the recognition of sow lactation behavior based on vision. The method is mainly based on the visual information of sows during lactation, such as sow side Lying down, exposing the udder, piglets lie down on the sow's udder to suck milk, etc., automatically obtain the sow's lactation behavior information through algorithm analysis. However, due to changes in the ambient light of the pig house, crowding of piglets, and occlusion, some visual information cannot be obtained, which affects the judgment and analysis of breastfeeding behavior.

综上，现有母猪哺乳行为监测方法均不适于或难于识别母猪哺乳行为。鉴于此，亟需提供一种母猪哺乳行为自动识别方法，以实现对母猪哺乳行为的准确监测。由于母猪哺乳时伴有规律的哺乳声，因此，本发明提供一种基于音视频信息融合的母猪哺乳行为识别方法，通过视、音频信号相互协同辅助，以实现对母猪哺乳行为的精确监测。In summary, the existing monitoring methods for lactation behavior of sows are not suitable or difficult to identify the lactation behavior of sows. In view of this, there is an urgent need to provide an automatic identification method for the lactation behavior of sows, so as to accurately monitor the lactation behavior of sows. Since sows are accompanied by regular suckling sounds when breastfeeding, the present invention provides a method for recognizing sows’ lactating behaviors based on the fusion of audio and video information, through the cooperation and assistance of video and audio signals to achieve accurate breastfeeding behaviors of sows. monitor.

发明内容Contents of the invention

本发明目的在于克服现有技术难以准确识别母猪哺乳行为的缺点与不足，提出了一种基于音视频信息融合的母猪哺乳行为识别方法，为及时做出有效的管理决策提供数据支持，提高猪只健康状况和猪场经济效益。The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art that it is difficult to accurately identify the lactation behavior of sows, and proposes a method for recognizing the lactation behavior of sows based on the fusion of audio and video information, providing data support for making effective management decisions in a timely manner, and improving Pig health status and farm economics.

为实现上述目的，本发明所提供的技术方案为：一种基于音视频信息融合的母猪哺乳行为识别方法，包括以下步骤：In order to achieve the above object, the technical solution provided by the present invention is: a method for recognizing a sow's lactation behavior based on audio and video information fusion, comprising the following steps:

1)采集哺乳期母猪音视频数据；1) Collect audio and video data of lactating sows;

2)数据预处理：首先分离出音、视频数据，然后对音频数据进行去噪、分帧并获取音频波形图序列，最后对视频数据进行光流提取，获取光流图像序列；2) Data preprocessing: first separate the audio and video data, then denoise the audio data, divide into frames and obtain the audio waveform sequence, and finally perform optical flow extraction on the video data to obtain the optical flow image sequence;

3)将视频帧和光流图像序列输入预设的外观-运动双流网络进行特征提取，获得视觉特征，将音频波形图序列输入预设的听觉特征提取网络，获得听觉特征；3) Input the video frame and optical flow image sequence into the preset appearance-motion dual-stream network for feature extraction to obtain visual features, and input the audio waveform sequence into the preset auditory feature extraction network to obtain auditory features;

4)将视觉特征和听觉特征输入长短期记忆网络进行进一步特征融合和提取，获取时序视听觉特征；4) Input the visual features and auditory features into the long-term short-term memory network for further feature fusion and extraction, and obtain temporal audio-visual features;

5)将时序视听觉特征送入全连接层和软最大分类器进行行为分类，实现母猪哺乳行为自动识别。5) The time-series audiovisual features are sent to the fully connected layer and the soft maximum classifier for behavior classification, so as to realize automatic recognition of sow lactation behavior.

在步骤1)中，在猪舍正上方安装具有录音功能的摄像头，采集哺乳期母猪俯视视频及音频数据。In step 1), a camera with recording function is installed directly above the pig house to collect video and audio data of the lactating sow looking down.

所述步骤2)包括以下步骤：Described step 2) comprises the following steps:

2.1)从拍摄的音视频数据中分离出音频和视频数据；2.1) Separate the audio and video data from the captured audio and video data;

2.2)利用带通滤波器对原始音频信号进行处理，获得与原始音频信号对应的去噪后的音频信号；2.2) processing the original audio signal with a bandpass filter to obtain a denoised audio signal corresponding to the original audio signal;

2.3)对去噪后的音频信号进行分帧，帧长度为30ms，帧间重叠10ms，并将音频信号转换为音频波形图序列；2.3) The denoised audio signal is divided into frames, the frame length is 30ms, and the frame overlaps by 10ms, and the audio signal is converted into an audio waveform sequence;

2.4)利用光流法根据待监测的哺乳期母猪原始图像序列获取待监测哺乳期母猪的光流图像序列。2.4) Using the optical flow method to obtain the optical flow image sequence of the lactating sow to be monitored according to the original image sequence of the lactating sow to be monitored.

所述步骤3)包括以下两种处理：Described step 3) comprises following two kinds of processing:

a、将视频帧和光流图像序列输入预设的外观-运动双流网络，经过外观-运动双流网络的卷积层、下采样层和全连接层，对视频帧和光流图像序列提取视频中对应的外观-运动特征，并输出一维视觉特征向量；其中，在将视频帧和光流图像序列输入预设外观-运动双流网络之前，需先对预设的外观-运动双流网络进行训练，具体如下：a. Input the video frame and optical flow image sequence into the preset appearance-motion dual-stream network, and extract the corresponding video frame and optical flow image sequence through the convolution layer, down-sampling layer and fully connected layer of the appearance-motion dual-stream network Appearance-movement features, and output a one-dimensional visual feature vector; wherein, before inputting the video frame and optical flow image sequence into the preset appearance-movement dual-stream network, the preset appearance-movement dual-stream network needs to be trained first, as follows:

获取带有哺乳行为标记的原始视频帧和光流图像序列；将带有哺乳行为标记的原始视频帧和对应的光流图像序列输入外观-运动双流网络进行训练，获取外观-运动双流网络的最优网络参数；Obtain the original video frame and optical flow image sequence marked with breast-feeding behavior; input the original video frame with breast-feeding behavior mark and the corresponding optical flow image sequence into the appearance-motion dual-stream network for training, and obtain the optimal appearance-motion dual-stream network Network parameters;

b、将音频波形图序列输入预设的听觉特征提取网络，经过听觉特征提取网络的卷积层、下采样层和全连接层，输出一维听觉特征向量；其中，在将音频波形图序列输入预设的听觉特征提取网络之前，需先对预设的听觉特征提取网络进行训练，具体如下：b. Input the audio waveform sequence into the preset auditory feature extraction network, and output the one-dimensional auditory feature vector through the convolutional layer, downsampling layer and fully connected layer of the auditory feature extraction network; wherein, after inputting the audio waveform sequence Before the preset auditory feature extraction network, it is necessary to train the preset auditory feature extraction network, as follows:

获取带有哺乳行为标记的原始音频数据；采用带通滤波器对原始音频信号进行去噪；对去噪后的音频信号进行分帧，帧长度为30ms，帧间重叠10ms；将分帧后的音频信号转换为音频波形图序列，获得原始音频信号对应的音频波形图序列；将带有哺乳行为标记的音频波形图序列输入预设的听觉特征提取网络进行训练，获取听觉特征提取网络的最优网络参数；Obtain the original audio data marked with breastfeeding behavior; use a bandpass filter to denoise the original audio signal; divide the denoised audio signal into frames, the frame length is 30ms, and the frame overlap is 10ms; the framed The audio signal is converted into an audio waveform sequence to obtain the audio waveform sequence corresponding to the original audio signal; the audio waveform sequence with the breastfeeding behavior mark is input into the preset auditory feature extraction network for training, and the optimal auditory feature extraction network is obtained. Network parameters;

所述步骤4)包括以下步骤：Described step 4) comprises the following steps:

4.1)将一维视觉特征向量和一维听觉特征向量进行堆积拼接，获得视听觉特征；4.1) Accumulate the one-dimensional visual feature vector and the one-dimensional auditory feature vector to obtain the visual and auditory features;

4.2)将视听觉特征送入预设的长短期记忆网络中进行特征提取，输出时序视听觉特征；4.2) Send the audiovisual features into the preset long-term short-term memory network for feature extraction, and output the temporal audiovisual features;

其中，将视听觉特征送入预设的长短期记忆网络之前，需先对预设的网络进行训练，具体如下：Among them, before sending the visual and auditory features into the preset long-term short-term memory network, the preset network needs to be trained first, as follows:

获取带有行为标记的原始视、音频序列样本，根据带有行为标记的原始视频序列获得对应的光流图像序列样本，并根据带有行为标记的原始音频序列样本获得去噪后的音频波形图序列样本；Obtain the original video and audio sequence samples with behavioral markers, obtain the corresponding optical flow image sequence samples according to the original video sequence with behavioral markers, and obtain the denoised audio waveform diagram based on the original audio sequence samples with behavioral markers sequence samples;

将带有行为标记的原始视频帧和光流序列输入预设的外观-运动双流网络提取视觉特征，将音频波形图序列样本输入预设的听觉特征提取网络提取听觉特征；Input the original video frame and optical flow sequence with behavioral markers into the preset appearance-motion dual-stream network to extract visual features, and input the audio waveform sequence samples into the preset auditory feature extraction network to extract auditory features;

将听觉特征和视觉特征进行堆积拼接，输入预设的长短期记忆网络进行训练，获取最优网络参数。The auditory features and visual features are piled up and spliced, and input into the preset long-term short-term memory network for training to obtain the optimal network parameters.

所述步骤5)包括以下步骤：Described step 5) comprises the following steps:

5.1)将时序视听觉特征输入一个全连接层进一步进行特征提取和整合，获得2个特征值，分别对应哺乳行为和非哺乳行为的特征取值；5.1) Input the time-series audiovisual features into a fully connected layer for further feature extraction and integration, and obtain 2 feature values, corresponding to the feature values of breast-feeding behavior and non-breast-feeding behavior;

5.2)将哺乳行为和非哺乳行为的特征取值输入软最大分类器中计算2个特征对应哺乳行为和非哺乳行为的概率取值，将概率最大者所属行为类别作为行为识别结果，从而实现母猪哺乳行为识别。5.2) Input the feature values of breast-feeding behavior and non-breast-feeding behavior into the soft maximum classifier to calculate the probability values of the two features corresponding to breast-feeding behavior and non-breast-feeding behavior, and use the behavior category of the one with the highest probability as the behavior recognition result, so as to realize the Recognition of suckling behavior in pigs.

本发明与现有技术相比，具有如下优点与有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明提供了一种母猪哺乳行为自动识别方法，从而获取母猪哺乳行为信息，进一步分析哺乳行为能力。1. The present invention provides a method for automatically identifying the lactation behavior of sows, thereby obtaining information on the lactation behavior of sows and further analyzing the ability of the lactation behavior.

2、本发明融合视、音频信息，视、音频信息相互协同辅助，因此，所提取特征丰富、精确。2. The present invention integrates video and audio information, and the video and audio information cooperate and assist each other. Therefore, the extracted features are rich and accurate.

3、本发明采用视频帧序列、光流序列和音频序列，分别从空间分布、时序运动和声音三个方面分析，因此识别精度高。3. The present invention uses video frame sequence, optical flow sequence and audio sequence to analyze from the three aspects of spatial distribution, time series motion and sound respectively, so the recognition accuracy is high.

4、本发明采用外观-运动双流网络融合光流和视频卷积特征，增强了视觉特征表达能力。4. The present invention adopts the appearance-motion dual-stream network to fuse optical flow and video convolution features, which enhances the ability to express visual features.

5、本发明采用长短期记忆网融合视觉和听觉特征，增强了对行为特征在时序上的表达能力。5. The present invention adopts the long-short-term memory network to fuse visual and auditory features, which enhances the ability to express behavioral features in time series.

6、在潜在应用方法，统计固定时间内母猪哺乳行为的发生时间、时长、频率，用来研究母猪哺乳行为规律。这些哺乳信息一方面可以用来预测母猪和仔猪身体健康和福利状况，一方面为母猪母性能力判断提供数据参考，进一步为猪场选种、饲喂管理提供决策支持，提高猪场经济效益。6. In the potential application method, the occurrence time, duration and frequency of sow's lactation behavior are counted within a fixed period of time to study the law of sow's lactation behavior. On the one hand, these lactation information can be used to predict the health and welfare of sows and piglets, on the other hand, it can provide data reference for judging the maternal ability of sows, and further provide decision support for breeding selection and feeding management in pig farms, so as to improve the economic benefits of pig farms .

7、对于伴有特定声音的行为，融合视、音频多模态数据，具有较强的适用性，可用于其他动物行为的研究。7. For behaviors accompanied by specific sounds, the fusion of video and audio multi-modal data has strong applicability and can be used for research on other animal behaviors.

附图说明Description of drawings

图1为本发明方法的整体流程示意图。Fig. 1 is the overall flow diagram of the method of the present invention.

图2为本发明方法的架构图。Fig. 2 is a structure diagram of the method of the present invention.

图3为外观-运动双流网络的结构示意图。Figure 3 is a schematic diagram of the appearance-motion dual-stream network.

具体实施方式Detailed ways

下面结合具体实施例对本发明作进一步说明。The present invention will be further described below in conjunction with specific examples.

如图1和图2所示，本发明所提供的基于音视频信息融合的母猪哺乳行为识别方法，实现了基于音视频多模态数据的商业圈养母猪哺乳行为的识别，为实时监控和分析母猪哺乳行为及健康状态提供参考，其包括以下步骤：As shown in Figures 1 and 2, the method for recognizing the lactation behavior of sows based on the fusion of audio and video information provided by the present invention realizes the recognition of the lactation behavior of commercial captive sows based on audio and video multimodal data, providing real-time monitoring and Analyze the lactation behavior and health status of sows to provide reference, which includes the following steps:

具体地，在实际商业圈养猪舍正上方安装带有录音功能的摄像头，连续拍摄哺乳母猪日常行为视音频。在本实施例中，采用海康威视DS-2CD3T46FWD V2-I3高清音频摄像头，获取包含母猪哺乳行为的日常行为视音频数据。Specifically, a camera with a recording function is installed directly above the actual commercial pig house to continuously record video and audio of the daily behavior of lactating sows. In this embodiment, the Hikvision DS-2CD3T46FWD V2-I3 high-definition audio camera is used to obtain the video and audio data of the daily behavior including the lactation behavior of sows.

2)数据预处理：首先分离出音、视频数据，然后对音频数据进行去噪、分帧并获取音频波形图序列，最后对视频数据进行光流提取，获取光流图像序列；具体包括以下步骤：2) Data preprocessing: first separate the audio and video data, then denoise the audio data, divide into frames and obtain the audio waveform sequence, and finally perform optical flow extraction on the video data to obtain the optical flow image sequence; specifically include the following steps :

2.1)采用音视频编辑转换工具对获取的视音频数据进行音频和视频分离，提取出音频数据和视频画面；2.1) Use audio and video editing and conversion tools to separate audio and video from the acquired video and audio data, and extract audio data and video images;

2.2)由于实际猪舍中仔猪叫声、其他栏猪只叫声以及机器设备噪声嘈杂，为了减少这些噪声对母猪哺乳声的判断，本发明首先对获取的原始音频数据进行去噪。母猪在哺乳前会发出“哼哼呼噜”的唤乳声吸引仔猪前来吮乳，并通过哼叫速率的改变引导仔猪在按摩与吸吮之间变换，基于这一性质，本实施例中采用适用于母猪哺乳声的带通滤波器对原始音频进行滤波，减少其它噪声对哺乳声音的影响；2.2) Due to the noise of piglets, pigs in other stalls, and machinery and equipment in the actual pig house, in order to reduce the judgment of these noises on the sound of sows' lactation, the present invention first denoises the acquired original audio data. Before lactation, the sow will make a "humming and snorting" sound to attract the piglets to come to suck, and guide the piglets to switch between massage and sucking by changing the speed of grunting. Based on this nature, this embodiment uses The band-pass filter suitable for sows' lactation sounds filters the original audio to reduce the influence of other noises on the lactation sounds;

2.3)为适用于本实施例中卷积网络的二维图像输入，本实施例对去噪后的时序音频信号进行分帧，帧长度为30ms，帧间重叠10ms，并将音频信号转换为音频波形图序列；2.3) In order to be suitable for the two-dimensional image input of the convolutional network in this embodiment, this embodiment divides the time-series audio signal after denoising into frames, the frame length is 30ms, and the overlap between frames is 10ms, and the audio signal is converted into audio waveform sequence;

2.4)利用光流法根据待监测的哺乳期母猪原始图像序列获取所述待监测母猪的光流图像序列，光流图像中每个像素点记录了该像素的运动强度和方向。2.4) Using the optical flow method to obtain the optical flow image sequence of the sow to be monitored according to the original image sequence of the lactating sow to be monitored, each pixel in the optical flow image records the motion intensity and direction of the pixel.

3)将视频帧和光流图像序列输入预设的外观-运动双流网络进行特征提取，获得视觉特征，将音频波形图序列输入预设的听觉特征提取网络，获得听觉特征；具体情况如下：3) Input the video frame and optical flow image sequence into the preset appearance-motion dual-stream network for feature extraction to obtain visual features, and input the audio waveform sequence into the preset auditory feature extraction network to obtain auditory features; the details are as follows:

a、将视频帧和光流图像序列输入预设的外观-运动双流网络，经过外观-运动双流网络的卷积层、下采样层和全连接层，对视频帧和光流图像序列提取所述视频中对应的外观-运动特征，并输出一维视觉特征向量；a. Input the video frame and optical flow image sequence into the preset appearance-motion dual-stream network, and extract the video frame and optical flow image sequence through the convolutional layer, down-sampling layer and fully connected layer of the appearance-motion dual-stream network. Corresponding appearance-movement features, and output a one-dimensional visual feature vector;

在本实施例中，预设的外观-运动双流网络结构如图3所示，该网络结构由外观流和运动流组成，分别以视频帧和光流序列作为输入，然后经过5层卷积层和5层下采样层后输出相同维度的特征图，该两流特征图通过拼接的方式融合后送入2层卷积层进一步进行特征提取和融合，融合后的特征图送入连续2层全连接层，输出表示外观和运动的一维视觉特征；In this embodiment, the preset appearance-motion dual-stream network structure is shown in Figure 3. The network structure is composed of appearance flow and motion flow, which take video frames and optical flow sequences as input respectively, and then go through 5 layers of convolutional layers and After the 5-layer downsampling layer, the feature map of the same dimension is output. The two-stream feature map is fused by splicing and sent to the 2-layer convolutional layer for further feature extraction and fusion. The fused feature map is sent to the continuous 2-layer full connection. layer, which outputs one-dimensional visual features representing appearance and motion;

在视频帧和光流序列输入外观-运动双流网络结构之前，需要对该网络进行训练。由于该网络只实现了特征提取，为获取该网络的最优网络参数，在训练该网络前，在该网络后追加1层全卷积层和软最大(softmax)层用于行为分类。在训练过程中，设置该网络的输出行为类别个数为2，表示哺乳行为和非哺乳行为，带有哺乳行为标记的原始视频帧和对应的光流图像序列输入外观-运动双流网络进行前向传播和后向反馈训练，获取外观-运动双流网络的最优网络参数。原始视频帧和光流序列输入训练好的外观-运动双流网络的卷积、下采样和全连接层后输出一维视觉特征。The network needs to be trained before video frames and optical flow sequences are fed into the appearance-motion two-stream network structure. Since the network only implements feature extraction, in order to obtain the optimal network parameters of the network, before training the network, a full convolutional layer and a softmax (softmax) layer are added after the network for behavior classification. During the training process, the number of output behavior categories of the network is set to 2, representing breast-feeding behavior and non-breast-feeding behavior. Propagation and backward feedback training to obtain optimal network parameters for appearance-motion two-stream networks. Raw video frames and optical flow sequences are input into the trained appearance-motion two-stream network to output 1D visual features after convolutional, downsampling and fully connected layers.

b、将音频波形图序列输入预设的听觉特征提取网络，经过听觉特征提取网络的卷积层、下采样层和全连接层，输出一维听觉特征向量；b. Input the audio waveform sequence into the preset auditory feature extraction network, and output a one-dimensional auditory feature vector through the convolutional layer, downsampling layer and fully connected layer of the auditory feature extraction network;

在本实施例中，预设的听觉特征提取结构由5层卷积层、5层下采样层和2层全连接层构成；In this embodiment, the preset auditory feature extraction structure consists of 5 convolutional layers, 5 downsampling layers and 2 fully connected layers;

在音频波形图序列输入预设的听觉特征提取网络之前需要对该网络进行训练。由于该网络只实现了特征提取，为获取该网络的最优网络参数，在训练该网络前，在该网络后追加1层全卷积层和softmax层用于行为分类，并设置输出行为类别个数为2，表示哺乳行为和非哺乳行为；The network needs to be trained before the audio waveform sequence is input into the preset auditory feature extraction network. Since the network only implements feature extraction, in order to obtain the optimal network parameters of the network, before training the network, a full convolutional layer and a softmax layer are added after the network for behavior classification, and the output behavior category is set The number is 2, which means breast-feeding behavior and non-breast-feeding behavior;

获取带有哺乳行为标记的原始音频数据，采用带通滤波器对原始音频信号进行去噪，对去噪后的音频信号进行分帧，帧长度为30ms，帧间重叠10ms，将分帧后的音频信号转换为音频波形图序列，将带有哺乳行为标记的音频波形图序列输入预设的听觉特征提取网络进行前向传播和后向反馈训练，获取听觉特征提取网络的最优网络参数；音频波形图序列输入训练好的听觉特征提取网络的卷积层、下采样层和全连接层后输出一维听觉特征。Obtain the original audio data marked with breastfeeding behavior, use a bandpass filter to denoise the original audio signal, divide the denoised audio signal into frames, the frame length is 30ms, and the frame overlap is 10ms, and the framed The audio signal is converted into an audio waveform sequence, and the audio waveform sequence with breastfeeding behavior markers is input into the preset auditory feature extraction network for forward propagation and backward feedback training to obtain the optimal network parameters of the auditory feature extraction network; audio The waveform image sequence is input to the convolutional layer, downsampling layer and fully connected layer of the trained auditory feature extraction network, and then the one-dimensional auditory feature is output.

4)将视觉和听觉特征输入长短期记忆网络进行进一步特征融合和提取，获取时序视听觉特征，具体包括以下步骤：4) Input the visual and auditory features into the long-term short-term memory network for further feature fusion and extraction, and obtain time-series audio-visual features, specifically including the following steps:

4.1)将一维视觉特征和听觉特征进行堆积拼接，获得视听觉特征；4.1) Accumulate and splice one-dimensional visual features and auditory features to obtain visual and auditory features;

4.2)将视听觉特征送入预设的长短期记忆网络中进行特征提取，输出一维时序特征，所述预设的长短期记忆网络是提前设置好并训练好的；长短期记忆网络(LSTM)是一种时间递归神经网络，适合于处理和预测时间序列事件。一个LSTM由一个细胞和三个控制器构成，三个控制器分别是输入门(Input gate)、忘记门(Forget gate)和输出门(Outputgate)，用来保护和控制细胞状态。本实施例中LSTM被设计用于时序特征提取，LSTM可以根据实际需求进行设置，此处不做具体限定。4.2) Send audiovisual features into the preset long-term short-term memory network for feature extraction, and output one-dimensional time series features. The preset long-term short-term memory network is set and trained in advance; the long-term short-term memory network (LSTM ) is a temporal recurrent neural network suitable for processing and forecasting time series events. An LSTM consists of a cell and three controllers. The three controllers are the input gate, the forget gate and the output gate, which are used to protect and control the cell state. In this embodiment, the LSTM is designed for time series feature extraction, and the LSTM can be set according to actual needs, which is not specifically limited here.

在LSTM进行训练时，可以连同后续全连接层和软最大分类器构成分类网络进行网络参数训练。LSTM的具体训练步骤包括：When LSTM is trained, it can be combined with subsequent fully connected layers and soft maximum classifiers to form a classification network for network parameter training. The specific training steps of LSTM include:

获取带有行为标记的原始视、音频序列样本，根据带有行为标记的原始视频序列获得带有行为标记的光流图像序列样本，并根据带有行为标记的原始音频序列样本获得带有行为标记去噪后的音频波形图序列样本；将带有行为标记的视频和光流序列输入预设的外观-运动双流网络提取视觉特征，将音频波形图序列样本输入预设的听觉特征提取网络提取听觉特征；将听觉特征和视觉特征进行堆积拼接，输入预设的长短期记忆网络进行前向传播和后向反馈训练，获取最优网络参数。Obtain the original video and audio sequence samples with behavioral markers, obtain optical flow image sequence samples with behavioral markers from the original video sequence with behavioral markers, and obtain behavioral markers from the original audio sequence samples with behavioral markers Denoised audio waveform sequence samples; input video and optical flow sequences with behavioral markers into the preset appearance-motion dual-stream network to extract visual features, and input audio waveform sequence samples into the preset auditory feature extraction network to extract auditory features ; The auditory features and visual features are piled up and spliced, and input into the preset long-term short-term memory network for forward propagation and backward feedback training to obtain optimal network parameters.

5)将时序视听觉特征送入全连接层和软最大分类器进行行为分类，实现母猪哺乳行为自动识别，具体包括以下步骤5) Send the time-series audiovisual features to the fully connected layer and the soft-max classifier for behavior classification to realize automatic recognition of sows’ lactation behavior, specifically including the following steps

5.1)将时序视听觉特征输入一个全连接层进一步进行特征提取和整合，设置全连接层神经元个数为2，获得2个特征值，分别对应哺乳行为和非哺乳行为的特征取值；5.1) Input the temporal audiovisual features into a fully connected layer for further feature extraction and integration, set the number of neurons in the fully connected layer to 2, and obtain 2 feature values, corresponding to the feature values of breastfeeding behavior and non-breastfeeding behavior;

综上所述，本发明公开的母猪哺乳行为识别方法，包括采集哺乳期母猪俯视视音频数据，对原始视音频数据进行音频和视频画面提取，音频去噪、分帧，获取音频波形图序列，以及根据原始视频画面提取对应的光流图序列；将光流序列和视频帧输入预设外观-运动双流网络提取视觉特征，将音频波形图序列输入预设听觉特征提取网络，之后拼接融合视觉和听觉特征，并输入预设长短期记忆网络提取时序视听觉特征，最后送入全连接层和软最大进行行为分类，输出哺乳行为类别，以实现母猪哺乳行为识别。该项研究是基于视音频多模态信息融合对母猪哺乳行为进行识别，为猪场管理者提供可靠的哺乳信息，引导猪场管理者进行及时有效的管理决策，提高猪只健康和福利状况，同时也为母猪母性行为规律的探究提供数据支持，值得推广。In summary, the method for recognizing the lactation behavior of sows disclosed in the present invention includes collecting video and audio data from the top view of lactating sows, extracting audio and video images from the original video and audio data, audio denoising and framing, and obtaining audio waveforms Sequence, and extract the corresponding optical flow sequence based on the original video picture; input the optical flow sequence and video frame into the preset appearance-motion dual-stream network to extract visual features, input the audio waveform sequence into the preset auditory feature extraction network, and then splicing and fusion Visual and auditory features are input into the preset long-term and short-term memory network to extract time-series audiovisual features, and finally sent to the fully connected layer and soft maximum for behavior classification, and the output of the lactation behavior category is to realize the recognition of sow lactation behavior. This research is based on the fusion of video and audio multi-modal information to identify the lactation behavior of sows, provide reliable lactation information for pig farm managers, guide pig farm managers to make timely and effective management decisions, and improve the health and welfare of pigs At the same time, it also provides data support for the exploration of the law of sow's maternal sexual behavior, which is worthy of promotion.

以上所述实施例只为本发明之较佳实施例，并非以此限制本发明的实施范围，故凡依本发明之形状、原理所作的变化，均应涵盖在本发明的保护范围内。The above-described embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Therefore, all changes made according to the shape and principles of the present invention should be covered within the protection scope of the present invention.

Claims

1. A sow-feeding behavior recognition method based on audio-video information fusion, is characterized in that, comprises the following steps:

1) Collect audio and video data of lactating sows;

2) Data preprocessing: first separate the audio and video data, then denoise the audio data, divide into frames and obtain the audio waveform sequence, and finally perform optical flow extraction on the video data to obtain the optical flow image sequence;

3) Input the video frame and optical flow image sequence into the preset appearance-motion dual-stream network for feature extraction to obtain visual features, and input the audio waveform sequence into the preset auditory feature extraction network to obtain auditory features; wherein, the appearance- The motion dual-stream network consists of an appearance stream and a motion stream, which take video frames and optical flow sequences as input respectively, and then output feature maps of the same dimension after passing through 5 layers of convolutional layers and 5 layers of downsampling layers. The feature maps of the two streams are concatenated After fusion, it is sent to a 2-layer convolutional layer for feature extraction and fusion, and the fused feature map is sent to a continuous 2-layer fully connected layer to output a one-dimensional visual feature representing appearance and motion;

4) Input the visual features and auditory features into the long-term short-term memory network for further feature fusion and extraction, and obtain temporal audio-visual features;

5) The time-series audiovisual features are sent to the fully connected layer and the soft maximum classifier for behavior classification to realize automatic recognition of sows' lactation behavior.

2. A method for recognizing sows' lactation behavior based on audio and video information fusion according to claim 1, characterized in that: in step 1), a camera with a recording function is installed directly above the pigsty to collect data from lactating sows. Pigs look down on video and audio data.

3. A method for recognizing a sow's lactation behavior based on audio-video information fusion according to claim 1, wherein said step 2) comprises the following steps:

2.1) Separate the audio and video data from the captured audio and video data;

2.2) Process the original audio signal with a bandpass filter to obtain a denoised audio signal corresponding to the original audio signal;

2.3) Divide the denoised audio signal into frames with a frame length of 30ms and an overlap of 10ms between frames, and convert the audio signal into an audio waveform sequence;

2.4) Using the optical flow method to obtain the optical flow image sequence of the lactating sow to be monitored based on the original image sequence of the lactating sow to be monitored.

4. A method for recognizing a sow's lactation behavior based on fusion of audio and video information according to claim 1, wherein the step 3) includes the following two processes:

a. Input the video frame and optical flow image sequence into the preset appearance-motion dual-stream network, and extract the corresponding video frame and optical flow image sequence through the convolution layer, down-sampling layer and fully connected layer of the appearance-motion dual-stream network Appearance-movement features, and output a one-dimensional visual feature vector; wherein, before inputting the video frame and optical flow image sequence into the preset appearance-movement dual-stream network, the preset appearance-movement dual-stream network needs to be trained first, as follows:

Obtain the original video frame and optical flow image sequence marked with breast-feeding behavior; input the original video frame with breast-feeding behavior mark and the corresponding optical flow image sequence into the appearance-motion dual-stream network for training, and obtain the optimal appearance-motion dual-stream network Network parameters;

b. Input the audio waveform sequence into the preset auditory feature extraction network, and output the one-dimensional auditory feature vector through the convolutional layer, downsampling layer and fully connected layer of the auditory feature extraction network; wherein, after inputting the audio waveform sequence Before the preset auditory feature extraction network, it is necessary to train the preset auditory feature extraction network, as follows:

Obtain the original audio data marked with breastfeeding behavior; use a bandpass filter to denoise the original audio signal; divide the denoised audio signal into frames, the frame length is 30ms, and the frame overlap is 10ms; the framed The audio signal is converted into an audio waveform sequence to obtain the audio waveform sequence corresponding to the original audio signal; the audio waveform sequence with the breastfeeding behavior mark is input into the preset auditory feature extraction network for training, and the optimal auditory feature extraction network is obtained. Network parameters;

Said step 4) includes the following steps:

4.1) Accumulate and stitch the one-dimensional visual feature vector and one-dimensional auditory feature vector to obtain visual and auditory features;

4.2) Send audiovisual features into the preset long-term short-term memory network for feature extraction, and output time-series audiovisual features;

Among them, before sending the visual and auditory features into the preset long-term short-term memory network, the preset network needs to be trained first, as follows:

Obtain the original video and audio sequence samples with behavioral markers, obtain the corresponding optical flow image sequence samples according to the original video sequence with behavioral markers, and obtain the denoised audio waveform diagram based on the original audio sequence samples with behavioral markers sequence samples;

Input the original video frame and optical flow sequence with behavioral markers into the preset appearance-motion dual-stream network to extract visual features, and input the audio waveform sequence samples into the preset auditory feature extraction network to extract auditory features;

The auditory features and visual features are piled up and spliced, and input into the preset long-term short-term memory network for training to obtain the optimal network parameters.

5. A method for recognizing a sow's lactation behavior based on audio-video information fusion according to claim 1, wherein said step 5) comprises the following steps:

5.1) Input the time-series audiovisual features into a fully connected layer for further feature extraction and integration, and obtain 2 feature values, corresponding to the feature values of breast-feeding behavior and non-breast-feeding behavior;

5.2) Input the feature values of breast-feeding behavior and non-breast-feeding behavior into the soft maximum classifier to calculate the probability values of the two features corresponding to breast-feeding behavior and non-breast-feeding behavior, and use the behavior category of the one with the highest probability as the behavior recognition result, so as to realize the Recognition of suckling behavior in pigs.