CN120012031A

CN120012031A - A bird flock recognition method and system based on ultra-high-definition video

Info

Publication number: CN120012031A
Application number: CN202510502649.1A
Authority: CN
Inventors: 郑慧明; 宋小民; 刘征; 吴成志; 虞建; 余佳豪; 陆志豪; 黄菊; 许哲; 吴脊; 赵周丽; 李新宇; 姜春桐; 吴万馨; 王正雄; 邓义斌
Original assignee: Sichuan Guochuang Innovation Vision Ultra Hd Video Technology Co ltd
Current assignee: Sichuan Guochuang Innovation Vision Ultra Hd Video Technology Co ltd
Priority date: 2025-04-22
Filing date: 2025-04-22
Publication date: 2025-05-16
Anticipated expiration: 2045-04-22
Also published as: CN120012031B

Abstract

The present invention provides a bird flock recognition method and system based on ultra-high-definition video, which relates to the technical field of bird recognition. The method and system include synchronously acquiring a video stream and an audio stream of a target area, and dynamically adjusting acquisition parameters of the video stream based on frequency spectrum features extracted from the audio stream; extracting spatiotemporal features of the dynamically adjusted video stream, and outputting spatial coordinates and visual confidence of the bird target; extracting time-frequency features and locating sound sources of the audio stream, and outputting sound source azimuth and acoustic confidence; performing spatial consistency matching based on the spatial coordinates and the sound source azimuth, and determining the area to be a valid candidate when the spatial distance between the two is less than a preset threshold; performing weighted calculation on the visual confidence and acoustic confidence of the valid candidate area, and outputting a recognition result that there are birds in the target area when the fusion confidence obtained by the weighted calculation exceeds the judgment threshold. The present invention improves the accuracy and robustness of bird flock recognition through the effective fusion of multimodal information and a dynamic adjustment strategy.

Description

Bird group identification method and system based on ultra-high definition video

Technical Field

The invention relates to the technical field of bird identification, in particular to a bird group identification method and system based on ultra-high definition video.

Background

Along with the increasing demands of ecological protection and environmental monitoring, the precise identification and monitoring of bird activities in natural environments becomes an important research direction. Particularly, under the background of rapid development of the ultra-high definition video technology, how to utilize the ultra-high definition video data to realize efficient and accurate bird group identification has important significance in the fields of ecological protection, wild animal management, environment monitoring and the like. The invention relates to a bird group identification method and a system based on ultra-high definition video, belongs to the technical crossing field of computer vision and audio processing, and aims to improve accuracy and robustness of bird group identification by combining multi-mode information of video and audio.

Conventional bird group identification methods rely primarily on a single video data or audio data. Although the video-based recognition method can intuitively capture the visual characteristics of birds, the recognition accuracy can be obviously affected under complex environments (such as illumination change, shielding and the like). The audio-based recognition method can capture the sound features of birds, but the recognition effect is also greatly reduced in the presence of background noise or multi-sound source interference.

Therefore, it is necessary to provide a method and a system for identifying a bird group based on ultra-high definition video to solve the above technical problems.

Disclosure of Invention

In order to solve the technical problems, the invention provides a bird group identification method and a system based on ultra-high definition video, which improve the accuracy and the robustness of bird group identification through effective fusion and dynamic adjustment strategies of multi-mode information.

The invention provides a bird group identification method based on ultra-high definition video, which comprises the following steps:

Synchronously acquiring a video stream and an audio stream of a target area, and dynamically adjusting acquisition parameters of the video stream based on frequency spectrum features extracted from the audio stream;

Extracting space-time characteristics of the video stream after dynamic adjustment, and outputting space coordinates and visual confidence of the bird target;

Extracting time-frequency characteristics and positioning sound sources of the audio stream, and outputting azimuth angles of the sound sources and acoustic confidence coefficients;

Performing space consistency matching based on the space coordinates and the sound source azimuth, and judging that the space distance is smaller than a preset threshold value as a valid candidate region;

and carrying out weighted calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region, and outputting a recognition result of birds in the target region when the fusion confidence coefficient obtained by the weighted calculation exceeds a judgment threshold value.

Preferably, the step of synchronously acquiring the video stream and the audio stream of the target area and dynamically adjusting the acquisition parameters of the video stream based on the spectral features extracted from the audio stream includes:

extracting the energy distribution ratio of a preset high frequency band and a preset low frequency band in the audio stream;

when the energy distribution proportion of the high frequency band exceeds a first threshold, the resolution and the frame rate of the video stream are improved according to a preset proportion coefficient;

when the energy distribution proportion of the low frequency band exceeds a second threshold value, reducing the resolution and the frame rate of the video stream according to a preset proportion coefficient;

and if the energy distribution ratio of the high frequency band and the low frequency band does not exceed the corresponding threshold value, maintaining the current acquisition parameters unchanged.

Preferably, the extracting the space-time characteristics of the video stream after the dynamic adjustment, and outputting the space coordinates and the visual confidence of the bird target, includes:

performing space-time joint modeling on at least three adjacent frames of the video stream, and extracting fusion features comprising time motion features and space texture features;

Generating a candidate region containing space coordinates and initial confidence coefficient through a pre-trained target detection network based on the fusion characteristics;

And performing non-maximum value inhibition processing on the candidate region, and outputting the space coordinates and the corresponding visual confidence of the final bird target.

Preferably, the performing time-frequency feature extraction and sound source localization on the audio stream, outputting a sound source azimuth angle and an acoustic confidence, includes:

Performing short-time Fourier transform on the audio stream to obtain a time-frequency spectrogram;

detecting bird sound features on the time-frequency spectrogram, and marking potential sound sources;

determining a sound source azimuth of the potential sound source by analyzing a time difference of sound signals received by at least two microphones;

And calculating and outputting acoustic confidence by combining the azimuth of the sound source based on the similarity between the time-frequency spectrogram and a preset bird sound template.

Preferably, the spatial consistency matching is performed based on the spatial coordinates and the azimuth angle of the sound source, and when the spatial distance between the spatial coordinates and the azimuth angle of the sound source is smaller than a preset threshold, the method determines that the spatial distance between the spatial coordinates and the azimuth angle of the sound source is a valid candidate area, and includes:

Mapping the space coordinates to a two-dimensional plane coordinate system of a video picture, and converting the azimuth angle of the sound source into projection coordinates under the two-dimensional plane coordinate system;

calculating the Euclidean distance between the space coordinate and the projection coordinate, and judging that the space consistency is matched when the Euclidean distance is smaller than or equal to a preset threshold value;

confidence correction is carried out on the effective candidate areas successfully matched, wherein:

when the Euclidean distance is smaller than 50% of the preset threshold value, the visual confidence coefficient and the acoustic confidence coefficient are enhanced according to a preset confidence coefficient lifting proportion;

And when the Euclidean distance is between 50% and 100% of the preset threshold value, dynamically attenuating the confidence according to the proportional relation between the Euclidean distance and the preset threshold value.

Preferably, the dynamic attenuation includes:

Calculating a normalized ratio value of the Euclidean distance to a preset threshold value, and recording the normalized ratio value as an attenuation coefficient;

Respectively carrying out attenuation calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region;

and when the attenuated visual confidence coefficient and the attenuated acoustic confidence coefficient are lower than the preset confidence coefficient lower limit, eliminating the effective candidate region.

Preferably, the calculating the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate area by weighting, when the fusion confidence coefficient obtained by the weighting calculation exceeds a decision threshold, outputting the identification result of birds existing in the target area, including:

Calculating a dynamic weight distribution coefficient based on the ratio of the Euclidean distance to the preset threshold value, wherein the weight coefficient of the visual confidence coefficient is in a linear relation with the reciprocal of the ratio;

and carrying out weighted fusion on the dynamic weight distribution coefficient, the visual confidence coefficient after confidence coefficient correction and the acoustic confidence coefficient, wherein the weighted fusion is specifically expressed as follows:

Wherein, Is the degree of confidence of the fusion,Is the euclidean distance and the distance between the two points,Is a preset threshold value, and the preset threshold value is set,AndThe corrected visual confidence and acoustic confidence are respectively;

and when the fusion confidence coefficient is greater than or equal to a judging threshold value, judging that the effective candidate area has the bird target, and outputting a recognition result, otherwise, judging that the effective candidate area does not have the bird target.

Preferably, the determining threshold is dynamically adjusted according to the acquisition parameters of the video stream, and specifically includes:

Determining the judgment threshold value from a pre-established mapping relation table of video stream acquisition parameters and the judgment threshold value in a table look-up mode, wherein the mapping relation table comprises optimal judgment threshold values corresponding to different resolution ratios and frame rate combinations.

The invention provides a bird group identification system based on ultra-high definition video, which is used for executing a bird group identification method based on ultra-high definition video, and comprises the following steps:

The parameter dynamic adjustment module is used for synchronously acquiring video stream and audio stream of a target area and dynamically adjusting acquisition parameters of the video stream based on frequency spectrum features extracted from the audio stream;

the video stream processing module is used for extracting space-time characteristics of the video stream after dynamic adjustment and outputting the space coordinates and visual confidence of the bird target;

the audio stream processing module is used for extracting time-frequency characteristics of the audio stream and positioning an acoustic source and outputting an azimuth angle of the acoustic source and an acoustic confidence;

the region judging module is used for carrying out space consistency matching based on the space coordinates and the sound source azimuth angle, and judging the space distance between the space coordinates and the sound source azimuth angle to be a valid candidate region when the space distance between the space coordinates and the sound source azimuth angle is smaller than a preset threshold value;

And the result output module is used for carrying out weighted calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region, and outputting the recognition result of birds in the target region when the fusion confidence coefficient obtained by the weighted calculation exceeds a judgment threshold value.

Compared with the related art, the method and the system for identifying the bird group based on the ultra-high definition video have the following beneficial effects:

According to the invention, the video stream and the audio stream of the target area are synchronously acquired, and the acquisition parameters of the video stream are dynamically adjusted based on the frequency spectrum characteristics extracted from the audio stream, so that the identification requirements under different environmental conditions are met. Meanwhile, spatial consistency matching and confidence weighting calculation are realized by combining the space-time feature extraction of the video stream and the time-frequency feature extraction of the audio stream with sound source localization, and finally, the identification result of birds in the target area is output. The invention aims to improve the accuracy and the robustness of the bird group identification through effective fusion and dynamic adjustment strategies of multi-mode information, reduce the calculation complexity and meet the actual application demands.

Drawings

FIG. 1 is a flow chart of a method for identifying a bird group based on ultra-high definition video;

fig. 2 is a block diagram of a system for identifying a bird group based on ultra-high definition video.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings. Furthermore, embodiments of the invention and features of the embodiments may be combined with each other without conflict.

It should be further noted that, for convenience of description, only some, but not all of the matters related to the present invention are shown in the accompanying drawings. Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example 1

The invention provides a bird group identification method based on ultra-high definition video, which is shown by referring to fig. 1, and comprises the following steps:

and S1, synchronously acquiring video stream and audio stream of a target area, and dynamically adjusting acquisition parameters of the video stream based on the frequency spectrum characteristics extracted from the audio stream.

Specifically, step S1 includes the steps of:

s11, extracting the energy distribution ratio of a preset high frequency band and a preset low frequency band in the audio stream.

In this embodiment, in the audio stream processing, firstly, the frame division processing is performed on the continuously input audio signal, and the windowing operation is performed by using the hanning window function, and each frame length is 50ms and the frame shift is 25ms. And separating a preset high-frequency band (8 kHz to 16 kHz) and a preset low-frequency band (100 Hz to 500 Hz) through a band-pass filter bank, and respectively calculating the energy values of the two frequency bands in the time dimension, wherein the high-frequency band energy ratio refers to the proportion of the high-frequency band energy value to the total energy of the whole audio signal, and the calculation method of the low-frequency band energy ratio is the same as the calculation method. All energy values are calibrated by adopting decibel units, and finally the dynamic proportion distribution of the two frequency bands is output.

And S12, when the energy distribution proportion of the high frequency band exceeds a first threshold, the resolution and the frame rate of the video stream are improved according to a preset proportionality coefficient.

In this embodiment, when the high-band energy duty ratio exceeds a preset first threshold (e.g., 30%), an adjustment mechanism of the video acquisition parameters is automatically triggered. Specific adjustments include increasing the video resolution by a preset magnification factor (e.g., 1.5 times), for example from 1920 x 1080 pixels to 3840 x 2160 pixels (i.e., 4K resolution), while increasing the video frame rate from 30 frames per second to 60 frames per second. This adjustment strategy is based on a strong correlation between high-band energy and bird specific activities (such as wing flapping, ringing), enhancing the capture of bird activity details by increasing the video acquisition parameters. In the parameter adjustment process, a smooth transition algorithm is adopted to avoid abrupt change of picture quality, wherein resolution switching is realized by a bilinear interpolation technology, and frame rate adjustment is realized by adopting a frame skip compensation technology to keep smooth playing of video.

And S13, when the energy distribution proportion of the low frequency band exceeds a second threshold value, reducing the resolution and the frame rate of the video stream according to a preset proportionality coefficient.

In this embodiment, when the low-frequency energy duty ratio exceeds a preset second threshold (e.g., 40%), it is determined that low-frequency interference (such as wind noise, mechanical noise, etc.) exists in the environment, and then the video acquisition parameters are reduced according to a preset reduction coefficient (e.g., 0.7 times). Specific operations include reducing the resolution from 4K to 1080p, and the frame rate from 60 frames per second to 30 frames per second. This parameter reduction strategy reduces the occupation of system computing resources by reducing the amount of video stream data, while employing motion adaptive filtering techniques to suppress the impact of low frequency noise on video quality. In the implementation process, the resolution parameter is preferentially adjusted, the frame rate parameter is then adjusted, and the time interval between two parameter adjustments is not less than 2 seconds, so as to prevent frequent fluctuation of parameter setting.

And S14, if the energy distribution ratio of the high frequency band and the low frequency band does not exceed the corresponding threshold value, maintaining the current acquisition parameters unchanged.

In this embodiment, when the high-band energy ratio is not more than 30% and the low-band energy ratio is not more than 40%, the current video acquisition parameters are maintained unchanged. In the state, the fluctuation condition of the energy duty ratio of the two frequency bands is continuously monitored, wherein if the fluctuation amplitude is smaller than 5% in continuous 5 seconds, the environment is judged to be in a stable state and enters a low-power-consumption operation mode, and if the fluctuation amplitude exceeds 5%, the threshold judgment flow is restarted. In order to prevent the system from frequently switching parameters around the threshold, hysteresis intervals (for example, the high frequency threshold is 3% floating up and down and the low frequency threshold is 5% floating up and down) are specifically set, and the parameter adjustment operation is performed only when the energy ratio exceeds the range of these hysteresis intervals.

In addition, the first threshold and the second threshold are obtained through experimental calibration, and the thresholds can ensure that the optimal recognition effect can be achieved under different environmental conditions through a large number of actual scene tests and performance optimization

And S2, extracting space-time characteristics of the video stream after dynamic adjustment, and outputting the space coordinates and visual confidence of the bird target.

Specifically, step S2 includes the steps of:

And S21, carrying out space-time joint modeling on at least three adjacent frames of the video stream, and extracting fusion features comprising time motion features and space texture features.

In this embodiment, during the process of extracting the spatio-temporal features of the video stream, the dynamically adjusted video sequence is first preprocessed, and three continuous images are selected to form the spatio-temporal analysis unit. The video sequence is jointly modeled using a three-dimensional convolutional neural network (3 DCNN), wherein the first layer convolution kernel size is set to 3 x 3 (time x height x width), the step size is 1 x 1, a total of 64 filters are used to extract spatiotemporal features. In the time dimension, bird motion features are captured by calculating the optical flow field between adjacent frames, a Farneback dense optical flow algorithm is adopted to calculate pixel-level displacement vectors, and in the space dimension, a modified ResNet network is used to extract multi-scale texture features, including static features such as feather textures, beak shapes and the like. And carrying out cascade fusion on the time motion features and the space texture features in the feature layer to form 1280-dimensional fusion feature vectors.

And S22, generating a candidate region comprising the space coordinates and the initial confidence coefficient through a pre-trained target detection network based on the fusion characteristics.

In this embodiment, bird target detection is performed by inputting the extracted fusion feature vector into a pre-trained FasterR-CNN target detection network. The network comprises two parts, namely a Regional Proposal Network (RPN) and a detection network, wherein the RPN network generates about 2000 candidate regions, and each candidate region outputs space coordinates (center point coordinates and width and height) and an initial confidence score. The detection network performs secondary classification and regression on the candidate regions, calculates probability values belonging to bird species as initial confidence using a softmax function, and sets the confidence threshold to 0.7 to filter low quality suggestion boxes. The cross entropy loss function and smoothL loss function are adopted to jointly optimize during network training, and the training is carried out on a data set containing 50 common birds until convergence.

And S23, performing non-maximum value inhibition processing on the candidate region, and outputting the space coordinates and the corresponding visual confidence of the final bird target.

In the present embodiment, non-maximum suppression (NMS) processing is performed on the candidate area of the detection network output, and the overlap threshold (IOU) is set to 0.5. The specific processing process comprises the steps of firstly sorting all candidate areas according to a confidence degree descending order, selecting the highest-score candidate frame as a reference, calculating the intersection ratio of other candidate frames and the reference frame, deleting low-score candidate frames with the intersection ratio larger than 0.5, and iteratively executing the above processes until all candidate frames are processed. The finally output bird targets comprise accurate space coordinates and normalized visual confidence, wherein the confidence is calibrated through a sigmoid function, and the comparability of targets with different scales is ensured. For the situation that a plurality of birds occur simultaneously, the detection result with the confidence ranking of top 10 is reserved to meet the real-time requirement.

And S3, extracting time-frequency characteristics and positioning sound sources of the audio stream, and outputting azimuth angles and acoustic confidence degrees of the sound sources.

Specifically, step S3 includes the following steps:

and S31, short-time Fourier transform is carried out on the audio stream, and a time-frequency spectrogram is acquired.

In this embodiment, during the audio stream processing, the collected audio signal is first preprocessed, and a 16-bit PCM encoding format with a sampling rate of 48kHz is adopted. A short-time fourier transform (STFT) is performed on the continuous audio stream, framing is performed using a hamming window function, the window length is 1024 sample points (about 21.3 ms), and the frame is shifted to 512 sample points. The spectral components of each frame were calculated by FFT to generate a time-frequency spectrogram with a frequency resolution of 46.9Hz and a time resolution of 10.7ms. And carrying out Mel scale conversion on the spectrogram, carrying out nonlinear compression on spectrum energy by using 40 Mel filter banks, and finally outputting a time-frequency-energy three-dimensional characteristic time-frequency spectrogram.

And S32, detecting bird song features on the time-frequency spectrogram, and marking potential sound sources.

In this embodiment, bird song feature detection is performed on a time-frequency spectrogram, and a time-frequency region with bird song features is first located by spectral centroid analysis and spectral flatness calculation. Potential sound sources are marked by adopting an acoustic model based on GMM-HMM, and a bird sound database containing 5000 sound samples of 50 common birds is used for model training for more than 100 hours. For each sound source event detected, the following characteristic parameters are extracted, namely a fundamental frequency contour (50-8000 Hz), a harmonic structure (at least 3 harmonics), a time modulation characteristic (amplitude modulation of 10-300 Hz), and the start-stop time and the frequency range are recorded. Potential sound sources with confidence greater than 0.6 are initially screened out by calculating the degree of matching of these features with typical bird song features.

And S33, determining the sound source azimuth angle of the potential sound source by analyzing the time difference of sound signals received by at least two microphones.

In the present embodiment, sound source localization is performed using an array of at least two microphones arranged in a space. The time difference ((TDOA)) of the same sound signal reaching different microphones is calculated, and the time delay estimation is carried out by adopting a generalized cross correlation function ((GCC-PHAT)) method, wherein the time resolution is 0.1ms. According to the geometric configuration of the microphone array (minimum spacing 0.5 m), the azimuth angle of the sound source is calculated through a spherical intersection algorithm, the resolution in the horizontal direction is 2 degrees, and the resolution in the vertical direction is 5 degrees. And the positioning precision can reach +/-0.1 m within a distance range of 3 m. And carrying out track smoothing processing on stable sound sources with more than 10 continuous frames by adopting Kalman filtering.

S34, calculating and outputting acoustic confidence based on the similarity between the time-frequency spectrogram and a preset bird sound template and combining the azimuth angle of the sound source.

In this embodiment, the detected sound source characteristics are matched with a preset bird sound template library, which contains MFCC characteristics (39 dimensions), prosodic characteristics, and spectral envelope characteristics for each type of bird species. And calculating the similarity between the test sample and the template by adopting a dynamic time warping algorithm, and calculating the acoustic confidence degree by combining the stability of the azimuth angle of the sound source (the angular change is less than 5 DEG in 5 continuous frames). The confidence calculation formula is acoustic confidence = 0.7 x spectral similarity +0.3 x azimuthal stability, where the spectral similarity is normalized to the 0-1 range by softmax. Finally, the azimuth angle (0-360 degrees) and the acoustic confidence (0-1) of each sound source are output, and the updating frequency is 10Hz.

And S4, carrying out space consistency matching based on the space coordinates and the azimuth angle of the sound source, and judging that the space distance between the space coordinates and the azimuth angle of the sound source is a valid candidate area when the space distance between the space coordinates and the azimuth angle of the sound source is smaller than a preset threshold value.

Specifically, step S4 includes the steps of:

and S41, mapping the space coordinates to a two-dimensional plane coordinate system of a video picture, and converting the azimuth angle of the sound source into projection coordinates under the two-dimensional plane coordinate system.

In the spatial consistency matching process, first, spatial coordinates obtained by video detection are converted from a pixel coordinate system to a world coordinate system. A three-dimensional coordinate system with the optical center of the camera as the origin is established, and the two-dimensional image coordinate is converted into a three-dimensional ground coordinate by camera calibration parameters (including focal length, principal point coordinate and distortion coefficient) and a known installation height (such as 3 meters), wherein the Z-axis coordinate is fixed to 0 (assuming that birds are moving near the ground). The coordinate conversion is realized by adopting a perspective transformation matrix, and the conversion error is controlled within a range of +/-0.1 meter.

And converting the azimuth information of the sound source into projection coordinates under the same world coordinate system. Based on the mounting position of the microphone array (at a fixed distance of 1 meter from the camera) and azimuth data (horizontal anglePitch angle of) Calculating the three-dimensional coordinates of the sound source through a spherical coordinate conversion formula: wherein Is a preset sound source distance estimate (3 meters by default). Taking sound source positioning errors into consideration, performing Gaussian smoothing on the coordinatesSet to 0.2 meters) and finally obtain the projection coordinates of the sound source.

And S42, calculating the Euclidean distance between the space coordinates and the projection coordinates, and judging that the space consistency is matched when the Euclidean distance is smaller than or equal to a preset threshold value.

In the present embodiment, the euclidean distance between the video detection coordinates and the sound source projection coordinates is calculated. Since the Z coordinate is fixed to 0, a preset threshold is set to 30% of the diagonal length of the video detection frame (typically 0.5-1.5 meters), and when D is less than or equal to the preset threshold, it is determined that spatial consistency matches. In order to improve the calculation efficiency, the KD tree data structure is adopted to carry out rapid neighborhood search on the detection target, and the processing speed can reach 1000 times of matching/second.

S43, carrying out confidence correction on the effective candidate area successfully matched, wherein:

Wherein the dynamic attenuation comprises:

firstly, calculating a normalized ratio value of the Euclidean distance to a preset threshold value, and recording the normalized ratio value as an attenuation coefficient.

And secondly, respectively carrying out attenuation calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region.

And finally, eliminating the effective candidate region when the attenuated visual confidence coefficient and the attenuated acoustic confidence coefficient are lower than the preset confidence coefficient lower limit.

And S5, carrying out weighted calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region, and outputting the recognition result of birds in the target region when the fusion confidence coefficient obtained by the weighted calculation exceeds a judgment threshold value.

Specifically, step S5 includes the steps of:

And S51, calculating a dynamic weight distribution coefficient based on the ratio of the Euclidean distance to the preset threshold value, wherein the weight coefficient of the visual confidence coefficient is in a linear relation with the reciprocal of the ratio.

In the fusion confidence calculation stage, firstly, dynamically distributing weights according to the spatial consistency matching result. For each valid candidate region, a weight ratio of visual confidence and acoustic confidence is determined based on the ratio of its euclidean distance to a preset threshold (noted D/D). The weight coefficient of the visual confidence is set to (1D/D), the weight coefficient of the acoustic confidence is D/D. The closer the distance is to the upper threshold, the higher the weight ratio of the acoustic confidence, whereas the visual confidence is dominant. The weight distribution process adopts a linear interpolation algorithm to ensure the smooth transition of the weight coefficient in the range from 0 to D.

S52, carrying out weighted fusion on the dynamic weight distribution coefficient, the visual confidence coefficient after confidence coefficient correction and the acoustic confidence coefficient, wherein the method specifically comprises the following steps:

Wherein, Is the degree of confidence of the fusion,Is the euclidean distance and the distance between the two points,Is a preset threshold value, and the preset threshold value is set,AndThe corrected visual confidence and acoustic confidence, respectively.

In this embodiment, the dynamic weights are weighted and fused with the corrected visual confidence and acoustic confidence. In specific implementation, each candidate region is calculated according to the formula. For the case of multi-mode data collision (such as V high and A extremely low), the system sets a collision detection mechanism, and when the confidence difference of the two types exceeds 0.5, a manual rechecking mark is triggered.

And S53, when the fusion confidence coefficient is larger than or equal to a judging threshold value, judging that the effective candidate area has the bird target, and outputting a recognition result, otherwise, judging that the effective candidate area does not have the bird target.

In this embodiment, the dynamic adjustment of the decision threshold is implemented by a pre-established parameter mapping table. The mapping table stores the experimentally calibrated optimal threshold value with the video resolution and frame rate as indexes. For example, the threshold is 0.65 in the 4K@60fps mode and 0.75 in the 1080p@30fps mode. And acquiring current video stream acquisition parameters in real time, and quickly retrieving corresponding thresholds through a hash table. For uncovered parameter combinations, a nearest-neighbor interpolation is used to calculate the threshold, e.g., a threshold of 3840×1600@45fps takes a weighted average of the 4k@60fps and 1080p@30fps thresholds.

And when the fusion confidence reaches or exceeds the judgment threshold value of the search, judging that the bird target exists in the area. And outputting an identification result comprising the space coordinates, the confidence value and the time stamp, and recording the original data for the subsequent model optimization. For effective candidate areas with fusion confidence coefficient lower than the judgment threshold value, two-stage filtering is performed, namely firstly discarding invalid areas with fusion confidence coefficient lower than 0.3, storing the rest areas into a buffer area for 5 seconds, and reactivating if the confidence coefficient rises above the threshold value. And finally, packaging the output result by a JSON format, wherein the output result comprises structural data such as a target ID, a coordinate set, a confidence curve and the like.

Example two

The invention provides a bird group identification system based on ultra-high definition video, which is used for executing a bird group identification method based on ultra-high definition video, and is shown by referring to fig. 2, the system comprises:

And the parameter dynamic adjustment module 100 is used for synchronously acquiring the video stream and the audio stream of the target area and dynamically adjusting the acquisition parameters of the video stream based on the frequency spectrum characteristics extracted from the audio stream.

The video stream processing module 200 is configured to extract space-time characteristics of the video stream after dynamic adjustment, and output spatial coordinates and visual confidence of the bird target.

And the audio stream processing module 300 is used for carrying out time-frequency characteristic extraction and sound source positioning on the audio stream and outputting a sound source azimuth angle and an acoustic confidence.

The region determining module 400 is configured to perform spatial consistency matching based on the spatial coordinates and the azimuth of the sound source, and determine that the region is a valid candidate region when the spatial distance between the spatial coordinates and the azimuth of the sound source is less than a preset threshold.

And the result output module 500 is used for carrying out weighted calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region, and outputting the recognition result of birds in the target region when the fusion confidence coefficient obtained by the weighted calculation exceeds a judgment threshold value.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the above embodiments may be implemented by a program that instructs associated hardware, the program may be stored in a computer readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (CD-ROM), or other optical disc Memory, magnetic disk Memory, tape Memory, or any other medium capable of being used for computer readable carrying or storing data.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

Claims

1. A method for identifying bird flocks based on ultra-high-definition video, characterized in that the method comprises the following steps:

Synchronously acquiring a video stream and an audio stream of a target area, and dynamically adjusting acquisition parameters of the video stream based on spectral features extracted from the audio stream;

Extracting spatiotemporal features from the dynamically adjusted video stream, and outputting spatial coordinates and visual confidence of the bird target;

Extracting time-frequency features and locating sound sources from the audio stream, and outputting sound source azimuth and acoustic confidence;

Performing spatial consistency matching based on the spatial coordinates and the sound source azimuth, and determining the region as a valid candidate region when the spatial distance between the two is less than a preset threshold;

A weighted calculation is performed on the visual confidence and the acoustic confidence of the valid candidate area, and when the fusion confidence obtained by the weighted calculation exceeds a determination threshold, an identification result indicating that there is a bird in the target area is output.

2. The method for identifying bird flocks based on ultra-high-definition video according to claim 1, characterized in that the synchronous acquisition of the video stream and the audio stream of the target area and the dynamic adjustment of the acquisition parameters of the video stream based on the spectrum features extracted from the audio stream include:

When the energy distribution ratio of the high frequency band exceeds a first threshold, the resolution and frame rate of the video stream are increased according to a preset proportionality coefficient;

When the energy distribution ratio of the low frequency band exceeds a second threshold, reducing the resolution and frame rate of the video stream according to a preset proportionality coefficient;

If the energy distribution ratios of the high frequency band and the low frequency band do not exceed the corresponding thresholds, the current acquisition parameters are maintained unchanged.

3. The method for identifying bird flocks based on ultra-high-definition video according to claim 2 is characterized in that the step of extracting spatiotemporal features from the dynamically adjusted video stream and outputting spatial coordinates and visual confidence of bird targets comprises:

Performing spatiotemporal joint modeling on at least three adjacent frames of the video stream to extract fusion features including temporal motion features and spatial texture features;

Based on the fused features, a candidate region including spatial coordinates and initial confidence is generated through a pre-trained target detection network;

The candidate region is subjected to non-maximum suppression processing, and the spatial coordinates of the final bird target and the corresponding visual confidence are output.

4. The method for bird flock recognition based on ultra-high-definition video according to claim 2, characterized in that the step of extracting time-frequency features and locating sound sources from the audio stream and outputting sound source azimuth and acoustic confidence comprises:

Performing short-time Fourier transform on the audio stream to obtain a time-frequency spectrum diagram;

detecting bird call features on the time-frequency spectrum diagram and marking potential sound sources;

Determine the sound source azimuth of the potential sound source by analyzing the time difference of the sound signals received by at least two microphones;

Based on the similarity between the time-frequency spectrum diagram and a preset bird sound template, and in combination with the sound source azimuth, the acoustic confidence is calculated and output.

5. The method for bird flock recognition based on ultra-high-definition video according to claim 4 is characterized in that the spatial consistency matching is performed based on the spatial coordinates and the sound source azimuth, and when the spatial distance between the two is less than a preset threshold, it is determined as a valid candidate area, including:

Mapping the spatial coordinates to a two-dimensional plane coordinate system of a video screen, and converting the sound source azimuth angle into a projection coordinate in the two-dimensional plane coordinate system;

Calculating the Euclidean distance between the spatial coordinates and the projection coordinates, and determining that the spatial consistency match is achieved when the Euclidean distance is less than or equal to a preset threshold;

The confidence of the successfully matched valid candidate area is corrected, where:

When the Euclidean distance is less than 50% of the preset threshold, enhancing the visual confidence and the acoustic confidence according to a preset confidence enhancement ratio;

When the Euclidean distance is between 50% and 100% of the preset threshold, the confidence is dynamically attenuated according to a proportional relationship between the Euclidean distance and the preset threshold.

6. The method for bird flock recognition based on ultra-high definition video according to claim 5, wherein the dynamic attenuation comprises:

Calculate the normalized ratio of the Euclidean distance to a preset threshold value, and record it as an attenuation coefficient;

Performing attenuation calculation on the visual confidence and the acoustic confidence of the valid candidate area respectively;

When the attenuated visual confidence and acoustic confidence are both lower than the preset confidence lower limit, the valid candidate area is eliminated.

7. The method for identifying bird flocks based on ultra-high-definition video according to claim 6 is characterized in that the visual confidence and acoustic confidence of the valid candidate area are weightedly calculated, and when the fusion confidence obtained by the weighted calculation exceeds the judgment threshold, the identification result that there are birds in the target area is output, including:

Calculating a dynamic weight allocation coefficient based on the ratio of the Euclidean distance to the preset threshold, wherein the weight coefficient of the visual confidence is linearly related to the inverse of the ratio;

The dynamic weight distribution coefficient is weighted and fused with the visual confidence and acoustic confidence after confidence correction, which is specifically expressed as:

in, is the fusion confidence, is the Euclidean distance, is the preset threshold, and They are the corrected visual confidence and acoustic confidence respectively;

When the fusion confidence is greater than or equal to the determination threshold, it is determined that there is a bird target in the valid candidate area and the recognition result is output; otherwise, it is determined that there is no bird target in the valid candidate area.

8. The method for bird flock recognition based on ultra-high-definition video according to claim 7 is characterized in that the determination threshold is dynamically adjusted according to the acquisition parameters of the video stream, specifically comprising:

The determination threshold is determined by table lookup from a pre-established mapping relationship table between video stream acquisition parameters and determination thresholds, wherein the mapping relationship table contains optimal determination thresholds corresponding to different resolution and frame rate combinations.

9. A bird flock identification system based on ultra-high-definition video, used to execute the bird flock identification method based on ultra-high-definition video according to any one of claims 1 to 8, characterized in that the system comprises:

A parameter dynamic adjustment module, used to synchronously acquire the video stream and audio stream of the target area, and dynamically adjust the acquisition parameters of the video stream based on the spectrum features extracted from the audio stream;

A video stream processing module is used to extract spatiotemporal features of the dynamically adjusted video stream and output the spatial coordinates and visual confidence of the bird target;

An audio stream processing module, used for extracting time-frequency features and locating sound sources on the audio stream, and outputting sound source azimuth and acoustic confidence;

A region determination module, configured to perform spatial consistency matching based on the spatial coordinates and the sound source azimuth, and determine the region as a valid candidate region when the spatial distance between the two is less than a preset threshold;

The result output module is used to perform weighted calculation on the visual confidence and acoustic confidence of the valid candidate area, and when the fusion confidence obtained by the weighted calculation exceeds the judgment threshold, output the recognition result that there are birds in the target area.