[go: up one dir, main page]

CN120012031A - A bird flock recognition method and system based on ultra-high-definition video - Google Patents

A bird flock recognition method and system based on ultra-high-definition video Download PDF

Info

Publication number
CN120012031A
CN120012031A CN202510502649.1A CN202510502649A CN120012031A CN 120012031 A CN120012031 A CN 120012031A CN 202510502649 A CN202510502649 A CN 202510502649A CN 120012031 A CN120012031 A CN 120012031A
Authority
CN
China
Prior art keywords
confidence
bird
video stream
visual
sound source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202510502649.1A
Other languages
Chinese (zh)
Other versions
CN120012031B (en
Inventor
郑慧明
宋小民
刘征
吴成志
虞建
余佳豪
陆志豪
黄菊
许哲
吴脊
赵周丽
李新宇
姜春桐
吴万馨
王正雄
邓义斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Guochuang Innovation Vision Ultra Hd Video Technology Co ltd
Original Assignee
Sichuan Guochuang Innovation Vision Ultra Hd Video Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Guochuang Innovation Vision Ultra Hd Video Technology Co ltd filed Critical Sichuan Guochuang Innovation Vision Ultra Hd Video Technology Co ltd
Priority to CN202510502649.1A priority Critical patent/CN120012031B/en
Publication of CN120012031A publication Critical patent/CN120012031A/en
Application granted granted Critical
Publication of CN120012031B publication Critical patent/CN120012031B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
  • Image Analysis (AREA)

Abstract

本发明提供一种基于超高清视频的鸟群识别方法及系统,涉及鸟类识别技术领域,包括同步获取目标区域的视频流和音频流,并基于音频流中提取的频谱特征动态调整视频流的采集参数;对动态调整后的视频流进行时空特征提取,输出鸟类目标的空间坐标和视觉置信度;对音频流进行时频特征提取和声源定位,输出声源方位角和声学置信度;基于空间坐标和声源方位角进行空间一致性匹配,当两者空间距离小于预设阈值时判定为有效候选区域;对有效候选区域的视觉置信度和声学置信度进行加权计算,当加权计算得到的融合置信度超过判定阈值时,输出目标区域存在鸟类的识别结果,本发明通过多模态信息的有效融合和动态调整策略,提高鸟群识别的准确性和鲁棒性。

The present invention provides a bird flock recognition method and system based on ultra-high-definition video, which relates to the technical field of bird recognition. The method and system include synchronously acquiring a video stream and an audio stream of a target area, and dynamically adjusting acquisition parameters of the video stream based on frequency spectrum features extracted from the audio stream; extracting spatiotemporal features of the dynamically adjusted video stream, and outputting spatial coordinates and visual confidence of the bird target; extracting time-frequency features and locating sound sources of the audio stream, and outputting sound source azimuth and acoustic confidence; performing spatial consistency matching based on the spatial coordinates and the sound source azimuth, and determining the area to be a valid candidate when the spatial distance between the two is less than a preset threshold; performing weighted calculation on the visual confidence and acoustic confidence of the valid candidate area, and outputting a recognition result that there are birds in the target area when the fusion confidence obtained by the weighted calculation exceeds the judgment threshold. The present invention improves the accuracy and robustness of bird flock recognition through the effective fusion of multimodal information and a dynamic adjustment strategy.

Description

Bird group identification method and system based on ultra-high definition video
Technical Field
The invention relates to the technical field of bird identification, in particular to a bird group identification method and system based on ultra-high definition video.
Background
Along with the increasing demands of ecological protection and environmental monitoring, the precise identification and monitoring of bird activities in natural environments becomes an important research direction. Particularly, under the background of rapid development of the ultra-high definition video technology, how to utilize the ultra-high definition video data to realize efficient and accurate bird group identification has important significance in the fields of ecological protection, wild animal management, environment monitoring and the like. The invention relates to a bird group identification method and a system based on ultra-high definition video, belongs to the technical crossing field of computer vision and audio processing, and aims to improve accuracy and robustness of bird group identification by combining multi-mode information of video and audio.
Conventional bird group identification methods rely primarily on a single video data or audio data. Although the video-based recognition method can intuitively capture the visual characteristics of birds, the recognition accuracy can be obviously affected under complex environments (such as illumination change, shielding and the like). The audio-based recognition method can capture the sound features of birds, but the recognition effect is also greatly reduced in the presence of background noise or multi-sound source interference.
Therefore, it is necessary to provide a method and a system for identifying a bird group based on ultra-high definition video to solve the above technical problems.
Disclosure of Invention
In order to solve the technical problems, the invention provides a bird group identification method and a system based on ultra-high definition video, which improve the accuracy and the robustness of bird group identification through effective fusion and dynamic adjustment strategies of multi-mode information.
The invention provides a bird group identification method based on ultra-high definition video, which comprises the following steps:
Synchronously acquiring a video stream and an audio stream of a target area, and dynamically adjusting acquisition parameters of the video stream based on frequency spectrum features extracted from the audio stream;
Extracting space-time characteristics of the video stream after dynamic adjustment, and outputting space coordinates and visual confidence of the bird target;
Extracting time-frequency characteristics and positioning sound sources of the audio stream, and outputting azimuth angles of the sound sources and acoustic confidence coefficients;
Performing space consistency matching based on the space coordinates and the sound source azimuth, and judging that the space distance is smaller than a preset threshold value as a valid candidate region;
and carrying out weighted calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region, and outputting a recognition result of birds in the target region when the fusion confidence coefficient obtained by the weighted calculation exceeds a judgment threshold value.
Preferably, the step of synchronously acquiring the video stream and the audio stream of the target area and dynamically adjusting the acquisition parameters of the video stream based on the spectral features extracted from the audio stream includes:
extracting the energy distribution ratio of a preset high frequency band and a preset low frequency band in the audio stream;
when the energy distribution proportion of the high frequency band exceeds a first threshold, the resolution and the frame rate of the video stream are improved according to a preset proportion coefficient;
when the energy distribution proportion of the low frequency band exceeds a second threshold value, reducing the resolution and the frame rate of the video stream according to a preset proportion coefficient;
and if the energy distribution ratio of the high frequency band and the low frequency band does not exceed the corresponding threshold value, maintaining the current acquisition parameters unchanged.
Preferably, the extracting the space-time characteristics of the video stream after the dynamic adjustment, and outputting the space coordinates and the visual confidence of the bird target, includes:
performing space-time joint modeling on at least three adjacent frames of the video stream, and extracting fusion features comprising time motion features and space texture features;
Generating a candidate region containing space coordinates and initial confidence coefficient through a pre-trained target detection network based on the fusion characteristics;
And performing non-maximum value inhibition processing on the candidate region, and outputting the space coordinates and the corresponding visual confidence of the final bird target.
Preferably, the performing time-frequency feature extraction and sound source localization on the audio stream, outputting a sound source azimuth angle and an acoustic confidence, includes:
Performing short-time Fourier transform on the audio stream to obtain a time-frequency spectrogram;
detecting bird sound features on the time-frequency spectrogram, and marking potential sound sources;
determining a sound source azimuth of the potential sound source by analyzing a time difference of sound signals received by at least two microphones;
And calculating and outputting acoustic confidence by combining the azimuth of the sound source based on the similarity between the time-frequency spectrogram and a preset bird sound template.
Preferably, the spatial consistency matching is performed based on the spatial coordinates and the azimuth angle of the sound source, and when the spatial distance between the spatial coordinates and the azimuth angle of the sound source is smaller than a preset threshold, the method determines that the spatial distance between the spatial coordinates and the azimuth angle of the sound source is a valid candidate area, and includes:
Mapping the space coordinates to a two-dimensional plane coordinate system of a video picture, and converting the azimuth angle of the sound source into projection coordinates under the two-dimensional plane coordinate system;
calculating the Euclidean distance between the space coordinate and the projection coordinate, and judging that the space consistency is matched when the Euclidean distance is smaller than or equal to a preset threshold value;
confidence correction is carried out on the effective candidate areas successfully matched, wherein:
when the Euclidean distance is smaller than 50% of the preset threshold value, the visual confidence coefficient and the acoustic confidence coefficient are enhanced according to a preset confidence coefficient lifting proportion;
And when the Euclidean distance is between 50% and 100% of the preset threshold value, dynamically attenuating the confidence according to the proportional relation between the Euclidean distance and the preset threshold value.
Preferably, the dynamic attenuation includes:
Calculating a normalized ratio value of the Euclidean distance to a preset threshold value, and recording the normalized ratio value as an attenuation coefficient;
Respectively carrying out attenuation calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region;
and when the attenuated visual confidence coefficient and the attenuated acoustic confidence coefficient are lower than the preset confidence coefficient lower limit, eliminating the effective candidate region.
Preferably, the calculating the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate area by weighting, when the fusion confidence coefficient obtained by the weighting calculation exceeds a decision threshold, outputting the identification result of birds existing in the target area, including:
Calculating a dynamic weight distribution coefficient based on the ratio of the Euclidean distance to the preset threshold value, wherein the weight coefficient of the visual confidence coefficient is in a linear relation with the reciprocal of the ratio;
and carrying out weighted fusion on the dynamic weight distribution coefficient, the visual confidence coefficient after confidence coefficient correction and the acoustic confidence coefficient, wherein the weighted fusion is specifically expressed as follows:
Wherein, Is the degree of confidence of the fusion,Is the euclidean distance and the distance between the two points,Is a preset threshold value, and the preset threshold value is set,AndThe corrected visual confidence and acoustic confidence are respectively;
and when the fusion confidence coefficient is greater than or equal to a judging threshold value, judging that the effective candidate area has the bird target, and outputting a recognition result, otherwise, judging that the effective candidate area does not have the bird target.
Preferably, the determining threshold is dynamically adjusted according to the acquisition parameters of the video stream, and specifically includes:
Determining the judgment threshold value from a pre-established mapping relation table of video stream acquisition parameters and the judgment threshold value in a table look-up mode, wherein the mapping relation table comprises optimal judgment threshold values corresponding to different resolution ratios and frame rate combinations.
The invention provides a bird group identification system based on ultra-high definition video, which is used for executing a bird group identification method based on ultra-high definition video, and comprises the following steps:
The parameter dynamic adjustment module is used for synchronously acquiring video stream and audio stream of a target area and dynamically adjusting acquisition parameters of the video stream based on frequency spectrum features extracted from the audio stream;
the video stream processing module is used for extracting space-time characteristics of the video stream after dynamic adjustment and outputting the space coordinates and visual confidence of the bird target;
the audio stream processing module is used for extracting time-frequency characteristics of the audio stream and positioning an acoustic source and outputting an azimuth angle of the acoustic source and an acoustic confidence;
the region judging module is used for carrying out space consistency matching based on the space coordinates and the sound source azimuth angle, and judging the space distance between the space coordinates and the sound source azimuth angle to be a valid candidate region when the space distance between the space coordinates and the sound source azimuth angle is smaller than a preset threshold value;
And the result output module is used for carrying out weighted calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region, and outputting the recognition result of birds in the target region when the fusion confidence coefficient obtained by the weighted calculation exceeds a judgment threshold value.
Compared with the related art, the method and the system for identifying the bird group based on the ultra-high definition video have the following beneficial effects:
According to the invention, the video stream and the audio stream of the target area are synchronously acquired, and the acquisition parameters of the video stream are dynamically adjusted based on the frequency spectrum characteristics extracted from the audio stream, so that the identification requirements under different environmental conditions are met. Meanwhile, spatial consistency matching and confidence weighting calculation are realized by combining the space-time feature extraction of the video stream and the time-frequency feature extraction of the audio stream with sound source localization, and finally, the identification result of birds in the target area is output. The invention aims to improve the accuracy and the robustness of the bird group identification through effective fusion and dynamic adjustment strategies of multi-mode information, reduce the calculation complexity and meet the actual application demands.
Drawings
FIG. 1 is a flow chart of a method for identifying a bird group based on ultra-high definition video;
fig. 2 is a block diagram of a system for identifying a bird group based on ultra-high definition video.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings. Furthermore, embodiments of the invention and features of the embodiments may be combined with each other without conflict.
It should be further noted that, for convenience of description, only some, but not all of the matters related to the present invention are shown in the accompanying drawings. Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently, or at the same time. Furthermore, the order of the operations may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example 1
The invention provides a bird group identification method based on ultra-high definition video, which is shown by referring to fig. 1, and comprises the following steps:
and S1, synchronously acquiring video stream and audio stream of a target area, and dynamically adjusting acquisition parameters of the video stream based on the frequency spectrum characteristics extracted from the audio stream.
Specifically, step S1 includes the steps of:
s11, extracting the energy distribution ratio of a preset high frequency band and a preset low frequency band in the audio stream.
In this embodiment, in the audio stream processing, firstly, the frame division processing is performed on the continuously input audio signal, and the windowing operation is performed by using the hanning window function, and each frame length is 50ms and the frame shift is 25ms. And separating a preset high-frequency band (8 kHz to 16 kHz) and a preset low-frequency band (100 Hz to 500 Hz) through a band-pass filter bank, and respectively calculating the energy values of the two frequency bands in the time dimension, wherein the high-frequency band energy ratio refers to the proportion of the high-frequency band energy value to the total energy of the whole audio signal, and the calculation method of the low-frequency band energy ratio is the same as the calculation method. All energy values are calibrated by adopting decibel units, and finally the dynamic proportion distribution of the two frequency bands is output.
And S12, when the energy distribution proportion of the high frequency band exceeds a first threshold, the resolution and the frame rate of the video stream are improved according to a preset proportionality coefficient.
In this embodiment, when the high-band energy duty ratio exceeds a preset first threshold (e.g., 30%), an adjustment mechanism of the video acquisition parameters is automatically triggered. Specific adjustments include increasing the video resolution by a preset magnification factor (e.g., 1.5 times), for example from 1920 x 1080 pixels to 3840 x 2160 pixels (i.e., 4K resolution), while increasing the video frame rate from 30 frames per second to 60 frames per second. This adjustment strategy is based on a strong correlation between high-band energy and bird specific activities (such as wing flapping, ringing), enhancing the capture of bird activity details by increasing the video acquisition parameters. In the parameter adjustment process, a smooth transition algorithm is adopted to avoid abrupt change of picture quality, wherein resolution switching is realized by a bilinear interpolation technology, and frame rate adjustment is realized by adopting a frame skip compensation technology to keep smooth playing of video.
And S13, when the energy distribution proportion of the low frequency band exceeds a second threshold value, reducing the resolution and the frame rate of the video stream according to a preset proportionality coefficient.
In this embodiment, when the low-frequency energy duty ratio exceeds a preset second threshold (e.g., 40%), it is determined that low-frequency interference (such as wind noise, mechanical noise, etc.) exists in the environment, and then the video acquisition parameters are reduced according to a preset reduction coefficient (e.g., 0.7 times). Specific operations include reducing the resolution from 4K to 1080p, and the frame rate from 60 frames per second to 30 frames per second. This parameter reduction strategy reduces the occupation of system computing resources by reducing the amount of video stream data, while employing motion adaptive filtering techniques to suppress the impact of low frequency noise on video quality. In the implementation process, the resolution parameter is preferentially adjusted, the frame rate parameter is then adjusted, and the time interval between two parameter adjustments is not less than 2 seconds, so as to prevent frequent fluctuation of parameter setting.
And S14, if the energy distribution ratio of the high frequency band and the low frequency band does not exceed the corresponding threshold value, maintaining the current acquisition parameters unchanged.
In this embodiment, when the high-band energy ratio is not more than 30% and the low-band energy ratio is not more than 40%, the current video acquisition parameters are maintained unchanged. In the state, the fluctuation condition of the energy duty ratio of the two frequency bands is continuously monitored, wherein if the fluctuation amplitude is smaller than 5% in continuous 5 seconds, the environment is judged to be in a stable state and enters a low-power-consumption operation mode, and if the fluctuation amplitude exceeds 5%, the threshold judgment flow is restarted. In order to prevent the system from frequently switching parameters around the threshold, hysteresis intervals (for example, the high frequency threshold is 3% floating up and down and the low frequency threshold is 5% floating up and down) are specifically set, and the parameter adjustment operation is performed only when the energy ratio exceeds the range of these hysteresis intervals.
In addition, the first threshold and the second threshold are obtained through experimental calibration, and the thresholds can ensure that the optimal recognition effect can be achieved under different environmental conditions through a large number of actual scene tests and performance optimization
And S2, extracting space-time characteristics of the video stream after dynamic adjustment, and outputting the space coordinates and visual confidence of the bird target.
Specifically, step S2 includes the steps of:
And S21, carrying out space-time joint modeling on at least three adjacent frames of the video stream, and extracting fusion features comprising time motion features and space texture features.
In this embodiment, during the process of extracting the spatio-temporal features of the video stream, the dynamically adjusted video sequence is first preprocessed, and three continuous images are selected to form the spatio-temporal analysis unit. The video sequence is jointly modeled using a three-dimensional convolutional neural network (3 DCNN), wherein the first layer convolution kernel size is set to 3 x 3 (time x height x width), the step size is 1 x 1, a total of 64 filters are used to extract spatiotemporal features. In the time dimension, bird motion features are captured by calculating the optical flow field between adjacent frames, a Farneback dense optical flow algorithm is adopted to calculate pixel-level displacement vectors, and in the space dimension, a modified ResNet network is used to extract multi-scale texture features, including static features such as feather textures, beak shapes and the like. And carrying out cascade fusion on the time motion features and the space texture features in the feature layer to form 1280-dimensional fusion feature vectors.
And S22, generating a candidate region comprising the space coordinates and the initial confidence coefficient through a pre-trained target detection network based on the fusion characteristics.
In this embodiment, bird target detection is performed by inputting the extracted fusion feature vector into a pre-trained FasterR-CNN target detection network. The network comprises two parts, namely a Regional Proposal Network (RPN) and a detection network, wherein the RPN network generates about 2000 candidate regions, and each candidate region outputs space coordinates (center point coordinates and width and height) and an initial confidence score. The detection network performs secondary classification and regression on the candidate regions, calculates probability values belonging to bird species as initial confidence using a softmax function, and sets the confidence threshold to 0.7 to filter low quality suggestion boxes. The cross entropy loss function and smoothL loss function are adopted to jointly optimize during network training, and the training is carried out on a data set containing 50 common birds until convergence.
And S23, performing non-maximum value inhibition processing on the candidate region, and outputting the space coordinates and the corresponding visual confidence of the final bird target.
In the present embodiment, non-maximum suppression (NMS) processing is performed on the candidate area of the detection network output, and the overlap threshold (IOU) is set to 0.5. The specific processing process comprises the steps of firstly sorting all candidate areas according to a confidence degree descending order, selecting the highest-score candidate frame as a reference, calculating the intersection ratio of other candidate frames and the reference frame, deleting low-score candidate frames with the intersection ratio larger than 0.5, and iteratively executing the above processes until all candidate frames are processed. The finally output bird targets comprise accurate space coordinates and normalized visual confidence, wherein the confidence is calibrated through a sigmoid function, and the comparability of targets with different scales is ensured. For the situation that a plurality of birds occur simultaneously, the detection result with the confidence ranking of top 10 is reserved to meet the real-time requirement.
And S3, extracting time-frequency characteristics and positioning sound sources of the audio stream, and outputting azimuth angles and acoustic confidence degrees of the sound sources.
Specifically, step S3 includes the following steps:
and S31, short-time Fourier transform is carried out on the audio stream, and a time-frequency spectrogram is acquired.
In this embodiment, during the audio stream processing, the collected audio signal is first preprocessed, and a 16-bit PCM encoding format with a sampling rate of 48kHz is adopted. A short-time fourier transform (STFT) is performed on the continuous audio stream, framing is performed using a hamming window function, the window length is 1024 sample points (about 21.3 ms), and the frame is shifted to 512 sample points. The spectral components of each frame were calculated by FFT to generate a time-frequency spectrogram with a frequency resolution of 46.9Hz and a time resolution of 10.7ms. And carrying out Mel scale conversion on the spectrogram, carrying out nonlinear compression on spectrum energy by using 40 Mel filter banks, and finally outputting a time-frequency-energy three-dimensional characteristic time-frequency spectrogram.
And S32, detecting bird song features on the time-frequency spectrogram, and marking potential sound sources.
In this embodiment, bird song feature detection is performed on a time-frequency spectrogram, and a time-frequency region with bird song features is first located by spectral centroid analysis and spectral flatness calculation. Potential sound sources are marked by adopting an acoustic model based on GMM-HMM, and a bird sound database containing 5000 sound samples of 50 common birds is used for model training for more than 100 hours. For each sound source event detected, the following characteristic parameters are extracted, namely a fundamental frequency contour (50-8000 Hz), a harmonic structure (at least 3 harmonics), a time modulation characteristic (amplitude modulation of 10-300 Hz), and the start-stop time and the frequency range are recorded. Potential sound sources with confidence greater than 0.6 are initially screened out by calculating the degree of matching of these features with typical bird song features.
And S33, determining the sound source azimuth angle of the potential sound source by analyzing the time difference of sound signals received by at least two microphones.
In the present embodiment, sound source localization is performed using an array of at least two microphones arranged in a space. The time difference ((TDOA)) of the same sound signal reaching different microphones is calculated, and the time delay estimation is carried out by adopting a generalized cross correlation function ((GCC-PHAT)) method, wherein the time resolution is 0.1ms. According to the geometric configuration of the microphone array (minimum spacing 0.5 m), the azimuth angle of the sound source is calculated through a spherical intersection algorithm, the resolution in the horizontal direction is 2 degrees, and the resolution in the vertical direction is 5 degrees. And the positioning precision can reach +/-0.1 m within a distance range of 3 m. And carrying out track smoothing processing on stable sound sources with more than 10 continuous frames by adopting Kalman filtering.
S34, calculating and outputting acoustic confidence based on the similarity between the time-frequency spectrogram and a preset bird sound template and combining the azimuth angle of the sound source.
In this embodiment, the detected sound source characteristics are matched with a preset bird sound template library, which contains MFCC characteristics (39 dimensions), prosodic characteristics, and spectral envelope characteristics for each type of bird species. And calculating the similarity between the test sample and the template by adopting a dynamic time warping algorithm, and calculating the acoustic confidence degree by combining the stability of the azimuth angle of the sound source (the angular change is less than 5 DEG in 5 continuous frames). The confidence calculation formula is acoustic confidence = 0.7 x spectral similarity +0.3 x azimuthal stability, where the spectral similarity is normalized to the 0-1 range by softmax. Finally, the azimuth angle (0-360 degrees) and the acoustic confidence (0-1) of each sound source are output, and the updating frequency is 10Hz.
And S4, carrying out space consistency matching based on the space coordinates and the azimuth angle of the sound source, and judging that the space distance between the space coordinates and the azimuth angle of the sound source is a valid candidate area when the space distance between the space coordinates and the azimuth angle of the sound source is smaller than a preset threshold value.
Specifically, step S4 includes the steps of:
and S41, mapping the space coordinates to a two-dimensional plane coordinate system of a video picture, and converting the azimuth angle of the sound source into projection coordinates under the two-dimensional plane coordinate system.
In the spatial consistency matching process, first, spatial coordinates obtained by video detection are converted from a pixel coordinate system to a world coordinate system. A three-dimensional coordinate system with the optical center of the camera as the origin is established, and the two-dimensional image coordinate is converted into a three-dimensional ground coordinate by camera calibration parameters (including focal length, principal point coordinate and distortion coefficient) and a known installation height (such as 3 meters), wherein the Z-axis coordinate is fixed to 0 (assuming that birds are moving near the ground). The coordinate conversion is realized by adopting a perspective transformation matrix, and the conversion error is controlled within a range of +/-0.1 meter.
And converting the azimuth information of the sound source into projection coordinates under the same world coordinate system. Based on the mounting position of the microphone array (at a fixed distance of 1 meter from the camera) and azimuth data (horizontal anglePitch angle of) Calculating the three-dimensional coordinates of the sound source through a spherical coordinate conversion formula: wherein Is a preset sound source distance estimate (3 meters by default). Taking sound source positioning errors into consideration, performing Gaussian smoothing on the coordinatesSet to 0.2 meters) and finally obtain the projection coordinates of the sound source.
And S42, calculating the Euclidean distance between the space coordinates and the projection coordinates, and judging that the space consistency is matched when the Euclidean distance is smaller than or equal to a preset threshold value.
In the present embodiment, the euclidean distance between the video detection coordinates and the sound source projection coordinates is calculated. Since the Z coordinate is fixed to 0, a preset threshold is set to 30% of the diagonal length of the video detection frame (typically 0.5-1.5 meters), and when D is less than or equal to the preset threshold, it is determined that spatial consistency matches. In order to improve the calculation efficiency, the KD tree data structure is adopted to carry out rapid neighborhood search on the detection target, and the processing speed can reach 1000 times of matching/second.
S43, carrying out confidence correction on the effective candidate area successfully matched, wherein:
when the Euclidean distance is smaller than 50% of the preset threshold value, the visual confidence coefficient and the acoustic confidence coefficient are enhanced according to a preset confidence coefficient lifting proportion;
And when the Euclidean distance is between 50% and 100% of the preset threshold value, dynamically attenuating the confidence according to the proportional relation between the Euclidean distance and the preset threshold value.
Wherein the dynamic attenuation comprises:
firstly, calculating a normalized ratio value of the Euclidean distance to a preset threshold value, and recording the normalized ratio value as an attenuation coefficient.
And secondly, respectively carrying out attenuation calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region.
And finally, eliminating the effective candidate region when the attenuated visual confidence coefficient and the attenuated acoustic confidence coefficient are lower than the preset confidence coefficient lower limit.
And S5, carrying out weighted calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region, and outputting the recognition result of birds in the target region when the fusion confidence coefficient obtained by the weighted calculation exceeds a judgment threshold value.
Specifically, step S5 includes the steps of:
And S51, calculating a dynamic weight distribution coefficient based on the ratio of the Euclidean distance to the preset threshold value, wherein the weight coefficient of the visual confidence coefficient is in a linear relation with the reciprocal of the ratio.
In the fusion confidence calculation stage, firstly, dynamically distributing weights according to the spatial consistency matching result. For each valid candidate region, a weight ratio of visual confidence and acoustic confidence is determined based on the ratio of its euclidean distance to a preset threshold (noted D/D). The weight coefficient of the visual confidence is set to (1D/D), the weight coefficient of the acoustic confidence is D/D. The closer the distance is to the upper threshold, the higher the weight ratio of the acoustic confidence, whereas the visual confidence is dominant. The weight distribution process adopts a linear interpolation algorithm to ensure the smooth transition of the weight coefficient in the range from 0 to D.
S52, carrying out weighted fusion on the dynamic weight distribution coefficient, the visual confidence coefficient after confidence coefficient correction and the acoustic confidence coefficient, wherein the method specifically comprises the following steps:
Wherein, Is the degree of confidence of the fusion,Is the euclidean distance and the distance between the two points,Is a preset threshold value, and the preset threshold value is set,AndThe corrected visual confidence and acoustic confidence, respectively.
In this embodiment, the dynamic weights are weighted and fused with the corrected visual confidence and acoustic confidence. In specific implementation, each candidate region is calculated according to the formula. For the case of multi-mode data collision (such as V high and A extremely low), the system sets a collision detection mechanism, and when the confidence difference of the two types exceeds 0.5, a manual rechecking mark is triggered.
And S53, when the fusion confidence coefficient is larger than or equal to a judging threshold value, judging that the effective candidate area has the bird target, and outputting a recognition result, otherwise, judging that the effective candidate area does not have the bird target.
In this embodiment, the dynamic adjustment of the decision threshold is implemented by a pre-established parameter mapping table. The mapping table stores the experimentally calibrated optimal threshold value with the video resolution and frame rate as indexes. For example, the threshold is 0.65 in the 4K@60fps mode and 0.75 in the 1080p@30fps mode. And acquiring current video stream acquisition parameters in real time, and quickly retrieving corresponding thresholds through a hash table. For uncovered parameter combinations, a nearest-neighbor interpolation is used to calculate the threshold, e.g., a threshold of 3840×1600@45fps takes a weighted average of the 4k@60fps and 1080p@30fps thresholds.
And when the fusion confidence reaches or exceeds the judgment threshold value of the search, judging that the bird target exists in the area. And outputting an identification result comprising the space coordinates, the confidence value and the time stamp, and recording the original data for the subsequent model optimization. For effective candidate areas with fusion confidence coefficient lower than the judgment threshold value, two-stage filtering is performed, namely firstly discarding invalid areas with fusion confidence coefficient lower than 0.3, storing the rest areas into a buffer area for 5 seconds, and reactivating if the confidence coefficient rises above the threshold value. And finally, packaging the output result by a JSON format, wherein the output result comprises structural data such as a target ID, a coordinate set, a confidence curve and the like.
Example two
The invention provides a bird group identification system based on ultra-high definition video, which is used for executing a bird group identification method based on ultra-high definition video, and is shown by referring to fig. 2, the system comprises:
And the parameter dynamic adjustment module 100 is used for synchronously acquiring the video stream and the audio stream of the target area and dynamically adjusting the acquisition parameters of the video stream based on the frequency spectrum characteristics extracted from the audio stream.
The video stream processing module 200 is configured to extract space-time characteristics of the video stream after dynamic adjustment, and output spatial coordinates and visual confidence of the bird target.
And the audio stream processing module 300 is used for carrying out time-frequency characteristic extraction and sound source positioning on the audio stream and outputting a sound source azimuth angle and an acoustic confidence.
The region determining module 400 is configured to perform spatial consistency matching based on the spatial coordinates and the azimuth of the sound source, and determine that the region is a valid candidate region when the spatial distance between the spatial coordinates and the azimuth of the sound source is less than a preset threshold.
And the result output module 500 is used for carrying out weighted calculation on the visual confidence coefficient and the acoustic confidence coefficient of the effective candidate region, and outputting the recognition result of birds in the target region when the fusion confidence coefficient obtained by the weighted calculation exceeds a judgment threshold value.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flowchart and/or block of the flowchart illustrations and/or block diagrams, and combinations of flowcharts and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the above embodiments may be implemented by a program that instructs associated hardware, the program may be stored in a computer readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (CD-ROM), or other optical disc Memory, magnetic disk Memory, tape Memory, or any other medium capable of being used for computer readable carrying or storing data.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

Claims (9)

1.一种基于超高清视频的鸟群识别方法,其特征在于,所述方法包括以下步骤:1. A method for identifying bird flocks based on ultra-high-definition video, characterized in that the method comprises the following steps: 同步获取目标区域的视频流和音频流,并基于所述音频流中提取的频谱特征动态调整所述视频流的采集参数;Synchronously acquiring a video stream and an audio stream of a target area, and dynamically adjusting acquisition parameters of the video stream based on spectral features extracted from the audio stream; 对动态调整后的所述视频流进行时空特征提取,输出鸟类目标的空间坐标和视觉置信度;Extracting spatiotemporal features from the dynamically adjusted video stream, and outputting spatial coordinates and visual confidence of the bird target; 对所述音频流进行时频特征提取和声源定位,输出声源方位角和声学置信度;Extracting time-frequency features and locating sound sources from the audio stream, and outputting sound source azimuth and acoustic confidence; 基于所述空间坐标和所述声源方位角进行空间一致性匹配,当两者空间距离小于预设阈值时判定为有效候选区域;Performing spatial consistency matching based on the spatial coordinates and the sound source azimuth, and determining the region as a valid candidate region when the spatial distance between the two is less than a preset threshold; 对所述有效候选区域的视觉置信度和声学置信度进行加权计算,当加权计算得到的融合置信度超过判定阈值时,输出目标区域存在鸟类的识别结果。A weighted calculation is performed on the visual confidence and the acoustic confidence of the valid candidate area, and when the fusion confidence obtained by the weighted calculation exceeds a determination threshold, an identification result indicating that there is a bird in the target area is output. 2.根据权利要求1所述的一种基于超高清视频的鸟群识别方法,其特征在于,所述同步获取目标区域的视频流和音频流,并基于所述音频流中提取的频谱特征动态调整所述视频流的采集参数,包括:2. The method for identifying bird flocks based on ultra-high-definition video according to claim 1, characterized in that the synchronous acquisition of the video stream and the audio stream of the target area and the dynamic adjustment of the acquisition parameters of the video stream based on the spectrum features extracted from the audio stream include: 提取所述音频流中预设高频段和预设低频段的能量分布比例;Extracting the energy distribution ratio of a preset high frequency band and a preset low frequency band in the audio stream; 当所述高频段的能量分布比例超过第一阈值时,按照预设的比例系数提升所述视频流的分辨率和帧率;When the energy distribution ratio of the high frequency band exceeds a first threshold, the resolution and frame rate of the video stream are increased according to a preset proportionality coefficient; 当所述低频段的能量分布比例超过第二阈值时,按照预设的比例系数降低所述视频流的分辨率和帧率;When the energy distribution ratio of the low frequency band exceeds a second threshold, reducing the resolution and frame rate of the video stream according to a preset proportionality coefficient; 若所述高频段与低频段的能量分布比例均未超过对应阈值,则维持当前的采集参数不变。If the energy distribution ratios of the high frequency band and the low frequency band do not exceed the corresponding thresholds, the current acquisition parameters are maintained unchanged. 3.根据权利要求2所述的一种基于超高清视频的鸟群识别方法,其特征在于,所述对动态调整后的所述视频流进行时空特征提取,输出鸟类目标的空间坐标和视觉置信度,包括:3. The method for identifying bird flocks based on ultra-high-definition video according to claim 2 is characterized in that the step of extracting spatiotemporal features from the dynamically adjusted video stream and outputting spatial coordinates and visual confidence of bird targets comprises: 对所述视频流的相邻的至少三帧进行时空联合建模,提取包含时间运动特征和空间纹理特征的融合特征;Performing spatiotemporal joint modeling on at least three adjacent frames of the video stream to extract fusion features including temporal motion features and spatial texture features; 基于所述融合特征,通过预训练的目标检测网络生成包含空间坐标及初始置信度的候选区域;Based on the fused features, a candidate region including spatial coordinates and initial confidence is generated through a pre-trained target detection network; 对所述候选区域进行非极大值抑制处理,输出最终鸟类目标的空间坐标及对应的视觉置信度。The candidate region is subjected to non-maximum suppression processing, and the spatial coordinates of the final bird target and the corresponding visual confidence are output. 4.根据权利要求2所述的一种基于超高清视频的鸟群识别方法,其特征在于,所述对所述音频流进行时频特征提取和声源定位,输出声源方位角和声学置信度,包括:4. The method for bird flock recognition based on ultra-high-definition video according to claim 2, characterized in that the step of extracting time-frequency features and locating sound sources from the audio stream and outputting sound source azimuth and acoustic confidence comprises: 对所述音频流执行短时傅里叶变换,获取时频谱图;Performing short-time Fourier transform on the audio stream to obtain a time-frequency spectrum diagram; 在所述时频谱图上检测鸟类叫声特征,标记潜在声源;detecting bird call features on the time-frequency spectrum diagram and marking potential sound sources; 通过分析至少两个麦克风接收到的声音信号的时间差,确定所述潜在声源的声源方位角;Determine the sound source azimuth of the potential sound source by analyzing the time difference of the sound signals received by at least two microphones; 基于所述时频谱图与预设的鸟类声音模板之间的相似度,并结合所述声源方位角计算并输出声学置信度。Based on the similarity between the time-frequency spectrum diagram and a preset bird sound template, and in combination with the sound source azimuth, the acoustic confidence is calculated and output. 5.根据权利要求4所述的一种基于超高清视频的鸟群识别方法,其特征在于,所述基于所述空间坐标和所述声源方位角进行空间一致性匹配,当两者空间距离小于预设阈值时判定为有效候选区域,包括:5. The method for bird flock recognition based on ultra-high-definition video according to claim 4 is characterized in that the spatial consistency matching is performed based on the spatial coordinates and the sound source azimuth, and when the spatial distance between the two is less than a preset threshold, it is determined as a valid candidate area, including: 将所述空间坐标映射到视频画面的二维平面坐标系,并将所述声源方位角转换为所述二维平面坐标系下的投影坐标;Mapping the spatial coordinates to a two-dimensional plane coordinate system of a video screen, and converting the sound source azimuth angle into a projection coordinate in the two-dimensional plane coordinate system; 计算所述空间坐标与投影坐标之间的欧氏距离,当所述欧氏距离小于或等于预设阈值时判定为空间一致性匹配;Calculating the Euclidean distance between the spatial coordinates and the projection coordinates, and determining that the spatial consistency match is achieved when the Euclidean distance is less than or equal to a preset threshold; 对匹配成功的有效候选区域进行置信度修正,其中:The confidence of the successfully matched valid candidate area is corrected, where: 当所述欧氏距离小于所述预设阈值的50%时,按预设的置信度提升比例增强视觉置信度和声学置信度;When the Euclidean distance is less than 50% of the preset threshold, enhancing the visual confidence and the acoustic confidence according to a preset confidence enhancement ratio; 当所述欧氏距离处于所述预设阈值的50%至100%时,根据所述欧氏距离与所述预设阈值的比例关系对置信度进行动态衰减。When the Euclidean distance is between 50% and 100% of the preset threshold, the confidence is dynamically attenuated according to a proportional relationship between the Euclidean distance and the preset threshold. 6.根据权利要求5所述的一种基于超高清视频的鸟群识别方法,其特征在于,所述动态衰减包括:6. The method for bird flock recognition based on ultra-high definition video according to claim 5, wherein the dynamic attenuation comprises: 计算所述欧氏距离与预设阈值的归一化比例值,记为衰减系数;Calculate the normalized ratio of the Euclidean distance to a preset threshold value, and record it as an attenuation coefficient; 对所述有效候选区域的视觉置信度和声学置信度分别进行衰减计算;Performing attenuation calculation on the visual confidence and the acoustic confidence of the valid candidate area respectively; 当衰减后的视觉置信度和声学置信度均低于预设的置信度下限时,剔除所述有效候选区域。When the attenuated visual confidence and acoustic confidence are both lower than the preset confidence lower limit, the valid candidate area is eliminated. 7.根据权利要求6所述的一种基于超高清视频的鸟群识别方法,其特征在于,所述对所述有效候选区域的视觉置信度和声学置信度进行加权计算,当加权计算得到的融合置信度超过判定阈值时,输出目标区域存在鸟类的识别结果,包括:7. The method for identifying bird flocks based on ultra-high-definition video according to claim 6 is characterized in that the visual confidence and acoustic confidence of the valid candidate area are weightedly calculated, and when the fusion confidence obtained by the weighted calculation exceeds the judgment threshold, the identification result that there are birds in the target area is output, including: 基于所述欧氏距离与所述预设阈值的比值计算动态权重分配系数,其中视觉置信度的权重系数与所述比值的倒数呈线性关系;Calculating a dynamic weight allocation coefficient based on the ratio of the Euclidean distance to the preset threshold, wherein the weight coefficient of the visual confidence is linearly related to the inverse of the ratio; 将所述动态权重分配系数与置信度修正后的视觉置信度及声学置信度进行加权融合,具体表示为:The dynamic weight distribution coefficient is weighted and fused with the visual confidence and acoustic confidence after confidence correction, which is specifically expressed as: 其中,是融合置信度,是欧氏距离,是预设阈值,分别为修正后的视觉置信度和声学置信度;in, is the fusion confidence, is the Euclidean distance, is the preset threshold, and They are the corrected visual confidence and acoustic confidence respectively; 当所述融合置信度大于等于判定阈值时,判定所述有效候选区域存在鸟类目标,并输出识别结果,反之则判定所述有效候选区域不存在鸟类目标。When the fusion confidence is greater than or equal to the determination threshold, it is determined that there is a bird target in the valid candidate area and the recognition result is output; otherwise, it is determined that there is no bird target in the valid candidate area. 8.根据权利要求7所述的一种基于超高清视频的鸟群识别方法,其特征在于,所述判定阈值根据视频流的采集参数动态调整,具体包括:8. The method for bird flock recognition based on ultra-high-definition video according to claim 7 is characterized in that the determination threshold is dynamically adjusted according to the acquisition parameters of the video stream, specifically comprising: 通过查表方式从预先建立的视频流采集参数与判定阈值的映射关系表确定所述判定阈值,其中所述映射关系表包含不同分辨率、帧率组合对应的最优判定阈值。The determination threshold is determined by table lookup from a pre-established mapping relationship table between video stream acquisition parameters and determination thresholds, wherein the mapping relationship table contains optimal determination thresholds corresponding to different resolution and frame rate combinations. 9.一种基于超高清视频的鸟群识别系统,用于执行如权利要求1至8任意一项所述的一种基于超高清视频的鸟群识别方法,其特征在于,所述系统包括:9. A bird flock identification system based on ultra-high-definition video, used to execute the bird flock identification method based on ultra-high-definition video according to any one of claims 1 to 8, characterized in that the system comprises: 参数动态调整模块,用于同步获取目标区域的视频流和音频流,并基于所述音频流中提取的频谱特征动态调整所述视频流的采集参数;A parameter dynamic adjustment module, used to synchronously acquire the video stream and audio stream of the target area, and dynamically adjust the acquisition parameters of the video stream based on the spectrum features extracted from the audio stream; 视频流处理模块,用于对动态调整后的所述视频流进行时空特征提取,输出鸟类目标的空间坐标和视觉置信度;A video stream processing module is used to extract spatiotemporal features of the dynamically adjusted video stream and output the spatial coordinates and visual confidence of the bird target; 音频流处理模块,用于对所述音频流进行时频特征提取和声源定位,输出声源方位角和声学置信度;An audio stream processing module, used for extracting time-frequency features and locating sound sources on the audio stream, and outputting sound source azimuth and acoustic confidence; 区域判定模块,用于基于所述空间坐标和所述声源方位角进行空间一致性匹配,当两者空间距离小于预设阈值时判定为有效候选区域;A region determination module, configured to perform spatial consistency matching based on the spatial coordinates and the sound source azimuth, and determine the region as a valid candidate region when the spatial distance between the two is less than a preset threshold; 结果输出模块,用于对所述有效候选区域的视觉置信度和声学置信度进行加权计算,当加权计算得到的融合置信度超过判定阈值时,输出目标区域存在鸟类的识别结果。The result output module is used to perform weighted calculation on the visual confidence and acoustic confidence of the valid candidate area, and when the fusion confidence obtained by the weighted calculation exceeds the judgment threshold, output the recognition result that there are birds in the target area.
CN202510502649.1A 2025-04-22 2025-04-22 Bird group identification method and system based on ultra-high definition video Active CN120012031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510502649.1A CN120012031B (en) 2025-04-22 2025-04-22 Bird group identification method and system based on ultra-high definition video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510502649.1A CN120012031B (en) 2025-04-22 2025-04-22 Bird group identification method and system based on ultra-high definition video

Publications (2)

Publication Number Publication Date
CN120012031A true CN120012031A (en) 2025-05-16
CN120012031B CN120012031B (en) 2025-07-18

Family

ID=95676669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510502649.1A Active CN120012031B (en) 2025-04-22 2025-04-22 Bird group identification method and system based on ultra-high definition video

Country Status (1)

Country Link
CN (1) CN120012031B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6567775B1 (en) * 2000-04-26 2003-05-20 International Business Machines Corporation Fusion of audio and video based speaker identification for multimedia information access
CN110033787A (en) * 2018-01-12 2019-07-19 英特尔公司 Trigger the audio event of video analysis
CN110097568A (en) * 2019-05-13 2019-08-06 中国石油大学(华东) A Video Object Detection and Segmentation Method Based on Spatiotemporal Dual Branch Network
US20200137491A1 (en) * 2017-08-30 2020-04-30 Panasonic Intellectual Property Management Co., Ltd. Sound pickup device, sound pickup method, and program
CN116684548A (en) * 2023-04-07 2023-09-01 中国农业银行股份有限公司 Monitoring method, monitoring device, electronic equipment and storage medium
CN118861987A (en) * 2024-08-02 2024-10-29 滨州魏桥国科高等技术研究院 Method and device for identifying bird species, and electronic equipment
CN119724232A (en) * 2024-12-30 2025-03-28 上海安勤智行汽车电子有限公司 Automobile ambient sound enhancement method, device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6567775B1 (en) * 2000-04-26 2003-05-20 International Business Machines Corporation Fusion of audio and video based speaker identification for multimedia information access
US20200137491A1 (en) * 2017-08-30 2020-04-30 Panasonic Intellectual Property Management Co., Ltd. Sound pickup device, sound pickup method, and program
CN110033787A (en) * 2018-01-12 2019-07-19 英特尔公司 Trigger the audio event of video analysis
CN110097568A (en) * 2019-05-13 2019-08-06 中国石油大学(华东) A Video Object Detection and Segmentation Method Based on Spatiotemporal Dual Branch Network
CN116684548A (en) * 2023-04-07 2023-09-01 中国农业银行股份有限公司 Monitoring method, monitoring device, electronic equipment and storage medium
CN118861987A (en) * 2024-08-02 2024-10-29 滨州魏桥国科高等技术研究院 Method and device for identifying bird species, and electronic equipment
CN119724232A (en) * 2024-12-30 2025-03-28 上海安勤智行汽车电子有限公司 Automobile ambient sound enhancement method, device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姜雪莹: "基于音视频融合的说话人跟踪方法的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 1, 15 January 2019 (2019-01-15), pages 138 - 2628 *

Also Published As

Publication number Publication date
CN120012031B (en) 2025-07-18

Similar Documents

Publication Publication Date Title
US9495591B2 (en) Object recognition using multi-modal matching scheme
US20200075012A1 (en) Methods, apparatuses, systems, devices, and computer-readable storage media for processing speech signals
US10582117B1 (en) Automatic camera control in a video conference system
CN112560822B (en) Road sound signal classification method based on convolutional neural network
CN111601074A (en) Security monitoring method, device, robot and storage medium
WO2013157254A1 (en) Sound detecting apparatus, sound detecting method, sound feature value detecting apparatus, sound feature value detecting method, sound section detecting apparatus, sound section detecting method, and program
CN105554443B (en) The localization method and device in abnormal sound source in video image
CN108109617A (en) A kind of remote pickup method
CN102214298A (en) Method for detecting and identifying airport target by using remote sensing image based on selective visual attention mechanism
JP2021527853A (en) Wearable system utterance processing
CN114417908B (en) A UAV detection system and method based on multimodal fusion
WO2025035975A1 (en) Training method for speech enhancement network, speech enhancement method, and electronic device
CN112489674A (en) Speech enhancement method, device, equipment and computer readable storage medium
CN111401169A (en) Power supply business hall service personnel behavior identification method based on monitoring video information
CN115508821A (en) Multisource fuses unmanned aerial vehicle intelligent detection system
CN117762372A (en) Multi-mode man-machine interaction system
CN113093106A (en) Sound source positioning method and system
CN119919499A (en) UAV positioning method, equipment and medium based on multimodal fusion
CN120012031B (en) Bird group identification method and system based on ultra-high definition video
CN115174816A (en) Environmental noise sound source directional snapshot method and device based on microphone array
CN117768786A (en) Dangerous word sound source positioning camera monitoring system based on AI training
Kechichian et al. Model-based speech enhancement using a bone-conducted signal
WO2022227916A1 (en) Image processing method, image processor, electronic device, and storage medium
CN115767040B (en) 360-degree panoramic monitoring automatic cruising method based on interactive continuous learning
CN120598946B (en) Foreign body intrusion detection method and system for railway perimeter protection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant