CN120581021A

CN120581021A - Audio zooming method, electronic device, storage medium and computer program product

Info

Publication number: CN120581021A
Application number: CN202511086201.2A
Authority: CN
Inventors: 袁斌; 刘兵兵; 侯天峰; 蔡正辉; 辛龙
Original assignee: Goertek Inc
Current assignee: Goertek Inc
Priority date: 2025-08-05
Filing date: 2025-08-05
Publication date: 2025-09-02

Abstract

The present application discloses an audio zoom method, electronic device, storage medium, and computer program product, relating to the field of signal processing technology. The method comprises: performing beamforming processing on a microphone array signal with a preset direction as a speech enhancement direction to obtain an enhanced signal, wherein the preset direction is determined according to the shooting direction of the camera device; fusing the enhanced signal with at least one signal in the microphone array signal according to the video zoom magnification of the camera device, and obtaining a target output signal based on the fused signal, wherein the greater the video zoom magnification, the greater the fusion ratio corresponding to the enhanced signal. By adopting simple beamforming and signal fusion technology, the present application occupies little computing resources and memory resources, and has high algorithm robustness, making it possible to deploy an audio zoom algorithm in resource-constrained devices.

Description

Audio zooming method, electronic device, storage medium and computer program product

Technical Field

The present application relates to the field of signal processing technologies, and in particular, to an audio zooming method, an electronic device, a storage medium, and a computer program product.

Background

With the development of audio processing technology, a new requirement of audio and video synchronous zooming is presented at present, namely, when people want to shoot video through electronic equipment, the audio can realize synchronous zooming with the video. The current mainstream scheme adopts a Deep Neural Network (DNN) to realize an audio zooming function, and the audio signal is synchronously adjusted along with the video focal length change through model processing, however, the deep neural network scheme needs to occupy higher computing resources and memory resources, and meanwhile, needs to rely on large-scale training data to ensure model robustness, so that the audio signal is difficult to deploy in resource-limited equipment.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The application mainly aims to provide an audio zooming method, electronic equipment, a storage medium and a computer program product, and aims to solve the technical problems that a deep neural network is adopted to realize audio and video synchronous zooming, high computing resources and memory resources are occupied, and large-scale training data are needed at present.

To achieve the above object, the present application proposes an audio zooming method including:

carrying out beam forming processing on the microphone array signals by taking a preset direction as a voice enhancement direction to obtain enhancement signals, wherein the preset direction is determined according to the shooting direction of the camera device;

And fusing the enhancement signal with at least one path of signal in the microphone array signal according to the video zooming multiple of the camera device, and obtaining a target output signal according to the fused signal, wherein the larger the video zooming multiple is, the larger the fusion proportion corresponding to the enhancement signal is.

Optionally, the step of obtaining the target output signal according to the fused signal includes:

Calculating the normalized phase difference of two paths of signals in the microphone array signals;

Determining a first gain factor from the normalized phase difference;

And performing gain control on the fused signals by adopting the first gain coefficient to obtain a first gain signal, and obtaining a target output signal according to the first gain signal.

Optionally, the step of obtaining the target output signal according to the first gain signal includes:

Determining a second gain coefficient according to the video zooming multiple, wherein the larger the video zooming multiple is, the larger the second gain coefficient is;

and performing gain control on the first gain signal by adopting the second gain coefficient to obtain a second gain signal, and obtaining a target output signal according to the second gain signal.

Optionally, the step of calculating the normalized phase difference of two signals of the microphone array signals includes:

Calculating a normalized phase difference of a first signal and a second signal, wherein the first signal is a signal corresponding to a first microphone in the microphone array signals, the second signal is a signal corresponding to a second microphone in the microphone array signals, and the first microphone and the second microphone are two microphones in the microphone array, wherein the distance between the two microphones in the preset direction is greater than a preset threshold value.

Optionally, the step of fusing the enhancement signal with at least one signal in the microphone array signal according to the video zoom multiple of the image capturing device, and obtaining the target output signal according to the fused signal includes:

fusing the enhanced signal with a third signal according to the video zoom multiple, and obtaining a target output signal of a left channel according to the fused signal, wherein the third signal is a signal corresponding to a third microphone in the microphone array signal;

and fusing the enhancement signal with a fourth signal according to the video zoom multiple, and obtaining a target output signal of a right channel according to the fused signal, wherein the fourth signal is a signal corresponding to a fourth microphone in the microphone array signal, and the distance between the third microphone and a left channel loudspeaker is shorter than the distance between the fourth microphone and the left channel loudspeaker.

Optionally, the step of fusing the enhancement signal with at least one signal of the microphone array signal according to a video zoom multiple of the image capturing device includes:

acquiring an adjusting factor;

Determining a fusion proportion corresponding to the enhancement signal according to the adjustment factor and the video zoom multiple, wherein the smaller the adjustment factor is, the smaller the fusion proportion is under the condition that the video zoom multiple is unchanged;

and fusing the enhancement signal with at least one path of signal in the microphone array signal according to the fusion proportion.

Optionally, the step of obtaining the adjustment factor includes:

And receiving a user adjusting instruction, and determining an adjusting factor according to the user adjusting instruction.

In addition, to achieve the above object, the present application also proposes an audio zooming apparatus including:

The signal enhancement module is used for carrying out beam forming processing on the microphone array signals by taking a preset direction as a voice enhancement direction to obtain enhancement signals, wherein the preset direction is determined according to the shooting direction of the shooting device;

And the signal fusion module is used for fusing the enhancement signal with at least one path of signal in the microphone array signal according to the video zooming multiple of the camera device, and obtaining a target output signal according to the fused signal, wherein the larger the video zooming multiple is, the larger the fusion proportion corresponding to the enhancement signal is.

In addition, in order to achieve the above object, the application also proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the audio zooming method as described above.

In addition, to achieve the above object, the present application also proposes a storage medium, which is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the audio zooming method as described above.

Furthermore, to achieve the above object, the present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the audio zooming method as described above.

One or more technical schemes provided by the application have at least the following technical effects:

And further, fusing the enhancement signal with at least one path of signal in the microphone array signal according to the video zooming multiple of the camera device, enhancing the fusion proportion of the signal, and obtaining a target output signal according to the fused signal, namely obtaining the audio signal synchronously zoomed with the video. Because compared with the original microphone array signal, the target output signal is mixed with the enhancement signal which enhances the shooting direction, the user feels that the audio signal after the audio zooming is more prominent than the audio signal without the audio zooming, and the user feels the change along with the video zooming multiple due to the condition that the larger the video zooming multiple is set, the fusion proportion of the enhancement signal is larger, the shooting direction signal also has the effects of 'zooming in' and 'zooming out', the audio is also 'zoomed in' when the video is zoomed in, and the audio is also 'zoomed out' when the video is zoomed out, so that the effect of synchronous zooming of audio and video is realized. Meanwhile, compared with the scheme of realizing synchronous zooming of audio and video by adopting a deep neural network, the scheme of the application adopts a simple beam forming and signal fusion technology, occupies little computing resource and memory resource, has high algorithm robustness, and makes the deployment of the audio zooming algorithm in equipment with limited resources possible.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flow chart of a first embodiment of an audio zooming method according to the present application;

fig. 2 is a schematic view of a shooting direction of an image capturing apparatus according to a first embodiment of the present application;

fig. 3 is a schematic diagram of a microphone arrangement of a binaural device according to a first embodiment of the application;

fig. 4 is a general frame diagram of an audio zooming method according to a first embodiment of the present application;

FIG. 5 is a schematic block diagram of an audio zooming device according to the present application;

Fig. 6 is a schematic device structure diagram of a hardware operating environment related to an audio zooming method according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.

For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.

The current mainstream scheme adopts a Deep Neural Network (DNN) to realize an audio zooming function, and the audio signal is synchronously adjusted along with the video focal length change through model processing, however, the deep neural network scheme needs to occupy higher computing resources and memory resources, and meanwhile, needs to rely on large-scale training data to ensure model robustness, so that the audio signal is difficult to deploy in resource-limited equipment.

The embodiment of the application provides a solution, wherein a preset direction determined according to a shooting direction of an image pickup device is taken as a voice enhancement direction, a microphone array signal is subjected to beam forming processing to obtain an enhancement signal, and a sound source in a video shooting direction is subjected to signal enhancement. Because compared with the original microphone array signal, the target output signal is mixed with the enhancement signal which enhances the shooting direction, the user feels that the audio signal after the audio zooming is more prominent than the audio signal without the audio zooming, and the user feels the change along with the video zooming multiple due to the condition that the larger the video zooming multiple is set, the fusion proportion of the enhancement signal is larger, the shooting direction signal also has the effects of 'zooming in' and 'zooming out', the audio is also 'zoomed in' when the video is zoomed in, and the audio is also 'zoomed out' when the video is zoomed out, so that the effect of synchronous zooming of audio and video is realized. Meanwhile, compared with the scheme of realizing synchronous zooming of audio and video by adopting a deep neural network, the scheme of the application adopts a simple beam forming and signal fusion technology, occupies little computing resource and memory resource, has high algorithm robustness, and makes the deployment of the audio zooming algorithm in equipment with limited resources possible.

It should be noted that, the execution body of each embodiment of the audio zooming method of the present application may be an electronic device with functions of data processing, network communication and program running, such as a tablet computer, a personal computer, a mobile phone, etc., where the electronic device has an internal or external camera device and a microphone array.

Referring to fig. 1, fig. 1 is a flowchart illustrating a first embodiment of an audio zooming method according to the present application.

In this embodiment, the audio zooming method includes steps S10 to S20:

step S10, carrying out beam forming processing on the microphone array signals by taking a preset direction as a voice enhancement direction to obtain enhancement signals, wherein the preset direction is determined according to the shooting direction of the shooting device;

The microphone array is a set formed by a plurality of microphones according to a certain arrangement mode and is used for capturing sound signals from different directions at the same time, the sound signals captured by each microphone are waveforms of sound waves changing along with time and are represented as time domain signals, and multichannel voice signals collected by the microphones at the same time can be respectively read to form microphone array signals. The enhanced signal obtained after the beam forming process is a single-channel voice signal (or multiple channels, but usually focused into a main output channel), and the signal to noise ratio (the ratio between the useful signal and the background noise) and the voice quality of the signal in the preset direction are higher.

Beamforming refers to spatial filtering of microphone array signals for speech enhancement directions (preset directions), which may be processed for frequency domain signals or time domain signals. For example, for spatial filtering of time domain signals, time delay adjustment can be directly performed on signals of different microphones, so that signals from preset directions are aligned and added in time domain, signals from other directions are offset by time delays to obtain enhancement signals, for spatial filtering of frequency domain signals, short-time Fourier transform (STFT) is performed on microphone array signals to convert the signals into frequency domain, and then for each frequency bin, a frequency-dependent filter is applied to perform weighted summation on signals of different microphones, enhance signals of a target direction and inhibit signals of other directions to obtain enhancement signals.

The preset direction refers to a direction for focusing the enhancement of the voice signal in the microphone array signal processing process, and is usually expressed in the form of azimuth angle, pitch angle and the like in a space coordinate system, and the direction can be prestored in electronic equipment or can be dynamically determined according to a shooting method of an imaging device. In addition, the preset direction may also be determined in combination with the photographing direction and the image analysis. For example, in a multi-person scene, although the imaging device captures a large scene, a person who is speaking can be identified by face detection and lip movement detection, the direction of the person relative to the imaging device is determined, and beam forming processing is performed by taking the direction as a preset direction preferentially, so that a signal in the direction is more prominent, and a "zoom-in" effect is realized.

The imaging device refers to imaging equipment comprising an Optical sensor and an image sensor, the shooting direction refers to the direction of an Optical Axis (Optical Axis) of the imaging device in space, and the direction is determined by the rotation angle (including a yaw angle and a pitch angle) of the imaging device and the position of the imaging device in a device coordinate system or a world coordinate system, wherein the yaw angle of the imaging device represents the rotation angle in a horizontal plane and corresponds to the azimuth angle of a preset direction, and the pitch angle of the imaging device represents the inclination angle in a vertical plane and corresponds to the pitch angle of the preset direction. In practical application, the shooting direction of the camera device can be obtained through a mechanical structure (such as a holder angle sensor) of the camera device.

Optionally, the electronic device may automatically set a specific target direction (preset direction) for enhancing the voice signal of the microphone array in real time according to the shooting direction of the image pickup device, and then the electronic device performs beam forming processing on the microphone array signal by taking the specific target direction as the voice enhancing direction to obtain an enhanced signal in which the voice is significantly enhanced in the specific target direction and the interference noise in other directions is effectively suppressed.

Alternatively, in the case where the image pickup apparatus and the microphone array are in the same apparatus, the acquired image pickup direction of the image pickup apparatus may be directly set as the preset direction of the beam forming process. Because the camera device and the microphone array are kept relatively static, the relative pose (position and orientation) between the two coordinate systems is fixed, after a unified equipment coordinate system is established, the camera rotates to only change the orientation of the camera, the relative position relation with the microphone array is not influenced, and the preset direction of the microphone array is not changed.

Optionally, in the case that the image capturing device and the microphone array are not in the same device, the position relationship between the image capturing device and the microphone array may be further combined, and the capturing direction may be converted from the coordinate system of the image capturing device itself to the coordinate system of the microphone array, so as to obtain the preset direction of the beam forming process.

For example, referring to fig. 2, fig. 2 provides a schematic view of a shooting direction of an image capturing device. Assuming that a microphone array is built in an electronic device and an image pickup device is independently movable right above the outside of the electronic device, a space coordinate system is established based on the electronic device, a plane perpendicular to the gravity direction is taken as an xOy plane, a user gradually changes from photographing speaker A to photographing speaker B, and the photographing height is unchanged, but in order to keep a photographing target at the center of a picture or a specific position, an operator usually needs to adjust the direction of the image pickup device, namely, change the yaw angle, so that the pitch angle of the image pickup device is unchanged, the yaw angle gradually changes from alpha to beta, and further, the image pickup device synchronizes the yaw angle change to the microphone array, and the microphone array can determine that the change of a preset direction changes from alpha to beta in azimuth angle, and the pitch angle is unchanged.

It can be understood that the beam forming process is performed on the microphone array signal by taking the preset direction as the voice enhancement direction, so that the voice energy of the preset direction is enhanced, and the signal of the preset direction is more prominent, meanwhile, the preset direction (voice enhancement direction) is automatically adjusted along with the shooting direction of the camera device, namely, the 'hearing' focus is automatically determined along with the 'sight line' of the camera device, without manually adjusting the microphone or designating a speaker by a user, so that seamless voice enhancement experience is provided for the user in a dynamic scene.

And step S20, fusing the enhancement signal with at least one path of signal in the microphone array signal according to the video zooming multiple of the image pickup device, and obtaining a target output signal according to the fused signal, wherein the larger the video zooming multiple is, the larger the corresponding fusion proportion of the enhancement signal is.

Optionally, the video zoom factor is used to characterize the magnification of the camera with respect to the current captured scene, directly reflecting the degree to which the lens of the current camera is "zoomed in", typically expressed as a factor of 1x, such as 2x, 5x, etc., relative to the base wide angle.

Optionally, the target output signal refers to an audio signal after audio zooming.

Optionally, after the original microphone array signal and the enhancement signal after beam forming are obtained, introducing a video zoom multiple of the image pickup device as a key control parameter, and dynamically fusing at least one path of signal in the microphone array signal and the enhancement signal. The larger the video zoom multiple (the closer the lens is pulled), the larger the specific gravity (fusion proportion) occupied by the enhancement signal, whereas the smaller the video zoom multiple, the specific gravity of the enhancement signal is relatively reduced. And finally, generating a target output signal which is synchronously zoomed with the video shot by the shooting device according to the fused signal.

Optionally, after the fused signal is obtained, gain control may be further implemented by automatic gain control (Automatic Gain Control, AGC) or the like, so as to reduce the problem of inattention of volume, so that the amplitude fluctuation of the generated target output signal in different environments or devices is kept consistent, and stable output is realized. In addition, it may be further optimized, such as adjusting signal amplitude, encoding it, etc., to ensure that the target output signal generated is an audio signal that meets the playing or transmission standards.

It can be understood that the audio fusion proportion is controlled by introducing the video zoom multiple, when the lens is zoomed in (the zoom multiple is increased), the target is visually focused to be close-up, and the target voice is more prominent and more near-to-the-ear by increasing the fusion proportion of the enhancement signal in the sense of hearing, so that the audio and video synchronous zooming is realized, and the audio and video sensing consistency is further improved.

In one possible embodiment, step S20 includes:

Step S21, fusing the enhancement signal with a third signal according to the video zoom multiple, and obtaining a target output signal of a left channel according to the fused signal, wherein the third signal is a signal corresponding to a third microphone in the microphone array signal;

step S22, fusing the enhancement signal and the fourth signal according to the video zoom multiple, and obtaining a target output signal of the right channel according to the fused signal, wherein the fourth signal is a signal corresponding to a fourth microphone in the microphone array signal, and the distance between the third microphone and the left channel loudspeaker is closer than the distance between the fourth microphone and the left channel loudspeaker.

The third microphone and the fourth microphone are any two different microphones in the microphone array, wherein the third microphone is physically closer to the left channel speaker and the fourth microphone is relatively farther from the left channel speaker and closer to the right channel speaker than the third microphone. Correspondingly, the third signal and the fourth signal are each one of the signals in the microphone array, wherein the third signal characterizes a single channel audio signal collected and output by a microphone closer to the left channel speaker, and the fourth signal is a single channel audio signal collected and output by a microphone closer to the right channel speaker.

Optionally, after the third microphone and the fourth microphone collect the original signals, the original signals may be directly output as the corresponding third signals and fourth signals, or the collected original signals may be subjected to preprocessing operations such as noise reduction and filtering, and then the corresponding third signals and fourth signals are output.

The left channel speaker is a speaker unit which is arranged in the device or matched with the device and is specially responsible for playing left channel audio signals (target output signals of left channels) in stereo audio playing devices (such as headphones and sound equipment), and the right channel speaker is a speaker unit which is corresponding to the device and is specially responsible for playing right channel audio signals (target data signals of right channels). Left and right channel speakers are typically placed on either the left or right side of the device or in the active orientation of the user.

Referring to fig. 3, an exemplary microphone arrangement schematic diagram of a two-channel device is provided in fig. 3, wherein a bottom end of the device includes a left channel speaker and a right channel speaker, a circle in fig. 3 represents a single microphone channel, and a distance between a microphone a and the left channel speaker is closer than a distance between a microphone b and the left channel speaker, so that the microphone a can be selected as a third microphone and the microphone b can be selected as a fourth microphone, further, an enhancement signal and a signal corresponding to the microphone a are fused according to a video zoom multiple, and a target output signal of a left channel is obtained according to the fused signal, and played at the left channel speaker, and meanwhile, the enhancement signal and a signal corresponding to the microphone b are fused according to the video zoom multiple, and a target output signal of a right channel is obtained according to the fused signal, and played at the right channel speaker.

It can be understood that the spatial orientation sense of the sound scene can be restored by independently using the fusion signals of the microphone signals close to the corresponding side speakers in the left and right channels, thereby improving the auditory sense when the audio is played.

In the embodiment, the audio signals in the shooting direction are enhanced, so that the signals in the shooting direction are more prominent, the enhancement signals and at least one path of signals in the microphone array are fused according to the video zooming multiple, the larger the video zooming multiple is, the larger the fusion proportion of the enhancement signals is, the more prominent the enhancement signals in the corresponding fused signals are, and the audio 'zooming-in' effect is realized, so that the audio and video synchronous zooming is realized. Meanwhile, compared with the scheme of realizing synchronous zooming of audio and video by adopting a deep neural network, the scheme of the application adopts a simple beam forming and signal fusion technology, occupies little computing resource and memory resource, has high algorithm robustness, and makes the deployment of the audio zooming algorithm in equipment with limited resources possible.

Based on the above-described first embodiment, a second embodiment of the audio zooming method of the present application is proposed. In this embodiment, the same or similar contents as those of the first embodiment may be referred to the description above, and will not be repeated. In this embodiment, the step of obtaining the target output signal according to the fused signal in step S20 includes:

Step A10, calculating the normalized phase difference of two paths of signals in the microphone array signals;

In a possible embodiment, the phase difference value within the range of [ -1,1] is obtained by performing phase extraction, difference calculation and normalization processing on any two paths of signals in the microphone array, namely, the normalized phase difference.

The phase difference is the difference between the phase angles of the two paths of signals at the same frequency point, and the normalized phase difference is the normalized processing of the calculated original phase difference, so that the range of the value of the normalized phase difference is independent of the specific frequency f and the difference d between microphones to which the two paths of signals respectively belong.

Optionally, the original phase difference can be determined according to the difference between the phase angles of any two paths of signals in the microphone array, and then the original phase difference is subjected to standardization processing to obtain a normalized phase difference.

Taking as an example, the normalized phase difference of the two paths of signals with the microphone numbers x and y in the microphone array signal is calculated, the following formula can be referred to as the calculation method:

wherein x and y represent the numbers of the microphones, i is a frame index, and k is a frequency index; Representing the phase difference of the two microphone signals, Representing the normalized phase difference of the two microphone signals; Indicating phase angle, e.g The phase angle of the ith frame frequency point k in the signal corresponding to the microphone with the number x, c is the sound velocity, f is the frequency, and d is the distance between the two microphones with the numbers x and y.

In one possible embodiment, step a10 includes:

And step A11, calculating a normalized phase difference of a first signal and a second signal, wherein the first signal is a signal corresponding to a first microphone in the microphone array signals, the second signal is a signal corresponding to a second microphone in the microphone array signals, and the first microphone and the second microphone are two microphones with a distance greater than a preset threshold value in a preset direction in the microphone array.

The distance between the first microphone and the second microphone in the preset direction refers to the projection length of the connecting line segment of the first microphone and the second microphone in the preset direction.

Alternatively, the first microphone and the second microphone may be optional two microphones in a microphone array, wherein "first" and "second" are merely for distinguishing between different microphones and are not fixed numbers of microphones. The electronic device may optionally select two microphones with a distance in the preset direction greater than a preset threshold, and determine the two microphones as the first microphone and the second microphone, respectively.

Alternatively, the first microphone and the second microphone may be two microphones that are manually defined, and the distance between the first microphone and the second microphone in the preset direction is greater than a preset threshold.

Optionally, the preset threshold is used to screen the microphone pair with a distance greater than the threshold in the preset direction, where the threshold may be 0, or may be determined according to the distribution situation of the microphone array, or the like, and this embodiment is not limited specifically.

Optionally, in order to distinguish signals corresponding to different microphones, a signal corresponding to a first microphone in the microphone array signals is referred to as a first signal, and a signal corresponding to a second microphone in the microphone array signals is referred to as a second signal.

Optionally, the signals (including the first signal and the second signal) corresponding to the microphone may be directly collected by the microphone array, or may be obtained by preprocessing an original signal collected by the microphone by filtering, noise reduction, and the like.

In this embodiment, reference may be made to the above embodiment of step a10 for the embodiment of the normalized phase difference of the two signals, which is not described herein.

Step A20, determining a first gain coefficient according to the normalized phase difference;

Optionally, the gain coefficient refers to a scale factor for adjusting the signal amplitude, and the value of the gain coefficient may be preset or determined according to parameters such as the signal amplitude, the phase difference, the signal-to-noise ratio, and the like. In the following, for illustration, a gain coefficient determined from the normalized phase difference is referred to as a first gain coefficient.

Alternatively, the first gain factor may be determined from the normalized phase difference according to a gain mapping function between a preset normalized phase difference and the gain factor.

Illustratively, the formula of the gain mapping function between the preset normalized phase difference and the gain coefficient is as follows:

wherein, the Representing a corresponding first gain factor for the frequency k in the audio signal of the i-th frame,Representing normalized phase difference of two paths of microphone signals, wherein the normalized phase difference is smaller thanIn the case of (2), the first gain factor is determined to be 1, and in other cases, the first gain factor is determined to be 0.2. In the formula, the normalized phase difference is smaller thanIn practice, the actual time difference of reaching the two microphones by the sound wave is reflected to be smaller, a larger gain coefficient can be set to enhance the signal, and the value of the first gain coefficient under different conditions can be determined according to the requirement, and the embodiment is not particularly limited.

And step A30, performing gain control on the fused signals by adopting a first gain coefficient to obtain first gain signals, and obtaining target output signals according to the first gain signals.

Alternatively, gain control refers to multiplying each frame of an audio signal by a gain coefficient to achieve amplitude adjustment of the audio signal, and a signal subjected to gain control processing is referred to as a gain signal. In the following, the gain-controlled signal having undergone the first gain factor will be referred to as a first gain signal for the sake of illustration.

Optionally, the first gain signal can be directly used as a target output signal, or the first gain signal can be automatically gain controlled according to the video zoom multiple to obtain the target output signal.

It can be understood that by performing normalized phase difference calculation on each independent time-frequency unit, determining a corresponding gain coefficient for each time-frequency unit according to the normalized phase difference, that is, the audio signals with different frequencies at the same time point may correspond to different gain coefficients, performing gain control on the fused signals by using the gain coefficients, high-frequency noise can be suppressed while enhancing the voice fundamental frequency at a certain time point, and Soft Mask (Soft-Mask) gain control is realized.

In a possible implementation manner, the step of obtaining the target output signal according to the first gain signal in the step a30 includes:

step A31, determining a second gain coefficient according to the video zooming multiple, wherein the larger the video zooming multiple is, the larger the second gain coefficient is;

and step A32, performing gain control on the first gain signal by adopting a second gain coefficient to obtain a second gain signal, and obtaining a target output signal according to the second gain signal.

Optionally, in order to distinguish from the first gain coefficient determined according to the normalized phase difference, the gain coefficient determined according to the video zoom factor is referred to as a second gain coefficient, and the larger the video zoom factor is, the larger the corresponding second gain coefficient is. Correspondingly, a gain signal obtained by performing gain control on the first gain signal by using the second gain coefficient is referred to as a second gain signal.

Alternatively, the above-described process of gain controlling the fused signal using the first gain coefficient is performed in the frequency domain, and the gain controlling the first gain signal using the second gain coefficient is performed in the time domain, so that after the first gain signal is obtained, an Inverse Short time fourier transform (Inverse Short-Time Fourier Transform, ISTFT) is required to convert it from the frequency domain signal back to the time domain signal.

Optionally, multiplying each frame signal of the first gain signal by a second gain coefficient corresponding to the frame to obtain a second gain signal, wherein the second gain coefficient is determined according to the video zoom multiple corresponding to the frame.

In the embodiment, the gain coefficient is dynamically adjusted according to the video zoom multiple, so that the sound of far and near can be properly enhanced, automatic gain control (Automatic Gain Control and AGC) is realized, and meanwhile, the larger the video zoom multiple is, the larger the gain coefficient is, the larger the volume of the corresponding overall audio is, and the overall hearing 'zoom-in' effect is realized, thereby realizing synchronous zooming of audio and video signals.

An exemplary method for audio zooming is shown in fig. 4, in which a microphone array signal is first obtained and input to an STFT module, and is then converted from a time domain signal to a frequency domain signal for performing beam forming and gain control, then the converted microphone array signal is input to a beam forming module, the microphone array signal is beam formed in a preset direction determined according to a shooting direction of an imaging device as a voice enhancement direction, an enhancement signal is obtained and output to a fusion module, then the enhancement signal is fused with at least one of the microphone array signals according to a video zoom multiple of the imaging device to obtain a fused signal and output to a Soft Mask module, and then the fused signal is subjected to the above-mentioned Soft-Mask gain control according to a normalized phase difference of two of the microphone array signals to obtain a first gain signal and output to an ISTFT module, the first gain signal is converted from the frequency domain signal to the time domain signal through inverse fourier transform, and then the converted first gain signal is input to the fusion module, the first gain signal is further input to the fusion module, the second gain signal is converted according to a video zoom multiple of the video zoom signal, and the second gain signal is further output as a synchronous gain signal.

Alternatively, since the beamforming may be performed on the time domain signal, in the case where the beamforming module performs voice enhancement by spatial filtering on the time domain signal, the STFT module may be placed before the soft mask module, which is not particularly limited in this embodiment.

In the embodiment, after the fused signal is obtained, the soft mask gain control is performed to identify and retain the effective components in the voice signal, and simultaneously the noise and unimportant frequency components are suppressed, so that the definition of the voice signal is improved, the automatic gain control is further performed on the first gain signal obtained by the soft mask gain control, the output level of the voice signal is ensured to be kept stable, and the sound is ensured to keep relatively consistent volume under different scenes.

Based on the above-described first and/or second embodiments, a third embodiment of the audio zooming method of the present application is presented. In this embodiment, the same or similar contents as those of the first and second embodiments described above may be referred to the description above, and will not be repeated. In this embodiment, the audio zooming method is applied to a mobile terminal, and the step of fusing the enhancement signal with at least one signal in the microphone array signal according to the video zooming multiple of the image capturing device in step S20 includes:

Step B10, obtaining an adjusting factor;

optionally, the adjustment factor is a parameter for adjusting a fusion ratio corresponding to the enhancement signal, reflects the applicability of the current system or the user to the enhancement signal, and may be preset in the electronic device or may be determined in real time according to user input.

In one possible embodiment, step B10 includes:

and step B11, receiving a user adjusting instruction, and determining an adjusting factor according to the user adjusting instruction.

Optionally, the user adjustment instruction refers to a control signal generated by the user through an input device (such as a touch screen, keys, etc.), the specific form of which depends on the interaction mode.

The electronic device may determine the adjustment factor according to the rotation angle of the knob in the case that the input device is a rotary button, display a slider on the video capturing interface to receive a user adjustment instruction and further determine the adjustment factor according to the movement position of the slider in the case that the input device is a touch screen, and determine and execute a voice instruction of increasing the adjustment factor according to the voice input of the current user, such as "clearer".

Step B20, determining a fusion proportion corresponding to the enhancement signal according to the adjustment factor and the video zoom multiple, wherein the smaller the adjustment factor is, the smaller the fusion proportion is under the condition that the video zoom multiple is unchanged;

Optionally, the video zoom factor and the adjustment factor may be mapped into a fusion ratio according to a preset mapping rule (such as linear mapping, nonlinear mapping, table lookup method, etc.), where the smaller the adjustment factor, the smaller the fusion ratio under the condition that the video zoom factor is unchanged.

Optionally, a multiplicative coupled function mapping rule may be used, and the fusion proportion corresponding to the enhancement signal is determined according to the adjustment factor and the video zoom multiple.

For example, please refer to the formula that y=a×b×x, wherein a is an adjustment factor, the value range is 0-1, b is a calculation parameter, the value is fixedly set to 0.1, X is a video zoom multiple, and the value is 0-10, and at this time, the smaller the adjustment factor, the smaller the corresponding calculated fusion ratio. For example, when a is 1, the calculation formula of the fusion ratio is y=0.1×x, and the value range of Y is 0-1, when a is 0.1, the calculation formula of the fusion ratio is y=0.01×x, and the value range of Y is 0-0.1, it can be seen that the smaller the adjustment factor is, the smaller the influence of the video zoom factor on the fusion ratio is, and the less obvious the effect of the audio zooming perceived by the user is when the video zoom factor is adjusted. Accordingly, the user can control the degree to which the audio zoom is affected by the video zoom multiple by controlling the adjustment factor, thereby obtaining an audio zoom effect suitable for the user's individual.

And step B30, fusing the enhancement signal with at least one path of signal in the microphone array signal according to the fusion proportion.

In this embodiment, by introducing the adjustment factor, the fusion proportion corresponding to the enhancement signal can be adjusted in real time according to the user instruction, so as to adjust the obvious degree of the audio zooming effect, and facilitate the user to dynamically adjust the audio zooming effect according to the actual requirement.

An embodiment of the present application further provides an audio zooming device, referring to fig. 5, including:

the signal enhancement module 10 is configured to perform beam forming processing on the microphone array signal with a preset direction as a voice enhancement direction to obtain an enhanced signal, where the preset direction is determined according to a shooting direction of the image pickup device;

The signal fusion module 20 is configured to fuse the enhancement signal with at least one signal in the microphone array signal according to a video zoom multiple of the image capturing device, and obtain a target output signal according to the fused signal, where the larger the video zoom multiple is, the larger the fusion ratio corresponding to the enhancement signal is.

Compared with the prior art, the audio zooming device provided by the embodiment of the application has the same beneficial effects as the audio zooming method provided by the embodiment, and other technical features in the audio zooming device are the same as the features disclosed by the method of the embodiment, and are not repeated herein.

The embodiment of the application provides electronic equipment, which comprises at least one processor and a memory in communication connection with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the audio zooming method in the first embodiment.

Referring now to fig. 6, a schematic diagram of an electronic device suitable for use in implementing embodiments of the present application is shown. The electronic device in the embodiment of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (Personal DIGITAL ASSISTANT: personal digital assistant), a PAD (Portable Application Description: tablet computer), a PMP (Portable MEDIA PLAYER: portable multimedia player), an in-vehicle terminal (e.g., an in-vehicle navigation terminal), and the like, a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the application.

As shown in fig. 6, the electronic device may include a processing means 1001 (e.g., a central processing unit, a graphics processor, etc.) which may perform various appropriate actions and processes according to a program stored in a read only memory 1002 or a program loaded from a storage means 1003 into a random access memory 1004. In the random access memory 1004, various programs and data necessary for the operation of the electronic device are also stored. The processing device 1001, the read only memory 1002, and the random access memory 1004 are connected to each other by a bus 1005. An input/output interface 1006 is also connected to the bus. In general, a system including an input device 1007 such as a touch screen, a touch pad, a keyboard, a mouse, an image sensor, a microphone, an accelerometer, a gyroscope, etc., an output device 1008 including a Liquid crystal display (LCD: liquid CRYSTAL DISPLAY), a speaker, a vibrator, etc., a storage device 1003 including a magnetic tape, a hard disk, etc., and a communication device 1009 may be connected to the input/output interface 1006. The communication means 1009 may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. While electronic devices having various systems are shown in the figures, it should be understood that not all of the illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via a communication device, or installed from the storage device 1003, or installed from the read only memory 1002. The above-described functions defined in the method of the disclosed embodiment of the application are performed when the computer program is executed by the processing device 1001.

Compared with the prior art, the electronic device provided by the embodiment of the application has the same beneficial effects as the audio zooming method provided by the embodiment, and other technical features in the electronic device are the same as the features disclosed by the method of the previous embodiment, and are not repeated here.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Embodiments of the present application provide a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon for performing the audio zooming method of the above-described embodiments.

The computer readable storage medium provided by the embodiments of the present application may be, for example, a usb disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access Memory (RAM: random Access Memory), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (EPROM: erasable Programmable Read Only Memory or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, the computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (Radio Frequency) and the like, or any suitable combination of the foregoing.

The computer readable storage medium may be included in the electronic device or may exist alone without being incorporated into the electronic device.

The computer-readable storage medium carries one or more programs which, when executed by an electronic device, cause the electronic device to perform the functions described above as defined in the methods of the disclosed embodiments of the application.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN: local Area Network) or a wide area network (WAN: wide Area Network), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.

The readable storage medium provided by the embodiment of the present application is a computer readable storage medium, and the computer readable storage medium stores computer readable program instructions (i.e., a computer program) for executing the above-mentioned audio zooming method.

The embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of an audio zooming method as described above.

Compared with the prior art, the beneficial effects of the computer program product provided by the embodiment of the application are the same as those of the audio zooming method provided by the embodiment, and are not described in detail herein.

The foregoing description is only a partial embodiment of the present application, and is not intended to limit the scope of the present application, and all the equivalent structural changes made by the description and the accompanying drawings under the technical concept of the present application, or the direct/indirect application in other related technical fields are included in the scope of the present application.

Claims

1. An audio zoom method, characterized in that the audio zoom method comprises:

Performing beamforming processing on the microphone array signal with a preset direction as the speech enhancement direction to obtain an enhanced signal, wherein the preset direction is determined according to the shooting direction of the camera device;

The enhanced signal is fused with at least one signal of the microphone array signal according to the video zoom ratio of the camera device, and a target output signal is obtained based on the fused signal, wherein the greater the video zoom ratio, the greater the fusion ratio corresponding to the enhanced signal.

2. The audio zoom method according to claim 1, wherein the step of obtaining the target output signal based on the fused signal comprises:

Calculating a normalized phase difference between two signals of the microphone array signal;

determining a first gain coefficient according to the normalized phase difference;

The first gain coefficient is used to perform gain control on the fused signal to obtain a first gain signal, and a target output signal is obtained according to the first gain signal.

3. The audio zoom method according to claim 2, wherein the step of obtaining the target output signal according to the first gain signal comprises:

Determining a second gain coefficient according to the video zoom factor, wherein the larger the video zoom factor is, the larger the second gain coefficient is;

The second gain coefficient is used to perform gain control on the first gain signal to obtain a second gain signal, and a target output signal is obtained according to the second gain signal.

4. The audio zoom method according to claim 2, wherein the step of calculating the normalized phase difference between two signals of the microphone array signal comprises:

Calculate a normalized phase difference between a first signal and a second signal, wherein the first signal is a signal corresponding to a first microphone in the microphone array signal, the second signal is a signal corresponding to a second microphone in the microphone array signal, and the first microphone and the second microphone are two microphones in the microphone array whose distance in the preset direction is greater than a preset threshold.

5. The audio zoom method according to claim 1 , wherein the step of fusing the enhanced signal with at least one of the microphone array signals according to the video zoom factor of the camera device and obtaining a target output signal based on the fused signal comprises:

fusing the enhanced signal with a third signal according to the video zoom factor, and obtaining a target output signal of the left channel according to the fused signal, wherein the third signal is a signal corresponding to the third microphone in the microphone array signal;

The enhanced signal is fused with the fourth signal according to the video zoom factor, and a target output signal of the right channel is obtained based on the fused signal, wherein the fourth signal is a signal corresponding to the fourth microphone in the microphone array signal, and the distance between the third microphone and the left channel speaker is closer than the distance between the fourth microphone and the left channel speaker.

6. The audio zoom method according to any one of claims 1 to 5, wherein the step of fusing the enhanced signal with at least one of the microphone array signals according to the video zoom factor of the camera device comprises:

Get the adjustment factor;

determining a fusion ratio corresponding to the enhanced signal according to the adjustment factor and the video zoom factor, wherein, when the video zoom factor remains unchanged, the smaller the adjustment factor, the smaller the fusion ratio;

The enhanced signal is fused with at least one signal of the microphone array signal according to the fusion ratio.

7. The audio zoom method according to claim 6, wherein the step of obtaining the adjustment factor comprises:

A user adjustment instruction is received, and an adjustment factor is determined according to the user adjustment instruction.

8. An electronic device, characterized in that the electronic device comprises: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program is configured to implement the steps of the audio zoom method according to any one of claims 1 to 7.

9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, and a computer program is stored on the storage medium, and when the computer program is executed by a processor, the steps of the audio zoom method according to any one of claims 1 to 7 are implemented.

10. A computer program product, characterized in that the computer program product comprises a computer program, and when the computer program is executed by a processor, the steps of the audio zoom method according to any one of claims 1 to 7 are implemented.