CN119785818A

CN119785818A - Audio and video processing method and device, audio and video equipment and computer storage medium

Info

Publication number: CN119785818A
Application number: CN202311288297.1A
Authority: CN
Inventors: 黎镭
Original assignee: Anker Innovations Co Ltd
Current assignee: Anker Innovations Co Ltd
Priority date: 2023-10-07
Filing date: 2023-10-07
Publication date: 2025-04-08

Abstract

The present application relates to an audio and video processing method, device, audio and video equipment and computer storage medium. After acquiring sound data and video data, target recognition is performed on the video data to determine the image category. Based on the image category, audio separation is performed on the sound data to determine the audio data corresponding to each image category. Then, spatial audio is obtained based on the spatial position information of each image category and the corresponding audio data. The spatial audio is used to be played in combination with video data or played alone. Therefore, the spatial audio obtained by combining the image category and spatial position information can accurately correspond to images of different categories and restore the sound orientation during recording, so that scene restoration and sound restoration can be achieved when playing the spatial audio, thereby improving the reliability of spatial audio playback.

Description

Audio and video processing method and device, audio and video equipment and computer storage medium

Technical Field

The present application relates to the field of audio and video technologies, and in particular, to an audio and video processing method, an audio and video processing device, an audio and video device, and a computer storage medium.

Background

With the development of audio and video technology, the trend from traditional audio making from the dimension of sound to audio and video combining with the dimension of video is the trend of future development.

In the traditional audio and video manufacturing process, from video shooting to video playing, stereo recording and stereo playing are adopted in the aspect of sound, recorded sound and real life sound can be mixed together when the recording end records, and the original recorded sound is played when the recording end arrives at the playback end.

In this case, however, all the sounds are mixed together at the time of playback, and the degree of restoration of the originally recorded sound is not high.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an audio/video processing method, apparatus, audio/video device, and computer storage medium capable of improving the reproduction level of recorded sound.

In a first aspect, the present application provides an audio/video processing method, where the method includes:

the method comprises the steps of obtaining audio and video data, wherein the audio and video data comprise sound data and video data;

Performing target recognition on the video data, and determining an image category;

Performing audio separation on the sound data based on the image categories, and determining audio data corresponding to each image category;

And obtaining spatial audio according to the spatial position information of each image category and the corresponding audio data, wherein the spatial audio is used for being played in combination with the video data or independently played.

In one of the embodiments of the present invention,

The obtaining spatial audio according to the spatial position information of each image category and the corresponding audio data comprises the following steps:

Acquiring spatial position information of each image category;

inputting the spatial position information of each image category as a key parameter into a head transfer function to obtain a head related impulse response;

And obtaining spatial audio according to the head related impulse response and the audio data corresponding to each image category.

In one embodiment, the head-related impulse response includes a head-related impulse response to a left ear and a head-related impulse response to a right ear, the spatial audio includes an output audio signal of the left ear and an output audio signal of the right ear, and the obtaining spatial audio according to the head-related impulse response and the audio data corresponding to each image category includes:

respectively convolving the audio data corresponding to each image category and the head related impulse response to the left ear, and then performing a down-mixing algorithm to obtain an output audio signal of the left ear;

and respectively convolving the audio data corresponding to each image category and the head related impulse response to the right ear, and then performing a down-mixing algorithm to obtain an output audio signal of the right ear.

In one embodiment, the performing object recognition on the video data, determining an image category includes:

And inputting the video data into an image recognition model to obtain image types of all images in the video data, wherein the image recognition model is obtained by training based on image samples of all types.

In one embodiment, the audio separation of the sound data based on the image categories, and determining the audio data corresponding to each image category include:

and inputting the identified image types and the voice data into a voice separation model to obtain the audio data corresponding to each image type, wherein the voice separation model is obtained based on image samples of each type and corresponding voice training.

In one embodiment, the obtaining spatial audio according to the spatial location information of each image category and the corresponding audio data includes:

obtaining intermediate audio data according to the spatial position information of each image category and the corresponding audio data;

and adding a reverberation parameter to the intermediate audio data to obtain the spatial audio.

In one embodiment, after obtaining the spatial audio according to the spatial location information of each image category and the corresponding audio data, the method further includes:

And adjusting the angle in the spatial audio according to the head deviation angle of the user so that the play position of the adjusted spatial audio does not change along with the head deviation of the user.

sharing the spatial audio.

In one embodiment, the spatial location information is location information of each of the image categories in the video data, or,

The spatial location information is user-defined location information.

In a second aspect, the application further provides an audio and video processing device. The device comprises:

the audio/video acquisition module is used for acquiring audio/video data, wherein the audio/video data comprises sound data and video data;

the image recognition module is used for carrying out target recognition on the video data and determining image categories;

the sound separation module is used for carrying out audio separation on the sound data based on the image categories and determining audio data corresponding to each image category;

The spatial audio reconstruction module is used for obtaining spatial audio according to the spatial position information of each image category and the corresponding audio data, and the spatial audio is used for being played in combination with the video data or being played independently.

In a third aspect, the present application further provides an audio/video device, including a recorder and a processor, where the recorder is connected to the processor, the recorder is configured to record audio/video data and send the audio/video data to the processor, and the processor is configured to perform audio/video processing by using the method in any of the foregoing embodiments.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

The audio and video processing method, the audio and video processing device, the audio and video equipment and the computer storage medium are used for obtaining the sound data and the video data, then carrying out target recognition on the video data, determining the image types, carrying out audio separation on the sound data based on the image types, determining the audio data corresponding to each image type, and obtaining the spatial audio according to the spatial position information of each image type and the corresponding audio data, wherein the spatial audio is used for being played in combination with the video data or being played independently. Therefore, the spatial audio obtained by combining the image type and the spatial position information can accurately correspond to images of different types, and the sound azimuth during recording is restored, so that scene restoration and sound restoration can be realized during playing the spatial audio, and the reliability of playing the spatial audio is improved.

Drawings

FIG. 1 is a flow chart of an audio/video processing method according to an embodiment;

FIG. 2 is a schematic diagram of the location of spatial location information of image categories in one embodiment;

FIG. 3 is a flow chart illustrating steps for obtaining spatial audio in one embodiment;

FIG. 4 is a flow diagram of a process for constructing a head transfer function in one embodiment;

FIG. 5 is a flowchart illustrating a step of obtaining spatial audio according to another embodiment;

FIG. 6 is a flowchart of an audio/video processing method according to another embodiment;

fig. 7 is a flowchart of an audio/video processing method according to another embodiment;

FIG. 8 is a detailed flowchart of an audio/video processing method according to an embodiment;

FIG. 9 is a schematic diagram of an application scenario of a head transfer function in one embodiment;

Fig. 10 is a block diagram of an audio/video processing apparatus according to an embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The audio and video processing method provided by the embodiment of the application is used for processing the audio and video data acquired by the equipment to form spatial audio. The device for collecting audio and video data may be a device, for example, a device capable of capturing images, such as a mobile phone, a camera, glasses, and the like, to record video, and a microphone of the device itself to record sound. Or the device for collecting the audio and video data can be two or more devices, for example, a device carrying a camera such as a mobile phone, glasses and the like is used for recording video, and a wearable device (such as a headset, glasses and the like) is used for recording sound.

The audio and video processing method can be executed by a server or a terminal. After the equipment collects the audio and video data, the audio and video data is sent to a server or a terminal, and the server or the terminal processes the audio and video data to form spatial audio. The terminal may be a device for collecting audio and video data, or may be other devices. The terminal can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things equipment and portable wearable equipment, and the internet of things equipment can be smart speakers, smart televisions, smart vehicle-mounted equipment and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 1, an audio/video processing method is provided, and the method is applied to a terminal for illustration, and includes the following steps:

step 102, obtaining audio and video data.

Wherein the audio-video data includes sound data and video data. The sound data is recorded data when video is recorded.

After the equipment collects the audio and video data, the audio and video data is sent to a server or a terminal, and the server or the terminal acquires the audio and video data. The sound data may be a mono audio file or a binaural audio file. The device for collecting audio and video data may be a device, for example, a device capable of capturing images, such as a mobile phone, a camera, glasses, and the like, to record video, and a microphone of the device itself to record sound. Or the device for collecting the audio and video data can be two or more devices, for example, a device carrying a camera such as a mobile phone, glasses and the like is used for recording video, and a wearable device (such as a headset, glasses and the like) is used for recording sound.

And 104, carrying out target recognition on the video data, and determining the image category.

The object identifier is a technology in the field of computer vision, and is mainly used for automatically extracting and identifying an object of interest from digital image or video data.

The target recognition of the video data mainly comprises the steps of denoising, smoothing, image enhancement and the like of images in the video data so as to reduce image interference and improve target visibility. Features in the image that are relevant to the object, such as color, shape, texture, edges, etc., are then extracted, and mathematical methods are typically used to describe these features. The known template or sample features are then used to match the target features to be identified to determine the image category in the video data. For example, the identified image categories include three categories of boys, girls, and dogs. Or in concert scenarios, object recognition may be understood as recognizing image categories on live concert stages as various instruments and voices, such as singing, guitar, piano, violin, etc.

And 106, carrying out audio separation on the sound data based on the image categories to obtain audio data corresponding to each image category.

Among them, audio separation is a method of decomposing a plurality of sound data mixed together into individual audio data.

After the image categories in the video data are determined, audio separation is carried out on the sound data by combining the image categories, namely, the sound corresponding to each image category is separated, and the audio data corresponding to each image category are obtained. The image category herein refers to an image category in which a sound is emitted. For example, the audio data of the boy, girl and dog are separated by the audio separation technology, and the audio data of the main singing, guitar, piano and violin are separated by the musical instrument separation technology in the concert scene by the recorded mixed audio played by the concert.

And step 108, obtaining the spatial audio according to the spatial position information of each image category and the corresponding audio data.

The spatial position information of each image category may be represented by (L, α, β), where L is a distance between each image category and the user, α is a horizontal angle between each image category and the user, and β is a vertical angle between each image category and the user, and a schematic diagram of the spatial position information may be shown in fig. 2.

Spatial location information for each image category may be derived from the location of each image category in the video data. The analysis of the video data may result in spatial location information for each image type in the video data. After the spatial position information and the corresponding audio data of each image category are obtained, the spatial audio of each image category can be obtained, so that all the spatial audio is obtained.

After the spatial audio is obtained, the spatial audio is sent to the wearable equipment for playing, for example, the wearable equipment is sent to the earphone for playing, and sound reduction is better carried out. The space audio data and the video data can be sent to equipment for playing, for example, to terminal equipment such as a mobile phone for playing, and the real scene of sound during video recording is restored, so that scene restoration and sound restoration are realized.

In this embodiment, after obtaining sound data and video data, performing object recognition on the video data, determining image types, performing audio separation on the sound data based on the image types, determining audio data corresponding to each image type, and obtaining spatial audio according to spatial position information and corresponding audio data of each image type, where the spatial audio is used for playing in combination with the video data or playing independently. Therefore, the spatial audio obtained by combining the image type and the spatial position information can accurately correspond to images of different types, and the sound azimuth during recording is restored, so that scene restoration and sound restoration can be realized during playing the spatial audio, and the reliability of playing the spatial audio is improved.

In one embodiment, as shown in FIG. 3, step 108 includes steps 302 through 306.

Step 302, obtaining spatial position information of each image category.

Spatial location information for each image category may be derived from the location of each image category in the video data. The analysis of the video data may result in spatial location information for each image type in the video data. The spatial position information of each image category may be represented by (L, α, β), where L is a distance between each image category and the user, α is a horizontal angle between each image category and the user, and β is a vertical angle between each image category and the user, and a schematic diagram of the spatial position information may be shown in fig. 2.

And step 304, using the spatial position information of each image category as a key parameter to input a head transfer function, and obtaining a head related impulse response.

Wherein the head transfer function (HRTF) is a signal processing function for simulating the listening space. It is determined by the influence of head, ear and body factors on sound transmission and reception, including the influence of sound waves reflected and refracted through head, ear front and back and hair, shoulders, chest and other body parts. HRTFs may describe the way sound reaches a listener in different directions and distances and may be used to calculate the filter response from the sound source to the listener's ears.

After the spatial position information of each image category is used as a key parameter to be input into a head transfer function, the head transfer function can simulate the human ear to receive sound source signals in different directions and positions, and head related impulse response is obtained. The head-related impulse response is a representation of the head transfer function in the time domain. After the spatial position information of each image category is used as a key parameter to be input into a head transfer function, the head transfer function can comprise a plurality of angles (horizontal angle and vertical angle) and a plurality of distances (L), at the moment, the head transfer function can be three-dimensional data, after the image category and the corresponding spatial position information are determined, two angles (horizontal angle and vertical angle) and a distance (distance between the image category and a user) corresponding to the determined image category can be obtained, and thus a curve corresponding to output can be obtained as a head related impulse response.

Step 306, obtaining spatial audio according to the head related impulse response and the audio data corresponding to each image category.

Further, spatial audio can be obtained from the head-related impulse response including spatial position information of each image category and audio data corresponding to each image category. The processing mode can be that the head related impulse response is convolved with the audio data corresponding to each image category, so that the spatial audio can be obtained, the sound source position in the three-dimensional space is simulated, and the stereo or surround sound effect is generated, thereby providing a realistic stereo effect for a listener.

In this embodiment, after the spatial position information of each image category is obtained, spatial audio is obtained by using the spatial position information of each image category, the head related impulse response and the audio data corresponding to each image category, so that the sound source position in the three-dimensional space can be simulated, and a realistic stereo effect can be provided for the listener.

The header transfer function may be stored after being constructed in advance, for example in a memory of the terminal. When the head transfer function is required to be used, features of parts such as auricles of the user can be identified, and the head transfer function corresponding to the identified feature parameters can be matched in the stored head transfer functions according to the identified feature parameters. Alternatively, the user may be assigned a head transfer function according to a default setting. Further, as shown in fig. 4, the construction process of the head transfer function includes step 402 and step 404.

Step 402, under the preset environmental conditions, measuring the designated user to obtain measurement data.

The preset environmental conditions refer to environmental conditions for which environmental parameters are known, for example, in a anechoic room, or in a concert hall, etc.

The designated user refers to a user determined according to the requirement. Measuring the designated user means measuring the influence of each position of the designated user on sound, and obtaining measurement data. For example, the sound wave is detected to reflect and refract the sound through the head, the front and back of the ear, and the body parts such as the hair, the shoulder, the chest, and the like, and measurement data is obtained.

Step 404, obtaining a head transfer function according to the measurement data.

After the measurement data are obtained, the measurement data are analyzed and processed to obtain a head transfer function.

In particular, data acquisition devices are used for measurement and recording, and these devices typically include a model of a human head and an array of microphones or microphones. Under preset environmental conditions, sound is transmitted to the array from different directions and distances, and the signal captured by each microphone is recorded as measurement data. From these data, the head transfer function in each direction can be calculated.

In this embodiment, by constructing the head transfer function from measurement data obtained by measuring the specified user under the preset environmental condition, the head transfer function can be more attached to the characteristics of the specified user.

It will be appreciated that the disclosed data may be used to match the head transfer function of the user when determining the head transfer function. Or firstly clustering the HRTFs in the database, and then selecting a certain type of the best matched HRTFs from the public database by utilizing the personalized auricle characteristics of the user as the finally determined head transfer function. The above modes can be selected according to actual requirements, and are not limited herein.

In one embodiment, the head related impulse response comprises a head related impulse response to the left ear and a head related impulse response to the right ear, the spatial audio comprises an output audio signal for the left ear and an output audio signal for the right ear, as shown in fig. 5, step 306 comprises steps 502 and 504.

Step 502, convolving the audio data corresponding to each image category and the head related impulse response to the left ear respectively, and then performing a down-mixing algorithm to obtain an output audio signal of the left ear.

Wherein the head related impulse response to the left ear refers to the transfer function of sound of a certain image class transmitted to the left ear. For example, if a person and an animal are sounding, the head related impulse response to the left ear includes a transfer function of the person's sound to the left ear and a transfer function of the animal's sound to the left ear.

And step 504, respectively convolving the audio data corresponding to each image category and the head related impulse response to the right ear, and then performing a down-mixing algorithm to obtain an output audio signal of the right ear.

Wherein the head related impulse response to the right ear refers to the transfer function of sound of a certain image class transmitted to the right ear. For example, if a person and an animal are sounding, the head related impulse response to the right ear includes a transfer function of the person's sound to the right ear and a transfer function of the animal's sound to the right ear.

Since some audio devices have only two channels, e.g., headphones have only left and right channels. Therefore, when spatial audio is formed, an output audio signal of the left ear is obtained based on the head-related impulse response to the left ear and the audio data corresponding to each image category, played at the left ear channel, and an output audio signal of the right ear is obtained based on the head-related impulse response to the right ear and the audio data corresponding to each image category, played at the right ear channel, so as to restore the spatial realism of sound.

Illustratively, when there are three image categories, or three interesting object image categories, such as a boy, a girl and a dog, for playback with headphones, a downmix algorithm is needed here, provided that the boy, the girl, the dog have sounds x1, x2, x3, hll and hlr respectively representing the head-related impulse response of x1 to the left ear and the head-related impulse response to the right ear, hcl and hcr respectively representing the head-related impulse response of x2 to the left ear and the head-related impulse response to the right ear, hrl and hrr respectively representing the head-related impulse response of x3 to the left ear and the head-related impulse response to the right ear, respectively, the left and right ear sound signals heard by the human ear are yl, yr respectively:

yl=x1*hll+x2*hcl+x3*hrl;

yr=x1*hlr+x2*hcr+x3*hrr;

Through the above processing, the user can hear and record the same spatial audio at the earphone end.

In one embodiment, as shown in FIG. 6, step 104 includes step 602.

Step 602, inputting the video data into an image recognition model to obtain image types of all images in the video data.

The image recognition model is a model which is trained in advance, is obtained based on image samples of various categories, and comprises characteristics of different image categories.

The training process of the image recognition model comprises the steps of acquiring a database for image recognition and video target monitoring, dividing training, testing and verifying sets, and training the model by using methods such as deep learning. The trained model can then be placed on a cell phone, or a hardware device deployment such as headphones, eyeglasses, etc. The model for video target recognition comprises MANet (multi-scale aggregation network), association LSTM (long and short memory Association network), T-CNN (space-time convolution network) and other computer vision deep learning models. Specifically, MANet is a neural network architecture that combines multi-scale features in a single model to improve accuracy of target detection. It uses pyramid pooling to build a multi-scale feature map and uses a lightweight network to aggregate information on different scales. Association LSTM is a recurrent neural network architecture that learns associated objects by using time-dependent encoding between object features in a video sequence. It can track multiple objects simultaneously, showing good results in several tracking benchmarks. T-CNN is a three-dimensional convolutional neural network architecture that can process time-space data, such as video. It uses a temporal convolution to capture the time dependence between frames and a spatial convolution to capture the spatial features, exhibiting the most advanced performance of action recognition on the reference dataset.

After the video data is input into the image recognition model, the features corresponding to different image types stored in the image recognition model are matched based on the features of the images in the video data, so that the image types of the images in the video data are obtained.

In this embodiment, the video data is input into the image recognition model, so that the image category of each image in the video data can be obtained quickly and accurately.

In one embodiment, as shown in FIG. 6, step 106 includes step 604.

And step 604, inputting the identified image type and sound data into a sound separation model to obtain the corresponding audio data of each image type.

The sound separation model is a model which is trained in advance, is obtained based on image samples of various categories and corresponding sound training, and comprises sound corresponding to the image samples of various categories.

The training process of the sound separation model comprises the steps of obtaining a database about audio separation, namely a database of sound and labels of various categories such as boys, girls, dogs, pianos, violins and the like, dividing training, testing and verifying sets, and training the model by using methods such as deep learning and the like. After training is completed, the trained model can be placed on hardware equipment such as mobile phones, glasses, earphones and the like for deployment. The model for audio recognition comprises a deep learning model used in the field of audio separation such as conv-tasNet, demucs, D net. In particular, conv-tasNet is an end-to-end deep learning model based on a time-domain speech separation network that employs a convolutional layer and a loss function to process the audio signal. The model can separate multiple mixed audio signals into corresponding individual audio signals, which exhibit good performance over many audio separation tasks. Demucs is a deep neural network based audio separation model that uses encoders and decoders of a two-way long and short memory network (BLSTM) and downsampling and upsampling operations by a time-extended convolutional neural network (TDCN). D3net is a novel deep learning model aimed at achieving blind source separation of multichannel audio signals. The model combines standard convolution layers with deconvolution layers to improve the quality and speed of audio separation.

After the identified image type and sound data are input into the sound separation model, the characteristic parameters of the sound corresponding to different image types stored in the sound separation model are matched based on the characteristic parameters of the sound data of different image types, so that the audio data corresponding to each image type are obtained.

In this embodiment, the recognized image types and sound data are input into the sound separation model, so that the audio data corresponding to each image type can be obtained quickly and accurately.

In one embodiment, as shown in FIG. 6, step 108 includes step 606 and step 608.

Step 606, obtaining intermediate audio data according to the spatial position information of each image category and the corresponding audio data.

Spatial location information for each image category may be derived from the location of each image category in the video data. After the spatial position information and the corresponding audio data of each image category are obtained, the spatial audio of each image category can be obtained as intermediate audio data.

In step 608, the reverberation parameter is added to the intermediate audio data to obtain the spatial audio.

Then, adding the reverberation parameter in the intermediate audio data to obtain the space audio. The real hearing effect under specific environmental conditions can be simulated, so that the obtained spatial audio is closer to reality.

For example, the intermediate audio data is obtained from spatial location information of each image class and corresponding audio data, and a joint head transfer function is generally required. While most of the head transfer functions are the laboratory head transfer functions employed. Since these are mostly measured in anechoic chambers, much spatial information is lost. Therefore, by adding the reverberation parameter to the intermediate audio data through the reverberation algorithm, the early reflection and the late reflection of the specific environment can be added through the algorithm, so that the real listening effect simulating the specific environment is achieved.

Further, the manner of adding reverberation may be room impact response HRIR or artificial reverberation, etc. Room impact response HRIR refers to the sound signal received by a microphone when an instantaneous sound source sounds in a room. By recording the room impulse responses at different locations and convolving them with the original sound signal, a reverberation effect with the corresponding room characteristics can be obtained. This method requires first measuring the room to obtain the room impulse response data for each location, and the impulse response is different for each room. Artificial reverberation is an analog reverberation method based on digital signal processing. It can generate reverberation effects of virtual rooms by simulating the processes of propagation, reflection, attenuation, etc. of sound waves in space. Artificial reverberation typically uses digital filters to simulate the reflection and attenuation of sound waves in different frequency ranges, and preset reverberation time and decay time parameters to control the reverberation effect.

In this embodiment, reverberation parameters are added to intermediate audio data obtained according to spatial position information of each image category and corresponding audio data to obtain spatial audio, so that a real listening effect under a specific environmental condition can be simulated, and the obtained spatial audio is closer to reality.

In one embodiment, as shown in fig. 7, after step 108, the audio/video processing method further includes step 702.

Step 702, adjusting the angle in the spatial audio according to the head deviation angle of the user, so that the play position of the adjusted spatial audio does not change with the head deviation of the user.

Specifically, a gyroscope provided in the wearable device may be used to track the head deviation angle of the user. After the terminal acquires the head deviation angle, the angle in the spatial audio is adjusted according to the head deviation angle of the user, and the influence of the head deviation angle is counteracted, so that the play position of the adjusted spatial audio does not change along with the head deviation of the user. For example, when audio/video data is recorded in a concert, the position of the sound data at the time of recording is at the stage immediately in front. After the spatial audio is obtained, when the spatial audio and the video data are played, the head deviation angle of the user is obtained, and the spatial audio is adjusted according to the head deviation angle of the user, so that when the spatial audio is played, no matter how the head of the user rotates, the playing position of the spatial audio is right in front. That is, if the user wears the earphone to record sound, the recorded sound is right in front of the user, and after the user turns left, the playing position of the spatial audio heard by the user is right, and after the user turns right, the playing position of the spatial audio heard by the user is left. Therefore, no matter how the head of the user rotates, the play position of the adjusted spatial audio does not change along with the deviation of the head of the user, and the fixed position is kept, so that the spatial audio based on head tracking is obtained.

In one embodiment, as shown in fig. 7, after step 108, the audio/video processing method further includes step 704.

Step 704, spatial audio is shared.

After the spatial audio is obtained, the audio can be independently played without matching with video data. In addition, the user can share out the spatial audio for other users to use. Or after the spatial audio is obtained, the spatial audio can be played together with video data, and the spatial audio and the video data can be shared to other users for other users to use.

The manner in which spatial audio and/or video data is shared is not unique, and may be through LE audio technology, for example. The LE audio technology refers to the next generation audio transmission standard based on bluetooth low energy technology. It is a new technology developed by Bluetooth SIG (Bluetooth SPECIAL INTEREST Group) that can provide better audio experience and more efficient audio transmission.

In this embodiment, after the spatial audio is obtained, the spatial audio can be shared for other users to use.

In one embodiment, the spatial location information is the location information of each image category in the video data, or the spatial location information is user-defined location information.

Specifically, the analysis of the video data may result in spatial location information for each image type in the video data. Or the spatial position information can also be user-defined position information set by a user according to the needs.

In this embodiment, when the spatial location information is the location information of each image category in the video data, the spatial audio obtained based on the spatial location information may restore the real scene when the sound and the video are recorded. When the spatial position information is user-defined position information, the spatial audio obtained according to the spatial position information can meet the personalized requirements of the user. For example, in a concert scene, video and sound are recorded at the same time, the sound can be decomposed into multi-channel sound sources such as singers, pianos, guitars and drums through a music source separation algorithm, the decomposed multi-channel sound sources can be freely placed by users according to ideas through self-defined spatial position information, and spatial audio can be generated without setting a concert hall (such as placing the singers in the middle, placing the pianos at the back, placing the left on the drums and placing the guitars on the top), so that the application range is wide.

For a better understanding of the above embodiments, a detailed explanation is provided below in connection with a specific embodiment. In one embodiment, as shown in fig. 8, when implementing the audio/video processing method, video is recorded mainly by using a device with a camera, such as a mobile phone, glasses, and the like, and sound is recorded by using a wearable device (such as a headset, glasses, and the like). When recording, the categories in the picture are identified through the image identification technology, after the sound is recorded, the categories are separated, the separated sound is used for making space audio according to the direction and the distance of the picture, and the space audio is played back to the earphone, so that the real scene restoration and the sound restoration of the user are achieved. The earphone end can also be combined with a gyroscope to achieve the effect of following the space sound effect. The audio and video processing method specifically comprises the following aspects:

At the recording end, equipment capable of shooting, such as a mobile phone, a camera, glasses and the like, can be used for recording pictures, a microphone of the equipment is used for recording sound, if the equipment does not have the microphone, and hardware carrying the microphone is used for recording sound, such as a headset and the like, the recorded sound can be a mono audio file or a dual-channel audio file.

And then carrying out target recognition on the recorded video, and determining the image category. The destination identifier is a form of computer vision that is used to identify objects in a picture or video. For example, three categories of videos recorded in the drawings, namely, men's and women's, and dogs are identified, and in a concert scene, target identification can be understood as identifying various musical instruments and voices on a live concert stage, such as a singer, guitar, piano, violin and the like.

The target recognition of the video data can adopt a deep learning method, and the method comprises the steps of regarding a database for image recognition and video target monitoring, dividing training, testing and verifying sets, training a model by using a deep learning method and the like, wherein the model for solving the video target recognition comprises MANet, association LSTM, T-CNN and the like. And then placing the trained model into a mobile phone, or hardware equipment such as earphones, glasses and the like for deployment.

Then, audio separation is performed on the recorded sound data. The method comprises the steps of aiming at the image type identified by the target, performing audio separation on the recorded single and double channels, namely, separating the mixed audio of the boy, the girl and the dog through an audio separation technology, and in a concert scene, separating the recorded mixed audio of the concert performance through an audio separation technology, namely, separating the main singing, the guitar, the piano and the violin.

The steps for audio separation using deep learning are as follows:

1. the database related to audio separation comprises sound and tag databases of various categories such as boys, girls, dogs, pianos, violins and the like, and training sets, testing sets and verification sets are divided;

2. Training a model by using methods such as deep learning, wherein a common model for solving the audio recognition comprises conv-tasNet, demucs, D net and the like;

3. Placing the trained model into hardware devices such as mobile phones, glasses, earphones and the like for deployment;

After the audio frequency is separated, the space audio frequency processing is carried out, the space horizontal plane and the space positioning capability (comprising direction and distance) in the whole sphere centered by people are obtained through the head transfer function, and the space sound field can be matched, and a plurality of indoor acoustics or a reverberation algorithm of the space acoustics and the like can be added to construct more real space feeling.

Specifically, the audio file subjected to audio separation in the previous step is utilized as a multi-channel sound source, and then binaural processing is performed. Binaural is the most central point of audio and video processing, and after a multi-channel sound source is taken, only stereo sound can be actually played on the earphone, in the binaural technology, the sound source can be fixed by using the HRTF technology, and the earphone is simulated to be worn like a headphone (if the stereo sound source is a stereo sound source, the auditory sense that two speakers are placed in front of eyes is simulated, and if the multi-channel sound source is a multi-channel sound source, the auditory sense that a plurality of speakers are placed around is simulated).

There are various methods for determining the HRTF of the head transfer function, such as matching the HRTF of a certain person's head with public data, or clustering the HRTF of the head transfer in a database first, then selecting the most matched HRTF of a certain class from the public database as the final function with the personalized auricle feature of the user, or measuring the HRTF functions of various positions of the user in the anechoic chamber, or measuring the HRTF functions of various positions of the user in the concert hall.

After binaural rendering, a reverberation algorithm is performed. This part is mainly an important part of the spatial perception of the added sound effect. In general, HRTF functions are disclosed laboratory HRTF functions, but because the HRTF functions are mostly measured in a anechoic room, a lot of spatial information is lost, and a reverberation algorithm can increase early reflection and late reflection of the room through an algorithm, so that a real listening effect of the room is simulated, and common ways to add reverberation are room impact response HRIR, artificial reverberation and the like.

The above sound effect function capable of representing the sound azimuth sense is HRTF (L, α, β), L is the distance between the sound source and the user, α is the horizontal angle between the sound source and the user, β is the vertical angle between the sound source and the user, and the defining manner of the angle is shown in fig. 2. The user can determine the L, α, β of each image category based on the position of each object in the video data. Or the user can put the separated audio signals according to the own requirement, namely the user defines L, alpha and beta of each image category, after the definition is finished, the corresponding HRTF can be selected in the HRTF library (for example, the male has hll and hlr respectively representing the transmission functions of the sound to the left ear and the right ear), and then the corresponding HRTF and the audio signals are respectively convolved to obtain the spatial audio. If desired, the user may also add different reverberations to enhance the spatial impression of the spatial audio.

As shown in fig. 9, there are three target audio sources, and when playing back with headphones, a down-mixing algorithm is needed here, since headphones have only two channels. If the sound of a boy, a girl, and a dog is x1, x2, x3, hll, and hlr respectively, the transmission functions of x1 to the left ear and to the right ear are shown, hcl and hcr respectively, the transmission functions of x2 to the left ear and to the right ear are shown, hrl and hrr respectively, the transmission functions of x3 to the left ear and to the right ear are shown, and the left and right ear sound signals yl, yr heard by the human ear are shown:

yl=x1*hll+x2*hcl+x3*hrl;

yr=x1*hlr+x2*hcr+x3*hrr;

And the head rotation angle of the user can be judged by combining various G-sensors, gyroscopes and the like. The head deviation angle of the user can be tracked in real time through the built-in gyroscope of the earphone, the user rotates anyway, the sound source always keeps the original position, the direction of sound cannot be changed due to head change, and therefore spatial audio based on head tracking is obtained;

in addition, the user can change the spatial audio by customizing L, α, β. For example, in a concert scene, the original stereo sound can be decomposed into multi-channel sound sources such as singers, pianos, guitars, drums and the like through an audio separation technology while recording video and sound, and the decomposed multi-channel sound sources can be freely placed by users according to ideas, and spatial audio can be generated without setting up a concert hall (such as placing the singers in the middle, placing the pianos at the back, placing the drums at the left and placing the guitars at the top).

The generated spatial audio can be played independently without being matched with video. If the user wants to share DIY work to his friends, LE audio technology can be used to share the audio multitrack file. Alternatively, the generated spatial audio may be played in combination with video, for example, if the user wants to share the spatial audio placed in the concert scene to his friends, the LE audio technology may be used to share the video file with the audio multitrack file.

According to the audio processing method provided by the embodiment of the application, on a common recording file, a multi-track target audio source can be constructed through an audio separation technology to form spatial audio capable of restoring scenes and reality, and scene reproduction can be realized to a great extent through an audio-video combined display mode, so that an immersive realistic effect is achieved.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an audio/video processing device for realizing the above related audio/video processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of one or more audio/video processing devices provided below may refer to the limitation of the audio/video processing method hereinabove, and will not be described herein.

In one embodiment, as shown in fig. 10, there is provided an audio/video processing apparatus, including an audio/video acquisition module 1002, an image recognition module 1004, a sound separation module 1006, and a spatial audio reconstruction module 1008, wherein:

The audio/video acquisition module 1002 is configured to acquire audio/video data, where the audio/video data includes sound data and video data.

The image recognition module 1004 is configured to perform object recognition on the video data, and determine an image category.

The sound separation module 1006 is configured to perform audio separation on the sound data based on the image categories, and determine audio data corresponding to each image category.

The spatial audio reconstruction module 1008 is configured to obtain spatial audio according to spatial position information of each image category and corresponding audio data, where the spatial audio is used for playing in combination with video data or playing separately.

In one embodiment, the spatial audio reconstruction module is further configured to obtain spatial location information of each image category, input the spatial location information of each image category as a key parameter into the head transfer function, obtain a head related impulse response, and obtain spatial audio according to the head related impulse response and audio data corresponding to each image category.

In one embodiment, the head related impulse response includes a head related impulse response to a left ear and a head related impulse response to a right ear, the spatial audio includes an output audio signal of the left ear and an output audio signal of the right ear, the spatial audio reconstruction module is further configured to convolve audio data corresponding to each image class with the head related impulse response to the left ear, and then perform a down-mixing algorithm to obtain an output audio signal of the left ear, and convolve audio data corresponding to each image class with the head related impulse response to the right ear, and then perform a down-mixing algorithm to obtain an output audio signal of the right ear.

In one embodiment, the image recognition module is further configured to input the video data into an image recognition model to obtain image categories of each image in the video data, where the image recognition model is obtained by training based on image samples of each category.

In one embodiment, the sound separation module is further configured to input the identified image types and the sound data into a sound separation model to obtain audio data corresponding to each image type, where the sound separation model is obtained based on the image samples of each type and the corresponding sound training.

In one embodiment, the spatial audio reconstruction module is further configured to obtain intermediate audio data according to the spatial location information of each image category and the corresponding audio data, and increase reverberation parameters for the intermediate audio data to obtain the spatial audio.

In one embodiment, the audio-video processing device further includes a head tracking module, and the head tracking module is configured to adjust an angle in the spatial audio according to the head deviation angle of the user after the spatial audio reconstruction module obtains the spatial audio according to the spatial position information of each image category and the corresponding audio data, so that the play position of the adjusted spatial audio does not change along with the head deviation of the user.

In an embodiment, the audio/video processing device further includes a sharing module, where the sharing module is configured to share the spatial audio after the spatial audio reconstruction module obtains the spatial audio according to the spatial location information of each image category and the corresponding audio data.

The above-mentioned various modules in the audio-video processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, an audio-visual device is provided. The audio and video equipment comprises a recorder and a processor, wherein the recorder is connected with the processor, the recorder is used for recording audio and video data and sending the audio and video data to the processor, and the processor is used for carrying out audio and video processing according to the method of any embodiment.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to recorded audio and video data) and the data (including but not limited to data for analysis, stored data, displayed data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. An audio/video processing method, comprising:

2. The method of claim 1, wherein obtaining spatial audio from the spatial location information and the corresponding audio data for each of the image categories comprises:

Acquiring spatial position information of each image category;

3. The method of claim 2, wherein the head-related impulse response comprises a head-related impulse response to a left ear and a head-related impulse response to a right ear, wherein the spatial audio comprises an output audio signal for the left ear and an output audio signal for the right ear, wherein the deriving the spatial audio from the head-related impulse response and the audio data corresponding to each of the image categories comprises:

4. The method of claim 1, wherein said object recognition of said video data, determining an image category, comprises:

5. The method of claim 1, wherein the audio separating the sound data based on the image categories, determining audio data corresponding to each image category, comprises:

6. The method of claim 1, wherein obtaining spatial audio from the spatial location information and the corresponding audio data for each of the image categories comprises:

7. The method of claim 1, wherein after obtaining the spatial audio from the spatial location information and the corresponding audio data of each of the image categories, the method further comprises:

8. The method of claim 1, wherein after obtaining the spatial audio from the spatial location information and the corresponding audio data of each of the image categories, the method further comprises:

sharing the spatial audio.

9. The method according to any one of claims 1-8, wherein the spatial location information is location information of each of the image categories in the video data, or,

The spatial location information is user-defined location information.

10. An audio-video processing apparatus, the apparatus comprising:

11. An audio-video device comprising a recorder and a processor, the recorder being connected to the processor, the recorder being for recording audio-video data and sending to the processor, the processor being for audio-video processing according to the method of any one of claims 1 to 9.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 9.