CN119694334B

CN119694334B - Method, device, equipment and storage medium for labeling audio track data

Info

Publication number: CN119694334B
Application number: CN202411809979.7A
Authority: CN
Inventors: 何礼
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2024-12-10
Filing date: 2024-12-10
Publication date: 2025-10-03
Anticipated expiration: 2044-12-10
Also published as: CN119694334A

Abstract

The application discloses a method, a device, equipment and a storage medium for marking audio track data, and belongs to the field of audio processing. The method comprises the steps of calling an audio track dividing model to obtain a predicted track signal corresponding to at least one audio track from audio to be processed, determining the audio to be processed as effective audio under the condition that the energy of the at least one predicted track signal meets a threshold value condition, obtaining a track marking result of the audio to be processed when the fact that the audio to be processed is effective is determined based on the energy of the predicted track signal corresponding to the at least one audio track, and training the audio track dividing model based on the audio to be processed and the track marking result. The method can improve the labeling efficiency of the audio track data.

Description

Method, device, equipment and storage medium for labeling audio track data

Technical Field

The present application relates to the field of audio processing, and in particular, to a method, apparatus, device, and storage medium for labeling audio track data.

Background

Audio typically includes track signals for multiple tracks. For example, song audio is included as human voice audio and accompaniment audio, or accompaniment audio is included as track audio of different instruments such as drum audio, bass audio, piano audio, etc.

In the related art, a part of track signals in audio needs to be marked manually, so that corresponding functions can be executed by using marked track signals. For example, in a song recording scene, it is necessary to play accompaniment track audio in song audio alone in order to generate recorded audio based on recorded user human voice audio and accompaniment track audio.

However, depending on manual listening and labeling of the track signals, the cost of labor is high and the labeling efficiency is low.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for marking audio track data, which can improve the marking efficiency of the audio track data. The technical scheme is as follows:

according to an aspect of the present application, there is provided a method of labeling audio track data, the method comprising:

calling an audio track dividing model to separate at least one predicted track signal corresponding to an audio track from audio to be processed;

determining the audio to be processed as valid audio in case the energy of the at least one predicted track signal meets a threshold condition;

when the audio to be processed is determined to be effective based on the energy of the predicted track signal corresponding to the at least one audio track, acquiring a track marking result of the audio to be processed;

And training the audio track dividing model based on the audio to be processed and the track labeling result.

According to another aspect of the present application, there is provided an apparatus for labeling audio track data, the apparatus comprising:

The separation module is used for calling the audio track dividing model to separate at least one predicted track signal corresponding to the audio track from the audio to be processed;

a decision module for determining the audio to be processed as valid audio in case the energy of the at least one predicted track signal meets a threshold condition;

The marking module is used for acquiring a track marking result of the audio to be processed when the audio to be processed is determined to be effective based on the energy of the predicted track signal corresponding to the at least one audio track;

And the training module is used for training the audio track dividing model based on the audio to be processed and the track marking result.

According to another aspect of the present application, there is provided a computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, a set of codes or a set of instructions, the at least one instruction, the at least one program, the set of codes or the set of instructions being loaded and executed by the processor to implement the method of annotating audio track data as described in the above aspect.

According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes or a set of instructions, the at least one instruction, the at least one program, the set of codes or the set of instructions being loaded and executed by a processor to implement the method of labeling audio track data as described in the above aspect.

According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of labeling audio track data provided in various alternative implementations of the above aspects.

The technical scheme provided by the application has the beneficial effects that at least:

And performing audio track separation on the audio to be processed by using the audio track separation model to obtain at least one predicted track signal corresponding to at least one audio track respectively. And then, calculating the energy value of each predicted track signal, and judging whether the audio to be processed contains the track signal of the audio track to be marked according to the threshold condition corresponding to the audio track. If the audio signal is included, the audio to be processed is determined to be effective data, and the original track signal of the effective data is manually listened to for marking the audio track. If the audio signal does not contain the audio track, the audio signal of the audio track to be marked is not contained in the audio to be processed, and the original track signal of the audio to be processed does not need to be manually listened to, so that the track marking efficiency is improved. And after the track marking result is obtained, the track marking result can be used for training the audio track dividing model, the separation accuracy of the audio track dividing model is improved by using the track marking result, the effective data is screened more accurately, and the virtuous circle of track data marking is formed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a computer system provided in an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method for labeling audio track data provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a method for labeling audio track data provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a method for labeling audio track data according to an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a method for labeling audio track data according to an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a method for labeling audio track data according to an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of a method for labeling audio track data according to an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of a method for labeling audio track data according to an exemplary embodiment of the present application;

FIG. 9 is a flowchart of a method for labeling audio track data provided by an exemplary embodiment of the present application;

FIG. 10 is a schematic diagram of a method for labeling audio track data according to an exemplary embodiment of the present application;

FIG. 11 is a schematic diagram of a device for labeling audio track data according to an exemplary embodiment of the present application;

Fig. 12 is a schematic diagram of a computer device according to an exemplary embodiment of the present application.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of a computer system provided by an exemplary embodiment of the application, which may include a terminal device 101 and a server 103.

The method for labeling audio track data according to the embodiment of the present application may be applied to a terminal device, where the terminal device 101 runs an application 102 having a labeling requirement for audio track data. The terminal device may include a mobile phone, a tablet computer, a notebook computer, a laptop computer, a desktop computer, a computer all-in-one machine, an internet of things device, an intelligent robot workstation, a television, a set-top box, smart glasses, a smart watch, a digital camera, an MP4 playback device, an MP5 playback device, a learning machine, a point-to-read machine, an electronic paper book, an electronic dictionary, a vehicle-mounted device, a Virtual Reality (VR) playback device, an augmented Reality (Augmented Reality, AR) playback device, or the like.

The method for labeling the audio track data provided by the application can be executed by a client in the terminal equipment. The client is a client of an application having labeling requirements for audio track data. For example, the applications may include at least one of an audio player, an audio application, a video application, a social application, a shopping application, a live application, a car audio playback application, an information application, a browser, a game application, a recording application.

For example, an accompaniment function is provided in an audio player, and the accompaniment function needs to separate human voice audio from accompaniment audio for original audio and annotate the accompaniment audio therein. The audio player can use the method for labeling the audio track data provided by the application to separate and label the accompaniment audio in the original audio for the accompaniment function.

The terminal device 101 comprises a first memory and a first processor. The first memory stores the marking program of the audio track data, and the marking program of the audio track data is called and executed by the first processor to realize the marking method of the audio track data. The first Memory may include, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), and electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM).

The first processor may be one or more integrated circuit chips. Alternatively, the first processor may be a general purpose processor, such as a central processing unit (Central Processing Unit, CPU) or a network processor (Network Processor, NP). Optionally, the first processor may implement the method for labeling audio track data provided by the present application by running a program or code.

In an alternative embodiment, the terminal device 101 and the server 103 may be connected to each other through a wired or wireless network.

The method for labeling the audio track data can be executed by a server. The server 103 is configured to provide a background service for a target use case system of the terminal device 101. Optionally, the server 103 performs primary computing, the terminal device 101 performs secondary computing, or the server 103 performs secondary computing, the terminal device 101 performs primary computing, or a distributed computing architecture is used between the server 103 and the terminal device 101 for collaborative computing.

The server 103 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center.

Optionally, the server 103 comprises a second memory and a second processor. The second memory stores the marking program of the audio track data, and the marking program of the audio track data is called by the second processor to realize the marking method of the audio track data. Alternatively, the second memory may include, but is not limited to, RAM, ROM, PROM, EPROM, EEPROM. Alternatively, the second processor may be a general purpose processor, such as a CPU or NP.

Fig. 2 is a flowchart of a method for labeling audio track data according to an exemplary embodiment of the present application. The method may be used for a terminal device or a server as shown in fig. 1. The method comprises the following steps.

And 210, calling an audio track dividing model to separate at least one predicted track signal corresponding to the audio track from the audio to be processed.

Illustratively, the audio to be processed is composed of at least one original track signal, the audio track to which the at least one original track signal belongs is unknown, the method provided by the embodiment of the application can be used for marking the original track signal of the target audio track to be marked.

For example, the demand is to annotate accompaniment track audio in all the audio to be processed, but at least one track audio of voice track audio, accompaniment track audio, noise track audio may be included in the audio to be processed. By adopting the method in the related art, each original track audio of each audio to be processed needs to be manually listened to identify the accompaniment track audio therein, and the labeling efficiency is too low. By adopting the method provided by the embodiment of the application, the audio track dividing model can be used for separating and obtaining the predicted accompaniment track audio from each audio to be processed. And then coarsely screening the predicted accompaniment track audio by using the energy threshold of the accompaniment track audio, and identifying whether the energy of the predicted accompaniment track audio reaches the energy threshold of the accompaniment track audio. If so, the audio to be processed contains accompaniment track audio with high probability, and the audio to be processed is marked as effective data; if not, the audio to be processed is marked as invalid data if the audio to be processed does not contain accompaniment track audio in a large probability. Then, only the original track audio of the effective data is listened to by trial, the accompaniment track audio is marked, and the original track audio of the ineffective data is not listened to again, so that the marking efficiency of the audio track data is greatly improved.

The audio track-dividing model is a neural network model and is used for separating and obtaining at least one audio track corresponding to each predicted track signal from the input audio to be processed. At least one audio track is an audio track that needs to be annotated (i.e., a target audio track). For example, if the audio track-splitting model is trained to be used for separating the predicted track signals of the first audio track, the second audio track and the third audio track, after the audio to be processed is input into the audio track-splitting model, a first predicted track signal corresponding to the first audio track, a second predicted track signal corresponding to the second audio track and a third predicted track signal corresponding to the third audio track can be obtained.

For example, the audio track model may be trained using a small number of training samples. For example, training samples may use open source dataset musdbhq. Each training sample comprises a sample audio signal (mix), and original track signals of four audio tracks, namely vocals (human voice), drums (drum voice), bass (bass) and other tracks, contained in the sample audio signal. In the training process, a sample audio signal is input into an audio track-dividing model to obtain predicted track signals of four audio tracks, and then the audio track-dividing modularity is trained according to the predicted track signals and the loss of an original track signal in a training sample.

Step 220 of determining the audio to be processed as valid audio in case the energy of the at least one predicted track signal satisfies a threshold condition.

For example, the track signal of each audio track corresponds to a respective energy value threshold, and if the predicted track signal reaches the energy value threshold, the predicted track signal may be the track signal of the audio track. Thus, it is possible to coarsely screen whether the predicted track signal is a track signal of the audio track based on the energy of the predicted track signal.

If the at least one predicted track signal contains the track signal of the audio track to be marked, the audio to be processed is determined to be effective audio, so that the original track audio of the effective audio is listened to manually, and the track signal of the audio track to be marked is marked.

The at least one predicted track signal may include a first predicted track signal corresponding to a first audio track separated by the audio track-splitting model. The valid data may be determined in one of the following ways:

1) And determining the audio to be processed as valid data under the condition that the energy of the first predicted track signal meets a first threshold condition corresponding to the first audio track.

That is, when the energy of one of the at least one predicted track signal satisfies the threshold condition, the audio to be processed may be marked as valid data. The threshold value in the threshold condition corresponding to one audio track may be set according to an average energy value of the historical audio signal of the audio track, may be set according to a lowest energy value of the historical audio signal of the audio track, and may be obtained by subtracting a preset error from the average energy value of the historical audio signal.

2) And determining the audio to be processed as valid data under the condition that the energy of at least one predicted track signal meets the threshold condition corresponding to the respective audio track.

That is, the audio to be processed is marked as valid data only if the energy of at least one of the predicted track signals satisfies the respective threshold condition. And marking the audio to be processed as invalid data if the energy of one predicted track signal in the at least one predicted track signal does not meet the threshold condition of the corresponding audio track.

The valid data is the audio to be processed that requires manual listening. The valid data may contain the original track signal of the target audio track to be marked. Correspondingly, the invalid data is the audio to be processed which does not need manual listening. The original track signal of the target audio track to be marked may not be included in the invalid data.

The energy is different for the audio signals of different audio tracks (the energy of the audio signals is proportional to the square of the amplitude, and the amplitude of the different audio tracks is different, the energy is different), for example, the audio energy of a suona track is typically higher than a first threshold, the audio energy of a human voice track is typically higher than a second threshold, and the first threshold is higher than the second threshold (the amplitude of suona audio is typically higher than human voice). Therefore, the audio to be processed can be primarily screened according to the threshold value corresponding to the audio track, the predicted track signals possibly belonging to the audio track are screened, and the predicted track signals which are low in energy and have high probability of not belonging to the audio track are screened. Then, only the audio to be processed containing the predicted track signal of the audio track is subjected to manual listening and marking, so that a large amount of invalid data is prevented from being listened to manually, the labor cost is wasted, and the marking efficiency of the audio track-dividing data is improved.

And 230, acquiring a track labeling result of the audio to be processed when the audio to be processed is determined to be effective based on the energy of the predicted track signal corresponding to the at least one audio track.

The effective data is subjected to track marking, and a track marking result is obtained. The track annotation result comprises the audio track type to which the original track signal belongs.

For example, the effective data can be manually listened to, and the effective data is marked by a track. After the effective data is obtained, the original track signal of the effective data can be manually listened to, and the original track signal contains the track signal of the target audio track to be marked with the effective data with high probability. If the track signal of the target audio track is not heard by manual trial listening, the track signal of the target audio track is not contained in the effective data, and the marking is not needed.

Or a track labeling model (neural network model) can be used for labeling the audio tracks of each original track signal in the effective data to obtain a track labeling result. The track annotation model is trained to output an audio track to which the original track signal belongs according to the input original track signal.

The track annotation result comprises at least one audio track described by the original track signal, respectively. Or the track labeling result comprises an original track signal corresponding to the target audio track. For example, the at least one original track signal includes a first original track signal, a second original track signal, and a third original track signal. The track annotation result may comprise that the first original track signal corresponds to a first audio track, the second original track signal corresponds to a second audio track, and the third original track signal corresponds to a third audio track. For another example, the at least one original track signal includes a first original track signal, a second original track signal, and a third original track signal. And the need is to annotate the track signal of the second audio track. The track annotation result may comprise that the second original track signal corresponds to the second audio track and that the first original track signal and the third original track signal do not need to be annotated.

And 240, training an audio track model based on the audio to be processed and the track labeling result.

For example, when the audio to be processed is valid data and the audio to be processed corresponds to the track labeling result, the audio to be processed is used as sample input data, the track labeling result is used as a sample label, and the audio track dividing model is trained. For example, inputting the audio to be processed into an audio track-dividing model to obtain a predicted track signal corresponding to the target audio track, then obtaining an original track signal corresponding to the target audio track in the track labeling result, calculating the loss between the predicted track signal and the original track signal, and training the audio track-dividing model based on the loss. For example, the valid data includes a first original track signal, and the track labeling result includes that the first original track signal belongs to the first audio track. And inputting the audio to be processed into the audio track-dividing model to obtain a first predicted track signal corresponding to the first audio track, then calculating first losses of the first predicted track signal and the first original track signal, and training the audio track-dividing model according to the first losses.

The audio track-dividing model comprises a first network corresponding to a first audio track, the first network is used for separating and obtaining a predicted track signal of the first audio track from input audio to be processed, and the track labeling result comprises that the first original track signal corresponds to the first audio track. And training an audio track dividing model by taking the audio to be processed as sample input data of the first network and the first original track signal as a sample label of the first network.

The network structure of the audio track-dividing model is not limited in this embodiment, and any neural network model for executing the audio track-dividing task can be applied to the method for labeling audio track-dividing data provided in this embodiment.

In summary, in the method provided in this embodiment, the audio track separation is performed on the audio to be processed by using the audio track division model, so as to obtain at least one predicted track signal corresponding to at least one audio track. And then, calculating the energy value of each predicted track signal, and judging whether the audio to be processed contains the track signal of the audio track to be marked according to the threshold condition corresponding to the audio track. If the audio signal is included, the audio to be processed is determined to be effective data, and the original track signal of the effective data is manually listened to for marking the audio track. If the audio signal does not contain the audio track, the audio signal of the audio track to be marked is not contained in the audio to be processed, and the original track signal of the audio to be processed does not need to be manually listened to, so that the track marking efficiency is improved. And after the track marking result is obtained, the track marking result can be used for training the audio track dividing model, the separation accuracy of the audio track dividing model is improved by using the track marking result, the effective data is screened more accurately, and the virtuous circle of track data marking is formed.

An exemplary embodiment of determining valid data is presented.

Fig. 3 is a flowchart of a method for labeling audio track data according to an exemplary embodiment of the present application. The method may be used for a terminal device or a server as shown in fig. 1. Step 220 includes step 221 based on the embodiment shown in fig. 2.

Step 210, calling an audio track dividing model to separate and obtain a predicted track signal corresponding to at least one audio track from audio to be processed, wherein the audio to be processed consists of at least one original track signal, and the audio track corresponding to the at least one original track signal is unknown.

The model structure of the audio track-dividing model may be arbitrary. Fig. 4 shows an audio track-dividing model provided in this embodiment, where a sub-network corresponding to a target audio track in the audio track-dividing model includes a sub-band splitting module, a time-frequency modeling module, and a masking value prediction module. When the audio track-splitting model is used to split the predicted track signals of a plurality of audio tracks, the audio track-splitting model may comprise a plurality of sub-networks to which the audio tracks respectively correspond, i.e. the audio track-splitting model comprises a plurality of sub-networks as shown in fig. 4.

For example, as shown in fig. 4, the audio X to be processed is input to the subband splitting module to obtain a subband splitting result Z, the subband splitting result Z is input to the time-frequency modeling module to obtain a modeling result Q, and the modeling result is input to the masking value prediction module to obtain a masking value M corresponding to the target audio track. Multiplying the masking value M with the audio X to be processed to obtain a predicted track signal S corresponding to the target audio track.

As shown in fig. 5, a sub-band splitting module is provided in this embodiment. The sub-band splitting module includes a splitting network, a normalization and convolution network 501, and a merging network. Inputting the audio X to be processed into a slicing network, performing short-time Fourier transform on the audio X to be processed by the slicing network to obtain frequency domain signals, slicing the frequency points of the frequency domain signals into at least one sub-band, and obtaining the frequency domain signals of each sub-band. Then, the frequency domain signal of each sub-band is input into the normalization and convolution network 501 corresponding to the sub-band, so as to obtain the convolution result corresponding to each sub-band. The normalization and convolution network 501 includes a normalization layer and at least one convolution layer. And then, inputting the convolution result corresponding to each sub-band into a merging network, performing data splicing on a frequency domain, and merging the convolution result of each sub-band into a banded result Z.

As shown in fig. 6, a time-frequency modeling module is provided in this embodiment. The time-frequency modeling module may be implemented using RNNs (Recurrent Neural Network, recurrent neural networks). For example, the time-frequency modeling module may include a time domain RNN network and a frequency domain RNN network, and the banded result Z is input into the time domain RNN network according to a time domain sequence to obtain an intermediate result Z'. The time domain RNN network includes a normalization layer, a bl stm (Bidirectional Long Short-terminal Memory network) and a convolutional network. And inputting the intermediate result Z' into a frequency domain RNN network according to the frequency domain sequence to obtain a modeling result Q. The frequency domain RNN network includes a normalization layer, a BLSTM, and a convolutional network.

As shown in fig. 6, a masking value prediction module is provided in this embodiment. The masking value prediction module includes a slicing network, a normalization and MLP (Multilayer Perceptron, multi-layer perceptron) network 502, and a merging network. The modeling result Q is input into a segmentation network, and the segmentation network segments the modeling result Q into at least one sub-band to obtain the modeling result of each sub-band. And then, respectively inputting the modeling result of each sub-band into the normalization and MLP network 502 corresponding to the sub-band to obtain a prediction result corresponding to each sub-band. The normalization and MLP network 502 includes a normalization layer and an MLP network. And then, inputting the prediction result corresponding to each sub-band into a merging network, performing data splicing on a frequency domain, and merging the prediction result of each sub-band into a masking value M. The masking value M is then multiplied with the audio X to be processed to obtain a first predicted track signal for the first audio track.

Step 221 of determining the audio to be processed as valid data in case the energy of at least one predicted track signal meets a threshold condition corresponding to the respective audio track.

Taking a first predicted track signal corresponding to a first audio track as an example, calculating a first energy value of the first predicted track signal, calculating the sum of energy values of at least one predicted track signal, calculating the ratio of the first energy value to the sum of energy values, and determining that the energy of the first predicted track signal meets a threshold condition corresponding to the first track under the condition that the ratio is not less than a first threshold corresponding to the first audio track.

For example, the predicted track signal includes vocals (human voice), drums (drumbeat), bass (bass), other (other) four audio tracks. The threshold conditions for each of the four audio tracks may be:

Human voice:

Drumbeat:

bass:

Other:

Wherein y _v (t) is a predicted track signal of a human track, y _d (t) is a predicted track signal of a drumbeat track, y _b (t) is a predicted track signal of a bass track, and y _o (t) is a predicted track signal of other tracks. Alpha _v is a threshold value of a human voice track, alpha _d is a threshold value of a drumbeat track, alpha _b is a threshold value of a bass track, and alpha _o is a threshold value of other tracks.

And 230, when the fact that the audio to be processed is effective is determined based on the energy of the predicted track signal corresponding to the at least one audio track, acquiring a track marking result of the audio to be processed, wherein the track marking result comprises the audio track corresponding to the at least one original track signal.

Illustratively, as shown in fig. 8, the audio 301 to be processed is input into an audio track modularity 302, four predicted track signals on four audio tracks are obtained, then energy determination 304 is performed on the four predicted track signals, and in the case that the energy of the four predicted track signals all meet the threshold value of the respective audio tracks, the audio 301 to be processed is determined as valid data. And providing the effective data for manual labeling. The manually annotated results may be used to train the audio sub-scale 302, and may also be used to synthesize the audio 301 to be processed.

In summary, in the method provided in this embodiment, the audio track separation is performed on the audio to be processed by using the audio track division model, so as to obtain at least one predicted track signal corresponding to at least one audio track. And then, calculating the ratio of the energy value of the predicted track signal to the total energy value of all the predicted track signals aiming at the predicted track signal of one audio track, determining that the predicted track signal of the audio track meets a threshold condition under the condition that the ratio is larger than the capacity threshold corresponding to the audio track, and marking the audio to be processed as effective data when at least one predicted bone carving signal meets the threshold condition. And (3) screening out invalid data of the track signal which does not contain the target audio track, and only executing track marking on the valid data, thereby improving the track marking efficiency.

An exemplary embodiment of training an audio track-splitting model is given below.

Fig. 9 is a flowchart of a method for labeling audio track data according to an exemplary embodiment of the present application. The method may be used for a terminal device or a server as shown in fig. 1. The method step 240 includes steps 241 to 245 based on the embodiment shown in fig. 2.

And 241, inputting the audio to be processed into an audio track division model to obtain a first number of predicted track signals.

The first number is the number of audio tracks that the audio track-splitting model is able to separate. For example, the audio track splitting model is used to split the track signal of three audio tracks, the first number is 3.

For another example, the audio to be processed is input into an audio track model to obtain four predicted track signals on vocals (human voice), drums (drum voice), bass (bass) and other audio tracks.

Then, the loss of the predicted track signal and the original track signal is calculated.

For example, audio to be processed is input into an audio track-splitting model to obtain a first number of predicted track signals, and then the losses of the first number of predicted track signals and the first number of original track signals are calculated according to the artificial labeling result. The loss calculation is as follows.

The loss function is:

wherein, L _loss is loss, For a time domain loss of an audio signal of J audio tracks (i.e., J is the first number),Frequency domain loss for an audio signal of J audio tracks. J is the number of audio tracks. Alpha is the scale factor of the time domain loss and the frequency domain loss, and is a preset value.

The time domain loss function is:

Where y _j (t) is the original track signal (tag data or real data) of the jth audio track, For the predicted track signal of the jth audio track, II ₁ is taking the first order norm.

The frequency domain loss function is:

Wherein, the For the resolution loss of the ith resolution, M is the total number of resolutions.The amplitude loss for the j-th audio track.Log loss for j audio tracks. STFT (y (t)) is a frequency domain signal obtained by short-time fourier transforming the original track signal,The prediction frequency domain signal is obtained by performing short-time Fourier transform on the prediction orbit signal. II _F is taking the second order norm.

According to the above calculation formula of the loss function, the calculation method for obtaining the loss includes the following steps 242 to 244.

Step 242, calculating a first number of predicted track signalsTime domain loss of original track signal y _j (t) with corresponding audio track

Illustratively, for one of the first number of audio tracks, a first order norm of an amplitude difference in the time domain of the predicted track signal and the original track signal is calculatedSumming, obtaining a fractional time domain loss of an audio trackCalculating a sum of the fractional time domain losses of the first number of audio tracksObtaining the time domain loss

Step 243 of calculating a first number of predicted track signalsFrequency domain loss of original track signal y _j (t) with corresponding audio track

Illustratively, for one of the first number of audio tracks, calculating an amplitude loss in the frequency domain of the predicted track signal and the original track signalAnd logarithmic lossCalculating the sum of the amplitude loss and the logarithmic loss to obtain the track-divided frequency domain loss of an audio trackCalculating the sum of the track-divided frequency domain losses of the first number of audio tracks to obtain the frequency domain loss

The method for calculating the track-divided frequency domain loss of an audio track can be exemplified by performing short-time Fourier transform on the predicted track signal according to at least one group of resolutions to obtain at least one group of predicted frequency domain signals, wherein one group of resolutions comprises a window length and an offset step length, performing short-time Fourier transform on the original track signal according to at least one group of resolutions to obtain at least one group of frequency domain signals, calculating amplitude loss and logarithmic loss of the predicted frequency domain signal and the frequency domain signal according to one group of resolutions, and determining the sum of the amplitude loss and the logarithmic loss as the resolution loss corresponding to one group of resolutionsCalculating a sum of resolution losses of at least one set of resolutions to obtain a tracked frequency domain loss of an audio track

The amplitude loss can be calculated by calculating the difference between the absolute values of the amplitudes of the frequency domain signal and the predicted frequency domain signal to obtain an amplitude difference valueCalculating the two norms of the amplitude difference to obtain a first two normsA two-norm of the absolute value of the amplitude of the frequency domain signal is calculated, obtaining a second norm |STFT (y (t)) || _F; calculating the ratio of the first and second norms to obtain amplitude loss

The log loss may be calculated by calculating a common logarithm of the absolute value of the amplitude of the frequency domain signal to obtain a first logarithm log|STFT (y (t))|, and calculating a common logarithm of the absolute value of the amplitude of the predicted frequency domain signal to obtain a second logarithmCalculating a first order norm of a difference between the first logarithm and the second logarithmObtaining logarithmic losses

Step 244, calculating a weighted sum of the time domain loss and the frequency domain lossRatio of the weighted sum to the first quantityThe loss value L _loss is determined.

Step 245, training an audio track model based on the loss values.

For example, as shown in fig. 10, after audio 301 to be processed is input into an audio track-dividing model 302, four predicted track signals of four audio tracks are obtained, then loss values are calculated according to tags, and the audio track-dividing model is trained based on the loss values.

In summary, in the method provided in this embodiment, the audio track separation is performed on the audio to be processed using the audio track division model, so as to obtain at least one predicted bone carving signal corresponding to at least one audio track. And determining whether the audio to be processed is effective data or not according to the energy of the predicted track signal, and manually marking the effective data. According to the artificial labeling result, the audio to be processed is used as a training sample, an audio track dividing model is trained, the separation accuracy of the audio track dividing model is improved, effective data are screened more accurately, and a virtuous circle of track data labeling is formed.

It should be noted that, the sequence of the steps of the method provided in the embodiment of the present application may be appropriately adjusted, the steps may also be increased or decreased according to the situation, and any method that is easily conceivable to be changed by those skilled in the art within the technical scope of the present disclosure should be covered within the protection scope of the present disclosure, so that no further description is given.

Fig. 11 is a schematic structural diagram of an audio track data labeling apparatus according to an exemplary embodiment of the present application. The device comprises:

the separation module 601 is configured to invoke an audio track separation model to separate a predicted track signal corresponding to at least one audio track from audio to be processed, where the audio to be processed is composed of at least one original track signal, and the audio track corresponding to the at least one original track signal is unknown;

A decision module 602, configured to determine the audio to be processed as valid audio if the energy of the at least one predicted track signal meets a threshold condition;

The labeling module 603 is configured to obtain a track labeling result of the audio to be processed when it is determined that the audio to be processed is valid based on the energy of the predicted track signal corresponding to the at least one audio track, where the track labeling result includes an audio track corresponding to the at least one original track signal;

And the training module 604 is used for training the audio track dividing model based on the audio to be processed and the track labeling result.

In an alternative embodiment, the at least one predicted track signal includes a first predicted track signal corresponding to a first audio track separated by the audio track-splitting model;

the determining module 602 is configured to perform one of the following:

Determining the audio to be processed as the valid data under the condition that the energy of the first predicted track signal meets a first threshold condition corresponding to the first audio track;

And under the condition that the energy of the at least one predicted track signal meets the threshold condition corresponding to the respective audio track, determining the audio to be processed as the effective data.

In an alternative embodiment, the decision module 602 is configured to calculate a first energy value of the first predicted track signal and calculate a sum of energy values of the at least one predicted track signal;

The determining module 602 is configured to calculate a ratio of the first energy value to a sum of the energy values;

The determining module 602 is configured to determine that the energy of the first predicted track signal meets a threshold condition corresponding to the first track if the ratio is not less than a first threshold corresponding to the first audio track.

In an alternative embodiment, the audio track-dividing model includes a first network corresponding to a first audio track, where the first network is configured to separate a predicted track signal of the first audio track from the input audio to be processed;

The track labeling result comprises that a first original track signal in the audio to be processed corresponds to the first audio track;

the training module 604 is configured to train the audio track-dividing model by using the audio to be processed as sample input data of the first network and the first original track signal as a sample tag of the first network.

In an alternative embodiment, the training module 604 is configured to input the audio to be processed into the audio track model to obtain a first number of predicted track signals;

The training module 604 is configured to calculate a time domain loss of the first number of predicted track signals and the original track signal of the corresponding audio track;

The training module 604 is configured to calculate a weighted sum of the time domain loss and the frequency domain loss, and determine a ratio of the weighted sum to the first number as a loss value;

the training module 604 is configured to train the audio track model based on the loss value.

In an alternative embodiment, the training module 604 is configured to calculate, for one audio track of the first number of audio tracks, a sum of first-order norms of amplitude differences between the predicted track signal and the original track signal in a time domain, to obtain a fractional time domain loss of the one audio track;

the training module 604 is configured to calculate a sum of the time domain loss of the first number of audio tracks, to obtain the time domain loss.

In an alternative embodiment, the training module 604 is configured to calculate, for one audio track of the first number of audio tracks, an amplitude loss and a log loss of the predicted track signal and the original track signal in a frequency domain, and calculate a sum of the amplitude loss and the log loss to obtain a tracked frequency domain loss of the one audio track;

The training module 604 is configured to calculate a sum of the tracked frequency domain losses of the first number of audio tracks, to obtain the frequency domain loss.

In an alternative embodiment, the training module 604 is configured to perform short-time fourier transform on the predicted orbit signal according to at least one set of resolutions, to obtain at least one set of predicted frequency domain signals, where a set of resolutions in the at least one set of resolutions includes a window length and an offset step;

the training module 604 is configured to perform short-time fourier transform on the original track signal according to the at least one set of resolutions, to obtain at least one set of frequency domain signals;

the training module 604 is configured to calculate, for a set of resolutions of the at least one set of resolutions, an amplitude loss and a log loss of the predicted frequency domain signal and the frequency domain signal, and determine a sum of the amplitude loss and the log loss as a resolution loss corresponding to the set of resolutions;

the training module 604 is configured to calculate a sum of the resolution loss of the at least one set of resolutions to obtain a tracked frequency domain loss of an audio track.

In an alternative embodiment, the training module 604 is configured to calculate a difference between the absolute values of the magnitudes of the frequency-domain signal and the predicted frequency-domain signal, to obtain a magnitude difference;

The training module 604 is configured to calculate a second norm of the amplitude difference to obtain a first second norm, and calculate a second norm of the amplitude absolute value of the frequency domain signal to obtain a second norm;

the training module 604 is configured to calculate a ratio of the first second norm to the second norm, so as to obtain the amplitude loss.

In an alternative embodiment, the training module 604 is configured to calculate a common logarithm of the absolute value of the amplitude of the frequency-domain signal to obtain a first logarithm, and calculate a common logarithm of the absolute value of the amplitude of the predicted frequency-domain signal to obtain a second logarithm;

the training module 604 is configured to calculate a first-order norm of a difference between the first logarithm and the second logarithm, so as to obtain the logarithm loss.

It should be noted that, the audio track data labeling device provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the labeling device for audio track data provided in the above embodiment and the labeling method embodiment for audio track data belong to the same concept, and detailed implementation processes of the labeling device for audio track data are detailed in the method embodiment, and are not repeated here.

The embodiment of the application also provides computer equipment, which comprises a processor and a memory, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the method for marking the audio track data provided by the method embodiments. The computer device may be implemented as a terminal device.

Illustratively, fig. 12 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

In general, computer device 1700 includes a processor 1701 and a memory 1702.

The processor 1701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1701 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array) GATE ARRAY, PLA (Programmable Logic Array ). The processor 1701 may also include a main processor, which is a processor for processing data in a wake-up state, also called a CPU (Central Processing Unit ), and a coprocessor, which is a low-power processor for processing data in a standby state. In some embodiments, the processor 1701 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1701 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 1702 may include one or more computer-readable storage media, which may be non-transitory. Memory 1702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1702 is used to store at least one instruction for execution by processor 1701 to implement the method of labeling audio track data provided by an embodiment of the method of the present application.

In some embodiments, computer device 1700 also optionally includes a peripheral interface 1703 and at least one peripheral device. The processor 1701, memory 1702, and peripheral interface 1703 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 1703 by buses, signal lines or a circuit board. Specifically, the peripheral devices include at least one of radio frequency circuitry 1704, a display screen 1705, a camera assembly 1706, audio circuitry 1707, and a power source 1708.

The peripheral interface 1703 may be used to connect at least one Input/Output (I/O) related peripheral to the processor 1701 and the memory 1702. In some embodiments, the processor 1701, the memory 1702, and the peripheral interface 1703 are integrated on the same chip or circuit board, and in some other embodiments, either or both of the processor 1701, the memory 1702, and the peripheral interface 1703 may be implemented on separate chips or circuit boards, as the embodiments of the present application are not limited in this respect.

The Radio Frequency circuit 1704 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1704 communicates with a communication network and other communication devices through electromagnetic signals. The radio frequency circuit 1704 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuitry 1704 includes an antenna system, an RF transceiver, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 1704 may communicate with other computer devices through at least one wireless communication protocol. The wireless communication protocols include, but are not limited to, the world wide web, metropolitan area networks, intranets, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 1704 may also include NFC (NEAR FIELD Communication) related circuits, which are not limited by the present application.

The display screen 1705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1705 is a touch display, the display 1705 also has the ability to collect touch signals at or above the surface of the display 1705. The touch signal may be input as a control signal to the processor 1701 for processing. At this point, the display 1705 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1705 may be one, providing a front panel of the computer device 1700, in other embodiments, the display 1705 may be at least two, provided on different surfaces of the computer device 1700 or in a folded design, respectively, and in still other embodiments, the display 1705 may be a flexible display, provided on a curved surface or a folded surface of the computer device 1700. Even more, the display 1705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The display 1705 may be made of LCD (Liquid CRYSTAL DISPLAY), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1706 is used to capture images or video. Optionally, the camera assembly 1706 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the computer device 1700 and the rear camera is disposed on the back of the computer device. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 1706 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 1707 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1701 for processing, or inputting the electric signals to the radio frequency circuit 1704 for voice communication. For purposes of stereo acquisition or noise reduction, the microphone may be multiple, each disposed at a different location of the computer device 1700. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1701 or the radio frequency circuit 1704 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1707 may also include a headphone jack.

A power supply 1708 is used to power the various components in the computer device 1700. The power source 1708 may be alternating current, direct current, disposable battery, or rechargeable battery. When the power source 1708 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the computer device 1700 also includes one or more sensors 1709. The one or more sensors 1709 include, but are not limited to, an acceleration sensor 1710, a gyroscope sensor 1711, a pressure sensor 1712, an optical sensor 1713, and a proximity sensor 1714.

The acceleration sensor 1710 can detect the magnitudes of accelerations on three coordinate axes of a coordinate system established with the computer device 1700. For example, the acceleration sensor 1710 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1701 may control the touch display 1705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1710. The acceleration sensor 1710 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1711 may detect a body direction and a rotation angle of the computer device 1700, and the gyro sensor 1711 may collect 3D actions of the user on the computer device 1700 in cooperation with the acceleration sensor 1710. The processor 1701 can realize functions such as motion sensing (e.g., changing a UI according to a tilting operation of a user), image stabilization at photographing, game control, and inertial navigation, based on data collected by the gyro sensor 1711.

The pressure sensor 1712 may be disposed on a side frame of the computer device 1700 and/or on an underside of the touch screen 1705. When the pressure sensor 1712 is disposed on a side frame of the computer device 1700, a grip signal of the computer device 1700 by a user may be detected, and the processor 1701 performs a left-right hand recognition or a shortcut operation according to the grip signal collected by the pressure sensor 1712. When the pressure sensor 1712 is disposed at the lower layer of the touch display screen 1705, the processor 1701 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 1705. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 1713 is used to collect ambient light intensity. In one embodiment, the processor 1701 may control the display brightness of the touch display 1705 based on the ambient light intensity collected by the optical sensor 1713. Specifically, the display brightness of the touch display screen 1705 is turned up when the ambient light intensity is high, and the display brightness of the touch display screen 1705 is turned down when the ambient light intensity is low. In another embodiment, the processor 1701 may also dynamically adjust the shooting parameters of the camera assembly 1706 based on the ambient light intensity collected by the optical sensor 1713.

A proximity sensor 1714, also referred to as a distance sensor, is typically provided on the front panel of the computer device 1700. The proximity sensor 1714 is used to collect the distance between the user and the front of the computer device 1700. In one embodiment, the processor 1701 controls the touch display 1705 to switch from the on-screen state to the off-screen state when the proximity sensor 1714 detects a gradual decrease in the distance between the user and the front of the computer device 1700, and the processor 1701 controls the touch display 1705 to switch from the off-screen state to the on-screen state when the proximity sensor 1714 detects a gradual increase in the distance between the user and the front of the computer device 1700.

Those skilled in the art will appreciate that the architecture shown in fig. 12 is not limiting as to the computer device 1700, and may include more or fewer components than shown, or may combine certain components, or employ a different arrangement of components.

The embodiment of the application also provides a computer readable storage medium, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the readable storage medium, and when the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by a processor of computer equipment, the method for marking the audio track data provided by the embodiment of the method is realized.

The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method for labeling the audio track data provided by the above method embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above readable storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims

1. A method for labeling audio track data, the method comprising:

training the audio track dividing model based on the audio to be processed and the track labeling result;

The at least one predicted track signal comprises a first predicted track signal corresponding to a first audio track separated by the audio track division model;

Said determining said audio to be processed as valid data in case the energy of said at least one predicted track signal fulfils a threshold condition, comprising one of:

2. The method of claim 1, wherein the energy of the first predicted track signal satisfies a threshold condition corresponding to the first audio track, comprising:

Calculating a first energy value of the first predicted track signal; and calculating a sum of energy values of the at least one predicted track signal;

Calculating a ratio of the first energy value to a sum of the energy values;

And under the condition that the ratio is not smaller than a first threshold value corresponding to the first audio track, determining that the energy of the first predicted track signal meets a threshold value condition corresponding to the first audio track.

3. The method according to claim 1 or 2, wherein the audio track-splitting model comprises a first network corresponding to a first audio track, the first network being configured to separate a predicted track signal of the first audio track from the input audio to be processed;

The training of the audio track-dividing model based on the audio to be processed and the track labeling result comprises the following steps:

And taking the audio to be processed as sample input data of the first network, and taking the first original track signal as a sample label of the first network to train the audio track-dividing model.

4. The method according to claim 1 or 2, wherein the training the audio track model based on the audio to be processed and the track annotation result comprises:

Inputting the audio to be processed into the audio track dividing model to obtain a first number of predicted track signals;

Calculating a time domain loss of the first number of predicted track signals and the original track signal of the corresponding audio track;

calculating a weighted sum of the time domain loss and the frequency domain loss, determining a ratio of the weighted sum to the first number as a loss value;

Training the audio track model based on the loss value.

5. The method of claim 4, wherein said calculating a time domain loss of the first number of predicted track signals and original track signals of the corresponding audio track comprises:

calculating the sum of first-order norms of amplitude differences of the predicted track signal and the original track signal in the time domain for one audio track in the first number of audio tracks to obtain the track-divided time domain loss of the one audio track;

And calculating the sum of the time domain loss of the first number of audio tracks in a track way to obtain the time domain loss.

6. The method of claim 4, wherein said calculating a frequency domain loss of the first number of predicted track signals and the original track signal of the corresponding audio track comprises:

Calculating the amplitude loss and the logarithmic loss of the predicted track signal and the original track signal in the frequency domain for one audio track in the first number of audio tracks, and calculating the sum of the amplitude loss and the logarithmic loss to obtain the track-dividing frequency domain loss of one audio track;

and calculating the sum of the track-dividing frequency domain losses of the first number of audio tracks to obtain the frequency domain loss.

7. The method of claim 6, wherein said calculating the magnitude and log losses of the predicted track signal and the original track signal in the frequency domain, and calculating the sum of the magnitude and log losses, yields a tracked frequency domain loss for an audio track, comprises:

performing short-time Fourier transform on the predicted orbit signals according to at least one group of resolutions to obtain at least one group of predicted frequency domain signals, wherein one group of resolutions in the at least one group of resolutions comprises a window length and an offset step length;

performing short-time Fourier transform on the original track signals according to the at least one group of resolutions to obtain at least one group of frequency domain signals;

calculating an amplitude loss and a logarithmic loss of the predicted frequency domain signal and the frequency domain signal for one of the at least one set of resolutions, determining a sum of the amplitude loss and the logarithmic loss as a resolution loss corresponding to the one set of resolutions;

and calculating the sum of the resolution losses of the at least one group of resolutions to obtain the track-dividing frequency domain loss of one audio track.

8. The method of claim 7, wherein said calculating the amplitude loss of the predicted frequency domain signal and the frequency domain signal comprises:

Calculating the difference between the absolute values of the amplitudes of the frequency domain signals and the predicted frequency domain signals to obtain an amplitude difference value;

Calculating the second norm of the amplitude absolute value of the frequency domain signal to obtain a second norm;

and calculating the ratio of the first norm to the second norm to obtain the amplitude loss.

9. The method of claim 7, wherein said calculating a log loss of said predicted frequency domain signal and said frequency domain signal comprises:

calculating the common logarithm of the absolute value of the amplitude of the predicted frequency domain signal to obtain a first logarithm;

And calculating a first-order norm of the difference between the first logarithm and the second logarithm to obtain the logarithm loss.

10. An apparatus for labeling audio track data, the apparatus comprising:

The training module is used for training the audio track dividing model based on the audio to be processed and the track marking result;

11. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the method of annotating audio track data according to any of claims 1 to 9.

12. A computer readable storage medium, wherein at least one program is stored in the readable storage medium, and the at least one program is loaded and executed by a processor to implement the method for labeling audio track data according to any one of claims 1 to 9.

13. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, the processor executing the computer instructions, causing the computer device to perform the method of labeling audio track data according to any of claims 1 to 9.