CN115547308B

CN115547308B - Audio recognition model training method, audio recognition method, device, electronic equipment and storage medium

Info

Publication number: CN115547308B
Application number: CN202211067740.8A
Authority: CN
Inventors: 王俊; 邓峰; 王晓瑞
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2024-09-20
Anticipated expiration: 2042-09-01
Also published as: CN115547308A

Abstract

The disclosure relates to an audio recognition model training method, an audio recognition method, an apparatus, an electronic device and a storage medium, wherein the method comprises the following steps: determining target audio feature information, respectively performing first data enhancement processing and second data enhancement processing on the target audio feature information to obtain first audio feature information and second audio feature information, performing audio recognition training on a first original network and a second original network based on the first audio feature information and the second audio feature information to obtain a first target network and a second target network, and determining an audio recognition model based on a first coding layer in the first target network or a second coding layer in the second target network. According to the application, the first original network and the second original network are trained through the first audio characteristic information and the second audio characteristic information which are subjected to data enhancement processing, and label data is not needed, so that training cost is reduced.

Description

Audio recognition model training method, audio recognition method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of internet, and in particular relates to an audio recognition model training method, an audio recognition device, electronic equipment and a storage medium.

Background

The vast amount of information carried by sound can play an important role in our daily lives. In usual life, we will receive various sounds and use them to determine where we are (subways, streets, etc.), and what is happening (alarms, dogs, etc.).

With the rapid development of artificial intelligence, computers can also make decisions that are more accurate than humans. Computer hearing and machine hearing are popular and promising areas of research. The audio classification and the audio event detection can be used for sensing calculation and providing better response for users under the condition that visual information is ambiguous in the fields of Internet of things, mobile navigation equipment and the like. However, most of the works focus on the supervised learning or semi-supervised learning, the learning mode needs tag information of data, and tag data, especially data in the audio field, has high labeling cost, which results in high tag data acquisition difficulty, thereby increasing the working cost.

Disclosure of Invention

The disclosure provides an audio recognition model training method, an audio recognition device, electronic equipment and a storage medium, and the technical scheme of the disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided an audio recognition model training method, including:

Determining target audio feature information;

Respectively carrying out first data enhancement processing and second data enhancement processing on the target audio feature information to obtain first audio feature information and second audio feature information;

Respectively performing audio recognition training on the first original network and the second original network based on the first audio feature information and the second audio feature information to obtain a first target network and a second target network; the difference between the first audio output data of the first target network and the second audio output data of the second target network is smaller than or equal to a preset difference;

An audio recognition model is determined based on the first encoding layer in the first target network or the second encoding layer in the second target network.

In some possible embodiments, performing audio recognition training on the first original network and the second original network based on the first audio feature information and the second audio feature information to obtain a first target network and a second target network, respectively, including:

performing audio identification processing on the first audio feature information through a first original network to obtain first audio output data;

Performing audio identification processing on the second audio characteristic information through a second original network to obtain second audio output data; wherein the data dimensions of the first audio output data and the second audio output data are the same;

determining audio similarity data based on the first audio output data and the second audio output data;

Training the first original network and the second original network based on the audio similarity data;

and under the condition that the iteration termination condition is met, obtaining a first target network and a second target network.

In some possible embodiments, the first original network and the second original network are trained based on audio similarity data; under the condition that the iteration termination condition is met, a first target network and a second target network are obtained, and the method comprises the following steps:

updating the first network parameters of the first original network based on the audio similarity data to obtain updated first network parameters and updated first original network;

Updating the second network parameters of the second original network based on the updated first network parameters to obtain updated second network parameters and updated second original network;

Training the first original network and the second original network in a circulating way until the iteration termination condition is met;

and determining the trained first original network as a first target network, and determining the trained second original network as a second target network.

In some possible embodiments, updating the second network parameter of the second original network based on the updated first network parameter, resulting in an updated second network parameter and an updated second original network, comprising:

Acquiring a second network parameter and a moving average parameter of a second original network;

Determining an updated second network parameter based on the updated first network parameter, the second network parameter, and the moving average parameter;

And updating the second original network based on the updated second network parameters to obtain an updated second original network.

In some possible embodiments, determining the target audio feature information includes:

Acquiring original audio;

cutting out the original audio to obtain target audio;

and carrying out logarithmic mel feature extraction or mel cepstrum coefficient feature extraction on the target audio to obtain target audio feature information.

In some possible embodiments, performing a first data enhancement process and a second data enhancement process on the target audio feature information to obtain first audio feature information and second audio feature information, including:

Performing one or more of audio data expansion processing, audio data fusion processing, audio data time shift processing and audio data pitch change processing on the target audio feature information to obtain first audio feature information;

Performing one or more of audio data expansion processing, audio data fusion processing, audio data time shift processing and audio data pitch change processing on the target audio feature information to obtain second audio feature information; the first audio feature information and the second audio feature information are not identical.

In some possible embodiments, after determining the audio recognition model based on the first coding layer in the first target network, further comprising:

acquiring an audio style dataset; the audio style data set comprises N first audio clips corresponding to the audio styles; wherein N is a positive integer greater than 1;

carrying out logarithmic mel feature extraction or mel cepstrum coefficient feature extraction on each first audio fragment in the audio style data set to obtain third audio feature information corresponding to each first audio fragment;

inputting third audio characteristic information corresponding to each first audio fragment into an audio recognition model to obtain first coding characteristic information corresponding to each first audio fragment;

classifying the audio style data set into a plurality of first audio fragment sets based on the first coding feature information corresponding to each first audio fragment; each first audio clip set includes at least one first audio clip in the audio style data set;

when the number of the first audio fragment sets satisfies N, it is determined that the audio recognition model verification is successful.

acquiring an audio scene data set; the audio scene data set comprises second audio clips corresponding to the M audio scenes; wherein M is a positive integer greater than 1;

carrying out logarithmic mel feature extraction or mel cepstrum coefficient feature extraction on each second audio fragment in the audio scene data set to obtain fourth audio feature information corresponding to each second audio fragment;

inputting fourth audio characteristic information corresponding to each second audio fragment into an audio recognition model to obtain second coding characteristic information corresponding to each second audio fragment;

Classifying the audio scene data set into a plurality of second audio fragment sets based on the second coding feature information corresponding to each second audio fragment; each second set of audio segments comprises at least one second audio segment in the acoustic scene data set;

when the number of the second audio fragment sets satisfies M, it is determined that the audio recognition model verification is successful.

In some possible embodiments, the first encoding layer and the second encoding layer are both a residual network of 38 layers.

According to a second aspect of embodiments of the present disclosure, there is provided an audio recognition method, including:

Acquiring audio to be identified;

Inputting the audio to be identified into an audio identification model obtained by training according to an audio identification model training method to obtain coding characteristic information of the audio to be identified;

and determining style information and/or scene information of the audio to be identified based on the coding feature information of the audio to be identified.

According to a third aspect of embodiments of the present disclosure, there is provided an audio recognition model training apparatus, including:

A first information determination module configured to perform determination of target audio feature information;

the second information determining module is configured to execute first data enhancement processing and second data enhancement processing on the target audio feature information respectively to obtain first audio feature information and second audio feature information;

The network training module is configured to execute audio recognition training on the first original network and the second original network based on the first audio feature information and the second audio feature information to obtain a first target network and a second target network; the difference between the first audio output data of the first target network and the second audio output data of the second target network is smaller than or equal to a preset difference;

The recognition model determination module is configured to perform determining an audio recognition model based on the first encoding layer in the first target network or the second encoding layer in the second target network.

In some possible embodiments, the network training module is configured to perform:

In some possible embodiments, the first information determination module is configured to perform:

Acquiring original audio;

cutting out the original audio to obtain target audio;

In some possible embodiments, the second information determination module is configured to perform:

In some possible embodiments, the apparatus further comprises a first authentication module configured to perform:

In some possible embodiments, the apparatus further comprises a second authentication module configured to perform:

According to a fourth aspect of embodiments of the present disclosure, there is provided an audio recognition apparatus, comprising:

an audio acquisition module configured to perform acquisition of audio to be recognized;

The coding information determining module is configured to execute an audio recognition model obtained by training the audio to be recognized by the audio recognition model training method, and obtain coding characteristic information of the audio to be recognized;

And the style scene determining module is configured to determine style information and/or scene information of the audio to be identified based on the coding characteristic information of the audio to be identified.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement the method as in any of the first or second aspects above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the method of any one of the first or second aspects of embodiments of the present disclosure.

According to a seventh aspect of the embodiments of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, the computer program being read from the readable storage medium by at least one processor of the computer device and executed such that the computer device performs the method of any one of the first or second aspects of the embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

Determining target audio feature information, respectively performing first data enhancement processing and second data enhancement processing on the target audio feature information to obtain first audio feature information and second audio feature information, respectively performing audio recognition training on a first original network and a second original network based on the first audio feature information and the second audio feature information to obtain a first target network and a second target network, wherein a difference between first audio output data of the first target network and second audio output data of the second target network is smaller than or equal to a preset difference, and determining an audio recognition model based on a first coding layer in the first target network or a second coding layer in the second target network. According to the application, the first original network and the second original network are trained through the first audio characteristic information and the second audio characteristic information which are subjected to data enhancement processing, and label data is not needed, so that training cost is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application environment shown in accordance with an exemplary embodiment;

FIG. 2 is a flowchart illustrating a method of training an audio recognition model, according to an example embodiment;

FIG. 3 is a flowchart illustrating a method of determining target audio feature information according to an exemplary embodiment;

FIG. 4 is a schematic diagram of the structure of a first original network and a second original network, according to an example embodiment;

FIG. 5 is a flow chart illustrating a network training according to an exemplary embodiment;

FIG. 6 is a schematic diagram illustrating the structure of a first encoding layer according to an exemplary embodiment;

FIG. 7 is a schematic diagram of a residual layer structure including four sub-graphs of (a), (b), (c), and (d), according to an example embodiment;

FIG. 8 is a flowchart illustrating a method of audio recognition, according to an example embodiment;

FIG. 9 is a block diagram of an audio recognition model training apparatus, according to an example embodiment;

FIG. 10 is a block diagram of an audio recognition device, according to an exemplary embodiment;

FIG. 11 is a block diagram of an electronic device for audio recognition model training or audio recognition, according to an example embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar first objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment of an audio recognition model training method according to an exemplary embodiment, and as shown in fig. 1, the application environment may include a server 01 and a client 02.

In some possible embodiments, the server 01 may include a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud audio recognition model training, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms. Operating systems running on the server may include, but are not limited to, android systems, IOS systems, linux, windows, unix, and the like.

In some possible embodiments, the client 02 described above may include, but is not limited to, a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a smart wearable device, and the like. Or may be software running on the client, such as an application, applet, etc. Alternatively, the operating system running on the client may include, but is not limited to, an android system, an IOS system, linux, windows, unix, and the like.

In some possible embodiments, the server 01 or the client 02 may determine target audio feature information, perform first data enhancement processing and second data enhancement processing on the target audio feature information to obtain first audio feature information and second audio feature information, perform audio recognition training on the first original network and the second original network based on the first audio feature information and the second audio feature information, respectively, to obtain a first target network and a second target network, where a difference between first audio output data of the first target network and second audio output data of the second target network is less than or equal to a preset difference, and determine an audio recognition model based on a first coding layer in the first target network or a second coding layer in the second target network.

In some possible embodiments, the client 02 and the server 01 may be connected through a wired link, or may be connected through a wireless link.

In an exemplary embodiment, the client, the server and the database corresponding to the server may be node devices in the blockchain system, and may share the acquired and generated information to other node devices in the blockchain system, so as to implement information sharing between multiple node devices. The plurality of node devices in the blockchain system can be configured with the same blockchain, the blockchain consists of a plurality of blocks, and the blocks adjacent to each other in front and back have an association relationship, so that the data in any block can be detected through the next block when being tampered, thereby avoiding the data in the blockchain from being tampered, and ensuring the safety and reliability of the data in the blockchain.

FIG. 2 is a flowchart of an audio recognition model training method according to an exemplary embodiment, and as shown in FIG. 2, the audio recognition model training method may be applied to a server, and may also be applied to other node devices, such as clients, and the method is described below by taking the server as an example, and includes the following steps:

In step S201, target audio feature information is determined.

In the embodiment of the application, the server can determine the target audio characteristics. An embodiment of determining target audio feature information is described below. FIG. 3 is a flowchart illustrating a method of determining target audio feature information, as shown in FIG. 3, according to an exemplary embodiment, including:

In step S301, the original audio is acquired.

In the embodiment of the application, the original audio can be obtained from a real environment, such as a beach environment. The original audio may be synthesized based on a certain scene, for example, may be synthesized based on beach scenes. Thus, the synthesized raw audio may have significant beach scene characteristics.

In an alternative embodiment, the server may obtain the original audio from the client, e.g., may obtain the original audio from an online audio library. Alternatively, the original audio acquired from the client may be acquired by shooting or recording by the client.

In some alternative embodiments, the number of original audio may be one, or the number of original audio may be a plurality, such as 1000, of original audio. The plurality of original audios can belong to audios in the same scene, and can also belong to audios of different scenes, wherein the different scenes can comprise beach scenes, train driving scenes, indoor scenes, traffic jam scenes and the like.

In step S303, the original audio is cut out to obtain the target audio.

In an alternative embodiment, since the duration of the original audio may be indefinite, if the duration is 60 seconds, if the duration is 2 minutes, in order to save the computer power as much as possible in the model training process, the server may perform duration detection on the original audio and perform segment interception on the original audio exceeding the preset duration, so as to obtain the target audio with the preset duration or less. Alternatively, the preset time period may be 60 seconds.

In another alternative embodiment, the server may directly treat the original audio as the target audio without performing clip interception on the original audio.

In step S305, logarithmic mel feature extraction or mel cepstrum coefficient feature extraction is performed on the target audio to obtain target audio feature information.

In the embodiment of the present application, the target audio at this time is one-dimensional data including a time axis. In order to enable the target audio to present more information, the server can conduct logarithmic mel feature extraction or mel cepstrum coefficient feature extraction on the target audio to obtain the target audio feature information.

Optionally, the process of extracting mel-frequency cepstral coefficient features of the target audio by the server includes:

in a first step, the server pre-emphasizes the target audio, i.e. the target audio is passed through a high pass filter. The objective of pre-emphasis is to flatten the spectrum of the target audio in order to boost the high frequency part, keep it in the whole band from low frequency to high frequency, and find the spectrum with the same signal-to-noise ratio. At the same time, the effect of vocal cords and lips in the occurrence process is eliminated to compensate the high-frequency part of the voice signal restrained by the pronunciation system, and the resonance peak of the high frequency is highlighted.

And secondly, the server can carry out framing treatment on the filtered target audio to obtain the framed target audio. The specific framing processing comprises the following steps: the filtered target audio is first divided into observation units, i.e., frames, each of which may be composed of a set of L (e.g., 256 or 512) sampling points. In order to avoid the excessive change between two adjacent frames, a section of overlapping area can exist between two adjacent frames, so that the filtered target audio with partial overlapping frames is regarded as the target audio after framing.

And thirdly, the server can carry out windowing processing on the target audio after framing to obtain the windowed target audio. Specifically, the server may multiply each frame of the framed target audio by a hamming window to increase the continuity of the frames to the left and right.

And fourthly, the server performs fast Fourier transform on the windowed target audio to obtain a frequency spectrum of the target audio. Since the signal is generally difficult to see the characteristics of the signal in the time domain, the signal can be converted into energy distribution in the frequency domain, and different energies can represent the characteristics of different audios, so that the server can perform fast fourier transform on each frame of the windowed target audio signal to obtain the frequency spectrum of each frame, and perform modulo squaring on the frequency spectrum of each frame to obtain the power spectrum of the target audio.

Fifth, the server passes the power spectrum of the target audio through a Mel-scale triangular filter bank. Specifically, the server may define a filter bank with M filters, where the filter used is a triangular filter, and the filter bank with M filters may smooth the spectrum and eliminate the effect of harmonics, so as to highlight formants of the target audio.

And sixthly, calculating log energy of the M filters at the filter bank, performing discrete cosine transform on the log energy, solving L-order mel-frequency coefficients, and taking the mel-frequency coefficients as target audio characteristic information.

However, discrete cosine transforming the log energy in the sixth step removes the correlation of the target audio, so that the obtained target audio feature information lacks correlation features, and thus the server may perform log mel feature extraction on the target audio to obtain target audio feature information including correlation features.

Optionally, the process of extracting the log mel feature of the target audio by the server includes:

And sixthly, the server calculates the logarithmic energy of the M filters at the filter bank, and determines the logarithmic energy as target audio characteristic information.

Thus, the application can determine the target audio characteristic information of the target audio, wherein the target audio characteristic information can comprise time domain information and frequency domain information, and has more characteristic information compared with the one-dimensional target audio.

In step S203, a first data enhancement process and a second data enhancement process are performed on the target audio feature information, respectively, to obtain first audio feature information and second audio feature information.

In the embodiment of the application, the server can perform first data enhancement processing on the target audio feature information to obtain the first audio feature information, and can perform second data enhancement processing on the target audio feature information to obtain the second audio feature information.

In the embodiment of the present application, since the structures of the first original network of the first audio feature information input and the second original network of the second audio feature information input are different, the first data enhancement process and the second data enhancement process may be the same data enhancement process. However, in order to be able to better train the first and second original networks, the first and second data enhancement processes may be different data enhancement processes.

Optionally, the first data enhancement process may include one of an audio data expansion process Spec-augment, an audio data fusion process Mixup, an audio data time shift process TIME SHIFT Augmengtation, and an audio data pitch change process PITCH SHIFT Augmentation, which may include a plurality of joint processes among the audio data expansion process, the audio data fusion process, the audio data time shift process, and the audio data pitch change process.

Alternatively, the second data enhancement process may include one of a non-process, an audio data expansion process, an audio data fusion process, an audio data time shift process, and an audio data pitch change process, and may include a multi-joint process of the audio data expansion process, the audio data fusion process, the audio data time shift process, and the audio data pitch change process.

Optionally, the audio data expansion processing refers to adding a mask with lengths t and f to the spectrum summary in the time domain or the frequency domain of the target audio feature information.

Alternatively, the audio data fusion process may be referred to as the same category enhancement. The server can intercept the original audio to obtain target audio, extract logarithmic mel characteristic or mel cepstrum coefficient characteristic of the target audio to obtain target audio characteristic information, intercept the original audio again to obtain analog audio different from the target audio, extract logarithmic mel characteristic or mel cepstrum coefficient characteristic of the analog audio to obtain analog audio characteristic information.

Then, the server may perform fusion processing of the target audio feature information and the analog audio feature information based on the following formula (1), to obtain first audio feature information, where formula (1) includes:

Wherein, Characterizing the first audio feature information; x _i represents target audio feature information; x _j represents analog audio feature information; lambda characterizes the fusion parameter, wherein the fusion parameter has a value between 0 and 1.

Optionally, the audio data time shift processing refers to randomly shifting the target audio feature information by a time axis rolling signal to obtain, for example, the first audio feature information.

Alternatively, the pitch change processing of the audio data refers to random scrolling within a preset range around the frequency axis, resulting in, for example, the first audio feature information.

In this way, the application carries out data enhancement processing on the target audio feature information by one, two, three or four of the four independent data enhancement modes to obtain different first audio feature information and second audio feature information. Because the target audio of the first audio feature information and the second audio feature information source is randomly intercepted from the original audio, and the data enhancement mode is also randomly selected, the information contained in the first audio feature information and the second audio feature information is richer and wider, and a good foundation is established for the robustness and the universality of the features obtained by subsequent network learning.

In step S205, performing audio recognition training on the first original network and the second original network based on the first audio feature information and the second audio feature information, to obtain a first target network and a second target network; the difference between the first audio output data of the first target network and the second audio output data of the second target network is smaller than or equal to a preset difference.

In the embodiment of the application, the server can train the first original network and the second original network based on the first audio feature information and the second audio feature information to obtain the first target network and the second target network.

In an alternative embodiment, fig. 4 is a schematic structural diagram of a first original network and a second original network, where the first original network includes a first coding layer, a first projection layer, and a prediction layer connected in sequence, and the second original network includes a second coding layer and a second projection layer, as shown in fig. 4, according to an exemplary embodiment. Fig. 5 is a flowchart illustrating a network training process according to an exemplary embodiment, and the network training process is described below in conjunction with fig. 5, as shown in fig. 5, including:

In step S501, audio recognition processing is performed on the first audio feature information through the first original network, so as to obtain first audio output data.

Optionally, the server may input the first audio feature information into the first coding layer, output the coding feature information corresponding to the first audio feature information, then input the coding feature information into the first projection layer, output projection representation information corresponding to the first audio feature information, and input the projection representation information into the prediction layer to obtain the first audio output data.

Fig. 6 is a schematic diagram illustrating a structure of a first coding layer according to an exemplary embodiment. Alternatively, the first coding layer may consist of a 38-layer residual network comprising an input convolutional layer, a residual layer, and an output convolutional layer. The input layer convolution layer consists of two convolution layers, the convolution kernel number is 64, and the size is (3, 3). Optionally, the output layer convolution layer is also composed of two convolution layers, the convolution kernel is 128, and the size is (3, 3).

Alternatively, the residual layer is between the input convolution layer and the output convolution layer, and fig. 7 is a schematic diagram illustrating a structure of a residual layer according to an exemplary embodiment. As shown in fig. 7, the total number of the four convolution kernels of (a), (b), (c) and (d) is [64,128,256,512] in sequence, the modules with different convolution kernels are called residual sub-modules, and each residual sub-module is repeated for a number of times of [3,4,6,3] in sequence.

Each residual sub-module is connected with a residual, and is input into the residual sub-module through a basic block and a supplementary block respectively, and the output results are added, and then the output of the residual sub-module is obtained through a ReLU activation function. The basic block consists of an average pooling (the first residual sub-module (a) is not present), two layers (3, 3) of convolution layers, a batch normalization BN layer and an activation function ReLU layer, and the supplementary block consists of an average pooling (the first residual sub-module (a) is not present), one layer (1, 1) of convolution layers and a batch normalization BN layer.

Wherein the residual sub-module is represented by the following formula (2):

F (x) =relu (F ₁(x)+F₂ (x))..formula (2)

Wherein F is a representation function of the residual sub-module, F ₁ is a representation function of the basic block, and F ₂ is a representation function of the supplementary block.

Alternatively, the first projection layer and the prediction layer may be fully connected layer compositions of different numbers of neurons. For example, the number of neurons in the first projection layer is 128, and the number of neurons in the prediction layer is 512.

In step S503, performing audio recognition processing on the second audio feature information through the second original network to obtain second audio output data; wherein the data dimensions of the first audio output data and the second audio output data are the same.

Optionally, the server may input the second audio feature information into the second coding layer, output the coding feature information corresponding to the second audio feature information, and then input the coding feature information into the second projection layer, to obtain second audio output data.

Alternatively, the structure of the second coding layer may refer to the structure of the first coding layer, and is composed of a 38-layer residual network including an input convolution layer, a residual layer, and an output convolution layer. The second projection layer may be determined based on the prediction layer, with a fully connected layer of 512 neurons. Therefore, the second projection layer can output second audio output data with the same output dimension as the prediction layer, and subsequent similarity calculation is facilitated.

In step S505, audio similarity data is determined based on the first audio output data and the second audio output data.

In the embodiment of the application, because the first audio feature information and the second audio feature information are derived from the same original audio, the events and features contained in the first audio feature information and the second audio feature information are similar, and the data dimensions of the first audio output data and the second audio output data are the same. Based on this, a similarity matrix may be formed using the first audio output data and the second audio output data, and the audio similarity data may be calculated. Wherein, formula (3) of the audio similarity data is:

wherein L characterizes audio similarity data; q _θ (zθ) characterizes the first audio output data; z' _ξ characterizes the second audio output data.

In step S507, the first original network and the second original network are trained based on the audio similarity data.

In this manner, the server may train the first original network and the second original network with the audio similarity data described above.

Optionally, the server may update the first network parameter of the first original network based on the audio similarity data to obtain an updated first network parameter and an updated first original network, and update the second network parameter of the second original network based on the updated first network parameter to obtain an updated second network parameter and an updated second original network.

Specifically, the server may obtain the second network parameter and the moving average parameter of the second original network, determine the updated second network parameter based on the updated first network parameter, the second network parameter and the moving average parameter, update the second original network based on the updated second network parameter, and obtain the updated second original network.

Specifically, the server may update the second network parameter and the second original network by a moving average, as represented by the following formula (4):

ζ' ≡τζ+ (1- τ) theta...the formula (4)

Wherein ζ represents a second network parameter prior to update; ζ' represents the updated second network parameter; θ characterizes the updated first network parameter; τ characterizes the running average parameter.

In step S509, in the case where the iteration termination condition is satisfied, a first target network and a second target network are obtained.

In the embodiment of the application, the server can train the first original network and the second original network in a circulating manner until the iteration termination condition is met, the trained first original network is determined to be the first target network, and the trained second original network is determined to be the second target network.

Optionally, the server may input the first audio feature information into the updated first original network to obtain the first audio output data of the cycle, and input the second audio feature information into the updated second original network to obtain the second audio output data of the cycle. And determining audio similarity data based on the first audio output data and the second audio output data of the cycle, and updating the updated first original network and second original network by using the new audio similarity data according to the updating mode, so that the server completes training of the first original network and the second original network of the second cycle. Then, the server may complete the third round of training, the fourth round of training, and the fifth round of training … … until the iteration termination condition is satisfied, determine the trained first original network as the first target network, and determine the trained second original network as the second target network.

In some possible embodiments, the iteration termination condition may be a preset number of loops, for example, if the current number of loops satisfies the preset number of loops (for example, 100 times), then the iteration termination condition is satisfied.

In other possible embodiments, the iteration termination condition may be preset similarity data, and if the audio similarity data is less than or equal to the preset similarity data, the iteration termination condition is satisfied.

In the embodiment of the application, unlike the prior art that positive and negative samples are required to be constructed to train the model, the training difficulty is increased, the model can be optimized only by the consistency of output distribution of two networks without establishing the positive and negative samples, the acquisition of the samples is reduced, and the training difficulty is further reduced.

In step S207, an audio recognition model is determined based on the first encoding layer in the first target network or the second encoding layer in the second target network.

In the embodiment of the application, the server can determine the audio recognition model from the first coding layer in the first target network or the second coding layer in the second target network. In this way, the server can obtain an audio recognition model.

In other possible embodiments, the server may also determine the accuracy, generalization, and robustness of the audio recognition model by way of verification. The server may verify the audio recognition model through both the music style and the acoustic scene dimensions.

In an alternative embodiment, the server may obtain an audio style dataset comprising N first audio segments corresponding to audio styles, where N is a positive integer greater than 1. The N audio styles may include audio corresponding to popular music, audio corresponding to classical music, audio corresponding to heavy metal music, audio corresponding to rock music, and other styles of audio, among others.

The server may perform logarithmic mel feature extraction or mel cepstrum coefficient feature extraction on each first audio segment in the audio style data set to obtain third audio feature information corresponding to each first audio segment. The specific manner of extracting the log mel feature or extracting the mel cepstrum coefficient feature may refer to the above, and will not be described herein.

The server can input the third audio feature information corresponding to each first audio segment into the audio recognition model to obtain first coding feature information corresponding to each first audio segment. The server may classify the audio style data set into a plurality of first audio clip sets based on the first coding feature information corresponding to each first audio clip using a K-nearest neighbor algorithm. Wherein each first audio clip set comprises at least one first audio clip in the audio style data set. When the number of the first audio fragment sets meets N, the convergence verification of the audio recognition model is determined to be successful, and when the number of the first audio fragment sets does not meet N, the verification of the audio recognition model is determined to be failed, and further training is needed.

In another alternative embodiment, the server may obtain an audio scene data set, where the audio scene data set includes M second audio segments corresponding to audio scenes, where M is a positive integer greater than 1. The M audio scenes may include beach scenes, train driving scenes, indoor scenes, traffic jam scenes, and the like.

The server may perform log mel feature extraction or mel cepstrum coefficient feature extraction on each second audio segment in the audio scene data set to obtain fourth audio feature information corresponding to each second audio segment. The specific manner of extracting the log mel feature or extracting the mel cepstrum coefficient feature may refer to the above, and will not be described herein.

The server can input fourth audio feature information corresponding to each second audio segment into the audio recognition model to obtain second coding feature information corresponding to each second audio segment. The server may classify the audio scene data set into a plurality of second audio clip sets based on the second coding feature information corresponding to each second audio clip using a K-nearest neighbor algorithm. Wherein each second set of audio segments comprises at least one second audio segment of the acoustic scene data set. And when the number of the second audio fragment sets meets M, determining that the convergence verification of the audio recognition model is successful. When the number of the second audio fragment sets does not meet M, determining that the audio recognition model fails to verify, and further training is needed.

The accuracy of the audio recognition model can be verified through the re-detection of the audio style and the audio scene on the audio recognition model, so that the subsequent application of the audio recognition model is ensured.

In the embodiment of the application, a data set with large data volume is not needed, the model can be trained by combining a small part of target audio with the consistency of two network output distributions, and the training difficulty is low. In addition, because the target audio of the first audio feature information and the second audio feature information is randomly intercepted from the original audio, the data enhancement mode is also randomly selected, so that the information contained in the first audio feature information and the second audio feature information is richer and wider, and a good foundation is established for the robustness and the universality of the features obtained by subsequent network learning.

Fig. 8 is a flowchart illustrating an audio recognition method according to an exemplary embodiment, and as shown in fig. 8, the audio recognition method may be applied to a server or a client, including the steps of:

in step S801, audio to be recognized is acquired.

In step S803, the audio to be identified is input into an audio identification model trained by the audio identification model training method to obtain coding feature information of the audio to be identified.

In step S805, style information and/or scene information of the audio to be recognized is determined based on the encoding feature information of the audio to be recognized.

In the embodiment of the application, the server can acquire the audio to be identified, input the audio to be identified into the audio identification model obtained by training the audio identification model training method, obtain the coding characteristic information of the audio to be identified, and determine the style information and/or the scene information of the audio to be identified based on the coding characteristic information of the audio to be identified.

FIG. 9 is a block diagram of an audio recognition model training apparatus, according to an example embodiment. The device has the function of realizing the data processing method in the method embodiment, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. Referring to fig. 9, the apparatus includes a first information determination module 901, a second information determination module 902, a network training module 903, and an identification model determination module 904.

A first information determination module 901 configured to perform determination of target audio feature information;

a second information determining module 902 configured to perform a first data enhancement process and a second data enhancement process on the target audio feature information, respectively, to obtain first audio feature information and second audio feature information;

The network training module 903 is configured to perform audio recognition training on the first original network and the second original network based on the first audio feature information and the second audio feature information, so as to obtain a first target network and a second target network; the difference between the first audio output data of the first target network and the second audio output data of the second target network is smaller than or equal to a preset difference;

the recognition model determination module 904 is configured to perform determining an audio recognition model based on the first encoding layer in the first target network or the second encoding layer in the second target network.

Acquiring original audio;

cutting out the original audio to obtain target audio;

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Fig. 10 is a block diagram of an audio recognition device, according to an example embodiment. The device has the function of realizing the data processing method in the method embodiment, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. Referring to fig. 10, the apparatus includes an audio acquisition module 1001, an encoding information determination module 1002, and a style scene determination module 1003.

An audio acquisition module 1001 configured to perform acquisition of audio to be recognized;

the encoding information determining module 1002 is configured to execute an audio recognition model obtained by training the audio input audio recognition model training method to be recognized, so as to obtain encoding feature information of the audio to be recognized;

The style scene determination module 1003 is configured to perform determining style information and/or scene information of the audio to be identified based on the encoding feature information of the audio to be identified.

Fig. 11 is a block diagram illustrating an apparatus 3000 for audio recognition model training or audio recognition, according to an example embodiment. For example, apparatus 3000 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, or the like.

Referring to fig. 11, the apparatus 3000 may include one or more of the following components: a processing component 3002, a memory 3004, a power component 3006, a multimedia component 3008, an audio component 3010, an input/output (I/O) interface 3012, a sensor component 3014, and an audio recognition model training component 3016.

The processing component 3002 generally controls overall operations of the device 3000, such as operations associated with display, phone calls, data audio recognition model training, camera operations, and recording operations. The processing assembly 3002 may include one or more processors 3020 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 3002 may include one or more modules to facilitate interactions between the processing component 3002 and other components. For example, the processing component 3002 may include a multimedia module to facilitate interaction between the multimedia component 3008 and the processing component 3002.

The memory 3004 is configured to store various types of data to support operations at the device 3000. Examples of such data include instructions for any application or method operating on device 3000, contact data, phonebook data, messages, pictures, videos, and the like. The memory 3004 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply assembly 3006 provides power to the various components of the device 3000. The power supply components 3006 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 3000.

The multimedia component 3008 includes a screen between the device 3000 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia assembly 3008 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 3000 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 3010 is configured to output and/or input audio signals. For example, audio component 3010 includes a Microphone (MIC) configured to receive external audio signals when device 3000 is in an operational mode, such as a call mode, a recording mode, and a speech recognition mode. The received audio signals may be further stored in the memory 3004 or transmitted via the audio recognition model training component 3016. In some embodiments, the audio component 3010 further comprises a speaker for outputting audio signals.

The I/O interface 3012 provides an interface between the processing component 3002 and a peripheral interface module, which may be a keyboard, click wheel, button, or the like. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 3014 includes one or more sensors for providing status assessment of various aspects of the device 3000. For example, sensor assembly 3014 may detect the on/off state of device 3000, the relative positioning of the components, such as the display and keypad of device 3000, sensor assembly 3014 may also detect the change in position of device 3000 or a component of device 3000, the presence or absence of user contact with device 3000, the orientation or acceleration/deceleration of device 3000, and the change in temperature of device 3000. The sensor assembly 3014 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 3014 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 3014 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The audio recognition model training component 3016 is configured to facilitate audio recognition model training in a wired or wireless manner between the apparatus 3000 and other devices. The device 3000 may access a wireless network based on audio recognition model training standards, such as WiFi, operator networks (e.g., 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the audio recognition model training component 3016 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the audio recognition model training component 3016 further includes a near field audio recognition model training (NFC) module to facilitate short range audio recognition model training. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 3000 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

Embodiments of the present invention also provide a computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing an audio recognition model training method, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the audio recognition model training method provided by the above method embodiments.

Embodiments of the present invention also provide a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of the computer device reads and executes the computer program, causing the computer device to perform the method of any of the first aspects of the embodiments of the present disclosure.

It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. An audio recognition model training method, comprising:

Determining target audio feature information;

Respectively performing audio recognition training on a first original network and a second original network based on the first audio feature information and the second audio feature information to obtain a first target network and a second target network; the difference between the first audio output data of the first target network and the second audio output data of the second target network is smaller than or equal to a preset difference;

Determining an audio recognition model based on a first encoding layer in the first target network or a second encoding layer in the second target network;

The audio recognition training is performed on the first original network and the second original network based on the first audio feature information and the second audio feature information respectively to obtain a first target network and a second target network, including:

Performing audio identification processing on the first audio feature information through the first original network to obtain first audio output data;

performing audio recognition processing on the second audio feature information through the second original network to obtain second audio output data;

training the first and second original networks based on the audio similarity data;

and under the condition that the iteration termination condition is met, obtaining the first target network and the second target network.

2. The audio recognition model training method of claim 1, wherein the data dimensions of the first audio output data and the second audio output data are the same.

3. The audio recognition model training method of claim 2, wherein the training of the first and second original networks is based on the audio similarity data; and under the condition that the iteration termination condition is met, obtaining the first target network and the second target network comprises the following steps:

and determining the trained first original network as the first target network, and determining the trained second original network as the second target network.

4. The method of training an audio recognition model according to claim 3, wherein updating the second network parameter of the second original network based on the updated first network parameter to obtain an updated second network parameter and an updated second original network comprises:

acquiring a second network parameter and a moving average parameter of the second original network;

determining the updated second network parameter based on the updated first network parameter, the second network parameter, and the moving average parameter;

And updating the second original network based on the updated second network parameter to obtain the updated second original network.

5. The method for training an audio recognition model according to any one of claims 1 to 4, wherein determining the target audio feature information comprises:

Acquiring original audio;

Intercepting the original audio to obtain target audio;

And carrying out logarithmic Mel characteristic extraction or Mel cepstrum coefficient characteristic extraction on the target audio to obtain the target audio characteristic information.

6. The method of claim 1, wherein the performing a first data enhancement process and a second data enhancement process on the target audio feature information to obtain first audio feature information and second audio feature information includes:

Performing one or more of the audio data expansion process, the audio data fusion process, the audio data time shift process and the audio data pitch change process on the target audio feature information to obtain the second audio feature information; the first audio feature information and the second audio feature information are not identical.

7. A method of training an audio recognition model according to any of claims 1-3, further comprising, after said determining an audio recognition model based on a first coding layer in said first target network:

acquiring an audio style dataset; the audio style dataset comprises N first audio clips corresponding to audio styles; wherein N is a positive integer greater than 1;

extracting logarithmic mel characteristics or mel cepstrum coefficient characteristics of each first audio fragment in the audio style data set to obtain third audio characteristic information corresponding to each first audio fragment;

inputting the third audio characteristic information corresponding to each first audio fragment into the audio recognition model to obtain first coding characteristic information corresponding to each first audio fragment;

classifying the audio style data set into a plurality of first audio fragment sets based on the first coding characteristic information corresponding to each first audio fragment; each first audio segment set comprising at least one first audio segment in the audio style data set;

and when the number of the first audio fragment sets meets the N, determining that the audio identification model is successfully verified.

8. The method according to any one of claims 1-4, wherein after determining an audio recognition model based on a first coding layer in the first target network, further comprising:

acquiring an audio scene data set; the audio scene data set comprises second audio clips corresponding to M audio scenes; wherein M is a positive integer greater than 1;

extracting logarithmic mel characteristics or mel cepstrum coefficient characteristics of each second audio fragment in the audio scene data set to obtain fourth audio characteristic information corresponding to each second audio fragment;

Inputting fourth audio characteristic information corresponding to each second audio fragment into the audio recognition model to obtain second coding characteristic information corresponding to each second audio fragment;

Classifying the audio scene data set into a plurality of second audio fragment sets based on the second coding feature information corresponding to each second audio fragment; each second set of audio clips includes at least one second audio clip in the set of audio scene data;

and when the number of the second audio fragment sets meets M, determining that the audio identification model is successfully verified.

9. The audio recognition model training method of claim 1, wherein the first encoding layer and the second encoding layer are both 38-layer residual networks.

10. An audio recognition method, comprising:

Acquiring audio to be identified;

Inputting the audio to be identified into an audio identification model obtained by training according to the audio identification model training method of any one of claims 1 to 9, and obtaining coding characteristic information of the audio to be identified;

and determining style information and/or scene information of the audio to be identified based on the coding characteristic information of the audio to be identified.

11. An audio recognition model training device, comprising:

The network training module is configured to perform audio recognition training on a first original network and a second original network based on the first audio feature information and the second audio feature information respectively to obtain a first target network and a second target network; the difference between the first audio output data of the first target network and the second audio output data of the second target network is smaller than or equal to a preset difference;

an identification model determination module configured to perform determining an audio identification model based on a first encoding layer in the first target network or a second encoding layer in the second target network;

The network training module is configured to perform audio recognition processing on the first audio feature information through the first original network to obtain the first audio output data; performing audio recognition processing on the second audio feature information through the second original network to obtain second audio output data; determining audio similarity data based on the first audio output data and the second audio output data; training the first and second original networks based on the audio similarity data; and under the condition that the iteration termination condition is met, obtaining the first target network and the second target network.

12. An audio recognition apparatus, comprising:

The coding information determining module is configured to perform inputting the audio to be identified into an audio identification model obtained by training according to the audio identification model training method of any one of claims 1 to 9, so as to obtain coding characteristic information of the audio to be identified;

13. An electronic device, comprising:

A processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the audio recognition model training method of any one of claims 1 to 9 or the audio recognition method of claim 10.

14. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the audio recognition model training method of any one of claims 1 to 9 or the audio recognition method of claim 10.

15. A computer program product, characterized in that the computer program product comprises a computer program stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, causing the computer device to perform the audio recognition model training method of any one of claims 1 to 9 or the audio recognition method of claim 10.