WO2024118649A1 - Systems, methods, and media for automatically transcribing lyrics of songs - Google Patents
Systems, methods, and media for automatically transcribing lyrics of songs Download PDFInfo
- Publication number
- WO2024118649A1 WO2024118649A1 PCT/US2023/081418 US2023081418W WO2024118649A1 WO 2024118649 A1 WO2024118649 A1 WO 2024118649A1 US 2023081418 W US2023081418 W US 2023081418W WO 2024118649 A1 WO2024118649 A1 WO 2024118649A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- song
- lyrics
- source separation
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/81—Detection of presence or absence of voice signals for discriminating voice from music
Definitions
- mechanisms including system, methods, and media, for automatically transcribing lyrics of songs are provided.
- systems for automatically transcribing ly rics of songs comprising: a memory; and at least one hardware processor coupled to the memory and configured to at least: select a source separation model to be used in performing source separation on a song; select a transcription model to be used in performing lyric transcription on the song; use the source separation model to extract audio stems from the song; and use the transcription model to generate predicted lyrics for the song based on the audio stems.
- selection of the source separation model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
- selection of the transcription model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
- the source separation model has a u-net deep convolutional neural network architecture.
- the at least one hardware processor is further configured to: select a sliding window of the audio stems; extract audio features of the audio stems from the sliding window; and determine the predicted lyrics using an acoustic model.
- the audio features are Mel-frequency cepstral coefficients (MFCCs).
- the at least one hardware processor is further configured to correct the predicted lyrics using a language model.
- methods for automatically transcribing lyrics of songs comprising: selecting a source separation model to be used in performing source separation on a song; selecting a transcription model to be used in performing lyric transcription on the song; using a hardware processor and the source separation model to extract audio stems from the song; using the transcription model to generate predicted lyrics for the song based on the audio stems.
- selection of the source separation model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
- selection of the transcription model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
- the source separation model has a u-net deep convolutional neural network architecture.
- the method further comprises: selecting a sliding window of the audio stems; extracting audio features of the audio stems from the sliding window; and determining the predicted lyrics using an acoustic model.
- the audio features are Mel- frequency cepstral coefficients (MFCCs).
- MFCCs Mel- frequency cepstral coefficients
- non-transitory computer-readable media containing computer executable instructions that, when executed by a processor, cause the processor to perform a method for automatically transcribing lyrics of songs
- the method comprising: selecting a source separation model to be used in performing source separation on a song; selecting a transcription model to be used in performing lyric transcription on the song; using the source separation model to extract audio stems from the song; and using the transcription model to generate predicted lyrics for the song based on the audio stems.
- selection of the source separation model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
- selection of the transcription model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
- the source separation model has a u-net deep convolutional neural network architecture.
- the method further comprises: selecting a sliding window of the audio stems; extracting audio features of the audio stems from the sliding window; and determining the predicted lyrics using an acoustic model.
- the audio features are Mel-frequency cepstral coefficients (MFCCs).
- the method further comprises correcting the predicted lyrics using a language model.
- FIG. 1 is a flow diagram of an example of a process for transcribing lyrics in accordance with some embodiments.
- FIG. 2 is a flow diagram of an example of a process for generating lyrics from audio stems in accordance with some embodiments.
- FIG. 3 is a block diagram of an acoustic model in accordance with some embodiments.
- FIG. 4 is an illustration of an example tool that can be used to review a lyric transcription model output, and revise incorrect and missing lyrics and times in accordance with some embodiments.
- FIG. 5 is a flow diagram of an example of a process for training a source separation model in accordance with some embodiments.
- FIG. 6 is an illustration of an example of a transcription model training pipeline in accordance with some embodiments.
- FIG. 7 is a flow diagram of an example of a process for creating a pre-trained acoustic model in accordance with some embodiments.
- FIG. 8 is an illustration of an example of a data augmentation pipeline in accordance with some embodiments.
- FIG. 9 is an illustration an example of a pipeline that can be used to train and use a timbre transfer model to change the vocal qualities of a recording while preserving the pitch and other spectral characteristics of the waveform in accordance with some embodiments.
- FIG. 10 is a block diagram of example hardware in accordance with some embodiments.
- mechanisms including system, methods, and media, for automatically transcribing lyrics of songs are provided.
- FIG. 1 an example 100 of a process for transcribing lyrics in accordance with some embodiments is illustrated.
- process 100 begins at 102 by generating digitized audio of a song to be transcribed if needed. For example, if a song is recorded on an analog format, the song can be digitized by sampling the song at any suitable sampling rate (e.g., 44. 1kHz) over any suitable number of bits (e.g., 16 bits).
- any suitable sampling rate e.g., 44. 1kHz
- any suitable number of bits e.g., 16 bits.
- process 100 can select a source separation model and a transcription model to be used in performing source separation on the song and lyric transcription on the song.
- Any suitable source separation model and any suitable transcription model can be selected in some embodiments.
- a source separation model and a transcription model can be selected from any suitable available separation models and any suitable available transcription models based on any suitable one or more characteristics of the song, such as genre, language (and dialect in some embodiments), year recorded, instruments used, acoustic, vocal characteristics, and/or any other suitable trait of the song, in some embodiments.
- characteristics of the song can be determined from meta data associated with the song, which can be received from any suitable source, in some embodiments.
- process 100 can use the selected source separation model to extract audio stems from the digital audio for the song.
- An audio stem is a component of a song corresponding to a particular source of sound, or a group of sound sources relating to a single musical instrument for the song. For example, for a song recorded by a band with one singer, one drummer, and one guitarist, the stems that may be generated for that song may include one voice stem, one drum stem, and one guitar stem. The voice stem is a vocal stem, and the drum stem and guitar stem are instrument stems.
- any suitable source separation model can be used in some embodiments, and the model can produce any suitable stems.
- the source separation model can be implemented using a u-net deep convolutional neural network architecture, as implemented by Spleeter or Demucs.
- process 100 can then use the selected transcription model to generate predicted lyrics for the song based on the stems produced by the source separation model.
- Any suitable transcription model can be used in some embodiments.
- a transcription model as described below can be used in some embodiments.
- FIG. 2 an example 200 of a process for generating lyrics from audio stems in accordance with some embodiments is illustrated.
- process 200 can receive audio stems for a song to be transcribed.
- These audio stems can be received in any suitable format and in any suitable manner from any suitable source, in some embodiments.
- these stems can be received from a database and/or storage.
- Any suitable stems of the song to be transcribed can be received, in some embodiments.
- all vocal and instrument stems for the song can be received.
- only one or more certain vocal stems and only one or more certain instrument stems can be received.
- one, more, or all vocal stems and one, more, or all instrument stems for only a portion of the song can be received.
- a window of size W for each of any one or more of the received stems can be selected by the process. Any suitable size W of window at any suitable position of the stems can be selected for each stem in some embodiments. For example, a window of 30s can be selected for each stem at the beginning of portion of the song corresponding to the beginning of the first vocal stem of the song, in some embodiments.
- process 200 can generate audio features of the received stems over the selected window. Any suitable audio features can be generated in some embodiments. For example, in some embodiments Mel-frequency cepstral coefficients (MFCCs) can be extracted as the audio features of the stems. As another example, in some embodiments, Mel frequency spectrum values can be extracted as the audio features of the stems. As still another example, in some embodiments, time domain amplitudes can be extracted as the audio features of the stems.
- MFCCs Mel-frequency cepstral coefficients
- the audio features that are extracted can be selected to match the input requirements of a first layer of neural network(s) used for the acoustic model.
- the first layer of the neural network(s) may be configured to receive a matrix of 39 MFCCs over 2000 overlapping (by 10ms) 25ms subwindows of a 30s window — i.e., a 39 x 2000 matrix. Any suitable number of sub-windows, having any suitable size and any suitable overlap (including no overlap) can be used in some embodiments. Any suitable number of features can be determined for each sub-window in some embodiments.
- process 200 can correct the lyrics generated by the acoustic model using a language model.
- the language model can be any suitable model for correcting lyrics.
- the language model can be a sequence-to-sequence language model that is trained to output a corrected sequence of linguistic tokens.
- the language model can be a sequence-to-sequence neural network that takes as input the sequence of linguistic tokens in a phrase and outputs the corrected sequence that, based on its training, it has determined should replace it. In some embodiments, this can involve replacing, removing, and/or inserting tokens.
- a BERT model Bidirectional Encoder Representations from Transformers
- process 200 can determine if the selected window is the last win ow . This determination can be made in any suitable manner in some embodiments. For example, in some embodiments, the selected window can be determined as being the last model if the window is at the end of the last vocal stem of the song.
- the process can select the next window of the stems received at 202.
- the next window' of the stems can be selected in any suitable manner in some embodiments.
- the next window can be selected by moving the window forward 30s (or any other suitable period of time).
- the encoders can run in parallel, their outputs combined in any suitable manner (for example, in some embodiments, the outputs of the two encoders can be concatenated) and the combined outputs provided as inputs to decoder 306. Additionally or alternatively, in some embodiments, a separate decoder (with different tasks) can be provided for each encoder and then the outputs of the decoders can be combined to provide the overall model output.
- Instrument encoder 302 can be implemented using a neural network in some embodiments. Any suitable neural network can be used for the instrument encoder in some embodiments.
- the instrument encoder can be implemented using a transformer encoder architecture. More particularly, for example, in some embodiments, the instrument encoder can be a transformer encoder block with selfattention. The input to this transformer encoder block can be a series of convolutional layers to reduce the dimensionality of the input space to one closer in output to what decoder 306 will predict. In some embodiments, such a transformer encoder can be preceded by a previously computed positional encoding to determine the positional representation of each input.
- instrument encoder 302 can have cross-attention with vocal encoder 304. which can be, as described below, another transformer block with self- attention.
- Cross-attention can allow the decoder to use information from the input sequence of the encoder, in some embodiments.
- the cross-attention can help determine relationships between the instrumental audio and the vocal audio that help predict lyrics in some embodiments.
- the instrument encoder receives audio features 306 (which can be the same as the audio features described above in connection with 206 of FIG. 2 in some embodiments) corresponding to the instrument stems for the selected window and generates a hidden representation that can be used by the decoder to predict the most likely next text token in the song.
- the audio features can be received in any suitable manner in some embodiments. For example, in some embodiments, these features can be received from 206 of FIG. 2.
- Vocal encoder 304 can be implemented using a neural network, in some embodiments. Any suitable neural network can be used for the vocal encoder in some embodiments.
- the vocal encoder can be implemented using a transformer encoder architecture. More particularly, for example, in some embodiments, the vocal encoder can be a transformer encoder block with self-attention.
- the input to this transformer encoder block can be a series of convolutional layers to reduce the dimensionality of the input space to one closer in output to what decoder 306 will predict.
- such a transformer encoder block can pe preceded by a previously computed positional encoding to determine the positional representation for each embedding in the sequence.
- vocal encoder 304 can have crossattention with instrument encoder 302.
- the vocal encoder receives audio features 308 (which can be the same as the audio features described above in connection with 206 of FIG. 2 in some embodiments) corresponding to the vocal stems, as well as output 303 of instrument encoder 302, for the selected window and generates a hidden representation that can be used to predict the most likely next text token in the song.
- the audio features can be received in any suitable manner in some embodiments. For example, in some embodiments, these features can be received from 206 of FIG. 2.
- Decoder 306 can be implemented using a neural network, in some embodiments.
- the decoder can be implemented using a transformer decoder architecture similar to the encoders.
- the layer feeding into the positional encoding and then into the transformer blocks would be an embedding layer for language instead of audio.
- decoder 306 can include a softmax layer to output probabilities of predicted tokens being the correct token.
- the decoder receives as input output 305 of the vocal audio encoder and the previously predicted tokens 310 of the transcription in some embodiments. In response to these inputs, the decoder predicts the next token 312 in the sequence recursively in some embodiments. In some embodiments, beam search can be used to optimize the sequence of most probable tokens.
- decoder 306 can perform one or more additional tasks 314, such as predicting genre, language, speech presence, and/or any other suitable charactenstic(s) of the songs.
- decoder 306 can have cross-attention with instrument encoder 302 and/or vocal encoder 304.
- FIG. 4 an example illustration 400 of a tool that can be used to review the lyric transcription model output, and revise incorrect and missing lyrics and times, in accordance with some embodiments is shown.
- tool 400 receives as input corresponding sets of: 1) lyrics, start time, end time 402; and 2) audio a song 404, in some embodiments. These lyrics can be empty if there is a sample with no predicted lyrics (which could be correct or incorrect depending on the song and portion thereof), in some embodiments.
- the tool then allows a user to listen to audio and show the corresponding lyrics 406 that are being sung as determined by the transcription model, in some embodiments.
- the corresponding portion of the transcribed lyrics can be highlighted, as represented by box 408. Highlighting can be performed in any suitable manner, such as by changing the foreground color, the background color, and/or the font of the lyrics, by adding bolding, underlining, and/or italicization, and/or performing any other suitable modification to the display of the lyrics. Any suitable amount of the lyrics can be highlighted in some embodiments. For example, one or more words or portions of words can be highlighted in some embodiments.
- a user can edit the highlighted portion of the lyrics using edit field 410 and edit button 412.
- a space 410 can left in the interface where there are allegedly no lyrics to allow the user to insert lyrics using an insert field 414 and an insert button 416.
- the user can pause, resume play, rewind and fast-forward the song and lyrics display using buttons 418, 420, 422. and 424, respectively.
- play, rewind, and fast-forward speeds can be controlled by the user.
- the user can control playback, rewinding, and/or fast- forwarding to occur at any suitable fraction (e g., small millisecond increments) and/or multiple (which can be integer or real value multiples) of normal playback rate.
- the corrected lyrics are saved along with the corresponding audio for re-training any suitable one or more of the models described herein. This re-training can be performed as described further below in some embodiments.
- FIGS. 5-9 examples of mechanisms for training the models described above in accordance with some embodiments are illustrated.
- FIG. 5 shows an example 500 of a process for training a source separation model that can be used in some embodiments.
- the process first receives fully mixed and mastered audio for a song and tracks of audio used to create mixed and mastered audio for the song at 502.
- These components can be received from any suitable source in any suitable manner in some embodiments.
- the components can be received from a producer of the music.
- process 500 trains the source separation model to separate stems from fully mixed and mastered audio to match the tracks of audio received at 502. This training can be performed in any suitable manner in some embodiments.
- Process 500 can then loop back to 502 for the next song to be used to train the source separation model.
- FIG. 6 an example 600 of a transcription model training pipeline in accordance with some embodiments is illustrated.
- a pre-trained acoustic model 602 and a pretrained language model 604 can be loaded as acoustic model 606 and language model 608, in some embodiments.
- Any suitable pre-trained acoustic model and any suitable pre-trained language model can be used as models 603 and 604 in some embodiments.
- a pre-trained acoustic model and a pre-trained language model can be created using a self-supervision pre-trained acoustic model and a self-supervision pre-trained language model.
- a pre-trained acoustic model and a pre-trained language model can be models previously used an as acoustic model and as a language model, respectively, in a transcription model and which are to be re-trained.
- pre-trained language model 604 can be a BERT model that is fine-tuned to grammar correction.
- lyrics 618 and corresponding vocal stems 620 and instrumental stems 622 are retrieved.
- Any suitable lyrics and corresponding vocal stems and instrumental stems can be retrieved in some embodiments.
- lyrics and corresponding stems for English language songs can be retrieved.
- Any suitable stems can be used in some embodiments
- audio features of stems 620 and 622 are generated for a window of stems 620 and 622.
- Any suitable audio features can be generated in some embodiments.
- the same features as described above in connection with 206 of FIG. 2 can be extracted.
- the window can have any suitable size.
- the window can be a 30s window.
- the audio features that are extracted can be selected to match the input requirements of a first layer of neural network(s) used for the acoustic model.
- the first layer of the neural network(s) may be configured to receive a matrix of 39 MFCCs over 2000 overlapping (by 10ms) 25ms subwindows of a 30s window — i.e., a 39 x 2000 matrix. Any suitable number of sub-windows, having any suitable size and any suitable overlap (including no overlap) can be used in some embodiments. Any suitable number of features can be determined for each sub-window in some embodiments.
- the window can be at any suitable position in the song corresponding to lyrics 618, corresponding vocal stems 620 and instrumental stems 622.
- the window can be positioned to include a portion of the song have the first lyrics in the song.
- the window can be positioned at a portion of the song in which no lyrics are present so that the acoustic model can be trained to recognize portions of songs with no lyrics.
- padding can be performed at 626, to fill the input matrix when an insufficient time of audio is present to match the input of the neural network.
- the acoustic model predicts the next lyric token 630 based audio features generated at 626 and previous known lyrics 618. This prediction can be performed in any suitable manner, such as that described above in connection with 208 of FIG. 2 and FIG. 3, in some embodiments. Predicted token 630 is then compared to current known lyrics 631 and weights of the models in the acoustic model are adjusted accordingly.
- next window of the stems can be selected and the processing described above repeated for the new window by looping back to 626.
- the next window of the stems can be selected in any suitable manner in some embodiments. For example, the next window can be selected by moving the window forward 30s (or any other suitable period of time).
- language model 608 can receive altered lyrics and corresponding correct lyrics 632 at 634. Altered lyrics and corresponding correct lyrics 632 can be received from any suitable source in some embodiments. The altered lyrics can be generated by modifying correct lyrics in any suitable manner.
- altered lyrics can be generated by omitting one or more words or one or more portions of a word in the correct lyrics.
- altered lyrics can be generated by transposing words in the lyrics, changing the spelling of words in the lyrics, etc.
- language model 608 can predict corrected lyrics 636 based on the altered lyrics received at 634. These corrected lyrics can then be compared to the correct lyrics received at 634 and the language model fine-tuned based on the errors found.
- the processes described above in connection with FIG. 6 can be repeated for any suitable number of songs to train the transcription model.
- FIG. 7 an example 700 of a process for creating a pre-trained acoustic model that can be used as pre-trained model 602 of FIG. 6 in accordance with some embodiments is illustrated.
- audio without lyric data 702 can be used to perform self-supervised training 704 to form a pre-trained acoustic model 706.
- Pre-trained acoustic model 706 can be any suitable model in some embodiments.
- the pretrained acoustic model can be a model having a neural network auto-encoder architecture.
- Audio without lyric data 702 can include audio features (e.g., as descnbed above in connection with 206 of FIG. 2, such as MFCCs) of both one or more vocal stems and one or more instrument stems for each of any suitable number of songs.
- pre-trained acoustic model 706 can be initially trained using audio features (e.g., MFCCs) of audio without lyric data 702. Model 706 can then be used as an initial model for acoustic model 706, which can then be fine-tuned using audio with lyric data 710.
- audio with lyric data 710 can include audio features (e.g., as described above in connection with 206 of FIG. 2, such as MFCCs) of both one or more vocal stems and one or more instrument stems for each of any suitable number of songs, and know n-correct, time-coded lyrics in text form for each of the songs.
- representations from the self-supervised acoustic and instrument models can also be fed directly as additional hidden inputs into the corresponding model.
- the language model is trained on lyrics.
- Such a model can be trained based on existing natural language processing tasks such as lyric filling, sentiment analysis, and topic modeling in some embodiments.
- domain differences in different types of music can be considered when training language models.
- different language models can be trained on clusters of songs using a similarity model, such that each model captures certain acoustic and linguistic nuances of a type of song.
- the language model not only incorporates the language structure in composing lyrics but also considers different composing styles for different types of music.
- different languages and linguistic properties can be accounted for in a single language model by incorporating all of the data and adding language and music type prediction tasks to allow local conditioning for the setting of each prediction task.
- a good training data set can be regarded as a data set which can well represent the entire population of the music which includes music with different genre types, languages, ages, and any other acoustic quality that would make it sound different.
- a good training dataset can cover a diverse map of music inputs, including, stem-separated audio files for source-separation and lyric recognition model.
- data augmentation can be used to improve a training data set.
- Data augmentation can apply one or more transformations to the various audio recordings (stems and full source) to provide wider variety of musical inputs capturing as much of the space of timbral and acoustic qualities as possible, in some embodiments. In this way, data augmentation can artificially increase the amount of data by generating new data points from existing data in some embodiments.
- new data augmentation techniques specific to the lyric recognition task are provided.
- the new data augmentation techniques transform the audio input along degrees of freedom, such as but not limited to timbre, pitch, speed, loudness and others, to which it is desirable for the models described herein to be invariant.
- pre-trained acoustic models or features extracted from audio can be used to understand which transformations should be required to balance the training data set in some embodiments. For example, in some embodiments, less prevalent pitches and tempos could be generated by transforming the available audio.
- FIG. 8 illustrates an example of a data augmentation pipeline in accordance with embodiment some embodiments.
- the process involves sampling from the training dataset, applying transformations to the audio, and then adding these audio transformations (alongside lyrics and other relevant metadata) to the training data.
- example of data augmentation techniques that can be applied include: [0090] Pitch change. Apply to the vocal component of the audio to shift the pitches up and down using traditional digital signal processing techniques, in some embodiments.
- Time stretch Speed up or slow down the audio file, in some embodiments. This also requires scaling the end time in the lyric data to ensure the pairs of lyrics and audio remain aligned, in some embodiments.
- Timbre transfer This can be achieved using differentiable digital signal processing (DDSP), for example, in some embodiments.
- DDSP digital signal processing
- FIG. 9 illustrates an example of a pipeline that can be used to train and use a timbre transfer model to change the vocal qualities of a recording while preserving the pitch and other spectral characteristics of the waveform, in some embodiments.
- Audio-text alignment can be applied to lyrics that do not have time codes by providing the lyrics and corresponding audio file as input to a forced alignment model such as but not limited to dynamic time warping.
- a forced alignment model such as but not limited to dynamic time warping.
- such algorithms can also be applied to validate and improve the accuracy of the lyric text that is time coded.
- Filtering Various techniques can be applied to filter the data such as deduplication of audio and lyric pairs, and manual inspection of datasets where the labelled alignment and that obtained via application of a forced alignment algorithm significantly diverge, in some embodiments.
- the mechanisms described herein can be implemented using a general-purpose computer or special-purpose computer, such as a server.
- Any such general-purpose computer or special-purpose computer can include any suitable hardware.
- such hardware can include a hardware processor 1002, memory and/or storage 1004, an input device controller 1006, an input device 1008, display/audio drivers 1010, display and audio output circuitry 1012, communication interface(s) 1014, an antenna 1016, and a bus 1018.
- Hardware processor 1002 can include any suitable hardware processor, such as a microprocessor, a micro-controller, digital signal processor(s), dedicated logic, and/or any other suitable circuitry for controlling the functioning of a general-purpose computer or a special purpose computer in some embodiments.
- a microprocessor such as a microprocessor, a micro-controller, digital signal processor(s), dedicated logic, and/or any other suitable circuitry for controlling the functioning of a general-purpose computer or a special purpose computer in some embodiments.
- Memory and/or storage 1004 can be any suitable memory and/or storage for storing programs, data, and/or any other suitable information in some embodiments.
- memory and/or storage 1004 can include random access memory 7 , read-only memory', flash memory', hard disk storage, optical media, and/or any other suitable memory'.
- Input device controller 1006 can be any suitable circuitry' for controlling and receiving input from input device(s) 1008 in some embodiments.
- input device controller 1006 can be circuitry' for receiving input from an input device 1008, such as a touch screen, from one or more buttons, from a voice recognition circuit, from a microphone, from a camera, from an optical sensor, from an accelerometer, from a magnetic field sensor, from a proximity' sensor, from a touch pressure sensor, from a touch size sensor, from a temperature sensor, from a near field sensor, from an orientation sensor, and/or from any other type of input device.
- an input device 1008 such as a touch screen, from one or more buttons, from a voice recognition circuit, from a microphone, from a camera, from an optical sensor, from an accelerometer, from a magnetic field sensor, from a proximity' sensor, from a touch pressure sensor, from a touch size sensor, from a temperature sensor, from a near field sensor, from an orientation sensor, and/or from any other type of input device.
- Display'/audio drivers 1010 can be any suitable circuitry for controlling and driving output to one or more display/audio output circuitries 1012 in some embodiments.
- display/audio drivers 1010 can be circuitry for driving one or more display/audio output circuitries 1012, such as an LCD display, a speaker, an LED, or any other type of output device.
- Communication interface(s) 1014 can be any suitable circuitry for interfacing with one or more communication networks.
- interface(s) 1014 can include network interface card circuitry, wireless communication circuitry, and/or any other suitable type of communication network circuitry.
- Antenna 1016 can be any suitable one or more antennas for wirelessly communicating with a communication network in some embodiments. In some embodiments, antenna 1016 can be omitted when not needed.
- Bus 1018 can be any suitable mechanism for communicating between two or more components 1002, 1004, 1006, 1010, and 1014 in some embodiments.
- any suitable computer readable media can be used for storing instructions for performing the functions and/or processes described herein.
- computer readable media can be transitory or non -transitory.
- non-transitory computer readable media can include media such as non- transitory magnetic media (such as hard disks, floppy disks, and/or any other suitable magnetic media), non-transitory optical media (such as compact discs, digital video discs, Blu-ray discs, and/or any other suitable optical media), non-transitory semiconductor media (such as flash memory, electrically programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and/or any other suitable semiconductor media), any suitable media that is not fleeting or devoid of any semblance of permanence during transmission, and/or any suitable non-transitory tangible media.
- EPROM electrically programmable read-only memory
- EEPROM electrically erasable programmable read-only memory
- transitory computer readable media can include signals on networks, in wires, conductors, optical fibers, circuits, any suitable media that is fleeting and devoid of any semblance of permanence during transmission, and/or any suitable transitory intangible media.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
Abstract
Mechanisms for automatically transcribing lyrics of songs include: selecting a source separation model to be used in performing source separation on a song; selecting a transcription model to be used in performing lyric transcription on the song; using a hardware processor and the source separation model to extract audio stems from the song; using the transcription model to generate predicted lyrics for the song based on the audio stems. In some of these mechanisms, selection of the source separation model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song. In some of these mechanisms, selection of the transcription model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song. In some of these embodiments, the source separation model has a u-net deep convolutional neural network architecture.
Description
SYSTEMS, METHODS, AND MEDIA FOR AUTOMATICALLY TRANSCRIBING LYRICS OF SONGS
Cross-Reference to Related Application
[0001] This application claims the benefit of United States Provisional Patent Application No. 63/428,375, filed November 28, 2022, which is hereby incorporated by reference herein in its entirety.
Background
[0002] Transcribing of lyrics of songs can be a slow manual process that needs to be performed by humans.
[0003] Accordingly, mechanisms, including system, methods, and media, for automatically transcribing lyrics of songs are desirable.
Summary
[0004] In accordance with some embodiments, mechanisms, including system, methods, and media, for automatically transcribing lyrics of songs are provided.
[0005] In some embodiments, systems for automatically transcribing ly rics of songs are provided, the systems comprising: a memory; and at least one hardware processor coupled to the memory and configured to at least: select a source separation model to be used in performing source separation on a song; select a transcription model to be used in performing lyric transcription on the song; use the source separation model to extract audio stems from the song; and use the transcription model to generate predicted lyrics for the song based on the audio stems. In some of these embodiments, selection of the source separation model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song. In some of these embodiments, selection of the
transcription model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song. In some of these embodiments, the source separation model has a u-net deep convolutional neural network architecture. In some of these embodiments, the at least one hardware processor is further configured to: select a sliding window of the audio stems; extract audio features of the audio stems from the sliding window; and determine the predicted lyrics using an acoustic model. In some of these embodiments, the audio features are Mel-frequency cepstral coefficients (MFCCs). In some of these embodiments, the at least one hardware processor is further configured to correct the predicted lyrics using a language model.
[0006] In some embodiments, methods for automatically transcribing lyrics of songs are provided, the methods comprising: selecting a source separation model to be used in performing source separation on a song; selecting a transcription model to be used in performing lyric transcription on the song; using a hardware processor and the source separation model to extract audio stems from the song; using the transcription model to generate predicted lyrics for the song based on the audio stems. In some of these embodiments, selection of the source separation model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song. In some of these embodiments, selection of the transcription model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song. In some of these embodiments, the source separation model has a u-net deep convolutional neural network architecture. In some of these embodiments, the method further comprises: selecting a sliding window of the audio stems; extracting audio features of the audio stems from the sliding window; and determining the predicted lyrics using an acoustic model. In some of these embodiments, the audio features are Mel- frequency cepstral coefficients (MFCCs).
[0007] In some of these embodiments, the method further comprises correcting the predicted lyrics using a language model.
[0008] In some embodiments, non-transitory computer-readable media containing computer executable instructions that, when executed by a processor, cause the processor to perform a method for automatically transcribing lyrics of songs are provided, the method comprising: selecting a source separation model to be used in performing source separation on a song; selecting a transcription model to be used in performing lyric transcription on the song; using the source separation model to extract audio stems from the song; and using the transcription model to generate predicted lyrics for the song based on the audio stems. In some of these embodiments, selection of the source separation model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song. In some of these embodiments, selection of the transcription model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song. In some of these embodiments, the source separation model has a u-net deep convolutional neural network architecture. In some of these embodiments, the method further comprises: selecting a sliding window of the audio stems; extracting audio features of the audio stems from the sliding window; and determining the predicted lyrics using an acoustic model. In some of these embodiments, the audio features are Mel-frequency cepstral coefficients (MFCCs). In some of these embodiments, the method further comprises correcting the predicted lyrics using a language model.
Brief Description of the Drawings
[0009] FIG. 1 is a flow diagram of an example of a process for transcribing lyrics in accordance with some embodiments.
[0010] FIG. 2 is a flow diagram of an example of a process for generating lyrics from audio stems in accordance with some embodiments.
[0011] FIG. 3 is a block diagram of an acoustic model in accordance with some embodiments.
[0012] FIG. 4 is an illustration of an example tool that can be used to review a lyric transcription model output, and revise incorrect and missing lyrics and times in accordance with some embodiments.
[0013] FIG. 5 is a flow diagram of an example of a process for training a source separation model in accordance with some embodiments.
[0014] FIG. 6 is an illustration of an example of a transcription model training pipeline in accordance with some embodiments.
[0015] FIG. 7 is a flow diagram of an example of a process for creating a pre-trained acoustic model in accordance with some embodiments.
[0016] FIG. 8 is an illustration of an example of a data augmentation pipeline in accordance with some embodiments.
[0017] FIG. 9 is an illustration an example of a pipeline that can be used to train and use a timbre transfer model to change the vocal qualities of a recording while preserving the pitch and other spectral characteristics of the waveform in accordance with some embodiments.
[0018] FIG. 10 is a block diagram of example hardware in accordance with some embodiments.
Detailed Description
[0019] In accordance with some embodiments, mechanisms, including system, methods, and media, for automatically transcribing lyrics of songs are provided.
[0020] Turning to FIG. 1, an example 100 of a process for transcribing lyrics in accordance with some embodiments is illustrated.
[0021] As shown, process 100 begins at 102 by generating digitized audio of a song to be transcribed if needed. For example, if a song is recorded on an analog format, the song can be digitized by sampling the song at any suitable sampling rate (e.g., 44. 1kHz) over any suitable number of bits (e.g., 16 bits).
[0022] Next, at 104, process 100 can select a source separation model and a transcription model to be used in performing source separation on the song and lyric transcription on the song. Any suitable source separation model and any suitable transcription model can be selected in some embodiments. For example, a source separation model and a transcription model can be selected from any suitable available separation models and any suitable available transcription models based on any suitable one or more characteristics of the song, such as genre, language (and dialect in some embodiments), year recorded, instruments used, acoustic, vocal characteristics, and/or any other suitable trait of the song, in some embodiments. Such characteristics of the song can be determined from meta data associated with the song, which can be received from any suitable source, in some embodiments.
[0023] Then, at 106, process 100 can use the selected source separation model to extract audio stems from the digital audio for the song.
[0024] An audio stem is a component of a song corresponding to a particular source of sound, or a group of sound sources relating to a single musical instrument for the song. For example, for a song recorded by a band with one singer, one drummer, and one guitarist, the stems that may be generated for that song may include one voice stem, one drum stem, and one guitar stem. The voice stem is a vocal stem, and the drum stem and guitar stem are instrument stems.
[0025] In some embodiments, any suitable source separation model can be used in some embodiments, and the model can produce any suitable stems. For example, in some embodiments, the source separation model can be implemented using a u-net deep convolutional neural network architecture, as implemented by Spleeter or Demucs.
[0026] At 108, process 100 can then use the selected transcription model to generate predicted lyrics for the song based on the stems produced by the source separation model. Any suitable transcription model can be used in some embodiments. For example, in some embodiments, a transcription model as described below can be used in some embodiments. [0027] Turning to FIG. 2, an example 200 of a process for generating lyrics from audio stems in accordance with some embodiments is illustrated.
[0028] As shown, at 202 process 200 can receive audio stems for a song to be transcribed. These audio stems can be received in any suitable format and in any suitable manner from any suitable source, in some embodiments. For example, in some embodiments, these stems can be received from a database and/or storage. Any suitable stems of the song to be transcribed can be received, in some embodiments. For example, in some embodiments all vocal and instrument stems for the song can be received. As another example, in some embodiments, only one or more certain vocal stems and only one or more certain instrument stems can be received. As still another example, in some embodiments, one, more, or all vocal stems and one, more, or all instrument stems for only a portion of the song can be received.
[0029] Next, at 204, a window of size W for each of any one or more of the received stems can be selected by the process. Any suitable size W of window at any suitable position of the stems can be selected for each stem in some embodiments. For example, a window of 30s can be selected for each stem at the beginning of portion of the song corresponding to the beginning of the first vocal stem of the song, in some embodiments.
[0030] Then, at 206, process 200 can generate audio features of the received stems over the selected window. Any suitable audio features can be generated in some embodiments. For example, in some embodiments Mel-frequency cepstral coefficients (MFCCs) can be extracted as the audio features of the stems. As another example, in some embodiments, Mel frequency spectrum values can be extracted as the audio features of the stems. As still another example, in some embodiments, time domain amplitudes can be extracted as the audio features of the stems.
[0031] In some embodiments, the audio features that are extracted can be selected to match the input requirements of a first layer of neural network(s) used for the acoustic model. For example, in some embodiments, the first layer of the neural network(s) may be configured to receive a matrix of 39 MFCCs over 2000 overlapping (by 10ms) 25ms subwindows of a 30s window — i.e., a 39 x 2000 matrix. Any suitable number of sub-windows, having any suitable size and any suitable overlap (including no overlap) can be used in some embodiments. Any suitable number of features can be determined for each sub-window in some embodiments.
[0032] At 208, process 200 can determine lyrics from the audio features using an acoustic model. Any suitable acoustic model can be used in some embodiments. For example, in some embodiments, an acoustic model as described below in connection with FIG. 3 can be used. The acoustic model can operate in any suitable manner, in some embodiments. For example, in some embodiments, the acoustic model can begin by predicting a first token (which is a unit of language transcription, and can be a word, a syllable, a letter, punctuation, and indicator related to transcription (such as an end-of-paragraph indicator), and/or any other suitable unit of language) from the selected window or accessing tokens determined in the previous window (if the selected window is not the first window). Continuing the example, the acoustic model can then use these tokens as input to the model to generate the
most likely next token, and so on recursively until the end token is generated indicating that the phrase is complete, in some embodiments. In some embodiments, beam search can be used to optimize the selection of successive tokens generated.
[0033] Next at 210, process 200 can correct the lyrics generated by the acoustic model using a language model. The language model can be any suitable model for correcting lyrics. For example, in some embodiment, the language model can be a sequence-to-sequence language model that is trained to output a corrected sequence of linguistic tokens.
[0034] As a more particular example, in some embodiments, the language model can be a sequence-to-sequence neural network that takes as input the sequence of linguistic tokens in a phrase and outputs the corrected sequence that, based on its training, it has determined should replace it. In some embodiments, this can involve replacing, removing, and/or inserting tokens. For example, in some embodiments, a BERT model (Bidirectional Encoder Representations from Transformers) can be pre-trained on a large corpus of linguistic data, and be tuned to correct phrases in lyrics for songs.
[0035] Then at 212, process 200 can determine if the selected window is the last win ow . This determination can be made in any suitable manner in some embodiments. For example, in some embodiments, the selected window can be determined as being the last model if the window is at the end of the last vocal stem of the song.
[0036] If the current window is determined at 212 to not be the last window, then at 214, the process can select the next window of the stems received at 202. The next window' of the stems can be selected in any suitable manner in some embodiments. For example, the next window can be selected by moving the window forward 30s (or any other suitable period of time).
[0037] Otherwise, if the current window is determined at 212 to be the last window, then at 216, process can end.
[0038] Turning to FIG. 3, an example 300 of an acoustic model that can be used at 208 of process 200 of FIG. 2 in accordance with some embodiments is illustrated. As shown, acoustic model 300 includes an instrument encoder 302, a vocal encoder 304, and a decoder 306, and is implemented using a sequence-to-sequence architecture in which output 303 of encoder 302 to is provided as an input to encoder 304, and output 305 of encoder 304 is provided as an input to decoder 306.
[0039] In some embodiments, rather than the output of encoder 302 providing an input to encoder 304, the encoders can run in parallel, their outputs combined in any suitable manner (for example, in some embodiments, the outputs of the two encoders can be concatenated) and the combined outputs provided as inputs to decoder 306. Additionally or alternatively, in some embodiments, a separate decoder (with different tasks) can be provided for each encoder and then the outputs of the decoders can be combined to provide the overall model output.
[0040] Instrument encoder 302 can be implemented using a neural network in some embodiments. Any suitable neural network can be used for the instrument encoder in some embodiments. For example, in some embodiments, the instrument encoder can be implemented using a transformer encoder architecture. More particularly, for example, in some embodiments, the instrument encoder can be a transformer encoder block with selfattention. The input to this transformer encoder block can be a series of convolutional layers to reduce the dimensionality of the input space to one closer in output to what decoder 306 will predict. In some embodiments, such a transformer encoder can be preceded by a previously computed positional encoding to determine the positional representation of each input.
[0041] In some embodiments, instrument encoder 302 can have cross-attention with vocal encoder 304. which can be, as described below, another transformer block with self-
attention. Cross-attention can allow the decoder to use information from the input sequence of the encoder, in some embodiments. The cross-attention can help determine relationships between the instrumental audio and the vocal audio that help predict lyrics in some embodiments.
[0042] In some embodiments, the instrument encoder receives audio features 306 (which can be the same as the audio features described above in connection with 206 of FIG. 2 in some embodiments) corresponding to the instrument stems for the selected window and generates a hidden representation that can be used by the decoder to predict the most likely next text token in the song. The audio features can be received in any suitable manner in some embodiments. For example, in some embodiments, these features can be received from 206 of FIG. 2.
[0043] Vocal encoder 304 can be implemented using a neural network, in some embodiments. Any suitable neural network can be used for the vocal encoder in some embodiments. For example, in some embodiments, the vocal encoder can be implemented using a transformer encoder architecture. More particularly, for example, in some embodiments, the vocal encoder can be a transformer encoder block with self-attention. The input to this transformer encoder block can be a series of convolutional layers to reduce the dimensionality of the input space to one closer in output to what decoder 306 will predict. In some embodiments, such a transformer encoder block can pe preceded by a previously computed positional encoding to determine the positional representation for each embedding in the sequence.
[0044] As noted above, in some embodiments, vocal encoder 304 can have crossattention with instrument encoder 302.
[0045] In some embodiments, the vocal encoder receives audio features 308 (which can be the same as the audio features described above in connection with 206 of FIG. 2 in some
embodiments) corresponding to the vocal stems, as well as output 303 of instrument encoder 302, for the selected window and generates a hidden representation that can be used to predict the most likely next text token in the song. The audio features can be received in any suitable manner in some embodiments. For example, in some embodiments, these features can be received from 206 of FIG. 2.
[0046] Decoder 306 can be implemented using a neural network, in some embodiments.
Any suitable neural network can be used for the decoder in some embodiments. For example, in some embodiments, the decoder can be implemented using a transformer decoder architecture similar to the encoders. In this case, the layer feeding into the positional encoding and then into the transformer blocks would be an embedding layer for language instead of audio.
[0047] In some embodiments, decoder 306 can include a softmax layer to output probabilities of predicted tokens being the correct token.
[0048] The decoder receives as input output 305 of the vocal audio encoder and the previously predicted tokens 310 of the transcription in some embodiments. In response to these inputs, the decoder predicts the next token 312 in the sequence recursively in some embodiments. In some embodiments, beam search can be used to optimize the sequence of most probable tokens.
[0049] In some embodiments, if previously trained to do so, decoder 306 can perform one or more additional tasks 314, such as predicting genre, language, speech presence, and/or any other suitable charactenstic(s) of the songs.
[0050] In some embodiments, decoder 306 can have cross-attention with instrument encoder 302 and/or vocal encoder 304.
[0051] Turning to FIG. 4, an example illustration 400 of a tool that can be used to review the lyric transcription model output, and revise incorrect and missing lyrics and times, in accordance with some embodiments is shown.
[0052] As illustrated, tool 400 receives as input corresponding sets of: 1) lyrics, start time, end time 402; and 2) audio a song 404, in some embodiments. These lyrics can be empty if there is a sample with no predicted lyrics (which could be correct or incorrect depending on the song and portion thereof), in some embodiments. The tool then allows a user to listen to audio and show the corresponding lyrics 406 that are being sung as determined by the transcription model, in some embodiments.
[0053] In some embodiments, as words are being sung in the song, the corresponding portion of the transcribed lyrics can be highlighted, as represented by box 408. Highlighting can be performed in any suitable manner, such as by changing the foreground color, the background color, and/or the font of the lyrics, by adding bolding, underlining, and/or italicization, and/or performing any other suitable modification to the display of the lyrics. Any suitable amount of the lyrics can be highlighted in some embodiments. For example, one or more words or portions of words can be highlighted in some embodiments.
[0054] In some embodiments, a user can edit the highlighted portion of the lyrics using edit field 410 and edit button 412.
[0055] In some embodiments, a space 410 can left in the interface where there are allegedly no lyrics to allow the user to insert lyrics using an insert field 414 and an insert button 416.
[0056] In some embodiments, the user can pause, resume play, rewind and fast-forward the song and lyrics display using buttons 418, 420, 422. and 424, respectively. In some embodiments, play, rewind, and fast-forward speeds can be controlled by the user. For example, in some embodiments, the user can control playback, rewinding, and/or fast-
forwarding to occur at any suitable fraction (e g., small millisecond increments) and/or multiple (which can be integer or real value multiples) of normal playback rate.
[0057] In some embodiments, when the manual review process for each clip terminates, the corrected lyrics are saved along with the corresponding audio for re-training any suitable one or more of the models described herein. This re-training can be performed as described further below in some embodiments.
[0058] Turning to FIGS. 5-9, examples of mechanisms for training the models described above in accordance with some embodiments are illustrated.
[0059] FIG. 5 shows an example 500 of a process for training a source separation model that can be used in some embodiments.
[0060] As illustrated, the process first receives fully mixed and mastered audio for a song and tracks of audio used to create mixed and mastered audio for the song at 502. These components can be received from any suitable source in any suitable manner in some embodiments. For example, in some embodiments, the components can be received from a producer of the music.
[0061] Next, at 504, process 500 trains the source separation model to separate stems from fully mixed and mastered audio to match the tracks of audio received at 502. This training can be performed in any suitable manner in some embodiments.
[0062] Process 500 can then loop back to 502 for the next song to be used to train the source separation model.
[0063] As noted above, different source separation models can be generated for songs having different characteristics, such as instruments, genre, language, year recorded, etc. [0064] Turning to FIG. 6, an example 600 of a transcription model training pipeline in accordance with some embodiments is illustrated.
[0065] As shown in FIG. 6, at 601 and 603, a pre-trained acoustic model 602 and a pretrained language model 604 can be loaded as acoustic model 606 and language model 608, in some embodiments. Any suitable pre-trained acoustic model and any suitable pre-trained language model can be used as models 603 and 604 in some embodiments. For example, in some embodiments, a pre-trained acoustic model and a pre-trained language model can be created using a self-supervision pre-trained acoustic model and a self-supervision pre-trained language model. As another example, in some embodiments, a pre-trained acoustic model and a pre-trained language model can be models previously used an as acoustic model and as a language model, respectively, in a transcription model and which are to be re-trained. As yet another example, in some embodiments, pre-trained language model 604 can be a BERT model that is fine-tuned to grammar correction.
[0066] Next, at 610, 612, 614, and 616, lyrics 618 and corresponding vocal stems 620 and instrumental stems 622 are retrieved. Any suitable lyrics and corresponding vocal stems and instrumental stems can be retrieved in some embodiments. For example, in some embodiments, if the transcription model being trained is for the English language, lyrics and corresponding stems for English language songs can be retrieved. Any suitable stems can be used in some embodiments
[0067] Then, at 626, audio features of stems 620 and 622 are generated for a window of stems 620 and 622. Any suitable audio features can be generated in some embodiments. For example, in some embodiments, the same features as described above in connection with 206 of FIG. 2 can be extracted.
[0068] In some embodiments, the window can have any suitable size. For example, in some embodiments, the window can be a 30s window.
[0069] In some embodiments, the audio features that are extracted can be selected to match the input requirements of a first layer of neural network(s) used for the acoustic model.
For example, in some embodiments, the first layer of the neural network(s) may be configured to receive a matrix of 39 MFCCs over 2000 overlapping (by 10ms) 25ms subwindows of a 30s window — i.e., a 39 x 2000 matrix. Any suitable number of sub-windows, having any suitable size and any suitable overlap (including no overlap) can be used in some embodiments. Any suitable number of features can be determined for each sub-window in some embodiments.
[0070] In some embodiments, the window can be at any suitable position in the song corresponding to lyrics 618, corresponding vocal stems 620 and instrumental stems 622. For example, in some embodiments, the window can be positioned to include a portion of the song have the first lyrics in the song. As another example, in some embodiments, the window can be positioned at a portion of the song in which no lyrics are present so that the acoustic model can be trained to recognize portions of songs with no lyrics.
[0071] In some embodiments, padding can be performed at 626, to fill the input matrix when an insufficient time of audio is present to match the input of the neural network.
[0072] Next, at 628, the acoustic model predicts the next lyric token 630 based audio features generated at 626 and previous known lyrics 618. This prediction can be performed in any suitable manner, such as that described above in connection with 208 of FIG. 2 and FIG. 3, in some embodiments. Predicted token 630 is then compared to current known lyrics 631 and weights of the models in the acoustic model are adjusted accordingly.
[0073] If the current window is to not be the last window in the song, then the next window of the stems can be selected and the processing described above repeated for the new window by looping back to 626. The next window of the stems can be selected in any suitable manner in some embodiments. For example, the next window can be selected by moving the window forward 30s (or any other suitable period of time).
[0074] In some embodiments, language model 608 can receive altered lyrics and corresponding correct lyrics 632 at 634. Altered lyrics and corresponding correct lyrics 632 can be received from any suitable source in some embodiments. The altered lyrics can be generated by modifying correct lyrics in any suitable manner. For example, in some embodiments, altered lyrics can be generated by omitting one or more words or one or more portions of a word in the correct lyrics. As another example, in some embodiments, altered lyrics can be generated by transposing words in the lyrics, changing the spelling of words in the lyrics, etc.
[0075] At 638, language model 608 can predict corrected lyrics 636 based on the altered lyrics received at 634. These corrected lyrics can then be compared to the correct lyrics received at 634 and the language model fine-tuned based on the errors found.
[0076] In some embodiments, the processes described above in connection with FIG. 6 can be repeated for any suitable number of songs to train the transcription model.
[0077] Turning to FIG. 7, an example 700 of a process for creating a pre-trained acoustic model that can be used as pre-trained model 602 of FIG. 6 in accordance with some embodiments is illustrated.
[0078] As shown, audio without lyric data 702 can be used to perform self-supervised training 704 to form a pre-trained acoustic model 706. Pre-trained acoustic model 706 can be any suitable model in some embodiments. For example, in some embodiments, the pretrained acoustic model can be a model having a neural network auto-encoder architecture. Audio without lyric data 702 can include audio features (e.g., as descnbed above in connection with 206 of FIG. 2, such as MFCCs) of both one or more vocal stems and one or more instrument stems for each of any suitable number of songs.
[0079] In some embodiments, pre-trained acoustic model 706 can be initially trained using audio features (e.g., MFCCs) of audio without lyric data 702. Model 706 can then be
used as an initial model for acoustic model 706, which can then be fine-tuned using audio with lyric data 710. In some embodiments, audio with lyric data 710 can include audio features (e.g., as described above in connection with 206 of FIG. 2, such as MFCCs) of both one or more vocal stems and one or more instrument stems for each of any suitable number of songs, and know n-correct, time-coded lyrics in text form for each of the songs.
[0080] In some embodiments, representations from the self-supervised acoustic and instrument models can also be fed directly as additional hidden inputs into the corresponding model.
[0081] In some embodiments, the language model is trained on lyrics. Such a model can be trained based on existing natural language processing tasks such as lyric filling, sentiment analysis, and topic modeling in some embodiments.
[0082] In some embodiments, domain differences in different types of music can be considered when training language models. For example, in some embodiments, different language models can be trained on clusters of songs using a similarity model, such that each model captures certain acoustic and linguistic nuances of a type of song. As a result, in some embodiments, the language model not only incorporates the language structure in composing lyrics but also considers different composing styles for different types of music.
[0083] Alternatively, in some embodiments, different languages and linguistic properties can be accounted for in a single language model by incorporating all of the data and adding language and music type prediction tasks to allow local conditioning for the setting of each prediction task.
[0084] In some embodiments, a good training data set can be regarded as a data set which can well represent the entire population of the music which includes music with different genre types, languages, ages, and any other acoustic quality that would make it sound different. In some embodiments, a good training dataset can cover a diverse map of music
inputs, including, stem-separated audio files for source-separation and lyric recognition model.
[0085] In some embodiments, to improve a training data set, data augmentation can be used. Data augmentation can apply one or more transformations to the various audio recordings (stems and full source) to provide wider variety of musical inputs capturing as much of the space of timbral and acoustic qualities as possible, in some embodiments. In this way, data augmentation can artificially increase the amount of data by generating new data points from existing data in some embodiments.
[0086] In accordance with some embodiments, new data augmentation techniques specific to the lyric recognition task are provided. The new data augmentation techniques transform the audio input along degrees of freedom, such as but not limited to timbre, pitch, speed, loudness and others, to which it is desirable for the models described herein to be invariant.
[0087] For each of the types of transformations of the audio, pre-trained acoustic models or features extracted from audio can be used to understand which transformations should be required to balance the training data set in some embodiments. For example, in some embodiments, less prevalent pitches and tempos could be generated by transforming the available audio.
[0088] FIG. 8 illustrates an example of a data augmentation pipeline in accordance with embodiment some embodiments. In some embodiments, the process involves sampling from the training dataset, applying transformations to the audio, and then adding these audio transformations (alongside lyrics and other relevant metadata) to the training data.
[0089] In accordance with some embodiments, example of data augmentation techniques that can be applied include:
[0090] Pitch change. Apply to the vocal component of the audio to shift the pitches up and down using traditional digital signal processing techniques, in some embodiments.
[0091] Time stretch. Speed up or slow down the audio file, in some embodiments. This also requires scaling the end time in the lyric data to ensure the pairs of lyrics and audio remain aligned, in some embodiments.
[0092] Noise addition. Other non-spoken, lower decibel audio can be mixed with the audio file to improve both the transcription and source separation models, in some embodiments.
[0093] Timbre transfer. This can be achieved using differentiable digital signal processing (DDSP), for example, in some embodiments. FIG. 9 illustrates an example of a pipeline that can be used to train and use a timbre transfer model to change the vocal qualities of a recording while preserving the pitch and other spectral characteristics of the waveform, in some embodiments.
[0094] Audio-text alignment. In some embodiments, to augment and improve the training dataset, automatic lyric alignment can be applied to lyrics that do not have time codes by providing the lyrics and corresponding audio file as input to a forced alignment model such as but not limited to dynamic time warping. In some embodiments, such algorithms can also be applied to validate and improve the accuracy of the lyric text that is time coded.
[0095] Filtering. Various techniques can be applied to filter the data such as deduplication of audio and lyric pairs, and manual inspection of datasets where the labelled alignment and that obtained via application of a forced alignment algorithm significantly diverge, in some embodiments.
[0096] References - the following references are hereby incorporated by reference herein in their entireties:
[1] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Batenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173-182. PMLR, 2016.
[2] Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv: 1611.01599, 2016.
[3] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, atend and spell: A neural network for large vocabulary' conversational speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4960- 4964. IEEE, 2016.
[4] Gerardo Roa Dabike and Jon Barker. The Sheffield university system for the mirex 2020: Lyrics transcription task.
[5] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International conference on machine learning, pages 1764- 1772. PMLR, 2014.
[6] Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5(50):2154, 2020.
[7] Chanwoo Kim and Richard M Stem. Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction. In tenth annual conference of the international speech communication association, 2009.
[8] Dietrich Klakow and Jochen Peters. Testing the correlation of word error rate and perplexity. Speech Communication, 38(1-2): 19-28, 2002.
[9] Gabriel Meseguer-Brocal, Alice Cohen-Hadria, and Geoffrey Peeters. Dali: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm. arXiv preprint arXiv: 1906. 10606, 2019.
[10] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv: 1904.08779, 2019.
[11] Michael Sharp. Method and system for automatically generating lyrics of a song, December 20 2018. US Patent App. 15/981,387.
[12] Brendan Shillingford, Yannis Assael, Matthew W Hoffman, Thomas Paine, Cian Hughes, Utsav Prabhu, Hank Liao, Hasim Sak, Kanishka Rao, Lorrayne Bennett, et al. Large- scale visual speech recognition. arXiv preprint arXiv: 1807.05162, 2018.
[13] Daniel Stoller, Simon Durand, and Sebastian Ewert. End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 181- 185. IEEE, 2019.
[0097] In some embodiments, the mechanisms described herein can be implemented using a general-purpose computer or special-purpose computer, such as a server. Any such general-purpose computer or special-purpose computer can include any suitable hardware. For example, as illustrated in example hardware 1000 of FIG. 10, such hardware can include a hardware processor 1002, memory and/or storage 1004, an input device controller 1006, an input device 1008, display/audio drivers 1010, display and audio output circuitry 1012, communication interface(s) 1014, an antenna 1016, and a bus 1018.
[0098] Hardware processor 1002 can include any suitable hardware processor, such as a microprocessor, a micro-controller, digital signal processor(s), dedicated logic, and/or any
other suitable circuitry for controlling the functioning of a general-purpose computer or a special purpose computer in some embodiments.
[0099] Memory and/or storage 1004 can be any suitable memory and/or storage for storing programs, data, and/or any other suitable information in some embodiments. For example, memory and/or storage 1004 can include random access memory7, read-only memory', flash memory', hard disk storage, optical media, and/or any other suitable memory'.
[0100] Input device controller 1006 can be any suitable circuitry' for controlling and receiving input from input device(s) 1008 in some embodiments. For example, input device controller 1006 can be circuitry' for receiving input from an input device 1008, such as a touch screen, from one or more buttons, from a voice recognition circuit, from a microphone, from a camera, from an optical sensor, from an accelerometer, from a magnetic field sensor, from a proximity' sensor, from a touch pressure sensor, from a touch size sensor, from a temperature sensor, from a near field sensor, from an orientation sensor, and/or from any other type of input device.
[0101] Display'/audio drivers 1010 can be any suitable circuitry for controlling and driving output to one or more display/audio output circuitries 1012 in some embodiments. For example, display/audio drivers 1010 can be circuitry for driving one or more display/audio output circuitries 1012, such as an LCD display, a speaker, an LED, or any other type of output device.
[0102] Communication interface(s) 1014 can be any suitable circuitry for interfacing with one or more communication networks. For example, interface(s) 1014 can include network interface card circuitry, wireless communication circuitry, and/or any other suitable type of communication network circuitry.
[0103] Antenna 1016 can be any suitable one or more antennas for wirelessly communicating with a communication network in some embodiments. In some embodiments, antenna 1016 can be omitted when not needed.
[0104] Bus 1018 can be any suitable mechanism for communicating between two or more components 1002, 1004, 1006, 1010, and 1014 in some embodiments.
[0105] Any other suitable components can additionally or alternatively be included in hardware 1000 in accordance with some embodiments.
[0106] In some embodiments, any suitable computer readable media can be used for storing instructions for performing the functions and/or processes described herein. For example, in some embodiments, computer readable media can be transitory or non -transitory. For example, non-transitory computer readable media can include media such as non- transitory magnetic media (such as hard disks, floppy disks, and/or any other suitable magnetic media), non-transitory optical media (such as compact discs, digital video discs, Blu-ray discs, and/or any other suitable optical media), non-transitory semiconductor media (such as flash memory, electrically programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and/or any other suitable semiconductor media), any suitable media that is not fleeting or devoid of any semblance of permanence during transmission, and/or any suitable non-transitory tangible media. As another example, transitory computer readable media can include signals on networks, in wires, conductors, optical fibers, circuits, any suitable media that is fleeting and devoid of any semblance of permanence during transmission, and/or any suitable transitory intangible media.
[0107] Although the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention
can be made without departing from the spirit and scope of the invention, which is limited only by the claims that follow . Features of the disclosed embodiments can be combined and rearranged in various ways.
Claims
1. A system for automatically transcribing lyrics of songs, comprising: a memory; and at least one hardware processor coupled to the memory and configured to at least: select a source separation model to be used in performing source separation on a song; select a transcription model to be used in performing lyric transcription on the song; use the source separation model to extract audio stems from the song; and use the transcription model to generate predicted lyrics for the song based on the audio stems.
2. The system of claim 1, wherein selection of the source separation model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
3. The system of claim 1, wherein selection of the transcription model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
4. The system of claim 1, wherein the source separation model has a u-net deep convolutional neural network architecture.
5. The system of claim 1. wherein the at least one hardware processor is further configured to:
select a sliding window of the audio stems; extract audio features of the audio stems from the sliding window; and determine the predicted lyrics using an acoustic model.
6. The system of claim 5, wherein the audio features are Mel-frequency cepstral coefficients (MFCCs).
7. The system of claim 5, wherein the at least one hardware processor is further configured to correct the predicted lyrics using a language model.
8. A method for automatically transcribing lyrics of songs, comprising: selecting a source separation model to be used in performing source separation on a song; selecting a transcription model to be used in performing lyric transcription on the song; using a hardware processor and the source separation model to extract audio stems from the song; and using the transcription model to generate predicted lyrics for the song based on the audio stems.
9. The method of claim 8, wherein selection of the source separation model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
10. The method of claim 8, wherein selection of the transcription model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
11. The method of claim 8, wherein the source separation model has a u-net deep convolutional neural network architecture.
12. The method of claim 8, further comprising: selecting a sliding window of the audio stems; extracting audio features of the audio stems from the sliding window; and determining the predicted lyrics using an acoustic model.
13. The method of claim 12, wherein the audio features are Mel-frequency cepstral coefficients (MFCCs).
14. The method of claim 12, further comprising correcting the predicted lyrics using a language model.
15. A non-transitory computer-readable medium containing computer executable instructions that, when executed by a processor, cause the processor to perform a method for automatically transcribing lyrics of songs, the method comprising: selecting a source separation model to be used in performing source separation on a song; selecting a transcription model to be used in performing lyric transcription on the song;
using the source separation model to extract audio stems from the song; and using the transcription model to generate predicted lyrics for the song based on the audio stems.
16. The non-transitory computer-readable medium of claim 15, wherein selection of the source separation model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
17. The non-lransitory computer-readable medium of claim 15, wherein selection of the transcription model is based on at least one of genre, language, dialect, year recorded, instruments used, acoustic, and vocal characteristics of the song.
18. The non-transitory computer-readable medium of claim 15, wherein the source separation model has a u-net deep convolutional neural network architecture.
19. The non-transitory computer-readable medium of claim 15, wherein the method further comprises: selecting a sliding window of the audio stems; extracting audio features of the audio stems from the sliding window; and determining the predicted lyrics using an acoustic model.
20. The non-transitory computer-readable medium of claim 19, wherein the audio features are Mel-frequency cepstral coefficients (MFCCs).
21. The non-transitory computer-readable medium of claim 19, wherein the method further comprises correcting the predicted lyrics using a language model.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263428375P | 2022-11-28 | 2022-11-28 | |
| US63/428,375 | 2022-11-28 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024118649A1 true WO2024118649A1 (en) | 2024-06-06 |
Family
ID=91324817
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/081418 Ceased WO2024118649A1 (en) | 2022-11-28 | 2023-11-28 | Systems, methods, and media for automatically transcribing lyrics of songs |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024118649A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10770092B1 (en) * | 2017-09-22 | 2020-09-08 | Amazon Technologies, Inc. | Viseme data generation |
| US20210335333A1 (en) * | 2019-09-24 | 2021-10-28 | Secret Chord Laboratories, Inc. | Computing orders of modeled expectation across features of media |
| US11245950B1 (en) * | 2019-04-24 | 2022-02-08 | Amazon Technologies, Inc. | Lyrics synchronization |
-
2023
- 2023-11-28 WO PCT/US2023/081418 patent/WO2024118649A1/en not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10770092B1 (en) * | 2017-09-22 | 2020-09-08 | Amazon Technologies, Inc. | Viseme data generation |
| US11245950B1 (en) * | 2019-04-24 | 2022-02-08 | Amazon Technologies, Inc. | Lyrics synchronization |
| US20210335333A1 (en) * | 2019-09-24 | 2021-10-28 | Secret Chord Laboratories, Inc. | Computing orders of modeled expectation across features of media |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Gold et al. | Speech and audio signal processing: processing and perception of speech and music | |
| US7842873B2 (en) | Speech-driven selection of an audio file | |
| Mesaros et al. | Automatic recognition of lyrics in singing | |
| EP1909263B1 (en) | Exploitation of language identification of media file data in speech dialog systems | |
| US20130035936A1 (en) | Language transcription | |
| US20060112812A1 (en) | Method and apparatus for adapting original musical tracks for karaoke use | |
| Kruspe et al. | Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing. | |
| Lux et al. | The IMS Toucan system for the Blizzard Challenge 2023 | |
| CN119274533B (en) | High-expressive force audio generation method based on natural language description text | |
| JP5326169B2 (en) | Speech data retrieval system and speech data retrieval method | |
| Gupta et al. | Deep learning approaches in topics of singing information processing | |
| Mesaros | Singing voice identification and lyrics transcription for music information retrieval invited paper | |
| Al-Issa et al. | Building a neural speech recognizer for quranic recitations | |
| Kim | Singing voice analysis/synthesis | |
| Wang et al. | Adapting pretrained speech model for mandarin lyrics transcription and alignment | |
| Park et al. | A Real-Time Lyrics Alignment System Using Chroma and Phonetic Features for Classical Vocal Performance | |
| Gao et al. | Music-robust automatic lyrics transcription of polyphonic music | |
| Yosha et al. | WHISTRESS: Enriching Transcriptions with Sentence Stress Detection | |
| WO2024118649A1 (en) | Systems, methods, and media for automatically transcribing lyrics of songs | |
| KR102585031B1 (en) | Real-time foreign language pronunciation evaluation system and method | |
| Mesaros | Singing voice recognition for music information retrieval | |
| Hamberger et al. | Fretting-Transformer: Encoder-Decoder Model for MIDI to Tablature Transcription | |
| Gref | Robust Speech Recognition via Adaptation for German Oral History Interviews | |
| Zhou et al. | AnimeTAB: A new guitar tablature dataset of anime and game music | |
| Gao | Automatic lyrics transcription of polyphonic music |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23898727 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 23898727 Country of ref document: EP Kind code of ref document: A1 |