US20230345082A1 - Interactive pronunciation learning system - Google Patents
Interactive pronunciation learning system Download PDFInfo
- Publication number
- US20230345082A1 US20230345082A1 US18/213,599 US202318213599A US2023345082A1 US 20230345082 A1 US20230345082 A1 US 20230345082A1 US 202318213599 A US202318213599 A US 202318213599A US 2023345082 A1 US2023345082 A1 US 2023345082A1
- Authority
- US
- United States
- Prior art keywords
- word
- closed captioning
- content item
- dialogue
- selectable closed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/54—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/485—End-user interface for client configuration
- H04N21/4856—End-user interface for client configuration for language selection, e.g. for the menu or subtitles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8106—Monomedia components thereof involving special audio data, e.g. different tracks for different languages
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
Definitions
- the media service allows subtitles or closed captions to be displayed along with the video so that the non-native speaker can read the text of the dialogue while listening to the dialogue. That way, the non-native person can match a word to the correct pronunciation.
- certain words may be spoken too quickly in that the non-native speaker may not be able to fully grasp the word, or the non-native speaker may want to hear it multiple times in order to comprehend the pronunciation of the word perfectly. If the non-native speaker misses the word and may want to listen to it later, then the non-native speaker would have to look up an online dictionary to hear the pronunciation.
- the non-native speaker may prefer to hear it the way an actor or actress pronounces the word in the movie rather than hearing it in a robotic voice that is often offered by an online dictionary application. Also, the non-native speaker may prefer to hear and practice the word while watching the show rather than practicing it after the show. That way, the non-native speaker can remember the pronunciation of the word the way it is pronounced in the show while it is still fresh in the non-native speaker's memory.
- a system receives a request to present a content item (e.g., a movie) for display on a device (e.g., TV).
- a content item e.g., a movie
- the system retrieves metadata of the content item, which includes the dialogue and respective timestamp information corresponding to each word in the dialogue.
- the system also retrieves a closed captioning file corresponding to the dialogue from a database of the content item.
- the metadata of the content item is compared to the retrieved closed captioning file corresponding to the dialogue.
- the system displays the closed captioning words along with the video of the content item.
- the closed captioning words are selectable via the user input interface of the device.
- the system retrieves an audio file associated with the selected closed captioning word and generates for playback a portion of the dialogue corresponding to the selected closed captioning word.
- the system provides audible pronunciation of the selected closed captioning word.
- the user may practice pronouncing the word by uttering the word after the system outputs audible pronunciation of the selected word.
- the user may use a second device (e.g., mobile phone) remote from a display device (e.g., TV) that is close to the display device.
- a second device e.g., mobile phone
- Any device capable of receiving voice input and transmitting the voice input to the streaming server or media application server is suitable for use as a second device.
- a second device remote from the first device (e.g., display device) may capture the user's voice and creates a temporary audio file for the captured voice input.
- the temporary audio file may be in any audio file format such as the waveform audio file (e.g., .wav) and is transmitted to the server for pronunciation analysis.
- the system may compare the temporary file corresponding to the captured word to an audio file containing audible pronunciation of the selected word.
- the audio file may be retrieved from the database of the content item.
- the audio file includes audible pronunciation in the standard accent in a particular language or in a particular style that is pronounced in the content item.
- the system compares the temporary audio file corresponding to the captured word to an audio file containing audible pronunciation of the selected word to calculate a similarity score.
- a similarity score may indicate a level of similarity between the user's pronunciation and standard pronunciation. The higher the similarity score is, the more likely the user's pronunciation is close to the standard pronunciation of the particular word. In some embodiments, a similarity score indicates a level of similarity between the user's pronunciation and the pronunciation of a particular style uttered in the content item—the way the character in the content item pronounces a word.
- a similarity score is over a certain threshold (e.g., 70%)
- the system may indicate in the user interface with positive feedback that the user has done a great job with the pronunciation.
- Real-time feedback may be generated for display with details, such as comparison point or practice history (e.g., You are improving! Better than yesterday.”).
- the feedback may also provide tips for pronouncing the word (e.g., “Try to enunciate each word.”).
- the present disclosure provides an interactive pronunciation learning system that prompts real-time user selection of a closed captioning word that enables playback of the audible pronunciation of the selected word the way a character of the content item pronounces it and provides real-time feedback by comparing user's recording of the word to an audio file of the selected word uttered by the character.
- the present disclosure further addresses the problems described above by, for example, saving the network bandwidth and reduces network traffic by reducing the need to send multiple requests to route to a different online language learning source (e.g., online dictionary for pronunciation) for learning the pronunciation.
- a different online language learning source e.g., online dictionary for pronunciation
- FIG. 1 depicts an exemplary user interface of a content item with a highlighted closed captioning word, in accordance with some embodiments of the disclosure
- FIG. 2 depicts an exemplary user interface of a content item with a highlighted closed captioning phrase, in accordance with some embodiments of the disclosure
- FIG. 3 depicts an exemplary user interface of a content item with non-speech information, in accordance with some embodiments of the disclosure
- FIG. 4 depicts an exemplary user interface of a content item with a slang, in accordance with some embodiments of the disclosure
- FIG. 5 depicts an exemplary user interface of a content item with a list of one or more pronunciation styles, in accordance with some embodiments of the disclosure
- FIG. 6 depicts an exemplary user interface of a content item with a list of one or more characters who uttered a closed captioning word, in accordance with some embodiments of the disclosure
- FIG. 7 depicts an exemplary user interface of providing feedback for pronunciation practice, in accordance with some embodiments of the disclosure.
- FIG. 8 depicts an exemplary embodiment of synchronizing an actual audio file to user's recording, in accordance with some embodiments of the disclosure
- FIG. 9 depicts an exemplary user interface of sharing a pronunciation recording with another user, in accordance with some embodiments of the disclosure.
- FIG. 10 depicts a flowchart of a process for providing audible pronunciation of a closed captioning word, in accordance with some embodiments of the disclosure
- FIG. 11 depicts a flowchart of a process for segmenting a content item and associating timestamps with words in a dialogue, in accordance with some embodiments of the disclosure
- FIG. 12 depicts an exemplary algorithm of generating audio files for words in a dialogue specified within a WebVTT format, in accordance with some embodiments of the disclosure
- FIG. 13 depicts an exemplary flow for providing feedback to a user's recording, in accordance with some embodiments of the disclosure
- FIG. 14 depicts an illustrative block diagram of an interactive pronunciation learning system, in accordance with some embodiments of the disclosure.
- FIG. 15 depicts an illustrative block diagram showing additional details of the system of FIG. 14 , in accordance with some embodiments of the disclosure.
- FIG. 1 depicts an exemplary user interface 100 of a content item with a highlighted closed captioning word 102 , in accordance with some embodiments of the disclosure.
- the content item e.g., “Mulan” movie
- the media application may be a stand-alone application implemented on user equipment devices 1414 a , 1414 b , 1414 c of FIG. 14 .
- the processes and embodiments described herein may be performed by a media application server 1404 of FIG. 14 or a streaming server 1306 of FIG. 13 .
- the media application retrieves metadata of the content item from a database of the content item.
- the metadata of the content item may comprise the dialogue and a respective timestamp corresponding to each word in the dialogue.
- the media application may also retrieve a closed captioning word file corresponding to the dialogue from a database of the content item.
- the media application compares the metadata of the content item to the closed captioning word file corresponding to the dialogue. Based on the comparison, the media application determines that at least the portion of the dialogue corresponds to the selected closed captioning word.
- a video of the content item is displayed with closed captioning words corresponding to dialogue 104 (e.g., “wait and see when we're through”) spoken in the first language (e.g., English).
- the closed captioning words are selectable via a user interface of a computing device (e.g., mobile device) remote from a display device (e.g., TV) that displays the content item.
- the closed captioning word may be selected via any type of input device such as a keyboard, mouse, or touchscreen.
- the selection of the closed captioning word is made via the display (e.g., tablet PC).
- the media application In response to receiving the selection of the closed captioning word, the media application highlights the selected word 102 and generates for playback at least a portion of the dialogue corresponding to the selected closed captioning word. As shown in FIG. 1 , the selection was made for the closed captioning word “wait” 102 . Accordingly, an audible pronunciation of the selected closed captioning word (e.g., “wait”) will be played.
- the audible pronunciation of the selected closed captioning word is different from the standard pronunciation of the selected word.
- pronunciation may vary drastically based on how the word is pronounced by a particular character in the content item or the contextual situation of the scene. For example, the intensity of how the word is said (e.g., angry v. sad), the pitch of the voice (e.g., female character v. male character), the intonation of the speech (e.g., the hometown of the character or hometown of the actor/actress), or how quickly the word is being said (e.g., urgent scene), all of which may play a part in varied pronunciation of a particular word.
- the present disclosure allows the users to hear the pronunciation of a particular word in a way that it is pronounced in a content item and learn the pronunciation of a word as a character in the content item would pronounce it.
- the playback of the content item is paused when a user selection of a closed captioning word is received. For example, when a user selects “wait” 102 in the closed captioning words, the playback of a video of the content item may be paused to play the pronunciation of the selected word. The user may also send a request to pause the video before selecting a closed captioning word.
- FIG. 2 depicts an exemplary user interface 200 of a content item with a highlighted closed captioning phrase 202 , in accordance with some embodiments of the disclosure.
- a selection may be made for a single word or multiple words. If a selection was made for a phrase (multiple words), then the media application may highlight a phrase comprising a plurality of words instead of highlighting a single word. Humans utter generally 100-130 words per minute and may utter multiple words at the same time. Therefore, a listener may hear the pronunciation of a single phrase rather than the individual words. In this case, an end time of a word may be temporally too close to a start time of a subsequent word (e.g., 1 ⁇ 3 second apart). This may deter the listener from discerning individual words, and the listener may not be able to pinpoint a particular word that the listener wants to hear again in the closed captioning words. Alternatively, the system may only allow more than one word to be selected by the user.
- the system may highlight a phrase (e.g., “I've never seen”) instead of highlighting only the selected word (“I've”) because “I've never seen” is a collection of words that is often uttered together.
- the media application determines the temporal proximity of the first set of words (“I've never seen”) 202 in the dialogue 204 .
- the media application categorizes the first set of words as a first phrase.
- the media application receives a selection of at least one word (e.g., “never”) of the first set of words (e.g., “I've never seen”) via the user interface of the user device, the media application retrieves an audio file or multiple audio files containing audible pronunciation of the first phrase (“I've never seen”). The media application generates for output the audible pronunciation of the first phrase. In this case, the first set of words will be played sequentially for those adjacent words.
- a threshold e.g., the user utters “I've never seen” too fast, so the end time of “never” and the start time of “seen” is less than a threshold of 0:00:002
- FIG. 3 depicts an exemplary user interface 300 of a content item with non-speech information 302 , in accordance with some embodiments of the disclosure.
- the non-speech information may include non-dialogue, such as a description of the background scene (e.g., “hair trembles with emotion”).
- Non-speech information may be available for the hearing impaired listeners to give the context of the scene (e.g., somber music).
- the non-speech information may be greyed out or marked in a way that it is clear to the user that the displayed non-speech information is not part of the dialogue.
- the non-speech information 302 is displayed within a bracket.
- the non-speech information 302 may not be selectable by the user as these words are not part of the dialogue that the character in the content item uttered.
- the non-speech information may be available as an audio file to be output in a voice other than the character who appeared in the content item.
- FIG. 4 depicts an exemplary user interface of a content item 400 with a slang 402 , in accordance with some embodiments of the disclosure.
- Some movies include certain words that are pronounced by the characters in a particular way that is different from the typical way that the words are pronounced, such as by having different intonation, pitch, or tone. Some users like how the word is pronounced by these characters and want to hear and practice the word the way the characters in the movies pronounce it. Because these slangs make the pronunciation unique, the slang may appear in the video with the actual words.
- the slang word 402 e.g., “Fuhgeddaboudit”
- an actual word 404 e.g., “Forget about it”.
- the slang word 402 (“Fuhgeddaboudit”) may be visually distinguishable from the actual word 404 (“Forget about it”) in that the slang words are highlighted in different colors or are displayed in different fonts than the actual words are displayed.
- the present disclosure allows the users to hear the pronunciation of a particular word the way it is pronounced in a content item, thereby allowing the user to learn the unique pronunciation of the word like a native speaker or a character in the content item and carrying out the distinct audio characteristics such as emotion, pitch, tone, pause, or intonation.
- FIG. 5 depicts an exemplary user interface 500 of a content item with a list of one or more pronunciation styles 504 , 506 , 508 , in accordance with some embodiments of the disclosure.
- the media application identifies a plurality of pronunciation styles in the first language that is stored in a database.
- standard American accent 504 , Southern accent 506 , Boston accent 508 are available for the phrase “Joey does not share food” 502 .
- the media application generates for display a list of the plurality of pronunciation styles 504 , 506 , 508 on the first device.
- exemplary user interface 500 displays particular accents as different pronunciation styles, any dialect or any type of varied pronunciation style may be used.
- the media application receives a selection of a pronunciation style of the plurality of pronunciation styles.
- Southern accent 506 was selected.
- the media application retrieves an audio file containing audible pronunciation of the selected word in the selected style (e.g., Southern accent).
- the media application generates for output audible pronunciation of the selected word in the selected style.
- FIG. 6 depicts an exemplary user interface 600 of a content item with a list of one or more characters 604 , 606 , 608 who uttered the closed captioning word, in accordance with some embodiments of the disclosure. For example, if a user selects a word or phrase “Joey does not share food” 602 from the show “Friends,” the media application identifies whether one or more characters speak the selected word or phrase of the content item by querying the database of the content item. If audio files of “Joey does not share food” are available for one or more characters of the show, then the media application generates for display the exemplary user interface 600 that includes a list of one or more characters of the content item who spoke the selected word or phrase.
- the media application receives a selection of a character of one or more characters by the user.
- Jennifer Anniston's voice 604 was selected.
- the media application retrieves an audio file containing audible pronunciation of the selected word spoken by the selected character (e.g., Jennifer Anniston).
- the media application generates the retrieved audio file for output containing audible pronunciation of the selected word spoken by the selected character.
- FIG. 7 depicts an exemplary user interface 700 of providing feedback 704 for pronunciation practice, in accordance with some embodiments of the disclosure.
- Exemplary user interface 700 may be performed in accordance with the exemplary user interfaces 100 - 500 discussed in FIGS. 1 - 5 .
- the user may repeat pronouncing the same word.
- the user may do so by uttering the word after the media application outputs audible pronunciation of the selected word.
- the user may use a second device remote from the display device (e.g., TV) such as a mobile phone or a voice assistant device that is close to the display device. Any device capable of receiving voice input and transmitting the voice input to the streaming server or media application server is suitable for use as a second device.
- a second device (e.g., voice assistant device) 706 remote from the first device (e.g., display device) may capture the user's voice and creates a temporary audio file for the captured voice.
- the temporary audio file may be in any audio file format such as the waveform audio file (e.g., .wav) and is transmitted to the server for pronunciation analysis.
- the temporary audio file may be analyzed at a client device level by control circuitry 1510 of computing device 1414 a , 1414 b , 1414 c.
- the media application may compare the temporary file corresponding to the captured word to an audio file containing audible pronunciation of the selected word.
- the audio file may be retrieved from the database of the content item.
- the audio file includes audible pronunciation in the standard accent in a particular language or in a particular style that is pronounced in the content item.
- the media application compares the temporary audio file corresponding to the captured word to an audio file containing audible pronunciation of the selected word to calculate a similarity score. It may do so by synchronizing the time domain signals between two files and overlaying frequency components, as shown in FIG. 8 , which will be explained in detail.
- a similarity score may indicate a level of similarity between the user's pronunciation and standard pronunciation. The higher the similarity score is, the more likely the user's pronunciation is close to the standard pronunciation of the particular word. In some embodiments, a similarity score indicates a level of similarity between the user's pronunciation and the pronunciation of a particular style uttered in the content item—the way the character in the content item pronounces a word.
- a similarity score is over a certain threshold (e.g., 70%)
- the media application may indicate in the user interface with positive feedback that the user has done a great job with the pronunciation.
- a real-time feedback 704 may be generated for display with details, such as comparison point or practice history (e.g., You are improving! Better than yesterday.”). Feedback 704 may also provide tips for pronouncing the word (e.g., “Try to enunciate each word.”) Although exemplary feedback 704 was used for illustrative purposes, any kind of feedback regarding improving the pronunciation may work. If the similarity score falls below the threshold, then the media application may include constructive feedback with descriptive details that can help with the pronunciation.
- FIG. 8 depicts an exemplary embodiment of synchronizing an actual audio file to a user's recording in accordance with some embodiments of the disclosure.
- the media application may synchronize the time domain signals between the files and overlay the frequency components in some embodiments. Based on the comparison, the media application determines how close these two files are.
- the synchronization and the comparison may be performed by any of the media application, streaming server 1306 of FIG. 13 or media application server 1404 of FIG. 14 .
- the media application may use Fast Fourier Transform (FFT) algorithms to compute a sequence of signals and converts digital signals to spectral components.
- FFT Fast Fourier Transform
- FIG. 9 depicts an exemplary user interface 900 of sharing a pronunciation recording 902 in accordance with some embodiments of the disclosure.
- Exemplary user interface 900 may be performed in accordance with the embodiment discussed in connection with FIG. 7 .
- a first user e.g., Joe
- Joe may share his or her recording 902 with other users located in a remote location.
- Joe may select a friend 904 that the user wants to send the recording to (e.g., language teacher or native speaker) and cause the recording to be sent to the user's friend (e.g., Serhad, Rae, Max) by making a selection in the friend's list retrieved from Joe's profile data.
- the user's friend e.g., Serhad, Rae, Max
- the selected user's friend may perform actions related to the recording, such as playing the recording, rating the recording, providing feedback to the recording, or creating a new recording.
- the user's friend may send the feedback or a newly-created recording back to Joe for comparison.
- FIG. 10 depicts a flowchart of a process 1000 for providing audible pronunciation of closed captioning words, in accordance with some embodiments of the disclosure. It should be noted that process 1000 may be performed by control circuitry 1502 , 1510 of FIG. 14 as instructed by the media application that may be performed on any client device. In addition, one or more steps of flowcharts 1100 or 1300 may be incorporated into or combined with one or more steps of any other process of FIG. 10 .
- control circuitry 1510 generates for output on a first device a content item comprising a dialogue.
- a content item may be audio-visual content that includes dialogue uttered by a character.
- control circuitry 1510 generates for display on the first device a closed captioning word corresponding to the dialogue.
- the closed captioning word may be in the language that is the same as the dialogue.
- the closed captioning word may be selectable via a user interface of the first device.
- control circuitry 1510 receives a selection of the closed captioning word via the user interface of the first device (e.g., laptop). Alternatively, a selection of the closed captioning word may be made via the user interface of a second device different from the first device.
- a video of the content item is paused.
- control circuitry 1510 generates for playback on the first device at least a portion of the dialogue corresponding to the selected closed captioning word in response to receiving the selection of the closed captioning word.
- Control circuitry 1510 generates audible pronunciation of the selected word uttered by the character in the content item.
- the audible pronunciation has its own audio characteristic, such as tone, intensity, pause, intonation, pitch, or any distinguishable audio attributes that make the pronunciation unique from the standard pronunciation.
- FIG. 11 depicts a flowchart 1100 of a process for segmenting a content item and associating timestamps with words in dialogue, in accordance with some embodiments of the disclosure. It should be noted that process 1100 may be performed by control circuitry 1502 , 1510 of FIG. 14 as instructed by the media application that may be performed on any client device. Alternatively, process 1100 may be performed by streaming server 1306 of FIG. 13 or media application server 1404 of FIG. 14 . In addition, one or more steps of flowcharts 1000 or 1300 may be incorporated into or combined with one or more steps of any other process of FIG. 11 .
- control circuitry 1502 splits the content item into an audio stream and a video stream.
- control circuitry 1502 segments the audio stream of the content item to a sequence of words using a speech-to-text algorithm to generate an audio word list.
- a speech-to-text algorithm or voice recognition algorithm may be used in generating an audio word list.
- metadata of the content item comprising closed caption data is retrieved from a database of the content item.
- the closed caption data includes a text version of the spoken part of the content item (e.g., dialogue).
- control circuitry 1502 detects whether the closed caption data matches the words being used in the video by comparing the closed caption data and the processed video. For example, a speech detection algorithm or image processing technique may be used to decipher or read lips of the character in the video (e.g., a character saying “forget about it”) to determine the words that are being used in the video. Additionally, in another embodiment, control circuitry 1502 detects whether words in the audio word list match the words being used in the video.
- control circuitry 1502 maps the closed caption data to the audio word list generated from the audio stream using the speech-to-text algorithm at step 1104 .
- Step 1108 may provide an additional degree of confidence that the closed caption data matches not only the video of the content item, but also the audio of the content item.
- control circuitry 1502 records the audio file, timestamp information (e.g., a time range) of the word identified within the video, and the link between the closed caption word as part of the metadata for the video.
- FIG. 12 depicts an exemplary algorithm 1200 for generating audio files for words specified within the Web Video Text Tracks Format (WebVTT) format, according to some embodiments of the disclosure.
- Exemplary algorithm 1200 includes codes for generating the audio files for the dialogue.
- An audio file may include a word and associated timestamps information specified within the WebVTT format. For example, the phrase “wait and see when we're through” is spoken from the time range of 0:00.100-0:00.400 for three seconds.
- the range of the words is kept as tuples of words, and each spoken word is assigned a start timestamp and an end timestamp.
- the media application may create a new tag for each pronunciation and assign an audio file associated with the pronunciation of the word and the range of timestamps when the utterance appears.
- a range of timestamps may be assigned with the start timestamp of 0:00:100 and the end timestamp of 0:00:100.567.
- a newly generated tag for the word “wait” may be associated with the specified start timestamp and the end timestamp.
- the audio files are part of the HLS (HTTP Live Streaming) streaming manifest for SVOD (Subscription Video-On-Demand).
- HLS HTTP Live Streaming
- SVOD Subscribescription Video-On-Demand
- the algorithm may be implemented in various formats, such as Secure-Reliable Transport (SRT) or Timed-Text Markup Language (TTML).
- SRT Secure-Reliable Transport
- TTML Timed-Text Markup Language
- the algorithm may be implemented using other streaming protocols such as HLS, MPEG DASH, HSS, HDS, etc.
- FIG. 13 depicts an exemplary flow 1300 for providing feedback to a user's recording in accordance with some embodiments of the disclosure.
- a steaming server transmits a content item with closed captioning words or subtitles to a streaming video client 1304 in response to a user request to display the content item (e.g., a user plays the movie).
- a streaming server may be a server that provides content items to computing devices over communication network 1412 .
- a streaming server may be media application server 1404 .
- a streaming video client can be a rendering device such as a TV or laptop.
- a streaming video client may be any of computing devices 1414 a , 1414 b , 1414 c .
- a remote device 1302 can be any device that is capable of providing input, selecting a text, or capturing a vocal input.
- streaming video client 1304 and remote device 1302 can be integrated as a single device.
- the content item is generated for display on streaming video client 1304 .
- a user may send a request to pause the video to hear the pronunciation of a specific word at step 1312 .
- streaming video client 1304 may relay the request from remote device 1302 to streaming server 1306 .
- the user may navigate between closed captioning words displayed on a screen of streaming video client 1304 .
- the user may select a word or a phrase within the closed captioning words at remote device 1302 (e.g., by double-clicking a word).
- streaming video client 1304 may relay the selection made from remote device 1302 to streaming server 1306 .
- the selection may be made via a graphical user interface of streaming video client 1304 (e.g., a TV touchscreen).
- streaming server 1306 queries for an audio file of the selected word by looking up the manifest or metadata associated with the content item.
- streaming server 1306 sends an audio file containing audible pronunciation of the selected word to streaming video client 1304 .
- streaming video client 1304 plays audible pronunciation of the selected word. If the user wishes to practice the pronunciation, the user may repeat the word after streaming video client 1304 plays the word. The pronounced word may be captured as a recording at remote device 1302 and may be sent to streaming server 1306 at step 1324 .
- streaming video client 1304 may relay the recording file made from remote device 1302 to streaming server 1306 .
- capturing of the user's pronunciation is performed using a speaker at streaming video client 1304 (e.g., using a speaker of a laptop).
- streaming server 1306 compares the user's recording to the audio file of the selected word to calculate a similarity score at step 1326 .
- streaming server 1306 transmits the comparison result (e.g., real-time feedback) to streaming video client 1304 based on the calculated similarity score.
- FIG. 14 shows an illustrative block diagram of an interactive pronunciation learning system, in accordance with some embodiments of the disclosure.
- system 1400 includes one or more of media application server 1404 , content item source 1406 , and communication network 1412 .
- Communication network 1412 may be one or more networks including the Internet, a mobile phone network, mobile voice or data network (e.g., a 4G or LTE network), cable network, public switched telephone network, or other types of communication network or combinations of communication networks.
- Communication network 1412 includes one or more communication paths, such as a satellite path, a fiber-optic path, a cable path, a path that supports Internet communications (e.g., IPTV), free-space connections (e.g., for broadcast or other wireless signals), or any other suitable wired or wireless communication path or combination of such paths.
- Communication network 1412 communicatively couples various components of system 1400 to one another. For instance, server 1404 may be communicatively coupled to video-hosting web server, content item source 1406 via communication network 1412 .
- Video-hosting web server hosts one or more video websites, such as YouTubeTM, and/or the like, that enable users to download or stream videos, video clips, and/or other types of content.
- video websites such as YouTubeTM, and/or the like, that enable users to download or stream videos, video clips, and/or other types of content.
- the video websites also provide access to data regarding downloaded content.
- Content item source 1406 may store content item-related data from one or more types of content providers or originator of content (e.g., a television broadcaster, a Webcast provider, on-demand content provider, over-the-top content providers, or other providers of content).
- Content item source includes a content item, manifest associated with the content item, metadata associated with the content item, closed caption data or subtitles, or any other related material associated with the content item.
- the metadata or manifest of the content item may include, among other information of the content item, such as dialogue and associated timestamp information for each word in the dialogue.
- a remote media server may be used to store different types of content in a location remote from computing device 1414 (described below).
- User data source may provide user-related data, such as user profile or preference data described herein such as preferred selection options, previous option selection, preferred content item, preferred genre, preferred characters or actors, user's friends list, to computing device 1414 , server 1404 and/or video-hosting web server using any suitable approach.
- content item source 1406 and user data source may be integrated as one device.
- content item data from content item source 1406 may be provided to computing device 1414 using a client/server approach.
- computing device 1414 may pull content item data from a server (e.g., server 1404 ), or a server may push content item data to computing device 1414 .
- a client application residing on computing device 1414 may initiate sessions with user data source to obtain content item data when needed, e.g., when the content item data is out of date or when computing device 1414 receives a request from the user to receive data.
- Content and/or content item data delivered to computing device 1414 may be over-the-top (OTT) content.
- OTT content delivery allows Internet-enabled user devices, such as computing device 1414 , to receive content that is transferred over the Internet, including any content described above, in addition to content received over cable or satellite connections.
- OTT content is delivered via an Internet connection provided by an Internet service provider (ISP), but a third party distributes the content.
- ISP Internet service provider
- the ISP may not be responsible for the viewing abilities, copyrights, or redistribution of the content, and may only transfer IP packets provided by the OTT content provider. Examples of OTT content providers include YouTubeTM, NetflixTM, and HULUTM, which provide audio and video via IP packets.
- OTT content providers may additionally or alternatively provide content item data described above.
- providers of OTT content can distribute applications (e.g., web-based applications or cloud-based applications), or the content can be displayed by applications stored on computing device 1414 .
- media application server 1404 accesses the content of the video website(s) hosted by video-hosting web server and, based on the accessed content, generates a variety of types of data such as metadata or manifest (e.g., terms, associations between terms and corresponding media content identifiers, dialogue, closed captions, subtitles, and/or the like) that can be accessed to facilitate the retrieving or searching of media content made available by content item source 1406 .
- server 1404 accesses metadata or manifest of the content item from content item source 1406 .
- the metadata or manifest of the content item may be generated by video-hosting web server or media application server 1404 .
- the metadata or manifest of the content item may be generated by a third-party generator that has access to the content item.
- System 1400 also includes one or more computing devices 1414 , such as user television equipment 1414 a (e.g., a set-top box), user computer equipment 1414 b , and wireless user communication device 1414 c (e.g., a smartphone device or a remote control), which users can use to interact with server 1404 , user data source, and/or content item source 1406 , via communication network 1412 , to search for desired media content.
- server 1404 may provide a user interface via computing device 1414 , by which a user can input a query for a particular item of media content made available by content item source 1406 , and generate a response to the query by accessing and/or processing data and/or manifest.
- FIG. 1400 also includes one or more computing devices 1414 , such as user television equipment 1414 a (e.g., a set-top box), user computer equipment 1414 b , and wireless user communication device 1414 c (e.g., a smartphone device or a remote control), which users can use
- system 1400 may include multiples of one or more illustrated components.
- system 1400 may include multiple video-hosting web servers and media application server 1404 may aggregate data from the multiple video websites hosted by multiple video-hosting web servers, respectively.
- FIG. 15 is an illustrative block diagram showing additional details of the system 1400 of FIG. 14 , in accordance with some embodiments of the disclosure.
- server 1404 includes control circuitry 1502 and Input/Output (I/O) path 1508
- control circuitry 1502 includes storage 1504 and processing circuitry 1506
- Computing device 1414 includes control circuitry 1510 , I/O path 1516 , speaker 1518 , display 1520 , camera 1524 , microphone 1526 , and user input interface 1522 .
- Control circuitry 1510 includes storage 1512 and processing circuitry 214 . Control circuitry 1502 and/or 1510 may be based on any suitable processing circuitry such as processing circuitry 1506 and/or 1514 .
- processing circuitry should be understood to mean circuitry based on one or more microprocessors, microcontrollers, digital signal processors, programmable logic devices, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), etc., and may include a multi-core processor (e.g., dual-core, quad-core, hexa-core, or any suitable number of cores). In some embodiments, processing circuitry may be distributed across multiple separate processors, for example, multiple of the same type of processors (e.g., two Intel Core i9 processors) or multiple different processors (e.g., an Intel Core i7 processor and an Intel Core i9 processor).
- Each of storage 1504 , storage 1512 , and/or storages of other components of system 1400 may be an electronic storage device.
- the phrase “electronic storage device” or “storage device” should be understood to mean any device for storing electronic data, computer software, or firmware, such as random-access memory, read-only memory, hard drives, optical drives, digital video disc (DVD) recorders, compact disc (CD) recorders, BLU-RAY disc (BD) recorders, BLU-RAY 3D disc recorders, digital video recorders (DVR, sometimes called a personal video recorder, or PVR), solid state devices, quantum storage devices, gaming consoles, gaming media, or any other suitable fixed or removable storage devices, and/or any combination of the same.
- Each of storage 1504 , storage 1512 , and/or storages of other components of system 1400 may be used to store various types of content, content item data, and or other types of data.
- Non-volatile memory may also be used (e.g., to launch a boot-up routine and other instructions).
- Cloud-based storage may be used to supplement storages 1504 , 1512 or instead of storages 1504 , 1512 .
- control circuitry 1502 and/or 1510 executes instructions for an application stored in memory (e.g., storage 1504 and/or 1512 ). Specifically, control circuitry 1502 and/or 1510 may be instructed by the application to perform the functions discussed herein. In some implementations, any action performed by control circuitry 1502 and/or 1510 may be based on instructions received from the application.
- the application may be implemented as software or a set of executable instructions that may be stored in storage 1504 and/or 1512 and executed by control circuitry 1502 and/or 1510 .
- the application may be a client/server application where only a client application resides on computing device 1414 , and a server application resides on server 1404 .
- the application may be implemented using any suitable architecture.
- it may be a stand-alone application wholly implemented on computing device 1414 .
- the media application may be implemented as software or a set of executable instructions, which may be stored in non-transitory storage 1512 and executed by control circuitry 1510 of a user device 1414 .
- instructions for the application are stored locally (e.g., in storage 1512 ), and data for use by the application is downloaded on a periodic basis (e.g., from an out-of-band feed, from an Internet resource, or using another suitable approach).
- Control circuitry 1510 may retrieve instructions for the application from storage 1512 and process the instructions to perform the functionality described herein. Based on the processed instructions, control circuitry 1510 may determine what action to perform when input is received from user input interface 1522 .
- control circuitry 1510 may include communication circuitry suitable for communicating with an application server (e.g., server 1404 ) or other networks or servers.
- the instructions for carrying out the functionality described herein may be stored on the application server.
- Communication circuitry may include a cable modem, an integrated services digital network (ISDN) modem, a digital subscriber line (DSL) modem, a telephone modem, an Ethernet card, or a wireless modem for communication with other equipment, or any other suitable communication circuitry.
- ISDN integrated services digital network
- DSL digital subscriber line
- Such communication may involve the Internet or any other suitable communication networks or paths (e.g., communication network 1412 ).
- control circuitry 1510 runs a web browser that interprets web pages provided by a remote server (e.g., server 1404 ).
- the remote server may store the instructions for the application in a storage device.
- the remote server may process the stored instructions using circuitry (e.g., control circuitry 1502 ) and generate the displays discussed above and below.
- Computing device 1414 may display the content via display 1520 . This way, the processing of the instructions is performed remotely (e.g., by server 1404 ) while the resulting displays are provided locally on computing device 1414 .
- Computing device 1414 may receive inputs from the user via input interface 1522 and transmit those inputs to the remote server for processing and generating the corresponding displays.
- a user may send instructions to control circuitry 1502 and/or 1510 using user input interface 1522 .
- User input interface 1522 may be any suitable user interface, such as a remote control, trackball, keypad, keyboard, touchscreen, touchpad, stylus input, joystick, voice recognition interface, or other user input interfaces.
- User input interface 1522 may be integrated with or combined with display 1520 , which may be a monitor, a television, a liquid crystal display (LCD), electronic ink display, or any other equipment suitable for displaying visual images.
- display 1520 which may be a monitor, a television, a liquid crystal display (LCD), electronic ink display, or any other equipment suitable for displaying visual images.
- LCD liquid crystal display
- Camera 1524 of computing device 1414 may capture an image or a video.
- a microphone 1526 of computing device 1414 may detect sound in proximity to computing device 1414 and converts the sound to electrical signals.
- Server 1404 and computing device 1414 may receive content and data via I/O paths 1508 and 1516 , respectively.
- I/O paths 1508 , 1516 may provide content (e.g., broadcast programming, on-demand programming, Internet content, content available over a local area network (LAN) or wide area network (WAN), and/or other content) and data to control circuitry 1502 , 1510 .
- Control circuitry 1502 , 1510 may be used to send and receive commands, requests, and other suitable data using I/O paths 1508 , 1516 .
- I/O paths 1508 , 1516 may connect control circuitry 1502 , 1510 (and specifically processing circuitry 1506 , 214 ) to one or more communication paths (described below). I/O functions may be provided by one or more of these communication paths but are shown as single paths in FIG. 15 to avoid overcomplicating the drawing.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Electrically Operated Instructional Devices (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
Description
- It has been challenging for a non-native speaker to learn a foreign language. It has been particularly challenging to pick up the correct pronunciation of a word as a native speaker would pronounce it. What has been effective in learning the correct pronunciation is by watching content items (e.g., movies or shows) in the native language because it helps the non-native speaker to learn the pronunciation as the native speaker would pronounce it or to learn the pronunciation of the everyday language or slang, which may not be taught in classes or books.
- During the playback of the media, the media service allows subtitles or closed captions to be displayed along with the video so that the non-native speaker can read the text of the dialogue while listening to the dialogue. That way, the non-native person can match a word to the correct pronunciation. However, certain words may be spoken too quickly in that the non-native speaker may not be able to fully grasp the word, or the non-native speaker may want to hear it multiple times in order to comprehend the pronunciation of the word perfectly. If the non-native speaker misses the word and may want to listen to it later, then the non-native speaker would have to look up an online dictionary to hear the pronunciation. However, the non-native speaker may prefer to hear it the way an actor or actress pronounces the word in the movie rather than hearing it in a robotic voice that is often offered by an online dictionary application. Also, the non-native speaker may prefer to hear and practice the word while watching the show rather than practicing it after the show. That way, the non-native speaker can remember the pronunciation of the word the way it is pronounced in the show while it is still fresh in the non-native speaker's memory.
- To overcome such issues, methods and systems are described herein for a pronunciation learning support system that provides real-time audible pronunciation of a word corresponding to a dialogue upon a user selection of a closed captioning word or a word in the subtitles. For example, a system receives a request to present a content item (e.g., a movie) for display on a device (e.g., TV). In some embodiments, the system retrieves metadata of the content item, which includes the dialogue and respective timestamp information corresponding to each word in the dialogue. The system also retrieves a closed captioning file corresponding to the dialogue from a database of the content item. The metadata of the content item is compared to the retrieved closed captioning file corresponding to the dialogue. The system displays the closed captioning words along with the video of the content item.
- In some embodiments, the closed captioning words are selectable via the user input interface of the device. Upon a user selection, the system retrieves an audio file associated with the selected closed captioning word and generates for playback a portion of the dialogue corresponding to the selected closed captioning word. The system provides audible pronunciation of the selected closed captioning word.
- The user may practice pronouncing the word by uttering the word after the system outputs audible pronunciation of the selected word. In one embodiment, the user may use a second device (e.g., mobile phone) remote from a display device (e.g., TV) that is close to the display device. Any device capable of receiving voice input and transmitting the voice input to the streaming server or media application server is suitable for use as a second device.
- A second device (e.g., voice assistant device) remote from the first device (e.g., display device) may capture the user's voice and creates a temporary audio file for the captured voice input. The temporary audio file may be in any audio file format such as the waveform audio file (e.g., .wav) and is transmitted to the server for pronunciation analysis.
- The system may compare the temporary file corresponding to the captured word to an audio file containing audible pronunciation of the selected word. The audio file may be retrieved from the database of the content item. The audio file includes audible pronunciation in the standard accent in a particular language or in a particular style that is pronounced in the content item. The system compares the temporary audio file corresponding to the captured word to an audio file containing audible pronunciation of the selected word to calculate a similarity score.
- A similarity score may indicate a level of similarity between the user's pronunciation and standard pronunciation. The higher the similarity score is, the more likely the user's pronunciation is close to the standard pronunciation of the particular word. In some embodiments, a similarity score indicates a level of similarity between the user's pronunciation and the pronunciation of a particular style uttered in the content item—the way the character in the content item pronounces a word.
- In some embodiments, if a similarity score is over a certain threshold (e.g., 70%), then the system may indicate in the user interface with positive feedback that the user has done a great job with the pronunciation. Real-time feedback may be generated for display with details, such as comparison point or practice history (e.g., You are improving! Better than yesterday.”). The feedback may also provide tips for pronouncing the word (e.g., “Try to enunciate each word.”).
- The present disclosure provides an interactive pronunciation learning system that prompts real-time user selection of a closed captioning word that enables playback of the audible pronunciation of the selected word the way a character of the content item pronounces it and provides real-time feedback by comparing user's recording of the word to an audio file of the selected word uttered by the character. The present disclosure further addresses the problems described above by, for example, saving the network bandwidth and reduces network traffic by reducing the need to send multiple requests to route to a different online language learning source (e.g., online dictionary for pronunciation) for learning the pronunciation.
- It should be noted that the systems, methods, apparatuses, and/or aspects described above may be applied to, or used in accordance with, other systems, methods, apparatuses, and/or aspects described in this disclosure.
- The above and other objects and advantages of the disclosure will be apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which:
-
FIG. 1 depicts an exemplary user interface of a content item with a highlighted closed captioning word, in accordance with some embodiments of the disclosure; -
FIG. 2 depicts an exemplary user interface of a content item with a highlighted closed captioning phrase, in accordance with some embodiments of the disclosure; -
FIG. 3 depicts an exemplary user interface of a content item with non-speech information, in accordance with some embodiments of the disclosure; -
FIG. 4 depicts an exemplary user interface of a content item with a slang, in accordance with some embodiments of the disclosure; -
FIG. 5 depicts an exemplary user interface of a content item with a list of one or more pronunciation styles, in accordance with some embodiments of the disclosure; -
FIG. 6 depicts an exemplary user interface of a content item with a list of one or more characters who uttered a closed captioning word, in accordance with some embodiments of the disclosure; -
FIG. 7 depicts an exemplary user interface of providing feedback for pronunciation practice, in accordance with some embodiments of the disclosure; -
FIG. 8 depicts an exemplary embodiment of synchronizing an actual audio file to user's recording, in accordance with some embodiments of the disclosure; -
FIG. 9 depicts an exemplary user interface of sharing a pronunciation recording with another user, in accordance with some embodiments of the disclosure; -
FIG. 10 depicts a flowchart of a process for providing audible pronunciation of a closed captioning word, in accordance with some embodiments of the disclosure; -
FIG. 11 depicts a flowchart of a process for segmenting a content item and associating timestamps with words in a dialogue, in accordance with some embodiments of the disclosure; -
FIG. 12 depicts an exemplary algorithm of generating audio files for words in a dialogue specified within a WebVTT format, in accordance with some embodiments of the disclosure; -
FIG. 13 depicts an exemplary flow for providing feedback to a user's recording, in accordance with some embodiments of the disclosure; -
FIG. 14 depicts an illustrative block diagram of an interactive pronunciation learning system, in accordance with some embodiments of the disclosure; and -
FIG. 15 depicts an illustrative block diagram showing additional details of the system ofFIG. 14 , in accordance with some embodiments of the disclosure. -
FIG. 1 depicts anexemplary user interface 100 of a content item with a highlighted closed captioningword 102, in accordance with some embodiments of the disclosure. For example, the content item (e.g., “Mulan” movie) is presented via a media application on a user device in response to a user request to display the content item. The media application may be a stand-alone application implemented on user equipment devices 1414 a, 1414 b, 1414 c ofFIG. 14 . In some embodiments, the processes and embodiments described herein may be performed by amedia application server 1404 ofFIG. 14 or astreaming server 1306 ofFIG. 13 . - The media application retrieves metadata of the content item from a database of the content item. The metadata of the content item may comprise the dialogue and a respective timestamp corresponding to each word in the dialogue. The media application may also retrieve a closed captioning word file corresponding to the dialogue from a database of the content item. The media application compares the metadata of the content item to the closed captioning word file corresponding to the dialogue. Based on the comparison, the media application determines that at least the portion of the dialogue corresponds to the selected closed captioning word.
- In some embodiments, a video of the content item is displayed with closed captioning words corresponding to dialogue 104 (e.g., “wait and see when we're through”) spoken in the first language (e.g., English). In some embodiments, the closed captioning words are selectable via a user interface of a computing device (e.g., mobile device) remote from a display device (e.g., TV) that displays the content item. For example, the closed captioning word may be selected via any type of input device such as a keyboard, mouse, or touchscreen. In some embodiments, the selection of the closed captioning word is made via the display (e.g., tablet PC).
- In response to receiving the selection of the closed captioning word, the media application highlights the selected
word 102 and generates for playback at least a portion of the dialogue corresponding to the selected closed captioning word. As shown inFIG. 1 , the selection was made for the closed captioning word “wait” 102. Accordingly, an audible pronunciation of the selected closed captioning word (e.g., “wait”) will be played. - In some embodiments, the audible pronunciation of the selected closed captioning word is different from the standard pronunciation of the selected word. For example, pronunciation may vary drastically based on how the word is pronounced by a particular character in the content item or the contextual situation of the scene. For example, the intensity of how the word is said (e.g., angry v. sad), the pitch of the voice (e.g., female character v. male character), the intonation of the speech (e.g., the hometown of the character or hometown of the actor/actress), or how quickly the word is being said (e.g., urgent scene), all of which may play a part in varied pronunciation of a particular word. The present disclosure allows the users to hear the pronunciation of a particular word in a way that it is pronounced in a content item and learn the pronunciation of a word as a character in the content item would pronounce it.
- In some embodiments, the playback of the content item is paused when a user selection of a closed captioning word is received. For example, when a user selects “wait” 102 in the closed captioning words, the playback of a video of the content item may be paused to play the pronunciation of the selected word. The user may also send a request to pause the video before selecting a closed captioning word.
-
FIG. 2 depicts anexemplary user interface 200 of a content item with a highlightedclosed captioning phrase 202, in accordance with some embodiments of the disclosure. In some embodiments, a selection may be made for a single word or multiple words. If a selection was made for a phrase (multiple words), then the media application may highlight a phrase comprising a plurality of words instead of highlighting a single word. Humans utter generally 100-130 words per minute and may utter multiple words at the same time. Therefore, a listener may hear the pronunciation of a single phrase rather than the individual words. In this case, an end time of a word may be temporally too close to a start time of a subsequent word (e.g., ⅓ second apart). This may deter the listener from discerning individual words, and the listener may not be able to pinpoint a particular word that the listener wants to hear again in the closed captioning words. Alternatively, the system may only allow more than one word to be selected by the user. - In one example, if the user selects “I've,” the system may highlight a phrase (e.g., “I've never seen”) instead of highlighting only the selected word (“I've”) because “I've never seen” is a collection of words that is often uttered together. In another embodiment, the media application determines the temporal proximity of the first set of words (“I've never seen”) 202 in the
dialogue 204. If the temporal proximity of each word of the first set of words is less than a threshold (e.g., the user utters “I've never seen” too fast, so the end time of “never” and the start time of “seen” is less than a threshold of 0:00:002), the media application categorizes the first set of words as a first phrase. When the media application receives a selection of at least one word (e.g., “never”) of the first set of words (e.g., “I've never seen”) via the user interface of the user device, the media application retrieves an audio file or multiple audio files containing audible pronunciation of the first phrase (“I've never seen”). The media application generates for output the audible pronunciation of the first phrase. In this case, the first set of words will be played sequentially for those adjacent words. -
FIG. 3 depicts anexemplary user interface 300 of a content item withnon-speech information 302, in accordance with some embodiments of the disclosure. The non-speech information may include non-dialogue, such as a description of the background scene (e.g., “hair trembles with emotion”). Non-speech information may be available for the hearing impaired listeners to give the context of the scene (e.g., somber music). In some embodiments, the non-speech information may be greyed out or marked in a way that it is clear to the user that the displayed non-speech information is not part of the dialogue. Inexemplary user interface 300, thenon-speech information 302 is displayed within a bracket. Thenon-speech information 302 may not be selectable by the user as these words are not part of the dialogue that the character in the content item uttered. In some embodiments, the non-speech information may be available as an audio file to be output in a voice other than the character who appeared in the content item. -
FIG. 4 depicts an exemplary user interface of acontent item 400 with aslang 402, in accordance with some embodiments of the disclosure. Some movies include certain words that are pronounced by the characters in a particular way that is different from the typical way that the words are pronounced, such as by having different intonation, pitch, or tone. Some users like how the word is pronounced by these characters and want to hear and practice the word the way the characters in the movies pronounce it. Because these slangs make the pronunciation unique, the slang may appear in the video with the actual words. For example, inexemplary user interface 400, the slang word 402 (e.g., “Fuhgeddaboudit”) may be displayed with an actual word 404 (e.g., “Forget about it”). In some embodiments, the slang word 402 (“Fuhgeddaboudit”) may be visually distinguishable from the actual word 404 (“Forget about it”) in that the slang words are highlighted in different colors or are displayed in different fonts than the actual words are displayed. The present disclosure allows the users to hear the pronunciation of a particular word the way it is pronounced in a content item, thereby allowing the user to learn the unique pronunciation of the word like a native speaker or a character in the content item and carrying out the distinct audio characteristics such as emotion, pitch, tone, pause, or intonation. -
FIG. 5 depicts anexemplary user interface 500 of a content item with a list of one ormore pronunciation styles exemplary user interface 500, standardAmerican accent 504,Southern accent 506,Boston accent 508 are available for the phrase “Joey does not share food” 502. The media application generates for display a list of the plurality ofpronunciation styles exemplary user interface 500 displays particular accents as different pronunciation styles, any dialect or any type of varied pronunciation style may be used. The media application receives a selection of a pronunciation style of the plurality of pronunciation styles. In thisexemplary user interface 500,Southern accent 506 was selected. The media application retrieves an audio file containing audible pronunciation of the selected word in the selected style (e.g., Southern accent). The media application generates for output audible pronunciation of the selected word in the selected style. -
FIG. 6 depicts anexemplary user interface 600 of a content item with a list of one ormore characters exemplary user interface 600 that includes a list of one or more characters of the content item who spoke the selected word or phrase. The media application receives a selection of a character of one or more characters by the user. In thisexemplary user interface 600, Jennifer Anniston'svoice 604 was selected. The media application retrieves an audio file containing audible pronunciation of the selected word spoken by the selected character (e.g., Jennifer Anniston). The media application generates the retrieved audio file for output containing audible pronunciation of the selected word spoken by the selected character. -
FIG. 7 depicts anexemplary user interface 700 of providingfeedback 704 for pronunciation practice, in accordance with some embodiments of the disclosure.Exemplary user interface 700 may be performed in accordance with the exemplary user interfaces 100-500 discussed inFIGS. 1-5 . After a user pauses the video to hear the pronunciation of a certain word or phrase 702 (e.g., “wait and see when we're through”), the user may repeat pronouncing the same word. The user may do so by uttering the word after the media application outputs audible pronunciation of the selected word. In one embodiment, the user may use a second device remote from the display device (e.g., TV) such as a mobile phone or a voice assistant device that is close to the display device. Any device capable of receiving voice input and transmitting the voice input to the streaming server or media application server is suitable for use as a second device. - A second device (e.g., voice assistant device) 706 remote from the first device (e.g., display device) may capture the user's voice and creates a temporary audio file for the captured voice. The temporary audio file may be in any audio file format such as the waveform audio file (e.g., .wav) and is transmitted to the server for pronunciation analysis. In some embodiments, the temporary audio file may be analyzed at a client device level by
control circuitry 1510 of computing device 1414 a, 1414 b, 1414 c. - The media application may compare the temporary file corresponding to the captured word to an audio file containing audible pronunciation of the selected word. The audio file may be retrieved from the database of the content item. The audio file includes audible pronunciation in the standard accent in a particular language or in a particular style that is pronounced in the content item. The media application compares the temporary audio file corresponding to the captured word to an audio file containing audible pronunciation of the selected word to calculate a similarity score. It may do so by synchronizing the time domain signals between two files and overlaying frequency components, as shown in
FIG. 8 , which will be explained in detail. - A similarity score may indicate a level of similarity between the user's pronunciation and standard pronunciation. The higher the similarity score is, the more likely the user's pronunciation is close to the standard pronunciation of the particular word. In some embodiments, a similarity score indicates a level of similarity between the user's pronunciation and the pronunciation of a particular style uttered in the content item—the way the character in the content item pronounces a word.
- In some embodiments, if a similarity score is over a certain threshold (e.g., 70%), then the media application may indicate in the user interface with positive feedback that the user has done a great job with the pronunciation. As shown in
exemplary user interface 700, a real-time feedback 704 may be generated for display with details, such as comparison point or practice history (e.g., You are improving! Better than yesterday.”).Feedback 704 may also provide tips for pronouncing the word (e.g., “Try to enunciate each word.”) Althoughexemplary feedback 704 was used for illustrative purposes, any kind of feedback regarding improving the pronunciation may work. If the similarity score falls below the threshold, then the media application may include constructive feedback with descriptive details that can help with the pronunciation. -
FIG. 8 depicts an exemplary embodiment of synchronizing an actual audio file to a user's recording in accordance with some embodiments of the disclosure. The media application may synchronize the time domain signals between the files and overlay the frequency components in some embodiments. Based on the comparison, the media application determines how close these two files are. The synchronization and the comparison may be performed by any of the media application,streaming server 1306 ofFIG. 13 ormedia application server 1404 ofFIG. 14 . In some embodiments, the media application may use Fast Fourier Transform (FFT) algorithms to compute a sequence of signals and converts digital signals to spectral components. -
FIG. 9 depicts anexemplary user interface 900 of sharing apronunciation recording 902 in accordance with some embodiments of the disclosure.Exemplary user interface 900 may be performed in accordance with the embodiment discussed in connection withFIG. 7 . A first user (e.g., Joe) may share his or herrecording 902 with other users located in a remote location. For example, after the recording is completed, Joe may select afriend 904 that the user wants to send the recording to (e.g., language teacher or native speaker) and cause the recording to be sent to the user's friend (e.g., Serhad, Rae, Max) by making a selection in the friend's list retrieved from Joe's profile data. The selected user's friend (e.g., Serhad) may perform actions related to the recording, such as playing the recording, rating the recording, providing feedback to the recording, or creating a new recording. The user's friend may send the feedback or a newly-created recording back to Joe for comparison. -
FIG. 10 depicts a flowchart of aprocess 1000 for providing audible pronunciation of closed captioning words, in accordance with some embodiments of the disclosure. It should be noted thatprocess 1000 may be performed bycontrol circuitry FIG. 14 as instructed by the media application that may be performed on any client device. In addition, one or more steps offlowcharts FIG. 10 . - At
step 1002,control circuitry 1510 generates for output on a first device a content item comprising a dialogue. A content item may be audio-visual content that includes dialogue uttered by a character. Atstep 1004,control circuitry 1510 generates for display on the first device a closed captioning word corresponding to the dialogue. The closed captioning word may be in the language that is the same as the dialogue. The closed captioning word may be selectable via a user interface of the first device. Atstep 1006,control circuitry 1510 receives a selection of the closed captioning word via the user interface of the first device (e.g., laptop). Alternatively, a selection of the closed captioning word may be made via the user interface of a second device different from the first device. In some embodiments, a video of the content item is paused. Atstep 1008,control circuitry 1510 generates for playback on the first device at least a portion of the dialogue corresponding to the selected closed captioning word in response to receiving the selection of the closed captioning word.Control circuitry 1510 generates audible pronunciation of the selected word uttered by the character in the content item. The audible pronunciation has its own audio characteristic, such as tone, intensity, pause, intonation, pitch, or any distinguishable audio attributes that make the pronunciation unique from the standard pronunciation. -
FIG. 11 depicts aflowchart 1100 of a process for segmenting a content item and associating timestamps with words in dialogue, in accordance with some embodiments of the disclosure. It should be noted thatprocess 1100 may be performed bycontrol circuitry FIG. 14 as instructed by the media application that may be performed on any client device. Alternatively,process 1100 may be performed by streamingserver 1306 ofFIG. 13 ormedia application server 1404 ofFIG. 14 . In addition, one or more steps offlowcharts FIG. 11 . - At
step 1102,control circuitry 1502 splits the content item into an audio stream and a video stream. Atstep 1104,control circuitry 1502 segments the audio stream of the content item to a sequence of words using a speech-to-text algorithm to generate an audio word list. A speech-to-text algorithm or voice recognition algorithm may be used in generating an audio word list. In some embodiments, metadata of the content item comprising closed caption data is retrieved from a database of the content item. The closed caption data includes a text version of the spoken part of the content item (e.g., dialogue). - At
step 1106,control circuitry 1502 detects whether the closed caption data matches the words being used in the video by comparing the closed caption data and the processed video. For example, a speech detection algorithm or image processing technique may be used to decipher or read lips of the character in the video (e.g., a character saying “forget about it”) to determine the words that are being used in the video. Additionally, in another embodiment,control circuitry 1502 detects whether words in the audio word list match the words being used in the video. - At
step 1108,control circuitry 1502 maps the closed caption data to the audio word list generated from the audio stream using the speech-to-text algorithm atstep 1104.Step 1108 may provide an additional degree of confidence that the closed caption data matches not only the video of the content item, but also the audio of the content item. Atstep 1110,control circuitry 1502 records the audio file, timestamp information (e.g., a time range) of the word identified within the video, and the link between the closed caption word as part of the metadata for the video. -
FIG. 12 depicts anexemplary algorithm 1200 for generating audio files for words specified within the Web Video Text Tracks Format (WebVTT) format, according to some embodiments of the disclosure.Exemplary algorithm 1200 includes codes for generating the audio files for the dialogue. An audio file may include a word and associated timestamps information specified within the WebVTT format. For example, the phrase “wait and see when we're through” is spoken from the time range of 0:00.100-0:00.400 for three seconds. The range of the words is kept as tuples of words, and each spoken word is assigned a start timestamp and an end timestamp. The media application may create a new tag for each pronunciation and assign an audio file associated with the pronunciation of the word and the range of timestamps when the utterance appears. As shown inFIG. 12 , for the word “wait,” a range of timestamps may be assigned with the start timestamp of 0:00:100 and the end timestamp of 0:00:100.567. A newly generated tag for the word “wait” may be associated with the specified start timestamp and the end timestamp. - In some embodiments, the audio files are part of the HLS (HTTP Live Streaming) streaming manifest for SVOD (Subscription Video-On-Demand). The algorithm may be implemented in various formats, such as Secure-Reliable Transport (SRT) or Timed-Text Markup Language (TTML). In some embodiments, the algorithm may be implemented using other streaming protocols such as HLS, MPEG DASH, HSS, HDS, etc.
-
FIG. 13 depicts anexemplary flow 1300 for providing feedback to a user's recording in accordance with some embodiments of the disclosure. Atstep 1310, a steaming server transmits a content item with closed captioning words or subtitles to astreaming video client 1304 in response to a user request to display the content item (e.g., a user plays the movie). A streaming server may be a server that provides content items to computing devices over communication network 1412. In some embodiments, a streaming server may bemedia application server 1404. A streaming video client can be a rendering device such as a TV or laptop. In some embodiments, a streaming video client may be any of computing devices 1414 a, 1414 b, 1414 c. Aremote device 1302 can be any device that is capable of providing input, selecting a text, or capturing a vocal input. In some embodiments, streamingvideo client 1304 andremote device 1302 can be integrated as a single device. - The content item is generated for display on streaming
video client 1304. A user may send a request to pause the video to hear the pronunciation of a specific word atstep 1312. In some embodiments, streamingvideo client 1304 may relay the request fromremote device 1302 to streamingserver 1306. Atstep 1314, the user may navigate between closed captioning words displayed on a screen of streamingvideo client 1304. Atstep 1316, the user may select a word or a phrase within the closed captioning words at remote device 1302 (e.g., by double-clicking a word). In one embodiment, streamingvideo client 1304 may relay the selection made fromremote device 1302 to streamingserver 1306. In another embodiment, the selection may be made via a graphical user interface of streaming video client 1304 (e.g., a TV touchscreen). Atstep 1318, in response to receiving the selection, streamingserver 1306 queries for an audio file of the selected word by looking up the manifest or metadata associated with the content item. - At
step 1320, streamingserver 1306 sends an audio file containing audible pronunciation of the selected word to streamingvideo client 1304. Atstep 1322, streamingvideo client 1304 plays audible pronunciation of the selected word. If the user wishes to practice the pronunciation, the user may repeat the word after streamingvideo client 1304 plays the word. The pronounced word may be captured as a recording atremote device 1302 and may be sent tostreaming server 1306 atstep 1324. In one embodiment, streamingvideo client 1304 may relay the recording file made fromremote device 1302 to streamingserver 1306. In another embodiment, capturing of the user's pronunciation is performed using a speaker at streaming video client 1304 (e.g., using a speaker of a laptop). Atstep 1326, streamingserver 1306 compares the user's recording to the audio file of the selected word to calculate a similarity score atstep 1326. Atstep 1328, streamingserver 1306 transmits the comparison result (e.g., real-time feedback) to streamingvideo client 1304 based on the calculated similarity score. -
FIG. 14 shows an illustrative block diagram of an interactive pronunciation learning system, in accordance with some embodiments of the disclosure. In one aspect,system 1400 includes one or more ofmedia application server 1404,content item source 1406, and communication network 1412. - Communication network 1412 may be one or more networks including the Internet, a mobile phone network, mobile voice or data network (e.g., a 4G or LTE network), cable network, public switched telephone network, or other types of communication network or combinations of communication networks. Communication network 1412 includes one or more communication paths, such as a satellite path, a fiber-optic path, a cable path, a path that supports Internet communications (e.g., IPTV), free-space connections (e.g., for broadcast or other wireless signals), or any other suitable wired or wireless communication path or combination of such paths. Communication network 1412 communicatively couples various components of
system 1400 to one another. For instance,server 1404 may be communicatively coupled to video-hosting web server,content item source 1406 via communication network 1412. - Video-hosting web server (not shown) hosts one or more video websites, such as YouTube™, and/or the like, that enable users to download or stream videos, video clips, and/or other types of content. In addition to enabling users to download and view content, the video websites also provide access to data regarding downloaded content.
-
Content item source 1406 may store content item-related data from one or more types of content providers or originator of content (e.g., a television broadcaster, a Webcast provider, on-demand content provider, over-the-top content providers, or other providers of content). Content item source includes a content item, manifest associated with the content item, metadata associated with the content item, closed caption data or subtitles, or any other related material associated with the content item. The metadata or manifest of the content item may include, among other information of the content item, such as dialogue and associated timestamp information for each word in the dialogue. A remote media server may be used to store different types of content in a location remote from computing device 1414 (described below). Systems and methods for remote storage of content and providing remotely stored content to user equipment are discussed in greater detail in connection with Ellis et al., U.S. Pat. No. 7,761,892, issued Jul. 20, 2010, which is hereby incorporated by reference herein in its entirety. - User data source may provide user-related data, such as user profile or preference data described herein such as preferred selection options, previous option selection, preferred content item, preferred genre, preferred characters or actors, user's friends list, to computing device 1414,
server 1404 and/or video-hosting web server using any suitable approach. In some embodiments,content item source 1406 and user data source may be integrated as one device. - In some embodiments, content item data from
content item source 1406 may be provided to computing device 1414 using a client/server approach. For example, computing device 1414 may pull content item data from a server (e.g., server 1404), or a server may push content item data to computing device 1414. In some embodiments, a client application residing on computing device 1414 may initiate sessions with user data source to obtain content item data when needed, e.g., when the content item data is out of date or when computing device 1414 receives a request from the user to receive data. - Content and/or content item data delivered to computing device 1414 may be over-the-top (OTT) content. OTT content delivery allows Internet-enabled user devices, such as computing device 1414, to receive content that is transferred over the Internet, including any content described above, in addition to content received over cable or satellite connections. OTT content is delivered via an Internet connection provided by an Internet service provider (ISP), but a third party distributes the content. The ISP may not be responsible for the viewing abilities, copyrights, or redistribution of the content, and may only transfer IP packets provided by the OTT content provider. Examples of OTT content providers include YouTube™, Netflix™, and HULU™, which provide audio and video via IP packets. YouTube™ is a trademark owned by Google Inc., Netflix™ is a trademark owned by Netflix Inc., and Hulu is a trademark owned by Hulu™. OTT content providers may additionally or alternatively provide content item data described above. In addition to content and/or content item data, providers of OTT content can distribute applications (e.g., web-based applications or cloud-based applications), or the content can be displayed by applications stored on computing device 1414.
- As described in further detail below,
media application server 1404 accesses the content of the video website(s) hosted by video-hosting web server and, based on the accessed content, generates a variety of types of data such as metadata or manifest (e.g., terms, associations between terms and corresponding media content identifiers, dialogue, closed captions, subtitles, and/or the like) that can be accessed to facilitate the retrieving or searching of media content made available bycontent item source 1406. In some embodiments,server 1404 accesses metadata or manifest of the content item fromcontent item source 1406. The metadata or manifest of the content item may be generated by video-hosting web server ormedia application server 1404. In some embodiments, the metadata or manifest of the content item may be generated by a third-party generator that has access to the content item. -
System 1400 also includes one or more computing devices 1414, such as user television equipment 1414 a (e.g., a set-top box), user computer equipment 1414 b, and wireless user communication device 1414 c (e.g., a smartphone device or a remote control), which users can use to interact withserver 1404, user data source, and/orcontent item source 1406, via communication network 1412, to search for desired media content. For instance, in some aspects,server 1404 may provide a user interface via computing device 1414, by which a user can input a query for a particular item of media content made available bycontent item source 1406, and generate a response to the query by accessing and/or processing data and/or manifest. AlthoughFIG. 14 shows one of each component, in various examples,system 1400 may include multiples of one or more illustrated components. For instance,system 1400 may include multiple video-hosting web servers andmedia application server 1404 may aggregate data from the multiple video websites hosted by multiple video-hosting web servers, respectively. -
FIG. 15 is an illustrative block diagram showing additional details of thesystem 1400 ofFIG. 14 , in accordance with some embodiments of the disclosure. In particular,server 1404 includescontrol circuitry 1502 and Input/Output (I/O)path 1508, andcontrol circuitry 1502 includesstorage 1504 andprocessing circuitry 1506. Computing device 1414 includescontrol circuitry 1510, I/O path 1516,speaker 1518,display 1520,camera 1524,microphone 1526, and user input interface 1522.Control circuitry 1510 includesstorage 1512 and processing circuitry 214.Control circuitry 1502 and/or 1510 may be based on any suitable processing circuitry such asprocessing circuitry 1506 and/or 1514. - As referred to herein, processing circuitry should be understood to mean circuitry based on one or more microprocessors, microcontrollers, digital signal processors, programmable logic devices, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), etc., and may include a multi-core processor (e.g., dual-core, quad-core, hexa-core, or any suitable number of cores). In some embodiments, processing circuitry may be distributed across multiple separate processors, for example, multiple of the same type of processors (e.g., two Intel Core i9 processors) or multiple different processors (e.g., an Intel Core i7 processor and an Intel Core i9 processor).
- Each of
storage 1504,storage 1512, and/or storages of other components of system 1400 (e.g., storages ofcontent item source 1406, user data source, and/or the like) may be an electronic storage device. As referred to herein, the phrase “electronic storage device” or “storage device” should be understood to mean any device for storing electronic data, computer software, or firmware, such as random-access memory, read-only memory, hard drives, optical drives, digital video disc (DVD) recorders, compact disc (CD) recorders, BLU-RAY disc (BD) recorders, BLU-RAY 3D disc recorders, digital video recorders (DVR, sometimes called a personal video recorder, or PVR), solid state devices, quantum storage devices, gaming consoles, gaming media, or any other suitable fixed or removable storage devices, and/or any combination of the same. Each ofstorage 1504,storage 1512, and/or storages of other components ofsystem 1400 may be used to store various types of content, content item data, and or other types of data. Non-volatile memory may also be used (e.g., to launch a boot-up routine and other instructions). Cloud-based storage may be used to supplementstorages storages - In some embodiments,
control circuitry 1502 and/or 1510 executes instructions for an application stored in memory (e.g.,storage 1504 and/or 1512). Specifically,control circuitry 1502 and/or 1510 may be instructed by the application to perform the functions discussed herein. In some implementations, any action performed bycontrol circuitry 1502 and/or 1510 may be based on instructions received from the application. For example, the application may be implemented as software or a set of executable instructions that may be stored instorage 1504 and/or 1512 and executed bycontrol circuitry 1502 and/or 1510. In some embodiments, the application may be a client/server application where only a client application resides on computing device 1414, and a server application resides onserver 1404. - The application (e.g., media application) may be implemented using any suitable architecture. For example, it may be a stand-alone application wholly implemented on computing device 1414. For example, the media application may be implemented as software or a set of executable instructions, which may be stored in
non-transitory storage 1512 and executed bycontrol circuitry 1510 of a user device 1414. In such an approach, instructions for the application are stored locally (e.g., in storage 1512), and data for use by the application is downloaded on a periodic basis (e.g., from an out-of-band feed, from an Internet resource, or using another suitable approach).Control circuitry 1510 may retrieve instructions for the application fromstorage 1512 and process the instructions to perform the functionality described herein. Based on the processed instructions,control circuitry 1510 may determine what action to perform when input is received from user input interface 1522. - In client/server-based embodiments,
control circuitry 1510 may include communication circuitry suitable for communicating with an application server (e.g., server 1404) or other networks or servers. The instructions for carrying out the functionality described herein may be stored on the application server. Communication circuitry may include a cable modem, an integrated services digital network (ISDN) modem, a digital subscriber line (DSL) modem, a telephone modem, an Ethernet card, or a wireless modem for communication with other equipment, or any other suitable communication circuitry. Such communication may involve the Internet or any other suitable communication networks or paths (e.g., communication network 1412). - In another example of a client/server-based application,
control circuitry 1510 runs a web browser that interprets web pages provided by a remote server (e.g., server 1404). For example, the remote server may store the instructions for the application in a storage device. The remote server may process the stored instructions using circuitry (e.g., control circuitry 1502) and generate the displays discussed above and below. Computing device 1414 may display the content viadisplay 1520. This way, the processing of the instructions is performed remotely (e.g., by server 1404) while the resulting displays are provided locally on computing device 1414. Computing device 1414 may receive inputs from the user via input interface 1522 and transmit those inputs to the remote server for processing and generating the corresponding displays. - A user may send instructions to control
circuitry 1502 and/or 1510 using user input interface 1522. User input interface 1522 may be any suitable user interface, such as a remote control, trackball, keypad, keyboard, touchscreen, touchpad, stylus input, joystick, voice recognition interface, or other user input interfaces. User input interface 1522 may be integrated with or combined withdisplay 1520, which may be a monitor, a television, a liquid crystal display (LCD), electronic ink display, or any other equipment suitable for displaying visual images. -
Camera 1524 of computing device 1414 may capture an image or a video. Amicrophone 1526 of computing device 1414 may detect sound in proximity to computing device 1414 and converts the sound to electrical signals. -
Server 1404 and computing device 1414 may receive content and data via I/O paths O paths circuitry Control circuitry O paths O paths control circuitry 1502, 1510 (and specifically processingcircuitry 1506, 214) to one or more communication paths (described below). I/O functions may be provided by one or more of these communication paths but are shown as single paths inFIG. 15 to avoid overcomplicating the drawing. - The systems and processes discussed above are intended to be illustrative and not limiting. One skilled in the art would appreciate that the actions of the processes discussed herein may be omitted, modified, combined, and/or rearranged, and any additional actions may be performed without departing from the scope of the invention. More generally, the above disclosure is meant to be exemplary and not limiting. Only the claims that follow are meant to set bounds as to what the present disclosure includes. Furthermore, it should be noted that the features and limitations described in any one embodiment may be applied to any other embodiment herein, and flowcharts or examples relating to one embodiment may be combined with any other embodiment in a suitable manner, done in different orders, or done in parallel. In addition, the systems and methods described herein may be performed in real time. It should also be noted that the systems and/or methods described above may be applied to, or used in accordance with, other systems and/or methods.
Claims (21)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/213,599 US12335577B2 (en) | 2021-10-15 | 2023-06-23 | Interactive pronunciation learning system |
US19/209,996 US20250280178A1 (en) | 2021-10-15 | 2025-05-16 | Interactive pronunciation learning system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/502,205 US11736773B2 (en) | 2021-10-15 | 2021-10-15 | Interactive pronunciation learning system |
US18/213,599 US12335577B2 (en) | 2021-10-15 | 2023-06-23 | Interactive pronunciation learning system |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/502,205 Continuation US11736773B2 (en) | 2021-10-15 | 2021-10-15 | Interactive pronunciation learning system |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US19/209,996 Continuation US20250280178A1 (en) | 2021-10-15 | 2025-05-16 | Interactive pronunciation learning system |
Publications (2)
Publication Number | Publication Date |
---|---|
US20230345082A1 true US20230345082A1 (en) | 2023-10-26 |
US12335577B2 US12335577B2 (en) | 2025-06-17 |
Family
ID=85981864
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/502,205 Active US11736773B2 (en) | 2021-10-15 | 2021-10-15 | Interactive pronunciation learning system |
US18/213,599 Active US12335577B2 (en) | 2021-10-15 | 2023-06-23 | Interactive pronunciation learning system |
US19/209,996 Pending US20250280178A1 (en) | 2021-10-15 | 2025-05-16 | Interactive pronunciation learning system |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/502,205 Active US11736773B2 (en) | 2021-10-15 | 2021-10-15 | Interactive pronunciation learning system |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US19/209,996 Pending US20250280178A1 (en) | 2021-10-15 | 2025-05-16 | Interactive pronunciation learning system |
Country Status (1)
Country | Link |
---|---|
US (3) | US11736773B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12432404B1 (en) * | 2024-05-24 | 2025-09-30 | Noah Buffett-Kennedy | System and method for enabling and facilitating the delivery of audio-visual content to a group of people |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11736773B2 (en) | 2021-10-15 | 2023-08-22 | Rovi Guides, Inc. | Interactive pronunciation learning system |
Citations (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4570232A (en) * | 1981-12-21 | 1986-02-11 | Nippon Telegraph & Telephone Public Corporation | Speech recognition apparatus |
US5598557A (en) * | 1992-09-22 | 1997-01-28 | Caere Corporation | Apparatus and method for retrieving and grouping images representing text files based on the relevance of key words extracted from a selected file to the text files |
US6085160A (en) * | 1998-07-10 | 2000-07-04 | Lernout & Hauspie Speech Products N.V. | Language independent speech recognition |
US6098082A (en) * | 1996-07-15 | 2000-08-01 | At&T Corp | Method for automatically providing a compressed rendition of a video program in a format suitable for electronic searching and retrieval |
US20020055950A1 (en) * | 1998-12-23 | 2002-05-09 | Arabesque Communications, Inc. | Synchronizing audio and text of multimedia segments |
US20020093591A1 (en) * | 2000-12-12 | 2002-07-18 | Nec Usa, Inc. | Creating audio-centric, imagecentric, and integrated audio visual summaries |
US6442518B1 (en) * | 1999-07-14 | 2002-08-27 | Compaq Information Technologies Group, L.P. | Method for refining time alignments of closed captions |
US6473778B1 (en) * | 1998-12-24 | 2002-10-29 | At&T Corporation | Generating hypermedia documents from transcriptions of television programs using parallel text alignment |
US20030025832A1 (en) * | 2001-08-03 | 2003-02-06 | Swart William D. | Video and digital multimedia aggregator content coding and formatting |
US20030061028A1 (en) * | 2001-09-21 | 2003-03-27 | Knumi Inc. | Tool for automatically mapping multimedia annotations to ontologies |
US20030169366A1 (en) * | 2002-03-08 | 2003-09-11 | Umberto Lenzi | Method and apparatus for control of closed captioning |
US20030206717A1 (en) * | 2001-04-20 | 2003-11-06 | Front Porch Digital Inc. | Methods and apparatus for indexing and archiving encoded audio/video data |
US20040096110A1 (en) * | 2001-04-20 | 2004-05-20 | Front Porch Digital Inc. | Methods and apparatus for archiving, indexing and accessing audio and video data |
US20050227614A1 (en) * | 2001-12-24 | 2005-10-13 | Hosking Ian M | Captioning system |
US20060015339A1 (en) * | 1999-03-05 | 2006-01-19 | Canon Kabushiki Kaisha | Database annotation and retrieval |
US7047191B2 (en) * | 2000-03-06 | 2006-05-16 | Rochester Institute Of Technology | Method and system for providing automated captioning for AV signals |
US7065524B1 (en) * | 2001-03-30 | 2006-06-20 | Pharsight Corporation | Identification and correction of confounders in a statistical analysis |
US7092888B1 (en) * | 2001-10-26 | 2006-08-15 | Verizon Corporate Services Group Inc. | Unsupervised training in natural language call routing |
US20070124788A1 (en) * | 2004-11-25 | 2007-05-31 | Erland Wittkoter | Appliance and method for client-sided synchronization of audio/video content and external data |
US20070124147A1 (en) * | 2005-11-30 | 2007-05-31 | International Business Machines Corporation | Methods and apparatus for use in speech recognition systems for identifying unknown words and for adding previously unknown words to vocabularies and grammars of speech recognition systems |
US20080066138A1 (en) * | 2006-09-13 | 2008-03-13 | Nortel Networks Limited | Closed captioning language translation |
US20080266449A1 (en) * | 2007-04-25 | 2008-10-30 | Samsung Electronics Co., Ltd. | Method and system for providing access to information of potential interest to a user |
US7509385B1 (en) * | 2008-05-29 | 2009-03-24 | International Business Machines Corporation | Method of system for creating an electronic message |
US20090171662A1 (en) * | 2007-12-27 | 2009-07-02 | Sehda, Inc. | Robust Information Extraction from Utterances |
US7739253B1 (en) * | 2005-04-21 | 2010-06-15 | Sonicwall, Inc. | Link-based content ratings of pages |
US7761892B2 (en) * | 1998-07-14 | 2010-07-20 | United Video Properties, Inc. | Client server based interactive television program guide system with remote server recording |
US8131545B1 (en) * | 2008-09-25 | 2012-03-06 | Google Inc. | Aligning a transcript to audio data |
US20190303797A1 (en) * | 2018-03-30 | 2019-10-03 | International Business Machines Corporation | System and method for cognitive multilingual speech training and recognition |
US11228810B1 (en) * | 2019-04-22 | 2022-01-18 | Matan Arazi | System, method, and program product for interactively prompting user decisions |
US20220059082A1 (en) * | 2020-08-21 | 2022-02-24 | International Business Machines Corporation | Multiplicative integration in neural network transducer models for end-to-end speech recognition |
US20220093103A1 (en) * | 2020-09-23 | 2022-03-24 | Naver Corporation | Method, system, and computer-readable recording medium for managing text transcript and memo for audio file |
US20220223066A1 (en) * | 2021-01-08 | 2022-07-14 | Ping An Technology (Shenzhen) Co., Ltd. | Method, device, and computer program product for english pronunciation assessment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11736773B2 (en) | 2021-10-15 | 2023-08-22 | Rovi Guides, Inc. | Interactive pronunciation learning system |
-
2021
- 2021-10-15 US US17/502,205 patent/US11736773B2/en active Active
-
2023
- 2023-06-23 US US18/213,599 patent/US12335577B2/en active Active
-
2025
- 2025-05-16 US US19/209,996 patent/US20250280178A1/en active Pending
Patent Citations (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4570232A (en) * | 1981-12-21 | 1986-02-11 | Nippon Telegraph & Telephone Public Corporation | Speech recognition apparatus |
US5598557A (en) * | 1992-09-22 | 1997-01-28 | Caere Corporation | Apparatus and method for retrieving and grouping images representing text files based on the relevance of key words extracted from a selected file to the text files |
US6098082A (en) * | 1996-07-15 | 2000-08-01 | At&T Corp | Method for automatically providing a compressed rendition of a video program in a format suitable for electronic searching and retrieval |
US6085160A (en) * | 1998-07-10 | 2000-07-04 | Lernout & Hauspie Speech Products N.V. | Language independent speech recognition |
US7761892B2 (en) * | 1998-07-14 | 2010-07-20 | United Video Properties, Inc. | Client server based interactive television program guide system with remote server recording |
US20020055950A1 (en) * | 1998-12-23 | 2002-05-09 | Arabesque Communications, Inc. | Synchronizing audio and text of multimedia segments |
US6473778B1 (en) * | 1998-12-24 | 2002-10-29 | At&T Corporation | Generating hypermedia documents from transcriptions of television programs using parallel text alignment |
US20060015339A1 (en) * | 1999-03-05 | 2006-01-19 | Canon Kabushiki Kaisha | Database annotation and retrieval |
US6442518B1 (en) * | 1999-07-14 | 2002-08-27 | Compaq Information Technologies Group, L.P. | Method for refining time alignments of closed captions |
US7047191B2 (en) * | 2000-03-06 | 2006-05-16 | Rochester Institute Of Technology | Method and system for providing automated captioning for AV signals |
US20020093591A1 (en) * | 2000-12-12 | 2002-07-18 | Nec Usa, Inc. | Creating audio-centric, imagecentric, and integrated audio visual summaries |
US7065524B1 (en) * | 2001-03-30 | 2006-06-20 | Pharsight Corporation | Identification and correction of confounders in a statistical analysis |
US7035468B2 (en) * | 2001-04-20 | 2006-04-25 | Front Porch Digital Inc. | Methods and apparatus for archiving, indexing and accessing audio and video data |
US20040096110A1 (en) * | 2001-04-20 | 2004-05-20 | Front Porch Digital Inc. | Methods and apparatus for archiving, indexing and accessing audio and video data |
US20030206717A1 (en) * | 2001-04-20 | 2003-11-06 | Front Porch Digital Inc. | Methods and apparatus for indexing and archiving encoded audio/video data |
US20080262996A1 (en) * | 2001-04-20 | 2008-10-23 | Front Porch Digital, Inc. | Methods and apparatus for indexing and archiving encoded audio/video data |
US7110664B2 (en) * | 2001-04-20 | 2006-09-19 | Front Porch Digital, Inc. | Methods and apparatus for indexing and archiving encoded audio-video data |
US7908628B2 (en) * | 2001-08-03 | 2011-03-15 | Comcast Ip Holdings I, Llc | Video and digital multimedia aggregator content coding and formatting |
US20030025832A1 (en) * | 2001-08-03 | 2003-02-06 | Swart William D. | Video and digital multimedia aggregator content coding and formatting |
US20030061028A1 (en) * | 2001-09-21 | 2003-03-27 | Knumi Inc. | Tool for automatically mapping multimedia annotations to ontologies |
US7092888B1 (en) * | 2001-10-26 | 2006-08-15 | Verizon Corporate Services Group Inc. | Unsupervised training in natural language call routing |
US20050227614A1 (en) * | 2001-12-24 | 2005-10-13 | Hosking Ian M | Captioning system |
US8248528B2 (en) * | 2001-12-24 | 2012-08-21 | Intrasonics S.A.R.L. | Captioning system |
US20030169366A1 (en) * | 2002-03-08 | 2003-09-11 | Umberto Lenzi | Method and apparatus for control of closed captioning |
US20070124788A1 (en) * | 2004-11-25 | 2007-05-31 | Erland Wittkoter | Appliance and method for client-sided synchronization of audio/video content and external data |
US7739253B1 (en) * | 2005-04-21 | 2010-06-15 | Sonicwall, Inc. | Link-based content ratings of pages |
US20070124147A1 (en) * | 2005-11-30 | 2007-05-31 | International Business Machines Corporation | Methods and apparatus for use in speech recognition systems for identifying unknown words and for adding previously unknown words to vocabularies and grammars of speech recognition systems |
US20080066138A1 (en) * | 2006-09-13 | 2008-03-13 | Nortel Networks Limited | Closed captioning language translation |
US8209724B2 (en) * | 2007-04-25 | 2012-06-26 | Samsung Electronics Co., Ltd. | Method and system for providing access to information of potential interest to a user |
US20080266449A1 (en) * | 2007-04-25 | 2008-10-30 | Samsung Electronics Co., Ltd. | Method and system for providing access to information of potential interest to a user |
US20090171662A1 (en) * | 2007-12-27 | 2009-07-02 | Sehda, Inc. | Robust Information Extraction from Utterances |
US7509385B1 (en) * | 2008-05-29 | 2009-03-24 | International Business Machines Corporation | Method of system for creating an electronic message |
US8131545B1 (en) * | 2008-09-25 | 2012-03-06 | Google Inc. | Aligning a transcript to audio data |
US20190303797A1 (en) * | 2018-03-30 | 2019-10-03 | International Business Machines Corporation | System and method for cognitive multilingual speech training and recognition |
US11228810B1 (en) * | 2019-04-22 | 2022-01-18 | Matan Arazi | System, method, and program product for interactively prompting user decisions |
US20220059082A1 (en) * | 2020-08-21 | 2022-02-24 | International Business Machines Corporation | Multiplicative integration in neural network transducer models for end-to-end speech recognition |
US20220093103A1 (en) * | 2020-09-23 | 2022-03-24 | Naver Corporation | Method, system, and computer-readable recording medium for managing text transcript and memo for audio file |
US20220223066A1 (en) * | 2021-01-08 | 2022-07-14 | Ping An Technology (Shenzhen) Co., Ltd. | Method, device, and computer program product for english pronunciation assessment |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12432404B1 (en) * | 2024-05-24 | 2025-09-30 | Noah Buffett-Kennedy | System and method for enabling and facilitating the delivery of audio-visual content to a group of people |
Also Published As
Publication number | Publication date |
---|---|
US20230124847A1 (en) | 2023-04-20 |
US20250280178A1 (en) | 2025-09-04 |
US11736773B2 (en) | 2023-08-22 |
US12335577B2 (en) | 2025-06-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12046229B2 (en) | Systems and methods for providing notifications within a media asset without breaking immersion | |
US20250184572A1 (en) | Methods and systems for recommending content in context of a conversation | |
US9959872B2 (en) | Multimodal speech recognition for real-time video audio-based display indicia application | |
US11651775B2 (en) | Word correction using automatic speech recognition (ASR) incremental response | |
US11418849B2 (en) | Systems and methods for inserting emoticons within a media asset | |
US20250280178A1 (en) | Interactive pronunciation learning system | |
US11375287B2 (en) | Systems and methods for gamification of real-time instructional commentating | |
KR20200010455A (en) | Method and system for correcting input based on speech generated using automatic speech recognition | |
US12063414B2 (en) | Methods and systems for selective playback and attenuation of audio based on user preference | |
US20170092292A1 (en) | Automatic rate control based on user identities | |
US20250094486A1 (en) | Method and systems for providing personalized supplemental audio streams | |
US12170827B2 (en) | Systems and methods for inserting emoticons within a media asset | |
US20250104715A1 (en) | Media System with Closed-Captioning Data and/or Subtitle Data Generation Features | |
WO2023218268A1 (en) | Generation of closed captions based on various visual and non-visual elements in content | |
WO2021126867A1 (en) | Providing enhanced content with identified complex content segments | |
US11423920B2 (en) | Methods and systems for suppressing vocal tracks | |
US20240414409A1 (en) | Methods and systems to automate subtitle display based on user language profile and preferences | |
Dubinsky | Syncwords: a platform for semi-automated closed captioning and subtitles. | |
US20220148600A1 (en) | Systems and methods for detecting a mimicked voice input signal | |
JP2021085999A (en) | Live subtitle display system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: ROVI GUIDES, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:DOKEN, SERHAD;REEL/FRAME:064047/0191 Effective date: 20211101 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
AS | Assignment |
Owner name: ADEIA GUIDES INC., CALIFORNIA Free format text: CHANGE OF NAME;ASSIGNOR:ROVI GUIDES, INC.;REEL/FRAME:069113/0399 Effective date: 20220815 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |