CN118830020A

CN118830020A - Structured video documentation

Info

Publication number: CN118830020A
Application number: CN202380027426.3A
Authority: CN
Inventors: 约翰·沙尔克维克; 弗朗索瓦丝·博费
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2022-03-04
Filing date: 2023-03-02
Publication date: 2024-10-22
Also published as: US20250094491A1; US12169522B2; JP2025512697A; WO2023168373A1; US20230281248A1; KR20240151201A; EP4473528A1

Abstract

A method (500) includes: receiving a content feed (120) comprising audio data (122) corresponding to a speech utterance; and processing the content feed to generate a semantically rich, structured document (300). The structured document includes a transcription (310) of the speech utterance (123) and includes a plurality of words (123) each aligned with a corresponding audio segment (222) that indicates a time at which the word was recognized in the audio data. During playback of the content feed, the method further comprises: receiving a query from a user requesting information contained in a content feed (112); and processing the query and the structured document by the large language model (180) to generate a response to the query (182). The response conveys the requested information contained in the content feed. The method further comprises the steps of: a response to the query is provided for output from a user device associated with the user (102).

Description

Structured video documentation

技术领域Technical Field

本公开涉及结构化视频文档。The present disclosure relates to structured video documents.

背景技术Background Art

虽然视频是用户消费娱乐、新闻和教育内容的常见方式，但由于搜索和调用视频的内容的能力所施加的限制，用户使用视频作为信息媒介仍具有挑战性。对于基于信息的任务，用户通常与基于时间轴的视频播放器的用户界面进行交互以向前/向后拖动通过(scrub through)视频来定位可能感兴趣的特定内容。在某种程度上，针对视频中的对话生成转录/解说词的能力已通过允许用户录入关键词搜索来定位转录/解说词中的相关内容来改进在视频中搜索内容的能力。然而，利用转录/解说词来搜索内容的这些用户界面缺乏从语义上理解针对视频中的特定内容所说出(或所键入)的查询的能力—更不用说具有利用语义相关的信息来实现查询的能力。While video is a common way for users to consume entertainment, news, and educational content, it remains challenging for users to use video as an information medium due to limitations imposed by the ability to search and call up the content of a video. For information-based tasks, users typically interact with a timeline-based video player user interface to scrub through a video forward/backward to locate specific content that may be of interest. To some extent, the ability to generate transcripts/narratives for dialogues in a video has improved the ability to search for content in a video by allowing users to enter keyword searches to locate relevant content in the transcripts/narratives. However, these user interfaces that use transcripts/narratives to search for content lack the ability to semantically understand queries that are spoken (or typed) for specific content in a video—not to mention the ability to use semantically relevant information to implement queries.

发明内容Summary of the invention

本公开的一个方面提供了一种计算机实现的方法，该计算机实现的方法当在数据处理硬件上执行时使数据处理硬件执行操作，该操作包括：接收包括的内容馈送，该音频数据与语音话语相对应；以及处理内容馈送以生成语义丰富的结构化文档。该结构化文档包括语音话语的转录，并且包括各自与音频数据的对应音频片段对齐的多个单词，该对应音频片段指示在音频数据中辨识出单词的时间。在内容馈送的回放期间，操作还包括：接收来自用户的请求内容馈送中包含的信息的查询；以及由大型语言模型处理查询和结构化文档以生成对查询的响应。此处，该响应传达内容馈送中包含的所请求的信息。操作还包括：提供对查询的响应以供从与用户相关联的用户装置输出。One aspect of the present disclosure provides a computer-implemented method that, when executed on data processing hardware, causes the data processing hardware to perform operations, the operations comprising: receiving a content feed including audio data corresponding to a speech utterance; and processing the content feed to generate a semantically rich structured document. The structured document includes a transcription of the speech utterance and includes a plurality of words that are each aligned with a corresponding audio segment of the audio data, the corresponding audio segment indicating the time at which the word was recognized in the audio data. During playback of the content feed, the operations also include: receiving a query from a user requesting information contained in the content feed; and processing the query and the structured document by a large language model to generate a response to the query. Here, the response conveys the requested information contained in the content feed. The operations also include: providing a response to the query for output from a user device associated with the user.

本公开的实现方式可以包括以下可选特征中的一者或多者。在一些实现方式中，操作还包括：提取转录的包括由对查询的响应传达的所请求的信息的片段，该转录的片段由开始单词和结束单词界定；将音频数据的起始音频片段识别为音频数据的与界定转录的片段的开始单词对齐的对应音频片段；以及将音频数据的结束音频片段识别为音频数据的与界定转录的片段的结束单词对齐的对应音频片段。在这些实现方式中，提供对查询的响应包括从与用户相关联的用户装置从音频数据的开始音频片段到音频数据的结束音频片段重回放（replay back）音频数据。该内容馈送可以进一步包括图像数据，该图像数据包括多个图像帧，其中操作进一步包括：在从音频数据的开始音频片段到音频数据的结束音频片段重回放音频数据时，暂停图像数据的多个图像帧的回放。Implementations of the present disclosure may include one or more of the following optional features. In some implementations, the operation also includes: extracting a transcribed segment including the requested information conveyed by the response to the query, the transcribed segment being bounded by a start word and an end word; identifying the starting audio segment of the audio data as the corresponding audio segment of the audio data aligned with the starting word that bounds the transcribed segment; and identifying the ending audio segment of the audio data as the corresponding audio segment of the audio data aligned with the ending word that bounds the transcribed segment. In these implementations, providing a response to the query includes replaying back the audio data from a user device associated with the user from the starting audio segment of the audio data to the ending audio segment of the audio data. The content feed may further include image data, the image data including multiple image frames, wherein the operation further includes: pausing the playback of multiple image frames of the image data while replaying the audio data from the starting audio segment of the audio data to the ending audio segment of the audio data.

在一些示例中，该内容馈送进一步包括图像数据，该图像数据包括多个图像帧，并且该语义丰富的结构化文档进一步包括在多个图像帧中的一个或多个图像帧中所辨识出的创作者提供的文本。此处，该创作者提供的文本与音频数据的对应音频片段对齐，以指示在一个或多个图像帧中辨识出创作者所提供的文本的时间。在这些示例中，处理内容馈送以生成语义丰富的结构化文档可以进一步包括通过基于音频数据的与在一个或多个图像帧中辨识出的创作者提供的文本对齐的对应音频片段将创作者所提供的文本插入在转录中的一对相邻单词之间来利用创作者提供的文本对语音话语的转录进行注释。In some examples, the content feed further includes image data, the image data including a plurality of image frames, and the semantically enriched structured document further includes creator-provided text identified in one or more of the plurality of image frames. Here, the creator-provided text is aligned with a corresponding audio segment of the audio data to indicate a time when the creator-provided text was identified in the one or more image frames. In these examples, processing the content feed to generate the semantically enriched structured document may further include annotating a transcription of the speech utterance with the creator-provided text by inserting the creator-provided text between a pair of adjacent words in the transcription based on a corresponding audio segment of the audio data aligned with the creator-provided text identified in the one or more image frames.

对查询的响应可以包括传达所请求的信息作为对查询的连贯集中式响应的文本响应。在一些实现方式中，操作还包括：对文本响应执行文本到语音转换以生成对查询的响应的合成语音表示，其中提供对查询的响应以供从用户装置输出包括：从用户装置可听地输出对查询的响应的合成语音表示。在这些实现方式中，操作可以进一步包括：在从用户装置可听地输出对查询的响应的合成语音表示时，暂停内容馈送的回放。进一步地，对查询的文本响应可以进一步包括对与所请求的信息相关的源材料的一个或多个引用。The response to the query may include a text response that conveys the requested information as a coherent, focused response to the query. In some implementations, the operation also includes: performing text-to-speech conversion on the text response to generate a synthesized speech representation of the response to the query, wherein providing the response to the query for output from a user device includes: audibly outputting the synthesized speech representation of the response to the query from the user device. In these implementations, the operation may further include: pausing playback of the content feed while the synthesized speech representation of the response to the query is audibly output from the user device. Further, the text response to the query may further include one or more references to source materials related to the requested information.

在一些示例中，大型语言模型包括预训练的大型语言模型，并且使用结构化文档作为用于查询的场境（context）以生成对查询的响应来执行少样本(few-shot)学习。该查询可以包括呈自然语言的问题，并且对查询的响应可以包括对问题的自然语言响应。In some examples, the large language model includes a pre-trained large language model, and few-shot learning is performed using a structured document as a context for a query to generate a response to the query. The query may include a question in natural language, and the response to the query may include a natural language response to the question.

在一些实现方式中，处理内容馈送以生成语义丰富的结构化文档包括：将音频数据分割成多个音频片段；对多个音频片段执行说话人分类以预测包括指派给每个音频片段的对应说话人标签的分类结果；以及使用指派给从音频数据分割的每个音频片段的对应说话人标签来对语音话语的转录进行索引。In some implementations, processing a content feed to generate a semantically rich structured document includes: segmenting audio data into a plurality of audio segments; performing speaker classification on the plurality of audio segments to predict a classification result including a corresponding speaker label assigned to each audio segment; and indexing a transcription of the speech utterance using the corresponding speaker label assigned to each audio segment segmented from the audio data.

本公开的另一个方面提供了一种系统，该系统包括数据处理硬件和与该数据处理硬件通信的存储器硬件。该存储器硬件存储指令，该指令当在数据处理硬件上执行时使数据处理硬件执行操作，该操作包括：接收包括音频数据的内容馈送，该音频数据与语音话语相对应；以及处理内容馈送以生成语义丰富的结构化文档的。该结构化文档包括语音话语的转录，并且包括各自与音频数据的对应音频片段对齐的多个单词，该对应音频片段指示在音频数据中辨识出单词的时间。在内容馈送的回放期间，操作还包括：接收来自用户的请求内容馈送中包含的信息的查询；以及由大型语言模型处理查询和结构化文档以生成对查询的响应。此处，该响应传达内容馈送中包含的所请求的信息。操作还包括：提供对查询的响应以供从与用户相关联的用户装置输出。Another aspect of the present disclosure provides a system comprising data processing hardware and memory hardware in communication with the data processing hardware. The memory hardware stores instructions that, when executed on the data processing hardware, cause the data processing hardware to perform operations, the operations comprising: receiving a content feed comprising audio data corresponding to a speech utterance; and processing the content feed to generate a semantically rich structured document. The structured document comprises a transcription of the speech utterance and comprises a plurality of words each aligned with a corresponding audio segment of the audio data, the corresponding audio segment indicating the time when the word was recognized in the audio data. During playback of the content feed, the operation also comprises: receiving a query from a user requesting information contained in the content feed; and processing the query and the structured document by a large language model to generate a response to the query. Here, the response conveys the requested information contained in the content feed. The operation also comprises: providing a response to the query for output from a user device associated with the user.

此方面可以包括以下可选特征中的一者或多者。在一些实现方式中，操作还包括：提取转录的包括由对查询的响应传达的所请求的信息的片段，该转录的片段由开始单词和结束单词界定；将音频数据的起始音频片段识别为音频数据的与界定转录的片段的开始单词对齐的对应音频片段；以及将音频数据的结束音频片段识别为音频数据的与界定转录的片段的结束单词对齐的对应音频片段。在这些实现方式中，提供对查询的响应包括从与用户相关联的用户装置从音频数据的开始音频片段到音频数据的结束音频片段重回放音频数据。该内容馈送可以进一步包括图像数据，该图像数据包括多个图像帧，其中操作进一步包括：在从音频数据的开始音频片段到音频数据的结束音频片段重回放音频数据时，暂停图像数据的多个图像帧的回放。This aspect may include one or more of the following optional features. In some implementations, the operation also includes: extracting a transcribed segment including the requested information conveyed by the response to the query, the transcribed segment being bounded by a start word and an end word; identifying a starting audio segment of the audio data as a corresponding audio segment of the audio data aligned with the start word that bounds the transcribed segment; and identifying an ending audio segment of the audio data as a corresponding audio segment of the audio data aligned with the end word that bounds the transcribed segment. In these implementations, providing a response to the query includes replaying the audio data from a user device associated with the user from the starting audio segment of the audio data to the ending audio segment of the audio data. The content feed may further include image data, the image data including a plurality of image frames, wherein the operation further includes: pausing playback of the plurality of image frames of the image data while replaying the audio data from the starting audio segment of the audio data to the ending audio segment of the audio data.

在一些示例中，大型语言模型包括预训练的大型语言模型，并且使用结构化文档作为用于查询的场境以生成对查询的响应来执行少样本学习。该查询可以包括呈自然语言的问题，并且对查询的响应可以包括对问题的自然语言响应。In some examples, the large language model includes a pre-trained large language model, and few-shot learning is performed using a structured document as a context for a query to generate a response to the query. The query may include a question in natural language, and the response to the query may include a natural language response to the question.

本公开的一个或多个实现方式的细节在附图和以下描述中进行阐述。根据说明书和附图以及根据权利要求，其他方面、特征和优点将显而易见。The details of one or more implementations of the present disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是用于允许用户使用针对视频生成的结构化文档来与视频进行交互的示例系统的示意图。1 is a schematic diagram of an example system for allowing a user to interact with a video using a structured document generated for the video.

图2是用于从视听馈送的音频数据和图像数据生成结构化文档的示例文档结构化器(structurer)的示意图。2 is a schematic diagram of an example document structurer for generating a structured document from audio data and image data of an audiovisual feed.

图3是包括转录、创作者提供的文本和带注释的转录的示例结构化文档的示意图。3 is a diagram of an example structured document including a transcription, author-provided text, and annotated transcription.

图4是在回放视频时呈现来自针对视频生成的结构化文档的信息的示例视频界面的示意图。4 is a diagram of an example video interface that presents information from a structured document generated for a video while playing back the video.

图5是用于在内容馈送的回放期间使用结构化文档来与内容馈送进行交互的方法的操作的示例布置的流程图。5 is a flow diagram of an example arrangement of operations of a method for interacting with a content feed using a structured document during playback of the content feed.

图6是可以用于实现本文所述的系统和方法的示例计算装置的示意图。6 is a schematic diagram of an example computing device that may be used to implement the systems and methods described herein.

各个图中的相似附图标记指示相似元件。Like reference numbers in the various drawings indicate like elements.

具体实施方式DETAILED DESCRIPTION

由媒体回放应用和web浏览器采用的视频播放器准许用户提供用于控制视频的回放的命令。例如，用户可以播放/暂停/停止视频，以及经由专用按钮/命令或经由拖动通过视频时间轴来向前/向后扫览(scan)。在自动语音辨识(ASR)方面的最新进展已经使得用户通过语音来提供这些视频回放命令成为可能。视频时间轴特征使得用户能够预览逐帧视觉内容，但需要用户向前和向后多次拖动通过时间轴来定位感兴趣的内容。例如，如果用户正在观看用于准备特定食谱的烹饪教学视频，其中行动者/指导者/参与者(也称为“说话人”)说出配料列表及其相应比例，在用户没有时间去内化针对配料中的一种所需的比例的情况下，用户将不得不手动在视频中向后拖动以重复指导者说出配料列表的回放。显然，拖动通过时间轴供用户从事来定位感兴趣的内容是低效且耗时的过程。此外，交互式时间轴搜索限于视频的逐帧视觉内容，而不会显露音频内容或诸如情节/情景(scene)描述或主题的高级内容。The video player adopted by media playback application and web browser allows users to provide commands for controlling the playback of videos. For example, users can play/pause/stop videos, and scan forward/backward via dedicated buttons/commands or via dragging through the video timeline. The latest progress in automatic speech recognition (ASR) has made it possible for users to provide these video playback commands by voice. The video timeline feature enables users to preview frame-by-frame visual content, but requires users to drag forward and backward through the timeline multiple times to locate the content of interest. For example, if a user is watching a cooking instruction video for preparing a specific recipe, wherein the actor/instructor/participant (also referred to as "speaker") speaks a list of ingredients and its corresponding proportions, in the case where the user does not have time to internalize a required proportion for one of the ingredients, the user will have to manually drag backward in the video to repeat the playback of the instructor saying the list of ingredients. Obviously, dragging through the timeline for users to engage in locating the content of interest is an inefficient and time-consuming process. Furthermore, interactive timeline searching is limited to the frame-by-frame visual content of the video and does not reveal audio content or high-level content such as plot/scene description or theme.

一些视频播放器用户界面利用视频中的音频内容的文本转录和解说词来支持由用户输入的关键词搜索。在用户正在观看教学烹饪视频的上述示例中，在说话人说出孜然的比例时，用户可以输入(口头或文本)关键词搜索“how much cumin?(多少孜然？)”，以从音频内容的转录探知孜然的比例。然而，假设指导者/行动者仅通过名称引用了配料列表而没有指定食谱所需的每种配料的比例，而是由教学烹饪视频的创作者呈现/叠加视觉上传达每种配料的比例的创作者提供的图形，则用户将无法经由关键词搜索来探知他/她正在寻找的信息，因为这些信息从转录缺失。Some video player user interfaces utilize text transcriptions and transcripts of the audio content in the video to support keyword searches entered by the user. In the above example where the user is watching an instructional cooking video, while the speaker is speaking the proportions of cumin, the user can enter a (verbal or text) keyword search for "how much cumin?" to ascertain the proportions of cumin from the transcription of the audio content. However, assuming that the instructor/actor merely referenced the list of ingredients by name without specifying the proportions of each ingredient required for the recipe, and instead presented/superimposed a creator-provided graphic that visually conveyed the proportions of each ingredient by the creator of the instructional cooking video, the user would not be able to ascertain the information he/she is looking for via a keyword search because such information is missing from the transcription.

附加地，响应于关键词搜索从转录的和带解说词的语音中提取的信息通常阅读起来耗费时间，并且难以定位相关内容，因为转录且带解说词的语音容易包含语音特有的不流畅和冗余。由于转录仅仅呈现长段文本，并且解说词仅包含短的短语的序列，结构化组织的这种缺乏限制用户针对视频中的特定主题进行导览或探知视频的内容的任何类型的摘要的能力。Additionally, information extracted from transcribed and narrated speech in response to keyword searches is often time consuming to read and difficult to locate relevant content because transcribed and narrated speech tend to contain speech-specific disfluencies and redundancies. Since the transcription presents only long segments of text and the narration contains only sequences of short phrases, this lack of structured organization limits the user's ability to navigate to specific topics in the video or to ascertain any type of summary of the video's content.

可通过允许视频的创作者本着创建视频的可导航表示以实现针对由用户/观看者所查询的内容进行搜索的目的而嵌入结构化文本来解决用于在视频中搜索相关内容的现有技术中的一些固有缺陷。除了转录和解说词之外，创建者还可以将传达关键主题、章节标题、情节摘要、用于视频的不同片段的摘要的结构化文本文档嵌入到视频中。虽然使用创作者提供的结构化文本文档在某种程度上可以有效地允许用户/观看者在视频中定位相关内容，但是创作相关结构化文本文档并将其嵌入到视频中所需的资源和费用使得该任务对于绝大多数创作者难以执行。此外，即使在创作者愿意为他们的视频内容创作结构化文本文档的情况下，响应于用户的搜索而返回的内容也仅与创作者选择来嵌入到视频中的结构化文本文档一样好。也就是说，让创作者来预料可能是用户的搜索的主题的所有可设想类型的内容以便包括在结构化文本文档中简直是不可能的壮举。通过相同的概念，由于不能以统一方式利用创作者提供的结构化文本文档对查询进行语义解释，创作者提供的结构化文本文档到视频中的嵌入也无法为用户/观看者在输入查询以定位视频中感兴趣的内容时提供真正的交互式体验。Some inherent defects in the prior art for searching related content in a video can be solved by allowing the creator of the video to embed structured text in the spirit of creating a navigable representation of the video to realize the purpose of searching for the content queried by the user/viewer. In addition to transcription and commentary, the creator can also embed a structured text document that conveys key themes, chapter titles, plot summaries, and summaries for different segments of the video into the video. Although the structured text document provided by the creator can effectively allow the user/viewer to locate related content in the video to a certain extent, the resources and expenses required for creating related structured text documents and embedding them into the video make this task difficult for the vast majority of creators to perform. In addition, even if the creator is willing to create a structured text document for their video content, the content returned in response to the user's search is only as good as the structured text document that the creator chooses to embed into the video. That is, it is simply an impossible feat for the creator to anticipate all conceivable types of content that may be the subject of the user's search so as to be included in the structured text document. By the same concept, embedding of creator-provided structured text documents into videos also fails to provide a truly interactive experience for users/viewers when entering queries to locate content of interest in the video, since the creator-provided structured text documents cannot be leveraged in a unified way for semantic interpretation of queries.

本文的实现方式涉及为内容馈送(即，视频)自动生成语义丰富的结构化文档，以实现对请求内容馈送中包含的信息的查询的语义解释。参考图1，系统100包括通过媒体播放器应用150观看在计算/用户装置10上回放的内容馈送120的用户2。媒体播放器应用150可以是在用户装置10上执行的独立式应用或经由web浏览器访问的基于web的应用。在所示的示例中，内容馈送120包括在计算装置10上播放以供用户2观看和交互的录制的教学烹饪视频。虽然本文的示例将内容馈送120描绘为包括音频数据122(例如，音频内容、音频信号或音频流)和图像数据124(例如，图像内容或视频内容)的视听(AV)馈送(例如，视频)，但是内容馈送120可以是仅包括音频数据122的纯音频馈送，诸如但不限于播客剧集或有声读物(audio book)。为简单起见，内容馈送120在本文中可以互换地称为视频、AV信号、AV馈送或简称为AV数据，除非另有指定。The implementation of this article relates to automatically generating semantically rich structured documents for content feeds (i.e., videos) to achieve semantic interpretation of queries requesting information contained in content feeds. Referring to Figure 1, system 100 includes a user 2 who views a content feed 120 played back on a computing/user device 10 through a media player application 150. The media player application 150 can be a stand-alone application executed on the user device 10 or a web-based application accessed via a web browser. In the example shown, the content feed 120 includes a recorded instructional cooking video played on the computing device 10 for user 2 to watch and interact with. Although the examples of this article depict the content feed 120 as an audio-visual (AV) feed (e.g., video) including audio data 122 (e.g., audio content, audio signal, or audio stream) and image data 124 (e.g., image content or video content), the content feed 120 can be a pure audio feed including only audio data 122, such as, but not limited to, a podcast episode or an audio book. For simplicity, content feed 120 may be referred to herein interchangeably as video, AV signal, AV feed, or simply AV data, unless otherwise specified.

系统100还包括远程系统130，该远程系统经由网络120与计算装置10通信。远程系统130可以是具有可扩展的/弹性的资源的分布式系统(例如，云计算环境或存储抽象)。资源包括计算资源134(例如，数据处理硬件)和/或存储资源136(例如，存储器硬件)。在一些实现方式中，远程系统130托管(例如，在计算资源上)媒体播放器应用150以协调内容馈送120在计算装置10上的回放，针对内容馈送120生成语义丰富的结构化文档300，并且使用结构化文档300使用户10能够通过发出在内容馈送120的回放期间请求该内容馈送中包含的信息的查询112来经由计算装置10与内容馈送120进行交互。例如，远程系统130的数据处理硬件134可以执行存储在远程系统130的存储器硬件136上以用于执行应用150的指令。附加地或替代地，媒体播放器应用150可以在与用户2相关联的计算装置10上执行。例如，计算装置10的数据处理硬件12可以执行存储在计算装置10的存储器硬件14上以用于执行应用150的指令。数据处理硬件12的一些示例包括中央处理单元(CPU)、图形处理单元(GPU)或张量处理单元(TPU)。The system 100 also includes a remote system 130 that communicates with the computing device 10 via a network 120. The remote system 130 may be a distributed system with scalable/elastic resources (e.g., a cloud computing environment or a storage abstraction). The resources include computing resources 134 (e.g., data processing hardware) and/or storage resources 136 (e.g., memory hardware). In some implementations, the remote system 130 hosts (e.g., on computing resources) a media player application 150 to coordinate playback of the content feed 120 on the computing device 10, generates semantically rich structured documents 300 for the content feed 120, and uses the structured documents 300 to enable the user 10 to interact with the content feed 120 via the computing device 10 by issuing queries 112 that request information contained in the content feed during playback of the content feed 120. For example, the data processing hardware 134 of the remote system 130 may execute instructions stored on the memory hardware 136 of the remote system 130 for executing the application 150. Additionally or alternatively, media player application 150 may be executed on computing device 10 associated with user 2. For example, data processing hardware 12 of computing device 10 may execute instructions stored on memory hardware 14 of computing device 10 for executing application 150. Some examples of data processing hardware 12 include a central processing unit (CPU), a graphics processing unit (GPU), or a tensor processing unit (TPU).

计算装置10包括能够显示用于呈现图像数据124的视频界面400的显示器11以及用于音频数据122的可听输出的扬声器18，或者与该显示器和该扬声器通信。音频数据122可以与由行动者、指导者、叙述者、会议参与者、主持人或在视频120中记录的其他个人说出的语音话语123相对应。计算装置10的一些示例包括计算机、膝上型电脑、移动计算装置、智能电视、监视器、智能装置(例如，智能扬声器、智能显示器、智能家电)、可穿戴装置等。在所示的示例中，内容馈送120包括在计算装置10上播放以供用户2观看并与之交互的录制的教学烹饪视频。本文的实现方式涉及在教学烹饪视频120的回放期间向用户2提供交互式体验的媒体播放器应用150，该媒体播放器应用准许用户2发出请求包含在视频120中的信息的自然语言查询112，由此应用150使用针对视频120生成的语义丰富的结构化文档300来检索所请求的信息并向用户2提供包含所请求的信息的响应182。Computing device 10 includes, or is in communication with, a display 11 capable of displaying a video interface 400 for presenting image data 124 and a speaker 18 for audible output of audio data 122. Audio data 122 may correspond to voice utterances 123 spoken by an actor, instructor, narrator, meeting participant, presenter, or other individual recorded in video 120. Some examples of computing device 10 include computers, laptops, mobile computing devices, smart televisions, monitors, smart devices (e.g., smart speakers, smart displays, smart appliances), wearable devices, etc. In the example shown, content feed 120 includes a recorded instructional cooking video that is played on computing device 10 for user 2 to view and interact with. The implementation of the present invention involves a media player application 150 that provides an interactive experience to a user 2 during playback of an instructional cooking video 120, wherein the media player application allows the user 2 to issue a natural language query 112 requesting information contained in the video 120, whereby the application 150 uses a semantically rich structured document 300 generated for the video 120 to retrieve the requested information and provide a response 182 containing the requested information to the user 2.

用户2可以将查询112作为口头查询112发出，该口头查询由与计算装置10进行通信的麦克风16在流音频中捕获，并且应用150(或另一个应用)可以执行语音辨识以将口头查询112转换成查询112的对应文本表示。替代地，用户2还可以具有经由输入装置20输入查询112的能力，该输入装置可以包括与计算装置10进行通信的物理键盘或被呈现以供在视频界面400中显示的虚拟键盘。输入装置20还可以包括准许用户通过选择或悬停在对象之上来输入请求关于在视频数据124中显示的对象的信息(例如，隐藏式(closed)解说词中的单词、创作者提供的文本中的单词/短语、视频的情景中所描绘的实体)的查询112的鼠标、触控笔或图形用户界面。User 2 may issue query 112 as a spoken query 112 that is captured in streaming audio by microphone 16 in communication with computing device 10, and application 150 (or another application) may perform speech recognition to convert spoken query 112 into a corresponding textual representation of query 112. Alternatively, user 2 may also have the ability to enter query 112 via input device 20, which may include a physical keyboard in communication with computing device 10 or a virtual keyboard presented for display in video interface 400. Input device 20 may also include a mouse, stylus, or graphical user interface that permits a user to enter query 112 requesting information about an object displayed in video data 124 (e.g., a word in a closed caption, a word/phrase in author-provided text, an entity depicted in the context of the video) by selecting or hovering over the object.

举例来说，教学烹饪视频120正在播放片段，其中行动者正在混合一系列(a listof)原料以用于制作称为Haw Mok Talay的流行泰国海鲜咖喱菜。如果用户2在视频120的回放期间无法内化最近传达的孜然的比例，则该用户可以发出查询112“How much cumin?”，以探知孜然的比例。无需用户在视频120中手动向后拖动来定位孜然的比例被传达时的片段，应用150可以从语义丰富的结构化文档300中检索孜然的比例(即，所请求的信息)。例如，假设行动者在语音话语123中明确地说出孜然的比例，则应用150可以从音频数据122的转录310中检索孜然的比例。在其他类型的信息当中，结构化文档300还可以包括在图像数据124中辨识的创作者提供的文本320。例如，视频120的创作者可以在图像数据124中短暂地显示创作者提供的文本320，该创作者提供的文本指定食谱需要半茶匙孜然。因此，无论行动者是否在语音话语123中明确地说出该比例，应用150都可以从结构化文档300中包括的创作者提供的文本320中检索孜然的比例，并且提供对用户的查询112的响应182，该响应传达需要半茶匙孜然。For example, an instructional cooking video 120 is playing a segment in which an actor is mixing a list of ingredients for making a popular Thai seafood curry dish called Haw Mok Talay. If user 2 is unable to internalize the recently communicated ratio of cumin during playback of video 120, the user can issue a query 112 "How much cumin?" to find out the ratio of cumin. Without the user having to manually drag back in video 120 to locate the segment when the ratio of cumin is communicated, application 150 can retrieve the ratio of cumin (i.e., the requested information) from a semantically rich structured document 300. For example, assuming that the actor explicitly says the ratio of cumin in voice utterance 123, application 150 can retrieve the ratio of cumin from transcription 310 of audio data 122. Among other types of information, structured document 300 can also include creator-provided text 320 identified in image data 124. For example, the creator of the video 120 may briefly display creator-provided text 320 in the image data 124 that specifies that the recipe calls for half a teaspoon of cumin. Thus, the application 150 may retrieve the ratio of cumin from the creator-provided text 320 included in the structured document 300, and provide a response 182 to the user's query 112 that communicates that half a teaspoon of cumin is called for, regardless of whether the actor explicitly says the ratio in the voice utterance 123.

媒体播放器应用150包括文档结构化器200、大型语言模型180和输出模块190。文档结构化器200被配置为接收/摄取并处理视听馈送120以生成语义丰富的结构化文档300。值得注意的是，文档结构化器200可以针对所摄取的视听馈送120自动生成结构化文档300，而无需视听馈送120的创作者提供任何结构化文本，或以其他方式要求创作者在结构化文档300的创作中贡献力量。因此，文档结构化器200可以摄取任何新的或现有的内容馈送120并在没有来自馈送120的创作者的任何输入的情况下立即生成结构化文档300。The media player application 150 includes a document structurer 200, a large language model 180, and an output module 190. The document structurer 200 is configured to receive/ingest and process the audio-visual feed 120 to generate a semantically rich structured document 300. Notably, the document structurer 200 can automatically generate the structured document 300 for the ingested audio-visual feed 120 without requiring the creator of the audio-visual feed 120 to provide any structured text or otherwise require the creator to contribute in the creation of the structured document 300. Thus, the document structurer 200 can ingest any new or existing content feed 120 and immediately generate the structured document 300 without any input from the creator of the feed 120.

由文档结构化器200摄取的视听馈送120包括音频数据122和图像数据124。音频数据122可以表征语音话语123，并且图像数据124可以包括多个图像帧125、125a-n (图2)。如上文所讨论，内容馈送120可以包括仅包括音频数据122的纯音频馈送120。结构化文档300包括语音话语123的转录310。下面将参考图2进行更详细的描述，转录310包括多个单词，并且结构化文档300将转录310中的每个单词与音频数据122的对应音频片段222 (图2)对齐，该对应音频片段指示在音频数据122中辨识出该单词的时间。也就是说，结构化文档300包括用于转录310中的每个单词的时间戳。The audiovisual feed 120 ingested by the document structurer 200 includes audio data 122 and image data 124. The audio data 122 may represent a speech utterance 123, and the image data 124 may include a plurality of image frames 125, 125a-n (FIG. 2). As discussed above, the content feed 120 may include a pure audio feed 120 that includes only the audio data 122. The structured document 300 includes a transcription 310 of the speech utterance 123. As will be described in more detail below with reference to FIG. 2, the transcription 310 includes a plurality of words, and the structured document 300 aligns each word in the transcription 310 with a corresponding audio segment 222 (FIG. 2) of the audio data 122, the corresponding audio segment indicating the time at which the word was recognized in the audio data 122. That is, the structured document 300 includes a timestamp for each word in the transcription 310.

在一些实现方式中，文档结构化器200还处理图像数据124以确定是否在图像数据124中辨识出任何创作者提供的文本320。在这些实现方式中，由文档结构化器200生成的结构化文档300还将包括在图像数据124的一个或多个图像帧125a–n (图2)中辨识出的任何创作者提供的文本320。文档结构化器200可以使用诸如对象字符辨识的技术来辨识每个图像帧125中的任何创作者提供的文本320。如本文所用，创作者提供的文本320可以包括文本图形(字母、单词或其他符号的任何组合)，视频的创作者将该文本图形叠加到图像帧中描绘的情景上，以便向用户/观看者2传达相关内容。创作者提供的文本320还可以包括在图像帧中所描绘的实际情景中辨识出的任何文本。结构化文档300可将任何辨识出的创作者提供的文本320与音频数据122的对应音频片段222 (图2)对齐，以指示在一个或多个图像帧125中辨识出创作者提供的文本320的时间。In some implementations, the document structurer 200 also processes the image data 124 to determine whether any author-provided text 320 is recognized in the image data 124. In these implementations, the structured document 300 generated by the document structurer 200 will also include any author-provided text 320 recognized in one or more image frames 125a-n (FIG. 2) of the image data 124. The document structurer 200 may use techniques such as object character recognition to recognize any author-provided text 320 in each image frame 125. As used herein, author-provided text 320 may include text graphics (any combination of letters, words, or other symbols) that the author of the video superimposed on the scene depicted in the image frame in order to convey relevant content to the user/viewer 2. Author-provided text 320 may also include any text recognized in the actual scene depicted in the image frame. The structured document 300 may align any recognized author-provided text 320 with a corresponding audio segment 222 ( FIG. 2 ) of the audio data 122 to indicate when the author-provided text 320 was recognized in one or more image frames 125 .

文档结构化器200还可以对所摄取的视听馈送120附加地执行其他处理技术，诸如但不限于说话人分类(diarization)、摘要化和格式化，并且将这些处理技术的结果保存在结构化文档300中。说话人分类回答问题“who is speaking when(谁在何时说话)”，并且具有多种应用，包括多媒体信息检索、说话人轮换分析、音频处理和对话语音的自动转录等等。文档结构化器200可以利用文本生成模型，该模型消耗转录310和/或创作者提供的文本320并输出视听馈送120的一个或多个不同片段的关键主题或对应摘要。格式化可以识别视听馈送120中的不同章节/情景。The document structurer 200 may also additionally perform other processing techniques on the ingested audio-visual feed 120, such as but not limited to speaker diarization, summarization, and formatting, and save the results of these processing techniques in a structured document 300. Speaker diarization answers the question "who is speaking when" and has a variety of applications, including multimedia information retrieval, speaker turn analysis, audio processing, and automatic transcription of conversational speech, among others. The document structurer 200 may utilize a text generation model that consumes the transcription 310 and/or the author-provided text 320 and outputs key topics or corresponding summaries of one or more different segments of the audio-visual feed 120. Formatting may identify different chapters/scenes in the audio-visual feed 120.

图2示出了文档结构化器200的示例，该文档结构化器包括分类模块220、自动化语音辨识(ASR)模块230、对象字符辨识(OCR)模块240和生成器250。应用150执行ASR模块230以生成一个或多个由说话人(例如，行动者/参与者)在内容馈送120(例如，包括音频数据122和视频数据124的视听信号或仅包括音频数据122的纯音频信号)中说出的语音话语123的转录310(也称为转录)。2 shows an example of a document structurer 200 including a classification module 220, an automated speech recognition (ASR) module 230, an object character recognition (OCR) module 240, and a generator 250. The application 150 executes the ASR module 230 to generate a transcription 310 (also referred to as a transcription) of one or more speech utterances 123 spoken by a speaker (e.g., an actor/participant) in a content feed 120 (e.g., an audio-visual signal including audio data 122 and video data 124 or an audio-only signal including only audio data 122).

分类模块220被配置为接收与来自内容馈送120的说话人的话语123相对应的音频数据122(以及可选地表示说话人的面部的图像数据124)，将音频数据122分割成多个片段222、222a–n (例如，固定长度的片段或可变长度的片段)，并且基于音频数据122 (以及可选地图像数据124)使用概率模型(例如，概率生成式模型)来生成分类结果224，该分类结果包括指派给每个片段222的对应说话人标签226。换句话说，分类模块220包括一系列具有短话语(例如，片段222)的说话人辨识任务，并且确定给定对话的两个片段222是否由同一个说话人说出。同时，分类模块220可以同时执行面部跟踪例程以识别哪个参与者在哪个片段222期间说话，以进一步优化说话人辨识。然后，分类模块220被配置为对对话的所有片段222重复该过程。此处，分类结果224为所接收的音频数据122提供带时间戳的说话人标签226、226a-n，该带时间戳的说话人标签不仅可以识别在给定片段222期间谁在说话，还可以识别在相邻片段222之间何时发生说话人变化。The classification module 220 is configured to receive audio data 122 corresponding to an utterance 123 of a speaker from the content feed 120 (and optionally image data 124 representing the face of the speaker), segment the audio data 122 into a plurality of segments 222, 222a-n (e.g., fixed-length segments or variable-length segments), and generate a classification result 224 based on the audio data 122 (and optionally the image data 124) using a probabilistic model (e.g., a probabilistic generative model), the classification result including a corresponding speaker label 226 assigned to each segment 222. In other words, the classification module 220 includes a series of speaker identification tasks with short utterances (e.g., segments 222), and determines whether two segments 222 of a given conversation are spoken by the same speaker. At the same time, the classification module 220 can simultaneously perform a facial tracking routine to identify which participant is speaking during which segment 222 to further optimize the speaker identification. The classification module 220 is then configured to repeat the process for all segments 222 of the conversation. Here, the classification results 224 provide timestamped speaker labels 226 , 226 a - n for the received audio data 122 , which can identify not only who is speaking during a given segment 222 , but also when speaker changes occur between adjacent segments 222 .

ASR模块230被配置为接收与话语123相对应的音频数据122 (以及可选地表示说话人在说出话语123时的面部的图像数据124)。ASR模块230将音频数据122转录成对应ASR结果232。此处，ASR结果232是指音频数据122的文本转录(例如，转录310)或多个候选文本转录。在一些示例中，ASR模块230与分类模块220通信，以利用与音频数据122相关联的分类结果224来改进基于话语123的语音辨识。例如，ASR模块230可以对从分类结果224中识别的不同说话人应用不同的语音辨识模型(例如，语言模型、韵律模型)。附加地或替代地，ASR模块230和/或分类模块220 (或应用150的某个其他组件)可以使用针对从分割结果224中获得的每个片段222预测的带时间戳的说话人标签226来对音频数据122的转录310进行索引。如图2所示，用于内容馈送120的转录310可以由说话人进行索引，以将转录202的部分与相应说话人相关联，以便识别每个说话人所说的内容。The ASR module 230 is configured to receive audio data 122 corresponding to the utterance 123 (and optionally image data 124 representing the face of the speaker when the utterance 123 is spoken). The ASR module 230 transcribes the audio data 122 into corresponding ASR results 232. Here, the ASR results 232 refer to a text transcription (e.g., transcription 310) or multiple candidate text transcriptions of the audio data 122. In some examples, the ASR module 230 communicates with the classification module 220 to improve speech recognition based on the utterance 123 using the classification results 224 associated with the audio data 122. For example, the ASR module 230 can apply different speech recognition models (e.g., language models, prosody models) to different speakers identified from the classification results 224. Additionally or alternatively, the ASR module 230 and/or the classification module 220 (or some other component of the application 150) can index the transcription 310 of the audio data 122 using the predicted timestamped speaker labels 226 for each segment 222 obtained from the segmentation results 224. As shown in FIG. 2 , the transcription 310 for the content feed 120 can be indexed by speaker to associate portions of the transcription 202 with corresponding speakers in order to identify what each speaker said.

在一些实现方式中，文档结构化器200接收先前由另一应用生成或由内容馈送120的创作者提供的用于口头话语123的解说词。解说词可以带有时间戳/与音频数据122的音频片段222对齐，以指示用于话语123的解说词被说出的时间。解说词可用作转录310，而无需ASR模块230处理音频数据122，或者解说词可以与通过处理音频数据122生成的转录310的辨识结果232结合使用或用于改进该识别结果。在一些示例中，当解说词不包括标点符号时，文档结构化器200将标点符号添加到先前生成的解说词，以用于改进由大型语言模型180生成的响应182的准确性。In some implementations, the document structurer 200 receives a transcript for the spoken utterance 123 that was previously generated by another application or provided by an author of the content feed 120. The transcript can be time-stamped/aligned with the audio segment 222 of the audio data 122 to indicate the time when the transcript for the utterance 123 was spoken. The transcript can be used as a transcription 310 without the ASR module 230 processing the audio data 122, or the transcript can be used in conjunction with or to improve the recognition result 232 of the transcription 310 generated by processing the audio data 122. In some examples, when the transcript does not include punctuation, the document structurer 200 adds punctuation to the previously generated transcript for improving the accuracy of the response 182 generated by the large language model 180.

供包括在结构化文档300中的话语123的转录310还包括对齐信息315。对齐信息315提供转录310的多个单词312、312a–n(图3)中的每个单词312(图3)与音频数据122的对应音频片段222之间的对齐，该对应音频片段指示辨识出对应单词的时间。The transcription 310 for the utterance 123 included in the structured document 300 also includes alignment information 315. The alignment information 315 provides an alignment between each word 312 (FIG. 3) of the plurality of words 312, 312a-n (FIG. 3) of the transcription 310 and a corresponding audio segment 222 of the audio data 122, the corresponding audio segment indicating a time at which the corresponding word was recognized.

OCR模块240被配置为辨识可能存在于图像数据124的一个或多个图像帧125a-n中的任何创作者提供的文本320。OCR模块240可以包括被训练来辨识每个图像帧125中的任何创作者提供的文本320的OCR机器学习模型(例如，辨识器)244。如本文所用，创作者提供的文本320可以包括文本图形(字母、单词或其他符号的任何组合)，视频的创作者将该文本图形叠加到图像帧中描绘的情景上，以便向用户/观看者2传达相关内容。创作者提供的文本320还可以包括在图像帧125中所描绘的实际情景中辨识出的任何文本。在一些示例中，OCR模块240包括OCR数据存储242，OCR机器学习模型244可以访问该数据存储以辨识特定字体、符号或文本模式。在一个或多个图像帧125a-n中辨识出的创作者提供的文本320可以进一步包括对应对齐信息322。此处，对齐信息322提供任何所辨识的创作者提供的文本320与音频数据122的对应音频片段222 (图2)之间的对齐，以指示在一个或多个图像帧125中辨识出创作者提供的文本320的时间。The OCR module 240 is configured to recognize any creator-provided text 320 that may be present in one or more image frames 125a-n of the image data 124. The OCR module 240 may include an OCR machine learning model (e.g., recognizer) 244 that is trained to recognize any creator-provided text 320 in each image frame 125. As used herein, the creator-provided text 320 may include text graphics (any combination of letters, words, or other symbols) that the creator of the video superimposes on the scene depicted in the image frame in order to convey relevant content to the user/viewer 2. The creator-provided text 320 may also include any text recognized in the actual scene depicted in the image frame 125. In some examples, the OCR module 240 includes an OCR data store 242 that the OCR machine learning model 244 can access to recognize specific fonts, symbols, or text patterns. The creator-provided text 320 recognized in one or more image frames 125a-n may further include corresponding alignment information 322. Here, the alignment information 322 provides an alignment between any recognized author-provided text 320 and the corresponding audio segment 222 ( FIG. 2 ) of the audio data 122 to indicate the time at which the author-provided text 320 was recognized in one or more image frames 125 .

在一些场景中，ASR模块230使用所辨识的创作者提供的文本320和对应对齐信息322来改进转录310的准确性。参考其中内容馈送120包括教学烹饪视频的上述示例，ASR模块230可以产生辨识结果232，该识别结果将泰国菜名称“Haw Mok Talay”误辨识为“HamookTaley”。类似地，先前生成的解说词可以误辨识菜的名称。在图像数据320中所辨识的创作者提供的文本320中的一些可以包括短语“Haw Mok Talay”。在一些情况下，正确的拼写(“Haw Mok Talay”)可以是在语音辨识结果232中包括的候选假设列表中的较低置信度假设，由此与所辨识的创作者提供的文本320中存在的短语“Haw Mok Talay”的匹配会提升在辨识结果232中对“Haw Mok Talay”的候选假设的置信度，使得该短语最终被选择用于包括在转录310中。In some scenarios, the ASR module 230 uses the recognized author-provided text 320 and the corresponding alignment information 322 to improve the accuracy of the transcription 310. Referring to the above example where the content feed 120 includes an instructional cooking video, the ASR module 230 can generate a recognition result 232 that misrecognizes the name of the Thai dish "Haw Mok Talay" as "Hamook Taley". Similarly, the previously generated commentary can misrecognize the name of the dish. Some of the author-provided text 320 recognized in the image data 320 can include the phrase "Haw Mok Talay". In some cases, the correct spelling (“Haw Mok Talay”) may be a lower confidence hypothesis in the list of candidate hypotheses included in the speech recognition results 232, whereby a match with the phrase “Haw Mok Talay” present in the recognized author-provided text 320 increases the confidence of the candidate hypothesis for “Haw Mok Talay” in the recognition results 232, causing the phrase to ultimately be selected for inclusion in the transcription 310.

在一些实现方式中，生成器250接收在一个或多个图像帧125中所辨识的转录310、创作者提供的文本320以及对应对齐信息315、322，并且通过利用创作者提供的文本320对语音话语123的转录310进行注释来生成结构化文档300。在这些实现方式中，对齐信息315、322可以显示语音话语123的哪些部分与创作者提供的文本320相关的可能性。例如，生成器250可以基于音频数据120的与在一个或多个图像帧中辨识出的创作者提供的文本320对齐的对应音频片段222来在转录310中的一对相邻单词之间插入创作者提供的文本320。In some implementations, the generator 250 receives the transcription 310 recognized in one or more image frames 125, the author-provided text 320, and the corresponding alignment information 315, 322, and generates the structured document 300 by annotating the transcription 310 of the speech utterance 123 with the author-provided text 320. In these implementations, the alignment information 315, 322 can show the likelihood of which parts of the speech utterance 123 are related to the author-provided text 320. For example, the generator 250 can insert the author-provided text 320 between a pair of adjacent words in the transcription 310 based on the corresponding audio segment 222 of the audio data 120 that is aligned with the author-provided text 320 recognized in the one or more image frames.

图3示出了由图1和图2的文档结构化器200针对与教学烹饪视频相对应的视听馈送120生成的示例语义丰富的结构化文档300。转录310与由讨论为泰国菜“Haw Mok Talay”制作沙爹酱腌泡汁(satay marinade)的步骤中的一些的说话人说出的话语123相关。转录310包括多个单词312、312a-n，并且对应对齐信息315提供多个单词312a-n中的每个单词312与音频数据122的对应音频片段222之间的对齐，该对应音频片段指示辨识出对应单词312的时间。创作者提供的文本320包括指示沙爹酱腌泡汁所需的芫荽和孜然的相应比例的“1.5 tsp coriander(1.5茶匙芫荽)”和“0.5 tsp cumin(0.5茶匙孜然)”。值得注意的是，由于说话人从来没有说过传达这些比例的任何话语123，所以它们未包括在转录310中。对应对齐信息325提供在创作者提供的文本320与音频数据122的对应音频段222(图2)之间的对齐以指示在一个或多个图像帧125中辨识出创作者提供的文本320的时间。FIG3 shows an example semantically enriched structured document 300 generated by the document structurer 200 of FIGS. 1 and 2 for an audiovisual feed 120 corresponding to an instructional cooking video. The transcription 310 is associated with an utterance 123 spoken by a speaker discussing some of the steps of making a satay marinade for the Thai dish "Haw Mok Talay". The transcription 310 includes a plurality of words 312, 312a-n, and corresponding alignment information 315 provides an alignment between each word 312 in the plurality of words 312a-n and a corresponding audio segment 222 of the audio data 122, the corresponding audio segment indicating the time when the corresponding word 312 was recognized. The author-provided text 320 includes "1.5 tsp coriander" and "0.5 tsp cumin" indicating the respective proportions of coriander and cumin required for the satay marinade. Notably, since the speaker never uttered any utterances 123 conveying these proportions, they are not included in the transcription 310. The corresponding alignment information 325 provides an alignment between the author-provided text 320 and the corresponding audio segments 222 (FIG. 2) of the audio data 122 to indicate when the author-provided text 320 was recognized in one or more image frames 125.

在所示的示例中，结构化文档300还包括带注释的转录330，该转录使用对齐信息315、325基于音频数据120的与在一个或多个图像帧中辨识出的创作者提供的文本320对齐的对应音频片段222来将创作者提供的文本320插入到转录310中的一对相邻单词(例如，“anything”和“Coriander”)之间。此处，带注释的转录330包括插入到转录310的相关位置中的创作者提供的文本320。In the example shown, the structured document 300 also includes an annotated transcription 330 that uses the alignment information 315, 325 to insert the creator-provided text 320 between a pair of adjacent words (e.g., "anything" and "Coriander") in the transcription 310 based on the corresponding audio segment 222 of the audio data 120 that is aligned with the creator-provided text 320 recognized in the one or more image frames. Here, the annotated transcription 330 includes the creator-provided text 320 inserted into the relevant position of the transcription 310.

返回参考图1，在一些实现方式中，在内容馈送(例如，视听馈送)120的回放期间，大型语言模型180被配置为接收语义丰富的结构化文档300和由用户2发出的查询112作为输入，并且生成对查询112的响应182作为输出，该响应传达内容馈送120中包含的所请求的信息。在一些示例中，查询112包括呈自然语言的问题，并且对查询112的响应182包括提供对问题的答案的自然语言响应。例如，对查询112的响应182可以包括由大型语言模型180生成的文本响应，该文本响应将所请求的信息作为对查询112的连贯的集中的响应进行传达。在一些示例中，大型语言模型180能够进一步通过引用源材料来增强对查询112的连贯的/集中的响应182，以突出显示响应182中包含的信息的权威性。也就是说，对查询112的文本响应182可以包括对与所请求的信息相关的源材料的一个或多个引用，诸如对响应182中提到的实体可以将用户2引导到附加信息的链接。除了生成文本以提供对自然语言查询112的自然语言响应/答案182之外，大型语言模型182可以执行诸如生成摘要化结构化文档300的一个或多个部分的自然语言文本的其他生成式任务。Referring back to FIG. 1 , in some implementations, during playback of a content feed (e.g., an audiovisual feed) 120, a large language model 180 is configured to receive as input a semantically rich structured document 300 and a query 112 issued by a user 2, and generate as output a response 182 to the query 112 that communicates requested information contained in the content feed 120. In some examples, the query 112 includes a question in natural language, and the response 182 to the query 112 includes a natural language response that provides an answer to the question. For example, the response 182 to the query 112 may include a text response generated by the large language model 180 that communicates the requested information as a coherent, focused response to the query 112. In some examples, the large language model 180 can further enhance the coherent/focused response 182 to the query 112 by citing source material to highlight the authority of the information contained in the response 182. That is, the text response 182 to the query 112 may include one or more references to source material related to the requested information, such as links to entities mentioned in the response 182 that may direct the user 2 to additional information. In addition to generating text to provide a natural language response/answer 182 to the natural language query 112, the large language model 182 may perform other generative tasks such as generating natural language text that summarizes one or more portions of the structured document 300.

大型语言模型180可以包括预训练的大型语言模型180，该预训练的大型语言模型使用一个或多个生成式任务(即，多任务学习)基于一般世界知识进行预训练，以学习高效的场境表示。因此，大型语言模型180可以包括多任务统一模型(MUM)。预训练的大型语言模型180可以基于Transformer或Conformer模型，或具有多头注意力机制的其他编码/解码架构。例如，预训练的大型语言模型可以包括用于对结构化文档300进行编码的一个编码分支、用于对查询112进行编码的另一个编码分支以及接收两种编码以检索/生成回答查询的响应的共享解码器。值得注意的是，Transformer/Conformer模型能够有效地被并行化用于训练大规模语言模型，该大规模语言模型被证明来与基于诸如循环神经网络模型的自回归神经网络架构的语言模型相比更好地进行泛化，并实现显著更好的性能。预训练的神经网络模型180可以包括超过十亿个参数，并且可以超过一万亿个参数的上限。The large language model 180 may include a pre-trained large language model 180, which is pre-trained based on general world knowledge using one or more generative tasks (i.e., multi-task learning) to learn efficient context representations. Therefore, the large language model 180 may include a multi-task unified model (MUM). The pre-trained large language model 180 may be based on a Transformer or Conformer model, or other encoding/decoding architectures with a multi-head attention mechanism. For example, the pre-trained large language model may include an encoding branch for encoding a structured document 300, another encoding branch for encoding a query 112, and a shared decoder that receives two encodings to retrieve/generate responses to answer queries. It is worth noting that the Transformer/Conformer model can be effectively parallelized for training a large-scale language model, which is proven to generalize better than a language model based on an autoregressive neural network architecture such as a recurrent neural network model, and achieve significantly better performance. The pre-trained neural network model 180 may include more than one billion parameters, and may exceed an upper limit of one trillion parameters.

本文的实现方式涉及执行少样本学习的预训练的大型语言模型180，该预训练的大型语言模型使用结构化文档300作为场境来生成对查询180的响应182。也就是说，少样本学习可以对预训练的大型语言模型180的参数进行精调，以便语言模型180可以应用于响应于由用户2发出的查询112而检索音频视频源中包含的相关信息的下游任务。少样本学习的使用对于其中有限的训练数据可用的任务特别有用，因为语言模型180能够基于提供有标记的示例的结构化文档300来很好地进行泛化以改进对相关数据(例如，在用户2当前正在观看的视听馈送120中包含的信息)的检索。通过少样本学习，查询112和结构化文档300被作为输出对提供给预训练的大型语言模型300，使得结构化文档被标记为以某种方式与生成响应182作为输出相关。大型语言模型180还能够执行零样本学习任务，其中语言模型180在生成对查询110的响应182时可以默认为该语言模型对世界的了解。Implementations herein involve a pre-trained large language model 180 that performs few-shot learning, which uses a structured document 300 as context to generate a response 182 to a query 180. That is, the few-shot learning can fine-tune the parameters of the pre-trained large language model 180 so that the language model 180 can be applied to the downstream task of retrieving relevant information contained in an audio-visual source in response to a query 112 issued by a user 2. The use of few-shot learning is particularly useful for tasks where limited training data is available, because the language model 180 is able to generalize well based on the structured document 300 provided with labeled examples to improve the retrieval of relevant data (e.g., information contained in the audio-visual feed 120 that the user 2 is currently watching). Through few-shot learning, the query 112 and the structured document 300 are provided as an output pair to the pre-trained large language model 300, such that the structured document is labeled as being related in some way to generating a response 182 as an output. The large language model 180 is also capable of performing zero-shot learning tasks, where the language model 180 can default to the language model's understanding of the world when generating a response 182 to a query 110 .

在大型语言模型180生成响应182之后，输出模块190被配置为提供响应182以供从用户计算装置输出。输出模块190可以包括回放设置控制器190a、用户界面(UI)生成器190b和文本转语音(TTS)系统 190c的任何组合。继续其中视听馈送120包括教学烹饪视频的该示例，在意识到视频中由行动者所说出的关于食谱中所用的芫荽籽如何进行烘烤的细节缺失后，用户2可以提供查询112 “What, how were they toasted?(什么，它们是如何进行烘烤的？)”。响应182可以包括答案“They were toasted in a dry sauté pan(它们是在干煎锅里烘烤的)”。在一些示例中，输出模块190接收响应182和结构化文档300作为输入，并且提取转录310的片段(和/或带注释的转录330的片段(图3))，该片段包括由对查询112的响应传达的所请求的信息。例如，从转录310中提取的片段可以包括“You want to toastthem in a dry sauté pan (你要在干煎锅里烘烤它们)”，其中该片段由开始单词“You”和结束单词“pan”界定。因此，输出模块190然后既可以将音频数据122的起始音频片段222(图2)识别为与界定转录的片段的开始单词对齐的对应音频片段，又可以将音频数据122的结束音频片段222识别为与界定转录310片段的结束单词对齐的对应音频片段。使用所识别的起始和结束音频片段222，输出模块190可以指导回放设置控制器190a从开始音频片段到结束音频片段重回放音频数据122作为来自扬声器18的可听输出，使得话语123“You wantto toast them in a dry sauté pan”被重回放以传达对查询110的响应182。值得注意的是，回放设置控制器190a可以在重回放相关音频数据120时暂停图像数据125的多个图像帧的回放。在一些示例中，控制器190a响应于接收到查询110而暂停音频-视频馈送110的回放。After the large language model 180 generates the response 182, the output module 190 is configured to provide the response 182 for output from the user computing device. The output module 190 may include any combination of a playback settings controller 190a, a user interface (UI) generator 190b, and a text-to-speech (TTS) system 190c. Continuing with the example in which the audio-visual feed 120 includes an instructional cooking video, upon realizing that details spoken by the actor in the video about how the coriander seeds used in the recipe were toasted are missing, the user 2 may provide the query 112 "What, how were they toasted?" The response 182 may include the answer "They were toasted in a dry sauté pan." In some examples, the output module 190 receives the response 182 and the structured document 300 as input and extracts a segment of the transcription 310 (and/or a segment of the annotated transcription 330 ( FIG. 3 )) that includes the requested information conveyed by the response to the query 112. For example, the segment extracted from the transcription 310 may include “You want to toast them in a dry sauté pan,” where the segment is bounded by the beginning word “You” and the ending word “pan.” Thus, the output module 190 may then both identify the starting audio segment 222 ( FIG. 2 ) of the audio data 122 as the corresponding audio segment that is aligned with the beginning word that bounds the segment of the transcription and identify the ending audio segment 222 of the audio data 122 as the corresponding audio segment that is aligned with the ending word that bounds the segment of the transcription 310. Using the identified starting and ending audio segments 222, the output module 190 can direct the playback settings controller 190a to replay the audio data 122 from the starting audio segment to the ending audio segment as an audible output from the speaker 18, so that the speech 123 "You want to toast them in a dry sauté pan" is replayed to convey the response 182 to the query 110. Notably, the playback settings controller 190a can pause the playback of multiple image frames of the image data 125 while replaying the associated audio data 120. In some examples, the controller 190a pauses the playback of the audio-video feed 110 in response to receiving the query 110.

在一些附加示例中，输出模块190指导TTS系统190c对来自大型语言模型180的文本响应182输出执行TTS转换，以生成对查询112的响应182的合成语音表示。在由对查询112的响应传达的所请求的信息不存在于转录310中时的场景中，输出模块190可以使用TTS系统190c，因此从未在口头话语123中进行传达。因此，没有机会重回放音频数据122的任何部分来传达由响应182传达的所请求的信息。例如，并且参考图3，对查询112“How muchcumin?”的响应182可以仅由大型语言模型180从如带注释的转录330所证实的创作者提供的文本320来确定。在此示例中，输出模块190可以接收由大型语言模型180生成的文本响应182“半茶匙孜然籽”作为对查询112的答案，并且指导TTS系统190c对文本响应182执行文本到语音转换，以生成对查询112的响应182的合成语音表示。因此，媒体播放器应用150可以从计算装置10的扬声器18中可听地输出合成语音表示。值得注意的是，回放设置控制器190a还可以在传达响应182“half a teaspoon of cumin seeds”的合成语音表示从计算装置10的扬声器18可听地输出时完全暂停视听馈送120的回放。在一些示例中，控制器190a响应于接收到查询110而暂停音频-视频馈送110的回放。In some additional examples, the output module 190 directs the TTS system 190c to perform a TTS transformation on the text response 182 output from the large language model 180 to generate a synthesized speech representation of the response 182 to the query 112. The output module 190 can use the TTS system 190c in scenarios when the requested information conveyed by the response to the query 112 is not present in the transcription 310, and thus is never conveyed in the spoken utterance 123. Thus, there is no opportunity to replay any portion of the audio data 122 to convey the requested information conveyed by the response 182. For example, and referring to FIG. 3, the response 182 to the query 112 "How much cumin?" can be determined solely by the large language model 180 from the author-provided text 320 as evidenced by the annotated transcription 330. In this example, the output module 190 may receive the text response 182 "half a teaspoon of cumin seeds" generated by the large language model 180 as an answer to the query 112, and direct the TTS system 190c to perform text-to-speech conversion on the text response 182 to generate a synthesized speech representation of the response 182 to the query 112. Accordingly, the media player application 150 may audibly output the synthesized speech representation from the speaker 18 of the computing device 10. Notably, the playback settings controller 190a may also completely pause playback of the audio-visual feed 120 when the synthesized speech representation of the response 182 "half a teaspoon of cumin seeds" is audibly output from the speaker 18 of the computing device 10. In some examples, the controller 190a pauses playback of the audio-visual feed 110 in response to receiving the query 110.

附加地，输出模块190可以指导UI生成器190b生成文本响应182的图形，并且在视听馈送120的回放期间在计算装置10的显示器11上所显示的视频界面400中呈现文本响应182的图形。此处，用户2可以在观看视频120时简单地阅读在视频界面400中呈现的文本响应182的图形。文本响应182可以包括对与所请求信息相关的源材料的一个或多个引用。例如，在视频界面400中呈现的文本响应182的图形可以提供到对与所请求的信息相关的源材料的引用的超链接。用户2可以简单地悬停(例如，通过鼠标)在视频界面400中所呈现的文本响应中感兴趣的单词之上或触摸该单词，以观看其他信息或被引导到另一个源，例如，网页。Additionally, the output module 190 can direct the UI generator 190b to generate a graphic of the text response 182 and present the graphic of the text response 182 in the video interface 400 displayed on the display 11 of the computing device 10 during the playback of the audio-visual feed 120. Here, the user 2 can simply read the graphic of the text response 182 presented in the video interface 400 while watching the video 120. The text response 182 may include one or more references to source materials related to the requested information. For example, the graphic of the text response 182 presented in the video interface 400 may provide a hyperlink to the reference to the source material related to the requested information. The user 2 can simply hover (e.g., by a mouse) over or touch a word of interest in the text response presented in the video interface 400 to view additional information or be directed to another source, such as a web page.

图4提供了在视听馈送120的回放期间媒体播放器应用150在计算装置10的显示器11上显示的示例视频界面400。在此示例中，媒体播放器应用150还在视频界面400中显示来自针对视听馈送120生成的语义丰富的结构化文档300的信息，以允许用户2在视听馈送120的回放期间与结构化文档300进行交互。例如，可以显示由两个不同说话人所说出的话语的转录310，以及指示转录310的哪些部分是由每个说话人说出的对应说话人标签204。结构化文档300可以进一步提供多模态交互，诸如添加到转录310中所叙述的可以与用户2相关的特定术语或实体的超链接。例如，可以由用户2经由选择术语“cryptocurrency (加密货币)”或将鼠标悬停在该术语之上来探索关于该术语的附加信息。此处，视频界面400可以填充加密货币的定义或来自维基百科页面的关于加密货币的摘录(snippet)。FIG. 4 provides an example video interface 400 displayed by the media player application 150 on the display 11 of the computing device 10 during playback of the audio-visual feed 120. In this example, the media player application 150 also displays information from a semantically enriched structured document 300 generated for the audio-visual feed 120 in the video interface 400 to allow the user 2 to interact with the structured document 300 during playback of the audio-visual feed 120. For example, a transcription 310 of an utterance spoken by two different speakers may be displayed, along with corresponding speaker tags 204 indicating which portions of the transcription 310 were spoken by each speaker. The structured document 300 may further provide multimodal interactions, such as hyperlinks added to specific terms or entities narrated in the transcription 310 that may be relevant to the user 2. For example, additional information about the term "cryptocurrency" may be explored by the user 2 via selecting the term or hovering the mouse over the term. Here, the video interface 400 may be populated with a definition of cryptocurrency or a snippet from a Wikipedia page about cryptocurrency.

结构化文档300可以进一步提供视听馈送120的相关章节/节段/情景的摘要410以供在视频界面400中的呈现。此处，摘要410可以由图1的大型语言模型180基于从转录310、创作者提供的文本320和/或带注释的转录320中提取的信息来生成。用户2可以选择摘要410中的一个，并且视频播放器可以前进到视频的该部分。The structured document 300 may further provide summaries 410 of relevant chapters/sections/scenes of the audio-visual feed 120 for presentation in the video interface 400. Here, the summaries 410 may be generated by the large language model 180 of FIG1 based on information extracted from the transcription 310, the author-provided text 320, and/or the annotated transcription 320. The user 2 may select one of the summaries 410, and the video player may advance to that portion of the video.

媒体播放器应用150的视频界面400还提供回放设置控件450，该播放设置控件允许用户2控制视听馈送450的回放。例如，回放设置控件450可以包括用于播放、向前/向后扫览、暂停的按钮，以及用户2可以操纵以拖动通过视频的视频时间轴。The video interface 400 of the media player application 150 also provides playback settings controls 450 that allow the user 2 to control the playback of the audio-visual feed 450. For example, the playback settings controls 450 may include buttons for play, forward/backward scanning, pause, and a video timeline that the user 2 can manipulate to scrub through the video.

图5提供了用于在内容馈送120的回放期间使用结构化文档300来与内容馈送120进行交互的方法500的操作的示例布置的流程图。在操作502处，方法500包括：接收包括音频数据122的内容馈送120。音频数据120与语音话语123相对应。内容馈送120可以包括视听馈送，该视听馈送附加地包括图像数据124，该图像数据124包括多个图像帧125a–n。5 provides a flow diagram of an example arrangement of operations of a method 500 for interacting with a content feed 120 using a structured document 300 during playback of the content feed 120. At operation 502, the method 500 includes receiving a content feed 120 including audio data 122. The audio data 120 corresponds to a speech utterance 123. The content feed 120 may include an audiovisual feed that additionally includes image data 124 that includes a plurality of image frames 125a-n.

在操作504处，方法500包括：处理内容馈送120以生成语义丰富的结构化文档300。此处，结构化文档300包括语音话语123的转录310。转录310可以包括各自与音频数据122的对应音频片段222对齐的多个单词312，该对应音频片段指示在音频数据122中辨识出单词312的时间。At operation 504, the method 500 includes processing the content feed 120 to generate a semantically enriched structured document 300. Here, the structured document 300 includes a transcription 310 of the speech utterance 123. The transcription 310 may include a plurality of words 312 that are each aligned with a corresponding audio segment 222 of the audio data 122, the corresponding audio segment indicating the time at which the word 312 was recognized in the audio data 122.

在操作506处，在内容馈送120的回放期间，方法500包括：接收来自用户2的请求内容馈送中包含的信息的查询112。在操作508处，在内容馈送120的回放期间，方法500包括：由大型语言模型180处理查询112和结构化文档300以生成对查询112的响应182。此处，响应182传达内容馈送120中包含的所请求的信息。在操作510处，方法包括：提供对查询112的响应182以供从与用户2相关联的用户装置10输出。At operation 506, during playback of the content feed 120, the method 500 includes receiving a query 112 from user 2 requesting information contained in the content feed. At operation 508, during playback of the content feed 120, the method 500 includes processing the query 112 and the structured document 300 by the large language model 180 to generate a response 182 to the query 112. Here, the response 182 conveys the requested information contained in the content feed 120. At operation 510, the method includes providing the response 182 to the query 112 for output from a user device 10 associated with user 2.

软件应用(即，软件资源)可以是指使计算装置执行任务的计算机软件。在一些示例中，软件应用可以被称为“应用”、“app”或“程序”。示例应用包括但不限于系统诊断应用、系统管理应用、系统维护应用、词处理应用、电子表格应用、消息传递应用、媒体流应用、社交网络应用和游戏应用。A software application (i.e., software resource) may refer to computer software that enables a computing device to perform tasks. In some examples, a software application may be referred to as an "application," "app," or "program." Example applications include, but are not limited to, system diagnostic applications, system management applications, system maintenance applications, word processing applications, spreadsheet applications, messaging applications, media streaming applications, social networking applications, and gaming applications.

非暂时性存储器可以是用于临时或永久存储程序(例如，指令序列)或数据(例如，程序状态信息)以供计算装置使用的物理装置。非暂时性存储器可以是易失性和/或非易失性可以寻址半导体存储器。非易失性存储器的示例包括但不限于闪存存储器和只读存储器(ROM)/可编程只读存储器(PROM)/可擦除可编程只读存储器(EPROM)/电子可擦除可编程只读存储器(EEPROM) (例如，通常用于固件，诸如引导程序)。易失性存储器的示例包括但不限于随机存取存储器(RAM)、动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、相变存储器(PCM)以及磁盘或磁带。Non-transitory memory can be a physical device used to temporarily or permanently store programs (e.g., instruction sequences) or data (e.g., program state information) for use by a computing device. Non-transitory memory can be volatile and/or non-volatile addressable semiconductor memory. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electronically erasable programmable read-only memory (EEPROM) (e.g., commonly used for firmware, such as bootloaders). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM), and disk or tape.

图6是可以用于实现本文档中所述的系统和方法的示例计算装置600的示意图。计算装置600旨在表示各种形式的数字计算机，诸如膝上型计算机、台式机、工作站、个人数字助理、服务器、刀片服务器、大型机和其他适当的计算机。这里所示的组件、它们的连接和关系以及它们的功能意在仅是示例性的，并不意在限制本文档中所述和/或所要求保护的发明的实现方式。FIG6 is a schematic diagram of an example computing device 600 that may be used to implement the systems and methods described in this document. Computing device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown here, their connections and relationships, and their functions are intended to be exemplary only and are not intended to limit implementations of the inventions described and/or claimed in this document.

计算装置600包括处理器610、存储器620、存储装置630、连接到存储器620和高速扩展端口650的高速接口/控制器640以及连接到低速总线670和存储装置630的低速接口/控制器660。组件610、620、630、640、650和660中的每一者使用各种总线互连，并且可安装在公共主板上或视情况以其他方式安装。处理器610可以处理用于在计算装置600内执行的指令，包括存储在存储器620中或存储装置630上的指令，以在外部输入/输出装置(诸如耦合到高速接口640的显示器680)上显示图形用户界面(GUI)的图形信息。在其他实现方式中，可以视情况连同多个存储器和多种存储器类型使用多个处理器和/或多个总线。而且，可连接多个计算装置600，其中每个装置提供部分必要操作(例如，作为服务器库、刀片服务器组或多处理器系统)。The computing device 600 includes a processor 610, a memory 620, a storage device 630, a high-speed interface/controller 640 connected to the memory 620 and a high-speed expansion port 650, and a low-speed interface/controller 660 connected to a low-speed bus 670 and the storage device 630. Each of the components 610, 620, 630, 640, 650, and 660 is interconnected using various buses and can be installed on a common motherboard or installed in other ways as appropriate. The processor 610 can process instructions for execution within the computing device 600, including instructions stored in the memory 620 or on the storage device 630, to display graphical information of a graphical user interface (GUI) on an external input/output device (such as a display 680 coupled to the high-speed interface 640). In other implementations, multiple processors and/or multiple buses can be used as appropriate along with multiple memories and multiple memory types. Moreover, multiple computing devices 600 can be connected, each of which provides a portion of the necessary operations (for example, as a server library, a blade server group, or a multi-processor system).

存储器620将信息非暂态地存储在计算装置600内。存储器620可以是计算机可读介质、易失性存储器单元或非易失性存储器单元。非暂态存储器620可以是用于临时或永久地存储供计算装置600使用的程序(例如，指令序列)或数据(例如，程序状态信息)的物理装置。非易失性存储器的示例包括但不限于闪存存储器和只读存储器(ROM)/可编程只读存储器(PROM)/可擦除可编程只读存储器(EPROM)/电子可擦除可编程只读存储器(EEPROM) (例如，通常用于固件，诸如引导程序)。易失性存储器的示例包括但不限于随机存取存储器(RAM)、动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、相变存储器(PCM)以及磁盘或磁带。The memory 620 stores information non-transitorily within the computing device 600. The memory 620 may be a computer-readable medium, a volatile memory unit, or a non-volatile memory unit. The non-transitory memory 620 may be a physical device for temporarily or permanently storing programs (e.g., instruction sequences) or data (e.g., program state information) for use by the computing device 600. Examples of non-volatile memory include, but are not limited to, flash memory and read-only memory (ROM)/programmable read-only memory (PROM)/erasable programmable read-only memory (EPROM)/electronically erasable programmable read-only memory (EEPROM) (e.g., typically used for firmware, such as bootloaders). Examples of volatile memory include, but are not limited to, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM), and disk or tape.

存储装置630能够为计算装置600提供大容量存储。在一些实现方式中，存储装置630是计算机可读介质。在各种不同的实现方式中，存储装置630可以是软盘装置、硬盘装置、光盘装置或磁带装置、闪存存储器或其他类似的固态存储器装置、或装置阵列(包括呈存储区域网络或其他配置的装置)。在附加的实现方式中，计算机程序产品有形地体现在信息载体中。计算机程序产品包含指令，所述指令在被执行时执行一种或多种方法，诸如上文所描述的那些方法。信息载体是计算机或机器可读介质，诸如存储器620、存储装置630或者在处理器610上的存储器。The storage device 630 is capable of providing mass storage for the computing device 600. In some implementations, the storage device 630 is a computer-readable medium. In various implementations, the storage device 630 may be a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid-state memory device, or an array of devices (including devices in a storage area network or other configurations). In additional implementations, a computer program product is tangibly embodied in an information carrier. The computer program product includes instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer or machine-readable medium, such as the memory 620, the storage device 630, or a memory on the processor 610.

高速控制器640管理计算装置600的带宽密集型操作，而低速控制器660管理较低带宽密集型操作。此类职责分配仅是示例性的。在一些实现方式中，高速控制器640耦合到存储器620、显示器680 (例如，通过图形处理器或加速器)以及可接受各种扩展卡(未示出)的高速扩展端口650。在一些实现方式中，低速控制器660耦合到存储装置630和低速扩展端口690。低速扩展端口690 (其可以包括各种通信端口(例如，USB、蓝牙、以太网、无线以太网))可以例如通过网络适配器耦合到一个或多个输入/输出装置，诸如键盘、指向装置、扫描仪，或者联网装置，诸如交换机或路由器。The high-speed controller 640 manages bandwidth-intensive operations of the computing device 600, while the low-speed controller 660 manages less bandwidth-intensive operations. Such a division of responsibilities is exemplary only. In some implementations, the high-speed controller 640 is coupled to a memory 620, a display 680 (e.g., through a graphics processor or accelerator), and a high-speed expansion port 650 that can accept various expansion cards (not shown). In some implementations, the low-speed controller 660 is coupled to a storage device 630 and a low-speed expansion port 690. The low-speed expansion port 690 (which can include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet)) can be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device, such as a switch or a router, for example, through a network adapter.

如图所示，计算装置600可以多种不同的形式实现。例如，它可实现为标准服务器600a或多次实现为一组此类服务器600a，实现为膝上型计算机600b，或实现为机架服务器系统600c的一部分。As shown, computing device 600 may be implemented in a variety of different forms. For example, it may be implemented as a standard server 600a or multiple times as a group of such servers 600a, as a laptop computer 600b, or as part of a rack server system 600c.

本文所述的系统和技术的各种实现方式可在数字电子和/或光学电路、集成电路、专门设计的ASIC (专用集成电路)、计算机硬件、固件、软件和/或它们的组合中实现。这些各种实现方式可包括可在可编程系统上执行和/或解释的一个或多个计算机程序中的实现方式，该可编程系统包括至少一个可编程处理器，该可编程处理器可以是专用的或通用的，可耦合以从存储系统、至少一个输入装置和至少一个输出装置接收数据和指令并且向存储系统、至少一个输入装置和至少一个输出装置传输数据和指令。Various implementations of the systems and techniques described herein can be implemented in digital electronic and/or optical circuits, integrated circuits, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementations in one or more computer programs that can be executed and/or interpreted on a programmable system that includes at least one programmable processor that can be either special purpose or general purpose and can be coupled to receive data and instructions from and transmit data and instructions to a storage system, at least one input device, and at least one output device.

这些计算机程序(也称为程序、软件、软件应用或代码)包括用于可编程处理器的机器指令，并且可以用高级过程和/或面向对象的编程语言和/或用汇编/机器语言来实现。如本文所用，术语“机器可读介质”和“计算机可读介质”是指用于向可编程处理器提供机器指令和/或数据的任何计算机程序产品、非暂时性计算机可读介质、装置和/或装置(例如，磁盘、光盘、存储器、可编程逻辑装置(PLD))，包括接收机器指令作为机器可读信号的机器可读介质。术语“机器可读信号”是指用于向可编程处理器提供机器指令和/或数据的任何信号。These computer programs (also referred to as programs, software, software applications, or code) include machine instructions for a programmable processor and may be implemented in high-level procedural and/or object-oriented programming languages and/or in assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, non-transitory computer-readable medium, device, and/or apparatus (e.g., disk, optical disk, memory, programmable logic device (PLD)) for providing machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal for providing machine instructions and/or data to a programmable processor.

本说明书中描述的过程和逻辑流可以由一个或多个可编程处理器(也称为数据处理硬件)执行，该一个或多个可编程处理器执行一个或多个计算机程序以通过对输入数据进行操作并生成输出来执行功能。过程和逻辑流还可由专用逻辑电路系统执行，例如FPGA(现场可编程门阵列)或ASIC (专用集成电路)。举例来说，适合于执行计算机程序的处理器包括通用和专用微处理器，以及任何类型的数字计算机的任何一个或多个处理器。通常，处理器将从只读存储器或随机存取器或者两者接收指令和数据。计算机的基本元件是用于执行指令的处理器和用于存储指令和数据的一个或多个存储器装置。通常，计算机还将包括用于存储数据的一个或多个大容量存储装置，例如磁盘、磁光盘或光盘，或者可操作地耦合以从一个或多个大容量存储装置接收数据或将数据传输到一个或多个大容量存储装置或两者。然而，计算机不必具有此类装置。适于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、介质和存储器装置，包括例如半导体存储器装置(例如EPROM、EEPROM和闪存存储器装置)、磁盘(例如内部硬盘或可移动磁盘)、磁光盘以及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路系统补充或结合在专用逻辑电路系统中。The process and logic flow described in this specification can be performed by one or more programmable processors (also referred to as data processing hardware), which execute one or more computer programs to perform functions by operating on input data and generating output. The process and logic flow can also be performed by a dedicated logic circuit system, such as an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). For example, processors suitable for executing computer programs include general-purpose and special-purpose microprocessors, and any one or more processors of any type of digital computer. Typically, the processor will receive instructions and data from a read-only memory or a random access memory or both. The basic elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include one or more mass storage devices for storing data, such as a disk, a magneto-optical disk, or an optical disk, or be operably coupled to receive data from one or more mass storage devices or transfer data to one or more mass storage devices or both. However, a computer does not have to have such a device. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disks or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and memory may be supplemented by, or incorporated in, special purpose logic circuitry.

为了提供与用户的交互，本公开的一个或多个方面可以在具有用于向用户显示信息的显示装置(例如，CRT (阴极射线管)、LCD (液晶显示器)监视器)或触摸屏以及可选地用户可以通过其向计算机提供输入的键盘和指向装置(例如，鼠标或轨迹球)的计算机上实现。也可使用其他类型的装置来提供与用户的交互；例如，提供给用户的反馈可以是任何形式的感觉反馈，例如视觉反馈、听觉反馈或触觉反馈；并且可以以任何形式接收来自用户的输入，包括声音、语音或触觉输入。另外，计算机可通过向用户使用的装置发送文档以及从用户使用的装置接收文档来与用户交互；例如，通过响应于从web浏览器接收的请求而向用户的客户端装置上的web浏览器发送网页。To provide interaction with a user, one or more aspects of the present disclosure may be implemented on a computer having a display device (e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor) or a touch screen for displaying information to the user, and optionally a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including sound, voice, or tactile input. In addition, the computer may interact with the user by sending documents to and receiving documents from a device used by the user; for example, by sending a web page to a web browser on a user's client device in response to a request received from the web browser.

已经描述了多种实现方式。然而，应当理解，在不脱离本公开的精神和范围的情况下，可以进行各种修改。相应地，其他实现方式在所附权利要求的范围内。A variety of implementations have been described. However, it should be understood that various modifications may be made without departing from the spirit and scope of the present disclosure. Accordingly, other implementations are within the scope of the appended claims.

Claims

1. A computer-implemented method (500), the computer-implemented method (500) when executed on data processing hardware (134) causing the data processing hardware (134) to perform operations comprising:

Receiving a content feed (120) comprising audio data (122), the audio data (122) corresponding to a speech utterance (123);

Processing the content feed (120) to generate a semantically rich, structured document (300), the structured document (300) comprising a transcription (310) of the speech utterance (123), the transcription (310) comprising a plurality of words (312) each aligned with a corresponding audio segment (222) of the audio data (122), the corresponding audio segment indicating a time at which the word (312) was recognized in the audio data (122);

during playback of the content feed (120):

receiving a query (112) from a user requesting information contained in the content feed (120); and

Processing the query (112) and the structured document (300) by a large language model (180) to generate a response (182) to the query (112), the response (182) conveying requested information contained in the content feed (120); and

The response (182) to the query (112) is provided for output from a user device (10) associated with the user.

2. The computer-implemented method (500) of claim 1, wherein the operations further comprise:

Extracting a segment of the transcription (310) that includes the requested information conveyed by the response (182) to the query (112), the segment of the transcription (310) being defined by a start word (312) and an end word (312);

Identifying a starting audio segment of the audio data (122) as the corresponding audio segment (222) of the audio data (122) that is aligned with the starting word (312) defining the segment of the transcription (310); and

Identifying an ending audio segment of the audio data (122) as the corresponding audio segment (222) of the audio data (122) that is aligned with the ending word (312) defining the segment of the transcription (310),

Wherein providing the response (182) to the query (112) includes replaying the audio data (122) from the beginning audio segment of the audio data (122) to the ending audio segment of the audio data (122) from the user device (10) associated with the user.

3. The computer-implemented method (500) of claim 2, wherein:

The content feed (120) further comprises image data (124), the image data comprising a plurality of image frames (125); and

The operations further comprise: playback of the plurality of image frames (125) of the image data (124) is paused while the audio data (122) is replayed from the beginning audio segment of the audio data (122) to the ending audio segment of the audio data (122).

4. A computer-implemented method (500) as claimed in any of claims 1 to 3, wherein:

The semantically rich, structured document (300) further includes author-provided text recognized in one or more image frames (125) of the plurality of image frames (125), the author-provided text aligned with a corresponding audio segment (222) of the audio data (122) to indicate a time at which the author-provided text was recognized in the one or more image frames (125).

5. The computer-implemented method (500) of claim 4, wherein processing the content feed (120) to generate the semantically rich, structured document (300) comprises: annotating the transcription (310) of the speech utterance (123) with the author-provided text by inserting the author-provided text between a pair of adjacent words (312) in the transcription (310) based on the corresponding audio segments (222) of the audio data (122) that are aligned with the author-provided text identified in the one or more image frames (125).

6. The computer-implemented method (500) of any of claims 1 to 5, wherein the response (182) to the query (112) includes a text response (182) conveying the requested information as a coherent centralized response (182) to the query (112).

7. The computer-implemented method (500) of claim 6, wherein the operations further comprise:

Performing a text-to-speech conversion on the text response (182) to generate a synthesized speech representation of the response (182) to the query (112),

Wherein providing the response (182) to the query (112) for output from the user device (10) comprises audibly outputting the synthesized speech representation of the response (182) to the query (112) from the user device (10).

8. The computer-implemented method (500) of claim 7, wherein the operations further comprise: playback of the content feed (120) is paused while the synthesized speech representation of the response (182) to the query (112) is audibly output from the user device (10).

9. The computer-implemented method (500) of any of claims 6 to 8, wherein the text response (182) to the query (112) further includes one or more references to source material related to the requested information.

10. The computer-implemented method (500) of any of claims 1 to 9, wherein the large language model (180) comprises a pre-trained large language model (180) and the structured document (300) is used as a context for the query (112) to generate the response (182) to the query (112) to perform a low sample learning.

11. The computer-implemented method (500) of any of claims 1 to 10, wherein:

the query (112) includes questions in natural language; and

The response (182) to the query (112) includes a natural language response (182) to the question.

12. The computer-implemented method (500) of any of claims 1 to 11, wherein processing the content feed (120) to generate the semantically rich, structured document (300) comprises:

-splitting the audio data (122) into a plurality of audio segments (222);

Performing speaker classification on the plurality of audio segments (222) to predict a classification result (224), the classification result comprising a corresponding speaker tag (226) assigned to each audio segment (222); and

The transcription (310) of the speech utterance (123) is indexed using the corresponding speaker tag (226) assigned to each audio segment segmented from the audio data (122).

13. A system (100) comprising:

data processing hardware (134); and

Memory hardware (136) in communication with the data processing hardware (134), the memory hardware (136) storing instructions that, when executed on the data processing hardware (134), cause the data processing hardware (134) to perform operations comprising:

during playback of the content feed (120):

14. The system (100) of claim 13, wherein the operations further comprise:

15. The system (100) of claim 14, wherein:

16. The system (100) of any one of claims 13 to 15, wherein:

17. The system (100) of claim 16, wherein processing the content feed (120) to generate the semantically rich, structured document (300) includes: annotating the transcription (310) of the speech utterance (123) with the author-provided text by inserting the author-provided text between a pair of adjacent words (312) in the transcription (310) based on the corresponding audio segments (222) of the audio data (122) that are aligned with the author-provided text identified in the one or more image frames (125).

18. The system (100) of any of claims 13 to 17, wherein the response (182) to the query (112) includes a text response (182) conveying the requested information as a coherent centralized response (182) to the query (112).

19. The system (100) of claim 18, wherein the operations further comprise:

20. The system (100) of claim 19, wherein the operations further comprise: playback of the content feed (120) is paused while the synthesized speech representation of the response (182) to the query (112) is audibly output from the user device (10).

21. The system (100) of any of claims 18 to 20, wherein the text response (182) to the query (112) further includes one or more references to source material related to the requested information.

22. The system (100) of any one of claims 13 to 21, wherein the large language model (180) comprises a pre-trained large language model (180) and the structured document (300) is used as a context for the query (112) to generate the response (182) to the query (112) to perform a low sample learning.

23. The system (100) of any one of claims 13 to 22, wherein:

the query (112) includes questions in natural language; and

24. The system (100) of any one of claims 13 to 23, wherein processing the content feed (120) to generate the semantically rich, structured document (300) comprises:

-splitting the audio data (122) into a plurality of audio segments (222);

Performing speaker classification on the plurality of audio segments (222) to predict a classification result (224), the classification result including a corresponding speaker tag (226) assigned to each audio segment; and