HK40038257B

HK40038257B - Multimedia data processing method and apparatus, electronic device, and storage medium

Info

Publication number: HK40038257B
Application number: HK42021028216.6A
Authority: HK
Inventors: 李振阳; 马连洋; 衡阵
Original assignee: 腾讯科技（深圳）有限公司
Filing date: 2021-03-29
Publication date: 2024-04-19

Description

Multimedia data processing methods, devices, electronic equipment, and storage media

技术领域Technical Field

本申请涉及人工智能领域，尤其涉及一种多媒体数据的处理方法、装置、电子设备以及存储介质。This application relates to the field of artificial intelligence, and in particular to a method, apparatus, electronic device, and storage medium for processing multimedia data.

背景技术Background Technology

随着科学技术的不断发展，传统的文字以及文字配图的多媒体内容已经不能满足用户的需求，视频、音频等多媒体内容(如短视频)逐渐成为大众获取信息以及娱乐的重要方式之一。With the continuous development of science and technology, traditional text and multimedia content with text and pictures can no longer meet the needs of users. Multimedia content such as video and audio (such as short videos) has gradually become one of the important ways for the public to obtain information and entertainment.

在日常生活中，用户往往希望直接浏览主要咨询内容(如短视频的亮点内容)来快速获取相关信息。为满足用户需求，在人工智能领域以及大数据领域现有技术中，往往采取将多媒体内容与标题信息进行图文匹配的方式，向用户提示与标题相关联的主要媒体内容。但是由于图像和文字属于不同领域的信息，在实际的匹配过程往往会导致匹配效果较差，并且现有的图文匹配技术对于不包含任何图像的音频数据来说，无法确定音频数据中与标题信息相关联的主要音频内容，适用性较差，降低用户体验。In daily life, users often prefer to directly browse key information (such as highlights in short videos) to quickly obtain relevant information. To meet this need, existing technologies in artificial intelligence and big data often employ image-text matching between multimedia content and title information to suggest key media content related to the title. However, since images and text belong to different fields, the matching process often results in poor performance. Furthermore, existing image-text matching technologies cannot identify the key audio content related to the title information in audio data that does not contain any images, leading to poor applicability and a reduced user experience.

因此，如何准确地确定出多媒体数据中的主要内容成为亟需解决的问题。Therefore, accurately identifying the main content in multimedia data has become an urgent problem to be solved.

发明内容Summary of the Invention

本申请实施例提供一种多媒体数据的处理方法、装置、电子设备以及存储介质，可确定出多媒体数据中与标题信息相关联的主要内容的播放时间区域，可提升用户体验，适用性高。This application provides a multimedia data processing method, apparatus, electronic device, and storage medium, which can determine the playback time range of the main content associated with the title information in the multimedia data, improve the user experience, and has high applicability.

第一方面，本申请实施例提供一种多媒体数据的处理方法，该方法包括：In a first aspect, embodiments of this application provide a method for processing multimedia data, the method comprising:

获取多媒体数据中包含的至少一个文本信息，以及上述多媒体数据的标题信息；Obtain at least one text information contained in the multimedia data, as well as the title information of the multimedia data;

确定上述标题信息与各上述文本信息的匹配度；Determine the degree of matching between the above title information and each of the above text information;

根据各上述文本信息对应的匹配度，确定上述多媒体数据中的目标播放时间区域；Based on the matching degree of each of the above text information, the target playback time area in the above multimedia data is determined;

根据上述目标播放时间区域对上述多媒体数据进行处理。The multimedia data is processed according to the target playback time range.

第二方面，本申请实施例提供了一种多媒体数据的处理装置，该装置包括：Secondly, embodiments of this application provide a multimedia data processing apparatus, the apparatus comprising:

获取单元，用于获取多媒体数据中包含的至少一个文本信息，以及上述多媒体数据的标题信息；The acquisition unit is used to acquire at least one text information contained in the multimedia data, as well as the title information of the multimedia data.

确定单元，用于确定上述标题信息与各上述文本信息的匹配度；The determining unit is used to determine the degree of matching between the above-mentioned title information and each of the above-mentioned text information;

上述确定单元，用于根据各上述文本信息对应的匹配度，确定上述多媒体数据中的目标播放时间区域；The aforementioned determining unit is used to determine the target playback time region in the aforementioned multimedia data based on the matching degree corresponding to each of the aforementioned text information.

播放单元，用于根据上述目标播放时间区域对上述多媒体数据进行处理。The playback unit is used to process the multimedia data according to the target playback time range.

第三方面，本申请实施例提供了一种电子设备，包括处理器和存储器，该处理器和存储器相互连接；Thirdly, embodiments of this application provide an electronic device, including a processor and a memory, which are interconnected;

上述存储器用于存储计算机程序；The aforementioned memory is used to store computer programs;

上述处理器被配置用于在调用上述计算机程序时，执行上述第一方面所提供的方法。The processor is configured to execute the method provided in the first aspect when the computer program is invoked.

第四方面，本申请实施例提供了一种计算机可读存储介质，该计算机可读存储介质存储有计算机程序，该计算机程序被处理器执行以实现上述第一方面所提供的方法。Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program that is executed by a processor to implement the method provided in the first aspect above.

第五方面，本申请实施例提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述第一方面所提供的方法。Fifthly, embodiments of this application provide a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of an electronic device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the method provided in the first aspect.

在本申请实施例中，通过将多媒体数据用至少一个文本信息表示，可在文字维度上准确确定多媒体数据的标题信息和每个文本信息的匹配度，进而可基于匹配度准确衡量各文本信息与标题信息的关联程度以通过匹配度确定多媒体数据中的目标播放时间区域。进一步的，通过目标播放时间区域对多媒体数据进行处理，可使用户快速确定与标题信息相关的多媒体内容的播放时间区域，可增强用户吸引力，适用性高。In this embodiment, by representing multimedia data with at least one text message, the matching degree between the title information and each text message can be accurately determined at the text level. Furthermore, the correlation between each text message and the title information can be accurately measured based on the matching degree to determine the target playback time zone in the multimedia data. Further, processing the multimedia data using the target playback time zone allows users to quickly determine the playback time zone of the multimedia content related to the title information, enhancing user appeal and improving applicability.

附图说明Attached Figure Description

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

图1是本申请实施例提供的多媒体数据的处理方法的一流程图；Figure 1 is a flowchart of a multimedia data processing method provided in an embodiment of this application;

图2a是本申请实施例提供的获取视频数据中包含的文本信息的一场景示意图；Figure 2a is a schematic diagram of a scenario for obtaining text information contained in video data according to an embodiment of this application;

图2b是本申请实施例提供的获取视频数据中包含的文本信息的另一场景示意图；Figure 2b is a schematic diagram of another scenario for obtaining text information contained in video data according to an embodiment of this application;

图2c是本申请实施例提供的获取视频数据中包含的文本信息的又一场景示意图；Figure 2c is a schematic diagram of another scenario for obtaining text information contained in video data according to an embodiment of this application;

图3是本申请实施例提供的获取音频数据中包含的文本信息的场景示意图；Figure 3 is a schematic diagram of a scenario for obtaining text information contained in audio data according to an embodiment of this application;

图4是本申请实施例提供的根据相似度确定目标播放时间区域的示意图；Figure 4 is a schematic diagram of determining the target playback time region based on similarity according to an embodiment of this application;

图5是本申请实施例提供的根据关键词确定目标播放时间区域的示意图；Figure 5 is a schematic diagram of determining the target playback time region based on keywords according to an embodiment of this application;

图6是本申请实施例提供的根据指定信息确定目标播放时间区域的示意图；Figure 6 is a schematic diagram of determining the target playback time region based on specified information according to an embodiment of this application;

图7是本申请实施例提供的确定目标播放时间区域的示意图；Figure 7 is a schematic diagram of determining the target playback time region provided in an embodiment of this application;

图8是本申请实施例提供的对多媒体数据进行处理的场景示意图；Figure 8 is a schematic diagram of a scenario for processing multimedia data provided in an embodiment of this application;

图9是本申请实施例提供的多媒体数据的处理装置的结构示意图；Figure 9 is a schematic diagram of the structure of the multimedia data processing device provided in an embodiment of this application;

图10是本申请实施例提供的电子设备的结构示意图。Figure 10 is a schematic diagram of the structure of the electronic device provided in an embodiment of this application.

具体实施方式Detailed Implementation

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。根据本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of this application.

本申请实施例提供的多媒体数据处理方法可适用于人工智能、大数据等多种领域，如基于自然语言处理((Nature Language processing，NLP)的人机交互、云技术(Cloudtechnology)中的云计算、人工智能云服务以及大数据领域中的相关数据计算处理领域，旨在通过将多媒体数据转化为文本信息，进而基于文本信息确定出多媒体数据中主要媒体内容的目标播放时间区域。The multimedia data processing method provided in this application embodiment is applicable to various fields such as artificial intelligence and big data, including human-computer interaction based on natural language processing (NLP), cloud computing in cloud technology, artificial intelligence cloud services, and related data computing and processing fields in the big data field. It aims to convert multimedia data into text information and then determine the target playback time area of the main media content in the multimedia data based on the text information.

人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

自然语言处理(Nature Language processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，所以它与语言学的研究有着密切的联系。Natural Language Processing (NLP) is an important field within computer science and artificial intelligence. It studies the theories and methods for enabling effective communication between humans and computers using natural language. NLP is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field involves natural language—the language people use in daily life—and thus it has a close relationship with linguistic research.

云技术是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来，实现数据的计算、储存、处理和共享的一种托管技术。本申请实施例所提供的多媒体数据的处理方法可基于云技术中的云计算(cloud computing)实现。Cloud technology refers to a hosting technology that unifies hardware, software, network, and other resources within a wide area network (WAN) or local area network (LAN) to achieve data computation, storage, processing, and sharing. The multimedia data processing method provided in this application embodiment can be implemented based on cloud computing.

云计算是指通过网络以按需、易扩展的方式获得所需资源，是网格计算(GridComputing)、分布式计算(Distributed Computing)、并行计算(Parallel Computing)、效用计算(Utility Computing)、网络存储(Network Storage Technologies)、虚拟化(Virtualization)、负载均衡(Load Balance)等传统计算机和网络技术发展融合的产物。Cloud computing refers to obtaining the required resources on demand and in an easily scalable manner through the network. It is the product of the development and integration of traditional computer and network technologies such as grid computing, distributed computing, parallel computing, utility computing, network storage technologies, virtualization, and load balancing.

人工智能云服务，一般也被称作是AIaaS(AI as a Service，AI即服务)。这是目前主流的一种人工智能平台的服务方式，具体来说AIaaS平台会把几类常见的人工智能服务进行拆分，并在云端提供独立或者打包的服务，如语音识别处理、文本信息提取等。Artificial intelligence cloud services are generally also known as AIaaS (AI as a Service). This is currently a mainstream service model for artificial intelligence platforms. Specifically, AIaaS platforms break down several common artificial intelligence services and provide them as independent or packaged services in the cloud, such as speech recognition processing and text information extraction.

大数据(Big data)是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合，是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。随着云时代的来临，大数据也吸引了越来越多的关注。大数据基于大规模并行处理数据库、数据挖掘、分布式文件系统、分布式数据库、以及上述云计算等技术，有效地实施本实施例所提供的多媒体数据的处理方法。Big data refers to data sets that cannot be captured, managed, and processed within a certain timeframe using conventional software tools. It represents massive, rapidly growing, and diverse information assets that require new processing models to achieve stronger decision-making, insightful discovery, and process optimization capabilities. With the advent of the cloud era, big data has attracted increasing attention. This embodiment effectively implements the multimedia data processing method provided, based on technologies such as massively parallel processing databases, data mining, distributed file systems, distributed databases, and cloud computing.

参见图1，图1是本申请实施例提供的多媒体数据的处理方法的一流程图。该方法可以由任一电子设备执行，如可以是服务器或者用户终端，也可以是用户终端和服务器交互完成。当由用户终端执行时，用户终端在获取到多媒体数据后，可确定多媒体数据中的目标播放时间区域，进而基于目标播放时间区域对多媒体数据进行处理。当由服务器和用户终端交互完成时，服务器可确定多媒体数据中的目标播放时间区域，进而将目标播放时间区域指示给用户终端，用户终端根据目标播放时间区域对多媒体数据进行处理。其中，服务器接收到的多媒体数据可以由用户终端发送，也可由服务器通过其他方式，如数据库、网页获取等获取，在此不做限制。其中，服务器可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content DeliveryNetwork，内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器或服务器集群。用户终端可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等，用户终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接，但并不局限于此。Referring to Figure 1, Figure 1 is a flowchart of a multimedia data processing method provided in an embodiment of this application. This method can be executed by any electronic device, such as a server or a user terminal, or it can be completed through interaction between the user terminal and the server. When executed by a user terminal, after acquiring multimedia data, the user terminal can determine the target playback time region in the multimedia data, and then process the multimedia data based on the target playback time region. When completed through interaction between the server and the user terminal, the server can determine the target playback time region in the multimedia data, and then indicate the target playback time region to the user terminal, which then processes the multimedia data according to the target playback time region. The multimedia data received by the server can be sent by the user terminal, or it can be obtained by the server through other means, such as databases or web pages; no limitations are imposed here. The server can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server or server cluster providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. User terminals can be smartphones, tablets, laptops, desktop computers, smart speakers, smartwatches, etc. User terminals and servers can be connected directly or indirectly through wired or wireless communication, but are not limited to these.

如图1所示，本申请实施例提供的多媒体数据的处理方法可包括如下步骤：As shown in Figure 1, the multimedia data processing method provided in this application embodiment may include the following steps:

步骤S101、获取多媒体数据中包含的至少一个文本信息，以及多媒体数据的标题信息。Step S101: Obtain at least one text information contained in the multimedia data, as well as the title information of the multimedia data.

在一些可行的实施方式中，本申请实施例中的多媒体数据包括但不限于视频数据、音频数据以及视频与音频相结合的数据，其中，视频数据为包含图像以及语音数据的多媒体数据，音频数据可以为视频数据中的语音数据，具体可基于实际应用场景确定，在此不做限制。进一步的，本申请实施例可基于多媒体数据中包含的至少一个文本信息来确定出多媒体数据中的目标播放时间区域，以基于目标播放时间区域对多媒体数据进行处理。In some feasible implementations, the multimedia data in the embodiments of this application includes, but is not limited to, video data, audio data, and data combining video and audio. The video data is multimedia data containing both image and audio data, and the audio data can be audio data from the video data. The specific details can be determined based on the actual application scenario and are not limited here. Furthermore, the embodiments of this application can determine the target playback time region in the multimedia data based on at least one text information contained within the multimedia data, so as to process the multimedia data based on the target playback time region.

在一些可行的实施方式中，对于视频数据而言，可获取视频数据中的至少一帧图像的字幕信息，将至少一帧图像的字幕信息作为视频数据中包含的至少一个文本信息。也就是说，视频数据中的任一帧图像的字幕信息可作为视频数据中包含的一个文本信息。其中，从视频数据的帧图像中获取字幕信息时，可采用OCR(Optical CharacterRecognition，光学字符识别)技术，或者其他文字识别方式、文字提取工具等获取，在此不做限制。In some feasible implementations, for video data, caption information from at least one frame of the video data can be obtained, and this caption information from at least one frame can be used as at least one piece of text information contained in the video data. That is, caption information from any frame of the video data can be used as a piece of text information contained in the video data. When obtaining caption information from the frame images of the video data, OCR (Optical Character Recognition) technology, or other text recognition methods, text extraction tools, etc., can be used, and there are no limitations on this method.

参见图2a，图2a是本申请实施例提供的获取视频数据中包含的文本信息的一场景示意图。为方便描述，假设一视频数据仅有9帧图像，且每一帧图像具有不同的字幕信息，如第1帧图像中的字幕信息为“台风‘山竹’即将登陆”，第7帧图像中的字幕信息为“让我们看看发生了什么”。对于该视频数据，可将该视频数据中每一帧图像的字幕信息，作为该视频数据中包含的多个文本信息。例如，第1帧图像中的字幕信息“台风‘山竹’即将登陆”可作为视频数据中包含的一个文本信息，第7帧图像中的字幕信息“让我们看看发生了什么”可作为视频数据中包含的另一个文本信息。可选的，也可根据实际需求将第1帧图像或者第7帧图像中的任一字幕信息作为一个文本信息，具体可根据实际应用场景需求确定，在此不做限制。Referring to Figure 2a, which is a schematic diagram of a scenario for obtaining text information contained in video data according to an embodiment of this application. For ease of description, assume that a video data set has only 9 frames, and each frame has different subtitle information, such as the subtitle information in the first frame being "Typhoon Mangkhut is about to make landfall," and the subtitle information in the seventh frame being "Let's see what happened." For this video data, the subtitle information of each frame can be used as multiple pieces of text information contained in the video data. For example, the subtitle information "Typhoon Mangkhut is about to make landfall" in the first frame can be used as one piece of text information contained in the video data, and the subtitle information "Let's see what happened" in the seventh frame can be used as another piece of text information contained in the video data. Optionally, either the subtitle information in the first frame or the seventh frame can be used as one piece of text information according to actual needs. The specific use can be determined according to the actual application scenario requirements and is not limited here.

具体的，由于人眼具有视觉暂留原理，即人眼无法辨别单幅的静态画面，从而实现平滑连续的视觉播放效果，因此视频数据在播放过程中是以连续的图像变化进行播放(如每秒播放连续的24帧图像)。在该情况下，视频数据中通常存在连续多帧图像所包含的字幕信息完全相同的情况。因此，在确定出视频数据中的各帧图像之后，对于字幕信息完全相同的连续多帧图像，可将其中任意一帧或者多帧图像的字幕信息作为视频数据中包含的一个或者多个文本信息，如将字幕信息完全相同的连续多帧图像中的第一个播放的帧图像的字幕信息，作为视频数据中包含的一个文本信息。Specifically, due to the principle of visual persistence in the human eye—meaning the eye cannot distinguish single static images—to achieve a smooth and continuous visual playback effect, video data is played as a series of continuously changing images (e.g., 24 consecutive frames per second). In this case, it is common for multiple consecutive frames in the video data to contain identical subtitle information. Therefore, after identifying each frame in the video data, for multiple consecutive frames with identical subtitle information, the subtitle information of any one or more frames can be used as one or more pieces of text information contained in the video data. For example, the subtitle information of the first frame played in a series of consecutive frames with identical subtitle information can be used as one piece of text information contained in the video data.

参见图2b，图2b是本申请实施例提供的获取视频数据中包含的文本信息的另一场景示意图。为方便描述，假设存在一包含60帧图像的视频数据，且在该视频数据中，第1帧图像至第35帧图像均对应同一图像，且字幕信息均为“台风‘山竹即将登陆’”，第36帧图像至第60帧图像均对应图一图像，且字幕信息均为“让我们看看发生了什么”。此时可将第1帧图像至第35帧图像中的任一帧图像(如第1帧图像)的字幕信息作为视频数据中包含的一个文本信息，将第36帧至第60帧图像中的任一帧图像(如第36帧图像)的字幕信息作为视频数据中包含的一个文本信息。也就是说，通过上述方式可确定出该视频数据中包含的两个文本信息。Referring to Figure 2b, which is a schematic diagram of another scenario for obtaining text information contained in video data according to an embodiment of this application. For ease of description, assume there is video data containing 60 frames, where frames 1 to 35 all correspond to the same image, and the caption information is "Typhoon 'Mangkhut' is about to make landfall," and frames 36 to 60 all correspond to the image in Figure 1, and the caption information is "Let's see what happened." In this case, the caption information of any frame from frames 1 to 35 (such as frame 1) can be considered as one piece of text information contained in the video data, and the caption information of any frame from frames 36 to 60 (such as frame 36) can be considered as one piece of text information contained in the video data. That is, two pieces of text information contained in the video data can be determined through the above method.

可选的，对于字幕信息中每个字随视频播放进程逐一出现的视频数据而言，如果将每帧图像对应的文本信息作为视频数据中包含的文本信息时，将会导致存在大量无语义或者语义表述不全的文本信息。因此对于该种视频数据，可将视频数据分成多个视频数据片段，使得每个视频数据片段由一句完整字幕信息所对应的所有帧图像。也就是说，每个视频数据片段为一句完整的字幕信息从第一个字至完整字幕信息所对应的全部帧图像。进而可从每个视频数据片段的所有帧图像中确定出包含完整字幕的帧图像，并将该帧图像的字幕信息作为视频数据中包含的至少一个文本信息。基于上述实现方式，可将视频数据中每一句完整字幕信息作为视频数据中包含的一个文本信息，提高文本信息的处理效率。Optionally, for video data where each character in the subtitle information appears sequentially as the video plays, using the text information corresponding to each frame as the text information included in the video data would result in a large amount of text information lacking semantic meaning or with incomplete semantic representation. Therefore, for this type of video data, the video data can be divided into multiple video data segments, such that each video data segment consists of all frame images corresponding to a complete subtitle line. That is, each video data segment is all frame images corresponding to a complete subtitle line from its first character to the complete subtitle line. Then, the frame image containing the complete subtitle can be determined from all frame images of each video data segment, and the subtitle information of that frame image can be used as at least one piece of text information included in the video data. Based on the above implementation, each complete subtitle line in the video data can be used as one piece of text information included in the video data, improving the processing efficiency of text information.

参见图2c，图2c是本申请实施例提供的获取视频数据中包含的文本信息的又一场景示意图。图2c展示了某视频数据中的一个视频数据片段，该视频数据片段包含9帧图像，每一帧图像中的字幕信息为一完整字幕信息中的一部分。如第1帧图像中的字幕信息为“台”，第2帧图像中的字幕信息为“台风”，在将视频数据片段连续播放的过程中，字幕信息随播放画面不断变化直至显示完整的字幕信息“台风‘山竹’即将登陆”。由图2c可知，由于第1帧至第8帧的字幕信息均不为完整的字幕信息，只有第9帧图像的字幕信息为完整的字幕信息，因此对于如2c中的视频数据片段，可将第9帧所包含的字幕信息作为视频数据中包含的一个文本信息。Referring to Figure 2c, which is another scenario illustration of obtaining text information contained in video data according to an embodiment of this application, Figure 2c shows a video data segment containing 9 frames. The subtitle information in each frame is a part of a complete subtitle. For example, the subtitle information in the first frame is "台" (Taiwan), and the subtitle information in the second frame is "台风" (Typhoon). As the video data segment is played continuously, the subtitle information changes with the playback screen until the complete subtitle information "Typhoon 'Mangkhut' is about to make landfall" is displayed. As shown in Figure 2c, since the subtitle information in frames 1 to 8 is not complete, only the subtitle information in frame 9 is complete. Therefore, for a video data segment like the one in Figure 2c, the subtitle information contained in frame 9 can be regarded as a piece of text information contained in the video data.

可选的，对于不包含字幕信息的视频数据而言，若视频数据中包含语音数据，如视频旁白、新闻播报语音等，则可将语音数据进行语音识别，得到语音数据的语音识别结果，进而可将语音识别结果中的每个语句对应的文本内容，作为视频数据中包含的至少一个文本信息。Optionally, for video data that does not contain subtitle information, if the video data contains audio data, such as video narration or news broadcast audio, the audio data can be subjected to speech recognition to obtain the speech recognition result. Then, the text content corresponding to each sentence in the speech recognition result can be used as at least one piece of text information contained in the video data.

在一些可行的实施方式中，对于音频数据而言，由于音频数据中不包含帧图像，因此无法从帧图像中直接获取音频数据中包含的文本信息。此时可基于自然语言处理技术，将音频数据转化为文本内容，进而基于音频数据对应的文本内容确定音频数据中包含的至少一个文本信息。具体的，可对音频数据进行语音识别，以得到音频数据的语音识别结果。对于语音识别结果中的每一句对应的文本内容，可将其作为音频数据中包含的至少一个文本信息。也就是说，音频数据的语音识别结果中的每个语句，均可作为音频数据中包含的一个文本信息。In some feasible implementations, since audio data does not contain frame images, it is impossible to directly extract the text information contained in the audio data from the frame images. In this case, natural language processing techniques can be used to convert the audio data into text content, and then the at least one piece of text information contained in the audio data can be determined based on the corresponding text content. Specifically, speech recognition can be performed on the audio data to obtain the speech recognition result. The text content corresponding to each sentence in the speech recognition result can be used as at least one piece of text information contained in the audio data. That is, each sentence in the speech recognition result of the audio data can be used as a piece of text information contained in the audio data.

参见图3，图3是本申请实施例提供的获取音频数据中包含的文本信息的场景示意图。如图3所示，对音频数据进行语音识别之后，得到的语音识别结果中包括“台风‘山竹’即将登陆”，以及“让我们看看发生了什么”两个语句。对于上述两个语句，每个语句对应的文本内容均可作为语音数据中包含的一个文本信息。Referring to Figure 3, which is a schematic diagram of a scenario for obtaining text information contained in audio data according to an embodiment of this application. As shown in Figure 3, after performing speech recognition on the audio data, the obtained speech recognition result includes two statements: "Typhoon Mangkhut is about to make landfall" and "Let's see what happened." For the above two statements, the text content corresponding to each statement can be used as a piece of text information contained in the speech data.

可选的，对于语音识别结果中的各语句而言，当任一语句的文本长度较短时，其对应的文本内容所表述的语义有限，因此在得到语音识别结果之后，去除语音识别结果中文本长度较短的语句，可将每个文本长度大于预设文本长度阈值的语句对应的文本内容，作为音频数据中包含的一个文本信息。Optionally, for each statement in the speech recognition result, when the text length of any statement is short, the semantics expressed by its corresponding text content are limited. Therefore, after obtaining the speech recognition result, the statements with short text lengths in the speech recognition result are removed, and the text content corresponding to each statement with a text length greater than a preset text length threshold can be used as a text information contained in the audio data.

可选的，当音频数据的语音识别结果中，存在文本内容相同的语句，此时可将播放时间最早的语句对应的文本内容作为语音数据中包含的一个文本信息，也可将每个语句对应的文本内容均作为音频数据中包含的文本信息，具体可基于实际应用场景需求确定，在此不做限制。Optionally, when there are sentences with the same text content in the speech recognition results of the audio data, the text content corresponding to the sentence with the earliest playback time can be used as a text information contained in the speech data, or the text content corresponding to each sentence can be used as a text information contained in the audio data. The specific method can be determined based on the actual application scenario requirements, and there are no restrictions here.

可选的，由于语音识别结果中往往存在部分语气词，以及其他无意义的词，因此在得到音频数据的语音识别结果之后，可对语音识别结果进行筛选，以去除语气词以及其他无意义词，进而在筛选后的语音识别结果的基础上，基于上述任一种可行的实施方式，确定音频数据中包含的至少一个文本信息。Optionally, since speech recognition results often contain some interjections and other meaningless words, after obtaining the speech recognition results of the audio data, the speech recognition results can be filtered to remove interjections and other meaningless words. Then, based on the filtered speech recognition results, at least one text information contained in the audio data can be determined according to any of the above feasible implementation methods.

可选的，当音频数据为视频数据中包含的语音数据时，可基于音频数据的语音识别结果中的至少一个语句对应的文本内容，确定视频数据中包含的至少一个文本信息。Optionally, when the audio data is speech data contained in the video data, at least one text information contained in the video data can be determined based on the text content corresponding to at least one sentence in the speech recognition result of the audio data.

在一些可行的实施方式中，多媒体数据的标题信息可以为多媒体数据的文件名，可以为多媒体数据相关联的主题信息以及简要描述等，如短视频平台的视频标题，博客内容中关于视频、语音的内容标签等，具体可基于实际应用场景确定，在此不做限制。In some feasible implementations, the title information of multimedia data can be the file name of the multimedia data, the topic information associated with the multimedia data, and a brief description, such as the video title on a short video platform, or content tags about video and audio in blog content. The specific details can be determined based on the actual application scenario and are not limited here.

步骤S102、确定标题信息与各所述文本信息的匹配度。Step S102: Determine the matching degree between the title information and each of the text information.

在一些可行的实施方式中，在获取到多媒体数据的标题信息之后，可确定标题信息和至少一个文本信息中各文本信息的匹配度，进而根据各文本信息对应的匹配度来确定多媒体数据中的目标播放时间区域。In some feasible implementations, after obtaining the title information of the multimedia data, the matching degree between the title information and each text information in at least one text information can be determined, and then the target playback time area in the multimedia data can be determined based on the matching degree corresponding to each text information.

其中，标题信息和每个文本信息的匹配度，用于表示标题信息与每个文本信息的关联程度，以表征每个文本信息对应的多媒体内容与标题信息的关联程度。并且任一文本信息与标题信息的匹配度越高，表示该文本信息对应的多媒体内容与标题信息的关联程度越高，该文本信息对应的多媒体内容越贴近标题信息，即该文本信息对应的多媒体内容为多媒体数据中的主要内容。The matching degree between the title information and each piece of text information represents the degree of association between the title information and each piece of text information, thus characterizing the degree of association between the multimedia content corresponding to each piece of text information and the title information. Furthermore, the higher the matching degree between any piece of text information and the title information, the higher the degree of association between the multimedia content corresponding to that text information and the title information, and the closer the multimedia content corresponding to that text information is to the title information; that is, the multimedia content corresponding to that text information is the main content of the multimedia data.

在一些可行的实施方式中，标题信息与多媒体数据包含的各文本信息的匹配度，可以为标题信息与多媒体数据包含的各文本信息的文本相似度。也就是说，标题信息与任一文本信息的文本相似度越高，说明该文本信息对应的多媒体内容与标题信息的关联性越高。In some feasible implementations, the matching degree between the title information and the text information contained in the multimedia data can be considered as the text similarity between the title information and the text information contained in the multimedia data. That is, the higher the text similarity between the title information and any text information, the higher the correlation between the multimedia content corresponding to that text information and the title information.

其中，标题信息与各文本信息的文本相似度，具体可通过计算标题信息与各文本信息的余弦相似度、欧氏距离、汉明距离以及杰卡德相似度等方式确定，具体可基于实际应用场景确定，在此不做限制。The text similarity between the title information and each text information can be determined by calculating the cosine similarity, Euclidean distance, Hamming distance, and Jaccard similarity between the title information and each text information. The specific method can be determined based on the actual application scenario and is not limited here.

在一些可行的实施方式中，标题信息与多媒体数据包含的各文本信息的匹配度，还可基于标题信息中的各关键词确定。标题信息中的各关键词为表示标题信息中的主要信息的词，如标题信息为“多地高校开学时间确定”，该标题信息中的各关键词可以为“高校”、“开学时间”以及“确定”。需要特别说明的是，标题信息中的关键词的具体确定方式，可根据具体的标题信息以及实际应用场景需求确定，在此不做限制。In some feasible implementations, the matching degree between the title information and the various textual information contained in the multimedia data can also be determined based on the keywords in the title information. The keywords in the title information are words that represent the main information in the title information. For example, if the title information is "Opening dates for universities in multiple locations confirmed," the keywords in this title information could be "universities," "opening dates," and "confirmed." It should be noted that the specific method for determining the keywords in the title information can be determined according to the specific title information and the actual application scenario requirements, and is not limited here.

具体的，在基于标题信息中的各关键词确定各文本信息对应的匹配度时，可先确定标题信息中的各关键词。并进一步对每个文本信息进行分词处理，以得到每个文本信息中所有的词。从而对于每个文本信息，可将其包括的所有词与标题信息中各关键词进行匹配，以得到该文本信息中出现各关键词的次数，如该文本信息中出现“高校”两次，出现“开学时间”一次。当任一文本信息中出现标题信息中所有关键词的总次数越多时，可说明该文本信息与标题信息的关联性越高，如一个文本信息中出现各关键词共2次，另一文本信息中出现各关键词共8次，则显而易见地可确定后者与标题信息的关联性更高。因此，对于每个文本信息而言，将其中出现所有关键词的总次数确定为标题信息与该文本信息的匹配度。Specifically, when determining the matching degree of each text message based on the keywords in the title information, the keywords in the title information can be identified first. Then, each text message is further segmented to obtain all the words in it. For each text message, all its words can be matched with the keywords in the title information to determine the frequency of each keyword. For example, if "university" appears twice and "opening time" appears once, the higher the total frequency of all keywords in the title information within any text message, the stronger the correlation between the two. For instance, if one text message contains all keywords twice and another contains them eight times, the latter is clearly more correlated with the title information. Therefore, for each text message, the total frequency of all keywords is determined as the matching degree between the title information and the text message.

可选的，对于标题信息而言，其所包括的各关键词虽然可表示标题信息的主要信息，但是每个关键词在标题信息中所对应的含义具有不同的重要性。如对于标题信息“多地高校开学时间确定”中的各关键词“高校”、“开学时间”以及“确定”来说，“高校”与“开学时间”所表示的含义的重要性，明显高于“确定”所表示的含义。因此在将文本信息中出现各关键词的次数作为该文本信息对应的匹配度时，会忽略各关键词对于标题信息的重要性，从而在一定程度上出现与标题信息的关联性高，出现各关键词次数较少的文本信息对应的匹配度较低的情况出现。因此，上述情况下，对于每个文本信息而言，可确定该文本信息中出现的各关键词中，每个关键词对应的次数。进而基于每个关键词的权重，得到该文本信息所对应的各关键词的权重和，进而将该权重和确定为标题信息与该文本信息的匹配度。Optionally, while the keywords included in the title information can represent the main information, each keyword has a different level of importance within the title. For example, in the title information "Opening dates for universities in multiple regions confirmed," the keywords "universities," "opening dates," and "confirmed" have significantly higher importance than "confirmed." Therefore, using the frequency of each keyword in the text information as the matching degree will ignore the importance of each keyword to the title information, resulting in a situation where text information with high relevance to the title information but fewer occurrences of certain keywords corresponds to a lower matching degree. Therefore, in the above case, for each piece of text information, the frequency of each keyword can be determined. Then, based on the weight of each keyword, the sum of the weights of all keywords corresponding to the text information is obtained, and this sum of weights is determined as the matching degree between the title information and the text information.

在一些可行的实施方式中，在确定标题信息与各文本信息的匹配度之前，可先确定各文本信息中是否存在包含有指定信息的文本信息，在不存在包含指定信息的文本信息的情况下，再确定标题信息与各文本信息的匹配度。其中，上述指定信息为用于提示多媒体数据主要播放内容的常见话术信息，例如短视频中常见的“让我们看看接下来会发生什么”、“本期节目主要有以下内容”等等，并且上述指定信息的具体文本内容可基于实际应用场景需求确定，在此不做限制。也就是说，当多媒体数据包括的各文本信息中，存在包含上述指定信息的文本信息时，包含上述指定信息的文本信息所对应的播放内容为多媒体数据的主要多媒体内容，从而可确定该文本信息与标题信息的关联性较强。因此，对于任一文本信息，当其包含上述指定信息时，可确定该文本信息与标题信息具有关联性。In some feasible implementations, before determining the matching degree between the title information and each piece of text information, it can be first determined whether any text information contains specified information. If no text information containing the specified information exists, then the matching degree between the title information and each piece of text information is determined. The specified information refers to common phrases used to indicate the main content of the multimedia data, such as "Let's see what happens next" or "This episode mainly includes the following content," which are common in short videos. The specific text content of the specified information can be determined based on the actual application scenario requirements and is not limited here. In other words, when there is text information containing the specified information among the text information included in the multimedia data, the playback content corresponding to the text information containing the specified information is the main multimedia content of the multimedia data, thus determining that the text information has a strong correlation with the title information. Therefore, for any piece of text information, when it contains the specified information, it can be determined that the text information is related to the title information.

进一步的，当各文本信息中不存在包含指定信息的文本信息时，可先确定标题信息与各文本信息的文本相似度，若存在满足预设条件的文本相似度时，可将各文本信息对应的文本相似度确定为各文本信息对应的匹配度。其中，上述满足预设条件的文本相似度可以为存在超过文本相似度阈值的文本相似度，以及存在超过一定数量的文本相似度等，具体可基于实际应用场景确定，在此不做限制。Furthermore, when no text information containing the specified information exists among the various text information, the text similarity between the title information and each text information can be determined first. If a text similarity that meets preset conditions exists, the text similarity corresponding to each text information can be determined as the matching degree corresponding to each text information. The aforementioned text similarity meeting preset conditions can include text similarity exceeding a text similarity threshold, or text similarity exceeding a certain number, etc., and can be determined based on the actual application scenario; no restrictions are imposed here.

当各文本信息对应的文本相似度中不存在满足预设条件的文本相似度时，可根据各文本信息中出现标题信息中各关键词次数，确定标题信息与各文本信息的匹配度。或者根据各文本信息中出现标题信息中各关键词的次数，以及各关键词的权重，确定各文本信息对应的权重和，进而将各文本信息对应的权重和确定为标题信息与各文本信息的匹配度。When no text similarity matches the preset conditions in the text similarity scores of each text piece, the matching degree between the title information and each text piece can be determined based on the frequency of each keyword in the title information appearing in each text piece. Alternatively, the weighted sum of each text piece can be determined based on the frequency of each keyword in the title information appearing in each text piece, as well as the weight of each keyword, and then the weighted sum of each text piece can be used to determine the matching degree between the title information and each text piece.

步骤S103、根据各文本信息对应的匹配度，确定多媒体数据中的目标播放时间区域。Step S103: Determine the target playback time range in the multimedia data based on the matching degree of each text information.

在一些可行的实施方式中，目标播放时间区域为多媒体数据中与标题信息相关联的多媒体内容的播放时间区域，或者为多媒体数据中主要内容的播放时间区域。例如，某一短视频的标题信息为“新华街发生车祸”，该短视频中的目标播放时间区域可以为车祸现场对应的视频内容的播放时间区域。In some feasible implementations, the target playback time region is the playback time region of the multimedia content associated with the title information in the multimedia data, or the playback time region of the main content in the multimedia data. For example, if the title information of a short video is "Car accident occurred on Xinhua Street," the target playback time region in the short video can be the playback time region of the video content corresponding to the scene of the car accident.

其中，当多媒体数据为视频数据时，每个文本信息对应的播放时间区域为该文本信息对应的帧图像的播放时间区域，当多媒体数据为音频数据时，文本信息的播放时间区域为该文本信息中第一个字在音频数据中的播放时间区域。Specifically, when the multimedia data is video data, the playback time region corresponding to each text message is the playback time region of the frame image corresponding to that text message; when the multimedia data is audio data, the playback time region of the text message is the playback time region of the first character in the text message in the audio data.

可选的，可将每个文本信息对应的帧图像或者音频数据的开始播放时间视为每个文本信息对应的播放时间区域，即每个文本信息对应的播放时间区域在此情况下可表示时间跨度极小的时间区域或者表征文本信息对应的开始播放时间，具体可基于实际应用场景需求确定，在此不做限制。Optionally, the start playback time of the frame image or audio data corresponding to each text information can be regarded as the playback time region corresponding to each text information. That is, the playback time region corresponding to each text information can represent a time region with a very small time span or represent the start playback time corresponding to the text information. The specific time region can be determined based on the actual application scenario requirements and is not limited here.

在一些可行的实施方式中，由于各文本信息对应的匹配度可表示与标题信息的关联程度，因此在确定出各文本信息对应的匹配度之后，可将满足匹配条件的文本信息的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。也就是说，当任一文本信息与标题信息的关联程度达到一定程度时，可确定该文本信息所对应的多媒体内容与标题信息相关。In some feasible implementations, since the matching degree of each text message can represent the degree of association with the title information, after determining the matching degree of each text message, the playback time area corresponding to the text message that meets the matching conditions can be determined as the target playback time area in the multimedia data. That is to say, when the degree of association between any text message and the title information reaches a certain level, it can be determined that the multimedia content corresponding to that text message is related to the title information.

其中，上述满足匹配条件的文本信息可以为匹配度高于匹配度阈值的文本信息，也可以为各文本信息中匹配度最高的文本信息，具体可基于实际应用场景确定，在此不做限制。其中，上述匹配度阈值也可基于应用场景确定，在此不做限制。The text information that meets the matching criteria can be either text information with a matching degree higher than the matching degree threshold, or text information with the highest matching degree among all text information. The specific criteria can be determined based on the actual application scenario and are not limited here. The matching degree threshold can also be determined based on the application scenario and is not limited here.

可选的，当各文本信息对应的匹配度为标题信息与各文本信息的文本相似度时，可将文本相似度满足上述匹配条件的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。其中，在匹配度为文本相似度的情况下，匹配度阈值为对应的相似度阈值。参见图4，图4是本申请实施例提供的根据文本相似度确定目标播放时间区域的示意图。假设标题信息与各文本信息的文本相似度为余弦相似度，在确定标题信息与某一该文本信息的文本相似度时，可将标题信息和文本信息进行向量化处理，得到该文本信息对应的文本信息向量和标题信息对应的标题信息向量。根据文本信息对应的文本信息向量，以及标题信息对应的标题信息向量确定标题信息与该文本信息的余弦相似度，并将其所为标题信息与该文本信息的匹配度，进而将满足匹配条件的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。Optionally, when the matching degree corresponding to each text information is the text similarity between the title information and each text information, the playback time region corresponding to the text information whose text similarity satisfies the above matching conditions can be determined as the target playback time region in the multimedia data. Wherein, when the matching degree is text similarity, the matching degree threshold is the corresponding similarity threshold. See Figure 4, which is a schematic diagram of determining the target playback time region based on text similarity provided in this application embodiment. Assuming the text similarity between the title information and each text information is cosine similarity, when determining the text similarity between the title information and a certain text information, the title information and the text information can be vectorized to obtain the text information vector corresponding to the text information and the title information vector corresponding to the title information. Based on the text information vector corresponding to the text information and the title information vector corresponding to the title information, the cosine similarity between the title information and the text information is determined, and this is taken as the matching degree between the title information and the text information. Therefore, the playback time region corresponding to the text information that satisfies the matching conditions is determined as the target playback time region in the multimedia data.

可选的，当多媒体数据中的各文本信息对应的匹配度，是根据各文本信息中出现标题信息中各关键词的次数确定的时，可将匹配度满足匹配条件的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。参见图5，图5是本申请实施例提供的根据关键词确定目标播放时间区域的示意图。在图5中，对于每个文本信息而言，可对该文本信息进行分词处理，进而提取该文本信息中的各词。进一步的，将该文本信息中的各词与标题信息中的关键词进行匹配，可确定每个文本信息出现各关键词的次数。进而根据每个文本信息出现各关键词的次数，确定各文本信息对应的匹配度，以将满足匹配度条件的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。Optionally, when the matching degree of each text information in the multimedia data is determined based on the frequency of each keyword in the title information appearing in each text information, the playback time area corresponding to the text information whose matching degree meets the matching condition can be determined as the target playback time area in the multimedia data. Refer to Figure 5, which is a schematic diagram of determining the target playback time area based on keywords according to an embodiment of this application. In Figure 5, for each text information, word segmentation can be performed to extract each word. Further, each word in the text information is matched with the keywords in the title information to determine the frequency of each keyword appearing in each text information. Then, based on the frequency of each keyword appearing in each text information, the matching degree corresponding to each text information is determined, so that the playback time area corresponding to the text information that meets the matching degree condition is determined as the target playback time area in the multimedia data.

可选的，当各文本信息对应的匹配度，为基于各文本信息中出现标题信息中各关键词确定出的权重和时，可将匹配度满足匹配条件的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。此时匹配条件为权重和高于权重和阈值，或者权重和为最高权重和。Optionally, when the matching degree corresponding to each text message is the sum of weights determined based on the keywords appearing in the title information in each text message, the playback time area corresponding to the text message whose matching degree meets the matching condition can be determined as the target playback time area in the multimedia data. In this case, the matching condition is that the sum of weights is higher than the weight sum threshold, or the sum of weights is the highest weight sum.

在一些可行的实施方式中，为避免基于上述实现方式确定出的目标播放时间区域较多的情况，可在满足匹配条件的文本信息中至少包括两个文本信息的情况下，在满足匹配条件的各文本信息中，基于预设选择方式选择一个或者多个文本信息对应的播放时间区域，作为多媒体数据中的目标播放时间区域。In some feasible implementations, to avoid having too many target playback time regions determined based on the above implementation method, if the text information that meets the matching conditions includes at least two text information, then, among the text information that meets the matching conditions, one or more text information corresponding to the playback time region can be selected as the target playback time region in the multimedia data based on a preset selection method.

可选的，可确定满足匹配条件的各文本信息在多媒体数据中对应的播放时间，将第一个或者前预设数量的满足匹配条件的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。Optionally, the playback time corresponding to each text message that meets the matching conditions in the multimedia data can be determined, and the playback time area corresponding to the first or the first preset number of text messages that meet the matching conditions can be determined as the target playback time area in the multimedia data.

可选的，在满足匹配条件的文本信息中，随机选择预设数量的文本信息对应的播放时间区域，并将其确定为多媒体数据中的目标播放时间区域。Optionally, from the text information that meets the matching conditions, a preset number of text information corresponding to the playback time area are randomly selected and determined as the target playback time area in the multimedia data.

可选的，根据满足匹配条件的各文本信息在多媒体数据中的播放时间，将第一个文本信息对应的播放时间确定为多媒体数据中的一个目标播放时间区域，对于第一个文本信息之后的任一文本信息，若该文本信息对应的播放时间区域与其相邻的前一个播放时间区域所间隔的时间距离不小于时间距离阈值，则可将该文本信息对应的播放时间区域确定为多媒体数据中的目标播放时间区域。Optionally, based on the playback time of each text message that meets the matching conditions in the multimedia data, the playback time corresponding to the first text message is determined as a target playback time region in the multimedia data. For any text message after the first text message, if the time distance between the playback time region corresponding to the text message and its adjacent previous playback time region is not less than the time distance threshold, then the playback time region corresponding to the text message can be determined as the target playback time region in the multimedia data.

需要特别说明的是，上述基于预设选择方式从满足匹配条件的文本信息中确定多媒体数据中的目标播放时间区域的实现方式仅为示例，具体可基于实际应用场景确定，在此不做限制。It should be noted that the above implementation method for determining the target playback time region in multimedia data from text information that meets the matching conditions based on the preset selection method is only an example. The specific implementation method can be determined based on the actual application scenario, and there are no restrictions here.

在一些可行的实施方式中，在上述匹配条件为匹配度高于匹配度阈值的情况下，可根据多媒体数据中各文本信息在多媒体中对应的播放时间，按照播放顺序依次确定每个文本信息对应的匹配度，并在每确定一个文本信息对应的匹配度之后，将其与匹配度阈值进行比较。从而将第一个或者预设数量的匹配度高于匹配度阈值的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。或者，在将第一个匹配度高于匹配度阈值的文本信息对应的播放时间区域，确定为多媒体数据中的一个目标播放时间区域之后，可确定下一个匹配度高于匹配度阈值的文本信息。若该文本信息对应的播放时间区域，距离第一个匹配度高于匹配度阈值的文本信息的播放时间区域的时间距离，不小于时间距离阈值，则可将该文本信息对应的播放时间区域确定为多媒体数据中的目标播放时间区域，以此类推，直至确定出多媒体数据中的所有目标播放时间区域为止。In some feasible implementations, when the matching condition is that the matching degree is higher than the matching degree threshold, the matching degree of each text information in the multimedia data can be determined sequentially according to the playback time of each text information in the multimedia, following the playback order. After determining the matching degree of each text information, it is compared with the matching degree threshold. Thus, the playback time region corresponding to the first or a predetermined number of text information with a matching degree higher than the matching degree threshold is determined as the target playback time region in the multimedia data. Alternatively, after determining the playback time region corresponding to the first text information with a matching degree higher than the matching degree threshold as a target playback time region in the multimedia data, the next text information with a matching degree higher than the matching degree threshold can be determined. If the time distance between the playback time region corresponding to this text information and the playback time region of the first text information with a matching degree higher than the matching degree threshold is not less than the time distance threshold, then the playback time region corresponding to this text information can be determined as the target playback time region in the multimedia data, and so on, until all target playback time regions in the multimedia data are determined.

在一些可行的实施方式中，在标题信息与各文本信息的匹配度由各文本信息中出现标题信息中各关键词的次数确定的情况下，可将任一文本信息中出现所有关键词的总次数确定为该文本信息对应的匹配度。若此时上述匹配条件为总次数高于次数阈值，则在各文本信息存在满足匹配条件的多个文本信息(存在总次数高于次数阈值的多个文本信息)时，基于各关键词的权重确定各文本信息对应的权重和，将权重和最高的文本信息对应的播放时间区域确定为目标播放时间区域。若满足匹配条件的多个文本信息中存在权重和最高的多个文本信息时，可根据各文本信息在多媒体数据中的播放时间，将第一个权重和最高的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。或者可基于上述预设选择方式从权重和最高的多个文本信息中确定多媒体数据中的目标播放时间区域，在此不做限制。In some feasible implementations, when the matching degree between the title information and each text information is determined by the number of times each keyword in the title information appears in each text information, the total number of times all keywords appear in any text information can be determined as the matching degree corresponding to that text information. If the above matching condition is that the total number of occurrences is higher than the threshold, then when there are multiple text information that meet the matching condition (multiple text information with a total number of occurrences higher than the threshold), the weight sum of each text information is determined based on the weight of each keyword, and the playback time region corresponding to the text information with the highest weight sum is determined as the target playback time region. If there are multiple text information with the highest weight sum among the multiple text information that meet the matching condition, the playback time region corresponding to the first text information with the highest weight sum can be determined as the target playback time region in the multimedia data based on the playback time of each text information in the multimedia data. Alternatively, the target playback time region in the multimedia data can be determined from the multiple text information with the highest weight sum based on the above preset selection method, without any limitation.

可选的，若此时上述匹配条件为总次数为最高次数，则在各文本信息中存在满足匹配条件的多个文本信息(存在总次数最高的多个文本信息)时，基于满足匹配条件的多个文本信息中，每个文本信息中出现各关键词的次数，以及各关键词的权重，确定出满足匹配条件的多个文本信息中权重和最高的文本信息，并将权重和最高的文本信息对应的播放时间区域确定为目标播放时间区域。若满足匹配条件的多个文本信息中存在多个权重和最高的多个文本信息时，可根据各文本信息在多媒体数据中的播放时间，将第一个权重和最高的文本信息对应的播放时间区域确定为目标播放时间区域。或者可基于上述预设选择方式从权重和最高的多个文本信息中确定多媒体数据中的目标播放时间区域，在此不做限制。Optionally, if the matching condition is the highest total number of occurrences, then when multiple text messages satisfy the matching condition (multiple text messages with the highest total number of occurrences) exist among the text messages satisfying the matching condition, the text message with the highest weight among the multiple text messages satisfying the matching condition is determined based on the number of times each keyword appears in each text message and the weight of each keyword, and the playback time area corresponding to the text message with the highest weight is determined as the target playback time area. If multiple text messages satisfying the matching condition have multiple text messages with the highest weight, the playback time area corresponding to the first text message with the highest weight can be determined as the target playback time area based on the playback time of each text message in the multimedia data. Alternatively, the target playback time area in the multimedia data can be determined from the multiple text messages with the highest weight based on the above preset selection method, without any restrictions.

在一些可行的实施方式中，在标题信息与各文本信息的匹配度为各文本信息对应的权重和的情况下，若此时上述匹配条件为权重和高于权重和阈值，或者权重和为最高权重和，则在各文本信息存在满足匹配条件的多个文本信息(存在权重和高于权重和阈值的多个文本信息，或者存在权重和高于权重和阈值的多个文本信息)时，将出现所有关键词的总次数最高的文本信息对应的播放时间区域确定为目标播放时间区域。若满足匹配条件的多个文本信息中存在总次数最高的多个文本信息时，可根据各文本信息在多媒体数据中的播放时间，将第一个总次数最高的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。或者可基于上述预设选择方式从总次数最高的多个文本信息中确定多媒体数据中的目标播放时间区域，在此不做限制。In some feasible implementations, when the matching degree between the title information and each text information is the sum of the weights corresponding to each text information, if the matching condition is that the sum of weights is higher than the weight sum threshold, or the sum of weights is the highest weight sum, then when there are multiple text information pieces that satisfy the matching condition (multiple text information pieces with a sum of weights higher than the weight sum threshold, or multiple text information pieces with a sum of weights higher than the weight sum threshold), the playback time region corresponding to the text information piece with the highest total frequency of all keywords is determined as the target playback time region. If there are multiple text information pieces with the highest total frequency among the multiple text information pieces that satisfy the matching condition, the playback time region corresponding to the first text information piece with the highest total frequency can be determined as the target playback time region in the multimedia data based on the playback time of each text information piece in the multimedia data. Alternatively, the target playback time region in the multimedia data can be determined from the multiple text information pieces with the highest total frequency based on the above-mentioned preset selection method, without any restrictions.

在一些可行的实施方式中，由于步骤S102中的指定信息可提示多媒体数据的主要播放内容，因此可在确定标题信息与各文本信息的匹配度之前，基于上述指定信息确定多媒体数据中的目标播放时间区域。参见图6，图6是本申请实施例提供的根据指定信息确定目标播放时间区域的示意图。在获取到多媒体数据中包含的至少一个文本信息之后，可将每个文本信息与指定信息进行比较，并进一步将包含指定信息的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。其中，在基于指定信息确定多媒体数据中的目标播放时间区域时，可以采用一个指定信息或者多个指定信息，具体可基于实际应用场景确定，在此不做限制。如将包含任一指定信息的文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。In some feasible implementations, since the specified information in step S102 can indicate the main playback content of the multimedia data, the target playback time area in the multimedia data can be determined based on the specified information before determining the matching degree between the title information and each text information. Refer to Figure 6, which is a schematic diagram of determining the target playback time area based on specified information according to an embodiment of this application. After obtaining at least one text information contained in the multimedia data, each text information can be compared with the specified information, and the playback time area corresponding to the text information containing the specified information can be further determined as the target playback time area in the multimedia data. When determining the target playback time area in the multimedia data based on the specified information, one or more specified information can be used, depending on the actual application scenario, and no limitation is made here. For example, the playback time area corresponding to the text information containing any specified information can be determined as the target playback time area in the multimedia data.

具体的，可将每个文本信息按照播放时间顺序依次与指定信息进行匹配，在匹配过程中，将第一个或者前预设数量的包含任一指定信息的文本信息所对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。Specifically, each text message can be matched with the specified information in sequence according to the playback time. During the matching process, the playback time area corresponding to the first or the first preset number of text messages containing any specified information is determined as the target playback time area in the multimedia data.

可选的，还可先确定出各文本信息中，包含任一指定信息的所有文本信息，并从其中确定出包含不同指定信息的文本信息。若存在包含同一指定信息的多个文本信息，选择播放时间最早的文本信息。也就是说，基于上述实现方式，从多媒体数据包含的所有文本信息中，确定出分别包含不同指定信息的文本信息(任一文本信息所包含的指定信息与其他文本信息所包含的指定信息不同)，进而将包含不同指定信息的各文本信息对应的播放时间区域，确定为多媒体数据中的目标播放时间区域。Optionally, all text information containing any specified information can be identified first, and then text information containing different specified information can be identified from among them. If multiple text information containing the same specified information exist, the text information with the earliest playback time is selected. In other words, based on the above implementation method, text information containing different specified information (the specified information contained in any text information is different from the specified information contained in other text information) is identified from all text information contained in the multimedia data, and then the playback time area corresponding to each text information containing different specified information is determined as the target playback time area in the multimedia data.

进一步的，参见图7，图7是本申请实施例提供的确定目标播放时间区域的示意图。在图7中，若多媒体数据中各文本信息中存在包含指定信息的文本信息，可成功确定出多媒体数据中的目标播放时间区域。若各文本信息中不存在包含指定信息的文本信息，即基于指定信息确定多媒体数据中的目标播放时间区域失败时，可确定标题信息与各文本信息的文本相似度，将文本相似度作为各文本信息对应的匹配度，并进一步基于匹配条件确定多媒体数据中的目标播放时间区域，具体确定方式如上述所示，在此不做赘述。若在将文本相似度作为匹配度的情况下，未确定出多媒体数据中的目标播放时间区域，即基于文本相似度确定目标播放时间区域失败时，可基于关键词匹配的方式确定各文本信息中出现标题信息中各关键词的次数，进而确定标题信息与各文本信息的匹配度。进而基于匹配条件确定多媒体数据中的目标播放时间区域，具体确定方式如上述所示，在此不做赘述。Further, referring to Figure 7, which is a schematic diagram of determining the target playback time region according to an embodiment of this application. In Figure 7, if each text information in the multimedia data contains text information containing specified information, the target playback time region in the multimedia data can be successfully determined. If no text information containing specified information exists in the text information, i.e., determining the target playback time region in the multimedia data based on the specified information fails, the text similarity between the title information and each text information can be determined. The text similarity is used as the matching degree corresponding to each text information, and the target playback time region in the multimedia data is further determined based on the matching conditions. The specific determination method is as described above and will not be repeated here. If the target playback time region in the multimedia data is not determined when using text similarity as the matching degree, i.e., determining the target playback time region based on text similarity fails, the number of times each keyword in the title information appears in each text information can be determined based on keyword matching, thereby determining the matching degree between the title information and each text information. The target playback time region in the multimedia data is then determined based on the matching conditions. The specific determination method is as described above and will not be repeated here.

需要特别说明的是，上述基于指定信息确定多媒体数据中的目标播放时间区域的具体实现方式仅为示例，具体可基于实际应用场景确定，在此不做限制。It should be noted that the above-described specific implementation method for determining the target playback time region in multimedia data based on specified information is only an example. The specific implementation method can be determined based on the actual application scenario, and no restrictions are imposed here.

在一些可行的实施方式中，图7所示的目标播放时间区域的确定方式，即指定信息确定目标播放时间区域的方式、将文本相似度作为匹配度确定目标播放时间区域的方式以及基于文本信息中各关键词的出现次数确定目标播放时间区域的方式中，三者的组合顺序仅为示例，具体组合顺序可基于实际应用场景限制，在此不做限制。In some feasible implementations, the combination order of the three methods shown in Figure 7—namely, the method of determining the target playback time area by specifying information, the method of determining the target playback time area by using text similarity as the matching degree, and the method of determining the target playback time area based on the occurrence frequency of each keyword in the text information—is only an example. The specific combination order may be limited based on the actual application scenario and is not restricted here.

例如，可先确定标题信息与各文本信息的文本相似度，将文本相似度作为匹配度以基于匹配条件确定多媒体数据中的目标播放时间区域。在基于文本相似度确定目标播放时间区域失败时，可基于各文本信息中出现标题信息中各关键词的次数，确定各文本信息的匹配度，进而根据匹配条件确定目标播放时间区域。进而在基于各文本信息中出现各关键词的次数确定目标播放时间区域失败时，基于上述指定信息确定多媒体数据中的目标播放时间区域。For example, the text similarity between the title information and each piece of text information can be determined first. This text similarity can then be used as a matching score to determine the target playback time region in the multimedia data based on matching criteria. If determining the target playback time region based on text similarity fails, the matching score of each piece of text information can be determined based on the frequency of each keyword appearing in the title information. Then, the target playback time region can be determined based on the matching criteria. Furthermore, if determining the target playback time region based on the frequency of each keyword appearing in each piece of text information fails, the target playback time region in the multimedia data can be determined based on the aforementioned specified information.

可选的，还可基于上述三种目标播放时间区域的确定方式中，任意两种实现方式进行组合，得到新的目标播放时间区域的确定方式，具体的选取方式和组合顺序也可基于实际应用场景确定，在此不做限制。Optionally, any two of the above three methods for determining the target playback time area can be combined to obtain a new method for determining the target playback time area. The specific selection method and combination order can also be determined based on the actual application scenario, and there are no restrictions here.

例如，可先确定标题信息与各文本信息的文本相似度，将文本相似度作为匹配度以基于匹配条件确定多媒体数据中的目标播放时间区域。在基于文本相似度确定目标播放时间区域失败时，基于上述指定信息确定多媒体数据中的目标播放时间区域。For example, the text similarity between the title information and each piece of text information can be determined first. This text similarity can then be used as a matching score to determine the target playback time region in the multimedia data based on the matching criteria. If determining the target playback time region based on text similarity fails, the target playback time region in the multimedia data can be determined based on the aforementioned specified information.

步骤S104、根据目标播放时间区域对多媒体数据进行处理。Step S104: Process the multimedia data according to the target playback time range.

在一些可行的实施方式中，基于步骤S102确定出的目标播放时间区域所对应的多媒体内容，为与多媒体数据的标题信息相关联的主要内容。因此，在确定出多媒体数据中的目标播放时间区域之后，可基于目标播放时间区域生成播放提示信息，以提示多媒体数据的主要内容的目标播放时间区域。In some feasible implementations, the multimedia content corresponding to the target playback time area determined in step S102 is the main content associated with the title information of the multimedia data. Therefore, after determining the target playback time area in the multimedia data, playback prompt information can be generated based on the target playback time area to indicate the target playback time area of the main content of the multimedia data.

如对于短视频应用而言，短视频用户在使用短视频应用时，不同用户会因为其性格以及其所处环境等因素，会导致用户对于短视频的铺垫(即与标题信息不相关或者相关性较低的视频内容)长短的忍受度不同。因此，基于目标播放时间区域所生成的播放提示信息可向短视频用户提示短视频的亮点(即与标题信息相关的主要视频内容)对应的目标播放时间区域，以满足短视频用户的快速观看需求，提升用户体验度。For short video applications, different users may have varying tolerance levels for the length of prelude (video content unrelated or only marginally relevant to the title) due to factors such as personality and environment. Therefore, playback prompts generated based on the target playback time range can indicate the target playback time range of the video's highlights (the main video content relevant to the title), satisfying users' need for quick viewing and improving user experience.

如对于音乐应用而言，目标播放时间区域所对应的文本信息与标题信息相关，即目标播放时间区域对应的音频内容可能为音乐的副歌(高潮)部分，因此音乐应用可通过播放提示信息提示用户音乐的副歌对应的目标播放区域，从而使得用户直接欣赏音乐副歌部分，或者基于播放提示信息对音乐副歌部分进行截取(如截取副歌作为彩铃)等。For music applications, the text information corresponding to the target playback time area is related to the title information. That is, the audio content corresponding to the target playback time area may be the chorus (climax) of the music. Therefore, music applications can prompt users with playback prompts to indicate the target playback area corresponding to the chorus of the music, so that users can directly enjoy the chorus of the music, or extract the chorus of the music based on the playback prompts (such as extracting the chorus as a ringtone), etc.

具体的，在播放提示信息用于提示多媒体数据的主要内容的目标播放时间区域时，该播放提示信息可以为文字、语音等提示信息，也可以是符号、图形等，如视频播放进度条中的符号，或者将目标播放时间区域所对应的帧图像、文本信息等作为播放提示信息，具体可基于实际应用场景需求确定，在此不做限制。进一步的，在播放多媒体数据的过程中，如播放视频画面、播放音频内容时，可向用户显示播放提示信息，以提示用户与多媒体数据的标题信息相关的主要内容的目标播放时间区域，进而使得用户可基于目标播放时间区域快速浏览多媒体数据的主要内容。Specifically, when playback prompts are used to indicate the target playback time range of the main content of multimedia data, these prompts can be text, voice, symbols, graphics, such as symbols in a video playback progress bar, or frame images and text information corresponding to the target playback time range. The specific method can be determined based on the actual application scenario requirements and is not limited here. Furthermore, during the playback of multimedia data, such as when playing video or audio content, playback prompts can be displayed to the user to indicate the target playback time range of the main content related to the title information of the multimedia data, allowing the user to quickly browse the main content of the multimedia data based on the target playback time range.

参见图8，图8是本申请实施例提供的对多媒体数据进行处理的场景示意图。如图8中的多媒体数据为视频数据，且该媒体数据的标题信息为“恐龙灭绝之谜”，由此可知该多媒体数据主要通过视频画面播放与恐龙灭绝相关的内容。其中，假设基于步骤S102确定出的多媒体数据的目标播放时间区域之后，基于该目标播放时间区域生成的播放提示信息可作为图8中视频进度条中的指示时间区域，即该指示时间区域可以作为多媒体数据对应的播放提示信息。或者，在该目标播放时间区域在视频进度条中所在的位置，将目标播放时间区域对应的帧图像作为播放提示信息向用户显示，即将用于说明“恐龙已经灭绝了”相关内容的视频内容作为播放提示信息，并提示用户与标题信息“恐龙灭绝之谜”相关的主要内容的播放时间区域。Referring to Figure 8, which is a schematic diagram of a scenario for processing multimedia data according to an embodiment of this application, the multimedia data in Figure 8 is video data, and the title information of the media data is "The Mystery of Dinosaur Extinction". It can be seen that the multimedia data mainly plays content related to the extinction of dinosaurs through video footage. Specifically, assuming that after determining the target playback time area of the multimedia data in step S102, the playback prompt information generated based on the target playback time area can be used as the indicator time area in the video progress bar in Figure 8, that is, the indicator time area can serve as the playback prompt information corresponding to the multimedia data. Alternatively, at the position of the target playback time area in the video progress bar, the frame image corresponding to the target playback time area is displayed to the user as playback prompt information, that is, the video content used to explain that "dinosaurs have become extinct" is used as playback prompt information, and the user is prompted with the playback time area of the main content related to the title information "The Mystery of Dinosaur Extinction".

可选的，当确定出多个目标播放时间区域时，可同样生成多个播放提示信息以提示用户与标题信息相关联的多个内容的播放时间区域。例如当多媒体数据为电影数据时，基于多个目标播放时间区域可生成多个播放提示信息，以提示用户该电影中多个主要内容(如高潮部分)的播放时间区域，有助于提升用户的观影体验。Optionally, when multiple target playback time zones are identified, multiple playback prompts can be generated to indicate the playback time zones of multiple contents associated with the title information to the user. For example, when the multimedia data is movie data, multiple playback prompts can be generated based on multiple target playback time zones to indicate the playback time zones of multiple main contents (such as the climax) in the movie, which helps to improve the user's viewing experience.

可选的，若未能确定出多媒体数据的目标播放时间区域，则可确定多媒体数据的播放内容可能与标题信息关联性较低，即多媒体数据的播放内容可能无实质性内容。在该情况下，可生成内容提示信息，并在播放多媒体数据时向用户展示内容提示信息，以告知用户当前播放的多媒体数据可能不存在与标题信息相关联的内容，从而减少用户在该多媒体数据所浪费的时间，适用性更高。Optionally, if the target playback time range of the multimedia data cannot be determined, it can be determined that the playback content of the multimedia data may have low relevance to the title information, meaning the playback content of the multimedia data may lack substantive content. In this case, content prompts can be generated and displayed to the user while the multimedia data is playing, informing the user that the currently playing multimedia data may not contain content related to the title information, thereby reducing the time wasted by the user on the multimedia data and making it more applicable.

在一些可行的实施方式中，由于上述播放提示信息对应于多媒体数据中的主要内容，因此基于目标播放时间区域对多媒体数据进行筛选。如将目标播放时间区域时长超过一定时长阈值的多媒体数据，确定为目标多媒体数据，即目标多媒体数据的主要内容所对应的播放时长占据多媒体数据对应的总播放时长较大比例，从而可说明目标多媒体数据存在较少的与标题信息不相关的内容。In some feasible implementations, since the aforementioned playback prompt information corresponds to the main content of the multimedia data, the multimedia data is filtered based on the target playback time range. For example, multimedia data whose target playback time range duration exceeds a certain duration threshold is identified as target multimedia data. That is, the playback duration corresponding to the main content of the target multimedia data occupies a large proportion of the total playback duration of the multimedia data, which indicates that the target multimedia data contains relatively little content unrelated to the title information.

可选的，还可基于目标播放时间区域的数量作为筛选依据，筛选出目标播放时间区域较多的目标多媒体数据。即此时目标多媒体数据包含多段与标题信息相关的内容。Optionally, the number of target playback time ranges can also be used as a filtering criterion to select target multimedia data with a large number of target playback time ranges. That is, the target multimedia data at this time contains multiple segments of content related to the title information.

其中，上述基于目标播放时间区域对多媒体数据进行筛选的方式仅为示例，具体可基于实际应用场景需求确定，在此不做限制。The above method of filtering multimedia data based on the target playback time range is only an example. The specific method can be determined based on the actual application scenario requirements, and no restrictions are imposed here.

对于短视频应用而言，短视频应用可基于目标播放时间区域对短视频进行筛选，以对短视频进行更好的推荐与管理。如短视频应用可基于目标播放时间区域对用户上传的短视频进行审核筛选，将目标播放时间区域时长较短的短视频不予以审核通过，进而提高短视频应用中的各个短视频的视频质量。或者短视频应用在向用户推荐短视频时，优先向用户推荐目标播放时间区域较长，或者目标播放时间区域较多的短视频，以提升用户短视频观看体验。For short video apps, filtering short videos based on target playback time ranges allows for better recommendation and management. For example, apps can review and filter user-uploaded videos based on their target playback time range, rejecting those with shorter durations within that range, thus improving the overall video quality. Alternatively, when recommending videos to users, apps can prioritize videos with longer or more target playback time ranges to enhance the user's viewing experience.

可选的，还可基于多媒体数据的目标播放时间区域，确定向用户推荐多媒体数据的推荐测量。如基于用户播放多媒体数据的播放习惯信息，以及多媒体数据的目标播放时间区域向用户推荐多媒体数据。其中，上述播放习惯信息包括但不限于用户播放过的历史多媒体数据所对应的时长(即每个历史多媒体数据播放完所需时长)，用户对应的历史多媒体数据对应的播放时间(用户浏览和/或收听各历史多媒体数据所消耗的时间)等，具体可基于实际应用场景需求确定，在此不做限制。Optionally, the recommended measurement for multimedia data can be determined based on the target playback time range of the multimedia data. This could be based on the user's playback habits and the target playback time range of the multimedia data. The playback habits information includes, but is not limited to, the duration of historical multimedia data played by the user (i.e., the time required to play each historical multimedia data piece), and the playback time of the user's corresponding historical multimedia data (the time consumed by the user browsing and/or listening to each historical multimedia data piece), etc. The specific details can be determined based on the actual application scenario requirements and are not limited here.

进一步的，同样对于短视频应用而言，若基于用户对应的历史播放短视频确定用户更倾向于浏览时长较短的短视频，或者用户对于每个历史播放短视频均占用较少的时间浏览，说明用户对于与标题信息不相关的短视频内容的忍受度有限，因此短视频应用可基于短视频对应的目标播放时间区域，向用户推荐目标播放时间区域较为靠前的短视频，从而使得用户可在较短时间内浏览到与标题信息相关的视频内容。Furthermore, for short video applications, if it is determined based on a user's historical short video viewing history that the user prefers to browse shorter short videos, or that the user spends relatively little time browsing each historical short video, it indicates that the user has limited tolerance for short video content unrelated to the title information. Therefore, short video applications can recommend short videos that are earlier in the target playback time range to the user, so that the user can browse video content related to the title information in a shorter time.

在一些可行的实施方式中，在基于多媒体数据的目标播放时间区域生成播放提示信息时，若播放提示信息与目标播放时间区域相对应的文本信息相关联，可同样基于播放提示信息确定相对应的多媒体数据推荐策略，如向用户推荐与其历史播放的多媒体数据相关的多媒体数据。In some feasible implementations, when generating playback prompt information based on the target playback time range of multimedia data, if the playback prompt information is associated with the text information corresponding to the target playback time range, a corresponding multimedia data recommendation strategy can also be determined based on the playback prompt information, such as recommending multimedia data related to the multimedia data that the user has played in the past.

其中，播放提示信息与目标播放时间区域相对应的文本信息相关联，可可以表现为播放提示信息为目标播放时间区域所对应的帧图像、文本信息等，或者为相对应的帧图像、文本信息所对应的关键词、类别标签等，具体可基于实际应用场景需求确定，在此不做限制。The playback prompt information is associated with the text information corresponding to the target playback time area. It may be the frame image or text information corresponding to the target playback time area, or the keywords or category tags corresponding to the frame image or text information. The specific information can be determined based on the actual application scenario requirements and is not limited here.

例如，对于短视频应用而言，若基于用户的历史短视频播放数据可确定用户经常播放“足球”相关的短视频，短视频音乐则可确定播放提示信息与“足球”相关联的目标短视频，并向用户推荐目标短视频。For example, for short video applications, if it can be determined from the user's historical short video playback data that the user frequently plays short videos related to "football", then the short video music can identify target short videos that are associated with "football" in the playback prompt information and recommend target short videos to the user.

进一步的，短视频用户还可在确定出播放提示信息与“足球”相关联的目标短视频之后，可基于各目标短视频的播放提示信息所对应的目标播放时间区域，进一步筛选出适合用户浏览的目标短视频。如筛选出播放时间区域较为靠前的目标短视频，并将其优先向用户推荐。Furthermore, after identifying target short videos whose playback prompts are associated with "football," users can further filter suitable target short videos based on the target playback time range corresponding to the playback prompts of each target short video. For example, target short videos with earlier playback time ranges can be selected and recommended to users first.

再例如，对于音乐应用而言，可通过用户的历史音乐播放数据确定用户经常播放的音乐类型，如“励志音乐”、“情歌”、“英文歌”等。进而音乐应用可基于各音乐对应的播放提示信息，向用户推荐相关类型的音乐，以提升用户吸引力。For example, music apps can determine the types of music a user frequently plays by analyzing their historical music playback data, such as "inspirational music," "love songs," and "English songs." Based on the playback prompts for each song, the app can then recommend related music to the user, thereby increasing user engagement.

可选的，若播放提示信息与目标播放时间区域相对应的文本信息相关联，可同样基于播放提示信息确定相对应的多媒体数据管理策略，如基于各多媒体数据对应的播放提示信息对多媒体数据进行分类，或者基于各多媒体数据的标题信息进行分类后，基于各多媒体数据对应的播放提示信息对每个类别下的多媒体数据进行进一步分类，以及基于播放提示信息确定每个类别中与该类别不相符的多媒体数据等，具体管理策略可基于实际应用场景需求确定在，在此不做限制。Optionally, if the playback prompt information is associated with text information corresponding to the target playback time area, a corresponding multimedia data management strategy can also be determined based on the playback prompt information. For example, multimedia data can be classified based on the playback prompt information corresponding to each multimedia data, or after classifying the multimedia data based on the title information of each multimedia data, the multimedia data under each category can be further classified based on the playback prompt information corresponding to each multimedia data, or multimedia data in each category that does not match the category can be identified based on the playback prompt information. The specific management strategy can be determined based on the actual application scenario requirements and is not limited here.

在本申请实施例中，通过将多媒体数据用至少一个文本信息表示，可在文字维度上准确确定多媒体数据的标题信息和每个文本信息的匹配度。通过标题信息与每个文本信息的文本相似度，每个文本信息中出现标题信息中各关键词的次数，来确定标题信息与每个文本信息的匹配度，可提供多种标题信息与文本信息的关联程度的衡量方式，进而提供多种多媒体数据中的目标播放时间区域的确定方式，更好地适用于不同的应用场景。另一方面，通过将不同目标播放时间区域的确定方式进行组合，可进一步拓展目标播放时间的确定方式，并且可降低基于单一确定方式确定目标播放时间区域所可能导致的确定失败的风险。另一方面，通过目标播放时间区域对多媒体数据进行处理，可使用户快速确定与标题信息相关的多媒体内容的播放时间区域，通过多媒体提醒信息，可使用户节省对不包含与标题信息相关的多媒体内容的多媒体数据的浏览时间，增强用户吸引力，适用性高。In this embodiment, by representing multimedia data with at least one text message, the matching degree between the title information and each text message can be accurately determined at the text dimension. The matching degree is determined by the text similarity between the title information and each text message, and the frequency of each keyword appearing in each text message. This provides multiple ways to measure the correlation between title information and text messages, and thus offers multiple methods for determining the target playback time zone in multimedia data, better suited to different application scenarios. Furthermore, by combining different methods for determining the target playback time zone, the methods for determining the target playback time can be further expanded, and the risk of failure due to determining the target playback time zone based on a single method can be reduced. Moreover, processing multimedia data using the target playback time zone allows users to quickly determine the playback time zone of multimedia content related to the title information. Multimedia reminders can save users browsing time on multimedia data that does not contain multimedia content related to the title information, enhancing user appeal and demonstrating high applicability.

参见图9，图9是本申请实施例提供的多媒体数据的处理装置的结构示意图。本申请实施例提供的处理装置1包括：Referring to Figure 9, Figure 9 is a schematic diagram of the structure of a multimedia data processing apparatus provided in an embodiment of this application. The processing apparatus 1 provided in this embodiment includes:

获取单元11，用于获取多媒体数据中包含的至少一个文本信息，以及上述多媒体数据的标题信息；The acquisition unit 11 is used to acquire at least one text information contained in the multimedia data, as well as the title information of the multimedia data.

确定单元12，用于确定上述标题信息与各上述文本信息的匹配度；Determining unit 12 is used to determine the matching degree between the above title information and each of the above text information;

上述确定单元12，用于根据各上述文本信息对应的匹配度，确定上述多媒体数据中的目标播放时间区域；The aforementioned determining unit 12 is used to determine the target playback time region in the aforementioned multimedia data based on the matching degree corresponding to each of the aforementioned text information.

播放单元13，用于根据上述目标播放时间区域对上述多媒体数据进行处理。The playback unit 13 is used to process the multimedia data according to the target playback time range.

在一些可行的实施方式中，上述确定单元12，用于：In some feasible implementations, the determining unit 12 is used for:

确定上述标题信息与各上述文本信息的文本相似度，将上述文本相似度作为匹配度；Determine the text similarity between the above title information and each of the above text information, and use the above text similarity as the matching score;

确定上述标题信息的各关键词，对于每个上述文本信息，根据该文本信息中出现各上述关键词的次数，确定上述标题信息与该文本信息的匹配度。Identify the keywords in the title information above. For each piece of text information above, determine the matching degree between the title information and the text information based on the number of times each of the above keywords appears in the text information.

确定各上述关键词的权重；Determine the weight of each of the above keywords;

对于每个上述文本信息，根据该文本信息中出现各上述关键词的次数，以及各上述关键词的权重，确定上述标题信息与该文本信息的匹配度。For each of the above text information, the matching degree between the title information and the text information is determined based on the number of times each of the above keywords appears in the text information and the weight of each of the above keywords.

将满足匹配条件的文本信息对应的播放时间区域，确定为上述多媒体数据中的目标播放时间区域；The playback time area corresponding to the text information that meets the matching conditions is determined as the target playback time area in the above multimedia data.

上述匹配条件包括以下任一项：The matching criteria above include any of the following:

匹配度高于匹配度阈值；The matching degree is higher than the matching degree threshold;

匹配度为最高匹配度。The match score is the highest possible.

在一些可行的实施方式中，若满足上述匹配条件的文本信息中包括至少两个文本信息，上述确定单元12，用于：In some feasible implementations, if the text information that satisfies the above matching conditions includes at least two text information pieces, the determining unit 12 is used to:

根据满足上述匹配条件的各上述文本信息在上述多媒体数据中对应的播放时间，将第一个满足上述匹配条件的文本信息的播放时间区域，确定为上述多媒体数据中的目标播放时间区域。Based on the playback time of each of the above-mentioned text messages that meet the above-mentioned matching conditions in the above-mentioned multimedia data, the playback time area of the first text message that meets the above-mentioned matching conditions is determined as the target playback time area in the above-mentioned multimedia data.

若各上述文本信息中不存在包含指定信息的文本信息，则确定上述标题信息与各上述文本信息的文本相似度，若存在满足预设条件的文本相似度，则将各上述文本信息对应的文本相似度作为匹配度；If none of the above text information contains the specified information, then the text similarity between the title information and each of the above text information is determined. If there is a text similarity that meets the preset conditions, then the text similarity corresponding to each of the above text information is used as the matching degree.

若各上述文本信息对应的文本相似度中不存在满足上述预设条件的文本相似度，则根据各上述文本信息中出现各上述关键词的次数，确定上述标题信息与各上述文本信息的匹配度。If there is no text similarity among the text similarities corresponding to each of the above text information that meets the above preset conditions, then the matching degree between the title information and each of the above text information is determined based on the number of times each of the above keywords appears in each of the above text information.

在一些可行的实施方式中，上述确定单元12，还用于：In some feasible implementations, the determining unit 12 is further configured to:

将各上述文本信息中包含指定信息的文本信息对应的播放时间区域，确定为上述多媒体数据中的目标播放时间区域。The playback time range corresponding to the text information containing the specified information in each of the above text information is determined as the target playback time range in the above multimedia data.

在一些可行的实施方式中，上述播放单元13，用于：In some feasible implementations, the playback unit 13 described above is used for:

在接收到上述多媒体数据的播放请求时，根据上述目标播放时间区域生成上述多媒体数据对应的播放提示信息，其中，上述播放提示信息用于提示上述目标播放时间区域；Upon receiving a playback request for the aforementioned multimedia data, playback prompt information corresponding to the aforementioned multimedia data is generated based on the aforementioned target playback time range, wherein the aforementioned playback prompt information is used to indicate the aforementioned target playback time range;

播放上述多媒体数据，并向用户显示上述播放提示信息。Play the aforementioned multimedia data and display the aforementioned playback prompt information to the user.

在一些可行的实施方式中，上述多媒体数据为视频数据；上述获取单元11，用于：In some feasible implementations, the multimedia data is video data; the acquisition unit 11 is used for:

获取上述视频数据中的至少一帧图像的字幕信息，将上述至少一帧图像的字幕信息作为上述视频数据中包含的至少一个文本信息；Obtain the caption information of at least one frame of the above video data, and use the caption information of the at least one frame of the above video data as at least one piece of text information contained in the above video data.

其中，一帧图像的字幕信息为一个文本信息。The caption information for one frame of an image is a text message.

在一些可行的实施方式中，上述多媒体数据为音频数据；上述获取单元11，用于：In some feasible implementations, the multimedia data is audio data; the acquisition unit 11 is used for:

对上述音频数据进行语音识别，得到上述音频数据的语音识别结果；Speech recognition is performed on the above audio data to obtain the speech recognition result of the above audio data;

将上述语音识别结果中至少一个语句对应的文本内容，作为上述音频数据中包含的至少一个文本信息。The text content corresponding to at least one statement in the above speech recognition results shall be used as at least one piece of text information contained in the above audio data.

具体实现中，上述装置1可通过其内置的各个功能模块执行如上述图1中各个步骤所提供的实现方式，具体可参见上述各个步骤所提供的实现方式，在此不再赘述。In practice, the device 1 can execute the implementation methods provided by the steps in Figure 1 above through its built-in functional modules. For details, please refer to the implementation methods provided by the steps above, which will not be repeated here.

参见图10，图10是本申请实施例提供的电子设备的结构示意图。如图10所示，本实施例中的电子设备1000可以包括：处理器1001，网络接口1004和存储器1005，此外，上述电子设备1000还可以包括：用户接口1003，和至少一个通信总线1002。其中，通信总线1002用于实现这些组件之间的连接通信。其中，用户接口1003可以包括显示屏(Display)、键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1004可以是高速RAM存储器，也可以是非不稳定的存储器(non-volatile memory)，例如至少一个磁盘存储器。存储器1005可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图10所示，作为一种计算机可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。Referring to Figure 10, which is a schematic diagram of the structure of an electronic device provided in an embodiment of this application, the electronic device 1000 in this embodiment may include a processor 1001, a network interface 1004, and a memory 1005. Furthermore, the electronic device 1000 may also include a user interface 1003 and at least one communication bus 1002. The communication bus 1002 is used to enable communication between these components. The user interface 1003 may include a display screen and a keyboard; optionally, the user interface 1003 may also include a standard wired interface or a wireless interface. The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface). The memory 1004 may be a high-speed RAM or a non-volatile memory, such as at least one disk storage device. The memory 1005 may optionally be at least one storage device located remotely from the processor 1001. As shown in Figure 10, the memory 1005, as a computer-readable storage medium, may include an operating system, a network communication module, a user interface module, and a device control application.

在图10所示的电子设备1000中，网络接口1004可提供网络通讯功能；而用户接口1003主要用于为用户提供输入的接口；而处理器1001可以用于调用存储器1005中存储的设备控制应用程序，以实现：In the electronic device 1000 shown in Figure 10, the network interface 1004 provides network communication functionality; the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control application stored in the memory 1005 to achieve:

在一些可行的实施方式中，上述处理器1001用于：In some feasible implementations, the processor 1001 described above is used for:

匹配度为最高匹配度。The match score is the highest possible.

在一些可行的实施方式中，若满足上述匹配条件的文本信息中包括至少两个文本信息，上述处理器1001用于：In some feasible implementations, if the text information that satisfies the above matching conditions includes at least two text information pieces, the processor 1001 is used to:

在一些可行的实施方式中，上述处理器1001还用于：In some feasible implementations, the processor 1001 is further configured to:

在一些可行的实施方式中，上述多媒体数据为视频数据；上述处理器1001用于：In some feasible implementations, the multimedia data is video data; the processor 1001 is used for:

在一些可行的实施方式中，上述多媒体数据为音频数据；上述处理器1001用于：In some feasible implementations, the multimedia data is audio data; the processor 1001 is used for:

应当理解，在一些可行的实施方式中，上述处理器1001可以是中央处理单元(central processing unit，CPU)，该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor，DSP)、专用集成电路(application specific integratedcircuit，ASIC)、现成可编程门阵列(field-programmable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。该存储器可以包括只读存储器和随机存取存储器，并向处理器提供指令和数据。存储器的一部分还可以包括非易失性随机存取存储器。例如，存储器还可以存储设备类型的信息。It should be understood that in some feasible implementations, the processor 1001 described above may be a central processing unit (CPU), which may also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor. The memory may include read-only memory and random access memory, and provides instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

具体实现中，上述电子设备1000可通过其内置的各个功能模块执行如上述图1中各个步骤所提供的实现方式，具体可参见上述各个步骤所提供的实现方式，在此不再赘述。In practice, the aforementioned electronic device 1000 can execute the implementation methods provided in the steps shown in Figure 1 above through its built-in functional modules. For details, please refer to the implementation methods provided in the steps above, which will not be repeated here.

本申请实施例还提供一种计算机可读存储介质，该计算机可读存储介质存储有计算机程序，被处理器执行以实现上述图1中各个步骤所提供的方法，具体可参见上述各个步骤所提供的实现方式，在此不再赘述。This application also provides a computer-readable storage medium storing a computer program that is executed by a processor to implement the methods provided in the steps of FIG1 above. For details, please refer to the implementation methods provided in the steps above, which will not be repeated here.

上述计算机可读存储介质可以是前述任一实施例提供的装置或者设备的内部存储单元，例如电子设备的硬盘或内存。该计算机可读存储介质也可以是该电子设备的外部存储设备，例如该电子设备上配备的插接式硬盘，智能存储卡(smart media card,SMC)，安全数字(secure digital,SD)卡，闪存卡(flash card)等。上述计算机可读存储介质还可以包括磁碟、光盘、只读存储记忆体(read-only memory，ROM)或随机存储记忆体(randomaccess memory，RAM)等。进一步地，该计算机可读存储介质还可以既包括该电子设备的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储该计算机程序以及该电子设备所需的其他程序和数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。The aforementioned computer-readable storage medium can be an internal storage unit of the apparatus or device provided in any of the foregoing embodiments, such as a hard disk or memory of an electronic device. The computer-readable storage medium can also be an external storage device of the electronic device, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the electronic device. The aforementioned computer-readable storage medium can also include magnetic disks, optical disks, read-only memory (ROM), or random access memory (RAM), etc. Furthermore, the computer-readable storage medium can include both internal storage units and external storage devices of the electronic device. The computer-readable storage medium is used to store the computer program and other programs and data required by the electronic device. The computer-readable storage medium can also be used to temporarily store data that has been output or will be output.

本申请实施例提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行图1中各个步骤所提供的方法。This application provides a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of an electronic device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the methods provided in the steps of FIG1.

本申请的权利要求书和说明书及附图中的术语“第一”、“第二”等是用于区别不同对象，而不是用于描述特定顺序。此外，术语“包括”和“具有”以及它们任何变形，意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或电子设备没有限定于已列出的步骤或单元，而是可选地还包括没有列出的步骤或单元，或可选地还包括对于这些过程、方法、产品或电子设备固有的其它步骤或单元。在本文中提及“实施例”意味着，结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置展示该短语并不一定均是指相同的实施例，也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是，本文所描述的实施例可以与其它实施例相结合。在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。The terms "first," "second," etc., used in the claims, description, and drawings of this application are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or electronic device that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or electronic devices. References to "embodiment" herein mean that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The presentation of this phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments. The term "and/or" as used in this application's description and appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Those skilled in the art can implement the described functions using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.

以上所揭露的仅为本申请较佳实施例而已，不能以此来限定本申请之权利范围，因此依本申请权利要求所作的等同变化，仍属本申请所涵盖的范围。The above-disclosed embodiments are merely preferred embodiments of this application and should not be construed as limiting the scope of this application. Therefore, any equivalent variations made in accordance with the claims of this application shall still fall within the scope of this application.

Claims

1. A method for processing multimedia data, characterized in that the method comprises:

Obtain at least two text information items contained in the multimedia data, as well as the title information of the multimedia data;

Determine the matching degree between the title information and each of the text information;

If the matching degree between the title information and the first text information is higher than the matching degree threshold, and the time distance between the playback time area corresponding to the first text information and the playback time area corresponding to the second text information is not less than the time distance threshold, then the playback time area corresponding to the first text information is determined as the target playback time area in the multimedia data; the first text information and the second text information are two different text information among the at least two text information, and the playback time area corresponding to the second text information is before the playback time area corresponding to the first text information; the playback time area corresponding to the second text information is the target playback time area in the multimedia data;

Upon receiving a playback request for the multimedia data, playback prompt information corresponding to the multimedia data is generated based on the target playback time region, wherein the playback prompt information includes the frame image corresponding to the target playback time region;

Play the multimedia data and display the playback prompt information to the user.

2. The method according to claim 1, wherein determining the matching degree between the title information and each of the text information comprises:

If none of the text information contains the specified information, then the text similarity between the title information and each of the text information is determined. If there is a text similarity that meets the preset conditions, then the text similarity corresponding to each of the text information is used as the matching degree.

If there is no text similarity among the text similarities corresponding to each of the text information that meets the preset conditions, then the matching degree between the title information and each of the text information is determined based on the number of times each keyword of the title information appears in each of the text information.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

The playback time region corresponding to the text information containing the specified information in each of the aforementioned text information is determined as the target playback time region in the multimedia data.

4. The method according to claim 1, wherein the multimedia data is video data; and the acquisition of at least two text information items contained in the multimedia data includes:

Obtain caption information from at least two frames of the video data, and use the caption information from the at least two frames of the video data as at least two pieces of text information contained in the video data;

The caption information for one frame of an image is a text message.

5. The method according to claim 1, wherein the multimedia data is audio data; and the acquisition of at least two text information items contained in the multimedia data includes:

Perform speech recognition on the audio data to obtain the speech recognition result of the audio data;

The text content corresponding to at least two sentences in the speech recognition result shall be used as at least two text information contained in the audio data.

6. A multimedia data processing apparatus, characterized in that the processing apparatus comprises:

The acquisition unit is used to acquire at least two text information items contained in the multimedia data, as well as the title information of the multimedia data;

A determining unit is used to determine the matching degree between the title information and each of the text information;

The determining unit is configured to determine the playback time region corresponding to the first text information as the target playback time region in the multimedia data if the matching degree between the title information and the first text information is higher than the matching degree threshold, and the time distance between the playback time region corresponding to the first text information and the playback time region corresponding to the second text information is not less than the time distance threshold; the first text information and the second text information are two different text information among the at least two text information, and the playback time region corresponding to the second text information is before the playback time region corresponding to the first text information; the playback time region corresponding to the second text information is the target playback time region in the multimedia data.

A playback unit is configured to generate playback prompt information corresponding to the multimedia data based on the target playback time region when a playback request for the multimedia data is received, wherein the playback prompt information includes the frame image corresponding to the target playback time region;

7. An electronic device, characterized in that it comprises a processor and a memory, the processor and the memory being interconnected;

The memory is used to store computer programs;

The processor is configured to perform the method as described in any one of claims 1 to 5 when the computer program is invoked.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, the computer program being executed by a processor to implement the method of any one of claims 1 to 5.