[go: up one dir, main page]

HK40035387B - Method and device for determining text label, terminal and readable storage medium - Google Patents

Method and device for determining text label, terminal and readable storage medium Download PDF

Info

Publication number
HK40035387B
HK40035387B HK42021025295.3A HK42021025295A HK40035387B HK 40035387 B HK40035387 B HK 40035387B HK 42021025295 A HK42021025295 A HK 42021025295A HK 40035387 B HK40035387 B HK 40035387B
Authority
HK
Hong Kong
Prior art keywords
probability
word
tag
segmented
target text
Prior art date
Application number
HK42021025295.3A
Other languages
Chinese (zh)
Other versions
HK40035387A (en
Inventor
刘刚
Original Assignee
腾讯科技(深圳)有限公司
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of HK40035387A publication Critical patent/HK40035387A/en
Publication of HK40035387B publication Critical patent/HK40035387B/en

Links

Description

文本标签确定方法、装置、终端及可读存储介质Text tag determination method, device, terminal and readable storage medium

技术领域Technical Field

本申请涉及标签挖掘领域,特别涉及一种文本标签确定方法、装置、终端及可读存储介质。This application relates to the field of tag mining, and in particular to a method, apparatus, terminal and readable storage medium for determining text tags.

背景技术Background Technology

标签被定义为能够代表内容的最重要的关键词,在信息流内容分发过程当中,无论是图文还是视频,标签信息都非常重要,当内容有了标签后,就可以按照不同标签组织和展示内容,也可以根据标签和用户画像进行匹配实现更精准的内容推荐。Tags are defined as the most important keywords that can represent content. In the process of information flow content distribution, whether it is text and images or videos, tag information is very important. Once content has tags, it can be organized and displayed according to different tags, and more accurate content recommendations can be achieved by matching tags with user profiles.

在相关技术中,文章内容的标签提取方法包括基于TF-IDF统计特征来确定当前内容的标签,该方法倾向于过滤文章中的常见词语,保留重要词语。In related technologies, methods for extracting tags from article content include determining tags for the current content based on TF-IDF statistical features. This method tends to filter out common words in the article and retain important words.

但该基于统计的方法未考虑文章中词语与词语之间、词语与文档之间的关系,所获取到的标签与内容表达的实际语义存在偏差,获取的标签准确度不高。However, this statistical method does not consider the relationships between words in the article or between words and the document. As a result, the obtained tags deviate from the actual semantics of the content, and the accuracy of the obtained tags is not high.

发明内容Summary of the Invention

本申请提供了一种文本标签确定方法、装置、终端及可读存储介质,能够提高标签确定的准确度。所述技术方案如下:This application provides a text tag determination method, apparatus, terminal, and readable storage medium, which can improve the accuracy of tag determination. The technical solution is as follows:

一方面,提供了一种文本标签确定方法,所述方法包括:On the one hand, a method for determining text labels is provided, the method comprising:

对目标文本进行分词处理,得到分词集合,所述分词集合中包括所述目标文本分词得到的分词词汇,所述目标文本为待确定标签的文本;The target text is segmented to obtain a segmentation set, which includes the segmented words obtained from the segmentation of the target text. The target text is the text whose label is to be determined.

根据所述分词词汇的上下文关系,确定所述目标文本的第一候选标签;Based on the contextual relationships of the segmented words, determine the first candidate tag for the target text;

根据所述分词词汇在所述目标文本中的第一频率参数,和所述分词词汇在文本集合中的第二频率参数,确定所述目标文本的第二候选标签;Based on the first frequency parameter of the segmented words in the target text and the second frequency parameter of the segmented words in the text set, a second candidate tag for the target text is determined;

根据所述第一候选标签和所述第二候选标签确定所述目标文本的标签。The tags of the target text are determined based on the first candidate tag and the second candidate tag.

另一方面,提供了一种文本标签确定装置,所述装置包括:On the other hand, a text label determining device is provided, the device comprising:

处理模块,用于对目标文本进行分词处理,得到分词集合,所述分词集合中包括所述目标文本分词得到的分词词汇,所述目标文本为待确定标签的文本;The processing module is used to perform word segmentation on the target text to obtain a word segmentation set, wherein the word segmentation set includes the word segments obtained from the word segmentation of the target text, and the target text is the text whose label is to be determined;

确定模块,用于根据所述分词词汇的上下文关系,确定所述目标文本的第一候选标签;The determination module is used to determine the first candidate tag of the target text based on the contextual relationship of the segmented words;

所述确定模块,还用于根据所述分词词汇在所述目标文本中的第一频率参数,和所述分词词汇在文本集合中的第二频率参数,确定所述目标文本的第二候选标签;The determining module is further configured to determine a second candidate tag for the target text based on a first frequency parameter of the segmented words in the target text and a second frequency parameter of the segmented words in the text set;

所述确定模块,还用于根据所述第一候选标签和所述第二候选标签确定所述目标文本的标签。The determining module is further configured to determine the tag of the target text based on the first candidate tag and the second candidate tag.

另一方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上述本申请实施例中任一所述的文本标签确定方法。On the other hand, a computer device is provided, the computer device including a processor and a memory, the memory storing at least one instruction, at least one program, code set or instruction set, the at least one instruction, the at least one program, the code set or instruction set being loaded and executed by the processor to implement the text label determination method as described in any of the embodiments of this application above.

另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如上述本申请实施例中任一所述的文本标签确定方法。On the other hand, a computer-readable storage medium is provided, wherein at least one instruction, at least one program, code set, or instruction set is stored therein, wherein the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the text label determination method as described in any of the embodiments of this application above.

另一方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实施例中任一所述的文本标签确定方法。On the other hand, a computer program product or computer program is provided, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform any of the text label determination methods described in the above embodiments.

本申请实施例提供的技术方案带来的有益效果至少包括:The beneficial effects of the technical solutions provided in this application include at least the following:

在通过对目标文本的每个分词词汇的上下文关系,及每个分词词汇在目标文本中的第一频率参数、在文本集合中的第二频率参数共同确定目标文本的标签,从深层语义和浅层频率两方面共同确定目标文本的标签,提高了目标文本标签确定的准确度。By combining the contextual relationships of each word segment in the target text with the first frequency parameter of each word segment in the target text and the second frequency parameter of each word segment in the text set, the label of the target text is determined from both deep semantic and shallow frequency perspectives, thus improving the accuracy of target text label determination.

附图说明Attached Figure Description

为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

图1是本申请一个示例性实施例提供的实施环境示意图;Figure 1 is a schematic diagram of the implementation environment provided by an exemplary embodiment of this application;

图2是本申请一个示例性实施例提供的文本标签确定方法的流程图;Figure 2 is a flowchart of a text label determination method provided in an exemplary embodiment of this application;

图3是本申请另一个示例性实施例提供的文本标签确定方法的流程图;Figure 3 is a flowchart of a text label determination method provided in another exemplary embodiment of this application;

图4是本申请一个示例性实施例提供的基于双向长短时记忆神经网络的文本标签确定模型;Figure 4 is a text label determination model based on a bidirectional long short-term memory neural network provided in an exemplary embodiment of this application;

图5是本申请一个示例性实施例提供的注意力计算机制示意图;Figure 5 is a schematic diagram of an attention calculation mechanism provided in an exemplary embodiment of this application;

图6是本申请一个示例性实施例提供的注意力计算流程图;Figure 6 is a flowchart of attention calculation provided in an exemplary embodiment of this application;

图7是本申请一个示例性实施例提供的条件随机场确定标签的方法流程图;Figure 7 is a flowchart of a method for determining labels using a conditional random field according to an exemplary embodiment of this application;

图8是本申请一个示例性实施例提供的文本标签确定方法的系统流程示意图;Figure 8 is a schematic diagram of the system flow of a text label determination method provided in an exemplary embodiment of this application;

图9是本申请一个示例性实施例提供的标签确定的系统流程示意图;Figure 9 is a schematic diagram of the system flow for label determination provided in an exemplary embodiment of this application;

图10是本申请一个示例性实施例提供的文本标签确定装置的结构框图;Figure 10 is a structural block diagram of a text label determining device provided in an exemplary embodiment of this application;

图11是本申请另一个示例性实施例提供的文本标签确定装置的结构框图;Figure 11 is a structural block diagram of a text label determining device provided in another exemplary embodiment of this application;

图12是本申请一个示例性实施例提供的服务器的结构框图。Figure 12 is a structural block diagram of a server provided in an exemplary embodiment of this application.

具体实施方式Detailed Implementation

为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.

首先,针对本申请实施例中涉及的名词进行简单介绍:First, a brief introduction to the terms used in the embodiments of this application:

人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating/interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning/deep learning.

计算机视觉技术(Computer Vision,CV)计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、光学字符识别(Optical Character Recognition,OCR)、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、三维技术、虚拟现实、增强现实、同步定位与地图构建等技术。Computer vision (CV) is a science that studies how to enable machines to "see." More specifically, it refers to machine vision, which uses cameras and computers to replace human eyes in recognizing and measuring targets, and further processes images to create images more suitable for human observation or transmission to instruments. As a scientific discipline, computer vision studies related theories and technologies, attempting to build artificial intelligence systems capable of extracting information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, and simultaneous localization and mapping (SLAM).

语音技术(Speech Technology)的关键技术有自动语音识别技术(AutomaticSpeech Recognition,ASR)和语音合成技术(Text To Speech,TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。Key technologies in speech technology include Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and voiceprint recognition. Enabling computers to hear, see, speak, and feel is the future direction of human-computer interaction, with speech emerging as one of the most promising methods.

自然语言处理(Nature Language Processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。Natural Language Processing (NLP) is an important field within computer science and artificial intelligence. It studies the theories and methods for enabling effective communication between humans and computers using natural language. NLP is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field involves natural language—the language people use in daily life—and thus it has a close relationship with linguistic research. NLP techniques typically include text processing, semantic understanding, machine translation, question answering, and knowledge graphs.

机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技术。Machine learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and instruction-based learning.

词向量(word embedding),是嵌入式自然语言处理中的一组语言建模和特征学习技术的统称,其中来自词汇表的单词或短语被映射到实数的向量。从概念上讲,它涉及从每个单词一维的空间到具有更低维度的连续向量空间的数学嵌入。生成这种映射的方法包括神经网络,单词共生矩阵的降维,概率模型,可解释的知识库方法,和术语的显式表示单词出现的背景。当用作底层输入表示时,单词和短语嵌入已经被证明可以提高自然语言处理任务的性能,例如语法分析和情感分析。Word embeddings are a collective term for a set of language modeling and feature learning techniques in embedded natural language processing, where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embeddings from a one-dimensional space for each word to a continuous vector space with lower dimensions. Methods for generating such mappings include neural networks, dimensionality reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and explicit representations of the context in which words occur. When used as underlying input representations, word and phrase embeddings have been shown to improve the performance of natural language processing tasks such as parsing and sentiment analysis.

标签,在推荐系统中,标签被定义为能够代表文章语义的最重要的关键词,并且适合用于用户画像和内容的匹配项,相比于分类和主题,是更细粒度的语义。存在于推荐系统各个环节,内容画像维度、用户画像维度、召回模型特征、排序模型特征、多样性打散等。In recommendation systems, tags are defined as the most important keywords that represent the semantics of an article. They are suitable for matching user profiles and content, offering a more granular semantic level compared to categories and topics. Tags exist in various stages of the recommendation system, including content profile dimensions, user profile dimensions, recall model features, ranking model features, and diversity dispersal.

本申请实施例提供的方案涉及人工智能的自然语言处理和机器学习等技术,其中,本申请提供的文本标签确定方法可以应用于如下场景中的至少一种:The solutions provided in this application involve technologies such as natural language processing and machine learning in artificial intelligence. The text label determination method provided in this application can be applied to at least one of the following scenarios:

第一,该文本标签确定方法应用于文章阅读平台的推荐系统中,该文章包括新闻文章、公众号文章、个人原创文章、书籍内容等。服务器接收用户或合作方通过终端上传的文章内容,推荐系统对内容进行标签挖掘,根据标签将内容分发至相关用户。例如,一篇报道最近房价的变化趋势和政府实施新的相关政策的新闻,服务器的推荐系统将标签确定为“房价、政策”,将其推送至最近关注过房价变化趋势的用户对应应用程序的首页或其他页面,增加该新闻被点击的可能性。First, this text tagging method is applied to the recommendation system of an article reading platform. The articles include news articles, WeChat official account articles, original personal articles, and book content. The server receives article content uploaded by users or partners via their terminals. The recommendation system performs tag mining on the content and distributes it to relevant users based on the tags. For example, a news article reporting recent housing price trends and new government policies might be tagged as "housing prices, policies" by the server's recommendation system. This would then be pushed to the homepage or other pages of the applications of users who have recently followed housing price trends, increasing the likelihood of the news being clicked.

第二,该文本标签确定方法应用于社交平台的推荐系统中,该社交平台包括博客、微型博客、社交软件的公共互动模块。服务器接收用户通过终端上传的内容,推荐系统对内容进行标签挖掘,根据标签将内容进行分类并定向推送。在一个示例中,在一个社交应用程序的“广场”上,用户可以以文字、图片、视频等形式发布动态内容,服务器的推荐系统对其标签进行挖掘,例如发布内容为“今天到北京旅游,逛了故宫和鸟巢,北京烤鸭真好吃”,推荐系统将标签确定为“北京、旅游”,并将其推荐给定位为北京的用户或推荐给也曾发布过“旅游”相关动态的用户,得以推荐更精准垂直的目标用户。Second, this text tag determination method is applied to the recommendation system of a social platform, which includes blogs, microblogs, and public interaction modules of social software. The server receives content uploaded by users through their terminals, and the recommendation system performs tag mining on the content, classifying and targeting the content based on the tags. In one example, on the "square" of a social application, users can post dynamic content in the form of text, pictures, videos, etc. The server's recommendation system mines the tags. For example, if the content is posted as "I went to Beijing for tourism today, visited the Forbidden City and the Bird's Nest, and Peking duck was delicious," the recommendation system will determine the tags as "Beijing, tourism" and recommend it to users located in Beijing or to users who have also posted "tourism" related content, thus recommending more precise and vertical target users.

第三,该文本标签确定方法应用于视频平台的推荐系统中,该视频平台可以是提供普通视频、短视频等视频内容。服务器接收用户通过终端上传内容或供应商提供的视频内容,对视频内容进行语音转文字处理或光学字符识别字幕获取视频对应的文本,针对文本进行标签挖掘,并根据用户画像进行推荐。例如,从视频中提取到的文字内容为“苹果公司即将上线iphone12,手机支持……”,推荐系统对标签进行确定为“苹果、手机”,并针对苹果和手机的关系,确定该视频属于电子产品领域,对该视频进行推荐,将其推荐给近期关注过手机买卖的用户,并不会将其推荐给常关注农业领域的用户。Third, this text tagging method is applied to the recommendation system of a video platform, which can provide video content such as regular videos and short videos. The server receives content uploaded by users through their terminals or video content provided by suppliers. It performs speech-to-text processing or optical character recognition (OCR) to obtain the corresponding text, mines tags from the text, and makes recommendations based on user profiles. For example, if the text extracted from a video is "Apple is about to launch the iPhone 12, and the phone supports…", the recommendation system identifies the tags as "Apple, phone". Based on the relationship between Apple and phones, it determines that the video belongs to the electronics product field and recommends it to users who have recently followed mobile phone sales, but not to users who frequently follow the agricultural sector.

值得注意的是,上述应用场景仅为示意性的举例,本申请提供的文本标签确定可以应用于其他标签确定的应用场景中。It is worth noting that the above application scenarios are merely illustrative examples, and the text label determination provided in this application can be applied to other label determination application scenarios.

结合上述名词简介和应用场景的介绍,对本申请实施例的实施环境进行说明。Based on the above introduction of terms and application scenarios, the implementation environment of the embodiments of this application will be described.

示意性的,请参考图1,以该文本标签确定方法应用于服务器中为例进行说明,该文本标签确定方法的实施环境包括:终端110、服务器120以及通信网络130;For illustration, please refer to Figure 1. Taking the application of this text label determination method in a server as an example, the implementation environment of this text label determination method includes: terminal 110, server 120 and communication network 130.

终端110中安装有相关的应用程序,该相关的应用程序111可以是文章阅读平台对应的应用程序、社交平台对应的应用程序或视频平台对应的应用程序,用户可以通过终端110上传想要发布的内容,也可以通过终端110接收从服务器120获取的内容,也即,终端110是内容生产端也是内容消费端。终端110与服务器120之间通过通信网络130进行数据传输。Terminal 110 has related applications installed. These applications 111 can be applications for article reading platforms, social media platforms, or video platforms. Users can upload and publish content through terminal 110, and also receive content from server 120 through terminal 110. That is, terminal 110 is both a content producer and a content consumer. Data transmission between terminal 110 and server 120 is conducted through communication network 130.

服务器120中包括内容推荐模块140和存储模块150,其中,内容推荐模块140包括机器处理模块141、内容调度模块142、人工审核模块143、排重模块144、接口模块145。The server 120 includes a content recommendation module 140 and a storage module 150. The content recommendation module 140 includes a machine processing module 141, a content scheduling module 142, a manual review module 143, a deduplication module 144, and an interface module 145.

内容推荐模块140用于抽取内容有效的实体信息作为标签,将内容标签与用户画像进行匹配,实现更有针对性的内容分发和推荐。The content recommendation module 140 is used to extract valid entity information from the content as tags, and match the content tags with user profiles to achieve more targeted content distribution and recommendation.

其中,机器处理模块141包括标签确定模块和标签确定模型,用于对输入的内容进行标签确定,并将处理结果发送至内容调度模块142。The machine processing module 141 includes a label determination module and a label determination model, which are used to determine the labels of the input content and send the processing results to the content scheduling module 142.

内容调度模块142用于负责内容流转的整个调度过程,接收接口模块145的内容,从内容数据库模块153中获取内容的元信息;内容调度模块142还用于与内容存储模块151交互,从内容存储模块151存储内容或读取内容;内容调度模块142还用于调度人工审核模块143和机器处理模块141,控制调度的顺序和优先级;内容调度模块142还用于获取通过人工审核模块143审核的内容,通过接口模块145中的出口单元分发至终端110。The content scheduling module 142 is responsible for the entire scheduling process of content flow. It receives content from the interface module 145 and obtains the content's metadata from the content database module 153. The content scheduling module 142 is also used to interact with the content storage module 151 to store or retrieve content. The content scheduling module 142 is also used to schedule the manual review module 143 and the machine processing module 141, controlling the scheduling order and priority. The content scheduling module 142 is also used to obtain content that has passed the manual review module 143 and distribute it to the terminal 110 through the exit unit in the interface module 145.

人工审核模块143是人工服务能力的载体,用于审核过滤色情、法律不允许等机器无法确定判断的内容。The manual review module 143 is the carrier of human service capabilities, used to review and filter content that is pornographic, illegal, or otherwise cannot be determined by machines.

排重模块144用于和内容调度模块142通讯,用于标题去重,封面图的图片去重,内容正文去重及视频指纹和音频指纹去重。在一个示例中,将图文内容标题和正文向量化,对图片向量去重,对于视频内容抽取视频指纹和音频指纹构建向量,然后计算向量之间的距离,如欧式距离来确定是否重复。The deduplication module 144 communicates with the content scheduling module 142 for deduplication of titles, cover images, body text, video fingerprints, and audio fingerprints. In one example, the title and body text of the image and text content are vectorized, image vectors are deduplicated, and video and audio fingerprints are extracted to construct vectors for video content. Then, the distance between the vectors, such as Euclidean distance, is calculated to determine whether there is a duplicate.

接口模块145包括入口单元和出口单元,入口单元用于接收终端110的内容输入,出口单元用于向终端110输出内容。The interface module 145 includes an input unit and an output unit. The input unit is used to receive content input from the terminal 110, and the output unit is used to output content to the terminal 110.

存储模块150中包括内容数据库模块153、语料库152、内容存储模块151、爬虫模块154。The storage module 150 includes a content database module 153, a corpus 152, a content storage module 151, and a crawler module 154.

其中,内容数据库模块153用于存储内容的元信息,上述元信息包括文件大小数据、封面图链接信息、码率数据、文件格式、标题信息、发布时间数据、作者信息、视频文件大小数据、视频格式信息中的至少一个数据信息;内容数据库模块153还用于存储人工审核模块143对内容的审核结果和审核状态;内容数据库模块153还用于存储通过内容调度模块142传输的机器处理模块141、排重模块144的处理结果。The content database module 153 is used to store the metadata of the content, including at least one of the following: file size data, cover image link information, bitrate data, file format, title information, publication time data, author information, video file size data, and video format information. The content database module 153 is also used to store the review results and review status of the content by the manual review module 143. The content database module 153 is also used to store the processing results of the machine processing module 141 and the deduplication module 144 transmitted through the content scheduling module 142.

内容存储模块151用于存储内容的元信息之外的内容实体信息,例如,视频源文件和图文内容的图片源文件;内容存储模块151还用于在视频内容标签确定的时候,为机器处理模块141提供视频源文件,包括源文件中间的抽帧内容、视频文件的语音转文字和光学字符识别提供原始素材的输入。The content storage module 151 is used to store content entity information other than the metadata of the content, such as video source files and image source files of text and image content; the content storage module 151 is also used to provide video source files to the machine processing module 141 when the video content tags are determined, including the frame content extracted from the middle of the source file, the speech-to-text conversion of the video file and the input of the original material for optical character recognition.

爬虫模块154用于通过互联网获取外部领域的语料,作为领域语料信息,存储在语料库152中,语料库152中存储有分词序列与词向量的对应关系,可选的,词向量通过用来产生词向量的相关模型(word to vector,word2vec)从大规模语料中无监督学习得到。The crawler module 154 is used to obtain external domain corpus via the Internet and store it in the corpus 152 as domain corpus information. The corpus 152 stores the correspondence between word segmentation sequences and word vectors. Optionally, the word vectors are obtained from large-scale corpus through unsupervised learning using a related model (word to vector, word2vec) used to generate word vectors.

图2是本申请一个示例性实施例提供的文本标签确定方法的流程图,本申请实施例中,以该方法应用在服务器中为例进行说明,请参考图2,该方法包括:Figure 2 is a flowchart of a text label determination method provided in an exemplary embodiment of this application. In this embodiment, the method is described using an application in a server as an example. Referring to Figure 2, the method includes:

步骤201,对目标文本进行分词处理,得到分词集合。Step 201: Perform word segmentation on the target text to obtain a word segmentation set.

在本申请实施例中,服务器获取终端上传的文字、图片或视频资源。对于文字资源,直接生成目标文本。对于图片资源,服务器通过光学字符识别获取图片资源中的文字信息,生成目标文本。对于视频资源,服务器通过提取视频中的音频,对音频进行转文字处理,生成目标文本;或,抽取视频帧,对视频帧进行光学字符识别,生成目标文本。该目标文本为待确定标签的文本。In this embodiment, the server acquires text, image, or video resources uploaded by the terminal. For text resources, the target text is generated directly. For image resources, the server obtains the text information in the image resource through optical character recognition (OCR) and generates the target text. For video resources, the server extracts the audio from the video, performs audio-to-text conversion, and generates the target text; or, it extracts video frames, performs OCR on the video frames, and generates the target text. This target text is the text for which the tag to be determined is located.

上述分词集合中包括对目标文本分词得到的分词词汇,也即,分词集合是通过对一段文本或一句话分割成几个单独的词汇得到的集合,例如“从吴晓波到罗振宇,知识付费IP有哪些脆弱点?”进行分词得到“从”、“吴晓波”、“到”、“罗振宇”、“,”、“知识付费”、“IP”、“有”、“哪些”、“脆弱点”、“?”11个分词词汇,这11个分词词汇组成分词集合。The aforementioned word segmentation set includes word segments obtained from segmenting the target text. That is, the word segmentation set is a set obtained by dividing a text or sentence into several individual words. For example, the word segmentation of "From Wu Xiaobo to Luo Zhenyu, what are the vulnerabilities of knowledge payment IPs?" yields 11 word segments: "from", "Wu Xiaobo", "to", "Luo Zhenyu", ",", "knowledge payment", "IP", "have", "which", "vulnerabilities", and "?". These 11 word segments constitute the word segmentation set.

在分词过程中,存在多种分词方式,在相同字数的情况下,总词数越少,说明语义单元越少,相对于单个语义单元的权重越大,准确度越高,示意性的“知识付费”可以分词为“知识”和“付费”两个分词或“知识付费”一个分词,而在“从吴晓波到罗振宇,知识付费IP有哪些脆弱点?”的语义理解中,“知识付费”作为一个分词来作为标签提取,标签的指向性会更准确。In the process of word segmentation, there are multiple segmentation methods. With the same number of characters, the fewer the total number of words, the fewer the semantic units, the greater the weight relative to a single semantic unit, and the higher the accuracy. For example, "knowledge payment" can be segmented into two words, "knowledge" and "payment," or into one word, "knowledge payment." However, in the semantic understanding of "From Wu Xiaobo to Luo Zhenyu, what are the vulnerabilities of knowledge payment IPs?", "knowledge payment" is extracted as a single word as a tag, which makes the tag more accurate.

可选的,在分词过程中,使用基于词典的分词方法。给定一个分词的最大长度,以该长度进行切分,比对得到的分词是否在词典中出现,如果出现的话,该分词就是切词的结果,否则缩短词的长度。其中,切分方式分为正向切割和反向切割,分别得到正向切割结果和反向切割结果,选择长度最小的结果作为分词词汇集合。Optionally, a dictionary-based segmentation method can be used during the word segmentation process. Given a maximum word length, the word is segmented according to this length. The resulting word is compared to see if it appears in the dictionary. If it does, the segmented word is the final word; otherwise, the word length is shortened. The segmentation method is divided into forward segmentation and reverse segmentation, yielding forward and reverse segmentation results respectively. The result with the shortest length is selected as the word segmentation vocabulary set.

可选的,在分词过程中,使用基于隐含马尔可夫模型(Hiden Markov Model,HMM)的分词方法。获取每个状态之间转移的概率和每个状态产生该词概率,以及最初的状态,也即,每个状态在句首的概率;从最初状态开始,求状态序列集合{S,B,M,E}中四个状态分别可以产生第一个字符串的概率,并记录每个状态下的概率;判断第二个字符,首先由上一个时间状态下的状态转移到当前时间状态,并且取当前时间下每个状态能得到当前位置字符概率最大的那个,然后在记录一下上一个状态;依次处理每个字符;最后根据状态序列进行分词,得到分词词汇集合。Optionally, a word segmentation method based on a Hidden Markov Model (HMM) is used during the word segmentation process. The probability of transitions between each state and the probability of each state generating the word are obtained, along with the initial state, i.e., the probability of each state being at the beginning of the sentence. Starting from the initial state, the probability that each of the four states in the state sequence set {S, B, M, E} can generate the first string is calculated, and the probability of each state is recorded. To determine the second character, the state transitions from the previous state to the current state, and the state with the highest probability of obtaining the character at the current position is selected. The previous state is then recorded. Each character is processed sequentially. Finally, word segmentation is performed based on the state sequence to obtain a word segmentation vocabulary set.

可选的,在分词过程中,使用基于二元语法的分词方法。遍历所有以当前位置为结尾的词,其中,词的长度有限制,找到该词之后再往前找到前一个词,也即,这两个词之间得到一个切分结果,最后求所有的切分结果中概率最大值,根据其对应的切分结果得到分词词汇集合。Optionally, a segmentation method based on bigrams can be used during the segmentation process. This involves iterating through all words ending at the current position, where word length is limited. After finding the current word, the previous word is found, resulting in a segmentation between these two words. Finally, the maximum probability among all segmentation results is calculated, and the segmented vocabulary set is obtained based on this maximum probability.

步骤202,根据分词词汇的上下文关系,确定目标文本的第一候选标签。Step 202: Determine the first candidate tag for the target text based on the contextual relationship of the segmented words.

在本申请实施例中,对分词词汇进行预设处理,得到分词词汇基于上下文关系的表示,也即,同时考虑该分词词汇前的上下文信息,和该分词词汇后的上下文信息,对人类用户一般读完整个句子才更好的确定词的重要度进行很好的拟合,从而能够更有效进行第一候选标签的确定。In this embodiment of the application, the segmented words are pre-processed to obtain a representation of the segmented words based on contextual relationships. That is, the contextual information before the segmented word and the contextual information after the segmented word are considered at the same time. This makes a good fit to the fact that human users usually need to read the whole sentence to better determine the importance of words, thereby enabling more effective determination of the first candidate label.

可选的,将分词词汇输入机器学习模型中,输出得到目标文本的第一候选标签。该机器学习模型包括双向长短时记忆(Bi Long Short Term Memory,Bi-LSTM)神经网络、条件随机场(Conditional Random Field,CRF)、注意力机制(Attention Mechanism)。Optionally, the segmented vocabulary is input into the machine learning model, and the output is the first candidate label of the target text. The machine learning model includes a bidirectional long short-term memory (Bi-LSTM) neural network, a conditional random field (CRF), and an attention mechanism.

可选的,对分词词汇基于词图模型(Text Rank,TR)进行标签提取,其中,该方法是由通过互联网中的超链接关系来确定一个网页排名方法(Page Rank,PR)改进而来,PR是一种基于投票思想设计的方法,当计算网页A的PR值时,需要知道有哪些网页链接到网页A,也就是要首先得到网页A的入链,然后通过入链给网页A的投票来计算网页A的PR值。请参考公式一:Optionally, tags are extracted from the segmented vocabulary using a Text Rank (TR) model. This method is an improvement on Page Rank (PR), a method for determining a webpage's ranking based on hyperlinks on the internet. PR is a voting-based approach; to calculate the PR value of webpage A, it's necessary to know which webpages link to it. This means first obtaining the inbound links to webpage A, and then calculating its PR value based on the votes cast by these inbound links. Please refer to Formula 1:

其中,Vi表示某个网页,Vj表示链接到Vi的网页(即Vi的入链),S(Vi)表示网页Vi的PR值,In(Vi)表示网页Vi的所有入链的集合,Out(Vj)表示网页,d表示阻尼系数,是用来克服这个公式中“d*”后面的部分的固有缺陷用的:如果仅仅有求和的部分,那么该公式将无法处理没有入链的网页的PR值,因为这时,根据该公式这些网页的PR值为0,但实际情况却不是这样,所以加入了一个阻尼系数来确保每个网页都有一个大于0的PR值,在一个实例中,在0.85的阻尼系数下,大约100多次迭代PR值就能收敛到一个稳定的值,而当阻尼系数接近1时,需要的迭代次数会陡然增加很多,且排序不稳定。公式中S(Vj)前面的分数指的是Vj所有出链指向的网页应该平分Vj的PR值,这样才算是把当前票分给了当前链接到的网页。In this formula, Vi represents a webpage, Vj represents webpages linking to Vi (i.e., incoming links to Vi ), S( Vi ) represents the PageRank (PR) value of webpage Vi , In( Vi ) represents the set of all incoming links to webpage Vi , Out( Vj ) represents the webpage, and d represents the damping coefficient, which is used to overcome the inherent flaw in the part after "d*" in this formula: if there is only a summation part, the formula will not be able to handle the PR value of webpages without incoming links, because in this case, according to the formula, the PR value of these webpages is 0, but this is not the case in reality. Therefore, a damping coefficient is added to ensure that each webpage has a PR value greater than 0. In one example, with a damping coefficient of 0.85, the PR value can converge to a stable value in about 100 iterations. However, when the damping coefficient is close to 1, the number of iterations required will increase sharply, and the sorting will be unstable. The fraction before S( Vj ) in the formula means that the PR value of Vj should be equally distributed among all the webpages pointed to by all outgoing links of Vj , so that the current vote is distributed to the currently linked webpage.

TR是由PR改进而来,在计算过程中,仅多了一个权重项Wji,用来表示两个节点之间的边连接有不同的重要程度。该方法包括:(1)将目标文本按照完整的句子进行分割;(2)对于每个句子,进行分词和词性标注处理,并过滤停用词,只保留指定词性的单词,如名词、动词、形容词,即,得到句子集合,其中集合中的元素为保留的候选标签;(3)构建候选关键词图G=(V,E),其中V为节点集,由(2)生成的候选关键词组成,然后采用共现关系构造任两点之间的边,两个节点之间存在边仅当它们对应的词汇在长度为m的窗口中共现,m表示窗口大小,即最多共现m个单词;(4)计算迭代传播各节点的权重,直至收敛;(5)对节点权重进行倒序排序,从而得到最重要的T个词汇,作为候选标签;(6)在原始的目标文本中标记(5)中得到的T个词汇,若形成相邻词组,则组合成多词标签。TR is an improvement on PR. In the calculation process, only one weight term W ji is added to indicate that the edge connection between two nodes has different importance. The method includes: (1) dividing the target text into complete sentences; (2) for each sentence, performing word segmentation and part-of-speech tagging, filtering stop words, and retaining only words with specified parts of speech, such as nouns, verbs, and adjectives, that is, obtaining a set of sentences, where the elements in the set are the retained candidate labels; (3) constructing a candidate keyword graph G = (V, E), where V is a set of nodes composed of candidate keywords generated in (2), and then using co-occurrence relations to construct an edge between any two points. An edge exists between two nodes only if their corresponding words co-occur in a window of length m, where m represents the window size, that is, at most m words co-occur; (4) calculating the weight of each node iteratively propagating until convergence; (5) sorting the node weights in reverse order to obtain the most important T words as candidate labels; (6) marking the T words obtained in (5) in the original target text. If adjacent word groups are formed, they are combined into multi-word labels.

步骤203,根据分词词汇在目标文本中的第一频率参数,和分词词汇在文本集合中的第二频率参数,确定目标文本的第二候选标签。Step 203: Determine the second candidate label of the target text based on the first frequency parameter of the segmented words in the target text and the second frequency parameter of the segmented words in the text set.

在本申请实施例中,根据第一频率参数tfi,j和第二频率参数idfi,确定分词词汇对应的词汇频率tfidfi,j,其中,第一频率参数tfi,j的计算方法请参考公式二,第二频率参数idfi的计算方法请参考公式三:In this embodiment, the word frequency tfidf i,j corresponding to the segmented words is determined based on the first frequency parameter tf i,j and the second frequency parameter idf i . The calculation method for the first frequency parameter tf i,j is given in Formula 2, and the calculation method for the second frequency parameter idf i is given in Formula 3.

其中,tfi,j表示第一频率参数,idfi表示第二频率参数,ni,j表示当前分词词汇在文本集合j中出现的个数,|D|表示语料库中文本个数,|{j:ti∈dj}|表示包含分词词汇的文本集合个数。Where tf i,j represents the first frequency parameter, idf i represents the second frequency parameter, n i,j represents the number of times the current segmented word appears in text set j, |D| represents the number of texts in the corpus, and |{j:t i ∈d j }| represents the number of text sets containing the segmented word.

再确定的第一频率参数和第二频率参数之积,作为分词词汇对应的词汇频率tfidfi,j,该词汇频率的计算公式请参考公式四:The product of the first and second frequency parameters is then used as the word frequency tfidf i,j corresponding to the segmented word. The formula for calculating this word frequency is shown in Formula 4.

公式四:tfidfi,j=tfi,j×idfi Formula 4: tfidf i,j = tf i,j × idf i

将词汇频率符合频率要求的分词词汇,确定为目标文本的第二候选标签。Words whose frequency meets the frequency requirements are identified as the second candidate tags for the target text.

步骤204,根据第一候选标签和第二候选标签确定目标文本的标签。Step 204: Determine the tags of the target text based on the first candidate tags and the second candidate tags.

在本申请实施例中,可选的,响应于所述第一候选标签和所述第二候选标签之间存在交集的情况,对第一候选标签和第二候选标签取交集,得到目标文本的标签。响应于第一候选标签和第二候选标签之间不存在交集的情况,根据预设的选择规则,从第一候选标签和第二候选标签中确定目标文本的标签。在一个示例中,该预设的选择规则为给第一候选标签和第二候选标签根据目标文本属于的领域分配不同权重,当第一候选标签的权重大于第二候选标签的权重时,选取第一候选标签作为目标文本的标签,当第一候选标签的权重小于第二候选标签的权重时,选取第二候选标签作为目标文本的标签。In this embodiment, optionally, in response to the existence of an intersection between the first candidate tag and the second candidate tag, the intersection of the first candidate tag and the second candidate tag is taken to obtain the tag of the target text. In response to the existence of no intersection between the first candidate tag and the second candidate tag, the tag of the target text is determined from the first candidate tag and the second candidate tag according to a preset selection rule. In one example, the preset selection rule is to assign different weights to the first candidate tag and the second candidate tag according to the domain to which the target text belongs. When the weight of the first candidate tag is greater than the weight of the second candidate tag, the first candidate tag is selected as the tag of the target text; when the weight of the first candidate tag is less than the weight of the second candidate tag, the second candidate tag is selected as the tag of the target text.

综上所述,本实施例提供的文本标签确定方法,在通过对目标文本的每个分词词汇的上下文关系,及每个分词词汇在目标文本中的第一频率参数、在文本集合中的第二频率参数共同确定目标文本的标签,从深层语义和浅层频率两方面共同确定目标文本的标签,提高了目标文本标签确定的准确度,且通过机器学习模型有效地代替人工特征识别工程,提高了目标文本标签确定的效率。In summary, the text label determination method provided in this embodiment determines the label of the target text by jointly considering the contextual relationship of each word segment in the target text, as well as the first frequency parameter of each word segment in the target text and the second frequency parameter in the text set. This method determines the label of the target text from both deep semantic and shallow frequency perspectives, thereby improving the accuracy of target text label determination. Furthermore, by effectively replacing manual feature recognition engineering with a machine learning model, the efficiency of target text label determination is improved.

结合上述实施例,以通过机器学习模型得到基于分词词汇的上下文关系的第一候选标签为例,对本申请实施例提供的文本标签确定方法进行说明,请参考图3,其示出了一个示例性实施例提供的文本标签确定方法的流程图,该方法包括:In conjunction with the above embodiments, taking the first candidate label obtained through a machine learning model based on the contextual relationship of segmented words as an example, the text label determination method provided in this application embodiment will be described. Please refer to Figure 3, which shows a flowchart of a text label determination method provided in an exemplary embodiment. The method includes:

步骤301,对目标文本进行分词处理,得到分词集合。Step 301: Perform word segmentation on the target text to obtain a word segmentation set.

分词集合的获取过程的相关内容请参考上述步骤201,此处不再赘述。Please refer to step 201 above for details on the process of obtaining the word segmentation set; it will not be repeated here.

步骤302,对分词词汇进行特征提取,得到分词词汇的词汇向量。Step 302: Extract features from the segmented words to obtain word vectors.

在本申请实施例中,示意性的,通过爬虫实时获取互联网的大规模语料,并通过无监督方式学习得到分词对应的词汇向量,根据目标文本分解得到的分词集合,得到分词序列,将分词序列输入到语料库中查询得到对应的词汇向量。In this embodiment of the application, a large-scale corpus of the Internet is obtained in real time by crawling, and the word vectors corresponding to the word segments are learned in an unsupervised manner. The word segmentation sequence is obtained from the word segmentation set obtained by decomposing the target text, and the word segmentation sequence is input into the corpus to query the corresponding word vectors.

步骤303,对词汇向量结合上下文词汇向量进行特征分析,得到分词词汇对应的实体概率。Step 303: Perform feature analysis on the word vectors in combination with the context word vectors to obtain the entity probabilities corresponding to the segmented words.

在本申请实施例中,对所述词汇向量结合上下文词汇向量进行特征分析,得到分词词汇对应的实体概率,在一个示例中,将词汇向量输入双向长短时记忆神经网络,其中,实体概率中包括第一概率、第二概率和第三概率,第一概率表示分词词汇属于标签实体的概率,第二概率表示分词词汇不属于标签实体的概率,第三概率表示分词词汇属于标签实体内对应实体的概率。例如,若输入的分词词汇为实体,例如人名、地名或机构名等,则其第一概率高;若输入的分词词汇为非实体,例如介词、动词或形容词等,则其第二概率高;若输入的分词词汇为实体且其前一分词词汇为实体,则其第三概率高。In this embodiment, feature analysis is performed on the word vector combined with the context word vector to obtain the entity probability corresponding to the segmented word. In one example, the word vector is input into a bidirectional long short-term memory neural network. The entity probability includes a first probability, a second probability, and a third probability. The first probability represents the probability that the segmented word belongs to the labeled entity, the second probability represents the probability that the segmented word does not belong to the labeled entity, and the third probability represents the probability that the segmented word belongs to the corresponding entity within the labeled entity. For example, if the input segmented word is an entity, such as a person's name, place name, or organization name, its first probability is high; if the input segmented word is a non-entity, such as a preposition, verb, or adjective, its second probability is high; if the input segmented word is an entity and its preceding segmented word is also an entity, its third probability is high.

之后得到词汇向量对应的第一概率、第二概率以及第三概率。从第一概率、第二概率以及第三概率中,将数值最高的概率确定为分词词汇的实体概率。Then, the first probability, second probability, and third probability corresponding to the word vector are obtained. From the first probability, second probability, and third probability, the probability with the highest value is determined as the entity probability of the segmented word.

步骤304,根据实体概率从分词词汇中确定目标文本的第一候选标签。Step 304: Determine the first candidate tag of the target text from the segmented vocabulary based on the entity probability.

在本申请实施例中,滤除实体概率对应为第二概率的分词词汇,可选的,第二概率对应的分词词汇不属于标签实体,也即,是不可作为标签的分词词汇,如词性为介词的分词词汇。再根据第一概率和第三概率确定实体概率对应的分词词汇,得到第一候选标签。In this embodiment, word segmentation words whose entity probability corresponds to the second probability are filtered out. Optionally, the word segmentation words corresponding to the second probability do not belong to the tag entity, that is, they are word segmentation words that cannot be used as tags, such as word segmentation words whose part of speech is a preposition. Then, the word segmentation words corresponding to the entity probability are determined according to the first probability and the third probability to obtain the first candidate tag.

在一个示例中,请参考图4,将词汇向量410输入至双向长短时记忆神经网络(即图4中的BiLSTM)420,分词词汇411与词汇向量410对应,每个输入的词汇向量410都会输出三个预测分值,该三个预测分值分别代表第一概率B、第二概率O和第三概率I430,可选的,将预测分值输入至条件随机场(即图4中的CRF)440,得到预测标签(即图4中各个分词词汇对应的B标签、I标签或O标签)450,根据预测标签得到第一候选标签。In one example, referring to Figure 4, the word vector 410 is input into a bidirectional long short-term memory neural network (i.e., BiLSTM in Figure 4) 420. The word segmentation words 411 correspond to the word vector 410. Each input word vector 410 will output three predicted scores, which represent the first probability B, the second probability O, and the third probability I 430, respectively. Optionally, the predicted scores are input into a conditional random field (i.e., CRF in Figure 4) 440 to obtain predicted labels (i.e., B labels, I labels, or O labels corresponding to each word segmentation word in Figure 4) 450. The first candidate label is obtained based on the predicted labels.

可选的,在对词汇向量结合上下文词汇向量进行特征分析之后,增加对分词词汇的自注意力计算,计算机制请参考图5,自注意力的计算方法包括:图中的Value501是双向长短时记忆神经网络的输出词汇向量,Key502是不同于Value501的参数矩阵,注意力函数的本质可以被描述为一个查询元素(即图中的Query,对应当中自注意力计算层)503得到到一系列(键Key502-值Value501)对的映射,即通过计算Query503和各个Key502的相似性或者相关性,得到每个Key502对应Value501的权重系数,然后对Value501进行加权求和,即得到了最终的注意力数值504Attention,以公式表达为如下公式五:Optionally, after performing feature analysis on the word vectors combined with the context word vectors, self-attention calculation is added for the segmented words. The calculation mechanism is shown in Figure 5. The self-attention calculation method includes: Value501 in the figure is the output word vector of the bidirectional long short-term memory neural network, and Key502 is a parameter matrix different from Value501. The essence of the attention function can be described as a query element (i.e., Query in the figure, corresponding to the self-attention calculation layer) 503 obtaining a series of (key502-value501) pairs. That is, by calculating the similarity or relevance between Query503 and each Key502, the weight coefficient of Value501 corresponding to each Key502 is obtained. Then, the Value501 is weighted and summed to obtain the final attention value 504, which is expressed as Formula 5 below:

其中,Attention(Qurey,Source)代表当前句子(Source)和当前元素对应的注意力值,Source代表输入句子,Query代表元素,Key代表参数矩阵,Lx代表句子的长度,Similarity(Qurey,Keyi)参见公式七。Where Attention(Qurey,Source) represents the attention value corresponding to the current sentence (Source) and the current element, Source represents the input sentence, Query represents the element, Key represents the parameter matrix, Lx represents the length of the sentence, and Similarity(Qurey,Key i ) is defined in Formula 7.

计算机制请参考图6,将其归纳为两个过程:第一个过程是根据Query603和Key602计算权重系数,第二个过程根据权重系数对Value601进行加权求和。而第一个过程又可以细分为两个阶段:第一个阶段610根据Query603和Key602计算两者的相似性或者相关性得到中间向量;第二个阶段620对第一阶段610的原始分值进行归一化处理。其中,在第一个阶段610,可以引入不同的函数和计算机制,根据Query603和某个Key602,计算两者的相似性或者相关性,最常见的方法包括:求两者的向量点积,得到第一中间向量(即图6中的S1、S2、S3、S4)612,计算方法请参考公式六611(即图中的S(Q,K)):The calculation mechanism is shown in Figure 6, which can be summarized into two processes: the first process calculates the weight coefficients based on Query603 and Key602, and the second process performs a weighted summation of Value601 based on the weight coefficients. The first process can be further divided into two stages: the first stage 610 calculates the similarity or relevance between Query603 and Key602 to obtain an intermediate vector; the second stage 620 normalizes the original scores from the first stage 610. In the first stage 610, different functions and calculation mechanisms can be introduced to calculate the similarity or relevance between Query603 and a specific Key602. The most common method includes calculating the dot product of the two vectors to obtain the first intermediate vector (i.e., S1, S2, S3, S4 in Figure 6) 612. The calculation method is shown in Formula 6 611 (i.e., S(Q,K) in the figure).

公式六:Similarity(Qurey,Keyi)=Qurey·Keyi Formula 6: Similarity(Qurey,Key i )=Qurey·Key i

其中,Similarty代表向量点积,Query代表元素,Keyi代表第i个参数矩阵。Where Similarty represents the vector dot product, Query represents the element, and Key i represents the i-th parameter matrix.

第一阶段610产生的第一中间向量根据具体产生的方法不同其数值取值范围也不一样,第二阶段620引入类似SoftMax归一化621的计算方式对第一阶段的得分进行数值转换得到第二中间向量(即图中的a1、a2、a3、a4)622,一方面可以进行归一化,将原始计算分值整理成所有元素权重之和为1的概率分布;另一方面也可以通过SoftMax的内在机制更加突出重要元素的权重,计算方法请参考公式七:The first intermediate vector generated in the first stage (610) has a different value range depending on the specific generation method. The second stage (620) introduces a calculation method similar to SoftMax normalization (621) to transform the scores from the first stage into the second intermediate vector (i.e., a1, a2, a3, a4 in the diagram) (622). This normalization process can, on the one hand, organize the original calculated scores into a probability distribution where the sum of all element weights is 1; on the other hand, it can further highlight the weights of important elements through the inherent mechanism of SoftMax. For the calculation method, please refer to Formula 7.

其中,ai为第二中间向量,Softmax()为softmax函数,Simi为第一阶段求出的第一中间向量,Lx表示当前句子(Source)的长度。Where a<sub> i </sub> is the second intermediate vector, Softmax() is the softmax function, Sim<sub> i </sub> is the first intermediate vector obtained in the first stage, and L <sub>x</sub> represents the length of the current sentence (Source).

第三阶段630的计算结果即为对应的权重系数,然后进行加权求和即可得到注意力数值604,请参考公式八:The result of the third stage calculation of 630 is the corresponding weight coefficient. Then, a weighted sum is performed to obtain the attention value of 604. Please refer to Formula 8:

其中,Attention(Qurey,Source)代表当前句子(Source)和当前元素对应的注意力值,Lx表示当前句子(Source)的长度,ai为第二中间向量,Valuei为双向长短时记忆神经网络的输出向量。Where Attention(Qurey,Source) represents the attention value corresponding to the current sentence (Source) and the current element, Lx represents the length of the current sentence (Source), ai is the second intermediate vector, and Valuei is the output vector of the bidirectional long short-term memory neural network.

步骤305,根据分词词汇在目标文本中的第一频率参数,和分词词汇在文本集合中的第二频率参数,确定目标文本的第二候选标签。Step 305: Determine the second candidate tag for the target text based on the first frequency parameter of the segmented words in the target text and the second frequency parameter of the segmented words in the text set.

可选的,第二候选标签还可以通过词库特征或领域特征得到,示意性的,领域特征包括特定垂类领域的词汇编码,例如苹果在电子设备领域代表一个品牌,在农业领域代表一种水果。Optionally, the second candidate label can also be obtained through lexical features or domain features. For example, domain features include lexical codes for specific vertical domains. For instance, Apple represents a brand in the field of electronic devices and a type of fruit in the field of agriculture.

第二候选标签的确定方法的相关内容请参考上述步骤203,此处不再赘述。For details on the method for determining the second candidate label, please refer to step 203 above; it will not be repeated here.

步骤306,根据第一候选标签和第二候选标签确定目标文本的标签。Step 306: Determine the tags of the target text based on the first candidate tags and the second candidate tags.

可选的,还可以将从双向长短时记忆神经网络输出的数据进行自注意力计算得到的结果,与第二候选标签共同输入条件随机场模型。使用条件随机场确定标签的方法,请参考图7,其中,特征包括:字、词性、分词边界、特征词、词库,词库包括人名、地名、机构、影视、小说、音乐、医疗、网络舆情热词等词库。特征收集好之后需要配置特征模版,特征模版需要配置同一个特征不同位置组合,同一个位置不同特征组合,不同位置不同特征组合。在一个示例中,特征确定举例为:(1)如果当前位置701分词702的(词性703为名词)且(是否词库704=1)且(标签705=1),则让特征模板中的t1=1否则t1=0,如果权重λ1越大,模型越倾向把词库中的词当成标签。(2)如果上一个位置的分词的词性703是标点,下一个词性703是动词且当前分词是名词,则让特征模板中的t2=1,如果权重λ2越大,模型越倾向把夹在标点和动词之间的名词当作标题实体。Optionally, the result of self-attention calculation from the data output by the bidirectional long short-term memory neural network can be input into the conditional random field model along with the second candidate label. For the method of determining the label using conditional random field, please refer to Figure 7. The features include: characters, parts of speech, word segmentation boundaries, feature words, and a lexicon. The lexicon includes lexicons of personal names, place names, institutions, movies, novels, music, medical terms, and popular online public opinion terms. After the features are collected, feature templates need to be configured. The feature templates need to be configured with different combinations of the same feature at different positions, different combinations of features at the same position, and different combinations of features at different positions. In one example, feature determination is as follows: (1) If the current position 701 is segmented into 702 (part of speech 703 is a noun) and (whether the lexicon is 704 = 1) and (label 705 = 1), then let t1 = 1 in the feature template; otherwise, t1 = 0. If the weight λ1 is larger, the model is more inclined to treat the words in the lexicon as labels. (2) If the part of speech 703 of the previous word segment is punctuation, the part of speech 703 of the next word segment is verb and the current word segment is noun, then let t2 = 1 in the feature template. If the weight λ2 is larger, the model is more inclined to treat the noun sandwiched between punctuation and verb as the title entity.

综上所述,本实施例提供的文本标签确定方法,在通过对目标文本的每个分词词汇的上下文关系,及每个分词词汇在目标文本中的第一频率参数、在文本集合中的第二频率参数共同确定目标文本的标签,从深层语义和浅层频率两方面共同确定目标文本的标签,提高了目标文本标签确定的准确度,且通过机器学习模型有效地代替人工特征识别工程,提高了目标文本标签确定的效率。In summary, the text label determination method provided in this embodiment determines the label of the target text by jointly considering the contextual relationship of each word segment in the target text, as well as the first frequency parameter of each word segment in the target text and the second frequency parameter in the text set. This method determines the label of the target text from both deep semantic and shallow frequency perspectives, thereby improving the accuracy of target text label determination. Furthermore, by effectively replacing manual feature recognition engineering with a machine learning model, the efficiency of target text label determination is improved.

在一个可选的实施例中,以本申请实施例提供的文本标签确定方法应用于短视频应用程序中为例进行说明,对该方法的系统过程进行示意性说明,请参考图8,该系统流程800包括四个主要过程,该四个主要过程包括视频获取810、内容整理820、标签确定830以及视频分发840。In an optional embodiment, the text tag determination method provided in this application embodiment is used as an example to illustrate the system process of the method. Please refer to Figure 8. The system process 800 includes four main processes, namely video acquisition 810, content organization 820, tag determination 830, and video distribution 840.

视频获取810执行过程中,接收用户通过客户端上传的视频文件,可选的,该客户端对应的应用程序可以是与视频分发840执行过程中接收视频的应用程序相同,两者也可以不同。During the execution of video acquisition 810, video files uploaded by users through a client are received. Optionally, the application corresponding to this client can be the same as the application that receives the video during the execution of video distribution 840, or the two can be different.

内容整理820中包括:语料收集821、视频文本转换822、分词提取823;Content organization 820 includes: corpus collection 821, video text conversion 822, word segmentation and extraction 823;

其中,语料收集821在执行过程中,通过爬虫实时获取互联网的大规模语料,并通过无监督方式学习得到分词对应的词汇向量;视频文本转换822在执行过程中,通过抽取视频文件中音频,将音频转换为文字,得到目标文本,或抽取视频帧,通过光学字符识别视频中的文字得到目标文本;分词提取823在执行过程中,将视频文本转换822得到的目标文本分解为多个分词词汇集合。In the process of corpus collection 821, a large-scale corpus of the Internet is acquired in real time through web crawling, and the word vectors corresponding to the word segmentation are learned in an unsupervised manner; in the process of video text conversion 822, the audio in the video file is extracted and converted into text to obtain the target text, or the video frame is extracted and the text in the video is obtained through optical character recognition; in the process of word segmentation extraction 823, the target text obtained by video text conversion 822 is decomposed into multiple word segmentation vocabulary sets.

标签确定830在执行过程中,获取内容整理820得到的分词词汇,对分词词汇进行处理确定目标文本的标签,将标签与目标文本的对应关系输出到视频分发840。During the execution of tag determination 830, the word segmentation vocabulary obtained by content processing 820 is acquired, the word segmentation vocabulary is processed to determine the tags of the target text, and the correspondence between the tags and the target text is output to video distribution 840.

视频分发840在执行过程中,根据标签和目标文本的对应关系确定标签和视频的对应关系,根据标签对视频进行分类,将视频分发至应用程序对应视频分类的接口中,或,根据用户画像,将视频分发至可能对该视频感兴趣的用户客户端中。During the execution of the video distribution process, the video distribution 840 determines the correspondence between tags and videos based on the correspondence between tags and target text, classifies videos according to tags, and distributes videos to the corresponding video category interfaces in the application, or distributes videos to user clients who may be interested in the video based on user profiles.

其中,请参考图9,标签确定930包括分词映射向量层931、中间向量层932、自注意力计算层933、分词映射外部特征层934、全连接中间层935、全连接层936、多分类层937、约束层938、目标文本分词词汇900。Referring to Figure 9, the label determination 930 includes a word segmentation mapping vector layer 931, an intermediate vector layer 932, a self-attention calculation layer 933, a word segmentation mapping external feature layer 934, a fully connected intermediate layer 935, a fully connected layer 936, a multi-classification layer 937, a constraint layer 938, and a target text word segmentation vocabulary 900.

分词词汇分别输入分词映射向量层931和分词映射外部特征层934。在分词映射向量层931中,分词词汇映射成词汇向量,之后输入双向长短时记忆神经网络,也即走前向循环神经网络(Recurrent Neural Network,RNN)层9311和后向循环神经网络层9312,结果输入到中间向量层932,并重复上述循环神经网络的过程,从中间向量层932输出多个线性映射关系9321,并将其输入至自注意力计算层933,经过计算后结果输入全连接层936;在分词映射外部特征层934中,分词词汇输入分词映射外部特征层934,再将结果输出至全连接中间层935进行连接,将结果输出至全连接层936,与自注意力计算层933输出的结果进行全连接,再将结果输出至多分类层937,对分类结果输入约束层938,最后输出目标文本的标签。The segmented words are input into the segmentation mapping vector layer 931 and the segmentation mapping external feature layer 934, respectively. In the segmentation mapping vector layer 931, the segmented words are mapped into word vectors, which are then input into a bidirectional long short-term memory neural network, i.e., passing through a forward recurrent neural network (RNN) layer 9311 and a backward recurrent neural network layer 9312. The result is input into an intermediate vector layer 932, and the above recurrent neural network process is repeated. Multiple linear mapping relationships 9321 are output from the intermediate vector layer 932 and input into a self-attention calculation layer 933. After calculation, the result is input into a fully connected layer 936. In the segmentation mapping external feature layer 934, the segmented words are input into the segmentation mapping external feature layer 934, and the result is output to a fully connected intermediate layer 935 for connection. The result is then output to a fully connected layer 936, where it is fully connected with the result output from the self-attention calculation layer 933. The result is then output to a multi-classification layer 937, and the classification result is input into a constraint layer 938. Finally, the label of the target text is output.

图10是本申请一个示例性实施例提供的文本标签确定装置的结构框图,该装置包括:Figure 10 is a structural block diagram of a text label determining device provided in an exemplary embodiment of this application. The device includes:

处理模块1010,用于对目标文本进行分词处理,得到分词集合,所述分词集合中包括所述目标文本分词得到的分词词汇,所述目标文本为待确定标签的文本;Processing module 1010 is used to perform word segmentation on target text to obtain a word segmentation set, wherein the word segmentation set includes word segments obtained from the word segmentation of the target text, and the target text is the text whose label is to be determined;

确定模块1020,用于根据所述分词词汇的上下文关系,确定所述目标文本的第一候选标签;The determining module 1020 is used to determine the first candidate tag of the target text based on the contextual relationship of the segmented words;

所述确定模块1020,还用于根据所述分词词汇在所述目标文本中的第一频率参数,和所述分词词汇在文本集合中的第二频率参数,确定所述目标文本的第二候选标签;The determining module 1020 is further configured to determine a second candidate tag for the target text based on a first frequency parameter of the segmented words in the target text and a second frequency parameter of the segmented words in the text set;

所述确定模块1020,还用于根据所述第一候选标签和所述第二候选标签确定所述目标文本的标签。The determining module 1020 is further configured to determine the tag of the target text based on the first candidate tag and the second candidate tag.

在一个可选的实施例中,所述确定模块1020,还用于响应于所述第一候选标签和所述第二候选标签之间存在交集的情况,对所述第一候选标签和所述第二候选标签取交集,得到所述目标文本的标签。In an optional embodiment, the determining module 1020 is further configured to, in response to the existence of an intersection between the first candidate tag and the second candidate tag, take the intersection of the first candidate tag and the second candidate tag to obtain the tag of the target text.

在一个可选的实施例中,所述确定模块,还用于响应于所述第一候选标签和所述第二候选标签之间不存在交集的情况,根据预设的选择规则,从所述第一候选标签和所述第二候选标签中确定所述目标文本的标签。In an optional embodiment, the determining module is further configured to, in response to the case that there is no intersection between the first candidate tag and the second candidate tag, determine the tag of the target text from the first candidate tag and the second candidate tag according to a preset selection rule.

在一个可选的实施例中,请参考图11,所述确定模块1020还包括:In an optional embodiment, referring to FIG11, the determining module 1020 further includes:

第一确定单元1021,用于根据所述第一频率参数和所述第二频率参数,确定所述分词词汇对应的词汇频率;The first determining unit 1021 is used to determine the word frequency corresponding to the segmented word based on the first frequency parameter and the second frequency parameter;

所述第一确定单元1021,还用于将所述词汇频率符合频率要求的分词词汇,确定为所述目标文本的第二候选标签。The first determining unit 1021 is further configured to determine the word segmentation words whose word frequencies meet the frequency requirements as the second candidate tags of the target text.

在一个可选的实施例中,所述确定模块1020还包括:In an optional embodiment, the determining module 1020 further includes:

提取单元1022,用于对所述分词词汇进行特征提取,得到所述分词词汇的词汇向量;Extraction unit 1022 is used to extract features from the segmented words to obtain word vectors of the segmented words;

分析单元1023,用于对所述词汇向量结合上下文词汇向量进行特征分析,得到所述分词词汇对应的实体概率;Analysis unit 1023 is used to perform feature analysis on the word vector combined with the context word vector to obtain the entity probability corresponding to the segmented word;

第二确定单元1024,用于根据所述实体概率从所述分词词汇中确定所述目标文本的第一候选标签。The second determining unit 1024 is used to determine the first candidate tag of the target text from the segmented vocabulary based on the entity probability.

需要说明的是:上述实施例提供的文本标签确定装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的文本标签确定装置与文本标签确定方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that the text label determining device provided in the above embodiments is only an example of the division of the above functional modules. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the text label determining device and the text label determining method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process can be found in the method embodiments, which will not be repeated here.

本申请还提供了一种服务器,该服务器包括处理器和存储器,存储器中存储有至少一条指令,至少一条指令由处理器加载并执行以实现上述各个方法实施例提供的文本极性识别方法。需要说明的是,该服务器可以是如下图12所提供的服务器。This application also provides a server, which includes a processor and a memory. The memory stores at least one instruction, which is loaded and executed by the processor to implement the text polarity recognition method provided in the above-described method embodiments. It should be noted that the server can be the one shown in Figure 12 below.

请参考图12,其示出了本申请一个示例性实施例提供的服务器的结构示意图。具体来讲:所述服务器1200包括中央处理单元(Central Processing Unit,CPU)1201、包括随机存取存储器(Random Access Memory,RAM)1202和只读存储器(Read Only Memory,ROM)1203的系统存储器1204,以及连接系统存储器1204和中央处理单元1201的系统总线1205。所述服务器1200还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统1206,和用于存储操作系统1213、应用程序1214和其他程序模块1215的大容量存储设备1207。Please refer to Figure 12, which shows a schematic diagram of the structure of a server provided in an exemplary embodiment of this application. Specifically, the server 1200 includes a Central Processing Unit (CPU) 1201, a system memory 1204 including Random Access Memory (RAM) 1202 and Read Only Memory (ROM) 1203, and a system bus 1205 connecting the system memory 1204 and the CPU 1201. The server 1200 also includes a basic input/output system 1206 that facilitates the transfer of information between various devices within the computer, and a mass storage device 1207 for storing the operating system 1213, application programs 1214, and other program modules 1215.

所述基本输入/输出系统1206包括有用于显示信息的显示器1208和用于用户输入信息的诸如鼠标、键盘之类的输入设备1209。其中所述显示器1208和输入设备1209都通过连接到系统总线1205的输入输出控制器1210连接到中央处理单元1201。所述基本输入/输出系统1206还可以包括输入输出控制器1210以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1210还提供输出到显示屏、打印机或其他类型的输出设备。The basic input/output system 1206 includes a display 1208 for displaying information and an input device 1209 for user input, such as a mouse or keyboard. Both the display 1208 and the input device 1209 are connected to the central processing unit 1201 via an input/output controller 1210 connected to the system bus 1205. The basic input/output system 1206 may also include the input/output controller 1210 for receiving and processing input from multiple other devices such as a keyboard, mouse, or electronic stylus. Similarly, the input/output controller 1210 also provides output to a display screen, printer, or other types of output devices.

所述大容量存储设备1207通过连接到系统总线1205的大容量存储控制器(未示出)连接到中央处理单元1201。所述大容量存储设备1207及其相关联的计算机可读介质为服务器1200提供非易失性存储。也就是说,所述大容量存储设备1207可以包括诸如硬盘或者CD-ROM驱动器之类的计算机可读介质(未示出)。The mass storage device 1207 is connected to the central processing unit 1201 via a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1207 and its associated computer-readable media provide non-volatile storage for the server 1200. That is, the mass storage device 1207 may include computer-readable media (not shown) such as a hard disk or a CD-ROM drive.

不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、带电可擦可编程只读存储器(Electrically Erasable Programmable Read Only Memory,EEPROM)、闪存或其他固态存储其技术,CD-ROM、数字通用光盘(Digital Versatile Disc,DVD)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1204和大容量存储设备1207可以统称为存储器。Without loss of generality, the computer-readable medium may include computer storage media and communication media. Computer storage media include volatile and non-volatile, removable and non-removable media implemented using any method or technology for storing information such as computer-readable instructions, data structures, program modules, or other data. Computer storage media include RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid-state storage technologies, CD-ROM, digital versatile disc (DVD) or other optical storage, magnetic tape cassettes, magnetic tape, disk storage, or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage media are not limited to the above-mentioned types. The system memory 1204 and mass storage device 1207 described above can be collectively referred to as memory.

存储器存储有一个或多个程序,一个或多个程序被配置成由一个或多个中央处理单元1201执行,一个或多个程序包含用于实现上述文本极性识别方法的指令,中央处理单元1201执行该一个或多个程序实现上述各个方法实施例提供的文本极性识别方法。The memory stores one or more programs, which are configured to be executed by one or more central processing units 1201. The one or more programs contain instructions for implementing the text polarity recognition method described above. The central processing unit 1201 executes the one or more programs to implement the text polarity recognition method provided in the various method embodiments described above.

根据本申请的各种实施例,所述服务器1200还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即服务器1200可以通过连接在所述系统总线1205上的网络接口单元1211连接到网络1212,或者说,也可以使用网络接口单元1211来连接到其他类型的网络或远程计算机系统(未示出)。According to various embodiments of this application, the server 1200 can also be connected to a remote computer on a network, such as the Internet. That is, the server 1200 can be connected to the network 1212 via the network interface unit 1211 connected to the system bus 1205, or the network interface unit 1211 can be used to connect to other types of networks or remote computer systems (not shown).

所述存储器还包括一个或者一个以上的程序,所述一个或者一个以上程序存储于存储器中,所述一个或者一个以上程序包含用于进行本申请实施例提供的文本极性识别方法中由服务器所执行的步骤。The memory also includes one or more programs stored in the memory, and the one or more programs include steps executed by the server in the text polarity recognition method provided in the embodiments of this application.

本申请实施例还提供一种计算机设备,该计算机设备包括存储器和处理器,存储器中存储有至少一条指令、至少一段程序、代码集或指令集,至少一条指令、至少一段程序、代码集或指令集由处理器加载并实现上述实施例中任一所述的文本标签确定方法。This application also provides a computer device, which includes a memory and a processor. The memory stores at least one instruction, at least one program, code set, or instruction set. The at least one instruction, at least one program, code set, or instruction set is loaded by the processor to implement any of the text label determination methods described in the above embodiments.

本申请实施例还提供一种计算机可读存储介质,该可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述实施例中任一所述的文本标签确定方法。This application also provides a computer-readable storage medium storing at least one instruction, at least one program, code set, or instruction set, wherein the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement any of the text label determination methods described in the above embodiments.

本申请还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实施例中任一所述的文本标签确定方法。This application also provides a computer program product or computer program including computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform any of the text label determination methods described in the above embodiments.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,该计算机可读存储介质可以是上述实施例中的存储器中所包含的计算机可读存储介质;也可以是单独存在,未装配入终端中的计算机可读存储介质。该计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现本申请实施例中任一所述的文本标签确定方法。Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing related hardware. This program can be stored in a computer-readable storage medium, which may be a computer-readable storage medium included in the memory described in the above embodiments; or it may be a standalone computer-readable storage medium not assembled into the terminal. The computer-readable storage medium stores at least one instruction, at least one program segment, a code set, or an instruction set. The at least one instruction, the at least one program segment, the code set, or the instruction set is loaded and executed by the processor to implement any of the text tag determination methods described in the embodiments of this application.

可选地,该计算机可读存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、固态硬盘(SSD,Solid State Drives)或光盘等。其中,随机存取记忆体可以包括电阻式随机存取记忆体(ReRAM,Resistance RandomAccess Memory)和动态随机存取存储器(DRAM,Dynamic Random Access Memory)。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。Optionally, the computer-readable storage medium may include: read-only memory (ROM), random access memory (RAM), solid-state drives (SSDs), or optical discs, etc. The random access memory may include resistive random access memory (ReRAM) and dynamic random access memory (DRAM). The sequence numbers of the embodiments described above are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.

以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above description is merely an optional embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

Claims (12)

1.一种文本标签确定方法,其特征在于,所述方法包括:1. A method for determining text labels, characterized in that the method comprises: 对目标文本进行分词处理,得到分词集合,所述分词集合中包括所述目标文本分词得到的分词词汇,所述目标文本为待确定标签的文本;The target text is segmented to obtain a segmentation set, which includes the segmented words obtained from the segmentation of the target text. The target text is the text whose label is to be determined. 对所述分词词汇进行特征提取,得到所述分词词汇的词汇向量;Feature extraction is performed on the segmented words to obtain the word vectors of the segmented words; 对所述词汇向量结合上下文词汇向量进行特征分析,得到所述词汇向量对应的第一概率、第二概率以及第三概率,所述第一概率表示所述分词词汇属于标签实体的概率,所述第二概率表示所述分词词汇不属于所述标签实体的概率,所述第三概率表示所述分词词汇属于标签实体内对应实体的概率;其中,若所述分词词汇为实体,则所述第一概率高;若所述分词词汇为非实体,则所述第二概率高;若所述分词词汇为实体且其前一分词词汇为实体,则所述第三概率高;Feature analysis is performed on the word vectors in conjunction with the context word vectors to obtain a first probability, a second probability, and a third probability corresponding to the word vectors. The first probability represents the probability that the word segmentation belongs to the tag entity, the second probability represents the probability that the word segmentation does not belong to the tag entity, and the third probability represents the probability that the word segmentation belongs to the corresponding entity within the tag entity. Specifically, if the word segmentation is an entity, the first probability is higher; if the word segmentation is not an entity, the second probability is higher; if the word segmentation is an entity and its preceding word segmentation is also an entity, the third probability is higher. 将所述第一概率、所述第二概率以及所述第三概率输入至条件随机场,得到所述分词词汇对应的预测标签,所述预测标签指示所述第一概率、所述第二概率和所述第三概率中的其中之一;The first probability, the second probability, and the third probability are input into a conditional random field to obtain the predicted label corresponding to the segmented word, wherein the predicted label indicates one of the first probability, the second probability, and the third probability. 滤除所述预测标签指示所述第二概率的分词词汇;Filter out word segments whose predicted labels indicate the second probability; 根据所述预测标签指示所述第一概率的分词词汇以及所述预测标签指示所述第三概率的分词词汇,确定所述目标文本的第一候选标签;Based on the word segmentation words with the first probability indicated by the predicted tags and the word segmentation words with the third probability indicated by the predicted tags, the first candidate tags of the target text are determined; 根据所述分词词汇在所述目标文本中的第一频率参数,和所述分词词汇在文本集合中的第二频率参数,确定所述目标文本的第二候选标签;Based on the first frequency parameter of the segmented words in the target text and the second frequency parameter of the segmented words in the text set, a second candidate tag for the target text is determined; 根据所述第一候选标签和所述第二候选标签确定所述目标文本的标签。The tags of the target text are determined based on the first candidate tag and the second candidate tag. 2.根据权利要求1所述的方法,其特征在于,所述根据所述第一候选标签和所述第二候选标签确定所述目标文本的标签,包括:2. The method according to claim 1, wherein determining the tags of the target text based on the first candidate tags and the second candidate tags comprises: 在所述第一候选标签和所述第二候选标签之间存在交集的情况下,对所述第一候选标签和所述第二候选标签取交集,得到所述目标文本的标签。If there is an intersection between the first candidate tag and the second candidate tag, the intersection of the first candidate tag and the second candidate tag is taken to obtain the tag of the target text. 3.根据权利要求1所述的方法,其特征在于,所述根据所述第一候选标签和所述第二候选标签确定所述目标文本的标签,包括:3. The method according to claim 1, wherein determining the tags of the target text based on the first candidate tags and the second candidate tags comprises: 在所述第一候选标签和所述第二候选标签之间不存在交集的情况下,根据预设的选择规则,从所述第一候选标签和所述第二候选标签中确定所述目标文本的标签。If there is no overlap between the first candidate tag and the second candidate tag, the tag of the target text is determined from the first candidate tag and the second candidate tag according to a preset selection rule. 4.根据权利要求1至3任一所述的方法,其特征在于,所述根据所述分词词汇在所述目标文本中的第一频率参数,和所述分词词汇在文本集合中的第二频率参数,确定所述目标文本的第二候选标签,包括:4. The method according to any one of claims 1 to 3, characterized in that, determining the second candidate tag of the target text based on the first frequency parameter of the segmented words in the target text and the second frequency parameter of the segmented words in the text set includes: 根据所述第一频率参数和所述第二频率参数,确定所述分词词汇对应的词汇频率;The word frequency corresponding to the segmented word is determined based on the first frequency parameter and the second frequency parameter; 将所述词汇频率符合频率要求的分词词汇,确定为所述目标文本的第二候选标签。The word segments whose frequencies meet the frequency requirements are identified as the second candidate tags for the target text. 5.根据权利要求4所述的方法,其特征在于,所述根据所述第一频率参数和所述第二频率参数,确定所述分词词汇对应的词汇频率,包括:5. The method according to claim 4, wherein determining the word frequency corresponding to the segmented word based on the first frequency parameter and the second frequency parameter includes: 确定所述第一频率参数和所述第二频率参数之积,作为所述分词词汇对应的所述词汇频率。The product of the first frequency parameter and the second frequency parameter is determined as the word frequency corresponding to the segmented word. 6.一种文本标签确定装置,其特征在于,所述装置包括:6. A text label determining device, characterized in that the device comprises: 处理模块,用于对目标文本进行分词处理,得到分词集合,所述分词集合中包括所述目标文本分词得到的分词词汇,所述目标文本为待确定标签的文本;The processing module is used to perform word segmentation on the target text to obtain a word segmentation set, wherein the word segmentation set includes the word segments obtained from the word segmentation of the target text, and the target text is the text whose label is to be determined; 确定模块,用于对所述分词词汇进行特征提取,得到所述分词词汇的词汇向量;对所述词汇向量结合上下文词汇向量进行特征分析,预测得到所述分词词汇对应的第一概率、第二概率以及第三概率,所述第一概率表示所述分词词汇属于标签实体的概率,所述第二概率表示所述分词词汇不属于所述标签实体的概率,所述第三概率表示所述分词词汇属于标签实体内对应实体的概率;其中,若所述分词词汇为实体,则所述第一概率高;若所述分词词汇为非实体,则所述第二概率高;若所述分词词汇为实体且其前一分词词汇为实体,则所述第三概率高;将所述第一概率、所述第二概率以及所述第三概率输入至条件随机场,得到所述分词词汇对应的预测标签,所述预测标签指示所述第一概率、所述第二概率和所述第三概率中的其中之一;滤除所述预测标签指示所述第二概率的分词词汇;基于所述预测标签指示所述第一概率的分词词汇以及所述预测标签指示所述第三概率的分词词汇,确定所述目标文本的第一候选标签;A determination module is used to extract features from the segmented words to obtain word vectors for the segmented words; perform feature analysis on the word vectors in combination with the context word vectors to predict a first probability, a second probability, and a third probability corresponding to the segmented words. The first probability represents the probability that the segmented word belongs to a tag entity, the second probability represents the probability that the segmented word does not belong to the tag entity, and the third probability represents the probability that the segmented word belongs to the corresponding entity within the tag entity. Specifically, if the segmented word is an entity, the first probability is high; if the segmented word is not an entity, the second probability is high; if the segmented word is an entity and its preceding segmented word is an entity, the third probability is high. The first probability, the second probability, and the third probability are input into a conditional random field to obtain a predicted label corresponding to the segmented words. The predicted label indicates one of the first probability, the second probability, and the third probability. Segmented words whose predicted labels indicate the second probability are filtered out. Based on the segmented words whose predicted labels indicate the first probability and the segmented words whose predicted labels indicate the third probability, a first candidate label for the target text is determined. 所述确定模块,还用于根据所述分词词汇在所述目标文本中的第一频率参数,和所述分词词汇在文本集合中的第二频率参数,确定所述目标文本的第二候选标签;The determining module is further configured to determine a second candidate tag for the target text based on a first frequency parameter of the segmented words in the target text and a second frequency parameter of the segmented words in the text set; 所述确定模块,还用于根据所述第一候选标签和所述第二候选标签确定所述目标文本的标签。The determining module is further configured to determine the tag of the target text based on the first candidate tag and the second candidate tag. 7.根据权利要求6所述的装置,其特征在于,所述确定模块,还用于在所述第一候选标签和所述第二候选标签之间存在交集的情况下,对所述第一候选标签和所述第二候选标签取交集,得到所述目标文本的标签。7. The apparatus according to claim 6, wherein the determining module is further configured to, when there is an intersection between the first candidate tag and the second candidate tag, take the intersection of the first candidate tag and the second candidate tag to obtain the tag of the target text. 8.根据权利要求6所述的装置,其特征在于,所述确定模块,还用于在所述第一候选标签和所述第二候选标签之间不存在交集的情况下,根据预设的选择规则,从所述第一候选标签和所述第二候选标签中确定所述目标文本的标签。8. The apparatus according to claim 6, wherein the determining module is further configured to determine the tag of the target text from the first candidate tag and the second candidate tag according to a preset selection rule when there is no intersection between the first candidate tag and the second candidate tag. 9.根据权利要求6至8任一所述的装置,其特征在于,所述确定模块还包括:9. The apparatus according to any one of claims 6 to 8, wherein the determining module further comprises: 第一确定单元,用于根据所述第一频率参数和所述第二频率参数,确定所述分词词汇对应的词汇频率;The first determining unit is configured to determine the word frequency corresponding to the segmented word based on the first frequency parameter and the second frequency parameter; 所述第一确定单元,还用于将所述词汇频率符合频率要求的分词词汇,确定为所述目标文本的第二候选标签。The first determining unit is further configured to determine the word segmentation words whose word frequencies meet the frequency requirements as the second candidate tags of the target text. 10.一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一段程序,所述至少一段程序由所述处理器加载并执行以实现如权利要求1至5任一所述的文本标签确定方法。10. A computer device, characterized in that the computer device includes a processor and a memory, the memory storing at least one program, the at least one program being loaded and executed by the processor to implement the text tag determination method as described in any one of claims 1 to 5. 11.一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一段程序,所述至少一段程序由处理器加载并执行以实现如权利要求1至5任一所述的文本标签确定方法。11. A computer-readable storage medium, characterized in that the storage medium stores at least one program, said at least one program being loaded and executed by a processor to implement the text tag determination method as described in any one of claims 1 to 5. 12.一种计算机程序产品,其特征在于,包括计算机程序,所述计算机程序被处理器执行时实现如权利要求1至5任一所述的文本标签确定方法。12. A computer program product, characterized in that it comprises a computer program, which, when executed by a processor, implements the text label determination method as described in any one of claims 1 to 5.
HK42021025295.3A 2021-02-08 Method and device for determining text label, terminal and readable storage medium HK40035387B (en)

Publications (2)

Publication Number Publication Date
HK40035387A HK40035387A (en) 2021-05-14
HK40035387B true HK40035387B (en) 2024-06-28

Family

ID=

Similar Documents

Publication Publication Date Title
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
Chutia et al. A review on emotion detection by using deep learning techniques
US20200250247A1 (en) Domain specific natural language understanding of customer intent in self-help
US11720761B2 (en) Systems and methods for intelligent routing of source content for translation services
US9846836B2 (en) Modeling interestingness with deep neural networks
Dashtipour et al. Exploiting deep learning for Persian sentiment analysis
Li et al. Recursive deep learning for sentiment analysis over social data
Tang et al. Deep learning in sentiment analysis
Arumugam et al. Hands-On Natural Language Processing with Python: A practical guide to applying deep learning architectures to your NLP applications
Suman et al. An attention based multi-modal gender identification system for social media users
Menon et al. Semantics-based topic inter-relationship extraction
Chang et al. Supervised machine learning and deep learning classification techniques to identify scholarly and research content
Azzam et al. A question routing technique using deep neural network for communities of question answering
Khilji et al. Multimodal text summarization with evaluation approaches
Phan et al. Applying skip-gram word estimation and SVM-based classification for opinion mining Vietnamese food places text reviews
Guetari et al. Comod: an abstractive approach to discourse context identification
HK40035387B (en) Method and device for determining text label, terminal and readable storage medium
Al Helal Topic modelling and sentiment analysis with the bangla language: A deep learning approach combined with the latent dirichlet allocation
Yan ERNIE-TextCNN: research on classification methods of Chinese news headlines in different situations
Zhang et al. A semantic embedding enhanced topic model for user-generated textual content modeling in social ecosystems
Mahboob et al. Contextual video analytics and recommendations through natural language processing (NLP) and graph machine learning (GML)
Mokhtari et al. Context-sensitive neural sentiment classification
Wei A Survey of Sentiment Analysis Based on Product Review
Rawat et al. Emotionally wrapped social media text: approaches, opportunities, and challenges