CN111914054A

CN111914054A - System and method for large scale semantic indexing

Info

Publication number: CN111914054A
Application number: CN201911258645.4A
Authority: CN
Inventors: 李定成; 张婧媛; 李平
Original assignee: Baidu USA LLC
Current assignee: Baidu USA LLC
Priority date: 2019-05-10
Filing date: 2019-12-10
Publication date: 2020-11-10
Anticipated expiration: 2039-12-10
Also published as: US11748613B2; CN111914054B; US20200356851A1

Abstract

Described herein are embodiments of a deep layer-by-layer Extreme Multi-Label Learning and Classification (XMLC) framework for facilitating semantic indexing of documents. In one or more embodiments, the deep layer-by-layer XMLC framework includes two sequential modules, a deep layer-by-layer multi-label learning module and a layer-by-layer pointer generation module. In one or more embodiments, the first module utilizes class-based dynamic max-pooling and macro-F-metric-based weight adjustment to decompose the terms of the domain ontology into multiple levels and build a specific convolutional neural network for each level network. In one or more embodiments, the second module combines the layer-wise output into a final summary semantic index. The effectiveness of the deep layer-by-layer XMLC framework implementation is demonstrated by comparing it with several state-of-the-art methods for automatic labeling on various datasets.

Description

System and method for large-scale semantic indexing

技术领域technical field

本公开总体上涉及一种用于语义索引的系统和方法。更具体地，本公开涉及一种用于具有深度逐层的极端多标签学习的语义索引的系统和方法。The present disclosure generally relates to a system and method for semantic indexing. More specifically, the present disclosure relates to a system and method for semantic indexing with deep layer-by-layer extreme multi-label learning.

背景技术Background technique

随着科学文献的爆炸式增长量，需要高效的语义索引方法来构建检索系统。即使利用有效的技术，语义索引过程仍然涉及手动整理来自科学文献的关键方面。为了概述文章的主要主题，通常会邀请领域专家使用从领域本体中选择的关键字来手动索引文章。With the explosive growth of scientific literature, efficient semantic indexing methods are required to build retrieval systems. Even with efficient techniques, the semantic indexing process still involves manually curating key aspects from the scientific literature. To outline the main topic of the article, domain experts are usually invited to manually index the article using keywords selected from the domain ontology.

因此，需要用于大规模语义索引以提高自动标记效率的系统和方法。Therefore, there is a need for systems and methods for large-scale semantic indexing to improve automatic tagging efficiency.

发明内容SUMMARY OF THE INVENTION

在本公开的一方面，提供了一种使用一个或多个处理器来执行步骤的用于多标签学习和分类的计算机实现的方法，所述方法包括：In one aspect of the present disclosure, there is provided a computer-implemented method for multi-label learning and classification using one or more processors to perform steps, the method comprising:

将原始训练文本处理成干净的训练文本；Process the original training text into clean training text;

基于训练标签的本体层次结构，将所述训练标签解析成多个层级的逐层标签；Based on the ontology hierarchical structure of the training labels, the training labels are parsed into layer-by-layer labels of multiple levels;

至少基于所述逐层标签和所述干净的训练文本，通过逐层多标签分类模型来训练多个逐层模型，其中每个逐层模型与标签的对应层级相关；At least based on the layer-by-layer labels and the clean training text, a plurality of layer-by-layer models are trained by a layer-by-layer multi-label classification model, wherein each layer-by-layer model is associated with a corresponding level of the label;

通过所训练的多个逐层模型，采用一个或多个细化策略来从一个或多个输入进行逐层预测；以及Using one or more refinement strategies to make layer-by-layer predictions from one or more inputs through the trained plurality of layer-by-layer models; and

使用点生成模型将所述逐层预测合并到用于所述一个或多个输入数据集的统一标签集中。The layer-wise predictions are combined into a unified set of labels for the one or more input datasets using a point generative model.

在本公开的另一方面，提供了一种用于大规模语义索引的多标签学习和分类的系统，所述系统包括：In another aspect of the present disclosure, a system for multi-label learning and classification for large-scale semantic indexing is provided, the system comprising:

逐层多标签分类模型，所述逐层多标签分类模型基于标签的本体层次结构将高维空间中的所述标签分解成多个层级中的逐层标签，所述逐层多标签分类模型包括多个卷积神经网络(CNN)，其中对于每个层级的CNN，每个CNN分别从用于文档的单词嵌入、用于关键字的单词嵌入、上层级标签嵌入、以及下层级标签嵌入的输入中提取特征表示，所述CNN包括：A layer-by-layer multi-label classification model, which decomposes the label in a high-dimensional space into layer-by-layer labels in multiple levels based on the ontology hierarchy of labels, and the layer-by-layer multi-label classification model includes Multiple Convolutional Neural Networks (CNNs), where, for each level of CNN, each CNN takes input from word embeddings for documents, word embeddings for keywords, upper-level label embeddings, and lower-level label embeddings, respectively Feature representation is extracted from the CNN including:

最大池化层，所述最大池化层用于动态最大池化，以从串联的嵌入中选择特征，所述串联的嵌入通过从所有输入中提取的特征表示而串联；a max-pooling layer for dynamic max-pooling to select features from concatenated embeddings concatenated by feature representations extracted from all inputs;

一个或多个标准化层和一个或多个完全连接层，用于批量标准化并且从所选择的特征中获得紧凑表示；one or more normalization layers and one or more fully connected layers for batch normalization and obtaining a compact representation from the selected features;

输出层，输出用于所述每个层级的逐层预测；以及an output layer that outputs the layer-by-layer prediction for each of the layers; and

点生成模型，将所述多个层级中的每个层级处的所述逐层预测合并到所述文档的统一标签集中，所述点生成模型包括：A point generation model incorporating the layer-by-layer predictions at each of the plurality of levels into a unified set of tags for the document, the point generation model comprising:

编码器，将所述逐层预测编码成分别对应于所述多个层级的多个编码器隐藏状态序列；an encoder, encoding the layer-by-layer prediction into a plurality of encoder hidden state sequences corresponding to the plurality of layers respectively;

多个注意力发生器，从所述多个编码器隐藏状态序列导出，以针对所述多个层级中的每个层级生成注意力分布和内容矢量；以及a plurality of attention generators derived from the plurality of encoder hidden state sequences to generate attention distributions and content vectors for each of the plurality of levels; and

解码器，至少基于针对所述多个层级中的每个层级的所述生成的内容矢量来生成多个解码器隐藏状态序列，所述解码器至少基于所述逐层预测、所述编码器隐藏状态、所述注意力发生器以及所述解码器隐藏状态来生成所述统一标签集。a decoder to generate a plurality of decoder hidden state sequences based at least on the generated content vector for each of the plurality of levels, the decoder based on at least the layer-wise prediction, the encoder concealment state, the attention generator, and the decoder hidden state to generate the unified label set.

在本公开的又一方面，提供了一种使用一个或多个处理器来执行步骤的用于多标签学习和分类的计算机实现的方法，所述方法包括：In yet another aspect of the present disclosure, there is provided a computer-implemented method for multi-label learning and classification using one or more processors to perform steps, the method comprising:

在多个分层层级中的每个处，At each of the multiple hierarchical levels,

使用卷积神经网络(CNN)，分别从用于文档的单词嵌入、用于关键字的单词嵌入、上分层层级标签嵌入、以及下分层层级标签嵌入的输入中提取特征表示；Extract feature representations from the input of word embeddings for documents, word embeddings for keywords, upper hierarchical level label embeddings, and lower hierarchical level label embeddings, respectively, using a convolutional neural network (CNN);

将从所有输入中提取的所述特征表示串联为串联嵌入；concatenating said feature representations extracted from all inputs into concatenated embeddings;

在所述CNN的最大池化层处应用动态最大池化，以从串联嵌入中选择特征；apply dynamic max-pooling at the max-pooling layer of the CNN to select features from the concatenated embeddings;

应用批量归一化以及一个或多个完全连接层，以从所选择的特征中获得紧凑表示；apply batch normalization along with one or more fully connected layers to obtain a compact representation from the selected features;

从所述CNN的输出层输出针对所述多个分层层级中的所述每个层级的逐层预测；以及outputting, from an output layer of the CNN, a layer-by-layer prediction for the each of the plurality of hierarchical levels; and

使用点生成模型将来自所述多个分层层级的所述逐层预测合并到统一标签集中。The layer-wise predictions from the plurality of hierarchical levels are combined into a unified label set using a point generative model.

附图说明Description of drawings

将参考本发明的实施方式，它们的示例可示于附图中。这些附图旨在是说明性的而非限制性的。虽然本发明大体上在这些实施方式的上下文中描述，但应理解，本发明的范围并不旨在限于这些特定实施方式。附图中的项目未按比例绘制。Reference will be made to the embodiments of the invention, examples of which may be illustrated in the accompanying drawings. The drawings are intended to be illustrative and not restrictive. While the present invention is generally described in the context of these embodiments, it is to be understood that the scope of the present invention is not intended to be limited to these specific embodiments. Items in the drawings are not drawn to scale.

附图(“图”)1描绘了根据本公开的实施方式的深度逐层的极端多标签学习和分类(深度逐层的XMLC)框架的系统架构。The accompanying drawing ("FIG.") 1 depicts the system architecture of a deep layer-by-layer extreme multi-label learning and classification (deep layer-by-layer XMLC) framework in accordance with an embodiment of the present disclosure.

图2描绘了根据本公开的实施方式的使用深度逐层的XMLC的标记过程。2 depicts a markup process using deep layer-by-layer XMLC in accordance with an embodiment of the present disclosure.

图3描绘了根据本公开的实施方式的深度多标签学习模型的神经结构。3 depicts the neural structure of a deep multi-label learning model according to an embodiment of the present disclosure.

图4描绘了根据本公开的实施方式的用于在每个层级使用神经模型进行逐层预测的过程。4 depicts a process for layer-by-layer prediction using a neural model at each layer, according to an embodiment of the present disclosure.

图5描绘了根据本公开的实施方式的用于最终合并的指针生成模型。5 depicts a pointer generation model for final merging according to an embodiment of the present disclosure.

图6描绘了根据本公开的实施方式的使用指针生成模型以便进行最终合并的过程。6 depicts the process of generating a model using pointers for final merging, according to an embodiment of the present disclosure.

图7以图形方式描绘了根据本公开的实施方式的用于标签集的层级关系。Figure 7 graphically depicts a hierarchical relationship for tag sets in accordance with an embodiment of the present disclosure.

图8描绘了根据本公开的实施方式的通过具有深度逐层的极端多标签学习和分类(XMLC)的在线宏F度量优化(OFO)获得的宏精度、查全率和F分数。8 depicts macro precision, recall and F-score obtained by online macro F-metric optimization (OFO) with deep layer-by-layer extreme multi-label learning and classification (XMLC) in accordance with an embodiment of the present disclosure.

图9描绘了根据本公开的实施方式的计算装置/信息处理系统的简化框图。9 depicts a simplified block diagram of a computing device/information handling system in accordance with an embodiment of the present disclosure.

具体实施方式Detailed ways

在以下描述中，出于解释目的，阐明具体细节以便提供对本公开的理解。然而，将对本领域的技术人员显而易见的是，可在没有这些细节的情况下实践实施方式。此外，本领域的技术人员将认识到，下文描述的本公开的实施方式可以以各种方式(例如过程、装置、系统、设备或方法)在有形的计算机可读介质上实施。In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the present disclosure. However, it will be apparent to those skilled in the art that embodiments may be practiced without these details. Furthermore, those skilled in the art will appreciate that the embodiments of the present disclosure described below can be implemented on tangible computer-readable media in various ways (eg, processes, apparatus, systems, devices, or methods).

附图中示出的组件或模块是本发明实施方式的示例性说明，并且意图避免使本公开不清楚。还应理解，在本论述的全文中，组件可描述为单独的功能单元(可包括子单元)，但是本领域的技术人员将认识到，各种组件或其部分可划分成单独组件，或者可整合在一起(包括整合在单个的系统或组件内)。应注意，本文论述的功能或操作可实施为组件。组件可以以软件、硬件、或它们的组合实施。The components or modules shown in the figures are exemplary illustrations of embodiments of the invention and are intended to avoid obscuring the present disclosure. It should also be understood that throughout this discussion, components may be described as separate functional units (which may include sub-units), although those skilled in the art will recognize that various components or portions thereof may be divided into separate components, or may be integrated together (including within a single system or component). It should be noted that the functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware, or a combination thereof.

此外，附图内的组件或系统之间的连接并不旨在限于直接连接。相反，在这些组件之间的数据可由中间组件修改、重格式化、或以其它方式改变。另外，可使用另外或更少的连接。还应注意，术语“联接”、“连接”、或“通信地联接”应理解为包括直接连接、通过一个或多个中间设备来进行的间接连接、和无线连接。Furthermore, the connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, reformatted, or otherwise changed by intervening components. Additionally, additional or fewer connections may be used. It should also be noted that the terms "coupled," "connected," or "communicatively coupled" should be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections.

在本说明书中对“一个实施方式”、“优选实施方式”、“实施方式”或“多个实施方式”的提及表示结合实施方式所描述的具体特征、结构、特性或功能包括在本发明的至少一个实施方式中，以及可包括在多于一个的实施方式中。另外，在本说明书的各个地方出现以上所提到的短语并不一定全都是指相同的实施方式或多个相同实施方式。References in this specification to "one embodiment," "preferred embodiment," "an embodiment," or "a plurality of embodiments" mean that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in the present invention in at least one embodiment of , and may be included in more than one embodiment. Additionally, the appearances of the above-mentioned phrases in various places in this specification are not necessarily all referring to the same embodiment or embodiments.

在本说明书的各个地方使用某些术语目的在于说明，并且不应被理解为限制。服务、功能或资源并不限于单个服务、单个功能或单个资源；这些术语的使用可指代相关服务、功能或资源的可分布或聚合的分组。图像可以是静止图像或来自视频。Certain terms are used in various places in this specification for the purpose of description and should not be construed as limiting. A service, function or resource is not limited to a single service, function or resource; use of these terms may refer to a distributable or aggregated grouping of related services, functions or resources. Images can be still images or from video.

术语“包括”、“包括有”、“包含”和“包含有”应理解为开放性的术语，并且其后任何列出内容都是实例，而不旨在限于所列项目。本文所使用的任何标题仅是为了组织目的，并且不应被用于限制说明书或权利要求的范围。本专利文献中提到的每个参考文献以其全文通过引用并入本文。The terms "comprising", "including", "including" and "comprising" are to be understood as open-ended terms and any listings thereafter are examples and are not intended to be limited to the listed items. Any headings used herein are for organizational purposes only and should not be used to limit the scope of the description or claims. Each reference mentioned in this patent document is hereby incorporated by reference in its entirety.

此外，本领域的技术人员应认识到：(1)某些步骤可以可选地执行；(2)步骤可不限于本文中所阐述的特定次序；(3)某些步骤可以以不同次序执行；以及(4)某些步骤可同时地进行。In addition, those skilled in the art will recognize that: (1) certain steps may be optionally performed; (2) the steps may not be limited to the specific order set forth herein; (3) certain steps may be performed in a different order; and (4) Certain steps can be performed simultaneously.

A.引言A. Introduction

在医学领域，MEDLINE可能是世界上最大的生物医学文献数据库，并且医学主题词表(MeSH)是用于在MEDLINE中索引文章的领域本体。通过将查询映射到MeSH主题词表，极大地改善了医学文献搜索的体验。例如，查询少年药物使用被映射到MeSH主题词表青少年和物质相关疾病。当前，大多数映射规则以及来自MEDLINE的医学文献的最终索引由领域专家手动生成。对于语义索引的人类标记过程而言，这既昂贵又耗时。因此，迫切需要自动方法。In the medical field, MEDLINE is probably the largest database of biomedical literature in the world, and the Medical Subject Thesaurus (MeSH) is the domain ontology used to index articles in MEDLINE. The medical literature search experience is greatly improved by mapping queries to MeSH thesaurus. For example, the query juvenile drug use is mapped to the MeSH thesaurus juvenile and substance-related disorders. Currently, most mapping rules and the final index of medical literature from MEDLINE are manually generated by domain experts. This is expensive and time-consuming for the human tagging process of semantic indexing. Therefore, automatic methods are urgently needed.

然而，自动整理的任务面临重大挑战。首先，一篇文章经常用多个关键字或概念标记。此外，领域本体涉及数十万甚至数百万个标签。这些标签通常以呈森林的形式表示的层次结构来组织。同时处理大量标签、数据样本和复杂的层次结构是艰巨的任务。However, the task of tidying up presents significant challenges. First, an article is often tagged with multiple keywords or concepts. Furthermore, domain ontology involves hundreds of thousands or even millions of labels. These labels are usually organized in a hierarchy represented in the form of a forest. Handling large numbers of labels, data samples, and complex hierarchies simultaneously is a difficult task.

在本专利文献的实施方式中，自动语义索引的任务被认为是极端多标签学习和分类(XMLC)问题。与传统的多类不同，XMLC允许每个数据样本共存数百万个标签。最近，提出了几种处理XMLC的方法，包括FASTXML、LOMTrees、SLEEC、鲁棒的布隆过滤器、标签划分、快速标签嵌入以及若干种深度学习方法、使用局部神经网络(DXML和XML-CNN)的分层多标签分类。尽管这些方法在处理XMLC方面取得了一些进展，但是维数灾难(称为巨大的标签空间)和对手工制作特征工程的高要求是进一步提高有效性和效率的两个主要障碍。In the embodiments of this patent document, the task of automatic semantic indexing is considered an extreme multi-label learning and classification (XMLC) problem. Unlike traditional multiclass, XMLC allows millions of tags to coexist per data sample. Recently, several methods have been proposed to deal with XMLC, including FASTXML, LOMTrees, SLEEC, robust bloom filters, label partitioning, fast label embedding, and several deep learning methods, using local neural networks (DXML and XML-CNN) Hierarchical multi-label classification of . Although these methods have made some progress in dealing with XMLC, the curse of dimensionality (known as huge label space) and the high requirement for hand-crafted feature engineering are two major obstacles to further improving effectiveness and efficiency.

为了解决这两个问题，在本专利文献中公开了称为深度逐层的极端多标签学习和分类(深度逐层XMLC)的新颖框架的实施方式，以解决大规模语义索引的问题。在一个或多个实施方式中，深度逐层XMLC框架包括两个顺序模块。在一个或多个实施方式中，第一模块是逐层多标签分类模型。通过将大量标签(在较高维数的空间中)分解成多个层级(在较低维数的空间中)，它有效地解决了维数灾难。在一个或多个实施方式中，对于每个层级，构建具有至少两个创新点的卷积神经网络。第一个创新点包括旨在捕获标签同现和标签之间的分类关系的基于类别的动态最大池化方法。动态最大池化方法有助于紧密连接逐层分类模型。第二个创新点包括基于宏F度量优化的预测细化方法，该方法使得模块能够以增量方式自动选择标签。深度逐层XMLC框架的第二模块是分层指针生成模型，该模型通过复制和生成机制将每个层级的预测标签合并到最终的概述语义索引中。总体而言，深度逐层XMLC框架通过学习语义索引而无需任何特征工程避免了人为干预的高昂代价。整个系统架构的实施方式在图1中示出，其中在章节B中讨论更多细节。To address these two issues, an implementation of a novel framework called Deep Layer-by-Layer Extreme Multi-Label Learning and Classification (Deep Layer-by-Layer XMLC) is disclosed in this patent document to address the problem of large-scale semantic indexing. In one or more embodiments, the deep layer-by-layer XMLC framework includes two sequential modules. In one or more embodiments, the first module is a layer-wise multi-label classification model. It effectively solves the curse of dimensionality by decomposing a large number of labels (in higher dimensional space) into multiple levels (in lower dimensional space). In one or more embodiments, for each level, a convolutional neural network with at least two innovations is constructed. The first innovation includes a class-based dynamic max-pooling method that aims to capture label co-occurrences and taxonomic relationships between labels. Dynamic max pooling methods help to tightly connect layer-by-layer classification models. The second innovation includes a prediction refinement method based on macro F-metric optimization, which enables the module to automatically select labels in an incremental manner. The second module of the deep layer-by-layer XMLC framework is a hierarchical pointer generation model, which incorporates the predicted labels at each level into the final overview semantic index through a copy-and-generate mechanism. Overall, the deep layer-wise XMLC framework avoids the high cost of human intervention by learning semantic indexing without any feature engineering. An embodiment of the overall system architecture is shown in Figure 1, where more details are discussed in Section B.

本专利文献的一些贡献包括：Some of the contributions of this patent document include:

提出了深度逐层XMLC以学习大规模语义索引。它将标签分为多个层级，以减少维数灾难，同时提高训练效率。Deep layer-wise XMLC is proposed to learn large-scale semantic indexing. It divides labels into multiple layers to reduce the curse of dimensionality while improving training efficiency.

引入具有类别相关动态最大池化的新策略，以捕获标签之间的同现和类别关系。A new strategy with class-dependent dynamic max-pooling is introduced to capture co-occurrence and class relationships between labels.

探索了从宏F度量优化派生的预测细化技术的实施方式，以在线方式智能地选择最佳标签。An implementation of a prediction refinement technique derived from macro-F-metric optimization is explored to intelligently select the best labels in an online manner.

开发了分层指针生成模型，以将逐层输出合并到最终的概述语义索引中。A hierarchical pointer generation model is developed to incorporate the layer-by-layer output into the final overview semantic index.

通过将其与来自MEDLINE的MeSH的自动标记的几种最新方法以及AmazonCat13K(亚马逊Cat13K)(其是具有与MeSH性质相似的XMLC数据集)进行比较，证明了深度逐层XMLC实施方式的有效性。The effectiveness of the deep layer-by-layer XMLC implementation is demonstrated by comparing it with several state-of-the-art methods for auto-tagging of MeSH from MEDLINE and AmazonCat13K (Amazon Cat13K), which is an XMLC dataset with properties similar to MeSH.

B.方法实施方式B. Method Embodiments

XMLC存在两个主要挑战。首先，一个数据集中的标签的数量可能超过10,000，甚至高达一百万。其次，一个数据样本可能用多个标签索引，其数量典型的范围是一至几十个。There are two main challenges with XMLC. First, the number of labels in a dataset may exceed 10,000, or even as high as a million. Second, a data sample may be indexed with multiple labels, the number of which typically ranges from one to dozens.

在本专利文献中，如图1所示，公开了深度逐层XMLC框架的实施方式，以通过将每个数据样本的标签分解成多个层级来处理这两个挑战。在一个或多个实施方式中，深度逐层XMLC框架包括五个阶段：数据预处理阶段105、本体解析阶段110、逐层模型训练阶段115、逐层预测阶段120和最终合并阶段125。在数据预处理阶段，像往常一样，进行令牌化和过滤138以分别从训练文本134和验证文本136中获得干净的训练文本142和干净的验证文本144。在一个或多个实施方式中，根据一些比例随机选择训练和验证数据。与通常的自然语言处理(NLP)任务不同，使用在训练标签132上使用本体解析器140进行本体解析的额外步骤，以便将基于标签的本体层次将标签分为多个层级146。在第三阶段中，采用在方法部分中描述的神经模型来训练逐层模型148。在一个或多个实施方式中，本体解析器140用作逐层多标签分类模型(在章节A中描述的两个顺序模块中的第一个)。随后，在逐层预测阶段(或测试阶段)中，在以相似的方式对测试数据进行预处理以获得逐层预测150之后，将测试数据馈送到训练的逐层模型中以便进行标签预测或标记。在最后阶段中，与指针生成模型152进行最终合并，以输出一个或多个预测标签154，并且还筛选出一些不太相关的标签。在一个或多个实施方式中，指针生成模型152用作分层指针生成模型(在章节A中描述的两个顺序模块中的第二个)。In this patent document, as shown in Figure 1, an implementation of a deep layer-by-layer XMLC framework is disclosed to address these two challenges by decomposing the labels of each data sample into multiple layers. In one or more embodiments, the deep layer-by-layer XMLC framework includes five stages: data preprocessing stage 105 , ontology parsing stage 110 , layer-by-layer model training stage 115 , layer-by-layer prediction stage 120 , and final merging stage 125 . In the data preprocessing stage, tokenization and filtering 138 are performed as usual to obtain clean training text 142 and clean validation text 144 from training text 134 and validation text 136, respectively. In one or more embodiments, training and validation data are randomly selected according to some ratio. Unlike normal natural language processing (NLP) tasks, an additional step of ontology parsing using an ontology parser 140 on the training labels 132 is used in order to divide the labels into multiple levels 146 based on the label-based ontology hierarchy. In the third stage, the layer-by-layer model 148 is trained using the neural model described in the Methods section. In one or more embodiments, the ontology parser 140 operates as a layer-wise multi-label classification model (the first of the two sequential modules described in Section A). Subsequently, in the layer-by-layer prediction stage (or test phase), after preprocessing the test data in a similar manner to obtain layer-by-layer predictions 150, the test data is fed into the trained layer-by-layer model for label prediction or labeling . In the final stage, final merging with the pointer generation model 152 is performed to output one or more predicted labels 154, and some less relevant labels are also filtered out. In one or more embodiments, the pointer generation model 152 is used as a hierarchical pointer generation model (the second of the two sequential modules described in Section A).

图2描绘了根据本公开的实施方式的具有在深度逐层XMLC中涉及的五个步骤的标记过程。首先，使用NLP预处理器将用于训练和验证的输入原始文本分别处理(205)成干净的训练文本和干净的验证文本。在一个或多个实施方式中，单词被令牌化并且停用词被过滤掉。其次，使用本体解析器将训练标签基于它们的本体层次解析(210)成逐层标签。第三，至少基于逐层标签和干净的文本，通过逐层多标签分类模型来训练(215)多个逐层模型，其中每个逐层模型与标签的对应层级相关。第四，使用训练的多个逐层模型，利用一个或多个细化策略在一个或多个输入数据集上进行(220)逐层预测。在一个或多个实施方式中，一个或多个输入数据集可以是通过令牌化单词、停用词过滤等用相同的NLP预处理器清理的干净的验证文本。在一个或多个实施方式中，一个或多个输入数据集可以是在部署的框架上接收的文本数据。最后，训练点生成模型以将逐层预测合并(225)到统一的标签集中。2 depicts a tagging process with five steps involved in deep layer-by-layer XMLC, according to an embodiment of the present disclosure. First, the input raw text for training and validation is processed (205) into clean training text and clean validation text, respectively, using an NLP preprocessor. In one or more embodiments, words are tokenized and stop words are filtered out. Second, the training labels are parsed (210) hierarchically based on their ontology into layer-wise labels using an ontology parser. Third, a plurality of layer-by-layer models are trained (215) through a layer-by-layer multi-label classification model, where each layer-by-layer model is associated with a corresponding level of labels, based at least on layer-by-layer labels and clean text. Fourth, layer-wise prediction is performed (220) on one or more input datasets using one or more refinement strategies using the trained multiple layer-by-layer models. In one or more embodiments, one or more of the input datasets may be clean validation text cleaned with the same NLP preprocessor by tokenizing words, stopword filtering, etc. In one or more embodiments, the one or more input datasets may be textual data received on the deployed framework. Finally, the point generative model is trained to merge (225) the layer-wise predictions into a unified label set.

以下小节着重于：1)深度逐层多标签学习框架的实施方式；以及2)用于将所有层级的标签合并成一个统一标签集的指针生成模型的实施方式。The following subsections focus on: 1) an implementation of a deep layer-wise multi-label learning framework; and 2) an implementation of a pointer generation model for merging labels from all levels into one unified label set.

1.深度逐层多标签学习的实施方式1. Implementation of deep layer-by-layer multi-label learning

形式上，问题可以定义如下：给定一组输入对

深度逐层XMLC将它们分解成M个层级，并且在训练数据上训练M个神经模型。整个标签集被表示为

并且

是指

中的标签的总数。每个y_i是具有长度

的多热矢量。层级m处的每个模型在每个数据样本上预测最可能的K个标签

K是通过细化策略确定的。最后，训练指针生成模型以将每个数据样本x_i的M个层级的预测

合并到一个统一标签集y_i中。Formally, the problem can be defined as follows: given a set of input pairs

Deep layer-by-layer XMLC decomposes them into M levels and trains M neural models on the training data. The entire label set is represented as

and

Refers to

The total number of labels in . each _yi is with length

multi-heat vector. Each model at level m predicts the most likely K labels on each data sample

K is determined by the refinement strategy. Finally, the pointer generation model is trained to combine the predictions of the M levels of each data sample _xi

merged into a unified label set _yi .

1.1.特征嵌入构造的实施方式1.1. Implementation of Feature Embedding Construction

在一个或多个实施方式中，以逐层方式构造模型。如图3所示，在每个层级处用四个并行输入构建神经模型300。图4描绘了根据本公开的实施方式的用于在每个层级使用神经模型300进行逐层预测的过程。In one or more embodiments, the model is constructed in a layer-by-layer fashion. As shown in FIG. 3, a neural model 300 is constructed with four parallel inputs at each level. FIG. 4 depicts a process for layer-by-layer prediction using a neural model 300 at each layer, according to an embodiment of the present disclosure.

四个输入包括文档310的单词嵌入，关键字320的单词嵌入以及层级相关的信息(包括上层级标签嵌入330和下层级标签嵌入340)。它们为构建更具区分性的特征提供了多种信息。在一个或多个实施方式中，采用(405)卷积神经网络(CNN)300，以分别从对应的输入310、320、330和340学习大量的特征表示314、324、334和344。在一或多个实施方式中，直接从CNN学习文档嵌入314和关键字嵌入324。从上层级和下层级的预测结果的嵌入中学习其他两个嵌入，即上层级嵌入334和下层级标签嵌入344。在一或多个实施方式中，涉及两个步骤。首先，类似于用于输入文本和关键字的单词嵌入，在一个或多个实施方式中，采用Gensim来训练来自带注释的MeSH的标签嵌入。其次，在训练和测试中，可以将某些层级的某些文档的预测标签用作其上层级或下层级的输入特征。这两个嵌入不仅可以帮助捕获逐层依赖性，还可以处理XMLC中的标签不平衡问题。以这种方式，标签同现以及来自它们上层级和下层级的知识都可以有助于增强稀有标签的表示学习。The four inputs include word embeddings for documents 310, word embeddings for keywords 320, and hierarchy-related information (including upper-level tag embeddings 330 and lower-level tag embeddings 340). They provide a variety of information for building more discriminative features. In one or more embodiments, a convolutional neural network (CNN) 300 is employed (405) to learn a large number of feature representations 314, 324, 334, and 344 from corresponding inputs 310, 320, 330, and 340, respectively. In one or more embodiments, document embeddings 314 and keyword embeddings 324 are learned directly from the CNN. Two other embeddings, the upper-level embedding 334 and the lower-level label embedding 344, are learned from the embeddings of the prediction results of the upper and lower levels. In one or more embodiments, two steps are involved. First, similar to word embeddings for input text and keywords, in one or more embodiments, Gensim is employed to train label embeddings from annotated MeSH. Second, in training and testing, the predicted labels of certain documents at certain levels can be used as input features for the level above or below it. These two embeddings not only help to capture layer-by-layer dependencies, but also handle the tag imbalance problem in XMLC. In this way, both label co-occurrences and knowledge from their upper and lower levels can help to enhance representation learning for rare labels.

例如，在MeSH中，淋巴管瘤是稀有标签，并且它本身不易表示。利用其上层级MeSH淋巴管肿瘤和下层级MeSH淋巴管瘤的信息，囊性淋巴管瘤可能在嵌入空间中得到更好的表示。For example, in MeSH, lymphangioma is a rare label and itself is not easily represented. Cystic lymphangioma may be better represented in the embedding space using information from its upper-level MeSH lymphangioma and its lower-level MeSH lymphangioma.

在学习四个嵌入之后，将它们串联(410)成串联的嵌入352并传递到最大池化层350中。After learning the four embeddings, they are concatenated (410) into concatenated embeddings 352 and passed into the max pooling layer 350.

由于次序信息，原始令牌/单词可能未直接与关键字、上层级标签和下层级标签的嵌入串联。在一个或多个实施方式中，为它们的CNN特征上的原始令牌/单词构造双向长短期记忆(Bi-LSTM)，以在串联之前保留语言顺序信息。Due to the order information, the original tokens/words may not be directly concatenated with the embeddings of keywords, upper-level tags and lower-level tags. In one or more embodiments, bidirectional long short-term memory (Bi-LSTM) is constructed for the raw tokens/words on their CNN features to preserve linguistic order information prior to concatenation.

1.2.学习框架的目标函数的实施方式1.2. Implementation of the objective function of the learning framework

在一个或多个实施方式中，在嵌入串联之后，采用最大池化层350来施加(415)动态最大池化以从串联的嵌入中选择期望特征352。通过将批量归一化应用于一个或多个归一化层和一个或多个完全连接层360来从选择的特征352获得(420)紧凑表示362。之后，出于训练目的，至少基于获得的紧凑表示362，在输出层和隐藏瓶颈层370上采用(425)二进制交叉熵损失。在使用二进制交叉熵损失进行训练之后，输出层输出逐层标记380。In one or more embodiments, after the embeddings are concatenated, a max-pooling layer 350 is employed to apply (415) dynamic max-pooling to select desired features 352 from the concatenated embeddings. A compact representation 362 is obtained (420) from the selected features 352 by applying batch normalization to one or more normalization layers and one or more fully connected layers 360. Afterwards, for training purposes, at least based on the obtained compact representation 362, a (425) binary cross-entropy loss is employed on the output layer and hidden bottleneck layer 370. After training with binary cross-entropy loss, the output layer outputs layer-wise labels 380.

在一个或多个实施方式中，二进制交叉熵目标的损失函数L的公式为：In one or more embodiments, the formula for the loss function L of the binary cross-entropy objective is:

其中，

和f_j(x_i)表示输出层函数。此外，f_j(x_i)＝w_og_h(w_h[P(c₁)，...P(c_ι)])。在此，w_h∈R^h×(ιp)和

是与隐藏的瓶颈层和输出层370相关联的权重矩阵，g_h是元素方式激活函数，例如，施加到瓶颈层的S型函数(sigmoid)或双曲正切(tanh)函数，并且ιp是动态最大池化层处ι和p的乘积。ι是指馈送到池化层中的特征的数量，并且p是指池化数量。两者均由x_i中的特征的数量确定。此外，c_i是在从更低层进行池化操作P(.)之后的卷积特征矢量。in,

and f _j ( _xi ) represent the output layer function. Furthermore, f _j ( _xi )=wo _g _h ( _wh [P(c ₁ ), . . . P(c _ι )]). Here, w _h ^{∈R h×(ιp)} and

is the weight matrix associated with the hidden bottleneck layer and output layer 370, gh is an element-wise activation function, eg, a sigmoid or hyperbolic tangent ( _tanh ) function applied to the bottleneck layer, and ιp is the dynamic The product of i and p at the max pooling layer. ι refers to the number of features fed into the pooling layer, and p refers to the pooling number. Both are determined by the number of features in _xi . Furthermore, _ci is the convolutional feature vector after the pooling operation P(.) from lower layers.

1.3.以类别为方向的动态最大池化的实施方式1.3. Implementation of class-oriented dynamic max pooling

在用于文本分类的传统CNN模型中，经常采用max-over-time方案，因为从直觉上看，特征图的最大元素应当具有最重要的信息，即P(c)＝max{c}，其中c是指来自CNN的输出。然而，这种方法表现出严重的缺点。当输入文档包含多个主题时，仅使用一个值表示整个特征地图可能会丢失信息。对于多标签学习任务，多个池化可以捕获更丰富的信息。在本专利文献中，池化动态地执行为

其中

是指从索引1开始至

的c的子矢量，p是指最大池化维数。先前的工作使用固定的p。如果p被设置过大，则可能包含冗余特征。如果设置得太小，可能丢失相关特征。In traditional CNN models for text classification, the max-over-time scheme is often adopted because intuitively, the largest element of the feature map should have the most important information, i.e. P(c)=max{c}, where c refers to the output from the CNN. However, this approach exhibits serious drawbacks. Using only one value to represent the entire feature map may lose information when the input document contains multiple topics. For multi-label learning tasks, multiple pooling can capture richer information. In this patent document, pooling is performed dynamically as

in

means starting from index 1 to

A sub-vector of c, p refers to the max pooling dimension. Previous work used a fixed p. If p is set too large, redundant features may be included. If set too small, relevant features may be lost.

在一个或多个实施方式中，逐层相关信息，即标签的分类信息被合并到神经结构(例如，最大池层)中，以帮助动态地选择p。具体地，p随标签层级的分布而调整。例如，在MeSH中，所有项都分为16个类别，如“解剖学”、“生物体”、“疾病”等。每个类别涉及不同的子类别，并且每个标签涉及不同的分布。基于分布，分配不同的权重来确定p。类别的权重越大，p越大。在一个或多个实施方式中，从训练数据中初始化类别或标签的权重。In one or more embodiments, layer-wise correlation information, ie, categorical information of labels, is incorporated into the neural architecture (eg, max pooling layers) to help dynamically select p. Specifically, p is adjusted with the distribution of label levels. For example, in MeSH, all items are grouped into 16 categories such as "anatomy", "organism", "disease", etc. Each category involves a different subcategory, and each label involves a different distribution. Based on the distribution, assign different weights to determine p. The larger the weight of the category, the larger the p. In one or more embodiments, class or label weights are initialized from training data.

1.4.利用宏F度量最大化细化预测的实施方式1.4. Implementation of Maximizing Refinement Predictions Using Macro F-Metrics

如图1所示，利用嵌入和动态最大池化，网络可以进行逐层预测。在每个层级，为每个数据样本选择前K个(top-K)预测标签。然而，固定的K可能产生较高的查全率，但是较低的精度。在本专利文献的一个或多个实施方式中，通过更灵活的权重调整策略来细化预测。As shown in Figure 1, with embedding and dynamic max pooling, the network can perform layer-by-layer prediction. At each level, the top-K (top-K) predicted labels are selected for each data sample. However, a fixed K may yield higher recall, but lower precision. In one or more embodiments of this patent document, predictions are refined through a more flexible weight adjustment strategy.

在一个或多个实施方式中，将在线F度量优化(OFO)应用于权重调整。使用OFO，可以实现精度和查全率的动态平衡。在一个或多个实施方式中，OFO算法通过以在线方式进行阈值调整来优化二进制F度量。In one or more embodiments, online F-measure optimization (OFO) is applied to weight adjustment. With OFO, a dynamic balance of precision and recall can be achieved. In one or more embodiments, the OFO algorithm optimizes the binary F-measure by performing threshold adjustments in an online manner.

其中

并且

在此，

是第l个数据样本的第j个标签。

是在迭代t处从标签y_j上的第一个数据样本到第i个数据样本的累积F分数。in

and

here,

is the jth label of the lth data sample.

is the cumulative F-score from the first data sample on label y _j to the ith data sample at iteration t.

由于增量属性，OFO的阈值通过两条规则来更新。在一个或多个实施方式中，在相同的迭代(批数据)处，阈值

根据迭代间规则更新为

在不同的迭代处，它根据交叉迭代规则更新为

其中N是指一批中的数据样本的数量。在一个或多个实施方式中，当新的批次开始时，i被初始化为0，并且还不存在α或β值。在一个或多个实施方式中，最初使用来自最后一批的值。给定第i个数据样本，OFO将预测标签细化为

其中

是指在迭代t处标签y_j上的x_i的预测概率。在一个或多个实施方式中，最佳F度量

是最佳阈值

的值的两倍，为

由于提出的细化机制是动态的、逐层的和增量的，因此最佳阈值

要到训练结束后才能固定。在一个或多个实施方式中，将其保存为测试参数。Due to the incremental property, the threshold of OFO is updated by two rules. In one or more embodiments, at the same iteration (batch of data), the threshold

updated according to the inter-iteration rule to

At different iterations, it is updated according to the cross-iteration rule as

where N refers to the number of data samples in a batch. In one or more embodiments, when a new batch starts, i is initialized to 0, and there is no alpha or beta value yet. In one or more embodiments, the values from the last batch are used initially. Given the ith data sample, OFO refines the predicted labels as

in

is the predicted probability of x _i on label y _j at iteration t. In one or more embodiments, the optimal F-measure

is the optimal threshold

twice the value of , for

Since the proposed refinement mechanism is dynamic, layer-wise and incremental, the optimal threshold

It can't be fixed until after the training is over. In one or more embodiments, it is saved as a test parameter.

2.用于最终合并的指针生成模型的实施方式2. Implementation of the pointer generation model for final merge

在具有逐层输出之后，这些输出应合并成一个统一的标签集。然而，它们无法简单地组合在一起，因为与黄金标准标签或ground truth(标注真实)标签相比，简单的串联可能导致大得多数量的标签。在本专利文献中，公开了一种过滤方法以去除一些逐层标记，以确保预测标签的最终分布与黄金标准标签一致。在一个或多个实施方式中，受文本概述的启发，将每个逐层预测视为一个语句，并且将黄金标准视为概述输出。在解码、编码和注意状态期间层级之间的标签的分层关系被考虑。After having layer-wise outputs, these outputs should be merged into a unified label set. However, they cannot be simply grouped together, as simple concatenation can result in a much larger number of labels compared to gold standard labels or ground truth labels. In this patent document, a filtering method is disclosed to remove some layer-wise labels to ensure that the final distribution of predicted labels is consistent with the gold standard labels. In one or more embodiments, each layer-wise prediction is treated as a sentence, inspired by text summarization, and the gold standard is regarded as the summarization output. Hierarchical relationships of labels between levels during decoding, encoding and attention states are considered.

2.1分层指针生成模型的实施方式2.1 Implementation of the Hierarchical Pointer Generation Model

在一个或多个实施方式中，分层指针生成模型既允许从逐层预测复制标签，也允许从整个标签集生成标签。In one or more embodiments, the hierarchical pointer generation model allows both copying labels from layer-by-layer predictions and generating labels from the entire label set.

图5描绘了根据本公开的实施方式的用于最终合并的指针生成模型，并且图6描绘了根据本公开的实施方式的使用指针生成模型以便进行最终合并的过程。指针生成模型包括编码器520、注意力发生器530和解码器540。编码器接收(605)逐层预测标签的输入510，这些逐层预测标签被组织为从层级1到层级M的逐层序列。输入被编码(610)为M个序列的隐藏状态522。在一个或多个实施方式中，编码器是双向LSTM编码器。每个编码的隐藏状态在某个层级反映了预测标签的内部关系。在一个或多个实施方式中，编码器隐藏状态被表示为e^τ＝v^Ttanh(w_hγ^τ+w_ss^τ+b_attn)。在此，s^τ和γ^τ分别是预测标签周围的预测标签序列矢量和内容矢量。项v、w_h、w_s和b_attn是权重参数。在一个或多个实施方式中，内容矢量是关于标记的同现。5 depicts a pointer generation model for final merging according to an embodiment of the present disclosure, and FIG. 6 depicts a process for using a pointer generation model for final merging according to an embodiment of the present disclosure. The pointer generation model includes an encoder 520 , an attention generator 530 and a decoder 540 . The encoder receives ( 605 ) an input 510 of layer-wise predicted labels organized as a layer-by-layer sequence from level 1 to level M. The input is encoded (610) as M sequences of hidden states 522. In one or more embodiments, the encoder is a bidirectional LSTM encoder. The hidden state of each encoding reflects the internal relationship of the predicted labels at some level. In one or more embodiments, the encoder hidden state is denoted as e ^τ = v ^T _tanh (wh γ ^τ + w _s s ^τ + b _attn ). Here, ^sτ and ^γτ are the predicted label sequence vector and content vector around the predicted label, respectively. The terms v, _{wh, ws, and battn} _are _weight parameters. In one or more embodiments, the content vector is about co-occurrence of markers.

在一个或多个实施方式中，从编码器隐藏状态导出多个注意力发生器530，以便以时间步长τ生成(615)注意力分布a^τ和内容矢量γ^τ。在一个或多个实施方式中，a^τ被计算为a^τ＝softmax(e^τ)。注意力分布是预测的逐层标签上的概率分布。它用于产生γ^τ作为编码器隐藏状态的分层加权和：

其中γ_q代表第q个标签。In one or more embodiments, a plurality of attention generators 530 are derived from the encoder hidden state to generate (615) an attention distribution a ^τ and a content vector γ ^{τ at a time step τ} . In one or more embodiments, a ^τ is calculated as a ^τ =softmax(e ^τ ). The attention distribution is the probability distribution over the predicted layer-wise labels. It is used to generate ^γτ as a hierarchical weighted sum of encoder hidden states:

where γq represents the _qth label.

在一个或多个实施方式中，每个注意力发生器被称为覆盖矢量532，该覆盖矢量示出赋予每个层级的标签多少关注。众所周知，概述可能导致重复。因此，相同的标签也可以被多次生成。精心设计的覆盖矢量起着判断标签是否为重复的作用。如果不是重复，则关注度较高的标签更有可能被解码成一个正确的标签。如果重复，则避免重复的机制(在章节B中的2.3中进行描述)将标签过滤掉。基于该方法中描述的覆盖矢量，会引起注意。随后，解码器正在努力生成尺寸减小的输出。In one or more embodiments, each attention generator is referred to as a coverage vector 532, which shows how much attention is given to the labels of each hierarchy. As we all know, outlining can lead to duplication. Therefore, the same label can also be generated multiple times. The well-designed overlay vector plays the role of judging whether the label is a duplicate or not. If it is not a duplicate, a label with higher attention is more likely to be decoded into a correct label. If duplicated, the duplicate avoidance mechanism (described in 2.3 in Section B) filters the tags out. Attention is drawn based on the coverage vector described in the method. Subsequently, the decoder is working to generate a reduced size output.

在一个或多个实施方式中，为了为解码器生成解码器隐藏状态，从内容矢量γ^τ、预测的标签序列矢量s^τ、以及解码器输入y^τ(黄金标准或ground truth)获得(620)针对时间步长τ的生成概率p_gen∈[0，1]为：In one or more embodiments, to generate the decoder hidden state for the decoder, it is obtained (620) from the content vector ^γτ , the predicted label sequence vector ^sτ , and the decoder input ^yτ (gold standard or ground truth) The generation probability p _gen ∈ [0, 1] for time step τ is:

p_gen＝σ(w_hγ^τ+w_ss^τ+w_yy^τ+b_ptr) (3)p _gen =σ(w _h γ ^τ +w _s s ^τ +w _y y ^τ +b _ptr ) (3)

其中w_h、w_s、w_y和b_ptr是权重参数。在此，p_gen用作软开关，以在通过从标签分布

(在本文参见如何在2.2中计算

)中采样来从整个标签集中生成标签或者通过从注意力分布a^τ中采样来从输入序列中复制标签之间进行选择。where w _h , _ws , w _y and b _ptr are weight parameters. Here, _pgen is used as a soft switch to distribute

(See how to calculate in 2.2 in this article

) to generate labels from the entire label set or to copy labels from the input sequence by sampling from the attention distribution a ^τ .

利用以上输入逐层预测标签、编码器隐藏状态、注意力发生器和解码器隐藏状态，可以训练分层指针生成模型以生成(625)最终概述的语义索引标签的输出550。在生成输出时，生成最终标签的概率被学习。给定一个训练对

计算条件概率

以将概率链规则项的标签估计为：Using the above input layer-by-layer predicted labels, encoder hidden state, attention generator, and decoder hidden state, a hierarchical pointer generation model can be trained to generate (625) an output 550 of the final summarized semantic index label. When generating the output, the probability of generating the final label is learned. Given a training pair

Calculate Conditional Probabilities

To estimate the label of the probability chain rule term as:

其中

是

矢量的序列。通过将训练集的条件概率最大化为

来学习模型的参数，其中总和超过训练示例。in

Yes

Sequence of vectors. By maximizing the conditional probability of the training set as

to learn the parameters of the model where the sum exceeds the training examples.

2.2序列到序列概率计算的实施方式2.2 Implementation of sequence-to-sequence probability calculation

在一个或多个实施方式中，以上过程最终产生标签词汇分布

为：In one or more embodiments, the above process ultimately yields a label vocabulary distribution

for:

其中，v、v'、b和b’是可学习的参数。对于特定标签，可以从

获得。在一个或多个实施方式中，损失函数是目标标签

的负对数似然。以下示例说明了在给定其他标签的情况下一个标签的概率计算过程。where v, v', b and b' are learnable parameters. For a specific tag, it can be accessed from

get. In one or more embodiments, the loss function is the target label

The negative log-likelihood of . The following example illustrates the process of calculating the probability of one label given other labels.

图7描绘了根据本专利公开的一个或多个实施方式的标签集{营养和代谢疾病}的层次关系。图中的左边单词是标签的缩略词。例如，Wolfram综合症的首字母缩写为ws.。该示例的计算在章节B的部分2.2中进行了描述。在图7的左边，为了节省空间并且为了更好地说明过程，在此处以那些简短形式(它们是MeSH项的初始字母)绘制了Sigmoid符号。7 depicts a hierarchical relationship of a tag set {Nutrition and Metabolic Diseases} in accordance with one or more embodiments of the present patent disclosure. The words on the left in the figure are acronyms for labels. For example, the acronym for Wolfram Syndrome is ws. The calculation for this example is described in Section B, Section 2.2. On the left side of Figure 7, to save space and to better illustrate the process, the Sigmoid symbols are drawn here in those short forms (they are the initial letters of the MeSH term).

在一个或多个实施方式中，给定内容＝{nmd、md、dm、dmt1}，则遵循这些标签之间的层次关系来将p(e_ws|context)计算为：In one or more embodiments, given content={nmd, md, dm, dmt1}, following the hierarchical relationship between these tags, p(e _ws |context) is calculated as:

2.3避免重复的机制的实施方式2.3 Implementation of mechanisms to avoid duplication

指针生成模型或序列到序列模型的问题在于它可能多次从输入复制项。不需要重复项，因为每个标签都应当是唯一的。可以通过采用覆盖机制来避免重复。即，如果在一个层级的输出中已看到标签，则在其他层级生成它们的概率将变低。在本专利公开的一个或多个实施方式中，通过将覆盖机制组合到整个指针生成模型中来采用这种方法。具体地，The problem with pointer generative models or sequence-to-sequence models is that it may copy items from the input multiple times. Duplicates are not required, as each label should be unique. Duplication can be avoided by employing an overlay mechanism. That is, if labels have been seen in the output of one level, the probability of generating them at other levels will be lower. In one or more embodiments of the present patent disclosure, this approach is taken by incorporating an overlay mechanism into the overall pointer generation model. specifically,

其中

是指覆盖矢量，y^m是指第m层级。in

refers to the coverage vector, and y ^m refers to the mth level.

在本专利公开的一个或多个实施方式中，覆盖矢量由用于所有层级的一组矢量组成。对于每个覆盖矢量，

是在逐层输入上的非标准化分布，逐层输入表示迄今为止已从注意力机制接收到这些标签的覆盖程度。由于标签是按层级排序的，因此在层级的不同部分不应当重复，这种机制旨在删除在不同部分中找到的所有重复标签，并且还避免在同一层级内重复。在一个或多个实施方式中，作为对任何重复的惩罚，将

添加到注意力机制并且还将

添加到指针生成模型的总损失函数。In one or more embodiments of this patent disclosure, an overlay vector consists of a set of vectors for all levels. For each cover vector,

is the unnormalized distribution over the layer-wise input that represents how well these labels have been received so far from the attention mechanism. Since labels are ordered hierarchically, there should be no duplicates in different parts of the hierarchy, this mechanism is designed to remove all duplicate labels found in different parts, and also avoid duplication within the same level. In one or more embodiments, as a penalty for any duplication, the

added to the attention mechanism and will also

The total loss function added to the pointer generative model.

C.一些实验C. Some experiments

应当注意，这些实验和结果通过说明的方式提供并且使用一个或多个具体实施方式在具体条件下执行；因此，这些实验和它们的结果都不应被用来限制本专利文献的公开的范围。It should be noted that these experiments and results are provided by way of illustration and were performed under specific conditions using one or more specific embodiments; therefore, neither these experiments nor their results should be used to limit the scope of the disclosure of this patent document.

在此章节中，使用来自标记有MeSH和Amazon-Cat13K的美国国家医学图书馆的MEDLINE数据集来评估深度逐层XMLC的实施方式的有效性。如章节A中描述的，MEDLINE是世界上最大的生物医学文献数据库，并且医学主题词表(MeSH)是用于在MEDLINE中标记文章的领域本体。另一数据集AmazonCat13K是用于开发极端分类算法的基准数据集之一。类似于MeSH，它涉及13330个标签，所有标签都是按层次组织的。数据集规模、专家标签和层次结构性质为所呈现的框架提供了理想的测试平台。In this section, the effectiveness of an implementation of deep layer-wise XMLC is evaluated using the MEDLINE dataset from the US National Library of Medicine labeled MeSH and Amazon-Cat13K. As described in Section A, MEDLINE is the world's largest database of biomedical literature, and the Medical Subject Thesaurus (MeSH) is a domain ontology used to label articles in MEDLINE. Another dataset, AmazonCat13K, is one of the benchmark datasets used to develop extreme classification algorithms. Similar to MeSH, it involves 13330 labels, all of which are organized hierarchically. The dataset size, expert labels, and hierarchical nature provide an ideal testbed for the presented framework.

1.数据设置和预处理1. Data Setup and Preprocessing

MEDLINE中的MeSH标签的总数为26,000，其中60％出现超过1000次。在一项或多项实验设置中，在实验中删除了出现少于10次的那些MeSH标签。MEDLINE具有2600万篇带有摘要的文章。这些文章中有90％具有约20个MeSH标签。82％的文章分配了4到16个MeSH标签。在MeSH中，有350万个摘要具有MeSH标签和关键字两者。MeSH标签的本体可以分解为7个层级，其中最低层级(第7层级)包括最具体的MeSH标签，而最高层级(第1层级)具有最通用和抽象的MeSH标签。对于仅具有最低层级的MeSH标签的文章，通过以下方法对其进行扩展。从最低层级的标签开始，找出其上层级的所有标签。在一个或多个实验设置中，为提出的深度逐层XMLC框架构建7个数据集。The total number of MeSH tags in MEDLINE is 26,000, of which 60% appear more than 1000 times. In one or more experimental settings, those MeSH labels that appeared less than 10 times in the experiment were removed. MEDLINE has 26 million articles with abstracts. 90% of these articles have about 20 MeSH tags. 82% of articles were assigned between 4 and 16 MeSH tags. In MeSH, there are 3.5 million abstracts with both MeSH tags and keywords. The ontology of MeSH tags can be decomposed into 7 levels, where the lowest level (level 7) includes the most specific MeSH tags, and the highest level (level 1) has the most general and abstract MeSH tags. For articles with only the lowest level of MeSH tags, it is expanded by the following method. Start with the label at the lowest level and find all labels at the level above it. In one or more experimental settings, 7 datasets are constructed for the proposed deep layer-by-layer XMLC framework.

同时，储存来自所有7个层级的具有MeSH标签的102,167个摘要用于测试。表1中示出每个层级的数据集的统计信息。可以看出，中间层级的标签数量最多，而最高层级仅具有83个标签，并且最低层级具有2445个标签。数据量也可发现类似的趋势。两百万篇文章的标签来自层级2、3和4，而少于一百万篇文章的标签来自层级1、6和7。Meanwhile, 102,167 abstracts with MeSH tags from all 7 tiers were stored for testing. The statistics of the datasets for each tier are shown in Table 1. It can be seen that the middle level has the largest number of labels, while the highest level has only 83 labels, and the lowest level has 2445 labels. Similar trends can be found in the volume of data. Two million articles have tags from tiers 2, 3, and 4, while fewer than one million articles have tags from tiers 1, 6, and 7.

对于AmazonCat13K，由于深度逐层XMLC需要文本数据，因此无法直接使用其预处理数据集。同时，数据应基于其逐层类别来划分。发现所有标签可以分解成9个层级。有所不同的是，如果来自AmazonCat13K的文档具有较低的标签，则它必具有较高的标签，而来自MeSH的文档则不一定如此。因此，很容易找到用于测试AmazonCat13K的公共集合(仅使用具有较低类别的文档)。为了保持测试数据的合理的池化，忽略层级高于6的文档(对于层级7、8和9，分别仅为9990、385和31个文档)。For AmazonCat13K, since deep layer-by-layer XMLC requires text data, its preprocessing dataset cannot be used directly. At the same time, the data should be divided based on its layer-by-layer categories. It is found that all labels can be broken down into 9 levels. The difference is that if a document from AmazonCat13K has a lower label, it must have a higher label, which is not necessarily the case for a document from MeSH. Therefore, it is easy to find public collections for testing AmazonCat13K (using only documents with lower categories). In order to maintain reasonable pooling of test data, documents with tiers higher than 6 are ignored (only 9990, 385 and 31 documents for tiers 7, 8 and 9, respectively).

表1.数据集的统计信息。对于每个层级，存在不同的数据量。Medline中的论文不一定暗示可以将其标记为较高层级的MeSH项，即使它们被使用较低层级的MeSH项进行标记。Table 1. Statistics of the dataset. For each level, there is a different amount of data. Papers in Medline do not necessarily imply that they can be labeled as higher-level MeSH terms, even if they are labeled with lower-level MeSH terms.

在实验中，对于MEDLINE文章和关键字，在每个层级，首先根据深度逐层XMLC的第一分量来训练单个神经网络。采用训练模型对每个层级的测试数据进行预测。随后，将来自训练数据的预测的逐层标签以及黄金标准标签由指针生成模型用于最终合并。同样，针对AmazonCat14K训练逐层模型，但是后者不具有关键字。In the experiments, for MEDLINE articles and keywords, at each level, a single neural network is first trained according to the first component of the depth-wise layer-wise XMLC. Use the trained model to make predictions on the test data at each level. Subsequently, the predicted layer-wise labels from the training data, as well as the gold standard labels, are used by the pointer generation model for final merging. Likewise, a layer-wise model is trained for AmazonCat14K, but the latter does not have keywords.

2.评估指标2. Evaluation Metrics

在极端多标签分类数据集中，尽管通常存在巨大的标签空间，但是每个文档的相关标签的数量有限。这意味着为每个测试文档提供相关标签的简短的排序表是重要的。评估的重点因此是此类排序表的质量，重点在于每个列表的顶部的相关性。然而，在一个或多个实验设置中，出于与两个数据集来源进行比较的目的，使用了两个评估指标。医学界更喜欢使用精度、查全率和F分数，而一般领域的人更喜欢使用K处的精度(P@K)和归一化折扣累积增益(简称NDCG@K或G@K)。In extreme multi-label classification datasets, the number of relevant labels per document is limited despite the often huge label space. This means that it is important to provide a short sorted list of relevant tags for each test document. The focus of the evaluation is therefore on the quality of such sorted lists, with an emphasis on the relevance of the top of each list. However, in one or more experimental settings, two evaluation metrics were used for the purpose of comparison with two dataset sources. The medical community prefers to use precision, recall, and F-score, while those in the general domain prefer to use precision at K (P@K) and normalized discounted cumulative gain (NDCG@K or G@K for short).

具体地，给定预测的标签列表

其中前K个项在层级m处，精度、查全率和F分数定义如下：Specifically, given a list of predicted labels

where the top K terms are at level m, and the precision, recall and F-score are defined as follows:

其中，N是数据样本的数量，并且

是排名前K个标签中的正确标签的数量；AK_i是文章i的黄金标准标签的总数；微度量与宏度量之间的差异在于预测概率的计算。对于微度量，直到将所有正确的预测加在一起后才进行概率计算，而对于宏度量，将对每篇文章进行概率计算，并且最后将平均值用作宏分数。报告这两种度量是为了查看模型对于单篇文章和整个数据集的准确性如何。where N is the number of data samples, and

is the number of correct labels among the top K labels; AK _i is the total number of gold-standard labels for article i; the difference between micrometrics and macrometrics is the computation of predicted probabilities. For micrometrics, probability calculations are not performed until all correct predictions are added together, while for macrometrics, probability calculations are performed for each article, and the average is used as the macroscore at the end. Both metrics are reported to see how accurate the model is for individual articles and the entire dataset.

相反，P@K和NDCG@K的定义是，Instead, P@K and NDCG@K are defined as,

其中，

表示为文档的真实标签的矢量，并且

为同一文档的系统预测分数矢量。在一个或多个实验设置中，遵循P@K和NDCG@K的约定使用k＝1，3，5。in,

is a vector represented as the true labels of the document, and

A vector of predicted scores for the system for the same document. In one or more experimental settings, k=1, 3, 5 were used following the conventions of P@K and NDCG@K.

3.参数设置3. Parameter setting

对于深度逐层XMLC的神经网络，使用修正线性单元。滤波窗口设置为3、4、5。丢失(dropout)率设置为0.5，并且L2约束设置为3。最小批量大小设置为256。嵌入尺寸因不同特征而异。对于Mesh，医学单词的单词嵌入涉及500,000个唯一令牌，关键字嵌入涉及100,000多个短语，并且标签嵌入涉及26,000个MeSH项。Gensim用于训练以300为维数的嵌入。对于AmazonCat13K，利用预训练的GoogleNews-vectors-negative300.bin，其中具有300万个令牌并且以300为维数。其他超参数的值是经由对来自训练数据的较小验证集进行格点搜索来选择的。For deep layer-wise XMLC neural networks, rectified linear units are used. The filter window is set to 3, 4, 5. The dropout rate is set to 0.5, and the L2 constraint is set to 3. The minimum batch size is set to 256. Embedding size varies for different features. For Mesh, word embeddings for medical words involve 500,000 unique tokens, keyword embeddings involve over 100,000 phrases, and label embeddings involve 26,000 MeSH terms. Gensim is used to train embeddings of dimension 300. For AmazonCat13K, a pretrained GoogleNews-vectors-negative300.bin with 3 million tokens and 300 dimensions is utilized. Values for other hyperparameters were chosen via a lattice search on a smaller validation set from the training data.

4.在线F度量优化的性能4. Performance of Online F-Measure Optimization

如章节B.1.4中讨论的，在线宏F度量优化(OFO)被集成到提出的框架中。为了显示OFO的有效性，针对MeSH对图8中的前6个层级报告宏精度、查全率和F分数。尽管未显示第7层级的结果和AmazonCat13K的结果，但它们可以获得类似的性能。可以观察到，OFO有助于在宏精度与查全率之间实现平衡。进一步观察到，最佳F分数在不同的层级是不同的。如果始终选择前K个(实验中k＝10)进行逐层预测，则尽管每个层级的查全率可能高达80％左右，但仍无法获得最佳F分数。精度可以低至或小于20％。原因是在将每篇文章的MeSH标签划分为7个层级之后，大多数文章在每个层级仅具有2到5个标签。这意味着，即使所有标签均位于前10之内，尽管查全率可能为100％，但精度仅为20％至50％。在这种情况下，F分数也不高。OFO极大地删除了不太相关的标签，因此每个层级的最终预测集中的标签的数量也从2到5不等。同时，大多数正确的预测仍保留在预测集中。显然，这种调整策略极大地提高了性能。As discussed in Section B.1.4, online macro-F-metric optimization (OFO) is integrated into the proposed framework. To show the effectiveness of OFO, macro precision, recall and F-score are reported for the first 6 tiers in Figure 8 for MeSH. Although Tier 7 results and AmazonCat13K results are not shown, they can achieve similar performance. It can be observed that OFO helps to achieve a balance between macro precision and recall. It is further observed that the optimal F-score is different at different tiers. If the top K (k=10 in the experiment) are always selected for layer-by-layer prediction, the best F-score cannot be obtained although the recall rate of each layer may be as high as about 80%. Accuracy can be as low as or less than 20%. The reason is that after dividing the MeSH tags of each article into 7 tiers, most articles only have 2 to 5 tags at each tier. This means that even though all labels are in the top 10, even though the recall may be 100%, the accuracy is only 20% to 50%. In this case, the F-score is also not high. OFO greatly removes less relevant labels, so the number of labels in the final prediction set at each level also varies from 2 to 5. At the same time, most of the correct predictions remain in the prediction set. Clearly, this tuning strategy greatly improves performance.

5.逐层性能5. Layer-by-layer performance

如在章节B中讨论的，提出的深度逐层XMLC框架将XMLC的任务分解成逐层模型构建。因此，在本章节中，报告逐层预测结果以便了解整个模型的中间发展和改进。As discussed in Section B, the proposed deep layer-by-layer XMLC framework decomposes the task of XMLC into layer-by-layer model construction. Therefore, in this section, the layer-by-layer prediction results are reported in order to understand the intermediate development and improvement of the overall model.

如图3所示，逐层神经模型从来自上层级或下层级的MEDLINE集合、关键字以及预测标签中学习标签嵌入。在此报道了具有不同嵌入的逐层神经模型的性能。通过固定每个层级的前K个标签、经与逐层神经模型进行比较，进一步证明了OFO的有效性。测试从1到10的不同K值，并且发现当K为5时实现最佳性能。As shown in Figure 3, the layer-by-layer neural model learns label embeddings from MEDLINE sets, keywords, and predicted labels from upper or lower layers. Here we report the performance of layer-wise neural models with different embeddings. The effectiveness of OFO is further demonstrated by fixing the top K labels at each level and comparing with the layer-by-layer neural model. Different values of K from 1 to 10 were tested and the best performance was found to be achieved when K was 5.

表2报告了用于具有OFO和前K固定策略的逐层模型的微观性能。为了获得最佳结果，在此将K设置为5。表3中也示出宏度量的性能。可以看出，无论在微度量还是宏度量上，OFO的性能总是优于固定前K个的策略。Table 2 reports the microscopic performance for the layer-by-layer model with OFO and top-K fixation strategies. For best results, set K to 5 here. The performance of the macrometrics is also shown in Table 3. It can be seen that OFO always outperforms the strategy of fixing the top K, no matter on the micro-metric or the macro-metric.

表2和表3还报告了针对MeSH具有三种不同嵌入的逐层预测。尽管对AmazonCat13K数据集的评估不是基于F分数，但是还报道针对AmazonCat13K的微度量以示出OFO的优势。毕竟，使用OFO在过滤的输出上计算P@K和NDCG@K的结果。根据此结果，可以确定所有七个层级的明显的增量趋势。即，通过添加上层级和下层级的关键字和预测MeSH项，该预测会相应地得到快速改善。不难看出，一般而言，宏观结果要优于微观结果。其中，MeSH的第三层级和第四层级同时AmazonCat13K的第四层级和第五层级比其他数据集产生更差的结果，而对于两个数据集，层级1则获得好得多的结果。考虑到第三和第四标签数量更多(对于MeSH，分别为4,484和6,568，而对于AmazonCat13K，分别为6,181和5,372)，这是可理解的。Tables 2 and 3 also report layer-wise predictions for MeSH with three different embeddings. Although the evaluation on the AmazonCat13K dataset is not based on F-scores, a micrometric for AmazonCat13K is also reported to show the advantage of OFO. After all, the results of P@K and NDCG@K are computed on the filtered output using OFO. From this result, a clear incremental trend for all seven levels can be identified. That is, by adding upper and lower levels of keywords and predicted MeSH terms, the prediction will be quickly improved accordingly. It is not difficult to see that, in general, macro results are better than micro results. Among them, Tier 3 and Tier 4 of MeSH and Tier 4 and Tier 5 of AmazonCat13K yield worse results than the other datasets, while Tier 1 achieves much better results for both datasets. This is understandable considering that the third and fourth tags are more numerous (4,484 and 6,568, respectively, for MeSH and 6,181 and 5,372, respectively, for AmazonCat13K).

表2:使用前K个和OFO进行微度量的逐层性能。该表旨在示出当针对每个层级逐步添加新的特征时，微度量的增量式改进。同时，对于每个层级，此处还示出没有优化和具有OFO优化的前K个的结果。Table 2: Layer-wise performance for micrometrics using top-K and OFO. This table is intended to show the incremental improvement of micrometrics as new features are incrementally added for each level. Also, for each tier, the results for the top K without optimization and with OFO optimization are also shown here.

表3.使用前K个和OFO进行宏度量的逐层性能。该表旨在示出当针对每个层级逐步添加新的特征时，微度量的增量式改进。同时，对于每个层级，以类似于表2的方式，此处还示出没有优化和具有OFO优化的前K个的结果。Table 3. Layer-wise performance for macrometrics using top-K and OFO. This table is intended to show the incremental improvement of micrometrics as new features are incrementally added for each level. Meanwhile, for each tier, in a similar manner to Table 2, the results for the top K without optimization and with OFO optimization are also shown here.

6.最终合并的性能6. Performance of the final merge

所提出的深度逐层的XMLC会将逐层预测合并到具有指针生成模型的一个统一标签集中。在本章节中，针对MeSH结果，将深度逐层XMLC进一步与五种最新方法进行比较，以证明指针生成模型的有效性，五种最新方法包括：MTIDEF(Minlie Huang等人，推荐用于生物医学文章注释的MeSH项，美国医学信息协会杂志的第18卷，第5期(2011年)，第660至667页)；MeSH Now(Yuqing Mao和Zhiyong Lu.，2017年)，MeSHNow：通过学习排名在PubMed规模上自动进行MeSH索引，生物医学语义学杂志第8卷，第1期(2017年)，15)；MeSHLabeler、MeSHRanker(Ke Liu等人，MeSHLabeler：通过整合各种证据来提高大规模MeSH索引的准确性，生物资讯第31期卷，第12期(2015年)第i339至i347页)；以及深度Mesh(Shengwen Peng等人，DeepMeSH：用于改善大规模MeSH索引的深度语义表示，生物资讯，第32期卷，第12期(2016年)第i70至i79页)。所有这些现有系统都大量使用特征工程。相反，深度逐层XMLC使用有限的外部资源。对于AmazonCat13K，XML-CNN的结果在此基准数据集上报告了最新系统的状态。The proposed deep layer-by-layer XMLC incorporates layer-by-layer predictions into a unified label set with a pointer-generating model. In this section, against the MeSH results, the deep layer-by-layer XMLC is further compared with five state-of-the-art methods to demonstrate the effectiveness of the pointer generation model, including: MTIDEF (Minlie Huang et al., recommended for biomedical use) MeSH Items for Article Annotations, Journal of the American Medical Information Association, Volume 18, Issue 5 (2011), pp. 660-667); MeSH Now (Yuqing Mao and Zhiyong Lu., 2017), MeSHNow: Ranking by Learning Automated MeSH Indexing at the PubMed Scale, Journal of Biomedical Semantics, Vol. 8, Issue 1 (2017), 15); MeSHLabeler, MeSHRanker (Ke Liu et al., MeSHLabeler: Improving Large-Scale MeSH by Integrating Various Evidences) Indexing Accuracy, Bioinformatics Vol. 31, No. 12 (2015) pp. i339-i347); and Deep Mesh (Shengwen Peng et al., DeepMeSH: Deep Semantic Representations for Improving Large-Scale MeSH Indexing, Bio Information, Vol. 32, Issue 12 (2016, pp. i70-i79). All of these existing systems make heavy use of feature engineering. In contrast, deep layer-by-layer XMLC uses limited external resources. For AmazonCat13K, XML-CNN results report the state of the art system on this benchmark dataset.

从MeSH标记开始，在获得逐层结果之后，使用来自所有层级的预测结果作为输入并且以黄金标准标签作为输出来训练分层指针生成模型。对于模型训练，可以将每个标签作为独立的单位或与一个单元处于同一层级的标签(在概述共同体中称为语句)来组织输入。因此，训练两个指针生成模型，其中前者称为深度逐层XMLC_label，并且后者称为深度逐层XMLC_level。为了进行比较，将所有层级的结果相加，并且随后根据其预测概率和黄金标准(深度逐层XMLC_sampling)中的标签分布来过滤掉不太相关的标签。Starting from MeSH labeling, after obtaining layer-wise results, a hierarchical pointer generation model is trained using predictions from all layers as input and gold standard labels as output. For model training, the input can be organized with each label as a separate unit or as a label at the same level as a unit (called sentences in the overview community). Therefore, two pointer generative models are trained, where the former is called deep-level-wise XMLC _label and the latter is called deep-level-wise XMLC _level . For comparison, the results from all levels are summed, and less relevant labels are then filtered out according to their predicted probabilities and the label distribution in the gold standard (deep layer-wise XMLC _sampling ).

表4.用于MeSH数据集的深度逐层XMLC的性能。从粗体数字可以看出，最佳性能来自深度逐层XMLC。显然，基于层级的动态池化获得比基于标签的动态池更好的性能。Table 4. Performance of deep layer-wise XMLC for the MeSH dataset. As you can see from the bold numbers, the best performance comes from deep layer-by-layer XMLC. Obviously, hierarchical-based dynamic pooling achieves better performance than label-based dynamic pooling.

如表4所示，在精度、查全率和F分数的宏度量上，深度逐层XMLC_label和深度逐层XMLC_level均优于其他系统。表4中未报告的微度量也具有类似的趋势。As shown in Table 4, both depth-wise XMLC _label and depth-wise XMLC _level outperform other systems on the macro metrics of precision, recall and F-score. Micrometrics not reported in Table 4 have similar trends.

通过涉及来自MEDLINE集合和关键字的嵌入，与所有其他现有的前沿框架相比，深度逐层XMLC_label和深度逐层XMLC_level实现了好得多的性能。可以观察到，尽管F分数非常相似，但是不同的输入组织可能会导致精度和查全率的不同性能。深度逐层XMLC_label实现更好的精度，而深度逐层XMLC_level实现更好的查全率。这似乎表明，提出的分层指针生成模型考虑了单元内的标签之间的相关性。因此，具有更长输入单元的深度逐层XMLC_level获得更好的查全率。然而，它还包括更多的误报，从而降低了其精度。相比之下，深度逐层XMLC_lebel在精度上获胜，可能是因为它考虑了更多更小的单位，并且因此错过了更多的正确肯定。By involving embeddings from MEDLINE collections and keywords, deep layer-by-layer XMLC _label and deep layer-by-layer XMLC _level achieve much better performance compared to all other existing cutting-edge frameworks. It can be observed that although the F-scores are very similar, different input organizations may lead to different performances in precision and recall. The depth-wise XMLC _label achieves better precision, while the depth-wise XMLC _level achieves better recall. This seems to suggest that the proposed hierarchical pointer generation model takes into account the correlation between labels within a cell. Therefore, the depth-wise XMLC _level with longer input units achieves better recall. However, it also includes more false positives, reducing its accuracy. In contrast, the deep layer-by-layer XMLC _{lebel wins} in accuracy, probably because it considers more smaller units and therefore misses more correct positives.

同时，与大多数现有系统相比，深度逐层XMLC_sampling获得的结果要差得多。这表明分层指针生成模型在最终达到最佳性能方面可以发挥重要作用。此外，还报告了使用最大池化的深度逐层XMLC_level的结果。默认情况下，所有系统都使用动态最大池化来工作。显然，结果表明，动态最大池化比常规最大池化策略更具优势。Meanwhile, the results obtained by deep layer-by-layer XMLC _sampling are much worse than most existing systems. This suggests that the hierarchical pointer generation model can play an important role in ultimately achieving the best performance. In addition, deep layer-by-layer XMLC _level results using max pooling are also reported. By default, all systems work with dynamic max pooling. Clearly, the results show that dynamic max-pooling has advantages over conventional max-pooling strategies.

表5.用于AmazonCat13K的深度逐层XMLC的性能。使用具有指针生成网络以及基于动态池化和层级的版本。如章节C中的3.1中所述，为了将提出的方法论从医学领域扩展到更通用的一些领域，提出的模型实施方式也在AmazonCat13K上进行了测试。对于使用AmazonCat13K的人，他们更喜欢报告精度@K和NDCG@K。还列出了AmazonCat13K的XML-CNN的性能以便进行比较。Table 5. Performance of deep layer-by-layer XMLC for AmazonCat13K. Use the generation network with pointers and the version based on dynamic pooling and hierarchy. As described in Section C, 3.1, to extend the proposed methodology from the medical domain to some more general domains, the proposed model implementation is also tested on AmazonCat13K. For those using AmazonCat13K, they prefer to report accuracy @K and NDCG@K. The performance of AmazonCat13K's XML-CNN is also listed for comparison.

XML-CNN：Jingzhou Liu等人，用于极端多标签文本分类的深度学习。在日本东京的第40届国际ACM SIGIR信息检索研究和发展(SIGIR)会议的会议记录中，第115至124页。XML-CNN: Jingzhou Liu et al., Deep Learning for Extreme Multi-Label Text Classification. In Proceedings of the 40th International ACM SIGIR Conference on Information Retrieval Research and Development (SIGIR), Tokyo, Japan, pp. 115-124.

对于AmazonCat13K，结果在表5中给出。还列出了XML-CNN的最新技术成果。表5示出了来自本专利公开中的工作的更高性能的结果。应当注意的是，从具有每个层级的标签的原始文本数据中提取针对深度逐层XMLC的测试数据集，而XML-CNN的工作是在由数据收集器准备的标准测试数据集上进行测试的。For AmazonCat13K, the results are given in Table 5. The latest technological achievements of XML-CNN are also listed. Table 5 shows higher performance results from the work in this patent disclosure. It should be noted that the test dataset for deep layer-wise XMLC is extracted from the raw text data with labels at each level, while the work of XML-CNN is tested on the standard test dataset prepared by the data collector. .

D.一些相关工作D. Some related work

1.基于树的方法1. Tree-based approach

由于标签数量巨大，XMLC的预测可能会在时间和空间上涉及高成本。基于树的方法致力于降低训练和测试成本。例如，通过次线性排序(LPSR)方法进行的标签划分尝试通过学习基础分类器上的层次结构来减少预测时间。一些人提出了一种称为多标签随机森林(MLRF)的方法，该方法旨在学习整个随机树，而不是依赖于学习基础分类器。提议使用FastXML来学习不在标签空间上而是在特征空间上的层次结构。它定义在一个区域中活动的标签集为该区域中存在的所有训练点的标签的并集。在层次结构的每个节点上，都会优化基于NDCG的目标。即，在每个节点处，感应超平面，并且超平面将当前节点中的文档集分成两个子集。通过返回所有叶节点中最频繁出现的标签的排序列表来进行预测。最近，一些人基于整体随机森林开发了用于社交流的多标签分类。他们集成了基础学习器和基于标签的学习器来学习分层标签。然而，由于标签空间和特征空间的维数，这些方法的训练成本很高。Due to the huge number of tags, the prediction of XMLC may involve high cost in time and space. Tree-based methods aim to reduce training and testing costs. For example, label partitioning by sublinear sorting (LPSR) methods attempts to reduce prediction time by learning a hierarchy on the base classifier. Some have proposed a method called Multi-Label Random Forest (MLRF), which aims to learn an entire random tree rather than relying on learning a base classifier. It is proposed to use FastXML to learn hierarchies not in the label space but in the feature space. It defines the label set active in a region as the union of the labels of all training points present in that region. At each node of the hierarchy, an NDCG-based objective is optimized. That is, at each node, a hyperplane is induced, and the hyperplane divides the set of documents in the current node into two subsets. Predictions are made by returning a sorted list of the most frequently occurring labels across all leaf nodes. Recently, some people have developed multi-label classification for social flow based on ensemble random forests. They integrated a base learner and a label-based learner to learn hierarchical labels. However, these methods are expensive to train due to the dimensionality of the label space and feature space.

2.嵌入方法2. Embedding method

嵌入方法试图通过将标签矢量投影到低维空间上并且由此减少标签的数量来克服由大量标签带来的难处理性问题。假设标签矩阵是低等级的。由于其强大的理论基础和处理标签相关性的能力，嵌入方法已被证明是解决XMLC问题的最流行的方法。具体地，最近提出的嵌入方法、用于极端多标签分类(SLEEC)的稀疏局部嵌入在将非线性邻域约束合并到低维嵌入空间中以便进行训练之后大大提高了精度，并且简单的k最近邻域(k-NN)聚类被用于嵌入空间中以便进行测试。在本专利公开的一个或多个实施方式中，通过探索逐层标记嵌入来采取进一步的步骤以改善神经结构的预测。Embedding methods attempt to overcome the intractability problem caused by a large number of labels by projecting the label vector onto a low-dimensional space and thereby reducing the number of labels. Assume that the label matrix is low-rank. Embedding methods have proven to be the most popular methods for solving XMLC problems due to their strong theoretical foundation and ability to deal with tag dependencies. Specifically, a recently proposed embedding method, Sparse Local Embedding for Extreme Multi-Label Classification (SLEEC), greatly improves accuracy after incorporating nonlinear neighborhood constraints into a low-dimensional embedding space for training, and a simple k nearest Neighborhood (k-NN) clustering was used in the embedding space for testing. In one or more embodiments of the present patent disclosure, further steps are taken to improve prediction of neural structures by exploring layer-by-layer label embeddings.

3.最大裕度方法3. Maximum margin method

最大裕度方法也被用来处理多标签分类。有人提出名为PD-Sparse的模型。基本上，针对在与该标签相关联的权重矩阵上具有L1和L2范数罚分的每个标签来学习线性分类器。这导致原始空间和对偶空间中的稀疏解。使用完全校正的块坐标Frank-Wolfe训练算法来实现关于原始变量和双变量的数量的次线性训练时间，同时获得比1对所有SVM和多标签分类的逻辑回归更好的性能，大大减少了训练时间和模型尺寸。然而，与1对所有SVM一样，PD-Sparse方法在算法上无法扩展到极端多标签学习。The maximum margin method is also used to handle multi-label classification. Someone proposed a model called PD-Sparse. Basically, a linear classifier is learned for each label with L1 and L2 norm penalties on the weight matrix associated with that label. This leads to sparse solutions in primal and dual spaces. Use the fully corrected block coordinate Frank-Wolfe training algorithm to achieve sub-linear training time with respect to the number of original variables and bivariates, while obtaining better performance than 1 for logistic regression for all SVMs and multi-label classification, greatly reducing training time and model size. However, like 1-versus-all SVMs, the PD-Sparse method is algorithmically not scalable to extreme multi-label learning.

4.基于深度学习的方法4. Methods based on deep learning

基于深度学习的方法也已经用于多标签学习。有人将标签空间嵌入合并到特征嵌入中。具体地，为标签A构造邻接矩阵，并且使用等式M＝(A+A²)/2推导标签图矩阵。随后，对于矩阵中的每个非零条目，将由索引p、q和M_pq组成的元组馈送到标签嵌入网络中，以与单词嵌入一起训练复合网络。在预测阶段中，在低维特征表示中执行k-NN搜索以从训练数据集中找到类似样本。将k-NN的标签的平均值设置为最终标签预测。有人提出将多标签同现模式纳入神经网络目标，以提高分类性能。他们还建议采用动态最大池化来捕获来自文档不同区域的丰富信息，并且使用附加的隐藏瓶颈层来减小模型尺寸。此外，Sigmoid输出上的二进制交叉熵损失被调制为XMLC。然而，这些方法不适用于具有复杂层次标签的数据，因为标签层次的分解会大大减少标签空间。此外，有人提出了基于玻尔兹曼CNN的混合学习网络来处理生物医学文献分类。他们的工作丰富了数据序列嵌入。这种设计不适用于巨大的标签空间。他们的实验仅着重于少于2,000个MeSH标签的种类。有人提出分层多标签分类网络(HMCN)，据称它能够同时优化局部和全局损失函数，以便从整个分类层次结构中发现局部分层分类关系和全局信息，同时惩罚分层违规。但是由于使用了完全前馈层，他们的工作具有更高的计算复杂性。即使用具有共享权重的类似LSTM的模型简化了HMCN网络，它仍然具有很高的计算负担。看来这就是为什么针对HMCN报告最多约4000个标签的数据集的原因。Deep learning based methods have also been used for multi-label learning. Some people incorporate label space embeddings into feature embeddings. Specifically, an adjacency matrix is constructed for label A, and the label graph matrix is derived using the equation M=(A+A ² )/2. Subsequently, for each non-zero entry in the matrix, a tuple consisting of indices p, q, and M _pq is fed into the label embedding network to train the composite network along with the word embeddings. In the prediction stage, a k-NN search is performed in the low-dimensional feature representation to find similar samples from the training dataset. Set the average of the labels of the k-NN as the final label prediction. It has been proposed to incorporate multi-label co-occurrence patterns into neural network objectives to improve classification performance. They also propose dynamic max pooling to capture rich information from different regions of the document, and use an additional hidden bottleneck layer to reduce model size. Furthermore, the binary cross-entropy loss on the Sigmoid output is modulated as XMLC. However, these methods are not suitable for data with complex hierarchical labels, because the label-level decomposition would greatly reduce the label space. Furthermore, a hybrid learning network based on Boltzmann CNN has been proposed to handle biomedical literature classification. Their work enriches data sequence embeddings. This design is not suitable for huge tab space. Their experiments only focused on species with fewer than 2,000 MeSH tags. A Hierarchical Multi-Label Classification Network (HMCN) has been proposed, which is claimed to be able to optimize both local and global loss functions in order to discover local hierarchical classification relationships and global information from the entire classification hierarchy, while penalizing hierarchical violations. But their work has higher computational complexity due to the use of fully feed-forward layers. Even if the HMCN network is simplified with an LSTM-like model with shared weights, it still has a high computational burden. It seems that this is why datasets of up to ~4000 labels are reported for HMCN.

E.一些结论E. Some conclusions

本文公开了用于处理极端多标签学习和分类的基于深度学习的逐层框架的实施方式，通常(为方便起见而非限制)命名为深度逐层XMLC。深度逐层XMLC的实施方式包括若干创新。首先，在一个或多个实施方式中，分裂模型训练机制将标签划分为多个层级，从而在很大程度上减少了维数灾难和训练成本。其次，在一个或多个实施方式中，将利用宏F度量的取决于类别的动态最大池化和权重调整集成到神经架构中，使得最终预测更适合于层级及其分层关系的分布。第三，在一个或多个实施方式中，分层指针生成模型成功地将逐层输出合并到一个统一标签预测中。Disclosed herein is an implementation of a deep learning-based layer-by-layer framework for handling extreme multi-label learning and classification, generally (for convenience and not limitation) named deep layer-by-layer XMLC. The implementation of deep layer-by-layer XMLC includes several innovations. First, in one or more embodiments, the split model training mechanism divides the labels into multiple hierarchies, thereby greatly reducing the curse of dimensionality and training costs. Second, in one or more embodiments, class-dependent dynamic max-pooling and weight adjustment using macro-F-metrics is integrated into the neural architecture, so that the final predictions are more suitable for the distribution of hierarchies and their hierarchical relationships. Third, in one or more embodiments, the layer-by-layer pointer generation model successfully incorporates layer-by-layer outputs into one unified label prediction.

结果显示，通过利用来自上层级和下层级的MEDLINE集合、关键字和预测标签，深度逐层XMLC的实施方式实现了最新的结果。AmazonCat13K的结果还表明，深度逐层XMLC的实施方式对于处理各种数据集足够通用。The results show that the implementation of deep layer-by-layer XMLC achieves state-of-the-art results by leveraging MEDLINE sets, keywords and predicted labels from upper and lower layers. The results of AmazonCat13K also show that the implementation of deep layer-by-layer XMLC is general enough to handle various datasets.

在本专利公开中，不难发现，深度逐层XMLC实施方式可以方便地转移到任务，如大规模语义索引，以便构建更有效且更准确的信息检索引擎并且减少昂贵的手动专家工作，如此工作中所示。In this patent disclosure, it is not difficult to find that deep layer-by-layer XMLC implementations can be easily transferred to tasks such as large-scale semantic indexing in order to build more efficient and accurate information retrieval engines and reduce expensive manual expert work, such as shown in.

本领域技术人员将认识到，其他实施方式可以包括不同的、更稳健的损失函数，以及增加更多的层来处理特征细化或权重调整，并且同时提高运行效率。Those skilled in the art will recognize that other implementations may include different, more robust loss functions, as well as adding more layers to handle feature refinement or weight adjustment, and at the same time improve operational efficiency.

F.系统实施方式F. System Implementation

在实施方式中，本专利文献的方面可涉及、可包括一个或多个信息处理系统/计算系统或者可在一个或多个信息处理系统/计算系统上实施。计算系统可包括可操作来计算、运算、确定、分类、处理、传输、接收、检索、发起、路由、交换、存储、显示、通信、显现、检测、记录、再现、处理或利用任何形式信息、智能或数据的任何手段或手段的组合。例如，计算系统可为或可包括个人计算机(例如，膝上型计算机)、平板电脑、平板手机、个人数字助理(PDA)、智能手机、智能手表、智能包装、服务器(例如，刀片式服务器或机架式服务器)、网络存储设备、摄像机或任何其他合适设备，并且可在大小、形状、性能、功能和价格方面改变。计算系统可包括随机存取存储器(RAM)、一个或多个处理资源(例如中央处理单元(CPU)或硬件或软件控制逻辑)、ROM和/或其他类型的存储器。计算系统的另外组件可包括一个或多个盘驱动器、用于与外部设备通信的一个或多个网络端口、以及各种输入和输出(I/O)设备(例如键盘、鼠标、触摸屏和/或视频显示器)。计算系统还可包括可操作为在各种硬件组件之间传输通信的一个或多个总线。In embodiments, aspects of this patent document may relate to, may include, or may be implemented on one or more information handling systems/computing systems. A computing system may include a computing system operable to compute, compute, determine, classify, process, transmit, receive, retrieve, originate, route, exchange, store, display, communicate, manifest, detect, record, reproduce, process or utilize any form of information, Any means or combination of means of intelligence or data. For example, a computing system can be or can include a personal computer (eg, a laptop computer), a tablet computer, a phablet, a personal digital assistant (PDA), a smart phone, a smart watch, a smart package, a server (eg, a blade server or rack servers), network storage devices, cameras, or any other suitable equipment, and may vary in size, shape, performance, functionality, and price. A computing system may include random access memory (RAM), one or more processing resources (eg, a central processing unit (CPU) or hardware or software control logic), ROM, and/or other types of memory. Additional components of the computing system may include one or more disk drives, one or more network ports for communicating with external devices, and various input and output (I/O) devices (eg, keyboard, mouse, touch screen, and/or video display). The computing system may also include one or more buses operable to transmit communications between the various hardware components.

图9描绘根据本公开的实施方式的计算设备/信息处理系统(或是计算系统)的简化框图。应理解，计算系统可不同地配置并且包括不同组件，包括如图9中所示的更少或更多的部件，但应理解，针对系统900所示出的功能可操作为支持计算系统的各种实施方式。9 depicts a simplified block diagram of a computing device/information handling system (or computing system) in accordance with an embodiment of the present disclosure. It should be understood that computing systems may be configured differently and include different components, including fewer or more components as shown in FIG. 9, but it should be understood that the functions shown for system 900 are operable to support various components of the computing system. an implementation.

如图9所示，计算系统900包括一个或多个中央处理单元(CPU)901，CPU 901提供计算资源并控制计算机。CPU 901可实施有微处理器等，并且还可包括一个或多个图形处理单元(GPU)919和/或用于数学计算的浮点协处理器。系统900还可包括系统存储器902，系统存储器902可呈随机存取存储器(RAM)、只读存储器(ROM)、或两者的形式。As shown in FIG. 9, a computing system 900 includes one or more central processing units (CPUs) 901, which provide computing resources and control a computer. CPU 901 may be implemented with a microprocessor or the like, and may also include one or more graphics processing units (GPUs) 919 and/or floating point coprocessors for mathematical calculations. System 900 may also include system memory 902, which may be in the form of random access memory (RAM), read only memory (ROM), or both.

如图9所示，还可提供多个控制器和外围设备。输入控制器903表示至各种输入设备904的接口，例如键盘、鼠标、触摸屏和/或触笔。计算系统900还可包括存储控制器907，该存储控制器907用于与一个或多个存储设备908对接，存储设备中的每个包括存储介质(诸如磁带或盘)或光学介质(其可用于记录用于操作系统、实用工具和应用程序的指令的程序，它们可包括实施本发明的各方面的程序的实施方式)。存储设备908还可用于存储经处理的数据或是将要根据本发明处理的数据。系统900还可包括显示控制器909，该显示控制器909用于为显示设备911提供接口，显示设备911可为阴极射线管(CRT)、薄膜晶体管(TFT)显示器、有机发光二极管、电致发光面板、等离子面板或其他类型的显示器。计算系统900还可包括用于一个或多个外围设备906的一个或多个外围控制器或接口905。外围设备的示例可包括一个或多个打印机、扫描仪、输入设备、输出设备、传感器等。通信控制器914可与一个或多个通信设备915对接，这使系统900能够通过各种网络(包括互联网、云资源(例如以太云、经以太网的光纤通道(FCoE)/数据中心桥接(DCB)云等)、局域网(LAN)、广域网(WAN)、存储区域网络(SAN))中的任一网络，或通过任何合适电磁载波信号(包括红外信号)来连接至远程设备。As shown in Figure 9, multiple controllers and peripherals may also be provided. Input controller 903 represents an interface to various input devices 904, such as a keyboard, mouse, touch screen, and/or stylus. Computing system 900 may also include a storage controller 907 for interfacing with one or more storage devices 908, each of which includes a storage medium (such as tape or disk) or an optical medium (which may be used for Programs that record instructions for operating systems, utilities, and applications, which may include implementations of programs that implement aspects of the invention). Storage device 908 may also be used to store processed data or data to be processed in accordance with the present invention. The system 900 may also include a display controller 909 for providing an interface to a display device 911, which may be a cathode ray tube (CRT), thin film transistor (TFT) display, organic light emitting diode, electroluminescence panel, plasma panel or other type of display. Computing system 900 may also include one or more peripheral controllers or interfaces 905 for one or more peripheral devices 906 . Examples of peripheral devices may include one or more printers, scanners, input devices, output devices, sensors, and the like. The communications controller 914 may interface with one or more communications devices 915, which enables the system 900 to communicate over various networks (including the Internet, cloud resources (eg, Ethernet cloud, Fibre Channel over Ethernet (FCoE)/Data Center Bridging (DCB) ) cloud, etc.), local area network (LAN), wide area network (WAN), storage area network (SAN)), or by any suitable electromagnetic carrier signal (including infrared signal) to connect to the remote device.

在示出的系统中，所有主要系统组件可连接至总线916，总线916可表示多于一个的物理总线。然而，各种系统组件可在物理上彼此接近或可不在物理上彼此接近。例如，输入数据和/或输出数据可远程地从一个物理位置传输到另一物理位置。另外，实现本发明的各方面的程序可经由网络从远程位置(例如，服务器)访问。此类数据和/或程序可通过各种机器可读介质中的任一机器可读介质来传送，机器可读介质包括但不限于：诸如硬盘、软盘和磁带的磁性介质；诸如CD-ROM和全息设备的光学介质；磁光介质；以及硬件设备，该硬件设备专门被配置成存储或存储并执行程序代码，该硬件设备例如专用集成电路(ASIC)、可编程逻辑器件(PLD)、闪存设备、以及ROM和RAM设备。In the illustrated system, all major system components may be connected to bus 916, which may represent more than one physical bus. However, the various system components may or may not be in physical proximity to each other. For example, input data and/or output data may be transmitted remotely from one physical location to another. Additionally, programs implementing aspects of the present invention may be accessed from a remote location (eg, a server) via a network. Such data and/or programs may be transmitted by any of a variety of machine-readable media including, but not limited to: magnetic media such as hard disks, floppy disks, and magnetic tapes; such as CD-ROMs and Optical media for holographic devices; magneto-optical media; and hardware devices specifically configured to store or store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices , and ROM and RAM devices.

本发明的方面可利用用于一个或多个处理器或处理单元以使步骤执行的指令在一个或多个非暂态计算机可读介质上编码。应注意，一个或多个非暂态计算机可读介质应当包括易失性存储器和非易失性存储器。应注意，替代实现方式是可能的，其包括硬件实现方式或软件/硬件实现方式。硬件实施的功能可使用ASIC、可编程的阵列、数字信号处理电路等来实现。因此，任何权利要求中的术语“手段”旨在涵盖软件实现方式和硬件实现方式两者。类似地，如本文使用的术语“计算机可读媒介或介质”包括具有实施在其上的指令程序的软件和/或硬件或它们的组合。利用所构想的这些替代实现方式，应当理解，附图以及随附描述提供本领域的技术人员编写程序代码(即，软件)和/或制造电路(即，硬件)以执行所需处理所要求的功能信息。Aspects of the invention may utilize instructions for one or more processors or processing units to be encoded on one or more non-transitory computer readable media to cause the steps to be performed. It should be noted that the one or more non-transitory computer readable media shall include both volatile memory and nonvolatile memory. It should be noted that alternative implementations are possible, including hardware implementations or software/hardware implementations. Hardware-implemented functions may be implemented using ASICs, programmable arrays, digital signal processing circuits, and the like. Accordingly, the term "means" in any claim is intended to cover both software and hardware implementations. Similarly, the term "computer-readable medium or medium" as used herein includes software and/or hardware or a combination thereof having a program of instructions embodied thereon. With these alternative implementations contemplated, it should be understood that the drawings and accompanying description provide those skilled in the art with what would be required to write program code (ie, software) and/or to manufacture circuits (ie, hardware) to perform the desired processes function information.

应当注意，本发明的实施方式还可涉及具有其上具有用于执行各种计算机实施的操作的计算机代码的非暂态有形计算机可读介质的计算机产品。介质和计算机代码可为出于本发明的目的而专门设计和构造的介质和计算机代码，或者它们可为相关领域中的技术人员已知或可用的。有形计算机可读介质的示例包括但不限于：诸如硬盘、软盘和磁带的磁性介质；诸如CD-ROM和全息设备的光学介质；磁光介质；以及专门配置成存储或存储并执行程序代码的硬件设备，例如，专用集成电路(ASIC)、可编程逻辑器件(PLD)、闪存设备、以及ROM和RAM设备。计算机代码的示例包括机器代码(例如，编译器产生的代码)以及包含可由计算机使用解释器来执行的更高级代码的文件。本发明的实施方式可整体地或部分地实施为可在由处理设备执行的程序模块中的机器可执行指令。程序模块的示例包括库、程序、例程、对象、组件和数据结构。在分布的计算环境中，程序模块可物理上定位在本地、远程或两者的设定中。It should be noted that embodiments of the present invention may also relate to computer products having non-transitory tangible computer-readable media having computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be known or available to those skilled in the relevant art. Examples of tangible computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tapes; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware specially configured to store or store and execute program code Devices, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices. Examples of computer code include machine code (eg, code produced by a compiler) and files containing higher-level code that can be executed by a computer using an interpreter. Embodiments of the invention may be implemented in whole or in part as machine-executable instructions in a program module executable by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In a distributed computing environment, program modules may be physically located in local, remote, or both settings.

本领域的技术人员将认识到，计算系统或编程语言对本发明的实践来说均不重要。本领域的技术人员将还将认识到，多个上述元件可物理地和/或在功能上划分成子模块或组合在一起。Those skilled in the art will recognize that neither the computing system nor the programming language is critical to the practice of the present invention. Those skilled in the art will also appreciate that a number of the above-described elements may be physically and/or functionally divided into sub-modules or grouped together.

本领域技术人员将理解，前文的示例和实施方式是示例性的，并且不限制本公开的范围。旨在说明的是，在本领域的技术人员阅读本说明书并研究附图后将对本领域的技术人员显而易见的本发明的所有、置换、增强、等同、组合或改进包括在本公开的真实精神和范围内。还应注意，任何权利要求书的元素可不同地布置，包括具有多个从属、配置和组合。Those skilled in the art will appreciate that the foregoing examples and implementations are illustrative and do not limit the scope of the present disclosure. It is intended to illustrate that all, substitutions, enhancements, equivalents, combinations or improvements of the present invention that will be apparent to those skilled in the art after reading this specification and studying the accompanying drawings are included in the true spirit of the present disclosure and within the range. It should also be noted that elements of any claim may be arranged differently, including with multiple dependencies, configurations and combinations.

Claims

1. A computer-implemented method for multi-label learning and classification using one or more processors to perform steps, the method comprising:

processing an original training text into a clean training text;

analyzing the training labels into layer-by-layer labels of a plurality of levels based on the body hierarchical structure of the training labels;

training a plurality of layer-by-layer models by a layer-by-layer multi-label classification model based at least on the layer-by-layer labels and the clean training text, wherein each layer-by-layer model is associated with a corresponding level of labels;

employing one or more refinement strategies to perform layer-by-layer predictions from one or more inputs through the trained plurality of layer-by-layer models; and

merging the layer-by-layer predictions into a unified labelset for the one or more input data sets using a point generation model.

2. The computer-implemented method of claim 1, wherein the one or more inputs include word embedding for documents, word embedding for keywords, upper level embedding, and lower level embedding.

3. The computer-implemented method of claim 2, wherein performing layer-by-layer prediction comprises:

receiving inputs for word embedding for documents, word embedding for keywords, upper level label embedding, and lower level label embedding at a Convolutional Neural Network (CNN) within each layer-by-layer model to extract feature representations from each input;

obtaining a series embedding using the feature representation extracted from each input;

performing dynamic max-pooling at a max-pooling layer to select a desired feature from the series embedding;

obtaining a compact representation from the desired features by applying bulk normalization and one or more fully connected layers; and

each layer-by-layer model is trained with binary cross-entropy losses on the output layer and the hidden bottleneck layer based at least on the obtained compact representation.

4. The computer-implemented method of claim 3 wherein a Bi-directional long short term memory (Bi-LSTM) is constructed on the feature representations extracted from the word embedding to preserve the linguistic order of documents prior to concatenation.

5. The computer-implemented method of claim 3, wherein in performing dynamic maximal pooling, layer-by-layer correlation information of tags is incorporated into neural structures of at least the maximal pooling layer to capture tag co-occurrences and classification relationships between tags in order to dynamically select a maximal pooling size.

6. The computer-implemented method of claim 1, wherein the one or more refinement strategies include macro F-metric optimization to enable each layer-by-layer model to incrementally automatically refine layer-by-layer predictions by threshold adjustment.

7. The computer-implemented method of claim 1, wherein merging the layer-by-layer predictions into a unified labelset using the point generation model comprises:

encoding, using an encoder within the point generative model, the layer-by-layer prediction into a plurality of encoder hidden state sequences corresponding respectively to the plurality of levels;

deriving a plurality of attention generators from the plurality of encoder concealment state sequences to generate an attention distribution and a content vector for each level of the plurality of levels;

obtaining generation probabilities from the content vector, predicted tag sequence vector and decoder input to generate a plurality of decoder hidden state sequences; and

generating a final output summarizing semantic index labels based at least on the decoder hidden state.

8. The computer-implemented method of claim 7, wherein an overlay mechanism is combined with the point generation model to remove duplicate items per level and across levels.

9. A system for multi-label learning and classification of large-scale semantic indexes, the system comprising:

a layer-by-layer multi-tag classification model that decomposes tags in a high-dimensional space into layer-by-layer tags in a plurality of levels based on an ontology hierarchy of the tags, the layer-by-layer multi-tag classification model comprising a plurality of convolutional neural networks, wherein for each level of convolutional neural networks, each convolutional neural network extracts a feature representation from inputs for word embedding of a document, word embedding for a keyword, upper-level tag embedding, and lower-level tag embedding, respectively, the convolutional neural networks comprising:

a max pooling layer for dynamic max pooling to select features from a series of embeddings that are concatenated by feature representations extracted from all inputs;

one or more normalization layers and one or more fully connected layers for batch normalization and obtaining compact representations from selected features;

an output layer outputting a layer-by-layer prediction for said each level; and

a point generative model incorporating the layer-by-layer predictions at each of the plurality of levels into a unified labelset of the document, the point generative model comprising:

an encoder that encodes the layer-by-layer prediction into a plurality of encoder hidden state sequences respectively corresponding to the plurality of levels;

a plurality of attention generators derived from the plurality of encoder concealment state sequences to generate an attention distribution and a content vector for each of the plurality of levels; and

a decoder to generate a plurality of sequences of decoder hidden states based at least on the generated content vectors for each of the plurality of levels, the decoder to generate the unified labelset based at least on the layer-by-layer prediction, the encoder hidden state, the attention force generator, and the decoder hidden state.

10. The system of claim 9, wherein a Bi-directional long short term memory (Bi-LSTM) is constructed on the feature representations extracted from the word embedding to preserve linguistic order of documents prior to concatenation.

11. The system of claim 9, wherein in performing dynamic maximal pooling, layer-by-layer relevant information of tags is incorporated into neural structures of the maximal pooling layer to dynamically select a maximal pooling size.

12. The system of claim 9, wherein the layer-by-layer multi-label classification model uses online F-metric optimization such that each convolutional neural network can incrementally and automatically refine layer-by-layer predictions by adjusting thresholds of the online F-metric optimization.

13. The system of claim 12, wherein the threshold is updated according to an inter-iteration rule within the same iteration and a cross-iteration rule between iterations.

14. The system of claim 9, wherein the point generation model incorporates an overlay mechanism to eliminate duplicate laboratories in and across each level.

15. The system of claim 9, wherein each convolutional neural network further comprises a hidden bottleneck layer with activation function, the convolutional neural network being pre-trained by employing binary cross-entropy loss on the output layer and the hidden bottleneck layer.

16. The system of claim 15, wherein the binary cross-entropy loss is a function involving a weight matrix associated with the bottleneck layer and the output layer.

17. A computer-implemented method for multi-label learning and classification using one or more processors to perform steps, the method comprising:

at each of the plurality of hierarchical levels,

extracting feature representations from the input for word embedding for documents, word embedding for keywords, upper hierarchical level label embedding, and lower hierarchical level label embedding, respectively, using a convolutional neural network;

concatenating the feature representations extracted from all inputs as a concatenation embedding;

applying dynamic max pooling at a max pooling layer of the convolutional neural network to select features from a series embedding;

applying bulk normalization and one or more fully connected layers to obtain a compact representation from the selected features;

outputting, from an output layer of the convolutional neural network, a layer-by-layer prediction for the each level of the plurality of hierarchical levels; and

merging the layer-by-layer predictions from the plurality of hierarchical levels into a unified labelset using a point generation model.

18. The computer-implemented method of claim 17, wherein merging the layer-by-layer predictions into a unified labelset comprises:

encoding, using an encoder, the layer-by-layer prediction into a plurality of encoder hidden state sequences respectively corresponding to the plurality of levels;

deriving a plurality of attention generators from the plurality of encoder concealment state sequences to generate an attention distribution and a content vector for each level of the plurality of levels; and

generating, using a decoder, a plurality of decoder hidden state sequences based at least on the generated content vectors for each of the plurality of hierarchical levels; and

generating the unified labelset from the decoder based at least on the layer-by-layer prediction, the encoder concealment state, the attention generator, and the decoder concealment state.

19. The method of claim 18, wherein in performing dynamic maximum pooling, both tag co-occurrences and classification relationships between tags are captured to dynamically select a maximum pooling size.

20. The method of claim 18, wherein the layer-by-layer multi-label classification model uses online F-metric optimization such that each convolutional neural network can incrementally and automatically refine layer-by-layer predictions by adjusting thresholds of the online F-metric optimization, the thresholds being updated according to inter-iteration rules within the same iteration and cross-iteration rules between iterations.