CN116569164A

CN116569164A - System and method for intelligent categorization of content in a content management system

Info

Publication number: CN116569164A
Application number: CN202180075939.2A
Authority: CN
Inventors: S·戈沙尔; S·卡米瑞迪; J·玛雅拉; V·彼得; H·S·卡德拉巴陆
Original assignee: Oracle International Corp
Current assignee: Oracle International Corp
Priority date: 2020-09-28
Filing date: 2021-09-28
Publication date: 2023-08-08

Abstract

According to embodiments, the systems and methods described herein may be used, for example, with a content management system to provide recommendations to categorize/classify content into user-defined categories, which in turn provides a content manager with an opportunity to easily place new content into an accurate category based on previously evaluated/categorized content. The recommendation system or tool may facilitate placement of content into related categories by automatic categorization/classification of newly created/edited content. Recommendation tools can be implemented and applied across different domains by generating feature vectors from content, creating clusters in feature space based on previously categorized content, and by computing new content recommendation categories from the clustered feature space distances.

Description

System and method for intelligently categorizing content in a content management system

本专利文件的一部分公开内容包含受版权保护的材料。版权所有者不反对任何人对专利文件或专利公开就像它出现在专利商标局专利文件或记录中那样进行传真复制，但除此之外保留所有版权。A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

优先权要求和相关申请的交叉引用：Priority Claim and Cross-Reference to Related Applications:

本申请要求于2020年9月28日提交的标题为“SYSTEM AND METHOD FOR SMARTCATEGORIZATION OF CONTENT IN A CONTENT MANAGEMENT SYSTEM”、申请号为63/084,174的美国临时专利申请；以及于2021年9月27日提交的标题为“SYSTEM AND METHOD FORSMART CATEGORIZATION OF CONTENT IN A CONTENT MANAGEMENT SYSTEM”、申请号为17/486,524的美国专利申请的优先权权益；并且与于2019年10月18日提交的标题为“TECHNIQUES FOR RANKING CONTENT ITEM RECOMMENDATIONS”、申请号为16/657,395的美国专利申请相关；该申请是于2019年9月24日提交的标题为“SMART CONTENTRECOMMENDATIONS FOR CONTENT AUTHORS”、申请号为16/581,138的美国专利申请的部分继续申请并要求其优先权权益；该申请要求于2018年10月18日提交的标题为“SMART CONTENTRECOMMENDATIONS FOR AUTHORS”、申请号为201841039495的印度临时专利申请的优先权权益；上述申请中的每个申请及其内容均通过引用并入本文。This application claims the benefit of priority to U.S. Provisional Patent Application No. 63/084,174, filed on September 28, 2020, entitled “SYSTEM AND METHOD FOR SMART CATEGORIZATION OF CONTENT IN A CONTENT MANAGEMENT SYSTEM”; and U.S. Patent Application No. 17/486,524, filed on September 27, 2021, entitled “SYSTEM AND METHOD FOR SMART CATEGORIZATION OF CONTENT IN A CONTENT MANAGEMENT SYSTEM”; and is related to U.S. Patent Application No. 16/657,395, filed on October 18, 2019, entitled “TECHNIQUES FOR RANKING CONTENT ITEM RECOMMENDATIONS”; which is a U.S. Provisional Patent Application No. 63/084,174, filed on September 28, 2020; and U.S. Patent Application No. 17/486,524, filed on September 27, 2021, entitled “SYSTEM AND METHOD FOR SMART CATEGORIZATION OF CONTENT IN A CONTENT MANAGEMENT SYSTEM” This application is a continuation-in-part application of and claims the benefit of priority to U.S. Patent Application No. 16/581,138, entitled “SMART CONTENT RECOMMENDATIONS FOR AUTHORS”, filed on October 18, 2018, and Indian Provisional Patent Application No. 201841039495, entitled “SMART CONTENT RECOMMENDATIONS FOR AUTHORS”; each of the above applications and their contents are incorporated herein by reference.

技术领域Technical Field

本申请一般而言涉及在线商业环境以及内容数据的管理和交付，并且特别地涉及内容管理系统中内容的智能归类(categorization)/分类(classification)。The present application relates generally to online business environments and the management and delivery of content data, and in particular to intelligent categorization/classification of content in content management systems.

背景技术Background Art

用于在线发布和/或传输的原始内容的生成者和作者可以使用各种不同的基于软件的工具和技术来生成、编辑和存储新生成的内容。Generators and authors of original content for online publication and/or transmission may use a variety of different software-based tools and techniques to generate, edit, and store newly generated content.

在内容管理系统中，各种内容(例如，文档、例如博客的结构化内容、文章、新闻稿；以及例如图像和视频的媒体文档)经常需要基于它们的内容被评估/归类。这种归类/分类发生在分层的类别或节点集合上。例如，房地产租赁的合同文档可能会依据法律文档->房地产->合同进行评估/归类。同一文档(或内容)也可能同时处于多于一个归类/分类中。例如，同一合同文档可能存在于活动合同->已签署之下。In a content management system, various contents (e.g., documents, structured contents such as blogs, articles, press releases; and media documents such as images and videos) often need to be evaluated/categorized based on their contents. This categorization/classification occurs on a hierarchical set of categories or nodes. For example, a contract document for a real estate lease may be evaluated/categorized under Legal Documents->Real Estate->Contracts. The same document (or content) may also be in more than one categorization/classification at the same time. For example, the same contract document may exist under Active Contracts->Signed.

类别在称为分类体系(taxonomy)的组织概念下进行分组。组织往往有许多反映针对内容的其业务组织的分类体系。当添加新文档或内容项时，或者当新分类体系出现时，或者内容组织发生重大变化时，正确分类或重新分类内容的任务落在最终用户(或内容作者)身上。当内容量和分类体系数量增加时，这可能是一项代价高昂且容易出错的工作。Categories are grouped under an organizational concept called a taxonomy. Organizations often have many taxonomies that reflect their business organization for content. When new documents or content items are added, or when new taxonomies emerge, or when significant changes occur in the content organization, the task of correctly classifying or reclassifying content falls to the end user (or content author). This can be a costly and error-prone endeavor as the volume of content and the number of taxonomies grow.

发明内容Summary of the invention

根据实施例，本文描述的系统和方法可以例如与内容管理系统一起使用，以提供将内容归类/分类成用户定义的类别的推荐，这进而为内容管理者提供了基于先前评估/归类的内容轻松地将新内容放置到准确的类别中的机会。According to an embodiment, the systems and methods described herein may be used, for example, with a content management system to provide recommendations for categorizing/classifying content into user-defined categories, which in turn provides content managers with the opportunity to easily place new content into accurate categories based on previously evaluated/classified content.

以在线方式对大量内容进行分类是一项复杂的任务，涉及诸如对数据的单次传递约束以及快速响应要求之类的挑战。根据实施例，内容用户通过诸如分层分类体系树的逻辑聚类对相似内容进行归类，并将相似内容放置在分类体系树的相同节点/类别中。随着时间的推移，随着分类体系树中内容实体和节点数量的增长，相似的内容实体将发现它们彼此并排驻留在节点中。鉴于内容组织的这种状态，计算机算法可以使用驻留在已经评估/归类的分类体系中的内容来确定新创建/编辑的内容可能属于哪里。Categorizing large amounts of content in an online manner is a complex task involving challenges such as single pass constraints on data and rapid response requirements. According to an embodiment, content users categorize similar content through logical clustering such as a hierarchical classification system tree and place similar content in the same node/category of the classification system tree. Over time, as the number of content entities and nodes in the classification system tree grows, similar content entities will find themselves residing in nodes side by side with each other. Given this state of content organization, computer algorithms can use content that resides in already evaluated/classified classification systems to determine where newly created/edited content may belong.

根据实施例，推荐系统或工具可以使用人工智能(AI)技术来不断地从过去的数据中学习，并且通过新创建/编辑的内容的自动归类/分类来帮助将内容放置到相关的类别中。通过从内容生成特征向量、基于先前归类的内容在特征空间中创建聚类，以及通过从聚类的特征空间距离计算为新内容推荐类别，可以跨不同领域实现和应用推荐工具。According to an embodiment, a recommendation system or tool may use artificial intelligence (AI) techniques to continuously learn from past data and help place content into relevant categories through automatic classification/categorization of newly created/edited content. Recommendation tools may be implemented and applied across different domains by generating feature vectors from content, creating clusters in feature space based on previously classified content, and recommending categories for new content through feature space distance calculations from clusters.

本公开的方面涉及一种人工智能(AI)驱动工具，该工具被配置为用作智能数字助理以从内容储存库中推荐图像、文本内容和其它相关媒体内容。某些实施例可以包括具有图形用户界面(GUI)的前端软件工具以补充用于创作原始媒体内容(例如，博客条目、在线文章等)的内容创作界面。在一些情况下，附加的GUI屏幕和特征可以并入现有的内容创作软件工具中，例如，作为软件插件。智能数字内容推荐工具可以与多个后端服务和内容储存库通信，例如，以分析文本和/或视觉输入、从输入中提取关键词或主题、对输入内容进行分类和标记，以及将分类/标记内容存储在一个或多个内容储存库中。Aspects of the present disclosure relate to an artificial intelligence (AI) driven tool that is configured to be used as an intelligent digital assistant to recommend images, text content, and other related media content from a content repository. Certain embodiments may include a front-end software tool with a graphical user interface (GUI) to supplement the content creation interface for creating original media content (e.g., blog entries, online articles, etc.). In some cases, additional GUI screens and features may be incorporated into existing content creation software tools, for example, as software plug-ins. The intelligent digital content recommendation tool may communicate with multiple back-end services and content repositories, for example, to analyze text and/or visual input, extract keywords or topics from the input, classify and tag the input content, and store the classified/tagged content in one or more content repositories.

在智能数字内容推荐工具的各种实施例中(例如，直接由软件工具和/或间接地通过调用后端服务)执行的附加技术可以包括将输入文本和/或图像转换成多维向量空间内的向量，并将输入内容与多个储存库内容进行比较，以在内容储存库内找到多个相关内容选项。这种比较可以包括彻底和详尽的深度搜索和/或更高效的基于标签的过滤搜索。最后，相关内容项(例如，图像、音频和/或视频剪辑、相关文章的链接等)可以被检索并呈现给内容作者以供审阅并嵌入到原始创作内容内。Additional techniques performed in various embodiments of the intelligent digital content recommendation tool (e.g., directly by the software tool and/or indirectly by calling a backend service) may include converting input text and/or images into vectors in a multidimensional vector space and comparing the input content with multiple repository contents to find multiple relevant content options within the content repository. This comparison may include a thorough and exhaustive deep search and/or a more efficient tag-based filtered search. Finally, relevant content items (e.g., images, audio and/or video clips, links to related articles, etc.) may be retrieved and presented to the content author for review and embedded into the original creative content.

虽然本文的描述主要图示了对文本内容的应用，但是根据各种实施例，该方案可以通过元数据提取扩展到例如多媒体或其它类型的内容，诸如图像/视频。Although the description herein mainly illustrates application to text content, according to various embodiments, the scheme can be extended to, for example, multimedia or other types of content such as images/videos through metadata extraction.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

通过参考说明书的其余部分并结合以下附图，可以实现对根据本公开的实施例的性质和优点的进一步理解。A further understanding of the nature and advantages of embodiments according to the present disclosure may be realized by referring to the remainder of the specification in conjunction with the following drawings.

在附图中，相似的组件和/或特征可以具有相同的附图标记。另外，可以通过在附图标记之后加上破折号和在相似组件之间进行区分的第二标记来区分相同类型的各种组件。如果在说明书中仅使用第一附图标记，那么该描述适用于具有相同的第一附图标记的任何一个类似部件，而与第二附图标记无关。In the accompanying drawings, similar components and/or features may have the same reference number. In addition, various components of the same type may be distinguished by following the reference number with a dash and a second reference that distinguishes between the similar components. If only the first reference number is used in the specification, the description applies to any one of the similar components having the same first reference number, regardless of the second reference number.

图1是包括其中可以实现本公开的某些实施例的数据集成云平台的示例计算机系统体系架构的图。1 is a diagram of an example computer system architecture including a data integration cloud platform in which certain embodiments of the present disclosure may be implemented.

图2是根据本公开的某些实施例的用于配置、监视和控制服务实例的用户界面中的定制仪表板的示例屏幕。2 is an example screen shot of a custom dashboard in a user interface for configuring, monitoring, and controlling a service instance, according to certain embodiments of the present disclosure.

图3是根据本公开的某些实施例的数据集成云平台的体系架构图。FIG. 3 is an architecture diagram of a data integration cloud platform according to certain embodiments of the present disclosure.

图4是根据本公开的某些实施例的被配置为执行内容分类和推荐的示例计算环境的图。4 is a diagram of an example computing environment configured to perform content classification and recommendation, according to certain embodiments of the present disclosure.

图5是根据本公开的某些实施例的被配置为执行内容分类和推荐的示例计算环境的另一个图。5 is another diagram of an example computing environment configured to perform content classification and recommendation, according to certain embodiments of the present disclosure.

图6是图示根据本公开的某些实施例的用于基于内容储存库内的内容资源来生成特征向量的处理的流程图。6 is a flow diagram illustrating a process for generating feature vectors based on content resources within a content repository, according to certain embodiments of the present disclosure.

图7是根据本公开的某些实施例的识别多个图像特征的示例图像。7 is an example image in which multiple image features are identified, in accordance with certain embodiments of the present disclosure.

图8是根据本公开的某些实施例的图示关键词提取处理的文本文档的示例。FIG. 8 is an example of a text document illustrating a keyword extraction process according to certain embodiments of the present disclosure.

图9-11是图示根据本公开的某些实施例的生成和存储图像标签的处理的图。9-11 are diagrams illustrating a process of generating and storing image tags according to certain embodiments of the present disclosure.

图12是图示根据本公开的某些实施例的用于比较特征向量和识别内容储存库内的相关内容的另一个处理的流程图。12 is a flow diagram illustrating another process for comparing feature vectors and identifying related content within a content repository, according to certain embodiments of the present disclosure.

图13是图示根据本公开的某些实施例的将提交的图像转换成特征向量的技术的图。FIG. 13 is a diagram illustrating a technique for converting a submitted image into a feature vector in accordance with certain embodiments of the present disclosure.

图14是根据本公开的某些实施例的用特征向量填充的说明性向量空间。14 is an illustrative vector space populated with feature vectors, in accordance with certain embodiments of the present disclosure.

图15是图示根据本公开的某些实施例的深度特征空间向量比较的图。FIG. 15 is a diagram illustrating deep feature space vector comparison according to certain embodiments of the present disclosure.

图16-17是图示根据本公开的某些实施例的经过滤的特征空间向量比较的图。16-17 are diagrams illustrating filtered feature space vector comparisons according to certain embodiments of the present disclosure.

图18是根据本公开的某些实施例的表示接收和处理文本输入以识别相关图像或文章的处理的图。18 is a diagram representing a process of receiving and processing text input to identify related images or articles, according to certain embodiments of the present disclosure.

图19是图示根据本公开的某些实施例的将提取出的关键词与图像标签进行比较的示例图。FIG. 19 is an example diagram illustrating comparison of extracted keywords with image tags according to certain embodiments of the present disclosure.

图20是根据本公开的某些实施例的在3D词向量空间内的关键词分析的示例。FIG. 20 is an example of keyword analysis in a 3D word vector space, according to certain embodiments of the present disclosure.

图21是图示根据本公开的某些实施例的关键词到标签向量空间分析的图。FIG. 21 is a diagram illustrating keyword-to-tag vector space analysis according to certain embodiments of the present disclosure.

图22是图示根据本公开的某些实施例的同音异义图像标签的示例的图。FIG. 22 is a diagram illustrating an example of homonymous image labels according to certain embodiments of the present disclosure.

图23-24是图示根据本公开的某些实施例的示例性消歧处理的图。23-24 are diagrams illustrating exemplary disambiguation processes according to certain embodiments of the present disclosure.

图25-28是图示根据本公开的某些实施例的比较特征向量和识别内容储存库内的相关内容的处理的图。25-28 are diagrams illustrating a process of comparing feature vectors and identifying related content within a content repository, according to certain embodiments of the present disclosure.

图29-30是根据本公开的某些实施例的图示主题提取处理的文本文档的示例图。29-30 are example diagrams of text documents illustrating topic extraction processing according to certain embodiments of the present disclosure.

图31-35是图示根据本公开的某些实施例的基于输入文本数据识别相关文章的处理的图。31-35 are diagrams illustrating a process of identifying relevant articles based on input text data according to certain embodiments of the present disclosure.

图36是根据本公开的某些实施例的示例语义文本分析器系统的图。36 is a diagram of an example semantic text analyzer system in accordance with certain embodiments of the present disclosure.

图37-38是根据本公开的某些实施例的示出在内容创建期间向用户提供的图像推荐的示例用户界面屏幕。37-38 are example user interface screens illustrating image recommendations provided to a user during content creation, according to certain embodiments of the present disclosure.

图39描绘了用于实现根据本公开的某些实施例的分布式系统的简化图。FIG39 depicts a simplified diagram of a distributed system for implementing certain embodiments according to the present disclosure.

图40是根据本公开的某些实施例的可以通过其将由系统的一个或多个组件提供的服务作为云服务提供的系统环境的一个或多个组件的简化框图。40 is a simplified block diagram of one or more components of a system environment by which services provided by one or more components of the system may be provided as cloud services, in accordance with certain embodiments of the present disclosure.

图41图示了其中可以实现各种实施例的示例性计算机系统。FIG. 41 illustrates an exemplary computer system in which various embodiments may be implemented.

图42是根据本公开的某些实施例的示例计算环境的图，该示例计算环境被配置为响应于从用户或客户端系统接收的输入内容而评估和排名来自内容储存库的内容项。42 is a diagram of an example computing environment configured to evaluate and rank content items from a content repository in response to input content received from a user or client system, in accordance with certain embodiments of the present disclosure.

图43是流程图，其图示根据本公开的某些实施例的用于识别和排名与用户内容相关的内容项的处理。43 is a flow chart illustrating a process for identifying and ranking content items related to user content, in accordance with certain embodiments of the present disclosure.

图44是根据本公开的某些实施例的内容创作用户界面的示例屏幕。FIG. 44 is an example screen shot of a content authoring user interface in accordance with certain embodiments of the present disclosure.

图45示出了根据本公开的某些实施例的由内容推荐系统识别出的内容项的匹配集合的示例表。FIG. 45 illustrates an example table of matching sets of content items identified by a content recommendation system, according to certain embodiments of the present disclosure.

图46示出了根据本公开的某些实施例的包括排名分数的内容项的匹配集合的另一个示例表。46 illustrates another example table of matching sets of content items including ranking scores, in accordance with certain embodiments of the present disclosure.

图47是根据本公开的某些实施例的内容创作用户界面的另一个示例屏幕。FIG. 47 is another example screen shot of a content authoring user interface in accordance with certain embodiments of the present disclosure.

图48图示了根据实施例的内容管理系统环境的示例。FIG. 48 illustrates an example of a content management system environment according to an embodiment.

图49图示了根据实施例的用于内容数据的管理和交付的内容管理系统的示例使用。FIG. 49 illustrates an example use of a content management system for management and delivery of content data according to an embodiment.

图50图示了根据实施例的智能内容分类流程图。FIG. 50 illustrates a flow chart of intelligent content classification according to an embodiment.

图51图示了根据实施例的分类体系创建流程图。FIG. 51 illustrates a classification system creation flow chart according to an embodiment.

图52图示了根据实施例的分类体系修改流程图。FIG. 52 illustrates a classification system modification flow chart according to an embodiment.

图53图示了根据实施例的具有类别图的样本分类体系树。FIG53 illustrates a sample taxonomy tree with a category diagram, under an embodiment.

图54进一步图示了根据实施例的具有类别图的样本分类体系树。FIG. 54 further illustrates a sample taxonomy tree with a category diagram, according to an embodiment.

图55进一步图示了根据实施例的具有类别图的样本分类体系树。FIG. 55 further illustrates a sample taxonomy tree with a category diagram, according to an embodiment.

图56图示了根据实施例的配置自动分类阈值图。Figure 56 illustrates a configuration auto-classification threshold map according to an embodiment.

图57图示了根据实施例的用于触发储存库图中的内容的重新分类(批量)的配置。FIG. 57 illustrates a configuration for triggering reclassification (in bulk) of content in a repository graph, according to an embodiment.

图58图示了根据实施例的其中聚类可以是任意形状的传统聚类的问题。FIG. 58 illustrates the problem of traditional clustering where clusters can be of arbitrary shapes, according to an embodiment.

图59图示了根据实施例的微聚类。FIG59 illustrates micro clustering according to an embodiment.

图60图示了根据实施例的服装客户的样本主题分布。FIG. 60 illustrates a sample topic distribution for apparel customers, under an embodiment.

图61图示了根据实施例的可视化聚类半径。FIG. 61 illustrates visualization of cluster radii, according to an embodiment.

图62图示了根据实施例的聚类表示。FIG. 62 illustrates a cluster representation according to an embodiment.

图63图示了根据实施例的归类的宏观阶段。FIG63 illustrates the macroscopic stages of classification according to an embodiment.

图64图示了一旦通过宏观步骤选择了更高级别类别时的归类。Figure 64 illustrates the categorization once a higher level category is selected through the macro step.

图65是根据实施例的聚类权重如何随时间衰减的图示，并且图示了阻尼窗口模型的示例。Figure 65 is a diagram of how cluster weights decay over time, under an embodiment, and illustrates an example of a damped window model.

图66图示了根据实施例的影子(shadow)聚类如何可以出现的图形视图。Figure 66 illustrates a graphical view of how shadow clustering may appear according to an embodiment.

图67图示了根据实施例的影子聚类如何出现的图形视图。Figure 67 illustrates a graphical view of how shadow clusters appear according to an embodiment.

图68图示了根据实施例的建议用户从未归类的内容创建新类别。FIG. 68 illustrates suggesting that a user create a new category from uncategorized content, according to an embodiment.

图69是根据实施例的用于在内容管理系统中对内容进行智能归类的方法的流程图。FIG. 69 is a flowchart of a method for intelligently categorizing content in a content management system according to an embodiment.

具体实施方式DETAILED DESCRIPTION

根据实施例，本发明通过示例的方式而非通过限制的方式在附图的各图中进行图示，在附图中相似的参考标号指示相似的元素。应当注意的是，在本公开中对“一(an)”或“一个”或“一些”实施例的引用不一定指代相同的实施例，并且这样的引用表示至少一个。虽然讨论了具体实施方式，但应该理解的是，提供具体实施方式仅用于说明目的。相关领域的技术人员将认识到在不脱离本发明的范围和精神的情况下可以使用其它组件和配置。According to an embodiment, the present invention is illustrated in the figures of the accompanying drawings by way of example and not by way of limitation, and similar reference numerals indicate similar elements in the accompanying drawings. It should be noted that references to "an" or "one" or "some" embodiments in this disclosure do not necessarily refer to the same embodiments, and such references represent at least one. Although specific embodiments are discussed, it should be understood that specific embodiments are provided for illustrative purposes only. Those skilled in the relevant art will recognize that other components and configurations may be used without departing from the scope and spirit of the present invention.

在以下描述中，出于解释的目的，阐述具体细节以便提供对各种实施方式和示例的透彻理解。但是，将显而易见的是，可以在没有这些具体细节的情况下实践各种实施方式。例如，电路、系统、算法、结构、技术、网络、处理和其它组件可以被示为框图形式的组件，以便不会在不必要的细节上使实施方式晦涩难懂。附图和描述不旨在是限制性的。In the following description, for purposes of explanation, specific details are set forth in order to provide a thorough understanding of the various embodiments and examples. However, it will be apparent that the various embodiments may be practiced without these specific details. For example, circuits, systems, algorithms, structures, techniques, networks, processes, and other components may be shown as components in block diagram form so as not to obscure the embodiments in unnecessary detail. The drawings and descriptions are not intended to be limiting.

一些示例(诸如关于本公开中的附图公开的那些示例)可以被描述为被描绘为流程图、流图、数据流图、结构图、序列图或框图的处理。虽然序列图或流程图可以将操作描述为顺序处理，但是许多操作可以并行或并发执行。此外，可以重新布置操作的次序。处理在其操作完成后终止，但可以具有图中未包括的附加步骤。处理可以与方法、函数、过程、子例程、子程序等对应。当处理与函数对应时，其终止可以与该函数返回到调用函数或主函数对应。Some examples (such as those disclosed with respect to the drawings in this disclosure) may be described as processes depicted as flow charts, flow diagrams, data flow diagrams, structure diagrams, sequence diagrams, or block diagrams. Although a sequence diagram or flow chart may describe an operation as a sequential process, many operations may be performed in parallel or concurrently. In addition, the order of the operations may be rearranged. A process terminates after its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, function, procedure, subroutine, subprogram, etc. When a process corresponds to a function, its termination may correspond to the function returning to the calling function or main function.

本文描绘的处理(诸如参考本公开中的附图描述的处理)可以以由一个或多个处理单元(例如，处理器核心)执行的软件(例如，代码、程序)、硬件或其组合来实现。软件可以被存储在存储器中(例如，在存储器设备上、在非暂态计算机可读存储介质上)。在一些示例中，可以通过本文公开的任何系统来实现本文的序列图和流程图中描绘的处理。本公开中的处理步骤的特定系列不旨在是限制性的。根据可替代示例，也可以执行其它步骤序列。例如，本公开的可替代示例可以以不同的次序执行以上概述的步骤。而且，图中所示的各个步骤可以包括多个子步骤，这些子步骤可以按照适合于各个步骤的各种顺序来执行。此外，取决于特定的应用，可以添加或移除附加步骤。本领域普通技术人员将认识到许多变型、修改和替代。The processing described herein (such as the processing described with reference to the drawings in the present disclosure) can be implemented with software (e.g., code, program), hardware, or a combination thereof executed by one or more processing units (e.g., processor cores). The software can be stored in a memory (e.g., on a memory device, on a non-transitory computer-readable storage medium). In some examples, the processing depicted in the sequence diagrams and flow charts of this article can be implemented by any system disclosed herein. The specific series of processing steps in this disclosure are not intended to be restrictive. According to alternative examples, other step sequences can also be performed. For example, the alternative examples of the present disclosure can perform the steps outlined above in different orders. Moreover, the various steps shown in the figure can include multiple sub-steps, which can be performed in various orders suitable for the various steps. In addition, additional steps can be added or removed depending on the specific application. Those of ordinary skill in the art will recognize many variations, modifications, and substitutions.

在一些示例中，本公开的附图中的每个处理可以由一个或多个处理单元执行。处理单元可以包括一个或多个处理器，包括单核或多核处理器、处理器的一个或多个核心或其组合。在一些示例中，处理单元可以包括一个或多个专用协处理器，诸如图形处理器、数字信号处理器(DSP)等。在一些示例中，可以使用定制电路(诸如专用集成电路(ASIC)或现场可编程门阵列(FPGA))来实现处理单元中的一些或全部。In some examples, each process in the figures of the present disclosure may be performed by one or more processing units. A processing unit may include one or more processors, including a single-core or multi-core processor, one or more cores of a processor, or a combination thereof. In some examples, a processing unit may include one or more dedicated coprocessors, such as a graphics processor, a digital signal processor (DSP), etc. In some examples, custom circuits (such as application specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs)) may be used to implement some or all of the processing units.

本文描述的某些实施例可以被实现为数据集成平台云(DIPC)的一部分。一般而言，数据集成涉及组合驻留在不同数据源中的数据，并为用户提供数据的统一访问和统一视图。在许多情况下，诸如将商业实体与现有遗留数据库合并，这个处理常常出现并变得重要。随着数据量的不断增加以及对数据进行分析以提供有用结果(“大数据”)的能力，数据集成在企业软件系统中开始越来越频繁地出现。例如，考虑web应用，用户可以在其中查询各种类型的旅行信息(例如，天气、旅馆、航空公司、人口统计学、犯罪统计信息等)。企业应用不必要求将所有这些各种数据类型都存储在具有单个模式的单个数据库中，而是可以使用DIPC中的统一视图和虚拟模式来组合许多异构数据源，使得可以将它们在统一视图中呈现给用户。Certain embodiments described herein may be implemented as part of a data integration platform cloud (DIPC). In general, data integration involves combining data residing in different data sources and providing users with unified access and a unified view of the data. In many cases, such as merging a business entity with an existing legacy database, this process often occurs and becomes important. With the increasing amount of data and the ability to analyze data to provide useful results ("big data"), data integration is beginning to appear more and more frequently in enterprise software systems. For example, consider a web application in which users can query various types of travel information (e.g., weather, hotels, airlines, demographics, crime statistics, etc.). Enterprise applications do not necessarily require that all of these various data types be stored in a single database with a single schema, but can use unified views and virtual schemas in DIPC to combine many heterogeneous data sources so that they can be presented to users in a unified view.

DIPC是用于数据变换、集成、复制和治理(governance)的基于云的平台。它提供了云和内部部署(on-premise)数据源之间的批量和实时数据移动，同时维持具有默认容差和弹性的数据一致性。当将这些各种数据源组合到一个或多个数据仓库中时，DIPC可以被用于连接到各种数据源并准备、变换、复制、治理和/或监视来自这些各种源的数据。DIPC可以用任何类型的数据源工作并支持任何格式的任何类型的数据。DIPC可以使用平台即服务(PaaS)或基础设施即服务(IaaS)体系架构来为企业提供基于云的数据集成。DIPC is a cloud-based platform for data transformation, integration, replication and governance. It provides batch and real-time data movement between cloud and on-premise data sources while maintaining data consistency with default tolerances and elasticity. When these various data sources are combined into one or more data warehouses, DIPC can be used to connect to various data sources and prepare, transform, replicate, govern and/or monitor data from these various sources. DIPC can work with any type of data source and support any type of data in any format. DIPC can use Platform as a Service (PaaS) or Infrastructure as a Service (IaaS) architecture to provide cloud-based data integration for enterprises.

DIPC可以提供许多不同的实用程序，包括将整个数据源转移到新的基于云的部署以及允许从云平台容易地访问云数据库。数据可以实时地流传输到最新的新数据源并保持任何数量的分布式数据源同步。可以在同步的数据源之间分配负载，使得它们对最终用户保持高度可用。底层数据管理系统可以被用于减少通过网络移动以部署到数据库云、大数据云、第三方云等中的数据量。拖放用户界面可以被用于执行可重用的提取、加载和变换(ELT)函数和模板。可以创建实时测试环境，以在云中对复制的数据源执行报告和数据分析，使得数据可以保持对最终用户高度可用。使用重复的同步的数据源，可以以零停机时间执行数据迁移。同步的数据源还可以被用于维持可用性的无缝灾难恢复。DIPC can provide many different utilities, including transferring entire data sources to new cloud-based deployments and allowing easy access to cloud databases from cloud platforms. Data can be streamed to the latest new data sources in real time and keep any number of distributed data sources synchronized. The load can be distributed between synchronized data sources so that they remain highly available to end users. The underlying data management system can be used to reduce the amount of data moved over the network to be deployed in database clouds, big data clouds, third-party clouds, etc. The drag-and-drop user interface can be used to execute reusable extraction, loading, and transformation (ELT) functions and templates. A real-time test environment can be created to perform reporting and data analysis on replicated data sources in the cloud so that data can remain highly available to end users. Using duplicate synchronized data sources, data migration can be performed with zero downtime. Synchronized data sources can also be used for seamless disaster recovery to maintain availability.

图1图示了根据一些实施例的利用DIPC来集成来自各种现有平台的数据的计算机系统体系架构。第一数据源102可以包括基于云的存储储存库。第二数据源104可以包括内部部署数据中心。为了提供对第一数据源102和第二数据源104的统一访问和视图，DIPC108可以使用现有的高性能ELT函数库106从第一数据源102和第二数据源复制数据104。当数据存储在新的云平台中时，DIPC 108还可以提取、丰富和变换数据。然后，DIPC 108可以提供对驻留在云平台中或可由云平台访问的任何大数据实用程序的访问。在一些实施例中，原始数据源102和104可以继续向客户提供访问，而云平台中的复制的数据源可以被用于测试、监视、治理和大数据分析。在一些实施例中，可以提供数据治理以在用户界面中的定制仪表板的现有集合内剖析、清理和治理数据源。FIG. 1 illustrates a computer system architecture for integrating data from various existing platforms using DIPC according to some embodiments. The first data source 102 may include a cloud-based storage repository. The second data source 104 may include an on-premises data center. In order to provide unified access and views to the first data source 102 and the second data source 104, the DIPC 108 may use an existing high-performance ELT function library 106 to copy data 104 from the first data source 102 and the second data source. When the data is stored in the new cloud platform, the DIPC 108 may also extract, enrich and transform the data. Then, the DIPC 108 may provide access to any big data utility that resides in or can be accessed by the cloud platform. In some embodiments, the original data sources 102 and 104 may continue to provide access to customers, while the replicated data sources in the cloud platform may be used for testing, monitoring, governance and big data analysis. In some embodiments, data governance may be provided to profile, clean and govern data sources within an existing collection of custom dashboards in a user interface.

图2图示了用户界面中的自定义仪表板之一，其可以被用于在DIPC 108中配置、监视和控制服务实例。摘要仪表板202可以提供允许用户创建服务实例的控件204。接下来，可以呈现一系列渐进式web表单，以引导用户逐步了解用于创建服务实例的信息的类型。在第一步中，将要求用户提供服务名称和描述以及电子邮件地址和服务版本类型。还可以要求用户提供聚类尺寸，该聚类尺寸指定服务中使用的虚拟机的数量。服务版本类型确定在虚拟机上安装哪些应用。在第二步和对应的web表单中，用户可以提供正在运行的云数据库部署以存储DIPC服务器的模式。以后可以使用同一数据库存储数据实体并执行集成任务。此外，可以指定和/或供应存储云作为备份实用程序。用户还可以提供可以被用于访问数据集成中使用的现有数据源的凭据。在第三步中，可以确认供应信息并可以创建服务实例。然后可以在摘要仪表板202的摘要区域206中显示新服务实例。从那里，用户可以访问任何正在运行的数据集成服务实例的任何信息。FIG. 2 illustrates one of the custom dashboards in the user interface that can be used to configure, monitor, and control service instances in the DIPC 108. A summary dashboard 202 may provide a control 204 that allows a user to create a service instance. Next, a series of progressive web forms may be presented to guide the user through the types of information used to create a service instance. In the first step, the user will be asked to provide a service name and description as well as an email address and a service version type. The user may also be asked to provide a cluster size that specifies the number of virtual machines used in the service. The service version type determines which applications are installed on the virtual machines. In the second step and the corresponding web form, the user may provide a running cloud database deployment to store the mode of the DIPC server. The same database may be used later to store data entities and perform integration tasks. In addition, a storage cloud may be specified and/or provisioned as a backup utility. The user may also provide credentials that may be used to access existing data sources used in data integration. In the third step, the provisioning information may be confirmed and the service instance may be created. The new service instance may then be displayed in the summary area 206 of the summary dashboard 202. From there, the user may access any information of any running data integration service instance.

图3图示了根据一些实施例的DIPC的体系架构图。可以通过浏览器客户端302接收请求，该浏览器客户端302可以使用JavaScript扩展工具包(JET)组件集合来实现。可替代地或附加地，系统可以通过在客户的内部部署数据中心306处操作的DIPC代理304接收请求。DIPC代理304可以包括数据集成器代理308和用于复制服务(诸如Oracle的服务)的代理310。这些代理308、310中的每一个可以在正常操作期间从内部部署数据中心306检索信息，并且使用连接性服务312将数据传输回DIPC。FIG3 illustrates an architectural diagram of a DIPC according to some embodiments. The request may be received through a browser client 302, which may be implemented using a JavaScript Extension Toolkit (JET) component set. Alternatively or additionally, the system may receive the request through a DIPC agent 304 operating at a customer's on-premises data center 306. The DIPC agent 304 may include a data integrator agent 308 and a server for replicating services such as Oracle. Each of these agents 308, 310 can retrieve information from the on-premises data center 306 during normal operation and use connectivity services 312 to transmit data back to the DIPC.

传入的请求可以通过登录服务314被传递，该登录服务314可以包括负载平衡或其它实用程序，用于通过DIPC路由请求。登录服务314可以使用身份管理服务(诸如身份云服务316)，以提供用于云平台的安全性和身份管理作为集成企业安全结构的一部分。身份云服务316可以管理这个实施例中描述的云部署和内部部署应用两者的用户身份。除了身份云服务316之外，DIPC还可以使用PaaS服务管理器(PSM)工具318提供界面来管理云部署中平台服务的生命周期。例如，PSM工具318可以被用于在云平台中创建和管理数据集成服务的实例。Incoming requests can be passed through a login service 314, which can include load balancing or other utilities for routing requests through the DIPC. The login service 314 can use identity management services (such as identity cloud services 316) to provide security and identity management for the cloud platform as part of the integrated enterprise security structure. The identity cloud service 316 can manage the user identities of both the cloud deployment and on-premises applications described in this embodiment. In addition to the identity cloud service 316, the DIPC can also use the PaaS service manager (PSM) tool 318 to provide an interface to manage the life cycle of platform services in the cloud deployment. For example, the PSM tool 318 can be used to create and manage instances of data integration services in the cloud platform.

可以在Web逻辑服务器320上实现DIPC，以在云环境中构建和部署企业应用。DIPC可以包括本地储存库322，该本地储存库322存储用于通过DIPC的信息传递的数据策略、设计信息、元数据和审计数据。它还可以包括监视服务324，以填充本地储存库322。目录服务326可以包括机器可读的开放API的集合，以提供对云部署中许多SaaS和PaaS应用的访问。目录服务326也可以可用于使用分布式索引服务的搜索应用338，诸如Apache连接性服务328和中介(mediator)服务330可以管理连接并为通过DIPC的信息提供变换、验证和路由逻辑。可以使用事件驱动体系架构(EDA)和对应的消息总线332传递DIPC内的信息。The DIPC may be implemented on a Web logic server 320 to build and deploy enterprise applications in a cloud environment. The DIPC may include a local repository 322 that stores data policies, design information, metadata, and audit data for information delivery through the DIPC. It may also include a monitoring service 324 to populate the local repository 322. A directory service 326 may include a collection of machine-readable open APIs to provide access to many SaaS and PaaS applications in a cloud deployment. The directory service 326 may also be available to a search application 338 that uses a distributed indexing service, such as Apache Connectivity services 328 and mediator services 330 can manage connections and provide transformation, validation, and routing logic for information passing through the DIPC. Information within the DIPC can be communicated using an event-driven architecture (EDA) and a corresponding message bus 332.

DIPC还可以包括编排服务334。编排服务334可以通过调用REST端点、脚本、第三方自动化框架等来启用自动化任务。这些任务然后可以由编排服务334执行以提供DIPC功能性。编排服务334可以使用运行时服务来导入、变换和存储数据。例如，ELT运行时服务334可以执行上述ELT函数的库，而复制运行时服务342可以将来自各种数据源的数据复制到部署了云的DIPC储存库316中。此外，DIPC可以包括代码生成服务336，其为ELT函数和复制函数提供自动代码生成。The DIPC may also include an orchestration service 334. The orchestration service 334 may enable automated tasks by calling REST endpoints, scripts, third-party automation frameworks, etc. These tasks may then be performed by the orchestration service 334 to provide DIPC functionality. The orchestration service 334 may use runtime services to import, transform, and store data. For example, the ELT runtime service 334 may execute a library of the above-mentioned ELT functions, while the replication runtime service 342 may replicate data from various data sources to the DIPC repository 316 where the cloud is deployed. In addition, the DIPC may include a code generation service 336 that provides automatic code generation for ELT functions and replication functions.

智能内容–智能内容推荐Smart Content – Intelligent content recommendations

如上所述，当用户创建/创作原始媒体内容(例如，文章、新闻通讯、电子邮件、博客文章等)时，用诸如相关图像、音频/视频剪辑，到相关文章或其它内容的链接之类的相关附加内来增强创作的内容常常是有用的。但是，在几个方面，搜索此类附加内容以及将附加内容嵌入用户的原始创作内容中可能会很困难。最初的困难可以涉及从受信任的源寻找安全/可靠的附加内容，并确保授权用户/作者将该内容结合到他们的工作中。此外，从任何这样的安全和授权内容储存库，对于用户/作者来说，在其原始创作内容中定位并结合/嵌入任何相关内容可能是手动密集且效率低下的处理。As mentioned above, when a user creates/authors original media content (e.g., articles, newsletters, emails, blog posts, etc.), it is often useful to enhance the content of the creation with related additional content such as related images, audio/video clips, links to related articles or other content. However, in several aspects, searching for such additional content and embedding the additional content into the user's original creation content may be difficult. The initial difficulty may involve finding safe/reliable additional content from a trusted source and ensuring that the authorized user/author incorporates the content into their work. In addition, from any such safe and authorized content repository, it may be a manually intensive and inefficient process for the user/author to locate and incorporate/embed any related content in its original creation content.

因此，本文描述的某些方面涉及智能数字内容推荐工具。在某些实施例中，智能数字内容推荐工具可以是人工智能(AI)驱动的工具，该工具被配置为实时处理和分析来自内容作者的输入内容(例如，文本、图像)，并推荐来自一个或多个受信任内容储存库的相关图像、附加文本内容和/或其它相关媒体内容(例如，音频或视频剪辑、图形、社交媒体帖子等)。智能数字内容推荐工具可以与多个后端服务和内容储存库进行通信，例如，以分析文本和/或视觉输入、从输入中提取关键词或主题、对输入内容进行分类和标记，以及将分类/标记的内容存储在一个或多个内容储存库中。Therefore, certain aspects described herein relate to intelligent digital content recommendation tools. In some embodiments, the intelligent digital content recommendation tool can be an artificial intelligence (AI) driven tool that is configured to process and analyze input content (e.g., text, images) from content authors in real time and recommend related images, additional text content, and/or other related media content (e.g., audio or video clips, graphics, social media posts, etc.) from one or more trusted content repositories. The intelligent digital content recommendation tool can communicate with multiple backend services and content repositories, for example, to analyze text and/or visual input, extract keywords or topics from the input, classify and tag the input content, and store the classified/tagged content in one or more content repositories.

本文描述的附加方面(这些方面中的每个方面可以直接经由在由内容作者操作的客户端上执行的智能数字内容推荐工具执行，和/或间接地通过调用各种后端服务来执行)可以包括(a)接收文本和/或图像形式的原始内容作为输入，(b)从原始内容中提取关键词和/或主题，(c)确定并存储用于原始内容的相关联的关键词和/或主题标签，(d)将原始内容(例如，输入文本和/或图像)转换成多维向量空间内的向量，(e)将这样的向量与多个其它内容向量进行比较，每个其它向量表示内容储存库中的附加内容，以便找到并识别与用户/作者创作的原始内容输入相关的各种潜在相关的附加内容，以及最后(f)检索识别出的附加内容并经由智能数字内容推荐工具将其呈现给作者。在一些实施例中，每个附加内容项(例如，图像、到相关文章或网页的链接、音频或视频文件、图形、社交媒体帖子等)可以在基于GUI的工具中由智能数字内容推荐工具显示和/或缩略图化，该工具允许用户将附加内容拖放或以其它方式放置在用户的原始创作内容中，包括内容定位、格式化、调整尺寸等。Additional aspects described herein (each of which may be performed directly via an intelligent digital content recommendation tool executed on a client operated by a content author, and/or indirectly by calling various backend services) may include (a) receiving raw content in the form of text and/or images as input, (b) extracting keywords and/or topics from the raw content, (c) determining and storing associated keyword and/or topic tags for the raw content, (d) converting the raw content (e.g., input text and/or images) into a vector within a multi-dimensional vector space, (e) comparing such a vector to a plurality of other content vectors, each of which represents additional content in a content repository, in order to find and identify various potentially relevant additional content related to the original content input authored by the user/author, and finally (f) retrieving the identified additional content and presenting it to the author via the intelligent digital content recommendation tool. In some embodiments, each additional content item (e.g., an image, a link to a related article or web page, an audio or video file, a graphic, a social media post, etc.) can be displayed and/or thumbnailed by an intelligent digital content recommendation tool in a GUI-based tool that allows the user to drag and drop or otherwise place the additional content within the user's original creative content, including content positioning, formatting, resizing, etc.

现在参考图4，示出了图示用于智能内容分类和推荐的系统400的各个组件的框图，包括客户端设备410、内容输入处理和分析服务420、内容推荐引擎425、内容管理系统435，以及内容检索和嵌入服务445。此外，系统400包括存储内容文件/资源的一个或多个内容储存库440，以及一个或多个向量空间430。如以下更详细描述的，向量空间可以指被配置为存储一个或多个特征向量的多维数据结构。在一些实施例中，推荐引擎425、相关联的软件组件以及服务420和445、内容管理系统435以及内容储存库440(其可以存储一个或多个数据存储库或其它数据结构)可以被实现并存储为远离前端客户端设备410的后端服务器系统。因此，客户端设备410与内容推荐引擎425之间的交互可以是基于互联网的web浏览会话或客户端-服务器应用会话，在此期间，用户访问可以经由客户端设备410输入原始创作的内容，并从内容推荐引擎425接收附加内容形式的内容推荐，该附加内容是从内容储存库440中检索的并链接或嵌入到客户端设备410处的内容创作用户界面中。附加地或可可替代地，内容推荐引擎425和/或内容储存库440和相关服务可以被实现为在客户端设备410上执行的专用软件组件。Referring now to FIG. 4 , a block diagram illustrating the various components of a system 400 for intelligent content classification and recommendation is shown, including a client device 410, a content input processing and analysis service 420, a content recommendation engine 425, a content management system 435, and a content retrieval and embedding service 445. In addition, the system 400 includes one or more content repositories 440 storing content files/resources, and one or more vector spaces 430. As described in more detail below, a vector space may refer to a multidimensional data structure configured to store one or more feature vectors. In some embodiments, the recommendation engine 425, associated software components and services 420 and 445, the content management system 435, and the content repository 440 (which may store one or more data repositories or other data structures) may be implemented and stored as a back-end server system away from the front-end client device 410. Thus, the interaction between the client device 410 and the content recommendation engine 425 may be an Internet-based web browsing session or a client-server application session during which a user accesses content that may be originally authored, input via the client device 410, and receives content recommendations from the content recommendation engine 425 in the form of additional content that is retrieved from the content repository 440 and linked or embedded into the content authoring user interface at the client device 410. Additionally or alternatively, the content recommendation engine 425 and/or the content repository 440 and related services may be implemented as dedicated software components that execute on the client device 410.

这个示例中所示的各种计算基础设施元素(例如，内容推荐引擎425、软件组件/服务420、435和445以及内容储存库440)可以与由企业或组织创建和维护的高级计算机体系架构对应，该高级计算机体系架构向各种客户端设备410提供基于互联网的服务和/或内容。本文描述的内容(其也可以被称为内容资源和/或内容文件、内容链接等)可以存储在一个或多个内容储存库中、由内容推荐引擎425检索和分类，并提供给客户端设备410处的内容作者。在各种实施例中，内容作者可以在客户端设备410处输入内容的各种不同媒体类型或文件类型作为原始内容，并且类似地，各种不同媒体类型或文件类型的内容可以存储在内容储存库440中并推荐用于/嵌入到客户端设备410的前端用户界面中。由内容作者创作或推荐给内容作者的这些各种不同媒体类型可以包括文本(例如，创作信件、文章或博客)、图像(由作者选择或为作者选择)、音频或视频内容资源、图形、社交媒体内容(例如，帖子、消息或推文)。The various computing infrastructure elements shown in this example (e.g., content recommendation engine 425, software components/services 420, 435, and 445, and content repository 440) may correspond to a high-level computer architecture created and maintained by an enterprise or organization that provides Internet-based services and/or content to various client devices 410. Content described herein (which may also be referred to as content resources and/or content files, content links, etc.) may be stored in one or more content repositories, retrieved and categorized by content recommendation engine 425, and provided to content authors at client devices 410. In various embodiments, content authors may input various different media types or file types of content at client devices 410 as original content, and similarly, content of various different media types or file types may be stored in content repository 440 and recommended for/embedded in a front-end user interface of client devices 410. These various different media types created by or recommended to content authors may include text (e.g., an authored letter, article, or blog), images (selected by or for the author), audio or video content assets, graphics, social media content (e.g., posts, messages, or tweets).

在一些实施例中，图4中所示的系统400可以被实现为基于云的多层系统，其中上层用户设备410可以经由内容处理/分析组件420来请求和接收对基于网络的资源和服务的访问，并且其中可以基于包括硬件和/或软件资源的基础资源集(例如，基于云的、SaaS、IaaS、PaaS等)来部署和执行应用服务器。此外，虽然在一些实施例中可以使用基于云的系统，但是在其它示例中，系统400可以使用内部部署数据中心、服务器农场、分布式计算系统以及其它各种非云计算体系架构。本文描述的用于内容处理/分析组件420、内容推荐引擎425、内容管理系统435、内容检索和嵌入组件445以及向量空间430的生成和存储的功能性中的一些或全部可以由代表性状态传输(REST)服务和/或web服务执行，包括简单对象访问协议(SOAP)web服务或API，和/或通过超文本传输协议(HTTP)或HTTP安全协议公开的web内容。因此，虽然未在图4中示出以免附加细节使所示出的组件模糊不清，但是计算环境400可以包括附加的客户端设备410、一个或多个计算机网络、一个或多个防火墙435、代理服务器和/或其它中间网络设备，以促进客户端设备410、内容推荐引擎425和后端内容储存库440之间的交互。在图5中更详细地示出了类似系统500的另一个实施例。In some embodiments, the system 400 shown in FIG. 4 can be implemented as a multi-layer cloud-based system, wherein the upper user device 410 can request and receive access to network-based resources and services via the content processing/analysis component 420, and wherein the application server can be deployed and executed based on a basic resource set (e.g., cloud-based, SaaS, IaaS, PaaS, etc.) including hardware and/or software resources. In addition, although a cloud-based system can be used in some embodiments, in other examples, the system 400 can use an on-premises data center, a server farm, a distributed computing system, and various other non-cloud computing architectures. Some or all of the functionality described herein for the generation and storage of the content processing/analysis component 420, the content recommendation engine 425, the content management system 435, the content retrieval and embedding component 445, and the vector space 430 can be performed by a representative state transfer (REST) service and/or a web service, including a simple object access protocol (SOAP) web service or API, and/or web content disclosed via a hypertext transfer protocol (HTTP) or HTTP secure protocol. Thus, although not shown in FIG4 so as not to obscure the components shown with additional detail, the computing environment 400 may include additional client devices 410, one or more computer networks, one or more firewalls 435, proxy servers, and/or other intermediate network devices to facilitate interaction between the client devices 410, the content recommendation engine 425, and the backend content repository 440. Another embodiment of a similar system 500 is shown in more detail in FIG5.

简要地参考图5，示出了计算环境500的另一个示例图，其示出了用于执行内容分类和推荐的数据流/数据变换图。因此，在这个示例中示出的计算环境500可以与以上在图4中描述的计算环境400的一种可能的实施方式对应。在图5中，所示的几个图方框表示特定的数据状态或数据变换，而不是上面在图4中描述的结构硬件和/或软件组件。因此，方框505可以表示经由用户界面接收的输入内容数据。方框510表示基于输入内容505由系统400确定的关键词的集合。如以上所讨论的，可以由输入处理/分析组件420使用一个或多个关键词提取和/或主题建模处理来确定关键词510，并且可以基于所确定的关键词510来生成文本特征向量515。Referring briefly to Fig. 5, another example diagram of a computing environment 500 is shown, which shows a data flow/data transformation diagram for performing content classification and recommendation. Therefore, the computing environment 500 shown in this example can correspond to a possible implementation of the computing environment 400 described above in Fig. 4. In Fig. 5, several diagram boxes shown represent specific data states or data transformations, rather than the structural hardware and/or software components described above in Fig. 4. Therefore, box 505 can represent input content data received via a user interface. Box 510 represents a set of keywords determined by system 400 based on input content 505. As discussed above, keywords 510 can be determined by input processing/analysis component 420 using one or more keyword extraction and/or topic modeling processes, and text feature vectors 515 can be generated based on the determined keywords 510.

继续图5中所示的示例，可以从内容储存库440中检索多个附加特征向量520。在这个示例中，可以通过执行一个或多个神经网络训练的图像模型并将确定的关键词510提供给经训练的模型来从内容储存库440中选择附加特征向量520。基于经训练的模型的输出，所得特征向量520可以进一步变窄以排除特征向量概率小于z％的特征向量，从而得到检索出的特征向量525的子集。然后可以在测试特征向量515与检索出的特征向量525的子集之间执行特征空间比较530。在一些实施例中，并且如这个示例中所示，最接近欧几里德距离计算可以被用于识别与测试特征515最接近的检索出的特征向量525。基于特征空间比较530，可以确定一个或多个推荐530，基于与测试特征向量515具有阈值接近度的关联特征向量525以及与内容储存库440中的图像对应的每个推荐530来确定每个推荐530。Continuing with the example shown in FIG. 5 , a plurality of additional feature vectors 520 may be retrieved from the content repository 440. In this example, the additional feature vectors 520 may be selected from the content repository 440 by executing one or more neural network trained image models and providing the determined keywords 510 to the trained models. Based on the output of the trained models, the resulting feature vectors 520 may be further narrowed to exclude feature vectors having a feature vector probability less than z%, resulting in a subset of retrieved feature vectors 525. A feature space comparison 530 may then be performed between the test feature vector 515 and the subset of retrieved feature vectors 525. In some embodiments, and as shown in this example, a closest Euclidean distance calculation may be used to identify the retrieved feature vector 525 that is closest to the test feature 515. Based on the feature space comparison 530, one or more recommendations 530 may be determined, each recommendation 530 being determined based on an associated feature vector 525 having a threshold proximity to the test feature vector 515 and each recommendation 530 corresponding to an image in the content repository 440.

系统400中所示的用于向客户端设备410提供基于AI和基于特征向量分析的内容推荐和服务的组件可以以硬件、软件或硬件和软件的组合来实现。例如，可使用诸如数据存储设备、网络资源、计算资源(例如，服务器)和各种软件组件之类的底层系统硬件或软件组件在数据中心440中生成、部署和执行web服务。在一些实施例中，web服务可以与在相同的(一个或多个)底层计算机服务器、网络、数据存储库上和/或相同的虚拟机内执行的不同软件组件。在内容推荐引擎425内提供的一些基于web的内容、计算基础设施实例和/或web服务可以使用专用的硬件和/或软件资源，而其它可以共享底层资源(例如，共享的云)。在任一种情况下，某些更高级别的服务(例如，用户应用)以及客户端设备上的用户都无需知道用于支持服务的底层资源。The components shown in system 400 for providing AI-based and feature vector analysis-based content recommendations and services to client devices 410 can be implemented in hardware, software, or a combination of hardware and software. For example, underlying system hardware or software components such as data storage devices, network resources, computing resources (e.g., servers), and various software components can be used to generate, deploy, and execute web services in data center 440. In some embodiments, web services can be different software components executed on the same (one or more) underlying computer servers, networks, data repositories, and/or in the same virtual machine. Some web-based content, computing infrastructure instances, and/or web services provided in content recommendation engine 425 can use dedicated hardware and/or software resources, while others can share underlying resources (e.g., shared clouds). In either case, some higher-level services (e.g., user applications) and users on client devices do not need to know the underlying resources used to support the services.

在这样的实施方式中，各种应用服务器、数据库服务器和/或云存储系统以及其它基础设施组件(诸如web高速缓存、网络组件等(在这个示例中未示出))可以包括各种硬件和/或软件组件(例如，应用编程接口(API)、云资源管理器等)以提供和监视内容资源的分类和向量化，以及管理底层存储装置/服务器/网络资源。内容储存库440的底层资源可以存储在内容储存库和/或云存储系统内，该内容数据库和/或云存储系统可以包括例如被实现为数据库、基于文件的存储装置等的非易失性计算机存储器设备的集合、网络硬件和软件组件(例如，路由器、防火墙、网关、负载平衡器等)的集合、主机服务器的集合，以及与各种平台、服务器、中间件和应用软件的不同版本对应的各种软件资源(诸如存储软件映像、安装、内部版本、模板、配置文件等)。容纳推荐引擎425的应用服务器、向量空间430和相关服务/组件的数据中心还可以包括附加资源，诸如虚拟机管理程序、主机操作系统、资源管理器和其它基于云的应用，以及硬件和软件基础设施以支持各种基于互联网的服务，诸如基础设施即服务(IaaS)、平台即服务(PaaS)和软件即服务(SaaS)。此外，数据中心的底层硬件可以被配置为支持多种内部共享服务，其可以包括例如安全和身份服务、集成服务、存储库服务、企业管理服务、病毒扫描服务、备份和恢复服务、通知服务、文件传送服务等。In such an embodiment, various application servers, database servers, and/or cloud storage systems, as well as other infrastructure components such as web caches, network components, etc. (not shown in this example), may include various hardware and/or software components (e.g., application programming interfaces (APIs), cloud resource managers, etc.) to provide and monitor the classification and vectorization of content resources, and to manage the underlying storage/server/network resources. The underlying resources of the content repository 440 may be stored within the content repository and/or cloud storage system, which may include, for example, a collection of non-volatile computer memory devices implemented as databases, file-based storage, etc., a collection of network hardware and software components (e.g., routers, firewalls, gateways, load balancers, etc.), a collection of host servers, and various software resources corresponding to various platforms, servers, middleware, and different versions of application software (such as storage of software images, installations, builds, templates, configuration files, etc.). The data center that houses the application servers of the recommendation engine 425, the vector space 430, and related services/components may also include additional resources, such as hypervisors, host operating systems, resource managers, and other cloud-based applications, as well as hardware and software infrastructure to support various Internet-based services, such as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). In addition, the underlying hardware of the data center may be configured to support a variety of internal shared services, which may include, for example, security and identity services, integration services, repository services, enterprise management services, virus scanning services, backup and recovery services, notification services, file transfer services, etc.

如上所述，根据本文描述的各种实施例，许多不同类型的计算机体系架构(基于云、基于web、托管、多层计算环境、分布式计算环境等)可以被用于从内容推荐引擎425到客户端设备410提供基于web的内容推荐(可以经由一个或多个内容推荐应用服务器来实现)。但是，在某些实施方式中，云计算平台可以被用于提供某些有利的特征以用于基于web的内容的生成和管理。例如，与具有固定体系架构和有限硬件资源的非基于云的实施方式相反，云计算平台可以提供弹性和可伸缩性，以快速地供应、配置和部署许多不同类型的计算基础设施实例。而且，在各种实施例中可以使用公共云、私有云和公共-私有混合云平台以充分利用每个不同体系架构的特征和优点。As described above, according to the various embodiments described herein, many different types of computer architectures (cloud-based, web-based, hosted, multi-tier computing environments, distributed computing environments, etc.) can be used to provide web-based content recommendations from the content recommendation engine 425 to the client device 410 (which can be implemented via one or more content recommendation application servers). However, in some embodiments, a cloud computing platform can be used to provide certain advantageous features for the generation and management of web-based content. For example, in contrast to non-cloud-based embodiments with fixed architectures and limited hardware resources, a cloud computing platform can provide elasticity and scalability to quickly provision, configure, and deploy many different types of computing infrastructure instances. Moreover, public clouds, private clouds, and public-private hybrid cloud platforms can be used in various embodiments to take full advantage of the features and advantages of each different architecture.

此外，如这个示例中所示，系统400还包括内容管理系统435。在一些实施例中，内容管理系统435可以包括分布式存储处理系统、一个或多个基于机器学习的分类算法(和/或非基于机器学习的算法)和/或存储体系架构。如下面更详细地讨论的，在一些实施例中，内容管理系统435可以经由一个或多个内容储存库440(例如，基于网络的文档存储库、基于web的内容提供者等)访问内容资源(例如，基于web的文章、图像、音频文件、视频文件、图形、社交媒体内容等)。例如，在系统400内，可以在存储内容对象或基于网络的内容的一个或多个应用服务器、数据库服务器和/或云系统上安装和运行专用的JavaScript或其它软件组件。这些软件组件可以被配置为检索内容资源(例如，文章、图像、网页、文档等)并将其传输到内容管理系统435以进行分析和分类。例如，每次系统400的运营组织内的用户导入或创建新内容(诸如图像或文章)时，软件组件可以将内容取回内容管理系统435，以进行下面描述的各种处理和分析(例如，图像处理、关键词提取、主题分析等)。此外，虽然在这个示例中内容管理系统435被描述为与内容推荐引擎425和内容储存库440分开实现，但是在其它示例中，内容管理系统435可以与内容推荐引擎425和/或存储内容储存库440的存储设备中的任一个一起本地实现，因此不需要从那些设备接收单独的内容传输，而是可以分析和分类由其相应系统存储或提供的那些内容资源。In addition, as shown in this example, the system 400 also includes a content management system 435. In some embodiments, the content management system 435 may include a distributed storage processing system, one or more machine learning-based classification algorithms (and/or non-machine learning-based algorithms), and/or a storage architecture. As discussed in more detail below, in some embodiments, the content management system 435 may access content resources (e.g., web-based articles, images, audio files, video files, graphics, social media content, etc.) via one or more content repositories 440 (e.g., network-based document repositories, web-based content providers, etc.). For example, within the system 400, a dedicated JavaScript or other software component may be installed and run on one or more application servers, database servers, and/or cloud systems that store content objects or network-based content. These software components may be configured to retrieve content resources (e.g., articles, images, web pages, documents, etc.) and transmit them to the content management system 435 for analysis and classification. For example, each time a user within the operating organization of the system 400 imports or creates new content (such as an image or article), the software component can retrieve the content back to the content management system 435 for various processing and analysis described below (e.g., image processing, keyword extraction, topic analysis, etc.). In addition, although the content management system 435 is described as being implemented separately from the content recommendation engine 425 and the content repository 440 in this example, in other examples, the content management system 435 can be implemented locally with either of the content recommendation engine 425 and/or the storage devices storing the content repository 440, and therefore does not need to receive separate content transmissions from those devices, but can analyze and classify those content resources stored or provided by their respective systems.

还可以生成一个或多个向量空间430并将其用于存储与内容储存库440内的不同内容项对应的特征向量，并比较用于原始创作内容的特征向量(例如，从客户端设备410接收到的)与内容储存库440中附加内容项的特征向量。在一些实施例中，可以在系统400内实现多个多维特征空间430，诸如用于文本输入/文章的主题的第一特征空间430a和用于图像的第二特征空间430b。在其它实施例中，可以为不同类型的内容媒体生成附加的单独的多维特征空间430(例如，用于音频数据/文件的特征空间、用于视频数据/文件的特征空间、用于图形的特征空间、用于社交媒体内容的特征空间等)。如下面所讨论的，比较算法可以被用于确定特征空间内的向量之间的距离。因此，在图像特征向量的特征空间中，算法可以被用于识别与接收到的输入图像最接近的图像，在文本特征向量的特征空间中，算法可以被用于识别与接收到的输入文本块最接近的文本(例如，文章)，依此类推。此外或可可替代地，比较算法可以使用向量空间的关键词/标签来确定各种媒体类型之间的相似性。One or more vector spaces 430 may also be generated and used to store feature vectors corresponding to different content items within the content repository 440, and feature vectors for the original creative content (e.g., received from the client device 410) may be compared to feature vectors for additional content items in the content repository 440. In some embodiments, multiple multidimensional feature spaces 430 may be implemented within the system 400, such as a first feature space 430a for the subject of the text input/article and a second feature space 430b for the image. In other embodiments, additional separate multidimensional feature spaces 430 may be generated for different types of content media (e.g., feature spaces for audio data/files, feature spaces for video data/files, feature spaces for graphics, feature spaces for social media content, etc.). As discussed below, a comparison algorithm may be used to determine the distance between vectors within the feature space. Thus, in the feature space of image feature vectors, an algorithm may be used to identify the image that is closest to the received input image, in the feature space of text feature vectors, an algorithm may be used to identify the text (e.g., article) that is closest to the received input text block, and so on. Additionally or alternatively, the comparison algorithm may use keywords/tags of the vector space to determine similarities between various media types.

在各种实施方式中，系统400可以使用一个或多个计算系统和/或网络来实现。这些计算系统可以包括一个或多个计算机和/或服务器，它们可以是通用计算机、专用服务器计算机(例如，台式机服务器、UNIX服务器、中端服务器、大型机、机架式服务器等)、服务器农场、服务器聚类、分布式服务器或计算硬件的任何其它适当布置和/或组合。内容推荐引擎425可以运行操作系统和/或各种附加服务器应用和/或中间层应用，包括超文本传输协议(HTTP)服务器、文件传输服务(FTP)服务器、公共网关接口(CGI)服务器、Java服务器、数据库服务器和其它计算系统。内容储存库440可以包括数据库服务器，例如，可以从Oracle、Microsoft等商购获得的那些数据库服务器。可以使用硬件、固件、软件或硬件、固件和软件的组合来实现系统400中的每个组件。In various embodiments, the system 400 can be implemented using one or more computing systems and/or networks. These computing systems may include one or more computers and/or servers, which may be general-purpose computers, dedicated server computers (e.g., desktop servers, UNIX servers, mid-range servers, mainframes, rack servers, etc.), server farms, server clusters, distributed servers, or any other suitable arrangement and/or combination of computing hardware. The content recommendation engine 425 may run an operating system and/or various additional server applications and/or middle-tier applications, including a hypertext transfer protocol (HTTP) server, a file transfer service (FTP) server, a common gateway interface (CGI) server, a Java server, a database server, and other computing systems. The content repository 440 may include a database server, such as those commercially available from Oracle, Microsoft, etc. Each component in the system 400 may be implemented using hardware, firmware, software, or a combination of hardware, firmware, and software.

在各种实施方式中，系统400内的每个组件可以包括至少一个存储器、一个或多个处理单元(例如，(一个或多个)处理器))和/或存储装置。可以适当地以硬件(例如，集成电路)、计算机可执行指令、固件或硬件和指令的组合来实现(一个或多个)处理单元。在一些示例中，系统400的各种组件可以包括几个子系统和/或模块。内容推荐引擎425中的子系统和/或模块可以以硬件、在硬件上执行的软件(例如，可由处理器执行的程序代码或指令)或其组合来实现。在一些示例中，软件可以存储在存储器(例如，非暂态计算机可读介质)中、存储器设备上或某种其它物理存储器中，并且可以由一个或多个处理单元(例如，一个或多个处理器、一个或多个处理器内核、一个或多个图形处理单元(GPU)等)执行。(一个或多个)处理单元的计算机可执行指令或固件实施方式可以包括以任何合适的编程语言编写的计算机可执行指令或机器可执行指令，其可以执行本文描述的各种操作、函数、方法和/或处理。存储器可以存储在(一个或多个)处理单元上可加载和可执行的程序指令，以及在这些程序的执行期间生成的数据。存储器可以是易失性的(诸如随机存取存储器(RAM))和/或非易失性的(例诸如只读存储器(ROM)、闪存等)。可以使用任何类型的持久性存储设备(诸如计算机可读存储介质)来实现该存储器。在一些示例中，计算机可读存储介质可以被配置为保护计算机免受包含恶意代码的电子通信的影响。In various embodiments, each component within the system 400 may include at least one memory, one or more processing units (e.g., (one or more) processors)) and/or storage devices. The (one or more) processing units may be appropriately implemented in hardware (e.g., integrated circuits), computer executable instructions, firmware, or a combination of hardware and instructions. In some examples, the various components of the system 400 may include several subsystems and/or modules. The subsystems and/or modules in the content recommendation engine 425 may be implemented in hardware, software executed on hardware (e.g., program code or instructions executable by a processor), or a combination thereof. In some examples, the software may be stored in a memory (e.g., a non-transitory computer-readable medium), on a memory device, or in some other physical memory, and may be executed by one or more processing units (e.g., one or more processors, one or more processor cores, one or more graphics processing units (GPUs), etc.). The computer executable instructions or firmware implementation of the (one or more) processing units may include computer executable instructions or machine executable instructions written in any suitable programming language, which may perform the various operations, functions, methods, and/or processes described herein. The memory can store program instructions that are loadable and executable on (one or more) processing units, as well as data generated during the execution of these programs. The memory can be volatile (such as random access memory (RAM)) and/or non-volatile (such as read-only memory (ROM), flash memory, etc.). The memory can be implemented using any type of persistent storage device (such as a computer-readable storage medium). In some examples, the computer-readable storage medium can be configured to protect the computer from electronic communications containing malicious code.

现在参考图6，示出了图示用于基于内容储存库440中的内容资源生成特征向量并将特征向量存储在特征空间430中的处理的流程图。如下所述，这个处理中的步骤可以由计算环境400中的一个或多个组件(诸如内容管理系统435)以及在其中实现的各种子系统和子组件来执行。6 , there is shown a flow chart illustrating a process for generating feature vectors based on content resources in content repository 440 and storing the feature vectors in feature space 430. As described below, steps in this process may be performed by one or more components in computing environment 400, such as content management system 435, and various subsystems and subcomponents implemented therein.

在步骤602中，可以从内容储存库440或其它数据存储库中检索内容资源。如以上讨论的，个体内容资源(也可以被称为内容或内容项)可以与任何各种内容类型的数据对象对应：文本项(例如，文本文件、文章、电子邮件、博文等)、图像、音频文件、视频文件、2D或3D图形对象、社交媒体数据项等。在一些实施例中，可以从特定内容储存库440中检索内容项，诸如由特定受信任组织拥有和运营的专有数据存储库。虽然内容储存库440可以是外部数据源(诸如互联网web服务器或其它远程数据存储)，但是从本地和/或私有控制的内容储存库440检索和向量化内容的系统400可以在系统400的操作中实现一些技术优势，包括确保来自储存库440的内容将被保留并在需要时可访问，并且确保用户/作者被授权使用和再现来自储存库440的内容。在一些情况下，可以响应于新的内容项被存储在内容储存库440中和/或对内容储存库440中的项目的修改而触发步骤602(以及后续步骤604-608)中的检索。In step 602, content resources may be retrieved from a content repository 440 or other data repository. As discussed above, individual content resources (which may also be referred to as content or content items) may correspond to data objects of any of a variety of content types: text items (e.g., text files, articles, emails, blog posts, etc.), images, audio files, video files, 2D or 3D graphic objects, social media data items, etc. In some embodiments, content items may be retrieved from a specific content repository 440, such as a proprietary data repository owned and operated by a specific trusted organization. Although the content repository 440 may be an external data source (such as an Internet web server or other remote data store), the system 400 that retrieves and vectorizes content from a local and/or privately controlled content repository 440 may achieve several technical advantages in the operation of the system 400, including ensuring that content from the repository 440 will be preserved and accessible when needed, and ensuring that users/authors are authorized to use and reproduce content from the repository 440. In some cases, the retrieval in step 602 (and subsequent steps 604 - 608 ) may be triggered in response to a new content item being stored in content repository 440 and/or a modification to an item in content repository 440 .

在步骤604中，可以对在步骤602中检索出的内容项进行解析/分析等，以便提取项目特征或特点的集合。在步骤604中执行的解析、处理、特征提取和/或分析的类型可以取决于内容项的类型。对于图像内容项，可以使用基于人工智能的图像分类工具来识别特定图像特征和/或生成图像标签。如图7的示例图像中所示，图像分析可以识别多个图像特征(例如，微笑、女服务员、柜台、销售机、咖啡杯、蛋糕、手、食物、人、咖啡馆等)，并且可以使用这些识别出的特征中的每一个对图像进行加标签。对于基于文本的内容项(诸如博客文章、信件、电子邮件、文章等)，在步骤604中执行的分析可以包括关键词提取和处理工具(例如，词干、同义词检索等)，如图8中所示。一种或两种类型的分析(例如，如图9中所示的从图像中提取特征，以及如图8中所示的从文本内容中提取关键词/主题)可以经由基于REST的服务或其它web服务使用分析、机器学习算法和/或人工智能(AI)(诸如基于AI的认知图像分析服务或类似的AI/REST认知文本服务)用于图8的文本内容。类似的技术可以在步骤604中用于其它类型的内容项，诸如视频文件、音频文件、图形或社交媒体帖子，其中取决于内容项的媒体类型，系统400的专用web服务被用于提取和分析特定的特征(例如，词、图像/视频中的对象、面部表情等)。In step 604, the content items retrieved in step 602 may be parsed/analyzed, etc., to extract a set of project features or characteristics. The type of parsing, processing, feature extraction, and/or analysis performed in step 604 may depend on the type of content item. For image content items, artificial intelligence-based image classification tools may be used to identify specific image features and/or generate image tags. As shown in the example image of FIG. 7 , image analysis may identify multiple image features (e.g., smile, waitress, counter, vending machine, coffee cup, cake, hand, food, person, cafe, etc.), and each of these identified features may be used to tag the image. For text-based content items (such as blog posts, letters, emails, articles, etc.), the analysis performed in step 604 may include keyword extraction and processing tools (e.g., stemming, synonym retrieval, etc.), as shown in FIG. 8 . One or both types of analysis (e.g., extracting features from images as shown in FIG. 9 and extracting keywords/topics from text content as shown in FIG. 8 ) can be applied to the text content of FIG. 8 using analytics, machine learning algorithms, and/or artificial intelligence (AI) via a REST-based service or other web service (such as an AI-based cognitive image analysis service or a similar AI/REST cognitive text service). Similar techniques can be used in step 604 for other types of content items, such as video files, audio files, graphics, or social media posts, where, depending on the media type of the content item, a dedicated web service of the system 400 is used to extract and analyze specific features (e.g., words, objects in an image/video, facial expressions, etc.).

在步骤606中，在从内容项中提取/确定特定内容特征(例如，视觉对象、关键词、主题等)之后，可以基于提取出的/确定的特征来生成特征向量。使用各种变换技术，与内容项相关联的特征的每个集合可以被变换为可以输入到公共向量空间430中的向量。变换算法可以输出预定的向量格式(例如，1x 4096维向量)。然后，在步骤608中，可以将特征向量存储在一个或多个向量空间430中(例如，用于文本内容的主题向量空间430a、用于图像内容的图像向量空间430b和/或用于多个内容类型的组合向量空间)。向量空间和存储在其中的特征向量可以由内容管理系统435、内容推荐引擎425和/或系统400的其它组件来生成和维护。In step 606, after extracting/determining specific content features (e.g., visual objects, keywords, themes, etc.) from the content item, a feature vector can be generated based on the extracted/determined features. Using various transformation techniques, each set of features associated with the content item can be transformed into a vector that can be input into the common vector space 430. The transformation algorithm can output a predetermined vector format (e.g., 1x 4096-dimensional vector). Then, in step 608, the feature vector can be stored in one or more vector spaces 430 (e.g., a theme vector space 430a for text content, an image vector space 430b for image content, and/or a combined vector space for multiple content types). The vector space and the feature vectors stored therein can be generated and maintained by the content management system 435, the content recommendation engine 425, and/or other components of the system 400.

在一些实施例中，提取出的/确定的内容特征的子集也可以被保存为与内容项相关联的标签。图9-11中示出了用于基于图像生成和存储图像标签以及反过来基于图像标签检索图像的示例处理。虽然这些示例涉及图像内容项，但是可以对文本内容项、音频/视频内容项等执行类似的加标签处理和/或关键词或主题提取。如图9中所示，在步骤901中，可以创建图像和/或上传到例如内容储存库440中。在步骤902中，可以将图像从内容储存库440传输到基于人工智能(AI)的REST服务，该REST服务被配置为分析图像并提取主题、话题、特定视觉特征等。AI REST服务可以基于识别出的图像特征确定一个或多个特定图像标签，并且在步骤903中可以将图像标签传输回内容储存库，以在步骤904中存储在图像内或与图像相关联。在图10中，示出了与图9中描述的处理完全相同的处理，用于基于图像生成和存储图像标签。此外，图10示出了在某些实施例中可以在AI REST服务内实现的几个说明性特征1001，包括图像标签确定/检索组件、Apache MxNet组件和认知图像服务。在确定用于内容项的一个或多个标签之后，可以或者将标签存储回到内容储存库440中，或者存储在分离的存储位置中。例如，参考图7的示例图像，可以从这单个图像提取十二个或更多个潜在图像特征，所有这些特征可以被结合到特征向量中。但是；AI REST服务和/或内容储存库440可以确定对于内容匹配而言，最优的做法是仅用图像的一些最普遍的话题(例如，咖啡、零售)来标记图像。In some embodiments, a subset of the extracted/determined content features may also be saved as tags associated with the content item. An example process for generating and storing image tags based on an image and, in turn, retrieving an image based on the image tags is shown in FIGS. 9-11. Although these examples relate to image content items, similar tagging processes and/or keyword or topic extraction may be performed on text content items, audio/video content items, and the like. As shown in FIG. 9 , in step 901, an image may be created and/or uploaded to, for example, a content repository 440. In step 902, the image may be transferred from the content repository 440 to an artificial intelligence (AI)-based REST service configured to analyze the image and extract themes, topics, specific visual features, and the like. The AI REST service may determine one or more specific image tags based on the identified image features, and in step 903, the image tags may be transferred back to the content repository to be stored in or associated with the image in step 904. In FIG. 10 , a process identical to the process described in FIG. 9 is shown for generating and storing image tags based on an image. In addition, Figure 10 shows several illustrative features 1001 that can be implemented within the AI REST service in certain embodiments, including an image tag determination/retrieval component, an Apache MxNet component, and a cognitive image service. After determining one or more tags for a content item, the tags can be either stored back in the content repository 440 or stored in a separate storage location. For example, referring to the example image of Figure 7, twelve or more potential image features can be extracted from this single image, all of which can be combined into a feature vector. However; the AI REST service and/or the content repository 440 may determine that for content matching, the best practice is to tag the image with only some of the most common topics of the image (e.g., coffee, retail).

现在简要地参考图11，示出了另一个示例处理，该处理涉及图9和10中所描述的用于基于图像生成和存储图像标签的处理。在图11中，图示了相反的处理，其中图像标签被用于从内容储存库440检索匹配的图像。在步骤1101中，内容创作用户界面415或其它前端界面可以基于经由该界面接收的输入来确定一个或多个内容标签。在这个示例中，从接收到的用户输入确定单个内容标签(“女服务员”)，并且在步骤1102中，将内容标签传输到与内容储存库440相关联的搜索API。搜索API可以在计算系统的一个或多个分离的层内实现，包括在内容输入处理/分析组件420、内容推荐引擎425和/或内容管理系统435内。在步骤1103中，可以将识别由搜索API确定的匹配的图像的数据传输回内容创作用户界面415，以将其集成在界面中或以其它方式呈现给用户。Referring now briefly to FIG. 11 , another example process is shown that relates to the process described in FIGS. 9 and 10 for generating and storing image tags based on images. In FIG. 11 , the reverse process is illustrated, where image tags are used to retrieve matching images from content repository 440. In step 1101, content authoring user interface 415 or other front-end interface may determine one or more content tags based on input received via the interface. In this example, a single content tag (“waitress”) is determined from the received user input, and in step 1102, the content tag is transmitted to a search API associated with content repository 440. The search API may be implemented within one or more separate layers of the computing system, including within content input processing/analysis component 420, content recommendation engine 425, and/or content management system 435. In step 1103, data identifying matching images determined by the search API may be transmitted back to content authoring user interface 415 to be integrated in the interface or otherwise presented to the user.

因此，在针对内容储存库440中的多个内容资源的步骤602-608完成后，可以用向量填充一个或多个向量空间430，其中每个向量与储存库440中的内容项对应。此外，在一些实施例中，可以为内容项中的一些或全部生成元数据标签的分离集合，并将其作为来自向量的分离对象存储在向量空间430中。这样的标签可以存储在图4中所示的任何数据存储装置或组件中，或者存储在分离的数据存储库中，并且每个标签可以与储存库440中的内容项、向量空间430中的向量或两者相关联。Thus, after steps 602-608 are completed for a plurality of content resources in content repository 440, one or more vector spaces 430 may be populated with vectors, where each vector corresponds to a content item in repository 440. Additionally, in some embodiments, a separate set of metadata tags may be generated for some or all of the content items and stored as separate objects from the vectors in vector space 430. Such tags may be stored in any of the data storage devices or components shown in FIG. 4, or in a separate data repository, and each tag may be associated with a content item in repository 440, a vector in vector space 430, or both.

现在参考图12，示出了另一个流程图，图示用于经由客户端设备410从用户接收原始创作内容、在用户的创作会话期间实时(或接近实时)从内容中提取特征和/或标签、对创作内容进行向量化(也是实时或近实时)，并将原始创作内容的向量与一个或多个现有向量空间430进行比较以便识别和检索来自一个或多个可用内容储存库440的相关/相关联内容的第二处理。这个处理中的步骤也可以由计算环境400内的一个或多个组件(诸如内容推荐引擎425与客户端设备410、输入处理/分析组件420和检索/嵌入组件445以及在其中实现的各种子系统和子组件协作)执行。Referring now to FIG. 12 , another flow chart is shown illustrating a second process for receiving original creative content from a user via a client device 410, extracting features and/or tags from the content in real time (or near real time) during the user's creative session, vectorizing the creative content (also in real time or near real time), and comparing the vectors of the original creative content to one or more existing vector spaces 430 in order to identify and retrieve related/associated content from one or more available content repositories 440. Steps in this process may also be performed by one or more components within the computing environment 400, such as a content recommendation engine 425 in collaboration with the client device 410, input processing/analysis component 420, and retrieval/embedding component 445, and the various subsystems and subcomponents implemented therein.

在步骤1202中，可以经由客户端设备410从用户接收原始创作的内容。如以上所讨论的，原始创作的内容可以与用户键入的文本、用户创建或导入的新图像、用户记录或导入的新音频或视频输入、用户创建的新图形等对应。因此，步骤1202可以类似于上面讨论的步骤602。但是，虽然步骤602中的内容可以是先前从储存库440中检索出的创作/存储的内容，但是在步骤1202中，内容可以是经由用户界面，诸如基于web的文本输入控件、图像导入器控件、图像创建控件、音频/视频创建控件等，接收的新创作的内容。In step 1202, originally created content may be received from a user via client device 410. As discussed above, the originally created content may correspond to text typed by the user, a new image created or imported by the user, a new audio or video input recorded or imported by the user, a new graphic created by the user, etc. Thus, step 1202 may be similar to step 602 discussed above. However, while the content in step 602 may be previously created/stored content retrieved from repository 440, in step 1202, the content may be newly created content received via a user interface, such as a web-based text input control, an image importer control, an image creation control, an audio/video creation control, etc.

在步骤1204中，例如，可以由输入处理/分析组件420处理在步骤1202中接收到的内容(例如，原始创作的内容)。关于解析步骤、处理步骤、关键词/数据特征提取步骤等，步骤1204可以与上面讨论的步骤604相似或完全相同。例如，对于在步骤1202中接收到的文本输入(例如，博客文章、信件、电子邮件、文章等)，步骤1204中的处理可以包括文本解析、关键词的标识、词干、同义词分析/检索等。在其它示例中，当在步骤1202中接收到图像时(其可以由用户上传、从另一个系统导入，和/或由用户经由内容作者用户界面415手动创建或修改)，步骤1204可以包括使用如上所述的基于AI的图像分类工具来识别特定的图像特征和/或生成图像标签。可以使用分析、机器学习算法和/或AI(诸如基于AI的认知图像分析服务和/或AI/REST认知文本服务)经由基于REST的服务或其它web服务执行步骤1204中的这些分析。类似的技术/服务可以在步骤1204中用于其它类型的内容项，诸如视频文件、音频文件、图形或社交媒体帖子，其中系统400的专用web服务被用于提取并根据内容项的媒体类型分析特定特征(例如，词、图像/视频中的对象、面部表情等)。步骤1204还可以包括本文描述的用于用与任何识别出的内容主题、类别或特征对应的标签来对文本块、图像、音频/视频数据和/或其它内容加标签的任何加标签处理。In step 1204, the content received in step 1202 (e.g., originally authored content) may be processed, for example, by the input processing/analysis component 420. Step 1204 may be similar or identical to step 604 discussed above with respect to parsing steps, processing steps, keyword/data feature extraction steps, etc. For example, for text input received in step 1202 (e.g., blog post, letter, email, article, etc.), the processing in step 1204 may include text parsing, identification of keywords, stemming, synonym analysis/retrieval, etc. In other examples, when an image is received in step 1202 (which may be uploaded by a user, imported from another system, and/or manually created or modified by a user via the content author user interface 415), step 1204 may include using an AI-based image classification tool as described above to identify specific image features and/or generate image tags. These analyses in step 1204 may be performed using analytics, machine learning algorithms, and/or AI (such as an AI-based cognitive image analysis service and/or an AI/REST cognitive text service) via a REST-based service or other web service. Similar techniques/services may be used in step 1204 for other types of content items, such as video files, audio files, graphics, or social media posts, where the dedicated web services of system 400 are used to extract and analyze specific features (e.g., words, objects in images/videos, facial expressions, etc.) depending on the media type of the content item. Step 1204 may also include any tagging process described herein for tagging text blocks, images, audio/video data, and/or other content with tags corresponding to any identified content themes, categories, or features.

在步骤1206中，可以基于在步骤1202中接收到的内容来生成与可用向量空间430中的一个或多个兼容的一个或多个向量。步骤1206可以与以上讨论的步骤606相似或完全相同。如以上讨论的，可以基于在步骤1204中识别出的内容(和/或标签)内的特定特征来生成向量。步骤1206中的向量生成处理可以使用一种或多种数据变换技术，由此可以将与原始创作的内容项相关联的特征的集合变换成与公共向量空间430之一兼容的向量。例如，图13图示了一种技术，通过该技术，在步骤1206中，可以将在步骤1202中接收到的图像输入变换成预定向量格式的特征向量(例如，1x 4096维向量)。如图13中所示，可以将图像作为输入提供给模型，该模型被配置为提取和学习图像中的特征(例如，使用卷积、池化和其它函数)，并输出表示输入图像的特征向量。例如，如通过引用并入本文的纽约大学的MatthewD.Zeiler和Rob Fergus于2014年在论文“Visualizing and UnderstandingConvolutional Networks”中所描述的那样，在卷积神经网络内，神经网络的初始层可以从图像中检测简单的特征(如线性边缘)，在随后的层中检测更复杂的形状和图案。例如，卷积神经网络中的第一层和/或第二层可以检测简单的边缘或图案，而随后的层可以检测图像中存在的实际复杂对象(杯子、花朵、狗等)。作为示例，当使用卷积神经网络接收和处理人脸图像时，第一层可以检测各个方向上的边缘，第二层可以检测给定脸的不同部分(例如，眼睛、鼻子等)，而第三层可以获得整个脸的特征图。In step 1206, one or more vectors compatible with one or more of the available vector spaces 430 may be generated based on the content received in step 1202. Step 1206 may be similar or identical to step 606 discussed above. As discussed above, vectors may be generated based on specific features within the content (and/or tags) identified in step 1204. The vector generation process in step 1206 may use one or more data transformation techniques, whereby a set of features associated with the originally authored content item may be transformed into a vector compatible with one of the public vector spaces 430. For example, FIG. 13 illustrates a technique by which, in step 1206, an image input received in step 1202 may be transformed into a feature vector (e.g., a 1x 4096 dimensional vector) in a predetermined vector format. As shown in FIG. 13, an image may be provided as an input to a model that is configured to extract and learn features in the image (e.g., using convolution, pooling, and other functions) and output a feature vector representing the input image. For example, as described in the 2014 paper “Visualizing and Understanding Convolutional Networks” by Matthew D. Zeiler and Rob Fergus of New York University, which is incorporated herein by reference, within a convolutional neural network, the initial layers of the neural network can detect simple features (such as linear edges) from an image, and more complex shapes and patterns in subsequent layers. For example, the first and/or second layers in a convolutional neural network can detect simple edges or patterns, while subsequent layers can detect actual complex objects (cups, flowers, dogs, etc.) present in the image. As an example, when a convolutional neural network is used to receive and process an image of a human face, the first layer can detect edges in various directions, the second layer can detect different parts of a given face (e.g., eyes, nose, etc.), and the third layer can obtain a feature map of the entire face.

在步骤1208中，可以将在步骤1206中生成的特征向量与在上面讨论的处理602-608期间填充的兼容特征向量空间430(或空间430a-n)进行比较。例如，图14中示出了用与图像对应的特征向量填充的说明性向量空间。在这个示例中，每个点可以表示向量化的图像，并且图14中的圆圈(和对应的点颜色)可以指示与图像相关联的三个示例标签之一。在这种情况下，图像标签是“咖啡”、“山峰”和“鸟”，并且应当理解的是，这些标签不是互斥的(即，图像可以用一个、两个或所有三个标签加标签)。此外，应当理解的是，这些标签和图14中的多维向量空间的布局仅仅是说明性的。在各种实施例中，对可以使用的标签的数量或类型没有限制，或者对向量空间430的维数没有限制。In step 1208, the feature vector generated in step 1206 can be compared with the compatible feature vector space 430 (or space 430a-n) filled during the processes 602-608 discussed above. For example, an illustrative vector space filled with feature vectors corresponding to an image is shown in FIG. 14. In this example, each point can represent a vectorized image, and the circles in FIG. 14 (and the corresponding point colors) can indicate one of three example tags associated with the image. In this case, the image tags are "coffee", "mountain", and "bird", and it should be understood that these tags are not mutually exclusive (i.e., the image can be labeled with one, two, or all three tags). In addition, it should be understood that these tags and the layout of the multidimensional vector space in FIG. 14 are merely illustrative. In various embodiments, there is no limit on the number or type of tags that can be used, or on the dimensionality of the vector space 430.

为了在步骤1208中执行向量空间比较，内容推荐引擎425可以计算在步骤1206中生成的特征向量与存储在一个或多个向量空间430中的每个其它特征向量之间的欧几里德距离。基于计算出的距离，引擎425可以按照特征空间距离的升序对特征向量进行排名，使得两个特征向量之间的距离越小，排名越高。这样的技术可以允许内容推荐引擎425确定在(一个或多个)向量空间430内的最高排名的特征向量的集合，基于在步骤1202中接收到的输入，其在特征/特点/等方面与步骤1206中生成的特征向量最相似。在一些情况下，可以在步骤1208中选择预定数量(N)的最高排名的特征向量(例如，5篇最相似的文章、10个最相似的图像等)，而在其它情况下，可以选择满足特定接近度阈值的所有特征向量(例如，向量之间的距离<阈值(T))。To perform the vector space comparison in step 1208, the content recommendation engine 425 may calculate the Euclidean distance between the feature vector generated in step 1206 and each other feature vector stored in one or more vector spaces 430. Based on the calculated distance, the engine 425 may rank the feature vectors in ascending order of feature space distance, so that the smaller the distance between two feature vectors, the higher the ranking. Such a technique may allow the content recommendation engine 425 to determine a set of the highest ranked feature vectors within the (one or more) vector spaces 430, which are most similar to the feature vector generated in step 1206 in terms of features/characteristics/etc. based on the input received in step 1202. In some cases, a predetermined number (N) of the highest ranked feature vectors (e.g., 5 most similar articles, 10 most similar images, etc.) may be selected in step 1208, while in other cases, all feature vectors that meet a particular proximity threshold may be selected (e.g., distance between vectors < threshold (T)).

在一些实施例中，步骤1208中的向量比较可以是图15中所示的“深度特征空间”比较。在这些实施例中，可以在不考虑任何标签或其它元数据的情况下比较在步骤1206中生成的特征向量。换句话说，在深度特征比较中，可以将在步骤1206中生成的特征向量与存储在向量空间430中的每隔一个特征进行比较。虽然可以保证深度特征比较在向量空间430中找到最接近的向量，但是这种比较可能要求附加的处理资源和/或附加的时间来返回向量结果。对于可以包括数千个或甚至数百万个特征向量的大向量空间尤其如此，其中每个特征向量表示存储在储存库440中的分离的内容对象/资源。例如，为了计算尺寸为1x4096的两个图像特征向量之间的欧几里得距离，要求由系统400执行大约10000个加法和乘法指令。因此，如果储存库中有10,000个图像，那么必须执行10,000,000个操作。In some embodiments, the vector comparison in step 1208 may be a "deep feature space" comparison as shown in FIG. 15. In these embodiments, the feature vectors generated in step 1206 may be compared without regard to any tags or other metadata. In other words, in a deep feature comparison, the feature vector generated in step 1206 may be compared to every other feature stored in vector space 430. Although the deep feature comparison may be guaranteed to find the closest vector in vector space 430, such a comparison may require additional processing resources and/or additional time to return a vector result. This is especially true for large vector spaces that may include thousands or even millions of feature vectors, each of which represents a separate content object/resource stored in repository 440. For example, to calculate the Euclidean distance between two image feature vectors of size 1x4096, approximately 10,000 addition and multiplication instructions are required to be executed by system 400. Therefore, if there are 10,000 images in the repository, 10,000,000 operations must be performed.

因而，在其它实施例中，步骤1208中的向量比较可以是图16-17中所示的“经过滤的特征空间”比较。在经过滤的特征空间比较中，可以首先基于标签(和/或其它特性，诸如资源媒体类型、创建日期等)对向量空间进行过滤，以识别(一个或多个)向量空间430内的特征向量的子集具有与在步骤1206中生成的特征向量的标签(和/或其它特性)匹配的标签(和/或其它特性)。然后，可以将在步骤1206中生成的特征向量与仅具有匹配的标签/特性的子集中的那些特征向量进行比较。因而，虽然可能存在丢失被滤除且未比较的接近特征向量的可能性，但是与深层空间比较相比，可以更快且更高效地执行经过滤的特征空间比较。Thus, in other embodiments, the vector comparison in step 1208 may be a "filtered feature space" comparison as shown in FIGS. 16-17. In a filtered feature space comparison, the vector space may first be filtered based on tags (and/or other characteristics, such as resource media type, creation date, etc.) to identify a subset of feature vectors within the vector space(s) 430 that have tags (and/or other characteristics) that match the tags (and/or other characteristics) of the feature vectors generated in step 1206. The feature vectors generated in step 1206 may then be compared to those feature vectors in the subset that have only matching tags/characteristics. Thus, while there may be the possibility of missing close feature vectors that are filtered out and not compared, a filtered feature space comparison may be performed faster and more efficiently than a deep space comparison.

如上所述，步骤1208可以包括将在步骤1206中生成的特征向量与单个向量空间或与多个向量空间进行比较。在一些实施例中，可以将在步骤1206中生成的特征向量与向量空间的对应类型进行比较。例如，当在步骤1202中接收到文本输入时，可以将所得特征向量与主题向量空间430a内的向量进行比较，并且当在步骤1202中接收到图像作为输入时，可以将所得特征向量与图像向量空间430b内的向量进行比较，等等。在一些实施例中，可以有可能将与一种类型的输入对应的特征向量与包含不同类型的向量的向量空间进行比较(例如，识别与基于文本的输入最紧密相关的图像资源，反之亦然)。例如，图18表示在步骤1402中接收文本输入并且在步骤1408中检索相似图像(例如，图像向量空间430b中最接近的)和相似文章(例如，主题向量空间430b中最接近的)的处理。As described above, step 1208 may include comparing the feature vector generated in step 1206 to a single vector space or to multiple vector spaces. In some embodiments, the feature vector generated in step 1206 may be compared to a corresponding type of vector space. For example, when text input is received in step 1202, the resulting feature vector may be compared to a vector within the topic vector space 430a, and when an image is received as input in step 1202, the resulting feature vector may be compared to a vector within the image vector space 430b, and so on. In some embodiments, it may be possible to compare feature vectors corresponding to one type of input to vector spaces containing different types of vectors (e.g., identifying image resources that are most closely related to text-based input, or vice versa). For example, FIG. 18 illustrates a process of receiving text input in step 1402 and retrieving similar images (e.g., the closest ones in the image vector space 430b) and similar articles (e.g., the closest ones in the topic vector space 430b) in step 1408.

对于涉及检索和/或比较与内容资源相关联的标签的实施例，当一个资源的标签与另一个资源的对应标签/关键词/特征相关但不完全匹配时，会出现问题。这个潜在问题的示例在图19中图示，其中将从原始创作的文本内容资源中提取出的关键词与为图像内容资源的集合存储的图像标签的集合进行比较。在这个示例中，提取出的关键词(“Everest”、“Base Camp”、“Summit”、“Mountain”或“Himalaya”)都不与图像标签(“Mountaineer”、“Cappuccino”或“Macaw”)完全匹配。在一些实施例中，诸如词干、词定义和/或同义词检索和分析等词/短语解析和处理技术可以被用于检测相关但不匹配的术语之间的匹配。但是，即使对于相关的关键词/标签，这些技术也可能会失败。因此，在一些实施例中，内容处理/分析组件420和/或内容推荐引擎425可以执行词向量比较以解决这个问题。如图20所示示例中所示，可以在3D词向量空间内分析从图19中的文本文档中提取出的关键词，并且可以计算这些关键词与每个图像标签之间的距离。如图21中所示，在图20中执行的关键词到标签向量空间分析可以确定图像标签“Mountaineer”在词向量空间内与提取出的关键词足够接近，因此对于经过滤的特征空间比较，应当将其视为图像标签匹配。For embodiments involving retrieval and/or comparison of tags associated with content resources, when the tag of a resource is related to but not fully matched with the corresponding tag/keyword/feature of another resource, problems may occur. An example of this potential problem is illustrated in Figure 19, wherein the keywords extracted from the text content resource of the original creation are compared with the set of image tags stored for the set of image content resources. In this example, the extracted keywords ("Everest", "Base Camp", "Summit", "Mountain" or "Himalaya") do not fully match the image tags ("Mountaineer", "Cappuccino" or "Macaw"). In some embodiments, word/phrase parsing and processing techniques such as stem, word definition and/or synonym retrieval and analysis can be used to detect the match between related but not matched terms. However, even for related keywords/tags, these technologies may fail. Therefore, in some embodiments, content processing/analysis component 420 and/or content recommendation engine 425 can perform word vector comparison to solve this problem. As shown in the example shown in FIG20, the keywords extracted from the text document in FIG19 can be analyzed in the 3D word vector space, and the distance between these keywords and each image label can be calculated. As shown in FIG21, the keyword-to-label vector space analysis performed in FIG20 can determine that the image label "Mountaineer" is close enough to the extracted keywords in the word vector space, so it should be considered an image label match for the filtered feature space comparison.

在检索和/或比较与内容资源相关联的标签的实施例中可能出现的另一个潜在问题是由同形异义关键词和/或资源标签引起的。同形异义词或短语(或同音异义)是指具有相同拼写形式且具有不同且不相关的含义的词或短语。同音异义图像标签的示例在图22中示出，其中第一个图像标有词“Crane”，这是指长腿和长颈的鸟，第二个图像标有相同的词“Crane”，这是指具有用于移动重物的伸出臂的机器。在这种情况下，内容处理/分析组件420和/或内容推荐引擎425可以对两个图像标签执行词义消歧处理，以确定每个词Crane是指哪个词义。在这个示例中，词义消歧处理可以最初检索与每个标签相关联的Wordnet数据库条目(或其它定义数据)，如图22中对于两个不同的“Crane”标签所示的。Another potential problem that may arise in embodiments of retrieving and/or comparing tags associated with content resources is caused by homonymous keywords and/or resource tags. Homographs or phrases (or homonyms) refer to words or phrases that have the same spelling and have different and unrelated meanings. An example of homonymous image tags is shown in Figure 22, where the first image is labeled with the word "Crane", which refers to a bird with long legs and a long neck, and the second image is labeled with the same word "Crane", which refers to a machine with an outstretched arm for moving heavy objects. In this case, the content processing/analysis component 420 and/or the content recommendation engine 425 can perform word sense disambiguation on the two image tags to determine which word sense each word Crane refers to. In this example, the word sense disambiguation process can initially retrieve the Wordnet database entry (or other definition data) associated with each tag, as shown in Figure 22 for two different "Crane" tags.

在图23-24中示出了说明性的词义消歧处理。在这个处理中，创作的文档内的其它关键词和/或文档中“Crane”一词的特定上下文(例如，描述、词性、时态等)可以被内容处理/分析组件420和/或内容推荐引擎425用来确定词Crane在创作的文本文档内的最可能的含义，从而确定哪个“Crane”图像标签与创作文本文档有关。例如，现在参考图23，示出了输入文本2301，从该输入文本中提取了多个相关关键词2302。可以将第一个提取出的关键词(“Crane”)与内容储存库440内的图像标签进行比较，并且在这个示例中，已经识别出与内容储存库440内的两个“Crane”标签图像2304a和2304b对应的两个匹配标签2303。An illustrative word sense disambiguation process is shown in Figures 23-24. In this process, other keywords within the authored document and/or the specific context of the word "Crane" in the document (e.g., description, part of speech, tense, etc.) can be used by the content processing/analysis component 420 and/or the content recommendation engine 425 to determine the most likely meaning of the word Crane within the authored text document, thereby determining which "Crane" image tag is relevant to the authored text document. For example, now referring to Figure 23, an input text 2301 is shown from which multiple relevant keywords 2302 are extracted. The first extracted keyword ("Crane") can be compared to the image tags within the content repository 440, and in this example, two matching tags 2303 corresponding to two "Crane" tag images 2304a and 2304b within the content repository 440 have been identified.

如图24中所示，为了解决词义歧义这一潜在问题，可以通过将从输入内容2301中提取出的一个或多个附加关键词2302与两个匹配的图像2304a和2304b的其它内容标签进行比较来继续进行消歧处理。在这个示例中，可以将附加的提取出的关键词“mechanical”、“machine”、“lifting”和“construction”与与图像2304a和2304b的每一个相关联的内容标签和/或提取的特征进行比较。如图24中所示，这些附加比较可以消除“Crane”的初始关键词匹配的歧义，从而内容推荐系统425不返回禽类鹤图像2304a，而是返回建筑起重机图像2304b。As shown in FIG. 24 , to resolve the potential problem of word sense ambiguity, the disambiguation process can be continued by comparing one or more additional keywords 2302 extracted from the input content 2301 with other content tags of the two matching images 2304a and 2304b. In this example, the additional extracted keywords "mechanical", "machine", "lifting", and "construction" can be compared with the content tags and/or extracted features associated with each of the images 2304a and 2304b. As shown in FIG. 24 , these additional comparisons can disambiguate the initial keyword match for "Crane", so that the content recommendation system 425 does not return the avian crane image 2304a, but instead returns the construction crane image 2304b.

在其它示例中，可以使用图像相似性执行相似的消歧处理。例如，内容处理/分析组件420和/或内容推荐引擎425可以在与创作内容相关联的图像(例如，绘制的或创作的图像)与两个不同的“Crane”图像之间识别公共图像特征，以便确定哪个crane是合适的相关图像。这些消歧的处理也可以以各种方式组合，例如，将从创作的文本文档中提取出的关键词与从图像中提取出的视觉特征进行比较等。因此，在涉及crane的创作的文本文档中，如果可以在那个图像中视觉上识别出吊杆和滑轮，那么相关词“boom”和“pulley”可能在视觉上与下部的起重机图像匹配。类似地，如果创作的文本文档引用了crrane并且包含相关词“beak”和“feathers”，那么如果可以在那个图像中视觉上识别出喙和羽毛，那么可以将“crane”关键词在视觉上与上部鹤图像进行匹配。In other examples, similar disambiguation processing can be performed using image similarity. For example, the content processing/analysis component 420 and/or the content recommendation engine 425 can identify common image features between an image associated with the creative content (e.g., a drawn or creative image) and two different "Crane" images to determine which crane is a suitable related image. These disambiguation processes can also be combined in various ways, for example, comparing keywords extracted from the creative text document with visual features extracted from the image, etc. Therefore, in a text document involving the creation of a crane, if the boom and pulley can be visually identified in that image, then the related words "boom" and "pulley" may be visually matched with the lower crane image. Similarly, if the creative text document references a crane and contains the related words "beak" and "feathers", then if the beak and feathers can be visually identified in that image, then the "crane" keyword can be visually matched with the upper crane image.

现在参考图25-28，示出了执行图12中的处理的端到端示例，具体而言是针对经由用户界面415检索用户创作的文章的相关图像的集合的实施例。最初，在2501处(图25)，用户在Demo Editor(AIditor)用户界面中输入文章的文本。在2502处，从文章的文本中提取多个关键词，并且在2503处，AI Rest Service将提取出的关键词与为图像内容储存库440内的图像库存储的图像标签进行比较。图26和27图示了与图25相同的示例处理，但具有有关AI Rest Service的操作的附加细节。如图26中所示，使用以上讨论的技术，AI RestService将一个或多个标签(例如，“mountaineer”)识别为与创作的文章相关。然后，如图27中所示，在一些实施例中，可以使用不同的软件服务来执行这个步骤，诸如用于从文本输入确定关键词的集合的第一认知文本REST服务，以及用于将关键词映射到图像标签的第二内部REST服务。这些服务中的每一个可以在内容推荐引擎425内和/或经由外部服务提供者来实现，然后，在步骤2505处(图28中所示)，内容推荐引擎425可以将确定的(一个或多个)图像标签传输到与图像内容储存库440相关联的搜索API。在一些情况下，可以在基于云的内容中心(例如，Oracle内容管理(OCM))中实现搜索API。在步骤2506中，搜索API可以基于标签匹配检索相关图像的集合，并且在步骤2507中可以将检索的图像(或图像的按比例缩小的版本)传输回并嵌入415处的用户界面内(在屏幕区域2810处)。Referring now to FIGS. 25-28 , an end-to-end example of performing the processing in FIG. 12 is shown, specifically for an embodiment of retrieving a collection of images related to an article authored by a user via the user interface 415 . Initially, at 2501 ( FIG. 25 ), a user enters the text of an article in the Demo Editor (AIditor) user interface. At 2502 , a plurality of keywords are extracted from the text of the article, and at 2503 , the AI Rest Service compares the extracted keywords with the image tags stored for the image library within the image content repository 440 . FIGS. 26 and 27 illustrate the same example processing as FIG. 25 , but with additional details about the operation of the AI Rest Service. As shown in FIG. 26 , using the techniques discussed above, the AI Rest Service identifies one or more tags (e.g., “mountaineer”) as being related to the authored article. Then, as shown in FIG. 27 , in some embodiments, different software services may be used to perform this step, such as a first cognitive text REST service for determining a collection of keywords from text input, and a second internal REST service for mapping keywords to image tags. Each of these services can be implemented within the content recommendation engine 425 and/or via an external service provider, and then, at step 2505 (shown in FIG. 28 ), the content recommendation engine 425 can transmit the determined (one or more) image tags to a search API associated with the image content repository 440. In some cases, the search API can be implemented in a cloud-based content center (e.g., Oracle Content Management (OCM)). In step 2506, the search API can retrieve a collection of related images based on tag matching, and in step 2507, the retrieved image (or a scaled-down version of the image) can be transmitted back and embedded in the user interface at 415 (at screen area 2810).

虽然图25-28中所示的示例图示了检索由用户创作的文章的相关图像的集合的特定实施例，但是应当理解的是，也可以类似地执行图12中的步骤以检索其它类型的内容。例如，可以执行类似的步骤以检索与用户经由用户界面415输入的文本相关的文章(或其它文本文档)。在其它实施例中，也可以检索其它媒体类型的相关内容资源(例如，音频文件、视频剪辑、图形、社交媒体帖子等)。此外，当用户向用户界面中导入/创建文本以外的其它类型的输入(例如，绘制或上传的图像、口语音频输入、视频输入等)时，可以执行类似的步骤以根据内容推荐引擎425的配置和/或用户偏好来检索各种不同类型的相关内容资源(例如，相关文章、图像、视频、音频、社交媒体等)。Although the examples shown in Figures 25-28 illustrate specific embodiments of a collection of related images of articles created by a user, it should be understood that the steps in Figure 12 can also be performed similarly to retrieve other types of content. For example, similar steps can be performed to retrieve articles (or other text documents) related to the text entered by the user via the user interface 415. In other embodiments, related content resources of other media types (e.g., audio files, video clips, graphics, social media posts, etc.) can also be retrieved. In addition, when a user imports/creates other types of input other than text into the user interface (e.g., drawn or uploaded images, spoken audio input, video input, etc.), similar steps can be performed to retrieve various different types of related content resources (e.g., related articles, images, videos, audio, social media, etc.) according to the configuration of the content recommendation engine 425 and/or user preferences.

例如，现在参考图29-35，示出了另一个示例实施例，其中基于经由用户界面415接收的原始创作的文本输入(例如，用户的博客文章、电子邮件、文章等)，执行图12的处理步骤以检索相关文章(或其它文本内容资源)的集合。如图29中所示，用户已经由用户界面415创作了新文章，并且内容推荐引擎425调用的基于AI的REST服务已识别出文章主题的集合。如图30中所示，可以将识别出的文章主题与先前针对文章内容储存库440中的文章的集合识别出的主题进行比较。在这些示例中，根据一个实施例示出的图29，而图30示出了另一个实施例。图29只是图30的子集，我们可以从图中安全地移除图29。因此，可以使用与以上在图6)中描述的处理类似的技术来确定文章主题并将其存储为元数据或其它相关联的数据对象，其中确定图像特征/标签并与图像相关联地存储。类似地，对于存储在储存库440中的每篇文章，文章内容储存库440可以具有元数据或其它相关联的存储，包括文章主题、日期、关键词、作者、出版物等。在图30所示的示例中，根据文章主题的匹配，已将与珠穆朗玛峰上的死亡相关的文章确定为与用户的新创建的文章潜在相关。图31-35图示了使用系统400来查找与用户输入的文章相关的文章的端到端处理，类似于图25-28中所示的用于寻找相关图像的步骤。在步骤3101中(图31)，用户经由用户界面415创建新文章。在步骤3102中，文章文本由内容推荐引擎425传输到一个或多个软件服务(例如，基于AI的REST服务)，并且在步骤3103中(图32)，(一个或多个)软件服务使用认知文本服务功能性来分析文章的文本，以确定文章的一个或多个主题。在步骤3104中，将确定的文章主题传输回内容推荐引擎425，并且在步骤3105中，推荐引擎425将文章文本和识别出的主题都发送到分离的API(例如，在基于云的内容中心内)，其中在步骤3106中，可以将文章保存到储存库440以供将来参考，并基于识别出的主题对其进行索引。而且在步骤3106中(图33)，可以基于主题匹配处理(图34)经由搜索API来搜索文章的现有储存库440以识别潜在相关的主题。最终，在步骤3107中(图35)，被识别为与新创建的文章潜在相关的文章可以被传输回(整体或只是链接)以嵌入到用户界面415内(例如，在用户界面区域3510处)。For example, now referring to Figures 29-35, another example embodiment is shown, in which the processing steps of Figure 12 are performed to retrieve a collection of related articles (or other text content resources) based on the text input of the original creation received via the user interface 415 (e.g., a user's blog post, email, article, etc.). As shown in Figure 29, the user has created a new article by the user interface 415, and the AI-based REST service called by the content recommendation engine 425 has identified a collection of article topics. As shown in Figure 30, the identified article topics can be compared with the topics previously identified for the collection of articles in the article content repository 440. In these examples, Figure 29 is shown according to one embodiment, while Figure 30 shows another embodiment. Figure 29 is only a subset of Figure 30, and we can safely remove Figure 29 from the figure. Therefore, the article topic can be determined and stored as metadata or other associated data objects using techniques similar to the processing described above in Figure 6), wherein image features/labels are determined and stored in association with the image. Similarly, for each article stored in the repository 440, the article content repository 440 may have metadata or other associated storage, including article topics, dates, keywords, authors, publications, etc. In the example shown in FIG30, based on the matching of article topics, articles related to death on Mount Everest have been determined to be potentially relevant to the user's newly created article. FIGS. 31-35 illustrate an end-to-end process of using the system 400 to find articles related to the article input by the user, similar to the steps for finding related images shown in FIGS. 25-28. In step 3101 (FIG. 31), the user creates a new article via the user interface 415. In step 3102, the article text is transmitted to one or more software services (e.g., an AI-based REST service) by the content recommendation engine 425, and in step 3103 (FIG. 32), (one or more) software services use cognitive text service functionality to analyze the text of the article to determine one or more topics of the article. In step 3104, the determined article topic is transmitted back to the content recommendation engine 425, and in step 3105, the recommendation engine 425 sends both the article text and the identified topic to a separate API (e.g., in a cloud-based content center), where in step 3106, the article can be saved to a repository 440 for future reference and indexed based on the identified topic. Moreover, in step 3106 (FIG. 33), the existing repository 440 of articles can be searched via a search API based on a topic matching process (FIG. 34) to identify potentially relevant topics. Finally, in step 3107 (FIG. 35), the article identified as potentially relevant to the newly created article can be transmitted back (in its entirety or just a link) to be embedded in the user interface 415 (e.g., at user interface area 3510).

如在图29-33的以上示例中所图示的，本文描述的某些实施例可以包括新创建的文本文档的主题和/或存储在内容储存库440内的文本文档的主题的标识，以及主题接近度和匹配的比较和标识。在本文描述的各种实施例中，包括显式语义分析在内的各种技术可以被用于文本主题评估和主题“接近度”技术。如图29-30中所示，在一些情况下，此类技术可以使用大规模数据源(例如，维基百科)来提供不受限制的自然语言文本的细粒度语义表示，以在高维空间中表示得自数据源的自然概念的含义。例如，文本分类技术可以被用于根据基于维基百科的概念来明确表示任何文本的含义。语义表示可以是文本片段的特征向量，其通过主题建模进行转换。维基百科(或另一个大规模数据源)可以被用于将较大的词汇表(例如，单词袋)包括到系统中，以覆盖大范围的词。基于维基百科的概念可以是维基百科页面的标题，这些页面在分类给定文本片段时用作类/类别。给定文本片段，可以返回最接近的维基百科页面标题(例如，“Mount Everest”、“Stephen Hawking”、“Car Accident”等)，其可以用作文本的类/类别。可以通过计算自然语言文本的片段之间的语义相关度来自动评估此类技术的有效性。As illustrated in the above examples of Figures 29-33, some embodiments described herein may include identification of the subject of a newly created text document and/or a subject of a text document stored in a content repository 440, as well as comparison and identification of subject proximity and matching. In various embodiments described herein, various techniques including explicit semantic analysis may be used for text subject evaluation and subject "proximity" techniques. As shown in Figures 29-30, in some cases, such techniques may use large-scale data sources (e.g., Wikipedia) to provide fine-grained semantic representations of unrestricted natural language texts to represent the meaning of natural concepts derived from data sources in high-dimensional space. For example, text classification techniques may be used to explicitly represent the meaning of any text based on Wikipedia-based concepts. Semantic representations may be feature vectors of text fragments that are converted by topic modeling. Wikipedia (or another large-scale data source) may be used to include larger vocabularies (e.g., bags of words) into the system to cover a wide range of words. Wikipedia-based concepts may be titles of Wikipedia pages that are used as classes/categories when classifying given text fragments. Given a text snippet, the closest Wikipedia page title can be returned (e.g., "Mount Everest", "Stephen Hawking", "Car Accident", etc.), which can be used as the class/category of the text. The effectiveness of such techniques can be automatically evaluated by computing the semantic relatedness between snippets of natural language text.

在这些文本分类/相关性评估技术中，使用大规模的公共可用知识源(例如，维基百科或其它百科全书)的优势之一是可以访问大量预先编码成公共可用资源的高度组织化的人类知识，它们处于不断变化/发展中。可以基于维基百科和/或其它来源使用机器学习技术来构建语义解释器，该语义解释器将自然语言文本的片段映射为按其与输入的相关性排序的维基百科概念的加权序列。因此，输入文本可以被表示为概念的加权向量，称为解释向量。因此，根据文本片段与许多维基百科概念的相似性来解释其含义。然后，可以通过在由概念定义的空间中比较它们的向量，例如使用余弦度量，来计算文本的语义相关性。这样的语义分析在明显概念可以基于人类认知的意义上可以是显式的。因为可以经由用户界面415以纯文本的形式接收用户输入，所以可以使用常规文本分类算法来根据这些文章所表示的概念与给定文本片段的相关性来对其进行排名。因此，可以直接使用在线百科全书(例如，维基百科)，而无需深入的语言理解或预先编录的常识知识。在一些实施例中，每个维基百科概念可以被表示为出现在对应文章中的词的属性向量。可以使用例如术语“频率倒排文档频率(TFIDF)方案”来为这些向量的条目指派权重。这些权重可以量化词和概念之间的关联强度。为了加快语义解释，可以使用倒排索引，该倒排索引将每个词映射到该词出现的概念的列表中。通过移除给定词的权重低于某个阈值的那些概念，倒排索引还可以被用于丢弃词和概念之间的无关紧要的关联。语义解释器可以被实现为基于质心的分类器，其可以基于接收到的文本片段按相关性对维基百科概念进行排名。例如，内容推荐引擎中的语义解释器可以接收输入文本片段T，并将该片段表示为向量(例如，使用TFIDF方案)。语义解释器可以在文本词上迭代，从倒排索引中检索对应的条目，并将它们合并为加权向量。加权向量的条目可以反映对应概念与文本T的相关性。为了计算一对文本片段的语义相关性，可以使用例如余弦度量来比较它们的向量。In these text classification/relevance evaluation techniques, one of the advantages of using large-scale public available knowledge sources (e.g., Wikipedia or other encyclopedias) is that a large amount of highly organized human knowledge pre-encoded into public available resources can be accessed, which is in constant change/development. A semantic interpreter can be constructed based on Wikipedia and/or other sources using machine learning techniques, which maps segments of natural language text to weighted sequences of Wikipedia concepts sorted by their relevance to input. Therefore, the input text can be represented as a weighted vector of concepts, called an interpretation vector. Therefore, its meaning is interpreted according to the similarity of the text segment with many Wikipedia concepts. Then, the semantic relevance of the text can be calculated by comparing their vectors in a space defined by the concepts, such as using a cosine metric. Such semantic analysis can be explicit in the sense that obvious concepts can be based on human cognition. Because user input can be received in the form of plain text via user interface 415, conventional text classification algorithms can be used to rank the concepts represented by these articles according to their relevance to a given text segment. Therefore, online encyclopedias (e.g., Wikipedia) can be used directly without in-depth language understanding or pre-cataloged common sense knowledge. In some embodiments, each Wikipedia concept can be represented as an attribute vector of words that appear in the corresponding article. The terms "frequency inverted document frequency (TFIDF) scheme" can be used, for example, to assign weights to the entries of these vectors. These weights can quantify the strength of association between words and concepts. In order to speed up semantic interpretation, an inverted index can be used, which maps each word to a list of concepts in which the word appears. By removing those concepts whose weights of a given word are lower than a certain threshold, the inverted index can also be used to discard insignificant associations between words and concepts. A semantic interpreter can be implemented as a centroid-based classifier that can rank Wikipedia concepts by relevance based on received text fragments. For example, a semantic interpreter in a content recommendation engine can receive an input text fragment T and represent the fragment as a vector (e.g., using the TFIDF scheme). A semantic interpreter can iterate over text words, retrieve corresponding entries from an inverted index, and merge them into a weighted vector. The entries of the weighted vector may reflect the relevance of the corresponding concept to the text T. To calculate the semantic relevance of a pair of text segments, their vectors may be compared using, for example, the cosine metric.

在其它示例中，用于生成用于文本归类的特征的类似方法可以包括监督式学习任务，其中在训练文档中出现的词可以被用作特征。因此，在一些示例中，可以使用维基百科概念来扩充单词袋。另一方面，计算一对文本的语义相关性本质上是“一次性的”任务，因此单词袋表示可以用基于概念的表示来代替。以色列理工学院计算机科学技术系的EvgeniyGabrilovich和Shaul Markovitch在论文“Computing Semantic Relatedness usingWikipedia-based Explicit Semantic Analysis”中更详细地描述了这些技术和其它相关技术，该论文出于所有目的通过引用并入本文，以及其中讨论的其它相关内容。使用这个论文中描述的技术和其它技术，对于经过滤的维基百科的子集，对于每篇文章，可以存在作为文章标题的一个概念。当内容推荐引擎425经由用户界面415接收文本文档时，可以首先对文本进行汇总。在这些情况下，在移除停用词并对词进行词干提取之后，可以根据文章中词的出现频率和倒排频率对文本中的每个唯一词赋予权重。然后可以将每个词进行比较以查看其出现在什么维基百科文章(概念)中，因此内容推荐引擎425可以为该词创建概念向量。可以将文本文档中所有词的概念向量组合起来，以形成文本文档的加权概念向量。内容推荐引擎425然后可以测量每个词概念向量与文本概念向量之间的相似性。然后可以将高于某个阈值的所有词选择为文档的“关键词”。In other examples, similar methods for generating features for text classification may include supervised learning tasks, in which words that appear in training documents can be used as features. Therefore, in some examples, Wikipedia concepts can be used to expand word bags. On the other hand, calculating the semantic relevance of a pair of texts is essentially a "one-time" task, so the word bag representation can be replaced by a concept-based representation. Evgeniy Gabrilovich and Shaul Markovitch of the Department of Computer Science and Technology at the Israel Institute of Technology describe these techniques and other related techniques in more detail in the paper "Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis", which is incorporated herein by reference for all purposes, as well as other related content discussed therein. Using the techniques and other techniques described in this paper, for a filtered subset of Wikipedia, for each article, there can be a concept as the title of the article. When the content recommendation engine 425 receives a text document via the user interface 415, the text can be summarized first. In these cases, after removing stop words and stemming the words, each unique word in the text can be weighted according to the frequency of occurrence and inversion frequency of the word in the article. Each word can then be compared to see in what Wikipedia article (concept) it appears, so the content recommendation engine 425 can create a concept vector for the word. The concept vectors of all words in the text document can be combined to form a weighted concept vector for the text document. The content recommendation engine 425 can then measure the similarity between each word concept vector and the text concept vector. All words above a certain threshold can then be selected as "keywords" for the document.

现在参考图36，示出了示例语义文本分析器系统3600，其图示了在上述某些实施例中由分析器系统3600用于执行语义文本摘要的技术。在各种实施方式中，这样的系统3600可以被内容推荐引擎425结合和/或与之分离。36, an example semantic text analyzer system 3600 is shown, illustrating techniques used by the analyzer system 3600 to perform semantic text summarization in certain embodiments described above. In various implementations, such a system 3600 may be combined with and/or separate from the content recommendation engine 425.

在一些实施方式中，用于计算语义相关性的显式语义分析被重新用于计算给定文本文档的文本摘要。更具体而言，文本摘要是基于词嵌入而得出的。换句话说，与典型的相似性测度(诸如单词袋上的余弦或字符串上的编辑距离)相比，捕获n元语法(例如，词)的上下文是为了确定语义相似性。In some embodiments, the explicit semantic analysis used to calculate semantic relevance is reused to calculate a text summary for a given text document. More specifically, the text summary is derived based on word embeddings. In other words, compared to typical similarity measures (such as cosine on bag of words or edit distance on strings), the context of n-grams (e.g., words) is captured in order to determine semantic similarity.

给定的文本文档可以是文章、网页或期望文本摘要的其它文本。与本文描述的分类方法一样，文本不限于书面语言，而是可以包括其它人类可读的符号、数字、图表、表格、方程式、公式等。A given text document may be an article, a web page, or other text for which a text summary is desired. As with the classification method described herein, text is not limited to written language, but may include other human-readable symbols, numbers, diagrams, tables, equations, formulas, and the like.

使用显式语义分析的文本摘要方法一般如下操作：(1)使用用于识别和提取此类单元的任何已知技术从给定的文本文档中提取语法单元(例如，句子或词)，(2)提取出的每个语法单元和文本文档都表示为知识库概念的加权向量；(3)使用加权向量计算文本文档作为整体与每个语法单元之间的语义相关性，并且(4)选择与整个文本文档在语义上最相关的语法单元中的一个或多个，以将其包括在文本文档的文本摘要中。在一些情况下，知识库概念的加权向量的表示可以与主题建模对应，其中首先可以将每个句子或词转换成特征向量，然后可以在高维向量空间中计算特征差异/相似性。可以有多种方法可以将词转换成向量，例如WORD2VEC或Latent Dirichlet Allocation。Text summarization methods using explicit semantic analysis generally operate as follows: (1) grammatical units (e.g., sentences or words) are extracted from a given text document using any known technique for identifying and extracting such units, (2) each extracted grammatical unit and the text document are represented as a weighted vector of a knowledge base concept; (3) the semantic relevance between the text document as a whole and each grammatical unit is calculated using the weighted vectors, and (4) one or more of the grammatical units that are most semantically relevant to the entire text document are selected to include in the text summary of the text document. In some cases, the representation of weighted vectors of knowledge base concepts can correspond to topic modeling, where each sentence or word can first be converted into a feature vector, and then feature differences/similarity can be calculated in a high-dimensional vector space. There can be a variety of methods for converting words into vectors, such as WORD2VEC or Latent Dirichlet Allocation.

图36图示了使用显式语义分析的文本摘要。首先，基于知识库3602构建文本摘要器。知识库3602可以是通用的或特定于领域的。通用知识库的示例是百科全书文章的集合，诸如维基百科文章的集合或文本文章的其它百科全书的集合。但是，知识库3602可以替代地是特定领域的，诸如特定于特定技术领域的文本文章的集合，诸如医学、科学、工程或金融文章的集合。Figure 36 illustrates a text summary using explicit semantic analysis. First, a text summarizer is constructed based on a knowledge base 3602. The knowledge base 3602 can be general or domain-specific. An example of a general knowledge base is a collection of encyclopedia articles, such as a collection of Wikipedia articles or other encyclopedias of text articles. However, the knowledge base 3602 can alternatively be domain-specific, such as a collection of text articles specific to a particular technical field, such as a collection of medical, scientific, engineering or financial articles.

知识库3602的每篇文章都被表示为文章中出现的n元语法(例如，词)的属性向量。属性向量中的条目被指派有权重。例如，可以使用术语“频率倒排文档频率评分方案”来使用权重。文章的属性向量中的权重量化了文章的n元语法(例如，词)与作为概念的文章之间的关联强度。Each article of knowledge base 3602 is represented as an attribute vector of n-grams (e.g., words) that appear in the article. Entries in the attribute vector are assigned weights. For example, the term "frequency inverted document frequency scoring scheme" can be used to use weights. The weights in the attribute vector of the article quantify the association strength between the n-grams (e.g., words) of the article and the article as a concept.

在一些实施方式中，术语“频率倒排文档频率评分方案”为给定的文章文档d的给定的n元语法t计算权重，如以下等式所示：In some implementations, the term "inverse frequency document frequency scoring scheme" calculates a weight for a given n-gram t of a given article document d, as shown in the following equation:

在此，tf_t,d表示文档d中的n元语法t的频率。并且df_t表示在知识库3602中n元语法t的文档频率。M表示训练集中的文档总数，L_d表示按术语数量的文档d的长度，L_avg表示训练语料库的平均长度，并且K和b是自由参数。在一些实施方式中，k近似为1.5，b近似为0.75。Here, tf _t,d represents the frequency of n-gram t in document d. And df _t represents the document frequency of n-gram t in knowledge base 3602. M represents the total number of documents in the training set, L _d represents the length of document d in terms of terms, L _avg represents the average length of the training corpus, and K and b are free parameters. In some embodiments, k is approximately 1.5 and b is approximately 0.75.

前面是术语“频率倒排文档频率评分方案”的一个示例，其可以被用于对属性向量的属性进行加权。可以使用反映属性(例如，n元语法)对于知识库3602中的文章的重要性的其它统计测度。例如，考虑到锚文本的其它TF/IDF变体(诸如BM25F)可以与某些类型的知识库(诸如例如网页或其它超链接文档集的知识库)一起使用。The foregoing is an example of a term "frequency inverted document frequency scoring scheme" that can be used to weight the attributes of an attribute vector. Other statistical measures that reflect the importance of an attribute (e.g., an n-gram) to an article in the knowledge base 3602 can be used. For example, other TF/IDF variants that take into account anchor text (such as BM25F) can be used with certain types of knowledge bases (such as, for example, a knowledge base of web pages or other hyperlinked document collections).

加权倒排索引构建器计算机3604从表示知识库3602的文章的属性向量构建加权倒排索引3606。加权倒排索引3606将属性向量集合中表示的每个不同的n元语法映射到其中出现n元语法的概念(文章)的概念向量。概念向量中的每个概念可以根据概念与概念向量通过加权倒排索引3606所映射到的n元语法之间的关联强度来加权。在一些实施方式中，索引器计算机3604通过从概念向量中移除对于给定的n元语法的权重低于阈值的那些概念，使用倒排索引3606来丢弃n元语法与概念之间的无关紧要的关联。The weighted inverted index builder computer 3604 builds a weighted inverted index 3606 from the property vectors representing the articles of the knowledge base 3602. The weighted inverted index 3606 maps each different n-gram represented in the set of property vectors to a concept vector of the concept (article) in which the n-gram appears. Each concept in the concept vector can be weighted according to the strength of the association between the concept and the n-gram to which the concept vector is mapped by the weighted inverted index 3606. In some embodiments, the indexer computer 3604 uses the inverted index 3606 to discard insignificant associations between n-grams and concepts by removing those concepts from the concept vector whose weight for a given n-gram is below a threshold.

为了生成给定文本文档3610的文本摘要，从给定文本文档3610中提取语法单元3608，并且计算每个语法单元与给定文本文档3610之间的语义相关性。选择与给定文本文档3610具有高度语义相关性的多个语法单元以包括在文本摘要中。To generate a text summary of a given text document 3610, grammatical units 3608 are extracted from the given text document 3610, and a semantic relevance between each grammatical unit and the given text document 3610 is calculated. A plurality of grammatical units having a high semantic relevance to the given text document 3610 are selected to be included in the text summary.

被选择以包括在文本摘要中的语法单元的数量可以基于各种不同的因素而变化。一种方法是选择预定义数量的语法单元。例如，预定义数量可以由系统的用户配置或通过机器学习处理来学习。另一种方法是选择与给定文本文档3610的具有高于预定阈值的一定程度的语义相关性的所有语法单元。预定义阈值可以由系统的用户配置或通过机器学习处理来学习。还有另一种可能的方法是确定与给定文本文档3610具有最高语义相关度的语法单元，然后选择其中语法单元与给定文本文档3610在语义相关度上的差异和最高程度低于预定义阈值的所有其它语法单元。选择具有最高程度的语法单元和低于预定义阈值的任何其它语法单元以包括在文本摘要中。同样，预定义阈值可以由系统的用户配置或通过机器学习处理来学习。The number of grammatical units selected to be included in the text summary can vary based on various factors. One method is to select a predefined number of grammatical units. For example, the predefined number can be configured by the user of the system or learned through machine learning processing. Another method is to select all grammatical units with a certain degree of semantic relevance to a given text document 3610 that is higher than a predetermined threshold. The predefined threshold can be configured by the user of the system or learned through machine learning processing. Still another possible method is to determine the grammatical unit with the highest semantic relevance to a given text document 3610, and then select all other grammatical units in which the difference in semantic relevance between the grammatical unit and the given text document 3610 and the highest degree are lower than the predefined threshold. Select the grammatical unit with the highest degree and any other grammatical unit below the predefined threshold to be included in the text summary. Similarly, the predefined threshold can be configured by the user of the system or learned through machine learning processing.

在一些实施方式中，并非总是选择与给定文本文档3610具有最高或相对较高语义相关度的语法单元以包括在文本摘要中。例如，与第二语法单元相比，与给定文本文档3610具有较低语义相关度的第一语法单元可以被选择要包括在文本摘要中，而第二语法单元可以不被选择要包括在文本摘要中，如果第一语法单元相对于已经选择要包括在文本摘要中的语法单元没有足够不相似的话。相对于现有文本摘要，语法单元的不相似程度可以通过多种不同方式进行测量，诸如例如通过使用词汇方法、概率方法或词汇方法和概率方法的混合。使用不相似测度以选择要包括在文本摘要中的语法单元可以防止将多个相似的语法单元包括在同一文本摘要中。In some embodiments, the grammatical unit with the highest or relatively high semantic relevance to a given text document 3610 is not always selected for inclusion in the text summary. For example, a first grammatical unit with a lower semantic relevance to a given text document 3610 than a second grammatical unit may be selected to be included in the text summary, while the second grammatical unit may not be selected to be included in the text summary if the first grammatical unit is not sufficiently dissimilar to grammatical units already selected to be included in the text summary. The degree of dissimilarity of grammatical units relative to existing text summaries can be measured in a variety of different ways, such as, for example, by using a lexical approach, a probabilistic approach, or a mixture of lexical and probabilistic approaches. Using a dissimilarity measure to select grammatical units to be included in a text summary can prevent multiple similar grammatical units from being included in the same text summary.

在一些实施方式中，可以使用用于根据单元与给定文本文档3610的语义相关性及其相对于一个或多个其它单元的不相似性来选择多个语法单元以包括在文本摘要中的其它技术并且不限于任何特定技术。例如，给定多个语法单元与给定文本文档3610的语义相关性在阈值之上，那么可以测量每个复合语法单元相对于多个语法单元的组合的不相似性，并且可以选择彼此最不相似的多个语法单元以将其包括在文本摘要中。因此，从整体上来说，选择要包括在文本摘要中的语法单元与文本文档在语义上高度相关，但彼此之间却不相似。这是比包含高度语义相关但相似的语法单元的文本摘要更为有用的文本摘要，因为就语法单元传达的信息而言，相似的语法单元比不相似的语法单元更可能是彼此冗余的。In some embodiments, other techniques for selecting multiple grammatical units to include in a text summary based on the semantic relevance of the units to a given text document 3610 and their dissimilarity to one or more other units may be used and are not limited to any particular technique. For example, given that the semantic relevance of multiple grammatical units to a given text document 3610 is above a threshold, the dissimilarity of each composite grammatical unit relative to the combination of multiple grammatical units may be measured, and multiple grammatical units that are least similar to each other may be selected to include them in the text summary. Therefore, overall, the grammatical units selected to be included in the text summary are highly semantically related to the text document, but are not similar to each other. This is a more useful text summary than a text summary containing highly semantically related but similar grammatical units, because similar grammatical units are more likely to be redundant with each other than dissimilar grammatical units in terms of the information conveyed by the grammatical units.

另一个可能性是为语法单元计算复合相似性/不相似性测度，然后基于它们的复合分数来选择要包括在文本摘要中的语法单元。例如，复合测度可以是语义相关性测度和不相似性测度的加权平均值。例如，作为加权平均值计算的可能的复合测度为：Another possibility is to compute a composite similarity/dissimilarity measure for grammatical units and then select grammatical units to include in the text summary based on their composite scores. For example, the composite measure can be a weighted average of the semantic relevance measure and the dissimilarity measure. For example, a possible composite measure computed as a weighted average is:

(a*相似性)+(b*不相似性)(a*similarity)+(b*dissimilarity)

在此，参数相似性总体上表示语法单元与输入文本3610的语义相关性。例如，参数相似性可以是针对语法单元计算的相似性估计3620。参数不相似性表示语法单元与一个或多个语法单元的集合的不相似性的不相似性测度。例如，一个或多个语法单元的集合可以是已经被选择以包括在文本摘要中的一个或多个语法单元的集合。参数a表示加权平均值中应用于相似性测度的权重。参数b表示加权平均值中对不相似性测度的权重应用。复合测度有效地使相似性测度和不相似性测度彼此平衡。它们可以相等地彼此平衡(例如，a＝0.5和b＝0.5)。可替代地，可以给相似性测度更大的权重(例如，a＝0.8和b＝0.2)。Here, parameter similarity generally represents the semantic relevance of a grammatical unit to an input text 3610. For example, parameter similarity may be a similarity estimate 3620 calculated for a grammatical unit. Parameter dissimilarity represents a dissimilarity measure of the dissimilarity of a grammatical unit to a set of one or more grammatical units. For example, a set of one or more grammatical units may be a set of one or more grammatical units that have been selected to be included in a text summary. Parameter a represents the weight applied to the similarity measure in the weighted average. Parameter b represents the weight applied to the dissimilarity measure in the weighted average. The composite measure effectively balances the similarity measure and the dissimilarity measure with each other. They may be equally balanced with each other (e.g., a=0.5 and b=0.5). Alternatively, a greater weight may be given to the similarity measure (e.g., a=0.8 and b=0.2).

从给定文本文档中提取的语法单元可以是句子、短语、段落、词、n元语法或其它语法单元。在从给定文本文档3610提取出的语法单元3608是词或n元语法的情况下，与文本摘要相反，该处理可以被视为关键词生成。The grammatical unit extracted from a given text document may be a sentence, phrase, paragraph, word, n-gram or other grammatical unit. In the case where the grammatical unit 3608 extracted from a given text document 3610 is a word or n-gram, the process may be considered as keyword generation in contrast to text summarization.

文本摘要器3612接受文本片段。该文本片段是给定的文本文档3610或其语法单元。该文本片段被表示为该文本片段的加权属性(例如，词或n元语法)的“输入”向量。输入向量中的每个权重针对该文本片段中识别出的对应属性(例如，词或n元语法)，并表示该文本片段与对应属性之间的关联强度。例如，可以根据TF-IDF方案等来计算权重。The text summarizer 3612 accepts a text snippet. The text snippet is a given text document 3610 or a grammatical unit thereof. The text snippet is represented as an "input" vector of weighted attributes (e.g., words or n-grams) of the text snippet. Each weight in the input vector is for a corresponding attribute (e.g., word or n-gram) identified in the text snippet and represents the strength of association between the text snippet and the corresponding attribute. For example, the weights can be calculated according to a TF-IDF scheme, etc.

在一些实施方式中，输入向量中的属性的权重计算如下：In some embodiments, the weights of the attributes in the input vector are calculated as follows:

在此，tf_t,d是该文本片段d中n元语法t的频率。参数k、b、L_d和L_avg与之前的相同，只是关于知识库3602而不是分类训练集。在一些实施方式中，k近似为1.5，并且b近似为0.75。Here, tf _t,d is the frequency of n-gram t in the text segment d. The parameters k, b, L _d, and L _avg are the same as before, but with respect to the knowledge base 3602 instead of the classification training set. In some embodiments, k is approximately 1.5 and b is approximately 0.75.

应当注意的是，其它加权方案是可能的，并且实施例在形成输入向量时不限于任何特定的加权方案。形成输入向量还可以包括单位长度归一化，诸如以上关于训练数据项向量所描述的。It should be noted that other weighting schemes are possible, and embodiments are not limited to any particular weighting scheme when forming the input vector.Forming the input vector may also include unit length normalization, such as described above with respect to the training data item vector.

文本摘要器3612在基于该文本片段形成的输入向量的非零加权属性上迭代，从加权倒排索引3606中检索与属性对应的属性向量，并将检索出的属性向量合并为表示该文本片段的概念的加权向量。概念的这个加权向量在下文中被称为“概念”向量。The text summarizer 3612 iterates over the non-zero weighted attributes of the input vector formed based on the text segment, retrieves the attribute vector corresponding to the attribute from the weighted inverted index 3606, and merges the retrieved attribute vectors into a weighted vector representing the concept of the text segment. This weighted vector of the concept is hereinafter referred to as a "concept" vector.

从加权倒排索引3606检索出的与输入向量的属性对应的属性向量也各自是权重的向量。但是，属性向量中的权重量化知识库3602的相应概念与通过倒排索引3606映射到属性向量的属性之间的关联强度。The attribute vectors corresponding to the attributes of the input vector retrieved from the weighted inverted index 3606 are also vectors of weights. However, the weights in the attribute vectors quantify the strength of the association between the corresponding concept of the knowledge base 3602 and the attributes mapped to the attribute vectors by the inverted index 3606.

文本摘要器3612为文本片段创建概念向量。概念向量是权重的向量。概念向量中的每个权重表示知识库3602的相应概念与文本片段之间的关联强度。由文本摘要器3612计算概念向量中的概念权重作为值的总和，每个属性在输入向量中有一个非零加权的值。该总和的针对属性的每个值被计算为(a)输入向量中属性的权重与(b)针对该属性的属性向量中的概念的权重的乘积。概念向量中的每个概念权重反映该概念与文本片段的相关性。在一些实施方式中，概念向量被归一化。例如，可以针对单元长度或概念长度(例如，如上面的类长度)对概念向量进行归一化。The text summarizer 3612 creates a concept vector for the text segment. A concept vector is a vector of weights. Each weight in the concept vector represents the strength of association between the corresponding concept of the knowledge base 3602 and the text segment. The concept weights in the concept vector are calculated by the text summarizer 3612 as the sum of values, and each attribute has a non-zero weighted value in the input vector. Each value of the sum for the attribute is calculated as the product of (a) the weight of the attribute in the input vector and (b) the weight of the concept in the attribute vector for the attribute. Each concept weight in the concept vector reflects the relevance of the concept to the text segment. In some embodiments, the concept vector is normalized. For example, the concept vector can be normalized for unit length or concept length (e.g., class length as above).

文本摘要器3612可以为输入文本3610生成概念向量3616，并且为每个语法单元3608生成概念向量3614。向量比较器3618使用相似性测度将针对语法单元生成的概念向量3614与针对输入文本3610生成的概念向量3616进行比较，以生成相似性估计3620。在一些实施方式中，使用余弦相似性测度。实施方式不限于任何特定的相似性测度，并且可以使用能够测量两个非零向量之间的相似性的任何相似性测度。The text summarizer 3612 can generate a concept vector 3616 for the input text 3610 and a concept vector 3614 for each grammatical unit 3608. The vector comparator 3618 compares the concept vector 3614 generated for the grammatical unit with the concept vector 3616 generated for the input text 3610 using a similarity measure to generate a similarity estimate 3620. In some embodiments, a cosine similarity measure is used. Embodiments are not limited to any particular similarity measure, and any similarity measure capable of measuring the similarity between two non-zero vectors can be used.

相似性估计3620量化语法单元与从中提取语法单元的输入文本3610之间的语义相关度。例如，相似性估计3620可以是介于1和0之间的值(包括1和0)，其中更接近1的值表示较高的语义相关度，而更接近0的值表示较低的语义相关度。The similarity estimate 3620 quantifies the semantic relatedness between the grammatical unit and the input text 3610 from which the grammatical unit was extracted. For example, the similarity estimate 3620 can be a value between 1 and 0, inclusive, where values closer to 1 represent higher semantic relatedness and values closer to 0 represent lower semantic relatedness.

可以为每个语法单元3608计算相似性估计3620。为语法单元3608生成的相似性估计3620可以被用于选择语法单元3608中的一个或多个以包括在输入文本3610的文本摘要中(或选择一个或多个关键词用于为输入文本3610生成关键词)。A similarity estimate 3620 may be calculated for each grammar unit 3608. The similarity estimates 3620 generated for the grammar units 3608 may be used to select one or more of the grammar units 3608 to include in a text summary of the input text 3610 (or to select one or more keywords for generating keywords for the input text 3610).

上述技术有各种应用用于文本摘要化以提供较长文本(诸如例如新闻报道、博客文章、期刊文章、网页等)的准确文本摘要。The above-described techniques have various applications for text summarization to provide accurate text summaries of longer texts (such as, for example, news reports, blog posts, journal articles, web pages, etc.).

在任何或所有上述实施例中，在一个或多个内容资源(例如，图像、文章等)已被识别为潜在地与用户经由用户界面415当前正在创建的内容相关之后，相关的内容资源被传输回内容推荐引擎425，在那里它们可以例如由内容检索/嵌入组件445检索、修改并嵌入到用户界面415中。使用检索/嵌入组件445，可以以用户可以可选地将它们包括在当前正在创建的内容中的方式经由用户界面415向用户提供潜在相关的内容资源。在图37和38中示出了两个示例用户界面，其中在内容创建期间向用户提供图像推荐。在图37中，示出了媒体推荐窗格，其包括基于用户当前经由用户界面创作的内容的文本选择的图像。在图38中，视觉特征分析已被用来选择可能与用户选择的第一图像(“filename.JPG”)相关的图像的集合。可以使用类似的技术和用户界面屏幕来允许用户选择、拖放图像、到文章和其它文本文档的链接、音频/视频文件等，并将其嵌入到用户当前正在创建的内容中。In any or all of the above embodiments, after one or more content resources (e.g., images, articles, etc.) have been identified as potentially relevant to the content currently being created by the user via the user interface 415, the relevant content resources are transmitted back to the content recommendation engine 425, where they can be retrieved, modified, and embedded into the user interface 415, for example, by the content retrieval/embedding component 445. Using the retrieval/embedding component 445, potentially relevant content resources can be provided to the user via the user interface 415 in a manner that the user can optionally include them in the content currently being created. Two example user interfaces are shown in Figures 37 and 38, in which image recommendations are provided to the user during content creation. In Figure 37, a media recommendation pane is shown, which includes images selected based on the text of the content currently being created by the user via the user interface. In Figure 38, visual feature analysis has been used to select a collection of images that may be relevant to the first image ("filename.JPG") selected by the user. Similar techniques and user interface screens can be used to allow users to select, drag and drop images, links to articles and other text documents, audio/video files, etc., and embed them into the content currently being created by the user.

图39描绘了其中可以实现上面讨论的各种示例的分布式系统3900的简化图。在所示的示例中，分布式系统3900包括经由一个或多个通信网络3910耦合到服务器3912的一个或多个客户端计算设备3902、3904、3906、3908。客户端计算设备3902、3904、3906、3908可以被配置为运行一个或多个应用。39 depicts a simplified diagram of a distributed system 3900 in which the various examples discussed above may be implemented. In the example shown, the distributed system 3900 includes one or more client computing devices 3902, 3904, 3906, 3908 coupled to a server 3912 via one or more communication networks 3910. The client computing devices 3902, 3904, 3906, 3908 may be configured to run one or more applications.

在各种实施例中，服务器3912可以适于运行启用与内容推荐系统400相关联的一个或多个操作的一个或多个服务或软件应用。例如，用户可以使用客户端计算设备3902、3904、3906、3908(例如，对应于内容作者设备410)来访问通过经由内容推荐引擎425提供的一个或多个基于云的服务。In various embodiments, the server 3912 may be adapted to run one or more services or software applications that enable one or more operations associated with the content recommendation system 400. For example, a user may use a client computing device 3902, 3904, 3906, 3908 (e.g., corresponding to the content author device 410) to access one or more cloud-based services provided via the content recommendation engine 425.

在各个示例中，服务器3912还可以提供其它服务或软件应用，并且可以包括非虚拟和虚拟环境。在一些示例中，这些服务可以作为基于web的服务或云服务，诸如在软件即服务(SaaS)模型下，提供给客户端计算设备3902、3904、3906、3908的用户。操作客户端计算设备3902、3904、3906、3908的用户可以依次利用一个或多个客户端应用与服务器3912交互以利用由这些组件提供的服务。In various examples, the server 3912 may also provide other services or software applications, and may include non-virtualized and virtualized environments. In some examples, these services may be provided to users of client computing devices 3902, 3904, 3906, 3908 as web-based services or cloud services, such as under a software as a service (SaaS) model. Users operating client computing devices 3902, 3904, 3906, 3908 may in turn utilize one or more client applications to interact with the server 3912 to utilize the services provided by these components.

在图39所描绘的配置中，服务器3912可以包括实现由服务器3912执行的功能的一个或多个组件3918、3920和3922。这些组件可以包括可以由一个或多个处理器执行的软件组件、硬件组件或其组合。应当认识到的是，各种不同的系统配置是可能的，其可以与示例分布式系统3900不同。In the configuration depicted in Figure 39, the server 3912 may include one or more components 3918, 3920, and 3922 that implement the functions performed by the server 3912. These components may include software components, hardware components, or a combination thereof that may be executed by one or more processors. It should be appreciated that a variety of different system configurations are possible that may differ from the example distributed system 3900.

客户端计算设备3902、3904、3906、3908可以包括各种类型的计算系统，诸如便携式手持设备，诸如智能电话和平板电脑；通用计算机，诸如个人计算机和膝上型计算机；工作站计算机；可穿戴设备，诸如头戴式显示器；游戏系统，诸如手持游戏设备、游戏控制台和支持互联网的游戏设备；瘦客户端；各种消息传递设备；传感器和其它感测设备；等等。这些计算设备可以运行各种类型和版本的软件应用和操作系统(例如，MicrosoftApple 或类UNIX操作系统、Linux或类Linux操作系统)，包括各种移动操作系统(例如，Microsoft WindowsWindowsAndroid^TM、Palm)。客户端设备可以能够执行各种不同的应用，诸如各种与互联网相关的应用、通信应用(例如，电子邮件应用、短消息服务(SMS)应用)，并且可以使用各种通信协议。客户端设备可以提供使客户端设备的用户能够与客户端设备交互的接口。客户端设备还可以经由这个接口向用户输出信息。虽然图39仅描绘了四个客户端计算设备，但是可以支持任何数量的客户端计算设备。Client computing devices 3902, 3904, 3906, 3908 may include various types of computing systems, such as portable handheld devices, such as smartphones and tablets; general-purpose computers, such as personal computers and laptops; workstation computers; wearable devices, such as head-mounted displays; gaming systems, such as handheld gaming devices, gaming consoles, and Internet-enabled gaming devices; thin clients; various messaging devices; sensors and other sensing devices; and the like. These computing devices may run various types and versions of software applications and operating systems (e.g., Microsoft Apple or UNIX-like operating systems, Linux or Linux-like operating systems), including various mobile operating systems (e.g., Microsoft Windows Windows Android ^TM 、 Palm ). The client device may be capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications, short message service (SMS) applications), and may use various communication protocols. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via this interface. Although FIG. 39 depicts only four client computing devices, any number of client computing devices may be supported.

分布式系统3900中的(一个或多个)网络3910可以是本领域技术人员熟悉的任何类型的网络，其可以使用多种可用协议中的任何一种来支持数据通信，包括但不限于TCP/IP(传输控制协议/互联网协议)、SNA(系统网络体系架构)、IPX(互联网分组交换)、AppleTalk等。仅仅作为示例，(一个或多个)网络3910可以是局域网(LAN)、基于以太网的网络、令牌环、广域网、互联网、虚拟网络、虚拟专用网(VPN)、内联网、外联网、公共电话交换网(PSTN)、红外网络、无线网络(例如，在任何电气和电子协会(IEEE)802.11协议套件、和/或任何其它无线协议下操作的网络)和/或这些和/或其它网络的任意组合。The network(s) 3910 in the distributed system 3900 may be any type of network familiar to those skilled in the art, which may support data communications using any of a variety of available protocols, including, but not limited to, TCP/IP (Transmission Control Protocol/Internet Protocol), SNA (Systems Network Architecture), IPX (Internetwork Packet Exchange), AppleTalk, etc. Merely by way of example, the network(s) 3910 may be a local area network (LAN), an Ethernet-based network, a token ring, a wide area network, the Internet, a virtual network, a virtual private network (VPN), an intranet, an extranet, a public switched telephone network (PSTN), an infrared network, a wireless network (e.g., in any of the Institute of Electrical and Electronics Engineers (IEEE) 802.11 protocol suites, and/or any other wireless protocol operating under the network) and/or any combination of these and/or other networks.

服务器3912可以由一个或多个通用计算机、专用服务器计算机(作为示例，包括PC(个人计算机)服务器、服务器、中档服务器、大型计算机、机架安装的服务器等)、服务器农场、服务器聚类或任何其它适当的布置和/或组合组成。服务器3912可以包括运行虚拟操作系统的一个或多个虚拟机，或者涉及虚拟化的其它计算体系架构，诸如可以被虚拟化以维护服务器的虚拟存储设备的逻辑存储设备的一个或多个灵活的池。在各种示例中，服务器3912可以适于运行执行上述操作的一个或多个服务或软件应用。The server 3912 may be composed of one or more general-purpose computers, dedicated server computers (including, for example, PC (personal computer) servers, The server 3912 may be composed of a plurality of servers, mid-range servers, mainframe computers, rack-mounted servers, etc.), server farms, server clusters, or any other suitable arrangement and/or combination. The server 3912 may include one or more virtual machines running virtual operating systems, or other computing architectures involving virtualization, such as one or more flexible pools of logical storage devices that can be virtualized to maintain virtual storage devices of the server. In various examples, the server 3912 may be suitable for running one or more services or software applications that perform the above operations.

服务器3912可以运行操作系统，包括以上讨论的任何操作系统以及任何商用的服务器操作系统。服务器3912还可以运行各种附加服务器应用和/或中间层应用中的任何一种，包括HTTP(超文本传输协议)服务器、FTP(文件传输协议)服务器、CGI(通用网关接口)服务器、服务器、数据库服务器等。数据库服务器的示例包括但不限于可从Oracle、Microsoft、Sybase、IBM(国际商业机器)等商购获得的数据库服务器。Server 3912 can run an operating system, including any of the operating systems discussed above and any commercially available server operating system. Server 3912 can also run any of a variety of additional server applications and/or middle-tier applications, including HTTP (Hypertext Transfer Protocol) servers, FTP (File Transfer Protocol) servers, CGI (Common Gateway Interface) servers, Server, database server, etc. Examples of database servers include, but are not limited to, database servers commercially available from Oracle, Microsoft, Sybase, IBM (International Business Machines), etc.

在一些实施方式中，服务器3912可以包括一个或多个应用以分析和整合从客户端计算设备3902、3904、3906、3908的用户接收到的数据馈送和/或事件更新。作为示例，数据馈送和/或事件更新可以包括但不限于从一个或多个第三方信息源和连续数据流接收到的馈送或实时更新，其可以包括与传感器数据应用、金融报价机、网络性能测量工具(例如，网络监视和流量管理应用)、点击流分析工具、汽车流量监视等相关的实时事件。服务器3912还可以包括经由客户端计算设备3902、3904、3906、3908的一个或多个显示设备显示数据馈送和/或实时事件的一个或多个应用。In some embodiments, the server 3912 may include one or more applications to analyze and integrate data feeds and/or event updates received from users of the client computing devices 3902, 3904, 3906, 3908. As examples, the data feeds and/or event updates may include, but are not limited to, feeds or real-time updates received from one or more third-party information sources and continuous data streams, which may include real-time events related to sensor data applications, financial quote machines, network performance measurement tools (e.g., network monitoring and traffic management applications), clickstream analysis tools, automobile traffic monitoring, etc. The server 3912 may also include one or more applications that display the data feeds and/or real-time events via one or more display devices of the client computing devices 3902, 3904, 3906, 3908.

分布式系统3900还可以包括一个或多个数据储存库3914、3916。这些数据储存库可以提供用于存储各种类型的信息(诸如由以上讨论的各种示例描述的信息)的机制。数据储存库3914、3916可以驻留在各种位置。例如，由服务器3912使用的数据储存库可以在服务器3912本地，或者可以远离服务器3912并经由基于网络或专用的连接与服务器3912通信。数据储存库3914、3916可以是不同的类型。在一些示例中，由服务器3912使用的数据储存库可以是数据库，例如关系数据库，诸如由公司和其它供应商提供的数据库。这些数据库中的一个或多个可以适于响应于SQL格式的命令来实现数据库中数据的存储、更新和检索。The distributed system 3900 may also include one or more data repositories 3914, 3916. These data repositories may provide a mechanism for storing various types of information, such as the information described by the various examples discussed above. The data repositories 3914, 3916 may reside in a variety of locations. For example, a data repository used by a server 3912 may be local to the server 3912, or may be remote from the server 3912 and communicate with the server 3912 via a network-based or dedicated connection. The data repositories 3914, 3916 may be of different types. In some examples, the data repository used by the server 3912 may be a database, such as a relational database, such as a database provided by Databases provided by companies and other suppliers. One or more of these databases may be adapted to respond to SQL format commands to implement storage, updating, and retrieval of data in the database.

在一些示例中，应用还可以使用数据储存库3914、3916中的一个或多个来存储应用数据。由应用使用的数据储存库可以具有不同的类型，诸如，例如，键-值储存库、对象储存库或由文件系统支持的通用储存库。In some examples, the application may also use one or more of the data repositories 3914, 3916 to store application data. The data repositories used by the application may be of different types, such as, for example, a key-value store, an object store, or a general store backed by a file system.

在一些示例中，云环境可以提供一个或多个服务，例如以上讨论的服务。图40是系统环境4000的一个或多个组件的简化框图，其中这些和其它服务可以被提供为云服务。在图40中示出的示例中，云基础设施系统4002可以提供可以由用户使用一个或多个客户端计算设备4004、4006和4008请求的一个或多个云服务。云基础设施系统4002可以包括一个或多个计算机和/或服务器，其可以包括以上针对图39的服务器3912描述的那些计算机和/或服务器。云基础设施系统4002中的计算机可以被组织为通用计算机、专用服务器计算机、服务器农场、服务器聚类或任何其它适当的布置和/或组合。In some examples, the cloud environment can provide one or more services, such as the services discussed above. Figure 40 is a simplified block diagram of one or more components of system environment 4000, in which these and other services can be provided as cloud services. In the example shown in Figure 40, cloud infrastructure system 4002 can provide one or more cloud services that can be requested by a user using one or more client computing devices 4004, 4006 and 4008. Cloud infrastructure system 4002 may include one or more computers and/or servers, which may include those computers and/or servers described above for server 3912 of Figure 39. The computers in cloud infrastructure system 4002 can be organized as general-purpose computers, dedicated server computers, server farms, server clusters or any other suitable arrangements and/or combinations.

(一个或多个)网络4010可以促进客户端4004、4006和4008与云基础设施系统4002之间的通信和数据交换。(一个或多个)网络4010可以包括一个或多个网络。网络可以是相同或不同的类型。(一个或多个)网络4010可以支持一种或多种通信协议，包括有线和/或无线协议，以促进通信。The network(s) 4010 can facilitate communication and data exchange between clients 4004, 4006, and 4008 and cloud infrastructure system 4002. The network(s) 4010 can include one or more networks. The networks can be of the same or different types. The network(s) 4010 can support one or more communication protocols, including wired and/or wireless protocols, to facilitate communication.

图40中描绘的示例仅仅是云基础设施系统的一个示例，并且不旨在进行限制。应该认识到的是，在其它示例中，云基础设施系统4002可以具有比图40所示的组件更多或更少的组件、可以组合两个或更多个组件，或者可以具有不同的组件配置或布置。例如，虽然图40描绘了三个客户端计算设备，但是在其它示例中可以支持任何数量的客户端计算设备。The example depicted in Figure 40 is merely one example of a cloud infrastructure system and is not intended to be limiting. It should be appreciated that in other examples, cloud infrastructure system 4002 may have more or fewer components than those shown in Figure 40, may combine two or more components, or may have a different configuration or arrangement of components. For example, while Figure 40 depicts three client computing devices, any number of client computing devices may be supported in other examples.

术语“云服务”通常用于指由服务提供商的系统(例如，云基础设施系统4002)根据需要并且经由诸如互联网之类的通信网络使用户可使用的服务。典型地，在公共云环境中，组成云服务提供商系统的服务器和系统与客户自己的内部服务器和系统不同。云服务提供商的系统由云服务提供商管理。客户因此可以利用由云服务提供商提供的云服务，而不必为服务购买单独的许可证、支持或硬件和软件资源。例如，云服务提供商的系统可以托管应用，并且用户可以经由互联网按需和自助订购和使用应用，而用户不必购买用于执行应用的基础设施资源。云服务旨在提供对应用、资源和服务的轻松、可扩展的访问。几个提供商提供云服务。例如，由公司提供了几种云服务，诸如中间件服务、数据库服务、Java云服务等。The term "cloud service" is generally used to refer to services that are made available to users on demand by a service provider's system (e.g., cloud infrastructure system 4002) and via a communications network such as the Internet. Typically, in a public cloud environment, the servers and systems that make up the cloud service provider's system are different from the customer's own internal servers and systems. The cloud service provider's system is managed by the cloud service provider. Customers can therefore take advantage of cloud services provided by the cloud service provider without having to purchase separate licenses, support, or hardware and software resources for the services. For example, the cloud service provider's system can host applications, and users can order and use the applications on demand and self-service via the Internet without having to purchase infrastructure resources for executing the applications. Cloud services are designed to provide easy, scalable access to applications, resources, and services. Several providers offer cloud services. For example, The company provides several cloud services such as middleware services, database services, Java cloud services, etc.

在各种示例中，云基础设施系统4002可以使用诸如软件即服务(SaaS)模型、平台即服务(PaaS)模型、基础设施即服务(IaaS)模型以及包括混合服务模型的其它模型之类的不同模型提供一个或多个云服务。云基础设施系统4002可以包括一套应用、中间件、数据库以及使得能够供给各种云服务的其它资源。In various examples, cloud infrastructure system 4002 may provide one or more cloud services using different models such as software as a service (SaaS), platform as a service (PaaS), infrastructure as a service (IaaS), and other models including hybrid service models. Cloud infrastructure system 4002 may include a set of applications, middleware, databases, and other resources that enable the provision of various cloud services.

SaaS模型使得应用或软件能够作为服务通过如互联网的通信网络交付给客户，而客户不必为底层应用购买硬件或软件。例如，SaaS模型可以用于为客户提供对由云基础设施系统4002托管的按需应用的访问。由公司提供的SaaS服务的示例包括但不限于用于人力资源/资本管理、客户关系管理(CRM)、企业资源计划(ERP)、供应链管理(SCM)、企业绩效管理(EPM)、分析服务、社交应用及其它的各种服务。The SaaS model enables applications or software to be delivered to customers as a service over a communication network such as the Internet, without the customer having to purchase hardware or software for the underlying application. For example, the SaaS model can be used to provide customers with access to on-demand applications hosted by the cloud infrastructure system 4002. Examples of SaaS services provided by the Company include, but are not limited to, services for human resources/capital management, customer relationship management (CRM), enterprise resource planning (ERP), supply chain management (SCM), enterprise performance management (EPM), analytical services, social applications, and others.

IaaS模型通常用于向客户提供基础设施资源(例如，服务器、存储装置、硬件和联网资源)作为云服务，以提供弹性计算和存储能力。公司提供了各种IaaS服务。The IaaS model is often used to provide infrastructure resources (eg, servers, storage devices, hardware, and networking resources) to customers as cloud services to provide elastic computing and storage capabilities. The company provides a variety of IaaS services.

PaaS模型通常用于提供平台和环境资源作为服务，其使得客户能够开发、运行和管理应用和服务，而客户不必采购、构建或维护此类资源。由公司提供的PaaS服务的示例包括但不限于Oracle Java云服务(JCS)、Oracle数据库云服务(DBCS)、数据管理云服务、各种应用开发解决方案服务，以及其它服务。The PaaS model is typically used to provide platform and environment resources as a service, which enables customers to develop, run, and manage applications and services without having to purchase, build, or maintain such resources. Examples of PaaS services provided by the Company include, but are not limited to, Oracle Java Cloud Service (JCS), Oracle Database Cloud Service (DBCS), Data Management Cloud Service, various application development solution services, and other services.

在一些示例中，云基础设施系统4002中的资源可以由多个用户共享并且根据需求动态地重新分配。此外，可以将资源分配给不同时区的用户。例如，云基础设施系统4002可以使第一时区中的用户的第一集合能够在指定的小时数内利用云基础设施系统的资源，然后使得能够将相同资源重新分配给位于不同时区的用户的另一个集合，从而最大程度地利用资源。In some examples, resources in cloud infrastructure system 4002 can be shared by multiple users and dynamically reallocated based on demand. In addition, resources can be allocated to users in different time zones. For example, cloud infrastructure system 4002 can enable a first set of users in a first time zone to utilize resources of the cloud infrastructure system for a specified number of hours, and then enable the same resources to be reallocated to another set of users in a different time zone, thereby maximizing utilization of the resources.

云基础设施系统4002可以经由不同的部署模型来提供云服务。在公共云模型中，云基础设施系统4002可以由第三方云服务提供商拥有，并且云服务被提供给任何普通公众客户，其中客户可以是个人或企业。在某些其它实施例中，在私有云模型下，可以在组织内(例如，在企业组织内)操作云基础设施系统4002，并向组织内的客户提供服务。例如，客户可以是企业的各个部门，诸如人力资源部门、工资部门等，甚至是企业内的个人。在某些其它实施例中，在社区云模型下，云基础设施系统4002和所提供的服务可以由相关社区中的几个组织共享。也可以使用各种其它模型，诸如上面提到的模型的混合。The cloud infrastructure system 4002 can provide cloud services via different deployment models. In a public cloud model, the cloud infrastructure system 4002 can be owned by a third-party cloud service provider, and cloud services are provided to any general public customer, where the customer can be an individual or an enterprise. In certain other embodiments, under a private cloud model, the cloud infrastructure system 4002 can be operated within an organization (e.g., within an enterprise organization) and provide services to customers within the organization. For example, customers can be various departments of an enterprise, such as a human resources department, a payroll department, etc., or even individuals within an enterprise. In certain other embodiments, under a community cloud model, the cloud infrastructure system 4002 and the services provided can be shared by several organizations in a related community. Various other models can also be used, such as a mix of the models mentioned above.

客户端计算设备4004、4006、4008可以是与以上针对图39的客户端计算设备3902、3904、3906、3908所描述的设备相似的设备。图40的客户端计算设备4004、4006、4008可以被配置为操作可以由客户端计算设备的用户用于与云基础设施系统4002交互以使用云基础设施系统4002提供的服务的客户端应用，诸如web浏览器、专有客户端应用(例如，OracleForms)或某个其它应用。The client computing devices 4004, 4006, 4008 may be devices similar to those described above with respect to the client computing devices 3902, 3904, 3906, 3908 of Figure 39. The client computing devices 4004, 4006, 4008 of Figure 40 may be configured to operate a client application, such as a web browser, a proprietary client application (e.g., Oracle Forms), or some other application, that may be used by a user of the client computing device to interact with the cloud infrastructure system 4002 to use services provided by the cloud infrastructure system 4002.

在各个示例中，云基础设施系统4002还可以提供“大数据”以及相关的计算和分析服务。术语“大数据”一般被用于指可以由分析人员和研究人员存储和操纵的非常大的数据集，以可视化大量数据、检测趋势和/或以其它方式与数据进行交互。云基础设施系统4002可以执行的分析可以涉及使用、分析和操纵大型数据集，以检测和可视化数据内的各种趋势、行为、关系等。这种分析可以由一个或多个处理器执行、可能并行处理数据、使用数据执行仿真等。用于这种分析的数据可以包括结构化数据(例如，存储在数据库中或根据结构化模型结构化的数据)和/或非结构化数据(例如，数据Blob(二进制大对象))。In various examples, the cloud infrastructure system 4002 may also provide "big data" and related computing and analysis services. The term "big data" is generally used to refer to very large data sets that can be stored and manipulated by analysts and researchers to visualize large amounts of data, detect trends, and/or otherwise interact with the data. The analysis that the cloud infrastructure system 4002 may perform may involve using, analyzing, and manipulating large data sets to detect and visualize various trends, behaviors, relationships, etc. within the data. Such analysis may be performed by one or more processors, possibly processing data in parallel, performing simulations using the data, etc. The data used for such analysis may include structured data (e.g., data stored in a database or structured according to a structured model) and/or unstructured data (e.g., data blobs (binary large objects)).

如在图40的实施例中所描绘的，云基础设施系统4002可以包括基础设施资源4030，其用于促进由云基础设施系统4002提供的各种云服务的供给。基础设施资源4030可以包括例如处理资源、存储或存储器资源、联网资源等。40, cloud infrastructure system 4002 may include infrastructure resources 4030 for facilitating the provision of various cloud services provided by cloud infrastructure system 4002. Infrastructure resources 4030 may include, for example, processing resources, storage or memory resources, networking resources, and the like.

在一些示例中，为了促进这些资源的高效供给以支持由云基础设施系统4002为不同客户提供的各种云服务，可以将资源捆绑成资源或资源模块的集合(也称为“群聚(pod)”)。每个资源模块或群聚可以包括一种或多种类型的资源的预先集成和优化的组合。在一些示例中，可以为不同类型的云服务预先供给不同的群聚。例如，可以为数据库服务供给第一组群聚，可以为Java服务供给第二组群聚，其中第二组群聚可以包括与第一组群聚中的群聚不同的资源组合。对于一些服务，可以在服务之间共享为供给服务而分配的资源。In some examples, to facilitate efficient provisioning of these resources to support the various cloud services provided by the cloud infrastructure system 4002 for different customers, the resources can be bundled into a collection of resources or resource modules (also referred to as a "pod"). Each resource module or cluster can include a pre-integrated and optimized combination of one or more types of resources. In some examples, different clusters can be pre-provisioned for different types of cloud services. For example, a first set of clusters can be provisioned for database services, and a second set of clusters can be provisioned for Java services, where the second set of clusters can include a different combination of resources than the clusters in the first set of clusters. For some services, resources allocated to provision the service can be shared between services.

云基础设施系统4002本身可以内部使用服务4032，服务4032由云基础设施系统4002的不同组件共享并且促进云基础设施系统4002的服务供给。这些内部共享的服务可以包括但不限于安全和身份服务、集成服务、企业储存库服务、企业管理器服务、病毒扫描和白名单服务、高可用性、备份和恢复服务、用于启用云支持的服务、电子邮件服务、通知服务、文件传输服务等。The cloud infrastructure system 4002 itself may internally use services 4032 that are shared by different components of the cloud infrastructure system 4002 and facilitate the service provisioning of the cloud infrastructure system 4002. These internally shared services may include, but are not limited to, security and identity services, integration services, enterprise repository services, enterprise manager services, virus scanning and whitelisting services, high availability, backup and recovery services, services for enabling cloud support, email services, notification services, file transfer services, and the like.

在各个示例中，云基础设施系统4002可以包括多个子系统。这些子系统可以用软件或硬件或其组合来实现。如图40所示，子系统可以包括用户界面子系统4012，该用户界面子系统4012使得云基础设施系统4002的用户或客户能够与云基础设施系统4002交互。用户界面子系统4012可以包括各种不同的界面，诸如web界面4014、在线商店界面4016(其中由云基础设施系统4002提供的云服务被广告并且可由消费者购买)和其它界面4018。例如，客户可以使用客户端设备使用接口4014、4016和4018中的一个或多个来请求(服务请求4034)由云基础设施系统4002提供的一个或多个服务。例如，客户可以访问在线商店、浏览由云基础设施系统4002提供的云服务，并为由云基础设施系统4002提供的服务中客户希望订阅的一个或多个服务下订阅订单。服务请求可以包括识别客户的信息以及客户期望订阅的一个或多个服务。例如，客户可以为诸如上面讨论的那些服务下订单。作为订单的一部分，客户可以提供尤其识别客户所需的资源量和/或在什么时间范围内的信息。In various examples, cloud infrastructure system 4002 may include multiple subsystems. These subsystems may be implemented in software or hardware or a combination thereof. As shown in FIG. 40 , a subsystem may include a user interface subsystem 4012 that enables a user or customer of cloud infrastructure system 4002 to interact with cloud infrastructure system 4002. User interface subsystem 4012 may include various interfaces, such as web interface 4014, online store interface 4016 (wherein cloud services provided by cloud infrastructure system 4002 are advertised and can be purchased by consumers), and other interfaces 4018. For example, a customer may use a client device to request (service request 4034) one or more services provided by cloud infrastructure system 4002 using one or more of interfaces 4014, 4016, and 4018. For example, a customer may access an online store, browse cloud services provided by cloud infrastructure system 4002, and place a subscription order for one or more services that the customer wishes to subscribe to in the services provided by cloud infrastructure system 4002. A service request may include information identifying the customer and one or more services that the customer wishes to subscribe to. For example, a customer may place an order for services such as those discussed above. As part of the order, the customer may provide information identifying, among other things, the amount of resources the customer requires and/or within what time frame.

在一些示例中，诸如在图40中描绘的示例，云基础设施系统4002可以包括被配置为处理新订单的订单管理子系统(OMS)4020。作为这个处理的一部分，OMS 4020可以被配置为：为客户生成账户(如果尚未创建)；接收来自客户的账单和/或会计信息，该账单和/或账单信息将用于针对向客户提供所请求的服务向客户计费；验证客户信息；在验证后，为客户预订订单；并且编排各种工作流程以尤其准备用于供应的订单。In some examples, such as the example depicted in Figure 40, cloud infrastructure system 4002 may include an order management subsystem (OMS) 4020 configured to process new orders. As part of this processing, OMS 4020 may be configured to: generate an account for the customer (if not already created); receive billing and/or accounting information from the customer that will be used to bill the customer for providing the requested services to the customer; verify the customer information; upon verification, book the order for the customer; and orchestrate various workflows to, among other things, prepare the order for provision.

一旦被正确地验证，OMS 4020就可以调用订单供给子系统(OPS)4024，其被配置为为订单供给资源，包括处理资源、存储器资源和联网资源。供给可以包括为订单分配资源，以及配置资源以促进由客户订单所请求的服务。为订单供给资源的方式和供给资源的类型可以取决于客户已订购的云服务的类型。例如，根据一个工作流程，OPS 4024可以被配置为确定所请求的特定云服务，并且识别可能已经针对该特定云服务而被预先配置的多个群聚。为订单分配的群聚的数量可以取决于所请求的服务的大小/数量/级别/范围。例如，可以基于服务所支持的用户的数量、正在请求的服务的持续时间等来确定要分配的群聚的数量。然后，可以针对特定的请求客户定制所分配的群聚，以提供所请求的服务。Once properly authenticated, OMS 4020 may call order provisioning subsystem (OPS) 4024, which is configured to provision resources for the order, including processing resources, memory resources, and networking resources. Provisioning may include allocating resources to the order, and configuring resources to facilitate the services requested by the customer order. The manner in which resources are provisioned for the order and the type of resources provisioned may depend on the type of cloud service that the customer has ordered. For example, according to a workflow, OPS 4024 may be configured to determine the specific cloud service requested, and identify multiple clusters that may have been pre-configured for the specific cloud service. The number of clusters allocated for the order may depend on the size/quantity/level/scope of the requested service. For example, the number of clusters to be allocated may be determined based on the number of users supported by the service, the duration of the service being requested, and the like. The allocated clusters may then be customized for the specific requesting customer to provide the requested service.

云基础设施系统4002可以向请求客户发送响应或通知4044，以指示所请求的服务何时准备就绪。在一些情况下，可以将信息(例如，链接)发送给客户，使得客户能够开始使用和利用所请求的服务的益处。Cloud infrastructure system 4002 can send a response or notification 4044 to the requesting customer to indicate when the requested service is ready. In some cases, information (e.g., a link) can be sent to the customer so that the customer can start using and utilizing the benefits of the requested service.

云基础设施系统4002可以向多个客户提供服务。对于每个客户，云基础设施系统4002负责管理与从客户接收到的一个或多个订阅订单相关的信息、维护与订单相关的客户数据，以及向客户提供所请求的服务。云基础设施系统4002还可以收集关于客户对已订阅的服务的使用的使用统计信息。例如，可以针对使用的存储量、传输的数据量、用户的数量以及系统正常运行时间和系统停机时间量等收集统计信息。该使用信息可以用于向客户计费。计费可以例如按月周期进行。The cloud infrastructure system 4002 can provide services to multiple customers. For each customer, the cloud infrastructure system 4002 is responsible for managing information related to one or more subscription orders received from the customer, maintaining customer data related to the order, and providing the requested services to the customer. The cloud infrastructure system 4002 can also collect usage statistics about the customer's use of the subscribed services. For example, statistics can be collected for the amount of storage used, the amount of data transferred, the number of users, and the amount of system uptime and system downtime. This usage information can be used to bill the customer. Billing can be performed, for example, on a monthly cycle.

云基础设施系统4002可以并行地向多个客户提供服务。云基础设施系统4002可以存储这些客户的信息，包括可能的专有信息。在一些示例中，云基础设施系统4002包括身份管理子系统(IMS)4028，其被配置为管理客户的信息并提供所管理的信息的分离，使得与一个客户相关的信息不可被另一个客户访问。IMS 4028可以被配置为提供各种与安全相关的服务，诸如身份服务，诸如信息访问管理、认证和授权服务、用于管理客户身份和角色及相关能力的服务，等等。The cloud infrastructure system 4002 may provide services to multiple customers in parallel. The cloud infrastructure system 4002 may store information of these customers, including possibly proprietary information. In some examples, the cloud infrastructure system 4002 includes an identity management subsystem (IMS) 4028 that is configured to manage the information of the customers and provide separation of the managed information so that information associated with one customer is not accessible to another customer. The IMS 4028 may be configured to provide various security-related services, such as identity services, such as information access management, authentication and authorization services, services for managing customer identities and roles and related capabilities, and the like.

图41图示了可以被用于实现上面讨论的各种示例的计算机系统4100的示例。在一些示例中，计算机系统4100可以被用于实现上述各种服务器和计算机系统中的任何一个。如图41中所示，计算机系统4100包括各种子系统，包括经由总线子系统4102与多个其它子系统通信的处理子系统4104。这些其它子系统可以包括处理加速单元4106、I/O子系统4108、存储子系统4118和通信子系统4124。存储子系统4118可以包括非暂态计算机可读存储介质4122和系统存储器4110。FIG. 41 illustrates an example of a computer system 4100 that can be used to implement the various examples discussed above. In some examples, the computer system 4100 can be used to implement any of the various servers and computer systems described above. As shown in FIG. 41 , the computer system 4100 includes various subsystems, including a processing subsystem 4104 that communicates with multiple other subsystems via a bus subsystem 4102. These other subsystems may include a processing acceleration unit 4106, an I/O subsystem 4108, a storage subsystem 4118, and a communication subsystem 4124. The storage subsystem 4118 may include a non-transient computer-readable storage medium 4122 and a system memory 4110.

总线子系统4102提供用于使计算机系统4100的各种组件和子系统按照期望彼此通信的机制。虽然总线子系统4102被示意性地示为单条总线，但是总线子系统的可替代示例可以利用多条总线。总线子系统4102可以是几种类型的总线结构中的任何一种，包括存储器总线或存储器控制器、外围总线和使用任何各种总线体系架构的局部总线。例如，此类体系架构可以包括工业标准体系架构(ISA)总线、微通道体系架构(MCA)总线、增强型ISA(EISA)总线、视频电子标准协会(VESA)局部总线和外围组件互连(PCI)总线，其可以实现为根据IEEE P1386.1标准制造的夹层(Mezzanine)总线，等等。The bus subsystem 4102 provides a mechanism for making the various components and subsystems of the computer system 4100 communicate with each other as desired. Although the bus subsystem 4102 is schematically shown as a single bus, the alternative example of the bus subsystem can utilize multiple buses. The bus subsystem 4102 can be any of several types of bus structures, including a memory bus or a memory controller, a peripheral bus, and a local bus using any various bus architectures. For example, such architectures can include an industrial standard architecture (ISA) bus, a microchannel architecture (MCA) bus, an enhanced ISA (EISA) bus, a video electronics standard association (VESA) local bus, and a peripheral component interconnect (PCI) bus, which can be implemented as a mezzanine bus manufactured according to the IEEE P1386.1 standard, etc.

处理子系统4104控制计算机系统4100的操作，并且可以包括一个或多个处理器、专用集成电路(ASIC)或现场可编程门阵列(FPGA)。处理器可以包括单核或多核处理器。可以将计算机系统4100的处理资源组织成一个或多个处理单元4132、4134。处理单元可以包括一个或多个处理器(包括单核或多核处理器)、来自相同或不同处理器的一个或多个核心、核心和处理器的组合、或核心和处理器的其它组合。在一些示例中，处理子系统4104可以包括一个或多个专用协处理器，诸如图形处理器、数字信号处理器(DSP)等。在一些示例中，处理子系统4104的一些或全部可以使用定制电路来实现，诸如专用集成电路(ASIC)或现场可编程门阵列(FPGA)。The processing subsystem 4104 controls the operation of the computer system 4100 and may include one or more processors, application specific integrated circuits (ASICs), or field programmable gate arrays (FPGAs). The processor may include a single-core or multi-core processor. The processing resources of the computer system 4100 may be organized into one or more processing units 4132, 4134. The processing unit may include one or more processors (including single-core or multi-core processors), one or more cores from the same or different processors, a combination of cores and processors, or other combinations of cores and processors. In some examples, the processing subsystem 4104 may include one or more dedicated coprocessors, such as a graphics processor, a digital signal processor (DSP), etc. In some examples, some or all of the processing subsystem 4104 may be implemented using custom circuits, such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).

在一些示例中，处理子系统4104中的处理单元可以执行存储在系统存储器4110中或计算机可读存储介质4122上的指令。在各种示例中，处理单元可以执行各种程序或代码指令，并且可以维护多个并发执行的程序或进程。在任何给定的时间，要执行的程序代码中的一些或全部可以驻留在系统存储器4110中和/或计算机可读存储介质4122上，包括可能在一个或多个存储设备上。通过适当的编程，处理子系统4104可以提供上述各种功能性。在计算机系统4100正在执行一个或多个虚拟机的情况下，可以将一个或多个处理单元分配给每个虚拟机。In some examples, the processing units in the processing subsystem 4104 can execute instructions stored in the system memory 4110 or on the computer-readable storage medium 4122. In various examples, the processing units can execute various programs or code instructions, and can maintain multiple concurrently executed programs or processes. At any given time, some or all of the program code to be executed may reside in the system memory 4110 and/or on the computer-readable storage medium 4122, including possibly on one or more storage devices. Through appropriate programming, the processing subsystem 4104 can provide the various functionalities described above. In the case where the computer system 4100 is executing one or more virtual machines, one or more processing units can be assigned to each virtual machine.

在一些示例中，可以提供处理加速单元4106，以用于执行定制的处理或用于卸载由处理子系统4104执行的一些处理，从而加速由计算机系统4100执行的总体处理。In some examples, a processing acceleration unit 4106 may be provided for performing customized processing or for offloading some of the processing performed by processing subsystem 4104 , thereby speeding up the overall processing performed by computer system 4100 .

I/O子系统4108可以包括用于向计算机系统4100输入信息和/或用于从或经由计算机系统4100输出信息的设备和机制。一般而言，术语“输入设备”的使用旨在包括用于向计算机系统4100输入信息的所有可能类型的设备和机制。用户界面输入设备可以包括，例如，键盘、诸如鼠标或轨迹球之类的指向设备、并入到显示器中的触摸板或触摸屏、滚轮、点击轮、拨盘、按钮、开关、小键盘、带有语音命令识别系统的音频输入设备、麦克风以及其它类型的输入设备。用户界面输入设备还可以包括使用户能够控制输入设备并与之交互的诸如Microsoft运动传感器的运动感测和/或姿势识别设备、Microsoft360游戏控制器、提供用于接收使用姿势和口语命令的输入的界面的设备。用户界面输入设备还可以包括眼睛姿势识别设备，其从用户检测眼睛活动(例如，当拍摄图片和/或进行菜单选择时的“眨眼”)并将眼睛姿势转换为到输入设备的输入。此外，用户界面输入设备可以包括使用户能够通过语音命令与语音识别系统交互的语音识别感测设备。The I/O subsystem 4108 may include devices and mechanisms for inputting information to the computer system 4100 and/or for outputting information from or via the computer system 4100. In general, use of the term "input device" is intended to include all possible types of devices and mechanisms for inputting information to the computer system 4100. User interface input devices may include, for example, keyboards, pointing devices such as a mouse or trackball, touch pads or touch screens incorporated into a display, scroll wheels, click wheels, dials, buttons, switches, keypads, audio input devices with voice command recognition systems, microphones, and other types of input devices. User interface input devices may also include software such as Microsoft Windows that enables a user to control and interact with input devices. Motion sensing and/or gesture recognition devices, Microsoft 360 game controller, a device that provides an interface for receiving input using gestures and spoken commands. The user interface input device may also include an eye gesture recognition device that detects eye activity from the user (e.g., a "wink" when taking a picture and/or making a menu selection) and converts the eye gesture into input to the input device. In addition, the user interface input device may include a voice recognition sensing device that enables the user to interact with the voice recognition system through voice commands.

用户界面输入设备的其它示例包括但不限于，三维(3D)鼠标、操纵杆或指示杆、游戏板和图形平板、以及音频/视频设备，诸如扬声器、数字相机、数字摄像机、便携式媒体播放器、网络摄像机、图像扫描仪、指纹扫描仪、条形码读取器3D扫描仪、3D打印机、激光测距仪、以及眼睛注视跟踪设备。此外，用户界面输入设备可以包括，例如，医疗成像输入设备，诸如计算机断层摄影、磁共振成像、位置发射断层摄影、医疗超声检查设备。用户界面输入设备也可以包括，例如，音频输入设备，诸如MIDI键盘、数字乐器等。Other examples of user interface input devices include, but are not limited to, three-dimensional (3D) mice, joysticks or indicator rods, game pads and graphic tablets, and audio/video devices such as speakers, digital cameras, digital video cameras, portable media players, webcams, image scanners, fingerprint scanners, barcode readers 3D scanners, 3D printers, laser rangefinders, and eye gaze tracking devices. In addition, user interface input devices may include, for example, medical imaging input devices such as computed tomography, magnetic resonance imaging, position emission tomography, medical ultrasound examination equipment. User interface input devices may also include, for example, audio input devices such as MIDI keyboards, digital musical instruments, etc.

一般而言，术语“输出设备”的使用旨在包括所有可能类型的设备和用于从计算机系统4100向用户或其它计算机输出信息的机制。用户界面输出设备可以包括显示子系统、指示器灯或诸如音频输出设备的非可视显示器等。显示子系统可以是阴极射线管(CRT)、诸如利用液晶显示器(LCD)或等离子体显示器的平板设备、投影设备、触摸屏等。例如，用户界面输出设备可以包括但不限于，可视地传达文本、图形和音频/视频信息的各种显示设备，诸如监视器、打印机、扬声器、耳机、汽车导航系统、绘图仪、语音输出设备和调制解调器。In general, the use of the term "output device" is intended to include all possible types of devices and mechanisms for outputting information from the computer system 4100 to a user or other computers. User interface output devices may include display subsystems, indicator lights, or non-visual displays such as audio output devices, etc. The display subsystem may be a cathode ray tube (CRT), a flat panel device such as one utilizing a liquid crystal display (LCD) or plasma display, a projection device, a touch screen, etc. For example, user interface output devices may include, but are not limited to, various display devices that visually convey text, graphics, and audio/video information, such as monitors, printers, speakers, headphones, car navigation systems, plotters, voice output devices, and modems.

存储子系统4118提供用于存储由计算机系统4100使用的信息的储存库或数据存储库。存储子系统4118提供有形的非暂态计算机可读存储介质，用于存储提供一些示例的功能性的基本编程和数据构造。当由处理子系统4104执行时提供上述功能性的软件(例如，程序、代码模块、指令)可以存储在存储子系统4118中。软件可以由处理子系统4104的一个或多个处理单元执行。存储子系统4118还可以提供用于存储根据本公开使用的数据的储库。Storage subsystem 4118 provides a repository or data repository for storing information used by computer system 4100. Storage subsystem 4118 provides a tangible, non-transitory computer-readable storage medium for storing basic programming and data constructs that provide some example functionality. Software (e.g., programs, code modules, instructions) that provide the above functionality when executed by processing subsystem 4104 can be stored in storage subsystem 4118. The software can be executed by one or more processing units of processing subsystem 4104. Storage subsystem 4118 can also provide a repository for storing data used in accordance with the present disclosure.

存储子系统4118可以包括一个或多个非暂态存储器设备，包括易失性和非易失性存储器设备。如图41所示，存储子系统4118包括系统存储器4110和计算机可读存储介质4122。系统存储器4110可以包括多个存储器，包括用于在程序执行期间存储指令和数据的易失性主随机存取存储器(RAM)以及其中存储有固定指令的非易失性只读存储器(ROM)或闪存。在一些实施方式中，基本输入/输出系统(BIOS)可以典型地存储在ROM中，该基本输入/输出系统(BIOS)包含有助于诸如在启动期间在计算机系统4100内的元件之间传递信息的基本例程。RAM通常包含当前由处理子系统4104操作和执行的数据和/或程序模块。在一些实施方式中，系统存储器4110可以包括多种不同类型的存储器，诸如静态随机存取存储器(SRAM)或动态随机存取存储器(DRAM)等。The storage subsystem 4118 may include one or more non-transitory memory devices, including volatile and non-volatile memory devices. As shown in Figure 41, the storage subsystem 4118 includes a system memory 4110 and a computer-readable storage medium 4122. The system memory 4110 may include multiple memories, including a volatile main random access memory (RAM) for storing instructions and data during program execution and a non-volatile read-only memory (ROM) or flash memory in which fixed instructions are stored. In some embodiments, a basic input/output system (BIOS) may be typically stored in ROM, which contains basic routines that help to transfer information between elements within the computer system 4100, such as during startup. RAM typically contains data and/or program modules currently operated and executed by the processing subsystem 4104. In some embodiments, the system memory 4110 may include multiple different types of memory, such as static random access memory (SRAM) or dynamic random access memory (DRAM).

作为示例而非限制，如图41所示，系统存储器4110可以加载正在被执行的可以包括客户端应用、Web浏览器、中间层应用、关系型数据库管理系统(RDBMS)等的应用程序4112、程序数据4111和操作系统4116。作为示例，操作系统4116可以包括各种版本的MicrosoftApple和/或Linux操作系统、各种商用或类UNIX操作系统(包括但不限于各种GNU/Linux操作系统、OS等)和/或移动操作系统，诸如iOS、Phone、OS、10OS和OS操作系统。As an example and not a limitation, as shown in FIG. 41 , system memory 4110 may load application programs 4112 being executed, which may include client applications, web browsers, middle-tier applications, relational database management systems (RDBMS), etc., program data 4111, and operating system 4116. As an example, operating system 4116 may include various versions of Microsoft Apple and/or Linux operating system, various commercial or UNIX-like operating systems (including but not limited to various GNU/Linux operating systems, OS, etc.) and/or mobile operating systems such as iOS, Phone, OS, 10OS and OS operating system.

计算机可读存储介质4122可以存储提供一些示例的功能性的编程和数据构造。计算机可读介质4122可以为计算机系统4100提供计算机可读指令、数据结构、程序模块和其它数据的存储。当由处理子系统4104执行时，提供上述功能的软件(程序、代码模块、指令)可以存储在存储子系统4118中。作为示例，计算机可读存储介质4122可以包括非易失性存储器，诸如硬盘驱动器、磁盘驱动器、诸如CD ROM、DVD、(蓝光)盘或其它光学介质的光盘驱动器。计算机可读存储介质4122可以包括但不限于，驱动器、闪存存储器卡、通用串行总线(USB)闪存驱动器、安全数字(SD)卡、DVD盘、数字视频带等。计算机可读存储介质4122也可以包括基于非易失性存储器的固态驱动器(SSD)(诸如基于闪存存储器的SSD、企业闪存驱动器、固态ROM等)、基于易失性存储器的SSD(诸如基于固态RAM、动态RAM、静态RAM、DRAM的SSD、磁阻RAM(MRAM)SSD)，以及使用基于DRAM和基于闪存存储器的SSD的组合的混合SSD。计算机可读存储介质4122可以提供用于计算机系统4100计算机可读指令、数据结构、程序模块和其它数据的存储。Computer-readable storage media 4122 may store programming and data structures that provide some example functionality. Computer-readable media 4122 may provide storage for computer-readable instructions, data structures, program modules, and other data for computer system 4100. When executed by processing subsystem 4104, software (programs, code modules, instructions) that provide the above functionality may be stored in storage subsystem 4118. As an example, computer-readable storage media 4122 may include non-volatile memory, such as a hard drive, a disk drive, a hard drive such as a CD ROM, a DVD, (Blu-ray) discs or other optical media. Computer readable storage media 4122 may include, but are not limited to, Drive, flash memory card, universal serial bus (USB) flash drive, secure digital (SD) card, DVD disk, digital video tape, etc. The computer-readable storage medium 4122 may also include a solid-state drive (SSD) based on non-volatile memory (such as an SSD based on flash memory, an enterprise flash drive, a solid-state ROM, etc.), an SSD based on volatile memory (such as an SSD based on solid-state RAM, dynamic RAM, static RAM, DRAM, magnetoresistive RAM (MRAM) SSD), and a hybrid SSD using a combination of DRAM-based and flash memory-based SSDs. The computer-readable storage medium 4122 may provide storage for computer-readable instructions, data structures, program modules, and other data for the computer system 4100.

在一些示例中，存储子系统4118还可以包括计算机可读存储介质读取器4120，计算机可读存储介质读取器4120可以进一步连接到计算机可读存储介质4122。读取器4120可以接收并被配置为从诸如盘、闪存驱动器等存储器设备读取数据。In some examples, storage subsystem 4118 may also include a computer-readable storage medium reader 4120, which may be further connected to computer-readable storage medium 4122. Reader 4120 may receive and be configured to read data from a memory device such as a disk, flash drive, or the like.

在一些示例中，计算机系统4100可以支持虚拟化技术，包括但不限于处理和存储器资源的虚拟化。例如，计算机系统4100可以提供用于执行一个或多个虚拟机的支持。计算机系统4100可以执行诸如促进虚拟机的配置和管理的管理程序之类的程序。每个虚拟机一般独立于其它虚拟机运行。可以为每个虚拟机分配存储器、计算(例如，处理器、核心)、I/O和联网资源。每个虚拟机通常运行其自己的操作系统，该操作系统可以与由计算机系统4100执行的其它虚拟机执行的操作系统相同或不同。因此，计算机系统4100可以潜在地同时运行多个操作系统。In some examples, computer system 4100 can support virtualization technology, including but not limited to virtualization of processing and memory resources. For example, computer system 4100 can provide support for executing one or more virtual machines. Computer system 4100 can execute programs such as hypervisors that facilitate the configuration and management of virtual machines. Each virtual machine generally runs independently of other virtual machines. Memory, computing (e.g., processors, cores), I/O, and networking resources can be allocated to each virtual machine. Each virtual machine typically runs its own operating system, which can be the same or different from the operating system executed by other virtual machines executed by computer system 4100. Therefore, computer system 4100 can potentially run multiple operating systems simultaneously.

通信子系统4124提供到其它计算机系统和网络的接口。通信子系统4124用作用于从计算机系统4100接收数据以及向其它系统传输数据的接口。例如，通信子系统4124可以使得计算机系统4100能够经由互联网建立到一个或多个客户端设备的通信信道，以用于从客户端设备接收信息以及向客户端计算设备发送信息。The communication subsystem 4124 provides an interface to other computer systems and networks. The communication subsystem 4124 serves as an interface for receiving data from the computer system 4100 and transmitting data to other systems. For example, the communication subsystem 4124 can enable the computer system 4100 to establish a communication channel to one or more client devices via the Internet for receiving information from the client device and sending information to the client computing device.

通信子系统4124可以支持有线和/或无线通信协议两者。例如，在一些示例中，通信子系统4124可以包括用于(例如，使用蜂窝电话技术、高级数据网络技术(诸如3G、4G或EDGE(全球演进的增强数据速率)、WiFi(IEEE 802.11族标准)、或其它移动通信技术、或其任意组合)接入无线语音和/或数据网络的射频(RF)收发器组件、全球定位系统(GPS)接收器组件和/或其它组件。在一些示例中，作为无线接口的附加或代替地，通信子系统4124可以提供有线网络连接(例如，以太网)。The communication subsystem 4124 can support both wired and/or wireless communication protocols. For example, in some examples, the communication subsystem 4124 may include a radio frequency (RF) transceiver component, a global positioning system (GPS) receiver component, and/or other components for accessing a wireless voice and/or data network (e.g., using cellular telephone technology, advanced data network technology (such as 3G, 4G or EDGE (Enhanced Data Rates for Global Evolution), WiFi (IEEE 802.11 family of standards), or other mobile communication technologies, or any combination thereof). In some examples, in addition to or instead of a wireless interface, the communication subsystem 4124 can provide a wired network connection (e.g., Ethernet).

通信子系统4124可以以各种形式接收和传输数据。例如，在一些示例中，通信子系统4124还可以以结构化和/或非结构化的数据馈送4126、事件流4128、事件更新4130等形式接收输入通信。例如，通信子系统4124可以被配置为实时地从社交媒体网络的用户和/或诸如馈送、更新、诸如丰富站点摘要(RSS)馈送的web馈送的其它通信服务接收(或发送)数据馈送4126，和/或来自一个或多个第三方信息源的实时更新。The communication subsystem 4124 may receive and transmit data in various forms. For example, in some examples, the communication subsystem 4124 may also receive incoming communications in the form of structured and/or unstructured data feeds 4126, event streams 4128, event updates 4130, etc. For example, the communication subsystem 4124 may be configured to receive (or send) data feeds 4126 in real time from users of a social media network and/or other communication services such as feeds, updates, web feeds such as Rich Site Summary (RSS) feeds, and/or real-time updates from one or more third-party information sources.

在一些示例中，通信子系统4124可以被配置为以连续数据流的形式接收本质上可能是连续的或无界的没有明确结束的数据，其中连续数据流可以包括实时事件的事件流4128和/或事件更新4130。生成连续数据的应用的示例可以包括例如传感器数据应用、金融报价机、网络性能测量工具(例如网络监视和流量管理应用)、点击流分析工具、汽车流量监视等。In some examples, communication subsystem 4124 may be configured to receive data that may be continuous or unbounded in nature without a clear end in the form of a continuous data stream, where the continuous data stream may include event stream 4128 of real-time events and/or event updates 4130. Examples of applications that generate continuous data may include, for example, sensor data applications, financial quote machines, network performance measurement tools (e.g., network monitoring and traffic management applications), clickstream analysis tools, automobile traffic monitoring, etc.

通信子系统4124还可以被配置为将结构化和/或非结构化数据馈送4126、事件流4128、事件更新4130等输出到可以与耦合到计算机系统4100的一个或多个流传输数据源计算机进行通信的一个或多个数据库。The communication subsystem 4124 may also be configured to output structured and/or unstructured data feeds 4126 , event streams 4128 , event updates 4130 , etc. to one or more databases that may communicate with one or more streaming data source computers coupled to the computer system 4100 .

计算机系统4100可以是各种类型中的一种，包括手持便携式设备(例如，蜂窝电话、计算平板、PDA)、可穿戴设备(例如，头戴式显示器)、个人计算机、工作站、大型机、信息站、服务器机架或任何其它数据处理系统。Computer system 4100 can be one of various types, including a handheld portable device (e.g., Cellular phone, computing tablet, PDA), wearable device (e.g., head mounted display), personal computer, workstation, mainframe, kiosk, server rack, or any other data processing system.

由于计算机和网络不断变化的性质，对图41中绘出的计算机系统4100的描述旨在仅仅作为具体示例。具有比图41中所绘出的系统更多或更少组件的许多其它配置是可能的。基于本文所提供的公开内容和教导，本领域普通技术人员将理解实现各种实施例的其它方式和/或方法。Due to the ever-changing nature of computers and networks, the description of the computer system 4100 depicted in FIG. 41 is intended to be used as a specific example only. Many other configurations with more or fewer components than the system depicted in FIG. 41 are possible. Based on the disclosure and teachings provided herein, one of ordinary skill in the art will appreciate other ways and/or methods of implementing various embodiments.

智能内容-结果排名Smart Content-Result Ranking

对于被配置为响应于来自客户端或用户的输入来推荐特定内容(内容项)的内容推荐系统，评估内容项并相对于彼此进行排名的处理在总体推荐中起着重要的作用。例如，用户可以将一个或多个搜索术语输入搜索引擎查询界面，推荐系统可以推荐与输入搜索术语最匹配的一个或多个相关内容项(例如，网页、文档、图像等)。在其它示例中，用户可以使用内容创作系统创作原始内容，诸如电子邮件、文档、在线文章、博客文章等，该内容创作系统可以被配置为提供相关内容项的推荐，该推荐可以与用户创作的内容相关。例如，如上所述，内容推荐系统可以向用户推荐相关图像或到网页的链接等，使得如果期望的话，作者可以将推荐的内容项中的一个或多个结合到正在由用户创作的内容中。在上面的部分中描述了此类技术的几个示例。如这些示例所说明的，响应于从其它处理接收到手动输入的用户输入和/或自动输入，特定内容项的排名和推荐可以应用于许多不同的用例。待排名和/或推荐的内容项可以与推荐系统可用于搜索的图像、网页、其它媒体文件、文档、数字对象等对应。这些内容项可以存储在推荐系统可访问的一个或多个储存库中，该储存库可以是私有或公共储存库(例如，互联网)。For a content recommendation system configured to recommend specific content (content items) in response to input from a client or user, the process of evaluating content items and ranking them relative to each other plays an important role in the overall recommendation. For example, a user may enter one or more search terms into a search engine query interface, and the recommendation system may recommend one or more related content items (e.g., web pages, documents, images, etc.) that best match the input search terms. In other examples, a user may create original content such as emails, documents, online articles, blog posts, etc. using a content creation system, which may be configured to provide recommendations for related content items that may be related to the content created by the user. For example, as described above, a content recommendation system may recommend related images or links to web pages, etc. to a user, so that, if desired, the author may incorporate one or more of the recommended content items into the content being created by the user. Several examples of such techniques are described in the above sections. As illustrated by these examples, the ranking and recommendation of specific content items may be applied to many different use cases in response to receiving manually input user input and/or automatic input from other processes. Content items to be ranked and/or recommended may correspond to images, web pages, other media files, documents, digital objects, etc. that the recommendation system may use for searching. These content items may be stored in one or more repositories accessible to the recommendation system, which may be private or public repositories (eg, the Internet).

但是，对内容项进行排名是不简单的任务。例如，考虑使用标签匹配技术来执行推荐的推荐系统。在这样的系统中，可用于被推荐系统搜索的内容项被加标签，并且内容项与标签一起可以被存储在一个或多个储存库中。加标签可以由内容项加标签服务/应用来执行。对于内容项，与该内容项相关联的一个或多个标签指示该内容项所包含的内容。值(有时也称为标签概率)也可以与每个标签相关联，其中该值提供了由出现在内容项中的标签指示的内容的测度(例如，概率)。在接收到要为其提供内容项的推荐的用户输入(例如，搜索术语/短语、由用户创作的内容)后，可以分析用户输入以识别要与该用户输入相关联的一个或多个标签。推荐系统然后可以使用标签匹配技术从可用于搜索的内容项中识别内容项的集合，该内容项的相关联的标签匹配与用户输入相关联的标签。推荐系统然后可以使用某种排名算法来对识别出的集合内的内容项进行排名并将结果显示给用户。However, ranking content items is not a simple task. For example, consider a recommendation system that uses tag matching technology to perform recommendations. In such a system, content items that can be searched by the recommendation system are tagged, and the content items and tags can be stored in one or more repositories. Tagging can be performed by a content item tagging service/application. For a content item, one or more tags associated with the content item indicate the content contained in the content item. A value (sometimes also referred to as a tag probability) can also be associated with each tag, wherein the value provides a measure (e.g., probability) of the content indicated by the tag that appears in the content item. After receiving a user input (e.g., a search term/phrase, content created by a user) for which a recommendation of a content item is to be provided, the user input can be analyzed to identify one or more tags to be associated with the user input. The recommendation system can then use tag matching technology to identify a set of content items from the content items that can be searched, and the associated tags of the content items match the tags associated with the user input. The recommendation system can then use a certain ranking algorithm to rank the content items in the identified set and display the results to the user.

但是，在某些用例中，由推荐系统使用的排名算法的有效性会受到限制，并且可能不会产生最优结果。例如，考虑存在与用户输入相关联的多个标签的情况。例如，如果用户在图像搜索引擎中键入词“咖啡”和“人”，那么标签“咖啡”和“人”可以与用户输入相关联。在某些实施例中，搜索术语本身可以被视为与用户输入相关联的标签。为了简单起见，假设可用于搜索的内容项的集合包括加标签的图像。使用这两个标签，推荐系统可以从内容项(例如，图像)的集合中检索匹配的内容项的集合，其中如果与内容项相关联的至少一个标签和与用户输入相关联的标签匹配，那么认为该内容项是匹配的。考虑一种场景(情况1)，其中由推荐系统检索多个匹配的内容项，并且每个内容项都具有与之相关联的“咖啡”和“人”标签。对这些检索出的内容项进行排名的一种可能方式是(a)针对每个匹配的内容项，添加与用于那个内容项的与标签“咖啡”和“人”相关联的值，然后(b)基于它们相关联的相加总和对内容项进行排名。但是，这带来了一个问题，因为对于多个匹配的内容项，与匹配的标签相关联的值可以累加至相同的值，这是非常可能的情况，因为大多数加标签服务将概率归一化到一的尺度。例如，三个匹配的图像可以具有如下相关联的标签值：图像A((“咖啡”，0.5)，(“人”，0.5))，图像B((“咖啡”，0.2)，(“人”，0.8))和图片C((“咖啡”，0.7)，(“人”，0.3))。如可以看出的，这些匹配的图像中每个匹配的图像的匹配标签的值之和为一(“1”)，因此无法使用常规求和技术将这些图像之一排名在另一个之上。因而，仅基于图像的匹配标签值的相加总和对图像进行排名不能被用于对图像进行排名。However, in certain use cases, the effectiveness of the ranking algorithm used by the recommender system may be limited and may not produce optimal results. For example, consider a situation where there are multiple tags associated with a user input. For example, if a user types the words "coffee" and "person" into an image search engine, the tags "coffee" and "person" may be associated with the user input. In some embodiments, the search terms themselves may be considered tags associated with the user input. For simplicity, assume that the set of content items available for search includes tagged images. Using these two tags, the recommender system can retrieve a set of matching content items from a set of content items (e.g., images), where the content item is considered to be a match if at least one tag associated with the content item matches the tag associated with the user input. Consider a scenario (Case 1) in which multiple matching content items are retrieved by the recommender system, and each content item has the tags "coffee" and "person" associated with it. One possible way to rank these retrieved content items is to (a) for each matching content item, add the values associated with the tags "coffee" and "person" for that content item, and then (b) rank the content items based on their associated sum. However, this presents a problem because for multiple matching content items, the values associated with the matching tags may add up to the same value, which is a very likely case because most tagging services normalize probabilities to a scale of one. For example, three matching images may have the following associated tag values: Image A (("coffee", 0.5), ("person", 0.5)), Image B (("coffee", 0.2), ("person", 0.8)), and Image C (("coffee", 0.7), ("person", 0.3)). As can be seen, the sum of the values of the matching tags for each of these matching images is one ("1"), so it is not possible to rank one of these images above another using conventional summing techniques. Thus, ranking images based solely on the sum of their matching tag values cannot be used to rank images.

扩展以上示例，匹配的图像的集合还可以包括仅匹配一个标签(例如，仅匹配“咖啡”或仅匹配“人”)并具有相同的相关联标签值的图像。这再次提出了对图像进行排名的问题。例如，考虑三个匹配的图像可以具有如下相关联的标签值：图像A((“咖啡”，0.5))和图像B((“人”，0.5))。再次，使用常规技术无法将这些图像之一排名在另一个之上。Expanding on the above example, the set of matched images may also include images that match only one label (e.g., only "coffee" or only "person") and have the same associated label value. This again raises the problem of ranking images. For example, consider that three matching images may have the following associated label values: Image A (("coffee", 0.5)) and Image B (("person", 0.5)). Again, there is no way to rank one of these images above the other using conventional techniques.

当与用户输入相关联的标签多于两个时，情况进一步恶化。例如，如果用户在图像搜索引擎中键入词“咖啡”、“人”和“咖啡馆”，那么与用户输入相关联的搜索标签为“咖啡”、“人”和“咖啡馆”三个。推荐系统可以从内容项(例如，图像)的集合中检索匹配的内容项的集合，其中如果与内容项相关联的至少一个标签匹配与用户输入相关联的标签，那么认为该内容项是匹配的。匹配的内容项的匹配标签的数量可以从仅一个匹配标签到多个匹配标签(“咖啡”、“人”和“咖啡馆”示例的三个匹配标签的最大值)不同。同样在这个场景中，对于多个匹配的内容项，与匹配的标签相关联的值可以累加至相同的值。例如，三个匹配的图像可以具有如下相关联的标签值：图像A((“人”，0.8)，(“咖啡”，0.2))，图像B((“人”，0.2)，(“咖啡”，0.2)，(“咖啡馆”，0.6))和图像C((“人”，0.5)，(“咖啡馆”，0.5))。如可以看出的，这些匹配的图像中每个匹配的图像的匹配标签的值之和为一(“1”)，因此，无法使用常规求和技术将这些图像之一排名在另一个之上。The situation worsens further when there are more than two tags associated with the user input. For example, if a user types the words "coffee", "person", and "cafe" into an image search engine, the search tags associated with the user input are three: "coffee", "person", and "cafe". The recommendation system can retrieve a set of matching content items from a set of content items (e.g., images), where a content item is considered a match if at least one tag associated with the content item matches the tag associated with the user input. The number of matching tags for a matching content item can vary from only one matching tag to multiple matching tags (a maximum of three matching tags for the "coffee", "person", and "cafe" example). Also in this scenario, for multiple matching content items, the values associated with the matching tags can add up to the same value. For example, three matching images may have the following associated label values: Image A (("person", 0.8), ("coffee", 0.2)), Image B (("person", 0.2), ("coffee", 0.2), ("cafe", 0.6)), and Image C (("person", 0.5), ("cafe", 0.5)). As can be seen, the values of the matching labels for each of these matching images sum to one ("1"), and therefore, these images cannot be ranked one above the other using conventional summation techniques.

因而，在许多情况下，由于多种原因，简单的标签匹配技术可能不会返回最优内容推荐。例如，储存库中的某些图像(或其它内容项)可以仅用一个或两个内容标签来加标签，而其它图像/内容项可以用大量标签来加标签，包括对于单个内容项有数十个甚至数百个标签。在这样的情况下，常规标签匹配技术可能会过分推荐重加标签的内容项(例如，因为它更频繁地包括至少一个与输入术语匹配的标签)和/或可能会不足推荐这种项目(例如，因为即使匹配一个或多个内容标签，其大多数标签仍将与输入术语不匹配)。类似地，由用户或客户端系统提供的输入内容可能仅包括几个输入术语(例如，明确输入的搜索术语或从更大的输入文本中提取的主题)，或者可能包括相对大量的输入术语，这取决于接收到的输入数据。在此类情况下，常规标签匹配技术可能或者无法识别储存库中的某些相关内容项(例如，因为太少输入术语与相关内容项的标签匹配)，或者可能错误地推荐相关性较低的内容项(例如，因为相关性较低的内容项包括一个或多个匹配标签)。Thus, in many cases, simple tag matching techniques may not return optimal content recommendations for a variety of reasons. For example, some images (or other content items) in a repository may be tagged with only one or two content tags, while other images/content items may be tagged with a large number of tags, including dozens or even hundreds of tags for a single content item. In such cases, conventional tag matching techniques may over-recommend the heavily tagged content item (e.g., because it more frequently includes at least one tag that matches the input term) and/or may under-recommend such items (e.g., because even if one or more content tags are matched, most of its tags will still not match the input terms). Similarly, the input content provided by a user or client system may include only a few input terms (e.g., explicitly entered search terms or topics extracted from a larger input text), or may include a relatively large number of input terms, depending on the input data received. In such cases, conventional tag matching techniques may either fail to identify certain relevant content items in the repository (e.g., because too few input terms match the tags of the relevant content items), or may erroneously recommend less relevant content items (e.g., because the less relevant content items include one or more matching tags).

在某些实施例中，本文描述了用于评估、排名和推荐加标签的内容项的改进技术。在一些实施例中，内容推荐系统可以从客户端设备接收输入内容(诸如搜索查询、新创作的文本输入等)。一个或多个标签可以包括在从客户端设备接收到的输入内容中或与其相关联，和/或可以基于对输入内容执行的预处理和分析技术从输入内容中确定和提取。此外，内容推荐系统可以访问存储多个加标签的内容项(诸如图像、媒体内容文件、到网页的链接和/或其它文档)的内容储存库。在一些情况下，内容储存库可以存储识别加标签的内容项的数据，并且对于每个加标签的内容项，还可以存储用于每个项目的相关联的标签信息，其中用于内容项的标签信息包括识别与内容项关联的一个或多个标签的信息和每个相关联的标签的标签值。In certain embodiments, improved techniques for evaluating, ranking, and recommending tagged content items are described herein. In some embodiments, a content recommendation system may receive input content (such as a search query, newly created text input, etc.) from a client device. One or more tags may be included in or associated with the input content received from the client device, and/or may be determined and extracted from the input content based on preprocessing and analysis techniques performed on the input content. In addition, the content recommendation system may access a content repository storing multiple tagged content items (such as images, media content files, links to web pages, and/or other documents). In some cases, the content repository may store data identifying tagged content items, and for each tagged content item, associated tag information for each item may also be stored, wherein the tag information for the content item includes information identifying one or more tags associated with the content item and a tag value for each associated tag.

响应于接收要对其做出推荐的输入数据，内容推荐系统可以从可搜索内容项的集合中检索来自内容储存库的匹配的加标签的内容项的集合，其中如果与内容项相关联的至少一个内容标签匹配与输入内容相关联的标签，那么将该内容项视为匹配的内容项。然后，对于从内容储存库中检索出的每个匹配的加标签的内容项，内容推荐系统可以计算两个分数：(1)基于与匹配与输入内容相关联的标签的内容项相关联的标签的数量的第一分数(也称为标签计数分数)，以及(2)基于该内容项的每个匹配标签的标签值的第二分数(也称为基于标签值的分数或TVBS)。然后，内容推荐系统基于该匹配的内容项的第一分数和第二分数为每个匹配的内容项计算最终排名分数。然后，针对匹配的内容项的集合计算出的最终排名分数被用于生成匹配的内容项的排名列表。然后，这个排名列表被用于识别要输出到用户或客户端系统的匹配的内容项的推荐子集。In response to receiving input data for which recommendations are to be made, a content recommendation system may retrieve a set of matching tagged content items from a content repository from a set of searchable content items, wherein if at least one content tag associated with the content item matches a tag associated with the input content, then the content item is considered a matching content item. Then, for each matching tagged content item retrieved from the content repository, the content recommendation system may calculate two scores: (1) a first score based on the number of tags associated with the content item that matches the tag associated with the input content (also referred to as a tag count score), and (2) a second score based on the tag value of each matching tag of the content item (also referred to as a tag value-based score or TVBS). The content recommendation system then calculates a final ranking score for each matching content item based on the first score and the second score of the matching content item. The final ranking score calculated for the set of matching content items is then used to generate a ranked list of matching content items. This ranked list is then used to identify a recommended subset of matching content items to be output to a user or client system.

现在参考图42，示出了根据某些实施例的图示计算环境4200的框图，该计算环境4200具有内容推荐系统4220，该内容推荐系统4220被实现为响应于从用户或客户端系统4210接收的输入内容来评估和排名来自内容储存库4230的内容项。在这个示例中还示出了内容推荐系统4220内的各种组件和子系统，包括图形用户界面(GUI)4215，客户端系统4210可通过该图形用户界面4215与内容推荐系统4220交互，以提供输入内容并接收识别推荐的内容项的子集的数据。在某些实施例中，GUI 4215可以是用户用来创作内容的分离的客户端应用4215(例如，web浏览器应用)的GUI。在这个实施例中，内容推荐系统4220可以从客户端应用接收用户提供或创作的内容。内容推荐系统4220可以使用应用编程接口(API)来接收内容，该应用编程接口使客户端应用和内容推荐系统4220能够彼此交互和交换信息。Referring now to FIG. 42 , a block diagram of an illustrated computing environment 4200 is shown having a content recommendation system 4220 implemented to evaluate and rank content items from a content repository 4230 in response to input content received from a user or client system 4210, according to some embodiments. Various components and subsystems within the content recommendation system 4220 are also shown in this example, including a graphical user interface (GUI) 4215 through which the client system 4210 can interact with the content recommendation system 4220 to provide input content and receive data identifying a subset of recommended content items. In some embodiments, the GUI 4215 can be a GUI of a separate client application 4215 (e.g., a web browser application) that a user uses to create content. In this embodiment, the content recommendation system 4220 can receive content provided or created by a user from a client application. The content recommendation system 4220 can receive content using an application programming interface (API) that enables the client application and the content recommendation system 4220 to interact and exchange information with each other.

图42中描绘的实施例仅仅是示例，并且无意于不当地限制要求保护的实施例的范围。本领域普通技术人员将认识到许多可能的变化、替代和修改。例如，在一些实施方式中，内容推荐系统4220可以具有比图42所示的系统或子系统更多或更少的系统或子系统，可以组合两个或更多个系统，或者可以具有系统的不同配置或布置。内容推荐系统4220可以被实现为一个或多个计算系统，在一些实施例中，包括使用具有专用和专门的硬件和软件的独立计算和网络基础设施的分离的系统。可替代地或附加地，这些组件和子系统中的一个或多个可以集成到执行分离的功能性的单个系统中。图42中描绘的各种系统、子系统和组件可以以由相应系统的一个或多个处理单元(例如，处理器、核心)执行的软件(例如，代码、指令、程序)、硬件或组合来实现。软件可以存储在非暂态存储介质上(例如，存储器设备上)。The embodiments depicted in Figure 42 are merely examples and are not intended to unduly limit the scope of the claimed embodiments. Those of ordinary skill in the art will recognize many possible variations, substitutions, and modifications. For example, in some embodiments, the content recommendation system 4220 may have more or fewer systems or subsystems than the systems or subsystems shown in Figure 42, may combine two or more systems, or may have different configurations or arrangements of the system. The content recommendation system 4220 may be implemented as one or more computing systems, including, in some embodiments, separate systems using independent computing and network infrastructures with dedicated and specialized hardware and software. Alternatively or additionally, one or more of these components and subsystems may be integrated into a single system that performs separate functionality. The various systems, subsystems, and components depicted in Figure 42 may be implemented in software (e.g., code, instructions, programs), hardware, or a combination of software executed by one or more processing units (e.g., processors, cores) of the corresponding system. The software may be stored on a non-transitory storage medium (e.g., on a memory device).

在更高级别，内容推荐系统4220被配置为接收用户输入内容，然后响应于并基于用户输入内容而做出内容项推荐。推荐是从内容项的集合做出的，这些内容项对于内容推荐系统4220可用且可访问以进行推荐。内容项的集合可以包括图像、各种类型的文档、媒体内容、数字对象等。基于与用户输入内容相关联的标签信息和与内容项的集合相关联的标签信息，内容推荐系统4220被配置为使用标签匹配技术为用户输入内容识别匹配的内容项的集合。然后，内容推荐系统4220被配置为使用本公开中描述的创新排名技术来对匹配的内容项的集合中的内容项进行排名。基于排名，内容推荐系统4220然后被配置为识别和推荐要输出到用户或客户端系统的匹配的内容项的子集。At a higher level, the content recommendation system 4220 is configured to receive user input content, and then make content item recommendations in response to and based on the user input content. Recommendations are made from a collection of content items that are available and accessible to the content recommendation system 4220 for recommendation. The collection of content items may include images, various types of documents, media content, digital objects, etc. Based on the tag information associated with the user input content and the tag information associated with the collection of content items, the content recommendation system 4220 is configured to use tag matching technology to identify a collection of matching content items for the user input content. The content recommendation system 4220 is then configured to use the innovative ranking technology described in the present disclosure to rank the content items in the collection of matching content items. Based on the ranking, the content recommendation system 4220 is then configured to identify and recommend a subset of matching content items to be output to a user or client system.

内容推荐系统4220包括内容加标签器子系统4222，该内容加标签器子系统4222被配置为接收或检索可用于由内容推荐系统4220推荐的内容项。内容项可以包括但不限于图像、网页、文档、媒体文件等。可以从一个或多个内容储存库4230接收或检索内容项。内容储存库4230可以包括各种公共或私有内容储存库，诸如库或数据库，包括图像库、文档存储库、基于web的资源的局域网或广域网(例如，互联网)等。一个或多个内容储存库4230可以本地存储到内容推荐系统4220中，而其它内容储存库可以是分离的、远离内容推荐系统4220并且可经由一个或多个计算机网络让内容推荐系统4220访问。The content recommendation system 4220 includes a content tagger subsystem 4222 configured to receive or retrieve content items that can be recommended by the content recommendation system 4220. Content items may include, but are not limited to, images, web pages, documents, media files, etc. Content items may be received or retrieved from one or more content repositories 4230. The content repositories 4230 may include various public or private content repositories, such as libraries or databases, including image repositories, document repositories, local or wide area networks of web-based resources (e.g., the Internet), etc. One or more content repositories 4230 may be stored locally in the content recommendation system 4220, while other content repositories may be separate, remote from the content recommendation system 4220 and accessible to the content recommendation system 4220 via one or more computer networks.

在某些实施例中，对于每个内容项，内容加标签器子系统4222被配置为检索和分析内容项的内容，并识别要与该内容项相关联的一个或多个内容标签(tag)。对于与内容项相关联的每个标签，内容加标签器4222还可以确定与该标签相关联的标签值，其中该值提供由该标签指示的内容在内容项中出现的测度(例如，概率)。标签的标签值可以与表示特定内容标签如何适用于那个内容项的数值测度。一个或多个标签和对应的标签值可以与内容项相关联。对于具有多个相关联的标签和对应标签值的内容项，标签值可以表示那个图像中由标签指示的图像主题或话题的相对突出。例如，与具有相对高标签值的内容项相关联的第一标签可以指示由那个第一标签指示的内容或特征在该内容项中特别相关并且突出。相反，与具有较低标签值的相同内容项相关联的第二标签可以指示，相对于由第一内容标签指示的内容，由第二标签指示的内容或特征在内容项中不那么突出或流行。例如，图像内容项可以具有两个相关联的标签和值，如下所示：(“人”，0.8)，(“咖啡”，0.2))。这指示图像包含与咖啡(例如，咖啡杯)相关的内容和与人相关的内容(例如，喝咖啡的人)，并且与咖啡的描述相比，人在图像中的显示更为突出(例如，图像的很大一部分可以描绘人，而咖啡杯可以覆盖图像的一小部分)。标签值可以使用不同的格式表示。例如，在一些实施方式中，标签值可以被表达为0.0和1.0之间的浮点数。在一些实施方式中，与特定内容项相关联的标签的所有标签值的总和可以求和到固定且统一的值(例如，累加至1)。In some embodiments, for each content item, the content tagger subsystem 4222 is configured to retrieve and analyze the content of the content item and identify one or more content tags to be associated with the content item. For each tag associated with the content item, the content tagger 4222 can also determine a tag value associated with the tag, wherein the value provides a measure (e.g., probability) of the content indicated by the tag appearing in the content item. The tag value of the tag can be a numerical measure representing how a specific content tag is applicable to that content item. One or more tags and corresponding tag values can be associated with the content item. For a content item with multiple associated tags and corresponding tag values, the tag value can represent the relative prominence of the image theme or topic indicated by the tag in that image. For example, a first tag associated with a content item with a relatively high tag value can indicate that the content or feature indicated by that first tag is particularly relevant and prominent in the content item. In contrast, a second tag associated with the same content item with a lower tag value can indicate that the content or feature indicated by the second tag is less prominent or popular in the content item relative to the content indicated by the first content tag. For example, an image content item may have two associated labels and values as follows: ("person", 0.8), ("coffee", 0.2)). This indicates that the image contains content related to coffee (e.g., a coffee cup) and content related to a person (e.g., a person drinking coffee), and that the person is displayed more prominently in the image than the depiction of the coffee (e.g., a large portion of the image may depict the person, while the coffee cup may cover a small portion of the image). Label values can be represented using different formats. For example, in some embodiments, a label value can be expressed as a floating point number between 0.0 and 1.0. In some embodiments, the sum of all label values for a label associated with a particular content item can be summed to a fixed and uniform value (e.g., adding up to 1).

在一些实施例中，内容加标签器4222可以使用内容加标签服务的服务来执行针对内容项的加标签任务，包括识别要与内容项相关联的一个或多个标签以及每个标签的标签值。在某些实施例中，使用已被训练为将内容项作为输入并预测该内容项的标签和相关联的标签值的一个或多个预测性机器学习模型来实现内容加标签器4222。在一些实施例中，标签可以从用于训练模型的预配置标签的集合中选择。可以使用使用预训练的机器学习模型和/或其它基于人工智能的工具的各种机器学习技术(包括基于AI的文本或图像分类系统、主题或特征提取和/或上述技术的任何其它组合)来确定要与内容项相关联的标签和对应的标签值。In some embodiments, content tagger 4222 can use the services of a content tagging service to perform tagging tasks for content items, including identifying one or more tags to be associated with a content item and a tag value for each tag. In some embodiments, content tagger 4222 is implemented using one or more predictive machine learning models that have been trained to take a content item as input and predict a tag and associated tag value for the content item. In some embodiments, the tags can be selected from a set of preconfigured tags used to train the model. Various machine learning techniques using pretrained machine learning models and/or other artificial intelligence-based tools (including AI-based text or image classification systems, topic or feature extraction, and/or any other combination of the above techniques) can be used to determine the tags to be associated with the content item and the corresponding tag values.

在一些实施例中，从内容储存库4230检索出的内容项可能已经包括相关联的内容标签和标签值。当检索出的内容项不包括标签信息时，和/或当内容推荐系统4220被配置为确定内容项的附加标签时，内容加标签器4222可以被用于为检索出的内容项更新或生成新标签。内容加标签器4222可以使用各种不同技术来生成用于内容项的标签信息(例如，一个或多个标签以及相关联的标签值)。例如，内容加标签器4222可以使用任何或所有先前描述的技术来分析检索出的内容项并确定内容标签，诸如解析、处理、特征提取和/或其它分析技术。解析、处理、特征提取和/或分析的类型可以取决于内容项的类型。例如，对于基于文本的内容项，诸如博客文章、信件、电子邮件、文章、文档等，分析可以包括关键词提取和处理工具(例如，词干提取、同义词检索等)、主题分析工具等。对于作为图像的内容项，可以使用基于人工智能的图像分类工具来识别特定的图像特征和/或生成图像标签。例如，图像的分析可以识别多个图像特征，并且可以用这些识别出的特征中的每个特征对图像加标签。一种或两种类型的分析(即，从图像中提取标签和从文本内容中提取关键词/主题)可以使用基于分析、机器学习算法和/或人工智能(AI)的技术(诸如基于AI的认知图像分析服务，或要用于文本内容的类似的AI/REST认知文本服务)经由基于REST的服务或其它web服务来执行。类似的技术可以被用于其它类型的内容项，诸如视频文件、音频文件、图形或社交媒体帖子，其中可以使用专门的web服务来提取和分析特定特征(例如，词、图像/视频中的对象、面部表情等)，这取决于内容项的媒体类型。In some embodiments, the content items retrieved from the content repository 4230 may already include associated content tags and tag values. When the retrieved content items do not include tag information, and/or when the content recommendation system 4220 is configured to determine additional tags for the content items, the content tagger 4222 can be used to update or generate new tags for the retrieved content items. The content tagger 4222 can use a variety of different techniques to generate tag information (e.g., one or more tags and associated tag values) for the content items. For example, the content tagger 4222 can use any or all of the previously described techniques to analyze the retrieved content items and determine content tags, such as parsing, processing, feature extraction, and/or other analysis techniques. The type of parsing, processing, feature extraction, and/or analysis can depend on the type of content item. For example, for text-based content items, such as blog posts, letters, emails, articles, documents, etc., the analysis can include keyword extraction and processing tools (e.g., stem extraction, synonym retrieval, etc.), topic analysis tools, etc. For content items that are images, an artificial intelligence-based image classification tool can be used to identify specific image features and/or generate image tags. For example, analysis of an image may identify multiple image features, and the image may be tagged with each of these identified features. One or both types of analysis (i.e., extracting tags from the image and extracting keywords/topics from the text content) may be performed via a REST-based service or other web service using analytics, machine learning algorithms, and/or artificial intelligence (AI)-based techniques (such as an AI-based cognitive image analysis service, or a similar AI/REST cognitive text service to be used for text content). Similar techniques may be used for other types of content items, such as video files, audio files, graphics, or social media posts, where specialized web services may be used to extract and analyze specific features (e.g., words, objects in an image/video, facial expressions, etc.), depending on the media type of the content item.

在一些实施例中，内容加标签器4222可以使用一个或多个用训练数据进行训练的基于机器学习和/或人工智能的预训练的模型来识别和提取要用于确定内容项的标签和标签值的内容特征。例如，模型训练系统可以生成一个或多个模型，可以基于包括先前输入数据(例如，文本输入、图像等)以及先前输入数据的对应标签的训练数据集在内的训练数据集使用机器学习算法预先对其进行训练。在各种实施例中，可以使用一种或多种不同类型的经训练的模型，包括执行监督式或半监督式学习技术的分类系统(诸如朴素贝叶斯模型、决策树模型、逻辑回归模型或深度学习模型)，或可以执行监督式或非监督式学习技术的任何其它基于机器学习或人工智能的预测系统。对于每种机器学习模型或模型类型，经训练的模型可以由一个或多个计算系统执行，在执行期间，将内容项作为输入提供给一个或多个模型，并且来自模型的输出可以将一个或多个标签识别为与内容项相关联，或者模型的输出可以被用于识别要与内容项相关联的一个或多个标签。因而，内容加标签器4222可以使用各种不同工具或技术，诸如但不限于关键词提取和处理(例如，词干提取、同义词检索等)、主题分析、从图像中提取特征、机器学习和基于AI的建模工具和文本或图像分类系统，和/或上述技术的任何其它组合，以确定或生成可用于推荐的每个内容项的标签信息(例如，一个或多个标签和相关联的标签值)。In some embodiments, content tagger 4222 may use one or more pre-trained models based on machine learning and/or artificial intelligence trained with training data to identify and extract content features to be used to determine tags and tag values for content items. For example, a model training system may generate one or more models that may be pre-trained using a machine learning algorithm based on a training data set including a training data set of previous input data (e.g., text input, images, etc.) and corresponding tags for the previous input data. In various embodiments, one or more different types of trained models may be used, including a classification system that performs supervised or semi-supervised learning techniques (such as a naive Bayes model, a decision tree model, a logistic regression model, or a deep learning model), or any other machine learning or artificial intelligence-based prediction system that can perform supervised or unsupervised learning techniques. For each machine learning model or model type, the trained model may be executed by one or more computing systems, during which the content item is provided as input to one or more models, and the output from the model may identify one or more tags as being associated with the content item, or the output of the model may be used to identify one or more tags to be associated with the content item. Thus, the content tagger 4222 can use a variety of different tools or techniques, such as, but not limited to, keyword extraction and processing (e.g., stemming, synonym retrieval, etc.), topic analysis, feature extraction from images, machine learning and AI-based modeling tools and text or image classification systems, and/or any other combination of the foregoing techniques to determine or generate tag information (e.g., one or more tags and associated tag values) for each content item that can be used for recommendation.

在某些实施例中，可用于推荐的内容项及其相关联的标签信息(例如，对于每个内容项，与该内容项相关联的一个或多个标签以及对应的标签值)可以存储在数据存储库4223中。在一些实施例中，内容/标签信息数据存储库4223可以存储识别从内容储存库4230中检索出的内容项的数据，其可以包括项目本身(例如，图像、网页、文档、媒体文件等)，或者附加地/可替代地，可以包括对项目的引用(例如，项目标识符、可以从中检索内容项的网络地址、项目的描述、项目的缩略图等)。在图45中示出了图示可以存储在内容/标签信息数据存储库4223中的数据的类型的示例并且在下面更详细地讨论。In some embodiments, content items that can be used for recommendation and their associated tag information (e.g., for each content item, one or more tags associated with the content item and the corresponding tag value) can be stored in data repository 4223. In some embodiments, content/tag information data repository 4223 can store data identifying content items retrieved from content repository 4230, which can include the items themselves (e.g., images, web pages, documents, media files, etc.), or additionally/alternatively, can include references to items (e.g., item identifiers, network addresses from which content items can be retrieved, descriptions of items, thumbnails of items, etc.). An example illustrating the types of data that can be stored in content/tag information data repository 4223 is shown in FIG. 45 and discussed in more detail below.

内容推荐系统4220包括标签识别器子系统4221，该标签识别器子系统4221被配置为从设备4210接收用户输入内容并确定要与用户内容相关联的一个或多个标签。在一些实施例中，从设备4210接收的用户内容可以包括相关联的标签。在一些其它实施例中，标签识别器4221可以被配置为处理输入数据以确定要与输入数据相关联的一个或多个标签。作为一个示例，标签识别器4221可以使用数据加标签服务来识别将与输入数据相关联的一个或多个标签的集合。然后，标签识别器4221可以将与输入数据(在一些实施方式中还包括用户内容)相关联的标签提供给推荐的内容项识别器和排名器子系统4224(为简洁起见，可以被称为内容项排名器4224)以供进一步处理。The content recommendation system 4220 includes a tag identifier subsystem 4221, which is configured to receive user input content from the device 4210 and determine one or more tags to be associated with the user content. In some embodiments, the user content received from the device 4210 may include associated tags. In some other embodiments, the tag identifier 4221 may be configured to process the input data to determine one or more tags to be associated with the input data. As an example, the tag identifier 4221 may use a data tagging service to identify a set of one or more tags to be associated with the input data. The tag identifier 4221 can then provide the tags associated with the input data (including user content in some embodiments) to the recommended content item identifier and ranking subsystem 4224 (which may be referred to as content item ranking device 4224 for brevity) for further processing.

在一些实施例中，标签识别器4221可以使用由内容标签器4222使用的各种技术，并且如上所述，确定要与接收到的用户内容相关联的一个或多个标签。在一些实施例中，标签识别器4221和内容加标签器4222都可以使用相同的标签的超集，从该标签的超集确定要与用户输入和内容项相关联的标签。在某些实施例中，标签识别器4221和内容加标签器4222可以使用相同的数据加标签服务来分别识别要与用户内容相关联的标签以及内容项。在还有其它实施例中，标记识别器4221和内容加标签器4222子系统可以被实现为单个子系统，该子系统被配置为对从储存库4230接收的内容项和从客户端系统4210接收的输入内容执行相似(或甚至完全相同)的处理。In some embodiments, tag identifier 4221 can use various techniques used by content tagger 4222, and as described above, determine one or more tags to be associated with received user content. In some embodiments, tag identifier 4221 and content tagger 4222 can both use a superset of the same tags from which tags to be associated with user input and content items are determined. In some embodiments, tag identifier 4221 and content tagger 4222 can use the same data tagging service to identify tags to be associated with user content and content items, respectively. In yet other embodiments, tag identifier 4221 and content tagger 4222 subsystems can be implemented as a single subsystem that is configured to perform similar (or even identical) processing on content items received from repository 4230 and input content received from client system 4210.

如所描述的，当由内容加标签器4222对内容项加标签时，对于每个内容项，识别要与该内容项相关联的一个或多个标签以及每个标签的标签值。关于为用户内容加标签，在一些实施例中，标签识别器4221被配置为仅确定要与用户输入相关联的标签，而没有任何相关联的标签值。在此类实施例中，基于与用户输入相关联的标签，相对于由内容项排名器4224执行的排名，每个标签被赋予相等的权重。在一些其它实施例中，可以为用户内容确定标签和相关联的标签值两者，并且由内容项排名器4224用于对内容项推荐进行排名。As described, when content items are tagged by content tagger 4222, for each content item, one or more tags to be associated with the content item and the tag value of each tag are identified. With respect to tagging user content, in some embodiments, tag identifier 4221 is configured to determine only tags to be associated with user input without any associated tag values. In such embodiments, each tag is given an equal weight relative to the ranking performed by content item ranker 4224 based on the tags associated with the user input. In some other embodiments, both tags and associated tag values may be determined for user content and used by content item ranker 4224 to rank content item recommendations.

如上所述，由标签识别器4221接收和处理的用户内容可以以不同的形式出现。例如，用户内容可以包括由用户创作的文档的内容(例如，电子邮件、文章、博客帖子、文档、社交媒体帖子、图像等)、用户创建或选择的内容(例如，多媒体文件)等。作为另一个示例，用户输入可以是用户访问的文档(例如，网页)。作为又一个示例，用户内容可以是由用户(例如，基于浏览器的搜索引擎)输入的用于执行搜索的搜索术语。在某些实施例中，例如，对于搜索术语，这些术语本身可以用作标签。As described above, the user content received and processed by the tag identifier 4221 can appear in different forms. For example, the user content can include the content of a document authored by a user (e.g., an email, an article, a blog post, a document, a social media post, an image, etc.), content created or selected by a user (e.g., a multimedia file), etc. As another example, the user input can be a document accessed by a user (e.g., a web page). As yet another example, the user content can be a search term entered by a user (e.g., a browser-based search engine) for performing a search. In some embodiments, for example, for the search terms, the terms themselves can be used as tags.

如在图42中所描绘并且如上所述，内容项排名器4224从标签识别器4221接收识别与用户内容相关联的一个或多个标签的集合的信息作为输入。基于用户内容的这个标签信息并且基于可用于推荐的内容项，内容项排名器4224被配置为使用标签匹配技术来识别与输入内容最相关和/或最有关的一个或多个内容项。在多个内容项被识别为与用户输入相关或有关的情况下，内容项排名器4224还被配置为使用本文描述的创新排名技术来对内容项进行排名。下面更详细地描述与内容项排名器4224所使用的用于对内容项进行评分和排名的各种技术相关的更多细节。内容项排名器4224被配置为响应于为用户接收到的用户输入来生成要推荐给用户的内容项的排名列表。然后将内容项的排名列表提供给推荐选择器子系统4225以供进一步处理。As depicted in Figure 42 and as described above, content item ranker 4224 receives information identifying a set of one or more tags associated with user content from tag identifier 4221 as input. Based on this tag information of the user content and based on the content items that can be used for recommendation, content item ranker 4224 is configured to use tag matching technology to identify one or more content items that are most relevant and/or most related to the input content. In the case where multiple content items are identified as being related or relevant to the user input, content item ranker 4224 is also configured to use the innovative ranking technology described herein to rank the content items. More details related to various technologies used by content item ranker 4224 for scoring and ranking content items are described in more detail below. Content item ranker 4224 is configured to generate a ranking list of content items to be recommended to the user in response to the user input received for the user. The ranking list of content items is then provided to the recommendation selector subsystem 4225 for further processing.

使用从内容项排名器4224接收的内容项的排名列表，推荐选择器4225被配置为响应于从客户端系统4210接收的输入内容而选择要推荐给用户的一个或多个特定内容项。在某些场景中，可以选择排名列表中的所有内容项以进行推荐。在一些其它场景中，可以选择排名的内容项的子集以进行推荐，其中该子集包括少于排名列表中的所有内容项，并且基于排名列表中的内容项的排名来选择该子集中包括的一个或多个内容项。例如，推荐选择器4225可以从排名列表中选择排名最高的“X”个(例如，排名前5、排名前10等)内容项用于推荐，其中X是小于或等于排名项目的数量的某个整数。在某些实施例中，推荐选择器4225可以基于与排名列表中的内容项相关联的分数来选择要包括在要推荐给用户的子集中的内容项。例如，只有具有高于用户可配置的阈值分数的相关联分数的那些内容项才可以被选择以推荐给用户。Using the ranked list of content items received from the content item ranker 4224, the recommendation selector 4225 is configured to select one or more specific content items to be recommended to the user in response to the input content received from the client system 4210. In some scenarios, all the content items in the ranked list can be selected for recommendation. In some other scenarios, a subset of the ranked content items can be selected for recommendation, wherein the subset includes less than all the content items in the ranked list, and one or more content items included in the subset are selected based on the ranking of the content items in the ranked list. For example, the recommendation selector 4225 can select the highest ranked "X" (e.g., top 5, top 10, etc.) content items from the ranked list for recommendation, where X is a certain integer less than or equal to the number of ranked items. In some embodiments, the recommendation selector 4225 can select the content items to be included in the subset to be recommended to the user based on the scores associated with the content items in the ranked list. For example, only those content items with an associated score higher than a user-configurable threshold score can be selected to be recommended to the user.

然后，可以将识别由推荐选择器4225选择用于推荐的内容项的信息从内容推荐系统4220传送到用户的用户客户端设备4210。然后可以经由用户客户端设备将关于推荐的内容项的信息输出给用户。例如，关于推荐的信息可以经由在用户客户端设备上显示的GUI4215或者经由由用户客户端设备执行的应用4215来输出。例如，如果用户输入与用户经由由用户设备执行的浏览器显示的网页而由用户输入的搜索查询对应，那么关于推荐的信息也可以经由那个网页或由浏览器显示的附加网页输出给用户。在某些实施例中，对于每个推荐的内容项，输出给用户的信息可以包括识别该内容项的信息(例如，文本信息、图像的缩略图等)以及用于访问该内容项的信息。例如，访问内容项的信息可以采用链接(例如，URL)的形式，当用户选择该链接(例如，通过鼠标单击动作)时，该链接将导致对应的内容项被访问并经由用户客户端设备显示给用户。在一些实施例中，可以组合识别内容项的信息和用于访问内容项的信息(例如，推荐的图像的缩略图表示，其既识别图像内容项又可以由用户选择以访问图像本身)。Information identifying the content item selected for recommendation by the recommendation selector 4225 may then be transmitted from the content recommendation system 4220 to the user client device 4210 of the user. Information about the recommended content item may then be output to the user via the user client device. For example, information about the recommendation may be output via a GUI 4215 displayed on the user client device or via an application 4215 executed by the user client device. For example, if the user input corresponds to a search query entered by the user via a web page displayed by a browser executed by the user device, then information about the recommendation may also be output to the user via that web page or an additional web page displayed by the browser. In some embodiments, for each recommended content item, the information output to the user may include information identifying the content item (e.g., text information, thumbnails of images, etc.) and information for accessing the content item. For example, the information for accessing the content item may be in the form of a link (e.g., a URL), which, when the user selects the link (e.g., by a mouse click action), causes the corresponding content item to be accessed and displayed to the user via the user client device. In some embodiments, information identifying a content item and information for accessing the content item may be combined (eg, a thumbnail representation of a recommended image that both identifies the image content item and can be selected by a user to access the image itself).

在各种实施例中，内容推荐系统4220，包括其相关联的硬件/软件组件4221-4225和服务，可以被实现为远离前端客户端设备4210的后端服务。客户端设备4210与内容推荐系统4220之间的交互可以是基于互联网的web浏览会话，或者是客户端-服务器应用会话，在此期间，用户访问可以经由客户端设备4210输入用户内容(例如，搜索术语、原始创作的内容等)，并且可以从内容推荐系统4220接收内容项推荐。附加地或可替代地，内容推荐系统4220和/或内容储存库4230和相关服务可以被实现为直接在客户端设备上执行的专门的软件组件。In various embodiments, the content recommendation system 4220, including its associated hardware/software components 4221-4225 and services, can be implemented as a backend service remote from the front-end client device 4210. The interaction between the client device 4210 and the content recommendation system 4220 can be an Internet-based web browsing session, or a client-server application session, during which user access can enter user content (e.g., search terms, originally created content, etc.) via the client device 4210, and content item recommendations can be received from the content recommendation system 4220. Additionally or alternatively, the content recommendation system 4220 and/or the content repository 4230 and related services can be implemented as a dedicated software component that executes directly on the client device.

在一些实施例中，图42中所示的系统4200可以被实现为基于云的多层系统，其中上层用户设备4210可以经由驻留在基于底层资源集(例如，基于云的、SaaS、IaaS、PaaS等)上部署和执行的后端应用服务器上的内容推荐系统4220来请求和接收对基于网络的资源和服务的访问。本文针对内容推荐系统4220描述的功能性中的一些或全部可以由代表性状态转移(REST)服务和/或包括简单对象访问协议(SOAP)web服务或API和/或经由超文本传输协议(HTTP)或HTTP安全协议公开的web内容的web服务来执行或访问。因此，虽然未在图42中示出以免用附加的细节使所示出的组件模糊，但是计算环境4200可以包括附加的客户端设备、一个或多个计算机网络、一个或多个防火墙、代理服务器、路由器、网关、负载平衡器和/或其它中间网络设备，从而促进客户端设备4210、内容推荐系统4220和内容储存库4230之间的交互。In some embodiments, the system 4200 shown in FIG. 42 can be implemented as a multi-layer cloud-based system, where the upper user device 4210 can request and receive access to network-based resources and services via a content recommendation system 4220 residing on a back-end application server deployed and executed on an underlying resource set (e.g., cloud-based, SaaS, IaaS, PaaS, etc.). Some or all of the functionality described herein for the content recommendation system 4220 can be executed or accessed by a representative state transfer (REST) service and/or a web service including a simple object access protocol (SOAP) web service or API and/or web content disclosed via a hypertext transfer protocol (HTTP) or HTTP security protocol. Therefore, although not shown in FIG. 42 to avoid obscuring the components shown with additional details, the computing environment 4200 can include additional client devices, one or more computer networks, one or more firewalls, proxy servers, routers, gateways, load balancers, and/or other intermediate network devices, thereby facilitating interaction between the client device 4210, the content recommendation system 4220, and the content repository 4230.

在各种实施方式中，可以使用一个或多个计算系统和/或网络来实现计算环境4200中所描绘的系统，包括专门的服务器计算机(诸如台式服务器、UNIX服务器、中端服务器、大型机计算机、机架式服务器等)、服务器农场、服务器聚类、分布式服务器或计算硬件的任何其它适当布置和/或组合。例如，内容推荐系统4220可以运行操作系统和/或各种其它服务器应用和/或中间层应用，包括超文本传输协议(HTTP)服务器、文件传输服务(FTP)服务器、通用网关接口(CGI)服务器、Java服务器、数据库服务器和其它计算系统。内容推荐系统4220内的任何或所有组件或子系统可以包括至少一个存储器、一个或多个处理单元(例如，(一个或多个)处理器)和/或存储装置。内容推荐系统4220中的子系统和/或模块可以以硬件、在硬件上执行的软件(例如，可由处理器执行的程序代码或指令)或其组合来实现。在一些示例中，软件可以存储在存储器(例如，非暂态计算机可读介质)中、存储器设备上或某种其它物理存储器中，并且可以由一个或多个处理单元(例如，一个或多个处理器、一个或多个处理器内核、一个或多个图形处理单元(GPU)等)执行。(一个或多个)处理单元的计算机可执行指令或固件实施方式可以包括以任何合适的编程语言编写的计算机可执行指令或机器可执行指令，其可以执行本文描述的各种操作、函数、方法和/或处理。存储器可以存储在(一个或多个)处理单元上可加载和可执行的程序指令，以及在这些程序的执行期间生成的数据。存储器可以是易失性的(诸如随机存取存储器(RAM))和/或非易失性的(例诸如只读存储器(ROM)、闪存等)。可以使用任何类型的持久性存储设备(诸如计算机可读存储介质)来实现该存储器。在一些示例中，计算机可读存储介质可以被配置为保护计算机免受包含恶意代码的电子通信的影响。In various embodiments, the system depicted in computing environment 4200 may be implemented using one or more computing systems and/or networks, including specialized server computers (such as desktop servers, UNIX servers, midrange servers, mainframe computers, rack servers, etc.), server farms, server clusters, distributed servers, or any other suitable arrangement and/or combination of computing hardware. For example, content recommendation system 4220 may run an operating system and/or various other server applications and/or middle-tier applications, including a hypertext transfer protocol (HTTP) server, a file transfer service (FTP) server, a common gateway interface (CGI) server, a Java server, a database server, and other computing systems. Any or all components or subsystems within content recommendation system 4220 may include at least one memory, one or more processing units (e.g., (one or more) processors) and/or storage devices. The subsystems and/or modules in content recommendation system 4220 may be implemented in hardware, software executed on hardware (e.g., program code or instructions executable by a processor), or a combination thereof. In some examples, the software may be stored in a memory (e.g., a non-transitory computer-readable medium), on a memory device, or in some other physical memory, and may be executed by one or more processing units (e.g., one or more processors, one or more processor cores, one or more graphics processing units (GPUs), etc.). The computer executable instructions or firmware implementation of the (one or more) processing units may include computer executable instructions or machine executable instructions written in any suitable programming language, which may perform the various operations, functions, methods, and/or processes described herein. The memory may store program instructions that are loadable and executable on the (one or more) processing units, as well as data generated during the execution of these programs. The memory may be volatile (such as random access memory (RAM)) and/or non-volatile (such as read-only memory (ROM), flash memory, etc.). The memory may be implemented using any type of persistent storage device (such as a computer-readable storage medium). In some examples, the computer-readable storage medium may be configured to protect the computer from electronic communications containing malicious code.

图43描绘了简化的流程图4300，其描绘了根据某些实施例的由内容推荐系统执行的用于识别和排名与用户内容相关的内容项的处理。图43中描绘的处理可以在由相应系统、硬件或其组合的一个或多个处理单元(例如，处理器、核心)执行的软件(例如，代码、指令、程序)中实现。软件可以存储在非暂态存储介质上(例如，存储器设备上)。图43中呈现并在下面描述的方法旨在是说明性而非限制性的。虽然图43描绘了以特定顺序或次序发生的各种处理步骤，但这并不旨在是限制性的。在某些替代实施例中，可以以某个不同的次序来执行处理，或者也可以并行地执行一些步骤。图43中描绘的处理可以由图42中描绘的一个或多个系统执行，诸如由内容推荐系统4220执行。作为示例，对于图42中所描绘的实施例，可以由标签识别器4221来执行4302和4304中的处理，可以由内容项排名器4224来执行4306至4316中的处理，并且可以由推荐选择器4225执行4318和4320中的处理。但是，应当理解的是，结合图43描述的技术和功能性不必仅限于图42中所示的特定计算基础设施内的实施方式，而是可以使用本文描述的其它兼容的计算基础设施来实现。Figure 43 depicts a simplified flowchart 4300 depicting a process for identifying and ranking content items related to user content performed by a content recommendation system according to certain embodiments. The process depicted in Figure 43 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of a corresponding system, hardware, or a combination thereof. The software may be stored on a non-transitory storage medium (e.g., on a memory device). The method presented in Figure 43 and described below is intended to be illustrative and not restrictive. Although Figure 43 depicts various processing steps occurring in a particular order or sequence, this is not intended to be limiting. In certain alternative embodiments, the processing may be performed in a different order, or some steps may be performed in parallel. The processing depicted in Figure 43 may be performed by one or more systems depicted in Figure 42, such as by content recommendation system 4220. As an example, for the embodiment depicted in FIG42, the processing in 4302 and 4304 may be performed by tag identifier 4221, the processing in 4306 through 4316 may be performed by content item ranker 4224, and the processing in 4318 and 4320 may be performed by recommendation selector 4225. However, it should be understood that the techniques and functionality described in conjunction with FIG43 are not necessarily limited to implementation within the particular computing infrastructure shown in FIG42, but may be implemented using other compatible computing infrastructures described herein.

在4302中，内容推荐系统4220可以从一个或多个用户或客户端系统4210接收输入内容。如以上参考图42所讨论的，可以通过由内容推荐系统4220提供的图形用户界面4215(例如，基于web的GUI)从客户端设备4210接收输入内容。在其它示例中，基于由安装在客户端设备4210上的前端应用(例如，移动应用)传输的数据，输入内容可以由在内容推荐系统4220中执行的web服务器或后端服务接收。In 4302, the content recommendation system 4220 may receive input content from one or more users or client systems 4210. As discussed above with reference to FIG42, the input content may be received from the client device 4210 via a graphical user interface 4215 (e.g., a web-based GUI) provided by the content recommendation system 4220. In other examples, the input content may be received by a web server or backend service executed in the content recommendation system 4220 based on data transmitted by a front-end application (e.g., a mobile application) installed on the client device 4210.

在一些实施例中，在步骤4302中接收的输入内容可以与由用户输入到搜索引擎用户界面中的搜索术语或短语的集合对应。在其它实施例中，输入内容可以与由用户创作并输入到专用用户界面中的原始内容对应。例如，新的原始内容可以包括在线文章、新闻通讯、电子邮件、博文等，并且这种内容可以由用户经由基于软件的文字处理器工具、电子邮件客户端应用、web开发工具等输入。在还有其它示例中，在步骤4302中接收到的输入内容可以是图像、图形、音频输入，或者是用户经由客户端设备4310生成或选择的任何其它文本和/或多媒体内容。In some embodiments, the input content received in step 4302 may correspond to a set of search terms or phrases entered by a user into a search engine user interface. In other embodiments, the input content may correspond to original content created by a user and entered into a dedicated user interface. For example, new original content may include online articles, newsletters, emails, blog posts, etc., and such content may be entered by a user via a software-based word processor tool, an email client application, a web development tool, etc. In yet other examples, the input content received in step 4302 may be an image, a graphic, an audio input, or any other text and/or multimedia content generated or selected by a user via client device 4310.

简要地参考图44，示出了示例用户界面4400，其包括允许用户输入原始创作内容的用户界面屏幕4410。在这个示例中，用户界面屏幕4410总体上标记为“内容创作用户界面”，但是，在各种实施例中，用户界面4410可以与文字处理器、文章设计者或博文作者、电子邮件客户端应用等的界面对应。在这个示例中，用户界面4410包括第一文本框4411，用户可以在其中输入创作的内容的标题或主题，以及第二文本框4412，用户可以在其中输入全文(例如，文章、电子邮件主体、文档等)作为输入内容。此外，用户界面4410包括可选择的按钮4413，其允许用户发起对可以被结合到新创作的内容中的相关内容项(例如，图像、相关文章等)的搜索。在一些情况下，按钮4413或类似的用户界面组件的选择可以通过首先分析并将经由用户界面4410接收的用户内容(例如，在4411和/或4412中输入的用户内容)传输到内容推荐系统4220来发起图43中所示的处理。在其它实施例中，后台处理可以在前端用户界面内连续(或周期性)运行，以连续(或周期性)分析从用户接收的新文本输入(例如，在4411和/或4412中由用户输入的内容)，并响应于文本更新而重新发起图43的处理，以便可以连续或周期性地实时更新内容项推荐。Referring briefly to FIG. 44 , an example user interface 4400 is shown that includes a user interface screen 4410 that allows a user to input original creative content. In this example, the user interface screen 4410 is generally labeled as a “content creation user interface,” but in various embodiments, the user interface 4410 may correspond to an interface of a word processor, an article designer or blog author, an email client application, or the like. In this example, the user interface 4410 includes a first text box 4411 in which a user may enter a title or subject of the creative content, and a second text box 4412 in which a user may enter the full text (e.g., an article, an email body, a document, etc.) as input content. In addition, the user interface 4410 includes a selectable button 4413 that allows a user to initiate a search for related content items (e.g., images, related articles, etc.) that may be incorporated into the newly created content. In some cases, the selection of button 4413 or a similar user interface component may initiate the process shown in FIG. 43 by first analyzing and transmitting the user content received via the user interface 4410 (e.g., the user content entered in 4411 and/or 4412) to the content recommendation system 4220. In other embodiments, background processing may run continuously (or periodically) within the front-end user interface to continuously (or periodically) analyze new text input received from the user (e.g., content entered by the user in 4411 and/or 4412) and re-initiate the processing of FIG. 43 in response to text updates so that content item recommendations may be continuously or periodically updated in real time.

返回去参考图43，在4304中，内容推荐系统4220为在步骤4302中接收到的输入内容确定一个或多个标签。在一些实施例中，在4302中接收到的输入内容可能已经具有与其相关联或嵌入在输入内容中的标签，并且标签识别器4221可以识别并提取与输入内容相关联的预定标签的集合。如果在步骤4302中接收到的输入内容与搜索术语对应，那么标签识别器4221可以简单地将用户输入的搜索术语用作标签(例如，排除某些词，诸如冠词、连接词、介词、量词等)。在原始创作的文本内容或内容推荐系统4220接收到的任何其它未加标签的输入内容的情况下，标签识别器4221可以被配置为分析在4302中接收到的输入内容的各种特征，并基于分析确定将与接收到的输入内容相关联的一个或多个标签。如前所述，标签识别器4221可以使用各种不同技术来确定将与在4302中接收到的输入内容相关联的一个或多个标签。Referring back to FIG. 43 , in 4304 , the content recommendation system 4220 determines one or more tags for the input content received in step 4302 . In some embodiments, the input content received in 4302 may already have tags associated with it or embedded in the input content, and the tag identifier 4221 may identify and extract a set of predetermined tags associated with the input content. If the input content received in step 4302 corresponds to a search term, the tag identifier 4221 may simply use the search term entered by the user as a tag (e.g., excluding certain words such as articles, conjunctions, prepositions, quantifiers, etc.). In the case of originally authored text content or any other untagged input content received by the content recommendation system 4220 , the tag identifier 4221 may be configured to analyze various features of the input content received in 4302 and determine one or more tags to be associated with the received input content based on the analysis. As previously described, the tag identifier 4221 may use a variety of different techniques to determine one or more tags to be associated with the input content received in 4302 .

返回到图44中所示的示例用户界面4400，对于这个示例，在4302中接收到的输入内容可以与用户在主题/主题框4411中输入的文本(即，“对你来说，咖啡比茶更健康吗？”)以及在方框4412中输入的文本对应。基于对4411中主题内容和4411中创作的文章的主体文本的分析以及可以提供的任何其它输入内容，内容推荐系统4220可以在4304中确定标签“咖啡”、“茶”和“人”与用户内容相关联。Returning to the example user interface 4400 shown in FIG44 , for this example, the input received in 4302 may correspond to the text entered by the user in the subject/topic box 4411 (i.e., “Is coffee healthier than tea for you?”) and the text entered in box 4412. Based on analysis of the subject content in 4411 and the body text of the article authored in 4411, as well as any other input that may be provided, the content recommendation system 4220 may determine in 4304 that the tags “coffee,” “tea,” and “people” are associated with the user content.

在4306中，基于在4302中为输入内容确定的标签，内容推荐系统4220使用标签匹配技术从可用于推荐的内容项的集合中识别匹配的内容项的集合，其中如果与内容项相关联的至少一个标签与为4304中的输入内容确定的标签匹配，那么将该内容项视为匹配并在4306中进行识别。在4306中识别出的内容项的集合也可以被称为内容项的匹配集合，并且包括响应于在4302中接收到的输入内容而作为推荐给用户的候选的内容项。在各种实施例中，诸如词干提取和同义词检索/比较之类的数据处理技术可以用作4306中的匹配处理的一部分，以识别匹配的标签。In 4306, based on the tags determined for the input content in 4302, the content recommendation system 4220 identifies a set of matching content items from the set of content items available for recommendation using tag matching techniques, wherein if at least one tag associated with a content item matches the tag determined for the input content in 4304, the content item is considered a match and identified in 4306. The set of content items identified in 4306 may also be referred to as a matching set of content items, and includes content items that are candidates for recommendation to the user in response to the input content received in 4302. In various embodiments, data processing techniques such as stemming and synonym retrieval/comparison may be used as part of the matching process in 4306 to identify matching tags.

例如，在图42中所描绘的实施例中，可用于推荐的内容项的集合及其相关联的标签信息(例如，与内容项相关联的标签和相关联的标签值)可以存储在内容/标签信息数据存储库4223。作为4306中的处理的一部分，内容推荐系统4220可以将在4304中确定的一个或多个标签与和可用于推荐的内容项相关联的标签进行比较，并从集合中识别出具有与4304中识别出的标签匹配的至少一个相关联标签的内容项。For example, in the embodiment depicted in FIG42 , a collection of content items available for recommendation and their associated tag information (e.g., tags associated with the content items and associated tag values) may be stored in content/tag information data repository 4223. As part of the processing in 4306, content recommendation system 4220 may compare the one or more tags determined in 4304 with tags associated with content items available for recommendation and identify content items from the collection that have at least one associated tag that matches the tag identified in 4304.

继续图44的示例并假设已在4304中为在4302中接收到的用户输入内容确定了标签“咖啡”、“人”和“茶”，图45中的示例表4500示出了由内容推荐系统4220(例如，由内容项排序器4224)从可用于做出推荐的内容项的集合中识别出的内容项的匹配集合。如从表4500中可以看出的，八个不同的图像内容项已被识别为具有与标签“咖啡”、“人”或“茶”中的至少一个匹配的至少一个相关联标签。如这个示例中所示，使用图像识别器4501来识别每个匹配的图像内容项。图45中提供的图像描述4502是对匹配的图像的内容的描述，并且已经在图45中提供，使得不必示出实际图像。每个匹配的图像具有一个或多个相关联的标签4503，并且标签值4504与每个标签相关联。在表4500的示例中，标签值是具有预定范围(例如，在0.0和1.0之间)的浮点值，并且每个内容项的标签值的总和等于相同的总数(例如，1.0)。此类实施例提供了附加的技术优势，例如，标签计数分数的总和的统一性，并且TVBS可以确保具有更多相关联标签的内容项不会基于较大数量的相关联标签而被人为地过高排名或过分推荐。另外，如下面所讨论的，通过具有介于0.0和1.0之间的标签值，当它们相乘时(例如，在下面描述的用例中)，结果值将允许内容项在特定组或桶内排名，同时确保一个组内排名最高的内容项不会比下一个更高组中排名最低的内容项排名高。Continuing with the example of FIG. 44 and assuming that the tags "coffee", "person", and "tea" have been determined in 4304 for the user input content received in 4302, the example table 4500 in FIG. 45 shows a matching set of content items identified by the content recommendation system 4220 (e.g., by the content item sorter 4224) from a set of content items that can be used to make recommendations. As can be seen from table 4500, eight different image content items have been identified as having at least one associated tag that matches at least one of the tags "coffee", "person", or "tea". As shown in this example, an image recognizer 4501 is used to identify each matching image content item. The image description 4502 provided in FIG. 45 is a description of the content of the matched image, and has been provided in FIG. 45 so that the actual image does not have to be shown. Each matched image has one or more associated tags 4503, and a tag value 4504 is associated with each tag. In the example of Table 4500, the tag values are floating point values having a predetermined range (e.g., between 0.0 and 1.0), and the sum of the tag values for each content item equals the same total (e.g., 1.0). Such embodiments provide additional technical advantages, such as uniformity in the sum of the tag count scores, and TVBS can ensure that content items with more associated tags are not artificially over-ranked or over-recommended based on a larger number of associated tags. In addition, as discussed below, by having tag values between 0.0 and 1.0, when they are multiplied (e.g., in the use case described below), the resulting values will allow content items to be ranked within a particular group or bucket while ensuring that the highest ranked content item within one group is not ranked higher than the lowest ranked content item in the next higher group.

从图45可以看出，如果与内容项相关联的至少一个标签匹配与输入内容相关联的标签，那么该内容项(在这个示例中为图像)被视为匹配。匹配的内容项可以具有相关联的标签，该相关联的标签匹配与输入内容相关联的一个或多个标签。匹配的内容项还可以具有与之相关联的其它标签，这些其它标签与用于输入内容的标签不同(例如，Image_1，Image_3等)。As can be seen from Figure 45, if at least one tag associated with a content item matches a tag associated with the input content, then the content item (in this example, an image) is considered a match. A matched content item may have associated tags that match one or more tags associated with the input content. A matched content item may also have other tags associated with it that are different from the tags used for the input content (e.g., Image_1, Image_3, etc.).

在4308中，基于与内容项相关联的、与为输入内容确定的标签匹配的标签的数量，为在4306中识别出的每个匹配的内容项计算标签计数分数(第一分数)。在一些场景中，内容项的每个匹配标签被赋予值一，因此在4308中为内容项计算出的标签计数分数等于匹配与输入内容相关联的标签的内容项的标签数量。对于在图45中识别出的示例匹配的图像，图46中的表4600为在图45(以及图46)中识别出的每个匹配的图像识别标签计数分数4602。例如，Image_1的标签计数分数是“1”(一)，因为与该图像相关联的一个标签(“人”)匹配与输入内容相关联的标签。作为另一个示例，Image_2的标签计数分数为“2”(二)，因为与该图像相关联的两个标签(“咖啡”和“人”)匹配与输入内容相关联的图像。作为又一个示例，Image_8的标签计数分数是“1”(一)，因为与该图像相关联的一个标签(“咖啡”)匹配与输入内容相关联的标签。At 4308, a tag count score (a first score) is calculated for each matching content item identified in 4306 based on the number of tags associated with the content item that match the tags determined for the input content. In some scenarios, each matching tag of the content item is assigned a value of one, so the tag count score calculated for the content item in 4308 is equal to the number of tags of the content item that match the tags associated with the input content. For the example matching images identified in FIG. 45, table 4600 in FIG. 46 identifies a tag count score 4602 for each matching image identified in FIG. 45 (and FIG. 46). For example, the tag count score for Image_1 is "1" (one) because one tag associated with the image ("person") matches the tag associated with the input content. As another example, the tag count score for Image_2 is "2" (two) because two tags associated with the image ("coffee" and "person") match the image associated with the input content. As yet another example, the label count score for Image_8 is "1" (one) because one label associated with the image ("coffee") matches the label associated with the input content.

在4310中，基于在4308中为内容项计算出的标签计数分数，将匹配的内容项分组(分桶)为组或桶。在某些实施例中，组(或桶)包含具有相同标签计数分数的所有内容项。在内容项的每个匹配标签被赋予值一的情况下，每个组或桶都包括具有相同数量的匹配的标签的内容项。4310中的处理可以是可选的，并且在某些实施例中可以不执行。At 4310, the matching content items are grouped (bucketed) into groups or buckets based on the tag count scores calculated for the content items at 4308. In some embodiments, a group (or bucket) contains all content items with the same tag count score. Where each matching tag of a content item is assigned a value of one, each group or bucket includes content items with the same number of matching tags. The processing at 4310 may be optional and may not be performed in some embodiments.

继续图46中所示的示例，可以将八个匹配的图像分组为两个组或桶：第一组或桶，其包括标签计数分数为1(一)的内容项和标签计数分数为2(二)的第二组或桶。第一组将包括图像{Image_1，Image_3，Image_6和Image_8}，第二组将包括图像{Image_2，Image_4，Image_5和Image_7}。值得注意的是，在这个示例中，没有匹配的图像匹配与输入内容相关联的所有三个标签(“咖啡”，“人”，“茶”)。Continuing with the example shown in FIG. 46 , the eight matching images can be grouped into two groups or buckets: a first group or bucket that includes content items with a tag count score of 1 (one) and a second group or bucket with a tag count score of 2 (two). The first group would include the images {Image_1, Image_3, Image_6, and Image_8}, and the second group would include the images {Image_2, Image_4, Image_5, and Image_7}. Notably, in this example, no matching images matched all three tags associated with the input content (“coffee”, “person”, “tea”).

在4312中，对于在4310中识别出的每个组，为那个组中的每个候选内容项计算基于标签值的分数(第二分数)。在一些实施例中，用于特定内容项的基于标签值的分数(TVBS)基于与该内容项的匹配标签相关联的标签值并使用其来计算。At 4312, for each group identified at 4310, a tag value-based score (second score) is calculated for each candidate content item in that group. In some embodiments, the tag value-based score (TVBS) for a particular content item is calculated based on and using the tag values associated with the matching tags for that content item.

在某些实施例中，通过将与内容项的标签相关联的标签值相乘来计算针对内容项的TVBS，其中内容项的标签匹配与输入内容相关联的标签。例如：In some embodiments, the TVBS for a content item is calculated by multiplying the tag value associated with the tag of the content item, where the tag of the content item matches the tag associated with the input content. For example:

图46中Image_1的TVBS＝0.93TVBS of Image_1 in Figure 46 = 0.93

图46中Image_2的TVBS＝0.5*0.5＝0.25TVBS of Image_2 in FIG46 = 0.5*0.5 = 0.25

图46中Image_3的TVBS＝0.65TVBS of Image_3 in Figure 46 = 0.65

图46中Image_4的TVBS＝0.35*0.60＝0.21，依此类推。The TVBS of Image_4 in Figure 46 is 0.35*0.60=0.21, and so on.

表4600示出了使用上述技术针对各个匹配的图像计算出的TVBS 4603。在某些实施例中，使用朴素贝叶斯方法来计算内容项的基于标签值的分数。例如，假设为输入内容确定了两个标签tag₁和tag₂，那么图像内容项的基于标签值的分数(TVBS)可以表达为：Table 4600 shows the TVBS 4603 calculated for each matched image using the above technique. In some embodiments, a naive Bayes method is used to calculate the tag value-based score of the content item. For example, assuming that two tags tag ₁ and tag ₂ are determined for the input content, the tag value-based score (TVBS) of the image content item can be expressed as:

Image_i的TVBS＝P(Image_i|tag₁,tag₂)＝给定标签tag₁和tag₂的Image_i的概率TVBS of Image _i = P(Image _i | tag ₁ , tag ₂ ) = the probability of Image _i given tags tag ₁ and tag ₂

将此扩展用于“n”个标签：Use this extension for 'n' number of tags:

P(Image_i|tag₁，tag₂…，tag_n)＝给定(标签tag₁和tag₂和…tag_n)的Image_i的概率P(Image _i | tag ₁ , tag ₂ , …, tag _n ) = the probability of Image _i given (tags tag ₁ and tag ₂ and … tag _n )

假设标签tag₁，tag₂…，tag_n是相互独立的：Assume that tags tag ₁ , tag ₂ , ..., tag _n are independent of each other:

Image_i的TVBS＝TVBS_i＝P(Image_i|tag₁,tag₂…,tag_n)TVBS of Image _i = TVBS _i =P (Image _i | tag ₁ , tag ₂ ..., tag _n )

现在由简单的朴素贝叶斯(SimpleBayes)：Now, from the simple naive Bayes Bayes:

其中：in:

P(Image_i|tag_t)＝对于Image_i，tag_t的概率P(Image _i |tag _t ) = probability of image _i and tag _t

P(Image_i)＝认为每个图像是唯一的，并且可以舍弃或忽略这一项P(Image _i ) = Each image is considered unique and this term can be discarded or ignored

P(tag_t)＝内容项的集合中标签的频率(即，用tag_t加标签的可用于推荐的内容项集合中内容项(例如，图像)的数量)。由于这一项在分母中，因此标签在内容项集合中出现的频率越低(即，具有这个相关联标签的内容项的数量越少)，具有该标签的图像的TVBS分数就越高。P(tag _t ) = frequency of the tag in the set of content items (i.e., the number of content items (e.g., images) in the set of content items that are tagged with tag _t and are available for recommendation). Since this term is in the denominator, the less frequent a tag appears in the set of content items (i.e., the fewer content items have this associated tag), the higher the TVBS score of an image with that tag.

在展开后，以上公式变为：After expansion, the above formula becomes:

以上等式3中的分子是概率的乘积，当概率可能相等时，将导致分数更高(给定为多个图像匹配相同的标签，分母将保持不变)。The numerator in equation 3 above is the product of the probabilities, which will result in a higher score when the probabilities are likely to be equal (the denominator will remain the same given that multiple images match the same label).

以下示例说明了等式3在计算内容项的TVBS时的应用。假设为输入内容确定的标签是“人”和“咖啡”。进一步假设可用于推荐的内容项(图像)的集合包含具有以下标签和标签值的三个图像：The following example illustrates the application of Equation 3 in calculating the TVBS of a content item. Assume that the tags determined for the input content are "person" and "coffee". Further assume that the set of content items (images) available for recommendation contains three images with the following tags and tag values:

图像A：(“人”，0.5)，(“咖啡”，0.5)Image A: (“person”, 0.5), (“coffee”, 0.5)

图像B：(“人”，0.1)，(“咖啡”，0.9)Image B: (“person”, 0.1), (“coffee”, 0.9)

图像C：(“人”，0.8)，(“咖啡”，0.2)Image C: (“person”, 0.8), (“coffee”, 0.2)

对于图像A：For image A:

P(人|图像)＝0.5和P(咖啡|图像)＝0.5P(person|image)=0.5 and P(coffee|image)=0.5

应用等式3，Applying Equation 3,

其中“频率(人)”是可用于推荐的内容项集合中具有与之相关联的“人”标签的内容项的数量，并且 where “Frequency (person)” is the number of content items in the set of content items available for recommendation that have a “person” tag associated therewith, and

图像A的TVBS＝0.028TVBS of image A = 0.028

使用类似的技术，Using a similar technique,

图像B的TVBS＝(0.1/3)*(0.9/3)＝0.01TVBS of image B = (0.1/3)*(0.9/3) = 0.01

图像C的TVBS＝(0.8/3)*(0.2/3)＝0.018TVBS of image C = (0.8/3)*(0.2/3) = 0.018

如从这个示例可以看出的，在频率相同的情况下，当概率可能相等时，TVBS更高(如果为多个图像匹配相同的标签，分母将保持相同)。As can be seen from this example, in the case where the frequencies are the same, when the probabilities are likely to be equal, the TVBS is higher (if the same label is matched for multiple images, the denominator will remain the same).

根据等式3的扩展，具有特定相关联标签的内容项集合中的内容项的频率被考虑用于计算TVBS(因此也将影响总体排名分数，如下所述)。具有特定相关联标签的内容项数量越少(即，频率越低)，具有特定标签的图像的TVBS分数越高。在一些实施例中，期望增加具有“稀少”或较不频繁标签的此类内容项比具有更频繁标签的内容项被排名更高的可能性，从而增加它们被包括在推荐给用户的内容项列表中的可能性。因此，内容项的TVBS的值与出现在内容项的集合中的特定标签的出现频率成反比(即，集合中具有相关联的特定标签的内容项的数量)。According to the expansion of equation 3, the frequency of content items in the content item set with specific associated tags is taken into account for calculating TVBS (therefore, it will also affect the overall ranking score, as described below). The fewer the number of content items with specific associated tags (that is, the lower the frequency), the higher the TVBS score of the image with the specific tag. In some embodiments, it is desirable to increase the possibility that such content items with "scarce" or less frequent tags are ranked higher than content items with more frequent tags, thereby increasing the possibility that they are included in the list of content items recommended to the user. Therefore, the value of the TVBS of a content item is inversely proportional to the frequency of occurrence of specific tags that appear in the set of content items (that is, the number of content items with associated specific tags in the set).

在4314中，基于在4308中为内容项计算出的标签计数分数和在4312中为内容项计算出的TVBS，为在4306中识别出的每个匹配的内容项计算总体排名分数。在一些实施例中，可以将候选内容项的总体排名分数计算为针对该内容项计算的标签计数分数(在4308中计算)和TVBS(在4312中计算)之和。即，对于图像内容项Image_i：At 4314, an overall ranking score is calculated for each matching content item identified at 4306 based on the tag count score calculated for the content item at 4308 and the TVBS calculated for the content item at 4312. In some embodiments, the overall ranking score for a candidate content item may be calculated as the sum of the tag count score (calculated at 4308) and the TVBS (calculated at 4312) calculated for the content item. That is, for an image content item Image _i :

排名分数(Image_i)＝TagsCountScore_i+TVBS_i Ranking score (Image _i ) = TagsCountScore _i + TVBS _i

对于图46中所描绘的示例，列4604指示通过为每个匹配的图像将那个图像的标签计数分数(在列4602中指示)和该图像的TVBS(在列4603中指示)相加而为每个匹配的图像计算出的总体排名分数。例如，计算出的总体排名分数如下：For the example depicted in Figure 46, column 4604 indicates the overall ranking score calculated for each matched image by adding the label count score of that image (indicated in column 4602) and the TVBS of the image (indicated in column 4603) for each matched image. For example, the calculated overall ranking score is as follows:

Image_1：1+0.93＝1.93Image_1: 1+0.93=1.93

Image_2：2+0.25＝2.25Image_2: 2+0.25=2.25

Image_3：1+0.65＝1.65，依此类推。Image_3: 1+0.65=1.65, and so on.

在步骤4316中，内容推荐系统4220(例如，内容推荐系统4220中的内容项排序器4224)基于在4314中为内容项计算出的总体排名分数来生成匹配的内容项的排名列表。因此，继续图44-46的示例，基于为匹配的图像计算出的总体排名分数(在列4604中)，可以按从最高到最低的顺序对图像进行排名，如下所示：(1)Image_2，(2)Image_5，(3)Image_4，(4)Image_7，(5)Image_6，(6)Image_1，(7)Image_8，(8)Image_3。In step 4316, the content recommendation system 4220 (e.g., the content item ranker 4224 in the content recommendation system 4220) generates a ranked list of matching content items based on the overall ranking scores calculated for the content items in 4314. Thus, continuing with the example of FIGS. 44-46 , based on the overall ranking scores calculated for the matching images (in column 4604), the images may be ranked in order from highest to lowest as follows: (1) Image_2, (2) Image_5, (3) Image_4, (4) Image_7, (5) Image_6, (6) Image_1, (7) Image_8, (8) Image_3.

在其中标签的标签值在0.0和1.0之间的范围内的某些实施例中，使用(TagCountScore+TVBS)方法计算总体排名分数确保与具有较低数量标签匹配的内容项相比，具有匹配与输入内容相关联的标签的更高数量的相关联的标签的内容项被更高排名。例如，在图46的示例中，标签计数分数为2的图像(与和图像相关联的两个标签对应，该标签匹配与输入内容相关联的标签)将始终具有总体排名分数，因此其排名高于标签计数分数为1的图像(与和图像相关联的两个标签对应，该标签匹配与输入内容相关联的标签)。这是因为，给定标签值在零和一之间，通过将与匹配的标签相关联的标签值相乘而计算出的图像的TVBS不能超过一。这还意味着，对于与第一标签计数分数对应的内容项的第一组或桶以及与第二标签计数分数对应的内容项的第二组或桶，如果第一标签计数分数高于第二标签计数分数，那么第一组中的每个内容项的排名(由于总体排名分数更高)将比第二组中的内容项更高。因此，在为输入内容确定了三个标签(“咖啡”、“人”和“茶”)的示例中，具有与输入内容的标签匹配的三个内容标签的内容项将始终排在具有两个匹配的内容标签的内容项之前，具有两个匹配的内容标签的每一个内容项将始终排在具有一个匹配的内容标签的内容项之前，依此类推。在每个组或桶内，可以基于内容项的TVBS对内容项进行排名，这有利于匹配的内容标签的参数更高且更相等。但是，应当理解的是，在其它实施例中，为了实现不同的内容项排名优先级和策略，可以使用不同的等式或逻辑来计算标签计数分数、TVBS和总体排名分数。In certain embodiments where the tag value of a tag is within a range between 0.0 and 1.0, using the (TagCountScore+TVBS) method to calculate the overall ranking score ensures that content items with a higher number of associated tags that match tags associated with the input content are ranked higher than content items with a lower number of tag matches. For example, in the example of FIG. 46 , an image with a tag count score of 2 (corresponding to two tags associated with the image that match tags associated with the input content) will always have an overall ranking score and therefore be ranked higher than an image with a tag count score of 1 (corresponding to two tags associated with the image that match tags associated with the input content). This is because, given a tag value between zero and one, the TVBS of an image calculated by multiplying the tag values associated with the matched tags cannot exceed one. This also means that for a first group or bucket of content items corresponding to a first tag count score and a second group or bucket of content items corresponding to a second tag count score, if the first tag count score is higher than the second tag count score, then each content item in the first group will be ranked higher (due to the higher overall ranking score) than the content items in the second group. Therefore, in the example where three tags ("coffee", "people", and "tea") are determined for the input content, a content item with three content tags that match the tags of the input content will always be ranked before a content item with two matching content tags, and each content item with two matching content tags will always be ranked before a content item with one matching content tag, and so on. Within each group or bucket, the content items can be ranked based on their TVBS, which is conducive to higher and more equal parameters of the matching content tags. However, it should be understood that in other embodiments, in order to implement different content item ranking priorities and strategies, different equations or logic can be used to calculate the tag count score, TVBS, and overall ranking score.

在4318中，内容推荐系统4220(例如，推荐选择器4225)可以使用在4316中生成的排名列表来选择要推荐给用户的一个或多个内容项。在某些场景中，可以选择排名列表中的所有内容项以进行推荐。在一些其它场景中，可以选择排名的内容项的子集以进行推荐，其中该子集包括少于排名列表中的所有内容项，并且基于排名列表中的内容项的排名来选择该子集中包括的一个或多个内容项。例如，推荐选择器4225可以从用于推荐的排名列表中选择排名最高的“X”个(例如，排名前5、排名前10等)内容项，其中X是小于或等于列表中排名项目的数量的某个整数。在某些实施例中，推荐选择器4225可以基于与排名列表中的内容项相关联的总体排名分数来选择要包括在要推荐给用户的子集中的内容项。例如，只有具有高于用户可配置的阈值分数的相关联分数的那些内容项才可以被选择以推荐给用户。In 4318, the content recommendation system 4220 (e.g., recommendation selector 4225) can use the ranking list generated in 4316 to select one or more content items to be recommended to the user. In some scenarios, all content items in the ranking list can be selected for recommendation. In some other scenarios, a subset of ranked content items can be selected for recommendation, wherein the subset includes less than all content items in the ranking list, and one or more content items included in the subset are selected based on the ranking of the content items in the ranking list. For example, the recommendation selector 4225 can select the highest ranking "X" (e.g., top 5, top 10, etc.) content items from the ranking list for recommendation, where X is a certain integer less than or equal to the number of ranked items in the list. In some embodiments, the recommendation selector 4225 can select the content items to be included in the subset to be recommended to the user based on the overall ranking score associated with the content items in the ranking list. For example, only those content items with an associated score higher than a user-configurable threshold score can be selected to be recommended to the user.

在4320中，内容推荐系统4220可以将与在4318中选择的内容项相关的信息传送给用户设备。由于这个信息包括与要推荐给用户的内容项相关的信息，因此这个信息可以被称为推荐信息。在4320中传送的推荐信息还可以包括排名信息(例如，与所选择的内容项相关联的总体排名分数)。这个信息可以在用户设备上被用于确定关于推荐内容项的信息(例如，次序)如何经由用户设备被显示给用户。在一些实施例中，作为推荐信息的一部分，推荐选择器4225可以在步骤4302中向从其接收输入内容的客户端设备4310或者传输内容项本身，或者识别内容项的某些信息(例如，内容项标识符和描述、缩略图、网络路径或下载链接等)。In 4320, the content recommendation system 4220 may transmit information related to the content item selected in 4318 to the user device. Since this information includes information related to the content item to be recommended to the user, this information may be referred to as recommendation information. The recommendation information transmitted in 4320 may also include ranking information (e.g., an overall ranking score associated with the selected content item). This information may be used on the user device to determine how information (e.g., order) about the recommended content items is displayed to the user via the user device. In some embodiments, as part of the recommendation information, the recommendation selector 4225 may transmit the content item itself, or certain information identifying the content item (e.g., a content item identifier and description, a thumbnail, a network path or download link, etc.) to the client device 4310 from which the input content is received in step 4302.

然后可以经由用户设备将关于所选择的推荐的信息输出给用户。例如，关于推荐的信息可以经由在用户客户端设备上显示的GUI 4215或者经由由用户客户端设备执行的应用4215来输出。例如，如果用户输入与用户经由由用户设备执行的浏览器显示的网页而由用户输入的搜索查询对应，那么关于推荐的信息也可以经由示出搜索结果的网页或浏览器显示的附加网页输出给用户。在某些实施例中，对于每个推荐的内容项，输出给用户的信息可以包括识别该内容项的信息(例如，文本信息、图像的缩略图等)以及用于访问该内容项的信息。例如，访问内容项的信息可以采用链接(例如，URL)的形式，当用户选择该链接(例如，通过鼠标单击动作)时，该链接将导致对应的内容项被访问并经由用户客户端设备显示给用户。在一些实施例中，可以组合识别内容项的信息和用于访问内容项的信息(例如，推荐的图像的缩略图表示，其既识别图像内容项又可以由用户选择以访问图像本身)。The information about the selected recommendation can then be output to the user via the user device. For example, the information about the recommendation can be output via a GUI 4215 displayed on a user client device or via an application 4215 executed by the user client device. For example, if the user input corresponds to a search query entered by the user via a web page displayed by a browser executed by the user device, the information about the recommendation can also be output to the user via a web page showing the search results or an additional web page displayed by the browser. In some embodiments, for each recommended content item, the information output to the user may include information identifying the content item (e.g., text information, thumbnails of images, etc.) and information for accessing the content item. For example, the information for accessing the content item can take the form of a link (e.g., a URL), and when the user selects the link (e.g., by a mouse click action), the link will cause the corresponding content item to be accessed and displayed to the user via the user client device. In some embodiments, the information identifying the content item and the information for accessing the content item (e.g., a thumbnail representation of a recommended image, which both identifies the image content item and can be selected by the user to access the image itself) can be combined.

例如，参考图47，示出了示例用户界面4700，其与来自图44的显示与推荐的图像相关的信息的用户界面屏幕4400的更新对应。在这个示例中，基于标题/主题4711、正文文本4712和/或任何其它输入内容，内容推荐系统4220已经从图像的排名列表中选择了四个排名最高的内容项图像以推荐给用户。在用户界面4700的专用部分4714内按照排名的次序显示与这四个排名最高的图像相关的信息，以示出内容项推荐。在某些实施例中，可以在4714中显示推荐的图像的缩略图表示。用户界面4700可以支持拖放功能或其它技术，以允许用户将在4714中显示的一个或多个建议的图像结合到创作的内容的主体4712中。For example, referring to FIG. 47 , an example user interface 4700 is shown, which corresponds to an update of the user interface screen 4400 from FIG. 44 that displays information related to recommended images. In this example, based on the title/subject 4711, the body text 4712, and/or any other input content, the content recommendation system 4220 has selected four top-ranked content item images from the ranked list of images to recommend to the user. Information related to these four top-ranked images is displayed in the order of ranking within a dedicated portion 4714 of the user interface 4700 to show content item recommendations. In some embodiments, thumbnail representations of the recommended images may be displayed in 4714. The user interface 4700 may support drag-and-drop functionality or other techniques to allow the user to incorporate one or more suggested images displayed in 4714 into the body 4712 of the created content.

图43中描绘且在上面描述的处理不旨在是限制性的。在不同的实施例中可以提供各种变化。例如，对于图43中描绘且在上面描述的实施例，对于在4306中识别出的所有匹配的内容项执行4312、4314和4316中的处理。在某些变化中，在4308中计算出的标签计数分数可以被用于从进一步的处理中滤除某些内容项。例如，在内容项具有不同标签计数分数的情况下，可以从流程图中的进一步处理中滤除具有最低标签计数分数(或某个其它阈值)的内容项。在一些其它实施例中，仅具有最高标签计数分数的那些内容项可以被选择用于进一步处理，从而从进一步处理中滤除其它内容项。例如，内容项排名器4224可以仅计算最高标签分数组的TVBS(如在步骤4304中确定)，或者可以按照从最高到最低标签计数分数组的次序来计算TVBS，在此期间，计算处理可以在达到阈值数量的候选内容项或阈值标签计数分数时停止。这样的过滤减少了要处理的内容项的数量，并且可以使用更少的处理资源(例如，处理器、存储器、网络资源)使总体推荐操作更快且更高效地执行。The processing described in Figure 43 and described above is not intended to be restrictive. Various changes can be provided in different embodiments.For example, for the embodiment described in Figure 43 and described above, the processing in 4312,4314 and 4316 is performed for the content items of all matches identified in 4306.In some changes, the label count score calculated in 4308 can be used to filter out some content items from further processing.For example, when the content items have different label count scores, the content items with the lowest label count score (or some other threshold value) can be filtered out from the further processing in the flow chart.In some other embodiments, those content items with only the highest label count score can be selected for further processing, thereby filtering out other content items from further processing.For example, the content item ranker 4224 can only calculate the TVBS (as determined in step 4304) of the highest label score group, or can calculate the TVBS according to the order from the highest to the lowest label count score group, during this period, the calculation process can stop when reaching the candidate content items of threshold value number or the threshold label count score. Such filtering reduces the number of content items to be processed and may enable the overall recommendation operation to be performed faster and more efficiently using fewer processing resources (eg, processor, memory, network resources).

在图43中描绘且上面描述的方法中，相对于由内容推荐系统4220执行的排名，与4304中的输入内容相关联或为其确定的每个标签都被赋予相等的权重。基于这个假设，将用于每个匹配的内容项的标签分数确定为与输入内容的标签匹配的其内容标签的数量。因而，相对于标签计数分数的确定，与匹配的内容项相关联的每个匹配的内容标签都被同等地评估/加权。但是，在其它实施例中，可以将不同的权重赋予与输入内容相关联的标签。例如，对于为输入内容确定的两个标签，通过赋予一个标签比另一个标签更高的权重，一个标签对于输入内容可以被指示为“更重要”。例如，对于图44-47所描绘且上面描述的示例，其中为输入内容确定了标签“人”、“咖啡”和“茶”，不是对这三个标签赋予同等的重要性，而是标签的权重如下：人＝1，咖啡＝2，并且茶＝4。这个权重可以指示标签对输入内容的相对重要性，例如，“茶”比“咖啡”权重大，而“咖啡”比“人”权重大。在某些实施例中，可以修改用于计算每个内容项的标签计数分数的逻辑，以考虑指派给输入内容的标签的不同权重。根据一个这种经修改的逻辑，内容项的每个匹配标签的贡献乘以与针对输入内容的相同标签相关联的权重。例如，使用(人＝1，咖啡＝2，茶＝4)权重用于输入内容，图45中匹配的图像的标签计数分数将如下所示：In the method depicted in FIG. 43 and described above, each tag associated with or determined for the input content in 4304 is assigned an equal weight relative to the ranking performed by the content recommendation system 4220. Based on this assumption, the tag score for each matching content item is determined as the number of its content tags that match the tags of the input content. Thus, relative to the determination of the tag count score, each matching content tag associated with the matching content item is equally evaluated/weighted. However, in other embodiments, different weights may be assigned to tags associated with the input content. For example, for two tags determined for the input content, by assigning a higher weight to one tag than another tag, one tag may be indicated as "more important" for the input content. For example, for the example depicted in FIGS. 44-47 and described above, in which the tags "people", "coffee", and "tea" are determined for the input content, instead of assigning equal importance to the three tags, the weights of the tags are as follows: people = 1, coffee = 2, and tea = 4. This weight may indicate the relative importance of the tags to the input content, for example, "tea" is more important than "coffee", and "coffee" is more important than "people". In some embodiments, the logic used to calculate the tag count score for each content item can be modified to take into account the different weights assigned to the tags of the input content. According to one such modified logic, the contribution of each matching tag for the content item is multiplied by the weight associated with the same tag for the input content. For example, using the (person=1, coffee=2, tea=4) weights for the input content, the tag count scores for the matched images in FIG. 45 would be as follows:

Image_1：Image_1:

-TagCountScore(无加权)＝"人"标签匹配＝1-TagCountScore (unweighted) = "people" tag match = 1

-TagCountScore(有加权)＝"人"标签匹配＝1*1＝1-TagCountScore (weighted) = "people" tag matching = 1*1 = 1

Image_2：Image_2:

-TagCountScore(无加权)＝"咖啡"和"人"标签匹配＝1+1＝2-TagCountScore (unweighted) = "coffee" and "people" tag matches = 1+1 = 2

-TagCountScore(有加权)＝"咖啡"和"人"标签匹配＝2(1)+1(1)＝3-TagCountScore (weighted) = "coffee" and "people" tag matches = 2 (1) + 1 (1) = 3

Image_3：Image_3:

-TagCountScore(无加权)＝"茶"标签匹配＝1-TagCountScore (unweighted) = "tea" tag match = 1

-TagCountScore(有加权)＝"茶"标签匹配＝4(1)＝4-TagCountScore (weighted) = "tea" tag matches = 4 (1) = 4

Image_4：Image_4:

Image_5：Image_5:

-TagCountScore(无加权)＝"茶"和"人"标签匹配＝1+1＝2-TagCountScore (unweighted) = "tea" and "person" tag matches = 1+1 = 2

-TagCountScore(有加权)＝"茶"和"人"标签匹配＝4(1)+1(1)＝5-TagCountScore (weighted) = "tea" and "person" tag matches = 4 (1) + 1 (1) = 5

Image_6：Image_6:

-TagCountScore(无加权)＝"咖啡"标签匹配＝1-TagCountScore (unweighted) = "coffee" tag match = 1

-TagCountScore(有加权)＝"咖啡"标签匹配＝2(1)＝2-TagCountScore (weighted) = "coffee" tag matches = 2 (1) = 2

Image_7：Image_7:

Image_8：Image_8:

由于标签计数分数不同，因此在4310中执行的内容项的分组或分桶将不同。因此，具有与“茶”和“人”标签都匹配的内容标签的内容项将被分组在一起，并被指派标签分数5，而仅具有一个与“茶”匹配的内容标签的内容项将被分组在一起并被指派标签分数4，并且具有与“咖啡”和“人”标签都匹配的内容标签的内容项将被分组在一起并被指派标签分数3，依此类推。因此，在此类实施例中，候选内容项的总体排名不仅受到内容项的多少标签与输入内容的标签匹配的影响，而且还受到内容项的哪些特定标签与输入内容标签匹配以及赋予那些标签的相对重要性的权重的影响。在这个示例中，图像Image_5和Image_7将基于它们的最高标签计数分数5(茶＝4+人＝1)成为排名最高的总体内容项。Because the tag count scores are different, the grouping or bucketing of content items performed in 4310 will be different. Thus, content items with content tags that match both the "tea" and "person" tags will be grouped together and assigned a tag score of 5, while content items with only one content tag that matches "tea" will be grouped together and assigned a tag score of 4, and content items with content tags that match both the "coffee" and "person" tags will be grouped together and assigned a tag score of 3, and so on. Thus, in such embodiments, the overall ranking of candidate content items is affected not only by how many tags of the content items match the tags of the input content, but also by which specific tags of the content items match the input content tags and the weights assigned to the relative importance of those tags. In this example, images Image_5 and Image_7 will become the highest ranked overall content items based on their highest tag count scores of 5 (tea=4+person=1).

智能内容-智能归类/分类Smart Content - Smart Classification/Category

以在线方式对大量内容进行分类是一项复杂的任务，涉及诸如对数据的单次传递约束以及快速响应要求之类的挑战。根据实施例，内容用户通过诸如分层分类体系树的逻辑聚类对相似内容进行归类，并将相似内容放置在分类体系树的相同节点/类别中。随着时间的推移，随着分类体系树中内容实体和节点数量的增长，相似的内容实体将发现它们彼此并排驻留在节点中。鉴于内容组织的这种状态，计算机算法(诸如归类引擎)可以使用驻留在已经评估/归类的分类体系中的内容来确定新创建/编辑的内容可能属于哪里。Categorizing large amounts of content in an online manner is a complex task involving challenges such as single pass constraints on data and rapid response requirements. According to an embodiment, content users categorize similar content through logical clustering such as a hierarchical classification system tree and place similar content in the same node/category of the classification system tree. Over time, as the number of content entities and nodes in the classification system tree grows, similar content entities will find themselves residing in nodes side by side with each other. Given this state of content organization, a computer algorithm (such as a classification engine) can use content that resides in an already evaluated/classified classification system to determine where newly created/edited content may belong.

推荐系统或工具可以使用人工智能(AI)技术来不断地从过去的数据和/或来自生成的推荐的新输入结果学习，并且通过新创建/编辑的内容的自动归类/分类来帮助将内容放置到相关的类别中。The recommendation system or tool may use artificial intelligence (AI) techniques to continually learn from past data and/or new input results from generated recommendations, and help place content into relevant categories through automatic categorization/classification of newly created/edited content.

通过从内容生成特征向量、基于先前归类的内容在特征空间中创建聚类，以及通过从聚类的特征空间距离计算为新内容推荐类别，可以跨不同领域实现和应用推荐工具。By generating feature vectors from content, creating clusters in feature space based on previously categorized content, and recommending categories for new content by computing feature space distances from clusters, recommendation tools can be implemented and applied across different domains.

图48图示了根据实施例的内容管理系统环境的示例使用。FIG. 48 illustrates an example use of a content management system environment according to an embodiment.

更具体而言，图48图示了示例性内容管理系统，其可以包括用于在内容管理系统处对内容进行智能归类的归类引擎。More specifically, FIG. 48 illustrates an exemplary content management system that may include a categorization engine for intelligently categorizing content at the content management system.

如图48中所示，根据实施例，对于具有用户界面4801、4803、4805和物理设备硬件4806、4807、4808(例如，CPU、存储器)的多个客户端设备4800、4802和4804中的每一个，客户端设备可以提供有内容访问应用4810、4811、4812以在其上执行。As shown in FIG. 48 , according to an embodiment, for each of a plurality of client devices 4800 , 4802 , and 4804 having a user interface 4801 , 4803 , 4805 and physical device hardware 4806 , 4807 , 4808 (e.g., CPU, memory), the client device may be provided with a content access application 4810 , 4811 , 4812 for execution thereon.

根据实施例，客户端设备可以与应用服务器4830通信4862，该应用服务器4830包括物理计算机硬件4831(例如，CPU、存储器)和内容管理系统4832。According to an embodiment, the client device may communicate 4862 with an application server 4830 , which includes physical computer hardware 4831 (eg, CPU, memory) and a content management system 4832 .

根据实施例，客户端设备处的内容访问应用可以经由网络4860(例如，互联网或云环境)与内容管理系统通信。内容访问应用可以被配置为使用户4850、4852、4854能够在每个客户端设备处查看、上传、修改、删除或以其它方式访问内容，诸如例如内容项4820、4822、4824。例如，可以通过用户与相关联客户端设备上的内容访问应用交互来将新内容添加或上传到内容管理系统。内容可以被传输到内容管理系统，例如用于标记和存储。According to an embodiment, the content access application at the client device can communicate with the content management system via a network 4860 (e.g., the Internet or a cloud environment). The content access application can be configured to enable users 4850, 4852, 4854 to view, upload, modify, delete, or otherwise access content, such as, for example, content items 4820, 4822, 4824, at each client device. For example, new content can be added or uploaded to the content management system by a user interacting with the content access application on the associated client device. The content can be transferred to the content management system, for example, for tagging and storage.

根据实施例，内容管理系统可以是或包括用于整合可以由多个用户或客户端管理的内容的平台。根据实施例，内容管理系统可以被配置为与用于存储内容(或内容项)4840的内容储存库4836通信，并且可以经由他们的客户端设备将内容交付给用户。根据实施例，内容储存库可以是关系数据库管理系统(RDBMS)、文件系统或内容管理系统可以访问的其它数据存储库。内容可以包括，例如，文档、文件、电子邮件、备忘录、图像、视频、幻灯片演示、对话和用户简档。According to an embodiment, a content management system may be or include a platform for integrating content that may be managed by multiple users or clients. According to an embodiment, a content management system may be configured to communicate with a content repository 4836 for storing content (or content items) 4840, and may deliver content to users via their client devices. According to an embodiment, a content repository may be a relational database management system (RDBMS), a file system, or other data storage repositories that a content management system may access. Content may include, for example, documents, files, emails, memos, images, videos, slideshows, conversations, and user profiles.

根据实施例，内容管理系统可以被配置为将元数据与内容相关联。元数据可以包括关于内容项的信息，诸如例如它的标题、作者、发布日期、历史数据，诸如例如谁访问了项目以及何时访问、内容的存储位置等。According to an embodiment, the content management system can be configured to associate metadata with the content. The metadata can include information about the content item, such as, for example, its title, author, publication date, historical data, such as, for example, who accessed the item and when, the storage location of the content, etc.

根据实施例，元数据可以存储在元数据数据库4838中。根据实施例，内容管理系统可以被配置为与元数据数据库通信以访问存储在其中的元数据，以及将由系统生成的元数据存储在元数据数据库中。According to an embodiment, the metadata may be stored in a metadata database 4838. According to an embodiment, the content management system may be configured to communicate with the metadata database to access metadata stored therein, and to store metadata generated by the system in the metadata database.

根据实施例，内容管理系统还可以被配置为与搜索索引4839通信。搜索索引可以被配置为提供存储在内容储存库和元数据数据库中的内容和数据的索引和搜索。根据实施例，搜索索引可以是关系数据库管理系统(RDBMS)或搜索工具，诸如例如Oracle安全企业搜索(Oracle SES)。According to an embodiment, the content management system may also be configured to communicate with a search index 4839. The search index may be configured to provide indexing and searching of content and data stored in the content repository and metadata database. According to an embodiment, the search index may be a relational database management system (RDBMS) or a search tool such as, for example, Oracle Secure Enterprise Search (Oracle SES).

根据实施例，内容管理系统还可以包括内容管理应用4833和归类引擎4834。归类引擎可以包括人工智能/机器学习引擎和库4835，以及用户界面4836，其可以用于例如显示指示内容归类推荐的输出和/或接收指示内容归类选择的输入。According to an embodiment, the content management system may further include a content management application 4833 and a categorization engine 4834. The categorization engine may include an artificial intelligence/machine learning engine and library 4835, and a user interface 4836, which may be used, for example, to display output indicating content categorization recommendations and/or receive input indicating content categorization selections.

根据实施例，归类引擎可以例如与内容管理系统一起使用，以提供将内容归类/分类成定义的(例如，用户定义的)类别的推荐，这进而为内容管理者提供了可以基于先前的评估/归类内容轻松地将新内容放置到准确类别中的机会。归类引擎可以使用人工智能(AI)技术和例如机器学习库来不断地从过去的数据中学习，并通过对新创建/编辑的内容进行自动归类/分类来协助将内容放置到相关类别中。通过从内容生成特征向量、基于先前归类的内容在特征空间中创建聚类，并通过从聚类的特征空间距离计算为新内容推荐类别，可以跨不同领域实现和应用推荐工具。附加地或可替代地，该引擎可以经由其计算自动对内容进行归类。According to an embodiment, a classification engine can be used, for example, with a content management system to provide recommendations for categorizing/classifying content into defined (e.g., user-defined) categories, which in turn provides content managers with the opportunity to easily place new content into accurate categories based on previously evaluated/classified content. The classification engine can use artificial intelligence (AI) techniques and, for example, machine learning libraries to continuously learn from past data and assist in placing content into relevant categories by automatically categorizing/classifying newly created/edited content. Recommendation tools can be implemented and applied across different domains by generating feature vectors from content, creating clusters in feature space based on previously categorized content, and recommending categories for new content by calculating feature space distances from clusters. Additionally or alternatively, the engine can automatically categorize content via its calculations.

一般而言，根据实施例，归类引擎可以用于创建新的分类体系、修改现有的分类体系结构和/或根据与各种置信度分数或置信度级别(诸如高、中或低置信度)相关联的各种推荐或建议对现有的和/或新的内容批量地进行归类/分类。In general, according to an embodiment, the classification engine can be used to create new classification systems, modify existing classification systems, and/or categorize/classify existing and/or new content in batches based on various recommendations or suggestions associated with various confidence scores or confidence levels (such as high, medium, or low confidence).

例如，根据实施例，系统或用户可以忽略与特定内容集合相关联的类别/分类的低置信度推荐，而可以接受类别的高/中置信度推荐。For example, according to an embodiment, the system or user may ignore low confidence recommendations for categories/categories associated with a particular set of content, but may accept high/medium confidence recommendations for categories.

根据实施例，归类引擎可以在用户界面中显示与特定内容集合相关联的类别/分类的此类推荐或建议，以供内容管理员审查和/或采取行动。According to an embodiment, the categorization engine may display such recommendations or suggestions of categories/classifications associated with a particular set of content in a user interface for review and/or action by a content administrator.

例如，在处理新内容集合时，系统可以基于先前对各种内容的分类，建议或推荐新内容的一个或多个类别/分类，然后内容管理员可以选择接受或拒绝将这些类别/分类指派给该内容。For example, when processing a new collection of content, the system may suggest or recommend one or more categories/classifications for the new content based on previous categorization of various content, and the content administrator may then choose to accept or decline the assignment of these categories/classifications to the content.

根据实施例，用户或内容管理员对归类建议的这种接受或拒绝可以存储在数据库(诸如内容归类引擎的数据库)中。此类历史记录可以用于改进未来针对内容归类的推荐。According to an embodiment, such acceptance or rejection of the categorization suggestion by the user or content administrator may be stored in a database (such as a database of a content categorization engine). Such historical records may be used to improve future recommendations for content categorization.

根据实施例，分类可以作为两阶段处理执行，包括：宏观级别分类阶段，其中文档的主题分布和命名实体分布可以与更高级别的聚类节点匹配；以及微观级别分类阶段，其中可以基于对微聚类的评估在特征空间中扩展和比较类别。According to an embodiment, classification can be performed as a two-stage process, including: a macro-level classification stage, in which the topic distribution and named entity distribution of the document can be matched with higher-level cluster nodes; and a micro-level classification stage, in which categories can be expanded and compared in feature space based on the evaluation of micro-clusters.

根据实施例，内容管理系统(CMS)允许用户协同管理数字内容，包括web内容。示例内容管理系统是Oracle内容管理(OCM)，它可以通过云服务提供和访问。内容管理系统提供各种特征，诸如例如存储、管理和发布相关联的数字内容。诸如例如OCM的系统允许跨多个交付渠道快速开发和发布内容。According to an embodiment, a content management system (CMS) allows users to collaboratively manage digital content, including web content. An example content management system is Oracle Content Management (OCM), which can be provided and accessed via a cloud service. The content management system provides various features, such as, for example, storing, managing, and publishing associated digital content. Systems such as, for example, OCM allow for rapid development and publishing of content across multiple delivery channels.

根据实施例，交付渠道可以是将内容交付给内容的消费者的任何形式。示例性交付渠道包括网站、博客、HTML电子邮件、故事板、移动应用等。为了提供相关联交付渠道所使用的文档的快速开发，系统可以包括诸如例如开箱即用(out-of-the-box)模板、拖放(drag-and-drop)组件、样本页面布局和站点主题之类的特征，它们允许用户从预定义的构建块将内容组装到可发布的文档中。内容管理系统可以使用这些组件来生成文档的标记语言和代码(本文统称为“代码”)。According to an embodiment, a delivery channel can be any form of delivering content to consumers of the content. Exemplary delivery channels include websites, blogs, HTML emails, storyboards, mobile applications, etc. In order to provide rapid development of documents used by associated delivery channels, the system can include features such as, for example, out-of-the-box templates, drag-and-drop components, sample page layouts, and site themes that allow users to assemble content into publishable documents from predefined building blocks. The content management system can use these components to generate the markup language and code (collectively referred to herein as "code") of the document.

根据实施例，可以提供服务器应用程序接口(API)合同(管理API合同)，其建立内容管理系统和商业提供商系统(商业提供商)之间的交互。服务器API合同作为内容管理系统和商业提供商(例如，在线零售商)之间的合同操作，并使得内容管理系统和商业提供商之间能够进行通信和数据交换。根据实施例，基于与交互特征的交互，可以从商业提供商检索与产品相关联的数据和其它度量；或者可以将与请求相关联的数据发送给商业提供商，以执行与产品相关联的动作。According to an embodiment, a server application program interface (API) contract (management API contract) can be provided that establishes interaction between a content management system and a business provider system (business provider). The server API contract operates as a contract between the content management system and the business provider (e.g., an online retailer) and enables communication and data exchange between the content management system and the business provider. According to an embodiment, based on interaction with an interactive feature, data and other metrics associated with a product can be retrieved from the business provider; or data associated with a request can be sent to the business provider to perform an action associated with a product.

根据实施例，可以在系统中提供商业提供商系统4910以管理和交付内容数据。商业提供商系统可以包括产品目录4911、API，诸如例如服务器API 4912，以及物理计算机资源，诸如例如CPU、存储器4913。According to an embodiment, a business provider system 4910 may be provided in the system to manage and deliver content data. The business provider system may include a product catalog 4911, an API, such as, for example, a server API 4912, and physical computer resources, such as, for example, a CPU, a memory 4913.

根据实施例，内容管理系统4920可以包括内容元数据数据库4921(例如，OCM内容和元数据数据库)、API，诸如例如服务器API 4922，以及物理计算机资源，诸如例如CPU、存储器4923。According to an embodiment, the content management system 4920 may include a content metadata database 4921 (e.g., an OCM content and metadata database), an API such as, for example, a server API 4922, and physical computer resources such as, for example, a CPU, a memory 4923.

根据实施例，内容管理员4935可以与此类系统中的管理系统4930交互以创建/管理和交付内容数据4903。管理系统可以包括或者提供用户界面4931，其可以包括或提供内容映射配置4932。管理系统还可以包括物理计算机资源，诸如例如CPU、存储器4933。According to an embodiment, a content administrator 4935 may interact with a management system 4930 in such a system to create/manage and deliver content data 4903. The management system may include or provide a user interface 4931, which may include or provide a content mapping configuration 4932. The management system may also include physical computer resources such as, for example, a CPU, memory 4933.

根据实施例，服务器API合同(管理API合同)49049可以作为内容管理系统和商业提供商之间的合同进行操作。例如，服务器API合同可以被配置为接收与商业提供商处的用户账户相关联的账户信息和凭证。According to an embodiment, the server API contract (management API contract) 49049 can operate as a contract between a content management system and a business provider. For example, the server API contract can be configured to receive account information and credentials associated with a user account at a business provider.

根据实施例，内容管理系统可以使用接收到的账户凭证来在商业提供商系统处经由服务器API(管理API)来认证内容管理系统的管理用户(例如，内容管理员)；它提供内容管理系统对存储在商业提供商系统处的数据的访问。例如，内容管理系统可以经由服务器API请求商业提供商发送描述/定义由商业提供商提供销售的产品集合的数据。然后商业提供商系统可以经由服务器API将此类数据发送到内容管理系统。该数据可以包括例如产品列表以及描述产品列表的数据。According to an embodiment, the content management system can use the received account credentials to authenticate an administrative user (e.g., a content administrator) of the content management system at the business provider system via a server API (administration API); it provides the content management system with access to data stored at the business provider system. For example, the content management system can request the business provider via the server API to send data describing/defining a set of products offered for sale by the business provider. The business provider system can then send such data to the content management system via the server API. The data can include, for example, a list of products and data describing the list of products.

根据实施例，内容管理系统可以诸如例如经由生成的移动应用页面4905，或经由生成的网页4906在多个不同渠道发布内容4904。According to an embodiment, the content management system can publish content 4904 in multiple different channels, such as, for example, via a generated mobile application page 4905, or via a generated web page 4906.

根据实施例，最终用户4945可以经由客户端设备4940，例如经由显示生成的移动应用页面4905的移动应用4941，或者经由访问和显示生成的web页面4906的web应用(诸如例如浏览器4942)访问这样生成的页面。According to an embodiment, end user 4945 can access such generated pages via client device 4940, for example via mobile application 4941 that displays generated mobile application page 4905, or via a web application (such as, for example, browser 4942) that accesses and displays generated web page 4906.

根据实施例，可发布的内容(例如，网页或移动应用(app)页面)可以使用代码库来构建，诸如例如软件开发工具包(SDK)4907，它定义了功能和视觉内容组件，诸如例如“购买”按钮。In accordance with an embodiment, publishable content (e.g., a web page or mobile application (app) page) may be built using a code library, such as, for example, a software development kit (SDK) 4907, which defines functional and visual content components, such as, for example, a "Buy" button.

宏观/微观分类处理Macro/micro classification processing

如图50中所示，将内容自动分类(在本文中也称为“自动分类”)到已知分类体系(诸如特定于领域的本体)中可以经由AI/ML系统来完成。As shown in FIG. 50 , automatic classification (also referred to herein as “auto-classification”) of content into a known classification system (such as a domain-specific ontology) can be accomplished via an AI/ML system.

当创建5000新内容时(即，在内容管理系统处的新内容，诸如在其中生成的内容，或上传到内容管理系统的内容)，诸如如上所述的归类引擎之类的引擎在预测阶段5015期间可以自动建议5010新创建的内容应该归类到的一个类别或多个类别。When new content is created 5000 (i.e., new content at a content management system, such as content generated therein, or content uploaded to a content management system), an engine such as a categorization engine as described above may automatically suggest 5010 during a prediction phase 5015 a category or categories into which the newly created content should be categorized.

根据实施例，在接收到对内容归类的一个或多个选择之后，可以将内容放置5020到这样一个或多个所选择的类别中。According to an embodiment, after receiving one or more selections for categorizing the content, the content may be placed 5020 into such one or more selected categories.

根据这样的选择，在学习阶段5035期间，可以相应地更新5030聚类(例如，存储在存储器或归类引擎的库中的聚类)。由此，预测阶段5015可以根据另一个新创建的内容5000进行更新。Based on such selection, during the learning phase 5035, the clustering (eg, the clustering stored in a memory or library of the classification engine) may be updated 5030 accordingly. Thus, the prediction phase 5015 may be updated based on another newly created content 5000.

在其它系统中，这种归类通常是通过分析公共领域中的大量相似数据来完成的。但是，根据实施例，所描述的用户定义类别层次结构中的内容自动分类需要新方案；一种持续观察先前评估/归类的内容、类别元数据以及响应归类/分类建议的用户行为的方案。In other systems, such categorization is typically accomplished by analyzing large amounts of similar data in the public domain. However, according to an embodiment, the described automatic categorization of content in a user-defined category hierarchy requires a new approach; one that continuously observes previously assessed/categorized content, category metadata, and user behavior in response to categorization/classification suggestions.

根据实施例，进入类别层次结构的未评估/归类或细化评估/归类的内容带来了特殊挑战，因为在冷启动时，系统将只有非常少的信息可继续提出建议。随着时间的推移，使用内容管理系统的企业不断发展，使旧的类别层次结构变得不那么相关甚至过时。According to an embodiment, unrated/categorized or refined rated/categorized content that enters a category hierarchy presents a special challenge because, on a cold start, the system will have very little information to go on to make recommendations. Over time, businesses using content management systems evolve, making old category hierarchies less relevant or even obsolete.

根据实施例，本文描述的是一种系统和工具，用于使用微聚类方案基于相关概念的聚类将各种内容自动分类到类别中。这种方案优于经典的聚类方案。此外，该工具可以从用户行为中学习并随着时间的推移通过为内容作者产生大量生产力改进来调整自己。根据实施例，该系统包含以下用例集合：According to an embodiment, described herein is a system and tool for automatically classifying various content into categories based on clustering of related concepts using a micro-clustering scheme. This scheme outperforms classical clustering schemes. In addition, the tool can learn from user behavior and adapt itself over time by generating substantial productivity improvements for content authors. According to an embodiment, the system includes the following set of use cases:

创建新分类体系Creating a new classification system

根据实施例，可以使用所公开的系统和方法通过例如以下各项来创建新分类体系：According to an embodiment, the disclosed systems and methods may be used to create a new taxonomy by, for example:

a.接收指示用户创建分类体系的指令。在某些实施例中，分类体系可以包括分层分类结构，该结构的每个节点是一个类别。a. Receiving instructions instructing a user to create a taxonomy. In some embodiments, the taxonomy may include a hierarchical taxonomy structure, each node of which is a category.

b.接收指示用户通过将现有内容项(例如，已经在内容管理系统内的内容项)的至少一部分放置到适当的类别中(例如，一次一个或批量)将现有内容项的至少一部分分类/归类到这些分类体系中的指令。b. Receiving instructions instructing a user to categorize/classify at least a portion of existing content items (e.g., content items already within the content management system) into appropriate categories by placing at least a portion of the existing content items (e.g., one at a time or in batches) into these taxonomies.

c.接收(例如，在内容管理系统内上传或创建的)新内容。c. Receiving new content (eg, uploaded or created within a content management system).

d.系统自动获得通知并开始从上述动作中学习。d. The system automatically gets notified and starts learning from the above actions.

e.系统开始推荐新创建的内容和现有未归类的内容可能属于的类别(对于单个内容或批量)。e. The system starts recommending categories that newly created content and existing uncategorized content may belong to (for individual content or in batches).

f.用户接受或拒绝建议(一次一个或批量)，用户动作通知系统从其中学习并做出更好的推荐。f. The user accepts or rejects the recommendations (one at a time or in batches), and the user action informs the system to learn from it and make better recommendations.

根据实施例，在步骤5100处，如上所述，系统可以接收指示用户创建一个或多个分类体系的指令。在创建分类体系时，指令还可以包括分类体系下的类别。在某些实施例中，分类体系可以包括分层分类结构，该结构的每个节点是一个类别。According to an embodiment, at step 5100, as described above, the system may receive instructions instructing a user to create one or more classification systems. When creating a classification system, the instructions may also include categories under the classification system. In some embodiments, the classification system may include a hierarchical classification structure, each node of which is a category.

根据实施例，在步骤5105处，系统可以接收指示用户通过将现有内容项(例如，已经在内容管理系统中的内容项)的至少一部分放置到适当的类别中(例如，一次一个或批量)将现有内容项的至少一部分分类/归类到这些分类体系中的指令。这也可以包括新内容的分类/归类(例如，新上传到内容管理系统的内容)。新内容不一定会得到分类或归类。According to an embodiment, at step 5105, the system may receive instructions instructing the user to categorize/classify at least a portion of existing content items (e.g., content items already in the content management system) into these classification systems by placing at least a portion of the existing content items (e.g., one at a time or in batches) into appropriate categories. This may also include categorization/classification of new content (e.g., content newly uploaded to the content management system). New content does not necessarily get categorized or classified.

根据实施例，在步骤5110处，系统自动被通知并开始从上述动作中学习。系统开始推荐新创建内容和现有未归类内容可能所属的类别。这可以对单个/单条内容执行，或者可以针对批量内容归类执行此类建议。According to an embodiment, at step 5110, the system is automatically notified and begins to learn from the above actions. The system begins to recommend categories to which newly created content and existing unclassified content may belong. This can be performed for individual/single content, or such suggestions can be performed for batch content classification.

根据实施例，在步骤5115处，系统可以接收指示用户接受或拒绝生成的建议的指令(例如，一次一个或批量)。Depending on the embodiment, at step 5115, the system may receive instructions instructing the user to accept or reject the generated suggestions (e.g., one at a time or in batches).

以这种方式，根据在步骤5120处的实施例，指示用户动作的指令通知系统从中学习并做出更好的推荐。基于ML和AI，系统可以基于记录的对归类建议的接受或拒绝来细化推荐。In this way, instructions indicating user actions inform the system to learn from and make better recommendations according to the embodiment at step 5120. Based on ML and AI, the system can refine the recommendations based on the recorded acceptance or rejection of the categorization suggestions.

现有分类体系结构内的修改Modifications within existing classification architecture

根据实施例，可以使用以下处理创建现有分类体系结构内的修改。According to an embodiment, modifications within an existing classification architecture may be created using the following process.

a.用户创建新类别或更改分类体系的结构或对内容进行分类。a. Users create new categories or change the structure of the classification system or categorize content.

b.系统为现有的未评估/归类或新添加的内容(一次一个或批量)生成建议，包括任何新添加的类别。b. The system generates suggestions for existing unrated/categorized or newly added content (one at a time or in batches), including any newly added categories.

c.用户接受或拒绝建议(一次一个或批量)，用户动作通知系统做出更好的推荐。c. The user accepts or rejects the recommendations (one at a time or in batches), and the user's actions inform the system to make better recommendations.

根据实施例，在步骤5200处，如上所述，系统可以接收指示用户添加新类别或修改一个或多个现有分类体系的指令。系统然后可以接收指示用户将现有内容或新内容分类到新创建的类别中(例如，单独地或批量地)的指令。According to an embodiment, at step 5200, as described above, the system may receive instructions instructing the user to add a new category or modify one or more existing classification systems. The system may then receive instructions instructing the user to classify existing content or new content into the newly created category (e.g., individually or in bulk).

根据实施例，在步骤5205处，系统可以自动开始生成用于对新内容以及包括新创建的类别的现有内容进行归类的建议和推荐。According to an embodiment, at step 5205, the system may automatically begin generating suggestions and recommendations for categorizing new content as well as existing content including the newly created category.

根据实施例，在步骤5210处，系统可以接收指示用户接受或拒绝生成的建议(例如，一次一个或批量)的指令。Depending on the embodiment, at step 5210, the system may receive instructions instructing the user to accept or reject the generated suggestions (e.g., one at a time or in batches).

以这种方式，根据在步骤5215处的实施例，指示用户动作的指令通知系统从中学习并做出更好的推荐。基于ML和AI，系统可以基于记录的对归类建议的接受或拒绝来细化推荐。In this way, the instructions indicating the user's actions inform the system to learn from them and make better recommendations, according to the embodiment at step 5215. Based on ML and AI, the system can refine the recommendations based on the recorded acceptance or rejection of the categorization suggestions.

批量内容分类Bulk content classification

根据实施例，可以使用以下处理创建批量内容分类。According to an embodiment, a batch content taxonomy may be created using the following process.

a.系统使用户能够基于置信度分数将内容集合归类/分类到建议的类别中。a. The system enables users to categorize/classify content collections into suggested categories based on confidence scores.

b.对于类别，系统可以基于针对每个内容项的预测置信度分数将推荐内容放置到三个不相交的桶(高、中、低)中。b. For categories, the system can place recommended content into three disjoint buckets (high, medium, low) based on the predicted confidence score for each content item.

c.置信度分数桶提供了一种对内容项集合进行分组使得更容易实现批量归类的机制。例如，用户可以接受高置信度分数桶内内容的所有内容归类建议，并且拒绝低置信度分数桶内其它内容的所有内容归类建议。c. Confidence score buckets provide a mechanism to group content item collections so that batch classification is easier to implement. For example, a user can accept all content classification suggestions for content in high confidence score buckets and reject all content classification suggestions for other content in low confidence score buckets.

d.此外，用户可以通过全局分类阈值配置跳过建议的接受/拒绝。如果被选择，系统可以自动将内容指派到推荐置信度分数高于用户配置的阈值的类别中。d. In addition, users can skip the acceptance/rejection of recommendations through the global classification threshold configuration. If selected, the system can automatically assign content to categories with recommendation confidence scores above the user-configured threshold.

e.通常，建议是通过定期自动运行的后端作业生成的。此外，系统允许储存库管理员手动触发作业、导致推荐系统立即生成建议或将内容分类到指派给储存库的分类体系中。e. Typically, recommendations are generated through backend jobs that run automatically on a regular basis. In addition, the system allows repository administrators to manually trigger jobs, causing the recommendation system to immediately generate recommendations or categorize content into a taxonomy assigned to the repository.

如图53中所示，根据实施例，如用户界面5300的屏幕截图所示：(i)具有类别的样本分类体系树，“查看类别建议”5305可以将用户带到智能建议屏幕。该用户界面5300被设计为帮助对内容进行单个或批量分类。As shown in Figure 53, according to an embodiment, as shown in the screenshot of the user interface 5300: (i) a sample taxonomy tree with categories, "View Category Suggestions" 5305 can take the user to the smart suggestion screen. The user interface 5300 is designed to help categorize content individually or in bulk.

根据实施例，屏幕中的每个类别在类别旁边显示建议的数量。例如，内容5315可以与多个建议的归类5320一起显示。内容建议基于要放置在类别中的那个内容的置信度分数被划分为三个桶(高、中、低)。According to an embodiment, each category in the screen displays the number of suggestions next to the category. For example, content 5315 may be displayed with multiple suggested categorizations 5320. Content suggestions are divided into three buckets (high, medium, low) based on the confidence score of that content to be placed in the category.

如图54中所示，根据实施例，如用户界面5400的屏幕截图所示：(i)具有类别的样本分类体系树，“查看类别建议”可以将用户带到智能建议屏幕。该用户界面5400被设计为帮助对内容进行单个或批量分类。As shown in Figure 54, according to an embodiment, as shown in the screenshot of the user interface 5400: (i) a sample taxonomy tree with categories, "View Category Suggestions" can take the user to the smart suggestion screen. The user interface 5400 is designed to help categorize content individually or in batches.

如图54中所示，根据实施例，在这个示例中，选择具有高置信度5405和中置信度5410的内容。“指派到类别”链接5415使用户能够将所有选择的内容放置到相应的(一个或多个)类别中。54, according to an embodiment, in this example, content with high confidence 5405 and medium confidence 5410 is selected. An "assign to category" link 5415 enables the user to place all selected content into corresponding category(s).

根据实施例，如上所述，基于落入高和中置信度建议的建议的接受，归类引擎可以进一步细化后续归类建议。According to an embodiment, as described above, based on the acceptance of suggestions that fall into high and medium confidence suggestions, the categorization engine may further refine subsequent categorization suggestions.

如图55中所示，根据实施例，如用户界面5500的屏幕截图所示：(i)具有类别的样本分类体系树，“查看类别建议”可以将用户带到智能建议屏幕。该用户界面5500被设计为帮助对内容进行单个或批量分类。As shown in Figure 55, according to an embodiment, as shown in the screenshot of the user interface 5500: (i) a sample taxonomy tree with categories, "View Category Suggestions" can take the user to the smart suggestion screen. The user interface 5500 is designed to help categorize content individually or in batches.

如图55中所示，根据实施例，同样，用户可以选择内容(在这个示例中，选择低置信度5505分数项)。用户然后可以选择拒绝5510所有低置信度分数归类建议的选项。As shown in Figure 55, according to an embodiment, again, the user can select content (in this example, select a low confidence 5505 score item). The user can then select the option to reject 5510 all low confidence score classification suggestions.

根据实施例，如上所述，基于落入低置信度建议内的建议的拒绝，归类引擎可以进一步细化后续归类建议。According to an embodiment, as described above, based on the rejection of suggestions that fall within the low confidence suggestions, the categorization engine may further refine subsequent categorization suggestions.

如图56中所示，根据实施例，如用户界面5600的屏幕截图所示。该用户界面5600被设计为帮助/允许用户设置置信度分数的自动分类阈值5605。As shown in Figure 56, according to an embodiment, as shown in a screenshot of a user interface 5600. The user interface 5600 is designed to help/allow a user to set an automatic classification threshold 5605 for a confidence score.

根据实施例，换句话说，当系统和方法被配置为允许对内容项进行自动分类或归类时，可以设置阈值使得具有超过设置阈值的置信度分数的任何内容项可以由系统自动分类/归类，而无需进一步的用户交互。如图所示，用于控制自动分类设置的分类体系级别的配置设置在“中”和“高”置信度分数之间。这意味着具有高于此阈值的置信度分数的内容将被归类/分类为对应的类别。此外，置信度分数低于设置阈值的那些内容项可以具有自动丢弃的归类/分类选项，或者附加地或可替代地，将这些内容项提供给用户以供输入进行决定。According to an embodiment, in other words, when the system and method are configured to allow content items to be automatically classified or categorized, a threshold value can be set so that any content item with a confidence score exceeding the set threshold value can be automatically classified/categorized by the system without further user interaction. As shown in the figure, the configuration setting of the classification system level for controlling the automatic classification setting is between "medium" and "high" confidence scores. This means that content with a confidence score higher than this threshold value will be categorized/categorized into corresponding categories. In addition, those content items with confidence scores lower than the set threshold value can have the categorization/categorization option of being automatically discarded, or additionally or alternatively, these content items are provided to the user for input to make a decision.

如图57中所示，根据实施例，如用户界面5700的屏幕截图所示。该用户界面5700被设计为帮助/允许用户触发例如目标储存库5705内的文档的批量分类。As shown in Figure 57, according to an embodiment, as shown in a screenshot of a user interface 5700. The user interface 5700 is designed to help/allow a user to trigger batch classification of documents within a target repository 5705, for example.

根据实施例，例如，屏幕截图显示现有的储存库(例如，在内容管理系统处)内的内容项可以经由后台处理重新分类。可以运行后台处理以产生要提交给用户进行决定的新的归类/分类建议。可替代地，例如，可以运行后台处理以基于例如计算出的置信度分数对内容项进行自动分类。According to an embodiment, for example, the screenshots show that content items within an existing repository (e.g., at a content management system) can be reclassified via a background process. The background process can be run to generate new categorization/classification suggestions to be submitted to the user for decision. Alternatively, for example, the background process can be run to automatically classify content items based on, for example, a calculated confidence score.

概念和方案Concept and solution

在各种情况下，从内容管理系统的角度来看，经典的聚类方案效果不佳，因为由于多种原因难以将内容归类/分类到正确的类别中，这些原因包括但不限于：In various cases, classical clustering schemes do not work well from a content management system perspective, as it is difficult to categorize/classify content into the correct categories for a number of reasons, including but not limited to:

a.内容管理系统内的内容数量会随着时间的推移而急剧增长；a. The amount of content in a content management system will grow dramatically over time;

b.内容可能来自各个部门，并且特定于领域的解决方案可能不工作；b. Content may come from various departments, and domain-specific solutions may not work;

c.分类体系树中的类别/节点不受限制，并且频繁引入新类别；c. The categories/nodes in the classification system tree are not restricted, and new categories are frequently introduced;

d.类别中存在的内容数量可以从个位数到数千不等，并且相同的内容可以指派到多个类别；d. The number of contents in a category can range from single digits to thousands, and the same content can be assigned to multiple categories;

e.类别中存在的内容可能在语义上彼此不接近，可能存在共享同一类别的相似内容的多个不相交子集；e. The contents present in a category may not be semantically close to each other, and there may be multiple disjoint subsets of similar contents sharing the same category;

f.考虑到大量数据，最近邻方案可能是准确的但不可扩展；f. Considering the large amount of data, the nearest neighbor solution may be accurate but not scalable;

g.由于内容到类别的分布不均匀和标记数据的不足，传统的分类算法可能无法在这种场景下工作；g. Due to the uneven distribution of content to categories and the lack of labeled data, traditional classification algorithms may not work in this scenario;

h.传统的聚类方案可能不适用，因为随着越来越多的内容被添加，聚类可以增长到任意尺寸和形状，并且重构聚类需要传递聚类中存在的整个数据点(内容)，这在可扩展性方面是不允许的。h. Traditional clustering schemes may not be applicable because clusters can grow to arbitrary sizes and shapes as more and more content is added, and reconstructing clusters requires passing the entire data points (content) present in the cluster, which is not allowed in terms of scalability.

根据实施例，数据点N 5805显示在包括聚类1 5800和聚类2 5810的聚类图中。从所描绘的实施例中，可以清楚地看到数据点N 5805应该是聚类1的一部分。但是，实际上，从N到聚类2中心的距离实际上小于从N到聚类1中心的距离。这意味着聚类2似乎更准确。According to an embodiment, data point N 5805 is shown in a cluster diagram that includes cluster 1 5800 and cluster 2 5810. From the depicted embodiment, it is clear that data point N 5805 should be part of cluster 1. However, in reality, the distance from N to the center of cluster 2 is actually less than the distance from N to the center of cluster 1. This means that cluster 2 appears to be more accurate.

根据实施例，考虑到准确性和可扩展性，所描述的方案提供了一种基于通用密度的单通微聚类，它可以发现任何任意形状的聚类(关于准确性)、重塑聚类而无需查看过去的数据(在可扩展性方面)、不断地从最近的数据中学习和发展，并正确地批量预测新内容的类别，这进而减少了将内容归类到正确类别中的时间和精力。According to an embodiment, taking into account accuracy and scalability, the described scheme provides a universal density-based single-pass micro-clustering that can discover clusters of any arbitrary shape (in terms of accuracy), reshape clusters without looking at past data (in terms of scalability), continuously learn and evolve from recent data, and correctly predict the categories of new content in batches, which in turn reduces the time and effort of classifying content into the correct category.

自适应微聚类Adaptive Micro-Clustering

根据实施例，所描述的方案包括两个部分或方面：(1)从文档中提取特征并创建聚类——学习阶段；以及(2)自动对新内容进行分类——预测阶段。According to an embodiment, the described scheme includes two parts or aspects: (1) extracting features from documents and creating clusters - the learning phase; and (2) automatically classifying new content - the prediction phase.

1.从文档中提取特征1. Extract features from documents

根据实施例，从文档中提取特征包括创建和更新聚类(微聚类)。在这个步骤中，聚类表示得越好，系统的准确性就越好。如图59中所述，单个聚类中心单独可能并不能很好地表示整个聚类，在主聚类内部可以附加地形成微聚类。According to an embodiment, extracting features from a document includes creating and updating clusters (micro clusters). In this step, the better the clusters are represented, the better the accuracy of the system. As described in Figure 59, a single cluster center alone may not represent the entire cluster well, and micro clusters can be additionally formed within the main cluster.

图59图示了根据实施例的微聚类。在与上面图58相同的示例中，更大的聚类，图58中的聚类1，现在用3个点表示(例如，微聚类)：聚类1 5900、聚类1’5901和聚类1”5902。这时，新数据点N 5905将被归类为属于聚类1，因为最接近N的点是聚类1’5901，而不是聚类25910。Figure 59 illustrates micro clustering according to an embodiment. In the same example as Figure 58 above, the larger cluster, Cluster 1 in Figure 58, is now represented by 3 points (e.g., micro clusters): Cluster 1 5900, Cluster 1'5901, and Cluster 1" 5902. At this time, the new data point N 5905 will be classified as belonging to Cluster 1 because the closest point to N is Cluster 1'5901, not Cluster 25910.

根据实施例，有多个步骤要执行以更新聚类并在聚类内拆分微聚类，如下所述：According to an embodiment, there are multiple steps to be performed to update clusters and split micro-clusters within clusters, as described below:

A.从文本生成特征向量A. Generating feature vectors from text

根据实施例，第一步骤是将原始文本文档转换成特征向量以用于进一步处理。文本片段的特征空间是通过主题建模系统生成的，该系统输出稀疏文本特征(例如，200k维度)。随后，使用随机投影技术将文本特征投影到较低维度(2048)的密集特征空间，以进行快速处理。当内容稀疏时，该方案使用内容模型本身——例如名称和描述以及用于定义分类体系本身的其它元数据——进行工作。According to an embodiment, the first step is to convert the original text document into a feature vector for further processing. The feature space of the text fragment is generated by a topic modeling system, which outputs sparse text features (e.g., 200k dimensions). Subsequently, the text features are projected to a dense feature space of lower dimensions (2048) using a random projection technique for rapid processing. When the content is sparse, the scheme works using the content model itself, such as name and description and other metadata for defining the classification system itself.

B.从文本中提取命名实体B. Extracting named entities from text

根据实施例，命名实体识别器(NER)系统执行自动提取实体的任务，诸如人名、组织、客户产品、国家、书名或音乐专辑等等。According to an embodiment, a named entity recognizer (NER) system performs the task of automatically extracting entities, such as names of people, organizations, customer products, countries, book titles, or music albums, among others.

根据实施例，通过命名实体识别，系统可以从非结构化文本中获得关键信息并将它们分类成用户定义的类别。例如，考虑来自制药行业的文档场景，其中经常提到药物名称、化学元素；而另一方面，出版社文档包含书籍名称、作者、虚构人物等。仅通过查看实体就可以将此类文档分类到各自的领域。关于用户定义类别中命名实体的分布考虑文档的命名实体分数可以使分类任务更有效、可扩展和准确。According to an embodiment, through named entity recognition, the system can obtain key information from unstructured text and classify them into user-defined categories. For example, consider a document scenario from the pharmaceutical industry, where drug names, chemical elements are often mentioned; on the other hand, publisher documents contain book names, authors, fictional characters, etc. Such documents can be classified into their respective fields just by looking at the entities. Considering the named entity scores of documents with respect to the distribution of named entities in user-defined categories can make the classification task more efficient, scalable, and accurate.

C.从文本中提取主题C. Extracting topics from text

根据实施例，主题提取通过识别术语频率并将相似的词模式分组来自动地从文档中发现关键字和关键短语。用户定义的分类体系树可以在水平和垂直方向上变大，并且统计主题提取模型可以用于宏观级别分类以选择树中的较高阶节点，然后是基于特征的微分类器，同时建议文档的类别。According to an embodiment, topic extraction automatically discovers keywords and key phrases from documents by identifying term frequencies and grouping similar word patterns. A user-defined taxonomy tree can be enlarged horizontally and vertically, and a statistical topic extraction model can be used for macro-level classification to select higher-order nodes in the tree, followed by feature-based micro-classifiers while suggesting categories for documents.

例如，服装客户可能具有类似“女性>服装>种族>纱丽>棉>chanderi”和“家居亚麻>床上用品>床罩>棉>Haiba”的分类体系树层次结构，在本示例中，显然，为了决定宏观级别类别，不需要比较叶级别的每个文档(所有文档都属于女装或床单)，而是通过查看前3-4个节点中文章的主题分布，可以选择高级别类别，然后微级别(特征比较)分类器可以帮助建议纱丽(如chanderi或banarasi)或床罩(haiba)的类型。For example, an apparel customer might have a taxonomy tree hierarchy like “Women > Clothing > Ethnic > Sarees > Cotton > Chanderi” and “Home Linen > Bedding > Bed Cover > Cotton > Haiba” In this example, it is clear that to decide the macro level category, there is no need to compare each document at the leaf level (all documents belong to women’s clothing or bed linen), instead by looking at the topic distribution of articles in the first 3-4 nodes, the high level category can be selected and then the micro level (feature comparison) classifier can help suggest the type of saree (like chanderi or banarasi) or bed cover (haiba).

根据实施例，如图所示，主题分布包括例如服装聚类6001、家居亚麻聚类6020和家居装饰聚类6030。在每个聚类周围示出了示例数据点。According to an embodiment, as shown, the topic distribution includes, for example, an apparel cluster 6001, a home linen cluster 6020, and a home decor cluster 6030. Example data points are shown around each cluster.

D.聚类的创建D. Creation of clusters

根据实施例，每个类别(分类体系树中的节点)被指派随其的聚类；其中聚类可以包括：与该聚类中存在的内容相关联的特征向量的概括特征空间表示、该聚类中文档中存在的命名实体的分布，以及从该聚类的文档中提取的主题分布。According to an embodiment, each category (node in a classification system tree) is assigned a cluster with it; wherein the cluster may include: a summarized feature space representation of feature vectors associated with the content present in the cluster, the distribution of named entities present in the documents in the cluster, and the distribution of topics extracted from the documents in the cluster.

根据实施例，与类别一起创建聚类(用从类别元数据中提取的特征向量、命名实体和主题初始化)。According to an embodiment, clusters are created along with categories (initialized with feature vectors, named entities and topics extracted from category metadata).

E.聚类的定义E. Definition of Clustering

根据实施例，可以使用以下处理来定义聚类。聚类特征：根据实施例，聚类是根据质心和半径来定义的。质心＝聚类中存在的内容的特征向量的均值，半径＝成员特征向量到聚类质心的平均距离。较小的半径表示聚类中存在的内容密切相关。According to an embodiment, clusters may be defined using the following process. Cluster characteristics: According to an embodiment, clusters are defined based on centroid and radius. Centroid = mean of feature vectors of content present in the cluster, radius = average distance of member feature vectors to the centroid of the cluster. A smaller radius indicates that the content present in the cluster is closely related.

根据实施例，命名实体的列表连同频率被附加到聚类。通过从属于聚类的每个文档中选择前n个命名实体来创建分布。According to an embodiment, a list of named entities is attached to the clusters along with the frequencies. A distribution is created by selecting the top n named entities from each document belonging to the cluster.

根据实施例，类似于命名实体，可以定义主题分布，其识别被附加到聚类的最频繁的主题。主题分布也是通过从每个文档中提取最相关的主题来创建的。According to an embodiment, similar to named entities, a topic distribution can be defined that identifies the most frequent topics attached to clusters. The topic distribution is also created by extracting the most relevant topics from each document.

根据实施例，图61描绘了两个示例性聚类，聚类1 6100和聚类2 6110。如图所示，聚类1具有比聚类2更小的半径。如上所述，更小的半径表示聚类中存在的内容密切相关。According to an embodiment, Figure 61 depicts two exemplary clusters, Cluster 1 6100 and Cluster 2 6110. As shown, Cluster 1 has a smaller radius than Cluster 2. As described above, a smaller radius indicates that the content present in the cluster is closely related.

F.更新聚类特征F. Update clustering features

根据实施例，如图61中所示，具有高密度和小半径(左)的聚类可以实现最好的可能准确度。为了实现这个目标，随着越来越多的内容(例如，新内容)被添加，聚类被拆分成微聚类。如果新内容导致微聚类半径扩大，那么它们不需要被立即加入到微聚类中，而是可以仅将其识别为潜在的微聚类。According to an embodiment, as shown in FIG61 , clusters with high density and small radius (left) can achieve the best possible accuracy. To achieve this goal, as more and more content (e.g., new content) is added, clusters are split into micro-clusters. If new content causes the radius of a micro-cluster to expand, they do not need to be added to the micro-cluster immediately, but can only be identified as potential micro-clusters.

根据实施例，随后，随着内容的增长，随着越来越多的类似内容被添加，潜在的微聚类可以增长为核心微聚类，或者随着时间的推移，它们作为异常值通过阻尼因子被忽略。这种微聚类方案有助于在类别中找到相关内容的不相交子集。According to an embodiment, later, as the content grows, as more and more similar content is added, potential micro-clusters can grow into core micro-clusters, or they can be ignored as outliers through a damping factor over time. This micro-clustering scheme helps find disjoint subsets of related content in a category.

可以在线方式更新聚类质心和半径(无需查看过去的数据)。Cluster centroids and radii can be updated in an online manner (without looking at past data).

根据实施例，图62描绘了三个示例性聚类，聚类1 6200、聚类26210和聚类3 6230。如图所示，每个聚类具有不同的半径，每个半径反映了每个聚类中的内容项相关的密切程度。如上所述，更小的半径表示聚类中存在的内容密切相关。According to an embodiment, Figure 62 depicts three exemplary clusters, cluster 1 6200, cluster 2 6210, and cluster 3 6230. As shown, each cluster has a different radius, and each radius reflects the degree of closeness of the content items in each cluster. As described above, a smaller radius indicates that the content present in the cluster is closely related.

根据实施例，随着内容的增长，潜在的微聚类可以随着越来越多的类似内容被添加而增长为核心微聚类，或者随着时间的推移，它们通过阻尼因子作为异常值被忽略，诸如O1 6240和O2 6241。这种微聚类方案有助于在类别中找到相关内容的不相交子集。According to an embodiment, as content grows, potential micro-clusters can grow into core micro-clusters as more and more similar content is added, or they can be ignored as outliers over time through a damping factor, such as O1 6240 and O2 6241. This micro-clustering scheme helps find disjoint subsets of related content in a category.

根据实施例，当新文档D被放置在聚类中时，通过从D中提取主题和命名实体来为该聚类更新相关联的命名实体分布和主题分布。According to an embodiment, when a new document D is placed in a cluster, the associated named entity distribution and topic distribution are updated for the cluster by extracting topics and named entities from D.

2.将新内容分类为相关类别2. Categorize new content into relevant categories

根据实施例，分类是两阶段处理，如下所述，包括宏观级别分类阶段和微观级别分类阶段：According to an embodiment, classification is a two-stage process, as described below, including a macro-level classification stage and a micro-level classification stage:

A.宏观级别分类A. Macro-level classification

根据实施例，如上所述，每个类别节点都具有与其相关联的主题分布和命名实体分布。当上传文档时，首先从文档中提取主题和命名实体，并为分类体系树中的每个高级别类别计算条件概率分数。According to an embodiment, as described above, each category node has a topic distribution and a named entity distribution associated therewith.When a document is uploaded, topics and named entities are first extracted from the document, and conditional probability scores are calculated for each high-level category in the taxonomy tree.

图63图示了根据实施例的归类的宏观阶段。更具体而言，图63描绘了其中文档的主题分布和命名实体分布与高级别聚类节点，即节点6300-6309，匹配的实施例。Figure 63 illustrates a macroscopic stage of classification according to an embodiment. More specifically, Figure 63 depicts an embodiment in which the topic distribution and named entity distribution of a document are matched with high-level cluster nodes, namely nodes 6300-6309.

根据实施例，分数函数可以被视为给定主题列表，例如，“衬衫”、“棉花”等(以及每个主题对文档的权重)，什么是该文档将属于“女装”类别的可能性。给定文档D，其有n个主题(例如，topic-1，topic-2，...，topic-n)和每个主题的权重(weight_topic-1，weight_topic-2...，weight_topic-n)，该文档关于类别Catg-C的联合概率可以被计算为：According to an embodiment, the score function can be viewed as given a list of topics, e.g., "shirts", "cotton", etc. (and the weight of each topic for the document), what is the probability that the document will belong to the "women's clothing" category. Given a document D, which has n topics (e.g., topic-1, topic-2, ..., topic-n) and a weight for each topic (weight _topic-1 , weight _topic-2 ..., weight _topic-n ), the joint probability of the document with respect to category Catg-C can be calculated as:

Score(D|Catg-C)＝(P(topic-1|Catg-C)＊weight_topic-1)Score(D|Catg-C)=(P(topic-1|Catg-C)*weight _topic-1 )

+(P(topic-2|Catg-C)＊weight_topic-2)+(P(topic-2|Catg-C)*weight _topic-2 )

+...+(P(topic-n|Category-C)＊weight_topic-n)+...+(P(topic-n|Category-C)*weight _topic-n )

B.微观级别分类B. Micro-level classification

根据实施例，在这个阶段期间，执行特征比较是为了对文档进行更精细级别的分类，其中每个文档都存在于宏观选择类别的子节点中。According to an embodiment, during this phase, feature comparisons are performed in order to classify the documents at a finer level, where each document exists in a child node of a macro-selected category.

图64图示了一旦通过宏观步骤选择了更高级别类别时的归类。根据实施例，在微观步骤中，所选择树下的类别，所选择树为服装6401-->女性6405，以及所选择树下的类别，其为类别6410-6415，在特征空间中被扩展并进行比较。Figure 64 illustrates the classification once a higher level category is selected through the macro step. According to an embodiment, in the micro step, the categories under the selected tree, the selected tree is clothing 6401--> women 6405, and the categories under the selected tree, which are categories 6410-6415, are expanded and compared in the feature space.

根据实施例，为了将新内容归类/分类到类别中，可以在特征空间中计算内容和可用微聚类的余弦相似度，并且最相似的微聚类所属的类别可以被推荐用于新文章。According to an embodiment, in order to categorize/classify new content into categories, the cosine similarity of the content and available micro-clusters may be calculated in feature space, and the category to which the most similar micro-cluster belongs may be recommended for the new article.

根据实施例，随着例如内容用户引入新类别，并且越来越多的内容被放置在类别中，微聚类的数量可以达到非常高的值。将新内容与高维特征空间中的所有现有微聚类进行比较可能会耗费大量资源，并且会损害系统的性能。这个问题可以通过数据流聚类中使用的阻尼窗口模型方案来解决。According to an embodiment, as new categories are introduced by, for example, content users, and more and more content is placed in the categories, the number of micro-clusters can reach very high values. Comparing new content to all existing micro-clusters in the high-dimensional feature space can be resource intensive and can hurt the performance of the system. This problem can be addressed by the damped window model scheme used in data stream clustering.

阻尼窗口模型(权重随时间衰减)：Damped window model (weight decays over time):

根据实施例，在使用阻尼窗口模型(权重随时间衰减)时，每个内容与取决于其到达时间的权重相关联。当新内容到达时，它被指派尽可能高的权重，该权重根据老化函数随时间(例如，指数)减少。阻尼窗口模型通常使用的老化函数是指数衰减函数。微聚类的权重可以计算为其中存在的内容的权重之和。在将新内容与现有微聚类进行比较时，可以先基于权重对微聚类进行排序(降序)，然后仅选择前n个微聚类进行特征相似度计算。这样，最近使用的密集微聚类可以获得比不常使用的微聚类更高的优先级。According to an embodiment, when using a damped window model (weight decays over time), each content is associated with a weight that depends on its arrival time. When new content arrives, it is assigned as high a weight as possible, which decreases over time (e.g., exponentially) according to an aging function. The aging function commonly used by the damped window model is an exponential decay function. The weight of a micro-cluster can be calculated as the sum of the weights of the content present therein. When comparing new content with existing micro-clusters, the micro-clusters can be first sorted (in descending order) based on the weights, and then only the top n micro-clusters are selected for feature similarity calculation. In this way, recently used dense micro-clusters can be given higher priority than less frequently used micro-clusters.

图65是根据实施例的聚类权重如何随时间衰减的示图，并且图示了阻尼窗口模型的示例。Figure 65 is a diagram of how cluster weights decay over time, under an embodiment, and illustrates an example of a damped window model.

根据实施例，如图65中所示，示出了三个类别，类别1 6510、类别2 6520和类别36530。根据实施例，应用于类别的阻尼函数是：According to an embodiment, as shown in Figure 65, three categories are shown, category 1 6510, category 2 6520, and category 3 6530. According to an embodiment, the damping function applied to the categories is:

其中λ＝1where λ = 1

根据实施例，采用以上等式，如果类别1 6500在t1有100篇文章被添加，如图中粗体分数框6501所示，那么类别1在t1处将具有100的分数，其如图所示将被减小。如果在t5之前没有为类别1添加更多文章，那么通过阻尼函数计算的100篇文章的分数将为6.25。According to an embodiment, using the above equation, if Category 1 6500 has 100 articles added at t1, as shown in bold score box 6501, then Category 1 will have a score of 100 at t1, which will be reduced as shown. If no more articles are added to Category 1 before t5, then the score of 100 articles calculated by the damping function will be 6.25.

根据实施例，采取以上等式，如果类别2 6510在t1有30篇文章、在t2有20篇文章，并且在t4有10篇文章被添加，如图中粗体分数框6511、6512和6513所示，那么类别2在t1处将具有30的分数，在t2处具有35的分数，在t3具有17.5的分数，并且在t4具有17.75的分数。如果在t5之前没有为类别2添加更多文章，那么60篇文章的分数将为9.375。According to an embodiment, taking the above equation, if Category 2 6510 has 30 articles at t1, 20 articles at t2, and 10 articles are added at t4, as shown in bold score boxes 6511, 6512, and 6513, then Category 2 will have a score of 30 at t1, a score of 35 at t2, a score of 17.5 at t3, and a score of 17.75 at t4. If no more articles are added to Category 2 before t5, the score for 60 articles will be 9.375.

根据实施例，采取以上等式，如果类别3 6510在t2有10篇文章被添加、在t3有10篇文章、在t4有10篇文章、在t5有10篇文章被添加，如图中粗体分数框6521、6522、6523、6524所示，那么类别3在t1处将具有0的分数，在t2具有10的分数，在t3具有15的分数，在t4具有17.5的分数，并且在t5具有40篇文章的情况下具有18.75的分数。According to an embodiment, taking the above equation, if Category 3 6510 has 10 articles added at t2, 10 articles at t3, 10 articles at t4, and 10 articles added at t5, as shown in the bold score boxes 6521, 6522, 6523, and 6524 in the figure, then Category 3 will have a score of 0 at t1, a score of 10 at t2, a score of 15 at t3, a score of 17.5 at t4, and a score of 18.75 with 40 articles at t5.

从用户行为中学习Learn from user behavior

根据实施例，用户可以选择接受/拒绝系统提供的关于内容归类/分类的建议。取决于用户动作，系统可以例如以下列方式添加到机器学习数据库：According to an embodiment, the user may choose to accept/reject the suggestions provided by the system regarding content classification/categorization. Depending on the user action, the system may add to the machine learning database, for example, in the following manner:

用户接受建议User accepts suggestion

根据实施例，最近的接受加强了现有信号。内容现在成为类别聚类的一部分，并且内容特征被包含在聚类特征中。这不是聚类特征的直接平均更新，阻尼因子在这个更新中起主要作用。先前的聚类特征被部分减弱，然后添加到最近放置的内容特征以计算更新后的聚类特征。这使系统能够在对下一批次进行分类时为最近添加的内容提供更多权重。命名实体和主题的频率也会针对类别进行更新。According to an embodiment, recent acceptances strengthen existing signals. The content now becomes part of the category cluster and the content features are included in the cluster features. This is not a direct average update of the cluster features, the damping factor plays a major role in this update. The previous cluster features are partially dampened and then added to the recently placed content features to calculate the updated cluster features. This enables the system to give more weight to the recently added content when classifying the next batch. The frequencies of named entities and topics are also updated for the categories.

用户拒绝建议User rejects suggestion

根据实施例，当用户拒绝特定内容归类/分类时，这种拒绝最终会导致两种情况：According to an embodiment, when a user rejects a particular content classification/categorization, such rejection may ultimately lead to two situations:

i.用户拒绝并将文章放在某个不同的类别中：该动作将导致与接受相同的效果，用户现在放置内容的类别将消耗加权内容特征，并且更新后的聚类特征将开始以更高的置信度继续推荐相似的内容。i. The user rejects and places the article in a different category: This action will result in the same effect as accepting, the category where the user now places the content will consume the weighted content features, and the updated clustering features will start to continue recommending similar content with higher confidence.

ii.用户拒绝内容并且它保持“未归类”：在这种情况下，系统无法知道如何处理内容(或类似内容)，因为它对任何聚类都没有贡献。为了跟踪此类内容，引入了称为“影子聚类”的类别。影子聚类包含来自所有先前被拒绝的建议的被拒绝和未归类的内容。当新内容到达时，如果它与影子聚类的相似度分数最高，那么系统不会为该内容建议任何类别。ii. User rejects content and it remains "Unclassified": In this case, the system cannot know what to do with the content (or similar content) as it does not contribute to any cluster. To track such content, a category called "Shadow Cluster" is introduced. Shadow clusters contain rejected and unclassified content from all previously rejected suggestions. When new content arrives, if it has the highest similarity score with a shadow cluster, then the system does not suggest any category for that content.

图66图示了根据实施例的影子聚类如何可以出现的图形视图。Figure 66 illustrates a graphical view of how shadow clustering may occur, according to an embodiment.

根据实施例，如上所述，聚类1 6600、聚类2 6610和聚类3 6620可以填充已指派/归类/分类的内容项。According to an embodiment, as described above, cluster 1 6600, cluster 2 6610, and cluster 3 6620 may be populated with assigned/classified/categorized content items.

但是，根据实施例，如上所述，随着时间的推移，影子聚类6640可能增长。这些影子聚类的形状和大小可能会无限增长，因此，系统可能最终不会为大部分新创建的内容建议任何类别。为了避免这种情况，可以将相同的微聚类方案应用于影子聚类。微聚类有助于对在影子聚类内类似的未归类内容进行分组。However, according to an embodiment, as described above, over time, shadow clusters 6640 may grow. The shape and size of these shadow clusters may grow indefinitely, and therefore, the system may eventually not suggest any categories for most newly created content. To avoid this, the same micro-clustering scheme can be applied to shadow clusters. Micro-clusters help group similar unclassified content within shadow clusters.

图67图示了根据实施例的影子聚类如何可以出现的图形视图。Figure 67 illustrates a graphical view of how shadow clustering may appear, according to an embodiment.

更具体而言，图67图示了根据实施例的影子聚类1 6700、影子聚类2 6710和影子聚类3 6720可以被生成为影子聚类内的微聚类。More specifically, FIG. 67 illustrates that shadow cluster 1 6700 , shadow cluster 2 6710 , and shadow cluster 3 6720 may be generated as micro clusters within a shadow cluster according to an embodiment.

根据实施例，随着影子聚类内的微聚类增长，推荐系统(例如，归类引擎)可以开始为这样的新微聚类生成建议并将这些建议提供给用户。这些未归类的内容可能会形成类别，并基于系统中使用最频繁的主题来建议类别的名称。用户可以通过“创建新类别”选项从这些内容形成新类别。According to an embodiment, as micro-clusters within shadow clusters grow, the recommendation system (e.g., categorization engine) may begin to generate suggestions for such new micro-clusters and provide these suggestions to the user. These uncategorized content may be formed into categories, and the names of the categories are suggested based on the most frequently used topics in the system. Users can form new categories from these content through the "Create New Category" option.

根据实施例，图68中所示的是用户界面6800的示例性屏幕截图。在绘出的实施例中，未归类的内容6805被显示。可以向用户呈现为此类未归类内容创建新类别6810(例如，在分类体系中)的建议。According to an embodiment, shown in Figure 68 is an exemplary screenshot of a user interface 6800. In the depicted embodiment, uncategorized content 6805 is displayed. A suggestion to create a new category 6810 (e.g., in a classification system) for such uncategorized content may be presented to the user.

根据实施例，在步骤6900处，该方法可以提供一个或多个计算机，包括处理器，其提供对内容管理系统的访问。According to an embodiment, at step 6900, the method may provide one or more computers, including a processor, providing access to a content management system.

根据实施例，在步骤6910处，该方法可以在一个或多个计算机处提供内容归类引擎，该内容归类引擎可以访问分类体系。According to an embodiment, at step 6910, the method may provide, at one or more computers, a content classification engine that may access a classification system.

根据实施例，在步骤6920处，该方法可以由内容归类引擎的推荐系统从内容管理系统处的内容生成特征向量，该推荐系统可以访问AI/ML引擎和内容归类引擎的数据库，其中特征向量的生成至少基于对分类体系内先前归类的内容的评估。According to an embodiment, at step 6920, the method may generate a feature vector from the content at the content management system by a recommendation system of the content classification engine, which recommendation system may access the database of the AI/ML engine and the content classification engine, wherein the generation of the feature vector is based at least on an evaluation of content previously classified within the classification system.

根据实施例，在步骤6930处，该方法可以利用生成的特征向量将新内容归类到分类体系中。According to an embodiment, at step 6930, the method may classify the new content into a classification system using the generated feature vector.

虽然已经描述了特定的实施方式，但是各种修改、变更、替代构造以及等同物都是可能的。本公开中描述的实施方式不限于在某些特定数据处理环境内的操作，而是可以在多个数据处理环境内自由操作。此外，虽然已经使用一系列特定的事务和步骤描述了实施方式，但是对于本领域技术人员来说显而易见的是，这并不旨在进行限制。虽然一些流程图将操作描述为顺序处理，但是许多操作可以并行或同时执行。此外，操作的次序可以被重新布置。处理可能具有图中未包括的其它步骤。上述实施方式的各种特征和方面可以被单独使用或联合使用。Although specific embodiments have been described, various modifications, changes, alternative constructions, and equivalents are possible. The embodiments described in this disclosure are not limited to operations within certain specific data processing environments, but can be freely operated within multiple data processing environments. In addition, although embodiments have been described using a series of specific transactions and steps, it is obvious to those skilled in the art that this is not intended to be limiting. Although some flow charts describe operations as sequential processing, many operations can be performed in parallel or simultaneously. In addition, the order of operations can be rearranged. Processing may have other steps not included in the figure. The various features and aspects of the above-mentioned embodiments can be used alone or in combination.

此外，虽然已经使用硬件和软件的特定组合描述了本公开中描述的实施方式，但是应该认识到的是，硬件和软件的其它组合也是可能的。本文描述的一些实施方式可以仅用硬件或仅用软件或其组合来实现。本文描述的各种处理可以以任何组合在相同的处理器或不同的处理器上实现。In addition, although the embodiments described in this disclosure have been described using a specific combination of hardware and software, it should be appreciated that other combinations of hardware and software are also possible. Some embodiments described herein can be implemented using only hardware or only software or a combination thereof. The various processes described herein can be implemented on the same processor or different processors in any combination.

在将设备、系统、组件或模块描述为被配置为执行某些操作或函数的情况下，这样的配置可以通过以下方式来实现，例如，通过设计电子电路来执行操作、通过对可编程电子电路(诸如微处理器)进行编程来执行操作，诸如通过执行计算机指令或代码，或处理器或核心被编程为执行存储在非暂态存储介质上的代码或指令，或其任意组合来执行操作。进程可以使用各种技术进行通信，包括但不限于用于进程间通信的常规技术，并且不同对的进程可以使用不同的技术，或者同一对进程可以在不同时间使用不同的技术。Where a device, system, component or module is described as being configured to perform certain operations or functions, such configuration may be achieved, for example, by designing electronic circuits to perform the operations, by programming programmable electronic circuits (such as microprocessors) to perform the operations, such as by executing computer instructions or code, or a processor or core is programmed to execute code or instructions stored on a non-transitory storage medium, or any combination thereof to perform the operations. Processes may communicate using a variety of techniques, including but not limited to conventional techniques for inter-process communication, and different pairs of processes may use different techniques, or the same pair of processes may use different techniques at different times.

在本公开中给出了具体细节以提供对实施例的透彻理解。但是，可以在没有这些具体细节的情况下实践实施例。例如，已经示出了众所周知的电路、处理、算法、结构和技术，而没有不必要的细节，以避免使实施例模糊。本描述仅提供示例实施例，并且不旨在限制其它实施例的范围、适用性或配置。相反，实施例的先前描述将为本领域技术人员提供用于实现各种实施例的使能描述。可以对元件的功能和布置进行各种改变。Specific details are given in this disclosure to provide a thorough understanding of the embodiments. However, embodiments may be practiced without these specific details. For example, well-known circuits, processes, algorithms, structures, and techniques have been shown without unnecessary details to avoid obscuring the embodiments. This description provides only example embodiments and is not intended to limit the scope, applicability, or configuration of other embodiments. On the contrary, the previous description of the embodiments will provide an enabling description for implementing various embodiments for those skilled in the art. Various changes may be made to the functions and arrangements of the elements.

因此，说明书和附图应被认为是说明性的而不是限制性的。但是，将显而易见的是，在不脱离本公开的更广泛的精神和范围的情况下，可以对其进行添加、减少、删除以及其它修改和改变。因此，虽然已经描述了具体的实施方式，但是这些实施例并不旨在进行限制；各种修改和等同形式均在本公开的范围内。The description and drawings are therefore to be regarded as illustrative rather than restrictive. However, it will be apparent that additions, subtractions, deletions and other modifications and changes may be made thereto without departing from the broader spirit and scope of the present disclosure. Thus, although specific embodiments have been described, these embodiments are not intended to be limiting; various modifications and equivalents are within the scope of the present disclosure.

本文描述的实施例可以使用包括一个或多个处理器、存储器和/或根据本公开的教导编程的计算机可读存储介质的一个或多个通用或专用数字计算机、计算设备、机器或微处理器，或其它类型的计算机来实现。基于本公开的教导，熟练的程序员可以容易地准备适当的软件编码，如对软件领域的技术人员来说显而易见的。The embodiments described herein can be implemented using one or more general or special-purpose digital computers, computing devices, machines or microprocessors, or other types of computers including one or more processors, memories and/or computer-readable storage media programmed according to the teachings of the present disclosure. Based on the teachings of the present disclosure, a skilled programmer can easily prepare appropriate software coding, as will be apparent to those skilled in the art of software.

根据一些实施例，本文描述的特征可以全部或部分地在云环境中实现、作为云计算系统的一部分或作为云计算系统的服务，该云计算系统使得能够按需网络访问可配置计算资源(例如，网络、服务器、存储装置、应用和服务)的共享池，并且可以包括例如如美国国家标准与技术研究所定义的特性，诸如例如：按需自助服务；广泛的网络访问；资源汇聚；快速弹性；以及测量服务。示例云部署模型可以包括：公共云、私有云和混合云；而示例云服务模型可以包括软件即服务(SaaS)、平台即服务(PaaS)、数据库即服务(DBaaS)和基础设施即服务(IaaS)。根据实施例，除非另有说明，否则如本文使用的云可以包括公共云、私有云和混合云实施例，以及所有云部署模型，包括但不限于，云SaaS、云DBaaS、云PaaS和云IaaS。According to some embodiments, the features described herein may be implemented in whole or in part in a cloud environment, as part of a cloud computing system or as a service of a cloud computing system that enables on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage devices, applications, and services), and may include, for example, characteristics as defined by the National Institute of Standards and Technology, such as, for example: on-demand self-service; extensive network access; resource aggregation; rapid elasticity; and measurement services. Example cloud deployment models may include: public cloud, private cloud, and hybrid cloud; and example cloud service models may include software as a service (SaaS), platform as a service (PaaS), database as a service (DBaaS), and infrastructure as a service (IaaS). According to embodiments, unless otherwise specified, clouds as used herein may include public clouds, private clouds, and hybrid cloud embodiments, as well as all cloud deployment models, including, but not limited to, cloud SaaS, cloud DBaaS, cloud PaaS, and cloud IaaS.

根据一些实施例，可以提供一种计算机程序产品，它是一种(或多种)非暂态计算机可读存储介质，其上/其中存储有指令，该指令可以用于对计算机进行编程以执行本文描述的任何处理。此类存储介质的示例可以包括但不限于硬盘驱动器、硬盘、硬驱动器、固定盘或其它机电数据存储设备、软盘、光碟、DVD、CD-ROM、微型驱动器和磁光盘、ROM、RAM、EPROM、EEPROM、DRAM、VRAM、闪存设备、磁卡或光卡、纳米系统或适合指令和/或数据的非暂态存储的其它类型的存储介质或设备。According to some embodiments, a computer program product may be provided, which is a non-transitory computer-readable storage medium (or media) having instructions stored thereon/in which the instructions may be used to program a computer to perform any of the processes described herein. Examples of such storage media may include, but are not limited to, hard disk drives, hard drives, fixed disks or other electromechanical data storage devices, floppy disks, optical disks, DVDs, CD-ROMs, microdrives and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems, or other types of storage media or devices suitable for non-transitory storage of instructions and/or data.

为了说明和描述的目的提供了前述描述。它不旨在是详尽无遗的或将本发明限制为所公开的精确形式。许多修改和变化对于本领域技术人员来说是显而易见的。选择和描述实施例是为了最好地解释本教导的原理及其实际应用，从而使本领域的其它技术人员能够理解各种实施例以及适合于预期的特定使用的各种修改。旨在范围由所附权利要求及其等同形式限定。The foregoing description is provided for the purpose of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations will be apparent to those skilled in the art. The embodiments are selected and described in order to best explain the principles of the present teachings and their practical application, thereby enabling others skilled in the art to understand the various embodiments and various modifications suitable for the particular use contemplated. It is intended that the scope be limited by the appended claims and their equivalents.

Claims

1. A system for intelligently classifying content in a content management system, comprising:

One or more computers, including processors, that provide access to the content management system;

a content categorization engine provided at the one or more computers, the content categorization engine having access to a taxonomy;

a recommendation system comprising a content categorization engine, the recommendation system generating feature vectors from content at a content management system, the recommendation system having access to a database of the content categorization engine;

wherein the generation of feature vectors is based at least on an evaluation of previously classified content within the taxonomy;

The generated feature vectors are used to classify new content into taxonomy.

2. The system of claim 1, wherein the recommendation system creates clusters in a feature space based on previously categorized content.

3. The system of claim 2, wherein the recommender system generates one or more recommendations for new content into a taxonomy by computing feature space distances from clusters.

4. The system of claim 1,

Wherein the recommendation system is used to create a new classification system or modify a classification system.

5. The system of claim 1,

wherein the content categorization engine's database includes a history of user acceptance of previous categorization recommendations; and

Wherein the database of the content categorization engine includes a history of rejection of previous categorization recommendations by users.

6. The system of claim 5,

Wherein the recommender system generates one or more recommendations for new content into the taxonomy based on the user's history of accepting previous categorization records and the user's history of rejecting previous categorization recommendations.

7. The system of claim 1,

Wherein the recommender system generates recommendations for creating new categories within a taxonomy for a plurality of uncategorized content items.

8. A method for intelligently categorizing content in a content management system comprising:

Provide one or more computers, including a processor, to provide access to the content management system;

providing a content categorization engine at the one or more computers, the content categorization engine having access to the taxonomy;

A feature vector is generated from content at a content management system by a recommender system comprising the content categorization engine, the recommender system having access to a database of the content categorization engine, wherein the generation of the feature vector is based at least on a review of previous categorizations within the taxonomy. Evaluation of the content of the class;

Use the generated feature vectors to classify new content into a taxonomy.

9. The method of claim 8, further comprising:

Clusters are created in the feature space by the recommender system based on previously classified content within the taxonomy.

10. The method of claim 9, further comprising:

One or more recommendations for new content into a taxonomy are generated by the recommender system by computing feature space distances from clusters.

11. The method of claim 8,

12. The method of claim 8,

13. The method of claim 12, further comprising:

One or more recommendations for new content into the taxonomy are generated by the recommender system based on the user's history of accepting previous categorization records and the user's history of rejecting previous categorization recommendations.

14. The method of claim 8, further comprising

Recommendations for creating new categories within a taxonomy for a plurality of uncategorized content items are generated by the recommender system.

15. A non-transitory computer readable storage medium having instructions thereon which, when read and executed by a computer, cause the computer to perform steps comprising:

providing a computer including one or more processors that provides access to the content management system;

generating a feature vector from the content at the content management system by a recommender system comprising the content categorization engine, the recommender system having access to a database of the content categorization engine, wherein the generation of the feature vector is based at least on previous categorization within the taxonomy evaluation of content;

Use the generated feature vectors to classify new content into a taxonomy.

16. The non-transitory computer readable storage medium of claim 15, said steps further comprising:

17. The non-transitory computer readable storage medium of claim 16, said steps further comprising:

18. The non-transitory computer readable storage medium of claim 15,

19. The non-transitory computer readable storage medium of claim 15,

20. The non-transitory computer readable storage medium of claim 19, said steps further comprising: