WO2025232574A1

WO2025232574A1 - Knowledge graph construction

Info

Publication number: WO2025232574A1
Application number: PCT/CN2025/091106
Authority: WO
Inventors: 李凤; 李佩; 王伟
Original assignee: Chongqing Ant Consumer Finance Co Ltd
Current assignee: Chongqing Ant Consumer Finance Co Ltd
Priority date: 2024-05-08
Filing date: 2025-04-25
Publication date: 2025-11-13
Anticipated expiration: 2026-11-08
Also published as: CN118520940A

Abstract

Disclosed in the embodiments of the present disclosure are a knowledge graph construction method and apparatus, a storage medium and a terminal. The method comprises: generating at least one user entity on the basis of user data in a target scenario, and generating at least one data entity on the basis of unstructured data in the target scenario; for the data entities of each type, separately extracting a data association relationship for every two data entities of the type; and on the basis of a corresponding relationship between each user entity and each data entity and the data association relationships between the data entities, constructing a knowledge graph corresponding to the target scenario. The present application generates the user entity and the data entities of the unstructured data on the basis of original data in scenarios; and by means of analysis of the unstructured data, the present application extracts the association relationship between any two entities among data entities of each type, and uses same as edges for connecting entity nodes, thus effectively integrating data of various types in the scenarios, and constructing knowledge graphs with richer and more comprehensive knowledge in the target scenarios.

Description

Knowledge Graph Construction

Technical Field

本公开实施例涉及计算机技术领域，尤其涉及知识图谱构建。This disclosure relates to the field of computer technology, and more particularly to knowledge graph construction.

Background Technology

一般来说，从各种数据源可以获取与用户相关的多种类型数据，例如个人信息、行为数据、上传的文本图像资料等，这些数据能够体现用户对服务的需求和偏好，通过对多类数据进行汇总分析，有利于平台更进一步地了解用户，提升用户需求与服务之间的匹配度，进而提升用户体验。而从不同的数据源获取的数据往往类型、结构都存在差异，因此需要一种能够融合多种类数据特征的知识图谱构建方法，以准确描述用户偏好。Generally, various types of user-related data can be obtained from different data sources, such as personal information, behavioral data, and uploaded text and image materials. This data can reflect users' needs and preferences for services. By aggregating and analyzing multiple types of data, the platform can better understand users, improve the matching degree between user needs and services, and thus enhance the user experience. However, data obtained from different data sources often differ in type and structure. Therefore, a knowledge graph construction method that can integrate the characteristics of multiple types of data is needed to accurately describe user preferences.

Summary of the Invention

本公开实施例提供一种知识图谱构建方法、装置、存储介质以及终端，可以解决相关技术中无法基于非结构化数据构建知识图谱的技术问题。This disclosure provides a knowledge graph construction method, apparatus, storage medium, and terminal, which can solve the technical problem in related technologies that it is impossible to construct knowledge graphs based on unstructured data.

第一方面，本公开实施例提供一种知识图谱构建方法，该方法包括：获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于上述用户数据生成至少一个用户实体，基于上述非结构化数据生成至少一个数据实体；对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系；基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建上述目标场景对应的知识图谱，上述知识图谱中以各实体作为节点、以各实体之间的关系作为连接节点的边。In a first aspect, embodiments of this disclosure provide a knowledge graph construction method, the method comprising: acquiring user data and at least one type of unstructured data in a target scenario; generating at least one user entity based on the user data; generating at least one data entity based on the unstructured data; for each type of data entity, extracting data association relationships between every two data entities in each type; constructing a knowledge graph corresponding to the target scenario based on the correspondence between each user entity and each data entity and the data association relationships between each data entity, wherein each entity is used as a node in the knowledge graph and the relationships between each entity are used as edges connecting the nodes.

第二方面，本公开实施例提供一种知识图谱构建装置，该装置包括：实体准备模块，用于获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于上述用户数据生成至少一个用户实体，基于上述非结构化数据生成至少一个数据实体；关系抽取模块，用于对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系；图谱构建模块，用于基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建上述目标场景对应的知识图谱，上述知识图谱中以各实体作为节点、以各实体之间的关系作为连接节点的边。Secondly, embodiments of this disclosure provide a knowledge graph construction apparatus, which includes: an entity preparation module, used to acquire user data in a target scenario and at least one type of unstructured data, generate at least one user entity based on the user data, and generate at least one data entity based on the unstructured data; a relationship extraction module, used to extract data association relationships between every two data entities of each type for each type of data entity; and a graph construction module, used to construct a knowledge graph corresponding to the target scenario based on the correspondence between each user entity and each data entity and the data association relationships between each data entity, wherein each entity is used as a node and the relationship between each entity is used as an edge connecting the nodes.

第三方面，本公开实施例提供一种包含指令的计算机程序产品，当上述计算机程序产品在计算机或处理器上运行时，使得上述计算机或所述处理器执行上述的方法的步骤。Thirdly, embodiments of this disclosure provide a computer program product containing instructions that, when run on a computer or processor, cause the computer or processor to perform the steps of the method described above.

第四方面，本公开实施例提供一种计算机存储介质，所述计算机存储介质存储有多条指令，所述指令适于由处理器加载并执行上述的方法的步骤。Fourthly, embodiments of this disclosure provide a computer storage medium storing a plurality of instructions adapted for loading by a processor and executing the steps of the method described above.

第五方面，本公开实施例提供一种终端，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述计算机程序适于由处理器加载并执行上述的方法的步骤。Fifthly, embodiments of this disclosure provide a terminal including a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being adapted to be loaded by the processor and to execute the steps of the methods described above.

本公开一些实施例提供的技术方案带来的有益效果至少包括：本公开实施例提供一种知识图谱构建方法，获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于用户数据生成至少一个用户实体，基于非结构化数据生成至少一个数据实体；对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系；基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱，知识图谱中以各实体作为节点、以各实体之间的关系作为连接节点的边。由于根据场景下的原始数据，生成了用户实体和非结构化数据的数据实体，并且通过对非结构化数据进行分析，对每类数据实体都抽取出其中任意两个实体之间的关联关系，将作为连接实体节点的边，从而联动了所有非结构化数据，有效整合了场景下多种不同类型的数据，构建出了目标场景下知识更加丰富全面的、更便于分析的知识图谱。The beneficial effects of the technical solutions provided by some embodiments of this disclosure include at least the following: This disclosure provides a knowledge graph construction method, which acquires user data and at least one type of unstructured data in a target scenario; generates at least one user entity based on the user data; and generates at least one data entity based on the unstructured data. For each type of data entity, data relationships are extracted between every two data entities of each type. A knowledge graph corresponding to the target scenario is constructed based on the correspondence between each user entity and each data entity, and the data relationships between each data entity. In the knowledge graph, each entity is used as a node, and the relationships between each entity are used as edges connecting the nodes. Since user entities and data entities of unstructured data are generated based on the original data in the scenario, and by analyzing the unstructured data, the relationships between any two entities in each type of data entity are extracted and used as edges connecting the entity nodes, thereby linking all unstructured data, effectively integrating various types of data in the scenario, and constructing a knowledge graph with richer and more comprehensive knowledge in the target scenario that is easier to analyze.

Attached Figure Description

为了更清楚地说明本公开实施例或相关技术中的技术方案，下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本公开的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。To more clearly illustrate the technical solutions in the embodiments or related technologies of this disclosure, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

图1为本公开实施例提供的一种知识图谱构建方法的示例性系统架构图；Figure 1 is an exemplary system architecture diagram of a knowledge graph construction method provided in an embodiment of this disclosure;

图2为本公开实施例提供的一种知识图谱构建方法的流程示意图；Figure 2 is a flowchart illustrating a knowledge graph construction method provided in an embodiment of this disclosure;

图3为本公开另一实施例提供的一种知识图谱构建方法的流程示意图；Figure 3 is a flowchart illustrating a knowledge graph construction method according to another embodiment of this disclosure;

图4为本公开实施例提供的一种知识图谱构建方法的逻辑模块示例图；Figure 4 is an example diagram of the logical modules of a knowledge graph construction method provided in an embodiment of this disclosure;

图5为本公开实施例提供的一种知识图谱构建装置的结构框图；Figure 5 is a structural block diagram of a knowledge graph construction device provided in an embodiment of this disclosure;

图6为本公开实施例提供的一种终端的结构示意图。Figure 6 is a schematic diagram of the structure of a terminal provided in an embodiment of this disclosure.

Detailed Implementation

为使得本公开实施例的特征和优点能够更加的明显和易懂，下面将结合本公开实施例中的附图，对本公开实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本公开一部分实施例，而非全部实施例。基于本公开中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本公开实施例保护的范围。To make the features and advantages of the embodiments of this disclosure more apparent and understandable, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the embodiments of this disclosure.

下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开实施例相一致的所有实施方式。相反，它们仅是如所附权利要求书中所详述的、本公开实施例的一些方面相一致的装置和方法的例子。In the following description, when referring to the accompanying drawings, the same numbers in different drawings denote the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those of this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of the embodiments of this disclosure as detailed in the appended claims.

在大数据时代下，平台中数据的来源是多种多样的，例如一个用户的数据除了其自身填写的用户个人信息之外，还可能来源于文本数据、图像数据、音频数据等多种不同的数据模态，不同模态的数据能够从不同方面描述信息的特征，因此通过多模态数据能够更全面地体现用户对服务的需求和偏好，通过对多类数据进行汇总分析，有利于构建与平台、服务、用户多方相关的知识库，用于分析用户需求以及定位目标市场，以了解市场、用户的需求，从而更好地为用户提供针对性的服务，提升用户体验。In the era of big data, the sources of data on a platform are diverse. For example, in addition to the personal information a user fills in, a user's data may also come from various data modalities such as text data, image data, and audio data. Different modalities of data can describe the characteristics of information from different aspects. Therefore, multimodal data can more comprehensively reflect users' needs and preferences for services. By summarizing and analyzing multiple types of data, it is beneficial to build a knowledge base related to the platform, services, and users. This knowledge base can be used to analyze user needs and locate target markets, so as to understand market and user needs and thus better provide targeted services to users and improve user experience.

然而，相较于易于加工的结构化数据，从不同的数据源获取的数据多存在非结构数据，这些数据往往类型、特性都存在差异，非结构数据含有更丰富的原始信息，处理起来也相对困难，因此传统的数据处理方法难以有效整合这些数据，并且无法方便的将不同类型的数据联动起来，导致数据与用户之间的联系不紧密，数据不能得到应有的应用，限制了平台为用户提供个性化优质服务的能力。However, compared to structured data, which is easier to process, data obtained from different data sources often consists of unstructured data. This data often differs in type and characteristics, contains richer raw information, and is relatively difficult to process. Therefore, traditional data processing methods struggle to effectively integrate this data and cannot easily link different types of data together. This results in a weak connection between data and users, preventing data from being used as it should and limiting the platform's ability to provide personalized and high-quality services to users.

因此本公开实施例提供一种知识图谱构建方法，获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于用户数据生成至少一个用户实体，基于非结构化数据生成至少一个数据实体；对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系；基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱，知识图谱中以各实体作为节点、以各实体之间的关系作为连接节点的边，以解决上述无法基于非结构化数据构建知识图谱的技术问题。Therefore, this disclosure provides a knowledge graph construction method, which acquires user data and at least one type of unstructured data in a target scenario, generates at least one user entity based on the user data, and generates at least one data entity based on the unstructured data; for each type of data entity, extracts data association relationships between every two data entities in each type; and constructs a knowledge graph corresponding to the target scenario based on the correspondence between each user entity and each data entity and the data association relationships between each data entity. In the knowledge graph, each entity is used as a node, and the relationship between each entity is used as an edge connecting the nodes, so as to solve the above-mentioned technical problem that it is impossible to construct a knowledge graph based on unstructured data.

请参阅图1，图1为本公开实施例提供的一种知识图谱构建方法的示例性系统架构图。Please refer to Figure 1, which is an exemplary system architecture diagram of a knowledge graph construction method provided in this disclosure embodiment.

如图1所示，系统架构可以包括终端101、网络102和服务器103。网络102用于在终端101和服务器103之间提供通信链路的介质。网络102可以包括各种类型的有线通信链路或无线通信链路，例如：有线通信链路包括光纤、双绞线或同轴电缆的，无线通信链路包括蓝牙通信链路、无线保真(Wireless-Fidelity，Wi-Fi)通信链路或微波通信链路等。As shown in Figure 1, the system architecture may include a terminal 101, a network 102, and a server 103. The network 102 serves as the medium for providing a communication link between the terminal 101 and the server 103. The network 102 may include various types of wired or wireless communication links, such as wired communication links including fiber optic cables, twisted-pair cables, or coaxial cables, and wireless communication links including Bluetooth, Wireless-Fidelity (Wi-Fi), or microwave communication links.

终端101可以通过网络102与服务器103交互，以接收来自服务器103的消息或向服务器103发送消息，或者终端101可以通过网络102与服务器103交互，进而接收其他用户向服务器103发送的消息或者数据。终端101可以是硬件，也可以是软件。当终端101为硬件时，可以是各种电子设备，包括但不限于智能手表、智能手机、平板电脑、膝上型便携式计算机和台式计算机等。当终端101为软件时，可以是安装在上述所列举的电子设备中，其可以实现呈多个软件或软件模块(例如：用来提供分布式服务)，也可以实现成单个软件或软件模块，在此不作具体限定。Terminal 101 can interact with server 103 via network 102 to receive messages from or send messages to server 103. Alternatively, terminal 101 can interact with server 103 via network 102 to receive messages or data sent to server 103 by other users. Terminal 101 can be hardware or software. When terminal 101 is hardware, it can be various electronic devices, including but not limited to smartwatches, smartphones, tablets, laptops, and desktop computers. When terminal 101 is software, it can be installed in the aforementioned electronic devices and can be implemented as multiple software programs or software modules (e.g., to provide distributed services) or as a single software program or software module; no specific limitation is made here.

在本公开实施例中，终端101获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于用户数据生成至少一个用户实体，基于非结构化数据生成至少一个数据实体；进一步地，对于每个类型的数据实体，终端101分别对各类型中的每两个数据实体抽取数据关联关系；此时，终端101就可以基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱，知识图谱中以各实体作为节点、以各实体之间的关系作为连接节点的边。In this embodiment of the disclosure, terminal 101 acquires user data and at least one type of unstructured data in the target scenario, generates at least one user entity based on the user data, and generates at least one data entity based on the unstructured data; further, for each type of data entity, terminal 101 extracts the data association relationship between every two data entities in each type; at this time, terminal 101 can construct a knowledge graph corresponding to the target scenario based on the correspondence between each user entity and each data entity and the data association relationship between each data entity, wherein each entity is used as a node and the relationship between each entity is used as an edge connecting the nodes.

服务器103可以是提供各种服务的业务服务器。需要说明的是，服务器103可以是硬件，也可以是软件。当服务器103为硬件时，可以实现成多个服务器组成的分布式服务器集群，也可以实现成单个服务器。当服务器103为软件时，可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块，在此不做具体限定。Server 103 can be a business server providing various services. It should be noted that server 103 can be either hardware or software. When server 103 is hardware, it can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. When server 103 is software, it can be implemented as multiple software programs or software modules (e.g., used to provide distributed services), or as a single software program or software module; no specific limitations are made here.

或者，该系统架构还可以不包括服务器103，换言之，服务器103可以为本公开实施例中可选的设备，即本公开实施例提供的方法可以应用于仅包括终端101的系统结构中，本公开实施例对此不做限定。Alternatively, the system architecture may not include server 103. In other words, server 103 may be an optional device in the embodiments of this disclosure. That is, the method provided in the embodiments of this disclosure can be applied to a system structure that only includes terminal 101. The embodiments of this disclosure do not limit this.

应理解，图1中的终端、网络以及服务器的数目仅是示意性的，根据实现需要，可以是任意数量的终端、网络以及服务器。It should be understood that the number of terminals, networks, and servers in Figure 1 is only illustrative, and can be any number of terminals, networks, and servers depending on the implementation needs.

请参阅图2，图2为本公开实施例提供的一种知识图谱构建方法的流程示意图。本公开实施例的执行主体可以是执行知识图谱构建的终端，也可以是执行知识图谱构建方法的终端中的处理器，还可以是执行知识图谱构建方法的终端中的知识图谱构建服务。为方便描述，下面以执行主体是终端中的处理器为例，介绍知识图谱构建方法的具体执行过程。Please refer to Figure 2, which is a flowchart illustrating a knowledge graph construction method provided in this embodiment of the present disclosure. The execution entity in this embodiment can be a terminal executing knowledge graph construction, a processor within the terminal executing the knowledge graph construction method, or a knowledge graph construction service within the terminal executing the knowledge graph construction method. For ease of description, the following uses the processor within the terminal as an example to describe the specific execution process of the knowledge graph construction method.

如图2所示，知识图谱构建方法至少可以包括以下步骤。As shown in Figure 2, a knowledge graph construction method can include at least the following steps.

S202、获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于用户数据生成至少一个用户实体，基于非结构化数据生成至少一个数据实体。S202. Obtain user data and at least one type of unstructured data in the target scenario, generate at least one user entity based on the user data, and generate at least one data entity based on the unstructured data.

可选地，面对平台中的大体量数据，在分析数据以用于进行与用户相关的决策时，可以通过构建知识图谱来体现用户、数据之间的关联信息。知识图谱(Knowledge Graph)作为人工智能技术的重要组成部分，具有强大的语义处理、互联组织、信息检索以及知识推理能力。知识图谱是一张语义网络图，通过节点(或顶点)表示实体或者概念、通过边表示关系来描述真实世界中存在的各种实体或者概念以及其关系。知识图谱称为知识域可视化或知识领域映射地图，是显示知识发展进程与结构关系的一系列各种不同的图形，用可视化技术描述知识资源及其载体，挖掘、分析、构建、绘制和显示知识及它们之间的相互联系。知识图谱能够描述真实世界中存在的实体及其关系，一般可以采用三元组表示，该三元组包括头实体、尾实体和关系，实体之间通过关系相互联结，形成了网状的知识结构。Optionally, when dealing with large volumes of data on a platform, and analyzing this data for user-related decision-making, a knowledge graph can be constructed to represent the connections between users and data. As a crucial component of artificial intelligence technology, a knowledge graph possesses powerful semantic processing, interconnection, information retrieval, and knowledge reasoning capabilities. A knowledge graph is a semantic network graph that uses nodes (or vertices) to represent entities or concepts and edges to represent relationships, describing various entities or concepts existing in the real world and their relationships. A knowledge graph, also known as a knowledge domain visualization or knowledge domain mapping map, is a series of various graphs displaying the development process and structural relationships of knowledge. It uses visualization techniques to describe knowledge resources and their carriers, mining, analyzing, constructing, drawing, and displaying knowledge and their interrelationships. A knowledge graph can describe entities and their relationships existing in the real world, typically represented using triples. Each triple includes a head entity, a tail entity, and a relationship. Entities are interconnected through relationships, forming a network-like knowledge structure.

可选地，目前大多数图谱基于结构化数据进行构建，结构化数据通常存储在数据库中，能够用二维表结构来逻辑表达和实现，每一个列都有具体的含义，可以形式化存储，由于结构化数据具有固定的格式和结构，因此计算机处理起来相对容易。而文本、图像、语音等非结构化数据则不便于用数据库二维逻辑表来表示，含有丰富的原始信息，但由于格式多样，缺乏统一的逻辑结构，因此处理起来更为复杂，传统的数据处理方法无法有效地整合非结构化数据，限制了平台为用户提供个性化优质服务的能力。另一方面，目前大部分用户数据分析方法多通过个人信息画像和用户基础行为来理解用户，没有考虑不同用户之间的行为关联性、以及同一用户的不同行为之间的关联性，导致难以深度理解用户需求和偏好。Optionally, most current data maps are constructed based on structured data, which is typically stored in databases and can be logically expressed and implemented using a two-dimensional table structure. Each column has a specific meaning and can be formally stored. Because structured data has a fixed format and structure, it is relatively easy for computers to process. Unstructured data, such as text, images, and audio, is not easily represented using two-dimensional logical tables in databases. While it contains rich raw information, its diverse formats and lack of a unified logical structure make it more complex to process. Traditional data processing methods cannot effectively integrate unstructured data, limiting the platform's ability to provide personalized and high-quality services to users. On the other hand, most current user data analysis methods rely on personal profiles and basic user behavior to understand users, without considering the behavioral correlations between different users or the correlations between different behaviors of the same user, making it difficult to deeply understand user needs and preferences.

可选地，为了构建信息更丰富、更全面的知识图谱，可以使用非结构化数据来构建知识图谱，那么首先需要构建知识图谱中作为节点的实体。因此首先获取目标场景下的用户数据以及至少一种类型的非结构化数据，将每个用户、每个非结构化数据都转化为图谱中的实体节点，然后再进一步分析数据之间的关联关系，从而将知识图谱中的实体通过关系的边连接起来，得到关联用户、非结构化数据的知识图谱。具体地，非结构化数据的类型可以包括文本数据、图像数据、音频数据中的至少一种，数据的具体内容与用户以及目标场景的服务相关，例如，在金融机构平台中，文本数据可以是用户与客户经理咨询过程中的文本记录、用户所选服务的文字介绍等；图像数据可以是用户上传的证明材料图像等；音频数据可以是用户上传的语音片段等。Optionally, to construct a richer and more comprehensive knowledge graph, unstructured data can be used. This first requires building the entities that serve as nodes in the knowledge graph. Therefore, the first step is to acquire user data from the target scenario and at least one type of unstructured data. Each user and each piece of unstructured data is then transformed into an entity node in the graph. Further analysis of the relationships between the data is then conducted, connecting the entities in the knowledge graph through relational edges to obtain a knowledge graph linking users and unstructured data. Specifically, the types of unstructured data can include at least one of text data, image data, and audio data. The specific content of the data is related to the user and the services of the target scenario. For example, in a financial institution platform, text data could be text records of consultations between users and account managers, text descriptions of services selected by users, etc.; image data could be images of supporting documents uploaded by users, etc.; and audio data could be audio clips uploaded by users, etc.

可选地，基于用户数据生成每个用户对应的用户实体，用户实体的标识即为用户在目标场景下的唯一标识，属性则是用户的身份信息、关注偏好等基础属性信息。同样的，基于非结构化数据生成每个数据对应的数据实体，数据实体的标识即为数据在目标场景下的唯一标识，属性则是数据本身的关键点信息、描述信息等。Optionally, a user entity is generated for each user based on user data. The identifier of the user entity is the unique identifier of the user in the target scenario, and the attributes are basic attribute information such as the user's identity information and interests. Similarly, a data entity is generated for each data point based on unstructured data. The identifier of the data entity is the unique identifier of the data in the target scenario, and the attributes are key information and descriptive information of the data itself.

进一步地，可以先初步明确用户与其自身对应的非结构化数据之间的关系，相当于先基于用户的自身数据以及属于该用户的文本数据、图像数据、语音数据构建以该用户实体为中心的知识子图，后续再提取各非结构化数据之间的关联性，从而将平台中的不同用户、不同数据都包括在知识图谱中。Furthermore, we can first clarify the relationship between users and their corresponding unstructured data. This is equivalent to first constructing a knowledge subgraph centered on the user entity based on the user's own data and the text, image, and voice data belonging to that user. Then, we can extract the correlation between various unstructured data, thereby including different users and different data in the platform into the knowledge graph.

S204、对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系。S204. For each type of data entity, extract the data association relationship between every two data entities in each type.

可选地，在构建非结构化数据之间的关系时，考虑到不同类型的数据，数据特性、结构都可能会不同，因此，可以按照类别，对相同类别的数据抽取关联关系，也即对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系，其中，数据关联关系用于描述不同数据实体之间的相似程度，这样可以明确任意两个同类型的数据之间的关联性，从而将目标场景下的大体量数据和用户都联系起来，有效地将非结构化数据用于分析用户需求和偏好、为用户提供更优质的服务。Optionally, when constructing relationships between unstructured data, considering that different types of data may have different characteristics and structures, relationships can be extracted from data of the same category according to categories. That is, for each type of data entity, data relationships are extracted from every two data entities in each category. The data relationships are used to describe the degree of similarity between different data entities. This can clarify the relationship between any two data of the same type, thereby connecting a large amount of data and users in the target scenario, and effectively using unstructured data to analyze user needs and preferences and provide users with better services.

S206、基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱，知识图谱中以各实体作为节点、以各实体之间的关系作为连接节点的边。S206. Construct a knowledge graph corresponding to the target scenario based on the correspondence between each user entity and each data entity and the data association between each data entity. In the knowledge graph, each entity is used as a node and the relationship between each entity is used as an edge connecting the nodes.

可选地，抽取出数据之间的关系后，利用实体作为知识图谱中的节点、以各实体之间的关系作为连接节点的边，也就是基于用户实体和数据实体、同时基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱，这样得到的知识图谱联动了所有非结构化数据，有效整合了场景下多种不同类型的数据，有利于平台更进一步地了解用户，提升用户需求与服务之间的匹配度，进而提升用户体验。Optionally, after extracting the relationships between the data, entities are used as nodes in the knowledge graph, and the relationships between entities are used as edges connecting the nodes. That is, a knowledge graph corresponding to the target scenario is constructed based on user entities and data entities, as well as the correspondence between user entities and data entities and the data association between data entities. The resulting knowledge graph links all unstructured data, effectively integrates various types of data in the scenario, and helps the platform to better understand users, improve the matching degree between user needs and services, and thus improve the user experience.

在本公开实施例中，提供一种知识图谱构建方法，获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于用户数据生成至少一个用户实体，基于非结构化数据生成至少一个数据实体；对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系；基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱，知识图谱中以各实体作为节点、以各实体之间的关系作为连接节点的边。由于根据场景下的原始数据，生成了用户实体和非结构化数据的数据实体，并且通过对非结构化数据进行分析，对每类数据实体都抽取出其中任意两个实体之间的关联关系，将作为连接实体节点的边，从而联动了所有非结构化数据，有效整合了场景下多种不同类型的数据，构建出了目标场景下知识更加丰富全面的、更便于分析的知识图谱。This disclosure provides a knowledge graph construction method. The method involves acquiring user data and at least one type of unstructured data from a target scenario. At least one user entity is generated based on the user data, and at least one data entity is generated based on the unstructured data. For each type of data entity, data relationships are extracted between every two data entities of that type. A knowledge graph corresponding to the target scenario is constructed based on the correspondence between user entities and data entities, as well as the data relationships between data entities. In this knowledge graph, entities are used as nodes, and relationships between entities are used as edges connecting the nodes. By generating user entities and unstructured data entities from the original data in the scenario, and by analyzing the unstructured data and extracting the relationships between any two entities in each type as edges connecting entity nodes, all unstructured data is linked, effectively integrating various types of data in the scenario and constructing a more comprehensive and easier-to-analyze knowledge graph for the target scenario.

请参阅图3，图3为本公开另一实施例提供的一种知识图谱构建方法的流程示意图。Please refer to Figure 3, which is a flowchart illustrating a knowledge graph construction method provided in another embodiment of this disclosure.

如图3所示，知识图谱构建方法至少可以包括以下步骤。As shown in Figure 3, a knowledge graph construction method can include at least the following steps.

S302、获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于用户数据生成至少一个用户实体。S302. Obtain user data in the target scenario and at least one type of unstructured data, and generate at least one user entity based on the user data.

可选地，从上述实施例的介绍可以知道，为了构建出信息更丰富全面的知识图谱，可以使用用户数据和非结构化数据来构建知识图谱中作为节点的实体，然后再进一步分析数据之间的关联关系，从而将知识图谱中的实体通过关系的边连接起来，得到关联用户、非结构化数据的知识图谱。因此，首先是获取目标场景下的用户数据以及至少一种类型的非结构化数据，将每个用户、每个非结构化数据都转化为图谱中的实体节点。请参阅图4，图4为本公开实施例提供的一种知识图谱构建方法的逻辑模块示例图，如图4所示，在实体准备的逻辑模块中，基于用户数据和其他非结构化数据生成对应的用户实体和数据实体，数据实体和数据实体是两种不同类型的实体，用户实体是为了代表目标场景下的真实用户，而数据实体则是目标场景中非结构数据对应的实体，那么为了区分用户实体和数据实体，在知识图谱的表示中，数据实体和用户实体可以使用不同的节点颜色来表示，例如在图谱构建的逻辑模块中，数据实体使用黑色圆形球体表示，用户实体使用浅灰色圆形球体表示。Optionally, as described in the above embodiments, in order to construct a more comprehensive and richer knowledge graph, user data and unstructured data can be used to construct entities as nodes in the knowledge graph. Then, the relationships between the data are further analyzed, thereby connecting the entities in the knowledge graph through relational edges to obtain a knowledge graph of associated users and unstructured data. Therefore, the first step is to acquire user data from the target scenario and at least one type of unstructured data, transforming each user and each piece of unstructured data into entity nodes in the graph. Please refer to Figure 4, which is an example diagram of the logical modules of a knowledge graph construction method provided in this embodiment of the present disclosure. As shown in Figure 4, in the entity preparation logical module, corresponding user entities and data entities are generated based on user data and other unstructured data. Data entities and data entities are two different types of entities. User entities are used to represent real users in the target scenario, while data entities are entities corresponding to unstructured data in the target scenario. In order to distinguish between user entities and data entities, data entities and user entities can be represented by different node colors in the knowledge graph representation. For example, in the graph construction logical module, data entities are represented by black spheres, and user entities are represented by light gray spheres.

S304、非结构化数据包括文本数据时，基于文本数据生成至少一个文本实体，文本数据中包括至少一个文本，每个文本对应一个文本实体且每个文本实体具有唯一的文本标识。S304. When unstructured data includes text data, at least one text entity shall be generated based on the text data. The text data includes at least one text, each text corresponds to one text entity and each text entity has a unique text identifier.

可选地，在目标场景下，用户在咨询时发送的文本信息、用户所购买的产品描述等都可能会包含用户的需求信息、兴趣点信息，那么这些文本信息能够作为与用户相关的文本数据，基于这些文本数据构建知识图谱时有利于更加深入的了解用户。基于此，当非结构化数据包括文本数据时，生成数据实体时，是基于文本数据生成至少一个文本实体，其中，文本数据中包括至少一个文本，并且每个文本对应一个文本实体，在知识图谱中，文本实体的唯一的文本标识也即其原文本的唯一标识。Optionally, in the target scenario, text messages sent by users during consultations, product descriptions purchased by users, etc., may contain information about users' needs and interests. This text information can serve as user-related text data, which is beneficial for building a knowledge graph based on this text data to gain a deeper understanding of users. Therefore, when unstructured data includes text data, generating data entities involves generating at least one text entity based on the text data. The text data includes at least one text, and each text corresponds to one text entity. In the knowledge graph, the unique text identifier of a text entity is also the unique identifier of its original text.

S306、将文本数据输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本之间的文本相似度，根据文本相似度确定每两个文本实体之间的文本关联关系。S306. Input the text data into the text relation extraction model, calculate the text similarity between each pair of texts through the text relation extraction model, and determine the text association relationship between each pair of text entities based on the text similarity.

可选地，请继续参阅图4，在关系抽取的逻辑模块中，准备好文本实体之后，需要对每两个文本实体之间的关联关系进行抽取，深度分析不同文本之间的联系、相关性，从而将文本之间的相关性作为关系知识，用于构建知识图谱，使得知识图谱中具有更丰富、准确的知识。由于文本的数据结构复杂，其中包含丰富的语义信息、词义信息、情绪信息，因此在进行文本实体之间的关系抽取时，可以使用能够处理文本信息的神经网络模型来进行文本特征表征以及相关性计算。具体的，可以使用预训练的BERT模型来完成自然语言处理(Natural Language Processing,NLP)任务，以BERT模型为基础，在使用目标场景下的样本文本数据进行针对性训练后得到本公开实施例中的文本关系抽取模型，将文本数据输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本之间的文本相似度，根据文本相似度确定每两个文本实体之间的文本关联关系。需要说明的是，在本公开实施例中，文本关系抽取模型可以仅用于对文本进行特征表征，然后输出文本表征，并由其他计算模块在模型输出的文本表征的基础上计算文本之间的相似度；文本关系抽取模型还可以不仅进行特征表征，还直接计算文本相似度，输出相似度的计算结果。Optionally, referring to Figure 4, in the relation extraction logic module, after preparing the text entities, it is necessary to extract the relationships between every two text entities, deeply analyze the connections and relevance between different texts, and use the relevance between texts as relational knowledge to construct a knowledge graph, making the knowledge graph richer and more accurate in knowledge. Since the data structure of text is complex, containing rich semantic, lexical, and emotional information, a neural network model capable of processing text information can be used for text feature representation and relevance calculation when extracting relationships between text entities. Specifically, a pre-trained BERT model can be used to complete Natural Language Processing (NLP) tasks. Based on the BERT model, the text relation extraction model in this embodiment is obtained after targeted training using sample text data in the target scenario. The text data is input into the text relation extraction model, and the text similarity between every two texts is calculated. The text relationship between every two text entities is determined based on the text similarity. It should be noted that, in the embodiments of this disclosure, the text relationship extraction model can be used only to perform feature representation on the text, and then output the text representation, and other calculation modules can calculate the similarity between the texts based on the text representation output by the model; the text relationship extraction model can also not only perform feature representation, but also directly calculate the text similarity and output the similarity calculation result.

可选地，首先，获取原始的文本数据后需要对数据进行清洗，剔除与目标场景无关的数字、符号等。考虑到不同文本之间可能存在文本长度不一、长文本等问题，而模型进行文本表征向量抽取时通常要求文本长度不能超过某个阈值，那么在对比文本之间的相似度时，可以采用滑窗切片的方式，将文本数据中的每个文本分别切分为至少一个预设长度的文本切片，将文本数据输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本之间每组文本切片对的切片相似度，文本切片对由进行对比的两个文本的各一个文本切片组成；选择每两个文本之间满足预设条件的切片相似度作为文本相似度。例如，a文本被切分为片段1、片段2，b文本被切分为片段3、片段4，计算a文本和b文本之间的文本相似度时，先计算a文本和b文本之间每一对片段对之间的切片相似度，也即计算片段1和片段3的切片相似度、片段1和片段4的切片相似度、片段2和片段3的切片相似度、片段2和片段4的切片相似度，取最大的切片相似度作为a文本和b文本的文本相似度。这样可以解决文本长度较长，在模型中无法进行一次性完整计算的问题。Optionally, after obtaining the raw text data, it is necessary to clean the data and remove numbers, symbols, etc., that are irrelevant to the target scene. Considering that different texts may have different lengths and be long, and that the model usually requires the text length to not exceed a certain threshold when extracting text representation vectors, a sliding window slicing method can be used when comparing the similarity between texts. Each text in the text data is divided into at least one text slice of a preset length. The text data is then input into the text relation extraction model, which calculates the slice similarity between each pair of text slices for every two texts. Each text slice pair consists of one text slice from each of the two texts being compared. The slice similarity between each pair of texts that meets the preset conditions is selected as the text similarity. For example, if text 'a' is segmented into fragment 1 and fragment 2, and text 'b' is segmented into fragment 3 and fragment 4, when calculating the text similarity between text 'a' and text 'b', we first calculate the slice similarity between each pair of fragments, that is, the slice similarity between fragment 1 and fragment 3, fragment 1 and fragment 4, fragment 2 and fragment 3, and fragment 2 and fragment 4. The highest slice similarity is taken as the text similarity between text 'a' and text 'b'. This solves the problem that when the text is long, a complete calculation cannot be performed in one step in the model.

可选地，除了对文本进行滑窗切片处理，还可以通过提取文本摘要的方式对文本进行精简，也即对文本数据中的每个文本提取文本摘要，然后将所有文本摘要输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本摘要之间的摘要相似度，确定每两个文本之间的摘要相似度为文本相似度。具体的，在提取文本摘要时，可以引入大语言模型(LLM，Large Language Model)来提取文本摘要，大语言模型LLM擅长对长文本的理解，能从长文本中提取出准确的摘要而不丢失文本的丰富语义，当得到满足文本关系抽取模型的输入要求的文本摘要后，使得文本关系抽取模型基于文本摘要输出文本的摘要相似度，作为原文本之间的文本相似度。Optionally, besides sliding window slicing of the text, text can also be simplified by extracting text summaries. This involves extracting a text summary for each text in the text data, then inputting all text summaries into a text relation extraction model. The model calculates the summary similarity between any two text summaries, and this similarity is then used to determine the text similarity. Specifically, a Large Language Model (LLM) can be introduced to extract text summaries. LLM excels at understanding long texts, extracting accurate summaries without losing rich semantic meaning. Once a text summary that meets the input requirements of the text relation extraction model is obtained, the model outputs the summary similarity as the text similarity between the original texts.

进一步地，大语言模型LLM还可以用于文本实体的属性提取，向大语言模型LLM提供提示词(prompts)，告知其需要做的分析、需要输出的内容、需要的输出格式等等，大语言模型LLM可以根据提示在文本中抽取出兴趣点，例如金融场景下，文本的兴趣点可能是借贷产品、可能是投资推荐等等，这些兴趣点代表了文本的内容重点，可以直接作为文本实体的属性信息。Furthermore, Large Language Models (LLMs) can also be used for attribute extraction of text entities. Prompts can be provided to LLMs to inform them of the analysis to be performed, the content to be output, the required output format, and so on. Based on these prompts, LLMs can extract points of interest from the text. For example, in a financial context, the points of interest in the text might be lending products, investment recommendations, etc. These points of interest represent the key points of the text and can be directly used as attribute information for text entities.

S308、非结构化数据包括图像数据时，基于图像数据生成至少一个图像实体，图像数据中包括至少一个图像，每个图像对应一个图像实体且每个图像实体具有唯一的图像标识。S308. When unstructured data includes image data, at least one image entity is generated based on the image data. The image data includes at least one image, each image corresponds to one image entity, and each image entity has a unique image identifier.

可选地，在目标场景下，用户在获得服务、购买产品的时候可能会上传自己的身份信息图像、资质凭证图像等，这其中包含大量与用户相关的信息，基于这些图像数据构建知识图谱就有利于更加深入的了解用户。基于此，当非结构化数据包括图像数据时，生成数据实体时，是基于图像数据生成至少一个图像实体，其中，图像数据中包括至少一个图像，并且每个图像对应一个图像实体，在知识图谱中，图像实体的唯一的图像标识也即其原图像的唯一标识。Optionally, in the target scenario, users may upload images of their identity information, qualification certificates, etc., when obtaining services or purchasing products. This contains a large amount of user-related information. Building a knowledge graph based on this image data is beneficial for a deeper understanding of users. Therefore, when unstructured data includes image data, generating data entities involves generating at least one image entity based on the image data. The image data includes at least one image, and each image corresponds to one image entity. In the knowledge graph, the unique image identifier of an image entity is also the unique identifier of its original image.

S310、对每两个图像实体抽取图像内容关联关系，以及对每两个图像实体抽取图像本体关联关系。S310. Extract the image content association relationship for every two image entities, and extract the image ontology association relationship for every two image entities.

可选地，图像的信息通常包含在两方面，一是图像中的文字信息，二是图像本身的像素信息，那么抽取图像之间的相关关系时，可以从两个方面分别确定图像之间的相关性。也即请继续参阅图4，抽取每两个图像实体之间的数据关联关系时，需要对每两个图像实体抽取图像内容关联关系，以及对每两个图像实体抽取图像本体关联关系。Optionally, image information typically includes two aspects: textual information within the image and pixel information of the image itself. Therefore, when extracting the correlation between images, the correlation can be determined from both aspects. That is, please refer to Figure 4. When extracting the data association between any two image entities, it is necessary to extract the image content association between each pair of image entities, and also to extract the image body association between each pair of image entities.

可选地，对每两个图像实体抽取图像内容关联关系时，需要对图像中的文字内容进行提取和对比，也就是基于OCR等图像文字识别算法提取各图像中的目标文字内容，然后采用曼哈顿距离、余弦相似度、simhash等方法计算每两个图像对应的目标文字内容的内容相似度，根据内容相似度确定每两个图像实体之间的图像内容关联关系。在提取目标文字内容时，可以选择性的提取图像中的关键信息，不同类型的关键信息是技术人员预先分类的，例如所有图像可以分为关于借款偿还类型的、关于投资类型的，属于不同的关键信息类型的图像，分别使用不同关注点的算法来进行对应关键信息的识别。Optionally, when extracting the image content association between any two image entities, it is necessary to extract and compare the text content in the images. This involves extracting the target text content from each image using image text recognition algorithms such as OCR, and then calculating the content similarity of the target text content between each pair of images using methods such as Manhattan distance, cosine similarity, and simhash. The image content association between each pair of image entities is then determined based on the content similarity. When extracting the target text content, key information in the images can be selectively extracted. Different types of key information are pre-classified by technicians. For example, all images can be divided into images related to loan repayment and images related to investment. Images belonging to different key information types are identified using algorithms with different focus areas.

可选地，对每两个图像实体抽取图像本体关联关系时，对图像本身的像素特征进行分析对比，可以关注到图像中不同要素的位置，例如盖章的位置、签名的位置等，从而更准确的推断不同图像之间的相关性。在本公开实施例中，将图像数据输入图像关系抽取模型，通过图像关系抽取模型计算每两个图像之间的图像相似度，根据图像相似度确定每两个图像实体之间的图像本体关联关系。图像关系抽取模型可以使用Swin Transformer模型来提取图像的特征表征，Swin Transformer的优势在于它基于分层的Transformer架构，通过分层的注意力机制来处理不同尺度的特征，适于完成图像表征任务。Optionally, when extracting image ontology relationships between two image entities, analyzing and comparing the pixel features of the images themselves can help identify the positions of different elements in the image, such as the location of a stamp or signature, thereby more accurately inferring the correlation between different images. In this embodiment, image data is input into an image relationship extraction model, which calculates the image similarity between two images and determines the image ontology relationships between two image entities based on the image similarity. The image relationship extraction model can use the Swin Transformer model to extract image feature representations. The advantage of the Swin Transformer lies in its hierarchical Transformer architecture, which uses a hierarchical attention mechanism to process features at different scales, making it suitable for image representation tasks.

S312、非结构化数据包括音频数据时，基于音频数据生成至少一个音频实体，音频数据中包括至少一个音频，每个音频对应一个音频实体且每个音频实体具有唯一的音频标识。S312. When unstructured data includes audio data, at least one audio entity is generated based on the audio data. The audio data includes at least one audio, each audio corresponds to one audio entity, and each audio entity has a unique audio identifier.

可选地，在目标场景下，与用户相关的音频信息中也包括与用户相关的信息，基于这些音频数据构建知识图谱有助于补充图谱中的用户信息。基于此，当非结构化数据包括音频数据时，生成数据实体时，是基于音频数据生成至少一个音频实体，其中，音频数据中包括至少一个音频，并且每个音频对应一个音频实体，在知识图谱中，音频实体的唯一的音频标识也即其原音频的唯一标识。Optionally, in the target scenario, the audio information related to the user also includes user-related information. Building a knowledge graph based on this audio data helps to supplement the user information in the graph. Therefore, when unstructured data includes audio data, generating data entities involves generating at least one audio entity based on the audio data. The audio data includes at least one audio file, and each audio file corresponds to one audio entity. In the knowledge graph, the unique audio identifier of the audio entity is also the unique identifier of its original audio file.

S314、基于声纹提取算法获取音频数据中各音频对应的声纹信息；计算每两个音频对应的声纹信息的声纹相似度，根据声纹相似度确定每两个音频实体之间的音频关联关系。S314. Obtain the voiceprint information corresponding to each audio in the audio data based on the voiceprint extraction algorithm; calculate the voiceprint similarity of the voiceprint information corresponding to each two audios, and determine the audio association relationship between each two audio entities based on the voiceprint similarity.

可选地，请继续参阅图4，抽取每两个音频之间的数据关联关系时，基于x-vector、comfomer等声纹提取算法获取音频数据中各音频对应的声纹信息，构建声纹库，其中comfomer算法结合了卷积神经网络(CNN)和Transformer的神经网络架构，利用卷积层的空间感知能力与Transformer的全局依赖建模能力来提取更加鲁棒和区分性的声纹表征；然后计算每两个音频对应的声纹信息的声纹相似度，根据声纹相似度确定每两个音频实体之间的音频关联关系。在后续出现新的音频数据时，将新的音频数据提取出声纹信息，使用声纹检索能力从声纹库中检索对比是否存在相似度满足条件的声纹，从而确定新的音频在知识图谱中的位置和其他音频实体之间的连接关系。Optionally, referring to Figure 4, when extracting the data association between each pair of audio files, voiceprint extraction algorithms such as x-vector and comfomer are used to obtain the voiceprint information corresponding to each audio file in the audio data, constructing a voiceprint database. The comfomer algorithm combines a convolutional neural network (CNN) and a Transformer neural network architecture, utilizing the spatial awareness of convolutional layers and the global dependency modeling capability of Transformers to extract more robust and discriminative voiceprint representations. Then, the voiceprint similarity of the voiceprint information corresponding to each pair of audio files is calculated, and the audio association between each pair of audio entities is determined based on the voiceprint similarity. When new audio data appears subsequently, voiceprint information is extracted from the new audio data, and voiceprint retrieval capabilities are used to search and compare the voiceprint database to see if there are voiceprints with similarity that meet the conditions, thereby determining the position of the new audio in the knowledge graph and its connection relationships with other audio entities.

S316、分别确定每个类型中满足本类型的相似度条件的目标数据关联关系，基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的目标数据关联关系构建目标场景对应的知识图谱。S316. Determine the target data association relationships that meet the similarity conditions of each type, and construct the knowledge graph corresponding to the target scene based on the correspondence between each user entity and each data entity and the target data association relationships between each data entity.

可选地，通过上述实施例的介绍可以知道，在抽取各类型的数据实体之间的数据关联关系时，可以首先计算各类型的非结构化数据中每两个数据之间的相似度，再根据相似度确定各类型中每两个数据实体的数据关联关系。Optionally, as can be seen from the above embodiments, when extracting the data association relationship between various types of data entities, the similarity between every two data in each type of unstructured data can be calculated first, and then the data association relationship between every two data entities in each type can be determined based on the similarity.

具体地，考虑到当数据之间的相似度达到一定程度，说明两个数据具有较高的相关性，而若相似度较低，则说明相关性不高，因此将数据关联关系构建至知识图谱时，还需要根据相似度的高低对数据关联关系进行筛选，防止图谱中存在较多冗杂数据，保证图谱的精准度。那么在本公开实施例中，分别确定每个类型中满足本类型的相似度条件的目标数据关联关系，再基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的目标数据关联关系构建目标场景对应的知识图谱。Specifically, considering that a certain level of similarity between data indicates a high degree of correlation, while low similarity indicates a low correlation, it is necessary to filter data relationships based on their similarity levels when constructing a knowledge graph. This prevents the graph from containing excessive redundant data and ensures its accuracy. Therefore, in this embodiment, target data relationships that meet the similarity criteria for each type are determined, and a knowledge graph corresponding to the target scenario is constructed based on the correspondence between user entities and data entities, as well as the target data relationships between data entities.

需要说明的是，不同类型的数据特性不同，其关系特性也会不同，那么各类型的相似度条件可以基于各类型自身的数据特性以及关系特性得到，结合目标场景下的实际数据需求确定各类型对应的相似度阈值作为筛选条件即可，在筛选时，选择相似度大于相似度阈值的数据关联关系进入图谱，其他不满足要求的关系则被舍弃。It should be noted that different types of data have different characteristics, and their relationship characteristics will also be different. Therefore, the similarity conditions for each type can be obtained based on the data characteristics and relationship characteristics of each type. Combined with the actual data requirements of the target scenario, the corresponding similarity threshold for each type can be determined as the filtering condition. During the filtering, data associations with similarity greater than the similarity threshold are selected to enter the graph, while other relationships that do not meet the requirements are discarded.

S318、基于知识图谱在目标场景下进行用户个性化分析和/或产品推荐。S318. Conduct personalized user analysis and/or product recommendation based on knowledge graphs in target scenarios.

可选地，目标场景下的知识图谱建设完成后，可以基于知识图谱在目标场景下进行用户个性化分析和/或产品推荐，将知识图谱应用于用户个性化服务、产品推荐中，能够借助知识图谱中的用户信息、非结构化数据的隐含信息为用户提供更贴合需求的服务和产品，从而提升用户体验。Optionally, after the knowledge graph for the target scenario is built, personalized user analysis and/or product recommendations can be performed based on the knowledge graph in the target scenario. Applying the knowledge graph to personalized user services and product recommendations can provide users with services and products that better meet their needs by leveraging the user information and implicit information of unstructured data in the knowledge graph, thereby improving the user experience.

在本公开实施例中，提供一种知识图谱构建方法，非结构化数据包括文本数据时，基于文本数据生成至少一个文本实体，将文本数据输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本之间的文本相似度，根据文本相似度确定每两个文本实体之间的文本关联关系，将大模型应用于长文本理解，能够得到更多信息补充，丰富语义信息，借助模型强大的知识洞察能力，获得更准确的相似度计算结果；非结构化数据包括图像数据时，基于图像数据生成至少一个图像实体，对每两个图像实体抽取图像内容关联关系，以及对每两个图像实体抽取图像本体关联关系，从图像内容方面、图像本体方面等多方面对图像之间的相关性进行分析，能够在关注文件内容的同时，关注图像上关键要素所在位置图像的关系；非结构化数据包括音频数据时，基于音频数据生成至少一个音频实体，基于声纹提取算法获取音频数据中各音频对应的声纹信息，计算每两个音频对应的声纹信息的声纹相似度，根据声纹相似度确定每两个音频实体之间的音频关联关系，通过构建声纹库和检索能力的方式解决了音频关系抽取问题；分别确定每个类型中满足本类型的相似度条件的目标数据关联关系，基于目标数据关联关系构建知识图谱，根据相似度的高低对数据关联关系进行筛选，防止图谱中存在较多冗杂数据，保证图谱的精准度；基于知识图谱在目标场景下进行用户个性化分析和/或产品推荐，能够借助知识图谱中的用户信息、非结构化数据的隐含信息提升用户体验。In this embodiment, a knowledge graph construction method is provided. When the unstructured data includes text data, at least one text entity is generated based on the text data. The text data is input into a text relation extraction model, and the text similarity between each pair of texts is calculated using the text relation extraction model. The text association between each pair of text entities is determined based on the text similarity. Applying a large model to long text understanding can obtain more supplementary information and enrich semantic information. With the powerful knowledge insight capability of the model, more accurate similarity calculation results are obtained. When the unstructured data includes image data, at least one image entity is generated based on the image data. Image content association and image ontology association are extracted between each pair of image entities. The correlation between images is analyzed from multiple aspects, including image content and image ontology. This allows attention to both document content and image ontology. The system identifies the relationships between key elements in an image and their locations. When unstructured data includes audio data, at least one audio entity is generated based on the audio data. Voiceprint information corresponding to each audio element is obtained using a voiceprint extraction algorithm. The voiceprint similarity between each pair of audio elements is calculated, and the audio relationship between each pair of audio entities is determined based on the voiceprint similarity. The problem of audio relationship extraction is solved by constructing a voiceprint database and developing retrieval capabilities. Target data relationships that meet the similarity conditions for each type are identified. A knowledge graph is constructed based on these target data relationships. Data relationships are filtered according to their similarity levels to prevent redundant data in the graph and ensure its accuracy. Personalized user analysis and/or product recommendations are performed based on the knowledge graph in target scenarios, leveraging user information and implicit information from unstructured data to improve user experience.

请参阅图5，图5为本公开实施例提供的一种知识图谱构建装置的结构框图。如图5所示，知识图谱构建装置500包括：实体准备模块510，用于获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于用户数据生成至少一个用户实体，基于非结构化数据生成至少一个数据实体；关系抽取模块520，用于对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系；图谱构建模块530，用于基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱，知识图谱中以各实体作为节点、以各实体之间的关系作为连接节点的边。Please refer to Figure 5, which is a structural block diagram of a knowledge graph construction device provided in an embodiment of this disclosure. As shown in Figure 5, the knowledge graph construction device 500 includes: an entity preparation module 510, used to acquire user data and at least one type of unstructured data in a target scenario, generate at least one user entity based on the user data, and generate at least one data entity based on the unstructured data; a relationship extraction module 520, used to extract data association relationships for each type of data entity from every two data entities in each type; and a graph construction module 530, used to construct a knowledge graph corresponding to the target scenario based on the correspondence between each user entity and each data entity and the data association relationships between each data entity, wherein each entity is used as a node and the relationship between each entity is used as an edge connecting the nodes.

可选地，非结构化数据的类型包括文本数据、图像数据、音频数据中的至少一种。Optionally, the types of unstructured data include at least one of text data, image data, and audio data.

可选地，非结构化数据包括文本数据时，实体准备模块510，还用于基于文本数据生成至少一个文本实体，文本数据中包括至少一个文本，每个文本对应一个文本实体且每个文本实体具有唯一的文本标识；关系抽取模块520，还用于将文本数据输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本之间的文本相似度，根据文本相似度确定每两个文本实体之间的文本关联关系。Optionally, when the unstructured data includes text data, the entity preparation module 510 is further configured to generate at least one text entity based on the text data, wherein the text data includes at least one text, each text corresponds to one text entity and each text entity has a unique text identifier; the relation extraction module 520 is further configured to input the text data into the text relation extraction model, calculate the text similarity between each pair of texts through the text relation extraction model, and determine the text association relationship between each pair of text entities based on the text similarity.

可选地，关系抽取模块520，还用于将文本数据中的每个文本分别切分为至少一个预设长度的文本切片；将文本数据输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本之间每组文本切片对的切片相似度，文本切片对由进行对比的两个文本的各一个文本切片组成；选择每两个文本之间满足预设条件的切片相似度作为文本相似度。Optionally, the relation extraction module 520 is further configured to divide each text in the text data into at least one text slice of a preset length; input the text data into the text relation extraction model, calculate the slice similarity between each pair of text slices between each pair of texts through the text relation extraction model, wherein each text slice pair consists of one text slice from each of the two texts being compared; and select the slice similarity between each pair of texts that meets the preset conditions as the text similarity.

可选地，关系抽取模块520，还用于对文本数据中的每个文本提取文本摘要，将所有文本摘要输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本摘要之间的摘要相似度；确定每两个文本之间的摘要相似度为文本相似度。Optionally, the relation extraction module 520 is also used to extract a text summary for each text in the text data, input all text summaries into the text relation extraction model, calculate the summary similarity between every two text summaries through the text relation extraction model, and determine the summary similarity between every two texts as the text similarity.

可选地，非结构化数据包括图像数据时，实体准备模块510，还用于基于图像数据生成至少一个图像实体，图像数据中包括至少一个图像，每个图像对应一个图像实体且每个图像实体具有唯一的图像标识；关系抽取模块520，还用于对每两个图像实体抽取图像内容关联关系，以及对每两个图像实体抽取图像本体关联关系。Optionally, when the unstructured data includes image data, the entity preparation module 510 is further configured to generate at least one image entity based on the image data, wherein the image data includes at least one image, each image corresponds to one image entity and each image entity has a unique image identifier; the relationship extraction module 520 is further configured to extract the image content association relationship for every two image entities and extract the image ontology association relationship for every two image entities.

可选地，关系抽取模块520，还用于基于图像文字识别算法提取各图像中的目标文字内容；计算每两个图像对应的目标文字内容的内容相似度，根据内容相似度确定每两个图像实体之间的图像内容关联关系。Optionally, the relation extraction module 520 is also used to extract the target text content in each image based on the image text recognition algorithm; calculate the content similarity of the target text content corresponding to each two images; and determine the image content association relationship between each two image entities based on the content similarity.

可选地，关系抽取模块520，还用于将图像数据输入图像关系抽取模型，通过图像关系抽取模型计算每两个图像之间的图像相似度，根据图像相似度确定每两个图像实体之间的图像本体关联关系。Optionally, the relationship extraction module 520 is also used to input image data into the image relationship extraction model, calculate the image similarity between each pair of images through the image relationship extraction model, and determine the image ontology association relationship between each pair of image entities based on the image similarity.

可选地，非结构化数据包括音频数据时，实体准备模块510，还用于基于音频数据生成至少一个音频实体，音频数据中包括至少一个音频，每个音频对应一个音频实体且每个音频实体具有唯一的音频标识；关系抽取模块520，还用于基于声纹提取算法获取音频数据中各音频对应的声纹信息；计算每两个音频对应的声纹信息的声纹相似度，根据声纹相似度确定每两个音频实体之间的音频关联关系。Optionally, when the unstructured data includes audio data, the entity preparation module 510 is further configured to generate at least one audio entity based on the audio data, wherein the audio data includes at least one audio, each audio corresponds to one audio entity and each audio entity has a unique audio identifier; the relation extraction module 520 is further configured to obtain the voiceprint information corresponding to each audio in the audio data based on the voiceprint extraction algorithm; calculate the voiceprint similarity of the voiceprint information corresponding to each two audios, and determine the audio association relationship between each two audio entities based on the voiceprint similarity.

可选地，关系抽取模块520，还用于计算各类型的非结构化数据中每两个数据之间的相似度，根据相似度确定各类型中每两个数据实体的数据关联关系；图谱构建模块530，还用于分别确定每个类型中满足本类型的相似度条件的目标数据关联关系，基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的目标数据关联关系构建目标场景对应的知识图谱；其中，各类型的相似度条件基于各类型自身的数据特性以及关系特性得到。Optionally, the relation extraction module 520 is also used to calculate the similarity between every two data in each type of unstructured data, and determine the data association relationship between every two data entities in each type based on the similarity; the graph construction module 530 is also used to determine the target data association relationship that satisfies the similarity condition of each type, and construct the knowledge graph corresponding to the target scene based on the correspondence between each user entity and each data entity and the target data association relationship between each data entity; wherein, the similarity condition of each type is obtained based on the data characteristics and relation characteristics of each type itself.

可选地，知识图谱构建装置500还包括：图谱应用模块，用于基于知识图谱在目标场景下进行用户个性化分析和/或产品推荐。Optionally, the knowledge graph construction device 500 also includes a graph application module for performing personalized user analysis and/or product recommendation based on the knowledge graph in a target scenario.

在本公开实施例中，提供一种知识图谱构建装置，其中，实体准备模块，用于获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于用户数据生成至少一个用户实体，基于非结构化数据生成至少一个数据实体；关系抽取模块，用于对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系；图谱构建模块，用于基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱，知识图谱中以各实体作为节点、以各实体之间的关系作为连接节点的边。由于根据场景下的原始数据，生成了用户实体和非结构化数据的数据实体，并且通过对非结构化数据进行分析，对每类数据实体都抽取出其中任意两个实体之间的关联关系，将作为连接实体节点的边，从而联动了所有非结构化数据，有效整合了场景下多种不同类型的数据，构建出了目标场景下知识更加丰富全面的、更便于分析的知识图谱。In this embodiment, a knowledge graph construction apparatus is provided, comprising: an entity preparation module for acquiring user data and at least one type of unstructured data in a target scenario, generating at least one user entity based on the user data, and generating at least one data entity based on the unstructured data; a relation extraction module for extracting data relationships between every two data entities of each type; and a graph construction module for constructing a knowledge graph corresponding to the target scenario based on the correspondence between each user entity and each data entity, and the data relationships between each data entity. The knowledge graph uses each entity as a node and the relationships between entities as edges connecting the nodes. By generating user entities and unstructured data entities from the original data in the scenario, and by analyzing the unstructured data and extracting the relationships between any two entities of each type as edges connecting entity nodes, all unstructured data are linked, effectively integrating various types of data in the scenario, and constructing a more comprehensive and easier-to-analyze knowledge graph for the target scenario.

本公开实施例提供一种包含指令的计算机程序产品，当计算机程序产品在计算机或处理器上运行时，使得计算机或处理器执行上述实施例中任一项的方法的步骤。This disclosure provides a computer program product containing instructions that, when run on a computer or processor, cause the computer or processor to perform the steps of any of the methods described above.

本公开实施例还提供了一种计算机存储介质，计算机存储介质可以存储有多条指令，指令适于由处理器加载并执行如上述实施例中的任一项的方法的步骤。This disclosure also provides a computer storage medium that can store multiple instructions adapted for loading by a processor and executing the steps of any of the methods described in the above embodiments.

请参见图6，图6为本公开实施例提供的一种终端的结构示意图。如图6所示，终端600可以包括：至少一个终端处理器601，至少一个网络接口604，用户接口603，存储器605，至少一个通信总线602。Please refer to Figure 6, which is a schematic diagram of the structure of a terminal provided in an embodiment of this disclosure. As shown in Figure 6, the terminal 600 may include: at least one terminal processor 601, at least one network interface 604, a user interface 603, a memory 605, and at least one communication bus 602.

其中，通信总线602用于实现这些组件之间的连接通信。The communication bus 602 is used to enable communication between these components.

其中，用户接口603可以包括显示屏(Display)、摄像头(Camera)，可选地，用户接口603还可以包括标准的有线接口、无线接口。The user interface 603 may include a display screen and a camera. Optionally, the user interface 603 may also include a standard wired interface and a wireless interface.

其中，网络接口604可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。The network interface 604 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface).

其中，终端处理器601可以包括一个或者多个处理核心。终端处理器601利用各种接口和线路连接整个终端600内的各个部分，通过运行或执行存储在存储器605内的指令、程序、代码集或指令集，以及调用存储在存储器605内的数据，执行终端600的各种功能和处理数据。可选的，终端处理器601可以采用数字信号处理(Digital Signal Processing，DSP)、现场可编程门阵列(Field-Programmable Gate Array，FPGA)、可编程逻辑阵列(Programmable Logic Array，PLA)中的至少一种硬件形式来实现。终端处理器601可集成中央处理器(Central Processing Unit，CPU)、图像处理器(Graphics Processing Unit，GPU)和调制解调器等中的一种或几种的组合。其中，CPU主要处理操作系统、用户界面和应用程序等；GPU用于负责显示屏所需要显示的内容的渲染和绘制；调制解调器用于处理无线通信。可以理解的是，上述调制解调器也可以不集成到终端处理器601中，单独通过一块芯片进行实现。The terminal processor 601 may include one or more processing cores. The terminal processor 601 connects to various parts within the terminal 600 using various interfaces and lines, and performs various functions and processes data by running or executing instructions, programs, code sets, or instruction sets stored in the memory 605, and by calling data stored in the memory 605. Optionally, the terminal processor 601 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The terminal processor 601 may integrate one or more of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the content required for display; and the modem handles wireless communication. It is understood that the modem may also be implemented as a separate chip without being integrated into the terminal processor 601.

其中，存储器605可以包括随机存储器(Random Access Memory，RAM)，也可以包括只读存储器(Read-Only Memory，ROM)。可选的，该存储器605包括非瞬时性计算机可读介质(non-transitory computer-readable storage medium)。存储器605可用于存储指令、程序、代码、代码集或指令集。存储器605可包括存储程序区和存储数据区，其中，存储程序区可存储用于实现操作系统的指令、用于至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现上述各个方法实施例的指令等；存储数据区可存储上面各个方法实施例中涉及到的数据等。存储器605可选的还可以是至少一个位于远离前述终端处理器601的存储装置。如图6所示，作为一种计算机存储介质的存储器605中可以包括操作系统、网络通信模块、用户接口模块以及知识图谱构建程序。The memory 605 may include random access memory (RAM) or read-only memory (ROM). Optionally, the memory 605 may include a non-transitory computer-readable storage medium. The memory 605 may be used to store instructions, programs, code, code sets, or instruction sets. The memory 605 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch functionality, sound playback functionality, image playback functionality, etc.), instructions for implementing the various method embodiments described above, etc.; the data storage area may store data involved in the various method embodiments described above, etc. Optionally, the memory 605 may also be at least one storage device located remotely from the aforementioned terminal processor 601. As shown in FIG6, the memory 605, as a computer storage medium, may include an operating system, a network communication module, a user interface module, and a knowledge graph construction program.

在图6所示的终端600中，用户接口603主要用于为用户提供输入的接口，获取用户输入的数据；而终端处理器601可以用于调用存储器605中存储的知识图谱构建程序，并具体执行以下操作：获取目标场景下的用户数据以及至少一种类型的非结构化数据，基于用户数据生成至少一个用户实体，基于非结构化数据生成至少一个数据实体；对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系；基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱，知识图谱中以各实体作为节点、以各实体之间的关系作为连接节点的边。In the terminal 600 shown in Figure 6, the user interface 603 is mainly used to provide an input interface for the user and obtain the user input data; while the terminal processor 601 can be used to call the knowledge graph construction program stored in the memory 605 and specifically perform the following operations: obtain user data in the target scene and at least one type of unstructured data; generate at least one user entity based on the user data and at least one data entity based on the unstructured data; for each type of data entity, extract the data association relationship between every two data entities in each type; construct the knowledge graph corresponding to the target scene based on the correspondence between each user entity and each data entity and the data association relationship between each data entity, in which each entity is used as a node and the relationship between each entity is used as an edge connecting the nodes.

在一些实施例中，非结构化数据的类型包括文本数据、图像数据、音频数据中的至少一种。In some embodiments, the type of unstructured data includes at least one of text data, image data, and audio data.

在一些实施例中，非结构化数据包括文本数据时，终端处理器601在执行基于非结构化数据生成至少一个数据实体时，具体执行以下步骤：基于文本数据生成至少一个文本实体，文本数据中包括至少一个文本，每个文本对应一个文本实体且每个文本实体具有唯一的文本标识；终端处理器601在执行对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系时，具体执行以下步骤：将文本数据输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本之间的文本相似度，根据文本相似度确定每两个文本实体之间的文本关联关系。In some embodiments, when the unstructured data includes text data, the terminal processor 601, when generating at least one data entity based on the unstructured data, specifically performs the following steps: generating at least one text entity based on the text data, wherein the text data includes at least one text, each text corresponds to one text entity and each text entity has a unique text identifier; the terminal processor 601, when extracting data association relationships for each type of data entity, specifically performs the following steps: inputting the text data into the text relationship extraction model, calculating the text similarity between each pair of texts through the text relationship extraction model, and determining the text association relationship between each pair of text entities based on the text similarity.

在一些实施例中，终端处理器601在执行将文本数据输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本之间的文本相似度时，具体执行以下步骤：将文本数据中的每个文本分别切分为至少一个预设长度的文本切片；将文本数据输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本之间每组文本切片对的切片相似度，文本切片对由进行对比的两个文本的各一个文本切片组成；选择每两个文本之间满足预设条件的切片相似度作为文本相似度。In some embodiments, when the terminal processor 601 performs the following steps when inputting text data into a text relation extraction model and calculating the text similarity between each pair of texts using the text relation extraction model: dividing each text in the text data into at least one text slice of a preset length; inputting the text data into the text relation extraction model and calculating the slice similarity between each pair of text slices between each pair of texts using the text relation extraction model, wherein each text slice pair consists of one text slice from each of the two texts being compared; and selecting the slice similarity between each pair of texts that satisfies a preset condition as the text similarity.

在一些实施例中，终端处理器601在执行将文本数据输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本之间的文本相似度时，具体执行以下步骤：对文本数据中的每个文本提取文本摘要，将所有文本摘要输入文本关系抽取模型，通过文本关系抽取模型计算每两个文本摘要之间的摘要相似度；确定每两个文本之间的摘要相似度为文本相似度。In some embodiments, when the terminal processor 601 performs the following steps when it inputs text data into a text relation extraction model and calculates the text similarity between each pair of texts using the text relation extraction model: extracting a text summary for each text in the text data, inputting all text summaries into the text relation extraction model, calculating the summary similarity between each pair of text summaries using the text relation extraction model, and determining the summary similarity between each pair of texts as the text similarity.

在一些实施例中，非结构化数据包括图像数据时，终端处理器601在执行基于非结构化数据生成至少一个数据实体时，具体执行以下步骤：基于图像数据生成至少一个图像实体，图像数据中包括至少一个图像，每个图像对应一个图像实体且每个图像实体具有唯一的图像标识；终端处理器601在执行对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系时，具体执行以下步骤：对每两个图像实体抽取图像内容关联关系，以及对每两个图像实体抽取图像本体关联关系。In some embodiments, when the unstructured data includes image data, the terminal processor 601, when generating at least one data entity based on the unstructured data, specifically performs the following steps: generating at least one image entity based on the image data, wherein the image data includes at least one image, each image corresponds to one image entity and each image entity has a unique image identifier; and when the terminal processor 601, for each type of data entity, extracts the data association relationship for each pair of data entities in each type, specifically performs the following steps: extracting the image content association relationship for each pair of image entities, and extracting the image ontology association relationship for each pair of image entities.

在一些实施例中，终端处理器601在执行对每两个图像实体抽取图像内容关联关系时，具体执行以下步骤：基于图像文字识别算法提取各图像中的目标文字内容；计算每两个图像对应的目标文字内容的内容相似度，根据内容相似度确定每两个图像实体之间的图像内容关联关系。In some embodiments, when the terminal processor 601 extracts the image content association relationship between every two image entities, it specifically performs the following steps: extracting the target text content in each image based on an image text recognition algorithm; calculating the content similarity of the target text content corresponding to every two images; and determining the image content association relationship between every two image entities based on the content similarity.

在一些实施例中，终端处理器601在执行对每两个图像实体抽取图像本体关联关系时，具体执行以下步骤：将图像数据输入图像关系抽取模型，通过图像关系抽取模型计算每两个图像之间的图像相似度，根据图像相似度确定每两个图像实体之间的图像本体关联关系。In some embodiments, when the terminal processor 601 extracts the image ontology association relationship between every two image entities, it specifically performs the following steps: inputting image data into the image relationship extraction model, calculating the image similarity between every two images through the image relationship extraction model, and determining the image ontology association relationship between every two image entities based on the image similarity.

在一些实施例中，非结构化数据包括音频数据时，终端处理器601在执行基于非结构化数据生成至少一个数据实体时，具体执行以下步骤：基于音频数据生成至少一个音频实体，音频数据中包括至少一个音频，每个音频对应一个音频实体且每个音频实体具有唯一的音频标识；终端处理器601在执行对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系时，具体执行以下步骤：基于声纹提取算法获取音频数据中各音频对应的声纹信息；计算每两个音频对应的声纹信息的声纹相似度，根据声纹相似度确定每两个音频实体之间的音频关联关系。In some embodiments, when the unstructured data includes audio data, the terminal processor 601, when generating at least one data entity based on the unstructured data, specifically performs the following steps: generating at least one audio entity based on the audio data, wherein the audio data includes at least one audio, each audio corresponds to one audio entity and each audio entity has a unique audio identifier; the terminal processor 601, when extracting the data association relationship for each type of data entity, specifically performs the following steps: obtaining the voiceprint information corresponding to each audio in the audio data based on the voiceprint extraction algorithm; calculating the voiceprint similarity of the voiceprint information corresponding to each two audios, and determining the audio association relationship between each two audio entities based on the voiceprint similarity.

在一些实施例中，终端处理器601在执行对于每个类型的数据实体，分别对各类型中的每两个数据实体抽取数据关联关系时，具体执行以下步骤：计算各类型的非结构化数据中每两个数据之间的相似度，根据相似度确定各类型中每两个数据实体的数据关联关系；终端处理器601在执行基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱时，具体执行以下步骤：分别确定每个类型中满足本类型的相似度条件的目标数据关联关系，基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的目标数据关联关系构建目标场景对应的知识图谱；其中，各类型的相似度条件基于各类型自身的数据特性以及关系特性得到。In some embodiments, when the terminal processor 601 extracts data association relationships for each type of data entity, it specifically performs the following steps: calculating the similarity between each pair of unstructured data in each type, and determining the data association relationship between each pair of data entities in each type based on the similarity; when the terminal processor 601 constructs a knowledge graph corresponding to the target scene based on the correspondence between each user entity and each data entity and the data association relationship between each data entity, it specifically performs the following steps: determining the target data association relationship that satisfies the similarity condition of each type, and constructing the knowledge graph corresponding to the target scene based on the correspondence between each user entity and each data entity and the target data association relationship between each data entity; wherein, the similarity condition of each type is obtained based on the data characteristics and relationship characteristics of each type itself.

在一些实施例中，终端处理器601在执行基于各用户实体与各数据实体之间的对应关系以及各数据实体之间的数据关联关系构建目标场景对应的知识图谱之后，还具体执行以下步骤：基于知识图谱在目标场景下进行用户个性化分析和/或产品推荐。In some embodiments, after the terminal processor 601 constructs a knowledge graph corresponding to the target scenario based on the correspondence between each user entity and each data entity and the data association between each data entity, it further performs the following steps: performing personalized user analysis and/or product recommendation in the target scenario based on the knowledge graph.

在本公开所提供的几个实施例中，应该理解到，所揭露的装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，模块的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个模块或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或模块的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this disclosure, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or modules may be electrical, mechanical, or other forms.

作为分离部件说明的模块可以是或者也可以不是物理上分开的，作为模块显示的部件可以是或者也可以不是物理模块，即可以位于一个地方，或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separate. Similarly, the components shown as modules may or may not be physical modules; they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this embodiment, depending on actual needs.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。上述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行上述计算机程序指令时，全部或部分地产生按照本公开实施例上述的流程或功能。上述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。上述计算机指令可以存储在计算机可读存储介质中，或者通过上述计算机可读存储介质进行传输。上述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line，DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。上述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。上述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，数字多功能光盘(Digital Versatile Disc，DVD))、或者半导体介质(例如，固态硬盘(Solid State Disk，SSD))等。In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this disclosure are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in or transmitted through a computer-readable storage medium. The computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The aforementioned available media can be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., digital versatile discs (DVDs)), or semiconductor media (e.g., solid-state drives (SSDs)).

需要说明的是，对于前述的各方法实施例，为了简便描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本公开实施例并不受所描述的动作顺序的限制，因为依据本公开实施例，某些步骤可以采用其它顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作和模块并不一定都是本公开实施例所必须的。It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this disclosure are not limited to the described order of actions, because according to the embodiments of this disclosure, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments of this disclosure.

另外，还需要说明的是，本公开实施例所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号，均为经用户授权或者经过各方充分授权的，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如，本公开中涉及的用户数据、文本数据、图像数据、音频数据等都是在充分授权的情况下获取的。Furthermore, it should be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.), and signals involved in the embodiments of this disclosure are all authorized by the user or fully authorized by all parties, and the collection, use, and processing of related data must comply with the relevant laws, regulations, and standards of the relevant countries and regions. For example, the user data, text data, image data, audio data, etc. involved in this disclosure were all obtained under full authorization.

上述对本公开特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下，在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外，在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中，多任务处理和并行处理也是可以的或者可能是有利的。The foregoing has described specific embodiments of this disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其它实施例的相关描述。In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

以上为对本公开实施例所提供的一种知识图谱构建方法、装置、存储介质以及终端的描述，对于本领域的技术人员，依据本公开实施例的思想，在具体实施方式及应用范围上均会有改变之处，综上，本公开内容不应理解为对本公开实施例的限制。The above is a description of a knowledge graph construction method, apparatus, storage medium, and terminal provided in the embodiments of this disclosure. For those skilled in the art, based on the ideas of the embodiments of this disclosure, there will be changes in the specific implementation methods and application scope. Therefore, the content of this disclosure should not be construed as a limitation on the embodiments of this disclosure.

Claims

A knowledge graph construction method, the method comprising:

Acquire user data in the target scenario and at least one type of unstructured data, generate at least one user entity based on the user data, and generate at least one data entity based on the unstructured data;

For each type of data entity, extract the data relationship between every two data entities in each type;

A knowledge graph corresponding to the target scenario is constructed based on the correspondence between each user entity and each data entity, as well as the data association between each data entity. In the knowledge graph, each entity is used as a node, and the relationship between each entity is used as an edge connecting the nodes.

According to the method of claim 1, the type of unstructured data includes at least one of text data, image data, and audio data.

According to the method of claim 2, when the unstructured data includes text data, generating at least one data entity based on the unstructured data includes:

At least one text entity is generated based on the text data, wherein the text data includes at least one text, each text corresponds to one text entity and each text entity has a unique text identifier;

For each type of data entity, the data association relationship is extracted for every two data entities in each type, including:

The text data is input into a text relationship extraction model, and the text similarity between each pair of texts is calculated using the text relationship extraction model. Based on the text similarity, the text association relationship between each pair of text entities is determined.

According to the method of claim 3, the step of inputting the text data into a text relation extraction model and calculating the text similarity between every two texts through the text relation extraction model includes:

Each text in the text data is divided into at least one text slice of a preset length;

The text data is input into the text relation extraction model, and the slice similarity between each pair of text slices between each pair of texts is calculated by the text relation extraction model. The text slice pair consists of one text slice from each of the two texts being compared.

The slice similarity between any two texts that meets the preset conditions is selected as the text similarity.

For each text in the text data, extract a text summary, input all text summaries into a text relation extraction model, and calculate the summary similarity between every two text summaries using the text relation extraction model;

The summary similarity between any two texts is defined as the text similarity.

According to the method of claim 2, when the unstructured data includes image data, generating at least one data entity based on the unstructured data includes:

At least one image entity is generated based on the image data, wherein the image data includes at least one image, each image corresponds to one image entity and each image entity has a unique image identifier;

Extract the image content association relationship for every two image entities, and extract the image ontology association relationship for every two image entities.

According to the method of claim 6, the step of extracting the image content association relationship for every two image entities includes:

Extracting target text content from each image based on image text recognition algorithms;

Calculate the content similarity of the target text content corresponding to each pair of images, and determine the image content association relationship between each pair of image entities based on the content similarity.

According to the method of claim 6, the step of extracting the image ontology association relationship for every two image entities includes:

The image data is input into the image relationship extraction model, and the image similarity between each pair of images is calculated through the image relationship extraction model. Based on the image similarity, the image ontology association relationship between each pair of image entities is determined.

According to the method of claim 2, when the unstructured data includes audio data, generating at least one data entity based on the unstructured data includes:

At least one audio entity is generated based on the audio data, wherein the audio data includes at least one audio, each audio corresponds to one audio entity and each audio entity has a unique audio identifier;

The voiceprint information corresponding to each audio in the audio data is obtained based on the voiceprint extraction algorithm;

Calculate the voiceprint similarity of voiceprint information corresponding to each pair of audios, and determine the audio association relationship between each pair of audio entities based on the voiceprint similarity.

According to the method of claim 1, the step of extracting the data association relationship for every two data entities in each type of data entity includes:

Calculate the similarity between every two data points in each type of unstructured data, and determine the data association relationship between every two data entities in each type based on the similarity.

The construction of the knowledge graph corresponding to the target scenario based on the correspondence between each user entity and each data entity, as well as the data association between each data entity, includes:

Determine the target data association relationships that meet the similarity conditions of each type, and construct the knowledge graph corresponding to the target scene based on the correspondence between each user entity and each data entity and the target data association relationships between each data entity.

The similarity conditions for each type are derived based on the data characteristics and relational characteristics of each type.

According to the method of claim 1, after constructing the knowledge graph corresponding to the target scene based on the correspondence between each user entity and each data entity and the data association relationship between each data entity, the method further includes:

Based on the knowledge graph, personalized user analysis and/or product recommendations are performed in the target scenario.

A knowledge graph construction apparatus, the apparatus comprising:

An entity preparation module is used to acquire user data in a target scenario and at least one type of unstructured data, generate at least one user entity based on the user data, and generate at least one data entity based on the unstructured data.

The relation extraction module is used to extract the data relationship between every two data entities in each type of data entity.

The knowledge graph construction module is used to construct a knowledge graph corresponding to the target scenario based on the correspondence between each user entity and each data entity and the data association between each data entity. In the knowledge graph, each entity is used as a node and the relationship between each entity is used as an edge connecting the nodes.

A computer program product comprising instructions that, when run on a computer or processor, cause the computer or processor to perform the steps of the method as described in any one of claims 1 to 11.

A computer storage medium storing a plurality of instructions adapted for loading by a processor and performing the steps of the method as claimed in any one of claims 1 to 11.

A terminal includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method as claimed in any one of claims 1 to 11.