CN107291807B

CN107291807B - SPARQL query optimization method based on graph traversal

Info

Publication number: CN107291807B
Application number: CN201710343003.9A
Authority: CN
Inventors: 李亮; 沈志宏; 周园春; 黎建辉; 朱小杰; 刘东江; 李跃鹏
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2017-05-16
Filing date: 2017-05-16
Publication date: 2020-10-16
Anticipated expiration: 2037-05-16
Also published as: CN107291807A

Abstract

The invention discloses a SPARQL query optimization method based on graph traversal. The method is as follows: 1) Use the attribute graph to represent the triples in the RDF data, and then use the Bigtable model to store the RDF data to obtain the Bigtable data corresponding to the RDF data; 2) Convert the SPARQL query to traverse the RDF attribute graph; 3) According to the steps 2) The obtained traversal sequence traverses all nodes in the Bigtable data that meet the conditions, and completes the SPARQL query. On the one hand, the invention eliminates the dependence of traditional SPARQL query on data structures such as Hash, reduces the generation of intermediate data, and avoids the connection calculation of large-scale RDF data; on the other hand, it can effectively use Bigtable-based big data processing technology to store and Manage the massive amount of RDF related knowledge network data, and accelerate the query and analysis of RDF related data.

Description

A SPARQL Query Optimization Method Based on Graph Traversal

技术领域technical field

本发明涉及一种基于图遍历的SPARQL查询执行方法，具体涉及一种面向大数据关联的存储和查询的方法及系统。The invention relates to a SPARQL query execution method based on graph traversal, in particular to a method and system for storing and querying big data association.

背景技术Background technique

图数据挖掘和分析是大数据的新领域，通过建立关联万维网资源、微生物菌种资源、以及科研资源等的关联关系，支持基于数据关联的信息挖掘和科学发现。资源描述框架(Resource Description Framework,简称RDF)是用于表达关于万维网(World Wide Web)资源的信息的语言，能够表达任何可在互联网上被标识的事物的信息，如页面标题、作者和修改时间以及不同数据之间的关联关系。RDF规范提供了描述资源的基础性词汇表，定义了各领域应用如WDCM(Mircen World Data Centre for Microorganisms)微生物描述资源词汇表时必须遵循的规则。SPARQL(SPARQL Protocol and RDF Query Language)是为RDF开发的一种查询语言和数据获取协议，由W3C国际标准组织推荐的RDF数据模型定义，用于查询任何能用RDF表示的信息资源。SPARQL协议和RDF查询语言(SPARQL)于2008年1月15日正式成为一项W3C推荐标准。Graph data mining and analysis is a new field of big data. It supports information mining and scientific discovery based on data association by establishing association relationships between World Wide Web resources, microbial strain resources, and scientific research resources. Resource Description Framework (RDF) is a language used to express information about World Wide Web (World Wide Web) resources, capable of expressing any information that can be identified on the Internet, such as page title, author and modification time and the relationship between different data. The RDF specification provides a basic vocabulary for describing resources, and defines the rules that must be followed when applications in various fields such as WDCM (Mircen World Data Centre for Microorganisms) microbial description resource vocabulary. SPARQL (SPARQL Protocol and RDF Query Language) is a query language and data acquisition protocol developed for RDF. It is defined by the RDF data model recommended by the W3C International Standards Organization and is used to query any information resources that can be represented by RDF. The SPARQL Protocol and RDF Query Language (SPARQL) officially became a W3C Recommendation on January 15, 2008.

由于RDF使用了结构化的XML数据，检索和查询能够理解元数据的精确含义，搜索变得更为智能和准确，有效避免了检索经常返回无关数据的情况。RDF文件包含若干资源描述，每个资源描述由若干语句构成，每个语句由资源、属性类型、属性值构成三元组，表示资源具有一个属性。资源对应于自然语言中的主语，属性类型对应于谓语，属性值对应于宾语，多个RDF资源文件构成完整的资源描述和关联图。随着关联网络的数据规模越来越大，其表达和处理的数据类型越来越多，对RDF数据存储和SPARQL查询的实时性构成挑战。因此，采用可扩展的新型大数据架构提升RDF数据的存储和管理效率，提高SPARQL查询速度和分析能力非常重要。基于大图数据处理技术如Bigtable的图数据存储和管理框架，由于其出色的大规模数据处理能力，成为知识图谱网络的新发展方向。Because RDF uses structured XML data, retrieval and query can understand the precise meaning of metadata, and search becomes more intelligent and accurate, effectively avoiding the situation that retrieval often returns irrelevant data. The RDF file contains several resource descriptions, each resource description consists of several statements, and each statement consists of a triplet of resource, attribute type, and attribute value, indicating that the resource has an attribute. A resource corresponds to a subject in natural language, an attribute type corresponds to a predicate, and an attribute value corresponds to an object. Multiple RDF resource files constitute a complete resource description and association graph. With the increasing scale of data in relational networks, more and more data types are expressed and processed, posing challenges to the real-time performance of RDF data storage and SPARQL queries. Therefore, it is very important to adopt a new scalable big data architecture to improve the storage and management efficiency of RDF data, and to improve the SPARQL query speed and analysis capabilities. The graph data storage and management framework based on big graph data processing technologies such as Bigtable has become a new development direction of knowledge graph networks due to its excellent large-scale data processing capabilities.

目前，RDF数据主要采用关系数据库表或KV数据仓库存储和管理RDF三元组，通过自连接的方式实现RDF三元组的主语、谓语和宾语的子图匹配和SPARQL查询，通过Hash或Index支持本地数据的快速查询和检索，其典型实现如Virtuoso RDF图数据库。其分布式版本主要采用联邦方式，将RDF数据查询和分布式计算框架融合到一个统一的框架中，具体为：将SPARQL查询解析并分发到各个节点，每个节点运行子图匹配计算然后汇总每个节点的匹配结果。该架构简单且易于实现，支持RDF数据的分布式查询和快速返回，简化了面向较大大规模知识关联网络的设计和开发。At present, RDF data mainly uses relational database tables or KV data warehouse to store and manage RDF triples, and realize subgraph matching and SPARQL query of subject, predicate and object of RDF triples through self-connection, and support through Hash or Index Fast query and retrieval of local data, typically implemented as a Virtuoso RDF graph database. Its distributed version mainly adopts the federation method, which integrates the RDF data query and distributed computing framework into a unified framework. Specifically, the SPARQL query is parsed and distributed to each node, and each node runs subgraph matching calculation and then summarizes each node. matching results for each node. The architecture is simple and easy to implement, supports distributed query and fast return of RDF data, and simplifies the design and development of large-scale knowledge association networks.

然而，基于联邦和子图匹配的分布式查询方式，每次查询均需要将SPARQL查询分解为子图匹配分发到多个结点，运行子图匹配并返回结果，容易导致大量的节点通信和中间数据。当数据规模超大时，系统面临如下问题：However, based on the distributed query method of federation and subgraph matching, each query needs to decompose the SPARQL query into subgraph matching and distribute to multiple nodes, run subgraph matching and return the results, which easily leads to a large amount of node communication and intermediate data. . When the data scale is very large, the system faces the following problems:

1)高开销的自连接操作。针对分布式系统，数据表的连接操作导致系统结点间大量的数据通信。当数据量大和机器结点较多时，自连接开销大，查询延迟明显增加，不利于系统的横向扩展。1) Self-connection operation with high overhead. For distributed systems, the join operation of data tables results in a large amount of data communication between system nodes. When the amount of data is large and there are many machine nodes, the self-connection overhead is large, and the query delay increases significantly, which is not conducive to the horizontal expansion of the system.

2)大量的中间数据。SPARQL查询分解后被分发到多个节点分别运行，每个节点相当于一个查询引擎，其运行产生大量的中间数据，增加了系统的内存消耗，减少了系统并行处理的SPARQL个数。2) A large amount of intermediate data. After the SPARQL query is decomposed, it is distributed to multiple nodes to run separately. Each node is equivalent to a query engine. Its operation generates a large amount of intermediate data, which increases the memory consumption of the system and reduces the number of SPARQL processed in parallel by the system.

3)对数据分片要求高。联邦式查询将SPARQL查询分发到各个节点分别完成，对数据分片的质量要求较高。如果不同划分之间存在大量的关联关系，则SPARQL子图匹配不能在多个数据结点并行执行，降低了系统的运行效率。3) High requirements for data fragmentation. Federated query distributes SPARQL queries to each node for completion, which requires higher quality of data sharding. If there are a large number of associations between different partitions, SPARQL subgraph matching cannot be performed in parallel on multiple data nodes, which reduces the operating efficiency of the system.

由于这些问题的存在，基于联邦方式的SPARQL查询难以有效应对大规模RDF关联数据的大规模增长和满足知识网络关联应用的实时性查询需求，查询时间随数据规模的增长而增加。而基于Bigtable的数据处理技术由于缺乏大规模表的连接操作(Join)，难以应用到海量RDF知识网络关联数据的处理中。Due to the existence of these problems, it is difficult for SPARQL queries based on federated methods to effectively cope with the large-scale growth of large-scale RDF linked data and meet the real-time query requirements of knowledge network linked applications, and the query time increases with the growth of data scale. However, the data processing technology based on Bigtable is difficult to be applied to the processing of massive RDF knowledge network associated data due to the lack of large-scale table join operations.

发明内容SUMMARY OF THE INVENTION

针对RDF大数据SPARQL查询存在的问题，本发明的目的在于提供一种基于图遍历的SPARQL查询优化方法。Aiming at the problems existing in RDF big data SPARQL query, the purpose of the present invention is to provide a SPARQL query optimization method based on graph traversal.

本发明的技术方案为：The technical scheme of the present invention is:

一种基于图遍历的SPARQL查询优化方法，其步骤为：A SPARQL query optimization method based on graph traversal, the steps are:

1)使用属性图表示RDF数据中三元组，然后利用Bigtable模型存储RDF数据，得到RDF数据对应的Bigtable数据；1) Use the attribute graph to represent the triples in the RDF data, and then use the Bigtable model to store the RDF data to obtain the Bigtable data corresponding to the RDF data;

2)将SPARQL查询转化对RDF属性图的遍历；2) Convert the SPARQL query to the traversal of the RDF property graph;

3)根据步骤2)获得的遍历序列，遍历Bigtable数据中满足条件的所有节点，完成SPARQL查询。3) According to the traversal sequence obtained in step 2), traverse all the nodes in the Bigtable data that meet the conditions, and complete the SPARQL query.

利用Bigtable模型存储RDF数据的方法为：The method of using the Bigtable model to store RDF data is as follows:

21)针对RDF数据中的每一RDF三元组(sub，pre，obj)，将主语sub作为节点v，存储为Bigtable模型中一行；21) For each RDF triple (sub, pre, obj) in the RDF data, use the subject sub as the node v and store it as a row in the Bigtable model;

22)判断宾语obj的类型：a)若宾语obj为rdf:literal，即谓语pre为主语sub的属性，则将宾语obj作为该节点v的属性值，然后将谓语pre作为该节点v的属性名称，存储为该节点v所在行的一个存贮单元Cell；b)若宾语obj为rdf:resource，即谓语pre为主语sub关联到其它节点的边，则将宾语obj作为一独立节点w，存储为Bigtable模型中一行；然后将谓语pre作为该节点v的出边，指向该节点w，存储为该节点v所在行的一个Cell，并且将谓语pre作为该节点w的入边，出自该节点v，存储为该节点w所在行的一个Cell。22) Determine the type of the object obj: a) If the object obj is rdf:literal, that is, the predicate pre is an attribute of the subject sub, then the object obj is used as the attribute value of the node v, and then the predicate pre is used as the attribute name of the node v , stored as a storage unit Cell of the row where the node v is located; b) If the object obj is rdf:resource, that is, the predicate pre is associated with the subject sub to the edge of other nodes, then the object obj is taken as an independent node w, stored as One row in the Bigtable model; then use the predicate pre as the outgoing edge of the node v, point to the node w, store it as a Cell of the row where the node v is located, and use the predicate pre as the incoming edge of the node w, from the node v, Stored as a Cell of the row where the node w is located.

进一步的，步骤2)中，将SPARQL查询转化为Gremlin图遍历，实现将SPARQL查询转化对RDF属性图的遍历。Further, in step 2), the SPARQL query is converted into Gremlin graph traversal, so as to realize the traversal of the RDF property graph by converting the SPARQL query.

进一步的，将SPARQL查询转化为Gremlin图遍历的方法为：对于SPARQL查询的where子句中的每个三元组(sub，pre，obj)，如果该三元组为谓语pre表示属性图中属性的三元组，则将该三元组转化为对主语sub所代表属性图中节点的过滤has(pre,obj)，obj为过滤条件，pre为谓语pre所代表的属性名称；如果该三元组为谓语pre表示属性图中边的三元组，则将该三元组转化为主语sub所代表节点到宾语obj所代表节点的遍历sub.out(pre)->obj，out表示出边，pre表示谓语pre所代表的边的关联标签；通过处理where子句的所有三元组，获得所述遍历序列。Further, the method for converting a SPARQL query into a Gremlin graph traversal is: for each triple (sub, pre, obj) in the where clause of the SPARQL query, if the triple is the predicate pre, it represents the property in the property graph. The triplet is converted into a filter has(pre,obj) for the node in the attribute graph represented by the subject sub, where obj is the filter condition, and pre is the attribute name represented by the predicate pre; if the triplet The group is a triplet whose predicate pre represents an edge in the attribute graph, then the triplet is converted into a traversal sub.out(pre)->obj from the node represented by the subject sub to the node represented by the object obj, and out represents the edge, pre represents the associated label of the edge represented by the predicate pre; the traversal sequence is obtained by processing all triples of the where clause.

进一步的，根据步骤2)获得的遍历序列，遍历Bigtable数据中满足条件的所有节点的方法为：所述遍历序列为过滤和关联边组成的遍历序列；首先根据关联边组成遍历路径，然后利用过滤消除无效的遍历路径，得到有效遍历路径；然后根据有效遍历路径遍历Bigtable数据中满足条件的所有节点，然后组织有效遍历路径的节点和边的属性值。Further, according to the traversal sequence obtained in step 2), the method for traversing all nodes that meet the conditions in the Bigtable data is: the traversing sequence is a traversing sequence composed of filtering and associated edges; first, a traversing path is formed according to the associated edges, and then filtering Eliminate invalid traversal paths to obtain valid traversal paths; then traverse all the nodes in the Bigtable data that meet the conditions according to the valid traversal paths, and then organize the attribute values of the nodes and edges of the valid traversal paths.

本发明包括Bigtable数据存储、SPARQL到Gremlin遍历转化、Gremlin图遍历执行。The invention includes Bigtable data storage, SPARQL to Gremlin traversal conversion, and Gremlin graph traversal execution.

其功能描述如下：Its functions are described as follows:

1)基于Bigtable模型存储RDF数据1) Store RDF data based on the Bigtable model

RDF是一种面向互联网的图数据格式，采用<主语，谓语，宾语>三元组表示图数据，其主语(Subject)表示图节点，谓语为sub主语节点的属性名称，对应的宾语obj为属性。宾语(Object)主要包括rdf:literal和rdf:resource两种情况，前者表示主语节点的属性，属性名为谓语；后者表示主语节点关联到的其它节点，谓语标识主语关联到其它节点的边，边的标签为谓语。RDF图对应的属性图为图1(a)，其Bigtable表示如图1(b)所示。RDF is an Internet-oriented graph data format. It uses <subject, predicate, object> triples to represent graph data, the subject (Subject) represents the graph node, the predicate is the attribute name of the sub subject node, and the corresponding object obj is the attribute . Object mainly includes two cases: rdf:literal and rdf:resource. The former represents the attribute of the subject node, and the attribute name is the predicate; the latter represents other nodes associated with the subject node, and the predicate identifies the edge that the subject is related to other nodes. Edge labels are predicates. The attribute graph corresponding to the RDF graph is shown in Figure 1(a), and its Bigtable representation is shown in Figure 1(b).

本发明使用Bigtable存储RDF图，针对RDF数据三元组：主语(Subject)、谓语(Predicate)、宾语(Object)，依据其特性以及RDF模型(即RDF本体)，其结构解析如下：The present invention uses Bigtable to store RDF graphs, and for RDF data triples: subject (Subject), predicate (Predicate), object (Object), according to its characteristics and RDF model (that is, RDF ontology), its structure analysis is as follows:

针对“sub pre obj.”RDF三元组，将主语sub作为节点v，存储为Bigtable模型中一行。For the "sub pre obj." RDF triple, take the subject sub as the node v, and store it as a row in the Bigtable model.

若宾语obj为rdf:literal，则将宾语obj作为节点v的属性值，将谓语pre作为该节点v的属性名称，存储为v所在行的一个Cell(存贮单元)。否则，将obj作为独立节点w，存储为Bigtable中一行。将谓语pre作为主语sub节点v的出边，指向该节点w，将谓语pre存储为节点v所在行的一个Cell；将谓语pre作为宾语obj节点w的入边，出自节点v(即入边的出发点为节点v)，存储为节点w所在行的一个Cell。If the object obj is rdf:literal, the object obj is used as the attribute value of the node v, the predicate pre is used as the attribute name of the node v, and it is stored as a Cell (storage unit) in the row where v is located. Otherwise, store obj as an independent node w as a row in Bigtable. Take the predicate pre as the outgoing edge of the subject sub node v, point to the node w, and store the predicate pre as a Cell in the row where the node v is located; take the predicate pre as the incoming edge of the object obj node w, from the node v (that is, the incoming edge of the predicate pre The starting point is node v), which is stored as a Cell in the row where node w is located.

其中，rdf:lieral为宾语的一种类型，由Apache Jena或其他RDF工具解析。其中，Bigtable是一种海量数据的存储结构，能够高效支持海量图数据的存储和管理。2)将SPARQL查询转化为Gremlin图遍历Among them, rdf:lieral is a type of object, which is parsed by Apache Jena or other RDF tools. Among them, Bigtable is a storage structure for massive data, which can efficiently support the storage and management of massive graph data. 2) Convert SPARQL query to Gremlin graph traversal

如图2所示，SPARQL查询使用where子句的多个RDF三元组序列表示子图匹配。本发明使用图遍历实现SPARQL子图匹配，将SPARQL查询的where子句中的三元组序列转化为对图中节点和边的过滤和遍历。针对where子句中的每个“sub pre obj.”三元组，其转化流程如下所示：针对谓语pre表示属性图中属性的三元组，转化为对sub所代表属性图中节点的过滤has(pre,obj)，obj为过滤条件，pre为谓语pre所代表的属性名称。针对谓语pre表示属性图中边的三元组，转化为sub所代表节点到obj所代表节点的遍历，其gremlin代码为sub.out(pre)->obj，out表示出边，pre表示谓语pre所代表的边的关联标签。As shown in Figure 2, a SPARQL query expresses subgraph matching using a sequence of multiple RDF triples of where clauses. The invention uses graph traversal to realize SPARQL subgraph matching, and converts the triplet sequence in the where clause of the SPARQL query into filtering and traversing of nodes and edges in the graph. For each "sub pre obj." triplet in the where clause, the transformation process is as follows: For the triplet of the attribute in the attribute graph represented by the predicate pre, it is transformed into the filtering of the nodes in the attribute graph represented by sub has(pre,obj), obj is the filter condition, and pre is the attribute name represented by the predicate pre. For the triplet of the edge in the attribute graph represented by the predicate pre, it is converted into the traversal of the node represented by sub to the node represented by obj. The gremlin code is sub.out(pre)->obj, out represents the edge, and pre represents the predicate pre The associated label of the edge represented.

通过处理where子句的所有三元组，获得过滤和关联边组成的遍历序列。By processing all triples of the where clause, a traversal sequence consisting of filtered and associated edges is obtained.

3)图遍历执行3) Graph traversal execution

针对步骤(2)获得的遍历序列，遍历步骤1)中基于Bigtable模型存储RDF数据得到的Bigtable数据中满足条件的所有节点。其关联边组成遍历路径，而过滤用于消除无效的遍历路径。遍历完成后，组织有效遍历路径的节点和边的属性值，按用户需求返回。由于是直接对Bigtable访问，查询过程基本没有中间数据的产生。所述RDF数据指W3C国际标准组织定义的图数据表示方法，采用三元组表示数据和数据之间的关联关系，所述主语、谓语、宾语指RDF数据的标准数据结构。所述Gremlin指面向图数据遍历的标准语言。For the traversal sequence obtained in step (2), traverse all nodes that satisfy the conditions in the Bigtable data obtained by storing the RDF data based on the Bigtable model in step 1). Its associated edges make up traversal paths, and filtering is used to eliminate invalid traversal paths. After the traversal is completed, organize the attribute values of the nodes and edges of the valid traversal path, and return them according to user requirements. Due to the direct access to Bigtable, there is basically no intermediate data generated during the query process. The RDF data refers to the graph data representation method defined by the W3C International Standards Organization, using triples to represent the relationship between data and the data, and the subject, predicate, and object refer to the standard data structure of RDF data. The Gremlin refers to a standard language for graph data traversal.

本发明的有益效果在于：The beneficial effects of the present invention are:

针对目前大规模RDF图数据存储可扩展性差和查询效率较低的问题，提出了一种基于Bigtable的RDF图数据存储和查询方法，通过将RDF图数据转换为Bigtable模型的数据格式，支持大规模图数据的水平扩展；通过将SPARQL查询转化为基于图遍历的Bigtable数据访问，避免了RDF数据连接导致的开销较大和扩展性差的问题，减少了查询过程中间数据的产生。而且由于将SPARQL查询过程转化为对Bigtable的访问，能够利用缓存减少访问次数。Aiming at the problems of poor scalability and low query efficiency of large-scale RDF graph data storage, a method for storing and querying RDF graph data based on Bigtable is proposed. Horizontal expansion of graph data; by converting SPARQL queries into Bigtable data access based on graph traversal, the problems of high overhead and poor scalability caused by RDF data connections are avoided, and the generation of intermediate data in the query process is reduced. And because the SPARQL query process is converted into Bigtable access, the cache can be used to reduce the number of accesses.

本发明解决了海量大规模RDF数据的分布式存储和低延迟查询的问题，一方面消除了传统SPARQL查询对Hash等数据结构的依赖，减少了中间数据的产生，避免了大规模RDF数据的连接计算；另一方面，能有效利用基于Bigtable的大数据处理技术存储和管理RDF海量关联知识网络数据，利用缓存技术和索引技术加速RDF关联数据的查询和分析。The invention solves the problems of distributed storage and low-latency query of massive large-scale RDF data, on the one hand, it eliminates the dependence of traditional SPARQL query on data structures such as Hash, reduces the generation of intermediate data, and avoids the connection of large-scale RDF data On the other hand, it can effectively use Bigtable-based big data processing technology to store and manage massive RDF related knowledge network data, and use caching technology and indexing technology to accelerate RDF related data query and analysis.

附图说明Description of drawings

图1为基于Bigtable模型存储RDF数据表示图；Figure 1 is a representation of storing RDF data based on the Bigtable model;

(a)RDF数据属性图，(b)基于Bigtable模型RDF数据图；(a) RDF data attribute graph, (b) RDF data graph based on Bigtable model;

图2为基于图遍历的SPARQL查询执行流程图；Fig. 2 is the SPARQL query execution flow chart based on graph traversal;

图3为小数据集vs大数据集对比图。Figure 3 is a comparison diagram of small data sets vs large data sets.

具体实施方式Detailed ways

为使本发明的上述特征和优点能更明显易懂，下文特举实施例，并配合所附图作详细说明如下。In order to make the above-mentioned features and advantages of the present invention more obvious and easy to understand, the following embodiments are given and described in detail with the accompanying drawings as follows.

如图2所示，一种基于图遍历的SPARQL执行引擎，由Bigtable数据存储、SPARQL到Gremlin转化，图遍历执行组成。本发明依据RDF三元组宾语的rdf:literal特性和rdf:resource特性，使用属性图表示RDF三元组的关联关系和literal值，使用Bigtable数据模型存储和管理RDF数据，使用图遍历实现SPARQL对RDF关联数据的查询和分析。As shown in Figure 2, a graph traversal-based SPARQL execution engine consists of Bigtable data storage, SPARQL to Gremlin conversion, and graph traversal execution. According to the rdf:literal and rdf:resource characteristics of the object of the RDF triplet, the invention uses the attribute graph to represent the association relationship and literal value of the RDF triplet, uses the Bigtable data model to store and manage the RDF data, and uses the graph traversal to realize the SPARQL pairing Query and analysis of RDF linked data.

目前面向RDF数据的数据仓库以三元组为基本单位存储和管理RDF知识网络数据，依赖表自连接的方式实现子图匹配，从而实现对RDF数据的SPARQL查询和分析。At present, the data warehouse for RDF data uses triples as the basic unit to store and manage RDF knowledge network data, and relies on table self-connection to achieve subgraph matching, thereby realizing SPARQL query and analysis of RDF data.

本发明则采用Bigtable存储和管理RDF数据，将SPARQL查询和分析转化为对RDF属性图的遍历，通过对Bigtable的访问完成SPARQL查询。如此，本发明实现了对RDF海量数据的分布式扩展，避免了连接操作对SPARQL查询的不利影响。下面描述RDF数据的Bigtable存储和SPARQL到Gremlin转化按如图1所示进行设计来支持RDF数据的快速检索和分析：The present invention uses Bigtable to store and manage RDF data, converts SPARQL query and analysis into traversal of RDF attribute graph, and completes SPARQL query through Bigtable access. In this way, the present invention realizes the distributed expansion of the RDF massive data, and avoids the adverse effect of the connection operation on the SPARQL query. The following describes the Bigtable storage of RDF data and the SPARQL to Gremlin conversion designed as shown in Figure 1 to support fast retrieval and analysis of RDF data:

Bigtable数据存储：如图1所示，针对RDF三元组数据，将其表示为如图1(a)所示的属性图。如“<tax1><type><taxNode>.”，其宾语类型为rdf:literal，则转化为主语tax1所代表节点的属性值，属性名称为谓语；“<tax1><x-taxon><gene1>.”，其宾语为rdf:resource，则转化为主语tax1所代表节点指向宾语所代表gene1节点的边，谓语为边的标签，如此将所有的RDF三元组和属性图对应起来，并采用Bigtable数据结构存储属性和边。Bigtable data storage: As shown in Figure 1, for the RDF triple data, it is represented as an attribute graph as shown in Figure 1(a). For example, "<tax1><type><taxNode>.", whose object type is rdf:literal, is converted into the attribute value of the node represented by the subject tax1, and the attribute name is the predicate; "<tax1><x-taxon><gene1 >.", its object is rdf:resource, then the node represented by the subject tax1 points to the edge of the gene1 node represented by the object, and the predicate is the label of the edge, so that all the RDF triples and attribute graphs correspond, and use Bigtable data structures store attributes and edges.

SPARQL到Gremlin：将SPARQL查询中where子句的三元组转化为对属性图的遍历，其中SPARQL为面向RDF数据的查询语言，而Gremlin是面向属性图的遍历语言。SPARQL to Gremlin: Convert triples of where clauses in SPARQL queries into traversal of property graphs, where SPARQL is a query language for RDF data, and Gremlin is a traversal language for property graphs.

针对where子句的三元组“？tax<type><taxNode>.”，由于在属性图中type表示属性，则将该关联转化为gremlin中对节点属性type的过滤has(type,taxNode)step，其执行过程为依据taxNode对BigTable数据的过滤，减小遍历范围，提高遍历效率。For the triplet "?tax<type><taxNode>." of the where clause, since type represents an attribute in the attribute graph, the association is converted into the filter has(type,taxNode)step of the node attribute type in gremlin , whose execution process is to filter BigTable data according to taxNode, reduce the traversal range and improve the traversal efficiency.

针对“？tax<x-taxon>？gene.”，由于在属性图中x-taxon为节点之间的边，因此将该关联转化为gremlin中对tax所代表节点出边的遍历out(x-taxon)step，通过对Bigtable数据的访问实现该遍历。For "?tax<x-taxon>?gene.", since x-taxon is an edge between nodes in the attribute graph, this association is transformed into a traversal out(x- taxon) step, which implements this traversal through access to Bigtable data.

本发明选取微生物关联数据集WDCM的3亿规模小数据集、30亿规模的大数据集和16条标准SPARQL查询对发明进行测试，给出使用发明中所提出的面向大数据的基于图遍历的SPARQL执行引擎的一个具体实施过程，查询重复过程为10次，并去除最大值和最小值。The invention selects the 300 million scale small data set, 3 billion scale large data set and 16 standard SPARQL queries of the microbial association data set WDCM to test the invention, and provides a big data-oriented graph traversal based method proposed in the invention to test the invention. A specific implementation process of the SPARQL execution engine, the query repetition process is 10 times, and the maximum and minimum values are removed.

测试环境为4台支持BigTable的HBase集群，HBase版本为0.98.23.hadoop1，每个节点32G内存、12核CPU、28T磁盘，节点之间通过万兆交换机互联。Gremlin查询和分析引擎为Titan 1.0.0，SparQLToGremlin转化由项目组开发。The test environment is 4 HBase clusters that support BigTable. The HBase version is 0.98.23.hadoop1. Each node has 32G memory, 12-core CPU, and 28T disk. The nodes are interconnected through 10G switches. The Gremlin query and analysis engine is Titan 1.0.0, and the SparQLToGremlin conversion is developed by the project team.

使用本发明系统得到的查询运行时间如下：The query running time obtained using the system of the present invention is as follows:

如图3所示，对于3亿规模的小数据集，查询需要的时间在1s左右。As shown in Figure 3, for a small dataset with a scale of 300 million, the query takes about 1s.

针对30亿三元组规模的大数据集，由于图遍历负载有效分散在多个结点，16条查询语句所用时间小于1s，反而优于小数据集的查询时间。For a large data set with a scale of 3 billion triples, since the graph traversal load is effectively distributed in multiple nodes, the time required for 16 query statements is less than 1s, which is better than the query time of small data sets.

实验结果表明，针对不断增加的RDF数据，本发明能够有效利用Bigtable分布式数据存储的优势和图遍历查询的优势，保持查询时间恒定，很好地解决了目前RDF数据大规模增加时SPARQL查询时间明显增加的问题。The experimental results show that, for the increasing RDF data, the present invention can effectively utilize the advantages of Bigtable distributed data storage and graph traversal query, keep the query time constant, and solve the current SPARQL query time when the RDF data increases on a large scale. Significantly increased problems.

以上实施仅用以说明本发明的技术方案而非对其进行限制，本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明的精神和范围，本发明的保护范围应以权利要求书所述为准。The above implementation is only used to illustrate the technical solution of the present invention and not to limit it. Those of ordinary skill in the art can modify or equivalently replace the technical solution of the present invention without departing from the spirit and scope of the present invention. Protection of the present invention The scope should be as stated in the claims.

Claims

1. A SPARQL query optimization method based on graph traversal comprises the following steps:

1) representing the triples in the RDF data by using the attribute map, and then storing the RDF data by using a Bigtable model to obtain Bigtable data corresponding to the RDF data;

2) converting the SPARQL query into traversal of the RDF attribute graph;

3) traversing all nodes meeting the conditions in the Bigtable data according to the traversal sequence obtained in the step 2), and completing SPARQL query;

the method for storing RDF data by using the Bigtable model comprises the following steps: 21) for each RDF triple (sub, pre, obj) in the RDF data, storing a subject sub as a node v as one line in a Bigtable model; 22) judging the type of the object obj: a) if the object obj is rdf, i.e. the predicate pre is the attribute of the subject sub, taking the object obj as the attribute value of the node v, and then taking the predicate pre as the attribute name of the node v to be stored as a storage unit Cell in the row of the node v; b) if the object obj is rdf, resource, namely the predicate pre is the edge of the subject sub associated to other nodes, the object obj is taken as an independent node w and is stored as one line in the Bigtable model; then, the predicate pre is taken as the outgoing edge of the node v, points to the node w, and is stored as a Cell of the row where the node v is located, and the predicate pre is taken as the incoming edge of the node w, comes from the node v, and is stored as a Cell of the row where the node w is located.

2. The method of claim 1, wherein in step 2), converting the SPARQL query into a Gremlin graph traversal, implements converting the SPARQL query into a traversal of the RDF attribute graph.

3. The method of claim 2, wherein the method of translating the SPARQL query into a Gremlin graph traversal is by: for each triple (sub, pre, obj) in the where clause of the SPARQL query, if the triple is a triple in which a predicate pre represents an attribute in an attribute graph, converting the triple into a filter has (pre, obj) of a node in the attribute graph represented by a subject sub, where obj is a filter condition and pre is an attribute name represented by the predicate pre; if the triple is a triple of which the predicate pre represents an edge in the attribute graph, converting the triple into traversal sub.out (pre) - > obj from a node represented by the subject sub to a node represented by the object obj, wherein out represents the edge, and pre represents an associated label of the edge represented by the predicate pre; and processing all the triples of the where clause to obtain the traversal sequence.

4. The method of claim 1, wherein the method of traversing all nodes satisfying the condition in the Bigtable data according to the traversal sequence obtained in step 2) is: the traversal sequence is formed by filtering and associated edges; firstly, forming a traversal path according to the associated edges, and then eliminating an invalid traversal path by filtering to obtain an effective traversal path; and traversing all nodes meeting the conditions in the Bigtable data according to the effective traversal path, and then organizing the attribute values of the nodes and edges of the effective traversal path.