KR20010109945A

KR20010109945A - RS-tree for k-nearest neighbor queries with non spatial selection predicates and method for using it

Info

Publication number: KR20010109945A
Application number: KR1020000030766A
Authority: KR
Inventors: 박동주; 김형주
Original assignee: 박동주; 김형주
Priority date: 2000-06-05
Filing date: 2000-06-05
Publication date: 2001-12-12

Abstract

본 발명에 의하여 잘타손과 사멧이 제안한 점증적 최근접 인덱스 방법(알고리즘)의 성능을 향상시킬 수 있는 방법과 여기에 적용되어서 질의를 효과적으로 수행할 수 있는 RS트리구조가 개시된다.The present invention discloses a method for improving the performance of the incremental nearest index method (algorithm) proposed by Jalthason and Samet, and an RS tree structure that can be applied to the query effectively.

본 발명에 의하면, 종래 데이터베이스에서 질의에 사용되며, 잘타손(Hjaltason)과 사멧(Samet)이 제안한 R트리에 기반한 점증적 최근접 방법의 문제점인 불필요한 투플발생빈도수를 현저하게 감소시킴으로서 멀티미티어 데이터베이스 또는 지리정보시스템등에서 요구하는 다차원 공간객체집합에 대한 효과적인 인덱싱이 가능하다.According to the present invention, a multimedia database is used for a query in a conventional database by significantly reducing an unnecessary tuple occurrence frequency, which is a problem of an incremental close proximity method based on an R tree proposed by Hjaltason and Samet. Effective indexing of multidimensional spatial objects required by geographic information systems is possible.

Description

RS-tree for k-nearest neighbor queries with non spatial selection predicates and method for using it}

본 발명은 멀티미디어 데이터베이스 또는 GIS에서 인덱스방법으로 사용되는 잘타손(Hjaltason)과 사멧(Samet)이 제안한 점증적 최근접 알고리즘의 문제점인 투플발생빈도수를 감소시킴으로써 멀티미티어 데이터베이스 또는 지리정보시스템등에서 요구하는 다차원 공간객체집합에 대한 효과적인 검색이 가능한 인덱스방법에 관한 것이다.The present invention can reduce the number of tuple occurrences, which is a problem of the incremental nearest neighbor algorithm proposed by Hjaltason and Samet, which is used as an indexing method in a multimedia database or a GIS. The present invention relates to an index method that enables an efficient search on a spatial object set.

최근 지리정보시스템(GIS) 또는 이미지 검색 시스템 등의 멀티미디어 자료검색장치들에서는 다차원 공간객체집합을 대상으로 효율적인 "k-최근접 질의(k-nearest neighbor query)"의 처리를 요구한다. 이러한 질의를 위하여 효율적인 최적화 기법들이 개발되어 왔다. 그 중에서 Hjaltason과 Samet이 제안한 점증적 최근접 기법(Incremental Nearest Neighbor Algorithm)은 대화방식(interactive fashion)으로 동작하므로 "거리 브라우징(distance browsing)"을 요구하는 k-최근접 질의 처리에 가장 적합하다고 알려져 있다. 하기에 기록된 것은 종래의 k-최근접 질의 처리의 일례를 나타낸 것이다.Recently, multimedia data retrieval devices such as GIS or image retrieval systems require efficient processing of "k-nearest neighbor queries" for multidimensional spatial objects. Efficient optimization techniques have been developed for these queries. Among them, the Incremental Nearest Neighbor Algorithm proposed by Hjaltason and Samet works best in k-nearest query processing requiring "distance browsing" because it operates in an interactive fashion. have. What is written below shows an example of conventional k-nearest query processing.

SELECT *SELECT *

FROM DISCFROM DISC

WHERE artist = `Beatles'WHERE artist = `Beatles'

ORDER BY distance(color, `red')ORDER BY distance (color, `red ')

STOP AFTER k;STOP AFTER k;

상기에 기재된 질의(query)는 페이건(Fagin)의 멀티미디어 질의(query), "artist=`Beatles' ∧ color=`red'"를 확장 SQL문으로 표현한 것으로, 이것의 의미는 "Beatles의 음반(DISC) 중에서 음반의 색(color)이 red에 가장 가까운 k 개를 검색하라"는 것이다. 여기서 비공간 검색 조건(artist=`Beatles`)을 만족하는 k 개의 질의 결과를 얻기까지 질의 객체인 red로부터 거리 순서대로 공간 객체(color의 속성값)를 하나씩 검색해 낼 수 있는 거리 브라우징(distance browsing) 기능이 요구된다.The query described above expresses Fagin's multimedia query, "artist =` Beatles '∧ color = `red'" as an extended SQL statement, which means "DISC of Beatles". ) To find the k closest to the red of the album. Here, distance browsing, which can retrieve spatial objects (attribute values of color) one by one, in order of distance from the query object red, until k query results satisfying the non-spatial search condition (artist = `Beatles`) are obtained. Function is required.

Hjaltason과 Samet의 점증적 최근접 알고리즘은 R-트리와 같은 계층적 인덱스와 순위 큐(priority queue)를 이용하여, 주어진 질의 객체로부터 거리순서대로 다음 최근접 공간 객체를 포함하는 투플(tuple) 식별자(TID)를 하나씩 상위 연산자(예를들면, artist=`Beatles'를 처리하기 위한 select 연산자)에게 전달한다. 상위 연산자는 TID를 이용하여 해당 투플에 접근한 후 그 밖의 비공간 검색 조건을 처리하며, 요구된 k 개의 질의 결과가 추출될 때 까지 이러한 과정이 반복한다. 상기한 알고리즘의 가장 큰 장점은, 상위 연산자가 또다른 투플식별자(TID)를 요구할 때 질의를 처음부터 재시작(restart)하지 않고 이전의 작업 상태에서 다음 최근접 공간 객체를 포함하는 투플의 TID를 제공할 수 있다는 것이다.Hjaltason and Samet's incremental nearest-neighbor algorithm uses hierarchical indexes such as R-trees and priority queues to identify tuple identifiers containing the next nearest spatial object in the order of distance from a given query object. Pass TIDs one by one to the parent operator (for example, the select operator to handle artist = `Beatles'). The parent operator uses the TID to access the tuple and processes other non-spatial search conditions. This process is repeated until the k query results are retrieved. The main advantage of the above algorithm is that when the parent operator requires another Tuple Identifier (TID), it provides the TID of the tuple containing the next nearest spatial object in the previous working state without restarting the query from the beginning. It can be done.

그러나, 상기 점증적 최근접 알고리즘은 k 개의 질의 결과를 얻기까지 많은 수의 불필요한 투플이 발생되는 문제점이 있다. 즉 비공간 검색 조건을 만족시키지 않는 투플들을 상위 연산자에게 전달하기 때문에 검색속도면에서 효율적이지 못하였다. 이것은 k 개의 질의 결과를 얻기까지 불필요한 투플이 많은 경우 그 만큼 사용자의 응답 시간이 길어지기 때문이다. 상기와 같이 불필요한 투플이 발생하는 이유는, 상기의 점증적 최근접 알고리즘이 질의에 사용된 비공간 검색 조건과는 무관하게 작동하기 때문이다.However, the cumulative nearest algorithm has a problem in that a large number of unnecessary tuples are generated until k query results are obtained. That is, it is not efficient in terms of search speed because tuples that do not satisfy the non-spatial search condition are delivered to the upper level operator. This is because the response time of the user becomes longer when there are many unnecessary tuples before obtaining k query results. The reason why unnecessary tuples occur as described above is that the incremental nearest neighbor algorithm operates regardless of the non-spatial search condition used in the query.

Hjaltason과 Samet이 제안한 점증적 최근접 알고리즘에 대해서 더욱 상세히 살펴보고, 비공간 검색 조건이 포함된 k-최근접 질의를 처리할 때 발생하는 문제점을 설명한다. Hjaltason과 Samet은 R-트리를 이용한 점증적 최근접 알고리즘을 제안하였으며 이것을 RtreeINN 알고리즘이라 부른다.We will examine the incremental nearest algorithm proposed by Hjaltason and Samet in more detail, and describe the problems that occur when processing k-nearest queries with non-spatial search conditions. Hjaltason and Samet have proposed an incremental nearest neighbor algorithm using R-tree, which is called RtreeINN algorithm.

상기 RtreeINN 알고리즘은, R-트리의 루트 노드부터 탐색하면서, 다음 탐색 노드는 큐의 최전방 레코드에서 얻으며, 그 노드를 탐색한 후 그 노드의 자식 노드 또는 공간 객체를 큐에 삽입한다. 이때 질의 객체로부터 자식 노드 또는 공간 객체와의 거리를 함께 삽입한다. 큐는 항상 거리 순서대로 유지된다. 만약 큐의 맨 앞에서 꺼낸 데이터가 공간 객체일 때, 이것을 포함하는 튜플 식별자(TID)를 비공간 검색 조건 처리를 위한 질의 처리 연산자에게 리턴한다. k 개의 질의 결과를 얻거나 또는 큐에 레코드가 없을 때까지 이 과정을 계속 반복한다. 상술한 바와 같이, RtreeINN 알고리즘은 k'(> k) 투플이 필요할 때 질의를 처음부터 재실행하지 않고 다음 TID를 상위 연산자에게 전달할 수 있다.The RtreeINN algorithm searches from the root node of the R-tree, while the next search node is taken from the foremost record of the queue, and after searching for that node, inserts the node's child or spatial object into the queue. In this case, the distance between the child node and the spatial object is inserted from the query object. The cues are always kept in distance order. If the data retrieved from the front of the queue is a spatial object, the tuple identifier (TID) containing it is returned to the query processing operator for processing the non-spatial search condition. Repeat this process until you have k query results or no records in the queue. As described above, the RtreeINN algorithm can pass the next TID to the higher-level operator without rerunning the query from the beginning when k '(> k) tuples are needed.

Berchtold 등은 R-트리와 같은 계층적 인덱스를 이용하는 k-최근접 알고리즘의 "최적성(optimality)"을 정의하고, RtreeINN 알고리즘이 최적성을 보장함을 증명하였다. 최적성의 정의는 다음과 같다.Berchtold et al. Defined the "optimality" of the k-nearest algorithm using hierarchical indexes such as R-trees, and proved that the RtreeINN algorithm ensures optimality. The definition of optimality is as follows.

정의 1 k-최적성(k-Optimality)Definition 1 k-Optimality

최근접 질의를 처리할 때 접근된 전체 페이지 수와 SP(Q, r)와 겹치는(intersect) 페이지 수가 같을 때, 그 k-최근접 알고리즘은 최적성을 갖는다. 여기서 Q와 r은 각각 질의 객체와, Q로부터 k-번째 거리에 위치하는 공간 객체(O^k)와의 거리를 뜻한다(r=||Q - O^k||). 그리고, SP(Q, r)는 중심이 Q이고 반지름이 r인 구면체(sphere)를 나타낸다.The k-nearest algorithm is optimal when the total number of pages accessed and the number of pages intersect with SP (Q, r) are the same when processing the nearest query. Where Q and r are the distances between the query object and the spatial object O ^k located at the k-th distance from Q (r = || Q-O ^k ||). In addition, SP (Q, r) represents a sphere having a center of Q and a radius of r.

정의 1에 의해 RtreeINN 알고리즘은 최적성을 갖는다. 따라서, k 개의 질의 결과를 얻기까지 RtreeINN 알고리즘의 비용은 SP(Q, r)과 겹치는 R-트리의 페이지 수와 같다. 여기서의 비용은 비공간 검색 조건이 전혀 포함되지 않은 k-최근접 질의의 비용을 의미한다.By definition 1, the RtreeINN algorithm is optimal. Therefore, the cost of the RtreeINN algorithm until we get k query results is equal to the number of pages of the R-tree that overlap with SP (Q, r). The cost here means the cost of the k-nearest query without any non-spatial search conditions.

상기와 같이 RtreeNN 알고리즘을 이용하여 이러한 질의를 처리할 때 전체 비용 C_all= C_rtree+ C_tuple과 같다. 여기서 C_rtree는 R-트리 비용, C_tuple은 비공간 검색 조건을 처리하기 위한 투플 접근 비용을 의미한다. 그 외 전체 비용에는 순위 큐 연산 비용와 CPU 비용 등이 포함되지만 다른 비용에 비해 작기 때문에 본 발명에서는 무시한다.As described above, when processing such a query using the RtreeNN algorithm, the total cost C _all = C _rtree + C _tuple is equal. Where C _rtree is the R-tree cost and C _tuple is the tuple access cost to handle non-spatial search conditions. Other overall costs include rank queue operation costs, CPU costs, and the like, but are ignored in the present invention because they are small compared to other costs.

질의 객체를 Q, 비공간 검색 조건을 만족하는 k-번째 공간 객체를 O^k'(k' ≥ k), 반지름 r = ||Q - O^k'||이라고 할 때, C_rtree는 RtreeNN 알고리즘의 최적성에 따라 SP(Q, r)와 겹치는 R-트리 노드의 수와 같다. 다음으로, C_tuple을 구하기 위해 k'을 유추한다. 상위 연산자에게 전달되는 TID 집합의 크기는 k'과 동일하며, 질의 결과의 수 k가 주어질 때 k'을 다음과 같이 유추할 수 있다. 공간 속성(color)과 비공간 속성(artist)간에 연관성(correlation)이 전혀 없고, 질의에 사용된속성값(`Beatles')의 선택율(selectivity)을 S라고 하자. 그러면, i번째 질의 결과를 만족하는 공간 객체를 Oⁱ라고 할 때, (ｉ+1) 번째 질의 결과를 얻기 위해서는 d_i≤ ||Q - O|| ≤ d_i+1(d_i= ||Q - Oⁱ||)를 만족하는 (1/s)개의 공간 객체 O를 더 접근해야 한다. 따라서,가 된다. 여기서 k'은 k에 대해 선형적으로 증가함을 알 수 있다. 이때, 평균 접근 페이지 수를 구하는 Yao의 방정식을 사용하면과 같다. 여기서 N은 투플의 수, B는 페이지 크기, T는 투플 크기를 의미한다.When the query object is Q, and the k-th spatial object that satisfies the non-spatial search condition is O ^{k '} (k' ≥ k), and the radius r = || Q-O ^{k '} ||, C _rtree is the RtreeNN algorithm. It is equal to the number of R-tree nodes that overlap SP (Q, r) depending on the optimality. Next, infer k 'to obtain a C _tuple . The size of the TID set passed to the upper level operator is equal to k '. When the number of query results k is given, k' can be inferred as follows. Assume that there is no correlation between the spatial attributes (color) and the nonspatial attributes (artist), and the selectivity of the attribute values (`Beatles') used in the query is S. Then, when the spatial object satisfying the i th query result is O ⁱ , to obtain the (i + 1) th query result, d _i ≤ || Q-O || We need to access (1 / s) spatial objects O that satisfy ≤ d _{i + 1} (d _i = || Q-O ⁱ ||). therefore, Becomes It can be seen that k 'increases linearly with respect to k. Using Yao's equation to find the average number of pages accessed, Same as Where N is the number of tuples, B is the page size, and T is the tuple size.

비공간 검색 조건이 포함된 k-최근접 질의를 처리할 때 사용되는 RtreeINN 알고리즘의 문제점을 살펴보면 다음과 같다.The problems of the RtreeINN algorithm used when processing k-nearest queries with non-spatial search conditions are as follows.

상기에서 구한 k'의 근사값을 확인하기 위해 페이건의 RtreeINN 알고리즘을 구현하여 "artist=`Beatles' ∧ color=`red'" 형태의 질의를 여러 번 수행하여 도 1과 같은 결과를 얻었다. 도 1에 표시된 바와 같이 예측결과를 나타내는 그래프는 선형적으로 증가함을 알 수 있다. 그러나, 비공간 검색 조건을 만족하는 작은 수(k)의 질의 결과에 비해 접근해야 하는 투플의 수(k')가 너무 많음을 알 수 있다. 이러한 사실은 k' 개의 후보 투플 집합에는 너무 많은 불필요한 투플들을 포함하고 있다는 것을 나타낸다. 결국, k'가 커지면 C_tuple비용이 커지게 되고 사용자 응답 시간이 길어진다. 결론적 말하여, RtreeINN 알고리즘이 비공간 검색 조건이 포함된 k-최근접 질의를 처리를 하는 데는 적합하지만 많은 후보 투플들을 접근해야 하므로 비효율적임을 알 수 있다. 그 이유는 RtreeINN 알고리즘에서 사용하는 R-트리가 불필요한 투플들을 부분적으로 제거할 수 있는 기능이 없기 때문이다.In order to check the approximate value of k ', the implementation of Pagan's RtreeINN algorithm was performed several times in the form of "artist =` Beatles' ∧ color = `red'" to obtain a result as shown in FIG. As shown in FIG. 1, the graph representing the prediction result increases linearly. However, it can be seen that the number of tuples (k ') to be approached is too large compared to the small number (k) of query results satisfying the non-spatial search condition. This indicates that the set of k 'candidate tuples contains too many unnecessary tuples. As a result, the larger k ', the higher the cost of C _tuple and the longer the user response time. In conclusion, the RtreeINN algorithm is suitable for processing k-nearest queries with non-spatial search conditions, but it is inefficient because many candidate tuples must be accessed. This is because the R-tree used by the RtreeINN algorithm does not have the ability to partially remove unnecessary tuples.

예를 들어, 도 2에서와 같이, 질의 객체 Query로부터 각 R-트리 노드까지의 거리를, 거리 순서를 만족한다고 가정한다. RtreeINN 알고리즘에 따르면,,다음으로를 처리할 때,노드의 각 자식 노드를 미리 계산한 거리와 함께 큐에 삽입할 것이다. 여기서,과의 하위 트리에 비공간 검색 조건을 만족시키는 투플이 하나도 없다고 가정한다. 가정에 따라과가 쓸모없는 노드임에도 불구하고, RtreeINN 알고리즘은 그 노드들로 인해 미래에 많은 불필요한 투플들이 생성된다는 것을 고려하지 않고 그 노드들을 큐에 삽입한다. 따라서 불필요한 투플이 증가하는 문제점이 있다.For example, as shown in Figure 2, each R-tree node from the query object Query Distance to Street order Assume that According to the RtreeINN algorithm, , to the next When processing Each child node of the node will be inserted into the queue with a precomputed distance. here, and Suppose no tuples satisfy the non-spatial search conditions in the subtree of. According to home and Although is a useless node, the RtreeINN algorithm enqueues the nodes without considering that they create many unnecessary tuples in the future. Therefore, there is a problem that unnecessary tuples increase.

본 발명은 상기와 같은 문제점을 해결하여 안출된 것으로서, 본 발명의 목적은 투플의 제거에 효과적인 새로운 형식의 RS 트리를 제공하는데 있다.The present invention has been made to solve the above problems, and an object of the present invention is to provide a new type of RS tree effective for the removal of tuples.

본 발명의 다른 목적은, R-트리가 아닌 RS-트리에 기반한 점증적 최근접 알고리즘을 이용한 인덱스방법을 제공하는데 있다.Another object of the present invention is to provide an indexing method using an incremental nearest neighbor algorithm based on an RS-tree rather than an R-tree.

본 발명의 다른 목적은 불필요한 투플수를 감소시킴으로서 검색을 더욱 효율적으로 수행할 수 있는 비공간검색조건이 포함된 케이-최근접 질의를 위한 인덱스방법을 제공하는데 있다.Another object of the present invention is to provide an indexing method for a K-nearest query including a non-spatial search condition that can perform a search more efficiently by reducing the number of unnecessary tuples.

도 1은 종래의 RtreeINN 알고리즘을 이용한 질의에 대한 수행결과의 그래프,1 is a conventional RtreeINN Graph of execution result for query using algorithm,

도 2는 종래의 RtreeINN 알고리즘에 의한 탐색공간을 나타내는 그래프,2 is a conventional RtreeINN A graph representing the search space by the algorithm,

도 3은 본 발명의 RS 트리의 생성 및 구조를 나타내는 개념도,3 is a conceptual diagram illustrating generation and structure of an RS tree of the present invention;

도 4는 본 발명에 의한 시그니쳐분할 상태를 나타내는 개념도,4 is a conceptual diagram illustrating a signature split state according to the present invention;

도 5 내지 도 6은 본 발명의 RS 트리에 기초한 알고리즘을 나타내는 코드예,5 to 6 are code examples showing an algorithm based on the RS tree of the present invention;

도 7은 본 발명에 의한 질의결과에 따른 실험결과를 나타내는 그래프,7 is a graph showing an experimental result according to the query result according to the present invention;

도 8은 시그니쳐 분할함수에 의하여 분할된 상태의 실험결과를 나타내는 그 래프,8 is a graph showing an experimental result in a state divided by the signature division function;

도 9는 데이타 크기에 따른 성능평가실험결과를 나타내는 그래프,9 is a graph showing the results of the performance evaluation experiment according to the data size;

도 10은 공간데이타의 차원(d)에 따른 성능평가실험결과를 나타내는 그래프,10 is a graph showing a performance evaluation test result according to the dimension (d) of spatial data;

도 11은 실세계 데이터세트에서의 실험결과를 나타내는 그래프.11 is a graph showing experimental results in a real world dataset.

상기 목적을 달성하기 위한 본 발명에 의하여, RS트리구조와, S-트리의 제거효과와 이를 높이기 위한 시그니쳐분할방법, 그리고 RS-트리를 이용한 점증적 최근접 알고리즘이 제공된다.According to the present invention for achieving the above object, there is provided an RS tree structure, a removal effect of the S-tree, a signature division method for increasing it, and an incremental nearest neighbor algorithm using the RS-tree.

본 발명의 RS 트리는, 비공간검색조건이 포함된 케이-최근접 질의를 위한 트리구조에 있어서, 상기 트리구조가 R 트리와 S 트리의 조합으로 구성되는 트리구조로서, 상기 R트리가 공간속성에 대한 인덱스로 사용되고, 상기 S트리가 비공간속성에 대한 인덱스로 사용되고, R트리와 동일한 계층적 구조의 노드집합으로 구성되며, 각 노드는 다수의 엔트리로 구성되고, S트리의 각 노드와 그 노드를 구성하는 엔트리가 각각 R 트리의 노드와 엔트리에 일대일 대응하며, 인덱스노드레벨만을 갖는 것을 특징으로 한다.The RS tree of the present invention is a tree structure for a K-nearest query with non-spatial search conditions, wherein the tree structure is composed of a combination of an R tree and an S tree, and the R tree has a spatial attribute. It is used as an index for the S tree, and the S tree is used as an index for a non-spatial attribute. The tree is composed of a set of nodes of the same hierarchical structure as the R tree, and each node is composed of a plurality of entries. Each entry constituting the s corresponds to a node of the R tree and the entry one-to-one, and has only an index node level.

본 발명에 의한 S-트리는 다음과 같은 속성을 가진다.The S-tree according to the present invention has the following properties.

♥"시그니쳐 집합"으로서 각 시그니쳐는 비공간 속성값들의 집합을 표현하는 집합 값(set value)이다.♥ As a "signature set", each signature is a set value representing a set of non-spatial attribute values.

♥S-트리는 계층적 시그니쳐 화일에 기반하므로 R-트리와 똑같은 계층적 구조를 지원한다.♥ S-trees are based on hierarchical signature files, so they support the same hierarchical structure as R-trees.

♥S-트리는 R-트리와 독립적인 자료 구조를 가지며, 또한 독립적인 저장 구조를 가진다.♥ S-trees have data structures that are independent of R-trees, and have independent storage structures.

♥S-트리도 R-트리의 MBR과 같이 레벨에 따른 포함관계(hierarchical inclusion relationship)를 잘 지원한다.♥ S-trees also support hierarchical inclusion relationships as well as MBRs in R-trees.

♥작은 크기의 시그니쳐를 사용하므로 S-트리의 저장 공간이 작다.♥ Small size signature saves S-tree storage.

♥S-트리의 시그니쳐는 0 또는 1로 이루어진 비트 스트림(bit stream)이므로 AND 또는OR과 같은 연산 비용이 저렴하다.The signature of the ♥ -S-tree is a bit stream of zeros or ones, so the operation cost such as AND or OR is low.

그러나 R 트리는 종래 사용되는 것과 본질적으로 동일한 구조를 갖는다.However, the R tree has essentially the same structure as that used conventionally.

이하 첨부된 도면을 참고하여 본 발명을 상세히 설명하면 다음과 같다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

본 발명에 의한 RS-트리는, R-트리와 계층적 시그니쳐 화일 구조를 갖는 S-트리의 조합으로 구성된다. 상기 R-트리는 color와 같은 공간 속성에 대한 인덱스로 사용되며, S-트리는 artist와 같은 비공간 속성에 대한 인덱스로 사용된다. 일반적으로 하나의 투플은 여러 개의 비공간 속성으로 구성될 수 있으므로, RS-트리를 하나의 공간 속성과 m 개의 비공간 속성을 위한 인덱스, 즉 RS^m-트리로 확장할 수 있다. 새로운 인덱스를인 RS-트리를 중심으로 설명한다.The RS-tree according to the present invention is composed of a combination of an R-tree and an S-tree having a hierarchical signature file structure. The R-tree is used as an index for spatial attributes such as color, and the S-tree is used as an index for non-spatial attributes such as artist. In general, one tuple can be composed of several non-spatial attributes, so the RS-tree can be extended with an index for one spatial attribute and m non-spatial attributes, that is, the RS ^m -tree. New index The following description will focus on the RS-tree.

RS-트리를 구성하는 R-트리는 기존의 구조와 동일하므로 S-트리를 중심으로 전체 인덱스를 설명한다. S-트리는 R-트리와 같이 계층적 구조의 노드의 집합으로 구성되며, 각 노드는 여러 개의 엔트리들로 채워진다. S-트리의 각 노드와 그 노드를 구성하는 엔트리는 각각 R-트리의 노드와 엔트리에 일 대 일 대응된다. 그러나, S-트리는 R-트리와 달리 인덱스 노드 레벨만을 가진다. 그 이유는, R-트리는 인덱스 노드에 비해 데이타 노드의 저장 공간 비율이 매우 높으므로 R-트리 데이타 노드와 일대일 대응되는 S-트리의 데이타 노드를 두면 저장 공간의 오버헤드(overhead)가 너무 커지기 때문이다. 물론, S-트리에 데이타 노드를 두면 전체 성능을 높일 수 있지만 저장 공간의 오버헤드로 인해 비현실적이다.Since the R-tree constituting the RS-tree is the same as the existing structure, the entire index will be described based on the S-tree. An S-tree consists of a set of hierarchical nodes like an R-tree, with each node filled with a number of entries. Each node of the S-tree and the entry constituting the node correspond one-to-one to the nodes and entries of the R-tree, respectively. However, unlike the R-tree, the S-tree has only the index node level. The reason is that since the R-tree has a higher ratio of storage space to the data nodes compared to the index nodes, the storage space overhead becomes too large when the data nodes of the S-tree correspond one-to-one with the R-tree data nodes. to be. Of course, putting data nodes in an S-tree can increase overall performance, but it is unrealistic because of the storage space overhead.

S-트리 노드의 j 번째 엔트리는, 해당 R-트리 노드의 j 번째 엔트리의 서브트리에 포함되는 모든 비공간 속성값(예를 들면, artist의 `Beatles')의 집합을 표현하는 집합 값(set value)을 가진다. S-트리 노드의 엔트리가 집합 형태의 값을 갖는 이유는 "비공간 검색 조건을 만족시키지 않는 R-트리 노드"를 쉽게 제거하기 위해서다.The j th entry of the S-tree node is a set value representing the set of all non-spatial attribute values (eg, artist's 'Beatles') contained in the subtree of the j th entry of the R-tree node. value). The reason why an entry of an S-tree node has a value of a set form is to easily remove the "R-tree node which does not satisfy the non-spatial search condition".

S-트리는 R-트리가 일괄적재(bulk-loading) 방식으로 생성되는 시점에서 동시에 생성된다고 가정한다. 이때 S-트리 노드의 엔트리가 가지는 시그니쳐(signature)의 생성은 크게 S-트리의 레벨(S-트리의 맨 하단층)과로 구분된다. 레벨에서 S-트리 노드의 엔트리가 갖는 시그니쳐는, 1) 엔트리 자신과 대응되는 R-트리 노드의 엔트리의 서브트리가 가질 수 있는 모든 비공간 속성값의 집합을 구한 다음, 2) 그 집합에 속하는 각 비공간 속성값을 해쉬(hash) 함수를 써서 고정길이의 비트 스트림 즉, 시그니쳐로 변환하고, 3) 생성된 시그니쳐 집합의 모든 원소에 대해 비트 단위로 합연산(OR)한 결과와 같다. 그 이외의 레벨에서는 S-트리 노드의 엔트리가 가리키는 자식 노드의 모든 시그니쳐를 합연산한 결과를 그 엔트리의 시그니쳐로 사용한다.S-trees are assumed to be created at the same time that the R-trees are created in a bulk-loading manner. At this time, the signature generation of the entry of the S-tree node is largely the level of the S-tree. (At the bottom of the S-tree) Separated by. level The signature of an entry of an S-tree node in S1 is obtained by 1) obtaining a set of all non-spatial attribute values that a subtree of an entry of the R-tree node entry corresponding to the entry itself may have, and 2) each belonging to the set. It converts the non-spatial attribute value into a fixed-length bit stream, that is, a signature using a hash function, and 3) sums the bitwise OR of all elements of the generated signature set. At other levels, the sum of all the signatures of the child nodes indicated by the entry of the S-tree node is used as the signature of the entry.

도 3(a)에서 테이블 DISC는 공간 속성으로 color를, 비공간 속성으로 artist를 가지며, 각각 R-트리와 S-트리의 키로 사용된다. 그림 3(b)는 DISC 테이블의 각 artist 속성값에 대해 해쉬 함수를 이용하여 변환된 고정 길이의 시그니쳐 테이블을 의미한다. color 속성에 대한 d-차원 공간이 그림 3(c)와 같을 때, 해당 R-트리는 그림 3(d)의 왼쪽 부분의 트리와 같다. 각 R-트리 인덱스 노드에 일대일 대응되는 노드(예:↔)의 집합으로 구성된 트리가 S-트리이다. 레벨에서노드의시그니쳐를 생성하는 과정은, 먼저 대응되는 R-트리 노드의 엔트리를 찾고,가 가리키는 데이타 노드에 포함된 TID=1, 2를 통해 테이블 DISC에서 artist의 속성값 {`Beatles', `Sting'}을 얻고, 해쉬 함수를 통해 시그니쳐 {01101010, 00111001}로 변환한 후, 두 개의 시그니쳐를 합연산한다(01101010 ∧ 00111001 = 01111011). 시그니쳐와도와 같은 방식으로 만들어진다. 레벨인 경우, 시그니쳐은 자신의 하위 노드가 가지는 모든 시그니쳐,,의 합연산 01111011 ∧ 01111010 ∧ 00111011 = 01111011과 같다. 이런 방식으로 최상위 S-트리 노드의 시그니쳐가 생성되고, 전체 S-트리가 만들어진다.In FIG. 3 (a), the table DISC has color as a spatial attribute and artist as a non-spatial attribute, and is used as a key of an R-tree and an S-tree, respectively. Figure 3 (b) shows a fixed-length signature table converted using a hash function for each artist attribute value of the DISC table. When the d-dimensional space for the color attribute is shown in Figure 3 (c), the corresponding R-tree is the tree in the left part of Figure 3 (d). One-to-one correspondence to each R-tree index node (for example, ↔ Is a S-tree. level in Node The process of generating a signature may include a corresponding R-tree node. Entry Looking up, The data node pointed to The attribute values {`Beatles ',` Sting'} of the artist are obtained from the table DISC through TID = 1, 2, and the signatures are converted into signatures {01101010, 00111001} using a hash function, and then the two signatures are combined. (01101010 ∧ 00111001 = 01111011). Signature Wow Degree Is made in the same way. level If, signature Has its own child node All signatures , , The sum of 01111011 ∧ 01111010 ∧ 00111011 = 01111011. In this way the signature of the top S-tree node is generated, and the entire S-tree is created.

하기에서 S-트리가 첨가된 RtreeINN 알고리즘의 관점에서 S-트리에 의해 불필요한 R-트리 노드와 객체(또는 TID)가 어떻게 제거되는지 설명한다.The following describes how unnecessary R-tree nodes and objects (or TIDs) are removed by the S-tree in terms of the RtreeINN algorithm with the S-tree added.

큐의 최전방에 존재하는 다음 탐색 노드(next target node)를, 그 노드의 자식 노드를이라 하면, 본 발명에 의한 새로운 알고리즘은 각각의로부터 질의 객체까지의 거리를 계산하기 전에와 대응되는 S-트리 노드의 시그니쳐를 이용하여의 서브트리가 비공간 질의값(예를 들면, `Beatles')을 포함하는지를 먼저 검사한다. 즉, 비공간 질의값의 시그니쳐(예를 들면, `Beatles'의 시그니쳐 01101010)를 질의 시그니쳐(query signature)라고 하면,를 만족하는지 검사한다. 여기서 연산자 ∧는 비트 단위의 곱연산(AND)을 의미하며, 이러한 검사를 시그니쳐 검사(Signature Checking)라고 부른다. 시그니쳐 검사를 통과한다는 것은, 해당 R-트리의 자식 노드가 가질 수 있는 모든 데이타 노드에 질의에 사용된 비공간 속성값(예: `Beatles')을 갖는 투플이 있을 수 있다는 것을 의미하며, 그렇지 않는 경우는 그 반대의 의미를 지닌다. 따라서, S-트리의 노드에 포함되는 모든 시그니쳐 중 시그니쳐를 통과하는에 대응되는 R-트리의 자식 노드를 큐에 삽입하면 된다. 물론, 시그니쳐 검사를 위해 S-트리 노드 I/O가 발생하지만 이 비용보다 S-트리의 노드 제거에 의한 이익이 실제로 더 큼을 알 수 있다.Next target node in the front of the queue , The child of that node In this regard, the new algorithm according to the present invention Before calculating the distance from to the query object S-tree node corresponding to Signature Using First check whether the subtree of contains non-spatial query values (eg `Beatles'). That is, the signature of a non-spatial query value (for example, the signature 01101010 of `Beatles') is a query signature. Speaking of Check if it satisfies. Here, the operator 의미 means a bitwise multiplication (AND), and this check is called a signature check. Passing the signature check means that any data node that a child node of that R-tree can have, may have a tuple with non-spatial attribute values (eg `Beatles') used in the query, otherwise The case has the opposite meaning. Thus, nodes in the S-tree Which passes the signature among all signatures included in Insert a child node of the R-tree corresponding to the queue. Of course, S-tree node I / O occurs for signature checking, but it can be seen that the benefit of node removal of the S-tree is actually greater than this cost.

도 3을 다시 참고하면, 현재 큐에서 꺼낸 탐색 노드가 RIN2, 엔트리,,이 각각 가리키는 자식 노드가,,, 질의 객체로부터 각각의 자식 노드까지의 거리가 각각,,, 거리 관계가을 만족시킨다고 하자. 그리고 질의 시그니쳐를 `Beatles'의 시그니쳐, 즉= 01101010이라고 가정한다. S-트리를 이용하는 알고리즘에서는 RIN2의 엔트리가 가리키는 자식 노드를 큐에 삽입하기 전에 먼저 S-트리를 이용한 시그니쳐 검사를 수행한다. 시그니쳐 검사에서(= 01111010)은를 만족시키지 않으므로는 큐에 삽입되지 않는다. 반면, RtreeINN 알고리즘에서는의 서브트리에 비공간 질의 값과 동일한 비공간 속성값이 존재하는지를전혀 고려하지 않기 때문에를 그대로 큐에 삽입할 것이다. 이후과 같은 노드들은 많은 불필요한 투플들을 생성하게 될 것이다.Referring again to Figure 3, the search node taken out of the current queue is RIN2, entry , , Each of these child nodes , , , The distance from the query object to each child node , , , Distance relationship Let's satisfy. And the signature of the query is the signature of the Beatles, Assume that = 01101010. For algorithms using S-trees, entry of RIN2 Child nodes pointed to The signature check using S-tree is performed before inserting into the queue. In signature check (= 01111010) is Does not satisfy Is not inserted into the queue. On the other hand, in the RtreeINN algorithm Because no consideration is given to whether a nonspatial attribute value equal to the nonspatial query value exists in the subtree of Will be inserted into the queue as is. after Nodes such as will generate many unnecessary tuples.

상위 레벨로 갈수록 S-트리의 시그니쳐는 모든 비트가 `1'로 채워질 "시그니쳐 포화(signature saturation)" 상태에 이를 수 있다. 그 이유는 상술한 바와 같이 S-트리의 생성에서 레벨이 증가하면 시그니쳐를 생성하기 위해 합연산되는 자식 시그니쳐의 수가 증가하기 때문이다. 이때 발생하는 문제가 "false drop"이 많이 발생한다는 것이다. 어떤 투플이 false drop이라는 것은 시그니쳐 검사를 통과했음에도 불구하고 결국에는 질의 조건을 만족시키지 않는다는 것을 의미한다. false drop의 문제는 다수의 시그니쳐가 합연산될 때 발생하는 "팬텀 효과(phantom effect)" 때문에 발생된다. 예를 들면, 그림 3(d)에서=01101010일 때의 TID = 3, 4인 투플은이 시그니쳐 검사를 통과하지만, 결국 artist=`Beatles'를 만족하지 않으므로 false drop이 된다. 여기서 시그니쳐은에 대해서 팬텀 효과를 발생시킨 것이다. 이후 이러한 투플은 결국 불필요한 투플로 판명날 것이다. 후에 시그니쳐 포화로 인해 발생하는 다량의 false drop의 수를 줄일 수 있는 "시그니쳐 분할 방식"에 대해서 설명한다.At higher levels, the signature of the S-tree may reach a "signature saturation" state where all bits are filled with `1 '. This is because, as described above, as the level increases in the generation of the S-tree, the number of child signatures that are combined to generate the signature increases. The problem with this is that a lot of "false drop" occurs. If a tuple is false drop, it means that even though it passes the signature check, it eventually does not satisfy the query condition. The problem of false drop is caused by the "phantom effect" that occurs when multiple signatures are combined. For example, in Figure 3 (d) When = 01101010 The tuple of TID = 3, 4 This signature check passes, but eventually drops because it does not satisfy artist = `Beatles'. Signature here silver This is what caused the phantom effect. This tuple will eventually turn out to be an unnecessary tuple. Later, the "signature splitting method" will be described to reduce the number of large false drops caused by signature saturation.

수식적으로 "시그니쳐 포화"를 확인할 수 있다. 자식 시그니쳐와 부모 시그니쳐 각각의 i번째 비트 값이 `1'일 확률이 각각,이고, 그룹에 속하는 자식 시그니쳐의 수가일 때,과 같이 주어진다. 비교적 작은에 대한는 작은에도 불구하고 거의 `1'에 가까와진다는 것을 추정할수 있다. 이로부터 S-트리의 레벨이 증가할수록 시그니쳐 포화에 급격하게 도달한다는 사실을 알 수 있다. 왜냐 하면, S-트리의 레벨이 증가하면 동시에도 증가하기 때문이다. 따라서, 팬텀 효과로 인해 더 많은 false drop이 발생할 것이다. 우리는또는를 작게 만듬으로써 시그니쳐 포화 속도를 지연시킬 수 있다. 작은를 선택하면 확실히 그 속도를 지연시킬 수 있지만, 작은를 위해 매우 큰 크기의 시그니쳐가 필요함에도 불구하고 그 효과는 그리 크지 않다. 대신,을 작게 만들 수 있는 "시그니쳐 분할(signature chopping)"이라는 방법을 제안한다.The signature "saturation saturation" can be confirmed. The probability that the i th bit value of each child and parent signature is '1' is respectively , , The number of child signatures belonging to the group when, Is given by Relatively small For Is small In spite of this, it can be estimated that it is almost '1'. It can be seen from this that the signature saturation is reached rapidly as the level of the S-tree increases. Because as the level of the S-tree increases, Because also increases. Thus, the phantom effect will cause more false drops. We are or By making the smaller, we can delay the signature saturation rate. small Selecting it will definitely slow down its speed, but small Despite the need for a very large signature, the effect is not so great. instead, We propose a method called "signature chopping," which can make.

시그니쳐 분할의 목표는, S-트리 노드의 각 시그니쳐를 시그니쳐 분할 함수(예를 들면,)에 의해 여러 개의 시그니쳐로 분할함으로써,을 가능하면 작게 만들고 더나아가 팬텀 효과를 줄이는 데 있다. 레벨 l 상에 존재하는 어떤 S-트리 노드가 갖는 각각의 시그니쳐에 대해서,는개의 분할 시그니쳐(chopped-signature),로 나누어진다(레벨 l이 높은 경우가 발생할 수 있으므로 이런 경우의가 된다). S-트리의 생성시에,의 자식 노드가 갖는 C 개의 시그니쳐들은개의 버킷(bucket)에씩 배분된다. 그런 후, j-번째 버킷에 포함된 모든 시그니쳐를 합연산함으로써가 생성된다. 여기서,를 OR 연산자라고 할 때을 만족시킨다. 따라서, 팬텀 효과는 시그니쳐 분할 전에 비해 약배만큼 감소한다. 도 4에서이고이라고 가정할 때,는개의,,,로 분할된다(의 경우는이므로개로 분할된다). 시그니쳐 분할 전의는 자신의 자식 노드가 갖는 네 개의,,,을 합연산하여 생성되는 반면, 시그니쳐 분할 후의는 네 개의 시그니쳐로 분할되며 각각은의 자식 노드 중에서 단 하나의 시그니쳐를 자식 시그니쳐로 갖는다.The goal of the signature split is to split each signature of the S-tree node into a signature split function. (For example, By dividing into several signatures, Make it as small as possible, and furthermore, reduce the phantom effect. The signature of each S-tree node on level l about, Is Chopped-signatures, Divided by (if level l is high May occur because of this Becomes). At the time of creation of the S-tree, The C signatures of the child nodes of Buckets It is distributed by each. Then, by concatenating all the signatures contained in the j-th bucket Is generated. here, Is called the OR operator Satisfies Thus, the phantom effect is roughly comparable to before signature division. Decreases by a factor In Figure 4 ego Assume that Is doggy , , , Divided by In the case of Because of Divided into dogs). Before the signature split Has four child nodes , , , Generated by concatenating, while Is divided into four signatures, each of which is Only one of the child nodes of has the signature as the child signature.

시그니쳐 분할 방식에서, 시그니쳐 검사는개의 분할 시그니쳐 각각에 대하여 수행된다.를 큐에서 꺼낸 다음 탐색 노드를와 대응되는 S-노드라고 정의하며,는의 j-번째 시그니쳐라고 가정하자.의개의 분할 시그니쳐에 대해서, 어떤 분할 시그니쳐도 시그니쳐 검사를 통과하지 못하면, 시그니쳐 분할을 적용하지 않은 방식과 같이의 j-번째 엔트리의 자식 노드(=)는 제거된다. 즉, 큐에 삽입되지 않는다. 그렇지 않는 경우,는 추가적인 힌트(hint), 예를 들면, 비트맵(bitmap)과 함께 큐에 삽입된다. 여기서 힌트는, 시그니쳐 검사를 통과하지 못한 각 분할 시그니쳐에게 배분된노드의개의 엔트리의 모든 서브트리들은 더이상 접근될 필요가 없으므로 제거된다는 의미를 갖는다. 여기서,의 각 분할 시그니쳐는노드의개의 엔트리를 책임진다. 도 4에서 시그니쳐 검사를 통과한 시그니쳐를 진한 영역으로 표시하며, S-트리 노드에 대응되는 R-트리 노드를라고 하자.의 분할 시그니쳐는 어느 것도 시그니쳐 검사를 통과하지 않는다면, R-트리 노드는 큐에 삽입되지 않는다. 반면,의 분할 시그니쳐 중에서과는 시그니쳐 검사를 통과하므로 해당 R-트리 노드을 큐에 삽입한다. 이때,와은 시그니쳐 검사를 통과하지 못하였기 때문에의 엔트리 중에서 두 번째와 세 번째 엔트리에 대해서는 어떤 작업(예를 들면, 거리의 계산)도 수행할 필요가 없다는 것을 표시하는 비트맵을 큐에 삽입한다.In signature splitting, signature checking Is performed on each of the partition signatures. Remove the queue from the navigation node To Defined as the S-node, Is Suppose that is the j-th signature of. of For two split signatures, if no split signature passes the signature check, then the signature split is not applied. The child node of the j-th entry of (= ) Is removed. That is, it is not inserted into the queue. Otherwise, Is inserted into the queue with additional hints, for example, a bitmap. Here the hints are assigned to each split signature that does not pass the signature check. Node All subtrees of entries have to be removed because they no longer need to be accessed. here, Each split signature of Node Responsible for entries. In FIG. 4, the signature that passed the signature check is indicated by a dark area, and the S-tree node is shown. R-tree node corresponding to Let's say Split signature If none passes the signature check, then the R-tree node Is not inserted into the queue. On the other hand, Of the split signatures and Passes the signature check, so that R-tree node To the queue. At this time, Wow Did not pass the signature check, Inserts a bitmap into the queue indicating that there is no need to perform any operations (eg, calculate distance) on the second and third entries of the.

RtreeINN 알고리즘을 보완한 RS-트리 기반 점증적 최근접 알고리즘(이하 RStreeINN 알고리즘)을 설명하면 다음과 같다. 도 5 및 도 6에 도시된 알고리즘 1과 2는 각각 RStreeINN 알고리즘과 시그니쳐 검사 루틴을 나타낸다. RStreeINN 알고리즘에서 추가적인 부분은 큐에서 꺼낸 다음 탐색 노드가 갖는 자식 노드 중 어떤 노드가 제거되었는지 확인하는 단계(Algorithm 1의 단계 9과 21)와, 큐에 그것들을 삽입하기 전에 시그니쳐 검사를 수행하는 단계(Algorithm 1의 단계 10)이다. 알고리즘 2에서, AllocSize(l)은 각각의 분할 시그니쳐에 배분된, 레벨 l 상의 R-트리 노드의 엔트리의 수를 의미한다. RStreeINN 알고리즘은 비교 연산자(> 또는 <)가 포함된 비공간 검색 조건문이 포함된 질의에 적합하지 않다. 그러나, 이러한질의의 처리는 RS-트리의 R-트리를 이용한 RtreeINN 알고리즘을 그대로 적용함으로써 가능하다. 왜냐 하면, RS-트리에서의 R-트리는 S-트리와 독립적인 저장 구조를 갖기 때문이다. 도 5 및 도 6에는 상세한 알고리즘의 소스코드가 표시되어 있다.RS-tree based incremental nearest algorithm (hereinafter referred to as RStreeINN algorithm) that complements the RtreeINN algorithm is described as follows. Algorithms 1 and 2 shown in FIGS. 5 and 6 respectively represent an RStreeINN algorithm and a signature check routine. An additional part of the RStreeINN algorithm is taking the queue out and checking which of the child nodes the search node has removed (steps 9 and 21 of Algorithm 1), and performing signature checking before inserting them into the queue ( Step 10 of Algorithm 1). In Algorithm 2, AllocSize (l) refers to the number of entries of R-tree nodes on level l, allocated to each split signature. The RStreeINN algorithm is not suitable for queries containing non-spatial search conditions with comparison operators (> or <). However, this query can be processed by applying the RtreeINN algorithm using the R-tree of the RS-tree as it is. This is because the R-tree in the RS-tree has a storage structure independent of the S-tree. 5 and 6 show the source code of the detailed algorithm.

상기와 같은 본 발명의 방법을 적용한 성능실험의 결과가 하기에 설명된다.실험을 위해 기본 테이블 DISC를 생성하였으며, 그 스키마는 DISC(artist, type, country, color)와 같다. color는 공간 속성으로서 d-차원의 점(point)으로 표현되며, 그 이외는 비공간 속성으로서 30 바이트 크기의 텍스트 형식의 값을 가진다. color 속성값은 균일(uniform) 분포를 따르는 인위 데이터와 실세계 데이타로 나누어 실험하였다. color 속성 이외의 속성들은 Zipfian 분포를 따르는 속성값을 가지며, 또한 각각의 속성은 전체 투플 수의 0.5%에 해당하는 이산(distinct) 값을 가진다. DISC 테이블을 저장할 때의 페이지의 크기는 4K 바이트이다.The results of the performance experiment applying the method of the present invention as described above are described below. A base table DISC was created for the experiment, and the schema is the same as DISC (artist, type, country, color). color is a spatial attribute, expressed as a d-dimensional point; otherwise, it is a non-spatial attribute with a 30-byte text value. The color attribute value was tested by dividing the artificial data and the real-world data along the uniform distribution. Attributes other than the color attribute have attribute values according to the Zipfian distribution, and each attribute has a discrete value corresponding to 0.5% of the total tuple number. The page size when storing the DISC table is 4K bytes.

color 속성값을 저장하는 인덱스 구조로서 R^*-트리[2]를 사용하였으며, R^*-트리의 노드 크기는 4K 바이트이다. RStreeINN 알고리즘의 실험을 위한 S-트리는 R^*-트리를 일괄적재 방식으로 생성할 때 동시에 생성하였으며, 이때 여러 시그니쳐 분할 함수 f(l)을 적용하였다(예를 들면,). 만약, f(l)에 의해 분할된 시그니쳐의 수가 많거나 공간 객체의 차원이 낮아 노드의 유효 용량 C가 커지는 경우 S-트리의 오버플로우 페이지에 나머지 시그니쳐를 저장하였다. 비공간 속성값에 대한 시그니쳐 생성을 위한 해쉬 함수는 [8]을 따르며, 각 실험에서 시그니쳐 크기 F=64비트(=8바이트)를 사용하였다. 두 알고리즘에서 사용하는 순위 큐는모두 주 메모리에서 동작하도록 구현되었다.R ^* -tree [2] is used as an index structure to store color attribute values, and the node size of R ^* -tree is 4K bytes. The S-tree for the experiment of the RStreeINN algorithm was created at the same time when the R ^* -tree was created in a batch-loading manner, and various signature splitting functions f (l) were applied (for example, ). If the number of signatures divided by f (l) is large or the dimension of the spatial object is low and the effective capacity C of the node becomes large, the remaining signatures are stored in the overflow page of the S-tree. The hash function for signature generation for non-spatial attribute values follows [8], and the signature size F = 64 bits (= 8 bytes) was used in each experiment. The rank queues used by both algorithms are implemented to run in main memory.

실험 측정값은 질의 수행시간, 페이지 접근 수, 투플 접근 수로 나누어지고, 모든 실험에서 실험 질의는 (d-차원의 점, 비공간 속성값)으로 구성되며, 각 실험 결과는 50 개의 실험 질의를 수행한 결과의 평균값이다. d-차원 점은 무작위 방식으로, 비공간 속성값은 Zipfian 분포에 따른 선택율이 가장 높은 상위 50개의 속성값을 선택하였다. 그 이유는, 선택율이 높으면 동일한 속성값을 갖는 투플의 수가 증가하며, 따라서 질의 최적화시 정렬(sort) 비용이 큰 순차 스캔(scan) 방식보다 공간 인덱스를 사용할 가능성이 커지기 때문이다. 질의 결과 k는 1, ..., 35 범위를 가지며, k가 비교적 작은 이유는, k가 상대적으로 큰 경우 순차 검색이 유리하고 실제적으로 작은 k가 사용되기 때문이다[11]. R^*-트리의 캐쉬의 크기는 하나의 페이지이며, 테이블 접근 때 사용되는 캐쉬의 크기는 전체 테이블의 크기의 1%를 사용하였다. 매 질의 수행시 시스템의 OS 캐쉬의 영향을 최소화하기 위해 시스템의 메모리보다 큰 파일 스캔 작업을 먼저 수행하였다. 두 알고리즘의 구현 프로그램을 수행시키기 위해서 Pentium 133MHz 프로세서, 128MB의 메모리, 6GB 디스크 드라이브를 갖는 시스템과 운영 체제 PowerLinux 6.0을 사용하였다.Experimental measurements are divided into query execution time, page accesses, and tuple accesses. In all experiments, experimental queries consist of (d-dimensional point, non-spatial property values), and each experimental result performs 50 experimental queries. The mean value of one result. The d-dimensional points were randomly selected and the top 50 attribute values with the highest selectivity according to the Zipfian distribution were selected. The reason for this is that a high selectivity increases the number of tuples having the same attribute value, and therefore, the possibility of using spatial indexes is higher than that of a sequential scan method, which has a high sort cost in query optimization. The query result k has a range of 1, ..., 35, and k is relatively small because sequential search is advantageous when k is relatively large and a practically small k is used [11]. The cache size of the R ^* -tree is one page, and the size of the cache used for table access is 1% of the size of the entire table. In order to minimize the impact of the system's OS cache on every query, file scans larger than the system's memory are performed first. To run the implementation of the two algorithms, we used a system and operating system PowerLinux 6.0 with a Pentium 133MHz processor, 128MB of memory and a 6GB disk drive.

상기 실험은 인위적인 데이타를 사용하였으며, 실험 결과는 R*-트리가 중저차원에서 효율적이므로인 공간 데이타를 중심으로 수행된 것이다. 질의 결과 k에 따른 성능 평가: 테이블 DISC의 color 속성값의 차원, artist의 속성값의 Zipfian 변수 z=0.5, 전체 투플의 수, 시그니쳐 분할 함수,시그니쳐의 비트 수 F=64일 때, 질의 수행시간, 페이지 접근 수, 투플 접근 수는 각각 도 7a, 7b, 7c과 같다.The experiment used artificial data, and the experimental results indicate that the R * -tree is efficient This is done based on spatial data. Performance Evaluation According to Query Result k: Dimension of Color Attribute Value of Table DISC , the Zipfian variable z = 0.5 for the attribute value of the artist, the total number of tuples , Signature split function When the number of bits of the signature F = 64, the query execution time, the number of page accesses, and the number of tuple accesses are shown in FIGS. 7A, 7B, and 7C, respectively.

도 7a에서 투플 접근 수는 k개의 질의 결과를 얻기까지 전체 후보 투플의 수와 같다. 두 알고리즘 모두 k가 커질수록 접근해야 하는 투플 수가 선형적으로 증가하지만, 직선의 기울기에서 크게 차이가 난다. 그 이유는 RStreeINN 알고리즘이 상기에서 설명한 S-트리의 제거 효과를 가지기 때문이다. 도 7b에서 두 알고리즘의 페이지 접근 수는 도 7a의 투플 접근 수보다 약간 큰 것을 알 수 있다. 그 이유는 두 알고리즘이 R-트리 또는 S-트리의 페이지를 추가적으로 접근하기 때문이다. 두 알고리즘 모두 접근 페이지 수가 k가 커짐에 따라 선형적으로 증가하며, 도 7a의 그래프와 거의 닮은 꼴을 유지한다. 그 이유는, 실험 데이타가 상대적으로 작은 경우 접근된 페이지 수는 공간 인덱스보다 투플 접근 비용에 더 영향을 받기 때문이다. 도 7b의 페이지 접근 수에서의 차이는 도 7c의 질의 수행 시간에 그대로 반영된다. 그러나, 도 7c에서 도 7b의 페이지 접근 수의 차이에서 예상할 수 있는 두 알고리즘의 성능 차이의 폭이 줄어든 것을 알 수 있다. 그 이유는, RStreeINN 알고리즘에서는 R-트리와 S-트리를 교대로 접근해야 하므로 랜덤 디스크 I/O가 발생하여 추가적인 비용을 더 지불해야 하기 때문이다.In FIG. 7A, the tuple access number is equal to the total number of candidate tuples until k query results are obtained. Both algorithms linearly increase the number of tuples that need to be approached as k increases, but differ greatly in the slope of the straight line. This is because the RStreeINN algorithm has the above-described elimination effect of the S-tree. In FIG. 7B, the number of page accesses of the two algorithms is slightly larger than the number of tuple accesses in FIG. 7A. This is because the two algorithms additionally access pages of the R-tree or S-tree. Both algorithms increase linearly as the number of access pages increases, k, and remains almost similar to the graph of FIG. 7A. The reason is that the number of pages accessed is more affected by the tuple access cost than the spatial index when the experimental data is relatively small. The difference in the number of page accesses of FIG. 7B is reflected in the query execution time of FIG. 7C. However, it can be seen from FIG. 7C that the width of the performance difference of the two algorithms that can be expected from the difference in the number of page accesses of FIG. 7B is reduced. The reason is that the RStreeINN algorithm requires the R-tree and the S-tree to be accessed alternately, resulting in additional cost due to random disk I / O.

시그니쳐 분할 함수에 따른 성능 평가에 대하여 설명하면 다음과 같다.Signature split function The performance evaluation according to the following description is as follows.

RStreeINN 알고리즘에서 S-트리 노드의 각 엔트리가 가질 수 있는 시그니쳐의 수에 따라 전체 성능이 달라질 수 있다. 전체 성능은 시그니쳐의 수에 비례한다는 것을 예상할 수 있다. 오버플로우 페이지가 생기지 않는 범위내에서 시그니쳐분할 함수 f(l)에 따른 실험을 수행하였으며, 그 결과는 도 8에 표시되어 있다.In the RStreeINN algorithm, the overall performance may vary depending on the number of signatures that each entry of an S-tree node can have. It can be expected that the overall performance is proportional to the number of signatures. The experiment was performed according to the signature division function f (l) within the range of no overflow page, and the result is shown in FIG. 8.

도 8a에서 RStreeINN 기법의 경우 시그니쳐 분할 수가 증가함에 따라 페이지 접근 수가 상당히 감소함을 알 수 있으며, RtreeINN 기법과 비교할 때 큰 폭의 차이를 보인다. 이러한 차이는 도 8b의 질의 수행 시간에 그대로 반영된다. 그러나,인 경우 RStreeINN 알고리즘이 RtreeINN 알고리즘보다 성능이 나쁘다. 그 이유는, RStreeINN 알고리즘에서는 S-트리를 통해 얻는 이익보다 S-트리의 접근으로 발생하는 랜덤 디스크 I/O에 의한 손해가 더 크기 때문이다. 그림 8c는 k=10일 때 f(l)에 따른 큐의 크기를 나타내며, NoSign은 RtreeINN 알고리즘에, base>2인 경우는 RStreeINN 알고리즘에 해당한다. 그림에서 아래쪽과 윗쪽 박스의 숫자는 각각 큐에 삽입된 R-트리 노드의 수와 전체 큐 크기를 의미하며, 두 값의 차는 큐에 삽입된 객체(또는 TID)의 수를 의미한다. 도 8c에서 시그니쳐 분할 수가 커짐에 따라 큐에 삽입된 R-트리 노드와 객체의 수가 줄어듬을 알 수 있으며, S-트리에서 시그니쳐 포화와 팬텀 효과가 시그니쳐 분할에 의해 완화되었음을 의미한다. 그리고, 큐 크기의 차이는 성능의 차이로 나타남을 알 수 있다. 이것이 도 8a-8b에 표시되어 있다.In FIG. 8A, it can be seen that the number of page accesses decreases significantly as the number of signature divisions increases, and the difference is large compared to the RtreeINN technique. This difference is reflected in the query execution time of FIG. 8B. But, RStreeINN algorithm performs worse than RtreeINN algorithm. The reason is that the RStreeINN algorithm suffers more damage from random disk I / O due to S-tree access than the benefit from S-tree. Figure 8c shows the queue size according to f (l) when k = 10. NoSign corresponds to the RtreeINN algorithm, and base> 2 corresponds to the RStreeINN algorithm. The numbers in the bottom and top boxes in the figure represent the number of R-tree nodes inserted into the queue and the total queue size, respectively, and the difference between the two values represents the number of objects (or TIDs) inserted in the queue. In FIG. 8C, as the number of signature partitions increases, the number of R-tree nodes and objects inserted into a queue decreases, indicating that signature saturation and phantom effects in the S-tree are alleviated by signature partitioning. In addition, it can be seen that the difference in queue size is represented by a difference in performance. This is shown in Figures 8A-8B.

데이타 크기에 따른 성능 평가결과는 다음과 같다.Performance evaluation results according to the data size are as follows.

상기 실험에서 실험 변수값은비트를 사용하였다. 도 9를 참고하면, 데이타 크기가 커짐에 따라 RStreeINN 알고리즘이 RtreeINN 알고리즘보다 성능이 우수함을 알 수 있다. 그리고, RtreeINN 알고리즘에서는 데이타 크기가 커짐에 따라 질의 수행 시간도 비교적 큰 기울기로 증가하지만, 반면에 RStreeINN 알고리즘에서는 그래프의 기울기가 완만하다. 이 사실에서 RStreeINN 알고리즘은 데이타 크기에 상관없이 그 성능을 보장함을 알 수 있다.In the above experiment, the experimental variable value is Bit was used. Referring to FIG. 9, as the data size increases, the RStreeINN algorithm performs better than the RtreeINN algorithm. In the RtreeINN algorithm, as the data size increases, the query execution time also increases with a relatively large slope, whereas in the RStreeINN algorithm, the slope of the graph is gentle. This fact shows that the RStreeINN algorithm guarantees its performance regardless of the data size.

공간 데이타의 차원(d)에 따른 성능 평가실험에서, 실험 변수 값은 비트를 사용하였으며, d=2인 경우 S-트리에서 각 노드마다 하나의 오버플로우 페이지가 발생하였다. 도 10에서, 각 차원에 대해 RStreeINN이 RtreeINN보다 우수한 것을 알 수 있다. 그러나, 차원이 증가할수록 성능의 차이는 감소한다. 그 이유는, 고차원 R-트리에서는 저차원과 달리 데이타 노드의 용량이 작아서 질의의 비공간 속성값이 전체 데이타 노드에 고르게 분포하여 "시그니쳐 검사"를 통과할 확률이 높게 되어 RStreeINN가 접근하는 R-트리 노드의 수가 증가하기 때문이다In the performance evaluation experiment according to the dimension (d) of the spatial data, Bit is used, and when d = 2, one overflow page is generated for each node in the S-tree. 10, it can be seen that RStreeINN is superior to RtreeINN for each dimension. However, as the dimension increases, the difference in performance decreases. The reason for this is that in high-dimensional R-trees, unlike low-level data nodes, the capacity of data nodes is small, so that the non-spatial attribute values of a query are evenly distributed across all data nodes, which increases the probability of passing the "signature check". This is because the number of tree nodes increases

실세계 데이터세트(Real Dataset)에서의 실험결과를 설명하면 다음과 같다.The experimental results in the real world dataset are as follows.

상기 실험에서는 공간 데이타는 실세계 데이타로서 16 차원의, CAD 모델에 사용하는 고차원 푸리에(Fourier) 점을 사용하였으며, 비공간 데이타는 인위 데이타로서 Zipfian 변수 z=0.5를 갖는 분포를 따른다. 그 이외의 실험 변수 값은비트와 같다. 차원 d=16일 때 R^*-트리의 성능은 효율적이지 못하지만, 두 알고리즘의 성능 비교에는 별 무리가 없으리라 생각한다. 고차원에서는 R-트리 노드의 용량이 작아지기 때문에 두 알고리즘이 중저차원에서와 같은 성능은 보이지 않지만, RStreeINN이 RtreeINN보다 우위에 있음을 알 수 있다. 이것이 도 11에 표시되어 있다.In this experiment, the spatial data uses 16-dimensional, high-dimensional Fourier points used in the CAD model as real-world data, and the non-spatial data follows a distribution having a Zipfian variable z = 0.5 as artificial data. Other experimental variable values Like a bit. The performance of the R ^* -tree is not efficient when the dimension d = 16, but there is no problem in comparing the performance of the two algorithms. In the higher dimension, since the capacity of the R-tree node is smaller, the two algorithms do not show the same performance as the lower and lower dimensions, but RStreeINN is superior to RtreeINN. This is shown in FIG.

상기와 같이 본 발명에 의하면, 종래 데이터베이스에서 질의에 있어서 사용되며, 잘타손(Hjaltason)과 사멧(Samet)이 제안한 R트리에 기반한 점증적 최근접 방법의 문제점인 불필요한 투플발생빈도수를 현저하게 감소시킴으로서 멀티미티어 데이터베이스 또는 지리정보시스템등에서 요구하는 다차원 공간객체집합에 대한 효과적인 질의가 가능하다.As described above, according to the present invention, it is used in a query in a conventional database, and by significantly reducing the number of unnecessary tuple occurrences, which is a problem of the incremental nearest method based on the R tree proposed by Hjaltason and Samet. Effective querying for multidimensional spatial object sets required by multimedia databases or geographic information systems is possible.

Claims

In the tree structure for K-nearest query of a database with non-spatial search conditions,

A tree structure in which the tree structure is composed of a combination of an R tree and an S tree,

The R tree is used as an index for the space attribute,

The S tree is used as an index for a non-spatial attribute, and is composed of a node set having the same hierarchical structure as the R tree, each node is composed of a plurality of entries, and each node of the S tree and an entry constituting the node. RS tree structure for a K-nearest query of a database including a non-spatial search condition, wherein each corresponds to a node and an entry of the R tree one-to-one, and has only an index node level.

The method of claim 1,

At the same time as generating the R tree, an S tree is generated.

The S tree is based on a hierarchical signature file, and the non-spatial search condition is included in that the generation of the signature of the entry of the S tree node is divided into levels (l = 1) and l> 1. RS-tree structure for K-nearest query.

The method of claim 2,

The signature, at level l = 1) obtains the set of all non-spatial attribute values that the subtree of the entry of the R-tree node corresponding to the entry itself can have, and 2) each non-spatial attribute belonging to the set. The value is converted into a signature that is a fixed-length bit stream using a hash function, and 3) it is generated by bitwise ORing all elements of the generated signature set, and at the remaining levels ℓ> 1 RS tree structure for K-nearest query with non-spatial search conditions, which is generated by combining all signatures of child nodes indicated by entries of tree nodes.

The method according to claim 1, wherein

For each of the signature S _i (1≤i≤C) any S- tree nodes existing in the level ℓ has the S _i is, S _i, f satisfying _j (1≤j≤C) RS tree structure for K-nearest query with non-spatial search conditions characterized by being divided into (ℓ) signatures.

The method of claim 4, wherein

The next navigation node at the forefront of the queue , The child of that node La, each Before calculating the distance from to the query object S-tree node corresponding to Signature Wow, The subtree of is a query signature for nonspatial query values. When I say RS tree structure for k-nearest query with non-spatial search conditions characterized by checking whether

In an incremental nearest method based on an R tree that indexes according to a query to search a database,

This method,

Checking which of the child nodes the search node has had been removed from the queue;

And incrementally performing a signature check before inserting a child node into the queue.

The method of claim 6,

The signature check checks the next search node at the forefront of the queue. , The child of that node La, each Before calculating the distance from to the query object S-tree node corresponding to Signature Wow, The subtree of is a query signature for nonspatial query values. When I say Incremental closest method, characterized in that the test to satisfy.