KR102848186B1

KR102848186B1 - System for Anonymizing Transaction Data with Set Values and Method thereof

Info

Publication number: KR102848186B1
Application number: KR1020220151655A
Authority: KR
Inventors: 김순석
Original assignee: 주식회사 이지서티
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2025-08-20
Anticipated expiration: 2042-11-14
Also published as: KR20240070155A

Abstract

본 발명은 본 발명은 집합값을 갖는 트랜잭션 데이터의 익명처리 시스템 및 그 방법에 관한 것으로, 집합값을 갖는 트랜잭션 데이터에 대해 정보 손실을 최소화 하면서도 강력한 프라이버시를 보장할 수 있는 새로운 기법으로, 기존 k-익명성과 동일하게 하향식 파티셔닝 알고리즘을 기본적으로 채택하여 안전성과 CPU 수행시간을 최적화하면서도 파티셔닝 과정 이후에 추가로 상향식 트리 탐색 과정을 통하여 트랜잭션을 재할당함으로써 기존 k-익명성의 단점을 개선하였고, 기존 k-익명성에 비하여 상향 트리 탐색에 대한 시간은 추가되지만 이 시간은 최종 트랜잭션이 익명처리된 이후 남은 트랜잭션들에 대해 정보손실을 최적화하는 일종의 보정 작업이며 기존 k-익명성 알고리즘에 비해 전체 소요시간에 비해 미미하며, 정보 손실 측면에서는 기존 k-익명성이나 HgHs기법 등 현재까지 제안된 어느 기법들보다 매우 우수한 성능을 제공하는데 그 목적이 있다.The present invention relates to a system and method for anonymizing transaction data having a set value, and is a new technique that can guarantee strong privacy while minimizing information loss for transaction data having a set value. It basically adopts a top-down partitioning algorithm like the existing k-anonymity to optimize safety and CPU execution time, and improves the shortcomings of the existing k-anonymity by additionally reallocating transactions through a bottom-up tree search process after the partitioning process. Although the time for the bottom-up tree search is added compared to the existing k-anonymity, this time is a kind of compensation work that optimizes information loss for the remaining transactions after the final transaction is anonymized, and is insignificant compared to the overall time required compared to the existing k-anonymity algorithm. In terms of information loss, it aims to provide much better performance than any technique proposed to date, such as the existing k-anonymity or the HgHs technique.

Description

System for anonymizing transaction data with set values and method thereof

본 발명은 관계형 데이터베이스에서 각각의 속성값이 집합으로 구성된 집합값을 갖는 트랜잭션 데이터에서의 익명처리 시스템 및 방법에 관한 것이다. The present invention relates to a system and method for anonymizing transaction data in which each attribute value in a relational database has a set value.

통상적으로 정형(structured) 데이터는 데이터셋이 관계형 데이터베이스 형태로 구성된다. 예를 들어, 각 레코드에는 한 사람 한 사람 개인의 데이터들이 저장되고(이를 대개 '마이크로 데이터'라 부름) 각 컬럼에는 그 개인의 이름, 나이, 주소 등의 속성값들이 각각 저장되는 형태를 지닌다. Typically, structured data is organized as a dataset in a relational database. For example, each record contains individual data (often referred to as "microdata"), and each column contains attributes such as the individual's name, age, and address.

반정형(semi-structured) 데이터는 예를 들어 마트에서 각종 상품들을 구입하는 경우나 혹은 병원에서 한 환자에 대해 여러 증상들을 갖는 경우 등 다양한 사례가 존재한다. 대개 이러한 형태의 데이터를 반정형(semi-structured) 데이터라 부르며 특히 집합값들을 갖는 이러한 형태를 트랜잭션(transaction) 데이터라 부른다.Semi-structured data exists in a variety of contexts, such as purchasing various products at a supermarket or describing a patient's various symptoms at a hospital. This type of data is generally referred to as semi-structured data, and data containing aggregate values is specifically called transaction data.

그동안 많은 연구자들은 반정형 트랜잭션 데이터에서의 익명처리 문제에 대하여 그 해결 방안들이 제시되어 왔다. Meanwhile, many researchers have proposed solutions to the problem of anonymization in semi-structured transaction data.

예를 들어, 선행문헌 1,'M. Terrovitis, N. Mamoulis, and P. Kalnis, Privacy Preserving Anonymization of Set-valued Data. In: VLDB 2008'과, 선행문헌 2,'M. Terrovitis and D. Tsitsigkos, Amnesia, Institute for the Management of Information Systems, https://amnesia.openaire.eu/ (consulted in September 2020)' 연구에서 Manolis Terrovitis와 Nikos Mamoulis는 공격자가 그 사람이 구매한 항목의 하위 집합에 대해 부분적으로 알고 있는 경우 데이터베이스 D를 직접 공개하면 특정 거래와 관련된 사람의 신원이 공개될 수 있음을 관찰하였다. For example, in the study of prior literature 1, 'M. Terrovitis, N. Mamoulis, and P. Kalnis, Privacy Preserving Anonymization of Set-valued Data. In: VLDB 2008' and prior literature 2, 'M. Terrovitis and D. Tsitsigkos, Amnesia, Institute for the Management of Information Systems, https://amnesia.openaire.eu/ (consulted in September 2020)', Manolis Terrovitis and Nikos Mamoulis observed that directly disclosing a database D can reveal the identity of a person involved in a specific transaction if the attacker has partial knowledge of a subset of the items purchased by that person.

따라서, 이에 대한 해결책으로 k^m-익명성(일명 '글로벌 일반화')을 제안한 바 있다. 즉, 공격자의 최대 지식이 특정 트랜잭션(거래 레코드 집합 중)의 최대 m개 항목이라고 가정할 때, 우리는 그가 데이터베이스 D에 있는 k개의 게시된 트랜잭션 집합과 트랜잭션을 구별하는 것을 방지하고자 하였다. 마찬가지로 m개 이하의 항목 집합에 대해 게시된 데이터베이스 D'에 이 집합을 포함하는 최소한 k개의 트랜잭션이 있어야 한다. 익명성에 대한 이러한 정의는 집합 값 데이터의 경우 공격자가 사전에 알고 있는 민감한 집합 값 중 일부를 사용하여 개인을 식별할 수 있다는 사실을 강조하고 있다. Therefore, as a solution to this, we proposed k ^m -anonymity (aka 'global generalization'). That is, assuming that the attacker's maximum knowledge is at most m items of a particular transaction (out of a set of transaction records), we want to prevent him from distinguishing the transaction from the set of k published transactions in database D. Similarly, for any set of items less than m, there must be at least k transactions in the published database D' that contain this set. This definition of anonymity emphasizes that for set-valued data, an attacker can identify individuals using some of the sensitive set values that he knows in advance.

그러나 예를 들어, 어떤 경우에는 트랜잭션에서 공격자가 알 수 있는 항목의 수에 대한 경계를 미리 결정하는 것이 불가능할 수도 있다. 이 경우 m에 대한 안전한 값을 선택하는 것 자체가 불가능하다. 또 다른 예는 집합의 일부 항목을 사용하여 특정 거래가 개인과 연결되지 않도록 제외할 수 있는 경우 발생한다. 예를 들어 공격자는 일반적으로 65세 이상의 사람들만 일부 항목을 구매하고 특정 지역에 있는 사람들만 다른 항목을 구매한다는 것을 알고 있을 수 있다. 이러한 경우 k^m-익명성은 공격에 대한 보호를 제공할 수 없다.However, for example, in some cases, it may be impossible to predetermine a bound on the number of items an attacker can learn in a transaction. In this case, choosing a secure value for m itself is impossible. Another example occurs when some items in a set can be used to exclude certain transactions from being linked to individuals. For example, an attacker might know that only people over 65 typically purchase some items, and only people in a specific geographic area purchase other items. In such cases, k- ^m -anonymity cannot provide protection against the attack.

한편, 선행문헌 4, 'Y. Xu, B. C. M. Fung, K. Wang, A. W.-C. Fu, and J. Pei. Publishing sensitive transactions for itemset utility. In ICDM (2008)' 와, 선행문헌 5,'Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu. Anonymizing transaction databases for publication. In KDD (2008)' 논문에서 Y. Xu 등은 집합값 데이터를 익명화하기 위해 보다 정교한 프라이버시 기준인 (h, k, p)-일관성을 제안한 바 있다. (h, k, p)-일관성은 민감하지 않은 p 항목 조합에 대해 이러한 항목을 포함하는 데이터베이스에 최소한 k개의 트랜잭션이 있고 최대 h%의 트랜잭션에 일부 민감한 항목이 포함되어 있음을 보장한다. 즉, 매개변수 p를 사용하여 공격자의 사전 지식을 모델링하여 익명화의 유연성을 제공하고 있다. k^m-익명성은 h=100% 및 p=m인 특수한 경우라 볼 수 있다. 그러나 이 방법은 안전성을 강화하기 위하여 분포가 비교적 적은 전체 항목들에 대해 삭제(suppression) 기술을 사용함으로써 높은 정보 손실을 갖는 문제가 있다. 우리는 다음 장에서 이러한 사례에 대한 예시를 제공할 것이다. 또한 이 모델은 준식별자와 민감정보에 대한 분류가 존재한다는 가정을 갖고 있어 그렇지 않은 문제에는 직접 적용을 할 수 없다.Meanwhile, in the papers of previous literature 4, 'Y. Xu, BCM Fung, K. Wang, AW-C. Fu, and J. Pei. Publishing sensitive transactions for itemset utility. In ICDM (2008)' and previous literature 5, 'Y. Xu, K. Wang, AWC Fu, and P.S. Yu. Anonymizing transaction databases for publication. In KDD (2008)', Y. Xu et al. proposed (h, k, p)-consistency, a more sophisticated privacy criterion for anonymizing set-valued data. (h, k, p)-consistency ensures that for any combination of p insensitive items, there are at least k transactions in the database containing these items, and at most h% of the transactions contain some sensitive items. That is, it provides flexibility in anonymization by modeling the attacker's prior knowledge using the parameter p. k ^m -anonymity can be viewed as a special case where h = 100% and p = m. However, this method suffers from high information loss due to the use of suppression techniques for relatively small distributions of all items to enhance security. We will provide examples of such cases in the next chapter. Furthermore, this model assumes the existence of quasi-identifiers and classifications for sensitive information, making it inapplicable to problems where this is not the case.

선행문헌 6,'Yeye He'와 'Jeffrey F. Naughton'은 'Y. He and J. Naughton, Anonymization of Set-valued Data via Top-down Local Generalization. In: VLDB 2009 (2009)' 논문에서 위 연구들에서 제안한 이러한 문제들을 해결하기 위하여 소위 '로컬 일반화' 기법이라 부르는 k-익명성 기법을 제안하였다. In the previous literature 6, 'Yeye He' and 'Jeffrey F. Naughton' proposed the k-anonymity technique, also called the 'local generalization' technique, to solve the problems suggested in the above studies in the paper 'Y. He and J. Naughton, Anonymization of Set-valued Data via Top-down Local Generalization. In: VLDB 2009 (2009)'.

이 기법은 각 트랜잭션이 적어도 k-1개의 다른 트랜잭션과 동일하면 k-익명성을 만족하는 것으로 정의하였다. 먼저 k^m-익명성은 공격자가 m개 이하의 항목을 알고 있는 경우에만 개인의 프라이버시를 보호하는 반면 k-익명성은 매개변수 m이 없을 때 공격자가 알 수 있는 항목 수에 제한이 필요하지 않다. 일반적으로 k^m-익명성의 m이 작을수록 k^m-익명성이 제공하는 프라이버시가 약해질 수 밖에 없다. 그러나 k-익명성은 k^m-익명성보다 보다 강력한 프라이버시를 보장한다. 또한 k^m-익명성이 상향식 접근이라면 k-익명성은 하향식 기반의 그리디(greedy) 파티셔닝(트리 분리) 알고리즘을 사용하여 기존 기법들보다 CPU 수행 시간이 짧다는 장점이 있다. This technique is defined as satisfying k-anonymity if each transaction is identical to at least k-1 other transactions. First, k ^m -anonymity protects an individual's privacy only when the attacker knows m or fewer items, whereas k-anonymity does not require a limit on the number of items that the attacker can know when there is no parameter m. In general, the smaller m of k ^m -anonymity, the weaker the privacy provided by k ^m -anonymity. However, k-anonymity guarantees stronger privacy than k ^m -anonymity. In addition, while k ^m -anonymity is a bottom-up approach, k-anonymity has the advantage of shorter CPU execution time than existing techniques because it uses a greedy partitioning (tree splitting) algorithm based on a top-down approach.

여기서, k-익명성 알고리즘은 기본적으로 하향식 그리디 분할 알고리즘(top-down greedy partitioning algorithm)을 이용하여 일반화 계층 트리에서 그룹화되어야 하는 트랜잭션을 결정하는데 사용하고 있다.Here, the k-anonymity algorithm is basically used to determine which transactions should be grouped in the generalized hierarchical tree by using a top-down greedy partitioning algorithm.

도 1은 원본 데이터베이스 D에서의 2-익명성(로컬일반화) 및 일반화 트리 계층 예시도로서, 계층 구조의 루트 수준으로 모든 트랜잭션을 일반화하는 것으로 시작한다. 이것은 시작점으로 모든 트랜잭션이 루트로 일반화된 후 동일한 표현("ALL")을 공유하기 때문에 데이터베이스에 최소한 k개의 트랜잭션이 있는 한 항상 하나의 파티션으로 익명화를 생성한다. 이 시작점에서 초기 파티션을 다음 Anonymize 루틴으로 전달한다. 이 루틴에서는 현재 파티션을 하위 파티션으로 만들고 모든 결과 하위 파티션에서 Anonymize를 재귀적으로 호출한다. 더 이상 분할할 수 없으면 분할 프로세스가 종료된다. Figure 1 illustrates a two-part anonymization (local generalization) and generalization tree hierarchy in the original database D. It begins by generalizing all transactions to the root level of the hierarchy. This creates anonymized partitions as long as there are at least k transactions in the database, since all transactions, after generalizing to the root, share the same representation ("ALL"). From this starting point, the initial partition is passed to the Anonymize routine, which subpartitions the current partition and recursively calls Anonymize on all resulting subpartitions. The partitioning process terminates when no further partitioning is possible.

상기 k-익명성은 기본적으로 하향식 파티션 접근 방식을 채택하고 있기 때문에 트랜잭션 각 항목들의 일반화 트리 구조에서 한번 파티션된 도메인 내에서만 그리고 특히 유일값을 갖는 항목들에 대해서도 동일한 일반화를 적용하고 있어 정보 손실이 크다는 단점이 있다. Junqiang Liu와 Ke Wang은 선행문헌 8의 논문 'J. Liu and K. Wang, Anonymizing Transaction Data by Integrating Suppression and Generalization, In: Zaki, M.J., Yu, J. X., Ravindran, B., and Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science, vol 6118. Springer (2010)'에서 k-익명성의 상기와 같은 단점을 지적하여 새로운 기법인 HgHs(Heuristic generalization with Heuristic suppression) 알고리즘을 제안한 바 있다. The above k-anonymity basically adopts a top-down partitioning approach, so it has a disadvantage of significant information loss because it applies the same generalization only within the partitioned domain in the generalization tree structure of each transaction item, and especially for items with unique values. Junqiang Liu and Ke Wang pointed out the above shortcomings of k-anonymity in their paper 'J. Liu and K. Wang, Anonymizing Transaction Data by Integrating Suppression and Generalization, In: Zaki, M.J., Yu, J. X., Ravindran, B., and Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science, vol 6118. Springer (2010)' and proposed a new technique, the HgHs (Heuristic generalization with Heuristic suppression) algorithm.

상기 HgHs 알고리즘 기법은 k^m-익명성을 강화하기 위해 글로벌 일반화 기법[1,2]과 전체 항목 삭제 기법을 통합한 것이다. 세부 알고리즘은 2단계로 일반화 컷과 컷의 삭제 시나리오(SS, Suppression Scenario)로 이루어진다. [단계 1]은 'full subtree generalization'이라 불리우는 외부 루프로 일반화 계층 트리에서 정보 손실이 가장 적은 최적의 컷(cut)을 탐색한다. [단계 2]는 total item suppression이라 불리우는 내부 루프로 컷 내 서브트리에서 k^m-익명성을 만족하는지를 평가해서 만족하지 않는 항목들을 삭제하는 과정이다. The above HgHs algorithm technique integrates the global generalization technique [1,2] and the total item deletion technique to strengthen k ^m -anonymity. The detailed algorithm consists of two steps: generalization cut and cut deletion scenario (SS). [Step 1] is an outer loop called 'full subtree generalization', which searches for the optimal cut with the least information loss in the generalization hierarchy tree. [Step 2] is an inner loop called total item suppression, which evaluates whether the k ^m -anonymity is satisfied in the subtree within the cut and deletes items that do not satisfy it.

상기 HgHs 기법은 일반화 트리 구조에서 휴리스틱한 최적의 컷팅(트리 분리) 지점을 찾아 일반화와 삭제 기법을 적용한 것으로 k-익명성 기법에 비해 정보 손실은 적지만 CPU 수행 시간이 매우 길다는 단점이 있었다. The above HgHs technique finds the heuristic optimal cutting (tree splitting) point in the generalized tree structure and applies generalization and deletion techniques. Compared to the k-anonymity technique, it has less information loss, but has the disadvantage of very long CPU execution time.

1. M. Terrovitis, N. Mamoulis, and P. Kalnis, Privacy Preserving Anonymization of Set-valued Data. In: VLDB 2008 (2008)1. M. Terrovitis, N. Mamoulis, and P. Kalnis, Privacy Preserving Anonymization of Set-valued Data. In: VLDB 2008 (2008) 2. M. Terrovitis and D. Tsitsigkos, Amnesia, Institute for the Management of Information Systems, https://amnesia.openaire.eu/ (consulted in September 2020)2. M. Terrovitis and D. Tsitsigkos, Amnesia, Institute for the Management of Information Systems, https://amnesia.openaire.eu/ (consulted in September 2020) 3. M. Cunha, R. Mendes, and J. P. Vilela, A survey of privacy-preserving mechanisms for heterogeneous data types, Computer Science Review vol 41 (2021)3. M. Cunha, R. Mendes, and J. P. Vilela, A survey of privacy-preserving mechanisms for heterogeneous data types, Computer Science Review vol 41 (2021) 4. Y. Xu, B. C. M. Fung, K. Wang, A. W.-C. Fu, and J. Pei. Publishing sensitive transactions for itemset utility. In ICDM (2008)4. Y. Xu, B. C. M. Fung, K. Wang, A. W.-C. Fu, and J. Pei. Publishing sensitive transactions for itemset utility. In ICDM (2008) 5. Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu. Anonymizing transaction databases for publication. In KDD (2008)5. Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu. Anonymizing transaction databases for publication. In K.D.D. (2008) 6. Y. He and J. Naughton, Anonymization of Set-valued Data via Top-down Local Generalization. In: VLDB 2009 (2009)6. Y. He and J. Naughton, Anonymization of Set-valued Data via Top-down Local Generalization. In: VLDB 2009 (2009) 7. R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proc. of ACM SIGMOD (2000)7. R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proc. of ACM SIGMOD (2000) 8. J. Liu and K. Wang, Anonymizing Transaction Data by Integrating Suppression and Generalization, In: Zaki, M.J., Yu, J. X., Ravindran, B., and Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science, vol 6118. Springer (2010)8. J. Liu and K. Wang, Anonymizing Transaction Data by Integrating Suppression and Generalization, In: Zaki, M.J., Yu, J. X., Ravindran, B., and Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science, vol 6118. Springer (2010)

따라서, 본 발명은 종래의 기술의 문제점을 개선하기 위하여, 집합값을 갖는 트랜잭션 데이터에 대해 정보 손실을 최소화하면서도 강력한 프라이버시를 보장할 수 있는 새로운 익명처리 시스템 및 그 방법을 제공하는데 그 목적이 있다. 또한, 기존 k-익명성과 동일하게 하향식 파티셔닝 알고리즘을 기본적으로 채택하여 안전성과 CPU 수행시간을 최적화하면서도 파티셔닝 과정 이후에 추가로 상향식 트리 탐색 과정을 통하여 트랜잭션을 재할당함으로써 기존 k-익명성의 단점을 개선하였고, 기존 k-익명성에 비하여 상향 트리 탐색에 대한 시간은 추가되지만 이 시간은 최종 트랜잭션이 익명처리된 이후 남은 트랜잭션들에 대해 정보손실을 최적화하는 일종의 보정 작업이며 기존 k-익명성 알고리즘에 비해 전체 소요시간에 비해 미미하며, 정보 손실 측면에서는 기존 k-익명성이나 HgHs기법 등 현재까지 제안된 어느 기법들보다 매우 우수한 성능을 보이는 집합값을 갖는 트랜잭션 데이터의 익명처리 시스템 및 그 방법을 제공하는데 그 목적이 있다.Accordingly, the present invention aims to provide a novel anonymization system and method for transaction data having aggregate values, which can guarantee strong privacy while minimizing information loss, in order to improve the problems of the prior art. In addition, the present invention basically adopts a top-down partitioning algorithm, similar to the existing k-anonymity, to optimize security and CPU execution time, while simultaneously reallocating transactions through an additional bottom-up tree search process after the partitioning process, thereby improving the shortcomings of the existing k-anonymity. Although the time for the bottom-up tree search is additional compared to the existing k-anonymity, this time is a kind of compensation process that optimizes information loss for the remaining transactions after the final transaction is anonymized, and is insignificant compared to the overall time required compared to the existing k-anonymity algorithm. In terms of information loss, the present invention aims to provide a system and method for anonymizing transaction data having aggregate values, which exhibits far superior performance compared to any other technique proposed to date, such as the existing k-anonymity or the HgHs technique.

본 발명의 목적을 달성하기 위한 집합값을 갖는 트랜잭션 데이터의 익명처리 시스템은 관계형 데이터베이스에서 개인정보가 포함된 트랜잭션 데이터를 수집하고, 중앙처리장치(CPU)를 통해 데이터 익명처리를 수행하는 집합값을 갖는 트랜잭션 데이터의 익명처리 시스템에 있어서, 상기 수집된 트랜잭션 데이터를 익명처리 잡(Job) 할당을 수행하는 익명처리 매니저; 상기 익명처리 매니저로부터 할당된 잡에 대한 데이터 익명처리를 익명처리 잡 스케줄러(Job Scheduler)에 의해 수행하는 중앙처리장치(CPU); 상기 중앙처리장치에서 익명처리가 완료된 익명 트랜잭션 데이터를 저장 및 보관하고 허가된 사용자 단말기로 이를 제공하는 스토리지;를 포함하여 이루어진 것을 특징으로 한다. In order to achieve the object of the present invention, a system for anonymizing transaction data having a set value is provided, which collects transaction data containing personal information from a relational database and performs data anonymization through a central processing unit (CPU), and is characterized by comprising: an anonymization manager for assigning anonymization jobs to the collected transaction data; a central processing unit (CPU) for performing data anonymization for jobs assigned from the anonymization manager through an anonymization job scheduler; and a storage for storing and preserving anonymous transaction data for which anonymization has been completed in the central processing unit and providing the same to an authorized user terminal.

여기서, 상기 중앙처리장치의 익명처리 수행은 하향식 파티셔닝 알고리즘 기반으로 수행하며, 그 이후 하향식 트리 탐색과정을 통해 트랜잭션을 재할당하는 것을 특징으로 한다.Here, the anonymization processing of the central processing unit is performed based on a top-down partitioning algorithm, and is characterized by reallocating transactions through a top-down tree search process thereafter.

본 발명의 목적을 달성하기 위한 집합값을 갖는 트랜잭션 데이터의 익명처리 방법은 관계형 데이터베이스에서 개인정보가 포함된 트랜잭션 데이터를 익명처리 매니저를 통해 수집되고, 중앙처리장치(CPU)를 통해 데이터 익명처리를 수행하는 집합값을 갖는 트랜잭션 데이터의 익명처리 방법에 있어서, 상기 익명처리 매니저는 상기 관계형 데이터베이스에서 수집된 트랜잭션 데이터를 잡 스케줄러(Job Scheduler)에 익명처리 잡 할당을 수행하는 제1과정; 상기 중앙처리장치는 상기 잡 스케줄러에 할당된 잡에 대한 익명처리를 수행하는 제2과정; 및 상기 중앙처리장치에서 익명처리가 완료된 익명 트랜잭션 데이터를 스토리지에 저장하는 제3과정;을 포함하여 이루어진 것을 특징으로 한다.A method for anonymizing transaction data having a set value to achieve the object of the present invention is characterized in that the method comprises: a first step in which transaction data including personal information is collected from a relational database through an anonymization manager, and a central processing unit (CPU) performs data anonymization, wherein the anonymization manager performs an anonymization job assignment for the transaction data collected from the relational database to a job scheduler; a second step in which the central processing unit performs anonymization for the job assigned to the job scheduler; and a third step in which the central processing unit stores anonymous transaction data, on which anonymization has been completed, in storage.

여기서, 상기 제2과정은 처리 대상 트랜잭션 항목들에 대한 일반화 트리를 구성하고 파티션 과정을 수행하는 1단계; 상기 제1단계에서 분할된 파티션 중 첫번째 파티션부터 순차적으로 선택하는 2단계; 상기 선택된 파티션을 트리의 하위 노드들로 확장하는 3단계; 상기 확장된 파티션에 각각의 트랜잭션들을 할당하는 4단계; 상기 할당된 트랜잭션들에 대하여 사전에 주어진 k-익명성 프라이버시 모델을 만족하는지를 검토하기 위한 밸런스 파티션(Balance_patrition()) 루틴을 수행하는 5단계; 및 상기 5단계에서 검토결과 K-익명성 프라이버시 모델(K-anonymity privacy model) 을 만족할 경우 정보손실 최소화를 위하여 남은 트랜잭션들에 대하여 재처리를 수행한 후 익명처리를 종료하도록 최종 파티션(Final_partition()) 루틴을 수행하는 6단계;를 포함하여 이루어진 것을 특징으로 한다.Here, the second process is characterized by comprising: a first step of constructing a generalized tree for transaction items to be processed and performing a partitioning process; a second step of sequentially selecting a first partition among the partitions divided in the first step; a third step of expanding the selected partition to lower nodes of the tree; a fourth step of assigning each transaction to the expanded partition; a fifth step of performing a balance partition (Balance_patrition()) routine to check whether the assigned transactions satisfy a pre-given k-anonymity privacy model; and a sixth step of performing a final partition (Final_partition()) routine to terminate anonymization processing after performing reprocessing on the remaining transactions in order to minimize information loss if the K-anonymity privacy model is satisfied as a result of the review in the fifth step.

또한, 상기 제5단계는 K-익명성을 만족하지 못할 경우 해당 트랜잭션(t)을 분할된 파티션 내에서 할당할 수 있는 다른 파티션이 있는지를 검토하되, K-익명성을 만족할 때까지 해당 트랜잭션(tj)을 해당 파티션에 할당 하는 단계;를 더 포함하며, K-익명성을 만족하지 못할 경우 해당 트랜잭션(t)을 분할된 파티션 내에서 할당할 수 있는 다른 파티션이 있는지를 검토하되, 해당 트랜잭션(t)를 분할된 파티션 내에서 할당할 수 있는 다른 파티션이 없을 경우 해당 트랜잭션(tj)를 대기큐(WaitForQ, 이하 'WFQ')에 할당하는 단계;를 더 포함하며, 상기 K-익명성 검토에 만족할 경우 더 이상의 확장할 노드가 없을 때까지 상기 제1 내지 제4단계를 반복 수행하며, 상기 더 이상 확장할 노드가 없으면 최종 큐 루틴(FinalQ_data())을 수행하여 최종 일반화된 일종의 후보 트랜잭션과 이에 대한 파티션을 별도의 최종 큐(이하 'FinalQ')에 저장하는 것을 특징으로 한다.In addition, the fifth step further includes a step of checking whether there is another partition to which the transaction (t) can be assigned within the partitioned partition if K-anonymity is not satisfied, and assigning the transaction (tj) to the partition until K-anonymity is satisfied; and a step of checking whether there is another partition to which the transaction (t) can be assigned within the partitioned partition if K-anonymity is not satisfied, and assigning the transaction (tj) to a waiting queue (WaitForQ, hereinafter referred to as 'WFQ') if there is no other partition to which the transaction (t) can be assigned within the partitioned partition; and if the K-anonymity check is satisfied, the first to fourth steps are repeatedly performed until there are no more nodes to be expanded, and if there are no more nodes to be expanded, a final queue routine (FinalQ_data()) is performed to store a final generalized candidate transaction and its partition in a separate final queue (hereinafter referred to as 'FinalQ').

또한, 상기 제6단계는 WFQ에 저장된 각 파티션들을 상향식으로 탐색하면서 FinalQ에 저장된 파티션과 매칭하여 K-익명성을 만족하는 파티션이 존재하는지를 체크하는 과정을 수행하되, 만족하는 파티션이 있을 경우 해당 트랜잭션과 파티션은 WFQ에서 FinalQ로 이동하는 것을 특징으로 한다.In addition, the sixth step is characterized by performing a process of checking whether there is a partition that satisfies K-anonymity by matching the partitions stored in FinalQ with the partitions stored in WFQ while searching each partition stored in WFQ in a bottom-up manner, and if there is a partition that satisfies it, the transaction and partition are moved from WFQ to FinalQ.

본 발명에 따른 집합값을 갖는 트랜잭션 데이터의 익명처리 시스템 및 그 방법은 종래의 k-익명성(로컬일반화)와 동일하게 하향식 파티셔닝 알고리즘을 기본적으로 채택하여 안전성과 중앙처리장치(CPU) 수행시간을 최적화하면서도 파티셔닝 과정 이후에 추가로 상향식 트리 탐색 과정을 통하여 트랜잭션을 재할당함으로써 기존 k-익명성의 단점을 개선하였다. The anonymization system and method for transaction data having a set value according to the present invention basically adopt a top-down partitioning algorithm similar to the conventional k-anonymity (local generalization) to optimize safety and central processing unit (CPU) execution time, while improving the shortcomings of the conventional k-anonymity by additionally reallocating transactions through a bottom-up tree search process after the partitioning process.

또한, 종래의 k-익명성은 기본적으로 하향식 파티션 접근 방식을 채택하고 있기 때문에 트랜잭션 각 항목들의 일반화 트리 구조에서 한번 파티션된 도메인 내에서만 그리고 특히 유일값을 갖는 항목들에 대해서도 동일한 일반화를 적용하고 있어 정보 손실이 크다는 단점이 있으나, 본 발명은 하향식 파티셔닝 과정 이후에 추가로 상향식 트리 탐색 과정을 통하여 트랜잭션을 재할당함으로써 기존 k-익명성의 단점을 개선하였다.In addition, since the conventional k-anonymity basically adopts a top-down partitioning approach, it has a disadvantage of significant information loss because it applies the same generalization only within a partitioned domain and especially to items with unique values in the generalization tree structure of each transaction item. However, the present invention improves the disadvantages of the existing k-anonymity by reassigning transactions through an additional bottom-up tree search process after the top-down partitioning process.

또한, 종래의 k-익명성에 비하여 상향 트리 탐색에 대한 시간은 추가되지만 이 시간은 최종 트랜잭션이 익명처리된 이후 남은 트랜잭션들에 대해 정보손실을 최적화하는 일종의 보정 작업이며 기존 k-익명성 알고리즘에 비해 전체 소요시간에 비해 미미하며 종래의 HgHs 알고리즘 기법에 비해서는 매우 높은 처리 시간을 보일 것으로 예상된다. 이에 반해 정보 손실 측면에서는 기존 k-익명성이나 HgHs기법 등 현재까지 제안된 어느 기법들보다 매우 우수한 성능을 보인다.Furthermore, although the time for the upward tree search is increased compared to the conventional k-anonymity, this time is a kind of compensation operation that optimizes the information loss for the remaining transactions after the final transaction is anonymized. Compared to the existing k-anonymity algorithm, the total required time is minimal, and it is expected to show a much higher processing time than the conventional HgHs algorithm technique. In contrast, in terms of information loss, it shows much better performance than any other technique proposed to date, such as the conventional k-anonymity or HgHs technique.

또한, k-익명성(로컬일반화) 알고리즘을 채택하였지만, HgHs 알고리즘에 비하여 정보 손실이 적어 공개된 데이터 분석시 분석가에게 보다 많은 유용성을 제공하고 있다. CPU 실행 시간 측면에 있어서는 기본적으로 너비 우선 검색(breadth-first search) 접근 방식을 채택하고 있는 HgHs 알고리즘에 비하여 훨씬 빠른 성능을 보이는 반면에 k-익명성(로컬일반화) 알고리즘에 비하여 Final_partition() 루틴 과정을 추가로 거쳐야 되기 때문에 그 만큼의 비용이 추가로 드는 단점이 있으나, 이 과정은 이미 계층 트리의 확장이 완료된 이후 남은 일부 트랜잭션들에 대해 수행되는 것이기에 수행시간이 크게 높지는 않을 것으로 예상된다.In addition, although it adopts the k-anonymity (local generalization) algorithm, it provides more usefulness to analysts when analyzing public data because it has less information loss compared to the HgHs algorithm. In terms of CPU execution time, it shows much faster performance than the HgHs algorithm, which basically adopts the breadth-first search approach, but it has the disadvantage of incurring additional cost because it has to go through the Final_partition() routine process compared to the k-anonymity (local generalization) algorithm. However, since this process is performed on some of the remaining transactions after the expansion of the hierarchical tree is completed, the execution time is not expected to be significantly high.

한편, 안전성 측면에서 볼 때 k-익명성(로컬일반화)와 HgHs 알고리즘과 동일하게 k-익명성의 안전성 조건을 만족하고 있어 동일한 성능을 갖는다. 또한 이 성능은 앞서 서두에서도 언급했듯이 기존 2-^∞익명성(글로벌일반화)와 2-^∞익명성(삭제) 알고리즘에 비해 훨씬 우수한 안전성을 가지고 있다. 즉, k^m-익명성은 공격자가 m개 이하의 항목을 알고 있는 경우에만 개인의 프라이버시를 보호하는 반면 k-익명성은 매개변수 m이 없을 때 공격자가 알 수 있는 항목 수에 제한이 필요하지 않다. 일반적으로 k^m-익명성의 m이 작을수록 k^m-익명성이 제공하는 프라이버시가 약해질 수 밖에 없다. 그러나 k-익명성은 k^m-익명성보다 보다 강력한 프라이버시를 보장하는 효과가 있다.Meanwhile, in terms of security, it satisfies the security conditions of k-anonymity in the same way as k-anonymity (local generalization) and the HgHs algorithm, so it has the same performance. In addition, as mentioned in the introduction, this performance is much better than the existing 2- ^∞ anonymity (global generalization) and 2- ^∞ anonymity (deletion) algorithms. That is, k ^m -anonymity protects an individual's privacy only when the attacker knows m or fewer items, whereas ^k -anonymity does not require a limit on the number of items that the attacker can know when there is no parameter m. In general, the smaller m in k m -anonymity, the weaker the privacy provided by k ^m -anonymity. However, k-anonymity has the effect of guaranteeing stronger privacy than k ^m -anonymity.

도 1은 종래기술에 따른 원본 데이터베이스 D에서의 2-익명성(로컬일반화) 및 일반화 트리 계층 예시도이고,
도 2는 본 발명의 실시예에 따른 집합값을 갖는 트랜잭션 데이터의 익명처리 시스템의 구성도이고,
도 3은 본 발명의 실시예에 따른 집합값을 갖는 트랜잭션 데이터의 익명처리 과정의 전체 흐름도이고,
도 4는 도 3에서 익명성 검토 및 트랜잭션 할당을 확정하는 과정과 최종 파티션 루틴에 대한 흐름도이고,
도 5a 내지 5d는 원본 데이터베이스 D에 대하여 제안 알고리즘을 적용한 예시를 나타낸 것이고,
도 6은 하향식 로컬일반화 상향식 일반화 계층 트리 탐색을 이용한 제안 알고리즘 의사코드를 표시한 도면이다.Figure 1 is an example of a 2-anonymity (local generalization) and generalization tree hierarchy in the original database D according to the prior art.
Figure 2 is a configuration diagram of an anonymous processing system for transaction data having a set value according to an embodiment of the present invention.
Figure 3 is a flowchart of the entire process of anonymizing transaction data having a set value according to an embodiment of the present invention.
Figure 4 is a flowchart for the process of confirming anonymity review and transaction allocation in Figure 3 and the final partition routine.
Figures 5a to 5d show examples of applying the proposed algorithm to the original database D.
Figure 6 is a diagram showing the pseudocode of the proposed algorithm using top-down local generalization and bottom-up generalization hierarchical tree search.

상술한 본 발명의 특징 및 효과는 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. The features and effects of the present invention described above will become more apparent through the following detailed description with reference to the attached drawings, so that a person having ordinary skill in the art to which the present invention pertains can easily implement the technical idea of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시 예들을 예시하고 본문에 상세하게 설명하고자 한다. The present invention may be modified in various ways and may take many forms, and specific embodiments are exemplified and described in detail herein.

그러나, 이는 본 발명을 특정한 개시형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. However, this is not intended to limit the present invention to a specific disclosure form, and it should be understood that it includes all modifications, equivalents, or substitutes included in the spirit and technical scope of the present invention.

본 출원에서 사용한 용어는 단지 특정한 실시 예들을 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다.The terminology used in this application is for the purpose of describing specific embodiments only and is not intended to limit the present invention.

이하, 본 발명의 바람직한 실시예에 따른 집합값을 갖는 트랜잭션 데이터의 익명처리 시스템 및 방법에 대하여 첨부된 도면을 참조하여 상세히 설명하면 다음 과 같다.Hereinafter, a system and method for anonymous processing of transaction data having a set value according to a preferred embodiment of the present invention will be described in detail with reference to the attached drawings.

도 2는 본 발명 실시예에 따른 집합값을 갖는 트랜잭션 데이터의 익명처리 시스템의 구성도로서, 원천데이터로부터 개인정보가 포함된 트랜잭션 데이터를 수집하고, 사용자단말기(140)로 익명데이터를 제공하는 익명처리 매니저(111) 및 익명처리 웹화면(112)을 포함하는 운영체제부(OS)와, 상기 익명처리 매니저(111)에 의해 할당된 잡(Job)에 대한 데이터 익명처리를 각각의 익명처리 잡 스케줄러(121)(Node #1, …, Node #N)들에 의해 수행하는 중앙처리장치(CPU)(120)와, 상기 중앙처리장치(120)에서 익명처리가 완료된 익명 트랜잭션 데이터를 저장 및 보관하고, 허가된 사용자 단말기(140)로 이를 제공하는 스토리지(130)로 구성된다.FIG. 2 is a configuration diagram of an anonymization system for transaction data having a set value according to an embodiment of the present invention, comprising an operating system (OS) including an anonymization manager (111) and an anonymization web screen (112) that collects transaction data containing personal information from source data and provides the anonymous data to a user terminal (140), a central processing unit (CPU) (120) that performs data anonymization for a job assigned by the anonymization manager (111) by each anonymization job scheduler (121) (Node #1, ..., Node #N), and a storage (130) that stores and preserves anonymous transaction data that has been anonymized in the central processing unit (120) and provides the same to an authorized user terminal (140).

이와 같이 구성된 본 발명의 실시예에 따른 집합값을 갖는 트랜잭션 데이터의 익명처리 시스템의 작용에 대하여 상세히 설명하면 다음과 같다.The operation of the anonymization system for transaction data having a set value according to an embodiment of the present invention configured in this manner is described in detail as follows.

먼저, 본 발명은 기존 k-익명성(로컬일반화)과 동일하게 하향식 파티셔닝 알고리즘을 기본적으로 채택하여 안전성과 중앙처리장치(120)의 수행시간을 최적화하면서도 파티셔닝 과정 이후에 추가로 상향식 트리 탐색 과정을 통하여 트랜잭션을 재할당함으로써 기존 k-익명성의 단점을 개선하였고, 기존 k-익명성은 기본적으로 하향식 파티션 접근 방식을 채택하고 있기 때문에 트랜잭션 각 항목들의 일반화 트리 구조에서 한번 파티션된 도메인 내에서만 그리고 특히 유일값을 갖는 항목들에 대해서도 동일한 일반화를 적용하고 있어 정보 손실이 크다는 단점이 있었으나, 본 발명에서는 하향식 파티셔닝 과정 이후에 추가로 상향식 트리 탐색 과정을 통하여 트랜잭션을 재할당함으로써 기존 k-익명성의 단점을 개선하였다.First, the present invention basically adopts the top-down partitioning algorithm, which is the same as the existing k-anonymity (local generalization), to optimize the safety and the execution time of the central processing unit (120), and at the same time, improves the shortcomings of the existing k-anonymity by additionally reallocating transactions through a bottom-up tree search process after the partitioning process. Since the existing k-anonymity basically adopts the top-down partitioning approach, it has the disadvantage of a large loss of information because the same generalization is applied only within a partitioned domain in the generalization tree structure of each transaction item, and especially for items with unique values. However, the present invention improves the shortcomings of the existing k-anonymity by additionally reallocating transactions through a bottom-up tree search process after the top-down partitioning process.

도 2를 참조하면, 상기 운영체제부(OS)(111)의 익명처리 매니저(111)에서 원천 데이터로부터 개인정보가 포함된 트랜잭션 데이터 수집하고, 상기 익명처리 매니저(111)를 통하여 상기 중앙처리장치(120)의 익명처리 잡(Job) 스케줄러(121)에 익명처리 잡(Job)을 할당함에 따라, 상기 중앙처리장치(120)은 할당된 잡(Job)에 대한 익명처리 수행한다.Referring to FIG. 2, the anonymous processing manager (111) of the operating system (OS) (111) collects transaction data containing personal information from source data, and assigns an anonymous processing job to the anonymous processing job scheduler (121) of the central processing unit (120) through the anonymous processing manager (111), so that the central processing unit (120) performs anonymous processing on the assigned job.

상기 중앙처리장치(120)에서 익명처리가 완료된 익명 트랜잭션 데이터는 상기 스토리지(130)(트랜젝션 데이터 #1 ~ #N)에 저장 및 보관된다.Anonymous transaction data that has been anonymized in the central processing unit (120) is stored and kept in the storage (130) (transaction data #1 to #N).

상기 스토리지(130)에 보관된 트랜잭션 데이터는 접근이 허가된 상기 사용자 단말기(140)로부터 익명 트랜잭션 데이터 요청이 수신되면 해당 익명 트랜잭션 데이터를 조회하여 상기 익명처리 웹(Web)화면(112)을 통해 상기 사용자 단말기(140)로 제공한다. When an anonymous transaction data request is received from the user terminal (140) to which access is permitted, the transaction data stored in the storage (130) is retrieved and provided to the user terminal (140) through the anonymous processing web screen (112).

상기 익명처리 수행절차는 먼저, 상기 익명처리 매니저(111)을 통해 익명처리 Job 스케줄러(121)에 익명처리 잡 할당을 수행하고, 상기 익명처리 매니저(111)와 중앙처리장치(120)을 통해 할당된 잡에 대한 익명처리 과정을 수행한다.The above-mentioned anonymization processing procedure first performs anonymization job assignment to anonymization job scheduler (121) through the anonymization processing manager (111), and then performs anonymization processing for the assigned job through the anonymization processing manager (111) and the central processing unit (120).

도 3은 본 발명의 실시예에 따른 집합값을 갖는 트랜잭션 데이터의 익명처리과정의 전체 흐름도로서, 이를 참조하면 상기 중앙처리장치(120)는 다음 과 같이 각 단계를 순차적으로 익명처리 과정을 수행한다. FIG. 3 is a flowchart illustrating the entire process of anonymizing transaction data having a set value according to an embodiment of the present invention. Referring to this, the central processing unit (120) sequentially performs the anonymizing process for each step as follows.

1) [Anonymize(partition)]: 파티션(트리를 버킷으로 분할) 과정(S110)으로, 처리 대상 트랜잭션 항목들에 대한 일반화 트리를 구성하고 파티션 과정(트리 분할) 수행한다. 1) [Anonymize(partition)]: A partition (dividing a tree into buckets) process (S110) constructs a generalized tree for transaction items to be processed and performs a partition process (tree splitting).

2) [Pick_node()]: 파티션선택 과정(S120)으로, 분할된 파티션 중 첫번째 파티션부터 순차적으로 선택한다. 2) [Pick_node()]: Partition selection process (S120), sequentially selecting from the first partition among the divided partitions.

3) [Expand_node()]: 트리확장(Drill-down)과정(S130)으로, 선택된 파티션을 트리의 하위 노드들로 확장한다. 3) [Expand_node()]: This is a tree expansion (drill-down) process (S130), which expands the selected partition into the lower nodes of the tree.

4) [Distribute_data()]: 트랜잭션을 파티션(버킷)에 할당 과정(S140)으로, 확장된 파티션에 각각의 트랜잭션들을 할당한다. 4) [Distribute_data()]: Assigns transactions to partitions (buckets) in the process (S140), each transaction is assigned to an extended partition.

5) [Balance_partition()]: K-익명성 검토 및 트랜잭션 할당확정 과정(S150)으로, 할당된 트랜잭션들에 대하여 사전에 주어진 k-익명성 프라이버시 모델을 만족하는지를 검토하여 만족시 할당된 트랜잭션을 최종 확정하고, 후술하는 '[Final_partition()]'과정(S160)으로 이동하며, 불만족할 경우에는 상기 트리확장과정(S130)으로 이동한다. 5) [Balance_partition()]: This is a K-anonymity review and transaction allocation confirmation process (S150). It checks whether the allocated transactions satisfy the k-anonymity privacy model given in advance. If satisfied, the allocated transactions are finally confirmed and move on to the '[Final_partition()]' process (S160) described below. If not satisfied, move on to the tree expansion process (S130).

6) [Sub partition()]: 만약 이때도 만족하는 파티션이 없을 경우 해당 상태를 저장하고 서브 파티션[Sub partition()](S170)으로 이동하여 다음 분할된 파티션(S120)을 선택한다. 6) [Sub partition()]: If there is still no satisfactory partition at this time, save the state and move to sub partition [Sub partition()] (S170) to select the next partition (S120).

7) [Final_partition()]: 최종 파티션 과정(S160)으로, 상기 K-익명성 검토 및 트랜잭션 할당확정 과정(S150)에서 정보손실 최소화를 위하여 마지막까지 만족하지 못한 파티션들에 대하여 재처리(정보손실이 가장 적은 파티션부터 일반화된 값으로 대체)과정을 수행하여 모든 파티션들에 대하여 트랜잭션을 모두 할당하므로 익명처리과정을 종료한다. 7) [Final_partition()]: In the final partition process (S160), in order to minimize information loss in the K-anonymity review and transaction allocation confirmation process (S150), a reprocessing process (replacing with generalized values starting from the partition with the least information loss) is performed on the partitions that were not satisfied until the end, thereby allocating all transactions to all partitions, thereby ending the anonymization process.

상기 각 과정을 통해 익명처리가 완료된 익명 트랜잭션 데이터를 스토리지(130)에 저장 및 보관한다.Through each of the above processes, anonymous transaction data that has been anonymized is stored and preserved in storage (130).

도 4를 참조하여 상기 K-익명성 검토 및 트랜잭션 할당확정 과정[Balance_partition()](S150)과 최종 파티션 [Final_partition()](S160)에 대하여 보다 상세히 설명하면 다음과 같다.Referring to FIG. 4, the K-anonymity review and transaction allocation confirmation process [Balance_partition()] (S150) and the final partition [Final_partition()] (S160) are described in more detail as follows.

우선 Balance_partition() 루틴(S150)에서 드릴다운 즉, 일반화 계층 트리를 아래와 확장할 더 이상의 노드가 없는 경우 FinalQ_data() 루틴(S160)을 수행하여 최종 일반화된 일종의 후보 트랜잭션과 이에 대한 파티션을 별도의 FinalQ에 저장한다. First, in the Balance_partition() routine (S150), if there are no more nodes to drill down to, that is, expand the generalized hierarchy tree, the FinalQ_data() routine (S160) is performed to store the final generalized candidate transaction and its partition in a separate FinalQ.

그리고 Balance_partition() 루틴(S150)을 통해 드릴다운과정에서 해당 트랜잭션을 분할된 파티션 내에서 할당 할 수 있는 다른 파티션이 없을 경우 WaitForQ_data() 루틴을 수행하여 최종 드릴다운된 파티션과 해당 트랜잭션을 별도의 WFQ(WaitForQ)에 저장한다. And, if there is no other partition to which the transaction can be assigned within the partition divided during the drill-down process through the Balance_partition() routine (S150), the WaitForQ_data() routine is performed to store the final drill-down partition and the transaction in a separate WFQ (WaitForQ).

그 다음 [Final_partition()] 루틴(S160)으로 넘어가는데 이 루틴에서 WFQ에 저장된 각 파티션들을 상향식으로 탐색하면서 FinalQ에 저장된 파티션과 매칭하여 k-익명성을 만족하는 파티션이 존재하는지를 추가로 체크한다. Next, it moves to the [Final_partition()] routine (S160), which searches each partition stored in WFQ in a bottom-up manner and additionally checks whether there is a partition that satisfies k-anonymity by matching it with the partition stored in FinalQ.

만일, 이 과정에서 만족하는 파티션이 있을 경우 해당 트랜잭션과 파티션은 WFQ에서 FinalQ로 이동하게 되고 그렇지 않은 경우는 다음 단계로 이동한다. If there is a partition that satisfies this process, the transaction and partition are moved from WFQ to FinalQ, otherwise they move to the next step.

마지막 단계는 WFQ에 남아 있는 파티션들(루트 바로 아래 파티션들만 남게됨) 중 정보 손실이 가장 적은 파티션부터 상위 레벨인 루트로 일반화하면서 k-익명성을 만족할 때까지 반복하여 WFQ의 트랜잭션과 파티션들을 FinalQ로 보내게 된다. 이때 FinalQ에 있는 트랜잭션들이 외부로 최종 공개된다. The final step is to generalize the partitions remaining in the WFQ (only the partitions immediately below the root remain) from the partition with the least information loss to the root, which is the upper level, and repeat this process until k-anonymity is satisfied, sending the transactions and partitions in the WFQ to the FinalQ. At this time, the transactions in the FinalQ are finally made public.

도 5는 다음 표 1의 원본 데이터베이스 D에 대하여 제안 알고리즘을 적용한 예시를 나타낸 것으로 이를 참조하여 상세히 설명하면 다음과 같다.Figure 5 shows an example of applying the proposed algorithm to the original database D in Table 1 below. The detailed description is as follows with reference to this.

여기서, 표 1은 트랜잭션 데이터베이스 D에서의 기존 익명처리 기법들과 비교한 것이며, 표 1에서 '*'표시는 삭제를 의미하며, 제안 알고리즘은 본 발명에 따른 알고리즘이다.Here, Table 1 compares existing anonymization techniques in transaction database D, and in Table 1, the '*' sign indicates deletion, and the proposed algorithm is an algorithm according to the present invention.

도 5a 내지 도 5d에서 1-라운드부터 6-라운드까지는 도 6에서의 Anonymize(partition) 루틴으로 일반화 계층 트리의 루트를 하향식으로 탐색하면서 k=2 익명성을 만족하는 최적의 파티션을 찾아나가는 과정이다. In FIGS. 5a to 5d, rounds 1 to 6 are a process of searching the root of the generalized hierarchical tree in a top-down manner using the Anonymize(partition) routine in FIG. 6 to find the optimal partition that satisfies k=2 anonymity.

1-라운드는 맨 먼저 트리의 루트 T에서부터 시작하여 T 바로 아래 계층에 있는 노드를 분할한 다음 분할된 각 파티션에 t1부터 t8까지 트랜잭션을 할당한다.(도 3, S110)Round 1 starts from the root T of the tree, splits the nodes in the hierarchy immediately below T, and then assigns transactions t1 to t8 to each partition. (Figure 3, S110)

2-라운드에서는 다시 하향식 탐색을 반복하여 노드를 아래 계층으로 확장해 나가는 과정이다. 이때 파티션 [P, Q]의 경우 각각 [P, J, M]과 [H, K Q] 중 하나를 선택할 수 있다.(도 3, S120) In Round 2, the top-down search is repeated again, expanding the nodes to the lower layers. For each partition [P, Q], either [P, J, M] or [H, K Q] can be selected (Figure 3, S120).

이때는 할당된 트랜잭션 t2, t3, t4, t5에 대하여 [Q] <- [J, M]과 [P] <- [H, K] 중 정보 손실인 NCP를 측정하여 값이 적은 즉, 정보 손실이 적은 파티션인 [P, J, M]으로 드릴다운(확장)된다.(도 3, S130) At this time, for the assigned transactions t2, t3, t4, and t5, the NCP, which is information loss, is measured among [Q] <- [J, M] and [P] <- [H, K], and the partition with the smaller value, that is, the partition with the smaller information loss, is drilled down (expanded) to [P, J, M]. (Fig. 3, S130)

또한 1-라운드에서와 동일한 방법으로 드릴다운(확장)된 각 파티션 [H, K]에 대하여 t1의 트랜잭션을, [P, J, M]에 대하여 t2, t3, t4, t5의 트랜잭션을 할당한다.(도 3, S140)Also, for each partition [H, K] drilled down (expanded) in the same way as in the first round, the transaction of t1 is assigned, and for [P, J, M], the transactions of t2, t3, t4, and t5 are assigned. (Fig. 3, S140)

3-라운드에서는 2-라운드와 동일한 과정을 반복하여 [H, K] -> [H, c, d]로 확장되었지만 k=2 익명성을 만족하지 않아 다시 상위 계층으로 롤백(rollback)되어 상위 계층의 파티션들과 k=2 익명성을 만족하는지를 반복한다. 이때 만족하는 파티션이 더 이상 없기 때문에 파티션 [H, c, d]는 확장을 멈추고 대기큐인 WFQ에 해당 트랜잭션 t1과 함께 저장된다.(도 4, S150)In Round 3, the same process as Round 2 is repeated to expand [H, K] -> [H, c, d], but since it does not satisfy the k=2 anonymity requirement, it rolls back to the upper layer, and the upper layer partitions are tested to see if they satisfy the k=2 anonymity requirement. At this point, since no more partitions satisfy the k=2 anonymity requirement, the partition [H, c, d] stops expanding and is stored in the waiting queue (WFQ) along with the corresponding transaction t1 (Figure 4, S150).

4-라운드에서는 다시 [P, J, M]에 대해 [P, J], [P, M], [P, J, M]으로 확장이 일어나고(도 3, S140) 1-라운드에서와 동일한 방법으로 각 파티션 [P, J]에 대하여 t2, t5의 트랜잭션을 [P, J, M]에 대하여 t3, t4의 트랜잭션을 할당한다.(도 4, S150)In the 4th round, expansion occurs again for [P, J, M] to [P, J], [P, M], [P, J, M] (Fig. 3, S140), and in the same way as in the 1st round, transactions t2 and t5 are assigned to each partition [P, J] and transactions t3 and t4 are assigned to [P, J, M]. (Fig. 4, S150)

5-라운드에서는 다시 [P, J] -> [P, f, g]로 [P, J, M] -> [K, J, M]으로 확장이 일어나고(도 3, S140) 최종 [P, f, g]: t2, t5가 K=2 익명성을 만족하여 FinalQ : {[P,f,g]: t2, t5}에 저장된다.(도 4, S150)In the 5th round, the expansion occurs again from [P, J] -> [P, f, g] to [P, J, M] -> [K, J, M] (Fig. 3, S140), and the final [P, f, g]: t2, t5 satisfies K=2 anonymity and is stored in FinalQ: {[P, f, g]: t2, t5} (Fig. 4, S150).

6-라운드에서는 5-라운드와 동일한 확장 과정을 반복하여 파티션 [K,f,M]: t3, t4가 K=2 익명성을 만족하여 FinalQ : {[K,f,M]: t3, t4}에 저장되고 이후 더 이상 확장할 노드가 없으므로 Anonymize(partition) 루틴을 종료하게 된다.(도 4, S150) In the 6th round, the same expansion process as in the 5th round is repeated until the partition [K,f,M]: t3, t4 satisfies the K=2 anonymity and is stored in FinalQ: {[K,f,M]: t3, t4}. After that, there are no more nodes to expand, so the Anonymize(partition) routine is terminated. (Fig. 4, S150)

7-라운드부터 마지막 11-라운드까지는 Final_partition() 루틴으로 크게 2가지 과정을 거치게 된다.(도 4, S160)From the 7th round to the last 11th round, the Final_partition() routine goes through two major processes (Fig. 4, S160).

도 4를 참조하면, 첫 번째 과정은 그동안 더이상 2-익명성 매칭이 되지 않아 남아있는 WFQ의 파티션(그리고 파티션에 할당된 트랜잭션)들을 상향식으로 계층트리를 롤백(rollback)하면서 기존 FinalQ에 있는 파티션과 대조하여 2-익명성 매칭이 되는 파티션을 찾아 WFQ에서 FianlQ로 이동하는 과정이다. Referring to Figure 4, the first process is to roll back the hierarchical tree in a bottom-up manner for the remaining WFQ partitions (and transactions assigned to the partitions) that no longer have 2-anonymity matching, compare them with the partitions in the existing FinalQ, and move the partitions that have 2-anonymity matching from WFQ to FinalQ.

만일 이 과정에서 매칭이 되지 않아 실패할 경우 다음 과정으로 이동하게 된다. 다음 두 번째 과정은 더 이상 분할이 되지 않고 남아있는 WFQ의 파티션(그리고 파티션에 할당된 트랜잭션)들을 정보 손실이 가장 낮은 파티션부터 하나씩 상위 계층으로 일반화하면서 k=2 익명성을 만족하는 파티션(그리고 파티션에 할당된 트랜잭션)들이 있는지를 탐색하여 만족할 경우 WFQ에서 FinalQ로 이동하는 과정이다. If this process fails due to a mismatch, it moves to the next step. The second step is to generalize the remaining WFQ partitions (and the transactions assigned to them) from the partition with the lowest information loss to the upper layer one by one, and search for partitions (and the transactions assigned to them) that satisfy k=2 anonymity. If they are satisfied, it moves from WFQ to FinalQ.

7-라운드에서는 먼저 남은 파티션 [e,i], [e], [i]에 대해 더 이상 분할이 안되고 k=2 익명성을 만족하지 않음으로 모두 WFQ로 보낸 다음 Final_partition() 루틴을 수행하여 WFQ의 각 파티션들을 롤백해 가면서 FinalQ의 파티션과 매칭하여 k=2 익명성을 만족하는 트랜잭션이 있는지 탐색한다. In the 7th round, first, the remaining partitions [e,i], [e], [i] are no longer splittable and do not satisfy the k=2 anonymity, so they are all sent to the WFQ. Then, the Final_partition() routine is executed to roll back each partition of the WFQ and search for a transaction that satisfies the k=2 anonymity by matching it with the partition of the FinalQ.

8-라운드에서는 파티션 [P]: t1가 k=2 익명성을 만족하여 파티션 [P]: t1을 FinalQ에 할당하고 WFQ에서 제거한다. In the 8th round, partition [P]: t1 satisfies k=2 anonymity, so partition [P]: t1 is assigned to FinalQ and removed from WFQ.

9-라운드에서는 WFQ에 남은 트랜잭션들(t6, t7, t8)에 대하여 NCP가 가장 적은 [e]: t7, [i]: t8 중 [e]:t7을 T로 할당한 후, 다른 트랜잭션들 중 [e]가 있는 [e,i]를 [T,i]로 WFQ를 업데이트한다. In the 9th round, among the remaining transactions (t6, t7, t8) in the WFQ, among [e]: t7, [i]: t8 with the smallest NCP, [e]:t7 is assigned to T, and among the other transactions, [e,i] that have [e] is updated to [T,i] in the WFQ.

10-라운드에서는 WFQ에 남아있는 파티션들([T,i], [T], [i]) 중 k=2 익명성을 만족하는지를 체크한다. 이때 파티션 {[T,i]: t6, [T]: t7, [i]: t8}이 모두 k=2 익명성을 만족하여 이들 파티션들을 모두 FinalQ에 할당하고 WFQ에서 제거한다. 마지막 11-라운드에서는 WFQ가 비었으므로 Final_partition() 루틴을 종료하고 FinalQ에 있는 값을 출력하여 최종 공개한다.In the 10th round, we check whether the partitions ([T,i], [T], [i]) remaining in the WFQ satisfy k=2 anonymity. At this time, partitions {[T,i]: t6, [T]: t7, [i]: t8} all satisfy k=2 anonymity, so all of these partitions are assigned to FinalQ and removed from WFQ. In the final 11th round, since WFQ is empty, the Final_partition() routine is terminated and the values in FinalQ are output and finally made public.

이와 같이 본 발명은 기본적으로 종래의 K-익명성(로컬일반화) 알고리즘을 채택하였지만, K-익명성(로컬일반화) 또는 HgHs 알고리즘에 비하여 정보 손실이 적어 공개된 데이터 분석시 분석가에게 보다 많은 유용성을 제공하고 있다. 중앙처리장치의 실행 시간 측면에 있어서는 기본적으로 너비 우선 검색(breadth-first search) 접근 방식을 채택하고 있는 HgHs 알고리즘에 비하여 훨씬 빠른 성능을 보이고 있다. Thus, the present invention fundamentally adopts the conventional K-anonymity (local generalization) algorithm, but compared to the K-anonymity (local generalization) or HgHs algorithms, it exhibits less information loss, providing greater utility to analysts when analyzing public data. In terms of CPU execution time, it demonstrates significantly faster performance than the HgHs algorithm, which fundamentally adopts a breadth-first search approach.

반면에 k-익명성(로컬일반화) 알고리즘에 비하여 Final_partition() 루틴 과정을 추가로 거쳐야 하기 때문에 그 만큼의 비용이 추가로 드는 단점이 있으나, 상기 Final_partition() 루틴 과정은 이미 계층 트리의 확장이 완료된 이후 남은 일부 트랜잭션들에 대해 수행되는 것이기에 수행시간이 크게 높지는 않을 것으로 예상된다.On the other hand, compared to the k-anonymity (local generalization) algorithm, there is a disadvantage in that it requires an additional Final_partition() routine process, which incurs additional costs. However, since the Final_partition() routine process is performed on some of the remaining transactions after the expansion of the hierarchical tree has already been completed, it is expected that the execution time will not be significantly high.

한편, 안전성 측면에서 볼 때 종래의 K-익명성(로컬일반화)와 HgHs 알고리즘과 동일하게 K-익명성의 안전성 조건을 만족하고 있어 동일한 성능을 갖는다. Meanwhile, in terms of security, it satisfies the security conditions of K-anonymity in the same way as the conventional K-anonymity (local generalization) and HgHs algorithm, and thus has the same performance.

또한 이 성능은 종래의 2-^∞익명성(글로벌일반화) 와 2-^∞익명성(삭제) 알고리즘에 비해 훨씬 우수한 안전성을 가지고 있다. 즉, k^m-익명성은 공격자가 m개 이하의 항목을 알고 있는 경우에만 개인의 프라이버시를 보호하는 반면 k-익명성은 매개변수 m이 없을 때 공격자가 알 수 있는 항목 수에 제한이 필요하지 않다. Moreover, this performance has much better security than the conventional 2- ^∞ anonymity (global generalization) and 2- ^∞ anonymity (deletion) algorithms. That is, k ^m -anonymity protects an individual's privacy only when the attacker knows no more than m items, whereas k-anonymity does not require a limit on the number of items that the attacker can know when there is no parameter m.

일반적으로 k^m-익명성의 m이 작을수록 k^m-익명성이 제공하는 프라이버시가 약해질 수밖에 없으나, k-익명성은 k^m-익명성보다 보다 강력한 프라이버시를 보장한다.In general, the smaller the m of k ^m -anonymity, the weaker the privacy provided by k ^m -anonymity, but k-anonymity guarantees stronger privacy than k ^m -anonymity.

이상과 같이, 본 발명의 실시예에 따른 집합값을 갖는 트랜잭션 데이터의 익명처리 시스템 및 방법은 비록 한정된 실시예와 도면에 의해 설명되었으나 이 실시예에 의해 한정되지 않으며, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술 사상과 아래에 기재될 특허청구범위의 균등범위 내에서 다양한 수정 및 변형 가능함은 물론이다.As described above, although the anonymization system and method for transaction data having a set value according to an embodiment of the present invention has been described by limited embodiments and drawings, it is not limited to these embodiments, and it is obvious that various modifications and variations can be made within the scope of the technical idea of the present invention and the equivalent scope of the patent claims to be described below by a person having ordinary skill in the art to which the present invention pertains.

110 : 운영체제부 111 : 익명처리 매니저
112 : 익명처리 웹화면 120 : 중앙처리장치
121 : 익명처리 잡 스케줄러 130 : 스토리지
140 : 사용자 단말기110: Operating System Department 111: Anonymous Processing Manager
112: Anonymous web screen 120: Central processing unit
121: Anonymous processing job scheduler 130: Storage
140: User terminal

Claims

delete

In a method for anonymizing transaction data having a set value, in which transaction data containing personal information is collected from source data through an anonymization manager and data anonymization is performed through a central processing unit (CPU),
The above-mentioned anonymization manager performs a first process of assigning anonymization jobs to a job scheduler for transaction data collected from the above-mentioned source data;
The central processing unit performs a second process of anonymous processing for a job assigned to the job scheduler; and
A third process of storing anonymous transaction data, anonymized in the central processing unit, in a transaction database;
The anonymization process of the second step is performed based on a top-down partitioning algorithm, and then transactions are reallocated through a top-down tree search process.
The above second process
Step 1: Construct a generalization tree for the transaction items to be processed and perform a partitioning process;
Step 2: sequentially selecting the first partition from among the partitions divided in the above step 1;
Step 3: extending the selected partition to the lower nodes of the tree;
Step 4: Assigning each transaction to the extended partition;
Step 5: performing a balance partition (Balance_patrition()) routine to check whether the allocated transactions satisfy a pre-given k-anonymity privacy model; and
A method for anonymizing transaction data having a set value, characterized by including a step 6 of performing a final partition (Final_partition()) routine to terminate anonymization after reprocessing the remaining transactions to minimize information loss if the K-anonymity privacy model is satisfied as a result of the review in the above step 5.

delete

In paragraph 4,
A method for anonymizing transaction data having a set value, characterized in that the fifth step further includes a step of checking whether there is another partition to which the transaction (t) can be assigned within the divided partition if K-anonymity is not satisfied, and assigning the transaction (tj) to the partition until K-anonymity is satisfied.

In paragraph 4,
A method for anonymizing transaction data having a set value, characterized in that the above 5th step further includes a step of checking whether there is another partition to which the transaction (t) can be assigned within the partitioned partition if K-anonymity is not satisfied, and if there is no other partition to which the transaction (t) can be assigned within the partitioned partition, assigning the transaction (tj) to a waiting queue (WaitForQ, hereinafter referred to as 'WFQ').

In paragraph 4,
If the K-anonymity check is satisfied in the above step 5, steps 1 to 4 are repeated until there are no more nodes to expand.
A method for anonymizing transaction data having a set value, characterized in that when there are no more nodes to be expanded, a final queue routine (FinalQ_data()) is performed to store a final generalized type of candidate transaction and its partition in a separate final queue (FinalQ).

In paragraph 4,
The above 6th step is a method for anonymizing transaction data having a set value, characterized in that it performs a process of checking whether there is a partition that satisfies K-anonymity by matching the partitions stored in FinalQ with the partitions stored in WFQ while searching each partition stored in WFQ in a bottom-up manner, and if there is a satisfying partition, the transaction and partition are moved from WFQ to FinalQ.

In Article 10,
A method for anonymizing transaction data having a set value, characterized in that transactions and partitions of WFQ are repeatedly sent to FinalQ by generalizing from the partition with the least information loss among the partitions remaining in the above WFQ to the root, which is the upper level, until K-anonymity is satisfied.