JP2019101809A

JP2019101809A - Anonymization device, anonymization method, and anonymization program

Info

Publication number: JP2019101809A
Application number: JP2017232733A
Authority: JP
Inventors: 知明三本; Tomoaki Mitsumoto; 清本　晋作; Shinsaku Kiyomoto; 晋作清本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2019-06-24
Anticipated expiration: 2037-12-04
Also published as: JP6779854B2

Abstract

【課題】有用性を残し、かつ、安全なレコードのみを出力できる匿名化装置、匿名化方法及び匿名化プログラムを提供すること。【解決手段】匿名化装置１は、同一の属性を持つデータセットの入力を定期的に受け付けるデータセット入力部１１と、属性それぞれに対して一般化階層木の入力を受け付ける階層木入力部１２と、匿名化ルールの入力を受け付ける匿名化ルール入力部１３と、出力可能なレコードの条件を定めた出力ルールの入力を受け付ける出力ルール入力部１４と、匿名化ルールに基づいてデータセットの全体を匿名化する匿名化処理部１６と、匿名化されたデータセットから、出力ルールに合致したレコードのみを出力する匿名化データ出力部１７と、出力ルールに合致せず退避したレコードを、匿名化前の状態で記憶する記憶部２０と、を備え、匿名化処理部１６は、退避したレコードを、受け付けたデータセットに追加した後に匿名化する。【選択図】図１PROBLEM TO BE SOLVED: To provide an anonymization device, an anonymization method, and an anonymization program capable of outputting only safe records while remaining useful. An anonymization device (1) includes a data set input unit (11) that regularly receives an input of a data set having the same attribute, and a hierarchical tree input unit (12) that receives an input of a generalized hierarchical tree for each attribute. , The anonymization rule input unit 13 that receives the input of the anonymization rule, the output rule input unit 14 that receives the input of the output rule that defines the conditions of the record that can be output, and the entire data set is anonymous based on the anonymization rule. The anonymization processing unit 16 that anonymizes, the anonymization data output unit 17 that outputs only the records that match the output rule from the anonymized data set, and the records that have been saved that do not match the output rule before the anonymization. The storage unit 20 that stores the state is provided, and the anonymization processing unit 16 anonymizes the saved record after adding it to the received data set. [Selection diagram] Figure 1

Description

本発明は、データセットを匿名化するための装置、方法及びプログラムに関する。 The present invention relates to an apparatus, method and program for anonymizing a data set.

従来、例えば、ユーザ属性と共に移動履歴又は購買履歴等の個人情報を含むデータセットを解析し、広告配信等に利用する際には、レコードから個人が識別され個人情報が漏洩するリスクを回避する必要があった。このため、個人情報を含むデータセットは、匿名化の処理をした後に提供される。
データセットを自動で匿名化する際には、距離の近いレコードを丸めてクラスタ化する、あるいは、各属性に木構造を持たせ、汎化を繰り返すことでｋ−匿名化する手法が用いられている（例えば、非特許文献１及び２参照）。 Conventionally, for example, when analyzing a data set including personal information such as movement history or purchase history together with user attributes and using it for advertisement distribution etc., it is necessary to avoid the risk of individual being identified from records and leakage of personal information was there. For this reason, a data set including personal information is provided after the process of anonymization.
When automatically anonymizing a data set, a method is used in which records having a short distance are rounded and clustered, or each attribute has a tree structure and k-anonymization is performed by repeating generalization. (See, for example, Non-Patent Documents 1 and 2).

Ｊｉ−ＷｏｎＢｙｕｎ，ＡｓｈｉｓｈＫａｍｒａ，ＥｌｉｓａＢｅｒｔｉｎｏ，ＮｉｎｇｈｕｉＬｉ，Ｅｆｆｉｃｉｅｎｔｋ−ａｎｏｎｙｍｉｚａｔｉｏｎｕｓｉｎｇｃｌｕｓｔｅｒｉｎｇｔｅｃｈｎｉｑｕｅｓ，Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１２ｔｈｉｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆｅｒｅｎｃｅｏｎＤａｔａｂａｓｅｓｙｓｔｅｍｓｆｏｒａｄｖａｎｃｅｄａｐｐｌｉｃａｔｉｏｎｓ，Ａｐｒｉｌ０９−１２，２００７，Ｂａｎｇｋｏｋ，ＴｈａｉｌａｎｄJi-Won Byun, Ashish Kamra, Elisa Bertino, Ninghui Li, Efficient k-anonymization using clustering techniques, Proceedings of the 12th international conference on Database systems for advanced applications, April 09-12, 2007, Bangkok, Thailand ＫｒｉｓｔｅｎＬｅＦｅｖｒｅ，ＤａｖｉｄＪ．ＤｅＷｉｔｔ，ＲａｇｈｕＲａｍａｋｒｉｓｈｎａｎ，Ｉｎｃｏｇｎｉｔｏ：ｅｆｆｉｃｉｅｎｔｆｕｌｌ−ｄｏｍａｉｎＫ−ａｎｏｎｙｍｉｔｙ，Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ２００５ＡＣＭＳＩＧＭＯＤｉｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆｅｒｅｎｃｅｏｎＭａｎａｇｅｍｅｎｔｏｆｄａｔａ，Ｊｕｎｅ１４−１６，２００５，Ｂａｌｔｉｍｏｒｅ，ＭａｒｙｌａｎｄKristen LeFevre, David J. DeWitt, Raghu Ramakrishnan, Incognito: efficient full-domain K-anonymity, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland

しかしながら、自動で距離の近いレコードを丸める方式では、データの持つ意味を考慮することなく、単純なレコード間の距離によってクラスタ化されるため、例えば年齢「１７−２１」というように、利用者にとって有用性の低い汎化が行われていた。
また、木構造に基づく場合であっても、目的はｋ−匿名化を実施することであり、属性の数が増えると、ほとんどの属性に対して大幅な汎化が行われ、結果として利用価値の低いデータセットとなることが多かった。 However, in the method of automatically rounding records close in distance, since the data is clustered by a simple distance between records without considering the meaning of the data, for example, for the age "17-21" Generalization with low utility has been performed.
In addition, even if it is based on a tree structure, the purpose is to carry out k-anonymization, and when the number of attributes increases, significant generalization is performed on most of the attributes, resulting in utilization value Often had a low data set.

本発明は、有用性を残し、かつ、安全なレコードのみを出力できる匿名化装置、匿名化方法及び匿名化プログラムを提供することを目的とする。 An object of the present invention is to provide an anonymization device, an anonymization method, and an anonymization program that can output only secure records while leaving the utility.

本発明に係る匿名化装置は、複数のレコードからなる同一の属性を持つデータセットの入力を定期的に受け付けるデータセット入力部と、前記データセットに含まれる属性それぞれに対して、一般化した上位ノードを有する一般化階層木の入力を受け付ける階層木入力部と、前記データセットの利用方法に応じた匿名化ルールの入力を受け付ける匿名化ルール入力部と、個人が識別されるリスクに基づく出力可能なレコードの条件を定めた出力ルールの入力を受け付ける出力ルール入力部と、前記匿名化ルールに基づいて前記データセットの全体を匿名化する匿名化処理部と、前記匿名化処理部により匿名化されたデータセットから、前記出力ルールに合致したレコードのみを出力する匿名化データ出力部と、前記出力ルールに合致せず退避したレコードを、匿名化前の状態で記憶する退避レコード記憶部と、を備え、前記匿名化処理部は、前記退避レコード記憶部に記憶されているレコードを、前記データセット入力部により受け付けたデータセットに追加した後に匿名化する。 In the anonymization device according to the present invention, a data set input unit that periodically receives an input of a data set having the same attribute made up of a plurality of records, and an upper rank generalized for each of the attributes included in the data set A hierarchical tree input unit that receives an input of a generalized hierarchical tree having nodes, an anonymization rule input unit that receives an input of anonymization rules according to the usage of the data set, and an output based on the risk that an individual is identified And an anonymization processing unit that anonymizes the entire data set based on the anonymization rule, and anonymization processing unit that anonymizes the entire data set based on the anonymization rule. An anonymization data output unit that outputs only records that match the output rule from the data set, and saving without meeting the output rule A saved record storage unit for storing the stored record in a state before anonymization, the anonymization processing unit receiving the record stored in the saved record storage unit by the data set input unit Anonymize after adding to the set.

前記退避レコード記憶部は、前記レコードに日時情報の属性が含まれる場合、当該日時情報を削除して記憶してもよい。 When the record includes an attribute of date and time information, the save record storage unit may delete and store the date and time information.

前記退避レコード記憶部は、所定期間の経過したレコードを削除してもよい。 The save record storage unit may delete a record for which a predetermined period has elapsed.

本発明に係る匿名化方法は、複数のレコードからなる同一の属性を持つデータセットの入力を定期的に受け付けるデータセット入力ステップと、前記データセットに含まれる属性それぞれに対して、一般化した上位ノードを有する一般化階層木の入力を受け付ける階層木入力ステップと、前記データセットの利用方法に応じた匿名化ルールの入力を受け付ける匿名化ルール入力ステップと、個人が識別されるリスクに基づく出力可能なレコードの条件を定めた出力ルールの入力を受け付ける出力ルール入力ステップと、前記匿名化ルールに基づいて前記データセットの全体を匿名化する匿名化処理ステップと、前記匿名化処理ステップにおいて匿名化されたデータセットから、前記出力ルールに合致したレコードのみを出力する匿名化データ出力ステップと、前記出力ルールに合致せず退避したレコードを、匿名化前の状態で記憶する退避レコード記憶ステップと、をコンピュータが実行し、前記匿名化処理ステップにおいて、前記退避レコード記憶ステップにおいて記憶されているレコードを、前記データセット入力ステップにおいて受け付けたデータセットに追加した後に匿名化する。 In the anonymization method according to the present invention, a data set input step for periodically receiving an input of a data set having the same attribute made up of a plurality of records, and an upper rank generalized for each attribute included in the data set A hierarchical tree input step for receiving an input of a generalized hierarchical tree having nodes, an anonymization rule input step for receiving an input of anonymization rules according to a usage of the data set, and an output based on a risk of identifying an individual And anonymization processing step of anonymizing the whole of the data set based on the anonymization rule, and anonymization processing step of the anonymization processing step. Data that outputs only records that match the output rule from the specified data set The computer executes a force step and a save record storage step for storing a record saved without matching the output rule in a state before anonymization, and in the anonymization process step, storing in the save record storage step The record being recorded is anonymized after being added to the data set received in the data set input step.

本発明に係る匿名化プログラムは、複数のレコードからなる同一の属性を持つデータセットの入力を定期的に受け付けるデータセット入力ステップと、前記データセットに含まれる属性それぞれに対して、一般化した上位ノードを有する一般化階層木の入力を受け付ける階層木入力ステップと、前記データセットの利用方法に応じた匿名化ルールの入力を受け付ける匿名化ルール入力ステップと、個人が識別されるリスクに基づく出力可能なレコードの条件を定めた出力ルールの入力を受け付ける出力ルール入力ステップと、前記匿名化ルールに基づいて前記データセットの全体を匿名化する匿名化処理ステップと、前記匿名化処理ステップにおいて匿名化されたデータセットから、前記出力ルールに合致したレコードのみを出力する匿名化データ出力ステップと、前記出力ルールに合致せず退避したレコードを、匿名化前の状態で記憶する退避レコード記憶ステップと、をコンピュータに実行させ、前記匿名化処理ステップにおいて、前記退避レコード記憶ステップにおいて記憶されているレコードを、前記データセット入力ステップにおいて受け付けたデータセットに追加した後に匿名化させるためのものである。 The anonymization program according to the present invention comprises a data set input step of periodically receiving an input of a data set having the same attribute composed of a plurality of records, and a generalized upper rank for each attribute included in the data set. A hierarchical tree input step for receiving an input of a generalized hierarchical tree having nodes, an anonymization rule input step for receiving an input of anonymization rules according to a usage of the data set, and an output based on a risk of identifying an individual And anonymization processing step of anonymizing the whole of the data set based on the anonymization rule, and anonymization processing step of the anonymization processing step. To output only the records that match the output rule from the output data set Causing the computer to execute a data output step and a saved record storing step for storing a record saved without matching the output rule in a state before anonymization, and in the anonymization processing step, the saved record storing step The record stored at step S. is added to the data set received at the data set input step and then anonymized.

本発明によれば、データセットを匿名化する際に、有用性を残し、かつ、安全なレコードのみを出力できる。 According to the present invention, when anonymizing a data set, it is possible to output only a secure record while leaving the usefulness.

第１実施形態に係る匿名化装置の機能構成を示す図である。It is a figure showing functional composition of an anonymization device concerning a 1st embodiment. 第１実施形態に係る匿名化装置の入出力情報を示す図である。It is a figure which shows the input-output information of the anonymization apparatus which concerns on 1st Embodiment. 第２実施形態に係る匿名化装置の入出力情報を示す図である。It is a figure which shows the input-output information of the anonymization apparatus which concerns on 2nd Embodiment.

以下、本発明の第１実施形態について説明する。
図１は、本実施形態に係る匿名化装置１の機能構成を示す図である。
匿名化装置１は、サーバ装置又はパーソナルコンピュータ等の情報処理装置（コンピュータ）であり、制御部１０、記憶部２０、及び各種の入出力デバイスを備える。 Hereinafter, a first embodiment of the present invention will be described.
FIG. 1 is a diagram showing a functional configuration of the anonymization device 1 according to the present embodiment.
The anonymization device 1 is an information processing device (computer) such as a server device or a personal computer, and includes a control unit 10, a storage unit 20, and various input / output devices.

制御部１０は、匿名化装置１の全体を制御する部分であり、記憶部２０に記憶された各種プログラムを適宜読み出して実行することにより、本実施形態における機能を実現している。制御部１０は、ＣＰＵであってよい。 The control unit 10 is a part that controls the entire anonymization device 1, and realizes functions in the present embodiment by appropriately reading and executing various programs stored in the storage unit 20. The control unit 10 may be a CPU.

記憶部２０は、ハードウェア群を匿名化装置１として機能させるための各種プログラム、及び各種データ等の記憶領域であり、ＲＯＭ、ＲＡＭ、フラッシュメモリ又はハードディスク（ＨＤＤ）等であってよい。具体的には、記憶部２０は、本実施形態の機能を制御部１０に実行させるための匿名化プログラムの他、処理対象のデータセット及び各種のファイル群等を記憶する。 The storage unit 20 is a storage area for various programs for causing the hardware group to function as the anonymization device 1 and various data, and may be a ROM, a RAM, a flash memory, a hard disk (HDD), or the like. Specifically, the storage unit 20 stores, in addition to the anonymization program for causing the control unit 10 to execute the functions of this embodiment, a data set to be processed, various file groups, and the like.

また、制御部１０は、データセット入力部１１と、階層木入力部１２と、匿名化ルール入力部１３と、出力ルール入力部１４と、設定情報入力部１５と、匿名化処理部１６と、匿名化データ出力部１７とを備える。 Further, the control unit 10 includes a data set input unit 11, a hierarchical tree input unit 12, an anonymization rule input unit 13, an output rule input unit 14, a setting information input unit 15, and an anonymization processing unit 16. And an anonymization data output unit 17.

データセット入力部１１は、複数のレコードからなる同一の属性を持つデータセットの入力をバッチ処理等により定期的に受け付ける。例えば、１日１回、１日分のデータセットが取り込まれ、匿名化処理部１６に提供される。 The data set input unit 11 periodically receives an input of a data set having the same attribute, which includes a plurality of records, by batch processing or the like. For example, a data set for one day is taken in once a day and provided to the anonymization processor 16.

匿名化の対象となるデータセットの各レコードは、複数の属性からなる。各属性のデータの種類は、質的データ、量的データ、コード型データ等を含む。
質的データは、例えば、「東京」、「京都」といった住所が該当する。
量的データは、例えば、「１．５」、「３０」といった数値データが該当する。
コード型データは、例えば、郵便番号のように、各桁に意味を持つデータが該当する。 Each record of the data set to be anonymized consists of a plurality of attributes. The type of data of each attribute includes qualitative data, quantitative data, coded data, and the like.
The qualitative data corresponds to, for example, an address such as "Tokyo" or "Kyoto".
The quantitative data corresponds to, for example, numerical data such as "1.5" and "30".
The code type data corresponds to data having a meaning in each digit, such as a zip code, for example.

階層木入力部１２は、データセットに含まれる属性それぞれに対して、一般化した上位ノードを有する一般化階層木の入力を受け付ける。
一般化階層木では、例えば、質的データである「東京」又は「京都」といったノードの上位階層に、それぞれ「関東」又は「関西」といったノードが設けられる。また、量的データである「１３」、「１４」、「１５」といったノードの上位階層には、「１３−１５」又は「未成年」といったノードが設けられる。また、コード型データである「１２３−４５６７」といったノードの上位階層には、「１２３−４５＊＊」といった一部の桁を省略したノードが設けられる。 The hierarchical tree input unit 12 receives an input of a generalized hierarchical tree having a generalized upper node for each of the attributes included in the data set.
In the generalized hierarchical tree, for example, nodes such as "Kanto" or "Kansai" are provided in the upper hierarchy of nodes such as "Tokyo" or "Kyoto" which is qualitative data. Also, nodes such as "13-15" or "underage" are provided in the upper hierarchy of nodes such as "13", "14", and "15" that are quantitative data. Further, in the upper hierarchy of the node such as "123-4567" which is the code type data, a node in which some digits such as "123-45 **" are omitted is provided.

匿名化ルール入力部１３は、データセットの利用方法に応じた匿名化ルールのファイル入力を受け付ける。
匿名化ルールでは、例えば、属性ｘを木の高さｈまで汎化する、属性ｙの一部又は全部を削除する、同一レコード数がｎ以上のレコードに対して、属性ｚを木の高さｈまで汎化する等、汎化ルール、又は条件付きの汎化ルールが定義される。条件は複数設けられてもよく、例えばａｎｄ又はｏｒを用いて定義される。
なお、匿名化ルールは、一般化階層木に基づく汎化に限らず、例えば、サンプリング、スワッピング、ノイズ付与等の匿名化の手法が用いられてもよい。 The anonymization rule input unit 13 receives file input of the anonymization rule according to the usage of the data set.
In the anonymization rule, for example, the attribute x is generalized to the height h of the tree, the part or all of the attribute y is deleted, and the attribute z is the height of the tree for records having the same number of records n or more. A generalization rule or a conditional generalization rule is defined, such as generalization to h. A plurality of conditions may be provided, and defined using, for example, and or.
Note that the anonymization rule is not limited to generalization based on a generalized hierarchical tree, and for example, an anonymization method such as sampling, swapping, or noise addition may be used.

出力ルール入力部１４は、データセットのレコードの情報から個人が識別されるリスクを所定未満に抑えるために、出力可能なレコードの条件を定めた出力ルールのファイル入力を受け付ける。
出力ルールは、例えば、重複するレコード数ｋ、又は個人識別確率ｐ等の閾値で表現されてよい。
また、出力ルールは、レコード毎に独立して定められてもよい。さらに、出力ルールは、データセットの提供先に応じて定められてもよい。例えば、あるレコードは、企業規模がｘｘ以上の企業に対しては同一レコード数ｋ≧２、ｙｙ以下の企業に対してはｋ＞１０であれば開示してよい等、条件付きの閾値が出力ルールとして定められてもよい。 The output rule input unit 14 receives a file input of an output rule that defines the condition of the record that can be output, in order to suppress the risk that the individual is identified from the information of the record of the data set below a predetermined level.
The output rule may be expressed by, for example, the number of overlapping records k or a threshold such as the personal identification probability p.
Also, the output rule may be defined independently for each record. Furthermore, the output rule may be defined according to the provision destination of the data set. For example, a certain threshold is output such that a certain record may be disclosed if the number of records k 以上 2 for a company having a company size of xx or more and k> 10 for a company having a size of yy or less It may be defined as a rule.

設定情報入力部１５は、匿名化後のデータセット、ログファイル等の各種出力情報の保存先を指定したファイルの入力を受け付ける。 The setting information input unit 15 receives an input of a file specifying a storage destination of various output information such as an anonymized data set, a log file, and the like.

匿名化処理部１６は、匿名化ルールに基づいてデータセットの全体を匿名化する。
これにより得られたデータセットは、データの利用目的に合わせたレベルまで匿名化されている。 The anonymization processing unit 16 anonymizes the entire data set based on the anonymization rule.
The data set obtained in this way is anonymized to a level suited to the purpose of using the data.

匿名化データ出力部１７は、匿名化処理部１６により匿名化されたデータセットから、出力ルールに合致したレコードのみを出力する。
なお、出力ルールに合致しなかったレコードについて、匿名化データ出力部１７は、データセットから削除、マスク処理、出力ルールに合致するまで汎化を繰り返す等の加工を適宜実行する。
これにより、安全性が担保された匿名化データセットが出力される。 The anonymization data output unit 17 outputs only the records that match the output rule from the data set anonymized by the anonymization processing unit 16.
Note that the anonymization data output unit 17 appropriately executes processing such as deletion from the data set, mask processing, and generalization until the output rule is matched, for records that do not match the output rule.
As a result, an anonymization data set with security secured is output.

また、匿名化データ出力部１７は、匿名化データセットと共に、各種のログファイル及びレポートを出力し、記憶部２０に格納する。 In addition, the anonymization data output unit 17 outputs various log files and reports together with the anonymization data set, and stores the output in the storage unit 20.

図２は、本実施形態に係る匿名化装置１の入出力情報を示す図である。
匿名化装置１は、前述のように、匿名化の対象とするデータセットの他、一般化階層木、匿名化ルールファイル、出力ルールファイル、その他の設定ファイルを入力として受け付ける。
そして、匿名化装置１は、匿名化データセットを出力した際に、匿名化ログファイル、エラーログファイル、及び匿名化レポートを出力する。 FIG. 2 is a diagram showing input / output information of the anonymization device 1 according to the present embodiment.
As described above, the anonymization device 1 receives, as inputs, a generalized hierarchical tree, an anonymization rule file, an output rule file, and other setting files in addition to the data set to be anonymized.
Then, when outputting the anonymization data set, the anonymization device 1 outputs the anonymization log file, the error log file, and the anonymization report.

匿名化ログファイルには、出力ルールに合致しなかったレコードを、匿名化データセットと紐付けるためのＩＤと、このレコードが匿名化される前の元の属性情報が記録される。
エラーログファイルには、匿名化の処理が正常に終了しなかった場合のエラーメッセージが記録される。
匿名化レポートには、安全管理措置のため、匿名化ルールに基づきどのような匿名化を実施し、出力ルールに基づきどの程度のリスクが残っているかが記述される。 The anonymization log file records an ID for associating a record that does not match the output rule with the anonymization data set, and the original attribute information before this record is anonymized.
An error log file records an error message when the anonymization process is not completed normally.
The anonymization report describes what anonymization is performed based on the anonymization rules for safety management measures, and how much risk remains based on the output rules.

本実施形態によれば、匿名化装置１は、利用目的に合わせた匿名化ルールに基づいて汎化等の匿名化処理を行った後、個人が識別されるリスクを所定未満にするための出力ルールに合致した匿名化レコードのみを出力する。
従来の匿名化の手法では、出力条件に合致するようにデータセットの全体を加工するので、外れ値を他のレコードと合わせて大幅に汎化してしまい、有用性が低下していた。本実施形態の匿名化装置１は、ユースケースに応じて異なる残したい情報を匿名化ルールで明確化した上で、匿名化の後に出力ルールに合致する安全なレコードのみを出力することにより、安全性が所定未満の外れ値を除外して高い有用性を維持できる。
この結果、匿名化装置１は、従来の自動匿名化の手法とは異なり、ある一定の加工のルールと安全性の担保が可能であるため、匿名化データの利用者にとって有用なデータセットを生成できる。 According to the present embodiment, after the anonymization device 1 performs the anonymization process such as generalization based on the anonymization rule adapted to the purpose of use, an output for making the risk that the individual is identified less than a predetermined Output only anonymized records that match the rules.
In the conventional anonymization method, the entire data set is processed so as to meet the output condition, so the outliers are generally generalized together with other records, and the usefulness is lowered. The anonymization device 1 of the present embodiment is safe by clarifying the information to be different depending on the use case with the anonymization rule, and outputting only the safe record that matches the output rule after the anonymization. It is possible to maintain high usefulness by excluding outliers whose gender is below a predetermined level.
As a result, the anonymization device 1 generates a data set useful for the anonymized data user because different from the conventional automatic anonymization method, certain processing rules and security can be secured. it can.

さらに、匿名化装置１は、出力ルールをデータセットのレコード毎に独立して定めることにより、各レコードの安全性をより適切に定義できる。
また、匿名化装置１は、出力ルールをデータセットの提供先に応じて定めることにより、利用目的に合わせた適切なデータセットを出力できる。 Furthermore, the anonymization device 1 can more appropriately define the security of each record by defining the output rule independently for each record of the data set.
In addition, the anonymization device 1 can output an appropriate data set according to the purpose of use by defining the output rule according to the provision destination of the data set.

［第２実施形態］
以下、本発明の第２実施形態について説明する。
なお、第１実施形態と同様の構成については、同一の符号を付し、説明を省略又は簡略化する。 Second Embodiment
Hereinafter, a second embodiment of the present invention will be described.
In addition, about the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted or simplified.

データセットを自動で匿名化する際、他の類似したレコードが存在せず外れ値となるようなレコードは、大幅な汎化、又はレコードの全部若しくは一部の削除等の処理により、情報量が大きく削減されていた。また、第１実施形態においても、出力ルールに基づく安全性を満たすために、一部のレコードの削除又は大幅な汎化が行われると、有用性の低下が考えられる。
しかしながら、定期的にデータセットが入力される場合、時間経過に伴って、外れ値に類似したレコードの増加が期待できるため、汎化の度合いを抑えられる可能性がある。
そこで、本実施形態では、匿名化装置１は、出力ルールに合致しなかったレコードを、後に入力されたデータセットと統合して処理することで、有用性を維持する。 When automatically anonymizing a data set, records that are outliers without other similar records do not have sufficient information content due to processing such as significant generalization or deletion of all or part of the records. It has been greatly reduced. Also in the first embodiment, if deletion or significant generalization of some records is performed in order to satisfy the security based on the output rule, the usefulness may be reduced.
However, when a data set is periodically input, an increase in records similar to outliers can be expected with the passage of time, which may reduce the degree of generalization.
Therefore, in the present embodiment, the anonymization device 1 maintains the usefulness by integrating and processing the record that does not match the output rule with the data set input later.

本実施形態では、匿名化処理部１６及び匿名化データ出力部１７の機能が第１実施形態とは異なる。
匿名化データ出力部１７は、出力ルールに合致せず出力対象としなかった退避レコードを、匿名化前の元の状態で、過剰匿名化対象データセットとして、記憶部２０（退避レコード記憶部）に記憶する。
匿名化処理部１６は、記憶部２０に記憶されている過剰匿名化対象データセットを、データセット入力部１１により次回以降に受け付けたデータセットに追加した後に匿名化する。 In the present embodiment, the functions of the anonymization processing unit 16 and the anonymization data output unit 17 are different from those in the first embodiment.
The anonymization data output unit 17 stores the evacuation records that do not match the output rule and are not output targets in the storage unit 20 (evacuation record storage unit) as the excess anonymization target data set in the original state before anonymization. Remember.
The anonymization processing unit 16 anonymizes the excess anonymization target data set stored in the storage unit 20 after adding the data set input unit 11 to the data set received after the next time.

なお、匿名化ルールとしてサンプリングを採用した場合、出力ルールに関わらず出力対象とならないレコードが発生するが、これらのレコードは、過剰匿名化対象データセットに含めなくてよい。 When sampling is adopted as the anonymization rule, records not to be output occur regardless of the output rule, but these records may not be included in the data set for excessive anonymization.

ここで、匿名化データ出力部１７は、退避レコードに日時情報の属性が含まれる場合、この日時情報を削除した上で、過剰匿名化対象データセットとする。日時情報は、次回（例えば、翌日）以降に入力されるデータセットのレコードと同一値にならない。したがって、この日時情報が匿名化のための加工対象である場合、匿名化データ出力部１７は、退避レコードから日時情報を削除することにより、次回以降の匿名化処理においても外れ値となり続けることを抑制できる。
なお、匿名化処理部１６は、匿名化処理の際に、過剰匿名化対象データセットのうち所定期間の経過したレコードを削除してもよい。 Here, when the save record includes the attribute of the date and time information, the anonymization data output unit 17 deletes the date and time information, and sets the data set as the excess anonymization target data set. The date and time information does not have the same value as the record of the data set input next time (for example, the next day). Therefore, when the date and time information is a processing target for anonymization, the anonymization data output unit 17 continues to become an outlier in the next and subsequent anonymization processes by deleting the date and time information from the save record. It can be suppressed.
In the anonymization process, the anonymization processing unit 16 may delete a record for which a predetermined period has elapsed in the excess anonymization target data set.

図３は、本実施形態に係る匿名化装置１の入出力情報を示す図である。
第２実施形態において、匿名化装置１は、第１実施形態における出力データに加えて、過剰匿名化対象データセットを出力すると、記憶部２０の退避用データベース（ＤＢ）に格納する。このとき、退避用ＤＢに格納されるデータセットからは、日時情報が削除される。
そして、匿名化装置１は、次回の匿名化処理の際に、退避用ＤＢに格納されているデータセットを匿名化対象データセットに加えて匿名化を行う。 FIG. 3 is a diagram showing input / output information of the anonymization device 1 according to the present embodiment.
In the second embodiment, in addition to the output data in the first embodiment, the anonymization device 1 stores the excessive anonymization target data set in the save database (DB) of the storage unit 20. At this time, date and time information is deleted from the data set stored in the backup DB.
Then, at the next anonymization process, the anonymization device 1 adds the data set stored in the backup DB to the anonymization target data set to perform anonymization.

本実施形態によれば、匿名化装置１は、出力ルールに合致しなかった退避レコードを、匿名化前の状態で退避用ＤＢに格納し、次回以降に入力されるデータセットに加えることで匿名化処理に再利用する。
したがって、匿名化装置１は、同じ属性を持つデータセットに対して繰り返し匿名化処理を行う場合、リスクの高いレコードを一時退避して後から匿名化処理を行うことで、今回は外れ値であっても次回以降に出力対象となる可能性を高め、有用性を向上できる。 According to the present embodiment, the anonymization device 1 stores the evacuation record that did not match the output rule in the evacuation DB in the state before anonymization, and adds it to the data set input from the next time onward. Reuse for conversion processing.
Therefore, when the anonymization device 1 repeatedly performs the anonymization processing on the data set having the same attribute, the anonymization processing is temporarily performed after the high-risk record is temporarily saved, and this is an outlier. However, it is possible to increase the possibility of being output target from the next time on and improve the usefulness.

さらに、匿名化装置１は、レコードに日時情報の属性が含まれる場合、退避用ＤＢには、この日時情報を削除して格納するので、次回（例えば、翌日）のデータセットの中で外れ値となる事態を回避し、出力対象となる可能性を高められる。
また、匿名化装置１は、退避用ＤＢから、所定期間の経過したレコードを削除することにより、処理対象として統合することが適当でないレコードを除外でき、出力データの有用性を高められる。 Furthermore, since the anonymization device 1 deletes and stores this date and time information in the save DB when the record includes the attribute of the date and time information, outliers in the data set of the next (for example, the next day) Can be avoided, and the possibility of being output can be increased.
In addition, the anonymization device 1 can exclude the record that is not appropriate to be integrated as the processing target by deleting the record for which the predetermined period has elapsed from the save DB, and the usefulness of the output data can be enhanced.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、前述した実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above. Further, the effects described in the above-described embodiment are only listing the most preferable effects resulting from the present invention, and the effects of the present invention are not limited to those described in the embodiment.

匿名化装置１による匿名化方法は、ソフトウェアにより実現される。ソフトウェアによって実現される場合には、このソフトウェアを構成するプログラムが、情報処理装置（コンピュータ）にインストールされる。また、これらのプログラムは、ＣＤ−ＲＯＭのようなリムーバブルメディアに記録されてユーザに配布されてもよいし、ネットワークを介してユーザのコンピュータにダウンロードされることにより配布されてもよい。さらに、これらのプログラムは、ダウンロードされることなくネットワークを介したＷｅｂサービスとしてユーザのコンピュータに提供されてもよい。 The anonymization method by the anonymization device 1 is realized by software. When implemented by software, a program that configures this software is installed in an information processing apparatus (computer). Also, these programs may be recorded on removable media such as a CD-ROM and distributed to the user, or may be distributed by being downloaded to the user's computer via a network. Furthermore, these programs may be provided to the user's computer as a web service via a network without being downloaded.

１匿名化装置
１０制御部
１１データセット入力部
１２階層木入力部
１３匿名化ルール入力部
１４出力ルール入力部
１５設定情報入力部
１６匿名化処理部
１７匿名化データ出力部
２０記憶部 1 anonymization device 10 control unit 11 data set input unit 12 hierarchical tree input unit 13 anonymization rule input unit 14 output rule input unit 15 setting information input unit 16 anonymization processing unit 17 anonymization data output unit 20 storage unit

Claims

A data set input unit that periodically receives an input of a data set having the same attribute and is composed of a plurality of records;
A hierarchical tree input unit that receives an input of a generalized hierarchical tree having generalized upper nodes for each of the attributes included in the data set;
An anonymization rule input unit that receives an anonymization rule input according to a usage of the data set;
An output rule input unit that receives an input of an output rule that defines a condition of an outputable record based on a risk that an individual is identified;
An anonymization processing unit that anonymizes the entire data set based on the anonymization rule;
An anonymization data output unit that outputs only records that match the output rule from the data set anonymized by the anonymization processing unit;
And a save record storage unit that stores the saved record without matching the output rule in a state before anonymization,
The anonymization processing unit anonymizes the record stored in the save record storage unit after adding the record to the data set received by the data set input unit.

The anonymization device according to claim 1, wherein, when the record includes an attribute of date and time information, the save record storage unit deletes and stores the date and time information.

The anonymization device according to claim 1 or 2, wherein the save record storage unit deletes a record for which a predetermined period has elapsed.

A data set input step of periodically receiving input of a data set having the same attribute composed of a plurality of records;
A hierarchical tree input step of accepting an input of a generalized hierarchical tree having generalized upper nodes for each of the attributes included in the data set;
An anonymization rule input step for receiving an input of an anonymization rule according to a usage of the data set;
An output rule input step of receiving an input of an output rule which defines a condition of an outputable record based on a risk at which an individual is identified;
An anonymizing process step of anonymizing the entire data set based on the anonymization rule;
An anonymization data output step of outputting only records that match the output rule from the data set anonymized in the anonymization processing step;
The computer executes a saved record storing step of storing the saved record without matching the output rule in a state before anonymization,
In the anonymization process step, an anonymization method of adding the record stored in the evacuation record storage step to the data set accepted in the data set input step and then anonymizing the record.

A data set input step of periodically receiving input of a data set having the same attribute composed of a plurality of records;
A hierarchical tree input step of accepting an input of a generalized hierarchical tree having generalized upper nodes for each of the attributes included in the data set;
An anonymization rule input step for receiving an input of an anonymization rule according to a usage of the data set;
An output rule input step of receiving an input of an output rule which defines a condition of an outputable record based on a risk at which an individual is identified;
An anonymizing process step of anonymizing the entire data set based on the anonymization rule;
An anonymization data output step of outputting only records that match the output rule from the data set anonymized in the anonymization processing step;
And causing the computer to execute an evacuation record storage step of storing the evacuation records not meeting the output rule in a state before anonymization.
The anonymization program for making it anonymize after adding the record memorize | stored in the said evacuation record memory | storage step to the data set received in the said data set input step in the said anonymization process step.