CN110378144A

CN110378144A - The method for secret protection and system of range query are supported under data, that is, service mode

Info

Publication number: CN110378144A
Application number: CN201910481273.5A
Authority: CN
Inventors: 吴广君; 王勇; 王振宇; 李军
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-06-04
Filing date: 2019-06-04
Publication date: 2019-10-25
Anticipated expiration: 2039-06-04
Also published as: CN110378144B

Abstract

The invention relates to a privacy protection method and system supporting range query in a data-as-a-service mode. In the data-as-a-service management model, the security policy of the data service provider may not be complete, and the data owner does not fully trust it. In such an environment, it is necessary to design a complete mechanism that can ensure data privacy and security while data query is relatively efficient. The existing data-as-a-service management model has low time efficiency and risks of privacy information attacks. The present invention proposes a complete, privacy-safe solution that supports range query and data verification, the core of which is to partition the data and obtain an index by means of a partial sum of hash functions for data in the same partition; in order to avoid False hit data and data validation, introducing query accuracy and validation matrix. It is verified by experiments that the present invention has good time efficiency, and at the same time can well reduce data information leakage.

Description

Privacy protection method and system supporting range query in data-as-a-service mode

技术领域technical field

本发明属于数据管理、隐私保护等数据安全技术领域，具体设计一种保证用户隐私安全的数据即服务(DaaS)数据管理模式。The invention belongs to the technical field of data security such as data management and privacy protection, and specifically designs a data-as-a-service (DaaS) data management mode that ensures user privacy and security.

背景技术Background technique

数据即服务(DaaS)已经是云计算时代的数据管理模式。数据组织者通过购买服务的模式得到按需的数据存储服务，通过将储存任务放在云端，可以减少企业的费用，增大数据的管理能力。但数据隐私安全成为一项数据组织者必须要考虑的问题，目前用户隐私泄露已经导致严重的社会问题。Data as a Service (DaaS) is already a data management model in the era of cloud computing. Data organizers obtain on-demand data storage services by purchasing services. By placing storage tasks in the cloud, they can reduce enterprise costs and increase data management capabilities. However, data privacy security has become an issue that data organizers must consider. At present, the leakage of user privacy has caused serious social problems.

在用户隐私性保护技术中，首先就是设置合理的访问权限机制，非法的身份不能获得数据的访问权限。其次就是进行数据加密技术，对关键数据或者全部数据加密存储，存储经过加密的数据。在使用数据时，经过解密数据得到原始数据，再进行具体的统计分析。目前，数据即服务模型(DaaS)是数据管理使用的有效方式。数据往往存在数据服务商的服务器中。数据使用者在使用数据时需要从数据服务商的服务器中获得数据。数据即服务模型的优点是能够使得数据访问不局限于时间和地点，数据组织者不用使用相应的硬件来搭建数据服务器。但是数据存储在数据服务商中增加了数据隐私泄露的隐患。In the user privacy protection technology, the first is to set up a reasonable access authority mechanism, and illegal identities cannot obtain data access authority. The second is to implement data encryption technology, encrypt and store key data or all data, and store encrypted data. When using the data, the original data is obtained by decrypting the data, and then specific statistical analysis is performed. Currently, the data-as-a-service model (DaaS) is an effective way to manage data usage. The data is often stored in the server of the data service provider. Data users need to obtain data from the server of the data service provider when using the data. The advantage of the data-as-a-service model is that data access is not limited to time and place, and data organizers do not need to use corresponding hardware to build data servers. However, data storage in data service providers increases the hidden danger of data privacy leakage.

在DaaS模式中，可以将整个业务流程角色分为三个：(1)数据组织者。是数据的所有者。(2)数据服务商。存储用户的加密过的数据。(3)数据使用客户。使用数据服务商提供的服务进行查询。如附图1所示，数据使用客户使用数据服务商的数据存储服务。比如数据组织者要将自己的数据TRADE(tno cost,date)部署到数据服务商中，首先为了保护数据隐私，将数据项加密得到Entrypt(TRADE)后提交给数据服务商。数据使用客户在得到数据组织者的同意并得到密码本后，即查询数据并解密得到原始数据。他们之间的信任关系和服务模式也如附图1所示。数据使用客户和数据组织者存在信任关系，数据使用客户能够获得数据的密码策略来获得数据查询服务。而数据服务商与他们并没有绝对的信任关系，数据服务商不可信的原因是他们的存储策略并不完全可靠，可能会存在被非法窃取和篡改的可能性。In the DaaS model, the entire business process role can be divided into three: (1) data organizer. is the owner of the data. (2) Data service provider. Store user's encrypted data. (3) Data usage customers. Use the services provided by the data service provider to query. As shown in Figure 1, the data user uses the data storage service of the data service provider. For example, if a data organizer wants to deploy its own data TRADE (tno cost, date) to a data service provider, first, in order to protect data privacy, the data item is encrypted to obtain Entrypt (TRADE) and then submitted to the data service provider. After obtaining the consent of the data organizer and the codebook, the data user can query the data and decrypt it to obtain the original data. The trust relationship and service mode among them are also shown in Figure 1. There is a trust relationship between the data user and the data organizer, and the data user can obtain the password policy of the data to obtain the data query service. However, data service providers do not have an absolute trust relationship with them. The reason why data service providers are untrustworthy is that their storage strategies are not completely reliable, and there may be the possibility of illegal theft and tampering.

为了数据隐私安全，在数据服务商并不可信的情况下，需要对数据进行加密存储，并且在进行查询服务时，对操作的数据查询请求也进行处理，使数据服务商不知道具体数据。同时需要提高在加密数据上的可操作性以及能够对查询结果进行验证。For data privacy and security, when the data service provider is not trustworthy, the data needs to be encrypted and stored, and when the query service is performed, the data query request for the operation is also processed, so that the data service provider does not know the specific data. At the same time, it is necessary to improve the operability on encrypted data and to be able to verify query results.

在加密数据上的范围查询以及对查询结果的验证是保证数据和服务可用的核心技术。在对加密技术实行范围查询的方法有保持顺序的加密算法以及桶分法。采用保持顺序的加密时，也就是说数d₁<d₂时，经过加密后有Encrypt(d₁)<Encrypt(d₂)。这个方法已经有算法实现，相关算法包括OPE、OPES，但是上述方法比较耗费时间，而且在插入新的数据时复杂度较高，会消耗更多的计算资源。桶分的方法是将数据范围分为若干个离散的区间，每个桶分配一个标识符，在理想的状态下，如果每个桶最多一个数据，这样不会存在查询的假命中的情况，但是在实际情况下，数据往往不是均匀分布的，并且查询往往会出现假命中的情况。在对数据的验证技术上，目前技术一般利用Merkle哈希树来验证，这种方式对于多维数据的验证比较困难，对于空间要求比较高。比如对于二维数据，需要的空间复杂度为O(n²)。同时在更新数据时，需要更新该数据的Merkle节点的所有父节点的哈希值，需要数据集所有的数据参与计算。Range query on encrypted data and verification of query results are the core technologies to ensure the availability of data and services. The method of performing range query on the encryption technology includes the encryption algorithm and the bucket division method that maintains the sequence. When using encryption that maintains the order, that is to say, when the number d ₁ <d ₂ , after encryption, Encrypt(d ₁ )<Encrypt(d ₂ ). This method has already been implemented by algorithms, and related algorithms include OPE and OPES, but the above method is time-consuming, and the complexity is high when inserting new data, which will consume more computing resources. The method of bucketing is to divide the data range into several discrete intervals, and assign an identifier to each bucket. In an ideal state, if each bucket has at most one data, there will be no false hits in the query, but In reality, data is often not evenly distributed, and queries often have false hits. In terms of data verification technology, the current technology generally uses Merkle hash trees for verification. This method is difficult for multi-dimensional data verification and requires relatively high space requirements. For example, for two-dimensional data, the required space complexity is O(n ² ). At the same time, when updating data, it is necessary to update the hash values of all parent nodes of the Merkle node of the data, and all data in the data set is required to participate in the calculation.

发明内容Contents of the invention

本发明针对上述问题，提供一种在数据服务模型(DaaS)中的可验证的数据隐私保护方法，能够同时支持范围查询，并支持查询结果的正确性验证。Aiming at the above problems, the present invention provides a verifiable data privacy protection method in a data service model (DaaS), which can simultaneously support range query and correctness verification of query results.

本发明的总体技术方案如下：Overall technical scheme of the present invention is as follows:

1.通过划分值域区间U，将其分为N个区间，每个区间分配唯一的标识符。这个区间是数据插入时哈希索引和哈希签名链的更新单元，是安全策略的更新单元。安全策略的更新是指可以通过更新区间的标识符来防止因访问频数而受到重要性攻击，还可以更改加密密钥使得某些数据使用用户的数据权限自动过期，这样可以防止数据访问权限被滥用。1. By dividing the value range interval U, divide it into N intervals, and assign a unique identifier to each interval. This interval is the update unit of the hash index and hash signature chain when data is inserted, and is the update unit of the security policy. The update of the security policy means that the identifier of the interval can be updated to prevent the importance attack due to the frequency of access, and the encryption key can also be changed to automatically expire the data permissions of certain data users, which can prevent data access permissions from being abused .

2.引入查询精度Φ和标志记录，这样可以在进行范围查找时能够根据边界范围定位到具体的位置。2. Introduce query accuracy Φ and flag records, so that you can locate a specific location according to the boundary range when performing range search.

3.在同一个分区内通过计算得到与数据的属性值的顺序保持一致的哈希索引。在得到保持顺序的哈希索引时，通过累加该分区元素的哈希函数值来保持顺序。3. In the same partition, calculate the hash index that is consistent with the order of the attribute values of the data. When obtaining a hash index that maintains the order, the order is maintained by accumulating the hash function values of the partition elements.

4.本发明计了一种哈希签名链和验证矩阵进行数据验证。哈希签名链中每个哈希签名通过自身数据项和与之相邻的数据项的哈希得到。哈希签名链是存储在数据服务商中的，为了能够验证哈希签名链和数据项，本发明设计了验证矩阵来验证查询结果。通过验证矩阵，能够通过自证、它证、共证三种方式验证数据正确性和完整性。4. The present invention calculates a hash signature chain and verification matrix for data verification. Each hash signature in the hash signature chain is obtained by hashing its own data item and its adjacent data items. The hash signature chain is stored in the data service provider. In order to verify the hash signature chain and data items, the present invention designs a verification matrix to verify the query result. Through the verification matrix, the correctness and integrity of the data can be verified in three ways: self-certification, other certification, and co-certification.

具体来说，本发明采用的技术方案如下：Specifically, the technical scheme adopted in the present invention is as follows:

一种数据即服务模式下支持范围查询的隐私保护方法，包括以下步骤：A privacy protection method supporting range query in a data-as-a-service mode, comprising the following steps:

1)数据组织端将数据的值域分为若干区间，每个区间分配唯一的标识符，区间与标识符之间的映射关系作为密码本的一部分；数据组织端将密码本授权给受信任的数据使用端；1) The data organization end divides the value range of the data into several intervals, each interval is assigned a unique identifier, and the mapping relationship between intervals and identifiers is used as a part of the codebook; the data organization end authorizes the codebook to trusted data user;

2)数据组织端对同一个区间内的数据项建立保持顺序的哈希索引，并计算哈希签名链；所述哈希签名链中每个哈希签名通过数据项自身和与之相连的数据项的哈希得到；2) The data organization end establishes a sequential hash index for the data items in the same interval, and calculates the hash signature chain; each hash signature in the hash signature chain passes the data item itself and the data connected to it The hash of the item is obtained;

3)数据组织端插入数据时，在哈希索引中设置标志记录，然后将各区间内的加密的数据项及相应的哈希索引、哈希签名链提交给数据服务端；3) When the data organization end inserts data, set a flag record in the hash index, and then submit the encrypted data items in each interval, the corresponding hash index, and the hash signature chain to the data server;

4)数据使用端向数据服务端进行范围查找时，通过数据组织端设定的查询精度和标志记录，根据边界范围定位到具体的位置；4) When the data user end searches for the range from the data server end, it locates the specific location according to the boundary range through the query accuracy and mark records set by the data organization end;

5)数据使用端收到数据服务端返回的数据后，利用数据组织端授权的密码本对数据进行解密，利用哈希签名的验证矩阵对数据进行验证。5) After receiving the data returned by the data server, the data user uses the codebook authorized by the data organization to decrypt the data, and uses the verification matrix of the hash signature to verify the data.

一种数据即服务模式下支持范围查询的隐私保护系统，包括数据组织端、数据服务端和数据使用端；A privacy protection system supporting range query under the data-as-a-service mode, including a data organization end, a data server end, and a data use end;

数据组织端将数据的值域分为若干区间，每个区间分配唯一的标识符，区间与标识符之间的映射关系作为密码本的一部分；数据组织端将密码本授权给受信任的数据使用端；The data organization end divides the value range of the data into several intervals, each interval is assigned a unique identifier, and the mapping relationship between intervals and identifiers is used as a part of the codebook; the data organization end authorizes the codebook to be used by trusted data end;

数据组织端对同一个区间内的数据项建立保持顺序的哈希索引，并计算哈希签名链；所述哈希签名链中每个哈希签名通过数据项自身和与之相连的数据项的哈希得到；The data organization end establishes a sequential hash index for the data items in the same interval, and calculates the hash signature chain; each hash signature in the hash signature chain is passed through the data item itself and the data item connected to it. get the hash;

数据组织端插入数据时，在哈希索引中设置标志记录，然后将各区间内的加密的数据项及相应的哈希索引、哈希签名链提交给数据服务端；When the data organization end inserts data, set the flag record in the hash index, and then submit the encrypted data items in each interval, the corresponding hash index, and the hash signature chain to the data server;

数据使用端向数据服务端进行范围查找时，通过数据组织端设定的查询精度和标志记录，根据边界范围定位到具体的位置；When the data user end searches for the range from the data server end, it locates the specific location according to the boundary range through the query accuracy and mark records set by the data organization end;

数据使用端收到数据服务端返回的数据后，利用数据组织端授权的密码本对数据进行解密，利用哈希签名的验证矩阵对数据进行验证。After the data user receives the data returned by the data server, it decrypts the data with the codebook authorized by the data organization, and verifies the data with the verification matrix of the hash signature.

本发明设计了一种高效安全的数据服务模型(DaaS)，通过在数据的提交、查询和验证方面的设计，可以在数据的全生命周期提供安全保障，能够以很好的时间效率，提供一个完备的、安全的DaaS模型并提供数据管理、保证数据隐私安全。该方案具有以下的优点及效果：The present invention designs an efficient and safe data service model (DaaS), through the design of data submission, query and verification, it can provide security guarantee in the whole life cycle of data, and can provide a A complete and secure DaaS model provides data management and ensures data privacy and security. The program has the following advantages and effects:

1、能够实现加密数据上的精准范围查询，查询结果不存在假的命中的情况。通过定义查询精度Φ，可以在范围查询的边界数据即使不在数据集中也能定位到数据集中边界。1. It can realize accurate range query on encrypted data, and there is no false hit in the query result. By defining the query precision Φ, the boundary data that can be queried in the range can be located at the boundary of the data set even if it is not in the data set.

2、具有很好的时间效率。通过值域分区，对于数据集总体n，所有的操作都能够在时间复杂度为O(1)的情况下完成。数据在提交、更新以及查询过程中都具有很好的时间效率。2. It has good time efficiency. Through value domain partitioning, all operations can be completed with a time complexity of O(1) for the overall n of the data set. Data is time-efficient during submission, update, and query.

3、增加了对查询结果的正确性验证。将数据验证的数据哈希签名存储在数据服务商中，最大限度的利用数据服务商的服务，同时为了更完善的验证，通过验证矩阵对签名以及数据项通过三种方式验证，能够对数据是否存在删除、伪造和破坏情况进行验证。3. Added the verification of the correctness of the query results. The data hash signature for data verification is stored in the data service provider to maximize the use of the data service provider's services. At the same time, for more complete verification, the signature and data items are verified in three ways through the verification matrix. Existence of deletion, falsification and destruction is verified.

4、数据全生命周期数据隐私安全保护。从数据提交，到数据存储、数据查询、数据正确性验证整个数据传输过程对数据隐私安全保护，有效减少真实数据泄露以及防止对数据进行恶意的统计分析。4. Data privacy and security protection throughout the data life cycle. From data submission, to data storage, data query, and data correctness verification, the entire data transmission process protects data privacy and security, effectively reducing real data leakage and preventing malicious statistical analysis of data.

5、通过数据分区，能够按分区为单位对数据安全策略升级。通过按分区逐步地更新数据标识符和数据分区，能够使得数据使用者手中的密码本失效，有利于数据的安全控制。5. Through data partitioning, the data security policy can be upgraded in units of partitions. By gradually updating the data identifier and data partition by partition, the codebook in the hands of the data user can be invalidated, which is beneficial to the security control of the data.

附图说明Description of drawings

图1是数据服务模型(DaaS)的组织结构图和他们信任关系的示意图。Figure 1 is a schematic diagram of the organizational structure of the data service model (DaaS) and their trust relationship.

图2是本发明方案的整体示意流程图。主要分三个部分，数据组织者，数据服务商，数据使用客户。五个接口，数据组织者和数据服务商的数据管理接口及其查询接口，数据使用客户的查询匹配以及数据验证接口。Fig. 2 is an overall schematic flow chart of the solution of the present invention. It is mainly divided into three parts, data organizers, data service providers, and data users. Five interfaces, the data management interface and query interface of data organizers and data service providers, the query matching and data verification interfaces of data users.

图3是本发明方案设计的哈希签名示意图，其中字符d表示数据项，s表示与该数据项一起存储的哈希签名。Fig. 3 is a schematic diagram of a hash signature designed by the solution of the present invention, wherein the character d represents a data item, and s represents a hash signature stored together with the data item.

图4是描述本发明方案在多维查询数据时，数据的哈希签名示意图。其中A，B表示两个查询的属性，在一个属性A的分区内的数据项按属性B也是保持顺序的。Fig. 4 is a schematic diagram describing the hash signature of the data when the solution of the present invention queries data in multiple dimensions. Among them, A and B represent the attributes of two queries, and the data items in a partition of attribute A are also kept in order according to attribute B.

图5是本发明方案的数据范围查询的示意图，Q代表查询，a，b是查询范围边界，id是具体值域分区的标识符。Fig. 5 is a schematic diagram of data range query in the scheme of the present invention, Q represents query, a and b are query range boundaries, and id is an identifier of a specific value domain partition.

图6是CPS查询的处理时间比较结果图。Fig. 6 is a comparison result graph of the processing time of the CPS query.

图7是数据索引值与数据原始值的比较结果图。Fig. 7 is a comparison result diagram of data index value and data original value.

具体实施方式Detailed ways

本发明主要的技术要点有值域划分、获得哈希索引、计算哈希签名链以及提交、查询和验证数据等核心步骤。图2是本发明方案的整体示意流程图，主要分三个部分，数据组织者，数据服务商，数据使用客户。这三部分也可分别称为数据组织端、数据服务端、数据使用端。下面对每部分的实施进行详尽的说明。The main technical points of the present invention include core steps such as value domain division, obtaining hash index, calculating hash signature chain, submitting, querying and verifying data. Fig. 2 is an overall schematic flow chart of the solution of the present invention, which is mainly divided into three parts, data organizer, data service provider, and data user. These three parts can also be called data organization end, data server end and data use end respectively. The implementation of each part is described in detail below.

1.值域划分。1. Value range division.

数据组织者将要建立的查询的索引项的值域设为U，根据其分布将其分为N个区间，每个区间分配一个标识符id_i。这个标识符能够唯一匹配一个区间，每个区间与其标识符之间的映射关系作为密码本的一部分，同时这个密码本是即数据组织者所拥有的。The data organizer sets the value range of the index item of the query to be established as U, divides it into N intervals according to its distribution, and assigns an identifier id _i to each interval. This identifier can uniquely match a range, and the mapping relationship between each range and its identifier is used as a part of the codebook, and the codebook is owned by the data organizer.

2.获取哈希索引。2. Get the hash index.

这个部分主要是得到按顺序的哈希索引，由数据组织者完成。对于数据d₁<d₂，其索引值Index(d₁)<Index(d₂)。为了得到这个保持顺序的哈希值，同时能够快速地计算这个哈希值，在同一分区内采用哈希值的部分和来得到。即对于分区U_i的数据D_i＝{d₁,d₂,...,d_Ni}。其哈希索引通过以下方式得到： This part is mainly to get the sequential hash index, which is done by the data organizer. For data d ₁ <d ₂ , its index value Index(d ₁ )<Index(d ₂ ). In order to obtain the hash value that maintains the order and can quickly calculate the hash value, the partial sum of the hash value is used in the same partition. That is, the data D _i ={d ₁ , d ₂ , . . . , d _Ni } for the partition U _i . Its hash index is obtained by:

3.得到哈希签名链和验证矩阵。3. Get the hash signature chain and verification matrix.

这个部分主要是为了验证查询结果的正确性和完整性，由数据组织者完成。哈希签名链的计算单位也是分区。在每个数据记录中，如附图3所示，数据项已经按顺序排序，每个数据项记录着由自身数据项和下一个数据项的哈希值共同计算得到的哈希签名。签名公式为：This part is mainly to verify the correctness and completeness of the query results and is completed by the data organizer. The calculation unit of the hash signature chain is also a partition. In each data record, as shown in FIG. 3 , the data items have been sorted in order, and each data item records a hash signature calculated jointly by its own data item and the hash value of the next data item. The signature formula is:

S(data)＝MaxP(SHash(d_i)，1/ε₁)+MaxP(SHash(d_i-1)，1/ε₂) (1)S(data)＝MaxP(SHash(d _i ), 1/ε ₁ )+MaxP(SHash(d _i-1 ), 1/ε ₂ ) (1)

其中ε₁，ε₂是计算的参数，S(data)表示数据data的哈希签名，MaxP(a,b)表示小于a(a不必是整数)的整数b的最大公倍数，SHash是应用在数据项d_i的哈希函数。ε₁，ε₂决定签名公式的冲突率，根据签名公式，当ε₁，ε₂越小，同时取质数时，签名公式冲突率越低。同时ε₁，ε₂应由数据组织者决定，为了数据验证的可靠，该参数对于数据服务商不可知。Among them, ε ₁ and ε ₂ are the calculation parameters, S(data) represents the hash signature of the data data, MaxP(a,b) represents the greatest common multiple of the integer b less than a (a does not have to be an integer), and SHash is applied to the data Hash function for item d _i . ε ₁ and ε ₂ determine the collision rate of the signature formula. According to the signature formula, when ε ₁ and ε ₂ are smaller, at the same time When a prime number is selected, the signature formula conflict rate is lower. At the same time, ε ₁ and ε ₂ should be determined by the data organizer. For the reliability of data verification, this parameter is unknown to the data service provider.

通过这个公式每个数据项或者哈希签名都可以通过三种方式证明：1)自证。2)他证。哈希签名通过后一个数据项证明，数据项通过前一个哈希链证明。3)共证。满足签名公式。然后基于此设计一个哈希签名的验证矩阵：Through this formula, each data item or hash signature can be proved in three ways: 1) Self-certification. 2) Other certificates. The hash signature is proved by the next data item, and the data item is proved by the previous hash chain. 3) Co-certification. Satisfy the signature formula. Then design a hash signature verification matrix based on this:

其中s_ij表示哈希签名或者数据项是否满足三种证明方式。他证和自证的方式是下面的公式(2)、(3)。其中s_i是从数据服务商中请求数据得到的数据d_i的哈希签名。s₁₁,s₁₂,s₁₃表示哈希签名是否满足自证、他证和共证，s₂₁,s₂₂,s₂₃分别表示数据项是否满足自证、他证和共证。1表示满足，0表示不满足。in s _ij indicates whether the hash signature or data item satisfies the three proof methods. The ways of other proof and self proof are the following formulas (2), (3). Where s _i is the hash signature of data d _i obtained by requesting data from the data service provider. s ₁₁ , s ₁₂ , s ₁₃ indicate whether the hash signature satisfies self-certification, other certification and joint certification, and s ₂₁ , s ₂₂ , s ₂₃ respectively indicate whether the data item satisfies self-certification, other certification and joint certification. 1 means satisfied, 0 means not satisfied.

其中β₁＝1-ε₁,β₂＝1-ε₂，β₃表明三种证明的误差β₃＝1-ε₁·ε₂。同时定义计算“*”：in β ₁ =1-ε ₁ , β ₂ =1-ε ₂ , and β ₃ show the errors of the three proofs β ₃ =1-ε ₁ ·ε ₂ . Also define the calculation "*":

Au＝S*A，au_ij＝max(S_ik×a_kj)。Au=S*A, au _ij =max(S _ik ×a _kj ).

其中，s_ik表示矩阵S的元素，i和k表示行和列的下标；a_kj表示矩阵A的元素，k和j表示行和列的下标。Among them, s _ik represents the elements of matrix S, i and k represent the subscripts of rows and columns; a _kj represents the elements of matrix A, and k and j represent the subscripts of rows and columns.

4.提交数据4. Submit data

在数据组织者向数据服务商提交数据项时主要涉及两个方面，哈希索引和哈希签名的计算。首先是要得到数据的分区的标识符id，通过密码本匹配到数据标识符id后，为了使得同一分区内数据的哈希索引和哈希签名保持正确，需要根据该id向数据服务商请求该分区的所有数据(请求该分区的所有数据是数据插入的一般性步骤，如果是第一次提交数据，则请求后得到空的数据)。然后插入要提交的新数据并对分区内的数据排序，之后根据签名公式和索引公式得到哈希索引和哈希签名。最后得到将该分区所有数据提交到数据服务商中。When the data organizer submits data items to the data service provider, it mainly involves two aspects, the calculation of hash index and hash signature. The first is to obtain the identifier id of the partition of the data. After matching the data identifier id through the codebook, in order to keep the hash index and hash signature of the data in the same partition correct, it is necessary to request the data service provider based on the id. All the data of the partition (requesting all the data of the partition is a general step of data insertion, if the data is submitted for the first time, you will get empty data after the request). Then insert the new data to be submitted and sort the data in the partition, and then get the hash index and hash signature according to the signature formula and index formula. Finally, all the data of the partition is submitted to the data service provider.

在得到索引时，可能需要插入标志记录。对于分区[a,b)，其查询精度Φ，需要在插入的数据v前面插入标识数据，即标志记录v’。根据前文获得哈希索引的方法可以得知，对于一个数据集中不存在的边界数据值，无法得到最小的大于它的值和最大的小于它的值。也就是对于查询区间Q(c,d)，是无法定位边界值c和d。查询精度就是为了解决这个问题。查询精度Φ是一个区间的数据的最小查询精度。查询精度是数组组织者决定的。数据使用者在查询时需要知道查询精度，这个信息是数据组织者给定的。如果分区[a,b)的查询精度是Φ，就可以将区间[a,b)划分为{a,a+Φ,...,a+iΦ,...,b}的离散数据，此时查询边界c和d就分别会转换为a+iΦ，该数值可以通过标志记录v’定位到数据集的具体位置。标志记录v’<v，在该分区中，索引仍能保持顺序，同时对于标志记录v’，v’＝a+iΦ，需要记录标志记录的附加值i，该标志记录的分区标识符id与数据v相同。i的计算公式为(v-a)/Φ，同时取该值为整数。When indexed, it may be necessary to insert flag records. For the partition [a, b), its query precision Φ needs to insert the identification data in front of the inserted data v, that is, the label record v'. According to the previous method of obtaining the hash index, it can be known that for a boundary data value that does not exist in a data set, the smallest value greater than it and the largest value smaller than it cannot be obtained. That is, for the query interval Q(c,d), it is impossible to locate the boundary values c and d. Query precision is to solve this problem. The query precision Φ is the minimum query precision of data in an interval. The query precision is determined by the array organizer. Data users need to know the query accuracy when querying, and this information is given by the data organizer. If the query accuracy of the partition [a,b) is Φ, the interval [a,b) can be divided into discrete data of {a,a+Φ,...,a+iΦ,...,b}, this When the query boundaries c and d are converted to a+iΦ respectively, this value can be located to the specific position of the data set through the flag record v'. For the flag record v'<v, in this partition, the index can still maintain the order. At the same time, for the flag record v', v'=a+iΦ, the additional value i of the flag record needs to be recorded. The partition identifier id of the flag record is the same as Data v is the same. The calculation formula of i is (v-a)/Φ, and this value is taken as an integer.

对于Φ，应根据分区的数据分布定义，同时不同的分区定义的Φ不一定相同，数据比较稀疏，那么Φ可以大些，对于分区内数据相对密集的，Φ相对小，比如对于整数型的数据，Φ可以设定为1。同时Φ对于数据服务商也是不可知的。For Φ, it should be defined according to the data distribution of the partition. At the same time, the Φ defined by different partitions is not necessarily the same. If the data is relatively sparse, then Φ can be larger. For relatively dense data in the partition, Φ is relatively small, such as for integer data. , Φ can be set to 1. At the same time, Φ is also agnostic to data service providers.

在提交一个数据项时，一条记录可能有多个属性存在索引。不同属性上的分区是不同的，为了能够使得多个属性的索引值和哈希签名得以计算，需要按属性顺序操作。假设对于插入的一个数据项具有A和B两个属性，首先根据A值的分区标识符得到整个分区的数据项，插入整个数据项后重新计算索引和哈希签名。然后通过得到属性B的分区标识符得到所有该分区的数据项，之后更新这些数据项的索引和签名数据并提交到数据服务商中的服务器。如图4所示，其中A，B表示两个查询的属性，在一个属性区间A内的数据项按属性B也是保持顺序的。When submitting a data item, a record may have multiple attributes that are indexed. The partitions on different attributes are different. In order to enable the calculation of index values and hash signatures of multiple attributes, it is necessary to operate in the order of attributes. Assuming that an inserted data item has two attributes A and B, first obtain the data item of the entire partition according to the partition identifier of the A value, and then recalculate the index and hash signature after inserting the entire data item. Then obtain all the data items of the partition by obtaining the partition identifier of attribute B, and then update the index and signature data of these data items and submit them to the server in the data service provider. As shown in Figure 4, where A and B represent two query attributes, the data items in an attribute range A are also kept in order according to attribute B.

5.查询数据5. Query data

在对数据范围查找时，可以将范围区域分为两类：1)整个区间数据都满足查询。2)在查询边界的区间，部分可能不在查询范围。在将范围分为这两类后，匹配分区标识符和计算索引值以及标志记录附加值，将这些信息提交给数据服务商中的服务器后，服务器返回数据给数据使用者。数据使用者得到数据后解密数据(利用数据组织者授权的密码本对数据进行解密)并且验证数据。图5是范围查询时的示意图，其中q₁、q₂表示数值a和b计算得到的标志记录。When searching for a data range, the range area can be divided into two types: 1) The entire range of data satisfies the query. 2) In the interval of the query boundary, some may not be in the query range. After the scope is divided into these two categories, the partition identifier is matched with the calculated index value and the additional value of the flag record. After submitting this information to the server in the data service provider, the server returns the data to the data user. After the data user obtains the data, he decrypts the data (using the password book authorized by the data organizer to decrypt the data) and verifies the data. Fig. 5 is a schematic diagram of a range query, where q ₁ and q ₂ represent flag records obtained by calculating values a and b.

数据使用者在得到查询后的数据后，每个数据包括数据项d_i和哈希签名s_i。然后根据此计算每个数据i的签名矩阵根据前面Au的定义，可以通过Au判断下面的结果：该数据项前面不缺失的可能性为au₁₂，该数据项后面不缺少数据的可能性为au₂₂，该数据项正确的可能性为au₂₁，该哈希签名的正确的可能性为au₁₁。After the data user obtains the queried data, each data includes data item d _i and hash signature s _i . Then calculate the signature matrix for each data i from this According to the definition of Au above, the following results can be judged by Au: the possibility of no missing data before the data item is au ₁₂ , the possibility of no missing data behind the data item is au ₂₂ , and the possibility of correct data item is au ₂₁ , the probability that the hash signature is correct is au ₁₁ .

6.实验数据与结论6. Experimental data and conclusion

本发明方案基于时间效率和数据分布情况来验证方案，验证了该方案的可行性以及时具有很好的时间效率。下面通过进一步的实验分析本方案的优势。本方案在一个模拟生成的数据集TRADE(tno cost,date)和一个来自劳务统计的公共数据库TLSPD。The scheme of the present invention verifies the scheme based on time efficiency and data distribution, and verifies the feasibility of the scheme and has good time efficiency in time. The advantages of this scheme are analyzed through further experiments in the following. The scheme is based on a simulation-generated dataset TRADE(tno cost,date) and a public database TLSPD from labor statistics.

1)时间效率验证。1) Time efficiency verification.

本发明的一个主要的性能指标就是时间效率。首先我们方案的时间效率。将本发明方案的执行过程的时间与OPHF进行比较，如表1所示。OPHF是一种没有采用分区的顺序索引算法。A major performance index of the present invention is time efficiency. First of all, the time efficiency of our program. The execution time of the solution of the present invention is compared with that of OPHF, as shown in Table 1. OPHF is a sequential indexing algorithm that does not use partitions.

表1：不同方案执行时间统计表(毫秒/每个数据项)Table 1: Statistical table of execution time of different schemes (milliseconds/each data item)

分区个数(N)Number of partitions (N) 全部时间all the time 加密和解密时间Encryption and decryption time 数据查询预处理时间Data query preprocessing time 数据服务商的查询时间The query time of the data service provider 500500 54.9075454.90754 0.528680.52868 0.36280.3628 54.0160654.01606 10001000 54.643154.6431 0.452180.45218 0.223020.22302 53.967953.9679 20002000 52.685952.6859 0.407840.40784 0.144280.14428 52.1337852.13378 OPHFOPHF 4957.0064957.006 4.8424.842 4929.9324929.932 22.09622.096

在实验过程中，模拟数据TRADE(tno cost,date)具有10000条数据，其中cost是从0到10000000范围均匀分布的随机产生的。在实验过程中，首先采取我们的方案CPS，对cost属性进行索引，按照划分区间数500、1000、2000的方式进行实验，实验结果如图6所示，可以看到在区间划分越小，那么消耗的时间越少。但是划分区间越少，数据所有者需要记录的数据密码本就含有很过的数据。同时通过与OPHF的比较，本发明的方案具有很好的时间效率。During the experiment, the simulated data TRADE(tno cost, date) has 10,000 pieces of data, where the cost is randomly generated from a uniform distribution ranging from 0 to 10,000,000. During the experiment, we first adopted our scheme CPS, indexed the cost attribute, and carried out the experiment according to the number of intervals divided into 500, 1000, and 2000. The experimental results are shown in Figure 6. It can be seen that the smaller the interval division, the Less time is consumed. However, the fewer division intervals, the data codebook that the data owner needs to record contains too much data. At the same time, compared with OPHF, the solution of the present invention has good time efficiency.

2)数据隐私保护2) Data privacy protection

这个部分主要证明本发明的数据索引不会泄露数据隐私。主要包括不会泄露数据的分布以及数据的最大、最小值。实验结果如图7所示，可以看到数据索引在同一分区内是递增的，而且增加是线性的，与数据分布无关。图7中(a)图是模拟产生的均匀分布的数据，而(b)图是实际数据库，可以看到无论实际数据的分布如何，只要合理设置分区，使得每个分区内的数据项尽可能相同，那么就可以避免重要性攻击。同时每个分区的标识符的动态更新避免根据查询频率遭到恶意分析。This part mainly proves that the data index of the present invention will not leak data privacy. It mainly includes the distribution of the data that will not be leaked and the maximum and minimum values of the data. The experimental results are shown in Figure 7. It can be seen that the data index is increasing in the same partition, and the increase is linear, regardless of the data distribution. In Figure 7, (a) is the uniformly distributed data generated by simulation, and (b) is the actual database. It can be seen that no matter how the actual data is distributed, as long as the partitions are set reasonably, the data items in each partition are as large as possible. same, then importance attacks can be avoided. At the same time, the dynamic update of the identifier of each partition avoids malicious analysis according to the query frequency.

本方案从时间效率和数据隐私安全的角度进行了实验验证，实验表明该方案具有很好的时间效率。在数据提交、数据查询和数据验证阶段都只有很少的计算消耗。而且该方案是一个完备的方案，在数据关系模型中，通过本发明的方案，可以有效且安全地管理数据，在数据服务商并不可信的情况下能够安全有效地查询加密数据，并且能够验证查询结果的完整性和正确性。This scheme has been verified experimentally from the perspective of time efficiency and data privacy security, and the experiment shows that the scheme has good time efficiency. There is very little computational consumption in the data submission, data query, and data validation phases. Moreover, this scheme is a complete scheme. In the data relationship model, through the scheme of the present invention, data can be managed effectively and safely, and encrypted data can be queried safely and effectively when the data service provider is not credible, and can verify Integrity and correctness of query results.

本发明另一实施例提供一种数据即服务模式下支持范围查询的隐私保护系统，包括数据组织端、数据服务端和数据使用端；Another embodiment of the present invention provides a privacy protection system supporting range query in the data-as-a-service mode, including a data organization end, a data server end, and a data use end;

以上实施例仅用以说明本发明的技术方案而非对其进行限制，本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明的精神和范围，本发明的保护范围应以权利要求书所述为准。The above embodiments are only used to illustrate the technical solution of the present invention and not to limit it. Those of ordinary skill in the art can modify or equivalently replace the technical solution of the present invention without departing from the spirit and scope of the present invention. The scope of protection should be determined by the claims.

Claims

1. A privacy protection method supporting range query under a data-as-a-service mode, characterized in that it comprises the following steps:

1) The data organization end divides the value range of the data into several intervals, each interval is assigned a unique identifier, and the mapping relationship between intervals and identifiers is used as a part of the codebook; the data organization end authorizes the codebook to trusted data user;

2) The data organization end establishes a sequential hash index for the data items in the same interval, and calculates the hash signature chain; each hash signature in the hash signature chain passes the data item itself and the data connected to it The hash of the item is obtained;

3) When the data organization end inserts data, set a flag record in the hash index, and then submit the encrypted data items in each interval, the corresponding hash index, and the hash signature chain to the data server;

4) When the data user end searches for the range from the data server end, it locates the specific location according to the boundary range through the query accuracy and mark records set by the data organization end;

5) After receiving the data returned by the data server, the data user uses the codebook authorized by the data organization to decrypt the data, and uses the verification matrix of the hash signature to verify the data.

2. The method according to claim 1, wherein each interval is an update unit of a hash index and a hash signature chain when inserting data, and is an update unit of a security policy; the update of the security policy refers to the update of the security policy by Update the identifier of the interval to prevent importance attacks due to access frequency, and change the encryption key to automatically expire the data permissions of certain data users, thereby preventing data access permissions from being abused.

3. The method according to claim 1, wherein when establishing the hash index maintaining the order, the order is maintained by superimposing the hash function values of the elements in the interval.

4. The method according to claim 1, wherein the query accuracy can be positioned to the query boundary when the range query is performed, and setting different query precisions in different intervals will not cause data distribution leakage; when the range query is performed by comparing the marks Record markers and boundary values to avoid false hits for queries.

5. The method according to claim 1, wherein the hash index is in the same interval, the index value is in the same order as the data record, and when querying ranges, only the location of the boundary is needed to obtain the query range.

6. The method according to claim 1, wherein the verification matrix is calculated according to the hash signature, and both the hash signature and the data items can be verified; The proof method verifies the data, and can verify whether the data has been deleted, forged and destroyed.

7. The method according to claim 6, wherein the calculation formula of the hash signature is:

S(data)＝MaxP(SHash(d _i ), 1/ε ₁ )+MaxP(SHash(d _i-1 ), 1/ε ₂ )

Among them, ε ₁ and ε ₂ are the calculation parameters, S(data) represents the hash signature of the data data, MaxP(a, b) represents the greatest common multiple of the integer b less than a, and SHash is the hash applied to the data item d _i Hive function; ε ₁ and ε ₂ determine the collision rate of the signature formula. According to the signature formula, when ε ₁ and ε ₂ are smaller, and When a prime number is selected, the signature formula conflict rate is lower.

8. The method according to claim 7, wherein the verification matrix is:

in, s _ij indicates whether the hash signature or data item satisfies the three proof methods, s ₁₁ , s ₁₂ , s ₁₃

Indicates whether the hash signature satisfies self-certification, other certification and joint certification, s ₂₁ , s ₂₂ , and s ₂₃ respectively indicate whether the data item satisfies self-certification, other certification and joint certification;

in, β ₁ =1-ε ₁ , β ₂ =1-ε ₂ , β ₃ indicates the errors of the three proofs, β ₃ =1-ε ₁ ·ε ₂ ;

Among them, Au=S*A, au _ij =max(s _ik ×a _kj ); s _ik represents the elements of matrix S, i and k represent the subscripts of rows and columns; a _kj represents the elements of matrix A, k and j Indicates row and column subscripts.

9. The method according to claim 1, wherein after the data user obtains the queried data, each data includes a data item d _i and a hash signature _si , and then calculates each data signature matrix of i Among them: the possibility of not missing the front of the data item is au ₁₂ , the possibility of not missing data behind the data item is au ₂₂ , the possibility of the correct data item is au ₂₁ , and the correct possibility of the hash signature is au ₁₁ .

10. A privacy protection system supporting range query under the data-as-a-service mode, characterized in that it includes a data organization end, a data server end, and a data use end;

The data organization end divides the value range of the data into several intervals, each interval is assigned a unique identifier, and the mapping relationship between intervals and identifiers is used as a part of the codebook; the data organization end authorizes the codebook to be used by trusted data end;

The data organization end establishes a sequential hash index for the data items in the same interval, and calculates the hash signature chain; each hash signature in the hash signature chain is passed through the data item itself and the data item connected to it. get the hash;

When the data organization end inserts data, set the flag record in the hash index, and then submit the encrypted data items in each interval, the corresponding hash index, and the hash signature chain to the data server;

When the data user end searches for the range from the data server end, it locates the specific location according to the boundary range through the query accuracy and mark records set by the data organization end;

After the data user receives the data returned by the data server, it decrypts the data with the codebook authorized by the data organization, and verifies the data with the verification matrix of the hash signature.