CN110660450A

CN110660450A - Safety counting query and integrity verification device and method based on encrypted genome data

Info

Publication number: CN110660450A
Application number: CN201910899612.1A
Authority: CN
Inventors: 王雷; 邹赛; 朱贤友; 陈治平
Original assignee: Chongqing College of Electronic Engineering; Changsha University
Current assignee: Chongqing College of Electronic Engineering; Changsha University
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-01-07

Abstract

The present invention proposes a method for securely sharing and computing genome data on a not-so-honest cloud server. First, the method proposed in the present invention processes the biomedical data to ensure the privacy and integrity of the shared data. Secondly, the proposed method of the present invention uses the hashMap method to store data, which improves the encryption efficiency and efficiently realizes the secure count query. The present invention uses the existing single nucleotide polymorphism (SNP) sequence database to evaluate the method of the present invention, and experiments show that completing the count query is more flexible, easy to implement and high security for practical applications.

Description

A secure counting query and integrity verification device based on encrypted genomic data and method

技术领域technical field

本发明涉及生物信息技术领域，具体涉及一种基于加密基因组数据的安全计数查询与完整性验证装置。The invention relates to the technical field of biological information, in particular to a secure count query and integrity verification device based on encrypted genome data.

背景技术Background technique

临床医学实践在医疗保健领域发挥至关重要的作用，基因组学的研究也越来越受欢迎。对于基因组学的研究也有助于识别疾病与某种基因之间的潜在相关性，为此，生物医学研究者对病患的临床状态和DNA序列数据进行了大规模调查，大部分基因数据的分析都基于美国国立卫生研究院全基因组关联研究(GWAS)。为了提高研究的准确性，需要将不同来源的数据汇总起来，为此人们建立了各种共享和存储数据的服务系统，例如美国的基因型和表型数据库(dbGaP)和Welcomme Trust在英国的生物银行计划。基因组数据涉及患者的隐私，若数据被泄露可能会导致诸多社会与法律问题，例如，健康保险公司得知特定癌症可能性的基因携带突变的个人信息可能会拒绝对其进行保险。基于基因数据的敏感性，多个机构共享基因数据需要通过隐私保护方法来存储和访问基因组数据。The practice of clinical medicine plays a vital role in healthcare, and the study of genomics is gaining popularity. The study of genomics can also help to identify potential associations between diseases and certain genes. To this end, biomedical researchers have conducted large-scale surveys of patients' clinical status and DNA sequence data. Most of the genetic data are analyzed. Both are based on the NIH Genome-Wide Association Study (GWAS). In order to improve the accuracy of research, data from different sources needs to be aggregated, and various services for sharing and storing data have been established for this purpose, such as the database of genotype and phenotype (dbGaP) in the United States and the Welcome Trust Biotechnology in the United Kingdom. bank plan. Genomic data concerns the privacy of patients, and a breach of the data could lead to many social and legal issues. For example, health insurance companies may refuse to insure personal information about mutations in genes that are known to have a specific cancer likelihood. Based on the sensitivity of genetic data, sharing genetic data among multiple institutions requires privacy-preserving methods to store and access genomic data.

生物医学研究越来越依赖于大量的基因组和临床数据，随之而来备受广大学者关注的是在共享和管理这些数据时，如何保证相应个人的隐私和系统的整体安全。过去，为了保护主体的敏感信息，多个组织共享数据时删除可能识别个人身份的重要标识符。然而，研究表明使用自动化方法可以很简单的推算出主体身份。当前研究为了保护数据的隐私性，使用加密协议来共享，管理和分析生物医学数据，并将加密的数据外包给拥有庞大的存储和计算资源的第三方云服务提供商。与此同时，第三方也成为了可能造成研究主体隐私被侵犯的潜在对象。Biomedical research is increasingly reliant on a large amount of genomic and clinical data, which has attracted much attention from scholars on how to ensure the privacy of the corresponding individuals and the overall security of the system when sharing and managing these data. In the past, to protect the sensitive information of subjects, multiple organizations shared data to remove important identifiers that could identify individuals. However, research has shown that subject identities can be easily deduced using automated methods. In order to protect the privacy of data, current research uses encrypted protocols to share, manage and analyze biomedical data, and outsource encrypted data to third-party cloud service providers with huge storage and computing resources. At the same time, third parties have also become potential targets that may cause the privacy of research subjects to be violated.

现有技术中存在以特定形式在集中式数据库中管理临床基因组数据的安全框架，这些框架提出了在一个不太诚实的云服务上进行基因组数据安全共享和存储的方法。首先，数据拥有者将临床信息发送给第三方机构进行加密和认证，第三方机构将聚合的大量数据以一定结构发送到云服务进行存储。之后，云服务代表研究者执行查询(例如患有高血压患者和特定的基因组变异特征的患者记录数量)，并将查询结果返回给研究者。然而，直接将查询结果返回给研究者，无法排除攻击者攻击研究人员并俘获加密协议，仍然存在数据风险。Security frameworks exist in the prior art to manage clinical genomic data in a centralized database in a specific form, and these frameworks propose methods for the secure sharing and storage of genomic data on a less-than-honest cloud service. First, the data owner sends the clinical information to a third-party organization for encryption and authentication, and the third-party organization sends a large amount of aggregated data to a cloud service for storage in a certain structure. Afterwards, the cloud service executes queries (such as the number of patient records with hypertension and specific genomic variant signatures) on behalf of the researcher, and returns the query results to the researcher. However, directly returning the query results to the researcher cannot rule out that attackers attack the researcher and capture the encryption protocol, and there are still data risks.

发明内容SUMMARY OF THE INVENTION

为了实现在对共享医学数据查询的同时又不泄露数据主体身份下进行科学研究，本发明提出一种基于加密基因组数据的安全计数查询与完整性验证装置，包括数据拥有者终端，认证机构服务器，云服务器，代理机构终端，研究人员终端；In order to carry out scientific research without revealing the identity of the data subject while querying shared medical data, the present invention proposes a security counting query and integrity verification device based on encrypted genome data, including a data owner terminal, a certification authority server, Cloud server, agency terminal, researcher terminal;

认证机构终端从数据拥有者终端接收患者记录并加密；The certification authority terminal receives and encrypts the patient record from the data owner terminal;

数据拥有者终端将患者的基因组数据按照规定的格式处理，并以明文的形式发送给认证机构服务器；The data owner terminal processes the patient's genomic data in the prescribed format and sends it to the certification authority server in clear text;

认证机构服务器获得不同数据拥有者共享的数据后，构建聚合共享加密数据的可搜索的哈希图，并将哈希图发送到云服务；After the certification authority server obtains the data shared by different data owners, it builds a searchable hash map that aggregates the shared encrypted data, and sends the hash map to the cloud service;

云服务从认证机构获取加密数据并处理代理机构终端发送的加密查询请求，执行所述查询请求将查询结果加密发送至代理机构终端；The cloud service obtains encrypted data from the certification body and processes the encrypted query request sent by the agency terminal, executes the query request, and encrypts the query result and sends the query result to the agency terminal;

研究人员终端响应用户操作，将科研人员查询请求发送至代理机构终端；The researcher terminal responds to the user operation and sends the researcher query request to the agency terminal;

代理机构终端接收科研人员查询请求后将查询请求数据加密并将加密后的查询请求发送至云服务器，从云服务器获取查询结果后使用私钥解密所述查询结果并发送至研究人员终端。After receiving the researcher's query request, the agency terminal encrypts the query request data and sends the encrypted query request to the cloud server. After obtaining the query result from the cloud server, it decrypts the query result using the private key and sends it to the researcher's terminal.

进一步的，云服务器生成特定基因的一组SNP数据集和特定基因在数据库中出现的频率值，然后将该值标准化为记录总数；Further, the cloud server generates a set of SNP data sets for a specific gene and the frequency value of the specific gene in the database, and then normalizes the value to the total number of records;

云服务器将计数查询的总数表示为： The cloud server expresses the total number of count queries as:

其中，D为数据库Q为查询，D＝{S₁,S₂,…,S_n}表示n位患者的SNP序列，计数查询定义为查找D中满足Q的多个查询条件q的患者人数，患者的m个SNP位点的基因序列表示为S＝{d₁,d₂,…,d_m}，其中d_i(1≤i≤m)表示该患者的第i个位点的SNP值。Among them, D is the database Q is the query, D={S ₁ , S ₂ ,...,S _n } represents the SNP sequence of n patients, and the count query is defined as finding the number of patients in D that satisfy multiple query conditions q of Q, The gene sequence of m SNP sites of a patient is represented as S={d ₁ ,d ₂ ,...,d _m }, where d _i (1≤i≤m) represents the SNP value of the i-th site of the patient.

进一步的，认证机构服务器收到数据拥有者提供的数据后，对于每位患者的SNP序列创建地图的一个Entity,所有的数据处理后生成一个映射表M,M中包含每位患者的基因型和表型信息，创建M之后，对于每一条来自数据拥有者的新的记录，认证机构服务器创建并更新M的Entity信息；Further, after receiving the data provided by the data owner, the certification body server creates an Entity of the map for the SNP sequence of each patient, and after all data processing, a mapping table M is generated, which contains the genotype and the genotype of each patient. Phenotypic information, after M is created, for each new record from the data owner, the certification authority server creates and updates the Entity information of M;

每个Entity包含以下内容：Each Entity contains the following:

key:Entity中key值指向具体的数据，索引值，key: The key value in Entity points to specific data, index value,

geno：表示每位患者的SNP序列，一个患者的SNP序列表示为，S＝{d₁,d₂,…,d_m}，sid是SNP位点唯一的标识，代表基因序列中特定的位置，其中碱基对d_i的位点标识sid＝i,(1≤i≤n)，表示为

geno: indicates the SNP sequence of each patient, the SNP sequence of a patient is expressed as, S={d ₁ , d ₂ ,...,d _m }, sid is the unique identifier of the SNP site, representing a specific position in the gene sequence, Among them, the site identification of base pair d _i is sid=i, (1≤i≤n), which is expressed as

I:基因组序列对应的表型数据集，I: the phenotype dataset corresponding to the genome sequence,

count:在数据库中，当前SNP序列出现的频率,表示与基因型和表型相匹配的记录的总数，count: In the database, the frequency of the current SNP sequence, indicating the total number of records matching the genotype and phenotype,

next:下一个Entity的key值。next: The key value of the next Entity.

进一步的，认证机构服务器创建哈希表的复杂度为O(mn)，其中m是数据库中的记录数量，n是在每一条序列记录中不同的SNP位点的数目，定义每个Entity为θ，其组成部分表示为θ(key,geno,I,count,next)。Further, the complexity of the hash table created by the certification authority server is O(mn), where m is the number of records in the database, n is the number of different SNP sites in each sequence record, and each Entity is defined as θ. , whose components are denoted as θ(key, geno, I, count, next).

进一步的，认证机构服务器使用Bloom filter技术对表型数据进行处理，对基因组序列中的每条SNPs记录,在Bloom filter中插入相应的表型，如果多个表型与相同的基因组序列相关联，则将它们全部插入该基因组序列entity的Bloom滤波器中，M中的每个Entity都包含Bloom filter值I，表示与该基因组序列对应的表型信息；Further, the certification authority server uses the Bloom filter technology to process the phenotype data, inserts the corresponding phenotype into the Bloom filter for each SNPs record in the genome sequence, if multiple phenotypes are associated with the same genome sequence, Then insert them all into the Bloom filter of the genome sequence entity, and each Entity in M contains the Bloom filter value I, representing the phenotypic information corresponding to the genome sequence;

认证机构服务器设置Bloom滤波器中使用的哈希函数H和公用字母表Σ的域，Σ的域是所有可能表型的集合，对于基因组序列中的每个entity，在Bloom滤波器中插入相应的表型，Σ中的每个表型都被映射到一个唯一的数字，并且将该数字插入到Bloom滤波器中，The certification authority server sets the hash function H used in the Bloom filter and the domain of the common alphabet Σ, where the domain of Σ is the set of all possible phenotypes, and for each entity in the genome sequence, inserts in the Bloom filter the corresponding Phenotype, each phenotype in Σ is mapped to a unique number, and that number is plugged into the Bloom filter,

认证机构服务器使用伪随机函数F选择出随机密钥k,将每个Bloom filter加密为

The certification authority server uses the pseudo-random function F to select a random key k, and encrypts each Bloom filter as

认证机构服务器将私钥sk发送给代理机构终端，并提供PRF的密钥k和用于Bloomfilter的散列函数F_k；The certification authority server sends the private key sk to the agency terminal, and provides the key k of the PRF and the hash function F _k used for Bloomfilter;

认证机构服务器将查询请求的表型值I_q散列的位置与每个Entity的I_i散列位置进行匹配；The certification authority server matches the location of the phenotype value I _q hash of the query request with the I _i hash location of each Entity;

认证机构服务器向代理机构终端发送PRF密钥k，对加密的Bloom filterI′_i采用公式

进行解密。The certification authority server sends the PRF key k to the agency terminal, and uses the formula for the encrypted Bloom filter _I'i

to decrypt.

进一步的，认证机构服务器生成一个密钥对(pk,sk)，其中pk是公钥，sk是私钥，使用公钥pk对哈希表中的每个entity加密，将entity的加密表示为E_pk(θ)；Further, the certification authority server generates a key pair (pk, sk), where pk is the public key and sk is the private key, and the public key pk is used to encrypt each entity in the hash table, and the encryption of the entity is represented as E. _pk (θ);

认证机构服务器将哈希图M中所有的entity均被加密后的哈希图表示为

并将

发送给云服务器；The certification authority server expresses the hash map in which all entities in the hash map M are encrypted as

and will

Send to cloud server;

认证机构服务器将密钥对(pk,sk)发送给代理机构终端。The certificate authority server sends the key pair (pk,sk) to the agency terminal.

进一步的，云服务器将查询结果R_q发送到代理机构，代理机构终端用密钥sk解密SNP值并采用汉明码检测位来校验数据的完整性，对于返回的满足查询条件的查询结果：Further, the cloud server sends the query result R _q to the agency, and the agency terminal decrypts the SNP value with the key sk and uses the Hamming code detection bit to verify the integrity of the data. For the returned query result that meets the query conditions:

R_q＝{E_pk(count)|E_pk(B₁)|E_pk(B₂)|…|E_pk(B_num)|I_i′}R _q ={E _pk (count)|E _pk (B ₁ )|E _pk (B ₂ )|…|E _pk (B _num )|I _i ′}

代理机构终端用私钥sk解密每一个SNP值，解密后的

将

重新计算汉明码检测位为T′；The agency terminal decrypts each SNP value with the private key sk, and the decrypted

Will

Recalculate the Hamming code detection bit as T';

代理机构终端通过对比T′与T_q的相关性来判定数据是否完整，The agency terminal determines whether the data is complete by comparing the correlation between T' and T _q .

若T′＝T_q，则代理机构终端判定数据是安全可信的；If T′=T _q , the agency terminal determines that the data is safe and reliable;

若T′≠T_q,且T′∈{h_A,h_C,h_G,h_T}，则代理机构终端通过检测位T_q对返回的

进行纠正并更新哈希图；If T′≠T _q , and T′∈{h _A ,h _C ,h _G ,h _T }, then the agency terminal checks the bit T _q for the returned

make corrections and update the hashmap;

若T′≠T_q,且

则代理机构终端判定原数据

被篡改，并且被篡改成其它内容，此时对于患者的SNP序列已经不可信，返回的查询结果计数值不可信。If T′≠T _q , and

Then the agency terminal determines the original data

If it has been tampered with and has been tampered with other content, the SNP sequence of the patient is no longer credible, and the returned query result count value is not credible.

本发明还提供一种基于加密基因组数据的安全计数查询与完整性验证方法，包括以下步骤：The present invention also provides a security counting query and integrity verification method based on encrypted genome data, comprising the following steps:

本发明的有益效果是：The beneficial effects of the present invention are:

(1)本发明提供了一个安全的系统查询框架。本发明使用Paillier密码系统对基因序列加密，将加密的数据存储在第三方云服务，并在云服务不知道密钥的情况下进行计数查询，保证了数据的隐私性。引用代理机构对返回的加密节点的查询结果解密验证，保证了数据的完整性。(1) The present invention provides a secure system query framework. The invention uses the Paillier cryptosystem to encrypt the gene sequence, stores the encrypted data in a third-party cloud service, and performs a counting query without the cloud service knowing the key, thereby ensuring the privacy of the data. The reference agency decrypts and verifies the query result of the returned encrypted node to ensure the integrity of the data.

(2)本发明提供了一种对原始数据处理的技术。本发明使用hamming code技术对基因数据添加检测位，通过对返回的查询基因序列进行校验，验证计数查询结果的可信性。(2) The present invention provides a technology for processing raw data. The invention uses the hammering code technology to add detection bits to the gene data, and verifies the reliability of the counting query result by checking the returned query gene sequence.

(3)本发明提出的模型为数据隐私、查询隐私和输出隐私提供安全保证。查询发起者通过代理机构和云服务器完成安全的交互查询。(3) The model proposed by the present invention provides security guarantees for data privacy, query privacy and output privacy. The query initiator completes the secure interactive query through the agency and the cloud server.

(4)本发明提供实验研究，以证明本发明提出的方法的可行性。对于5000条记录的40个SNP位点的查询，Kantar认证机构服务器oglu et al.和Canim et al.,分别用了30min和80s。本发明提出的方法执行相同的查询用了3.9s。时间显著缩短。(4) The present invention provides experimental studies to prove the feasibility of the method proposed by the present invention. For the query of 40 SNP loci of 5000 records, the Kantar certification authority server, oglu et al. and Canim et al., took 30 min and 80 s, respectively. The method proposed by the present invention takes 3.9s to execute the same query. Time is significantly shortened.

(5)本发明的方法允许外包协议对存储的加密数据和密文状态下的查询命令进行匹配完成计数查询。为了保护数据隐私，本发明使用同态加密方法加密基因数据。为了验证查询结果的完整性，本发明使用汉明码技术对基因组数据添加检测位，基于汉明码检测位的可纠错性可以对查询结果进行验证，在保证基因组数据的安全性和完整性方面拥有良好的性能。(5) The method of the present invention allows the outsourcing protocol to match the stored encrypted data and the query command in the cipher text state to complete the counting query. In order to protect data privacy, the present invention uses homomorphic encryption to encrypt genetic data. In order to verify the integrity of the query result, the present invention uses the Hamming code technology to add detection bits to the genomic data, and the query result can be verified based on the error correctability of the Hamming code detection bits, which has the advantages of ensuring the security and integrity of the genome data. good performance.

附图说明Description of drawings

图1为本发明一实施例装置结构图。FIG. 1 is a structural diagram of an apparatus according to an embodiment of the present invention.

图2为本发明一实施例试验结果示意图。FIG. 2 is a schematic diagram of the test results of an embodiment of the present invention.

图3为本发明一实施例试验结果示意图。FIG. 3 is a schematic diagram of a test result of an embodiment of the present invention.

图4为本发明一实施例试验结果示意图。FIG. 4 is a schematic diagram of the test results of an embodiment of the present invention.

具体实施方式Detailed ways

本发明提出了一种在一个不太诚实的云服务器上进行基因组数据安全共享和计算的方法。首先，本发明提出的方法对生物医学数据进行处理，确保共享数据的隐私与完整性。其次，本发明的提出的方法使用hashMap的方式存储数据，提高了加密效率，高效的实现安全计数查询。本发明使用现有的单核苷酸多态性(SNP)序列数据库评估本发明的方法，实验表明完成计数查询对于实际应用是更加灵活、易于实现且安全性高。The present invention proposes a method for securely sharing and computing genome data on a not-so-honest cloud server. First, the method proposed in the present invention processes the biomedical data to ensure the privacy and integrity of the shared data. Secondly, the proposed method of the present invention uses the hashMap method to store data, which improves the encryption efficiency and efficiently realizes the secure count query. The present invention uses the existing single nucleotide polymorphism (SNP) sequence database to evaluate the method of the present invention, and experiments show that completing the count query is more flexible, easy to implement and high security for practical applications.

1下面对本发明的设计原理和有益效果进行说明1 The design principles and beneficial effects of the present invention are described below

在本发明中，本发明的目标是设计一种基于外包基因组数据安全查询的计数框架，通过计数查询确定数据库中与查询条件相匹配的记录数量。本发明提出的框架使用安全的加密机制处理和存储数据，并通过第三方的代理机构终端对查询的每条存储记录进行安全评估，如图1所示是本发明的系统框架。本发明的具体贡献如下：In the present invention, the objective of the present invention is to design a counting framework based on the secure query of outsourced genomic data, and determine the number of records in the database that match the query conditions through the counting query. The framework proposed by the present invention uses a secure encryption mechanism to process and store data, and performs security assessment on each stored record queried through a third-party agency terminal. Figure 1 shows the system framework of the present invention. The specific contributions of the present invention are as follows:

(1)本发明提供了一个安全的系统查询框架。本发明使用Paillier密码系统对基因序列加密，将加密的数据存储在第三方云服务，并在云服务不知道密钥的情况下进行计数查询，保证了数据的隐私性。引用代理机构终端对返回的加密节点的查询结果解密验证，保证了数据的完整性。(1) The present invention provides a secure system query framework. The invention uses the Paillier cryptosystem to encrypt the gene sequence, stores the encrypted data in a third-party cloud service, and performs a counting query without the cloud service knowing the key, thereby ensuring the privacy of the data. The reference agency terminal decrypts and verifies the returned query result of the encrypted node to ensure the integrity of the data.

(2)本发明提供了一种对原始数据处理的技术。本发明使用hamming code技术对基因数据添加检测位，通过对返回的查询基因序列进行校验，验证计数查询结果的可信性(2) The present invention provides a technology for processing raw data. The invention uses the hammering code technology to add detection bits to the gene data, and verifies the reliability of the counting query result by checking the returned query gene sequence.

(3)本发明提出的模型为数据隐私、查询隐私和输出隐私提供安全保证。查询发起者通过代理机构终端和云服务器完成安全的交互查询。(3) The model proposed by the present invention provides security guarantees for data privacy, query privacy and output privacy. The query initiator completes the secure interactive query through the agency terminal and the cloud server.

(4)本发明提供实验研究，以证明本发明提出的方法的可行性。对于5000条记录的40个SNP位点的查询，Kantar认证机构服务器oglu et al.and Canim et al.,分别用了30min和80s。本发明提出的方法执行相同的查询用了3.9s。(4) The present invention provides experimental studies to prove the feasibility of the method proposed by the present invention. For the query of 40 SNP loci of 5000 records, the Kantar certification authority server, oglu et al. and Canim et al., took 30 min and 80 s, respectively. The method proposed by the present invention takes 3.9s to execute the same query.

2下面对本发明涉及的相关概念进行介绍2 The related concepts involved in the present invention are introduced as follows

人类基因组包含关于人类生物学的基本数据和私人诊断信息，查询个人基因组数据是非常敏感的。为了保护数据的隐私，Ayday等人提出了一种基于同态加密的隐私保护方法和系统，使用存储和处理单元以加密形式存储敏感数据。Bruekers等人提出了半诚实攻击者模型的解决方案，基于有限DNA同态加密，其复杂性在一定程度上取决于要容忍的错误数量。Eppstein等人使用了隐私增强型可逆布隆过滤器(PIBF)，提出了一种在压缩基因组数据上运行的隐私保护比较方法。The human genome contains basic data about human biology and private diagnostic information, and querying personal genomic data is very sensitive. To protect the privacy of data, Ayday et al. propose a privacy protection method and system based on homomorphic encryption, which uses storage and processing units to store sensitive data in encrypted form. Bruekers et al. propose a solution to the semi-honest attacker model, based on finite DNA homomorphic encryption, the complexity of which depends in part on the number of errors to tolerate. Using privacy-enhancing reversible bloom filters (PIBF), Eppstein et al. propose a privacy-preserving comparison method operating on compressed genomic data.

基因组数据的隐私保护协议也可能会泄露敏感数据，针对查询协议的隐私性保护，Atallah等人提出了第一个隐私保护序列比较算法。然而，由于其密集的计算要求不合理，Jha等人介绍了序列比较和使用安全的双方通信协议的距离计算的乱码电路。该解决方案的主要缺点是无法处理大规模计算。Troncoso-Pastoriza等在半诚实的环境中提供DNA匹配安全性隐私保护，通过对自动机不经意的评估方式，提出用于安全匹配的协议，但是协议处理时间稍长。Blanton和Aliasgari等人为了改进Troncoso-Pastoriza等人协议沟通时间问题，提出了一种安全的计算外包协议。Privacy-preserving protocols for genomic data may also leak sensitive data. For the privacy-preserving query protocols, Atallah et al. proposed the first privacy-preserving sequence comparison algorithm. However, due to its unreasonable intensive computational requirements, Jha et al. introduced a garbled circuit for sequence comparison and distance computation using a secure two-party communication protocol. The main disadvantage of this solution is that it cannot handle large-scale computations. Troncoso-Pastoriza et al. provide DNA matching security privacy protection in a semi-honest environment, and propose a protocol for secure matching by inadvertently evaluating the automaton, but the protocol processing time is slightly longer. In order to improve the communication time problem of the Troncoso-Pastoriza et al. protocol, Blanton and Aliasgari et al. proposed a secure computing outsourcing protocol.

针对计数查询的基因组数据安全外包问题，Kantar认证机构服务器oglu等人最早提出了涉及两个第三方的加密模型，但是该模型并不提供查询隐私。Perl等人提出了一种用Bloom滤波器和同态加密相结合来搜索生物医学数据的方法。他们将搜索的任务完全外包给第三方云服务器。Canim等人使用防篡改的加密硬件提供数据存储服务器(DS)，实现对单个第三方基因组数据的安全存储，共享和查询，该方法查询的大小仅限于防篡改硬件的内存大小。谢等人提出了一种安全的遗传关联研究的荟萃分析的方案，将数据保存在相应的数据所有者的场所中。Ignatenko和Petkovic提出了一种在私人DNA数据库中搜索和匹配DNA序列的解决方案，将DNA序列表示为上下文树用于保护隐私的匹配和相似性搜索的索引信息。上下文树是使用称为上下文树加权(CTW)的通用压缩技术构建的。For the security outsourcing of genomic data for counting query, Kantar certification authority server oglu et al. first proposed an encryption model involving two third parties, but this model does not provide query privacy. Perl et al. proposed a method for searching biomedical data using a combination of Bloom filters and homomorphic encryption. They completely outsource the task of searching to a third-party cloud server. Canim et al. used tamper-resistant cryptographic hardware to provide a data storage server (DS), enabling secure storage, sharing, and querying of single third-party genomic data, and the size of the query is limited to the memory size of the tamper-resistant hardware. Xie et al. propose a protocol for a safe meta-analysis of genetic association studies, keeping the data in the premises of the respective data owners. Ignatenko and Petkovic proposed a solution for searching and matching DNA sequences in private DNA databases, representing DNA sequences as context trees for indexing information for privacy-preserving match and similarity searches. Context trees are built using a common compression technique called Context Tree Weighting (CTW).

本发明提出了一种基因组数据安全计数查询的隐私与完整性保护的方法。本发明的方法允许外包协议对存储的加密数据和密文状态下的查询命令进行匹配完成计数查询。为了保护数据隐私，本发明使用同态加密方法加密基因数据。为了验证查询结果的完整性，本发明使用汉明码技术对基因组数据添加检测位，基于汉明码检测位的可纠错性可以对查询结果进行验证，在保证基因组数据的安全性和完整性方面拥有良好的性能。The invention proposes a privacy and integrity protection method for genome data security count query. The method of the invention allows the outsourcing protocol to match the stored encrypted data and the query command in the cipher text state to complete the counting query. In order to protect data privacy, the present invention uses homomorphic encryption to encrypt genetic data. In order to verify the integrity of the query result, the present invention uses the Hamming code technology to add detection bits to the genomic data, and the query result can be verified based on the error correctability of the Hamming code detection bits, which has the advantages of ensuring the security and integrity of the genome data. good performance.

3.下面对本发明装置模型进行说明3. The device model of the present invention will be described below

3.1系统设计3.1 System Design

在本部分中，介绍本发明提出的安全框架。如图1所示，框架中包含五个主要的参与者：数据拥有者，认证机构服务器，云服务器，代理机构终端，研究人员终端。每个实体负责执行不同的特定的任务，保证整个系统安全和功能良好。该框架中的信息流包括两个阶段：数据集成阶段和查询处理阶段。在数据集成阶段，认证机构服务器从数据拥有者接收患者记录并加密。在查询处理阶段，代理机构终端对科研人员查询条件加密并发送到云服务执行查询。In this section, the security framework proposed by the present invention is introduced. As shown in Figure 1, the framework contains five main players: data owner, certification authority server, cloud server, agency terminal, and researcher terminal. Each entity is responsible for performing different specific tasks to keep the entire system safe and functional. The information flow in this framework consists of two phases: the data integration phase and the query processing phase. During the data integration phase, the certificate authority server receives patient records from the data owner and encrypts them. In the query processing stage, the agency terminal encrypts the researcher's query conditions and sends them to the cloud service to execute the query.

数据拥有者是由拥有基因组数据的多个机构组成，这些机构可能是医院，学术研究机构或政府研究机构等。这些机构将患者的基因组数据按照规定的格式处理，并以明文的形式发送给认证机构服务器。Data owners are composed of multiple institutions that own genomic data, which may be hospitals, academic research institutions, or government research institutions, etc. These institutions process the patient's genomic data in a prescribed format and send it to the certification authority server in clear text.

认证机构服务器作为第三方机构，对框架的安全性至关重要。认证机构服务器主要的任务：As a third-party authority, the certification authority server is critical to the security of the framework. The main tasks of the certification authority server:

1)处理数据。认证机构服务器获得不同数据拥有者共享的数据后，构建聚合共享加密数据的可搜索的hashMap，并将其发送到云服务。而用户查询操作基本上是在加密索引树上执行的。索引树包含来自共享数据的所有记录，对于记录的添加和删除，认证机构服务器可以相应地更新树。1) Process the data. After the certification authority server obtains the data shared by different data owners, it constructs a searchable hashMap that aggregates the shared encrypted data, and sends it to the cloud service. The user query operation is basically performed on the encrypted index tree. The index tree contains all records from the shared data, and for additions and deletions of records, the certification authority server can update the tree accordingly.

2)生成密钥。存储在哈希表中的敏感数据都是需要密钥加密的，执行机构需要管理用于数据加密的公钥和解密的私钥。2) Generate a key. Sensitive data stored in the hash table needs to be encrypted with a key, and the executive agency needs to manage the public key for data encryption and the private key for decryption.

云服务从认证机构服务器获取加密数据，负责处理代理机构终端发送的加密查询请求。云服务器执行该查询并将加密结果发送给代理机构终端。The cloud service obtains encrypted data from the certification authority server, and is responsible for processing encrypted query requests sent by the agency terminal. The cloud server executes the query and sends the encrypted result to the agency terminal.

代理机构终端处理与科研人员的所有通信。代理机构终端接收科研人员的查询请求后，将查询数据加密，并将加密后的查询请求发送给云服务器，得到查询结果后解密后发送给研究人员终端。Agency terminals handle all communications with researchers. After receiving the query request from the researcher, the agency terminal encrypts the query data, sends the encrypted query request to the cloud server, obtains the query result, decrypts it, and sends it to the researcher's terminal.

科研人员代表任何有兴趣对存储在云服务器中的共享数据执行查询的个人或组织。科研人员将查询请求发送给代理机构终端加密，代理机构终端收到加密查询结果之后，使用私钥解密这个结果并发送给研究人员终端，研究人员终端得到最终的输出结果。Scientists represent any individual or organization interested in performing queries on shared data stored in cloud servers. The researcher sends the query request to the agency terminal for encryption. After the agency terminal receives the encrypted query result, it decrypts the result with the private key and sends it to the researcher terminal, and the researcher terminal obtains the final output result.

3.2攻击模型与安全目标3.2 Attack Model and Security Target

本发明的目标是云服务器不知道共享基因组数据的任何信息，认证机构服务器和云服务器都不了解研究人员终端进行的查询。本发明假设认证机构服务器是可信的实体。认证机构服务器执行验证作用与NIH的数据访问委员会(DAC)相同，它负责哈希表的生成和加密，并验证申请访问数据的个体和组织的身份。在本发明的结构框架中，本发明假设云服务器是半诚实的，它本身正确的遵守协议，并不打算恶意产生错误的结果。云服务器与认证机构服务器不同，它存储了大量敏感数据，也是执行查询处理的场所，当云服务器被俘获时，攻击者可以获取大量数据，伪造查询结果或向用户提供不完整的查询结果等。一旦云服务器被攻击，其危害性将大大超过俘获认证机构服务器被俘获所造成的风险。因此，本发明重点关注云服务器被攻击的情况。The goal of the present invention is that the cloud server does not know any information about the shared genome data, and neither the certification authority server nor the cloud server know about the query made by the researcher's terminal. The present invention assumes that the certificate authority server is a trusted entity. The certification authority server performs the same verification role as the NIH's Data Access Committee (DAC), which is responsible for the generation and encryption of hash tables and for verifying the identities of individuals and organizations requesting access to data. In the structural framework of the present invention, the present invention assumes that the cloud server is semi-honest, it itself correctly complies with the protocol, and does not intend to maliciously produce erroneous results. Unlike certification authority servers, cloud servers store a large amount of sensitive data and are also places to perform query processing. When a cloud server is captured, attackers can obtain large amounts of data, forge query results, or provide users with incomplete query results. Once the cloud server is attacked, the harm will greatly exceed the risk caused by the capture of the certificate authority server. Therefore, the present invention focuses on the situation where the cloud server is attacked.

本发明的安全目标主要有以下两点：The security objectives of the present invention mainly include the following two points:

(1)数据隐私和查询隐私：数据隐私意味着云服务器无法知道data owner存储的明文数据，这确保了攻击者无法理解存储在云服务器的数据；查询隐私意味着云服务器无法知道研究人员终端查询请求的实际值，这确保了攻击者无法理解或推断云服务器收到的查询请求。(1) Data privacy and query privacy: Data privacy means that the cloud server cannot know the plaintext data stored by the data owner, which ensures that the attacker cannot understand the data stored in the cloud server; query privacy means that the cloud server cannot know the researcher’s terminal query The actual value of the request, which ensures that the query request received by the cloud server cannot be understood or deduced by an attacker.

(2)数据完整性：该目标旨在验证返回给研究人员终端的查询结果的正确性。如果云服务器返回给Agency的查询结果被攻击者删除或伪造，那么Agency可以将其检测出来，并基于本发明提出的协议对查询结果分析纠正。(2) Data integrity: This goal is to verify the correctness of the query results returned to the researcher's terminal. If the query result returned by the cloud server to the Agency is deleted or forged by the attacker, the Agency can detect it and analyze and correct the query result based on the protocol proposed by the present invention.

4下面对本发明采用的数据处理方法进行说明4 The data processing method adopted by the present invention will be described below

人类基因组是生物体的完整遗传信息集，遗传信息位于染色体内部，每条染色体都含有基因，这些基因负责控制人体的各种功能。基因组序列数据由核苷酸的四种碱基{A,C,G,T}构成，碱基在每条DNA链上排列的差异导致个体之间的独特性。大多数DNA序列在整个人群中是保守的，由于遗传变异，每个人的DNA中有大约0.5％与参考基因组不同。基因组中存在几种类型的变异方式，例如单核苷酸多态性(SNP)，拷贝数变异(CNV)和重排等。单核苷酸多态性是指在等位基因上由单个核苷酸的变异所引起的DNA序列多态性，也是最常见的DNA变异形式。大多数SNP对人体健康没有任何影响。但是一些SNP直接导致了人体内某种疾病的发生。针对SNP序列的分析在基因组数据研究中是很常见的。在dbGap数据库中包含患者的基因组序列和对应的临床信息，基因组序列是与SNP相关的位点信息和诊断。本发明假设一个序列S由多个SNP组成，表示为S＝{α₁,α₂,…,α_γ}，其中α_i表示第i位的SNP，它是由一对核苷酸组成。The human genome is the complete set of genetic information of an organism. The genetic information is located inside the chromosomes, and each chromosome contains genes that are responsible for controlling various functions of the human body. Genome sequence data consists of four bases of nucleotides {A, C, G, T}, and differences in the arrangement of bases on each DNA strand lead to uniqueness between individuals. Most DNA sequences are conserved across the population, and about 0.5% of each individual's DNA differs from the reference genome due to genetic variation. Several types of variation patterns exist in the genome, such as single nucleotide polymorphisms (SNPs), copy number variations (CNVs), and rearrangements. Single nucleotide polymorphism refers to the DNA sequence polymorphism caused by the variation of a single nucleotide on the allele, and it is also the most common form of DNA variation. Most SNPs do not have any effect on human health. But some SNPs directly lead to the occurrence of certain diseases in the human body. Analysis of SNP sequences is common in genomic data studies. The dbGap database contains the patient's genome sequence and corresponding clinical information, and the genome sequence is the site information and diagnosis related to the SNP. The present invention assumes that a sequence S consists of multiple SNPs, expressed as S={α ₁ ,α ₂ ,...,α _γ }, where α _i represents the SNP at the i-th position, which is composed of a pair of nucleotides.

Table 1 Data sequence of SNPsTable 1 Data sequence of SNPs

4.1安全计数查询4.1 Security count query

对于基因组数据的研究，最常见的执行任务就是确定数据库中有多少样本满足某些特征。计数查询是遗传关联研究的一个重要步骤，它有助于研究人员终端确定影响人体特定疾病的基因。例如，研究人员终端查询个体中DSP1基因的各种SNP与diagnosis阿尔茨海默病之间是否存在关联。类似的SNP-疾病关联研究在人类基因组学研究中越来越普遍。对于生物学研究人员终端来说，他需要得到特定基因的一组SNP数据集(例如，SNP₁＝AG∩SNP₂＝CT∩Diagnosis＝Alzheimer's Disease)在数据库中出现的频率值，然后将该值标准化为记录总数。例如，给定一个数据库D和一个查询Q，其中D＝{S₁,S₂,…,S_n}表示n位患者的SNP序列。计数查询可以定义为查找D中满足Q的多个查询条件q的患者人数。假设某位患者的m个SNP位点的基因序列表示为S＝{d₁,d₂,…,d_m}，其中d_i(1≤i≤m)表示该患者的第i个位点的SNP值，则计数查询的总数可以表示为：

For the study of genomic data, the most commonly performed task is to determine how many samples in the database satisfy certain characteristics. Count queries are an important step in genetic association studies, helping researchers to pinpoint genes that affect specific diseases in humans. For example, the researchers terminally queried whether there was an association between various SNPs in the DSP1 gene in an individual and diagnosis of Alzheimer's disease. Similar SNP-disease association studies are increasingly common in human genomics studies. For the biological researcher terminal, he needs to get the frequency value of a set of SNP data sets (eg, SNP ₁ =AG∩SNP ₂ =CT∩Diagnosis=Alzheimer's Disease) for a specific gene in the database, and then the value Normalized to the total number of records. For example, given a database D and a query Q, where D={S ₁ , S ₂ , . . . , Sn } represents the SNP sequences of _n patients. A count query can be defined as finding the number of patients in D that satisfy multiple query conditions q of Q. Suppose the gene sequence of m SNP loci of a patient is represented as S={d ₁ ,d ₂ ,...,d _m }, where d _i (1≤i≤m) represents the ith locus of the patient SNP value, then the total number of count queries can be expressed as:

如果数据以明文的形式存储，则计数查询是一项简单的操作，传统的数据库管理系统支持计数查询的操作。但是数据库管理系统并不能用于对加密的数据的查询。那么，数据库如何在不解密数据的情况下响应对加密值的计数查询，本发明在第5节提出了一种安全的数据处理方法hashMap，用于执行查询用户指定条件下的记录数，而无需解密存储在云服务器中的数据。Counting queries are a simple operation if the data is stored in plaintext, and traditional database management systems support counting queries. But database management systems cannot be used to query encrypted data. Then, how does the database respond to the count query of the encrypted value without decrypting the data? The present invention proposes a safe data processing method hashMap in Section 5, which is used to query the number of records under the conditions specified by the user without requiring Decrypt data stored in cloud servers.

4.2数据加密4.2 Data encryption

为了保护患者隐私，认证机构服务器需要对数据库中的数据进行加密以防止敏感数据的泄露。在本部分中，本发明将提供针对本发明框架有关的加密方案。In order to protect patient privacy, the certification authority server needs to encrypt the data in the database to prevent the leakage of sensitive data. In this section, the present invention will provide encryption schemes related to the framework of the present invention.

4.2.1汉明码4.2.1 Hamming code

汉明码(Hamming Code)是一种线性调试码，它在传输的消息流中插入验证码，对数据存储和传输过程中可能产生的数据位错误进行侦测并更正。汉明码利用了奇偶校验位的概念，在数据位后面增加特定位的比特以验证数据的有效性。本发明使用汉明码技术对基因数据进行处理，不仅可以验证数据是否有效，还能在数据出错的情况下指明错误信息并纠正。Hamming code (Hamming Code) is a linear debugging code, which inserts verification code into the transmitted message stream to detect and correct data bit errors that may occur during data storage and transmission. Hamming code utilizes the concept of parity bits, adding specific bits after the data bits to verify the validity of the data. The invention uses the Hamming code technology to process the genetic data, which can not only verify whether the data is valid, but also indicate the wrong information and correct it when the data is wrong.

汉明码的实现原理是在原来的数据中插入k位二进制数据作为校验位，把原来的n位数据变为z位编码。编码时需要满足不等式2^k-1≥z，其中z＝n+k。在汉明码规定所得到的z位编码的2^k-1(k≥0,2^k-1＜z)位上插入特殊校验码，其余位用原码按照顺序放置。汉明码的详细编码规则如下：The implementation principle of Hamming code is to insert k-bit binary data into the original data as a check digit, and change the original n-bit data into z-bit encoding. During encoding, the inequality 2 ^k -1 ≥ z needs to be satisfied, where z=n+k. A special check code is inserted into the 2 ^k-1 (k≥0, 2 k ^-1 <z) bits of the z-bit code obtained by the Hamming code, and the remaining bits are placed in order with the original code. The detailed encoding rules of Hamming code are as follows:

1)在新的编码位(即校验位)的2^k-1位上填入0，然后把新的编码的其余位用源码按顺序填入。1) Fill in 0 in the 2 ^k-1 bits of the new coding bit (ie, check digit), and then fill in the remaining bits of the new coding with the source code in sequence.

2)校验位的编码方式为：第k位校验码从则从新的编码的第2^k-1位开始，每计算2^k-1位的异或，跳2^k-1位，再计算下一组2^k-1位的异或，填入2^k-1位。2) The encoding method of the check digit is: the k-th check digit starts from the 2 ^k-1 bit of the new code, and every time the XOR of 2 k-1 bits is calculated, 2 ^k-1 ^bits are skipped, and then calculated The XOR of the next set of ^2k-1 bits, fill in the ^2k-1 bits.

比如,对于核苷酸中的腺嘌呤C，(C)2＝1000011,n＝7,由公式2^k-1≥z，z＝n+k可得k取得最小值为4。则添加4位检测码：For example, for adenine C in nucleotides, (C) 2 = 1000011, n = 7, from the formula 2 ^k -1 ≥ z, z = n + k, the minimum value of k can be obtained to be 4. Then add a 4-digit detection code:

首先在2^k-1位填入0，表示为00100000011。第1位校验码位于新的编码的第1位(2¹ ^-1)(汉明码从1位开始)，计算1,3,5,7,9,11位的异或为

将0填入新的编码的第1位，表示为00100000011。First fill in 0 in 2 ^k-1 bits, which is represented as 00100000011. The 1st digit check code is located in the 1st digit (2 ¹ ^-1 ) of the new code (Hamming code starts from 1 digit), and the XOR of 1, 3, 5, 7, 9, and 11 digits is calculated as

Fill 0 into the first bit of the new code, which is represented as 00100000011.

第2位校验码位于新的编码的第2位(2^2-1)，计算2,3,6,7,10,11位的异或为填入新的编码的第2位，表示为011000000011。第3位校验码位于新的编码的第4位(2^3-1)，计算4,5,6,7位的异或为填入新的编码的第4位表示为011000000011。第4位校验码位于新的编码的第8位(2^4-1)，计算8-11位的异或为

填入新的编码的第8位表示为011000000011。最终C的二进制汉明码表示为表示为011000000011。The 2nd bit check code is located in the 2nd bit (2 ^2-1 ) of the new code, and the XOR of bits 2, 3, 6, 7, 10, and 11 is calculated as Fill in the second digit of the new code, which is represented as 011000000011. The 3rd bit check code is located in the 4th bit (2 ^3-1 ) of the new code, and the XOR of bits 4, 5, 6, and 7 is calculated as Fill in the 4th bit of the new code and represent it as 011000000011. The 4th bit check code is located in the 8th bit (2 ^4-1 ) of the new code, and the XOR of the 8-11 bits is calculated as

Fill in the 8th bit of the new code and represent it as 011000000011. The final binary Hamming code representation of C is represented as 011000000011.

本发明使用汉明码处理原数据并利用其纠错功能来检测计数查询结果。对于四种核苷酸A，C，G，T，分别计算出其对应的AS认证机构服务器I码二进制校验位{h_A,h_C,h_G,h_T}，其中h_A＝0001,h_C＝0100,h_G＝1101,h_T＝0011。假设一对核苷酸d_i添加检测位t后表示为B_i，其中t是集合{h_A,h_C,h_G,h_T}中任意两个的元素组合，共2⁴种组合方式。例如，d_i＝CC的二进制表示为1000011 1000011，在每个碱基的第1、2、4和8位分别添加检测位0100,0100后，表示为01100000011 01100000011。The invention uses the Hamming code to process the original data and utilizes its error correction function to detect the counting query result. For the four kinds of nucleotides A, C, G, T, calculate the corresponding AS certification authority server I code binary check digit {h _A , h _C , h _G , h _T }, where h _A =0001, h _C =0100, h _G =1101, h _T =0011. Suppose a pair of nucleotides d _i is represented as B _i after adding the detection bit t, where t is the combination of any two elements in the set {h _A , h _C , h _G , h _T }, there are ²⁴ combinations in total. For example, the binary representation of d _i = CC is 1000011 1000011, and after adding detection bits 0100, 0100 to the 1st, 2nd, 4th and 8th bits of each base, it is represented as 01100000011 01100000011.

4.2.2 Paillier cryptosystem4.2.2 Paillier cryptosystem

为了实现架构的简单和灵活性，本发明使用Paillier cryptosystem加密算法。该加密算法满足语义安全性，保证具有有限计算能力和拥有密文的对手无法获取明文信息。Paillier cryptosystem是一种用于公钥加密的概率性非对称算法，当相同的消息被多次加密时，该密码系统产生不同的密文。在Paillier加密算法中会生成一对密钥，一个是私钥sk，另一个是公钥pk。公钥和私钥分别用于数据的加密和解密。Paillier密码系统可以定义如下。(参考维基翻译)In order to achieve the simplicity and flexibility of the architecture, the present invention uses the Paillier cryptosystem encryption algorithm. The encryption algorithm satisfies semantic security, ensuring that an adversary with limited computing power and ciphertext cannot obtain plaintext information. Paillier cryptosystem is a probabilistic asymmetric algorithm for public key encryption that produces different ciphertexts when the same message is encrypted multiple times. In the Paillier encryption algorithm, a pair of keys is generated, one is the private key sk, and the other is the public key pk. The public and private keys are used for data encryption and decryption, respectively. The Paillier cryptosystem can be defined as follows. (refer to wiki translation)

密钥生成阶段：随机选择两个大素数p和q，使得它们彼此独立gcd(pq,(p-1)(q-1))＝1。计算η＝pq和λ＝lcm(p-1,q-1)，选择随机整数g使得

通过对g进行模块乘法逆运算来确定n的划分顺序：μ＝(L(g^λmodη²))^-1modη，其中L被定义

由此可得公钥(η,g)，私钥(λ，μ)。Key generation phase: randomly select two large prime numbers p and q so that they are independent of each other gcd(pq,(p-1)(q-1))=1. Calculate η=pq and λ=lcm(p-1,q-1), choose a random integer g such that

The division order of n is determined by performing a modular multiplication inverse operation on g: μ=(L(g ^λ mod η ² )) ⁻¹ mod η, where L is defined

From this, the public key (η, g) and the private key (λ, μ) can be obtained.

加密阶段：对于需要加密的消息ω(0≤m＜η)，选择随机数γ(0＜γ＜η)计算密文c：c＝g^ω·γ^ηmodη²。Encryption stage: For the message ω (0≤m<η) that needs to be encrypted, select a random number γ (0<γ<η) to calculate the ciphertext c: c=g ^ω ·γ ^η modη ² .

解密阶段：对于需要解密的密文c满足

解密后的明文表示为：ω＝L(c^λmodη²)·μmodη。Decryption stage: For the ciphertext c that needs to be decrypted,

The decrypted plaintext is expressed as: ω=L(c ^λ modη ² )·μmodη.

同态属性：假设有两个数据ω₁和ω₂，则其密文的乘积解密后为它们相应的明文之和：E_sk(E_pk(ω₁)·E_pk(ω₂)modη²)＝ω₁+ω₂modη。Homomorphic property: Assuming there are two data ω ₁ and ω ₂ , the product of their ciphertexts is decrypted as the sum of their corresponding plaintexts: E _sk (E _pk (ω ₁ )·E _pk (ω ₂ )modη ² ) =ω ₁ +ω ₂ modη.

增加一个明文为另一个明文的指数值，解密后则为两个明文的乘积：Adding a plaintext is the exponent value of another plaintext, and after decryption, it is the product of the two plaintexts:

假设数据库中有n位患者，Data＝(S¹,S²,…,Sⁿ)表示数据库中的基因序列数据集，第j位患者的SNP序列的基因组数据可以表示为

其中1≤j≤n。加密过程中每个基因组数据

通过汉明码添加检测位表示为

对

加密可表示为

其中“|”表示将检测位添加到碱基对

中。每个数据加密的密文表示为：Assuming that there are n patients in the database, Data=(S ¹ , S ² ,...,S ⁿ ) represents the gene sequence data set in the database, and the genomic data of the SNP sequence of the jth patient can be expressed as

where 1≤j≤n. Data per genome during encryption

Adding detection bits via Hamming code is expressed as

right

Encryption can be expressed as

Where "|" means that the bit will be detected Add to base pair

middle. The encrypted ciphertext of each data is represented as:

5下面对本发明系统设计进行说明5 The system design of the present invention will be described below

在本部分中，本发明将介绍本发明使用的数据处理模型。首先，认证机构服务器对数据库中患者基因序列和临床信息进行处理。患者的临床信息主要包括一些表型的疾病诊断。假设疾病类型有I＝{I₁,I₂,…,I_g}种,其中(1≤g≤n)，集合中的某一个I值代表一种疾病。每一位患者可以患有一种或多种疾病。之后，针对患者案例的基因型和表型数据讨论如何将其整合到hashMap。如table1所示是5位患者的基因型序列和表型类型。In this section, the present invention will introduce the data processing model used by the present invention. First, the certification body server processes the patient's genetic sequence and clinical information in the database. The patient's clinical information mainly includes some phenotypic disease diagnosis. It is assumed that there are I= _{ I ₁ , I ₂ , . Each patient can have one or more diseases. Afterwards, the genotype and phenotype data of patient cases are discussed how to integrate it into the hashMap. As shown in table1 are the genotype sequences and phenotype types of 5 patients.

5.1生成hashMap5.1 Generate hashMap

认证机构服务器收到数据拥有者提供的数据后，对于每位患者的SNP序列创建hashMap的一个Entity,所有的数据处理后生成一个映射表M,M中包含每位患者的基因型和表型信息。创建M之后，对于每一条来自数据拥有者的新的记录，认证机构服务器需要创建并更新M的Entity信息。每个Entity包含以下内容：After the certification body server receives the data provided by the data owner, it creates an Entity of a hashMap for the SNP sequence of each patient, and after all data processing, a mapping table M is generated, which contains the genotype and phenotype information of each patient. . After M is created, for each new record from the data owner, the certification authority server needs to create and update the Entity information of M. Each Entity contains the following:

key:Entity中key值指向具体的数据，索引值。key: The key value in the Entity points to the specific data, the index value.

geno:Entity中的value值的一部分。它是每位患者的SNP序列。例如一个患者的SNP序列表示为S＝{d₁,d₂,…,d_m}，sid是SNP位点唯一的标识，代表基因序列中特定的位置。其中碱基对d_i的位点标识sid＝i,(1≤i≤n)。可以表示为 geno: part of the value in Entity. It is the SNP sequence for each patient. For example, the SNP sequence of a patient is represented as S={d ₁ ,d ₂ ,...,d _m }, and sid is the unique identifier of the SNP site, representing a specific position in the gene sequence. The site identification of base pair d _i is sid=i, (1≤i≤n). It can be expressed as

I:基因组序列对应的表型数据集。I: The phenotype dataset corresponding to the genome sequence.

count:在数据库中，当前SNP序列出现的频率,表示与基因型和表型相匹配的记录的总数。count: In the database, the frequency of the current SNP sequence, indicating the total number of records matching the genotype and phenotype.

next:下一个Entity的key值next: the key value of the next Entity

Table 2 Hash Map for Table1Table 2 Hash Map for Table1

创建哈希表的复杂度为O(mn)，其中m是数据库中的记录数量，n是在每一条序列记录中不同的SNP位点的数目。本发明定义每个Entity为θ，其组成部分表示为θ(key,geno,I,count,next)。根据哈希表的特征和table1的数据本发明可以得到table2。The complexity of creating the hash table is O(mn), where m is the number of records in the database and n is the number of distinct SNP sites in each sequence record. The present invention defines each Entity as θ, and its components are expressed as θ(key, geno, I, count, next). According to the characteristics of the hash table and the data of table1, the present invention can obtain table2.

5.2加密hashMap5.2 Encrypted hashMap

5.2.1布隆过滤器的加密5.2.1 Encryption of Bloom Filters

M中的每个Entity都包含Bloom filter值I，表示与该基因组序列对应的表型信息。认证机构服务器使用Bloom filter技术对表型数据进行处理。基因组序列中的每条SNPs记录,将在Bloom filter中插入相应的表型。如果多个表型与相同的基因组序列相关联，本发明将它们全部插入该基因组序列entity的Bloom滤波器中。Each Entity in M contains a Bloom filter value of I, representing the phenotypic information corresponding to that genomic sequence. The certification authority server uses Bloom filter technology to process the phenotypic data. For each SNPs record in the genome sequence, the corresponding phenotype will be inserted in the Bloom filter. If multiple phenotypes are associated with the same genomic sequence, the present invention inserts all of them into the Bloom filter for that genomic sequence entity.

认证机构服务器设置Bloom滤波器中使用的哈希函数H和公用字母表Σ的域。Σ的域是所有可能表型的集合。对于基因组序列中的每个entity，将在Bloom滤波器中插入相应的表型。Σ中的每个表型都被映射到一个唯一的数字，并且该数字将被插入到Bloom滤波器中。The certificate authority server sets the hash function H used in the Bloom filter and the domain of the common alphabet Σ. The domain of Σ is the set of all possible phenotypes. For each entity in the genome sequence, the corresponding phenotype will be inserted in the Bloom filter. Each phenotype in Σ is mapped to a unique number, and this number will be interpolated into the Bloom filter.

M中每个Entity上的Bloom filter的长度都是相同的，为了加密Bloom filter，本发明在密钥大小为128位的CTR模式下使用了AES。认证机构服务器使用伪随机函数(PRF)F选择出随机密钥k,将每个Bloom filter加密为The length of the Bloom filter on each Entity in M is the same. In order to encrypt the Bloom filter, the present invention uses AES in the CTR mode with a key size of 128 bits. The certification authority server uses a pseudo-random function (PRF) F to select a random key k, and encrypts each Bloom filter as

该加密过程是在entity的其他数据在被加密的同时完成的。认证机构服务器除了需要将私钥sk发送给代理机构终端，还需要提供PRF的密钥k和用于Bloom filter的散列函数F_k。为了在搜索期间匹配来自查询的表型，本发明需要将查询请求的表型值I_q散列的位置与每个Entity的I_i散列位置进行匹配。如果I_q中所有被置为1的位置在I_i中也被置为1，则意味着来自查询的表型与存储在树节点中的表型相匹配。This encryption process is done while the entity's other data is being encrypted. In addition to sending the private key sk to the agency terminal, the certification authority server also needs to provide the key k of the PRF and the hash function F _k used for the Bloom filter. In order to match phenotypes from a query during a search, the present invention requires matching the location of the _Iq hash of the phenotype value Iq requested by the query with the location of the _Ii hash of each Entity. If all positions set to 1 in _Iq are also set to 1 in _Ii , it means that the phenotype from the query matches the phenotype stored in the tree node.

由于在树的每个节点处表示的布卢姆过滤器都被加密，为了检查这些布卢姆过滤器中的任何一个的位置，本发明首先需要解密它。认证机构服务器向代理机构终端提供PRF密钥k。对加密的Bloom filterI′_i使用电路内的SFE进行解密Since the bloom filters represented at each node of the tree are encrypted, in order to check the position of any of these bloom filters, the present invention first needs to decrypt it. The certificate authority server provides the PRF key k to the agency terminal. Decrypt encrypted Bloom filter _I'i using in-circuit SFE

5.2.2 Paillier Cryptosystem加密5.2.2 Paillier Cryptosystem encryption

认证机构服务器使用Paillier Cryptosystem加密M中的所有Entity。认证机构服务器使用密钥生成算法为同态加密方案生成一个密钥对(pk,sk)，其中pk是公钥，sk是私钥；使用公钥pk对哈希表中的每个entity加密。为了在保持系统安全性的同时使整个搜索过程足够快，本发明只加密每个entity的敏感属性。由于云服务器在本发明的系统中是半诚实的，它必须能够保证查询者知道entity的next键值，因此本发明不需要加密next信息。对entity的加密可以表示为E_pk(θ)，加密之后，每个entity likeθ(key,E_pk(geno),E_pk(count),I′_i,next)。本发明定义M中所有的entity均被加密后的hashMap表示为

最后，认证机构服务器将

发送给云服务器。同时，认证机构服务器将密钥对(pk,sk)发送给代理机构终端。The certificate authority server encrypts all entities in M using the Paillier Cryptosystem. The certificate authority server uses the key generation algorithm to generate a key pair (pk,sk) for the homomorphic encryption scheme, where pk is the public key and sk is the private key; each entity in the hash table is encrypted using the public key pk. In order to make the whole search process fast enough while maintaining the security of the system, the present invention only encrypts the sensitive attributes of each entity. Since the cloud server is semi-honest in the system of the present invention, it must be able to ensure that the queryer knows the next key value of the entity, so the present invention does not need to encrypt the next information. The encryption of the entity can be expressed as E _pk (θ). After encryption, each entity is like θ (key, E _pk (geno), E _pk (count), I′ _i , next). All entities in the definition M of the present invention are represented by the encrypted hashMap as

Finally, the certificate authority server will

sent to the cloud server. At the same time, the certificate authority server sends the key pair (pk, sk) to the agency terminal.

5.3查询加密hashMap5.3 Query encrypted hashMap

云服务器接收到认证机构服务器发送的

执行代理机构终端发送的加密查询请求。查询的主要思想是查询满足查询表型为I_q，并且满足匹配存储在geno中的特定位点的基因序列。云服务器需要知道研究人员终端查询请求中的sid,将存储在Entity中的geno的特定位点的SNP与研究人员终端查询条件中位点的SNP相匹配。由于SNP的值都被加密，并且本发明使用的加密方案是概率性的，云服务器无法确定这些值是否匹配。云服务器可以将加密的valn值发送给代理机构终端，由代理机构终端解密加密的SNP并检查相等性。The cloud server receives the data sent by the certification authority server

Execute the encrypted query request sent by the agency terminal. The main idea of the query is to query the gene sequence that satisfies the query phenotype I _q , and that matches the specific locus stored in the geno. The cloud server needs to know the sid in the researcher's terminal query request, and matches the SNP of the specific site of the geno stored in the Entity with the SNP of the site in the researcher's terminal query condition. Since the values of the SNPs are all encrypted, and the encryption scheme used in the present invention is probabilistic, the cloud server cannot determine whether the values match. The cloud server can send the encrypted valn value to the agency terminal, which decrypts the encrypted SNP and checks for equality.

假设研究人员终端发送查询请求Q_q＝(q₁∩q₂∩…∩q_num∩I_q)，其中有num个查询条件。q_l(1≤l≤num)表示某位点SNP的具体基因值。例如,查询Q_q中的条件为SNP₂＝CC,SNP₃＝TT,SNP₅＝AG,I_q＝I₁，q₁表示条件SNP₂＝CC。代理机构终端对每个查询条件q的碱基对添加检测码B_q＝(T_q|q)。用公有密钥pk对查询条件加密,对于每个位点的查询条件可以表示θ_q(sid,E_pk(B_q))。查询请求加密后连接为：E_pk(θ_q)＝E_pk(θ₁)∩E_pk(θ₂)∩…∩E_pk(θ_num)∩I_q′。Suppose that the researcher terminal sends a query request Q _q =(q ₁ ∩q ₂ ∩...∩q _num ∩I _q ), in which there are num query conditions. q _l (1≤l≤num) represents the specific gene value of the SNP at a certain locus. For example, the conditions in the query Q _q are SNP ₂ =CC, SNP ₃ =TT, SNP ₅ =AG, I _q =I ₁ , and q ₁ represents the condition SNP ₂ =CC. The agency terminal adds the detection code B _q = (T _q |q) to the base pair of each query condition q. The query condition is encrypted with the public key pk, and the query condition for each site can be expressed as θ _q (sid, E _pk (B _q )). The encrypted query request is connected as: E _pk (θ _q )=E _pk (θ ₁ )∩E _pk (θ ₂ )∩…∩E _pk (θ _num )∩I _q ′.

例如，对table2中M进行上述查询Q_q操作，遍历每个entity中geno的位点sid＝2，sid＝3，sid＝5的SNP值，将这些值发送个代理机构终端进行解密验证，可以得出数据库中满足查询条件的案例。如图table2所示，只有case#1满足查询条件，返回个研究人员终端的查询计数结果count＝1。For example, perform the above query Q _q operation on M in table2, traverse the SNP values of sid=2, sid=3, and sid=5 of the geno site in each entity, and send these values to an agency terminal for decryption and verification, you can Find the cases in the database that satisfy the query conditions. As shown in Figure 2, only case#1 satisfies the query condition, and the query count result of each researcher terminal is returned as count=1.

5.4数据完整性分析5.4 Data Integrity Analysis

云服务器将查询结果R_q发送到代理机构终端，代理机构终端用密钥sk解密SNP值。为了确保代理机构终端和云服务器之间交互信息的隐私性，本发明采用汉明码检测位来校验数据的完整性。对于返回的满足查询条件的查询结果：The cloud server sends the query result R _q to the agency terminal, and the agency terminal decrypts the SNP value with the key sk. In order to ensure the privacy of the information exchanged between the agency terminal and the cloud server, the present invention adopts the Hamming code detection bit to verify the integrity of the data. For the returned query results that meet the query conditions:

代理机构终端用私钥sk解密每一个SNP值，解密后的

本发明将

重新计算汉明码检测位为T′。本发明通过对比T′与T_q的相关性来判定数据是否完整。The agency terminal decrypts each SNP value with the private key sk, and the decrypted

The present invention will

The Hamming code detection bit is recalculated as T'. The present invention determines whether the data is complete by comparing the correlation between T' and _Tq .

1)T′＝T_q，该种情况下，说明数据完整，数据是安全可信的；1) T′=T _q , in this case, it means that the data is complete and the data is safe and reliable;

2)T′≠T_q,且T′∈{h_A,h_C,h_G,h_T}，表示查询条件中的geno值已经被篡改，但仍然是A,C,G,T四种核苷酸的组合。可以通过检测位T_q对返回的

进行纠正并可更新hashMap。返回的计数值count已经不可信。2) T′≠T _q , and T′∈{h _A ,h _C ,h _G ,h _T }, indicating that the geno value in the query condition has been tampered with, but it is still four cores of A, C, G, and T. A combination of glucosinolates. can be returned by detecting the bit T _q pair

Make corrections and update hashMap. The returned count value count is no longer reliable.

3)T′≠T_q,且

表示原数据

被篡改，并且被篡改成其它内容，此时对于患者的SNP序列已经不可信，返回的查询结果计数值不可信。3) T′≠T _q , and

represent the original data

关于安全性分析，本发明只假设数据库中的geno序列向参与者透露，那么将导致很严重的安全问题。本发明考虑参与者在系统的不同阶段推断信息的能力，进行评估本发明提出的隐私协议和系统的安全性。下面给出了本发明提出的模型中不同参与者的泄漏概况。Regarding the security analysis, the present invention only assumes that the geno sequences in the database are disclosed to the participants, which will lead to serious security problems. The present invention evaluates the security of the privacy protocol and system proposed by the present invention considering the ability of participants to infer information at different stages of the system. The leakage profiles of the different actors in the model proposed by the present invention are given below.

HashMap的构建和加密阶段数据泄露：认证机构服务器作为可信实体，负责M的生成和加密，在这个阶段认证机构服务器不存在数据的泄露。Data leakage in the construction and encryption stage of HashMap: The certification authority server, as a trusted entity, is responsible for the generation and encryption of M, and there is no data leakage on the certification authority server at this stage.

每个查询中对云服务器的泄露：在查询执行的过程中，云服务器完全不涉及明文信息，因此，在查询阶段数据信息不会透露给云服务器。Leakage of the cloud server in each query: During the query execution process, the cloud server does not involve plaintext information at all. Therefore, the data information will not be disclosed to the cloud server during the query stage.

每个查询向代理机构终端和研究人员终端的泄露：代理机构终端主要贡献是对研究人员终端的查询请求用密钥加密发送到云服务器,接收云服务器返回的查询结果，并对其解密返回给研究人员终端。对于返回给代理机构终端的SNP查询结果，本发明使用汉明码技术进行验证，研究人员终端是不能直接从汉明码检测位推断出任何信息的。在这里，本发明不考虑输出中的任何隐私泄露。因此，本系统是安全可靠的。Leakage of each query to the agency terminal and the researcher terminal: The main contribution of the agency terminal is to encrypt the query request from the researcher terminal and send it to the cloud server, receive the query result returned by the cloud server, and decrypt it and return it to the cloud server. Researcher terminal. For the SNP query result returned to the agency terminal, the present invention uses the Hamming code technology to verify, and the researcher's terminal cannot directly infer any information from the Hamming code detection bits. Here, the present invention does not consider any privacy leakage in the output. Therefore, the system is safe and reliable.

6下面对本发明的实验分析数据进行说明6 The experimental analysis data of the present invention will be described below

自人类基因组计划开始以来，研究人员终端报告了大量的SNP。优质的SNP位点成为候选基因的可用性使得候选基因与全基因组关联性研究成为可能。连锁不平衡(LD)技术已经被广泛应用于开发高质量的SNP标记图。当应用于疾病-基因作图时，通过关联分析评估LD，其需要比较受影响的疾病(例如，阿尔茨海默氏症的诊断)和对照个体(例如，没有阿尔茨海默氏症的诊断)之间的等位基因或单倍型频率。Toivonen等人提出了一种称为单倍型模式挖掘(HPM)的LD映射数据挖掘方法用于生成模拟的SNP数据集。本发明提供的数据隐私保护框架，通过使用模拟的SNP数据集进行测试实验以评估其实用性。本发明系统的源代码是用JAVA程序编码语言实现的。本发明的实验环境是Intel i7-4710MQ(4核1.6GHz)CPU、8GB内存；软件环境为Windows 10操作系统和IntelliJ IDEA运行平台。在实验中，本发明提供两台不同的机器运行云服务器和认证机构服务器。本发明在实验中使用了汉明码技术对基因数据添加检测位，并使用1024位的Paillier加密技术。Since the beginning of the Human Genome Project, researchers have terminally reported a large number of SNPs. The availability of high-quality SNP loci as candidate genes makes it possible to study the association between candidate genes and the whole genome. Linkage disequilibrium (LD) techniques have been widely used to develop high-quality SNP marker maps. When applied to disease-gene mapping, LD is assessed by association analysis, which entails comparing affected disease (eg, a diagnosis of Alzheimer's) to control individuals (eg, no diagnosis of Alzheimer's) ) between allele or haplotype frequencies. Toivonen et al. proposed a LD mapping data mining method called Haplotype Pattern Mining (HPM) for generating simulated SNP datasets. The data privacy protection framework provided by the present invention evaluates its practicality by conducting test experiments using simulated SNP data sets. The source code of the system of the present invention is realized by JAVA program coding language. The experimental environment of the present invention is Intel i7-4710MQ (4-core 1.6GHz) CPU and 8GB memory; the software environment is Windows 10 operating system and IntelliJ IDEA running platform. In the experiment, the present invention provides two different machines to run the cloud server and the certification authority server. In the present invention, the Hamming code technology is used in the experiment to add detection bits to the genetic data, and the 1024-bit Paillier encryption technology is used.

在本发明的安全计数查询实验中，本发明使用了四个具有500,10 00和5000个SNPs数据集以及涉及10个，20个，30个和40个随机选择的SNP序列的四个不同查询大小。本发明测试了不同的记录数量对hashMap构建时间的影响。本发明测试的结果与MohammadZahidul Hasan等人使用的index tree的方法和M.Canim等人使用cryptographichardware SCPs来处理敏感的生物医学数据的方法进行对比。实验结果表明本发明提出的方法在加密时间和查询执行时间性能均优于。本实验从以下三个方面进行分析：In the safe count query experiments of the present invention, the present invention used four datasets with 500, 1000 and 5000 SNPs and four different queries involving 10, 20, 30 and 40 randomly selected SNP sequences size. The present invention tests the influence of different record numbers on the construction time of the hashMap. The results of the present tests were compared to the index tree method used by Mohammad Zahidul Hasan et al. and the method of M. Canim et al using cryptographichardware SCPs to process sensitive biomedical data. The experimental results show that the method proposed by the present invention is superior in both encryption time and query execution time. This experiment is analyzed from the following three aspects:

(1)HashMap生成时间。它指处理基因组数据库并使用基因型和表型构建hashMap所需的时间。本发明分析了包含500和1000数量的SNP的数据集的创建时间，如图2所示。实验表明，随着SNP数量的增加需要消耗的时间不断增加。与Mohammad Zahidul Hasan等人使用的index tree创建时间对比，本发明的方法在创建时间要比index tree时间多，但是以秒计算，对于实验运行时间来说只占有很小的比例。(1) HashMap generation time. It refers to the time it takes to process a genome database and build a hashMap using genotypes and phenotypes. The present invention analyzes the creation time of datasets containing 500 and 1000 numbers of SNPs, as shown in Figure 2. Experiments show that with the increase of the number of SNPs, the time required to consume increases continuously. Compared with the creation time of the index tree used by Mohammad Zahidul Hasan et al., the method of the present invention takes more time to create the index tree, but it only occupies a small proportion of the experiment running time in seconds.

(2)加密时间。本发明的方法中对于数据的加密时间主要消耗在对每个entity的碱基对和计数值加密，本发明使用的CTR方案加密Bloom filter并不需要很长时间，可忽略不计。其次，加密的时间也取决于SNP数据集中的序列的总数。如图3，本发明对具有500和1000个SNP的数据集的加密时间进行分析。随着SNP数量的增加需要消耗的加密时间不断增加，本发明的方法对比Index Tree在加密时间上有很明显的优势。例如对于1000records，本发明仅需要2.11min,(2) Encryption time. In the method of the present invention, the encryption time for data is mainly consumed in encrypting the base pair and count value of each entity. The CTR scheme used in the present invention does not take a long time to encrypt the Bloom filter, which can be ignored. Second, the time to encrypt also depends on the total number of sequences in the SNP dataset. Figure 3, the present invention analyzes the encryption time of data sets with 500 and 1000 SNPs. With the increase of the number of SNPs, the encryption time that needs to be consumed increases continuously, and the method of the present invention has obvious advantages in encryption time compared with the Index Tree. For example, for 1000records, the present invention only needs 2.11min,

(3)查询时间。查询时间指研究人员终端提供查询请求到得到返回结果所用的时间。为了计算查询时间，本发明随机选择10，20，30和40个大小的SNP序列，对5000条记录执行执行查询。如图4，列出了这些查询大小在加密的hashMap上的执行时间。由于需要对hashMap中entity中表型信息进行搜索匹配，本发明需要对hashMap中所有的entity进行遍历查询以获得满足查询条件的基因序列。在图2，本发明给出了本发明的方法执行的查询时间与的方法执行时间的比较，随着查询数目的增加，本发明的执行时间呈线性增长的趋势，时间以秒为单位。(3) Query time. The query time refers to the time taken by the researcher terminal to provide the query request to the return result. To calculate the query time, the present invention randomly selects SNP sequences of size 10, 20, 30 and 40, and executes the query on 5000 records. Figure 4 lists the execution times of these query sizes on an encrypted hashMap. Since the phenotypic information in the entities in the hashMap needs to be searched and matched, the present invention needs to traverse and query all the entities in the hashMap to obtain gene sequences that satisfy the query conditions. In Fig. 2, the present invention provides a comparison between the query time executed by the method of the present invention and the method execution time of the present invention. With the increase of the number of queries, the execution time of the present invention is linearly increased, and the time is in seconds.

Table 3 Comparison of query execution time on 5000 recordsTable 3 Comparison of query execution time on 5000 records

通过改变计数查询的记录大小和查询数据集总数，本发明分析了使用hashMap存储数据时间和执行查询的时间，并与已经提出的方法进行了对比。本发明的方法通过使用hashMap存储数据，并使用同态加密方案，有效的保护了数据的隐私性。本发明的方法支持存储大量的数据集。对entity的敏感数据的处理和加密并不会对记录数量产生直接影响。本发明提出使用汉明码的纠错性对原始数据添加检测位，在代理机构终端对返回的结果进行完整性验证，有效地保证返回给研究人员终端的结果的准确性。这些特性使得本发明提出的方法可以更加安全，高效的对加密的数据执行计数查询。By changing the record size of the count query and the total number of query data sets, the present invention analyzes the time to store data using a hashMap and the time to execute a query, and compares with the methods already proposed. The method of the invention effectively protects the privacy of the data by using the hashMap to store the data and using the homomorphic encryption scheme. The method of the present invention supports the storage of large data sets. The processing and encryption of the entity's sensitive data does not have a direct impact on the number of records. The present invention proposes to use the error correction property of Hamming code to add detection bits to the original data, and to verify the integrity of the returned result at the agency terminal, so as to effectively ensure the accuracy of the result returned to the researcher's terminal. These characteristics make the method proposed by the present invention more secure and efficient to perform counting query on encrypted data.

本发明针对加密基因组数据的安全计数查询问题，在分析现有方案的基础上提出了一种安全有效的基因组数据外包方法。为了实现数据的隐私性，该方法将数据库中的数据以hashMap方式存储，并将存储的数据外包给第三方云服务器。通过使用第三方代理机构终端，实现了研究人员终端在云服务的安全计数查询。为了验证数据的完整性，本发明提出使用hamming code技术对基因组数据添加检测位，保证在数据处理和查询执行阶段不会显示任何敏感的基因数据。Aiming at the security counting query problem of encrypted genome data, the invention proposes a safe and effective genome data outsourcing method on the basis of analyzing the existing scheme. In order to realize the privacy of data, this method stores the data in the database as a hashMap, and outsources the stored data to a third-party cloud server. By using the third-party agency terminal, the security counting query of the researcher's terminal in the cloud service is realized. In order to verify the integrity of the data, the present invention proposes to use the hammering code technology to add detection bits to the genomic data to ensure that no sensitive genetic data is displayed during the data processing and query execution stages.

Claims

1. a security counting query and integrity verification device based on encrypted genomic data, is characterized in that, comprises data owner terminal, certification authority server, cloud server, agency terminal, researcher terminal;

The certification authority terminal receives and encrypts the patient record from the data owner terminal;

The data owner terminal processes the patient's genomic data in the prescribed format and sends it to the certification authority server in clear text;

After the certification authority server obtains the data shared by different data owners, it builds a searchable hash map that aggregates the shared encrypted data, and sends the hash map to the cloud service;

The cloud service obtains encrypted data from the certification body and processes the encrypted query request sent by the agency terminal, executes the query request, and encrypts the query result and sends the query result to the agency terminal;

The researcher terminal responds to the user operation and sends the researcher query request to the agency terminal;

After receiving the researcher's query request, the agency terminal encrypts the query request data and sends the encrypted query request to the cloud server. After obtaining the query result from the cloud server, it decrypts the query result using the private key and sends it to the researcher's terminal.

2. a kind of security counting query and integrity verification device based on encrypted genome data as claimed in claim 1, is characterized in that,

The cloud server generates a set of SNP datasets for a specific gene and the frequency value of the specific gene in the database, and then normalizes the value to the total number of records;

The cloud server expresses the total number of count queries as:

Among them, D is the database Q is the query, D={S ₁ , S ₂ ,...,S _n } represents the SNP sequence of n patients, and the count query is defined as finding the number of patients in D that satisfy multiple query conditions q of Q, The gene sequence of m SNP sites of a patient is represented as S={d ₁ ,d ₂ ,...,d _m }, where d _i (1≤i≤m) represents the SNP value of the i-th site of the patient.

3. a kind of security counting query and integrity verification device based on encrypted genome data as claimed in claim 1, is characterized in that,

After the certification body server receives the data provided by the data owner, it creates an Entity of the map for the SNP sequence of each patient, and after all data processing, a mapping table M is generated, which contains the genotype and phenotype information of each patient. , after creating M, for each new record from the data owner, the certification authority server creates and updates the Entity information of M;

Each Entity contains the following:

key: The key value in Entity points to specific data, index value,

I: the phenotype dataset corresponding to the genome sequence,

count: In the database, the frequency of the current SNP sequence, indicating the total number of records matching the genotype and phenotype,

next: The key value of the next Entity.

4. a kind of security counting query and integrity verification device based on encrypted genome data as claimed in claim 3, is characterized in that,

The complexity of the hash table created by the certification authority server is O(mn), where m is the number of records in the database, n is the number of different SNP sites in each sequence record, and each Entity is defined as θ, which consists of The part is denoted as θ(key, geno, I, count, next).

5. a kind of security counting query and integrity verification device based on encrypted genome data as claimed in claim 1, is characterized in that,

The certification authority server uses the Bloom filter technology to process the phenotype data. For each SNPs record in the genome sequence, insert the corresponding phenotype in the Bloom filter. If multiple phenotypes are associated with the same genome sequence, they will be added. All are inserted into the Bloom filter of the genome sequence entity, each entity in M contains the Bloomfilter value I, indicating the phenotypic information corresponding to the genome sequence;

The certification authority server sets the hash function H used in the Bloom filter and the domain of the common alphabet Σ, where the domain of Σ is the set of all possible phenotypes, and for each entity in the genome sequence, inserts in the Bloom filter the corresponding Phenotype, each phenotype in Σ is mapped to a unique number, and that number is plugged into the Bloom filter,

The certification authority server sends the private key sk to the agency terminal, and provides the key k of the PRF and the hash function F _k used for Bloomfilter;

The certification authority server matches the location of the phenotype value I _q hash of the query request with the I _i hash location of each Entity;

The certification authority server sends the PRF key k to the agency terminal, and adopts the formula for the encrypted Bloom filterI _i '

to decrypt.

6. a kind of security counting query and integrity verification device based on encrypted genome data as claimed in claim 1, is characterized in that,

The certification authority server generates a key pair (pk, sk), where pk is the public key and sk is the private key, and uses the public key pk to encrypt each entity in the hash table, expressing the encryption of the entity as E _pk (θ );

The certification authority server expresses the hash map in which all entities in the hash map M are encrypted as

and will

Send to cloud server;

The certificate authority server sends the key pair (pk,sk) to the agency terminal.

7. a kind of security counting query and integrity verification device based on encrypted genome data as claimed in claim 1, is characterized in that,

The cloud server sends the query result R _q to the agency, and the agency terminal decrypts the SNP value with the key sk and uses the Hamming code to check the integrity of the data. For the returned query result that meets the query conditions:

R _q ={E _pk (count)|E _pk (B ₁ )|E _pk (B ₂ )|…|E _pk (B _num )|I _i ′}

The agency terminal decrypts each SNP value with the private key sk, and the decrypted

Will

Recalculate the Hamming code detection bit as T';

The agency terminal determines whether the data is complete by comparing the correlation between T' and T _q .

If T′=T _q , the agency terminal determines that the data is safe and reliable;

If T′≠T _q , and T′∈{h _A ,h _C ,h _G ,h _T }, then the agency terminal checks the bit T _q for the returned

make corrections and update the hashmap;

If T′≠T _q , and

Then the agency terminal determines the original data If it has been tampered with and has been tampered with other content, the SNP sequence of the patient is no longer credible, and the returned query result count value is not credible.

8. A security counting query and integrity verification method based on encrypted genomic data, comprising the following steps: