GB2469673A

GB2469673A - Encrypting data tags, metadata and labels in datasets to protect the identity of individuals when combining datasets or databases

Info

Publication number: GB2469673A
Application number: GB0906975A
Authority: GB
Inventors: Robert Navarro; Steven Gilbert
Original assignee: SAPIOR Ltd
Current assignee: SAPIOR Ltd
Priority date: 2009-04-23
Filing date: 2009-04-23
Publication date: 2010-10-27
Also published as: GB0906975D0

Abstract

Disclosed is a method for joining datasets from a plurality of providers, which provide at least one data set, data record or file, which relates to a subject. The method has the steps of associating the datasets with a label which is unique to the subject of that dataset, providing a primary and a secondary encryption key, encrypting the label using the primary encryption key in order to provide the dataset with a primary encrypted label, compiling a database from the datasets, and matching datasets within the database according to their primary encrypted label. Datasets have both a primary encrypted label and a secondary encrypted label, the secondary encrypted label being generated by encrypting the label of that dataset using the secondary encryption key. The dataset may be medical records and the subject may be a person. The encrypted label may be replaced with an arbitrarily tag and the encrypted label may be removed once the database has been compiled. The databases may be generated from the records for epidemiology research, such that the anonymity and privacy of the patients is protected.

Description

Joininu Datasets Usinu Encryption

Field of the Invention

This invention relates to a method and apparatus for joining datasets from a plurality of providers.

Background to the Invention

Medical science often relies upon researchers being able to gather large amounts of information from disparate sources. The science of epidemiology, for example, is quite impossible without detailed knowledge of the health of a population. Many of the treatments that we enjoy today depended at the start upon information gathered from large samples. For these reasons it is beneficial for researchers to have access to large databases of medical information.

However, this need for sharing information is in direct conflict with privacy rights.

Patients may provide medical information to any of a number of institutions aside from their General Practitioner (GP) and local hospital. Relevant information may also be held by various clinics, specialist treatment centres, disease registries or other national databases. For example, in the National Health Service (NHS) in the UK there is the Improving Access to Psychological Therapies (IAPT) programme and the Secondary Uses Service (SUS), among others. It is beneficial to researchers to have access to all of this information, and to be able to cross reference information about a patient, but it would be highly unethical to provide the researchers indiscriminate access to all of this data without full and informed consent from the patient. This sort of indiscriminate consent is very unlikely to be forthcoming.

Attempts have been made to overcome this conflict by ensuring anonymity when collecting medical data. If a patient's identity is not known to the researchers, then that patient's privacy is more easily protected. These solutions are often only partially successful. Collecting and cross referencing so much data while maintaining anonymity is an extremely difficult task.

One particularly difficult problem for a researcher to overcome is that of notification in the event of a discovery. For example, if a researcher finds that one of the patients in their data set is suffering from a treatable condition, then the researcher is obligated to inform the patient with the condition so that they can seek treatment. Ethics review boards will typically require researchers to make provisions to deal with this eventuality. If no provisions are made, then the research may not be allowed to proceed.

Because of this, true anonymity whereby a patient's identity can never be recovered from the data is often not an option. But the patient's identity must still be protected. An effective way for a researcher to draw together data from multiple sources while still maintaining anonymity would therefore be very valuable, especially if it allowed anonymity to be broken under certain, very specific, circumstances.

Summary of the Invention

Accordingly, this invention provides a method for joining datasets from a plurality of providers, wherein providers provide at least one dataset which relates to one of a plurality of subjects. The method comprises: associating at least one dataset with a label which is unique to the subject of that dataset; providing a primary encryption key and at least one secondary encryption key; encrypting at least one label using the primary encryption key in order to provide said dataset with a primary encrypted label; compiling a database from the datasets; and matching datasets within the database according to their primary encrypted label. At least one dataset comprises both a primary encrypted label and a secondary encrypted label, the secondary encrypted label being generated by encrypting the label of that dataset using a secondary encryption key.

In this way the invention provides a method for compiling a database. The primary and secondary encrypted labels function as pseudonyms for a dataset. Because labels are determined by the subject of a dataset, datasets can then be arranged according to their subject by matching the primary encrypted labels between different datasets. This preserves the anonymity of the subjects while still allowing cross referencing within a database. Some portion of the datasets are provided with a secondary encrypted label, which can later be used to reliably identify the subject of a dataset should this prove to be necessary. Therefore the invention provides a method by which anonymity of subjects can be preserved or broken under specific conditions and by specific providers. These conditions can be established before the database is compiled according to the needs of the providers, the subjects and the database compilers.

The anonymising process is irreversible for some providers and users of the eventual database. In the event that a user of a database compiled according to the invention identifies a beneficial option, such as a medical treatment, for the subjects of that database, the database will still allow the accurate identification and targeting of those subjects. The invention provides an efficient solution to the problem of creating such a database.

The label will typically be stored, initially, as part of the dataset. Hence, if the dataset comprises entries in a spreadsheet, then the label will also typically initially be entered into the same spreadsheet. Once the primary and secondary encrypted labels have been created and added to the datasets, the labels can be deleted.

At each stage, the data being handled in this method may be further encrypted to add additional security. For example, datasets may be encrypted for transmission to the database, and later queries to the database may also be encrypted.

In a preferred embodiment there are a plurality of secondary encryption keys. Where this is the case, it will usually be that at least one secondary encryption key is unique to one provider. Typically each secondary encryption key is unique to one provider. By dividing up the encryption keys in this way access to the information about the subject of any given dataset can be carefully controlled. Any interested party will typically know how to decrypt some of the secondary encrypted labels but not all. Therefore they will not be able identify the subject of any dataset that does not include a secondary encrypted label that they know how to decrypt.

The primary encryption key may desirably be a public key for an asymmetric encryption algorithm. Where this is the case, all records of the associated private key generated with the primary public key are usually destroyed. Without this private key it is effectively impossible to decode the primary encrypted label, further preserving the anonymity of the subjects.

Typically, the secondary encryption key is a key for a symmetric encryption algorithm, which requires fewer resources to produce and distribute than asymmetric keys. However, the secondary encryption key may be a public key for an asymmetric encryption algorithm if additional security is required.

In one advantageous embodiment, the subject of at least one dataset is a person. Where the subject is a person, the label will typically comprise identifying information such as the person's name, their address or their date of birth. The information chosen for use in the labels must be known to all providers, or cross referencing datasets will not be possible.

At least one dataset may comprise medical records. The invention is especially suited to such instances, thanks to the high standards of anonymity and ethics that must be maintained when handling medical records.

Providers may be any individual or body which maintains records on a subject suitable for the database. Where the datasets comprise medical records, the providers will usually include general practitioners and medical bodies such as hospitals or clinics and may also include disease registries.

Typically, the labels which have been encrypted with the primary encryption key are deleted from the database once the datasets have been matched, in order to provide further security and reduce the size of the database.

In a preferred embodiment, each label is consistently replaced with an arbitrarily generated tag before the database is compiled. This helps to protect the eventual database from attacks wherein an attacker trying to break the encryption searches for keys that produce valid outcomes; names, addresses, dates of birth and so on. Typically, the labels are replaced after they have been encrypted. However, it is also possible to replace the labels while they are still in plain text and then encrypt the replacements.

The invention extends to data processing apparatus configured to operate in accordance with the method of the invention. The invention further extends to computer software which configures general-purpose data processing apparatus to operate according to the method of the invention.

Brief Description of the Drawings

An embodiment of the invention will now be described, by way of example only, and with reference to the accompanying drawings, in which: Figure 1 is a diagram showing how a first dataset is encrypted by a provider in category A; Figure 2 is a diagram showing how a second dataset is encrypted by a provider in category B; Figure 3 is a diagram showing how a first and second dataset are joined in a database; and Figures 4 and 5 are diagrams showing how the first and second dataset are encrypted according to a second embodiment of the invention.

Detailed Description of Exemplary Embodiments

In a first method of joining datasets according to the invention, a researcher gathers datasets from a plurality of providers in order to build a database. The researcher begins by creating a number of asymmetric encryption key pairs, each of which comprises a public key which is used for encrypting information and a private key which can then be used to decrypt the information. The researcher then assigns each provider to category A or category B, and provides public encryption keys to each provider according to their category. Providers in category B will eventually be able to identify at least some of the patients in the database, when provided with a pseudonym from the patient's dataset.

Providers in category A will not.

Figure 1 illustrates how a first dataset 1 is encrypted by an information provider in category A. A category A provider may be an institution such as the Improving Access to Psychological Therapies (IAPT) programme or the Secondary Uses Service (SUS), both of which are part of the National Health Service (NHS) in the UK and hold extensive patient records. Other examples include clinics, specialist treatment centres, disease registries, other national databases etc, or some other body entirely, possibly not connected to a government health service. The first dataset 1 comprises both identity entries, which are labelled idi, id2 and id3, and attribute entries, which are labelled atti and att2. For simplicity Figure 1 only shows dataset 1 which relates to a single patient, however typically each provider will contribute a plurality of patient records.

In this example, the identity entries idi, id2 and id3 are the name, date of birth and postcode of the patient. These three pieces of information are used because they will usually be known to all of the providers, and they are typically sufficient to uniquely identify any given patient. The attribute entries atti and att2 contain the information from the category A provider which is intended for use in the database. Depending upon the provider and the researcher's field of interest, this information could be almost anything.

It may contain information about the patient's physiology, such as blood pressure readings or the results of scans and x-rays. It may also contain bibliographical information such as ethnic background, age, number of children etc. There can be as many attribute entries as is required.

Once the identity entries idi, id2 and id3 have been checked to ensure that they are correct and properly formatted, the researcher encrypts them using Public Key 1 as shown in Figure 1. The attribute entries atti and att2 are not encrypted at this stage. Public Key 1 is the encryption key from one of the asymmetric encryption key pairs that the researcher generated earlier. However, the researcher does not record the equivalent private key which would normally be used for decryption. Therefore, once the identity entries have been encrypted they create a pseudonym which is unique to the patient but cannot be decrypted.

Pseudonymised category A data 2 is therefore created, which comprises the first encrypted identity entries idi', id2' and id3' and the attribute entries atti and att2. The pseudonymised category A data 2 is then compressed into a Zip file along with the equivalent pseudonymised category A data for a number of other patients, and encrypted using Symmetric Key 1 to create the package of data labelled pack 1.

Figure 2 illustrates how a second dataset 3 is encrypted by an information provider in category B. Again, only one dataset relating to one patient is shown, but each provider will typically contribute a plurality of patient records. The category B provider will usually be a GP or another healthcare provider who has direct and regular interaction with the patient.

The category B provider follows a procedure which is similar to the one followed by the category A provider illustrated in Figure 1. The second dataset 3 comprises identity entries idi, id2 and id3 as well as attribute entries att1 and att2. Again, the identity entries idi, id2 and id3 are the name, date of birth and postcode of the patient. The category B provider encrypts the identity entries using Public Key 1, to generate first encrypted identity entries idi', id2' and id3'. For any given patient, idi', id2' and id3' will always be the same, regardless of which provider performs the encryption. This is because the identity entries idi, id2 and id3 will always be the same for each patient, and are always encrypted with the same encryption key, Public Key 1.

The category B provider will also encrypt the identity information idi, id2 and id3 using Symmetric Key 2 to create second encrypted identity entries idi ", id2" and id3".

Symmetric Key 2 is retained by the category B provider as a part of their records, keeping it secret from the other providers. In an alternative embodiment, the category B provider may use an asymmetric key at this point.

Pseudonymised category B data 4 is therefore created, and comprises the first encrypted identity entries idi', id2' and id3', the second encrypted identity entries idi ", id2" and id3", the attribute entries att1 and att2, and the provider identification number i. A different provider identification number is assigned to each provider in category B. The pseudonymised category B data 4 can therefore be easily attributed to a particular category B provider.

The pseudonymised category B data 4 is compressed into a Zip file along with the equivalent pseudonymised category B data for a number of other patients, and encrypted using Symmetric Key 1 to create the package of data labelled pack2.

Once the packages of data, packi and pack2, are prepared, each provider transmits their package to the researcher so that the database can be compiled. The information will typically be transmitted over the internet, either directly or by email, but it can also travel by other means, for example by being stored on a flash memory drive which is delivered by courier.

Figure 3 illustrates how the database is compiled. The researcher uses Private Key 2 to decrypt packi and pack2, deriving the pseudonymised category A data 2 and the pseudonymised category B data 4. Data for each patient from different providers can then be matched up using the first encrypted identity entries idi', id2' and id3', which will always be the same for a given patient, regardless of the provider. The researcher can therefore combine the data on each patient from multiple sources to create pseudonymised compiled data 5. For each patient, the pseudonymised compiled data 5 comprises the second encrypted identity entries idi ", id2" and id3", attribute entries drawn from one or more providers, and a provider identification number i. The first encrypted identity entries idi', id2' and id3' will then typically be deleted, although they can be retained if the researcher anticipates adding further data to the database at a future date.

The researcher now has access to a database of attribute entries which they can use in their research. Each patient is kept anonymous, known only to the researchers by their pseudonym of the second encrypted identity entries idi ", id2" and id3".

It is important that the identity entries id 1, id2 and id3 are accurate and formatted according to precise rules. Any mistakes in the identity entries will make it impossible to completely cross reference datasets later. For this reason validation and cleansing operations will usually be carried out at the provider in order to check that idi, id2 and id3 comply with rigid rules of formatting for all of the patients. The attribute entries are less critical to the operation of the database, although careful arrangement can help considerably when the researcher later comes to study the database.

Care must be taken with the information provided in the attribute entries that the patient cannot be identified simply by studying those entries. For example, although the date of birth is provided as identity entry id2 the researcher will not have access this information.

It may therefore be necessary to provide the age of the patient as one of the attributes. It will typically be sufficient to provide an approximate age as an attribute and avoid giving the exact date of birth, therefore helping to maintain anonymity on the part of the patient.

In the event that the researcher makes a discovery and needs to be able to contact a patient, that patient can still be identified. To do this, the researcher determines which category B provider supplied the information for the patient using the provider identification number i. The researcher then contacts the category B provider and tells them the encrypted identity entries for the patient in question; idi ", id2" and id3". The category B provider uses Symmetric Key 2 to decrypt this data and identify the patient.

Typically, in order that every patient can be identified when necessary, every patient in the database will need to be included in the records of at least one category B provider.

There may be some rare exceptions to this, for example when datasets are being provided for patients both living and dead. For those patients without a category B provider, data from multiple category A providers can still be matched using the method described above.

The researcher may visit the providers to assist them in preparing and transmitting their datasets. Where this is the case, the researcher may briefly have access to sensitive data such as the identity of patients. However, such information is still hidden from anyone who later administers the database.

If it is necessary for a patient to be identifiable to more than one provider, then there is no reason why a patient cannot have more than one provider in category B. In this event, the compiled datasets will comprise more than one second encrypted identity entry, and more than one provider identification number.

In a second method of joining datasets according to the invention, all of the providers are assigned to category B. In this embodiment, data on some patients will still originate with more than one provider, and consequently the compiled datasets will have more than one second encrypted identity entry, and more than one provider identification number.

In a third method of joining datasets according to the invention, the pseudonymisation of identity entries comprises an additional step. As illustrated in Figure 4, a provider in category A who is providing data according to the third method will first encrypt the identity entries using Public Key 1 as described above. The first encrypted identity entries idi', id2' and id3' are then transmitted to a server 6 which generates an arbitrary tag, Tagi to replace them. In an alternative embodiment, each of the first encrypted identity entries idi', id2' and id3' is replaced by an individual Tag. Tagi is then encrypted, along with the attributes data, to form pack3.

Because Tagi is arbitrary, it is much less vulnerable to undesirable decryption than the first encrypted identity entries id 1', id2' and id3'. In order to establish the connection between a patient and their data, an attacker would have to gain access to the server 6.

Even if they managed this, the attacker would still have to decrypt the first encrypted identity entries id 1', id2' and id3'.

The tags provided by server 6 are arbitrary but consistent. Therefore the server 6 will always provide Tag 1 in response to first encrypted identity entries id 1', id2' and id3'.

Therefore data from multiple providers can still be joined using Tag 1 to match up data referring to the same patient.

Figure 5 shows a provider in category B who is providing data according to the third method. In this case, the first encrypted identity entries idi', id2' and id3' are provided to the server 6 and replaced with Tag 1, while the second encrypted identity entries idi ", id2" and id3" are replaced with Tag2. The tags and attribute data are then encrypted to create pack4. Again, in an alternative embodiment an individual tag may be provided for each of the encrypted identity entries idi', id2', id3', idi ", id2" and id3".

A database can then be compiled as described above, using tags to match records.

If it is necessary to identify a patient, the researcher can identify the category B provider using the provider identification number i as in previous embodiments. The researcher then sends the provider Tag2. Because the tags are consistent, the category B provider can use Tag2 to look up the second encrypted identity entries idi ", id2" and id3" on the server 6. The category B provider then uses Symmetric Key 2 to decrypt this data and identify the patient.

The tags Tagi and Tag2, are typically much shorter in length than the first and second encrypted identity entries, which helps to reduce the size and hence the cost of the database that the researcher builds.

As described above, methods ofjoining data according to the invention are well suited to use with medical data. However, the same method can be used to join other sorts of data for other reasons. For example, an identical system could be used to join data on housing from a plurality of local authorities, or by an organisation such as the Financial Services Authority (FSA) in order to compile databases of financial information about its regulated companies.

Claims

Claims 1. A method for joining datasets from a plurality of providers, wherein providers provide at least one dataset which relates to one of a plurality of subjects, the method comprising: associating at least one dataset with a label which is unique to the subject of that dataset; providing a primary encryption key and at least one secondary encryption key; encrypting at least one label using the primary encryption key in order to provide said dataset with a primary encrypted label; compiling a database from the datasets; and matching datasets within the database according to their primary encrypted label, wherein at least one dataset comprises both a primary encrypted label and a secondary encrypted label, the secondary encrypted label being generated by encrypting the label of that dataset using a secondary encryption key.
2. A method as claimed in claim 1, wherein there are a plurality of secondary encryption keys.
3. A method as claimed in claim 1 or 2, wherein at least one secondary encryption key is unique to one provider.
4. A method as claimed in any preceding claim, wherein the primary encryption key is a public key for an asymmetric encryption algorithm.
5. A method as claimed in claim 4, wherein all records of the private key generated with the primary encryption key are destroyed.
6. A method as claimed in any preceding claim, wherein each secondary encryption key is a key for a symmetric encryption algorithm.
7. A method as claimed in any preceding claim, wherein the subject of at least one dataset is a person.
8. A method as claimed in any preceding claim, wherein at least one dataset comprises medical records.
9. A method as claimed in any preceding claim, wherein the labels encrypted with the primary encryption key are deleted from the database once the datasets have been matched.
10. A method as claimed in any preceding claim, wherein each label is consistently replaced with an arbitrarily generated tag before the database is compiled.
11. Data processing apparatus configured to operate in accordance with the method of any preceding claim.
12. Computer software which configures general-purpose data processing apparatus to operate according to the method of any of claims 1 to 10.