WO2015085358A1

WO2015085358A1 - A method and system for analysing test data to check for the presence of personally identifiable information

Info

Publication number: WO2015085358A1
Application number: PCT/AU2014/050385
Authority: WO
Inventors: Niall CRAWFORD; Liam McCRORY
Original assignee: ENOV8 DATA Pty Ltd
Current assignee: ENOV8 DATA Pty Ltd
Priority date: 2013-12-10
Filing date: 2014-11-28
Publication date: 2015-06-18
Anticipated expiration: 2016-06-10

Abstract

Systems and methods of analysing test data to check for the presence of personally identifiable information including the steps of: determining a typical distribution of values for at least one field of data in a production environment; calculating the actual distribution of values for at least one field of the test data; comparing the typical distribution with the actual distribution; and providing an indication of the likely presence of personally identifiable information based on the result of the comparison.

Description

A METHOD AND SYSTEM FOR ANALYSING TEST DATA TO CHECK FOR THE PRESENCE OF PERSONALLY IDENTIFIABLE INFORMATION

Technical Field

The present invention relates to methods and systems for analysing test data and particularl relates to methods and systems for checking test data for the presence of personall identifiable information.

Background to the Invention

Most businesses and other organi ations use computer software systems which produce, store and manipulate data, known as production data, as part of the ongoing activities of the business. Over time, it typically becomes necessary to revise, upgrade or replace various hardware' or software components of systems. Prior to implementing such changes it is customary to perform thorough testin to ensure correct operatio of the proposed system modifications thu avoiding malfunctions and disruption, to the operati on of the business. To thi s end, test data is prepared which mirrors the production data.

Preparation of "production-like" data is typically done in one of two ways, (1) via an Extract, Transform & Load (ETL) process i.e. where the data is extracted from the production environment itself and i s then manipulated e.g. subset and privatized before loading into the non-production environment (Development, Test & Training) and/or (2) fabricated/created from scratch. The former process (!) tends to be prevalent as it tends to deliver test data that looks and behaves more like the real production environment data.

The process of privatisation, during the ETL process, is a necessary as a way of ensuring Personally Identifiable Information (PII) data is obfuscated and ultimately ensures that customers, business and employee details are not disseminated and thus remain protected.

Despite all care being taken, it can occur thai production data makes its way into test systems. This is particularly the case in large organ ations or in complex systems which are maintained by a large number of people. As a result. Personally Identifiable Information becomes vul erable to be accessed by and/or disseminated to unauthorised persons.

It would therefore be desirable to be able to analyse test data to check for the presence of Personally Identifiable Information.

Summary of the Invention

In a first aspect the present invention provides a method of analysing test data to check for the presence of personally identifiable information including the steps of; determining a typical distribution of values for at least one field of data in a production environment; calculating the actual distribution of values for at least one field of the test data; comparing the typical distribution with the actual distribution; and providing an indicatio of the likely presence of personally identifiable information based on the result of the comparison.

The step of comparing may include calculating the correlation between the typical distribution and the actual distribution.

The method may further include the step of scanning the test data for personally identifiable information data types.

The data types include, and is not limited to, an one of First Name, Last Name, Email, Address, Telephone Number, Tax File/Social Security numbers, Driving License, Passport & Credit Cards.

In a second aspect the present invention provides a system for analysing test data to check for the presence of personally identifiable information including:

calculating means for calculating the actual distribution of values for at least one field of the test data; comparing means for comparing a typical distribution of values with the actual distribution; and display means for providing an indication of the likely presence of personally identifiable information based on the result Of the comparison.

The compari g means may be arranged to calculate a value representing the correlation between the typical distribution and the actual distribution.

In a third aspect the i vention provides a software program including instructions which, when carried out by a processor, cause a computing system to operate a method according to the first aspect of the invention or to embody a system accordi g to the second aspect of the inventi n.

In a fourth aspect the present invention provides a computer readable medium which i s popul ated with a software program according to the third aspect of the invention.

The invention is based on the reali sation that the act of privatisation changes both data content (column values) and the distribution of these values. This allows automatic validation of whether data has been adequately transformed (privatised) or not.

Brief Description of the Drawings

An embodiment of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:

Figure 1 is schematic diagram illustrating an embodiment of the invention; and Figure 2 is an example screenshot showing the operation of the invention.

Detailed Description of the Preferred Embodiment

Referring to figure 1, an organisation has a number of production databases 5 and number of non-production environments 10 i .e. development, test and training. To ensure that the non-production environments 10 have data that "behaves" like production, it is necessary for the data to be copied from production using an Extract, Transform .& Load (ETL) mechanism 15. Furthermore, to ensure that the organisation is compliant with various national and regulator bodies it is also necessary for any Personally Identifiable Information ("PIT) to undergo some form of privatisation (e.g. masking, obfuscation) process.

The challenge with the simple ETL process described above is that organisational data is typically non trivial. Data is often large, uses complex data structures and is housed in many locations and on various different technologies e.g. Mainframe, Windows, UNIX, Relational Databases, Hierarchal Databases, Flat files etc. The complexity of the data inevitably means that mistakes/omissions are likely and ΡΠ information will he copied (and not privatised) into one of more non production environments. Once the Pli is in the .secondary environment it is exposed to being used for secondary purposes and potentially theft.

To mitigate against the risk of PII data finding its way into the non production environments 10, a two-step automated profiling 20 and validation 25 process is carried out. The method according to the invention i typically initiated by a user accessing th system using their own computing device with a display screen. The computer processes initiated by the user ma ru o their local computing device, or on another computing device under instruction from the user's device. In some embodiments, the method of the in vention may be confi ured to ru at specifi ed intervals, or in response to various triggers, such as the loading of new data into the test en vironment.

In the profiling 20 step, the data in the test environments is automatically scanned for PII data types. Common PII types of interest are First Name, Last Maine, Email, Address, Telephone Number, Tax File Social Security numbers, Driving License, Passport & Credit Cards. This scanning is done by searching for common keywords (e.g. Smith, Jones etc.) and through other value/string matching means (e.g. Regular Expressions and Checksum analysis). Once a column is identified to be of a certain type (e.g. last name), its details are passed to the validation subroutine 25. In the validation 25 step, the validatio subroutine then analyses the column

(data) for distribution patterns that would typically be found in a production system. A simple example would be the distribution of surnames 30 in an English speaking country. These distribution patterns have been previously determined and are stored in the system as pattern files 28.

Example:

Number of Smith's, versus Number of Taylor's, versus Number of Foster's Production distribution pattern is approximatel 10: 3: 1 ,

Taking the example of surnames above, the validation subroutine 25 takes a significant sample of data from the non-production database 10 and calculates the distribution of values in the Surnames columns, The validation routine 25 then compares the calculated distribution against the known typical production distribution by calculating the correlation between the two. If it has a similar pattern (say 10:3:1) then it is likely the data has NOT been properly privatised, it is likely that the data in the non- production database is either production data or a production clone. The greater the discrepancy (lack of correlation) with the production pattern the lower the likelihood that it is not production data (indicating masking or obfuscation has taken place).

The size of the sample taken by the validation routine is configurable, the greater the sample the higher the likelihood of accuracy. A typical sample size would be 25,000 rows.

Referring to figure 2, a user interface of the system is shown. Once the profiling 20 & validation 25 steps are complete, the validation subroutine returns a percentage value indicating how closely the pattern found correlates with the typical production pattern, A score of 100% ( 1 ) indicates that the validation routine has found an exact distribution match. A score of 0% (0) indicates that absolutely no correlation was found. A negative score of -100% (-1) highlights a totally negative correlation.

A perfect score/correlation 100% (1) would be obtained by the sequence (order of popularity) matching perfectly . In the case of the example distribution shown in figure 1 , 100% would be achieved by: Smith "being greater than" Jones "being greater than" Taylor "being greater than" Young "being greater than" Foster "being greater than" Barnett In the example in Figure-2 both EMPLOYE JFJRSTJSfAME and

EMPLOYEE JLAST__NAME- have very high correlations (97% and 92%) labelled A, One would infer thai this indicates the presence of production cloned (non-obfuscated) ΡΓΙ information.

To add further accuracy & confirmation of the result the tool also provides a sample of data that has been found in each of the tables, This information can then be used to streamline further analysis & checks,

In addition to distribution of surnames, other distribution patterns can be used such as first names, or addresses. The patterns can be tailored to suit a particular set o data, or subset of data, such as by being relevant to a particular location. The following example demonstrates the use of patterns of street names i n Sydney,. Australia.

Distribution of Sydney Street Name (Prefix)

* Prefix: John: Rating: 18

· Prefix: Albert: Rating: 9

• Prefix: Young: Rating: 6

Prefix: Nelson: Rating: 2

Rating is based on real occurrence. In this instance one would expect to find twice as many John's as Albert's etc.

Distribution of Sydney Street Name (Suffix)

An alternative to using a prefix is to analyse suffix.

For example Sydney Street Suffix;

Suffix Street: Rating : 1 166

Suffix Lane Rating : 571

Suffi Avenue Rating : 1 14

Suffix W y Rating : 10

In this instance one would expect to find twice as many Streets as Lanes etc. Pattern matching can apply to virtually anything, not just alphanumeric strings. They can apply to numbers and/or sequences of numbers. For example larger banks (issuers) have a greater number of credit cards in distribution. A credit cards prefix determines the issuer. As such one would expect to find more credit cards attributed to a larger ban k than smaller bank.

The calculation of the correlation between production data and testing data is managed through a 4 step process;

Step-1 Identify a baseline-set of values with a strong "distribLrtioix/popularity" spread. From production identify a number of distinct values that have a di stinctive spread (popul ari ty count). By di stincti ve that i s to say the regularity of the value (popul arity of the value) is clearly different from other items in the data set. Refer to the surname example below. Each value (surname) is twice as popular as the next.

Example SURNAME Baseline Set:

VALUE COUNT PROD POPULARITY RATIO*

SMITH 64000 0.504

JONES 32000 0.252

TAYLOR 16000 0.126

YOUNG 8000 0,063

FOSTER 4000 0,031

BAR ETT 2000 0.016

'Popularity Ratio is ratio of COUNT/SUM^OF-SELECTION-COUNT

Step-2 Build an equivalent comparison-set from your Target Data Source (e.g. Test DB) Example SURNAME Comparison Set: VALUE COUN1 TEST POPULARITY RATIO*

SMITH 44 0.344

JONES 32 0.250

TAYLOR 28 0,219

YOUNG 8 0.063

FOSTER 0.063

BARNETT 0.039

^♦Popularity Ratio is ratio of COUNT/SUM-OF-SELECTION-COUNT

Step-3 Apply a Correlation Formula to the two Data Sets

Compare the correlation of the two associated rows using a Data Correlati on formula. For example: CORRELATE (P, T)

Where P = "Production Popularity Ratio" Data Set

Where T = "Test Popularity Ratio" Data Set

The ''Count" Data Set can also be used & would provide the same results (score). In one embodiment, the correlatio formula applied i s of the type known as the Spearma Rank Correlation formula. However any RAN correlation / or standard correlation (e.g. Pearson's) formula could be used.

Step-4 Display Results

Most correlation systems provide a score of -1 (inverse) to 0 (no correlation) to 1 (positive). To simplify the analysts understanding, these scores are converted into bands.

The banding is configurable, and by default is set as follows:

HIGH; High to Very High Correlation (Prod Like)

MEDIUM Medium Positive Correlation (Undetermined)

LOW: Low Negative Correlation (Non Prod Like) T^'he Relationship between score & band■will depend on Rank Correlation formula used.

It can be seen that embodiments of the invention have at least one of the following advantages:

* Rapi d i dentificatio of all E2E data that can be classified as of type PII

• Automatic Identification of whether this Non-Production hosted ΡΠ data contains production patterns, thus indicating l ikelihood of data actually being unmasked production data,

· A mechani sni to further audit the risk through use of sample data.

Any reference to prior art contained herei is not to be take as an admission that the information is common general knowledge, unless otherwise indicated.

Finally, it is to be appreciated that various alterations or additions may be made to the parts previously described without departing from the spirit or ambit of the present i vention.

Claims

CLAIM'S:

1. A method of analysing test data to check for the presence of personally

identifiable information including the steps of:

determining a typical distribution of values for at least one field of data in a production en vironm ent ;

calculating the actual distribution of values for at least one field of the test data;

comparing the typical distribution with the actual distribution; and

providing an indication of the likely presence of personally identifiable information based on the result of the comparison.

2. A method according to claim 1 wherein the step of comparing includes

calculating the correlation between the typical distribution and the actual distribution.

3. A method according to either of claim 1 or claim 2 further including the step of scanning the test data for personally identifiable information data types,

4. A method according to claim 3 wherein the data types include any one of Fi st Name, Last Name, Email, Address, Telephone Number, Tax File/Social Security numbers, Driving License, Passport & Credit Cards.

5. A system for analysing test data to check for the presence of personally

identifiable information including:

calculating means for calculating the actual distribution of values for at least one field of the test data;

comparing means for comparing a typical distribution of values with the actual distribution; and

display means for providing an indication of the likely presence of personally identifiable information based on the result of the compari son.

6. A system according to claim 5 wherein the comparing means is arranged to calculate a value representing the correlation between the typical distribution and the actual distribution.

7. A software program including instructions which, when carried out by a

processor, cause a computing system to operate a method accordi g to an one of claims 1 to 4 or to embody a system according to any one of claims 5 or 6.

8. A computer readable medium which is populated with a software program according to claim 7.