[go: up one dir, main page]

WO2015085358A1 - Procédé et système d'analyse de données d'essai permettant de vérifier la présence d'informations personnellement identifiables - Google Patents

Procédé et système d'analyse de données d'essai permettant de vérifier la présence d'informations personnellement identifiables Download PDF

Info

Publication number
WO2015085358A1
WO2015085358A1 PCT/AU2014/050385 AU2014050385W WO2015085358A1 WO 2015085358 A1 WO2015085358 A1 WO 2015085358A1 AU 2014050385 W AU2014050385 W AU 2014050385W WO 2015085358 A1 WO2015085358 A1 WO 2015085358A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
distribution
identifiable information
personally identifiable
test data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/AU2014/050385
Other languages
English (en)
Inventor
Niall CRAWFORD
Liam McCRORY
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ENOV8 DATA Pty Ltd
Original Assignee
ENOV8 DATA Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2013904792A external-priority patent/AU2013904792A0/en
Application filed by ENOV8 DATA Pty Ltd filed Critical ENOV8 DATA Pty Ltd
Publication of WO2015085358A1 publication Critical patent/WO2015085358A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification

Definitions

  • the present invention relates to methods and systems for analysing test data and particularl relates to methods and systems for checking test data for the presence of personall identifiable information.
  • test data is prepared which mirrors the production data.
  • Preparation of "production-like" data is typically done in one of two ways, (1) via an Extract, Transform & Load (ETL) process i.e. where the data is extracted from the production environment itself and i s then manipulated e.g. subset and privatized before loading into the non-production environment (Development, Test & Training) and/or (2) fabricated/created from scratch.
  • ETL Extract, Transform & Load
  • the former process tends to be prevalent as it tends to deliver test data that looks and behaves more like the real production environment data.
  • the process of privatisation, during the ETL process, is a necessary as a way of ensuring Personally Identifiable Information (PII) data is obfuscated and ultimately ensures that customers, business and employee details are not disseminated and thus remain protected.
  • PII Personally Identifiable Information
  • the present invention provides a method of analysing test data to check for the presence of personally identifiable information including the steps of; determining a typical distribution of values for at least one field of data in a production environment; calculating the actual distribution of values for at least one field of the test data; comparing the typical distribution with the actual distribution; and providing an indicatio of the likely presence of personally identifiable information based on the result of the comparison.
  • the step of comparing may include calculating the correlation between the typical distribution and the actual distribution.
  • the method may further include the step of scanning the test data for personally identifiable information data types.
  • the data types include, and is not limited to, an one of First Name, Last Name, Email, Address, Telephone Number, Tax File/Social Security numbers, Driving License, Passport & Credit Cards.
  • the present invention provides a system for analysing test data to check for the presence of personally identifiable information including:
  • calculating means for calculating the actual distribution of values for at least one field of the test data; comparing means for comparing a typical distribution of values with the actual distribution; and display means for providing an indication of the likely presence of personally identifiable information based on the result Of the comparison.
  • the compari g means may be arranged to calculate a value representing the correlation between the typical distribution and the actual distribution.
  • the i vention provides a software program including instructions which, when carried out by a processor, cause a computing system to operate a method according to the first aspect of the invention or to embody a system accordi g to the second aspect of the inventi n.
  • the present invention provides a computer readable medium which i s popul ated with a software program according to the third aspect of the invention.
  • the invention is based on the reali sation that the act of privatisation changes both data content (column values) and the distribution of these values. This allows automatic validation of whether data has been adequately transformed (privatised) or not.
  • Figure 1 is schematic diagram illustrating an embodiment of the invention
  • Figure 2 is an example screenshot showing the operation of the invention.
  • an organisation has a number of production databases 5 and number of non-production environments 10 i .e. development, test and training.
  • ETL Extract, Transform .& Load
  • PIT Personally Identifiable Information
  • a two-step automated profiling 20 and validation 25 process is carried out.
  • the method according to the invention i typically initiated by a user accessing th system using their own computing device with a display screen.
  • the computer processes initiated by the user ma ru o their local computing device, or on another computing device under instruction from the user's device.
  • the method of the in vention may be confi ured to ru at specifi ed intervals, or in response to various triggers, such as the loading of new data into the test en vironment.
  • the data in the test environments is automatically scanned for PII data types.
  • Common PII types of interest are First Name, Last Maine, Email, Address, Telephone Number, Tax File Social Security numbers, Driving License, Passport & Credit Cards. This scanning is done by searching for common keywords (e.g. Smith, Jones etc.) and through other value/string matching means (e.g. Regular Expressions and Checksum analysis).
  • a column is identified to be of a certain type (e.g. last name)
  • its details are passed to the validation subroutine 25.
  • the validatio subroutine then analyses the column
  • the validation subroutine 25 takes a significant sample of data from the non-production database 10 and calculates the distribution of values in the Surnames columns, The validation routine 25 then compares the calculated distribution against the known typical production distribution by calculating the correlation between the two. If it has a similar pattern (say 10:3:1) then it is likely the data has NOT been properly privatised, it is likely that the data in the non- production database is either production data or a production clone. The greater the discrepancy (lack of correlation) with the production pattern the lower the likelihood that it is not production data (indicating masking or obfuscation has taken place).
  • the size of the sample taken by the validation routine is configurable, the greater the sample the higher the likelihood of accuracy.
  • a typical sample size would be 25,000 rows.
  • the validation subroutine returns a percentage value indicating how closely the pattern found correlates with the typical production pattern, A score of 100% ( 1 ) indicates that the validation routine has found an exact distribution match. A score of 0% (0) indicates that absolutely no correlation was found. A negative score of -100% (-1) highlights a totally negative correlation.
  • EMPLOYEE JLAST__NAME- have very high correlations (97% and 92%) labelled A, One would infer thai this indicates the presence of production cloned (non-obfuscated) ⁇ information.
  • the tool also provides a sample of data that has been found in each of the tables, This information can then be used to streamline further analysis & checks,
  • distribution patterns can be used such as first names, or addresses.
  • the patterns can be tailored to suit a particular set o data, or subset of data, such as by being relevant to a particular location.
  • the following example demonstrates the use of patterns of street names i n Sydney,. Australia.
  • Pattern matching can apply to virtually anything, not just alphanumeric strings. They can apply to numbers and/or sequences of numbers. For example larger banks (issuers) have a greater number of credit cards in distribution. A credit cards prefix determines the issuer. As such one would expect to find more credit cards attributed to a larger ban k than smaller bank.
  • the calculation of the correlation between production data and testing data is managed through a 4 step process
  • Step-1 Identify a baseline-set of values with a strong "distribLrtioix/popularity" spread. From production identify a number of distinct values that have a di stinctive spread (popul ari ty count). By di stincti ve that i s to say the regularity of the value (popul arity of the value) is clearly different from other items in the data set. Refer to the surname example below. Each value (surname) is twice as popular as the next.
  • Step-2 Build an equivalent comparison-set from your Target Data Source (e.g. Test DB)
  • Target Data Source e.g. Test DB
  • Example SURNAME Comparison Set VALUE COUN1 TEST POPULARITY RATIO*
  • ⁇ Popularity Ratio is ratio of COUNT/SUM-OF-SELECTION-COUNT
  • Step-3 Apply a Correlation Formula to the two Data Sets
  • the ''Count" Data Set can also be used & would provide the same results (score).
  • the correlatio formula applied i s of the type known as the Spearma Rank Correlation formula.
  • any RAN correlation / or standard correlation (e.g. Pearson's) formula could be used.
  • the banding is configurable, and by default is set as follows:

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

L'invention concerne des systèmes et des procédés d'analyse de données d'essai qui permettent de vérifier la présence d'informations personnellement identifiables et qui comprennent les étapes suivantes : la détermination d'une répartition typique de valeurs pour au moins un champ de données dans un environnement de production ; le calcul de la répartition réelle des valeurs pour au moins un champ des données d'essai ; la comparaison de la répartition typique avec la répartition réelle ; la fourniture d'une indication de la présence probable d'informations personnellement identifiables sur la base du résultat de la comparaison.
PCT/AU2014/050385 2013-12-10 2014-11-28 Procédé et système d'analyse de données d'essai permettant de vérifier la présence d'informations personnellement identifiables Ceased WO2015085358A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AU2013904792A AU2013904792A0 (en) 2013-12-10 A method and system for analysing test data to check for the presence of personally identifiable information
AU2013904792 2013-12-10

Publications (1)

Publication Number Publication Date
WO2015085358A1 true WO2015085358A1 (fr) 2015-06-18

Family

ID=53370367

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2014/050385 Ceased WO2015085358A1 (fr) 2013-12-10 2014-11-28 Procédé et système d'analyse de données d'essai permettant de vérifier la présence d'informations personnellement identifiables

Country Status (1)

Country Link
WO (1) WO2015085358A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040199781A1 (en) * 2001-08-30 2004-10-07 Erickson Lars Carl Data source privacy screening systems and methods
US7269578B2 (en) * 2001-04-10 2007-09-11 Latanya Sweeney Systems and methods for deidentifying entries in a data source
US8069053B2 (en) * 2008-08-13 2011-11-29 Hartford Fire Insurance Company Systems and methods for de-identification of personal data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7269578B2 (en) * 2001-04-10 2007-09-11 Latanya Sweeney Systems and methods for deidentifying entries in a data source
US20040199781A1 (en) * 2001-08-30 2004-10-07 Erickson Lars Carl Data source privacy screening systems and methods
US8069053B2 (en) * 2008-08-13 2011-11-29 Hartford Fire Insurance Company Systems and methods for de-identification of personal data

Similar Documents

Publication Publication Date Title
US12045225B2 (en) Multi-table data validation tool
NL2012438B1 (en) Resolving similar entities from a database.
CA2748425C (fr) Identification de representation d'entite de lot au moyen de modeles d'appariement de champ
US8996524B2 (en) Automatically mining patterns for rule based data standardization systems
CN104516882B (zh) 确定sql语句的危害度的方法和设备
US10572480B2 (en) Adaptive intersect query processing
US9336286B2 (en) Graphical record matching process replay for a data quality user interface
Kumar et al. Feature selection techniques to counter class imbalance problem for aging related bug prediction: aging related bug prediction
JP6419667B2 (ja) テストdbデータ生成方法及び装置
CN106126736A (zh) 面向软件安全性bug修复的软件开发者个性化推荐方法
KR101742041B1 (ko) 개인정보를 보호하는 장치, 개인정보를 보호하는 방법 및 개인정보를 보호하는 프로그램을 저장하는 저장매체
CN103180848B (zh) 一种用于复制数据的系统和方法
Masood-Al-Farooq SQL Server 2014 Development Essentials
Hadler et al. An improved version of a tool mark comparison algorithm
WO2015085358A1 (fr) Procédé et système d'analyse de données d'essai permettant de vérifier la présence d'informations personnellement identifiables
US11250127B2 (en) Binary software composition analysis
US20160104166A1 (en) Computerized account database access tool
CN116860311A (zh) 脚本分析方法、装置、计算机设备及存储介质
CA2748676C (fr) Determination de la representation d'entite a l'aide d'information relative a la presentation d'entite
CN115658662A (zh) 一种基于数据库安全质检及自动整改的方法
JP2017010376A (ja) マートレス検証支援システムおよびマートレス検証支援方法
US12326954B2 (en) Method and system for identifying data and managing access thereto across multiple data platforms
GB2475796A (en) Identifying an entity representation by constructing a comprehensive search criteria
Grocevs et al. Modern Algorithms to Identify Plagiarism
Davidson et al. Exam Ref 70-762 Developing SQL Databases

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14869514

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14869514

Country of ref document: EP

Kind code of ref document: A1