WO2015085358A1 - Procédé et système d'analyse de données d'essai permettant de vérifier la présence d'informations personnellement identifiables - Google Patents
Procédé et système d'analyse de données d'essai permettant de vérifier la présence d'informations personnellement identifiables Download PDFInfo
- Publication number
- WO2015085358A1 WO2015085358A1 PCT/AU2014/050385 AU2014050385W WO2015085358A1 WO 2015085358 A1 WO2015085358 A1 WO 2015085358A1 AU 2014050385 W AU2014050385 W AU 2014050385W WO 2015085358 A1 WO2015085358 A1 WO 2015085358A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- distribution
- identifiable information
- personally identifiable
- test data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6254—Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
Definitions
- the present invention relates to methods and systems for analysing test data and particularl relates to methods and systems for checking test data for the presence of personall identifiable information.
- test data is prepared which mirrors the production data.
- Preparation of "production-like" data is typically done in one of two ways, (1) via an Extract, Transform & Load (ETL) process i.e. where the data is extracted from the production environment itself and i s then manipulated e.g. subset and privatized before loading into the non-production environment (Development, Test & Training) and/or (2) fabricated/created from scratch.
- ETL Extract, Transform & Load
- the former process tends to be prevalent as it tends to deliver test data that looks and behaves more like the real production environment data.
- the process of privatisation, during the ETL process, is a necessary as a way of ensuring Personally Identifiable Information (PII) data is obfuscated and ultimately ensures that customers, business and employee details are not disseminated and thus remain protected.
- PII Personally Identifiable Information
- the present invention provides a method of analysing test data to check for the presence of personally identifiable information including the steps of; determining a typical distribution of values for at least one field of data in a production environment; calculating the actual distribution of values for at least one field of the test data; comparing the typical distribution with the actual distribution; and providing an indicatio of the likely presence of personally identifiable information based on the result of the comparison.
- the step of comparing may include calculating the correlation between the typical distribution and the actual distribution.
- the method may further include the step of scanning the test data for personally identifiable information data types.
- the data types include, and is not limited to, an one of First Name, Last Name, Email, Address, Telephone Number, Tax File/Social Security numbers, Driving License, Passport & Credit Cards.
- the present invention provides a system for analysing test data to check for the presence of personally identifiable information including:
- calculating means for calculating the actual distribution of values for at least one field of the test data; comparing means for comparing a typical distribution of values with the actual distribution; and display means for providing an indication of the likely presence of personally identifiable information based on the result Of the comparison.
- the compari g means may be arranged to calculate a value representing the correlation between the typical distribution and the actual distribution.
- the i vention provides a software program including instructions which, when carried out by a processor, cause a computing system to operate a method according to the first aspect of the invention or to embody a system accordi g to the second aspect of the inventi n.
- the present invention provides a computer readable medium which i s popul ated with a software program according to the third aspect of the invention.
- the invention is based on the reali sation that the act of privatisation changes both data content (column values) and the distribution of these values. This allows automatic validation of whether data has been adequately transformed (privatised) or not.
- Figure 1 is schematic diagram illustrating an embodiment of the invention
- Figure 2 is an example screenshot showing the operation of the invention.
- an organisation has a number of production databases 5 and number of non-production environments 10 i .e. development, test and training.
- ETL Extract, Transform .& Load
- PIT Personally Identifiable Information
- a two-step automated profiling 20 and validation 25 process is carried out.
- the method according to the invention i typically initiated by a user accessing th system using their own computing device with a display screen.
- the computer processes initiated by the user ma ru o their local computing device, or on another computing device under instruction from the user's device.
- the method of the in vention may be confi ured to ru at specifi ed intervals, or in response to various triggers, such as the loading of new data into the test en vironment.
- the data in the test environments is automatically scanned for PII data types.
- Common PII types of interest are First Name, Last Maine, Email, Address, Telephone Number, Tax File Social Security numbers, Driving License, Passport & Credit Cards. This scanning is done by searching for common keywords (e.g. Smith, Jones etc.) and through other value/string matching means (e.g. Regular Expressions and Checksum analysis).
- a column is identified to be of a certain type (e.g. last name)
- its details are passed to the validation subroutine 25.
- the validatio subroutine then analyses the column
- the validation subroutine 25 takes a significant sample of data from the non-production database 10 and calculates the distribution of values in the Surnames columns, The validation routine 25 then compares the calculated distribution against the known typical production distribution by calculating the correlation between the two. If it has a similar pattern (say 10:3:1) then it is likely the data has NOT been properly privatised, it is likely that the data in the non- production database is either production data or a production clone. The greater the discrepancy (lack of correlation) with the production pattern the lower the likelihood that it is not production data (indicating masking or obfuscation has taken place).
- the size of the sample taken by the validation routine is configurable, the greater the sample the higher the likelihood of accuracy.
- a typical sample size would be 25,000 rows.
- the validation subroutine returns a percentage value indicating how closely the pattern found correlates with the typical production pattern, A score of 100% ( 1 ) indicates that the validation routine has found an exact distribution match. A score of 0% (0) indicates that absolutely no correlation was found. A negative score of -100% (-1) highlights a totally negative correlation.
- EMPLOYEE JLAST__NAME- have very high correlations (97% and 92%) labelled A, One would infer thai this indicates the presence of production cloned (non-obfuscated) ⁇ information.
- the tool also provides a sample of data that has been found in each of the tables, This information can then be used to streamline further analysis & checks,
- distribution patterns can be used such as first names, or addresses.
- the patterns can be tailored to suit a particular set o data, or subset of data, such as by being relevant to a particular location.
- the following example demonstrates the use of patterns of street names i n Sydney,. Australia.
- Pattern matching can apply to virtually anything, not just alphanumeric strings. They can apply to numbers and/or sequences of numbers. For example larger banks (issuers) have a greater number of credit cards in distribution. A credit cards prefix determines the issuer. As such one would expect to find more credit cards attributed to a larger ban k than smaller bank.
- the calculation of the correlation between production data and testing data is managed through a 4 step process
- Step-1 Identify a baseline-set of values with a strong "distribLrtioix/popularity" spread. From production identify a number of distinct values that have a di stinctive spread (popul ari ty count). By di stincti ve that i s to say the regularity of the value (popul arity of the value) is clearly different from other items in the data set. Refer to the surname example below. Each value (surname) is twice as popular as the next.
- Step-2 Build an equivalent comparison-set from your Target Data Source (e.g. Test DB)
- Target Data Source e.g. Test DB
- Example SURNAME Comparison Set VALUE COUN1 TEST POPULARITY RATIO*
- ⁇ Popularity Ratio is ratio of COUNT/SUM-OF-SELECTION-COUNT
- Step-3 Apply a Correlation Formula to the two Data Sets
- the ''Count" Data Set can also be used & would provide the same results (score).
- the correlatio formula applied i s of the type known as the Spearma Rank Correlation formula.
- any RAN correlation / or standard correlation (e.g. Pearson's) formula could be used.
- the banding is configurable, and by default is set as follows:
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
L'invention concerne des systèmes et des procédés d'analyse de données d'essai qui permettent de vérifier la présence d'informations personnellement identifiables et qui comprennent les étapes suivantes : la détermination d'une répartition typique de valeurs pour au moins un champ de données dans un environnement de production ; le calcul de la répartition réelle des valeurs pour au moins un champ des données d'essai ; la comparaison de la répartition typique avec la répartition réelle ; la fourniture d'une indication de la présence probable d'informations personnellement identifiables sur la base du résultat de la comparaison.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2013904792A AU2013904792A0 (en) | 2013-12-10 | A method and system for analysing test data to check for the presence of personally identifiable information | |
| AU2013904792 | 2013-12-10 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2015085358A1 true WO2015085358A1 (fr) | 2015-06-18 |
Family
ID=53370367
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/AU2014/050385 Ceased WO2015085358A1 (fr) | 2013-12-10 | 2014-11-28 | Procédé et système d'analyse de données d'essai permettant de vérifier la présence d'informations personnellement identifiables |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2015085358A1 (fr) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040199781A1 (en) * | 2001-08-30 | 2004-10-07 | Erickson Lars Carl | Data source privacy screening systems and methods |
| US7269578B2 (en) * | 2001-04-10 | 2007-09-11 | Latanya Sweeney | Systems and methods for deidentifying entries in a data source |
| US8069053B2 (en) * | 2008-08-13 | 2011-11-29 | Hartford Fire Insurance Company | Systems and methods for de-identification of personal data |
-
2014
- 2014-11-28 WO PCT/AU2014/050385 patent/WO2015085358A1/fr not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7269578B2 (en) * | 2001-04-10 | 2007-09-11 | Latanya Sweeney | Systems and methods for deidentifying entries in a data source |
| US20040199781A1 (en) * | 2001-08-30 | 2004-10-07 | Erickson Lars Carl | Data source privacy screening systems and methods |
| US8069053B2 (en) * | 2008-08-13 | 2011-11-29 | Hartford Fire Insurance Company | Systems and methods for de-identification of personal data |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12045225B2 (en) | Multi-table data validation tool | |
| NL2012438B1 (en) | Resolving similar entities from a database. | |
| CA2748425C (fr) | Identification de representation d'entite de lot au moyen de modeles d'appariement de champ | |
| US8996524B2 (en) | Automatically mining patterns for rule based data standardization systems | |
| CN104516882B (zh) | 确定sql语句的危害度的方法和设备 | |
| US10572480B2 (en) | Adaptive intersect query processing | |
| US9336286B2 (en) | Graphical record matching process replay for a data quality user interface | |
| Kumar et al. | Feature selection techniques to counter class imbalance problem for aging related bug prediction: aging related bug prediction | |
| JP6419667B2 (ja) | テストdbデータ生成方法及び装置 | |
| CN106126736A (zh) | 面向软件安全性bug修复的软件开发者个性化推荐方法 | |
| KR101742041B1 (ko) | 개인정보를 보호하는 장치, 개인정보를 보호하는 방법 및 개인정보를 보호하는 프로그램을 저장하는 저장매체 | |
| CN103180848B (zh) | 一种用于复制数据的系统和方法 | |
| Masood-Al-Farooq | SQL Server 2014 Development Essentials | |
| Hadler et al. | An improved version of a tool mark comparison algorithm | |
| WO2015085358A1 (fr) | Procédé et système d'analyse de données d'essai permettant de vérifier la présence d'informations personnellement identifiables | |
| US11250127B2 (en) | Binary software composition analysis | |
| US20160104166A1 (en) | Computerized account database access tool | |
| CN116860311A (zh) | 脚本分析方法、装置、计算机设备及存储介质 | |
| CA2748676C (fr) | Determination de la representation d'entite a l'aide d'information relative a la presentation d'entite | |
| CN115658662A (zh) | 一种基于数据库安全质检及自动整改的方法 | |
| JP2017010376A (ja) | マートレス検証支援システムおよびマートレス検証支援方法 | |
| US12326954B2 (en) | Method and system for identifying data and managing access thereto across multiple data platforms | |
| GB2475796A (en) | Identifying an entity representation by constructing a comprehensive search criteria | |
| Grocevs et al. | Modern Algorithms to Identify Plagiarism | |
| Davidson et al. | Exam Ref 70-762 Developing SQL Databases |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14869514 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 14869514 Country of ref document: EP Kind code of ref document: A1 |