[go: up one dir, main page]

CN101878461B - Method and system for analysis of system for matching data records - Google Patents

Method and system for analysis of system for matching data records Download PDF

Info

Publication number
CN101878461B
CN101878461B CN200880117086.9A CN200880117086A CN101878461B CN 101878461 B CN101878461 B CN 101878461B CN 200880117086 A CN200880117086 A CN 200880117086A CN 101878461 B CN101878461 B CN 101878461B
Authority
CN
China
Prior art keywords
entity
grouping
data
analyze
analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200880117086.9A
Other languages
Chinese (zh)
Other versions
CN101878461A (en
Inventor
G·戈登博格
S·舒玛彻
J·伍德斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN101878461A publication Critical patent/CN101878461A/en
Application granted granted Critical
Publication of CN101878461B publication Critical patent/CN101878461B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0769Readable error formats, e.g. cross-platform generic formats, human understandable formats
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Embodiments disclosed herein provide a system and method for analyzing an identity hub. Particularly, a user can connect to the identity hub, load an initial set of data records, create and/or edit an identity hub configuration locally, analyze and/or validate the configuration via a set of analysis tools, including an entity analysis tool, a data analysis tool, a bucket analysis tool, and a linkage analysis tool, and remotely deploy the validated configuration to an identity hub instance. In some embodiments, through a graphical user interface, these analysis tools enable the user to analyze and modify the configuration of the identity hub in real time while the identity hub is operating to ensure data quality and enhance system performance.

Description

Analyze the method and system for the system of matching data records
Cross reference to related application
The title that the application requires to submit on September 28th, 2007 is " METHOD ANDSYSTEM FOR ANALYSIS OF A SYSTEM FOR MATCHING DATARECORDS " 60/997, the right of priority of No. 038 U.S. Provisional Application, its full content is contained in the application by reference.The title that the application also relates on March 27th, 2008 and submits to is " METHOD AND SYSTEM FOR MANAGING ENTITIES " 12/056, No. 720 U.S. Patent applications, the title of submitting on Dec 31st, 2007 be " METHOD ANDSYSTEM FOR PARSING LANGUAGES " 11/967, No. 588 U.S. Patent applications, the title of submitting on September 28th, 2007 be " METHOD AND SYSTEM FORINDEXING, RELATING AND MANAGING INFORMATION ABOUTENTITIES " 11/904, No. 750 U.S. Patent applications, the title of submitting on September 14th, 2007 be " HIERARCHY GLOBAL MANAGEMENT SYSTEM ANDUSER INTERFACE " 11/901, No. 040 U.S. Patent application, the title of submitting on September 13rd, 2007 be " IMPLEMENTATION DEFINED SEGMENTS FORRELATIONAL DATABASE SYSTEMS " 11/900, No. 769 U.S. Patent applications, the title of submitting on June 29th, 2007 be " METHOD AND SYSTEM FORPROJECT MANAGEMENT " 11/824, No. 210 U.S. Patent applications, the title of submitting on June 1st, 2007 be " SYSTEM AND METHOD FORAUTOMATIC WEIGHT GENERATION FOR PROBABILISTICMATCHING " 11/809, No. 792 U.S. Patent applications, the title of submitting on February 5th, 2007 be " METHOD AND SYSTEM FOR A GRAPHICAL USERINTERFACE FOR CONFIGURATION OF AN ALGORITHM FORTHE MATCHING OF DATA RECORDS " 11/702, No. 410 U.S. Patent applications, the title of submitting on January 22nd, 2007 be " METHOD AND SYSTEM FORINDEXING INFORMATION ABOUT ENTITIES WITH RESPECTTO HIERARCHIES " 11/656, No. 111 U.S. Patent applications, the title of submitting on September 15th, 2006 be " METHOD AND SYSTEM FOR COMPARINGATTRIBUTES SUCH AS PERSONAL NAMES " 11/522, the title of submitting in No. 223 U.S. Patent applications and on September 15th, 2006 be " METHOD ANDSYSTEM FOR COMPARING ATTRIBUTES SUCH AS BUSINESSNAMES " 11/521, the right of priority of No. 928 U.S. Patent applications.All applications of quoting in this section are all contained in the application, for all objects.
Technical field
The disclosure relates generally to data recording associated, and relates in particular to identification and may comprise the data recording about the information of same entity, makes these data recording can be associated.More particularly, embodiment disclosed herein can relate to the analysis for the identification of data recording and associated system, comprises with the performance of this system or configures relevant analysis.
Background technology
Nowadays, the mass data that the reservation of most enterprises is relevant with the each side of their operation, as stock, client, product etc.Data about the entity such as people, product, parts or any other object can be kept at such as in the data storage devices such as Computer Database with digital format.These Computer Databases allow data about entity by fast access, and allow other relevant entry cross reference of these data and data about same entity.This database also allows people to inquire about this database to find the data recording that belongs to a certain special entity, and making can be interrelated from the data recording that belongs to same entity of different pieces of information memory storage.
Yet data storage device has several limitations, it may be limited in and in data storage device, find the ability about the correct data of a certain entity.Real data in data storage device can only be accurate as the input people of these data or raw data source.Therefore, mistake in entering data into the process of data storage device may cause, in this database, the search of the data about a certain entity is missed to the related data about this entity, because misspelled or SSN (social security number) input error etc. such as people's surname.Various these class problems can be envisioned as: for the entity that has record in database, may create two records that separate, make a plurality of data recording can comprise the information about same entity, may be different but be for example included in two title or identification numbers in data recording, thus may be difficult to the data recording that relates to same entity interrelated.
The enterprise of the one or more data storage devices that comprise mass data record for operation, in each database or central location about the ability of the relevant information of a certain special entity, be very important, but be not easy to obtain.During data that in addition, any mistake in arbitrary information source input data (comprising the more than one data recording that unrestrictedly produces same entity) process may cause searching for special entity in database, miss related data.In addition, in relating to the situation of a plurality of information sources, each information source can have slightly different data syntax (syntax) or form, and this processing that can further make to find data in database is complicated.In medical health field, need correct identification to relate to the entity of a certain data recording and example that location relates to all data recording of a certain entity is that a plurality of Different hospitals relevant to specific health care tissue can have the one or more information sources that comprise about their patient information, and health care tissue by the information from each hospital in master data base.Must be by the data recording link that belongs to same patient from all information sources so that can search for the information of a certain given patient in all hospital records.
There are several problems to limit and in this database, find the ability about whole related datas of a certain entity.As receive the independently result of data recording from one or more information sources, for a certain special entity, may there are a plurality of data recording, this has caused can be described as the problem of data fragmentation.The in the situation that of data fragmentation, to the inquiry of master data base, possibly cannot fetch all relevant informations about a certain special entity.In addition, as mentioned above, due to the misspelling occurring during data input, this causes the problem of data inaccessible, and this inquiry may be missed some relevant informations about a certain entity.In addition, large database concept may comprise and shows as identical data recording, if surname is a plurality of records of the people of Smith Jim by name.To the inquiry of database, by frequent in random data recording of selecting to fetch of the people who fetches all these data recording and database is carried out to this inquiry, it may be wrong data recording.This people can not often attempt to determine which record is suitable.Even if this can cause also fetching the data recording of wrong entity when correct data recording can obtain.These problems have limited the ability of the information of locating a certain special entity in database.
In order to reduce the data volume that must reexamine, and prevent that user from choosing wrong data recording, also wish may comprise about the data recording sign of the information of same entity associated from a plurality of information sources.Have the legacy system that is used for location database data recording copy and deletes those data recording copies, but these systems can only be located basically identical each other data recording.Therefore, these legacy systems can not determine for example whether two slightly different data recording of surname still comprise the information about same entity.In addition, these legacy systems are not attempted from a plurality of different aforementioned sources index data records, and localization package contains about the data recording in one or more information sources of the information of same entity, and these data recording are linked together.Therefore, wishing can be associated by the data recording that belongs to same entity from a plurality of information sources, and no matter the difference between the attribute of these data recording, and can in the mode of interior poly-(cohesive), combine and present the information from these different pieces of information records.Yet in practice, it may be extremely difficult that the consensus of opinion accurately from the information of a plurality of information sources is provided.
Summary of the invention
Due to the data from a plurality of not homologies be recorded in data that form and they comprise the two upper may be not identical, so configuration data disposal system may be an individual difficult task.These difficulty parts are because layoutprocedure may be need to be about the structure of system for associated data record and the manpower intensive task of a large amount of professional knowledges of ability aspect, in addition, also need to the height analysis of details with keep a close eye on to guarantee for the final configuration of the associated algorithm of data recording is produced to desirable result.
The independent needs of this system user may further aggravate these difficulties.For example, in some industry, as health care industry, not by data recording mistakenly interrelated (being called false positive) may be important, and other less important industry may be lower pay close attention to the wrong associated and association of more paying close attention to the data recording that may belong to same entity with avoid should be associated data recording do not have related situation (being called false negative).In fact, for the false positive or the false-negative quantity that allow, certain user may have strict requirement or guidance.
Because at least some part of this system can usage data sample group configure or adjust, so may not produce the result of hope when being applied to the configuration that the larger sampling time-base of all data or data sets up vertical system in this primary data sample.
Yet, may be difficult to determine how this system is moved about certain configuration, and because the algorithm of this system use may be very complicated, so even if can determine how this system is moved, may also be difficult to revise this and be configured to realize the result of wishing.
Therefore, need for analyzing for the system operation of data recording association is made need to configure the system and method for this system according to user.
Embodiment disclosed herein is provided for analyzing and the system and method that presents the performance parameter relevant with system for index or associated data record.These system and methods can be provided for statistical study and present the Identity Hub with Initiate Systems company tMconfiguration or the useful Software tool of the relevant data of performance.In the U.S. Patent application of quoting, can find Initiate Identity Hub in the disclosure tMexample embodiment.
In certain embodiments, these instruments comprise grouping (bucket) analysis tool, data analysis tool, entity analysis instrument and link analysis or Threshold Analysis instrument.More particularly, in one embodiment, fractional analysis instrument can be used to analyze and present the candidate interior with identity hinge (identity hub) and produces and select (that is, grouping (bucketing)) relevant data.In one embodiment, entity analysis instrument can be used to and analyzes and the associated relevant data that present with data recording.In one embodiment, link analysis instrument can be used to and analyzes and present and record the data relevant on arranging of the various threshold levels of the impact of system with them for link data.Described instrument can also provide predictive ability to make user can submit to possible parameter value and this instrument can calculate and predict that this value is on the impact in the operation of system or performance.
In certain embodiments, can present graphic user interface to use described multiple types of tools, make can present to user figure and provide with the interactional ability of analysis tool to obtain the information of wishing to user with configuration or the performance-relevant data of identity hinge.This graphic user interface can also provide together with another graphic user interface, or comprises the function of its at least a portion for identity hinge configuration, makes the configuration that user can change of status hinge and analyzes the result of this configuration.These interfaces can comprise the one or more webpages that for example can access by web browser.These webpages can for example adopt HTML or XHTML form, and the navigation of receiving other webpage by hypertext link can be provided.User can be from local computer (for example, use HTML (Hypertext Markup Language) or HTTP) or obtain these webpages from remote web server, this server can limit an accessing private network (for example, company's Intranet), or it can be distributed on the page in WWW.
In one embodiment, can in configuration tool, present this graphic user interface, make various analyses can present to when needed the user of configuration identity hinge, make user can in the information source of utilizing this identity hinge, find the data exception in data.This interface can also provide definite statistics or other identity hinge parameter are kept to the ability in the customized configuration of this identity hinge, make can be in different time and different configuration the function of this identity hinge relatively.
When data recording enters identity hinge or during based on this identity hinge of one or more standard search, can create one or more groupings.Therefore, the performance of this system (for example, throughput time etc.) can depend critically upon the size of the grouping creating in given example.Therefore, may wish to obtain about the size of created grouping or the statistics of type, create these groupings why, how to create these groupings, comprise the data recording of these groupings, the performance how these groupings affect this system etc.
Therefore, in one embodiment, the pattern that fractional analysis instrument can provide grouping to distribute, as the size of the various groupings that generated with comprise the various data recording of these groupings and the various data recording of not putting into grouping that are associated with this identity hinge.Large grouping (for example, surpassing 1,000 data recording) can designation data frequency not suitably be considered from the different of expection or some anonymous or corporate data value.For example, if for unknown data recording, a tissue is used name " John Doe ", and uncommon number of times can appear in this name.Little grouping can indicate the group markups of current use will definitely be too harsh.
Therefore, this fractional analysis instrument the pattern that grouping distributes not only can be provided but also this distribution can be provided or another distribute on the impact of the throughput of this identity hinge with the performance of guaranteeing this identity hinge in the scope of hope.Similarly, fractional analysis instrument can provide and checks or analyze for creating the algorithm of grouping and forming the ability and directly or reconfigure the ability of some parameter of identity hinge or identity hinge by Another Application program of the particular data record of those groupings.Thereby fractional analysis instrument can also provide the performance of estimation identity hinge under load in real time can guarantee the ability in the parameter of performance in hope in conjunction with this function.
In some cases, although do not have between data recording or almost link can indication problem yet, because member's data are intrarecord abnormal, some data recording may be linked or association (for example,, as entity) mistakenly.Therefore by the distribution of analysis entities size can analyze preferably or diagnose these data exceptions with the linking or associated relevant other problem of data recording.In one embodiment, entity analysis instrument can provide and calculate and the distribution of display entity size, how many entities are shown comprise that a data recording, how many entities comprise the ability of two data recording etc.Strange in distributing or distributing extremely can indication problem, or indication needs the configuration (for example, anonymous title or address) of change of status hinge.Entity analysis instrument can provide further analysis ability.Example case study ability can be to check that by size each entity in distribution group, analysis distribution group (for example, the entity that comprises three member's data recording) check in entity that each member's data recording (for example, check member's data record attribute value) or relatively two or more members in entity (for example, relatively two members' property value) make to be defined as the ability what these member's data recording is linked etc.
Can utilize the embodiment of soft link and AutoLink threshold value configuration identity hinge.These threshold values can greatly affect the performance of identity hinge.Therefore, how embodiment more disclosed herein affects system performance (for example, false negative or false positive, throughput etc.) and analyzes the ability that can how to change this identity hinge performance to the adjustment of these different threshold values if providing to analyze and understand the soft link and the AutoLink threshold value that configure to user.
More particularly, in certain embodiments, these interfaces and demonstration can provide and select false positive rate or the false negative rate of wishing and understand the ability on the impact of threshold level to user.In some embodiment of Threshold Analysis instrument disclosed herein, user can determine in order to realize desirable false positive rate or false negative rate threshold level should be which type of.In certain embodiments, the link dropping between the data recording between soft link and AutoLink threshold value must manually reexamine.Some embodiment of Threshold Analysis instrument can provide utilizing the estimation of the manually amount of reexamining of configured soft link and the generation of AutoLink threshold value.Some embodiment of Threshold Analysis instrument can provide the ability that regulates desirable false positive and false negative rate or number percent to user, and Threshold Analysis instrument will change into that threshold level is shown be which type of, and vice versa.
In one embodiment, false positive rate can be for example, with problem size (, the quantity of data recording) relevant, and false negative rate can be relevant to the quantity of information in each data recording.Therefore, quantity survey false positive rate or curve that can be based on record, and data that can be based on all records distribute and estimate false negative rate or curve.Due to these estimations can to the weight combining with identity hinge produce relevant, so can carry out these estimations after this weight generation.Then the reexamining of data recording to one group of link based on office worker, wherein user can determine that record is correctly linked or (for example be linked mistakenly, during this can occur in the configuration of identity hinge), then can usability analysis tool regulate matching or revise these curves.In certain embodiments, these curves can be dedicated user to being graphically together with the diagrammatic representation of threshold value, make user can regulate various false positives or false negative rate and understand the manually amount of reexamining that where various threshold values are set and can obtain from these threshold values.
Therefore configuration and performance that, embodiment disclosed herein can real-time analysis could process and mate the identity hinge of mass data record group.These instruments provide guarantees that the throughput of identity hinge and analysis (payment) quality that identity hinge produces meet the way of customer requirements.When in conjunction with following explanation and accompanying drawing consideration, further feature of the present disclosure, advantage and object will be familiar with and be understood better.
Accompanying drawing explanation
The accompanying drawing that forms this instructions part is included to describe aspects more of the present disclosure.By reference to shown in accompanying drawing as an example and be therefore nonrestrictive embodiment, the parts of system and the impression of operation that the disclosure and the disclosure are provided will be more clear.Possible in the situation that, all in accompanying drawing, using identical drawing reference numeral to represent same or similar feature (key element).Accompanying drawing is not necessarily drawn in proportion.
Fig. 1 illustrates the exemplary framework for an embodiment of the system of matching data records.
Fig. 2 A and Fig. 2 B illustrate the expression of two embodiment of data recording.
Fig. 3 illustrates the process flow diagram of an embodiment of comparing data record.
Fig. 4 illustrates for configuring and analyze the framework of an embodiment of the system of identity hinge.
Fig. 5 illustrates for configuring the process flow diagram of an embodiment of the method for identity hinge.
Fig. 6 illustrates the sectional drawing of an embodiment of graphic user interface that can analyze the configuration of identity hinge by it.
Fig. 7 A and Fig. 7 B illustrate the sectional drawing of an embodiment of config editor that can revise the configuration of identity hinge by it.
Fig. 8 A and Fig. 8 B illustrate the sectional drawing of an embodiment that can revise the config editor of operation configuration by it.
Fig. 9 A and Fig. 9 B illustrate the sectional drawing of an embodiment that can revise the algorithm editing machine of each algorithm being associated with entity type in identity hinge by it.
Figure 10 A and Figure 10 B illustrate by the sectional drawing of an embodiment of the graphic user interface of the configuration of its addressable identity hinge.
Figure 11 illustrates for analyzing the process flow diagram of an embodiment of method of the configuration of identity hinge.
Figure 12 A and Figure 12 B illustrate the sectional drawing of an embodiment of entity analysis instrument.
Figure 13 illustrates the sectional drawing of an embodiment of data analysis tool.
Figure 14 illustrates the sectional drawing of an embodiment of fractional analysis instrument.
Figure 15 illustrates the sectional drawing of an embodiment of link analysis instrument.
Figure 16 illustrates the sectional drawing of an embodiment that can analyze the graphic user interface of the error rate that is associated with member record in identity hinge and threshold value by it.
Figure 17 illustrate system performance and to in identity hinge, link false positive that member record is associated and the relation between the tolerance limit of false negative rate.
Embodiment
Below in conjunction with shown in accompanying drawing and describe in detail in the following description exemplary and therefore nonrestrictive embodiment explain more fully the disclosure and various different characteristic and favourable details.To the description of known programming technique, computer software, hardware, operating platform and agreement, can omit, thereby can not obscure this in detail openly meaninglessly.Yet should be appreciated that, although this detailed description and object lesson represent preferred embodiment, they are only to provide with the mode of example rather than restrictive mode.For a person skilled in the art, according to the disclosure, in spirit and/or various replacements, the modification in scope of basic creative concept, to add and/or reconfigure to be apparent.
Realizing the software of embodiment disclosed herein can realize with the suitable computer executable instructions that can be kept on computer-readable recording medium.In the disclosure, term " computer-readable recording medium " comprises all types of data storage mediums that can be read by processor.The example of computer-readable recording medium can comprise random access memory, ROM (read-only memory), hard disk drive, data capsule, tape, floppy disk, flash drives, optical data storage device, compact disk ROM (read-only memory) and other suitable computer memory and data storage device.
When using in this article, " comprising " that term " comprises ", " comprising ", " having " or their any other variant intention cover non-exclusionism.For example, processing, product, object or the equipment that comprises a series of key elements need not be confined to these key elements but can comprise other key element of clearly not listing or this processing, product, object or equipment are intrinsic.In addition, unless explicitly point out on the contrary, "or" refers to "or" rather than the exclusive "or" of comprising property.For example, following any satisfy condition A or B:A are that true (or existence) and B are false (or not existing), and A is that false (or not existing) and B are true (or existence), and A and B the two be all true (or existence).
In addition, any example providing herein or example are never thought the constraint of any term of they uses, restriction or clearly definition.On the contrary, these examples or example are considered to about the description of a specific embodiment and are exemplary.Those of ordinary skill in the art will understand, any term that these examples or example adopt comprises other embodiment and embodiment and the reorganization thereof that the other places wherein or in instructions may provide or may not provide, and all these embodiment intentions are included in the scope of this term.The language of pointing out this non-limitative example and example includes but not limited to: " such as ", " in one embodiment " etc.
Now detailed in exemplary embodiment of the present disclosure, the example of these embodiment shown in the drawings.Possible in the situation that, all in accompanying drawing, using identical drawing reference numeral to represent same or similar part (key element).
Embodiment more disclosed herein can supplement the embodiment about the system and method for the information of entity for the information source index from different describing in 5,991, No. 758 United States Patent (USP)s announcing as on November 23rd, 1999, and this patent is contained in this by reference.It is " METHOD AND SYSTEM FOR INDEXING INFORMATIONABOUT ENTITIES WITH RESPECT TO HIERARCHIES " 11/656 that embodiment more disclosed herein can supplement the title submitted to 2007 above-mentioned January 22, disclosed for pressing level index about the embodiment of the entity handles system and method for the information of entity in No. 111 U.S. Patent applications, it is also contained in this by reference.
Fig. 1 is the block diagram of exemplary architecture that an embodiment of entity handles system 30 is shown.Entity handles system 30 can comprise identity hinge (Identity Hub) 32, this identity hinge 32 processes, upgrades or preserve from the data of the data recording about one or more entities of one or more information sources 34,36,38 and in response to order or inquiry from a plurality of operation sides 40,42,44, wherein said operation side can be human user and/or infosystem.Identity hinge 32 can utilize the data recording operation from single information source, or utilizes as shown the data recording operation from a plurality of information sources.Use entity that the embodiment of identity hinge 32 follows the tracks of can comprise participant, the parts in warehouse of patient in hospital for example, medical health system or can there is relative data recording and be included in any other entity of the information in data recording.Identity hinge 32 can be one or more computer systems with at least one CPU (central processing unit) (CPU) 45, its execution is kept at computer-readable instruction (for example, software application) on one or more computer-readable recording mediums to carry out the function of identity hinge 32.As understood by those skilled in the art, can also implement identity hinge 32 with the combination of hardware circuit or software and hardware.
In the example of Fig. 1, identity hinge 32 can receive data recording from information source 34,36,38, and revised data are write back in information source 34,36,38.But the revised data that are sent to information source 34,36,38 can comprise to be correct to have changed information, about information and/or the information about linking between data recording of the fix information in data recording.
In addition, the response that one of operation side 40,42,44 can send inquiry and receive this inquiry from identity hinge 32 to identity hinge 32.Information source 34,36,38 can be for example the disparate databases that can have about the data recording of identical entity.For example, in medical health field, each information source 34,36,38 can be associated with the particular hospital in health care tissue, and this health care tissue can contact the data recording being associated with a plurality of hospitals with identity hinge 32, make when patient on holiday or can locate this patient in the data recording in Los Angeles while coming the hospital in New York.Identity hinge 32 can be positioned at center and information source 34,36,38 and user's 40,42,44 position can and can for example be connected via communication links to identity hinge 32 away from identity hinge 32, described communication link is the communication network of the Internet or any other type for example, as wide area network, Intranet, wireless network, network of renting etc.
In certain embodiments, identity hinge 32 can have its oneself database, and this database is kept at complete data recording in identity hinge 32.In certain embodiments, identity hinge 32 can also only comprise (is for example enough to the data of identification data record, address in particular source 34,36,38) or any part of the data field (data field) that comprises partial data record, make identity hinge 32 from information source 34,36,38, to fetch whole data recording when needed.Identity hinge 32 can be used entity identifier or the federated database that separates with real data record links together the data recording comprising about the information of same entity.Therefore, identity hinge 32 can keep the link between the data recording in one or more information sources 34,36,38, but needn't keep the single consistent data recording of entity.
In certain embodiments, identity hinge 32 can carry out the data recording in link information source 34,36,38 to identify the data recording that link together by other data recording relatively (from the side of operation or receive from data source 34,36,38) data recording and information source 34,36,38.This identifying processing may need one or more attributes of comparing data record and the like attribute of other data recording.For example, to one record relevant attribute of name can with the name comparison of other data recording, SSN (social security number) can with the SSN (social security number) of another record relatively etc.Like this, can identify the data recording that link.
It will be obvious to those skilled in the art that information source 34,36,38 and operation side 40,42,44 can from similar or different tissue and/or the owner contacts and can be separated from each other physically and/or away from.For example, information source 34 can contact with Los Angeles hospital by a medical health care network operation, and information source 36 can contact with the New York hospital of another medical health care network operation that may be had by French company.Therefore, from the data recording of information source 34,36,38, can there is different forms, different language etc.
This can clearly show that by reference to Fig. 2 A and Fig. 2 B, and Fig. 2 A and Fig. 2 B illustrate two embodiment of sample data record.Each in these data recording 200,202 has a group field 210, and this group field is corresponding to one group of attribute of each data recording.For example, one of each record attribute of 200 can be name, and another attribute can be taxpayer number etc.Obviously attribute can comprise a plurality of fields 210 of data recording 200,202.For example, the address properties of data recording 202 can comprise field 210c, 210d and 210e, and they are respectively street, Hezhou, city field.
Yet each can have different forms data recording 200,202.For example, data recording 202 can have the field 210 for attribute " insurance company ", and data recording 200 can not have this field.In addition, similarly attribute also can have different forms.For example, the name field 210b recording in 202 can accept the input of full name, and the name field 210a recording in 200 can be designed as the name that allows input finite length.For example, when two or more data recording (attribute of data recording) are relatively when identifying the data recording that link, this difference may be problematic.For example, name " Bobs Flower Shop " and " Bobs Very Pretty Flower Shoppe " are similar but incomplete same names.In addition, the comparison that typing error when the data of input data recording or error also affect data recording, and therefore (for example affect result, relatively name " Bobs Pretty Flower Shop " and " Bobs Pretty Glower Shop ", wherein " Glower " is that misspelling during by input word " Flower " causes).
Enterprise name in data recording may present due to their character many quite special problems.Some enterprise names may very short (for example, " Quick-E-Mart ") and other enterprise name may be grown (for example, " San Francisco ' s Best Coffee Shop ") very much.In addition, enterprise name may be used similar word (for example, " Shop ", " Inc. ", " Co. ") continually, and when the data recording of more same language, these words shared weight in the heuristic process of these titles is relatively little.In addition, often use acronym in enterprise name, for example, the enterprise of " New York City Bagel " by name may often be input in data recording as " NYCBagel ".
As will be described in further detail below, the embodiment of identity hinge 32 disclosed herein adopts and can when comparing enterprise name, consider the algorithm of these special features.Specifically, some algorithms that identity hinge 32 adopts are supported acronym, consider the frequency of some word in enterprise name, and consider the order (for example, title " Clinicof Austin " may be considered in fact with " Austin Clinic " identical) of mark (token) in enterprise name.Some algorithms are used multiple title comparison techniques with the comparison based on title during difference is recorded (for example, similitude) produce weight, then this weight can be used to determine whether two records should link, and described title comparison techniques comprises various speech comparison methods, the weighting based on name label frequency, prefix coupling, alias match etc.In certain embodiments, use the method (for example, mark whether completely, mate on voice etc.) of matched indicia that the mark of the name attribute of each record is compared mutually.Then can the coupling based on definite give weight (for example, coupling is endowed the first weight completely, and the prefix of a certain type coupling is endowed the second weight etc.) to these couplings.Then these weights can be added up to determine total weight of the matching degree between the name attribute of two data recording.The title of submitting to the 1 day June in 2007 of quoting is in the above 11/809 of " SYSTEMAND METHOD FOR AUTOMATIC WEIGHT GENERATION FORPROBABILISTIC MATCHING ", in No. 792 U.S. Patent applications, described the exemplary embodiment of suitable weight production method, it is contained in this by reference.The title of submitting to the 15 days September in 2006 of quoting is in the above 11/522 of " METHOD AND SYSTEMFOR COMPARING ATTRIBUTES SUCH AS PERSONAL NAMES ", the title of submitting in No. 223 U.S. Patent applications and on September 15th, 2006 be " METHOD AND SYSTEM FOR COMPARING ATTRIBUTESSUCH AS BUSINESS NAMES " 11/521, the exemplary embodiment of suitable title comparison techniques has been described in No. 928 U.S. Patent applications, its two be contained in by reference this.
Fig. 3 illustrates for identifying the example of the method for the record that belongs to same entity.In step 310, can at identity hinge 32 places, advance or draw in one group of data recording for evaluating.These data recording for example can comprise one or more new data records with one group of existing data recording (its can be Already in for example in information source 34,36,38 or can be provided for identity hinge 32) compare.In step 320, if for data recording relatively also not by standardization, can be by they standardization.This standardization can comprise the standardization of the attribute of data recording, makes this data recording be converted to standard format from its unprocessed form.Like this, can carry out the comparison between the like attribute of different pieces of information record subsequently according to the two standard format of attribute and data recording.To one skilled in the art, obviously can or characterize each attribute of the data recording that will compare and can complete by unique function according to standardization such as different-format, different semanteme, morpheme groups each attribute is turned to its corresponding canonical form.Therefore, can be by the standardization of the various attributes of data recording is standardized as to standard format by each data recording, by each attribute of corresponding function standardizing (these attribute functions can be used to the attribute of a plurality of types of standardization certainly).
For example, can evaluating data record 200 name attribute field 210a with the group echo that produces name attribute (for example, " Bobs ", " Pretty ", " Flower " and " Shop ") and for example can be connected according to certain form these marks, (to produce standardized attribute, " BOBS:PRETTY:FLOWER:SHOP "), make can resolve subsequently this standardized nature to produce the mark that comprises name attribute.As another example, when by title standardization, continuous single marking can be combined into mark (for example, I.B.M. becomes IBM) and (for example can replace, with " Company " replacement " Co. ", use " Incorporated " replacement " Inc. " etc.).Can by comprise abbreviation and they be equal to replacement be equal to table storage in the database being associated with identity hinge 32.Pseudo-code for an embodiment of standardization enterprise name is as follows:
BusinessNameParse(inputString,equivalenceTable):
STRING?outputString
for?c?in?inputString:
if?c?is?a?LETTER?or?a?DIGIT:
copy?c?to?outputString
else?if?c?is?one?of?the?following?characters[&,’,`](ampersand,single
quote,back?quote)
skip?c(do?not?replace?with?a?space)
else//non-ALPHA-DIGIT?[&,’,`]character
if?the?last?characterin?output?string?is?not?a?space,copy?a
space?to?output?string.
//Now?extract?the?tokens.
tokenList=[]
For?token?in?outputString//outputString?is?a?list?of?tokens?separated?by?spaces
If(token?is?a?single?character?and?it?is?followed?by?one?or?more?single
characters)
Combine?the?singletokens?into?a?single?token
If(equivalenceTable?maps?token)
Replace?token?with?its?equivalence.
Append?token?to?tokenList.
Return?tokenList
No matter use any technology, once the attribute of the data recording that will compare in step 320 and data recording itself are standardized as canonical form, just can from existing data recording, select in step 330 one group of candidate with data recording comparison new or input.This candidate selects processing (in this article also referred to as grouping (bucketing)) can comprise the comparison of one or more attributes of data recording new or input and available data record to determine which existing data recording and this new data records are enough similar to such an extent as to needs are relatively further.Every group of candidate (packet group) can based on use and the data recording of candidate's selection function (block functions) that Attribute Relative is answered between each comparison in (for example,, between the data recording of input and existing data recording) one group of attribute.For example, can be designed to based on use the title of candidate's selection function and one group of candidate of the alternative of address properties (that is, grouping) of comparison title and compare address.
Then in step 340, the data recording that comprise these groups candidate can be carried out the more detailed comparison with the record of new or input, wherein between described record relatively one group of attribute to determine available data record and whether should link with this new data records or associated.This more in detail relatively may need for example, corresponding attribute in for example, one or more and another record in the set of properties of a record (, existing record) (, record new or input) to compare to produce the score value of this attribute comparison.Then the score value that this can be organized to attribute is added to produce total score value, then can be by this total score value with threshold value comparison to determine whether these two records should link.For example, if total score value is less than first threshold (be called soft link or reexamine threshold value), can not link this record, if total score value is greater than Second Threshold (being called AutoLink threshold value), can chained record, and if total score value drops between these two threshold values, can chained record and be marked for user and reexamine.
Fig. 4 illustrates for configuring and analyze the framework of an embodiment of system 10 of the configuration of identity hinge 32.In certain embodiments, system 10 comprises computing machine 40 and worktable 20.Worktable 20 is software programs, and it is kept in the storer of computing machine 40 and comprises the computer instruction that can be read by the processor of computing machine 40.Worktable 20 is arranged on computing machine 40 and operation thereon, and computing machine 40 is communicated by letter with identity hinge 32 by network 15.Network 15 can be the expression of public network, private network or its combination.Worktable 20 comprises a plurality of functions, comprises configuration tool 400, and user 51 can be by graphic user interface 50 these functions of access.In certain embodiments, user interface 50 is expressions of one or more user interfaces of worktable 20.In certain embodiments, by user interface 50, worktable 20 makes user 51 can create, edit and/or confirm the configuration of identity hinge, this identity hinge configuration this locality is kept in computer-readable recording medium 56, and by network 15 by the long-range identity hinge example that is set to identity hinge 32 of the configuration of this confirmation.Computer-readable recording medium 56 can be that computing machine 40 is inner or outside.
As will be understood by the skilled person in the art, computing machine 40 be utilize worktable 20 the special programming of an embodiment for this locality configuration and analyze the configuration of identity hinge and by network by the expression of any network computation device of the long-range example that is set to identity hinge of (confirmation) configuration.Contrasting Fig. 5 below describes by an embodiment of the method for worktable 20 configuration identity hinges 32.Contrast the embodiment that Fig. 6 describes the user interface 50 of worktable 20 below.
In certain embodiments, configuration tool 400 comprises config editor 410, algorithm editing machine 420 and analysis tool 430.In certain embodiments, analysis tool 430 comprises data analysis tool 432, entity analysis instrument 434, fractional analysis instrument 436 and link analysis instrument 438.In certain embodiments, by config editor 410, worktable 20 provides to user 51 the new configuration of identity hinge 32 or the ability that loading is kept at the existing configuration of the identity hinge 32 on computer-readable recording medium 56 of creating.In certain embodiments, identity hinge configuration comprises the view of member record, the attribute of member record and the part that defines for the specific implementation mode of identity hinge 32.Further instruction for implementation definitional part, reader can with reference to the title of submitting on September 13rd, 2007 be " IMPLEMENTATION DEFINEDSE GMENTS FOR RELATIONAL DATABASE SYSTEMS " 11/900, No. 769 U.S. Patent applications, this application is contained in this by reference.Contrast Fig. 7-8 below and describe the details of configuration identity hinge 32.
Identity hinge 32 use polyalgorithms come comparison and score member property similarity and difference.More particularly, identity hinge 32 by this algorithm application in data with creation task and support function of search.In certain embodiments, by algorithm editing machine 420, worktable 20 provides specific implementation mode for identity hinge 32 to define the ability with custom algorithm to user 51.Contrast the embodiment that Fig. 9 A-9B describes algorithm editing machine 420 below.
In certain embodiments, by data analysis tool 432, user 51 can analyze the attribute validity of the data recording in identity hinge 32.In certain embodiments, by entity analysis instrument 434, user 51 can analyze the entity being associated with data recording in identity hinge 32.In certain embodiments, by fractional analysis instrument 436, user 51 can analyze grouping (group of candidate record) and the impact of this grouping strategy on identity hinge 32.In certain embodiments, by link analysis instrument 438, user 51 can analyze and the threshold value that links error rate that member record is associated and use when the derivant score to these records.Contrast some embodiment of Figure 10-17 descriptive analysis instrument 430 below.
Fig. 5 illustrates for configuring the process flow diagram of an embodiment of the method for identity hinge 32.When worktable 20 is arranged on computing machine 40 and moves thereon, in step 510, user 51 can access worktable 20 and create new
Figure GPA00001140008600181
project or open existing
Figure GPA00001140008600182
project.In certain embodiments,
Figure GPA00001140008600183
project is for keeping the container of identity hinge configuration and file associated with it.In certain embodiments,
Figure GPA00001140008600184
project comprises a plurality of artifacts (artifact).The plurality of artifact's example comprise identity hinge configuration, the algorithm being used by this identity hinge configuration and from analysis tool (430), obtain the result of analysis.In step 520, user 51 can create new configuration or open and create or open in step 510
Figure GPA00001140008600185
existing configuration in project.In step 530, by user interface 50, user 51 can analyze, revises and/or confirm the configuration that creates or open in step 520.In step 540, user 51 can configure this this locality and be kept in computing machine 40.In step 540, the network of the server that user 51 can be by the example with operation identity hinge 32 be connected by long-range this example that is set to identity hinge 32 of the configuration of preserving and confirming.In certain embodiments, can in real time the configuration of identity hinge and algorithm be directly set in the example of identity hinge 32.In certain embodiments, some tasks (operation) may directly be carried out by identity hinge 32 outside configuration arranges.In this case, some embodiment of worktable 20 can provide following means: carry out single operation in operation group or minute group job, directly carry out them and in worktable view, the progress of Job execution or state are shown to user 50 by user interface 50 on identity hinge 32.In certain embodiments, user 50 can fetch or check the operation result from identity hinge 32 by user interface 50 at computing machine 40.Some embodiment for user interface 50, reader can with reference to the title of submitting on September 14th, 2007 be " HIERARCHY GLOBAL MANAGEMENT SYSTEM ANDUSER INTERFACE " 11/901, No. 040 U.S. Patent application, this application is contained in this by reference.
Fig. 6 illustrates the sectional drawing 60 of an embodiment of user interface 50.More particularly, sectional drawing 60 illustrates the example layout that is presented at the config editor 410 of the worktable 20 on computing machine 40 by an embodiment of user interface 50.In this example, config editor 410 comprises menu 61, shortcut 63 and is called one group of perform region 64,65,66 and 67 of view.Menu 61 provides the access to various menu items, and each menu item provides one group of difference in functionality.For example, by menu item " startup " 62, user 51 can create new startup (Initiate) project, and the configuration of input identity hinge, arranges the configuration of identity hinge, creates new operation group or confirms partial weight etc.Shortcut 63 provides the fast access to worktable 20 functions in current use.For example, user 51 can be switched fast by shortcut 63 between config editor 410 and analysis tool 430.View 64,65,66 and 67 is the individual windows that comprise specific type data.Most of views can move to by the label that pulls and put down them zones of different of user interface 60 on screen.In order to change view, user 51 can be in the lower selection of menu item " window " " demonstration view " of menu 61.The brief description that is included in the view in the embodiment of user interface 50 of worktable 20 below.All these views can be hidden and launch worktable 20 is interior.
Omniselector view
Omniselector view is provided for browsing worktable artifact's tree construction.
Can access following functions from omniselector view:
Traversal project directory
Open and check item file
Copy, paste, move, deletion and rename item file
Input resource
Refresh the resource of input
Select working group's (and hiding untapped file in this working group) of select File
Cancel the working group of select File
Characteristic Views
Characteristic Views makes user can edit the characteristic value of any assembly being created by user.
Problem view
Problem view provides the configuration in worktable and confirms the list of problem.Great majority complete while confirming the file resource in preservation project, and mistake can be occurred immediately.
Control desk view
Control desk view illustrates Progress message and the mistake during a large amount of background tasks.
Operation view
Operation view illustrates the progress of operation or operation group or completes (execution) state.
Contrast Fig. 8 A and 8B below and describe the more details on operation view.
Analyze view
Analyze the result of view display analysis inquiry.In order to know the data in this view, worktable need to be connected to hinge, makes hinge process this inquiry.
Search view
Search view is presented at the Search Results in existing configuration.The row that user can search in view by double-click is opened configuration object.
In certain embodiments, worktable 20 provides the editing machine of several specific types, as config editor 410 and algorithm editing machine 420.In certain embodiments, worktable 20 is also supported other Editor Types, comprises received text and Java editing machine.Fig. 7 A and 7B illustrate sectional drawing 70a and the 70b of an embodiment of config editor 410, can revise by this editing machine the hinge configuration 71 of identity hinge 32.
More particularly, sectional drawing 70a illustrates the expression of the hinge configuration 71 that is input to worktable 20.In certain embodiments, config editor 410 can comprise navigation menu 72, and it illustrates the view of application program, attribute type, information source, link, member type, relationship type etc.With reference to figure 7A, member type view 73 makes user can add, edit and remove member type.In certain embodiments, " object kind " (for example, people, provider, guest or tissue) under member type identification data.In certain embodiments, have five objects that can be configured to special member type, each object has its oneself label (view): attribute, entity type, synthetic view, source and algorithm.
In certain embodiments, attribute type view makes user can check the attribute being associated with member type.For example, for member type people 74, attribute tags can illustrate the attribute being associated with member type people 74, as APPT and date of birth.In this example, attribute APPT has attribute type MEMAPPT, and the attribute date of birth has attribute type MEMDATE.In certain embodiments, attribute type (section) with
Figure GPA00001140008600201
data pattern conforms to define hinge behavior and information about firms.In certain embodiments, attribute type comprises member property type and attribute of a relation type.In certain embodiments, member property type comprises predefine (" fixing ") attribute type and limits implements attribute type, in 11/900, No. 769 U.S. Patent application of the title of submitting to 13 days September in 2007 mentioning for " IMPLEMENTATION DEFINEDSEGMENTS FOR RELATIONAL DATABASE SYSTEMS ", these attribute types have been described in the above.Limiting enforcement attribute type can create and therefore not be associated with produced classification when the enforcement of identity hinge.Attribute of a relation type is the attribute type that is exclusively used in relation.Attribute type can not be member property type and attribute of a relation type simultaneously.
In certain embodiments, entity type view makes it possible to management entity type, as identity or threshold value.Further instruction for relevant entity management, reader can with reference to the title of submitting on March 27th, 2008 be " METHOD AND SYSTEM FOR MANAGINGENTITIES " 12/056, the title of submitting in No. 720 U.S. Patent applications and on January 22nd, 2007 be " METHOD AND SYSTEM FOR INDEXING INFORMATIONABOUT ENTITIES WITH RESPECT TO HIERARCHIES " 11/656, No. 111 U.S. Patent applications, these two patented claims are contained in this by reference.
In certain embodiments, synthetic view represents member's defined by the user full picture.The configuration of synthetic view can be set up the behavior of member property data and the rule of demonstration of controlling in worktable 20.For example, the member property data of a certain special member can consist of name, address, phone and SSN (social security number).
In certain embodiments, source view makes user add and to manage about the information with worktable 20 interactive sources.The example in source can comprise definition source and information source.The example of information source can comprise above-mentioned source 34,36,38.Definition source is the source that creates therein and often upgrade member's (record).In certain embodiments, worktable 20 can send to renewal definition source.
In certain embodiments, algorithm label makes user can create or identify hinge and is used for processing efficient algorithm relatively.In certain embodiments, only have the algorithm can be effective to each member type on hinge example.The member type of these algorithms (effective and invalid) based on current definition in hinge configuration.The algorithm of each new establishment must be associated with the member type in hinge configuration (seeing Fig. 9 A and 9B).
In certain embodiments, can higher than the record of AutoLink threshold value, automatically form link (AutoLink) to score value, or by user, manually form link (office worker reexamines) during task solves.The object of link is to make it possible to accurately (enterprise-wide) comprehensively to understand member's (record).Contrast Fig. 7 B, in certain embodiments, the link view 76 of config editor 410 can provide link type 77 and linking status 78.This function can be used to add or editor's link type and association status.In this example, link type 77 is listed link ID, link type and the kind of the effective entity relationship of definition, and linking status 78 is listed state ID, linking status and the classification of the workflow states that represents business connection.In certain embodiments, can be by clicking with ascending order or these row of descending sort on column heading.
Concise and to the point with reference to figure 7A, navigation menu 72 also illustrates application view and relationship type view.Application view can be listed several functions.In certain embodiments, user can use the function in this assembly effectively still invalid to mark application program.In certain embodiments, enterprise customer can add and remove from application view and implements at enterprise site
Figure GPA00001140008600221
application program.Relationship type view can illustrate available relationship type.Relationship type is the association type that can be present between two differences (or identical) entity type.For example, a people can manage another person, or a tissue can have another tissue legally.In certain embodiments, user can use function in this assembly with the relation between management entity.Further instruction for the related information of relevant entity, reader can with reference to the title of submitting on September 28th, 2007 be " METHOD AND SYSTEM FOR INDEXING; RELATING AND MANAGING INFORMATION ABOUT ENTITIES " 11/904, No. 750 U.S. Patent applications, this application is contained in this by reference.For simplicity, in the disclosure, do not illustrate or describe all available views.Yet, those skilled in the art will appreciate that the additional views and the additional function that by these views, provide are also possible.For example, string view can make user can create rule or guidance, to indicate about how processing the algorithm of the data value of some input.As another example, audit view can carry out those mutual audit loggings so that user can set up with the mutual and user of identity hinge 32.
In some embodiment of worktable 20, keep the container of hinge configuration and associated documents thereof to be called as project.Before in hinge configuration is input to project, user need to create new project or input existing project.In order to create new project, user can select " newly starting project ... " from " startup " menu 61, and inputs the title of these new projects.Can for example, in the place (another local drive unit or network driver means) in current work space catalogue or outside the current workspace of user's appointment, use these new projects of worktable template establishment.The further instruction of some embodiment that manage for related item, reader can with reference to the title of submitting on June 29th, 2007 be " METHOD AND SYSTEM FOR PROJECTMANAGEMENT " 11/824, No. 210 U.S. Patent applications, this application is contained in this by reference.
Next worktable 20 creates this project and under this work space catalogue, adds following catalogue:
Flow-comprise stream file (.iflow)
Any customization function of function-comprise
Storehouse-comprise arranges required any additional java code base file (.jar)
Serve-comprise all data source wsdl documents (.wsdl) in the project of being input to
Src-comprises needed any additional Java source file (.java)
Anonutil-comprises sample default value file and filter file
Processor-comprise is for by the script support of Java processor packing
Operation-preservation is registered relevant information with hinge-project
Project is associated with identity hinge 32 by the connection of the server of the example to operation identity hinge 32.There is the connection of several types, comprise and making and test.In certain embodiments, by the corresponding function from menu 61 (seeing Fig. 6) access menus item " startup " 62 can add, being connected of the example of editor or removal and identity hinge 32.Can be by hinge configuration being input in project from " startup " menu 62 access " input hub configuration ... " function.In certain embodiments, may need user's name and password to fetch hinge configuration information from identity hinge 32.In certain embodiments, the title of the hinge of input configuration may be displayed in the omniselector view 64 of config editor 410, and the assembly of the hinge of input configuration may be displayed in work space 65.
Fig. 8 A and Fig. 8 B illustrate sectional drawing 80a and the 80b of an embodiment of the config editor 401 that can configure by its modification operation.In some embodiment of worktable 20, the task of being undertaken by identity hinge 32 can be called operation, and the group of one or more operations can become operation group.In certain embodiments, available operation (task) can be categorized as configuration operation, data analysis operation, hinge management operations etc.In certain embodiments, operation result can be preserved by the project on the server of operation identity hinge 32 servers, and in many cases can be at computing machine 40 from this server retrieves or check.In certain embodiments, by the operation view in config editor 410, can carry out the task of following non exhaustive list:
Configuration is set to hinge
Produce weight
Create Threshold Analysis pair
From hinge, fetch file
Hinge configuration is set
This function is set to hinge by configuration project.This operation can be used to (replacing above-mentioned startup menu option) and carry out this setting together with another operation.When carrying out this operation, this hinge is automatically stopped and being restarted.When from 62 operation of startup menu, can utilize following option:
Weight table is set.This option makes the weight table in selected workbench project catalogue can be set to this hinge when selected.
While needing, create and/or leave out database table.When this option is selected, allow need to carry out database table operation according to what support this configuration.
Check group synchronization.When this option is selected, check whether the local operation group who lists catches up with the group who defines in (up to date with) hinge.In one embodiment, if this option is selected and group is not mated, can stop arranging.
Produce weight
This function is carried out weight and is produced task.This operation needs derivative data (comparing data and integrated data) as input.In certain embodiments, can be by utilizing for example mpxdata, mpxprep, mpxfsdvd or mpxredvd to produce this derivative data file during above-mentioned standardization and grouping step 320 and 330.As an example, Fig. 8 A illustrates sectional drawing 80a, this illustrate can be how an embodiment by config editor 401 configure this operation.Particularly, for entity type id 84, an embodiment of config editor 401 can illustrate a plurality of labels, comprises step, input and output, property regulation, option and log tab.In certain embodiments, step label can make user can select weight generation step move and indicate whether to move subsequent step until processing finishes.The example that weight produces step can comprise:
From previous deletion artifact in service (artifact)
For all properties value produces counting
Produce the right at random of member
By more random member, obtain random data
The candidate of mating is to reduction
Produce coupling group, coupling statistics and initial weight
Because attribute is very little skipped final step
Step before iteration also checks the convergence of weight
Carry out whole remaining steps until processing finishes
In certain embodiments, input and output label can allow user to specify various I/O catalogues.The example of I/O catalogue can comprise:
BXM input directory: specify and read the input directory of piece cross-matched result from it.This catalogue must be mated the output directory being used by the mpx function that produces derivative data.
Working directory: appointment will be preserved the catalogue of weight table in workbench project.In one embodiment, acquiescence is weight catalogue.All Files is saved to as the sub-directory in the assigned work catalogue of this entity type name.
FRQ output directory: specify the output directory that produced Attributes Frequency data are write.
UPAIRS output directory: specify the produced random output directory that data are write.
USAMPS output directory: specify the output directory that produced not matched sample is write data.
MPAIRS output directory: specify the output directory that a produced paired data is write.
MSAMPS output directory: specify the output directory that produced coupling is write sample data.
RUN output directory: specify the output directory that produced weight is write.This catalogue is attached with the incremental number for each iteration.
In certain embodiments, property regulation label can allow user to revise following parameter:
Thread Count
Greatest iteration number in final step
Number of partitions relatively divides into groups
At random to grouping number of partitions
Coupling is to grouping number of partitions
Frequency partition number
Maximum I/O number of partitions
For the Audrecno examining
The random logarithm producing
For reporting the interval of processing record
Largest packet arranges size
For writing the minimal weight of item record
In certain embodiments, option-tag can provide following option to user:
Coding.In certain embodiments, worktable 20 is supported LATIN1, UTF8 and UTF16 coding.Also can use other coding method.For relevant further instruction of resolving the data recording of different language, reader can with reference to the title of submitting on Dec 31st, 2007 be " METHOD AND SYSTEM FOR PARSING LANGUAGES " 11/967, No. 588 U.S. Patent applications, this application is contained in this by reference.
Audit.In certain embodiments, the audit that worktable 20 is supported one group of data recording.
Comparison pattern.In certain embodiments, this option can be used to limit comparing function.For example, only produce the weight for mating and linking, only produce the weight for searching for or produce the weight for mating, link and searching for.
In certain embodiments, can be in Fig. 8 A find following weight to produce parameter under the option-tag of 80a.The data here comprise the threshold value that is exclusively used in each provenance.
The threshold value of the 3rd filtrator that attributes match is used in relatively percentage threshold (wgtNRM)-be defined in.
The threshold value of the second filtrator that attributes match is used in relatively threshold value (wgtABS)-be defined in attribute.
Convergence threshold (wgtCNV)-definition weight produces the tolerance limit of conversion.
Quality of data number percent (wgtQOD)-definition coupling group error rate of estimating for initial weight.
False negative rate (wgtFNR)-definition reexamines the false negative rate with AutoLink threshold value for calculating office worker.
False positive rate (wgtFPR)-definition is for calculating office worker's false positive rate.
Reexamine and AutoLink threshold value.
The threshold value of the first filtrator that coupling is used in relatively threshold value (wgtMAT)-be defined in.
The lower limit of minimum attribute count (wgtFLR)-defined attribute value frequency counting.
In certain embodiments, log tab can provide following logging option to user:
Trace log
Debugging log
Timer daily record
SQL daily record
When this generation weight operation completes, can check result and this weight can be kept to this locality.In certain embodiments, the output that produces weight can be copied to project from hinge.The further instruction producing for relevant weight, reader can with reference to the title of submitting on June 1st, 2007 be " SYSTEM AND METHOD FOR AUTOMATIC WEIGHTGENERATION FOR PROBABILISTIC MATCHING " 11/809, No. 792 U.S. Patent applications, its full content is contained in this by reference.
As the example of data analysis operation, Fig. 8 B illustrates sectional drawing 80b, and it illustrates and can how by an embodiment configured threshold of config editor 401, to analyze producing operation.Particularly, config editor 401 embodiment can allow user specified entitiy type and suitable input directory and output file.User can further specify the scope for logarithm and the score value of each score value.In the example of Fig. 8 B, minimum score value is 8.0, and maximum score value is 25.0.In this example, sample generator is got in each of 171 score value shelves (score bin) (8.0 to 25.0, increase progressively with 0.1) 10 random right.
As above in the face of what describe according to Fig. 7 A, the new algorithm creating must be associated with the member type in hinge.Fig. 9 A and Fig. 9 B illustrate sectional drawing 90a and the 90b of an embodiment of algorithm editing machine 420.In certain embodiments, the algorithm file that algorithm editing machine 420 can be edited user to be used by identity hinge 32 is with application Compare Logic.Particularly, when initial creation algorithm, it is empty.Algorithm editing machine 420 make user can in algorithm editing machine 420, add algorithm assembly and with palette (Palette) 91 be connected to build algorithm.In the example of Fig. 9 A, sectional drawing 90a illustrates the algorithm being associated with member type " people " 74.In certain embodiments, although only have an algorithm being set to " effectively " algorithm preset time arbitrarily, polyalgorithm can be associated with a certain concrete member type.At this locality editor's algorithm, make database not to be changed, until they come into force completely.
As shown in Fig. 9 A and 9B, algorithm can comprise a plurality of assemblies, comprises attribute assembly, standard functions assembly, comparison and inquiry role (Role) assembly and grouping and comparing function assembly.User can be by adding, revise or delete one or more algorithm assemblies revises algorithm.Attribute assembly allows user to define character or the field of data element.By the member type of algorithm, filter these attributes.Standard functions assembly comprise for the source data of standardization or format input for relatively, the function of grouping and search (inquiry) object.This can mean the capitalization of all the first characters, the removal of punctuation mark, anonymous value inspection and data sorting.After by standardization, data can be saved as to the comparison component of derivative data and use in the generation of integrated data.In certain embodiments, standardized data is not kept in hinge database and does not therefore change member's data.For example, can be using telephone number in 232-123-4567 input source.Although standardization routine can be eliminated dash and area code and this number format is turned to 1234567, the number being kept in the database 46 of identity hinge 32 remains 232-123-4567.Compare and inquire role's assembly makes user can be defined in algorithm how to use comparing function and/or interrogation function.Block functions can be used to identify integrated data, and a plurality of groups of information are shared in its identification.For example, grouping can be defined as to name (name, surname, middle word), date of birth+surname, address and SSN (social security number).This assembly also makes user can define the combination of data element in grouping.Further instruction for the embodiment of Some Related Algorithms editing machine 420, reader can with reference to the title of submitting on February 5th, 2007 be " METHOD ANDSYSTEM FOR A GRAPHICAL USER INTERFACE FORCONFIGURATION OF AN ALGORITHM FOR THE MATCHINEOF DATA RECORDS " 11/702, No. 410 U.S. Patent applications, this application is contained in this by reference.
Therefore, in one embodiment, for analyzing the method for identity hinge, can comprise the configuration of using one group of this identity hinge of primary data record generation, grouping according to the grouping strategy analysis being associated with this identity hinge configuration based on this group primary data record or the establishment of its subgroup, analyze these impact of grouping on identity hinge performance, and then correspondingly change grouping strategy.The algorithm that can use when creating this grouping or changing the one or more parameter value being associated with this algorithm by editor in one embodiment, changes this grouping strategy.In one embodiment, this algorithm can be associated with entity type.
In one embodiment, except above-mentioned core algorithm configuration feature, can also produce parameter by threshold value and the automatic weight of weight character label 92 configuration of algorithm editing machine 420.Because weight character is associated with entity type, so in order to check weight character, first user must select entity type.Sectional drawing 90b illustrates threshold value and the weight character of entity type id84 in the present example.
For relevant weight, produce the further instruction of (comprising that weight produces conversion), reader can with reference to the title of submitting on June 1st, 2007 be " SYSTEM AND METHOD FORAUTOMATIC WEIGHT GENERATION FOR PROBABILISTICMATCHING " 11/809, No. 792 U.S. Patent applications, this application is contained in this by reference.
With reference to figure 9B, after determining weight, user can use threshold calculations device 93 manually to arrange or the suitable office worker that calculates a certain specific hinge configuration reexamines and AutoLink threshold value.Threshold calculations device 93 can use user to calculate suitable office worker from the sample data of the database 46 of identity hinge 32 to reexamine and AutoLink threshold value.In certain embodiments, user can also use threshold calculations device 93 that office worker is set and reexamine threshold value and AutoLink threshold value, thereby false positive rate, false negative rate and estimation number of tasks are estimated.In certain embodiments, can to data, use the false positive rate (FPR) of estimating or the FPR of statistics to calculate this threshold value by the sample based on estimating.These values can be for selected (or whole) source pair.This statistics option needs user first to move above-mentioned Threshold Analysis to producing operation, and then the Job execution completing is obtained the action of operation result.
In certain embodiments, candidate's threshold value is provided with worktable 20.User can reexamine candidate's threshold value, task and link and determine the appropriate threshold value for a certain customized configuration.In certain embodiments, candidate's threshold value can be calculated as follows:
AutoLink threshold value
Candidate's AutoLink threshold value depends on file size and admissible false positive rate.Suppose that fpr can allow false positive rate (default value 10 Λ(5)), num is the number that records in data group.Candidate's AutoLink threshold value is thresh_al=-ln[-ln (1-fpr)/num]/ln (10), wherein ln is natural logarithm (truth of a matter is e).
Office worker reexamines threshold value
False negative rate based on hope (fnr) arranges candidate office worker and reexamines threshold value.For example, if wish that 95% score of copy reexamines threshold value higher than our office worker, default value is set to 0.05.Actual fnr value can be dependent on the distribution of the weight calculated for coupling, time ratio and these values that each attribute has effective value.The experience that can use guiding (bootstrap) program to determine that coupling batch total divides distributes and calculates office worker from this distribution and reexamines threshold value.For this guiding, produce the random member of row, calculate each member's information, and from this sample formation experience, distribute as follows:
Potentially in database select redundantly the random member of numebt.Call memrecno_1, memrecno_2 ..., memrecno_numebt.For wherein each, to member itself score (that is, calculating member's information).Call these score values s_1, s_2 ..., s_numebt.Suppose that s_min is minimum in these score values, s_max is maximum in these score values, and from s_min to s_max, take 0.1, to be incremented create a table, and to these score value steppings.This table will have following n=(s_max-s_min)/0.1 row:
Table 1: coupling component value distributes
Value Counting Frequency
s_min C_1=s_i equals the quantity of s_min ?f_1=c_1/numebt
s_min+0.1 C_2=s_i equals the quantity of s_min+0.1 ?f_2=c_2/numebt
s_min+0.2 C_3=s_i equals the quantity of s_min+0.2 ?f_3=c_3/numebt
... ... ?...
s_max C_n=s_i equals the quantity of s_max ?f_n=c_n/numebt
Now, suppose j the first index, make
f_1+f_2+...+f_j>fnr
Candidate office worker reexamines threshold value and is
thresh_cl=s_min+(j-1)*0.1。
In embodiment disclosed herein, above-mentioned configuration tool with for analyzing the group analysis instrument such as grouping and the each side such as entity of this configuration, be combined.These instruments can be assessed this configuration help and find to this and configure relevant mistake and Potential performance problem.Specifically, these instruments can help user seamlessly to configure hinge and confirm the correctness of this configuration.
With reference to figure 10A and Figure 10 B, some embodiment of worktable 20 can comprise the analysis look facility of implementing analysis tool 430.This analysis look facility can provide one group of inquiry instrument to analyze hinge configuration to configure user.For the data that are provided for analyzing, analyzing look facility need to be associated with hinge example.Figure 10 A illustrates the sectional drawing 100a of an embodiment of user interface 50, and it illustrates the analysis source that hinge is selected as project demo81, and hinge configuration 71, member type " people " 74 and entity type id 84 are selected for analysis.As shown in FIG. 10A, by selecting " preserve and analyze data to snapshot " option and providing title to be saved in snapshot by analyzing data in analyzing id field.In certain embodiments, snapshot is saved in " snapshot " file in omniselector view with XML form.In certain embodiments, with reference to figure 4, snapshot can local be kept in the computer-readable recording medium 56 of computing machine 40.By data are saved in snapshot, analysis data before and after user can relatively be configured and change or from the analysis data of different time points.In the situation that its input parameter is different, a plurality of copies of same inquiry can be saved in single snapshot.
Figure 10 B illustrates the sectional drawing 100b of an embodiment of user interface 50, and it illustrates snapshot and is selected as project Alpha's analysis source and snapshot, selects main_hub_Bucket3-10-08 from utilizing.In this example, member type " people " 74 and entity code id 84 are selected for analysis.When analysis view has data source associated with it, user can load one or more inquiries and check result.Each inquiry shows one group of special data.In certain embodiments, available inquiry is classified as data analysis, entity analysis, fractional analysis and link analysis type.
Figure 11 illustrates for analyzing the process flow diagram of an embodiment of the method for identity hinge configuration.As mentioned above, in conjunction with the instrument in the embodiment of worktable 20, make them can help user seamlessly to configure the example of identity hinge 32 and the correctness of this configuration of real-time confirmation.Therefore, the method step meant for illustration example process shown in Figure 11 and be never interpreted as restrictive.For example, the member couple that sampled when, has created comparing data and integrated data (derivative data), has determined weight and has determined suitable AL and during CR threshold value, can carry out some to grouping and analyze in early days, as packet size and grouping distribute.This early stage analysis can contribute to just to recognize in early days data exception.Therefore, in Figure 11 be not all necessary and for analyzing one or more steps that can comprise Figure 11 for some embodiment of the method for the system of matched record in steps.In addition, the step in Figure 11 can be carried out not according to specific order.For example, as weight, produce a part (step 103) of processing, can produce one group of threshold value (candidate's threshold value) of recommending.Now, user can carry out Threshold Analysis (step 107) and check estimated false positive and the false negative rate for threshold range.Setting threshold value and completing after (may be final) cross-matched, user can reexamine the entity (step 105) of possible errors (missing value etc.).If hinge is selected as analysis source, the data that user can be checked the distribution of physical size and deeply check member suspicious entity from worktable 20 by entity analysis instrument 432 are to help identification error.The report of physical size can be saved in (for example, computer-readable recording medium 56) in dish, for the comparison after having carried out further adjusting.
Can or still when carrying out the other parts of this processing, complete above-mentioned analysis task when this project approaches end.For example, in some cases, may still need to complete configuration task by the config editor 410 in worktable 20, as configuring application program, user/group is set, creates synthetic view etc.After carrying out necessary change, they need to be set to as other configuration data to the server of operation.When this project finishes, can produce the report about this configuration, can use afterwards this report to check the healthy of this system and to determine may need to take to regulate and make great efforts so that this system turns back to optimum performance.In addition,, when completing configuration, can easily reset it into other server (test, product etc.).After this configuration is set to new server, user can operation task " produce all configuration datas " to create derivative data and to move the comparison and the link that are necessary and process on new server on computing machine 40.
Get back to Figure 11, as an example, for analyzing an embodiment of the method for identity hinge, can comprise the attribute validity (step 101) of analyzing one group of data result by data analysis tool 434.In one embodiment, for analyzing the method for identity hinge, can comprise by entity analysis instrument 432 analysis entities (step 105).In one embodiment, these entities are classified as the particular entity type having in identity hinge 32.In certain embodiments, analyzing these entities may need analysis entities size distribution, analyzes by size these entities, by these entities of comparative analysis, analyze the score value be associated with these entities and distribute, the member that analysis is associated with these entities relatively or their combination.In certain embodiments, after analysis entities, user may wish executing arithmetic editing machine 420 and revises the algorithm being associated with entity type and/or change the one or more parameter values (step 102) in above-mentioned one or more algorithm assembly.In certain embodiments, this modification or change can trigger the change of grouping strategy and can be produced and automatically be produced new weight (step 103) by weight.Therefore, user may wish to move fractional analysis instrument 436 to reexamine and to analyze grouping and relative statistics (step 104).In certain embodiments, by the fractional analysis instrument 436 from worktable 20, user can analyze packet size and distribute, analyze by size these groupings, by these groupings of comparative analysis, analyze grouping cross-matched and relatively distribute, by classified counting, analyze member's (record), analyze member's grouping value, analyze member's grouped frequency, analyze that member relatively distributes or their combination.In certain embodiments, user can move link analysis instrument 438 (step 106) with the CR about current use and AL Threshold Analysis member's copy and member's overlapping (step 107).Can be during above-mentioned any step or preserve afterwards and analyze data (step 108).
Figure 12 A and Figure 12 B illustrate sectional drawing 120a and the 120b of an embodiment of entity analysis instrument 432.Specifically, the sectional drawing 120a of Figure 12 A illustrates the result that entity forms inquiry, wherein (row 121 list four members obtaining, entity 26 has four candidate data records that link together), row 122 are listed the value of the particular community (SSN (social security number)) being associated with these members, and row 123 are listed value of another specific object (sex) being associated with these members etc.The sectional drawing 120b of Figure 12 B illustrates for reference source and sends out the result that (Proband) member 27 and the member's of entity 26 member relatively inquires, wherein row 124 are listed the candidate record of comparison, and row 125 are listed their corresponding score value.
Entity shown in Figure 12 A and Figure 12 B relatively inquires with member relatively inquire it is the example that can pass through the inquiry of entity analysis instrument 432 realizations.The inquiry that can realize by entity analysis instrument 43 in certain embodiments, can comprise have been considered big or small entity, entity comparison, physical size distributions, member's comparison, member's entity frequency, member's entity value, has considered that member and score value that entity is counted distribute.
Considered big or small entity
This inquiry provides the ability of the entity of the magnitude range (number of members in entity) of inquiring coupling appointment.The value of minimum or largest amount is appointed as to not restriction (there is no maximum or there is no minimum) of 0 expression.
Entity comparison
This inquiry illustrates the content of designated entities.As illustrational in Figure 12 A, the tabular obtaining goes out member record ID in designated entities and source ID and each member's comparing data.Can this comparing data be divided in each row of this table by comparing role.
Physical size distributes
When the entity in hinge relates to size, this inquiry provides the complete observation result to whole entities in hinge.Can filter this sight and come to an end fruit so that the entity from the source being checked to be only shown.As sporocarp comprises the source that checked and the member in unchecked source, the large young pathbreaker of shown entity is only the counting of member record in checked source.
Member's comparison
This inquiry is provided for all members in a member record and designated entities (seeing Figure 12 B) or the mechanism that compares with one group of mandatory member.
Member's entity frequency
This inquiry illustrates the frequency that member occurs in entity; That is to say the number of members in an entity, the number of members in two entities, number of members in three entities etc.
Member's entity value
This inquiry illustrates the entity under member.
Considered the member of entity counting
This inquiry illustrates a row member within the scope of the entity of appointment (for example, all members in 3 or more entity).If do not specify maximum number, in 0 value shown in the maximum number of entities field.Otherwise the maximum number of entity value must be more than or equal to the minimum number of entity.
Score value distributes
This inquiry illustrates all distributions of recording right score value in system.In certain embodiments, single member's entity or the entity with two above member records can be not included in result.In certain embodiments, the logarithm of each score value can be all counting sums within the scope of given score value.For example, xaxis score value 27 can represent all right between 26.1 and 27.0 of score value.Can filter this observations so that the entity from the source being checked to be only shown.As sporocarp comprises the source that checked and the member in unchecked source, the size of shown entity is only the counting of the member record in checked source.If the result of specific link type is not shown, may not meet the entity of the standard in this link type and/or selected one group of source.
Figure 13 illustrates the sectional drawing 130 of an embodiment of data analysis tool 434.In one embodiment, data analysis tool 434 can provide the inquiry of attribute validity as shown in Figure 13.
Attribute validity
This inquiry illustrate from active and from the described record in independent source, there is the number percent of number of times of the value of member type attribute.The value being present in high number percent should be considered to be in the potential candidate who uses in algorithm.In certain embodiments, can press acquiescently Property Name classification results.In certain embodiments, can be by this result of row classification.In certain embodiments, can filter the number percent that the table that makes to obtain can be listed the member type record being included in assigned source to source.
Figure 14 illustrates the sectional drawing 140 of an embodiment of fractional analysis instrument 436.In certain embodiments, if the number that records in hinge is greater than 200 ten thousand, fractional analysis inquiry will not carried out, unless first prepared these data.In certain embodiments, data are prepared to comprise to obtain one group of intermediate data that original member and integrated data precomputation can inquire fast.Can by " fractional analysis preparation ", carry out this data preparation by config editor 410.In some cases, the data of preparing 1,000,000 records of 2-5 may spend about 10 minutes, and the data of preparing 500,000,000 records may spend about 5 hours.These estimations may differ widely according to different hardware and database configuration.If member's data are modified, also should recalculate prepared data to avoid seeing expired result.
Sectional drawing 140 illustrates the result that fractional analysis is scanned inquiry, and it is in a plurality of inquiries that can obtain by fractional analysis instrument 436.In certain embodiments, the member that the inquiry that can be undertaken by fractional analysis instrument 436 can comprise that fractional analysis is scanned, divides into groups to form, packet size distributes, considered big or small grouping, cross-matched relatively distributes in batches, member's grouped frequency, member divide class value, member relatively to distribute and considered classified counting.
Fractional analysis is scanned
This inquiry provides some the healthy overall informations about the grouping strategy of hinge.As Figure 14 illustrated, in one embodiment, the first half of this view is filled with information such as large packet count, ungrouped member.Can check by clicking suitable button large grouping and/or the ungrouped member of particular range.More particularly, click and to check that grouping button will select to have considered big or small group view and with the packet size scope operation inquiry of hope.Click checks that member's button will select to have considered that member's view of classified counting and operation inquiry are to illustrate the member without any grouping.In this example, the bottom section of the view shown in Figure 14 illustrates one of the hashed value of ten maximum groupings and these groupings, the grouping role who produces this grouping and the member's in these groupings minute class value.Can be identical for this minute class value of all members in same grouping.Select grouping hash and click and check that grouping button forms operation grouping to inquire and use the member of the grouping of selecting for this hash-code and these members' a minute class value to fill this view.
Grouping forms
This inquiry illustrates the content of designated packet.The tabular obtaining goes out grouping role and minute class value of each member in memrecnos in designated packet and this grouping.Shown minute class value is the actual packet value that the member's data from database calculate recently.If different minute class values is shown for same grouping hash, this represents grouping hash collision.This will compare mutually by being considered to abnormal and may be interpreted as what common some member who does not mutually compare.Yet this situation is not generally thought system health harmful.In certain embodiments, the view that is used for this inquiry can comprise to be checked member's button and checks algorithm button, thereby a line in the table of selecting to obtain clicking check member's button by operation member divide class value inquire to illustrate selectively member's grouping, and click and check that algorithm button will open algorithm editing machine 420 and select to create the grouping role (seeing Fig. 9 A) of specified grouping.
Packet size distributes
When the grouping in hinge relates to size, this inquiry provides the complete observation all dividing into groups in hinge.In certain embodiments, large grouping be shown on the right side of this view and use from green (less grouping) to yellow (medium sized grouping) to indicate these groupings to the color indicator of red (grouping greatly).Data point in the figure that delineate packet sizes distributes can be followed being bent downwardly from left (less grouping) to right (larger grouping).Therefore, the mass data point on the right side of packet size distribution plan can be the region of paying close attention to and can represent the anonymity value of missing, incorrect threshold value, data problem etc.In certain embodiments, click data is named a person for a particular job and is selected to have considered the view of big or small grouping and operation is inquired so that those groupings of this size to be shown.In certain embodiments, by pressed operating key before clicks strong point and inquiry, this size and those larger groupings can be shown.
Considered big or small grouping
This inquiry provides inquiry coupling to specify the ability of the grouping of magnitude range (number of members in grouping).For example, the value of minimum or largest amount is appointed as to not restriction (there is no minimum or there is no maximum) of 0 expression.In certain embodiments, the table obtaining can illustrate one of member's in grouping member's counting, grouping hash, grouping role and sample packet value.For all members in any given grouping, a minute class value can be also identical.To this exception, be whether to have the hash collision that causes different minute class values to there is same packets hash.In order to check this situation, user can select to divide into groups and click and check that grouping button is to check all members of any given grouping and their minute class value.If determine the problem (lack grouping etc.) based on frequency there is specific cluster role, row that can be by option table click check that algorithm button opens algorithm editing machine 420.This is by the specific cluster role's (seeing Fig. 9 A) who recalls algorithm editing machine 420 and select to have created selected grouping.
Cross-matched relatively distributes in batches
The quantity of carrying out the required comparison of batch cross-matched when the largest packet of appointment in mpxcomp operation is set size parameter (the packet size limit) when relating to is calculated in this inquiry.This comparison quantity can be used to determine the suitable deadline of cross-matched in batches subsequently together with the quantity of thread and the comparison quantity of each thread p.s..
Member's grouped frequency
This view is answered the problem of " having how many members in 1 grouping, 2 groupings, 3 groupings etc. " with forms such as bar graphs.X number of axle strong point 0 illustrates ungrouped member's quantity.In certain embodiments, the bar in click figure has been considered member's view of classified counting and has moved inquiry so that the member about the plurality of grouping to be shown selecting.
Member divides class value
What grouping this view illustrates the member of appointment in.The grouping role who expresses grouping hash, minute class value and produce each grouping who obtains.In certain embodiments, select to divide into groups and click and check that grouping button can select grouping to form view and operation inquiry forms so that the grouping of selected grouping hash to be shown.The grouping role (seeing Fig. 9 A) that algorithm button can be opened algorithm editing machine 420 and select to be responsible for this grouping of establishment is checked in click.
Member relatively distributes
This view illustrates the system performance of estimating when relating to the quantity of carried out comparison.That is to say: when searching for, will carry out how many actual specific? as an example, member compares distribution plan and can show on average to carry out three comparisons.More particularly, in certain embodiments, 10 have 1 will cause about 6 comparisons in relatively, and 100 have 1 will cause 7.5 comparisons in relatively, and 1000 have 1 will cause about 8 comparisons in relatively.The member of 20,000 stochastic samplings of these data based on from system.If 20,000 of the member's less thaies in system, are used whole members.On average, target member will compare with all members that share the grouping with this target member.
Considered the member of classified counting
This view provides the quantity inquiry member who is included in grouping wherein based on member.In certain embodiments, minimum and maximal value are appointed as to 0 and will return to all ungrouped members.For the minimum value that is greater than 0, maximal value is not restriction of 0 expression.In certain embodiments, the quantity of expressing memrecno, this member grouping therein obtaining and for this member's cmpd string.In certain embodiments, select member and click and check all groupings that member's button can select member's grouping value view to occur therein so that this member to be shown.
Figure 15 illustrates the sectional drawing 150 of an embodiment of link analysis instrument 438.In certain embodiments, the inquiry of the link analysis instrument person of may be provided in copy and the overlapping inquiry of member.
Member's copy
This inquiry illustrates around the various error rates of copy member (from the member record that is linked to the same source of same entity).As Figure 15 illustrated, in one embodiment, front four row of the table obtaining can illustrate the raw data (by source segmentation) from hinge database: the number of members in number of members, entity number, replica group number and these replica group.Rear 3 row can be listed the various error rates that can calculate from these values:
Misregistration rate-represent that you must check that how many records are to differentiate your copy, or how many records are incomplete to member's general view.
Physical copies rate-represent that how many members have copy record, or random member has the probability of copy record.
Transcript rate-represent that how many records are copies, or the number percent of the record that can eliminate.
Member is overlapping
This inquiry provides the information of overlapping number in relevant hinge.When an entity has from the recording of a plurality of sources, may exist overlapping.For example, if there is the entity with three records, and each is recorded in origin system separately, and each source is considered to have therein two overlapping (A and B, A and C etc.).In certain embodiments, resulting table can illustrate the quantity of the sole entity representing in the source of appointment and the number percent of all entities that the record in this source represents.In certain embodiments, resulting table can also be illustrated in counting and the number percent (those entities have at least one record in another source) at least one other source with those overlapping entities.In a plurality of other sources, have overlapping entity in resulting table only to be counted once.In certain embodiments, resulting table can also combine each source that illustrates by source.For example, when row and column source is identical, the number percent of counting is 100%.Yet, when row and column source is different, the overlapping number that described counting representative exists between this row origin system and this row origin system.Therefore this percent value representative has the number percent of the entity in overlapping row source in row source.
Therefore, in one embodiment, the error rate that can comprise that for analyzing the method for identity hinge analysis is associated with one group of data recording.In one embodiment, this error rate can comprise misregistration and human factor error rate.In one embodiment, for the misregistration rate of copy, be included in and record number divided by total number that records in replica group.The chance of the record diagram of fragment is randomly drawed in its representative from file.In one embodiment, human factor error rate is to have difference individual's the quantity of a plurality of records divided by the individual sum representing in this document.By the simple scenario of A, B, C, D, five records of E, wherein A, B, C represent same person.So this misregistration rate be 3/5 and human factor error rate be 1/3 (file that represents 3 different people A-B-C, D, E, and one of them people has a plurality of records).
In one embodiment, error rate can comprise false positive and false negative rate.In one embodiment, error rate reexamines (CR) and AutoLink (AL) threshold value and is associated with office worker.The tolerance of 32 pairs of false positives of identity hinge and false negative rate when in one embodiment, CR and AL threshold value table are shown in one group of data recording of coupling.Therefore, for analyzing an embodiment of the method for identity hinge, can comprise that analyzing office worker reexamines threshold value and AutoLink threshold value.Figure 16 illustrates the sectional drawing of an embodiment of the graphic user interface that can be used for analyzing the error rate that is associated with member record in identity hinge and threshold value.
For a way estimating described threshold value, comprise and to process the sample of the link producing by batch cross-matched, score, the result of score be fitted to the model curve of hit rate and curve that use the obtains error rate based on hope obtains threshold value.Utilize this way to have some potential difficulties.The first, it need people in very wide score value scope to several thousand links to reexamining and scoring.This causes inevitable variation because coupling or unmatched individual explain.The second, hit rate combines copy rate intrinsic in data and file size (if the data sample that we use does not have copy, hit rate is all zero to all score values).The 3rd, this processing produces and is applied to cross-matched and need to be converted to search or the threshold value of inquiry error rate.
In certain embodiments, the new estimation routine the following describes can address these problems.An advantage of this new method is at first can be based on data pattern or based on applying this way in one group of new statistics that weight produces during producing automatically.
False positive rate (AutoLink threshold value)
It is to have the theoretical expression can be used for for the false positive rate of fixed threshold approximate statistical that use is used for the benefit of likelihood ratio of score.This also means, if do correctly, coupling is that the probability of false coupling only depends on score value and do not rely on real data.
Use vector xrepresentative is the result of two records relatively.The likelihood ratio of this comparison or score value are
λ ( x ‾ ) = f M ( x ‾ ) f U ( x ‾ ) .
Wherein, f m( x) be at described record, to refer to the probability density of this comparison under the hypothesis of same target (Ren, enterprise etc.).That is to say, if we know that record should mate, it is the probability of observing this result.Similarly, f u( x) be the probability density of observing this result when described record does not refer to same target (, it is the relatively more random probability occurring of this group).
In certain embodiments, when the logarithm of this score value is greater than a certain threshold value, this hinge can link two records, and false positive probability is when described record does not refer to same target, to compare the probability that score value is greater than threshold value like this.On mathematics, be
P U ( log ( λ ( x ‾ ) ) > T ) = ∫ { x ‾ : log ( λ ( x ‾ ) ) > T } f U ( x ‾ ) .
Now, for group x: log (λ ( x)) > T}, 10 T < f M ( x &OverBar; ) f U ( x &OverBar; )
So f u( x) < 10 -Tf m( x).
Therefore,, for single comparison, false-positive probability is limited by following formula:
P U ( log ( &lambda; ( x &OverBar; ) ) > T ) = &Integral; { x &OverBar; : log ( &lambda; ( x &OverBar; ) ) > T } f U ( x &OverBar; )
< &Integral; { x &OverBar; : log ( &lambda; ( x &OverBar; ) ) > T } 10 - T f M ( x &OverBar; )
< 10 - T
If this threshold value is relatively large, people can expect the single search to the database that comprises n record when carrying out n independent comparison.This means and return higher than the false-positive probability of this threshold value identical higher than the probability of this threshold value with the maximal value of the individual independently single comparison of n to the single search of this database.Suppose { s 1, s 2..., s nrepresent the relatively score value of all records in this database of single record, for large T, the probability that produces false-positive search can be expressed as
Pfp = 1 - P ( s 1 < T , s 2 < T , . . . s n < T )
= 1 - &Pi; n P ( s i < T )
= 1 - P ( log ( &lambda; ( x &OverBar; ) ) < T ) n
< 1 - ( 1 - 10 - T ) n
&ap; 1 - e - n 10 - T
This can further be reduced to
Pfp &ap; 1 - e - n 10 - T
&ap; n 10 - T
Wherein 10 twith respect to n, be large.
As an example, if use threshold value 11 for the database with 1,000,000 records,
Pfp≈1000000×10 -11
≈10 -5
In other words, in 100,000 search, there is 1.
Sample based on score is to improving AutoLink threshold value
When sample is scored to (supposition sampling is uniform), can calculate new AutoLink (AL) threshold value.Its necessary information can comprise:
Comprise the right file of being scored.This document can comprise the score value of every pair and whether two records of this centering can represent same people (SP), does not represent same people (NSP), whether has enough information to judge the designator of (NEI).Can be accordingly from the score program value of referring to.For example, 1 means SP, and 0 means NSP, and-1 means NEI.
Counting by the right total score value that produced by BXM (if produce at random to time filtered source, this is all right countings in the source being filtered of two members).
The number recording in database (if produce at random to time filtered source, this is the counting recording in those sources).
In certain embodiments, the first step is to obtain Uniform Sample and the score value by NSP and SP obtains chart of percentage comparison.Only need NSP to be used for upgrading AL threshold value.Next step is to obtain total logarithm by score value.This produces in can creating the right step of sample before manually assessing.Next step is that the function calculating as score value obtains false-positive probability.For this reason, the size that people need to know database is so that standardization between batch cross-matched rate and inquiry rate.For each score value shelves, the probability of NSP is multiplied by the total logarithm in this score value situation, divided by the size of database, subtract 1, and will totally be multiplied by 2.If the distribution obtaining is unsmooth, can be by linearized index function application in sample data.That is to say, find coefficient a and b, make function p=e a+bsbe the least square fitting to sample data, wherein s is score value.
Can by new AL threshold calculations, be according to this fitting coefficient
AL=ln(-fprate·b/(0.1·Exp(a)))/b。
Can use formula below false positive rate to be defined as to the function of score value
fprate = - 0.1 b Exp ( a + b &CenterDot; s )
Upgrade office worker and reexamine threshold value
When having determined suitable AutoLink threshold value, the estimation of task quantity can be defined as to the function that office worker reexamines (CR) threshold value.This can be by suing for peace from score counting is obtained to AutoLink.User can regulate CR threshold value to produce the task of fixed qty.Figure 17 illustrates the relation between system performance and the false positive that the member record with linking in identity hinge is associated and the tolerance of false negative rate.In the example of Figure 17, AL and CR threshold value produce 12 office workers and reexamine task.
In the above description, about specific embodiment, the disclosure has been described.Yet, should be appreciated that, this description is only as an example rather than is interpreted as restrictive.Therefore, should also be understood that for reference those of ordinary skills of this explanation, the numerous variations in the details of embodiment of the present disclosure and additional embodiment of the present disclosure are apparent and are to make.All these change and additional embodiment is also taken into account in the scope of the present disclosure describing in detail in claims.

Claims (20)

1. for analyzing the method for the system of matching data records, comprising:
By primary data, record the configuration that group produces described system;
According to the grouping strategy being associated with the described configuration of described system, analyze the grouping of recording group or the establishment of its subgroup based on described primary data, wherein each grouping comprises the data recording being associated with entity;
Analyze the impact of described grouping on the performance of described system; And
Correspondingly change the information linking that described grouping strategy changes data recording and entity.
2. method according to claim 1, the described grouping strategy of wherein said change also comprises one or more parameter values that algorithm that editor uses when creating described grouping or change are associated with described algorithm.
3. method according to claim 2, wherein said algorithm is associated with entity type.
4. method according to claim 3, also comprises analyzing being classified as the entity with the described entity type in described system.
5. method according to claim 4, the described entity of wherein said analysis also comprises analysis entities size distribution, analyzes described entity, by score value distribution that described in composition analysis, entity, analysis are associated with described entity, member that analysis is associated with described entity relatively or their combination by size.
6. method according to claim 1, wherein records by primary data the configuration that group produces described system and also comprises that analyzing described primary data records group.
7. method according to claim 6, the described primary data record of wherein said analysis group also comprises the validity of the attribute of analyzing described primary data record group.
8. method according to claim 1, the described grouping of wherein said analysis also comprise analyze the statistics that is associated with described grouping, analyze packet size distribute, analyze by size described grouping, by grouping described in composition analysis, analyze cross-matched in batches relatively distribute, by classified counting analyze member, analyze member's grouping value, analyze member's grouped frequency, analysis member relatively distributes or their combination.
9. method according to claim 1, also comprises analyzing and records with described primary data the error rate that group is associated, and wherein said error rate comprises misregistration rate and human factor error rate.
10. method according to claim 1, the described configuration of wherein said system comprises that office worker reexamines threshold value and AutoLink threshold value, and the tolerance of system to false positive rate and false negative rate described in when wherein said office worker reexamines threshold value and described AutoLink threshold value table and is shown in the described primary data record group of coupling, the method also comprises that analyzing described office worker reexamines threshold value and described AutoLink threshold value.
11. 1 kinds for analyzing the system for the system of matching data records, comprising:
For using the device of the configuration of primary data record group generation system;
For according to the grouping strategy being associated with the described configuration of described system, analyze the device that records the grouping of group or the establishment of its subgroup based on described primary data, wherein each grouping comprises the data recording being associated with entity;
For analyzing described grouping and the device of described grouping on the impact of the performance of described system; And
For making user can change the device that described grouping strategy changes data recording and the information linking of entity.
12. systems according to claim 11, wherein saidly change data recording and the device of the information linking of entity and also comprise for edit the device of one or more parameter values that the algorithm that uses or change be associated with described algorithm when creating described grouping for making user can change described grouping strategy.
13. systems according to claim 12, wherein said algorithm is associated with entity type.
14. systems according to claim 13, also comprise for analyzing the device of the entity that is classified as the described entity type with described system.
15. systems according to claim 14, wherein said for analyzing the device of the entity that is classified as the described entity type with described system and also comprise for analysis entities size distribution, analyze by size described entity, by score value distribution that described in composition analysis, entity, analysis are associated with described entity, member that analysis is associated with described entity relatively or the device of their combination.
16. systems according to claim 11, wherein said for using the device of the configuration of primary data record group generation system also to comprise for analyzing the device of described primary data record group.
17. systems according to claim 16, wherein saidly also comprise for analyzing the device of validity of the attribute of described primary data record group for analyzing the device of described primary data record group.
18. systems according to claim 11, wherein said for analyze described grouping and described grouping the device of the impact of the performance of described system is also comprised analyze the statistics be associated with described grouping, analyze packet size distribute, analyze by size described grouping, by grouping described in composition analysis, analyze cross-matched in batches relatively distribute, by classified counting analyze member, analyze member's grouping value, analyze member's grouped frequency, analysis member relatively distributes or the device of their combination.
19. systems according to claim 11, also comprise the device that records the error rate that group is associated with described primary data for analyzing, and wherein said error rate comprises misregistration rate and human factor error rate.
20. systems according to claim 11, the described configuration of wherein said system comprises that office worker reexamines threshold value and AutoLink threshold value, and the tolerance of system to false positive rate and false negative rate described in when wherein said office worker reexamines threshold value and described AutoLink threshold value table and is shown in the described primary data record group of coupling, this system also comprises that analyzing described office worker reexamines threshold value and described AutoLink threshold value.
CN200880117086.9A 2007-09-28 2008-09-26 Method and system for analysis of system for matching data records Active CN101878461B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US99703807P 2007-09-28 2007-09-28
US60/997,038 2007-09-28
PCT/US2008/077985 WO2009042941A1 (en) 2007-09-28 2008-09-26 Method and system for analysis of a system for matching data records

Publications (2)

Publication Number Publication Date
CN101878461A CN101878461A (en) 2010-11-03
CN101878461B true CN101878461B (en) 2014-03-12

Family

ID=40509776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200880117086.9A Active CN101878461B (en) 2007-09-28 2008-09-26 Method and system for analysis of system for matching data records

Country Status (8)

Country Link
US (2) US8799282B2 (en)
EP (1) EP2193415A4 (en)
JP (1) JP5306360B2 (en)
CN (1) CN101878461B (en)
AU (1) AU2008304265B2 (en)
BR (1) BRPI0817507B1 (en)
CA (1) CA2701046C (en)
WO (1) WO2009042941A1 (en)

Families Citing this family (83)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7657540B1 (en) 2003-02-04 2010-02-02 Seisint, Inc. Method and system for linking and delinking data records
US7526486B2 (en) 2006-05-22 2009-04-28 Initiate Systems, Inc. Method and system for indexing information about entities with respect to hierarchies
WO2007143157A2 (en) 2006-06-02 2007-12-13 Initiate Systems, Inc. Automatic weight generation for probabilistic matching
US8356009B2 (en) 2006-09-15 2013-01-15 International Business Machines Corporation Implementation defined segments for relational database systems
US7685093B1 (en) 2006-09-15 2010-03-23 Initiate Systems, Inc. Method and system for comparing attributes such as business names
US7698268B1 (en) 2006-09-15 2010-04-13 Initiate Systems, Inc. Method and system for filtering false positives
US8359339B2 (en) 2007-02-05 2013-01-22 International Business Machines Corporation Graphical user interface for configuration of an algorithm for the matching of data records
US8515926B2 (en) * 2007-03-22 2013-08-20 International Business Machines Corporation Processing related data from information sources
US8423514B2 (en) 2007-03-29 2013-04-16 International Business Machines Corporation Service provisioning
WO2008121824A1 (en) 2007-03-29 2008-10-09 Initiate Systems, Inc. Method and system for data exchange among data sources
WO2008121700A1 (en) 2007-03-29 2008-10-09 Initiate Systems, Inc. Method and system for managing entities
WO2008121170A1 (en) 2007-03-29 2008-10-09 Initiate Systems, Inc. Method and system for parsing languages
CN101652775B (en) * 2007-04-13 2012-09-19 Gvbb控股股份有限公司 Systems and methods for mapping logical and physical assets in a user interface
US20110010214A1 (en) * 2007-06-29 2011-01-13 Carruth J Scott Method and system for project management
US8713434B2 (en) 2007-09-28 2014-04-29 International Business Machines Corporation Indexing, relating and managing information about entities
CN101884039B (en) 2007-09-28 2013-07-10 国际商业机器公司 Method and system for associating data records in multiple languages
AU2008304265B2 (en) 2007-09-28 2013-03-14 International Business Machines Corporation Method and system for analysis of a system for matching data records
US8266168B2 (en) 2008-04-24 2012-09-11 Lexisnexis Risk & Information Analytics Group Inc. Database systems and methods for linking records and entity representations with sufficiently high confidence
US9009358B1 (en) * 2008-09-23 2015-04-14 Western Digital Technologies, Inc. Configuring a data storage device with a parameter file interlocked with configuration code
US8082228B2 (en) 2008-10-31 2011-12-20 Netapp, Inc. Remote office duplication
KR20150042866A (en) * 2008-12-02 2015-04-21 아브 이니티오 테크놀로지 엘엘시 Mapping instances of a dataset within a data management system
US9411859B2 (en) 2009-12-14 2016-08-09 Lexisnexis Risk Solutions Fl Inc External linking based on hierarchical level weightings
US8352460B2 (en) * 2010-03-29 2013-01-08 International Business Machines Corporation Multiple candidate selection in an entity resolution system
US8918393B2 (en) 2010-09-29 2014-12-23 International Business Machines Corporation Identifying a set of candidate entities for an identity record
US8843501B2 (en) 2011-02-18 2014-09-23 International Business Machines Corporation Typed relevance scores in an identity resolution system
US20120324236A1 (en) * 2011-06-16 2012-12-20 Microsoft Corporation Trusted Snapshot Generation
WO2013023302A1 (en) * 2011-08-16 2013-02-21 Cirba Inc. System and method for determining and visualizing efficiencies and risks in computing environments
US10810218B2 (en) 2011-10-14 2020-10-20 Transunion, Llc System and method for matching of database records based on similarities to search queries
US9171158B2 (en) 2011-12-12 2015-10-27 International Business Machines Corporation Dynamic anomaly, association and clustering detection
US9104678B1 (en) 2011-12-31 2015-08-11 Richard Michael Nemes Methods and apparatus for information storage and retrieval using a caching technique with probe-limited open-address hashing
US9262469B1 (en) 2012-04-23 2016-02-16 Monsanto Technology Llc Intelligent data integration system
US9372903B1 (en) 2012-06-05 2016-06-21 Monsanto Technology Llc Data lineage in an intelligent data integration system
US20140129615A1 (en) * 2012-11-05 2014-05-08 Timest Ltd. System for automated data measurement and analysis
US9251133B2 (en) * 2012-12-12 2016-02-02 International Business Machines Corporation Approximate named-entity extraction
JP5971115B2 (en) * 2012-12-26 2016-08-17 富士通株式会社 Information processing program, information processing method and apparatus
US9336234B2 (en) * 2013-02-22 2016-05-10 Adobe Systems Incorporated Online content management system with undo and redo operations
US9485309B2 (en) * 2013-03-14 2016-11-01 Red Hat, Inc. Optimal fair distribution among buckets of different capacities
US10593003B2 (en) * 2013-03-14 2020-03-17 Securiport Llc Systems, methods and apparatuses for identifying person of interest
US10671629B1 (en) 2013-03-14 2020-06-02 Monsanto Technology Llc Intelligent data integration system with data lineage and visual rendering
US10803102B1 (en) * 2013-04-30 2020-10-13 Walmart Apollo, Llc Methods and systems for comparing customer records
US9767127B2 (en) 2013-05-02 2017-09-19 Outseeker Corp. Method for record linkage from multiple sources
US20130311233A1 (en) * 2013-05-13 2013-11-21 Twenga SA Method for predicting revenue to be generated by a webpage comprising a list of items having common properties
US9792658B1 (en) * 2013-06-27 2017-10-17 EMC IP Holding Company LLC HEALTHBOOK analysis
US9477934B2 (en) 2013-07-16 2016-10-25 Sap Portals Israel Ltd. Enterprise collaboration content governance framework
US10026114B2 (en) * 2014-01-10 2018-07-17 Betterdoctor, Inc. System for clustering and aggregating data from multiple sources
US9852049B2 (en) * 2014-05-27 2017-12-26 International Business Machines Corporation Screenshot validation testing
US10410225B1 (en) * 2014-06-30 2019-09-10 Groupon, Inc. Systems, apparatus, and methods of programmatically determining unique contacts based on crowdsourced error correction
US11593405B2 (en) * 2015-04-21 2023-02-28 International Business Machines Corporation Custodian disambiguation and data matching
US10474724B1 (en) * 2015-09-18 2019-11-12 Mpulse Mobile, Inc. Mobile content attribute recommendation engine
US10585893B2 (en) * 2016-03-30 2020-03-10 International Business Machines Corporation Data processing
US20190147988A1 (en) * 2016-04-19 2019-05-16 Koninklijke Philips N.V. Hospital matching of de-identified healthcare databases without obvious quasi-identifiers
US10452627B2 (en) 2016-06-02 2019-10-22 International Business Machines Corporation Column weight calculation for data deduplication
US10558669B2 (en) * 2016-07-22 2020-02-11 National Student Clearinghouse Record matching system
US10671626B2 (en) * 2016-09-27 2020-06-02 Salesforce.Com, Inc. Identity consolidation in heterogeneous data environment
US10621492B2 (en) 2016-10-21 2020-04-14 International Business Machines Corporation Multiple record linkage algorithm selector
US10061939B1 (en) * 2017-03-03 2018-08-28 Microsoft Technology Licensing, Llc Computing confidential data insight histograms and combining with smoothed posterior distribution based histograms
US10936572B2 (en) * 2017-05-17 2021-03-02 Change Healthcare Holdings, Llc Method, apparatus, and computer program product for improved tracking of state data
RU2667608C1 (en) * 2017-08-14 2018-09-21 Иван Александрович Баранов Method of ensuring the integrity of data
US11182394B2 (en) 2017-10-30 2021-11-23 Bank Of America Corporation Performing database file management using statistics maintenance and column similarity
US20230177029A1 (en) * 2017-11-13 2023-06-08 American Express Travel Related Services Company, Inc. Detecting and updating duplicate data records
US11341138B2 (en) * 2017-12-06 2022-05-24 International Business Machines Corporation Method and system for query performance prediction
US20190179924A1 (en) * 2017-12-08 2019-06-13 MHI Analytics, LLC Workflow automation on remote status updates for database records
CN108491460A (en) * 2018-03-05 2018-09-04 北京市肿瘤防治研究所 Personally identifiable information matching process, device, storage medium and computer equipment
US11556710B2 (en) * 2018-05-11 2023-01-17 International Business Machines Corporation Processing entity groups to generate analytics
US10936665B2 (en) 2018-08-09 2021-03-02 Sap Se Graphical match policy for identifying duplicative data
US11036479B2 (en) * 2018-08-27 2021-06-15 Georgia Tech Research Corporation Devices, systems, and methods of program identification, isolation, and profile attachment
US10901979B2 (en) * 2018-08-29 2021-01-26 International Business Machines Corporation Generating responses to queries based on selected value assignments
US11157528B2 (en) 2019-04-17 2021-10-26 International Business Machines Corporation Dependency-driven workflow management
US11256770B2 (en) * 2019-05-01 2022-02-22 Go Daddy Operating Company, LLC Data-driven online business name generator
US11397715B2 (en) 2019-07-31 2022-07-26 International Business Machines Corporation Defining indexing fields for matching data entities
US11409772B2 (en) 2019-08-05 2022-08-09 International Business Machines Corporation Active learning for data matching
US11663275B2 (en) 2019-08-05 2023-05-30 International Business Machines Corporation Method for dynamic data blocking in a database system
US11687828B2 (en) 2019-10-11 2023-06-27 International Business Machines Corporation Auto-tuning of comparison functions
US20220035777A1 (en) * 2020-07-29 2022-02-03 International Business Machines Corporation Pair selection for entity resolution analysis
US12038979B2 (en) * 2020-11-25 2024-07-16 International Business Machines Corporation Metadata indexing for information management using both data records and associated metadata records
US12026967B2 (en) 2020-12-31 2024-07-02 Securiport Llc Travel document validation using artificial intelligence and unsupervised learning
US20220374401A1 (en) * 2021-05-18 2022-11-24 International Business Machines Corporation Determining domain and matching algorithms for data systems
US11860909B2 (en) 2021-08-23 2024-01-02 U.S. Bancorp, National Association Entity household generation based on attribute-pair matching
US12105689B2 (en) 2021-08-23 2024-10-01 U.S. Bancorp, National Association Managing hierarchical data structures for entity matching
US11995733B2 (en) 2021-09-17 2024-05-28 Motorola Solutions, Inc. Method and system for linking unsolicited electronic tips to public-safety data
US20230418877A1 (en) * 2022-06-24 2023-12-28 International Business Machines Corporation Dynamic Threshold-Based Records Linking
US12026136B2 (en) * 2022-11-14 2024-07-02 Sap Se Consolidating resource data from disparate data sources
US20240411814A1 (en) * 2023-06-06 2024-12-12 International Business Machines Corporation Dynamic electronic data record resolution in electronic data environments

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6460045B1 (en) * 1999-03-15 2002-10-01 Microsoft Corporation Self-tuning histogram and database modeling

Family Cites Families (252)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3568155A (en) * 1967-04-10 1971-03-02 Ibm Method of storing and retrieving records
US4531186A (en) 1983-01-21 1985-07-23 International Business Machines Corporation User friendly data base access
US5020019A (en) 1989-05-29 1991-05-28 Ricoh Company, Ltd. Document retrieval system
JPH03129472A (en) 1989-07-31 1991-06-03 Ricoh Co Ltd Processing method for document retrieving device
US5134564A (en) 1989-10-19 1992-07-28 Dunn Eric C W Computer aided reconfiliation method and apparatus
AU631276B2 (en) 1989-12-22 1992-11-19 Bull Hn Information Systems Inc. Name resolution in a directory database
US5321833A (en) 1990-08-29 1994-06-14 Gte Laboratories Incorporated Adaptive ranking system for information retrieval
JPH04111121A (en) * 1990-08-31 1992-04-13 Fujitsu Ltd Field-specific dictionary generation device, machine translation device, and machine translation system using these devices
US5247437A (en) 1990-10-01 1993-09-21 Xerox Corporation Method of managing index entries during creation revision and assembly of documents
US5555409A (en) 1990-12-04 1996-09-10 Applied Technical Sysytem, Inc. Data management systems and methods including creation of composite views of data
US5455903A (en) * 1991-05-31 1995-10-03 Edify Corp. Object oriented customer information exchange system and method
US5381332A (en) * 1991-12-09 1995-01-10 Motorola, Inc. Project management system with automated schedule and cost integration
FR2688611A1 (en) * 1992-03-12 1993-09-17 Bull Sa USE OF A LANGUAGE WHICH TYPES RELATES TO THE CONTENT OF THE VARIABLES AND ALLOWS TO HANDLE COMPLEX CONSTRUCTIONS.
US5535322A (en) 1992-10-27 1996-07-09 International Business Machines Corporation Data processing system with improved work flow system and method
US5774887A (en) * 1992-11-18 1998-06-30 U S West Advanced Technologies, Inc. Customer service electronic form generating system
US5721850A (en) * 1993-01-15 1998-02-24 Quotron Systems, Inc. Method and means for navigating user interfaces which support a plurality of executing applications
US6496793B1 (en) 1993-04-21 2002-12-17 Borland Software Corporation System and methods for national language support with embedded locale-specific language driver identifiers
US5615367A (en) * 1993-05-25 1997-03-25 Borland International, Inc. System and methods including automatic linking of tables for improved relational database modeling with interface
US5537590A (en) 1993-08-05 1996-07-16 Amado; Armando Apparatus for applying analysis rules to data sets in a relational database to generate a database of diagnostic records linked to the data sets
US5442782A (en) 1993-08-13 1995-08-15 Peoplesoft, Inc. Providing information from a multilingual database of language-independent and language-dependent items
US5606690A (en) 1993-08-20 1997-02-25 Canon Inc. Non-literal textual search using fuzzy finite non-deterministic automata
EP0639814B1 (en) 1993-08-20 2000-06-14 Canon Kabushiki Kaisha Adaptive non-literal textual search apparatus and method
US5583763A (en) 1993-09-09 1996-12-10 Mni Interactive Method and apparatus for recommending selections based on preferences in a multi-user system
US5487141A (en) 1994-01-21 1996-01-23 Borland International, Inc. Development system with methods for visual inheritance and improved object reusability
US5848271A (en) 1994-03-14 1998-12-08 Dun & Bradstreet Software Services, Inc. Process and apparatus for controlling the work flow in a multi-user computing system
US5862322A (en) * 1994-03-14 1999-01-19 Dun & Bradstreet Software Services, Inc. Method and apparatus for facilitating customer service communications in a computing environment
US5497486A (en) * 1994-03-15 1996-03-05 Salvatore J. Stolfo Method of merging large databases in parallel
US5561794A (en) 1994-04-28 1996-10-01 The United States Of America As Represented By The Secretary Of The Navy Early commit optimistic projection-based computer database protocol
US5704018A (en) * 1994-05-09 1997-12-30 Microsoft Corporation Generating improved belief networks
US5710916A (en) * 1994-05-24 1998-01-20 Panasonic Technologies, Inc. Method and apparatus for similarity matching of handwritten data objects
US5675752A (en) 1994-09-15 1997-10-07 Sony Corporation Interactive applications generator for an interactive presentation environment
US5694593A (en) 1994-10-05 1997-12-02 Northeastern University Distributed computer database system and method
US5694594A (en) 1994-11-14 1997-12-02 Chang; Daniel System for linking hypermedia data objects in accordance with associations of source and destination data objects and similarity threshold without using keywords or link-difining terms
US5819264A (en) 1995-04-03 1998-10-06 Dtl Data Technologies Ltd. Associative search method with navigation for heterogeneous databases including an integration mechanism configured to combine schema-free data models such as a hyperbase
US5774661A (en) * 1995-04-18 1998-06-30 Network Imaging Corporation Rule engine interface for a visual workflow builder
US5675753A (en) 1995-04-24 1997-10-07 U.S. West Technologies, Inc. Method and system for presenting an electronic user-interface specification
US5774883A (en) * 1995-05-25 1998-06-30 Andersen; Lloyd R. Method for selecting a seller's most profitable financing program
US5790173A (en) 1995-07-20 1998-08-04 Bell Atlantic Network Services, Inc. Advanced intelligent network having digital entertainment terminal or the like interacting with integrated service control point
US5778370A (en) 1995-08-25 1998-07-07 Emerson; Mark L. Data village system
US5640553A (en) 1995-09-15 1997-06-17 Infonautics Corporation Relevance normalization for documents retrieved from an information retrieval system in response to a query
US5805702A (en) 1995-09-29 1998-09-08 Dallas Semiconductor Corporation Method, apparatus, and system for transferring units of value
US5809499A (en) 1995-10-20 1998-09-15 Pattern Discovery Software Systems, Ltd. Computational method for discovering patterns in data sets
US5893074A (en) * 1996-01-29 1999-04-06 California Institute Of Technology Network based task management
US5930768A (en) 1996-02-06 1999-07-27 Supersonic Boom, Inc. Method and system for remote user controlled manufacturing
US5963915A (en) 1996-02-21 1999-10-05 Infoseek Corporation Secure, convenient and efficient system and method of performing trans-internet purchase transactions
US5862325A (en) * 1996-02-29 1999-01-19 Intermind Corporation Computer-based communication system and method using metadata defining a control structure
US5835712A (en) 1996-05-03 1998-11-10 Webmate Technologies, Inc. Client-server system using embedded hypertext tags for application and database development
US5878043A (en) 1996-05-09 1999-03-02 Northern Telecom Limited ATM LAN emulation
US5859972A (en) * 1996-05-10 1999-01-12 The Board Of Trustees Of The University Of Illinois Multiple server repository and multiple server remote application virtual client computer
US5905496A (en) * 1996-07-03 1999-05-18 Sun Microsystems, Inc. Workflow product navigation system
US5765150A (en) * 1996-08-09 1998-06-09 Digital Equipment Corporation Method for statistically projecting the ranking of information
US5893110A (en) * 1996-08-16 1999-04-06 Silicon Graphics, Inc. Browser driven user interface to a media asset database
US6049847A (en) * 1996-09-16 2000-04-11 Corollary, Inc. System and method for maintaining memory coherency in a computer system having multiple system buses
US5787470A (en) 1996-10-18 1998-07-28 At&T Corp Inter-cache protocol for improved WEB performance
US5796393A (en) 1996-11-08 1998-08-18 Compuserve Incorporated System for intergrating an on-line service community with a foreign service
US5787431A (en) 1996-12-16 1998-07-28 Borland International, Inc. Database development system with methods for java-string reference lookups of column names
US5835912A (en) 1997-03-13 1998-11-10 The United States Of America As Represented By The National Security Agency Method of efficiency and flexibility storing, retrieving, and modifying data in any language representation
US6026433A (en) * 1997-03-17 2000-02-15 Silicon Graphics, Inc. Method of creating and editing a web site in a client-server environment using customizable web site templates
US6385600B1 (en) * 1997-04-03 2002-05-07 At&T Corp. System and method for searching on a computer using an evidence set
US5987422A (en) 1997-05-29 1999-11-16 Oracle Corporation Method for executing a procedure that requires input from a role
US5999937A (en) 1997-06-06 1999-12-07 Madison Information Technologies, Inc. System and method for converting data between data sets
US5991758A (en) 1997-06-06 1999-11-23 Madison Information Technologies, Inc. System and method for indexing information about entities from different information sources
AU740146B2 (en) 1997-06-16 2001-11-01 Telefonaktiebolaget Lm Ericsson (Publ) A telecommunications performance management system
US6014664A (en) 1997-08-29 2000-01-11 International Business Machines Corporation Method and apparatus for incorporating weights into data combinational rules
US6018733A (en) * 1997-09-12 2000-01-25 Infoseek Corporation Methods for iteratively and interactively performing collection selection in full text searches
US5960411A (en) 1997-09-12 1999-09-28 Amazon.Com, Inc. Method and system for placing a purchase order via a communications network
US6621505B1 (en) 1997-09-30 2003-09-16 Journee Software Corp. Dynamic process-based enterprise computing system and method
US6356931B2 (en) * 1997-10-06 2002-03-12 Sun Microsystems, Inc. Method and system for remotely browsing objects
US6134581A (en) 1997-10-06 2000-10-17 Sun Microsystems, Inc. Method and system for remotely browsing objects
US6108004A (en) 1997-10-21 2000-08-22 International Business Machines Corporation GUI guide for data mining
US6327611B1 (en) 1997-11-12 2001-12-04 Netscape Communications Corporation Electronic document routing system
US6297824B1 (en) 1997-11-26 2001-10-02 Xerox Corporation Interactive interface for viewing retrieval results
US6223145B1 (en) * 1997-11-26 2001-04-24 Zerox Corporation Interactive interface for specifying searches
US6807537B1 (en) 1997-12-04 2004-10-19 Microsoft Corporation Mixtures of Bayesian networks
US6016489A (en) * 1997-12-18 2000-01-18 Sun Microsystems, Inc. Method and apparatus for constructing stable iterators in a shared data collection
US6963871B1 (en) * 1998-03-25 2005-11-08 Language Analysis Systems, Inc. System and method for adaptive multi-cultural searching and matching of personal names
US6185608B1 (en) * 1998-06-12 2001-02-06 International Business Machines Corporation Caching dynamic web pages
US6742003B2 (en) 2001-04-30 2004-05-25 Microsoft Corporation Apparatus and accompanying methods for visualizing clusters of data and hierarchical cluster classifications
US6018742A (en) * 1998-07-07 2000-01-25 Perigis Corporation Constructing a bifurcated database of context-dependent and context-independent data items
US6470436B1 (en) 1998-12-01 2002-10-22 Fast-Chip, Inc. Eliminating memory fragmentation and garbage collection from the process of managing dynamically allocated memory
US6067549A (en) * 1998-12-11 2000-05-23 American Management Systems, Inc. System for managing regulated entities
US6298478B1 (en) 1998-12-31 2001-10-02 International Business Machines Corporation Technique for managing enterprise JavaBeans (™) which are the target of multiple concurrent and/or nested transactions
US6457065B1 (en) 1999-01-05 2002-09-24 International Business Machines Corporation Transaction-scoped replication for distributed object systems
US6311190B1 (en) 1999-02-02 2001-10-30 Harris Interactive Inc. System for conducting surveys in different languages over a network with survey voter registration
US6269373B1 (en) 1999-02-26 2001-07-31 International Business Machines Corporation Method and system for persisting beans as container-managed fields
US6374241B1 (en) * 1999-03-31 2002-04-16 Verizon Laboratories Inc. Data merging techniques
US7181459B2 (en) * 1999-05-04 2007-02-20 Iconfind, Inc. Method of coding, categorizing, and retrieving network pages and sites
US6662180B1 (en) 1999-05-12 2003-12-09 Matsushita Electric Industrial Co., Ltd. Method for searching in large databases of automatically recognized text
US6957186B1 (en) 1999-05-27 2005-10-18 Accenture Llp System method and article of manufacture for building, managing, and supporting various components of a system
JP2000348042A (en) 1999-06-03 2000-12-15 Fujitsu Ltd Integrated thesaurus creation device, modified thesaurus creation device, information collection type thesaurus creation device, integrated thesaurus creation program storage medium, modified thesaurus creation program storage medium, and information collection type thesaurus creation program storage medium
US6330569B1 (en) 1999-06-30 2001-12-11 Unisys Corp. Method for versioning a UML model in a repository in accordance with an updated XML representation of the UML model
US6633878B1 (en) 1999-07-30 2003-10-14 Accenture Llp Initializing an ecommerce database framework
US6389429B1 (en) * 1999-07-30 2002-05-14 Aprimo, Inc. System and method for generating a target database from one or more source databases
US6718535B1 (en) 1999-07-30 2004-04-06 Accenture Llp System, method and article of manufacture for an activity framework design in an e-commerce based environment
US6529892B1 (en) 1999-08-04 2003-03-04 Illinois, University Of Apparatus, method and product for multi-attribute drug comparison
US6842906B1 (en) 1999-08-31 2005-01-11 Accenture Llp System and method for a refreshable proxy pool in a communication services patterns environment
US6523019B1 (en) * 1999-09-21 2003-02-18 Choicemaker Technologies, Inc. Probabilistic record linkage model derived from training data
US6557100B1 (en) * 1999-10-21 2003-04-29 International Business Machines Corporation Fastpath redeployment of EJBs
US20020007284A1 (en) * 1999-12-01 2002-01-17 Schurenberg Kurt B. System and method for implementing a global master patient index
US6502099B1 (en) 1999-12-16 2002-12-31 International Business Machines Corporation Method and system for extending the functionality of an application
US6633992B1 (en) 1999-12-30 2003-10-14 Intel Corporation Generalized pre-charge clock circuit for pulsed domino gates
US20040220926A1 (en) 2000-01-03 2004-11-04 Interactual Technologies, Inc., A California Cpr[P Personalization services for entities from multiple sources
US6556983B1 (en) * 2000-01-12 2003-04-29 Microsoft Corporation Methods and apparatus for finding semantic information, such as usage logs, similar to a query using a pattern lattice data space
CA2332401A1 (en) 2000-02-10 2001-08-10 Dwl Incorporated Work-flow system for web-based applications
US7330845B2 (en) * 2000-02-17 2008-02-12 International Business Machines Corporation System, method and program product for providing navigational information for facilitating navigation and user socialization at web sites
JP2001236358A (en) 2000-02-23 2001-08-31 Ricoh Co Ltd Document search method and apparatus
US6449620B1 (en) 2000-03-02 2002-09-10 Nimble Technology, Inc. Method and apparatus for generating information pages using semi-structured data stored in a structured manner
US6757708B1 (en) * 2000-03-03 2004-06-29 International Business Machines Corporation Caching dynamic content
US6879944B1 (en) * 2000-03-07 2005-04-12 Microsoft Corporation Variational relevance vector machine
WO2001075679A1 (en) 2000-04-04 2001-10-11 Metamatrix, Inc. A system and method for accessing data in disparate information sources
US6704805B1 (en) * 2000-04-13 2004-03-09 International Business Machines Corporation EJB adaption of MQ integration in componetbroker
AU6263101A (en) 2000-05-26 2001-12-03 Tzunami Inc. Method and system for organizing objects according to information categories
US6633882B1 (en) 2000-06-29 2003-10-14 Microsoft Corporation Multi-dimensional database record compression utilizing optimized cluster models
US20020178360A1 (en) 2001-02-25 2002-11-28 Storymail, Inc. System and method for communicating a secure unidirectional response message
US6647383B1 (en) 2000-09-01 2003-11-11 Lucent Technologies Inc. System and method for providing interactive dialogue and iterative search functions to find information
US20020080187A1 (en) * 2000-10-02 2002-06-27 Lawton Scott S. Enhanced method and system for category selection
US7287089B1 (en) * 2000-10-25 2007-10-23 Thomson Financial Inc. Electronic commerce infrastructure system
US20020103920A1 (en) * 2000-11-21 2002-08-01 Berkun Ken Alan Interpretive stream metadata extraction
EP1211610A1 (en) 2000-11-29 2002-06-05 Lafayette Software Inc. Methods of organising data and processing queries in a database system
US20020073099A1 (en) * 2000-12-08 2002-06-13 Gilbert Eric S. De-identification and linkage of data records
US7406443B1 (en) 2000-12-18 2008-07-29 Powerloom Method and system for multi-dimensional trading
US7685224B2 (en) 2001-01-11 2010-03-23 Truelocal Inc. Method for providing an attribute bounded network of computers
US7487182B2 (en) 2001-01-23 2009-02-03 Conformia Software, Inc. Systems and methods for managing the development and manufacturing of a drug
SE520533C2 (en) 2001-03-13 2003-07-22 Picsearch Ab Method, computer programs and systems for indexing digitized devices
US6877111B2 (en) 2001-03-26 2005-04-05 Sun Microsystems, Inc. Method and apparatus for managing replicated and migration capable session state for a Java platform
US20030105825A1 (en) * 2001-05-01 2003-06-05 Profluent, Inc. Method and system for policy based management of messages for mobile data networks
US6510505B1 (en) * 2001-05-09 2003-01-21 International Business Machines Corporation System and method for allocating storage space using bit-parallel search of bitmap
US7089193B2 (en) 2001-05-09 2006-08-08 Prochain Solutions, Inc. Multiple project scheduling system
US7865427B2 (en) 2001-05-30 2011-01-04 Cybersource Corporation Method and apparatus for evaluating fraud risk in an electronic commerce transaction
US7007039B2 (en) * 2001-06-14 2006-02-28 Microsoft Corporation Method of building multidimensional workload-aware histograms
US6687702B2 (en) * 2001-06-15 2004-02-03 Sybass, Inc. Methodology providing high-speed shared memory access between database middle tier and database server
US7100147B2 (en) * 2001-06-28 2006-08-29 International Business Machines Corporation Method, system, and program for generating a workflow
US7069536B2 (en) * 2001-06-28 2006-06-27 International Business Machines Corporation Method, system, and program for executing a workflow
US7047535B2 (en) * 2001-07-30 2006-05-16 International Business Machines Corporation Method, system, and program for performing workflow related operations using an application programming interface
WO2003021480A1 (en) * 2001-09-04 2003-03-13 International Limited Database management system
US6912549B2 (en) 2001-09-05 2005-06-28 Siemens Medical Solutions Health Services Corporation System for processing and consolidating records
US6922695B2 (en) * 2001-09-06 2005-07-26 Initiate Systems, Inc. System and method for dynamically securing dynamic-multi-sourced persisted EJBS
US6996565B2 (en) * 2001-09-06 2006-02-07 Initiate Systems, Inc. System and method for dynamically mapping dynamic multi-sourced persisted EJBs
US7249131B2 (en) * 2001-09-06 2007-07-24 Initiate Systems, Inc. System and method for dynamically caching dynamic multi-sourced persisted EJBs
US7035809B2 (en) * 2001-12-07 2006-04-25 Accenture Global Services Gmbh Accelerated process improvement framework
US6907422B1 (en) * 2001-12-18 2005-06-14 Siebel Systems, Inc. Method and system for access and display of data from large data sets
US20030120630A1 (en) * 2001-12-20 2003-06-26 Daniel Tunkelang Method and system for similarity search and clustering
US20030149631A1 (en) * 2001-12-27 2003-08-07 Manugistics, Inc. System and method for order planning with attribute based planning
AU2003207836A1 (en) 2002-02-04 2003-09-02 Cataphora, Inc. A method and apparatus for sociological data mining
US6829606B2 (en) * 2002-02-14 2004-12-07 Infoglide Software Corporation Similarity search engine for use with relational databases
US7031969B2 (en) 2002-02-20 2006-04-18 Lawrence Technologies, Llc System and method for identifying relationships between database records
US20030174179A1 (en) 2002-03-12 2003-09-18 Suermondt Henri Jacques Tool for visualizing data patterns of a hierarchical classification structure
US6970882B2 (en) 2002-04-04 2005-11-29 International Business Machines Corporation Unified relational database model for data mining selected model scoring results, model training results where selection is based on metadata included in mining model control table
US7287026B2 (en) 2002-04-05 2007-10-23 Oommen John B Method of comparing the closeness of a target tree to other trees using noisy sub-sequence tree processing
US7149730B2 (en) * 2002-05-03 2006-12-12 Ward Mullins Dynamic class inheritance and distributed caching with object relational mapping and cartesian model support in a database manipulation and mapping system
US20030220858A1 (en) 2002-05-24 2003-11-27 Duc Lam Method and system for collaborative vendor reconciliation
US7231395B2 (en) 2002-05-24 2007-06-12 Overture Services, Inc. Method and apparatus for categorizing and presenting documents of a distributed database
US20030227487A1 (en) 2002-06-01 2003-12-11 Hugh Harlan M. Method and apparatus for creating and accessing associative data structures under a shared model of categories, rules, triggers and data relationship permissions
US20040006500A1 (en) 2002-07-08 2004-01-08 Diego Guicciardi Method and apparatus for solution design, implementation, and support
US20040143477A1 (en) 2002-07-08 2004-07-22 Wolff Maryann Walsh Apparatus and methods for assisting with development management and/or deployment of products and services
US6795793B2 (en) 2002-07-19 2004-09-21 Med-Ed Innovations, Inc. Method and apparatus for evaluating data and implementing training based on the evaluation of the data
WO2004023345A1 (en) 2002-09-04 2004-03-18 Journee Software Corporation System and method for dynamically mapping dynamic multi-sourced persisted ejbs
AU2002332913A1 (en) 2002-09-05 2004-03-29 Journee Software Corporation System and method for dynamically securing dynamic multi-sourced persisted ejbs
WO2004023311A1 (en) 2002-09-05 2004-03-18 Journee Software Corporation System and method for dynamically caching dynamic multi-sourced persisted ejbs
US7043476B2 (en) * 2002-10-11 2006-05-09 International Business Machines Corporation Method and apparatus for data mining to discover associations and covariances associated with data
US7155427B1 (en) 2002-10-30 2006-12-26 Oracle International Corporation Configurable search tool for finding and scoring non-exact matches in a relational database
US20040107189A1 (en) * 2002-12-03 2004-06-03 Lockheed Martin Corporation System for identifying similarities in record fields
US20040107205A1 (en) * 2002-12-03 2004-06-03 Lockheed Martin Corporation Boolean rule-based system for clustering similar records
US7490085B2 (en) * 2002-12-18 2009-02-10 Ge Medical Systems Global Technology Company, Llc Computer-assisted data processing system and method incorporating automated learning
US8280894B2 (en) 2003-01-22 2012-10-02 Amazon Technologies, Inc. Method and system for maintaining item authority
US20040181526A1 (en) 2003-03-11 2004-09-16 Lockheed Martin Corporation Robust system for interactively learning a record similarity measurement
US7487173B2 (en) * 2003-05-22 2009-02-03 International Business Machines Corporation Self-generation of a data warehouse from an enterprise data model of an EAI/BPI infrastructure
US7296011B2 (en) 2003-06-20 2007-11-13 Microsoft Corporation Efficient fuzzy match for evaluating data records
WO2005003308A2 (en) * 2003-06-25 2005-01-13 Smithkline Beecham Corporation Biological data set comparison method
US7596778B2 (en) * 2003-07-03 2009-09-29 Parasoft Corporation Method and system for automatic error prevention for computer software
JP4451624B2 (en) * 2003-08-19 2010-04-14 富士通株式会社 Information system associating device and associating method
US20050228808A1 (en) 2003-08-27 2005-10-13 Ascential Software Corporation Real time data integration services for health care information data integration
US7739223B2 (en) * 2003-08-29 2010-06-15 Microsoft Corporation Mapping architecture for arbitrary data models
EP2261821B1 (en) * 2003-09-15 2022-12-07 Ab Initio Technology LLC Data profiling
US20050060286A1 (en) * 2003-09-15 2005-03-17 Microsoft Corporation Free text search within a relational database
US8825502B2 (en) * 2003-09-30 2014-09-02 Epic Systems Corporation System and method for providing patient record synchronization in a healthcare setting
US7685016B2 (en) * 2003-10-07 2010-03-23 International Business Machines Corporation Method and system for analyzing relationships between persons
US7249129B2 (en) 2003-12-29 2007-07-24 The Generations Network, Inc. Correlating genealogy records systems and methods
US7324998B2 (en) 2004-03-18 2008-01-29 Zd Acquisition, Llc Document search methods and systems
US9189568B2 (en) 2004-04-23 2015-11-17 Ebay Inc. Method and system to display and search in a language independent manner
CA2564307C (en) 2004-05-05 2015-04-28 Ims Health Incorporated Data record matching algorithms for longitudinal patient level databases
WO2005114381A2 (en) 2004-05-14 2005-12-01 Gt Software, Inc. Systems and methods for web service function, definition implementation and/or execution
US20050273452A1 (en) 2004-06-04 2005-12-08 Microsoft Corporation Matching database records
US7788274B1 (en) 2004-06-30 2010-08-31 Google Inc. Systems and methods for category-based search
US7567962B2 (en) 2004-08-13 2009-07-28 Microsoft Corporation Generating a labeled hierarchy of mutually disjoint categories from a set of query results
US7970639B2 (en) 2004-08-20 2011-06-28 Mark A Vucina Project management systems and methods
US20060044307A1 (en) 2004-08-24 2006-03-02 Kyuman Song System and method for visually representing project metrics on 3-dimensional building models
US8615731B2 (en) 2004-08-25 2013-12-24 Mohit Doshi System and method for automating the development of web services that incorporate business rules
US20060074836A1 (en) 2004-09-03 2006-04-06 Biowisdom Limited System and method for graphically displaying ontology data
US20060074832A1 (en) 2004-09-03 2006-04-06 Biowisdom Limited System and method for utilizing an upper ontology in the creation of one or more multi-relational ontologies
US20060053173A1 (en) * 2004-09-03 2006-03-09 Biowisdom Limited System and method for support of chemical data within multi-relational ontologies
US20060053382A1 (en) * 2004-09-03 2006-03-09 Biowisdom Limited System and method for facilitating user interaction with multi-relational ontologies
US7496593B2 (en) 2004-09-03 2009-02-24 Biowisdom Limited Creating a multi-relational ontology having a predetermined structure
US20060053172A1 (en) * 2004-09-03 2006-03-09 Biowisdom Limited System and method for creating, editing, and using multi-relational ontologies
US20060064429A1 (en) * 2004-09-18 2006-03-23 Chi Yao Method and apparatus for providing assets reports categorized by attribute
US8892571B2 (en) 2004-10-12 2014-11-18 International Business Machines Corporation Systems for associating records in healthcare database with individuals
US20060179050A1 (en) 2004-10-22 2006-08-10 Giang Phan H Probabilistic model for record linkage
US7844956B2 (en) 2004-11-24 2010-11-30 Rojer Alan S Object-oriented processing of markup
US20060116983A1 (en) * 2004-11-30 2006-06-01 International Business Machines Corporation System and method for ordering query results
US7539668B2 (en) * 2004-11-30 2009-05-26 International Business Machines Corporation System and method for sorting data records contained in a query result based on suitability score
US7707201B2 (en) 2004-12-06 2010-04-27 Yahoo! Inc. Systems and methods for managing and using multiple concept networks for assisted search processing
JP4687089B2 (en) * 2004-12-08 2011-05-25 日本電気株式会社 Duplicate record detection system and duplicate record detection program
US7509259B2 (en) 2004-12-21 2009-03-24 Motorola, Inc. Method of refining statistical pattern recognition models and statistical pattern recognizers
US7672971B2 (en) * 2006-02-17 2010-03-02 Google Inc. Modular architecture for entity normalization
US7689555B2 (en) 2005-01-14 2010-03-30 International Business Machines Corporation Context insensitive model entity searching
US20070073678A1 (en) 2005-09-23 2007-03-29 Applied Linguistics, Llc Semantic document profiling
US7739687B2 (en) 2005-02-28 2010-06-15 International Business Machines Corporation Application of attribute-set policies to managed resources in a distributed computing system
US20060195460A1 (en) 2005-02-28 2006-08-31 Microsoft Corporation Data model for object-relational data
JP2006277413A (en) 2005-03-29 2006-10-12 Toshiba Corp Document classification device and document classification method
US8095386B2 (en) 2005-05-03 2012-01-10 Medicity, Inc. System and method for using and maintaining a master matching index
US20060271549A1 (en) 2005-05-27 2006-11-30 Rayback Geoffrey P Method and apparatus for central master indexing
US20060287890A1 (en) 2005-06-15 2006-12-21 Vanderbilt University Method and apparatus for organizing and integrating structured and non-structured data across heterogeneous systems
US20070016450A1 (en) * 2005-07-14 2007-01-18 Krora, Llc Global health information system
US7672833B2 (en) * 2005-09-22 2010-03-02 Fair Isaac Corporation Method and apparatus for automatic entity disambiguation
US20070073745A1 (en) 2005-09-23 2007-03-29 Applied Linguistics, Llc Similarity metric for semantic profiling
WO2007048229A1 (en) * 2005-10-25 2007-05-03 Angoss Software Corporation Strategy trees for data mining
US20070150279A1 (en) 2005-12-27 2007-06-28 Oracle International Corporation Word matching with context sensitive character to sound correlating
US20070214179A1 (en) 2006-03-10 2007-09-13 Khanh Hoang Searching, filtering, creating, displaying, and managing entity relationships across multiple data hierarchies through a user interface
US8260785B2 (en) 2006-02-17 2012-09-04 Google Inc. Automatic object reference identification and linking in a browseable fact repository
US8700568B2 (en) 2006-02-17 2014-04-15 Google Inc. Entity normalization via name normalization
US7558737B2 (en) 2006-02-28 2009-07-07 Sap Ag Entity validation framework
US20070214129A1 (en) 2006-03-01 2007-09-13 Oracle International Corporation Flexible Authorization Model for Secure Search
US20070260492A1 (en) 2006-03-09 2007-11-08 Microsoft Corporation Master patient index
US7949186B2 (en) 2006-03-15 2011-05-24 Massachusetts Institute Of Technology Pyramid match kernel and related techniques
US7974984B2 (en) 2006-04-19 2011-07-05 Mobile Content Networks, Inc. Method and system for managing single and multiple taxonomies
US7542973B2 (en) 2006-05-01 2009-06-02 Sap, Aktiengesellschaft System and method for performing configurable matching of similar data in a data repository
US7526486B2 (en) 2006-05-22 2009-04-28 Initiate Systems, Inc. Method and system for indexing information about entities with respect to hierarchies
WO2007143157A2 (en) * 2006-06-02 2007-12-13 Initiate Systems, Inc. Automatic weight generation for probabilistic matching
US7548906B2 (en) * 2006-06-23 2009-06-16 Microsoft Corporation Bucket-based searching
US7792967B2 (en) * 2006-07-14 2010-09-07 Chacha Search, Inc. Method and system for sharing and accessing resources
US8010396B2 (en) 2006-08-10 2011-08-30 International Business Machines Corporation Method and system for validating tasks
JP4405500B2 (en) * 2006-12-08 2010-01-27 インターナショナル・ビジネス・マシーンズ・コーポレーション Evaluation method and apparatus for trend analysis system
US7627550B1 (en) 2006-09-15 2009-12-01 Initiate Systems, Inc. Method and system for comparing attributes such as personal names
US8356009B2 (en) * 2006-09-15 2013-01-15 International Business Machines Corporation Implementation defined segments for relational database systems
US7620647B2 (en) 2006-09-15 2009-11-17 Initiate Systems, Inc. Hierarchy global management system and user interface
US7698268B1 (en) * 2006-09-15 2010-04-13 Initiate Systems, Inc. Method and system for filtering false positives
US7685093B1 (en) * 2006-09-15 2010-03-23 Initiate Systems, Inc. Method and system for comparing attributes such as business names
US8359339B2 (en) * 2007-02-05 2013-01-22 International Business Machines Corporation Graphical user interface for configuration of an algorithm for the matching of data records
US20080201713A1 (en) 2007-02-16 2008-08-21 Pivotal Labs, Inc. Project Management System
US8515926B2 (en) * 2007-03-22 2013-08-20 International Business Machines Corporation Processing related data from information sources
WO2008121170A1 (en) 2007-03-29 2008-10-09 Initiate Systems, Inc. Method and system for parsing languages
WO2008121824A1 (en) 2007-03-29 2008-10-09 Initiate Systems, Inc. Method and system for data exchange among data sources
WO2008121700A1 (en) 2007-03-29 2008-10-09 Initiate Systems, Inc. Method and system for managing entities
US8423514B2 (en) * 2007-03-29 2013-04-16 International Business Machines Corporation Service provisioning
US20080276221A1 (en) 2007-05-02 2008-11-06 Sap Ag. Method and apparatus for relations planning and validation
US20110010214A1 (en) * 2007-06-29 2011-01-13 Carruth J Scott Method and system for project management
US8713434B2 (en) 2007-09-28 2014-04-29 International Business Machines Corporation Indexing, relating and managing information about entities
AU2008304265B2 (en) 2007-09-28 2013-03-14 International Business Machines Corporation Method and system for analysis of a system for matching data records
CN101884039B (en) * 2007-09-28 2013-07-10 国际商业机器公司 Method and system for associating data records in multiple languages
US9058380B2 (en) 2012-02-06 2015-06-16 Fis Financial Compliance Solutions, Llc Methods and systems for list filtering based on known entity matching
US20140280274A1 (en) 2013-03-15 2014-09-18 Teradata Us, Inc. Probabilistic record linking
US9805081B2 (en) 2014-03-10 2017-10-31 Zephyr Health, Inc. Record linkage algorithm for multi-structured data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6460045B1 (en) * 1999-03-15 2002-10-01 Microsoft Corporation Self-tuning histogram and database modeling

Also Published As

Publication number Publication date
CA2701046A1 (en) 2009-04-02
US20140281729A1 (en) 2014-09-18
CN101878461A (en) 2010-11-03
US8799282B2 (en) 2014-08-05
EP2193415A4 (en) 2013-08-28
AU2008304265A1 (en) 2009-04-02
BRPI0817507B1 (en) 2021-03-23
US20090089630A1 (en) 2009-04-02
BRPI0817507A2 (en) 2015-09-29
CA2701046C (en) 2016-07-19
JP5306360B2 (en) 2013-10-02
AU2008304265B2 (en) 2013-03-14
EP2193415A1 (en) 2010-06-09
WO2009042941A1 (en) 2009-04-02
US10698755B2 (en) 2020-06-30
JP2011503681A (en) 2011-01-27

Similar Documents

Publication Publication Date Title
CN101878461B (en) Method and system for analysis of system for matching data records
CN116097241B (en) Data preparation using semantic roles
Marco Building and managing the meta data repository
Stvilia et al. A framework for information quality assessment
US9123024B2 (en) System for analyzing security compliance requirements
US10346358B2 (en) Systems and methods for management of data platforms
US8341131B2 (en) Systems and methods for master data management using record and field based rules
US10504047B2 (en) Metadata-driven audit reporting system with dynamically created display names
US10095766B2 (en) Automated refinement and validation of data warehouse star schemas
US20050183002A1 (en) Data and metadata linking form mechanism and method
EP2396720A1 (en) Creation of a data store
US20160125025A1 (en) Most likely classification code
US11580479B2 (en) Master network techniques for a digital duplicate
Zuccala et al. Metric assessments of books as families of works
Fürber Data quality
WO2006036972A2 (en) Method for searching data elements on the web using a conceptual metadata and contextual metadata search engine
Serbout et al. From openapi fragments to api pattern primitives and design smells
Badovinac Defining data quality in bibliographic and authority records: a case study of the COBISS. SI system
Li Data quality and data cleaning in database applications
KR20210036613A (en) Data Standardization Management System
KR100796906B1 (en) Database quality control method
KR100792322B1 (en) Database Quality Management Framework
KR100796905B1 (en) Database Quality Management System
AU2014280991B2 (en) System for analyzing security compliance requirements
Beaver Survey Data Cleaning Guidelines:(SPSS and Stata)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: INTERNATIONAL BUSINESS MACHINES CORP.

Free format text: FORMER OWNER: CHIEFDOM AV SYSTEM CO., LTD.

Effective date: 20110223

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: ILLINOIS, USA TO: NEW YORK, USA

TA01 Transfer of patent application right

Effective date of registration: 20110223

Address after: American New York

Applicant after: International Business Machines Corp.

Address before: Illinois Instrunment

Applicant before: Initiate Systems Inc.

GR01 Patent grant
GR01 Patent grant