The title that the application requires to submit on September 28th, 2007 is " METHOD ANDSYSTEM FOR ANALYSIS OF A SYSTEM FOR MATCHING DATARECORDS " 60/997, the right of priority of No. 038 U.S. Provisional Application, its full content is contained in the application by reference.The title that the application also relates on March 27th, 2008 and submits to is " METHOD AND SYSTEM FOR MANAGING ENTITIES " 12/056, No. 720 U.S. Patent applications, the title of submitting on Dec 31st, 2007 be " METHOD ANDSYSTEM FOR PARSING LANGUAGES " 11/967, No. 588 U.S. Patent applications, the title of submitting on September 28th, 2007 be " METHOD AND SYSTEM FORINDEXING, RELATING AND MANAGING INFORMATION ABOUTENTITIES " 11/904, No. 750 U.S. Patent applications, the title of submitting on September 14th, 2007 be " HIERARCHY GLOBAL MANAGEMENT SYSTEM ANDUSER INTERFACE " 11/901, No. 040 U.S. Patent application, the title of submitting on September 13rd, 2007 be " IMPLEMENTATION DEFINED SEGMENTS FORRELATIONAL DATABASE SYSTEMS " 11/900, No. 769 U.S. Patent applications, the title of submitting on June 29th, 2007 be " METHOD AND SYSTEM FORPROJECT MANAGEMENT " 11/824, No. 210 U.S. Patent applications, the title of submitting on June 1st, 2007 be " SYSTEM AND METHOD FORAUTOMATIC WEIGHT GENERATION FOR PROBABILISTICMATCHING " 11/809, No. 792 U.S. Patent applications, the title of submitting on February 5th, 2007 be " METHOD AND SYSTEM FOR A GRAPHICAL USERINTERFACE FOR CONFIGURATION OF AN ALGORITHM FORTHE MATCHING OF DATA RECORDS " 11/702, No. 410 U.S. Patent applications, the title of submitting on January 22nd, 2007 be " METHOD AND SYSTEM FORINDEXING INFORMATION ABOUT ENTITIES WITH RESPECTTO HIERARCHIES " 11/656, No. 111 U.S. Patent applications, the title of submitting on September 15th, 2006 be " METHOD AND SYSTEM FOR COMPARINGATTRIBUTES SUCH AS PERSONAL NAMES " 11/522, the title of submitting in No. 223 U.S. Patent applications and on September 15th, 2006 be " METHOD ANDSYSTEM FOR COMPARING ATTRIBUTES SUCH AS BUSINESSNAMES " 11/521, the right of priority of No. 928 U.S. Patent applications.All applications of quoting in this section are all contained in the application, for all objects.
Embodiment
Below in conjunction with shown in accompanying drawing and describe in detail in the following description exemplary and therefore nonrestrictive embodiment explain more fully the disclosure and various different characteristic and favourable details.To the description of known programming technique, computer software, hardware, operating platform and agreement, can omit, thereby can not obscure this in detail openly meaninglessly.Yet should be appreciated that, although this detailed description and object lesson represent preferred embodiment, they are only to provide with the mode of example rather than restrictive mode.For a person skilled in the art, according to the disclosure, in spirit and/or various replacements, the modification in scope of basic creative concept, to add and/or reconfigure to be apparent.
Realizing the software of embodiment disclosed herein can realize with the suitable computer executable instructions that can be kept on computer-readable recording medium.In the disclosure, term " computer-readable recording medium " comprises all types of data storage mediums that can be read by processor.The example of computer-readable recording medium can comprise random access memory, ROM (read-only memory), hard disk drive, data capsule, tape, floppy disk, flash drives, optical data storage device, compact disk ROM (read-only memory) and other suitable computer memory and data storage device.
When using in this article, " comprising " that term " comprises ", " comprising ", " having " or their any other variant intention cover non-exclusionism.For example, processing, product, object or the equipment that comprises a series of key elements need not be confined to these key elements but can comprise other key element of clearly not listing or this processing, product, object or equipment are intrinsic.In addition, unless explicitly point out on the contrary, "or" refers to "or" rather than the exclusive "or" of comprising property.For example, following any satisfy condition A or B:A are that true (or existence) and B are false (or not existing), and A is that false (or not existing) and B are true (or existence), and A and B the two be all true (or existence).
In addition, any example providing herein or example are never thought the constraint of any term of they uses, restriction or clearly definition.On the contrary, these examples or example are considered to about the description of a specific embodiment and are exemplary.Those of ordinary skill in the art will understand, any term that these examples or example adopt comprises other embodiment and embodiment and the reorganization thereof that the other places wherein or in instructions may provide or may not provide, and all these embodiment intentions are included in the scope of this term.The language of pointing out this non-limitative example and example includes but not limited to: " such as ", " in one embodiment " etc.
Now detailed in exemplary embodiment of the present disclosure, the example of these embodiment shown in the drawings.Possible in the situation that, all in accompanying drawing, using identical drawing reference numeral to represent same or similar part (key element).
Embodiment more disclosed herein can supplement the embodiment about the system and method for the information of entity for the information source index from different describing in 5,991, No. 758 United States Patent (USP)s announcing as on November 23rd, 1999, and this patent is contained in this by reference.It is " METHOD AND SYSTEM FOR INDEXING INFORMATIONABOUT ENTITIES WITH RESPECT TO HIERARCHIES " 11/656 that embodiment more disclosed herein can supplement the title submitted to 2007 above-mentioned January 22, disclosed for pressing level index about the embodiment of the entity handles system and method for the information of entity in No. 111 U.S. Patent applications, it is also contained in this by reference.
Fig. 1 is the block diagram of exemplary architecture that an embodiment of entity handles system 30 is shown.Entity handles system 30 can comprise identity hinge (Identity Hub) 32, this identity hinge 32 processes, upgrades or preserve from the data of the data recording about one or more entities of one or more information sources 34,36,38 and in response to order or inquiry from a plurality of operation sides 40,42,44, wherein said operation side can be human user and/or infosystem.Identity hinge 32 can utilize the data recording operation from single information source, or utilizes as shown the data recording operation from a plurality of information sources.Use entity that the embodiment of identity hinge 32 follows the tracks of can comprise participant, the parts in warehouse of patient in hospital for example, medical health system or can there is relative data recording and be included in any other entity of the information in data recording.Identity hinge 32 can be one or more computer systems with at least one CPU (central processing unit) (CPU) 45, its execution is kept at computer-readable instruction (for example, software application) on one or more computer-readable recording mediums to carry out the function of identity hinge 32.As understood by those skilled in the art, can also implement identity hinge 32 with the combination of hardware circuit or software and hardware.
In the example of Fig. 1, identity hinge 32 can receive data recording from information source 34,36,38, and revised data are write back in information source 34,36,38.But the revised data that are sent to information source 34,36,38 can comprise to be correct to have changed information, about information and/or the information about linking between data recording of the fix information in data recording.
In addition, the response that one of operation side 40,42,44 can send inquiry and receive this inquiry from identity hinge 32 to identity hinge 32.Information source 34,36,38 can be for example the disparate databases that can have about the data recording of identical entity.For example, in medical health field, each information source 34,36,38 can be associated with the particular hospital in health care tissue, and this health care tissue can contact the data recording being associated with a plurality of hospitals with identity hinge 32, make when patient on holiday or can locate this patient in the data recording in Los Angeles while coming the hospital in New York.Identity hinge 32 can be positioned at center and information source 34,36,38 and user's 40,42,44 position can and can for example be connected via communication links to identity hinge 32 away from identity hinge 32, described communication link is the communication network of the Internet or any other type for example, as wide area network, Intranet, wireless network, network of renting etc.
In certain embodiments, identity hinge 32 can have its oneself database, and this database is kept at complete data recording in identity hinge 32.In certain embodiments, identity hinge 32 can also only comprise (is for example enough to the data of identification data record, address in particular source 34,36,38) or any part of the data field (data field) that comprises partial data record, make identity hinge 32 from information source 34,36,38, to fetch whole data recording when needed.Identity hinge 32 can be used entity identifier or the federated database that separates with real data record links together the data recording comprising about the information of same entity.Therefore, identity hinge 32 can keep the link between the data recording in one or more information sources 34,36,38, but needn't keep the single consistent data recording of entity.
In certain embodiments, identity hinge 32 can carry out the data recording in link information source 34,36,38 to identify the data recording that link together by other data recording relatively (from the side of operation or receive from data source 34,36,38) data recording and information source 34,36,38.This identifying processing may need one or more attributes of comparing data record and the like attribute of other data recording.For example, to one record relevant attribute of name can with the name comparison of other data recording, SSN (social security number) can with the SSN (social security number) of another record relatively etc.Like this, can identify the data recording that link.
It will be obvious to those skilled in the art that information source 34,36,38 and operation side 40,42,44 can from similar or different tissue and/or the owner contacts and can be separated from each other physically and/or away from.For example, information source 34 can contact with Los Angeles hospital by a medical health care network operation, and information source 36 can contact with the New York hospital of another medical health care network operation that may be had by French company.Therefore, from the data recording of information source 34,36,38, can there is different forms, different language etc.
This can clearly show that by reference to Fig. 2 A and Fig. 2 B, and Fig. 2 A and Fig. 2 B illustrate two embodiment of sample data record.Each in these data recording 200,202 has a group field 210, and this group field is corresponding to one group of attribute of each data recording.For example, one of each record attribute of 200 can be name, and another attribute can be taxpayer number etc.Obviously attribute can comprise a plurality of fields 210 of data recording 200,202.For example, the address properties of data recording 202 can comprise field 210c, 210d and 210e, and they are respectively street, Hezhou, city field.
Yet each can have different forms data recording 200,202.For example, data recording 202 can have the field 210 for attribute " insurance company ", and data recording 200 can not have this field.In addition, similarly attribute also can have different forms.For example, the name field 210b recording in 202 can accept the input of full name, and the name field 210a recording in 200 can be designed as the name that allows input finite length.For example, when two or more data recording (attribute of data recording) are relatively when identifying the data recording that link, this difference may be problematic.For example, name " Bobs Flower Shop " and " Bobs Very Pretty Flower Shoppe " are similar but incomplete same names.In addition, the comparison that typing error when the data of input data recording or error also affect data recording, and therefore (for example affect result, relatively name " Bobs Pretty Flower Shop " and " Bobs Pretty Glower Shop ", wherein " Glower " is that misspelling during by input word " Flower " causes).
Enterprise name in data recording may present due to their character many quite special problems.Some enterprise names may very short (for example, " Quick-E-Mart ") and other enterprise name may be grown (for example, " San Francisco ' s Best Coffee Shop ") very much.In addition, enterprise name may be used similar word (for example, " Shop ", " Inc. ", " Co. ") continually, and when the data recording of more same language, these words shared weight in the heuristic process of these titles is relatively little.In addition, often use acronym in enterprise name, for example, the enterprise of " New York City Bagel " by name may often be input in data recording as " NYCBagel ".
As will be described in further detail below, the embodiment of identity hinge 32 disclosed herein adopts and can when comparing enterprise name, consider the algorithm of these special features.Specifically, some algorithms that identity hinge 32 adopts are supported acronym, consider the frequency of some word in enterprise name, and consider the order (for example, title " Clinicof Austin " may be considered in fact with " Austin Clinic " identical) of mark (token) in enterprise name.Some algorithms are used multiple title comparison techniques with the comparison based on title during difference is recorded (for example, similitude) produce weight, then this weight can be used to determine whether two records should link, and described title comparison techniques comprises various speech comparison methods, the weighting based on name label frequency, prefix coupling, alias match etc.In certain embodiments, use the method (for example, mark whether completely, mate on voice etc.) of matched indicia that the mark of the name attribute of each record is compared mutually.Then can the coupling based on definite give weight (for example, coupling is endowed the first weight completely, and the prefix of a certain type coupling is endowed the second weight etc.) to these couplings.Then these weights can be added up to determine total weight of the matching degree between the name attribute of two data recording.The title of submitting to the 1 day June in 2007 of quoting is in the above 11/809 of " SYSTEMAND METHOD FOR AUTOMATIC WEIGHT GENERATION FORPROBABILISTIC MATCHING ", in No. 792 U.S. Patent applications, described the exemplary embodiment of suitable weight production method, it is contained in this by reference.The title of submitting to the 15 days September in 2006 of quoting is in the above 11/522 of " METHOD AND SYSTEMFOR COMPARING ATTRIBUTES SUCH AS PERSONAL NAMES ", the title of submitting in No. 223 U.S. Patent applications and on September 15th, 2006 be " METHOD AND SYSTEM FOR COMPARING ATTRIBUTESSUCH AS BUSINESS NAMES " 11/521, the exemplary embodiment of suitable title comparison techniques has been described in No. 928 U.S. Patent applications, its two be contained in by reference this.
Fig. 3 illustrates for identifying the example of the method for the record that belongs to same entity.In step 310, can at identity hinge 32 places, advance or draw in one group of data recording for evaluating.These data recording for example can comprise one or more new data records with one group of existing data recording (its can be Already in for example in information source 34,36,38 or can be provided for identity hinge 32) compare.In step 320, if for data recording relatively also not by standardization, can be by they standardization.This standardization can comprise the standardization of the attribute of data recording, makes this data recording be converted to standard format from its unprocessed form.Like this, can carry out the comparison between the like attribute of different pieces of information record subsequently according to the two standard format of attribute and data recording.To one skilled in the art, obviously can or characterize each attribute of the data recording that will compare and can complete by unique function according to standardization such as different-format, different semanteme, morpheme groups each attribute is turned to its corresponding canonical form.Therefore, can be by the standardization of the various attributes of data recording is standardized as to standard format by each data recording, by each attribute of corresponding function standardizing (these attribute functions can be used to the attribute of a plurality of types of standardization certainly).
For example, can evaluating data record 200 name attribute field 210a with the group echo that produces name attribute (for example, " Bobs ", " Pretty ", " Flower " and " Shop ") and for example can be connected according to certain form these marks, (to produce standardized attribute, " BOBS:PRETTY:FLOWER:SHOP "), make can resolve subsequently this standardized nature to produce the mark that comprises name attribute.As another example, when by title standardization, continuous single marking can be combined into mark (for example, I.B.M. becomes IBM) and (for example can replace, with " Company " replacement " Co. ", use " Incorporated " replacement " Inc. " etc.).Can by comprise abbreviation and they be equal to replacement be equal to table storage in the database being associated with identity hinge 32.Pseudo-code for an embodiment of standardization enterprise name is as follows:
BusinessNameParse(inputString,equivalenceTable):
STRING?outputString
for?c?in?inputString:
if?c?is?a?LETTER?or?a?DIGIT:
copy?c?to?outputString
else?if?c?is?one?of?the?following?characters[&,’,`](ampersand,single
quote,back?quote)
skip?c(do?not?replace?with?a?space)
else//non-ALPHA-DIGIT?[&,’,`]character
if?the?last?characterin?output?string?is?not?a?space,copy?a
space?to?output?string.
//Now?extract?the?tokens.
tokenList=[]
For?token?in?outputString//outputString?is?a?list?of?tokens?separated?by?spaces
If(token?is?a?single?character?and?it?is?followed?by?one?or?more?single
characters)
Combine?the?singletokens?into?a?single?token
If(equivalenceTable?maps?token)
Replace?token?with?its?equivalence.
Append?token?to?tokenList.
Return?tokenList
No matter use any technology, once the attribute of the data recording that will compare in step 320 and data recording itself are standardized as canonical form, just can from existing data recording, select in step 330 one group of candidate with data recording comparison new or input.This candidate selects processing (in this article also referred to as grouping (bucketing)) can comprise the comparison of one or more attributes of data recording new or input and available data record to determine which existing data recording and this new data records are enough similar to such an extent as to needs are relatively further.Every group of candidate (packet group) can based on use and the data recording of candidate's selection function (block functions) that Attribute Relative is answered between each comparison in (for example,, between the data recording of input and existing data recording) one group of attribute.For example, can be designed to based on use the title of candidate's selection function and one group of candidate of the alternative of address properties (that is, grouping) of comparison title and compare address.
Then in step 340, the data recording that comprise these groups candidate can be carried out the more detailed comparison with the record of new or input, wherein between described record relatively one group of attribute to determine available data record and whether should link with this new data records or associated.This more in detail relatively may need for example, corresponding attribute in for example, one or more and another record in the set of properties of a record (, existing record) (, record new or input) to compare to produce the score value of this attribute comparison.Then the score value that this can be organized to attribute is added to produce total score value, then can be by this total score value with threshold value comparison to determine whether these two records should link.For example, if total score value is less than first threshold (be called soft link or reexamine threshold value), can not link this record, if total score value is greater than Second Threshold (being called AutoLink threshold value), can chained record, and if total score value drops between these two threshold values, can chained record and be marked for user and reexamine.
Fig. 4 illustrates for configuring and analyze the framework of an embodiment of system 10 of the configuration of identity hinge 32.In certain embodiments, system 10 comprises computing machine 40 and worktable 20.Worktable 20 is software programs, and it is kept in the storer of computing machine 40 and comprises the computer instruction that can be read by the processor of computing machine 40.Worktable 20 is arranged on computing machine 40 and operation thereon, and computing machine 40 is communicated by letter with identity hinge 32 by network 15.Network 15 can be the expression of public network, private network or its combination.Worktable 20 comprises a plurality of functions, comprises configuration tool 400, and user 51 can be by graphic user interface 50 these functions of access.In certain embodiments, user interface 50 is expressions of one or more user interfaces of worktable 20.In certain embodiments, by user interface 50, worktable 20 makes user 51 can create, edit and/or confirm the configuration of identity hinge, this identity hinge configuration this locality is kept in computer-readable recording medium 56, and by network 15 by the long-range identity hinge example that is set to identity hinge 32 of the configuration of this confirmation.Computer-readable recording medium 56 can be that computing machine 40 is inner or outside.
As will be understood by the skilled person in the art, computing machine 40 be utilize worktable 20 the special programming of an embodiment for this locality configuration and analyze the configuration of identity hinge and by network by the expression of any network computation device of the long-range example that is set to identity hinge of (confirmation) configuration.Contrasting Fig. 5 below describes by an embodiment of the method for worktable 20 configuration identity hinges 32.Contrast the embodiment that Fig. 6 describes the user interface 50 of worktable 20 below.
In certain embodiments, configuration tool 400 comprises config editor 410, algorithm editing machine 420 and analysis tool 430.In certain embodiments, analysis tool 430 comprises data analysis tool 432, entity analysis instrument 434, fractional analysis instrument 436 and link analysis instrument 438.In certain embodiments, by config editor 410, worktable 20 provides to user 51 the new configuration of identity hinge 32 or the ability that loading is kept at the existing configuration of the identity hinge 32 on computer-readable recording medium 56 of creating.In certain embodiments, identity hinge configuration comprises the view of member record, the attribute of member record and the part that defines for the specific implementation mode of identity hinge 32.Further instruction for implementation definitional part, reader can with reference to the title of submitting on September 13rd, 2007 be " IMPLEMENTATION DEFINEDSE GMENTS FOR RELATIONAL DATABASE SYSTEMS " 11/900, No. 769 U.S. Patent applications, this application is contained in this by reference.Contrast Fig. 7-8 below and describe the details of configuration identity hinge 32.
Identity hinge 32 use polyalgorithms come comparison and score member property similarity and difference.More particularly, identity hinge 32 by this algorithm application in data with creation task and support function of search.In certain embodiments, by algorithm editing machine 420, worktable 20 provides specific implementation mode for identity hinge 32 to define the ability with custom algorithm to user 51.Contrast the embodiment that Fig. 9 A-9B describes algorithm editing machine 420 below.
In certain embodiments, by data analysis tool 432, user 51 can analyze the attribute validity of the data recording in identity hinge 32.In certain embodiments, by entity analysis instrument 434, user 51 can analyze the entity being associated with data recording in identity hinge 32.In certain embodiments, by fractional analysis instrument 436, user 51 can analyze grouping (group of candidate record) and the impact of this grouping strategy on identity hinge 32.In certain embodiments, by link analysis instrument 438, user 51 can analyze and the threshold value that links error rate that member record is associated and use when the derivant score to these records.Contrast some embodiment of Figure 10-17 descriptive analysis instrument 430 below.
Fig. 5 illustrates for configuring the process flow diagram of an embodiment of the method for identity hinge 32.When worktable 20 is arranged on computing machine 40 and moves thereon, in step 510, user 51 can access worktable 20 and create new
project or open existing
project.In certain embodiments,
project is for keeping the container of identity hinge configuration and file associated with it.In certain embodiments,
project comprises a plurality of artifacts (artifact).The plurality of artifact's example comprise identity hinge configuration, the algorithm being used by this identity hinge configuration and from analysis tool (430), obtain the result of analysis.In step 520, user 51 can create new configuration or open and create or open in step 510

existing configuration in project.In step 530, by user interface 50, user 51 can analyze, revises and/or confirm the configuration that creates or open in step 520.In step 540, user 51 can configure this this locality and be kept in computing machine 40.In step 540, the network of the server that user 51 can be by the example with operation identity hinge 32 be connected by long-range this example that is set to identity hinge 32 of the configuration of preserving and confirming.In certain embodiments, can in real time the configuration of identity hinge and algorithm be directly set in the example of identity hinge 32.In certain embodiments, some tasks (operation) may directly be carried out by identity hinge 32 outside configuration arranges.In this case, some embodiment of worktable 20 can provide following means: carry out single operation in operation group or minute group job, directly carry out them and in worktable view, the progress of Job execution or state are shown to user 50 by user interface 50 on identity hinge 32.In certain embodiments, user 50 can fetch or check the operation result from identity hinge 32 by user interface 50 at computing machine 40.Some embodiment for user interface 50, reader can with reference to the title of submitting on September 14th, 2007 be " HIERARCHY GLOBAL MANAGEMENT SYSTEM ANDUSER INTERFACE " 11/901, No. 040 U.S. Patent application, this application is contained in this by reference.
Fig. 6 illustrates the sectional drawing 60 of an embodiment of user interface 50.More particularly, sectional drawing 60 illustrates the example layout that is presented at the config editor 410 of the worktable 20 on computing machine 40 by an embodiment of user interface 50.In this example, config editor 410 comprises menu 61, shortcut 63 and is called one group of perform region 64,65,66 and 67 of view.Menu 61 provides the access to various menu items, and each menu item provides one group of difference in functionality.For example, by menu item " startup " 62, user 51 can create new startup (Initiate) project, and the configuration of input identity hinge, arranges the configuration of identity hinge, creates new operation group or confirms partial weight etc.Shortcut 63 provides the fast access to worktable 20 functions in current use.For example, user 51 can be switched fast by shortcut 63 between config editor 410 and analysis tool 430.View 64,65,66 and 67 is the individual windows that comprise specific type data.Most of views can move to by the label that pulls and put down them zones of different of user interface 60 on screen.In order to change view, user 51 can be in the lower selection of menu item " window " " demonstration view " of menu 61.The brief description that is included in the view in the embodiment of user interface 50 of worktable 20 below.All these views can be hidden and launch worktable 20 is interior.
Omniselector view
Omniselector view is provided for browsing worktable artifact's tree construction.
Can access following functions from omniselector view:
Traversal project directory
Open and check item file
Copy, paste, move, deletion and rename item file
Input resource
Refresh the resource of input
Select working group's (and hiding untapped file in this working group) of select File
Cancel the working group of select File
Characteristic Views
Characteristic Views makes user can edit the characteristic value of any assembly being created by user.
Problem view
Problem view provides the configuration in worktable and confirms the list of problem.Great majority complete while confirming the file resource in preservation project, and mistake can be occurred immediately.
Control desk view
Control desk view illustrates Progress message and the mistake during a large amount of background tasks.
Operation view
Operation view illustrates the progress of operation or operation group or completes (execution) state.
Contrast Fig. 8 A and 8B below and describe the more details on operation view.
Analyze view
Analyze the result of view display analysis inquiry.In order to know the data in this view, worktable need to be connected to hinge, makes hinge process this inquiry.
Search view
Search view is presented at the Search Results in existing configuration.The row that user can search in view by double-click is opened configuration object.
In certain embodiments, worktable 20 provides the editing machine of several specific types, as config editor 410 and algorithm editing machine 420.In certain embodiments, worktable 20 is also supported other Editor Types, comprises received text and Java editing machine.Fig. 7 A and 7B illustrate sectional drawing 70a and the 70b of an embodiment of config editor 410, can revise by this editing machine the hinge configuration 71 of identity hinge 32.
More particularly, sectional drawing 70a illustrates the expression of the hinge configuration 71 that is input to worktable 20.In certain embodiments, config editor 410 can comprise navigation menu 72, and it illustrates the view of application program, attribute type, information source, link, member type, relationship type etc.With reference to figure 7A, member type view 73 makes user can add, edit and remove member type.In certain embodiments, " object kind " (for example, people, provider, guest or tissue) under member type identification data.In certain embodiments, have five objects that can be configured to special member type, each object has its oneself label (view): attribute, entity type, synthetic view, source and algorithm.
In certain embodiments, attribute type view makes user can check the attribute being associated with member type.For example, for
member type people 74, attribute tags can illustrate the attribute being associated with
member type people 74, as APPT and date of birth.In this example, attribute APPT has attribute type MEMAPPT, and the attribute date of birth has attribute type MEMDATE.In certain embodiments, attribute type (section) with

data pattern conforms to define hinge behavior and information about firms.In certain embodiments, attribute type comprises member property type and attribute of a relation type.In certain embodiments, member property type comprises predefine (" fixing ") attribute type and limits implements attribute type, in 11/900, No. 769 U.S. Patent application of the title of submitting to 13 days September in 2007 mentioning for " IMPLEMENTATION DEFINEDSEGMENTS FOR RELATIONAL DATABASE SYSTEMS ", these attribute types have been described in the above.Limiting enforcement attribute type can create and therefore not be associated with produced classification when the enforcement of identity hinge.Attribute of a relation type is the attribute type that is exclusively used in relation.Attribute type can not be member property type and attribute of a relation type simultaneously.
In certain embodiments, entity type view makes it possible to management entity type, as identity or threshold value.Further instruction for relevant entity management, reader can with reference to the title of submitting on March 27th, 2008 be " METHOD AND SYSTEM FOR MANAGINGENTITIES " 12/056, the title of submitting in No. 720 U.S. Patent applications and on January 22nd, 2007 be " METHOD AND SYSTEM FOR INDEXING INFORMATIONABOUT ENTITIES WITH RESPECT TO HIERARCHIES " 11/656, No. 111 U.S. Patent applications, these two patented claims are contained in this by reference.
In certain embodiments, synthetic view represents member's defined by the user full picture.The configuration of synthetic view can be set up the behavior of member property data and the rule of demonstration of controlling in worktable 20.For example, the member property data of a certain special member can consist of name, address, phone and SSN (social security number).
In certain embodiments, source view makes user add and to manage about the information with worktable 20 interactive sources.The example in source can comprise definition source and information source.The example of information source can comprise above-mentioned source 34,36,38.Definition source is the source that creates therein and often upgrade member's (record).In certain embodiments, worktable 20 can send to renewal definition source.
In certain embodiments, algorithm label makes user can create or identify hinge and is used for processing efficient algorithm relatively.In certain embodiments, only have the algorithm can be effective to each member type on hinge example.The member type of these algorithms (effective and invalid) based on current definition in hinge configuration.The algorithm of each new establishment must be associated with the member type in hinge configuration (seeing Fig. 9 A and 9B).
In certain embodiments, can higher than the record of AutoLink threshold value, automatically form link (AutoLink) to score value, or by user, manually form link (office worker reexamines) during task solves.The object of link is to make it possible to accurately (enterprise-wide) comprehensively to understand member's (record).Contrast Fig. 7 B, in certain embodiments, the link view 76 of config editor 410 can provide link type 77 and linking status 78.This function can be used to add or editor's link type and association status.In this example, link type 77 is listed link ID, link type and the kind of the effective entity relationship of definition, and linking status 78 is listed state ID, linking status and the classification of the workflow states that represents business connection.In certain embodiments, can be by clicking with ascending order or these row of descending sort on column heading.
Concise and to the point with reference to figure 7A, navigation menu 72 also illustrates application view and relationship type view.Application view can be listed several functions.In certain embodiments, user can use the function in this assembly effectively still invalid to mark application program.In certain embodiments, enterprise customer can add and remove from application view and implements at enterprise site

application program.Relationship type view can illustrate available relationship type.Relationship type is the association type that can be present between two differences (or identical) entity type.For example, a people can manage another person, or a tissue can have another tissue legally.In certain embodiments, user can use function in this assembly with the relation between management entity.Further instruction for the related information of relevant entity, reader can with reference to the title of submitting on September 28th, 2007 be " METHOD AND SYSTEM FOR INDEXING; RELATING AND MANAGING INFORMATION ABOUT ENTITIES " 11/904, No. 750 U.S. Patent applications, this application is contained in this by reference.For simplicity, in the disclosure, do not illustrate or describe all available views.Yet, those skilled in the art will appreciate that the additional views and the additional function that by these views, provide are also possible.For example, string view can make user can create rule or guidance, to indicate about how processing the algorithm of the data value of some input.As another example, audit view can carry out those mutual audit loggings so that user can set up with the mutual and user of identity hinge 32.
In some embodiment of worktable 20, keep the container of hinge configuration and associated documents thereof to be called as project.Before in hinge configuration is input to project, user need to create new project or input existing project.In order to create new project, user can select " newly starting project ... " from " startup " menu 61, and inputs the title of these new projects.Can for example, in the place (another local drive unit or network driver means) in current work space catalogue or outside the current workspace of user's appointment, use these new projects of worktable template establishment.The further instruction of some embodiment that manage for related item, reader can with reference to the title of submitting on June 29th, 2007 be " METHOD AND SYSTEM FOR PROJECTMANAGEMENT " 11/824, No. 210 U.S. Patent applications, this application is contained in this by reference.
Next worktable 20 creates this project and under this work space catalogue, adds following catalogue:
Flow-comprise stream file (.iflow)
Any customization function of function-comprise
Storehouse-comprise arranges required any additional java code base file (.jar)
Serve-comprise all data source wsdl documents (.wsdl) in the project of being input to
Src-comprises needed any additional Java source file (.java)
Anonutil-comprises sample default value file and filter file
Processor-comprise is for by the script support of Java processor packing
Operation-preservation is registered relevant information with hinge-project
Project is associated with identity hinge 32 by the connection of the server of the example to operation identity hinge 32.There is the connection of several types, comprise and making and test.In certain embodiments, by the corresponding function from menu 61 (seeing Fig. 6) access menus item " startup " 62 can add, being connected of the example of editor or removal and identity hinge 32.Can be by hinge configuration being input in project from " startup " menu 62 access " input hub configuration ... " function.In certain embodiments, may need user's name and password to fetch hinge configuration information from identity hinge 32.In certain embodiments, the title of the hinge of input configuration may be displayed in the omniselector view 64 of config editor 410, and the assembly of the hinge of input configuration may be displayed in work space 65.
Fig. 8 A and Fig. 8 B illustrate sectional drawing 80a and the 80b of an embodiment of the config editor 401 that can configure by its modification operation.In some embodiment of worktable 20, the task of being undertaken by identity hinge 32 can be called operation, and the group of one or more operations can become operation group.In certain embodiments, available operation (task) can be categorized as configuration operation, data analysis operation, hinge management operations etc.In certain embodiments, operation result can be preserved by the project on the server of operation identity hinge 32 servers, and in many cases can be at computing machine 40 from this server retrieves or check.In certain embodiments, by the operation view in config editor 410, can carry out the task of following non exhaustive list:
Configuration is set to hinge
Produce weight
Create Threshold Analysis pair
From hinge, fetch file
Hinge configuration is set
This function is set to hinge by configuration project.This operation can be used to (replacing above-mentioned startup menu option) and carry out this setting together with another operation.When carrying out this operation, this hinge is automatically stopped and being restarted.When from 62 operation of startup menu, can utilize following option:
Weight table is set.This option makes the weight table in selected workbench project catalogue can be set to this hinge when selected.
While needing, create and/or leave out database table.When this option is selected, allow need to carry out database table operation according to what support this configuration.
Check group synchronization.When this option is selected, check whether the local operation group who lists catches up with the group who defines in (up to date with) hinge.In one embodiment, if this option is selected and group is not mated, can stop arranging.
Produce weight
This function is carried out weight and is produced task.This operation needs derivative data (comparing data and integrated data) as input.In certain embodiments, can be by utilizing for example mpxdata, mpxprep, mpxfsdvd or mpxredvd to produce this derivative data file during above-mentioned standardization and grouping step 320 and 330.As an example, Fig. 8 A illustrates sectional drawing 80a, this illustrate can be how an embodiment by config editor 401 configure this operation.Particularly, for entity type id 84, an embodiment of config editor 401 can illustrate a plurality of labels, comprises step, input and output, property regulation, option and log tab.In certain embodiments, step label can make user can select weight generation step move and indicate whether to move subsequent step until processing finishes.The example that weight produces step can comprise:
From previous deletion artifact in service (artifact)
For all properties value produces counting
Produce the right at random of member
By more random member, obtain random data
The candidate of mating is to reduction
Produce coupling group, coupling statistics and initial weight
Because attribute is very little skipped final step
Step before iteration also checks the convergence of weight
Carry out whole remaining steps until processing finishes
In certain embodiments, input and output label can allow user to specify various I/O catalogues.The example of I/O catalogue can comprise:
BXM input directory: specify and read the input directory of piece cross-matched result from it.This catalogue must be mated the output directory being used by the mpx function that produces derivative data.
Working directory: appointment will be preserved the catalogue of weight table in workbench project.In one embodiment, acquiescence is weight catalogue.All Files is saved to as the sub-directory in the assigned work catalogue of this entity type name.
FRQ output directory: specify the output directory that produced Attributes Frequency data are write.
UPAIRS output directory: specify the produced random output directory that data are write.
USAMPS output directory: specify the output directory that produced not matched sample is write data.
MPAIRS output directory: specify the output directory that a produced paired data is write.
MSAMPS output directory: specify the output directory that produced coupling is write sample data.
RUN output directory: specify the output directory that produced weight is write.This catalogue is attached with the incremental number for each iteration.
In certain embodiments, property regulation label can allow user to revise following parameter:
Thread Count
Greatest iteration number in final step
Number of partitions relatively divides into groups
At random to grouping number of partitions
Coupling is to grouping number of partitions
Frequency partition number
Maximum I/O number of partitions
For the Audrecno examining
The random logarithm producing
For reporting the interval of processing record
Largest packet arranges size
For writing the minimal weight of item record
In certain embodiments, option-tag can provide following option to user:
Coding.In certain embodiments, worktable 20 is supported LATIN1, UTF8 and UTF16 coding.Also can use other coding method.For relevant further instruction of resolving the data recording of different language, reader can with reference to the title of submitting on Dec 31st, 2007 be " METHOD AND SYSTEM FOR PARSING LANGUAGES " 11/967, No. 588 U.S. Patent applications, this application is contained in this by reference.
Audit.In certain embodiments, the audit that worktable 20 is supported one group of data recording.
Comparison pattern.In certain embodiments, this option can be used to limit comparing function.For example, only produce the weight for mating and linking, only produce the weight for searching for or produce the weight for mating, link and searching for.
In certain embodiments, can be in Fig. 8 A find following weight to produce parameter under the option-tag of 80a.The data here comprise the threshold value that is exclusively used in each provenance.
The threshold value of the 3rd filtrator that attributes match is used in relatively percentage threshold (wgtNRM)-be defined in.
The threshold value of the second filtrator that attributes match is used in relatively threshold value (wgtABS)-be defined in attribute.
Convergence threshold (wgtCNV)-definition weight produces the tolerance limit of conversion.
Quality of data number percent (wgtQOD)-definition coupling group error rate of estimating for initial weight.
False negative rate (wgtFNR)-definition reexamines the false negative rate with AutoLink threshold value for calculating office worker.
False positive rate (wgtFPR)-definition is for calculating office worker's false positive rate.
Reexamine and AutoLink threshold value.
The threshold value of the first filtrator that coupling is used in relatively threshold value (wgtMAT)-be defined in.
The lower limit of minimum attribute count (wgtFLR)-defined attribute value frequency counting.
In certain embodiments, log tab can provide following logging option to user:
Trace log
Debugging log
Timer daily record
SQL daily record
When this generation weight operation completes, can check result and this weight can be kept to this locality.In certain embodiments, the output that produces weight can be copied to project from hinge.The further instruction producing for relevant weight, reader can with reference to the title of submitting on June 1st, 2007 be " SYSTEM AND METHOD FOR AUTOMATIC WEIGHTGENERATION FOR PROBABILISTIC MATCHING " 11/809, No. 792 U.S. Patent applications, its full content is contained in this by reference.
As the example of data analysis operation, Fig. 8 B illustrates sectional drawing 80b, and it illustrates and can how by an embodiment configured threshold of config editor 401, to analyze producing operation.Particularly, config editor 401 embodiment can allow user specified entitiy type and suitable input directory and output file.User can further specify the scope for logarithm and the score value of each score value.In the example of Fig. 8 B, minimum score value is 8.0, and maximum score value is 25.0.In this example, sample generator is got in each of 171 score value shelves (score bin) (8.0 to 25.0, increase progressively with 0.1) 10 random right.
As above in the face of what describe according to Fig. 7 A, the new algorithm creating must be associated with the member type in hinge.Fig. 9 A and Fig. 9 B illustrate sectional drawing 90a and the 90b of an embodiment of algorithm editing machine 420.In certain embodiments, the algorithm file that algorithm editing machine 420 can be edited user to be used by identity hinge 32 is with application Compare Logic.Particularly, when initial creation algorithm, it is empty.Algorithm editing machine 420 make user can in algorithm editing machine 420, add algorithm assembly and with palette (Palette) 91 be connected to build algorithm.In the example of Fig. 9 A, sectional drawing 90a illustrates the algorithm being associated with member type " people " 74.In certain embodiments, although only have an algorithm being set to " effectively " algorithm preset time arbitrarily, polyalgorithm can be associated with a certain concrete member type.At this locality editor's algorithm, make database not to be changed, until they come into force completely.
As shown in Fig. 9 A and 9B, algorithm can comprise a plurality of assemblies, comprises attribute assembly, standard functions assembly, comparison and inquiry role (Role) assembly and grouping and comparing function assembly.User can be by adding, revise or delete one or more algorithm assemblies revises algorithm.Attribute assembly allows user to define character or the field of data element.By the member type of algorithm, filter these attributes.Standard functions assembly comprise for the source data of standardization or format input for relatively, the function of grouping and search (inquiry) object.This can mean the capitalization of all the first characters, the removal of punctuation mark, anonymous value inspection and data sorting.After by standardization, data can be saved as to the comparison component of derivative data and use in the generation of integrated data.In certain embodiments, standardized data is not kept in hinge database and does not therefore change member's data.For example, can be using telephone number in 232-123-4567 input source.Although standardization routine can be eliminated dash and area code and this number format is turned to 1234567, the number being kept in the database 46 of identity hinge 32 remains 232-123-4567.Compare and inquire role's assembly makes user can be defined in algorithm how to use comparing function and/or interrogation function.Block functions can be used to identify integrated data, and a plurality of groups of information are shared in its identification.For example, grouping can be defined as to name (name, surname, middle word), date of birth+surname, address and SSN (social security number).This assembly also makes user can define the combination of data element in grouping.Further instruction for the embodiment of Some Related Algorithms editing machine 420, reader can with reference to the title of submitting on February 5th, 2007 be " METHOD ANDSYSTEM FOR A GRAPHICAL USER INTERFACE FORCONFIGURATION OF AN ALGORITHM FOR THE MATCHINEOF DATA RECORDS " 11/702, No. 410 U.S. Patent applications, this application is contained in this by reference.
Therefore, in one embodiment, for analyzing the method for identity hinge, can comprise the configuration of using one group of this identity hinge of primary data record generation, grouping according to the grouping strategy analysis being associated with this identity hinge configuration based on this group primary data record or the establishment of its subgroup, analyze these impact of grouping on identity hinge performance, and then correspondingly change grouping strategy.The algorithm that can use when creating this grouping or changing the one or more parameter value being associated with this algorithm by editor in one embodiment, changes this grouping strategy.In one embodiment, this algorithm can be associated with entity type.
In one embodiment, except above-mentioned core algorithm configuration feature, can also produce parameter by threshold value and the automatic weight of weight character label 92 configuration of algorithm editing machine 420.Because weight character is associated with entity type, so in order to check weight character, first user must select entity type.Sectional drawing 90b illustrates threshold value and the weight character of entity type id84 in the present example.
For relevant weight, produce the further instruction of (comprising that weight produces conversion), reader can with reference to the title of submitting on June 1st, 2007 be " SYSTEM AND METHOD FORAUTOMATIC WEIGHT GENERATION FOR PROBABILISTICMATCHING " 11/809, No. 792 U.S. Patent applications, this application is contained in this by reference.
With reference to figure 9B, after determining weight, user can use threshold calculations device 93 manually to arrange or the suitable office worker that calculates a certain specific hinge configuration reexamines and AutoLink threshold value.Threshold calculations device 93 can use user to calculate suitable office worker from the sample data of the database 46 of identity hinge 32 to reexamine and AutoLink threshold value.In certain embodiments, user can also use threshold calculations device 93 that office worker is set and reexamine threshold value and AutoLink threshold value, thereby false positive rate, false negative rate and estimation number of tasks are estimated.In certain embodiments, can to data, use the false positive rate (FPR) of estimating or the FPR of statistics to calculate this threshold value by the sample based on estimating.These values can be for selected (or whole) source pair.This statistics option needs user first to move above-mentioned Threshold Analysis to producing operation, and then the Job execution completing is obtained the action of operation result.
In certain embodiments, candidate's threshold value is provided with worktable 20.User can reexamine candidate's threshold value, task and link and determine the appropriate threshold value for a certain customized configuration.In certain embodiments, candidate's threshold value can be calculated as follows:
AutoLink threshold value
Candidate's AutoLink threshold value depends on file size and admissible false positive rate.Suppose that fpr can allow false positive rate (default value 10
Λ(5)), num is the number that records in data group.Candidate's AutoLink threshold value is thresh_al=-ln[-ln (1-fpr)/num]/ln (10), wherein ln is natural logarithm (truth of a matter is e).
Office worker reexamines threshold value
False negative rate based on hope (fnr) arranges candidate office worker and reexamines threshold value.For example, if wish that 95% score of copy reexamines threshold value higher than our office worker, default value is set to 0.05.Actual fnr value can be dependent on the distribution of the weight calculated for coupling, time ratio and these values that each attribute has effective value.The experience that can use guiding (bootstrap) program to determine that coupling batch total divides distributes and calculates office worker from this distribution and reexamines threshold value.For this guiding, produce the random member of row, calculate each member's information, and from this sample formation experience, distribute as follows:
Potentially in database select redundantly the random member of numebt.Call memrecno_1, memrecno_2 ..., memrecno_numebt.For wherein each, to member itself score (that is, calculating member's information).Call these score values s_1, s_2 ..., s_numebt.Suppose that s_min is minimum in these score values, s_max is maximum in these score values, and from s_min to s_max, take 0.1, to be incremented create a table, and to these score value steppings.This table will have following n=(s_max-s_min)/0.1 row:
Table 1: coupling component value distributes
Value |
Counting |
Frequency |
s_min |
C_1=s_i equals the quantity of s_min |
?f_1=c_1/numebt |
s_min+0.1 |
C_2=s_i equals the quantity of s_min+0.1 |
?f_2=c_2/numebt |
s_min+0.2 |
C_3=s_i equals the quantity of s_min+0.2 |
?f_3=c_3/numebt |
... |
... |
?... |
s_max |
C_n=s_i equals the quantity of s_max |
?f_n=c_n/numebt |
Now, suppose j the first index, make
f_1+f_2+...+f_j>fnr
Candidate office worker reexamines threshold value and is
thresh_cl=s_min+(j-1)*0.1。
In embodiment disclosed herein, above-mentioned configuration tool with for analyzing the group analysis instrument such as grouping and the each side such as entity of this configuration, be combined.These instruments can be assessed this configuration help and find to this and configure relevant mistake and Potential performance problem.Specifically, these instruments can help user seamlessly to configure hinge and confirm the correctness of this configuration.
With reference to figure 10A and Figure 10 B, some embodiment of worktable 20 can comprise the analysis look facility of implementing analysis tool 430.This analysis look facility can provide one group of inquiry instrument to analyze hinge configuration to configure user.For the data that are provided for analyzing, analyzing look facility need to be associated with hinge example.Figure 10 A illustrates the sectional drawing 100a of an embodiment of user interface 50, and it illustrates the analysis source that hinge is selected as project demo81, and hinge configuration 71, member type " people " 74 and entity type id 84 are selected for analysis.As shown in FIG. 10A, by selecting " preserve and analyze data to snapshot " option and providing title to be saved in snapshot by analyzing data in analyzing id field.In certain embodiments, snapshot is saved in " snapshot " file in omniselector view with XML form.In certain embodiments, with reference to figure 4, snapshot can local be kept in the computer-readable recording medium 56 of computing machine 40.By data are saved in snapshot, analysis data before and after user can relatively be configured and change or from the analysis data of different time points.In the situation that its input parameter is different, a plurality of copies of same inquiry can be saved in single snapshot.
Figure 10 B illustrates the sectional drawing 100b of an embodiment of user interface 50, and it illustrates snapshot and is selected as project Alpha's analysis source and snapshot, selects main_hub_Bucket3-10-08 from utilizing.In this example, member type " people " 74 and entity code id 84 are selected for analysis.When analysis view has data source associated with it, user can load one or more inquiries and check result.Each inquiry shows one group of special data.In certain embodiments, available inquiry is classified as data analysis, entity analysis, fractional analysis and link analysis type.
Figure 11 illustrates for analyzing the process flow diagram of an embodiment of the method for identity hinge configuration.As mentioned above, in conjunction with the instrument in the embodiment of worktable 20, make them can help user seamlessly to configure the example of identity hinge 32 and the correctness of this configuration of real-time confirmation.Therefore, the method step meant for illustration example process shown in Figure 11 and be never interpreted as restrictive.For example, the member couple that sampled when, has created comparing data and integrated data (derivative data), has determined weight and has determined suitable AL and during CR threshold value, can carry out some to grouping and analyze in early days, as packet size and grouping distribute.This early stage analysis can contribute to just to recognize in early days data exception.Therefore, in Figure 11 be not all necessary and for analyzing one or more steps that can comprise Figure 11 for some embodiment of the method for the system of matched record in steps.In addition, the step in Figure 11 can be carried out not according to specific order.For example, as weight, produce a part (step 103) of processing, can produce one group of threshold value (candidate's threshold value) of recommending.Now, user can carry out Threshold Analysis (step 107) and check estimated false positive and the false negative rate for threshold range.Setting threshold value and completing after (may be final) cross-matched, user can reexamine the entity (step 105) of possible errors (missing value etc.).If hinge is selected as analysis source, the data that user can be checked the distribution of physical size and deeply check member suspicious entity from worktable 20 by entity analysis instrument 432 are to help identification error.The report of physical size can be saved in (for example, computer-readable recording medium 56) in dish, for the comparison after having carried out further adjusting.
Can or still when carrying out the other parts of this processing, complete above-mentioned analysis task when this project approaches end.For example, in some cases, may still need to complete configuration task by the config editor 410 in worktable 20, as configuring application program, user/group is set, creates synthetic view etc.After carrying out necessary change, they need to be set to as other configuration data to the server of operation.When this project finishes, can produce the report about this configuration, can use afterwards this report to check the healthy of this system and to determine may need to take to regulate and make great efforts so that this system turns back to optimum performance.In addition,, when completing configuration, can easily reset it into other server (test, product etc.).After this configuration is set to new server, user can operation task " produce all configuration datas " to create derivative data and to move the comparison and the link that are necessary and process on new server on computing machine 40.
Get back to Figure 11, as an example, for analyzing an embodiment of the method for identity hinge, can comprise the attribute validity (step 101) of analyzing one group of data result by data analysis tool 434.In one embodiment, for analyzing the method for identity hinge, can comprise by entity analysis instrument 432 analysis entities (step 105).In one embodiment, these entities are classified as the particular entity type having in identity hinge 32.In certain embodiments, analyzing these entities may need analysis entities size distribution, analyzes by size these entities, by these entities of comparative analysis, analyze the score value be associated with these entities and distribute, the member that analysis is associated with these entities relatively or their combination.In certain embodiments, after analysis entities, user may wish executing arithmetic editing machine 420 and revises the algorithm being associated with entity type and/or change the one or more parameter values (step 102) in above-mentioned one or more algorithm assembly.In certain embodiments, this modification or change can trigger the change of grouping strategy and can be produced and automatically be produced new weight (step 103) by weight.Therefore, user may wish to move fractional analysis instrument 436 to reexamine and to analyze grouping and relative statistics (step 104).In certain embodiments, by the fractional analysis instrument 436 from worktable 20, user can analyze packet size and distribute, analyze by size these groupings, by these groupings of comparative analysis, analyze grouping cross-matched and relatively distribute, by classified counting, analyze member's (record), analyze member's grouping value, analyze member's grouped frequency, analyze that member relatively distributes or their combination.In certain embodiments, user can move link analysis instrument 438 (step 106) with the CR about current use and AL Threshold Analysis member's copy and member's overlapping (step 107).Can be during above-mentioned any step or preserve afterwards and analyze data (step 108).
Figure 12 A and Figure 12 B illustrate sectional drawing 120a and the 120b of an embodiment of entity analysis instrument 432.Specifically, the sectional drawing 120a of Figure 12 A illustrates the result that entity forms inquiry, wherein (row 121 list four members obtaining, entity 26 has four candidate data records that link together), row 122 are listed the value of the particular community (SSN (social security number)) being associated with these members, and row 123 are listed value of another specific object (sex) being associated with these members etc.The sectional drawing 120b of Figure 12 B illustrates for reference source and sends out the result that (Proband) member 27 and the member's of entity 26 member relatively inquires, wherein row 124 are listed the candidate record of comparison, and row 125 are listed their corresponding score value.
Entity shown in Figure 12 A and Figure 12 B relatively inquires with member relatively inquire it is the example that can pass through the inquiry of entity analysis instrument 432 realizations.The inquiry that can realize by entity analysis instrument 43 in certain embodiments, can comprise have been considered big or small entity, entity comparison, physical size distributions, member's comparison, member's entity frequency, member's entity value, has considered that member and score value that entity is counted distribute.
Considered big or small entity
This inquiry provides the ability of the entity of the magnitude range (number of members in entity) of inquiring coupling appointment.The value of minimum or largest amount is appointed as to not restriction (there is no maximum or there is no minimum) of 0 expression.
Entity comparison
This inquiry illustrates the content of designated entities.As illustrational in Figure 12 A, the tabular obtaining goes out member record ID in designated entities and source ID and each member's comparing data.Can this comparing data be divided in each row of this table by comparing role.
Physical size distributes
When the entity in hinge relates to size, this inquiry provides the complete observation result to whole entities in hinge.Can filter this sight and come to an end fruit so that the entity from the source being checked to be only shown.As sporocarp comprises the source that checked and the member in unchecked source, the large young pathbreaker of shown entity is only the counting of member record in checked source.
Member's comparison
This inquiry is provided for all members in a member record and designated entities (seeing Figure 12 B) or the mechanism that compares with one group of mandatory member.
Member's entity frequency
This inquiry illustrates the frequency that member occurs in entity; That is to say the number of members in an entity, the number of members in two entities, number of members in three entities etc.
Member's entity value
This inquiry illustrates the entity under member.
Considered the member of entity counting
This inquiry illustrates a row member within the scope of the entity of appointment (for example, all members in 3 or more entity).If do not specify maximum number, in 0 value shown in the maximum number of entities field.Otherwise the maximum number of entity value must be more than or equal to the minimum number of entity.
Score value distributes
This inquiry illustrates all distributions of recording right score value in system.In certain embodiments, single member's entity or the entity with two above member records can be not included in result.In certain embodiments, the logarithm of each score value can be all counting sums within the scope of given score value.For example, xaxis score value 27 can represent all right between 26.1 and 27.0 of score value.Can filter this observations so that the entity from the source being checked to be only shown.As sporocarp comprises the source that checked and the member in unchecked source, the size of shown entity is only the counting of the member record in checked source.If the result of specific link type is not shown, may not meet the entity of the standard in this link type and/or selected one group of source.
Figure 13 illustrates the sectional drawing 130 of an embodiment of data analysis tool 434.In one embodiment, data analysis tool 434 can provide the inquiry of attribute validity as shown in Figure 13.
Attribute validity
This inquiry illustrate from active and from the described record in independent source, there is the number percent of number of times of the value of member type attribute.The value being present in high number percent should be considered to be in the potential candidate who uses in algorithm.In certain embodiments, can press acquiescently Property Name classification results.In certain embodiments, can be by this result of row classification.In certain embodiments, can filter the number percent that the table that makes to obtain can be listed the member type record being included in assigned source to source.
Figure 14 illustrates the sectional drawing 140 of an embodiment of fractional analysis instrument 436.In certain embodiments, if the number that records in hinge is greater than 200 ten thousand, fractional analysis inquiry will not carried out, unless first prepared these data.In certain embodiments, data are prepared to comprise to obtain one group of intermediate data that original member and integrated data precomputation can inquire fast.Can by " fractional analysis preparation ", carry out this data preparation by config editor 410.In some cases, the data of preparing 1,000,000 records of 2-5 may spend about 10 minutes, and the data of preparing 500,000,000 records may spend about 5 hours.These estimations may differ widely according to different hardware and database configuration.If member's data are modified, also should recalculate prepared data to avoid seeing expired result.
Sectional drawing 140 illustrates the result that fractional analysis is scanned inquiry, and it is in a plurality of inquiries that can obtain by fractional analysis instrument 436.In certain embodiments, the member that the inquiry that can be undertaken by fractional analysis instrument 436 can comprise that fractional analysis is scanned, divides into groups to form, packet size distributes, considered big or small grouping, cross-matched relatively distributes in batches, member's grouped frequency, member divide class value, member relatively to distribute and considered classified counting.
Fractional analysis is scanned
This inquiry provides some the healthy overall informations about the grouping strategy of hinge.As Figure 14 illustrated, in one embodiment, the first half of this view is filled with information such as large packet count, ungrouped member.Can check by clicking suitable button large grouping and/or the ungrouped member of particular range.More particularly, click and to check that grouping button will select to have considered big or small group view and with the packet size scope operation inquiry of hope.Click checks that member's button will select to have considered that member's view of classified counting and operation inquiry are to illustrate the member without any grouping.In this example, the bottom section of the view shown in Figure 14 illustrates one of the hashed value of ten maximum groupings and these groupings, the grouping role who produces this grouping and the member's in these groupings minute class value.Can be identical for this minute class value of all members in same grouping.Select grouping hash and click and check that grouping button forms operation grouping to inquire and use the member of the grouping of selecting for this hash-code and these members' a minute class value to fill this view.
Grouping forms
This inquiry illustrates the content of designated packet.The tabular obtaining goes out grouping role and minute class value of each member in memrecnos in designated packet and this grouping.Shown minute class value is the actual packet value that the member's data from database calculate recently.If different minute class values is shown for same grouping hash, this represents grouping hash collision.This will compare mutually by being considered to abnormal and may be interpreted as what common some member who does not mutually compare.Yet this situation is not generally thought system health harmful.In certain embodiments, the view that is used for this inquiry can comprise to be checked member's button and checks algorithm button, thereby a line in the table of selecting to obtain clicking check member's button by operation member divide class value inquire to illustrate selectively member's grouping, and click and check that algorithm button will open algorithm editing machine 420 and select to create the grouping role (seeing Fig. 9 A) of specified grouping.
Packet size distributes
When the grouping in hinge relates to size, this inquiry provides the complete observation all dividing into groups in hinge.In certain embodiments, large grouping be shown on the right side of this view and use from green (less grouping) to yellow (medium sized grouping) to indicate these groupings to the color indicator of red (grouping greatly).Data point in the figure that delineate packet sizes distributes can be followed being bent downwardly from left (less grouping) to right (larger grouping).Therefore, the mass data point on the right side of packet size distribution plan can be the region of paying close attention to and can represent the anonymity value of missing, incorrect threshold value, data problem etc.In certain embodiments, click data is named a person for a particular job and is selected to have considered the view of big or small grouping and operation is inquired so that those groupings of this size to be shown.In certain embodiments, by pressed operating key before clicks strong point and inquiry, this size and those larger groupings can be shown.
Considered big or small grouping
This inquiry provides inquiry coupling to specify the ability of the grouping of magnitude range (number of members in grouping).For example, the value of minimum or largest amount is appointed as to not restriction (there is no minimum or there is no maximum) of 0 expression.In certain embodiments, the table obtaining can illustrate one of member's in grouping member's counting, grouping hash, grouping role and sample packet value.For all members in any given grouping, a minute class value can be also identical.To this exception, be whether to have the hash collision that causes different minute class values to there is same packets hash.In order to check this situation, user can select to divide into groups and click and check that grouping button is to check all members of any given grouping and their minute class value.If determine the problem (lack grouping etc.) based on frequency there is specific cluster role, row that can be by option table click check that algorithm button opens algorithm editing machine 420.This is by the specific cluster role's (seeing Fig. 9 A) who recalls algorithm editing machine 420 and select to have created selected grouping.
Cross-matched relatively distributes in batches
The quantity of carrying out the required comparison of batch cross-matched when the largest packet of appointment in mpxcomp operation is set size parameter (the packet size limit) when relating to is calculated in this inquiry.This comparison quantity can be used to determine the suitable deadline of cross-matched in batches subsequently together with the quantity of thread and the comparison quantity of each thread p.s..
Member's grouped frequency
This view is answered the problem of " having how many members in 1 grouping, 2 groupings, 3 groupings etc. " with forms such as bar graphs.X number of axle strong point 0 illustrates ungrouped member's quantity.In certain embodiments, the bar in click figure has been considered member's view of classified counting and has moved inquiry so that the member about the plurality of grouping to be shown selecting.
Member divides class value
What grouping this view illustrates the member of appointment in.The grouping role who expresses grouping hash, minute class value and produce each grouping who obtains.In certain embodiments, select to divide into groups and click and check that grouping button can select grouping to form view and operation inquiry forms so that the grouping of selected grouping hash to be shown.The grouping role (seeing Fig. 9 A) that algorithm button can be opened algorithm editing machine 420 and select to be responsible for this grouping of establishment is checked in click.
Member relatively distributes
This view illustrates the system performance of estimating when relating to the quantity of carried out comparison.That is to say: when searching for, will carry out how many actual specific? as an example, member compares distribution plan and can show on average to carry out three comparisons.More particularly, in certain embodiments, 10 have 1 will cause about 6 comparisons in relatively, and 100 have 1 will cause 7.5 comparisons in relatively, and 1000 have 1 will cause about 8 comparisons in relatively.The member of 20,000 stochastic samplings of these data based on from system.If 20,000 of the member's less thaies in system, are used whole members.On average, target member will compare with all members that share the grouping with this target member.
Considered the member of classified counting
This view provides the quantity inquiry member who is included in grouping wherein based on member.In certain embodiments, minimum and maximal value are appointed as to 0 and will return to all ungrouped members.For the minimum value that is greater than 0, maximal value is not restriction of 0 expression.In certain embodiments, the quantity of expressing memrecno, this member grouping therein obtaining and for this member's cmpd string.In certain embodiments, select member and click and check all groupings that member's button can select member's grouping value view to occur therein so that this member to be shown.
Figure 15 illustrates the sectional drawing 150 of an embodiment of link analysis instrument 438.In certain embodiments, the inquiry of the link analysis instrument person of may be provided in copy and the overlapping inquiry of member.
Member's copy
This inquiry illustrates around the various error rates of copy member (from the member record that is linked to the same source of same entity).As Figure 15 illustrated, in one embodiment, front four row of the table obtaining can illustrate the raw data (by source segmentation) from hinge database: the number of members in number of members, entity number, replica group number and these replica group.Rear 3 row can be listed the various error rates that can calculate from these values:
Misregistration rate-represent that you must check that how many records are to differentiate your copy, or how many records are incomplete to member's general view.
Physical copies rate-represent that how many members have copy record, or random member has the probability of copy record.
Transcript rate-represent that how many records are copies, or the number percent of the record that can eliminate.
Member is overlapping
This inquiry provides the information of overlapping number in relevant hinge.When an entity has from the recording of a plurality of sources, may exist overlapping.For example, if there is the entity with three records, and each is recorded in origin system separately, and each source is considered to have therein two overlapping (A and B, A and C etc.).In certain embodiments, resulting table can illustrate the quantity of the sole entity representing in the source of appointment and the number percent of all entities that the record in this source represents.In certain embodiments, resulting table can also be illustrated in counting and the number percent (those entities have at least one record in another source) at least one other source with those overlapping entities.In a plurality of other sources, have overlapping entity in resulting table only to be counted once.In certain embodiments, resulting table can also combine each source that illustrates by source.For example, when row and column source is identical, the number percent of counting is 100%.Yet, when row and column source is different, the overlapping number that described counting representative exists between this row origin system and this row origin system.Therefore this percent value representative has the number percent of the entity in overlapping row source in row source.
Therefore, in one embodiment, the error rate that can comprise that for analyzing the method for identity hinge analysis is associated with one group of data recording.In one embodiment, this error rate can comprise misregistration and human factor error rate.In one embodiment, for the misregistration rate of copy, be included in and record number divided by total number that records in replica group.The chance of the record diagram of fragment is randomly drawed in its representative from file.In one embodiment, human factor error rate is to have difference individual's the quantity of a plurality of records divided by the individual sum representing in this document.By the simple scenario of A, B, C, D, five records of E, wherein A, B, C represent same person.So this misregistration rate be 3/5 and human factor error rate be 1/3 (file that represents 3 different people A-B-C, D, E, and one of them people has a plurality of records).
In one embodiment, error rate can comprise false positive and false negative rate.In one embodiment, error rate reexamines (CR) and AutoLink (AL) threshold value and is associated with office worker.The tolerance of 32 pairs of false positives of identity hinge and false negative rate when in one embodiment, CR and AL threshold value table are shown in one group of data recording of coupling.Therefore, for analyzing an embodiment of the method for identity hinge, can comprise that analyzing office worker reexamines threshold value and AutoLink threshold value.Figure 16 illustrates the sectional drawing of an embodiment of the graphic user interface that can be used for analyzing the error rate that is associated with member record in identity hinge and threshold value.
For a way estimating described threshold value, comprise and to process the sample of the link producing by batch cross-matched, score, the result of score be fitted to the model curve of hit rate and curve that use the obtains error rate based on hope obtains threshold value.Utilize this way to have some potential difficulties.The first, it need people in very wide score value scope to several thousand links to reexamining and scoring.This causes inevitable variation because coupling or unmatched individual explain.The second, hit rate combines copy rate intrinsic in data and file size (if the data sample that we use does not have copy, hit rate is all zero to all score values).The 3rd, this processing produces and is applied to cross-matched and need to be converted to search or the threshold value of inquiry error rate.
In certain embodiments, the new estimation routine the following describes can address these problems.An advantage of this new method is at first can be based on data pattern or based on applying this way in one group of new statistics that weight produces during producing automatically.
False positive rate (AutoLink threshold value)
It is to have the theoretical expression can be used for for the false positive rate of fixed threshold approximate statistical that use is used for the benefit of likelihood ratio of score.This also means, if do correctly, coupling is that the probability of false coupling only depends on score value and do not rely on real data.
Use vector
xrepresentative is the result of two records relatively.The likelihood ratio of this comparison or score value are
Wherein, f
m(
x) be at described record, to refer to the probability density of this comparison under the hypothesis of same target (Ren, enterprise etc.).That is to say, if we know that record should mate, it is the probability of observing this result.Similarly, f
u(
x) be the probability density of observing this result when described record does not refer to same target (, it is the relatively more random probability occurring of this group).
In certain embodiments, when the logarithm of this score value is greater than a certain threshold value, this hinge can link two records, and false positive probability is when described record does not refer to same target, to compare the probability that score value is greater than threshold value like this.On mathematics, be
Now, for group
x: log (λ (
x)) > T},
So f
u(
x) < 10
-Tf
m(
x).
Therefore,, for single comparison, false-positive probability is limited by following formula:
If this threshold value is relatively large, people can expect the single search to the database that comprises n record when carrying out n independent comparison.This means and return higher than the false-positive probability of this threshold value identical higher than the probability of this threshold value with the maximal value of the individual independently single comparison of n to the single search of this database.Suppose { s
1, s
2..., s
nrepresent the relatively score value of all records in this database of single record, for large T, the probability that produces false-positive search can be expressed as
This can further be reduced to
Wherein 10
twith respect to n, be large.
As an example, if use threshold value 11 for the database with 1,000,000 records,
Pfp≈1000000×10
-11
≈10
-5
In other words, in 100,000 search, there is 1.
Sample based on score is to improving AutoLink threshold value
When sample is scored to (supposition sampling is uniform), can calculate new AutoLink (AL) threshold value.Its necessary information can comprise:
Comprise the right file of being scored.This document can comprise the score value of every pair and whether two records of this centering can represent same people (SP), does not represent same people (NSP), whether has enough information to judge the designator of (NEI).Can be accordingly from the score program value of referring to.For example, 1 means SP, and 0 means NSP, and-1 means NEI.
Counting by the right total score value that produced by BXM (if produce at random to time filtered source, this is all right countings in the source being filtered of two members).
The number recording in database (if produce at random to time filtered source, this is the counting recording in those sources).
In certain embodiments, the first step is to obtain Uniform Sample and the score value by NSP and SP obtains chart of percentage comparison.Only need NSP to be used for upgrading AL threshold value.Next step is to obtain total logarithm by score value.This produces in can creating the right step of sample before manually assessing.Next step is that the function calculating as score value obtains false-positive probability.For this reason, the size that people need to know database is so that standardization between batch cross-matched rate and inquiry rate.For each score value shelves, the probability of NSP is multiplied by the total logarithm in this score value situation, divided by the size of database, subtract 1, and will totally be multiplied by 2.If the distribution obtaining is unsmooth, can be by linearized index function application in sample data.That is to say, find coefficient a and b, make function p=e
a+bsbe the least square fitting to sample data, wherein s is score value.
Can by new AL threshold calculations, be according to this fitting coefficient
AL=ln(-fprate·b/(0.1·Exp(a)))/b。
Can use formula below false positive rate to be defined as to the function of score value
Upgrade office worker and reexamine threshold value
When having determined suitable AutoLink threshold value, the estimation of task quantity can be defined as to the function that office worker reexamines (CR) threshold value.This can be by suing for peace from score counting is obtained to AutoLink.User can regulate CR threshold value to produce the task of fixed qty.Figure 17 illustrates the relation between system performance and the false positive that the member record with linking in identity hinge is associated and the tolerance of false negative rate.In the example of Figure 17, AL and CR threshold value produce 12 office workers and reexamine task.
In the above description, about specific embodiment, the disclosure has been described.Yet, should be appreciated that, this description is only as an example rather than is interpreted as restrictive.Therefore, should also be understood that for reference those of ordinary skills of this explanation, the numerous variations in the details of embodiment of the present disclosure and additional embodiment of the present disclosure are apparent and are to make.All these change and additional embodiment is also taken into account in the scope of the present disclosure describing in detail in claims.