HK1091916B

HK1091916B - Data integration method

Info

Publication number: HK1091916B
Application number: HK06112391.4A
Authority: HK
Inventors: Sandra L. Stoker; Ahmad Tariq Sharif; Michael E. Prevoznak; Christopher John Lucas; Charles R. Benke; Maria P. Seckler; Alan Duckworth
Original assignee: Dun & Bradstreet, Inc.
Priority date: 2003-02-18
Filing date: 2004-01-21
Publication date: 2014-11-21

Description

Data integration method

Technical Field

The invention relates to a data processing method, in particular to a method for processing data related to an enterprise.

Background

To be successful, the business needs to make informed decisions. In risk management, an enterprise needs to understand and manage the entire risk exposure (risk exposure). They need to identify and solicit high-risk account (high-risk account) charges. Furthermore, they need to approve or grant credit quickly and consistently. Businesses need to target the most profitable customers and potential customers, as well as increasing opportunities in an existing customer base, in sales and marketing. In supply management, enterprises need to know the total amount to be paid to a supplier in order to better negotiate. They also need to reveal risks and dependencies about the suppliers to reduce the impact of supplier failure.

The success of these business decisions depends largely on the quality of the information behind them. The quality is determined by whether the information is accurate, complete, timely, and consistent. Since the number of data resources available is thousands, it is a challenge to decide which quality information a business should rely on to make a decision. This is especially true when businesses change very frequently. Within the next thirty minutes, 120 businesses will change addresses, 75 businesses' phone numbers will change or be cut off, 30 new businesses will open, 20 executive officers (CEOs) will leave jobs, 15 companies will change names and 10 businesses will close.

Conventional methods of providing business data are incomplete. Some providers collect incomplete data, fail to match entities perfectly, have incomplete numbering systems with re-used numbers, fail to provide corporate family information or provide incomplete corporate family information, and provide only incomplete value-added predictive data. It is an object of the present invention to provide more complete and accurate business data. This includes complete and accurate data collection, entity matching, identification number assignment, corporate linkage (corepatelinkage), and predictive indicators. This completeness and accuracy yields high quality business information that enterprises trust and rely on to make business decisions.

Disclosure of Invention

A data integration method for providing quality information that enables an enterprise to make business decisions, and in particular a method in which business data is collected as raw data. The primary data is tested for accuracy and processed to generate secondary data for integrity. Processing the raw data to form the secondary data includes performing corporate linkage and providing a predictive indicator. The combined primary and secondary data is then provided as enhanced business information. The primary and/or secondary data is periodically sampled and evaluated according to predetermined conditions. Thus, the testing and/or processing is adjusted to ensure quality.

Testing the raw data includes determining whether the raw data matches previously stored data. If a match is found, corporate linkage is performed (i.e., the affiliations between companies are checked). If no match is found, then testing includes determining whether the raw data satisfies a first threshold condition, such as when at least two sources confirm the existence of a business associated with the raw data. If the primary data meets the first threshold condition, an identification number is assigned and secondary data is created and saved. The identification number uniquely identifies an enterprise, is used once and is not reused. If the raw data does not satisfy the first threshold condition, the raw data is stored in a library until new data becomes available. Upon receipt of new data, the testing includes determining whether the original data, along with the new data, satisfies the first threshold condition. And if so, allocating an identification number and storing the secondary data.

Performing corporate linkage includes determining whether the raw data satisfies a second threshold condition, such as a predetermined sales volume. If so, the primary data is analyzed and processed, and the secondary data is created and stored to associate corporate families with the primary data. After merger or acquisition, the corporate family is updated. If the primary data does not satisfy a second threshold condition, a predictive indicator is created as additional secondary data.

Predictive indicators are only created when the raw data meets a third threshold condition, such as a predetermined level of customer inquiry. If so, the primary data is analyzed and processed and additional secondary data is created and stored as a product prediction indicator (product prediction indicator), such as a descriptive grade, score, or demand estimator.

Another embodiment of the invention is a system for data integration. The system includes a database, a data collection component, an identification number component, and a predictive indicator component. The database component stores information related to an enterprise. The data collection component collects raw data related to the enterprise. The identification number component applies an identification number to the primary data and stores secondary data in the database component. The predictive indicator component provides a predictive indicator associated with the business and also stores secondary data in the database component. The system can also include an entity matching component and a corporate linkage component. The entity matching component prevents duplicate business entities from appearing in the database component. The corporate linkage component associates corporate families with businesses in the database component.

Yet another embodiment of the invention is a machine-readable medium for storing executable instructions for data integration. The instructions include collecting raw data for a business, performing entity matching for the business, applying an identification number to the business, performing corporate linkage for the business, and providing the predictive indicator for the business.

Applying the identification number is a process that starts with receiving a request. The request has an identification number and raw data. If the identification number does not already exist, one is assigned. Otherwise, if the identification number is associated with other data, verification is performed and the identification number is provided.

Performing corporate linkage includes maintaining family trees, performing surveys, processing the family trees, and storing them. The family tree is maintained by checking and updating any standard industry taxonomy, checking and standardizing trade names (classistyle), and resolving any duplicates. The survey collects information. The family tree is processed by examining and processing the collected information, examining and updating any matches, and resolving any look-like (look-a-keys) or unassociated extrinsic data.

Providing the predictive indicator includes determining a model and an outcome of the prediction. Next, study samples (depth samples) are selected, profiles are created, and statistical analysis is performed. Finally, the predictive indicator is provided based on the model, results, samples, profiles, and statistical analysis.

These and other features, aspects, and advantages of the present invention will become better understood with regard to the drawings, description, and claims.

Drawings

FIG. 1 is a block diagram of a data integration method according to the present invention;

FIG. 2 is a block diagram of a system for data integration according to the present invention;

FIG. 3 is a block diagram of a system for data integration according to the present invention;

FIG. 4 is a logical block diagram illustrating a data integration method according to the present invention;

FIG. 5 is a block diagram of an exemplary source of data collection in accordance with the present invention;

FIG. 6 is a block diagram of further exemplary sources of data collection in accordance with the present invention;

FIGS. 7 and 8 are block diagrams of entity matching according to the present invention;

FIG. 9 is a block diagram of entity matching wherein matched data is transferred to a database and unmatched data is sent for assignment of a new company identification number in accordance with the present invention;

FIG. 10 is a block diagram of entity matching in which matched data is transferred to a database and unmatched data is either sent for assignment of a new company identification number or stored in the database until additional data can be collected in accordance with the present invention;

FIGS. 11 and 12 are block diagrams of an entity matching method according to the present invention;

FIGS. 13-16 are block diagrams of corporate linkage according to the present invention;

FIG. 17 is a logical block diagram of an exemplary method of performing corporate linkage in accordance with the present invention;

fig. 18A and 18B are block diagrams of exemplary methods of providing predictive indicators according to the present invention.

Detailed Description

In the following detailed description, reference is made to the accompanying drawings. The drawings constitute a part of this specification and illustrate by way of example specific preferred embodiments in which the invention may be practiced. These specific embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. Other embodiments may be utilized and structural, logical, and electrical changes may be made without departing from the spirit and scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.

Fig. 1 shows an overview of a data processing method according to the invention. The method is based on quality assurance 102, which is continuous data auditing, validation, normalization, correction, and updating to ensure the quality of the entire process. There are 5 quality drivers operating sequentially to enhance the input data 104 to make it quality information 106. These 5 drivers are: data collection driver 108, entity matching driver 110, Identification (ID) number driver 112, corporate linkage driver 114, and predictive indicators driver 116. These 5 drives access the database 118. Database 118 is an organized collection of data and database management tools, such as a relational database, object-oriented database, or other type of database. The data in database 118 is continually refined and enhanced based on quality assurance and customer feedback in the global data collection.

The data collection driver 108 aggregates data from various sources around the world. The data is then integrated into database 118 via entity matching driver 110, resulting in a single, more accurate description of each business entity. The identification number driver 112 then uses the identification number as a unique means of identifying and tracking the business throughout its experience with any changes. Corporate linkage driver 114 then establishes corporate families to be able to view the full enterprise risk and opportunity. Finally, predictive indicators driver 116 utilizes statistical analysis to estimate past performance of a business and indicate the likelihood that the business will perform the same in the future.

Fig. 2 and 3 show two exemplary embodiments of systems for data integration according to the present invention, but other systems may be suitable for implementing the present invention. FIG. 2 illustrates a network configuration, while FIG. 3 illustrates a computer system configuration. In FIG. 2, network 200 facilitates communication among other system components, including computer system 202. The 5 quality drivers, data collection driver 108, entity matching driver 110, identification number driver 112, corporate linkage driver 114, and predictive indicators driver 116 and quality assurance 102 work sequentially to enhance the input data 104 to become quality information 106 that is stored in database 204. In fig. 3, a computer system 300 has a processor 302, the processor 302 accessing a memory 304 via a bus 306. Memory 304 stores operating system program 308, data integration program 310, and data 312.

Fig. 4 shows another embodiment of a data integration method according to the present invention. The method comprises 5 main parts of data integration: data collection 400, entity matching 402, identification number 404, corporate linkage 406, and predictive indicators processing 408 to produce high quality data 410. Data collection 400 collects raw data. The raw data is tested for accuracy and processed to generate secondary data. Processing the raw data includes performing corporate linkage 406 and providing predictive indicators 408. The combined primary and secondary data is then provided as enhanced business information or high quality data 410. The primary and secondary data are periodically sampled and evaluated according to predetermined conditions. As a result, the testing and processing is adjusted to ensure quality.

Testing the raw data includes determining 412 in entity matching 402 whether the raw data matches previously stored data. If so, corporate linkage 406 is performed. If not, then testing includes determining whether the raw data satisfies a first threshold condition 414, such as when at least two sources confirm that there is an enterprise associated with the raw data. If the primary data meets the first threshold condition, then control passes to an identification number component 404, where an identification number is assigned 420 and secondary data is stored 422. The identification number uniquely identifies an enterprise, is used only once, and is not reused. If the raw data does not satisfy the first threshold condition, the raw data is stored in the repository 416 until new data becomes available 418. Once new data is received, testing includes determining whether the original data, along with the new data, satisfies the first threshold condition. If so, an identification number is assigned and secondary data is stored.

Performing corporate linkage 406 includes determining whether the raw data satisfies a second threshold condition 424, such as a predetermined sales volume. If so, the primary data is analyzed and processed 426 and secondary data is stored 428 to associate a corporate family with the primary data. After merger or acquisition, the corporate family is updated. If the raw data does not satisfy the second threshold condition, control passes to a predictive indicator component 408.

Providing predictive indicators 408 includes determining whether the raw data satisfies a third threshold condition 430, such as a predetermined level of customer inquiry. If so, the raw data is analyzed and processed 432 and secondary data is stored 434 to produce predictive indicators, such as descriptive ratings, scores or demand estimators.

Thus, the 5 components or drivers work together to integrate the collected data into enhanced data that can be used to make business decisions. Each of the 5 drives is studied in more detail below, starting with the data collection drive 108.

FIG. 5 illustrates some of the data sources used in the data collection driver 108. Data about customers, potential customers, suppliers are collected with the goal of collecting the most complete possible data. In particular, some sources of data are direct surveys 502, transactional data 504, public records 506, and Web sources 508. Direct survey 502 includes making a telephone call to the business. The transaction data 504 includes updated transaction records. Public records 506 include litigation (suits), liens, adjudications (jundgements), as well as bankruptcy applications and business registrations, among others. Web sources 508 include Uniform Resource Locators (URLs), updates from domains, client-provided online updates (customers providing online updates), and other Web data from the Internet.

The Web data includes information from "Whois" files and information from a central repository of registered domains called VerSignRegistry, among other data. Whois is a program that tells you the owner of any second-level domain name that has registered with VeriSign. VeriSign is a company, Inc. of Mountain View, headquartered in California, USA. Through data mining, the base reference file of the domain name is matched with the identification number and expanded. Some Uniform Resource Locators (URLs) are manually assigned to match them. Information from the "Whois" file and data mining is matched to the data in database 118. The base reference file is enhanced by data mining of additional Web site data such as status, privacy data, authentication data, and other data.

The coverage of the file is extended. All matches of identification numbers and URLs are handled reasonably. The URL coverage is extended across family tree members using top and bottom level links. The URLs are ordered according to status and match type. A certain number of URLs or fields, such as the 5 th above, are included in the output file. Another output file is created with all URLs and matched identification numbers (no association).

The URL base file data elements include the URL/domain name, match code, status flag, redirect flag, and the total number of URLs per identification number. The match code matches the site or affiliate (affiliate). The status flags are real-time, under construction, etc. The redirect flag is the actual URL that is listed if redirected to another site.

There are also URL attachment (URL plus) file elements, which are in a file separate from the URL base file. It includes all URLs and data from the URL base file, summary data on Website integrity (Website security), and security on valid/live URLs. It also includes the total number of external and internal links, meta tag flags, security indicators, encryption strength, such as presence Secure Socket Layer (SSL), and authentication indicators.

The extended URL attachment element is an independent file independent of the URL base file and URL attachment file. It includes all URL base and URL plus data with live URLs, detailed data on website integrity and security. They include the type of secure Web server, the certificate issuing company, the owner logo, which is the certificate owner or certificate user, the number of external URL links, e.g., 5, and metadata such as keywords, descriptions, authors, and creators.

Fig. 6 shows some other data sources used by the data collection driver 108 for increased accuracy, such as phone books or yellow pages 602, news and media 604, direct surveys 606, corporate financial information 608, payment data 610, courts & legel filing offices 612, and government registrations 614. The completeness of this information helps make profitable business decisions. In risk management, the user uses the derived information to assess risk from non-U.S. (u.s.) companies. Risks from small business customers can be more fully identified. Users are able to make more informed risk decisions when they are based on more complete information. In sales and marketing, a user may identify new potential customers from data extracted from multiple sources. The user can gain contact with international customers and potential customers and pick the best (cherry) list of potential customers with value-added information such as Standard Industry Classification (SIC) and contact names. In provisioning management, the user may utilize the derived information to assess risk from foreign providers and more fully identify risk from providers. Since database 118 is updated every day, the user can get a new, more complete description of each customer, potential customer, and supplier.

FIG. 7 shows how multiple unmatched pieces of data 702 may be transformed into a complete single enterprise 704. Entity matching driver 110 checks incoming data 104 to see if it belongs to any existing business in database 118. In this example, ABC, Inc., Chuck 'Mini-Mart and Charles Smith appear to be independent companies, but after entity matching, it is clear that they are part of one enterprise, ABC, Inc. and Chuck' Mini-Mart. The different addresses and other related information are also coordinated into a complete single enterprise 704.

FIG. 8 illustrates how input data 104 that matches a business in database 118 is added to the business via entity matching driver 110. FIG. 9 illustrates another scenario in which input data 104 that does not match any of the businesses in database 118 is either designated as a new business or, as shown in FIG. 10, is saved in a repository 1002 to await further data verification that it is a new business. Entity matching driver 110 is designed to match data to the correct business at a time, thereby increasing efficiency. The entity matching driver 110 provides a more complete and accurate profile of customers, potential customers, and suppliers and ensures that duplicate businesses are rare.

Fig. 11 illustrates an exemplary method of matching by the matching driver 110. The method includes cleaning and analyzing 1102, performing candidate retrieval 1104, and making decisions 1106. Cleaning and parsing 1102 includes identifying key components of query data 1108, normalizing name, address, and city 1110, performing name consistency 1112, and performing address normalization 1114. Candidate retrieval 1104 includes collecting possible matching candidates from a reference database 1116, using keywords to improve retrieval quality and speed 1118, and optimizing keywords based on data provided in the query data 1120. Decision making 1106 includes evaluating matches according to consistent criteria 1122, applying a match rating 1124, applying a trust code 1126, and applying a trust percentage 1128.

Fig. 12 shows a more detailed method of matching by the driver 110. The method includes Web services 1202, cleaning, analysis and normalization 1204, candidate retrieval 1206, and measurement, evaluation and decision 1208. In Web service 1202, the HTTP server receives a request and provides a response 1210 in XML over HTTP, and the application server processes the XML request and converts it into a JAVA object, and then processes the JAVA object and converts it back into XML 1212. In cleanup, analyze, and normalize 1204, name and address elements are analyzed and extraneous words are removed 1214. The address is then verified to ensure that the street and city names are correct, and the zip code plus 4 and latitude and longitude 1216 are specified. The reference table holds empty cities and empty street names 1218. In candidate retrieval 1206, a key for retrieving a candidate is generated from a reference database 1220. Next, the optimization keys are retrieved in the search strategy and candidate index for efficient database retrieval. Reference tables are created and maintained for searching the reference database 1224. In the measuring, evaluating, and deciding 1208, a measure of confidence score is derived that indicates the degree of match between the query and the candidate. Next, an order for presenting each candidate online is established and the best candidate in the batch is selected. Other methods of performing matching that occur to those of ordinary skill in the art can also be used to practice the present invention.

An Identification (ID) number driver 112 adds a unique identification number to each business so that the business can be easily and accurately identified. An example of a unique identification number is Dun, Inc. (Dun), such as may be located in Short Hills, NJ from headquarters&Bradstreet) from a mammalThe number, which is a 9 digit number, allows the business to be easily tracked through changes and updates. The identification number is always reserved for businesses as they exist. No two businesses will receive the same identification number and the identification number will never be reused. The identification number is not assigned until multiple data resources confirm that the business exists. The identification number functions as an industry standard for enterprise identification. It is recognized by the united nations, the international organization for standardization (ISO), the european commission, and over fifty industry organizations.

According to the invention, the identification number is a central concept in the data processing method. For quality assurance, the identification number allows information to be verified at each step of the process. For the data collection driver 108, if the data is not associated with an existing identification number, it indicates a potential new business. For entity matching driver 110, the identification number allows new data to be accurately matched to existing businesses. For corporate linkage driver 114, corporate families are combined based on the identification number of each business. For the predictive indicator driver 116, the identification number is used to construct a predictive tool.

In addition, the identification number opens new areas of opportunity for the user's business by helping to verify the existence of the business. The user is provided with a comprehensive understanding of potential customers, customers and suppliers. Existing data is clarified, duplication is eliminated, and related businesses are shown as being interrelated. When the identification number is appended to the user information, the user can more easily manage a large group of customers or suppliers. The identification number enables quick and easy data update when appended to user information.

Fig. 13 illustrates an exemplary method of identifying the number driver 112. The process begins with an identification number request 1302 that includes entering a name, address, city, status, etc. For example, when a record is created for a new business for which database 118 does not yet exist, an identification number is requested. At lookup operation 1304, the database 118 is searched for the identification number in the request. If the identification number is found 1306, the identification number is made available to the customer 1308. Otherwise, input from the request is captured 1310 and assigned an identification number, including Mod 10 validation 1312. Mod 10 validation assigns a check digit at the end to keep the number clean. If there is an association, it is verified 1316 prior to front end verification, at associate to other identification numbers step 1314. Then, duplicate authentication 1320 and host authentication 1322 are performed and the identification number is made available to the customer 1308. The association verification prevents errors such as a subsection being linked to another subsection.

Fig. 14-16 illustrate how corporate linkage driver 114 constructs corporate linkages to reveal how companies are linked. If there is no corporate linkage, the companies L Refinery div.1402, C Store Inc.1404, and G Store div.1406 in FIG. 14 appear to be irrelevant.

However, as shown in FIG. 15, using corporate linkage enables viewing of an entire corporate family without limitation of depth or breadth. The parent U Products Group Corp.1502 and the three following subsidiaries, LICC.1504, C Inc.1506, and G Inc.1508. L inc.1504 has two subsections, LStorage div.1510 and L reflective div.1402 (shown in fig. 14). C Inc.1506 has two divisions, Industrial Co.1512 and Building Co.1514 and a subsidiary, C Stores Inc.1404 (shown in FIG. 4). G Inc.1508 has two subsections, G Stroage div.1406 (shown in FIG. 14) and G Refinery div.1516. C Store inc. has four sections, North Store inc.1518, South Store inc.1520, West Store inc.1522, and East Store inc.1524. Establishing a broad corporate linkage allows business information providers to become industry leaders by providing this full detail.

FIG. 16 shows how corporate linkage driver 114 updates family trees after mergers and acquisitions. In this embodiment, there are two separate businesses, ABC 1602 and XYZ 1604, before merging, and each has its own subsidiaries and branches. After merging, ABC XYZ 1606 has two subsidiaries, ABC subsidiaries 1608 and XYZ subsidiaries 1610, each of which has its own division and/or subsidiaries.

Corporate linkage driver 114 exposes profitable opportunities for users in risk management, sales and marketing, and supply management. Which enables the user to understand the overall risk exposure of the corporate family. The user may recognize a company's lodging or financial pressure within the company versus other parts of their corporate family. Users are able to discover the increasing opportunities for new and existing customers within a corporate family and to know who is the best customer and who is the potential customer. The user can determine the total amount he pays for a corporate family for better negotiation.

Fig. 17 illustrates an exemplary method of executing corporate linkage driver 114. In general, it shows a method of updating family tree associations 1700 where the goal is to correctly associate all subsidiaries and branches of each entity with an identification number with consistent names, business names, and correct employee numbers, while addressing all look-alike (LALs).

For example, file builds and other activities may create initially unassociated records, such as duplicate records or similar-looking content that need to be resolved. For example, if one creates a record about LensCrafters, but when it is LensCrafters USA, it is called LensCrafters Eyeglasses, then you may have a look-like or duplicate record. To prevent this, method 1700 resolves look-alike records. There are three general rules for resolving records that look similar. First, if a seemingly similar record is on a directory, or can be verbally confirmed at headquarters, it is associated accordingly. Second, unproven seemingly similar records require a telephone investigation. Third, regardless of the level of collaboration, all look-alike records must be resolved before the tree can be de-registered.

At the start of method 1700, company 1702 is contacted for a directory, preferably in electronic form. Possible contacts include former contacts, human resources, legal departments, administrators, investment relations, and so forth. If a directory is available, the batch potential (bulk processing) of the tree and directory is evaluated, including external keying 1704. The tree is then updated accordingly. On the other hand, if a directory is not available, the company site 1706 is searched on the Internet. If the site is available, the batch potential of the website information is evaluated, including foreign keying 1708, and the tree is updated accordingly. If the site is not available, a determination is made as to whether the company has publicly traded 1710. If so, the last 10-K is checked. Otherwise, the subsidiary is called to verbally verify the tree structure. Records that look similar are resolved and log off of the tree is performed.

Predictive indicators driver 116 summarizes the collected information about the business and uses it to predict future performance. There are three types of predictive indicators: descriptive ratings, predictive scores and demand estimators. The descriptive rating is an overall descriptive rating assessment of the company's past performance. The prediction score is a prediction of how likely the business will be in the future. Demand estimators estimate how many products a business may purchase in total.

Predictive indicators help users promote various areas of their business. In risk management, descriptive ratings help users grant or approve credit. Based on past financial performance, the ratings indicate the goodness of the company's reputation. The score indicates reputation goodness based on past payment history. Predictive scores may be used across all of a user's portfolio to quickly identify high risk accounts and immediately begin collection. The business reputation score predicts the likelihood of a business having a late payment (shipping slow) during the next twelve months. The financial stress score predicts the likelihood of a business failing during the next twelve months. In sales and marketing, demand estimators let users know who is likely to buy, so that opportunities between customers or potential customers can be prioritized. Examples of demand estimators include the number of personal computers and local or long distance telephone expenses. In supply management, the predictive scores may be used for all suppliers to the user to quickly understand their risk of future failure.

In addition, the prediction score may be customized according to the particular needs and criteria of the user. For example, criteria such as (1) what behavior the user wants to predict; (2) what size of business a user wants to assess; (3) what is the decision rule that translates risk assessment into a reputation decision or risk management measure based on user risk tolerance.

Predictive indicators are enabled by the analysis capability and the data capability. For example, a specialized business-to-business (B2B) expert doctor team could build the underlying predictive model and obtain industry-specific knowledge, financial and payment information, and a large amount of historical information for analysis.

Fig. 18A and 18B illustrate an exemplary method of creating a predictive indicator. It starts with market analysis 1802 and is followed by business decisions 1804 regarding model development. The decision relates to the type of score to be developed and the final output, such as a bankruptcy risk score, a default risk score, or a industry-specific score. The bankruptcy risk score is the likelihood of a company interrupting operation. The default risk score is the likelihood that the company delays payment. The industry-specific score predicts the likelihood of certain special content, such as the use of transcribers (copiers) or truckers (truckers), or whether the company has a good reputation risk. Input data 1806 is collected from a reputation profile database 1808 and a transaction tape database 1810 that provide historical data about reputations. There are two time periods of interest, the active period, which is a historical observation of the entire fact, and the result period (stopping period), which is the period of time after which it is seen what happens. For example, given the data of the last year, how a company is operating for a certain period of time relative to this year. Next, a "bad definition" (result to be predicted) is determined referring to a risk to be evaluated, such as a financial stress score, which predicts the likelihood of not closing over the next twelve months.

Development samples are selected from the business world (business universe) 1814, demographic profiles for the business world are created 1816, and illustrative data analysis is performed 1818 (univariate analysis of all variables). Performing tasks such as determining variable ranges, variable types, including or not including variables, and other functions related to understanding what is put into the model. Variables may be selected based on the activity periods and the result periods, and weights may be assigned to indicate accuracy or representativeness. Quality assurance includes periodically checking to see if anything in the business world has affected the initial model, and taking a score (score) and running it for a previous epoch to check if it is still indicative or predictive. The sample may be defective.

Continuing with FIG. 18B, a statistical analysis and model development process 1820 including logistic regression (logistic regression) and other assessment techniques is performed. This step involves the use of appropriate models, formulas and statistics. The statistical coefficients are then converted into a scorecard (scorecard) 1822. The model is tested and verified 1824 and a technical specification is developed 1826. Finally, the model is implemented 1828 and tested 1830. Data is run in the model to produce a score. Periodically, a check is performed to verify that the score is still valid and to determine if the scorecard needs to be updated.

It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. Various embodiments are described for performing data collection, performing entity matching, applying identification numbers, performing corporate linkage, and providing predictive indicators. The invention is also applicable to applications outside the business information industry. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A method of data integration, comprising:

(a) collecting information including raw data associated with an enterprise collected from at least one data source;

(b) determining whether the original data matches stored entity data, operating according to the following rules:

(i) performing step (d) on the raw data if the raw data matches the stored entity data; and

(ii) performing step (c) if the raw data does not match the stored entity data and if the raw data satisfies a first threshold condition that at least two sources confirm that a business associated with the raw data exists; and

(iii) if the primary data does not match the stored entity data and the first threshold condition is not met, storing the primary data that does not match as first stored secondary data in a repository until new primary data becomes available, wherein the new primary data and the first stored secondary data are processed based on step (b) (ii);

(c) assigning an identification number to the raw data received from step (b) (ii), thereby creating and storing second stored secondary data;

(d) upon determining that the primary data from step (b) (i) meets a second threshold condition, then associating corporate linkage data with the primary data and storing as third stored secondary data; and upon determining that the raw data from step (b) (i) does not satisfy the second threshold condition, sending the raw data from step (b) (i) to step (e);

(e) when it is determined that the raw data from step (d) meets a third threshold condition, analyzing, processing, and storing the raw data as fourth stored secondary data, thereby generating at least one predictive indicator;

(f) combining the original data from step (e) and the fourth stored secondary data to produce enhanced information; and

(g) providing the enhanced information to a user.

2. The method of claim 1, further comprising:

periodically sampling any of the enhanced information or the first, second, third or fourth stored secondary data, thereby producing sampled data;

evaluating the sampled data against data obtained from a previous epoch; and

it is determined whether the sampled data is valid or needs to be updated.

3. The method of claim 1, wherein the assigned identification number is an entity identifier.

4. The method of claim 1, wherein the second threshold condition is that the business has a predetermined sales volume.

5. The method of claim 1, further comprising generating the corporate linkage data by examining a connection between a corporate entity and the raw data.

6. The method of claim 1, wherein the third threshold condition is that the business has a predetermined level of customer queries.

7. The method of claim 1, wherein the predictive indicator is selected from the group consisting of:

(i) a descriptive rating of the business' past performance; (ii) a prediction of a degree of likelihood that the business will be creditful in the future; and (c) an estimate of how many products the business may purchase.

8. A data integration system, comprising:

a data generator;

an entity matching unit;

an identification number unit;

a company association unit; and

the prediction indicator unit is provided with a prediction indicator unit,

wherein the data generator collects raw data associated with an enterprise from at least one data source,

wherein the entity matching unit determines whether the original data matches stored entity data, operating based on the following rules:

(i) sending the raw data to the corporate linkage unit if the raw data matches the stored entity data; and

(ii) sending the raw data to the identification number unit if the raw data does not match the stored entity data and if the raw data satisfies a first threshold condition that at least two sources confirm that a business associated with the raw data exists; and

(iii) if the primary data does not match the stored entity data and the first threshold condition is not met, storing the primary data that does not match as first stored secondary data in a repository until new primary data becomes available, wherein the new primary data and the first stored secondary data are processed based on step (ii),

wherein the identification number unit assigns an identification number to the original data received from step (ii), thereby creating and storing second stored secondary data,

wherein the corporate linkage unit, upon determining that the primary data from step (i) meets a second threshold condition, associates corporate linkage data with the primary data and stores as third stored secondary data; and when it is determined that the raw data from step (i) does not meet the second threshold condition, sending the raw data from step (i) to the predictive indicator unit;

wherein the predictive indicator unit, upon determining that the raw data from the corporate linkage unit meets a third threshold condition, analyzes, processes, and stores as fourth stored secondary data, thereby generating at least one predictive indicator to form enhanced information, and wherein the data generator, the entity matching unit, the identification number unit, the corporate linkage unit, and the predictive indicator unit are the same or independent of each other.

9. The system of claim 8, further comprising a quality assurance unit that performs the following operations:

evaluating the sampled data against data obtained from a previous epoch; and

it is determined whether the sampled data is valid or needs to be updated.

10. The system of claim 8, wherein the assigned identification number is an entity identifier.

11. The system of claim 8, wherein the second threshold condition is that the business has a predetermined sales volume.

12. The system of claim 8, wherein the corporate linkage unit generates the corporate linkage data by examining a linkage between a corporate entity and the raw data.

13. The system of claim 8, wherein the third threshold condition is that the business has a predetermined level of customer inquiry.

14. The system of claim 8, wherein the predictive indicator is selected from the group consisting of: