US20140324908A1 - Method and system for increasing accuracy and completeness of acquired data - Google Patents
Method and system for increasing accuracy and completeness of acquired data Download PDFInfo
- Publication number
- US20140324908A1 US20140324908A1 US13/872,868 US201313872868A US2014324908A1 US 20140324908 A1 US20140324908 A1 US 20140324908A1 US 201313872868 A US201313872868 A US 201313872868A US 2014324908 A1 US2014324908 A1 US 2014324908A1
- Authority
- US
- United States
- Prior art keywords
- data
- semantic
- fields
- instance
- data record
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G06F17/30684—
Definitions
- FIG. 2 is a flowchart depicting control logic for generating a semantic template, in accordance with aspects of the present disclosure
- FIG. 5 depicts a sample of n-tuples derived from an example of a semantic template, in accordance with aspects of the present disclosure
- FIG. 6 depicts a sample set of unstructured date, in accordance with aspects of the present disclosure
- FIG. 7 depicts an initial instance, in accordance with aspects of the present disclosure
- FIG. 9 depicts a subsequent set of n-tuples based on the present instance iteration, in accordance with aspects of the present disclosure
- FIG. 10 depicts an updated instance after a second round of text mining, in accordance with aspects of the present disclosure.
- Inaccurate and incomplete data can result in errors in the conclusions drawn from that data. That is, poor data quality can result in poor decision making, whether in an automated context (where a computer or other machine is provided the data and takes a corresponding action) or a human actor context. Such inaccurate and incomplete data may be present in data collected by “fixed choice” mechanisms (e.g., “check the box”) or “free text” data entry where a user types or writes a free form entry or record.
- “free text” data entry can introduce much more variability and uncertainty than data entry using fixed fields or options. As discussed herein, aspects of the present approach would drive the usability of free text fields to approach that of fields filled in by drop-down or radio box methods, and would facilitate and improve subsequent decision-making, automated or otherwise.
- inaccurate and incomplete data may be present in both historical data contexts, where data may be collected or translated, from paper or other media, as well as in contemporaneous or real-time data collection.
- aspects of the present approach may be used to improve the archived or existing data records as well as to facilitate or improve contemporaneous data collection. While certain examples and discussions within the present disclosure may relate to the processing of free text data fields for the reasons noted above, it should also be appreciated that the present approaches may also be used to improve the quality and accuracy of data acquired in non-free text contexts, such as where the data entry options are limited to specific values or choices.
- the use of semantic templates, as discussed herein may also be used to merge data sources that are not free text to improve the quality of data and data collection in these contexts as well.
- the presently disclosed approaches relate to cleaning and controlling the accuracy of electronic data surrounding transactions (e.g., business transactions) or other similarly structured events.
- the structure of the events is captured as semantic model templates, and instances of these models are created from the data.
- the content of fields with missing entries can be highlighted for acquisition or entry of the missing data.
- the content of fields with questionable or ambiguous data may be flagged for review and/or more suitable contents for the fields can be suggested.
- the electronic data to be processed can have been acquired at different times, i.e., may have varied temporality corresponding to archived events, historical events, finalized and completed current events, or currently occurring events with some components yet to happen in the future.
- the present approach addresses the generally low quality of data entered by humans in free text data input situations, such as may be found in customer service requests, for example.
- One instantiation of this approach would suggest wording improvements and sentence structure changes to people entering data into an otherwise free text field. This results in normalized data structures that flow smoothly into the event model discussed herein.
- such components may include: a display 10 for displaying data or processed data as discussed herein; I/O ports 12 suitable for receiving data for processing or routines for execution and/or for exporting processed data; input structures 14 for receiving data or command inputs or entered data; one or more special- or general-purpose processors 16 for executing routines or control logic as discussed herein and/or for processing data as discussed herein; a memory device 28 for storing data or routines for execution by the processor 16 ; and a non-volatile storage 20 for providing long-term storage of data or routines.
- the present disclosure relates to the generation and use of a set of semantic models (i.e., templates) that describe a typical or generic instance of an event and encompass the various possible data that may be entered with respect to the event.
- a set of semantic models i.e., templates
- templates i.e., templates
- an iterative solution may be employed that begins with the creation of a set of semantic models (templates) describing a comprehensive instance of an generalized event and all the available data describing that event.
- the templates are semantic models describing the events (e.g., transactions) to which the data refers.
- a set of representative transactions 52 such as customer inquiries, repair records, and so forth, is provided.
- Analysis of the representative transactions 52 may be used to generate (block 54 ) one or more semantic templates 56 (e.g., semantic models) that describe the event or events represented by the transactions 52 .
- the semantic templates describe not only the fields that may be present to capture the data associated with an event, but also the inter-relationship between these fields and the rules or logic governing the content of the respective fields.
- n-tuples may be composed of pairs or triplets of adjacent words in the semantic network or instances of words related by specific relationships described in the semantic templates. For example, the word “paint” followed or preceded by a color within three words of the word “paint” may constitute a specific relationship that might be captured as an n-tuple.
- Initial instances 74 based on the semantic templates 56 may then be created. For example, fields containing one of a fixed number of entries can be mapped to the instances 74 by copying the contents from the original data (or from another comparable set of data) into the instances 74 .
- free text fields in the data e.g., transactions 52
- text mining techniques may structure the free text field within the transactions 52 for later semantic processing and may use information from the free text field to populate individual fields of the instance.
- text mining techniques may also be guided by semantic templates 56 and may allow some degree of structure to be imposed on or implied from a set of unstructured data (e.g. free text fields, and so forth).
- word, word pairs, and/or n-tuple data structures 72 identified by analysis of a semantic template 56 may be used for text mining of data acquired in an unstructured form, such as to identify likely structured relationships within the data that can be leveraged in subsequent analysis.
- n-tuple data structures for use in text mining may be taken or derived from patterns within the semantic structure.
- These data structures derived from the semantic template 56 (or from instance 74 ) and used for text mining may be simple listings of paired word relationships or may be more complex patterns that may represent semantic structures themselves.
- a distribution of likely entries may be constructed, along with associated probabilities or rankings.
- the most likely entry exceeding a threshold likelihood may be entered into the respective field of the instance 74 .
- a confidence score or other likelihood indicator may be displayed in conjunction with a field entry determined in this manner.
- FIGS. 4-12 graphically depict the concepts discussed herein in conjunction with an example.
- FIG. 4 depicts an example of a semantic template 56 comprising a multitude of related or interconnected fields 94 that encompass the various parameters that might be present for a given transaction.
- each field 94 of FIG. 4 is labeled with a type of data to facilitate explanation.
- fields 94 are depicted that relate to a “product” data structure 98 , a “contact” data structure 100 , and a “date” data structure 102 , any one of which may have related data in any given sample transaction. That is, in this example, any given transaction would presumably have at least some data related to a product, a date, and/or a contact.
- the “contact” data structure 100 and “date” data structure 102 comprise fields that define additional data related to the respective structure and the respective relationships between such fields.
- the data structures defined for the “contact” and “date” structures may in turn be referenced as fields in the “product” data structure.
- a payment or order data point in the product data structure 98 may include a date field that may in turn be defined by the date data structure 102 .
- the semantic template 56 defines both the data fields 94 and the semantic or logical relationships between respective data fields 94 that may be present in a representative transaction.
- one or more of the fields 94 may be characterized by the type or structure of data that may be entered, such as text strings, numeric strings, e-mail addresses, numbers strings formatted as or having characteristics of a data or phone number, and so forth. Such constraints may be useful in parsing or generating the n-tuples for text mining and/or for parsing data into the template 56 or assigning probabilities to unstructured data for which a structure is being derived.
- Such representative data 104 may be useful in generating n-tuples, as discussed herein.
- n-tuples 110 in the form of word, word pairs, word triplets, or other structured arrangements or sequences of data
- the n-tuples 110 used as examples in FIG. 5 may represent data 104 that has been observed in representative transaction as corresponding to one or more identified fields 94 within the semantic template 56 or which, based on structure or context, is believed to correspond to such fields.
- identified n-tuples 110 may be used as seeds for text mining performed on new or historic data, such as to derive a structure for data that is acquired in an unstructured format, such as free-text field data.
- FIG. 6 depicts a sample set of unstructured data.
- FIG. 6 depicts a hypothetical data record 120 for a 2012 customer call log.
- the data record reads: “February 14 Mr. Archer called to ask about ordering a new microwave oven. 502-345-7899.”
- an unstructured data record such as this may undergo text mining, such as using n-tuples 110 as seeds, to attempt to derive a structure for the data.
- an initial instance 130 is generated based on the unstructured data record 120 , the semantic template 56 (particularly, the “date” data structure 102 and the “contact” data structure 10 ), and the results of an initial text mining operation based on the n-tuples 110 .
- the initial instance 130 data from the record 120 is parsed into fields 94 of the template 56 based on the assessed probabilities determined in the text mining operation. That is data is assigned to the most likely field 94 of the template 56 based on the statistical probabilities generated as part of the text mining operation.
- the “date” data structure 102 and associated fields and the “contact” data structure 100 and associated fields are partially populated based on the call log record data to generate the example instance 130 which includes a data instance 132 and contact instance 134 .
- Data derived from the call record 120 used to populate the instance 130 is shown as entered data 136 .
- an improperly assigned data element 140 is also shown in the instance 130 .
- the data element “microwave” was incorrectly specified as a contact address based on the probabilities generated by the text mining operation.
- semantic rules derived based on the semantic template 56 may, subsequent to the text mining operation, evaluate the initial instance 130 to address such errors and to thereby generate an improved instance 144 .
- the data element “microwave” has been moved from the contact address field of the initial instance 130 to the “type” field of a product instance 142 of the initial instance. That is, in such an example, the probability derived based on the text mining operation may be discarded or ignored as the semantic rules specify that the term “microwave”, when encountered, is a product type.
- an instance 150 may be generated that links the sub-instances previously generated together.
- the term “ask” in the call record being processed may be sufficient to probabilistically identify the record as relating to a product sale where the product was a microwave, the contact data related to a customer or potential customer, and the date data related to the date a sale inquiry was received.
- FIG. 11 depicts the current instance 150 within the larger context of the semantic template 56 .
- the present approach may be used to process and improve data sets which can be described by a semantic model.
- Typical of this class of data sets are business events (e.g., order to remittance), sales transactions, or inspection records.
- processing of this event data in accordance with the present approaches would also include the content of free text fields to be normalized within the context of a larger semantic model.
- the present approach may be implemented as an iterative process, where the extent of the semantic instance grows with the information added by each iterative use of statistical text mining, and the power of the text mining extends through the quality of the n-tuple set and the reasoning biases extracted from the semantic instance.
- the present approach allows data improvement using both semantic reasoning as well as statistical modeling.
- the hypotheses generated as a result of such a hybrid technique are better than those derived using either method alone.
- semantic reasoning with statistical modeling
- a level of certainty is captured and covers cases when few examples are available for training the statistical models.
- semantic reasoning by combining statistical modeling with semantic reasoning, the non-obvious relations can be identified by statistical models.
- the iterative application of both approaches allows for hypotheses to be created, validated, and retracted.
- the present approach can be fully automated and embedded in other applications.
- the approach is data-agnostic, and through the use of text mining may process both fixed and free-text fields.
- the present approach may be implemented in various manners, such as by embedding in batch programs or GUIs.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present disclosure relates to the use of both semantic analysis and statistical text mining to process data records, improving the completeness and accuracy of records so processed. By way of example, a data record may be iteratively processed by text mining using seeds derived from a semantic template and by validating the results based on semantic reasoning based on the semantic template.
Description
- The subject matter disclosed herein relates to the accuracy and completeness of data records drawn from historical and current sources.
- Data may be collected and stored for numerous industrial, commercial, and personal applications. For example, routine transactions may generate various types of data or new data points in an ongoing sequence. Such data may in turn be reviewed, evaluated, and used in various decision making processes, such as maintenance or repair tracking or planning in a building or vehicle context, budgetary planning, financial forecasting, or regulatory compliance and planning.
- Inaccurate and incomplete data, however may result in errors in these various processes or, more generally, may result in inaccurate decisions being drawn, improper actions being taken, or proper action not being taken. Such data problems may result from various sources, such as a set of data being incomplete, data points being recorded inaccurately, or data points being improperly characterized or categorized. These types of errors may arise in historical data or data being collected currently or contemporaneously and may arise in both fixed choice and free text data collection methodologies.
- In one embodiment, a computer-implemented method is provided for processing data. The method includes the acts of accessing a data record and performing a text mining operation on the data record using seeds derived from a semantic template encompassing the data record. One or more fields of a data instance are populated using data elements derived from the analysis of the data record by the text mining operation. The data instance is based on the semantic template. The data instance is then updated based on semantic rules defined by the semantic template. The seeds are updated and the steps of: performing the text mining operation, populating one or more fields of the data instance, and updating the data instance based on semantic rules to generate a final data instance are iterated.
- In a further embodiment, a data processing system is provided. The data processing system comprises a memory storing one or more routines; and a processing component configured to communicate with the controller and to execute the one or more routines stored in the memory. The one or more routines, when executed by the processing component, cause acts to be performed comprising: accessing a data record related to a transaction; accessing a set of seeds derived from a semantic template that describes the transaction; text mining the data record using the set of seeds; populating one or more fields of a semantic instance using data elements identified in the data record by text mining, wherein the data instance is based on the semantic template and wherein the one or more fields are populated based upon probabilities generated by the text mining; and analyzing the data instance based on one or more semantic rules associated with the semantic instance to validate the populated one or more fields of the semantic instance.
- In an additional embodiment, one or more non-transitory computer-readable media are provided encoding one or more processor-executable routines. The one or more routines, when executed by a processor, cause acts to be performed comprising: accessing a data record related to a transaction; accessing a semantic template derived from a plurality of representative transactions that described the transaction; and generating a data instance corresponding to the data record by iteratively: performing statistical text mining of the data record using seeds derived from the semantic template; and analyzing the data instance using one or more semantic rules derived from the semantic template.
- These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
-
FIG. 1 is a block diagram of an electronic devise suitable for processing data, in accordance with aspects of the present disclosure; -
FIG. 2 is a flowchart depicting control logic for generating a semantic template, in accordance with aspects of the present disclosure; -
FIG. 3 is a flowchart depicting control logic for generating and processing data instances, in accordance with aspects of the present disclosure; -
FIG. 4 is a sample of a semantic template, in accordance with aspects of the present disclosure; -
FIG. 5 depicts a sample of n-tuples derived from an example of a semantic template, in accordance with aspects of the present disclosure; -
FIG. 6 depicts a sample set of unstructured date, in accordance with aspects of the present disclosure; -
FIG. 7 depicts an initial instance, in accordance with aspects of the present disclosure; -
FIG. 8 depicts a corrected instance after review based on semantic rules, in accordance with aspects of the present disclosure; -
FIG. 9 depicts a subsequent set of n-tuples based on the present instance iteration, in accordance with aspects of the present disclosure; -
FIG. 10 depicts an updated instance after a second round of text mining, in accordance with aspects of the present disclosure; and -
FIG. 11 depicts an updated instance within the context of a semantic template, in accordance with aspects of the present disclosure. - Inaccurate and incomplete data can result in errors in the conclusions drawn from that data. That is, poor data quality can result in poor decision making, whether in an automated context (where a computer or other machine is provided the data and takes a corresponding action) or a human actor context. Such inaccurate and incomplete data may be present in data collected by “fixed choice” mechanisms (e.g., “check the box”) or “free text” data entry where a user types or writes a free form entry or record. However, as will be appreciated, “free text” data entry can introduce much more variability and uncertainty than data entry using fixed fields or options. As discussed herein, aspects of the present approach would drive the usability of free text fields to approach that of fields filled in by drop-down or radio box methods, and would facilitate and improve subsequent decision-making, automated or otherwise. Likewise, inaccurate and incomplete data may be present in both historical data contexts, where data may be collected or translated, from paper or other media, as well as in contemporaneous or real-time data collection. Aspects of the present approach may be used to improve the archived or existing data records as well as to facilitate or improve contemporaneous data collection. While certain examples and discussions within the present disclosure may relate to the processing of free text data fields for the reasons noted above, it should also be appreciated that the present approaches may also be used to improve the quality and accuracy of data acquired in non-free text contexts, such as where the data entry options are limited to specific values or choices. For example, the use of semantic templates, as discussed herein, may also be used to merge data sources that are not free text to improve the quality of data and data collection in these contexts as well.
- The presently disclosed approaches relate to cleaning and controlling the accuracy of electronic data surrounding transactions (e.g., business transactions) or other similarly structured events. In particular, as discussed herein, the structure of the events is captured as semantic model templates, and instances of these models are created from the data. The content of fields with missing entries can be highlighted for acquisition or entry of the missing data. The content of fields with questionable or ambiguous data may be flagged for review and/or more suitable contents for the fields can be suggested. The electronic data to be processed can have been acquired at different times, i.e., may have varied temporality corresponding to archived events, historical events, finalized and completed current events, or currently occurring events with some components yet to happen in the future. For cleaning and accuracy control of historical data, the approaches discussed herein may be exercised periodically as additional new data becomes available, more extensive semantic templates are generated, or improved algorithms become available. For data arriving in real time, such as by conversation or simultaneous data entry, repeated application of the approach creates a converging semantic solution that can be used immediately for reasoning or classification purposes.
- In addition, in one implementation the present approach addresses the generally low quality of data entered by humans in free text data input situations, such as may be found in customer service requests, for example. One instantiation of this approach would suggest wording improvements and sentence structure changes to people entering data into an otherwise free text field. This results in normalized data structures that flow smoothly into the event model discussed herein.
- With the foregoing in mind, a general description is provided below of suitable electronic devices that may be used in the implementation of the present approaches to improve the accuracy or completeness of acquired data. In particular,
FIG. 1 is a block diagram depicting various components that may be present in an electronic device (e.g., a general- or special-purpose computer system) suitable for executing routines for improving data completeness or accuracy as discussed herein. - As will be appreciated, the various functional blocks shown in
FIG. 1 may comprise hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should further be noted thatFIG. 1 is merely one example of a system capable of implementing the present approaches and merely illustrates the types of components that may be present in a suitableelectronic device 8. For example, in the presently illustrated embodiment, such components may include: adisplay 10 for displaying data or processed data as discussed herein; I/O ports 12 suitable for receiving data for processing or routines for execution and/or for exporting processed data;input structures 14 for receiving data or command inputs or entered data; one or more special- or general-purpose processors 16 for executing routines or control logic as discussed herein and/or for processing data as discussed herein; amemory device 28 for storing data or routines for execution by theprocessor 16; and anon-volatile storage 20 for providing long-term storage of data or routines. - With the foregoing discussion of suitable systems in mind, the present disclosure relates to the generation and use of a set of semantic models (i.e., templates) that describe a typical or generic instance of an event and encompass the various possible data that may be entered with respect to the event. For example, in one implementation, to construct a representation of a past, a present, or a future event an iterative solution may be employed that begins with the creation of a set of semantic models (templates) describing a comprehensive instance of an generalized event and all the available data describing that event. As used herein, the templates are semantic models describing the events (e.g., transactions) to which the data refers.
- This process is depicted graphically by the
flowchart 50 ofFIG. 2 . In this example, a set ofrepresentative transactions 52, such as customer inquiries, repair records, and so forth, is provided. Analysis of therepresentative transactions 52 may be used to generate (block 54) one or more semantic templates 56 (e.g., semantic models) that describe the event or events represented by thetransactions 52. In particular, the semantic templates describe not only the fields that may be present to capture the data associated with an event, but also the inter-relationship between these fields and the rules or logic governing the content of the respective fields. The process may be iterated to update or improve the semantic templates(s) 56 until it is determined (block 58) that there are noadditional transactions 52 to be processed or that thesemantic template 56 would be unchanged by processing or reprocessingadditional transactions 52. In addition, further iterations of the process might be triggered as newrepresentative transactions 52 become available over time. As discussed herein, thesemantic templates 56 may be created to accommodate a structured data format (such as might be created through automatic data input or drop down menus) or an unstructured data format expressed in free text fields, such as customer service requests. - Turning to
FIG. 3 , once thesemantic templates 56 have been constructed, a series of algorithms may be employed (block 70) to generate tables 72 of words and word relationships (n-tuples) from thesemantic templates 56 that will be used as seeds for subsequent text-mining operations. As used herein, n-tuples may be composed of pairs or triplets of adjacent words in the semantic network or instances of words related by specific relationships described in the semantic templates. For example, the word “paint” followed or preceded by a color within three words of the word “paint” may constitute a specific relationship that might be captured as an n-tuple. -
Initial instances 74 based on thesemantic templates 56 may then be created. For example, fields containing one of a fixed number of entries can be mapped to theinstances 74 by copying the contents from the original data (or from another comparable set of data) into theinstances 74. In the depicted embodiment, free text fields in the data (e.g., transactions 52) are mapped using statistical text mining techniques that use the n-tuples extracted from the originalsemantic templates 56. In such an implementation, text mining techniques may structure the free text field within thetransactions 52 for later semantic processing and may use information from the free text field to populate individual fields of the instance. - Next semantic reasoning (block 80) is applied to the
instance 74 populated using statistical text mining algorithms. The semantic reasoning uses known rules or logic to evaluate the data fields filled in by text mining to identify incongruous data fields and validate remaining data fields. The semantic reasoning may also suggest contents for other data fields of thesemantic instance 74 as constructed so far. Once semantic reasoning is done, a check may be performed to see if anything has changed (block 82) since the last semantic reasoning on theinstance 74. If so, the process may be iterated by extracting (block 70) an updated set of n-tuples from thesemantic instance 74. These new instances drive the biases that the statistical approach uses to generate the most likely matches, and the biases converge along with theinstance 74 to drive the best solution. In one such implementation, the text mining techniques are driven by n-tuple structures created from the evolvinginstance 74, or in the case of the first iteration, from the originalsemantic template 56. - In certain implementations, text mining techniques may also be guided by
semantic templates 56 and may allow some degree of structure to be imposed on or implied from a set of unstructured data (e.g. free text fields, and so forth). For example, word, word pairs, and/or n-tuple data structures 72 identified by analysis of asemantic template 56 may be used for text mining of data acquired in an unstructured form, such as to identify likely structured relationships within the data that can be leveraged in subsequent analysis. For example, n-tuple data structures for use in text mining may be taken or derived from patterns within the semantic structure. These data structures derived from the semantic template 56 (or from instance 74) and used for text mining may be simple listings of paired word relationships or may be more complex patterns that may represent semantic structures themselves. By way of example, for each field in aninstance 74 generated based on a set of unstructured data, such as a free text field, a distribution of likely entries may be constructed, along with associated probabilities or rankings. The most likely entry exceeding a threshold likelihood may be entered into the respective field of theinstance 74. In such an example, a confidence score or other likelihood indicator may be displayed in conjunction with a field entry determined in this manner. - With the foregoing in mind,
FIGS. 4-12 graphically depict the concepts discussed herein in conjunction with an example. For example,FIG. 4 depicts an example of asemantic template 56 comprising a multitude of related orinterconnected fields 94 that encompass the various parameters that might be present for a given transaction. For convenience, eachfield 94 ofFIG. 4 is labeled with a type of data to facilitate explanation. In the depicted example, fields 94 are depicted that relate to a “product”data structure 98, a “contact”data structure 100, and a “date”data structure 102, any one of which may have related data in any given sample transaction. That is, in this example, any given transaction would presumably have at least some data related to a product, a date, and/or a contact. - As depicted in
FIG. 4 , not only are thevarious fields 94 defined for a representative transaction, but also the relationships (i.e., the logical or semantic relationships) betweenrespective fields 94, denoted bylines 106. For example, with respect to a “product”data structure 98, for any given product, there may be related data regarding a sale, a shipper or shipment, a ship date, a price, a cost, a type or model, a packaging, a material, and/or labor. Likewise, for each of these fields there may be additional data or fields defined in a structural or semantic relationship that provide additional detail. Similarly, in this example, the “contact”data structure 100 and “date”data structure 102 comprise fields that define additional data related to the respective structure and the respective relationships between such fields. As depicted in the present example, the data structures defined for the “contact” and “date” structures may in turn be referenced as fields in the “product” data structure. For example, a payment or order data point in theproduct data structure 98 may include a date field that may in turn be defined by thedate data structure 102. In this manner, thesemantic template 56 defines both the data fields 94 and the semantic or logical relationships between respective data fields 94 that may be present in a representative transaction. - In one implementation, one or more of the
fields 94 may be characterized by the type or structure of data that may be entered, such as text strings, numeric strings, e-mail addresses, numbers strings formatted as or having characteristics of a data or phone number, and so forth. Such constraints may be useful in parsing or generating the n-tuples for text mining and/or for parsing data into thetemplate 56 or assigning probabilities to unstructured data for which a structure is being derived. - In the depicted example,
sample data 104, such as fromrepresentative transactions 52 may be associated withparticular fields 94. For example, under the “type” field within theproduct data structure 98, various examples are listed (i.e., microwave, dishwasher, refrigerator, range) which may be derived from representative transactions that have been structures in accordance with thetemplate 56. Similarly,other fields 94 are shown having representative data 104 (e.g., contact title, date day, date year, contact name, and so forth). - Such
representative data 104 may be useful in generating n-tuples, as discussed herein. For example, turning toFIG. 5 , an example is provided of n-tuples 110 (in the form of word, word pairs, word triplets, or other structured arrangements or sequences of data) that may be derived based on the definedsemantic template 56 ofFIG. 4 and from representative data ortransactions 52 described by or categorized with respect to thetemplate 56. For example, the n-tuples 110 used as examples inFIG. 5 may representdata 104 that has been observed in representative transaction as corresponding to one or more identifiedfields 94 within thesemantic template 56 or which, based on structure or context, is believed to correspond to such fields. Thus, identified n-tuples 110, or data or data structures structurally similar to the identified n-tuples, may be used as seeds for text mining performed on new or historic data, such as to derive a structure for data that is acquired in an unstructured format, such as free-text field data. - With the foregoing in mind,
FIG. 6 depicts a sample set of unstructured data. In particular,FIG. 6 depicts ahypothetical data record 120 for a 2012 customer call log. In this example, the data record reads: “February 14 Mr. Archer called to ask about ordering a new microwave oven. 502-345-7899.” In one implementation, an unstructured data record such as this may undergo text mining, such as using n-tuples 110 as seeds, to attempt to derive a structure for the data. - Turning to
FIG. 7 , aninitial instance 130 is generated based on theunstructured data record 120, the semantic template 56 (particularly, the “date”data structure 102 and the “contact” data structure 10), and the results of an initial text mining operation based on the n-tuples 110. In the depicted example, theinitial instance 130 data from therecord 120 is parsed intofields 94 of thetemplate 56 based on the assessed probabilities determined in the text mining operation. That is data is assigned to the mostlikely field 94 of thetemplate 56 based on the statistical probabilities generated as part of the text mining operation. - In particular, the “date”
data structure 102 and associated fields and the “contact”data structure 100 and associated fields are partially populated based on the call log record data to generate theexample instance 130 which includes adata instance 132 andcontact instance 134. Data derived from thecall record 120 used to populate theinstance 130 is shown as entereddata 136. In the depicted example, an improperly assigneddata element 140 is also shown in theinstance 130. In particular, the data element “microwave” was incorrectly specified as a contact address based on the probabilities generated by the text mining operation. - In certain embodiments, however, semantic rules derived based on the
semantic template 56 may, subsequent to the text mining operation, evaluate theinitial instance 130 to address such errors and to thereby generate animproved instance 144. In this example, turning toFIG. 8 , based on semantic analysis, the data element “microwave” has been moved from the contact address field of theinitial instance 130 to the “type” field of aproduct instance 142 of the initial instance. That is, in such an example, the probability derived based on the text mining operation may be discarded or ignored as the semantic rules specify that the term “microwave”, when encountered, is a product type. - As will be appreciated, however, the relationship between the
respective date instance 132,contact instance 134, andproduct instance 142 is still undefined. In this example, turning toFIG. 9 , additional text mining seeds (e.g., an additional round of n-tuples 110) are generated that address the current degree or level or ambiguity or uncertainty with respect to theinstance 144. Based on the new n-tuples, an additional round of text mining may be performed on the record being analyzed and the results of the text mining operation may be used to modify the current instance 144 (FIG. 10 ). - Based on the results of the additional round of text mining, and turning to
FIG. 10 , aninstance 150 may be generated that links the sub-instances previously generated together. For example, the term “ask” in the call record being processed may be sufficient to probabilistically identify the record as relating to a product sale where the product was a microwave, the contact data related to a customer or potential customer, and the date data related to the date a sale inquiry was received.FIG. 11 depicts thecurrent instance 150 within the larger context of thesemantic template 56. - As will be appreciated from the preceding discussion and example, the present approach may be used to process and improve data sets which can be described by a semantic model. Typical of this class of data sets are business events (e.g., order to remittance), sales transactions, or inspection records. Further, processing of this event data in accordance with the present approaches would also include the content of free text fields to be normalized within the context of a larger semantic model. Additionally as discussed herein the present approach may be implemented as an iterative process, where the extent of the semantic instance grows with the information added by each iterative use of statistical text mining, and the power of the text mining extends through the quality of the n-tuple set and the reasoning biases extracted from the semantic instance.
- Of note, the present approach allows data improvement using both semantic reasoning as well as statistical modeling. The hypotheses generated as a result of such a hybrid technique are better than those derived using either method alone. In particular, by combining semantic reasoning with statistical modeling, a level of certainty is captured and covers cases when few examples are available for training the statistical models. Conversely, by combining statistical modeling with semantic reasoning, the non-obvious relations can be identified by statistical models. The iterative application of both approaches allows for hypotheses to be created, validated, and retracted.
- In practice, the present approach can be fully automated and embedded in other applications. The approach is data-agnostic, and through the use of text mining may process both fixed and free-text fields. Further, the present approach may be implemented in various manners, such as by embedding in batch programs or GUIs.
- Technical effects of the invention include the use of both semantic and statistical models to improve the completeness and accuracy of data instances.
- This written description uses examples to disclose the invention, including the best mode, and also to enable any person skilled in the art to practice the invention, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the invention is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements with insubstantial differences from the literal languages of the claims.
Claims (20)
1. A computer-implemented method for processing data, comprising:
accessing a data record;
performing a text mining operation on the data record using seeds derived from a semantic template encompassing the data record;
populating one or more fields of a data instance using data elements derived from the analysis of the data record by the text mining operation, wherein the data instance is based on the semantic template;
updating the data instance based on semantic rules defined by the semantic template; and
updating the seeds and iterating the steps of: performing the text mining operation, populating one or more fields of the data instance, and updating the data instance based on semantic rules to generate a final data instance.
2. The computer-implemented method of claim 1 , wherein the data record comprises a transaction record for a business or maintenance transaction.
3. The computer-implemented method of claim 1 , wherein the seeds comprise n-tuples derived from the semantic template and a set of representative transactions.
4. The computer-implemented method of claim 1 , wherein the one or more fields of the data instances are populated based upon probabilities generated by the text mining operation.
5. The computer-implemented method of claim 1 , wherein the semantic template comprises:
a plurality of data fields associated with a generic transaction;
connections between the plurality of data fields; and
rules regarding potential content of the respective fields of the plurality of fields.
6. The computer-implemented method of claim 1 , wherein the data record comprises an unstructured data record.
7. The computer-implemented method of claim 1 , wherein the data record comprises a free text field.
8. A data processing system, comprising:
a memory storing one or more routines; and
a processing component configured to communicate with the controller and to execute the one or more routines stored in the memory, wherein the one or more routines, when executed by the processing component, cause acts to be performed comprising:
accessing a data record related to a transaction;
accessing a set of seeds derived from a semantic template that describes the transaction;
text mining the data record using the set of seeds;
populating one or more fields of a semantic instance using data elements identified in the data record by text mining, wherein the data instance is based on the semantic template and wherein the one or more fields are populated based upon probabilities generated by the text mining; and
analyzing the data instance based on one or more semantic rules associated with the semantic instance to validate the populated one or more fields of the semantic instance.
9. The data processing system of claim 8 , wherein the one or more routines, when executed by the processing component, cause further acts to be performed comprising:
determining if additional processing of the data record is needed;
if additional processing is determined to be needed, deriving an additional set of seeds from the semantic template; performing a text mining based on the additional set of seeds, populating one or more additional fields of the semantic instance based on the results of the text mining operation; and reanalyzing the data instance based on the one or more semantic rules; and
if additional processing is determined to not be needed, ending the processing of the data record.
10. The data processing system of claim 8 , wherein analyzing the data instance based on the one or more semantic rules comprises identifying data elements populating the wrong fields of the semantic instance.
11. The data processing system of claim 8 , wherein analyzing the data instance based on the one or more semantic rules comprises suggesting content for fields of the semantic instance not populated by the text mining.
12. The data processing system of claim 8 , wherein the semantic template is derived from a plurality of representative transactions.
13. The data processing system of claim 8 , wherein the set of seeds comprise n-tuples derived from the semantic template and a plurality of representative transactions.
14. The data processing system of claim 8 , wherein the semantic template comprises:
a plurality of data fields associated with a generic transaction;
inter-relationships between the plurality of data fields; and
rules regarding potential content of the respective fields of the plurality of fields.
15. The data processing system of claim 8 , wherein the data record comprises an unstructured data record.
16. One or more non-transitory computer-readable media encoding one or more processor-executable routines, wherein the one or more routines, when executed by a processor, cause acts to be performed comprising:
accessing a data record related to a transaction;
accessing a semantic template derived from a plurality of representative transactions that described the transaction; and
generating a data instance corresponding to the data record by iteratively:
performing statistical text mining of the data record using seeds derived from the semantic template; and
analyzing the data instance using one or more semantic rules derived from the semantic template.
17. The one or more non-transitory computer-readable media of claim 16 , wherein the semantic template comprises:
a plurality of data fields associated with a generic transaction;
connections between the plurality of data fields; and
rules regarding potential content of the respective fields of the plurality of fields.
18. The one or more non-transitory computer-readable media of claim 16 , wherein the data record comprises an unstructured data record.
19. The one or more non-transitory computer-readable media of claim 16 , wherein the data record comprises a free text field.
20. The one or more non-transitory computer-readable media of claim 16 , wherein analyzing the data using the one or more semantic rules comprises one or both of:
identifying data elements populating the wrong fields of the semantic instance;
suggesting content for fields of the semantic instance not populated by the text mining.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US13/872,868 US20140324908A1 (en) | 2013-04-29 | 2013-04-29 | Method and system for increasing accuracy and completeness of acquired data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US13/872,868 US20140324908A1 (en) | 2013-04-29 | 2013-04-29 | Method and system for increasing accuracy and completeness of acquired data |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| US20140324908A1 true US20140324908A1 (en) | 2014-10-30 |
Family
ID=51790199
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US13/872,868 Abandoned US20140324908A1 (en) | 2013-04-29 | 2013-04-29 | Method and system for increasing accuracy and completeness of acquired data |
Country Status (1)
| Country | Link |
|---|---|
| US (1) | US20140324908A1 (en) |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150339360A1 (en) * | 2014-05-23 | 2015-11-26 | International Business Machines Corporation | Processing a data set |
| CN107315739A (en) * | 2017-07-12 | 2017-11-03 | 安徽博约信息科技股份有限公司 | A kind of semantic analysis |
| US20180239918A1 (en) * | 2015-10-02 | 2018-08-23 | Dtex Systems Inc. | Method and System for Anonymizing Activity Records |
| US20180373698A1 (en) * | 2017-06-23 | 2018-12-27 | General Electric Company | Methods and systems for using implied properties to make a controlled-english modelling language more natural |
| US20190073356A1 (en) * | 2017-06-23 | 2019-03-07 | General Electric Company | Methods and systems for implied graph patterns in property chains |
| CN110110078A (en) * | 2018-01-11 | 2019-08-09 | 北京搜狗科技发展有限公司 | Data processing method and device, the device for data processing |
| US11343352B1 (en) * | 2017-06-21 | 2022-05-24 | Amazon Technologies, Inc. | Customer-facing service for service coordination |
| US11650972B1 (en) * | 2015-12-02 | 2023-05-16 | Wells Fargo Bank, N.A. | Semantic compliance validation for blockchain |
| US11941413B2 (en) | 2020-06-29 | 2024-03-26 | Amazon Technologies, Inc. | Managed control plane service |
| US11948005B2 (en) | 2020-06-29 | 2024-04-02 | Amazon Technologies, Inc. | Managed integration of constituent services of multi-service applications |
| US12086141B1 (en) | 2021-12-10 | 2024-09-10 | Amazon Technologies, Inc. | Coordination of services using PartiQL queries |
| CN120124605A (en) * | 2025-05-15 | 2025-06-10 | 江西省公路科研设计院有限公司 | A method for automatically generating traffic safety facility statistics table |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070013967A1 (en) * | 2005-07-15 | 2007-01-18 | Indxit Systems, Inc. | Systems and methods for data indexing and processing |
| US20080059442A1 (en) * | 2006-08-31 | 2008-03-06 | International Business Machines Corporation | System and method for automatically expanding referenced data |
| US20090006282A1 (en) * | 2007-06-27 | 2009-01-01 | International Business Machines Corporation | Using a data mining algorithm to generate rules used to validate a selected region of a predicted column |
| US20100211609A1 (en) * | 2009-02-16 | 2010-08-19 | Wuzhen Xiong | Method and system to process unstructured data |
-
2013
- 2013-04-29 US US13/872,868 patent/US20140324908A1/en not_active Abandoned
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070013967A1 (en) * | 2005-07-15 | 2007-01-18 | Indxit Systems, Inc. | Systems and methods for data indexing and processing |
| US20080059442A1 (en) * | 2006-08-31 | 2008-03-06 | International Business Machines Corporation | System and method for automatically expanding referenced data |
| US20090006282A1 (en) * | 2007-06-27 | 2009-01-01 | International Business Machines Corporation | Using a data mining algorithm to generate rules used to validate a selected region of a predicted column |
| US20100211609A1 (en) * | 2009-02-16 | 2010-08-19 | Wuzhen Xiong | Method and system to process unstructured data |
Cited By (17)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10671627B2 (en) * | 2014-05-23 | 2020-06-02 | International Business Machines Corporation | Processing a data set |
| US10210227B2 (en) * | 2014-05-23 | 2019-02-19 | International Business Machines Corporation | Processing a data set |
| US20150339360A1 (en) * | 2014-05-23 | 2015-11-26 | International Business Machines Corporation | Processing a data set |
| US10387667B2 (en) * | 2015-10-02 | 2019-08-20 | Dtex Systems, Inc. | Method and system for anonymizing activity records |
| US20180239918A1 (en) * | 2015-10-02 | 2018-08-23 | Dtex Systems Inc. | Method and System for Anonymizing Activity Records |
| US11650972B1 (en) * | 2015-12-02 | 2023-05-16 | Wells Fargo Bank, N.A. | Semantic compliance validation for blockchain |
| US11343352B1 (en) * | 2017-06-21 | 2022-05-24 | Amazon Technologies, Inc. | Customer-facing service for service coordination |
| US10984195B2 (en) * | 2017-06-23 | 2021-04-20 | General Electric Company | Methods and systems for using implied properties to make a controlled-english modelling language more natural |
| US11100286B2 (en) * | 2017-06-23 | 2021-08-24 | General Electric Company | Methods and systems for implied graph patterns in property chains |
| US20190073356A1 (en) * | 2017-06-23 | 2019-03-07 | General Electric Company | Methods and systems for implied graph patterns in property chains |
| US20180373698A1 (en) * | 2017-06-23 | 2018-12-27 | General Electric Company | Methods and systems for using implied properties to make a controlled-english modelling language more natural |
| CN107315739A (en) * | 2017-07-12 | 2017-11-03 | 安徽博约信息科技股份有限公司 | A kind of semantic analysis |
| CN110110078A (en) * | 2018-01-11 | 2019-08-09 | 北京搜狗科技发展有限公司 | Data processing method and device, the device for data processing |
| US11941413B2 (en) | 2020-06-29 | 2024-03-26 | Amazon Technologies, Inc. | Managed control plane service |
| US11948005B2 (en) | 2020-06-29 | 2024-04-02 | Amazon Technologies, Inc. | Managed integration of constituent services of multi-service applications |
| US12086141B1 (en) | 2021-12-10 | 2024-09-10 | Amazon Technologies, Inc. | Coordination of services using PartiQL queries |
| CN120124605A (en) * | 2025-05-15 | 2025-06-10 | 江西省公路科研设计院有限公司 | A method for automatically generating traffic safety facility statistics table |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20140324908A1 (en) | Method and system for increasing accuracy and completeness of acquired data | |
| Yu et al. | Identifying self-admitted technical debts with jitterbug: A two-step approach | |
| CN108256074B (en) | Verification processing method and device, electronic equipment and storage medium | |
| CN111860981B (en) | Enterprise national industry category prediction method and system based on LSTM deep learning | |
| Baier et al. | Matching events and activities by integrating behavioral aspects and label analysis | |
| Thirunagalingam | Enhancing Data Governance Through Explainable AI: Bridging Transparency and Automation. | |
| CN113449753B (en) | Service risk prediction method, device and system | |
| CN108664237B (en) | It is a kind of based on heuristic and neural network non-API member's recommended method | |
| US20220277229A1 (en) | Design learning: learning design policies based on interactions | |
| Jackson et al. | From natural language to simulations: applying GPT-3 codex to automate simulation modeling of logistics systems | |
| CN119443083A (en) | Method, device, electronic device and storage medium for generating unified bidding documents | |
| CN118313727A (en) | Enterprise comprehensive assessment method and system based on large language model technology | |
| US20250110943A1 (en) | Method and apparatus for integrated optimization-guided interpolation | |
| Nasfi et al. | Improving data cleaning by learning from unstructured textual data | |
| CN120781307B (en) | Enterprise multi-mode data intelligent processing system fusing RAG technology and intelligent processing method thereof | |
| US12411871B1 (en) | Apparatus and method for generating an automated output as a function of an attribute datum and key datums | |
| KR101691083B1 (en) | System and Method for Bug Fixing Developers Recommendation and Bug Severity Prediction based on Topic Model and Multi-Feature | |
| CN120216489A (en) | A data quality rule generation method, device, equipment and storage medium | |
| Peng et al. | An approach of crossover service goal convergence and conflicts resolution | |
| US20250272480A1 (en) | Dynamic user interface related to automated electronic document creation through machine learning | |
| CN119718939A (en) | Case processing method, apparatus, device, medium, and program product | |
| CN117827184A (en) | Workflow construction method and system based on generation type artificial intelligence by using natural language | |
| KR101697992B1 (en) | System and Method for Recommending Bug Fixing Developers based on Multi-Developer Network | |
| US20250348963A1 (en) | Contract advisor | |
| Sardar et al. | Comparative analysis of AI models for effort estimation in Western and regional environments |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: GENERAL ELECTRIC COMPANY, NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GRAHAM, MICHAEL EVANS;CRAPO, ANDREW WALTER;MOITRA, ABHA;AND OTHERS;SIGNING DATES FROM 20130325 TO 20130426;REEL/FRAME:030310/0412 |
|
| STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |