WO2016060546A1

WO2016060546A1 - A system and method for managing a collection of data

Info

Publication number: WO2016060546A1
Application number: PCT/MY2015/050121
Authority: WO
Inventors: Nor Azlinayati ABDUL MANAF; Nurul Aida OSMAN; Khalil BOUZEKRI; Dickson Lukose
Original assignee: Mimos Bhd
Current assignee: Mimos Bhd
Priority date: 2014-10-14
Filing date: 2015-10-12
Publication date: 2016-04-21
Anticipated expiration: 2017-04-14
Also published as: MY187722A

Abstract

The present invention discloses a system (1) and method (10) for managing a collection of data (2) in table forms and for identifying reference tables based on multi-modality approaches. The system (1) and method (10) discloses a module (5) that is incorporated with a key filter to determine reference tables and eliminate non-reference tables as well as a ranking to rank reference tables in order to identify reference tables with minimal user intervention.

Description

A SYSTEM AND METHOD FOR MANAGING A COLLECTION OF DATA TECHNICAL FIELD OF THE INVENTION

The invention relates to data management and in particular, the invention provides a system and method for managing a collection of data in tables.

BACKGROUND OF THE INVENTION

Managing data in a database is known as a key factor for business management. Managing a database requires the step of categorizing data and information in various forms, commonly in table form, then combining the details of data and information to perform analytical processing in order to help users to perform any decision making. For example, a manufacturer produces a new product after analysing customer information such as census data, customer demographics, needs and lifestyles.

Managing data in a database therefore has been and continue to grow to be a strong focus for the world especially in the business world. In order to minimize data redundancy and errors, improve data integrity and security as well as providing data storage, relational database is commonly used for supporting and managing a large volume of data in an organization or a company.

For managing and storing basic and transaction-oriented data and information, a relational database is constructed by composing multiples of tables with values in columns and rows as well as associated with a primary key for each of the tables. Furthermore, the relational database has a unique characteristic which allows users to focus on logical view of a data environment by listing the details of data and its relationships. It defines the relationships between the tables and provides the users a complete picture of the data and information stored by implementing the relational set theory.

One of the challenges faced in data management is that the data integrity and accuracy of a relational database is dependent on the keys of table that shared and linked across the tables within the relational database. A relational database generally comprises numerous tables in order to cater for various processes and activities; it hence generates a complex relation between the tables. For instance, a course information table in a relational database may consist a list of course title with a unique value in the table. The unique value in the table can be categorized as the primary key. The course table is also called as a reference table. On the other hand, another table may consist of the primary key from the reference table and other data such as student demographic. The table is called as a transactional table. The relational database is not able to link the reference table with other transactional tables if the primary key and the reference table are identified incorrectly.

Further compounding to the issue is that, the performance of a relational database is affected significantly if the number of tables is huge due to the relationships among the tables in the database. At the same instant, the process becomes more complicated and time consuming. Furthermore, frequent changes of data in the tables will lead adverse impact on the performance. Therefore, the table relation identification is a crucial attribute for managing a collection of data.

In view of the abovementioned data accuracy and performance of database, several methods have been developed to identify reference tables in a database.

US Patent No. 8386529 B2 mentions a method for identifying a reference table automatically by adopting foreign key relationship in a system and then employing a score function on this foreign key relationship. However, US 8386529 B2 does not teach the method of comparing and ranking the table attributes from a database with the semantic information such as semantic structure based value and instance based value from a knowledge base. US Patent No. 8250101 B2 discloses techniques for mapping and translating reference data in a database system by using the Ontology- Guided Reference. However, US 8250101 B2 discloses only a method to identify the reference table by using the size of the tables and it may not be able to identify reference tables accurately. Furthermore, a reference table must always be identified correctly when the database is performing queries or analysing result in order to prevent the system from being ineffective.

US Patent No. 8583626 B2 teaches a method for identifying reference data tables in an Extract-Transform-Load (ETL) process by identifying a reference table with aid of score ranking of each reference table in a process. The method mentioned in US 8583626 B2 only uses table attributes from a database to identify a reference table without analysing other multi-modality approach such as semantic structure of the table schema and content of database table.

A research paper entitled "An Empirical Study of Instance-based Ontology Matching" addresses the best instance based ontology mapping technique that is meeting the Semantic Web standards. However, the research paper does not disclose the technique to identify reference tables accurately and automatically in a database.

In terms of data accuracy and performance of database, the existing systems and methods have their limitations. Therefore, it is an aim of this present invention to provide a system and method that manages data based on reference tables and is capable of improving the accuracy of identification reference tables in a database, more particularly in a relational database. SUMMARY OF THE PRESENT INVENTION

The present invention aims to provide a system and a method for managing a collection data that is organized in table forms, a knowledge database that applies reasoning capabilities to the collection of data, and a list comprises reference tables and non-reference tables that stores attributes of the collection of data that is used to link the reference table with the knowledge database as well as a module that identifies reference tables from the collection of data and configures the collection of data in order to match the data of reference tables with semantic information from the knowledge database.

It is an object of the present invention to provide a system and a method that identifies primary keys and foreign keys in order to eliminate non- reference tables.

It is further an object of the present invention to provide a system and a method that determines first confidence values for each of reference table by using transaction information of the collection of data.

It is another object of the present invention to provide a system and a method that determines the relationship of the knowledge database with the attributes of reference tables and further identifies second and third confidence values for each of reference table in the collection of data.

It is another object of the present invention to provide a system and a method to obtain an overall confidence value for each reference table based on the result of each confidence value in order to improve the accuracy of reference table identification in the collection of data.

It is another object of the present invention to provide a system and a method that identifies reference tables by using a ranking that records the relationship of attributes from the collection of data with the semantic information from the knowledge database and ranks the reference tables in order to improve the accuracy of reference table identification in the collection of data. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates the system for managing a collection of data, as described in the present invention. Figure 2 illustrates the flowchart of the method for managing a collection of data, as described in the present invention.

Figure 3 illustrates the flowchart depicting the steps of identifying reference and non-reference table with the aid of a key filter, as described in the present invention.

Figure 4 illustrates the flowchart depicting the steps of determining first confidence values for each of reference table by using transaction information of the collection of data, as described in the present invention.

Figure 5 illustrates the flowchart depicting the steps of determining the relationship of the knowledge database with the attributes of reference tables and identifying second and third confidence values for each of reference table, as described in the present invention.

Figure 6 illustrates the flowchart depicting the steps of obtaining an overall confidence value for each reference table based on the result of each confidence value, as described in the present invention. DETAILED DESCRIPTION OF THE PRESENT INVENTION

The above mentioned and other features and objects of this invention will become more apparent and better understood by reference to the following detailed description. It should be understood that the detailed description made known below is not intended to be exhaustive or limit the invention to the precise form disclosed as the invention may assume various alternative forms. On the contrary, the detailed description covers all the relevant modifications and alterations made to the present invention, unless the claims expressly state otherwise.

As demand increases by the management to have more analysis on the data from the database, the problems can be solved by implementing the present invention.

The present invention relates to a system (1 ) and method (10) for identifying reference tables for managing a collection of data (2). More particularly, the present invention improves the accuracy of reference table identification in the collection of data (2). The system (1 ), as illustrated in Figure 1 , comprises at least a collection of data (2) that is organized in tables, fields and forms as well as records, at least a knowledge database (3) that applies reasoning capabilities to the collection of data (2), at least a confidence value list (4) that stores attributes of the collection of data (2) as reference and at least a module (5) that identifies tables having the collection of data (2) and configures the data in order to match the data with semantic information from the knowledge database (3), wherein the system (1 ) utilizes a key filter that is incorporated with the module (5) to eliminate non- reference tables, and a ranking that records the relationship of attributes from the collection of data (2) with the semantic information from the knowledge database (3) and to rank reference tables from the collection of data (2), in order to improve the accuracy of reference table identification in the collection of data (2).

The present invention explains that the collection of data (2) that is organized in the table forms, which further comprises a series of logically related two dimensional data. Moreover, the module (5) in the present invention further identifies the reference table by ranking the tables in the collection of data (2) with a confidence value in order to eliminate non- reference tables. In addition to that, the present invention illustrates that the ranking is incorporated with the module (5) that further rates the effectiveness of relationships with the knowledge database (3) by assigning a confidence value for each relationship. Each relationship is identified by comparing the information of the collection of data (2) with information of the knowledge database (3). It involves multi-modality approaches comparison such as semantic structure base and semantic instance.

Additionally, the present invention further discloses the use of knowledge database (3) which consists of centralized repository of data and information submitted by the user. The knowledge database (3) has been utilised to optimise data storage, data analysis, and data reuse as well as transactional properties on data requirements. The knowledge database (3) comprises pre-determined logics, concepts, rules and requirement of the systems to deduce new facts or inconsistencies.

Referring to Figure 2, the figure illustrates a method (10) for managing a collection of data (2) according to the present invention. The method (10) comprises a first step of identifying reference tables and non-reference tables with the aid of a key filter (1 1 ). Subsequently, a first confidence value for the reference table is identified based on the transaction information and updates of the reference table from the collection of data (2) (12). Next, the relationship of the knowledge database (3) with the attributes of reference table is identified prior to the identification of second and third confidence values for the reference table (13). Lastly, an overall confidence value for the reference table (14) is obtained. The overall confidence value of the reference table is obtained by comparing the result of each confidence value that is identified in previously. Figure 3 is a flowchart depicting further steps of identifying reference tables and non-reference tables with the aid of a key filter (1 1 ). It begins with the steps of identifying a name (1 1 1 ) and primary keys (1 12) for each table. Then, the primary keys are recorded and managed in a list that is named as a primary key list (1 13). Next, one or more foreign key is identified (1 14) if one primary key is found in the table. Otherwise, the table is identified as a non-reference table if no primary key or more than one primary key is identified. The foreign keys are recorded and managed in a list that is named as a foreign key list (1 15). Subsequently, the table is identified as reference table if no foreign key is identified in the table (1 16). The attributes of the table such as table names, primary keys and foreign keys are stored in these two list that are mentioned previously for identifying a reference table. These two list are used as a key filter list to identify whether the table is a reference table. Non-reference table is eliminated in the collection of data (2) after a reference table is identified (1 17).

Figure 4 is a flowchart depicting further steps of determining first confidence value for the reference table by using transaction information of the collection of data (2) (12). The first step is to identify activities of reference table (121 ) and then followed by identifying frequency of each update for the reference table (122). Once the frequency of the table has been identified, the total updates for the reference table is identified in a period of time (123). Based on the total updates of the reference table, the percentage of the updates of the reference table is calculated (124). Next, a first confidence value is determined based on the updates of the reference table (125). Lastly, the first confidence value of the reference table is recorded and managed in a list that is known as a confidence value list (126).

After identifying the first confidence value of the reference table in step (12), a relation of the knowledge database (3) with the attributes of the reference tables is determined to further identify second and third confidence value of the reference table (13) as illustrated in Figure 5. Prior to the determination of the relationship between the attributes of the reference table with the information in the knowledge database (3) (132), the information of the reference table is required to be converted into semantic information for further semantic analysis (131 ). Then, a relationship between the attributes of the reference table with the information of the knowledge database (3) is identified effectively. In the step of identifying the relationship of the reference table with the knowledge database (132) as illustrated in Figure 5, the semantic information is utilised for building a relationship among the data. Semantic information further consists of the study of rational, informational content of the collection of data (2). It focuses on the relation between concepts, words, value and what they stand for that include denotation. It further illustrates the degree of the relation between concepts and reference as well as representation. Hereafter, the semantic information is used for building a relationship among the collection of data (2) and linking the data of the reference tables with information from the knowledge database (3) in order to provide a visible and accurate result of data management.

Thereafter, a second confidence value of the reference table is determined (133) by comparing its table structure with the information of the knowledge database (3). Subsequently, a third confidence value of the reference table is determined (134) by comparing its table attributes with the semantic instance of the knowledge database (3). In one embodiment, the first, second and third confidence value are recorded and stored in the confidence value list (4) for each confidence value respectively (135). The list is known as the confidence value list (4). Each of the newly determined first, second and third confidence values are compared with the existing confidence values that are being stored in the confidence value list (4), so that an updated list of confidence value is created (136), otherwise the determined second and third confidence values in steps of (133) and (134) are treated as the highest value in the list of the confidence value..

The last step of the method (10) is identifying a reference table by determining an overall confidence value for the reference table (14) as showed in Figure 6. The overall confidence value is determined by obtaining the first, second and third confidence values that have been determined previously. The first, second and third confidence value are obtainable from the confidence value list respectively (141 ,142,143). Next, the confidence value list (4) that comprises of first, second and third confidence values, is used to rank the reference table in the collection of data (145). The highest value of each confidence value will be maintained and updated in the confidence value list. Lastly, the reference table is determined based on the highest value of the overall confidence value. The method (10) discloses an advantage of using the confidence value list (4) that is incorporated with the module (5) in order to identify reference tables based on multi-modality approaches such as semantic structure based and semantic instance based. With the aid of semantics information, the method (10) further illustrates other advantages of determining the relationship of reference table with the knowledge database (3) and generates the second and third confidence value that further identifies the reference table regardless of the size or complexity of the collection of the data (2). Moreover, an ontology mapping technique is implemented to compare the semantics information of the knowledge database (3) with attributes of the tables in the collection of data (2). The system (1 ) and method (10) that are presented herein are able to improve the accuracy of identification of reference table in a collection of data (2) by comparing multi-modality approaches such as data log, semantic structure of table and content of table. Last but not least, the tables are managed and ranked with minimal user intervention in order to provide a visible and accurate result in managing the collection of data (2).

The invention described herein is susceptible to variations, modifications and/or additions other than those specifically described and it is to be understood that the invention includes all such variations, modifications and/or additions which fall within the scope of the following claims.

Claims

1 . A system (1 ) for managing a collection of data comprising:

at least a collection of data (2) that is organized in table forms;

at least a knowledge database (3) that applies reasoning capabilities to the collection of data (2);

at least a confidence value list (4) that stores attributes of the collection of data (2);

at least a module (5) that identifies reference table having the collection of data (2) and configures the collection of data (2) in order to match the relationship of the reference table with the knowledge database (3);

wherein the system (1 ) utilizes a key filter that is incorporated with the module (5) to eliminate non-reference tables, and a ranking that records the relationship of attributes from the collection of data (2) with the semantic information from the knowledge database (3) and to rank reference tables from the collection of data (2), in order to improve the accuracy of reference table identification in the collection of data (2).

2. A system (1 ) according to claim 1 , wherein the collection of data (2) that is organized in table forms comprises a series of logically related two dimensional data.

3. A system (1 ) according to claim 1 , wherein the module (5) identifies the reference table by ranking the tables in a collection data (2) with a confidence value in order to eliminate non-reference tables.

4. A system (1 ) according to claim 1 , wherein the ranking incorporated with the module (5) further rates the effectiveness of relationships with the knowledge database (3) by assigning a confidence value for each relationship.

5. A method (10) for managing a collection of data (2) comprises of the steps of:

identifying reference tables and non-reference tables with the aid of a key filter (1 1 );

determining first confidence value for the reference table by using transaction information of the collection of data (2) (12);

determining the relationship of a knowledge database (3) with the attributes of reference table and further identifying second and third confidence values for the reference table (13); and

obtaining an overall confidence value for the reference table based on the result of each confidence value identified (14).

6. A method (10) according to claim 5, wherein the step of identifying reference and non-reference table with the aid of a key filter (1 1 ), comprising the steps of:

identifying a name for each table (1 1 1 );

identifying primary keys for each table (1 12);

recording and managing all the primary keys in a list that is named as primary key list (1 13);

identifying foreign keys for each table if only one primary key is identified in the table (1 14);

recording and managing all the foreign keys in a list that is named as foreign key list (1 15);

identifying the table as reference table if no foreign key is identified in the table (1 16); and

eliminating non reference table in the collection of data (2) (1 17).

7. A method (10) according to claim 5, wherein the step of determining first confidence values for the reference table by using transaction information of the collection of data (2) (12), comprising the steps of:

identifying activity of the reference table (121 );

identifying frequency of the updates of the reference table (122); identifying total updates of a period of time (123);

identifying percentage of the updates of the reference table (124); determining a first confidence value of the reference table based on the updates of the reference table (125); and

recording and managing the first confidence value of the reference table in a confidence value list (126).

8. A method (10) according to claim 5, wherein the step of determining the relationship of the knowledge database (3) with the attributes of reference tables and further identifying second and third confidence values for the reference table (13), comprising the steps of:

transforming the information of reference tables into semantic information for further semantic analysis (131 );

identifying the relationship between the attributes of reference table with information of the knowledge database (3) (132);

identifying a second confidence value by comparing the structure of the collection of data (2) with information of the knowledge database (3) (133); and

identifying a third confidence value by comparing the attributes of reference table with semantic instance of the knowledge database (3) (134); storing the second and third confidence value in the confidence value list (135); and

updating the list of the confidence value list (4) with the highest value of each confidence value (136).

9. A method (10) according to claim 5, wherein the step of obtaining an overall confidence value for the reference table based on the result of each confidence value (14), comprising the steps of:

obtaining the first confidence value identified from the confidence value list (141 );

obtaining the second confidence value identified from the confidence value list (142) ; obtaining the third confidence value identified from the confidence value list (143) ;

determining the overall confidence value based on the value of each confidence value (144) ; and

ranking and determining the reference table based on the highest value of the overall confidence value (145).