HK1174121B

HK1174121B - Interpreting data sets using repetition of records, keys and/or data field values

Info

Publication number: HK1174121B
Application number: HK13101133.1A
Authority: HK
Inventors: B. BARABAS Albert; D.A. VAN GULIK Mark; Raymond Terry
Original assignee: Miosoft Corporation
Priority date: 2009-08-18
Filing date: 2010-08-18
Publication date: 2016-08-26

Description

Interpreting data sets using repetition of records, keys, and/or data field values

Technical Field

The description relates to data collections.

Background

For example, a table of a typical relational database represents a data set of records. Each record has a data value in a field that has been defined for the table. Each field may have at most one value for the attribute that the field represents. The table has a unique key (unique key) that explicitly distinguishes a record from another record. The relationships of tables in a database are normally defined in advance, and all data and tables are represented in a common shared native format. In addition to performing transactions in the database, a user is typically able to view the records of each table, as well as the data combinations contained in the associated tables, through an interface provided by the database application.

Sometimes, the enterprise's relevant data is not kept in a predefined well-defined, strictly-defined database, but is generated as separate files, data sets, or data streams, possibly in different, unrelated formats. While the data for each of these sources may be constructed as records, for example, the delineation from records to fields may not be defined within the sources. Sometimes, although relevant, data in different sources may be inconsistent or duplicative.

United states patent 7512610, filed on 3/31 of 2009 and owned by the same company as the present patent application, describes a way to process a source file, stream or collection of data so that its data can be easily accessed and viewed as a record that can be manipulated and analyzed by a user, the entire contents of which are incorporated herein by reference.

Disclosure of Invention

In general, in one aspect, there are two or more data sets. Each of the data sets contains data that can be interpreted as records, each record having data values for data fields. Each of the data sets contains at least some data related to data in at least one of the other data sets. The data in different ones of the data sets may be organized or expressed differently. Each of the data sets allows a key to be defined for a record of the data set. The data set is characterized by a repetition of at least one of (a) a record, (b) a portion of a key, or (c) a value instance of a data field. Providing information to a user regarding at least one of the repetitions.

Implementations may be characterized by one or more of the following features. At least one of the data sets includes a file having a file format. At least two of the data sets include files having different file formats. Information is received from a user regarding the manner in which data of at least one of the data sets may be interpreted as records, each record having data values of data fields. Information from a user from which a key of each of the data sets can be determined is received. The keys of one of the data sets have a defined hierarchical relationship with the keys of another of the data sets. The repetition of the record comprises a repeated record in one of the data sets. The repetition of the portion of keys includes two different values of the key value in one of the data sets corresponding to the portion of keys of the other of the data sets. The repetition of the value instances of the data fields includes two or more value instances being included in a given field. The user can perform at least one of marking, unmarking, filtering, unfiltered and frequency analysis on the recorded values of at least one of the data sets.

Providing information to the user includes displaying the information. Displaying includes displaying records of the data set, an identification of fields of the records, and an indication of a duplication in the data set. Displaying information about the repetitions includes displaying repeated value instances of the data fields. Displaying information about the duplication includes indicating that duplicate records exist in the data set. Displaying information about the repetitions includes indicating that there is a repetition of a portion of the key.

Providing information to the user includes enabling the user to create an integrated file (integrated file) that includes data of the data set and records of information about duplicates. The integrated file contains records that are constrained by keys. The key of the integrated file includes a hierarchical concatenation of fields of the data collection. Duplicate data values are included in given fields of records of the integrated file. The record of the integrated file is displayed to the user. Displaying a view of data in the integrated file, the data corresponding to data in the data collection, the integrated file created from the data of the data collection. The method of claim further enables a user to perform at least one of marking, unmarking, filtering, unfiltered and frequency analysis on the recorded values of at least one of the data sets. Enabling a user to perform at least one of marking, unmarking, filtering, unmarking, and frequency analysis on the recorded values of the integrated file, the marking, unmarking, filtering, and frequency analysis being automatically applied to other views of the data.

In general, in one aspect, a data set is received that contains data that can be interpreted as records, each record having a data value of a data field. The data set is characterized by an arbitrary number of repetitions of value instances of at least one of the data fields. Information about the at least one repetition is provided to the user. The data collection includes files having a file format. Information is received from a user regarding the manner in which data of the data set may be interpreted as records, each record having data values of data fields. Information is received from the user from which the key of the data set can be determined. The user can perform marking, unmarking, filtering, unfiltered and frequency analysis on the recorded values of the data set. Providing information to the user includes displaying the information. Displaying includes displaying a record of the data set, an identification of a field of the record, and an indication of a duplication in the data set. Displaying information about the repetitions includes displaying repeated value instances of the data fields.

In general, in one aspect, a medium carries an integrated file of data records and keys of the records. Each record contains at least one data value of at least one data field. The data record contains information representing data of at least two data sets. Each of the data sets contains data that can be interpreted as records, each record having a data value of a data field. Each of the data sets contains at least some data related to data of at least one of the other data sets. The data in different ones of the data sets may be organized or expressed differently. Each of the data sets allows a key to be defined for a record of that data set. The data set is characterized by a repetition of at least one of (a) records, (b) partial keys, or (c) instances of values of data fields. The integrated file includes information identifying the repetitions.

These and other aspects and features, and combinations thereof, may be expressed as methods, procedures, apparatus, program products, databases, business methods, systems, means for performing functions, and in other ways.

Other advantages and features will be apparent from the following description and from the claims.

Drawings

Fig. 1 is a block diagram.

Fig. 2 to 14 are screen shots.

Detailed Description

As shown in FIG. 1, we describe herein a process 12 such as (but not limited to) flat file, IMS^*Independence of MQ, ODBC, and XMLThe data collection, file or data stream 10 (source). The data sources may (a) contain related data, (b) have different organizational schemes and formats, and (c) include duplicate data. The processing described herein enables a user 14 to access, display, analyze, and manipulate data displays via a user interface 16. In some implementations, at least some of the processing is done based on information provided by the user about the data source. In some cases, the process provides access to, display, analysis, and manipulation of data in the records of the data source itself. In some embodiments, the integrated file 18 is created from the source file and provides additional access, display, analysis, and manipulation capabilities for the user. The features may be provided without requiring programming or adaptation by the user.

Embodiments of at least some of the features described herein and other embodiments are described in what is known as Business Data Tool available from MioSoft corporation of Madison, Wis^tmAre found in the commercial products of (1). The product and its manual and description are incorporated herein by reference.

As an example of a data source to be processed, consider three separate but related data sets (in this case, contained in three data files), at least some of which are shown in fig. 2, 3 and 4, respectively. The records of the three source data files contain information about the lesson instructor of the particular computer course, the name of each course, and the student who registered the course, respectively. When a data set is expressed in a predefined file format (such as. txt or. csv), we sometimes refer to the data set as a file. The techniques we describe herein are broadly applicable to any type of data collection, file, or data stream in which data can be constructed as records. In some cases, the delimiters (delimiters) of the records are predefined in the data source. In other cases, the record and its delimiters are inferred from the data source with or without user assistance.

We sometimes use the words document and data set (and other terms) interchangeably in a very broad sense to include any set of data of any type, source, format, length, size, content, or other characteristic. The data set may be an organized data array or a stream of unorganized data (or a combination of both) that can be parsed or analyzed to infer so-called records and delimiters of records. We intend the phrase "record of a data set" to very broadly include any data group of the data set that contains one or more values of an attribute associated with a field of the data set.

In this example, fig. 2 shows data of a plain text file named sessions. The file can be interpreted as including records 50, each of which includes a value 52 in a field 54, the field 54 representing a combination of a course number (e.g., 69.102) and a course letter (such as a or b, which may correspond to, for example, a first course and a second course or two different segments of a given course in a single course) for the course. The second value 56 of each record in the second field 58 represents the name of the instructor (e.g., Chris Schulze).

Csv and shown in fig. 3, is expressed in comma separated variable (. csv) format and has records 60, each record 60 comprising a value 61 identifying a course number in one field 62 and a corresponding value 63 representing a course name in a second field 64. For example, course 69.102 has the name Data Migrator.

Csv and is also expressed in the csv format in the third file, named students _ with _ addresses. Many of its fields (not all shown in fig. 4) contain values including, among other things, a value 69 identifying the number 70 of each student (column a), a value for the first name 72 (column B), a value for the last name 74 (column C), a value for one or more addresses 76, 78 separated by wavy characters (column D), a value for an identifier 80 of a course segment (column E), and a value for a course number (column H, not shown).

Three source files contain relevant information. For example, a course number is used in all three documents. And the segment identifier (a or b) is used in the sessions. However, the relationship between some types of information that appear in three files, particularly the nature and number of repetitions of the information, is not readily understood by viewing the three separate files.

Here, we describe, among others, such tools: the tool enables a user to have three files analyzed and their records displayed in a manner that enables the user to understand the nature and number of repetitions of information that appear in the three files without any programming or adaptation or detailed analysis of all relationships between information contained in the three files.

Although the three files shown in this example contain only a relatively small number of records, the same approach can be applied to files containing a very large number of records (millions or even billions) and taken from widely distributed sources, including sources that may not be completely under control of the same party. Different files or collections of records may be expressed in disparate file formats, or in some cases without any formal file format.

We first describe a tool that analyzes and displays repetitive information about each file (among other things) and enables a user to quickly view and navigate through the properties of the file records by invoking simple features of the user interface. An example of a user interface 16 is shown in fig. 5.

Fig. 5 shows the state of the software application 12 (fig. 1) and the user interface (here numbered 90) after (a) three source data files 92, 94, 96 have been imported by the user into the software application, (b) each file has been parsed into fields 98 (which, with the assistance of the user, can, for example, identify delimiters between the fields), and the key 100 of the file has been identified by the user in the keybed 102. The record of file a is now shown in fig. 5.

Three imported files are listed in pane 104. After the file^**Note indication in the interface when it is highlighted and clickedThe record shown is a record of the file. (in some cases discussed later, absent in the pane following the entry^**The annotation indicates that when the entry is called, the information shown in the interface is a view of the file containing the data, but not itself taken directly from the existing file. ) Clicking on a file name or other entry in the pane causes its record to be displayed in the scrollable records pane 106.

The user interface illustrated in FIG. 5 (and other figures) includes display features, menu items, analysis tools, and other functions described and illustrated in greater detail in U.S. patent 7512610, which is incorporated herein by reference in its entirety.

In pane 104, after the name of each file or other entry, the key 105 of that file and its relationship to other keys of at least one other file are shown. As each file is imported and parsed, its fields are given an identifier that is unique across all three files. For example, in this example, the file Courses is identified as file A and its two fields are labeled A1 and A2. In many cases, the parsing of the fields can be done automatically based on information contained in the data source. In other cases, the user participates in the identification of the field and its delimiters.

A key of a file may be one or more fields of the file as identified by a user. For example, the user has identified here a unique course number A1 column as the key for File A, which is reflected in the occurrence of the left column entitled [ key ] (key).

As shown in FIG. 6, the key of the file B Sessions (paragraph) has been identified by the user in box 102 as a "concatenation" of column B1 (course number) and column B2 (paragraph letters). For document B, the course number (column B1) cannot be used alone as a key, since there may be two records carrying a given course number, one for each course segment or segment. By "concatenating" the class letter (column B2) with column B1, a unique key can be formed. Here, when we refer to concatenation, we mean that a 2-tuple is formed, for example, from the data in the two columns. Thus, "X" in column B1 and "YZ" in column B2 will result in a composite bond that is different from "XY" in column B1 and "Z" in column B2.

The concatenation bonds (B1, B2) of file B are related to the bonds (A1) of file A. This relationship 105 is shown in pane 104 and is represented by the notation B2, B1 ═ a 1. This notation expresses the fact that key B1 is the same as key a1, and that records in file B with key B1 having the same value but B2 having a different value may be duplicated. In fact, B1, B2 are hierarchical keys. We refer to this hierarchical arrangement as a repetition of key levels. The repetition of the key level is a structural feature of the three files taken together (but not readily apparent by viewing file a and not necessarily by viewing file C). The user indicates the relationship to the application by entering the relationship indicating B1, B2 to A1 in the key box 102 of file B.

In fig. 7A and 7B, the students _ with _ addressed file C is shown. In this document, concatenation of columns C8 (course number), C9 (course letter) and C1 (student ID) creates a unique three level hierarchy key. Obviously, C8 and C9 are the same as B1 and B2 in file B, so the relationship of key 110 can be represented as C8, C9, C1 ═ B1, B2 to indicate that C8 and C9 are the same as B1 and B2, respectively, and the combination of each course number and the course letter in the series key may also be repeated, since multiple students are typically registered for a given course of a given course.

In field C4 of file C, there are duplicates 122, 124 of instances of the student's street address for some records 120, and such duplicates may also occur for other address bars in this example. We refer to this as field level repetition.

In the source data set it is also possible to have duplicate records (duplicaterecords) with the same keys. For example, in file C, two records carrying the same key may have the same value in all fields (although such an example is not shown in fig. 7A and 7B). We refer to this as a repetition of the recording level. Similarly, file B may contain multiple records with the same key, possibly indicating courses and lessons (unusual arrangements) taught by multiple instructors. Files C and B are related by their keys rather than their records, so a C record should not be considered to have a parent B record (or records), rather a C record should be considered to have a C key with a parent B key, and a B record also has a B key.

When a data set has been imported and parsed into fields and records, and the records, fields, and values are displayed, there is a hierarchical indication in a header (header) pane 127 (located above the record pane) of the field relationships of the records. The uppermost header 129 spans all fields and represents the entire record. The header of the level 131 directly above the displayed record identifies each field. Each middle header level shows hierarchically the header groups below it.

The repetitive structure of the data set is indicated to the user in the user interface. As mentioned previously, repetition of the key level is declared (call out) in the pane 104 as well as the keyframe 102. The repetition at the key level is identified by displaying the word "repeat" in the key header. For example, course number key A1 (of file A) that may appear depending on the course letter is indicated by the word repeat 116 (repeat) appearing after the word key 118 (key) in the head of the concatenation key column. The note indicates to the user that the course number is a key, but the key may be repeated for different courses and B1, B2 together form a unique key.

In fig. 7A and 7B, field level repetition is indicated by the word repeat 117 in the head of the C4 column, which tells the viewer that there will be instances of repeated street, city, and state address fields for a given student. The word repeat 119 in the top level header (across all fields) indicates the likelihood of record level duplication, indicating to the viewer that the entire record of file C is likely to be duplicated. That is, a given student may have registered more than once in a given session of a given course.

The repeat structure is also indicated by the bracket group around the median values of the recordings shown in figures 5, 6, 7A and 7B. For example, in FIG. 6, in column B1 of each record, the value of the course number is contained in parentheses to indicate that there may be a repetition of the key level of the course number in file B (e.g., because there may be two courses for a given course). In fig. 7, the values in column C2 are each surrounded by three sets of brackets to indicate three possible duplicate cases of student names: (1) a repetition of the key level (117 in fig. 6) implied by the relationship B1, B2 ═ a1, indicating that a class may have multiple lessons, (2) a repetition 116 of the second key level, represented by the relationship C8, C9, C1 ═ B1, B2, indicating that multiple students may be recruited in a lesson, and (3) a repetition 119 of the record level that may allow multiple students in file C to record with the same key (C8, C9, C1). In the concatenation column C4, each value in each record also carries a fourth set of brackets to indicate that there may be field level duplication, since each record may include more than one student address instance.

Thus, the interfaces illustrated in fig. 4-7 (and related applications) enable a user to import a separate data collection (e.g., contained in a file potentially having a different file format or no file format, or not contained in a formal file), have the fields of the records of the data collection parsed, and identify keys and key relationships between the separate files that may contain related data. The application analyzes the data sets based on the keys and indicates to the user at least three levels in the display of records of any of the data sets: repetition present in the key level, field level, and record level. The application does not form any new files and does not merge or join data across data sets. However, the application exposes the fields and records of each data set and duplicates to the user.

The application determines how and where to place duplicate information into the displayed records and headers using the user-provided information through a "duplicate" command. This "duplication" (which can be obtained by right clicking on the header and selecting parsing followed by selecting duplication) is used to specify which elements in the header reflect the duplicated data. In this example, on data page B, there is a repeat indication about the key, since the data may have a repeat of the key level about a, a fact that has been indicated to the application by the user with a repeat command. Wherever a duplicate occurs in a field of a record, the record data displayed will include at least one set of parentheses. Parent/child parenthesis shown nested indicates repeated nesting. Sibling brackets shown side-by-side rather than nested are used when data in fields of a record is actually repeated.

In the display of the header and record, the parenthesis is given by the application based on the key repetition that the user defined in the "parent" file. The "[ repeat ]" note indicates everywhere the user has explicitly told the application to expect (and extract) duplicate data, either at the key, record, or field level. If the user does not specify a duplicate, no parenthesis is shown. However, all recordings are still shown since no inter-recording processing occurs during viewing.

The "[ repeat ]" note is displayed showing where the user has requested the application to expect duplicate data, whether that duplicate data is at the key, record, or field level. The parenthesis is an indication of how many repetitions the application makes are "above" the element in the format level decomposition of the file and the key space and key space ancestor (anchorage) of the file. For performance reasons, the application typically does not combine information from multiple recordings when viewing a common file, so that the repetition of key level and recording level is not faithfully given. Brackets are still included as remainders of the repetition of the statement, but they always appear in the singular because information from exactly one record is shown in each row. Repetition at the field level shows reliable repetition information as zero or more sequences with bracketed strings (e.g., "(foo) (bar)").

For convenience, the application visually indicates consecutive records with the same parent key using a dark horizontal separation line. For example, in fig. 6, line 119 separates record 69.208a from record 69.208 b. Since the parent key of two records is 69.208 (i.e., exactly column B1), the solid separation lines 121, 123 assist in forming a visual group of related records. When consecutive records have the same key (e.g., B1 and B2 are the same), a darker partition line is used, providing information about the two levels of the record grouping.

Furthermore, using the features of the application, the user is able to filter, tag, and perform frequency analysis and other analysis operations on the records of the data set, which provides additional insight into the data attributes. Many of the operations are explained in the patents cited previously.

The user is able to perform, among other things, analysis on portions of the data record represented by any header at any level of the header hierarchy. For example, by right-clicking on the header C4street 120 of the record display of file C, and selecting the analysis option and then selecting the analysis column option from the pop-up menu that appears, the user can cause the application to perform data analysis in that column and display the analysis results, for example, in the window 150 shown in fig. 8.

In FIG. 8, pane 152 displays data regarding the value of the column street in the record of the data set. Pane 154 shows information regarding the frequency of occurrence of the various values of street. And a third pane 156 shows the meta-frequency of the frequency of pane 156.

For example, while file C contains only 174 records, the display pane 152 shows that the count 125 for different values is 295 different streets. This reflects the fact that an instance of a street address may be repeated (in this example, duplicate) in column C4 of a given record.

The pane 154 lists the frequency of occurrence of each street address in the file in descending frequency order, including the number of occurrences, which represents the percentage of occurrences and the associated street value. For example, the first entry 127 in the pane 154 indicates that the street address 10300w. blue Rd., apt 310 appears four times in file C, which is 1.149% of the total number of street address occurrences in the file. In pane 156, the number of times a given frequency occurs in pane 154 is shown (in descending order) along with the total number of address occurrences represented by those occurrences, and the percentage of the total occurrences represented by the number of occurrences. For example, the first entry in pane 156 indicates that the street address appearing four times in file C, multiplied by the number of such addresses (2 in this case), equals 8, which corresponds to 2.299% of the occurrences of all street addresses in that file. On the other hand, the last entry in the pane indicates that 250 or 71.839% of the records appear only once in the file. Similar analysis can be done and displayed for any column of any file.

In viewing FIG. 8, if the user is interested in seeing only one row of represented records in pane 154, the user can double-click on the entry. In response, the record display for file C becomes showing only the records containing the address represented by the row in pane 154. By reviewing those displayed records, the user is able to infer and understand information about the records in the file and in what cases duplicates occur. The user is able to learn the data value, repetition, frequency, and other information for each data set by repeating the process while displaying records for that data set.

In the illustrated user interface, the user is able to derive an understanding of the relationships between the data in the three different files. For example, in FIG. 6, the user can right click on title B3 pointer, then select the option analysis and Analyzeconumn in the pop-up menu. Followed by the display of the instructor's frequency information in those records (but not shown here). By double-clicking on the entry showing two records for which the instructor is Chris Schulze, the records pane is updated to show only those two records. Both of those records are marked by clicking in a mark box 140. Next, the user can invoke the mark-across-join feature by switching to the display of Course file a, making a right click on file a in pane 104, invoking the mark cross join option, and then selecting file B as the mark source. This makes two courses recorded at Schulze professor: and marking in the recording panes of the Data parser and the Context Server.

The user may repeat the process by marking the record for the Data parser course in the record display for file B, switching the display to file C, right clicking on file C in pane 104, selecting the mark across join option, and selecting file C as the marked source. The record display for File C then shows the labeled records for File C's students who participated in the Chris Schulze professor course as labeled. The user is then shown only those records of the teaching lesson of christ Schulze by clicking on the filter button.

To summarize some of the specific examples described so far, a user is able to import files with related data and with different file formats and from different sources. The user can specify the records and field delimiters for each file, and the keys and key relationships for each file, if necessary. The file may be in three levels: each of the key level, record level, and field level has data repetition. The tool may indicate the presence of a repetition level. In addition, using the tagging, filtering, and joint tagging features of the application, the user is able to understand the relationships between the data elements in the three files. However, the process of navigating back and forth and using those functions to understand the relationships of data elements is somewhat cumbersome.

For example, if entry 127 in pane 154 is double-clicked to reveal that two records together provide four instances containing address 10300W bluemount, and if the same student identification number is associated with those two records, the user has no simple way to see more information about the lesson segment that the student is registered for, other than switching to the display of the lesson segment file and using the joint labeling, filtering, labeling and analysis features of the interface to find the record showing the required information. Thus, a user may observe and understand the interrelationships of data in different data sets, but in some cases is cumbersome.

This process can be simplified and made richer by invoking a feature of the tool called join all. The join all's features process the three files (or data sets) that have been parsed and whose keys and key relationships have been identified by the user to produce a new composite file. This new file captures the data of the three source files A, B and C in a form that allows for faster and easier analysis by the user. The user creates the integrated file by clicking on the menu item file and selecting the export data option and then selecting the joinall source files in the dialog box that appears (join all source files).

For records that are federated, the federation of all operations creates a corresponding hierarchical record that mirrors the relationship between their keys. Since the keys are hierarchically related, the union can be simplified by first sorting each page's recorded keys according to them and then processing them sequentially. By page we mean the portion of the document associated with one of the documents that is federated when all of the documents are federated. For example, original file B corresponds to page B of the union file. Additional pages are added to the format when the final consolidated file is created from the pages derived from the original file. The additional page can decode the joint record and send each appropriate portion to the corresponding page. Thus, in effect, each of the original documents has been replaced by a browsing page that receives its information for display to the user (i.e., for use in creating the view) from another page that joins all of the documents, rather than one of the original documents.

As shown in fig. 14, the header shown in the decoded page (which is a file called x × d. joined _ Records) illustrates the format of the joint record. The decoding of the record associated with page B is shown in the partial header structure shown in the figure (by page B we mean the part of the union file that represents the data derived from the original session file B).

Starting at the top element and continuing down, the header structure includes the elements "D1 ^ JoinedRecords", "D1 ^ D1 ^" and "D14 ^. The representation of the file decode begins with element "D1 ^" which is an element "with size," meaning that the record contains a size field in front of the data that specifies the data length of the record in bytes. (the size field itself is 4 bytes long in big-end (big-endian) format), "D1 ^" and "D1 ^" are used to decode the record for use in data page A. "D14 ^" and "D14 ^" are used to decode the record for use in data page B. The first element "D1 ^" decodes the field containing all the data related to the data joined to the corresponding record in data page A. This element, in turn, contains "D14 ^" with "repeat," meaning that there can be multiple records, each record being a record of size. Looking back at "D1 ^" and "D1 ^" we see that they follow the same pattern as "D14 ^" and "D14 ^" except that "D1 ^" has no repeats. This is because this element is used to decode the root joint record and thus it is not repeated. Strictly speaking, "D1 ^ is redundant, but its occurrence provides a repeating pattern that works for hierarchical federated data sets.

Under "D1 ^" are the elements used to decode the record of the conjoined data for page B. This element includes records from page C that are joined to a particular record of page B. This containment pattern repeats for all the joint data pages.

The elements "D15 ^" and "D15 ^" decode data related to only page B. If a repeat of the record level has been placed on the original data page B, "[ repeat ]" will appear on element D15. "D15 ^" is further composed of body and header information indicating actual data. In element D15, the notation "- > B.

The header extracted by element D16^ is further decoded by elements D16 through D23. As shown, the header contains the file name, file path, and number of records of the original data. It also contains the number of records of the conjoined data. Each header field is preceded by a single byte identifying the field.

Similar explanations will apply to other header elements, for example, illustrated in other figures.

For the example we have discussed, fig. 9 shows seven Records of the generated join all synthesis file, referred to as d.

As shown in fig. 10A to 10H (segments), and indicated in the left-side panes of those figures, the file being analyzed and viewed, indicated by two asterisks, is file D. The entries for files A, B and C are missing two asterisks after them, and the notation from file D indicates that when those entries are called and a record is displayed, the record is the record that has been derived from the join _ records file D rather than a direct view of the records of the three source files. For example, by clicking on entry A in pane 104, the user is shown data that was derived from the original source file C and integrated into the integrated file D.

File D contains exactly seven records created from the source file that was parsed and has keys. Each record in document D is associated with a course. So, for example, column D1 contains the numbers of seven different courses and is a key for file D, as indicated by the word key in the header column. Column D2 contains the course number and course name for each course, which has been derived from (joined to) the corresponding document A based on key D1. The source of the data in column D2 is indicated in the header column by the phrase "- > a. courses [ Joined records ]".

Column D4 identifies the name of the file from which the data for these records is derived, column D6 is the file path for the file from which the data is derived, column D8 is the number of original records in the file, and column D10 is the current number of records recorded. Columns D2 through D10 together form a so-called course recording column D.

Column D14 contains an instance of each lesson segment having a key. For example, in record 2 of document D, there are 2 session key instances for course 69.208. For simplicity, the examples are separated in the display by separator 00. For example, "69.20800 a" is actually "69.208", followed by a dummy byte (shown generally as 00 in red), followed by the letter "a". The dummy byte present in the key's component will be encoded as a 01 byte followed by another 01 byte. Any 01 byte present in the key component will be encoded as a 01 byte followed by a 02 byte. This encoding preserves the sorting order of the compound keys and extends the keys only when a 00 or 01 byte occurs within the key component (more than by the 00 byte separator), which is a less likely scenario.

In Body (main Body) in column D15, the name of the instructor per course and per session is shown. Sessions records under column D15 span D17, D19, D21, and D23 and capture information about the chapter piece information sources. In record 2, the names of two instructors appear because course 69.208 has two different courses. The information in column D15 has been derived from file B and joined (as indicated in the column header) which lists the instructor's name for each session. While original file B has eight records to reflect the fact that a course has eight sessions, in file D all those eight records are captured in only six records in column D15, because for two doors in a course, each course has two sessions and those session pairs can be recorded in the same record.

Column D27 shows key information for all student instances in each course, as well as associated information about the course and session in which the student is registered. With column D14, the key information for the course, lesson section, and student ID are separated by 00 separators. For example, the record 6 holds key information of a very large number of student instances registered for the course, each key including a course number, a course letter, and a student ID.

Similarly, column D28 captures student address records by associating with file C. The columns D30, D32, D34, D36 capture information about the original location of the syndication information.

Document D is a non-rectangular document in which, for each course record, each column of the record may have multiple instances (repetitions) at different repetition levels. For example, there may be student IDs of a number of students who have registered a course. Column D27 captures multiple IDs and for each captures an associated course and a course identifier. Information linking data of the source file is thus fully captured and immediately available to allow a user to view the relevant data in a different file.

This arrangement differs from a typical database table in which each record is rectangular, that is, each column has a space for a single value of the attribute of that column; additional fields must be provided for the added value of this attribute. The length of all records, one per column in terms of number of value entries, they are all the same and defined by the number of columns, for which reason such a table is rectangular. By comparison, file D is non-rectangular in that the length of the record may be more than one per column in terms of the number of value entries, and thus need not all be the same.

This application generates not only a join all synthesis file in planar form, but also three views of the data of file D corresponding to the three source files A, B and C. Having data organized in a non-rectangular record in file D and having three views (A, B and C) associated with the data in D allows the user the opportunity to more easily, quickly and intuitively view and use the data in the three source files.

For example, as shown in FIG. 11, assume that the user is interested in the course Configurable Parser represented by the second record of file D, and marks the record and filter accordingly, leaving the single record displayed. Now assume that the user is interested in details about the student who registered the course. By simply clicking on document C, the student in pane 104 is immediately presented with the data shown in fig. 12, which shows a single record of all the detailed information about the student who registered the course. Furthermore, when the user causes the application to count the frequency of record analysis for a given field, the analysis is immediately reflected in (carry over) views A, B and C, as well as the view of file D. For example, suppose that after marking and filtering record 2 in file D, the user performs a frequency analysis on the student information in that record and gets the results shown in FIG. 13.

Thus, the synthetic file generated by the join all feature enables users to more easily view, analyze, and understand data, data sets, and their relationships, including duplicates that may exist.

Using data sets and integration files stored on a large number of storage devices, the tools described herein can be implemented on a large number of software platforms running on a large number of hardware configurations.

Other implementations are within the scope of the following claims.

For example, a number of user interface styles may be used to display records or other data of a source data set of a synthetic file. Similarly, a number of user interface devices may be provided to enable a user to mark and unmark records, filter and unmask records, analyze and display frequency statistics, create and undo joins, create integrated files, and view some or all of the data sets, records, and fields. The title used to identify the field may be displayed in different ways. Information about repetitions may be illustrated to the user in various ways.

Claims

1. A method of interpreting a data set, comprising:

receiving two or more data sets, each data set containing data that can be interpreted as a record, each record in each data set having data values for the data fields of the record, the records in each data set having at least one data field that is different from all data fields of records in another one of the data sets, each data field being identified by a data field identifier, the at least one data field of a record in each data set being related to the at least one data field of a record in at least one of the other data sets, and the data in different ones of the data sets being organized or expressed differently,

determining a key for each of the data sets based on one or more data field identifiers of the data fields of that data set, wherein the keys for different data sets are different, the data sets being characterized by a repetition of at least one of (a) a record, (b) a portion of a key, or (c) a value instance of a data field, and the keys of the two or more data sets containing information about at least one of the repetitions, and

providing the information about at least one of the repetitions based on a key in the set of data.

2. The method of claim 1, wherein at least one of the data sets comprises a file having a file format.

3. The method of claim 1, wherein at least two of the data sets comprise files having different file formats.

4. The method of claim 1, further comprising receiving information from a user regarding the manner in which data of at least one of the data sets may be interpreted as records, each of the records having data values of data fields.

5. The method of claim 1, wherein a key of a record of one of the data sets has a defined hierarchical relationship with a key of a record of another of the data sets.

6. The method of claim 1, wherein the repetition of a record comprises a duplicate record in one of the data sets.

7. The method of claim 1, wherein the repetition of the portion of keys comprises two different values of the portion of keys for which the values of the keys in one of the data sets correspond to the other of the data sets.

8. The method of claim 1, wherein the repetition of the value instance of the data field comprises two or more value instances being included in a given field.

9. The method of claim 1, further comprising enabling a user to perform at least one of marking, unmarking, filtering, unfiltered and frequency analysis on recorded values of at least one of the data sets.

10. The method of claim 1, wherein providing the information about at least one of the repetitions based on a key in the set of data comprises displaying the information.

11. The method of claim 10, wherein the displaying comprises displaying records of the data set, data field identifiers of fields of the records, and indications of duplicates in a data set.

12. The method of claim 10, wherein displaying information about the repetitions comprises displaying repeated instances of values for data fields.

13. The method of claim 10, wherein displaying information about the duplication comprises indicating that duplicate records exist in a data set.

14. The method of claim 10, wherein displaying information about the repetitions includes indicating that there is a repetition of a portion of a key.

15. The method of claim 1, further comprising: enabling a user to create a comprehensive file of records, the comprehensive file including data of the data set and information about the repetitions.

16. The method of claim 15, wherein the integrated file contains key-bound records.

17. The method of claim 16, wherein a key of the integrated file comprises a hierarchical concatenation of data fields of the data collection.

18. The method of claim 15, wherein duplicate data values are included in a given data field of a record of the integrated file.

19. The method of claim 15, further comprising displaying a record of the integrated file to the user.

20. The method of claim 15, further comprising displaying a view of data in the integrated file, the data corresponding to data of the data collection, the integrated file created from the data of the data collection.

21. The method of claim 15, further enabling a user to perform at least one of marking, unmarking, filtering, unfiltered and frequency analysis on recorded values of at least one of the data sets.

22. The method of claim 15, further comprising displaying a view of data in the integrated file, the data corresponding to data of the data collection, the integrated file being created from the data of the data collection and enabling a user to perform at least one of marking, unmarking, filtering, unfiltered and frequency analysis on the recorded values of the integrated file, and the marking, unmarking, filtering and frequency analysis being automatically applied to other views of the data.

23. A method of interpreting a data set, comprising:

receiving a data set containing data that can be interpreted as records, each record of the data set having data values for data fields of the record, the data set characterized by any number of repetitions of value instances for at least one of the data fields, each data field identified by a data field identifier,

determining a key of the data set based on one or more data field identifiers of data fields of the data set, the key being different from any other key determined for any other data set and containing information about at least one of the repetitions, and

providing the information related to at least one of the repetitions based on the key.

24. The method of claim 23, wherein the data collection comprises a file having a file format.

25. The method of claim 23, further comprising receiving information from a user regarding the manner in which data of the data collection may be interpreted as records, each of the records having data values of data fields.

26. The method of claim 23, further comprising parsing data of the data set based on the data field.

27. The method of claim 23, further comprising enabling a user to perform marking, unmarking, filtering, unfiltered and frequency analysis on recorded values of the data set.

28. The method of claim 23, wherein providing the information related to at least one of the repetitions based on the key comprises displaying the information.

29. The method of claim 28, wherein the displaying comprises displaying records of the data set, an identification of fields of the records, and an indication of a duplication in a data set.

30. The method of claim 28, wherein displaying information about the repetitions comprises displaying repeated instances of values for data fields.

31. A system for interpreting a data set, comprising:

means for receiving two or more data sets, each data set containing data that can be interpreted as records, each record in each data set having a data value for a data field of the record, the records in each data set having at least one data field that is different from all data fields of records in another one of the data sets, each data field identified by a data field identifier, the at least one data field of a record in each data set being related to the at least one data field of a record in at least one of the other data sets, and the data in different ones of the data sets being organized or expressed differently,

means for determining a key for each of the data sets based on one or more data field identifiers of the data fields for that data set, wherein the keys for different data sets are different, the data sets being characterized by a repetition of at least one of (a) a record, (b) a portion of a key, or (c) an instance of a value of a data field, and the keys in the two or more data sets containing information about at least one of the repetitions, and

means for providing the information regarding at least one of the repetitions based on a key in the set of data.

32. A method of interpreting a data set, comprising:

receiving two or more data files, each data file containing data that can be interpreted as a record, each record in each data file having data values for the data fields of the record, the records in each data file having at least one data field that is different from all data fields of records in another one of the data files, each data field being identified by a data field identifier, the at least one data field of a record in each data file being related to the at least one data field of a record in at least one of the other data files,

the data in at least two of the data files is expressed according to two different file formats,

determining a key for each of the data files based on one or more data field identifiers of data fields in the data file, the keys of different data files being different, the data file characterized by a repetition of at least one of (a) a record, (b) a portion of a key, or (c) an instance of a value of a data field, the key of a record in the two or more data files containing information related to at least one of the repetitions,

displaying to a user a record of the data file, an identification of a field of the record, and an indication of the duplication in the data file, the duplication including an instance of a value for the duplication of a data field, an

Enabling the user to create a recorded integrated file comprising the data of the data file and information about the repetitions.