HK1210295B

HK1210295B - Mapping entities in data models

Info

Publication number: HK1210295B
Application number: HK15110871.6A
Authority: HK
Inventors: 克雷格.W.斯坦菲尔
Original assignee: 起元科技有限公司
Priority date: 2012-07-24
Filing date: 2013-07-24
Publication date: 2019-08-09

Description

Entity mapping in data model

Cross Reference to Related Applications

This application claims priority to U.S. application No. 61/675,053 filed on 24/7/2012, which is incorporated herein by reference.

Background

The application relates to mapping entities in a data model.

In information systems, data models are used to describe data requirements, data types, and data calculations such as are being processed or stored in a database. The data model includes entities and relationships between the entities defined by one or more schemas. In general, an entity is an abstraction of an item in an information domain, which can exist independently or be uniquely identified. A relationship refers to how two or more entities are related to each other. For example, relationships are treated as verbs, and entities are treated as nouns. A schema represents a particular set of entities and relationships between entities.

Complex operations involving data associated with a data model may be performed by utilizing various database operations, such as join operations or aggregate (or "summary") operations. These operations may be represented as data flows through a directed graph, where operational components are associated with vertices of the directed graph and data flows between these components correspond to links (arcs, edges) of the directed graph. One system for performing such graph-based operations is described in U.S. patent No. 5,966,072, "performing graph-based operations".

Disclosure of Invention

In one aspect, in general, a method for processing data in one or more data storage systems includes: receiving mapping information specifying one or more attributes of one or more target entities in terms of one or more attributes of one or more source entities, at least some of the one or more source entities corresponding to respective sets of records in the one or more data storage systems; and processing the mapping information to generate a program specification for calculating values corresponding to at least some of the one or more attributes of the one or more target entities. The processing comprises the following steps: a plurality of sets of nodes are generated, each set including a first node representing a first relational expression associated with an attribute specified by the mapping information, and at least some of the sets forming a directed acyclic graph including links to one or more other nodes representing relational expressions associated with at least one attribute of at least one source entity referenced by a relational expression of a node in the directed acyclic graph. The processing further comprises: at least two of the sets are merged with each other to form a third set based on comparing relational expressions of the merged nodes.

These aspects can include one or more of the following features.

The mapping information includes a first mapping rule defining an attribute value of the target entity in dependence on an attribute value of the first source entity and an attribute value of the second source entity.

The first set of nodes associated with the first mapping rule comprises: a first node representing a first relational expression comprising a relational algebra operation referencing the first source entity and the second source entity; a second node, connected to the first node, representing a relational expression comprising the first source entity; and a third node, connected to the first node, representing a relational expression comprising the second source entity.

The mapping information comprises a second mapping rule defining an attribute value of the target entity in dependence of an attribute value of the first source entity.

The merging includes merging the first set and a second set of one or more nodes associated with the second mapping rule, including merging the second node with a node of the second set that represents a relational expression that includes the first source entity.

The relational algebra operation is a join operation.

The relational algebra operation is an aggregation operation.

The first source entity and the second source entity are associated with each other according to a relationship defined by a schema.

The schema includes a plurality of entities, and relationships between the entities include one or more of: a one-to-one relationship, a one-to-many relationship, or a many-to-many relationship.

Generating the program specification includes generating a dataflow graph from the third collection that includes components to perform operations corresponding to relational expressions in nodes of the third collection and links representing flows of records between output ports and input ports of the components.

Generating the program specification includes generating a query language specification from the third set, the query language specification including a query expression for performing an operation corresponding to a relational expression in each node of the third set.

Generating the program specification includes generating, from the third set, a computer program including a function or expression for performing an operation corresponding to a relational expression in each node of the third set.

The computer program is specified in a programming language that is at least one of the following programming languages: java, C or C + +.

The method also includes processing records in the data storage system in accordance with the program specification to compute values corresponding to at least some of the one or more attributes of the one or more target entities.

In another aspect, in general, a computer-readable storage medium stores a computer program for processing data in one or more data storage systems. The computer program includes instructions for causing a computer system to: receiving mapping information specifying one or more attributes of one or more target entities in terms of one or more attributes of one or more source entities, at least some of the one or more source entities corresponding to respective sets of records in the one or more data storage systems; and processing the mapping information to generate a program specification for computing values corresponding to at least some of the one or more attributes of the one or more target entities. The processing comprises the following steps: a plurality of sets of nodes are generated, each set including a first node representing a first relational expression associated with an attribute specified by the mapping information, and at least some of the sets forming a directed acyclic graph including links to one or more other nodes representing relational expressions associated with at least one attribute of at least one source entity referenced by the relational expression of a node in the directed acyclic graph. The processing further comprises: at least two of the sets are merged with each other to form a third set based on comparing relational expressions of the merged nodes.

In another aspect, in general, a computer system includes: one or more data storage systems; an input device or port for receiving mapping information specifying one or more attributes of one or more target entities in terms of one or more attributes of one or more source entities, at least some of the one or more source entities corresponding to respective sets of records in the one or more data storage systems; and at least one processor configured to process the mapping information to generate a program specification to compute values corresponding to at least some of the one or more attributes of the one or more target entities. The processing comprises the following steps: a plurality of sets of nodes are generated, each set including a first node representing a first relational expression associated with an attribute specified by the mapping information, and at least some of the sets forming a directed acyclic graph including links to one or more other nodes representing relational expressions associated with at least one attribute of at least one source entity referenced by the relational expression of a node in the directed acyclic graph. The processing further comprises: at least two of the sets are merged with each other to form a third set based on comparing relational expressions of the merged nodes.

In another aspect, in general, a computer system includes: one or more data storage systems; mapping information receiving means for receiving mapping information specifying one or more attributes of one or more target entities in dependence upon one or more attributes of one or more source entities, at least some of the one or more source entities corresponding to respective sets of records in the one or more data storage systems; and mapping information processing means for processing the mapping information to generate a program specification for calculating values corresponding to at least some of the one or more attributes of the one or more target entities. The processing comprises the following steps: a plurality of sets of nodes are generated, each set including a first node representing a first relational expression associated with an attribute specified by the mapping information, and at least some of the sets forming a directed acyclic graph including links to one or more other nodes representing relational expressions associated with at least one attribute of at least one source entity referenced by the relational expression of a node in the directed acyclic graph. The processing further comprises: at least two of the sets are merged with each other to form a third set based on comparing relational expressions of the merged nodes.

These approaches may include one or more of the following advantages.

Techniques for generating a program specification can represent data-related issues in terms of a mapping between a source schema and a target schema, e.g., in the form of executable modules such as dataflow graphs. User input is specified at a higher level of abstraction by representing attributes of entities in a desired target schema from attributes of entities in an existing source schema. For example, a user can represent a rule based on data of a dataset that also references ancillary information from other sources without manually creating a dataflow graph to extract the ancillary information from those sources. The user input defines the target schema, which is processed to provide the program specification to extract all required information from any entity in one or more source schemas.

Other features and advantages of the invention will become apparent from the following description and claims.

Drawings

FIG. 1A is a block diagram of an exemplary computing system for controlling graph-based operations.

FIG. 1B illustrates an example of a dataflow graph.

Fig. 2A and 2B are illustrative schemas represented as entity-relationship diagrams.

FIG. 3 illustrates an exemplary mapping from a source schema to a target entity.

FIG. 4 illustrates an example process for converting source schema and mapping information into graph components.

FIG. 5 illustrates a more complex example of mapping from a source schema to a target entity.

FIG. 6 illustrates exemplary expression nodes.

FIGS. 7 and 8 illustrate an exemplary mapping from a source schema to one or more target entities.

Fig. 9A and 9B show pattern examples.

10A-10D illustrate a process for generating a dataflow graph from expression nodes.

11A-11C illustrate example configurations of generated dataflow graphs.

Fig. 12A and 12B are examples of data models.

Detailed Description

In an information system, statements of a problem to be solved may be received in various forms by using data and data operations. The data and its operations may correspond to entities and relationships between entities, such as an ER (entity relationship) graph. Typically, the problem statement is decomposed or translated by the user into the original form of the data and expressions used, e.g., complex query languages. In some embodiments, the system can generate a dataflow graph or query language expression directly from the initial question statement. In this regard, systems and methods are described for automatically decomposing problem statements into, for example, join (join) sets, rollups (rollups), and other data transformations that are part of a generated graph or expression. Further, the systems and methods may reference auxiliary information stored, for example, in an external database or external file.

FIG. 1A is a block diagram of the interrelationship of components of a computing system 100 for developing, executing, and controlling graph-based operations. Graph-based operations are performed through a "dataflow graph" represented by a directed graph whose vertices represent components (e.g., vertices represent components having a source or Sink operator type, such as a dataset, or vertices represent components performing data operations according to a specified operator type), and whose directed links or "edges" represent data flows between the components. A Graphical Development Environment (GDE) 102 provides a user interface to specify executable dataflow graphs and define parameters for dataflow graph components. The GDE 102 communicates with a repository 104 and a parallel operating environment 106. Also coupled to the repository 104 and the concurrent operating environment 106 are a user interface module 108 and an executable program (executable) 210. The executable 210 controls the execution of dataflow graphs within the concurrent operating environment 106. Dataflow graphs may also be generated through a software Application Program Interface (API) without interacting with the GDE 102.

The repository 104 is an extensible, object-oriented database system designed to support the development and execution of graph-based applications and to support the exchange of metadata between graph-based applications and other systems, such as other operating systems. The repository 104 is a storage system for various metadata including documents, record formats (such as field types and data types of records in a table), conversion functions, dataflow graph specifications, and monitoring information. The repository 104 also stores data objects and entities representing actual data to be processed by the computer system 100 (including data stored in the external data store 112).

The parallel operating environment 106 accepts specifications of dataflow graphs generated in the GDE 102 and executes computer instructions corresponding to the processing logic and resources defined by the dataflow graphs. The operating environment 106 may be hosted on one or more general-purpose computers controlled by a suitable operating system, such as a UNIX operating system. For example, the operating environment 106 may operate on a multi-node parallel computer system that includes a configuration of computer systems using multiple Central Processing Units (CPUs), which may be local CPUs (e.g., multiprocessor systems such as SMP computers), or local distributed CPUs (e.g., multiple processors coupled as a cluster or MPP processor), or remote CPUs, or remote distributed CPUs (e.g., multiple processors coupled through a local area network or wide area network), or a combination thereof.

The user interface module 108 provides web browser-based viewing of content in the repository 104. Through the user interface module 108, a user may browse objects, create new objects, and modify existing objects, among others. Through the user interface module 108, a user may browse and optionally edit information contained in and/or associated with stored data objects.

The executable program 110 is a repository-based job scheduling system that is freely accessible through the user interface module 108. The executable program 110 maintains jobs and job queues as objects in the repository 104, and the user interface module 108 provides browsing of jobs and job queues and operational assistance thereof.

One example of the system 100 for generating and executing dataflow graphs has the following features. (other examples of systems with fewer than all of these features are also acceptable.)

Each vertex in the dataflow graph represents a component that applies a specified operator to data, and the vertices are connected by directed links that represent the flow of data between components (referred to as a "dataflow").

Each component has at least one named connection point (called a "port"), and the operators of the component operate on input data flowing into one or more "input ports" and/or provide output data flowing from one or more "output ports".

Each port is associated with a data format that defines the format of data items flowing into or out of the port (e.g., for a record stream, the data format includes a record format that defines the format of a record as an independent data item).

Each port has a permission cardinality (cardinality), which is a limit on the number of data flows connected to the port. For example, a permission cardinality of a port may allow only one data stream to be connected to the port, or may allow multiple data streams to be connected to the port.

Each data flow in the graph connects an output port of one component to an input port of another component. The two ports are associated with the same data format.

The system 100 can prevent looping of dataflow graphs. If a path is followed from a vertex in the dataflow graph back to the vertex by traversing the dataflow, this indicates that the dataflow graph has a loop and execution may be inhibited.

The operators of a component have an operator type (which specifies the type of operator to which the component applies), e.g., an "input file" operator type applies to operators that read input files, and an "aggregate" operator type applies to components that apply aggregate operators.

Operators are also associated with parameters for configuring the operators. For example, the input file operator has a parameter that provides the name of the input file to be read. The set of support parameters depends on the type of algorithm.

The set of input ports and output ports supported by a component depends on the arithmetic type of the component. The operator type may also determine the permission cardinality of the supporting component.

A component has a customizable label that identifies itself and appears on a vertex that represents the component in a visual representation of a dataflow graph. Tags, unlike the type of algorithm, uniquely identify a component in a dataflow graph.

A port has a customizable label that identifies itself and appears on top of the visual representation of the port on the visual representation of its components. The tag uniquely identifies the port in the assembly.

Referring to FIG. 1B, one example of a visual representation of a dataflow graph 120 includes two components (with an "input file" algorithm type) that are labeled with labels for "first input file" and "second input file," respectively. Each component has an output port labeled "read" tag. The dataflow graph 120 includes two instances of components (with one "reformatted" operator type) labeled "first reformatted" and "second reformatted", respectively. Each reformatting component has an input port labeled "in" label and an output port labeled "out" label. The dataflow graph 120 includes an instance of a component (with a "join" operator type, and a label labeled "A-join"). The coupling assembly has two input ports labeled "in 0" and "in 1" labels, respectively, and one output port labeled "out" label. The dataflow graph 120 includes an instance of a component (with an "output file" operator type and a label labeled "output file"). The data flow graph 120 includes 5 data flows connecting the ports. Different visual appearances of components (e.g., based on operator type) are advantageous to developers or users.

When executing a dataflow graph, the system 100 performs certain actions associated with the semantics of the dataflow graph. For example, a data stream represents an ordered set of data items, each data item conforming to a data format associated with a port connecting the data streams. A component (e.g., executed by a process or thread) reads data items from its input port(s), if any, performs operations associated with applying its operators, and writes data to its output port(s), if any. In some cases, the component may also access external data (e.g., read or write to a file, or access data in a database). The operation associated with applying an operator depends on the state of the data item (if any) read from any input port(s), the parameters associated with that operator, and/or any external data (e.g., a file or database table) accessed by the component. If a component writes data items to an output port, the data items are transferred to any input port connected to the component as inputs to the operators of downstream components. The order of association of the data items represented by the data stream corresponds to the order in which the data items are transferred between the output port and the input port, and may (but need not) correspond to the order in which the data items are calculated.

An example of a port's data format is a record format that defines the data type of individual values within a given record, where each value is identified by an identifier. The syntax of the recording format is in accordance with a particular language. The following is an example of syntax of a record format in a Data Manipulation Language (DML).

Record (record)

< data type (data type)1> < identifier (identifier)1 >;

< data type 2> < identifier 2 >;

end (end);

one example of a data type is "int," which defines a value identified by an identifier that treats the value as an integer in native binary format. The system 100 may support a variety of data types, however, for purposes of illustration, the example herein employs an "int" data type. An example of a recording format using this DML syntax is as follows.

Recording

int a1；

int a2；

int a3；

Finishing;

the record format corresponds to a record consisting of three consecutive assigned binary integers identified as a1, a2, and a 3.

The system 100 may also support multiple operator types of components. For purposes of illustration, the following operator types are described: input files, output files, input tables, output tables, reformatting, copying, joining, and summarizing.

The "import file" component has an output port labeled (default) "read" tags and a parameter that determines the file to be read (i.e., "filename" parameter). The read port may be connected to one or more data streams. The record format (i.e., read format) associated with the read port is selected (e.g., selected by a developer or automatically determined) to correspond to the actual record format recorded in the determined file. The operator applied by the "input file" component reads successive records of the determined file (e.g., records having the record format associated with the read port) and provides them to the read port (e.g., by writing the successive records to a particular buffer, or by some other transfer mechanism).

For example, the "input file" component can be configured as follows.

Operator type (operator type): input file (input file)

Filename (filename): dat a.

Format (read): recording

int a1；

int a2；

int a3；

Finishing;

to ensure that the "import file" component can read the determined file smoothly and provide the records contained in the file to the read port, the physical record format actually used by the records in the file a.dat should match the provided record format read.

The "export file" component has an input port labeled (default) "write" tag and a parameter that determines the file to be written (i.e., a "filename" parameter). The write port may be connected to no more than one data stream. The record format (i.e., write format) associated with the write port is selected (e.g., selected by a developer or automatically determined) to correspond to the actual record format recorded in the determined file. The operator applied by the "export file" component receives the continuous records provided to the write port (e.g., by reading from a particular buffer or by some other transfer mechanism) and writes these continuous records to the determined file.

There are other types of operators for accessing data stored in media other than files, such as the "input table" operator type and the "output table" operator type, which may be used to access data stored in relational database tables (for example). The configuration and operator functionality of these operator types is similar to that of the "input file" and "output file" operator types described above.

A "reformatting" component has an input port labeled (default) "in" and an output port labeled (default) "out" with a parameter that determines the component's operator as the transformation function to be performed on each input record received at the input port that produces an output record that is provided to the output port.

The transformation function may be defined by parameters of a "reformat" component that employs the syntax of a plurality of entries to define an expression for each of a plurality of output fields.

Output (out): : reformatting (in) ═ e

Start (begin)

Output (out) < field (field) 1 >: : < expression (expression)1 >;

output, < field 2 >: : < expression 2 >;

finishing;

each field is a name of a field in the record format of the output port. Each expression consists of a literal value (e.g., values 1, 2, and 3), an input field (e.g., in.a1), and any of a variety of algebraic operators (e.g., addition (+) and multiplication (—). Expressions allow only referencing fields that exist in the input record format. The transformation function typically provides one expression for each field of the output record format, however, there are steps to determine its default values for the fields present in the output record format that are not included in the transformation function.

For example, the "reformat" component can be configured as follows.

The "copy" component is capable of generating multiple copies for each input record and has an input port labeled (default) "in" tags and an output port labeled (default) "out" tags. Multiple data streams may be connected to the output port.

The join component to which the join operator is applied has two input ports labeled (default) "in 0" and "in 1" tags, respectively, and one output port labeled (default) "out" tag. The coupling assembly has two parameters: key0 and key1, which specify the fields used as the join operator key fields. The join operator is associated with a transformation function that maps fields in the records of input ports in0 and in1 to fields in the records of the output ports. The transformation function of the "reformat" component has one argument, while the transformation function of the join component has two arguments. In addition, the syntax of the transform function of the join component is similar to the syntax of the transform function of the "reformat" component.

The join component applies the join operator by performing a relational internal join on the two input records. For example, for each pair of R0 and R1 records from the in0 and in1 ports, respectively (where the values of the critical fields R0.key0 and R1.key1 are equal), the join component will call the transform function for pairs R0 and R1 and provide the result to the output port.

For example, the coupling assembly may be configured as follows.

The coupling assembly finds each pair of R0 and R1 records from in0 and in1, respectively, where R0.a3 ═ R1.b3, passes the pair of records into the transform function, and provides the result of the transform function to the output port.

The rollup component has an input port labeled (default) "in" label and an output port labeled (default) "out" label. The rollup component has a key parameter that specifies key fields that the rollup operator uses to rollup records and specifies the transformation function that the rollup operator uses to compute the rollup results for the rollup records. The rollup component applies the rollup operator by dividing the input records into subsets such that each subset has the same key field value, applying the transformation function to each subset, and providing the results thereof to the output port. Thus, the rollup component will generate an output record for each particular value that appears in the key field.

The transformation functions of the rollup component are similar to those of the "reformat" component, except that the transformation functions of the rollup component may include aggregation functions (e.g., sum, minimum, and maximum) that are used in the expression to compute their cumulative sum, minimum, etc. for values from the input records in the aggregate collection. For example, if an input record has an in.a1 field, the expression of the sum in the transformation function (in.a1) will cause the rollup component to sum up the values appearing in in.a1 for all records with the same key field value. If the expression in an entry in the transformation function is not an aggregation function, the expression in the entry will be evaluated against any input records within each aggregate record set having the same key field value.

For example, the rollup component can be configured as follows.

This configures the rollup component to: data from the input ports are divided into aggregated sets with the same key field value a 3. Within each set, the rollup component will take the minimum value of a1, the a3 value from any record (in which case the size of the value chosen is trivial since a3 is the value of the key field, and therefore, the same value as each record within the set), and the sum of the a2 values. An output record is generated for each different key field value.

The GDE 102, the API, or a combination of both may be used to generate a data structure corresponding to any dataflow graph element, including the component along with its ports, arithmetic types and parameters, the data formats associated with the ports and parameter values, and the data flow that connects the output port to the input port. Repeated calls of the API functions can be used to programmatically generate a dataflow graph without user interaction using the GDE 102.

The data objects and metadata corresponding to relationships between entities and the entities stored in the repository 104 may be represented by an entity-relationship (ER) graph. Shown in FIG. 2 is an example of an entity relationship diagram 200. The entity relationship diagram 200 illustrates the interrelationships between entities. Entities in an entity relationship diagram may represent items in a domain, such as medical insurance items, trade items, or accounting items having independent, unique characteristics. These entities include real-world solutions from the domain. For example, entity 202 represents a person and includes "attributes" associated with the person, such as the person's name (with an attribute called "person _ name") and its age (with an attribute called "person _ age"). Similarly, entity 204 represents a city and includes "attributes" associated with the city, such as the name of the city (with an attribute called "city _ name") and its population (with an attribute called "city _ population"). In some examples, an entity represents an entity object, such as a building or a vehicle. In some examples, the entity represents an event, such as a sale in a mortgage transaction or an expiration of a service agreement term. In a real-world example, both entity 202 and entity 204 may have several attributes.

In some implementations, a reference to "entity" can be associated with an entity-type "category". The data objects stored in the repository 104 may be viewed as instances of entities associated with a given entity-type "category". For example, an entity-type category may be the entity-type category of "person" and a particular instance, for example, a data object associated with the entity-type category of a person may be a person of the name "Wade" and age of 32. Thus, each entity 202 and 204 in schema 200 can represent a class of data objects, each object having details related to a particular instance of entity 202 and entity 204. In schema 200, entity 202 and entity 204 are related by a "living" relationship 206, i.e., there is a "living" relationship between an object of entity type "people" and an object of entity type "city".

The "relationships" are described in more detail below. Referring to FIG. 2A, in schema 210, entity 212, entity 214, entity 216, and entity 218 are related to each other through 213a-e relationships. In some examples, a relationship between two entities (e.g., entity 212 and entity 218) is established through a "primary/foreign" relationship 213 a. The "primary key" of an entity 212 is an attribute whose value uniquely identifies each instance of the entity 212 of a given entity type class. For example, "employee ID" is the primary key attribute of "employee" entity 212. When the second entity 218 has an attribute that references the primary key attribute of the first entity 212, there is the "primary key/foreign key" relationship 213a between the first entity 212 and the second entity 218. The attribute of the second entity 218 is referred to as a "foreign key". For example, each instance of an item in "item" entity 218 is associated with an "employee ID" that serves as a foreign key attribute.

Schema 210 can describe other kinds of relationships 213b-e between entity 212, entity 214, entity 216, and entity 218. In one embodiment, the relationships 213a-d may be represented as lines connecting these entities. For example, the relationships between entity 212 and entity 214, between entity 212 and entity 218, between entity 212 and entity 216, and entity 212 itself can be represented as shown in schema 210, including the following three basic types of "connectivity" relationships between entities 302, 304, 306, and 308: one-to-one relationships, one-to-many relationships, and many-to-many relationships.

In one embodiment, when at most one data object in, for example, an "employee" entity 212 is related to one data object in, for example, an "office" entity 214, there is a one-to-one connectivity relationship 213 b. The "employee" entity 212 represents an employee of a company, i.e., each data object in the entity 212 represents an employee. The "office" entity 214 represents an office in a building, i.e., each data object in the entity 214 represents an office. If each employee is assigned their own office, the corresponding data object has a one-to-one foreign key relationship. The one-to-one connectivity is depicted as a straight line in pattern 210.

In one embodiment, when there are zero, one, or more related data objects in, for example, the "department" entity 216 for one data object in, for example, the "employee" entity 212, and one related data object in, for example, the "employee" entity 212 for one data object (i.e., the "department" entity 216), there is a one-to-many connectivity relationship 213 d. As described above, the "employee" entity 212 represents an employee of a company. The "department" entity 216 represents a department of the company, i.e., each data object in the entity 216 represents a department. Each employee is associated with a department, and each department is associated with a plurality of employees. Thus, the "employee" entity 212 and the "department" entity 216 have a one-to-many foreign key relationship. The one-to-many connectivity is depicted in pattern 210 as a line ending in a crow-tail.

In one embodiment, a many-to-many connectivity relationship 213e exists when there are zero, one, or more related data objects in, for example, the "item" entity 218 for one data object in, for example, the "employee" entity 212, and zero, one, or more related data objects in, for example, the "employee" entity 212 for one data object in, for example, the "item" entity 218. For this example, assume that employees can be assigned to any number of projects at the same time, and that a project (i.e., a data object in the "project" entity 218) has any number of employees assigned to the project. Thus, the "employee" entity 212 and the "project" entity 218 have a many-to-many foreign key relationship therebetween. Many-to-many connectivity is depicted in pattern 210 as a line that starts and ends with a fishtail.

In some examples, there is a relationship 213c between data objects of the same entity 212. For example, there is a one-to-many relationship between data objects in the "employee" entity 212 and other data objects in the "employee" entity 212. One employee has a "supervisory" relationship with another employee represented in a one-to-many foreign key relationship.

In some examples, entity 212, entity 214, entity 216, and entity 218 may be represented in tabular form. The data object instances of entity 212, entity 214, entity 216, and entity 218 may be represented in records in a table. In some embodiments, one or more tables may be combined or "added" in a predetermined manner. For example, an internal join of two entity tables requires that each record in the two entity tables have a matching record. The internal join joins records from the two tables based on a given join predicate (i.e., join key). The join predicate (join-predicate) specifies a join condition. For example, the join key specifies that a first attribute value of a record in the first entity table is equal to a second attribute value of a record in the second entity table. While an external join does not require that each record in the two join entity tables have a matching record. Thus, even if there are no matching records in the two entity tables, each record is retained in the join table.

The attributes of entity 212, entity 214, entity 216, and entity 218 are characterized by the granularity of the data. For example, at a coarser level of granularity, an entity representing an account of a person may include an attribute representing information about the address of the person. The data granularity refers to the degree to which the data field is subdivided. Also, in some examples, at a finer level of granularity, the entity may represent the same address information with multiple attributes (e.g., house number, street name, city or town name, and country name).

For example, a coarser grained attribute in an entity may be family information representing many users within a family. And one finer grained attribute may be a line item (line item) in the purchase order. As a coarser grained attribute, a product identifier (or other product level information) may have multiple related row items.

In some cases, it may be preferable to obtain more detailed information from the coarser grained entities. Similarly, it may be desirable to present the information contained in the finer grained entities in a summarized, less complex manner. In this regard, a source entity may be defined as having a predetermined level of granularity. A source entity in one or more source schemas may be mapped to a target entity in one or more target schemas, wherein a preconfigured expression relates predetermined granularity data to result attributes in the target entity. For example, source data at a finer level of granularity is aggregated (e.g., by an expression that includes an aggregation operator) to produce an attribute in a target entity.

In some examples, a rule generated by a user based on data from a dataset or dataflow graph needs to reference auxiliary information contained in an external database or external file. For example, such side information may be present at a coarser or finer level of granularity. In this manner, such ancillary information is processed using the preconfigured expressions to produce the desired attribute form for use in the application.

In one example scenario, a source entity may be mapped to a target entity using a preconfigured expression as follows. Referring to FIG. 3, a source schema 300 includes three source entities 302, 304, and 306, entity A, entity B, and entity C, where each entity includes an attribute a1, an attribute ab, an attribute ac, an attribute B1, an attribute ba, an attribute C1, and an attribute ca, respectively. For the purposes of this specification, the convention adopted is: the first letter of an attribute is matched to the name of the belonging entity. For example, each attribute of entity A begins with the letter "a". Also, the second letter of an attribute in an entity is set to "1" to indicate that the attribute is a primary key attribute. The relationships between entities are represented as follows: n: 1 (many-to-one) from entity A to entity B and 1: n (one-to-many) from entity A to entity C.

The join key associated with each relationship is displayed on that relationship, e.g., the join of entity A and entity B is displayed on the relationship ba/ab and the join of entity A to entity C is displayed on the relationship ca/ac. The conventions adopted in this specification are: let the second letter of the attribute used by the join key match the name of the collaborating entity.

As shown, entity A (e.g., an intermediate/source entity) is 1: 1 (one-to-one) mapped to entity 308, entity D (e.g., a target entity). The mapping is represented by an "arrow" pointing from entity a to entity D.

The definition of the mapping from entity a to entity D is provided in individual pre-configured logical expressions that relate to the entity attributes. For example, when defining the mapping from entity A to entity D in the first section, attribute D1 is assigned to the value of attribute a 1. In some examples, the value of attribute a1 may be copied directly to attribute d 1.

When the mapping from entity A to entity D is defined in the second section, attribute D2 is assigned to the value of attribute b 1. However, B1 is not in source entity a, but in entity B. Therefore, entity a and entity B need to be linked, and the value of attribute d2 is calculated, so as to obtain attribute B1.

In defining the mapping from entity A to entity D in the third section, attribute D3 is assigned to the value of the aggregate expression "Sum" performed on attribute c 1. Thus, before combining entity C and entity A (to obtain the value of the aggregate expression "sum" performed on attribute C1 when calculating the value of attribute d 3), the aggregation of attribute C1 needs to be performed on entity C first.

In one embodiment, the above-described joins and summaries define a path from the source property to the target property. Thus, the join and aggregation is used to map an attribute in the target entity to an entity storing source attributes (e.g., attributes a1, b1, and c1) based on a path through a root or just-in-time entity (e.g., entity a in the above example). In some examples, the path from the root entity to the entity containing the source property defines a schema in the form of a tree structure.

In some examples, a "simple" mapping from a first entity to a second entity consists of a "tree" type source schema having only one designated root entity, any number of other entities (related to the root entity or to each other), and one or more target entities (not part of the source schema, but which may form a target schema). For example, in fig. 3, source schema 300 includes entity 302, entity 304, and entity 306. The target entities 308 are not part of the source schema 300.

Referring to FIG. 4, a process 400 for transforming the source schema and mapping information into a program specification (e.g., a query language expression, a dataflow graph, or a computer program) for computing attributes of mapping target entities is shown. For example, the source schema and mapping information includes a mapping (in the form of an expression) from one or more source entities in the source schema to a target entity, the expression defining attributes of the target entity from attributes of the source entity through various operators or functions, possibly including some elements of the expression (corresponding to a program in the program specification). In some examples, the initial information includes a source schema having a directed acyclic graph structure (classified as having a tree structure, as shown in FIG. 4, since the relational links in the graph are non-directional, any node can be considered the root of the tree) and a set of mapping rules that detail how attributes of one or more target entities of the target schema are generated (step 402). The illustrated mapping rules and source schema may be used to generate a set of expression nodes (step 404). The expression nodes (described in detail below) are represented in an intermediate data structure. Each attribute in the target entity may be defined by a mapping rule that defines an expression constructed from the entity attributes in the source schema along with a set of arbitrary rung operators (scale operators) (e.g., +, -and function calls, etc.) and aggregation operators (e.g., sum operators, minimum operators, maximum operators, etc.). Any language may be used as the expression for the mapping rule. For example, expressions may be written in a Data Manipulation Language (DML) by a user and provided as user input. Various input interfaces may be used to receive the mapping rules, including, for example, a business rule interface (such as the interface described in U.S. patent No. 8,069,129, which is incorporated herein by reference). The initial information also includes metadata describing a physical representation of the data corresponding to the entity (e.g., a data set filename and record format).

In some implementations, a mapping module (e.g., executing in the operating environment 106) translates a schema-to-schema mapping from a source schema to a target schema (with attributes associated with the source schema) into a set of expressions (e.g., relational expressions organized as expression nodes by relational algebra) via mapping rules. In some examples, any known method for storing data structures may be used to store expression nodes in the form of one or more "query plans," where each query plan is a set of expression nodes that are associated with directed links, that can be used to generate a specified value for a query, and that have a directed acyclic graph structure. The directed links in the query plan represent dependencies (e.g., identifiers of one expression node referenced by another expression node), which are described in more detail below. Relational expressions represented by expression nodes of a query plan can be defined in terms of abstract relational operators (e.g., "joins" or "summaries") having various implementations in terms of concrete algorithms. A separate query plan is generated for each output attribute of the target entities in the target schema. The output attributes defined by a mapping expression (which does not require data from more than one entity in the source schema) correspond to a query plan (having only one expression node). The output attribute defined by a mapping expression (requiring data from multiple entities in the source schema) corresponds to a query plan (having a tree topology, with the value of the output attribute represented by the tree root expression node). From the query plan, the mapping module can generate a dataflow graph or other form of program specification that specifies some or all of the programs that will be used to calculate the property values of the target schema.

In some examples, the query plans may be merged to incorporate one or more expressions, resulting in a new query plan having a new node structure (step 406). The merged query plan may result in a more complex organization of intermediate data structures. For example, the merged set of expression nodes is no longer in the form of a tree (e.g., there are multiple tree roots, or links from one node are merged again at another node, although the graph is still an acyclic graph). The set of expression nodes resulting after the merging may be processed as described below to generate a corresponding program specification (step 408). In this manner, the initial source schema and the mapping target schema may be converted into a dataflow graph, as described in more detail below. In some embodiments, some additional techniques may be used to generate dataflow graphs from query plans (as described in U.S. published application No. 2011/0179014, which is incorporated herein by reference).

Referring to FIG. 5, a source schema 500 is represented as an entity relationship diagram that includes entity A, entity B, entity C, entity D, entity F, and entity G. In this example, the target pattern includes one entity Z as the target entity. As described above, in one embodiment, the graph formed by the entities and relationships in the source schema is a tree with one root entity (entity A). Each relationship between two entities is associated with a join key for each entity, e.g., join key "ab" for one side of entity A and join key "ba" for one side of entity B.

The expression in the mapping rule (defining the attribute of the target entity Z) can be converted into an parse tree (parsetree), wherein the leaves of the parse tree are the attributes, and the internal nodes of the parse tree are the expressions and the aggregation operators. In some examples, the expression may be a "raw" expression, having an parse tree that does not include any aggregation operators.

In some examples, expressions in the mapping rules are represented using relational algebra. The relational algebra includes a relationship such as a table and a transform. In some examples, a relationship may be a table having rows and columns (e.g., entities, links, or summaries, each represented in tabular form). The transformation may specify a column in the output relationship.

Using the convention described above, in one embodiment, the following conversion notation is given:

< transformation >: : (element) > [, < element > ] [, ]

< element >: : ═ identifier > | < identifier > < expression >

The following exemplary conversion expressions are translated into the languages as shown corresponding to the conversion expressions:

in one embodiment, the bi-directional connection syntax is as follows:

join(<rel1>/<k12>，<re12>/k21>)

where rel1 and rel2 are two relations and k12 and k21 are the linkages. For a multidirectional connection, the first relation is a "central" relation. The second relation and subsequent relations are combined with the central relation. The syntax is given as follows:

join(<rel1>/<k12>/<k22>…，<rel2>/<k21>，<rel3>/<k31>…)

as an illustration, the three-way connection of entity a, entity B and entity D is described as follows:

join(A/ab/ad，B/ba，D/da)

in some embodiments, the above expression is equivalent to either of the following two bi-directional connected cascades:

join(A/ab，join(A/ad，D/da)/ba)

join(A/ad，join(A/ab，B/ba)/da)

the summary syntax notation given is as follows:

rollup(<rel>/<key>)

as an illustration, a summary of D for attribute da is described below:

rollup(d/da)

in some examples, the expressions in the mapping rule are represented by expression nodes. In one embodiment, the expression node comprises: (1) an identifier that allows the expression node to be referenced by other expression nodes; (2) an expression identifying an expression corresponding to the node; (3) a context entity that identifies a context in which an expression is evaluated and a join/summarization in which a value resulting from evaluating the expression participates; (4) a transformation specifying a column in an output relationship; (5) a relational expression (also called relational attributes of the nodes) with relationships and transformations as described above; and (6) a set of child nodes corresponding to the node input relationships.

An example of a query plan is now provided that includes expression nodes corresponding to an expression in a mapping rule. Referring to FIG. 6, a query plan 600 includes a collection of exemplary expression nodes 602, 604, 606, and 608 corresponding to an expression "sum (d 1)" according to a defined mapping rule. The expression "sum (D1)" specified in the #1 node (i.e., node 602) can be evaluated by performing a summary (specified by the #2 node (i.e., node 604)) of the relationship D (specified by the #3 node (i.e., node 608)) with respect to the key "da". As specified in node #1, the summarized result specified in node #2 is combined with a relation A (specified by node #4 (i.e., node 606)) having join keys da and ad.

The expression nodes of the query plan 600 in FIG. 6 may be represented in the tabular form shown below in Table 1. An indentation of the expression in the "expression" column represents a parent/child relationship between nodes. The "-" symbol in the "context entity" field in the #2 node is detailed below.

TABLE 1

In one example, the expression sum (c1) a 1b 1 may be evaluated by constructing the expression nodes as follows. An aggregate expression may be compiled as an expression node tree (see, e.g., FIG. 6, corresponding to the expression "sum (d 1)"). When an expression node of an aggregated expression is created, the computing context (evaluation context) of the node is not known until the node is subsequently processed. For example, all other columns in Table 1 are not populated except for the "expression" column. Table 2, shown below, represents an initial expression node:

TABLE 2

A separate expression node is created for each aggregation operator in the expression for the expression "sum (c 1)" in table 3 shown below.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#1	sum(c1)

TABLE 3

As shown in Table 4, the aggregated expression is assigned to a child node (including the attributes of the expression being aggregated). In one embodiment, it is illustrated that if the aggregation expression is "sum (c1, c2 | ═ 0)", the child nodes include (c1, c 2).

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#2	sum(c1)
#3	c1

TABLE 4

See table 4, the #3 child node includes only expression c1, no aggregation operation, therefore expression c1 is a raw expression an algorithm to compile raw expressions into an expression node tree is applicable to expression c1 in one embodiment, the algorithm starts by determining an "aggregation set" for each entity in expression c1, typically, an aggregation set measure a summary of given attributes that need to aggregate the schema root entity.

Referring to FIG. 5, an example of an aggregate set of entities in source schema 500 is as follows:

entity a { } entity a is the root entity.

Entity B { } the relationship of entity B and entity a is a one-to-many relationship.

Entity C { C } the relationship of entity C and entity B is a many-to-one relationship.

Entity D { D } the relationship of entity D and entity A is a many-to-one relationship.

Entity E { E } the relationship of entity E and entity A is a many-to-one relationship.

Entity F { E } the relationship of entity F and entity E is a one-to-many relationship.

Entity G { E, G } the relationship of entity G and entity E is a many-to-one relationship.

Given an expression, a Maximum Aggregation Set (MAS) is the largest aggregation set of any entity mentioned in the expression. In some examples, the MAS is unique. For example, consider the expression b1+ e1+ g 1. The expression references entity B, entity E, and entity G. The corresponding aggregation sets are { }, { E } and { E, G }, respectively. Therefore, MAS of this expression is { E, G }.

In some examples, the presence of non-unique MAS is not allowed. For example, expression C1 × E1 mentions entity C and entity E, whose aggregate sets are { C } and { E } respectively. In this case, the expression c1 × e1 has no unique MAS. The expression E1F 1 mentions entity E and entity F, the aggregate set of which is { E } and { E }, respectively. In this case, the MAS of { E } is unique even if there are multiple entities with the same aggregation set.

The context of a computational expression is an entity with a MAS that is closest to the root entity of the schema. Typically, the context in which the top-level expression is computed is the root entity. For example, consider the expression a1+ b 1. Its MAS is { }. The aggregate set of entity a and entity B is { }. However, entity a is closest to the root entity and therefore serves as the computing context. In another example, consider the expression e1+ g 1. Its MAS is { E, G }. Thus, the computational context is G.

Upon selection of a computational context for an expression node, the expression node is populated with information about the selected computational context. Referring again to table 4, the computational context of C1 is C. Thus, the updated expression nodes are shown in table 5 below.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#2	sum(c1)
#3	c1	C

TABLE 5

In one embodiment, an "aggregate relationship" of the expression nodes is computed. An aggregated aggregation relationship is an attribute that connects the computational context to the root entity. In the current example, for expression C1, its computational context is C, and its aggregate relationship is cb. In one example, for the expression e1 × G1, which is the computational context G, the aggregate relationship is ge. In one example, for the expression f1, its computational context is E, and its aggregate relationship is ea.

To summarize the intermediate results to the above computed aggregation relationship, the key of the aggregation relationship is added to the expression node. As mentioned above, the polymerization relationship of C1 is cb. Therefore, the expression node is updated as shown in table 6 below.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#2	sum(c1)
#3	c1，cb	C

TABLE 6

At this time, the child node (i.e., #3 node) may be compiled. If the child node includes nested sets, a nested aggregation compilation algorithm may be recursively invoked, as described below. Therefore, the expression node is updated as shown in table 7 below.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#2	sum(c1)
#3	c1，cb	C	c1，cb	C

TABLE 7

Based on the results of the calculations described above in relation to tables 1-8, the expression nodes may be populated as shown below in table 8.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#2	sum(c1)	C-	t1＝sum(c1)，cb	rollup(#3/cb)
#3	c1，cb	C	c1，cb	C

TABLE 8

The context entity is the computational context of the expression (followed immediately by a "-" symbol). The "-" symbol on node #2 indicates that the aggregation level was lost. For example, the aggregate set of C is { C }, and the aggregate set of C-is { }. Without such adjustment, the MAS of B1+ sum (C1) is { C }, and the computational context of the expression is C instead of B.

The relationship is a child summary on an outgoing key (cb) of the summarized relationship. The conversion assigns the aggregate value to a temporary variable t1 (a temporary variable is introduced into these examples by using the character "t" followed by a different temporary variable distinguished by numbers) and passes the aggregate value through a summary key (cb).

The original node (i.e., #1 node in Table 2) may now be compiled as follows. The variable t1 is replaced by the aggregation node. Therefore, the original expression sum (c1) × a1 × b1 is considered to be t1 × a1 × b 1. These terms are associated with the context entities C-, a and B, respectively. The aggregate set for each of these entities is { }, so the MAS is { }, and the computational context is A. The update table is shown in table 9 below.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#1	sum(c1)a1b1	A

TABLE 9

The path of the two attributes sum (c1) and b1 crosses the relationship ba. Therefore, a child node is created for B and joins are inserted as shown in Table 10 below.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#1	sum(c1)a1b1	A	t1a1b1	join(#4/ab，#5/ba)
#4	a1	A	a1，ab	A
					#5	(sum(c1)，b1)	B

Watch 10

The computational context of the expression referenced by the child node (sum (C1), B1) is B, whose attributes come from B and C. Thus, a node is created for b 1. Therefore, the expression node is updated as shown in table 11 below.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#1	sum(c1)a1b1	A	t1a1b1	join(#4/ab，#5/ba)
#4	a1	A	a1，ab	A
					#5	(sum(c1)，b1)	B	t1，b1，ba	join(#6/bc，#2/cb)
#6	b1	B	b1	B
					#2	sum(c1)	C-	t1＝sum(c1)	rollup(#3/cb)
#3	c1	C	c1	cb

TABLE 11

As briefly described above, collections can be nested explicitly or implicitly. Typically, the degree of nesting matches the cardinality (cardinality) of the aggregation set of aggregation values. For example, G1 has an aggregation set { E, G }. Thus, the aggregation involving g1 has two aggregation levels, e.g., max (sum (g 1)).

In some examples, a nested aggregation compilation algorithm can convert non-nested aggregations to nested aggregations at compile time. For example, sum (g1) becomes sum (g1), and count (g1) becomes sum (g 1). In this scheme, in one embodiment, the direct conversion to nested aggregation is in violation of a rule.

In some examples, the transition to nested aggregation requires authorization to execute. In this scheme, aggregations such as max (sum (g1)) and sum (sum (g1) × e1)) may be specified.

In some examples, a nested aggregation compilation algorithm can convert non-nested aggregations to nested aggregations at compile time, leaving the nested aggregations to simply pass through. For example, sum (g1) is converted to sum (g1) at compile time, while max (sum (g1)) is unchanged across the nested aggregate compilation algorithm.

Consider an example of a nested aggregation expression max (sum (g 1)). The following tables 12-15 may be constructed using the expression node techniques described above.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#1	max(sum(g1))	A	t1	join(#2/ae，#3/ea)

TABLE 12

As previously described, a disjoint node is created:

identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#3	max(sum(g1))
#4	sum(g1)

Watch 13

Creating a second level separation node:

identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#6	sum(g1)
#7	g1

TABLE 14

The #7 node and the #6 node can be compiled directly.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#6	sum(g1)	G-	t2sum(g1)，ge	rollup(#7/ge)
#7	g1	G	g1，ge	G

Watch 15

From tables 12-15, node #3 can be compiled:

identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#3	max(sum(g1))	E-	t1＝max(t2)，ea	rollup(#4/ea)
#4	sum(g1)	E	t2，ea	join(#5/eg，#6/ge)
					#5	Air conditioner	E	ea，eg	E
#6	sum(g1)	G-	t2＝sum(g1)，ge	rollup(#7/ge)
					#7	g1	G	g1，ge	G

TABLE 16

Finally, from table 15, node #1 can be compiled:

identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#1	max(sum(g1))	A	t1	join(#2/ae，#3/ea)
#2	Air conditioner	A	ae	A
					#3	max(sum(g1))	E-	t1＝max(t2)，ea	rollup(#4/ea)
#4	sum(g1)	E	t2，ea	join(#5/eg，#6/ge)
					#5	Air conditioner	E	ea，eg	E
#6	sum(g1)	G-	t2＝sum(g1)，ge	rollup(#7/ge)
					#7	g1	G	g1，ge	G

TABLE 17

In some examples, more than one entity may be mapped simultaneously. Referring to fig. 7, a target pattern is provided that includes two target entities (X and Z). The expression in the entity X mapping rule may be interpreted as a root entity with entity E. The expression in the entity Z mapping rule may be interpreted as a root entity with entity a. In some examples, the mapping rules include attribute mappings, e.g., ae maps to zx and ea maps to x 7.

In some examples, the mapping rules include relational operations such as selection and aggregation. Referring to FIG. 8, a mapping rule for the same source mode and target mode is shown in FIG. 7, which includes a selection (e.g., a1>0) and a rollup (e.g., rolup (e 1)). When mapping rules are compiled in an assembly, entity E and virtual entity E 'are many-to-one relationships, and virtual entity E' can be considered as a root of a hypothesis.

In some examples, the source patterns may not exist in the form of a tree. For example, the source mode may be a cyclic mode (cyclic). Referring to FIG. 9A, a cycle mode is shown. Where both the "customer" entity and the "store" entity employ output from the same entity "zip code". The circular pattern can be broken by an aliasing algorithm. For example, as shown in FIG. 9B, the aliasing algorithm enables two alias copies of the relationship between the zip code and "customer" and the zip code and "store" to be generated separately. In this way, the circulation pattern can be eliminated. The schema in FIG. 9B enables the end user to more accurately specify attributes (e.g., the status of a customer or the status of a store) in a "zip code" entity.

Once the expression is converted to a query plan, the tree may be converted to a dataflow graph. In one embodiment, the step of generating the dataflow graph includes: 1) merging expressions to avoid redundant joining, redundant aggregation, or redundant scanning of data corresponding to the same entity; 2) optionally optimizing a set of relational operators; and 3) transcribing the merged results or optimized results into a dataflow graph.

After all expressions in the system are compiled into the query plan, the accumulated expressions can be merged by deleting common sub-expressions on the relational expression fields of all expression nodes. An example of the merging is as follows. Two nodes are provided in table 18:

identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#1	a1	A	a1	A
#2	a2	A	a2	A

Watch 18

These two nodes in table 18 may be merged into node #3 as shown in table 19 below.

Identifier (ID)	Expression formula	Context entity	Conversion	Relational expression
					#3	*	A	a1，a2	A

Watch 19

The new expression nodes have the same relational expression with the transformation field being a combined list of transformations for the merged node. The merged query plan may be transformed into one or more dataflow graphs by a mapping module running in the parallel operating environment 106. For example, appropriate graph components may be inserted, transformations may be added to these components according to expression nodes, and DML code may be generated for intermediate results. A graph component of a particular type is used to perform operations on expression nodes of the particular type. Scanning the expression nodes for entity data may be accomplished using a component that scans records of a data set (e.g., a file or table in a database) corresponding to a particular entity. The component accesses the records by, for example, reading the file in which they are stored or by querying a table in a database. The attributes of an entity specified by the expression node correspond to fields recorded in the table. The aggregation of expression nodes may be implemented by a component that performs a rollup operation. The joining of expression nodes may be accomplished by a component that performs the joining operation.

The following is one example of generating a query plan for different mapping rules, merging the query plans, and generating a dataflow graph from the merged query plan. In this example, two mapping rules are received, each including a single expression referencing a single output attribute (x 0 and x1, respectively) of the attributes of the entities (a1, b1, and e1) in the source schema.

x0＝a1*b1

x1＝sum(b1*e1)

Each of these expressions corresponds to an expression tree of expression nodes. The expression tree for the input expression for attribute value x0 is as follows:

the expression tree for the input expression for attribute value x1 is as follows:

the expression nodes from the two trees are combined (as described below), and an additional expression node (labeled with an "END" label) is added to represent the result after the mapping rules are combined. The additional expression node corresponds to connecting the two top level expressions (#1 and #4) together on the primary key of A (e.g., aa).

There are three instances of the relational expression that identify entity a, which the mapping module merges and updates the reference to the node identifier. The #2, #5 and #10 expression nodes are deleted and replaced with a new merged node having one of the previous identifiers (in this example, the #2 identifier). At each node merge, the new node has the same context and relational expression, and the converted value is a combined list of converted values for the merged nodes.

Since there are two relational expressions listing entity B, the expression nodes #3 and #11 are merged. In this case, the expression nodes are identical (except for the hierarchical level of their original tree), so node #11 is deleted and node #3 represents the new merged node.

After updating the reference of the node ID number, two nodes list the relational expressions join (#2/ab, #3/ba), so the mapping module incorporates the following nodes #1 and # 9.

The mapping module may generate a dataflow graph from the merged expression nodes (merged query plans) as follows. For each expression node in the query plan, the mapping module may generate a component configured to perform expression node specific operations. The configuration of these components may include any of: selecting a component type, generating a transformation function, selecting key fields for joins or summaries, and generating an output format. For example, the output format includes a record structure with corresponding fields for each attribute generated by the node.

Dependent connections between expression nodes in the query plan correspond to data flow connections in the data flow graph. In this regard, a component may receive input and output as children of the corresponding expression node. For example, if the relational attributes of expression node E0 specify an expression node E1 and inputs, the mapping module generates a data stream from the output port of the component corresponding to E1 to the appropriate input port of the component corresponding to E0. Where multiple expression nodes use output from the same child node, a replication component (or equivalent component) may be inserted within the dataflow graph, with copies of data from the child node provided on multiple outputs. Each output of the replication node is then used as an input to one of a plurality of expression nodes. The output of the END expression node in the above example would be connected by a data stream to a component that stores the results of the computation in an accessible location (e.g., in a file stored in the repository 104 or external data store 112).

An example of the step of generating a dataflow graph corresponding to an expression node query plan has three distinct phases: component generation, data stream generation, and data format generation. However, in other examples, the steps involved in these three stages may be mixed together as needed to produce equivalent results. The program uses working memory to store communications between ports and inputs and outputs of expression nodes in the dataflow graph being built, and to store data formats associated with those ports. The information stored in the working memory may be organized in any of a number of ways. For example, a record set having the following attributes may be stored.

Expression node ID: an identifier of the expression node.

Output port: output ports corresponding to the values produced by the identified expression nodes (if any). The output port may be identified by a string in the form of < component-tag > < port-name >.

Input port: an input port corresponding to each expression node referenced by the identified expression node (if any). In some embodiments, the number of input ports (e.g., 0, 1, or 2 input ports in the examples below) depends on the type of the expression node.

Components (with associated operators corresponding to each expression node in the query plan) are generated during the component generation phase. This stage involves the mapping module traversing the list of expression nodes in the query plan and transferring the relevant information for each expression node to the corresponding components of the generated dataflow graph (i.e., stored within the data structure implementing those components).

Some expression nodes include a relational expression composed of a single entity, which is classified as a "primitive relationship". If the entity's data is stored in, for example, an input file, the mapping module determines from a source (e.g., a metadata repository) the name of the file and the record format associated with the file. There are two different scenarios for an expression node generation component with a primitive relationship.

In the first case of generating components for expression nodes having a primitive relationship, the translation of the expression nodes does not define any temporary variables. In this case, the expression node is represented in the dataflow graph by an input file component (i.e., a component with an "input file" operator type). The filename parameter specifies the input filename in the metadata repository as its value. The tag of the component specifies < expression node ID > as its value. The record format of the read port is specified as a record format in the metadata repository. Records are stored in the working memory (with expression node ID equal to < expression node ID >), and the read port of the input file component is determined to be the output port.

In the second case of generating components for expression nodes having a primitive relationship, the translation of the expression node defines one or more temporary variables. In this case, the expression node is represented in the dataflow graph by an input file component and a reformatting component. The label of the reformatting component is < expression node ID >. reformatting, and the transformation function of the reformatting component is generated based on the transformation of the expression node. A data stream connects the read port of the input file component to the input port of the reformatting component. A record is stored in the working memory (with expression node ID equal to < expression node ID >), and the output port of the reformatting component is determined to be the output port.

The query plan in the above example has the following final merged set of expression nodes.

Taking the first case as an example, there is a primitive relationship between the #2 expression node and the single entity a in this example. For entity a, the metadata repository provides the following filenames and record formats.

The input file components generated by the mapping module for the #2 expression node are configured as follows.

The record stored in the working memory includes the following information.

Expression output input

Node port

#2 #2.read

Taking the second case as an example, the following is a substitute expression node (not the expression node in the above example).

It is assumed in this example that entity a has the same file name and record format. The input file component and reformatting component generated by the mapping module for the expression node are configured as follows.

The record stored in the working memory includes the following information.

Expression output input

Node port

#2 #2 reformatted read

The record format of the reformatting component inputs and outputs will be provided during the data format generation phase, as described below.

Some expression nodes include a relational expression comprised of join operations and summary operations. For these expression nodes, the mapping module generates a join component and a summarization component, respectively. The label of either type of component is < expression node ID >. The key field of the join operator or the key field of the summary operator is determined according to the operation parameters in the relational expression. The transformation function for either type of component is generated from the transformation of the expression nodes. For a join component, the mapping module determines which parameter of the join operation a given term in the transformation of the expression node comes from. A record is stored in the working memory (with expression node ID equal to < expression node ID >), and the input and output of the join or summarization operation are determined as the input and output ports of the record.

An example of the join component for the #1 expression node generation in the above example is shown below, and an example of the summarization component for the #6 expression node generation in the above example is shown below.

In the join example, the final merged translation of the #1 expression node requires the values of a1, b1, aa, and ae. These values are provided by the #2 expression node or the #3 expression node. The merged translation of the #2 expression node provides the values of a1, aa, and ae, and the merged translation of the #3 expression node provides the value of b 1. Based on the parameter locations of join operations in the relational attributes of the #1 expression node, #2 expression node corresponds to the in0 port of the generated join assembly, and #3 expression node corresponds to the in1 port of the generated join assembly. This port assignment corresponds to an identifier that can be used to configure the transformation function of the coupling assembly.

For the #1 expression node, the join component generated by the mapping module is configured as follows.

The record stored in the working memory includes the following information.

Expression output input

Node port

#1 #1.out#1.in0，#1.in1

In the summary example, the final merged translation of the #6 expression node requires the values of t3 and ea. These values are provided by the merged transitions of the nodes of the #7 expression. There is an input port labeled "in" and an output port labeled "out". For the #6 expression node, the summarization component generated by the mapping module is configured as follows.

The record stored in the working memory includes the following information.

Expression output input

Node port

#6 #6.out#6.in

In addition to the import file component of the #2 expression node, the join component of the #1 expression node, and the summarization component of the #6 expression node, components of other expression nodes will be generated in a manner similar to that described above using the following operator types.

ID number operator type:

#1 coupling

#2 input File

#3 input File

#4 coupling

#6 overview

#7 coupling

#8 import File

END coupling

After each of these components is generated, the working memory contains the following records.

Before these expression nodes are connected to each other by the data stream (i.e., nodes having the following identifiers: #1, #2, #3, #4, #6, #7, #8, END), a visual representation of the components corresponding to each of the remaining expression nodes in the consolidated query plan is shown in fig. 10A.

The data stream generation phase comprises two steps performed by the mapping module. In a first step, the mapping module inserts replicated components at specific locations, if needed. If the identifier of a particular expression node is referenced in the relational attributes of more than one expression node, the mapping module connects a replica component to the output port of the component corresponding to the particular expression node to provide a corresponding number of replicas of the data flowing through the component. For example, if more than one other expression node references expression node E, a replicated component tagged with an "E. The mapping module also stores the output port of the new replicated component (with the "E. In the above example, more than one other expression node references two expression nodes (#1 and # 2). Thus, the mapping module applies this step to both expression nodes. This results in the addition of two replicated components (#1 replicated component and #2 replicated component) to the dataflow graph, which are connected to the respective components (#1 component and #2 component) as shown in fig. 10B. After this step, the working memory contains the following records.

In a second step of the data stream generation phase, the mapping module connects a port to a data stream by traversing the set of expression nodes. If the relational attribute of expression node E1 references an identifier of another expression node E2, the mapping module generates a data stream from the output port of E2 (recorded in the working memory record of E2) to the corresponding input port of E1 (recorded in the working memory record of E1). For example, the relational attribute of the #4 expression node has a join operation of the #2 expression node and the #6 expression node. Thus, the first input port (#4.in0) of the #4 node is connected to the output port (#2. replace.out) of the #2 node, and the second input port (#4.in1) of the #4 node is connected to the output (#6.out) of the #6 node. The mapping module continues this step, producing a dataflow graph as shown in FIG. 10C. It is noted that while a visual representation of a dataflow graph may be provided using components that are configured to be transparent to a user, the functionality of the dataflow graph depends on its connectivity rather than the visual representation of the components.

After the second step of the data stream generation phase, the mapping module determines whether there are additional components connected to any existing component by a data stream. For example, in some implementations, the mapping module adds a component to provide the final output produced by the dataflow graph to the intended target (e.g., storage medium). Fig. 10D shows a dataflow diagram where the output file component is connected by a link to an output port of the END component (labeled "END. The name of the file written by the output file component may be obtained from a user or an external source (e.g., a metadata repository).

In the record format generation phase, the mapping module generates a record format for any port that has not yet provided the record format. In this example, the record formats for the input file components (for the #2, #3, and #8 expression nodes) have been obtained from the metadata repository, as shown below.

The mapping module generates record formats for other components by traversing the dataflow graph from a source component (i.e., a component without any input ports) to a target component (i.e., a component without any output ports), and generates an appropriate record format for each component for the output port of that component, as described in more detail below. The record format at the output port of each component is then propagated (i.e., copied) to the input ports connected to that output port by the data stream. For example, for the #1 link assembly, the record formats of its input ports in0 and in1 are the same as the record formats propagated from the two connected input file assemblies, respectively: in0.format #2.read and in1.format #3. read.

The output port of a replicated component has the same record format as its input port and as the output port of the component connected to the replicated component. Thus, in this example, #2. the record format of the output port of the replicated component is as follows:

the record format of the output port of the join component, reformat component, or rollup component depends on the examination of the component's transformation function. If the transformation function copies an input field to an output field, the record format will include the type of the output field (which is the same as the type of the input field). If the transformation function determines an output field based on an expression, the record format will include an output field adapted to hold the value returned by the expression. For example, for the transform function of the #1 linkage assembly shown above, the output fields out.aa, out.b1 and out.ae are copied from the input fields in0.aa, in1.b2 and in0.ae, respectively, so that the type of the output field is the same as the type of each input field. The remaining one output field out.t1 is defined as expression in0.a 1. in1.b 1. The product of two integers is also an integer, and thus the type of the output field out.t1 is an integer. These record formats determined for the input and output ports yield the following complete configuration information for the #1 link assembly.

This process is repeated component by component from input port to output port until all record formats are determined. In this example, the record format generation phase ENDs with determining the record format of the output port of the joining component END. In some embodiments, the record format generation stage includes deriving record formats or other metadata for the component using techniques described in more detail in U.S. patent No. 7,877,350, which is hereby incorporated by reference.

The specific nature of the generated dataflow graph depends upon various features of the system 100, including the storage location of the data represented by the entities. For example, if the data represented by entity B and entity E is stored in a relational database and the final output is also to be stored in the relational database, then the operator types for the #3 source component and the #8 source component are input tables and the operator type for the target component END is an output table. The dataflow graph that is generated includes a mixture of components representing files and tables as shown in FIG. 11A.

In another example, a mapping from a source to a target may appear as a "subgraph" connected through input and output ports to components within another dataflow graph. When such a sub-graph is generated from a query plan rather than a file or table as a meta-component or target component, the meta-component or target component is provided as an input port and an output port (also referred to as "boundary ports") at the boundary of the sub-graph, which is connected to other components by a data stream. Therefore, in the above example, before the component generation stage, expression nodes having one kind of primitive relationship (i.e., #2, #3, and #8) become input boundary ports to which "in _ 2", "in _ 3", and "in _ 8" are pasted and which are connected to other components as shown in fig. 11B. As in the previous example, the record format of these input border ports can be obtained from an external source, however, the data storage locations of the corresponding entities need not be obtained. Also, instead of adding the output file component or output table component, the mapping module generates an output boundary port (labeled "out" in FIG. 11B) that is connected to the output port of the END component by a link. The steps of generating a subgraph corresponding to the expression node query plan are otherwise identical to the steps described above. The map subgraph can be connected to specific data source and target components to form a complete dataflow graph, as shown in fig. 11C. The subgraph method has the advantages that developers can manually configure the data source, and the subgraph method is more flexible. Furthermore, the map subgraph can be reused in the case of containing several different dataflow graphs.

In some embodiments, some components used in a generated dataflow graph or subgraph can be reduced by an optional optimization step. For example, some optimization schemes include merging components that include operations to read data in the same dataset (e.g., two different fields of a record in the same file).

FIG. 12A illustrates an example of a portion of a data model 1200, the entities of which include "Account," "transaction," "product," "Bill," "line item," and "product update," the relationships of which are to be defined according to a source schema, a target schema, and mapping rules between the source schema and the target schema provided by a developer. The attributes listed for an entity in the data model 1200 correspond to fields of a formatted record according to a defined record format. Examples of record formats for entities such as "account", "transaction", "product", "bill", and "line item" are shown below. In some embodiments, the developer provides the following record formats from which the system 100 automatically generates the partial data model 1200.

Based on these record formats, there is a primary/foreign key relationship between the "bill" entity and the "line item" entity. The "line item" entity represents a vector of sub-records that include details of purchases that occur as line items in a billing record, which is an instance of the "billing" entity. Each billing record is assigned a unique serial number called "partntbill" (total bill) that can be used as a linkage key between a row entry instance and the corresponding billing instance.

FIG. 12B illustrates an example of a data model 1220 after a developer has defined additional relationships and provided mapping rules between source and target entities. In particular, the developer has defined a primary key/foreign key relationship 1222 between the "transacting" entity and the "account" entity by the foreign key field acctnum (account number) in the "transacting" entity, which references the primary key field acctnum of the "account" entity. The developer has defined a primary key/foreign key relationship 1224 between the "trading" entity and the "product" entity by the foreign key field SKU in the "trading" entity that references the primary key field SKU of the "product" entity. The "account" entity, "transaction" entity, and "product" entity collectively represent a source schema that can be used to define attributes of target entities in a target schema. The mapping module interprets the following attribute definitions (i.e., fields) of the target entities in the data model 1220 as mapping rules that generate a program specification, such as a dataflow graph that is capable of generating records associated with mapping entities that store information defined in the mapping rules.

Bill-account mapping rules:

line item-transaction mapping rules:

product update-product mapping rules:

SKU ＝SKU；

new stock-total (quantity);

bill-to-account mapping rules 1226 provide an expression for the "bill" target entity field from the "account" source entity field. The account field of the "billing" entity is specified as the value of the account field of the "account" entity. By default, the name of a field in the expression refers to the name of a field in the source entity, and thus the "name" field, the "address" field, and the "balance" field also refer to the field of the "account" entity. When the name of a field in the expression is not present in the source entity, the name of the field refers to the name of the field in the target entity or an entity related to the target entity. Thus, the "old balance" field, the "total cost" field, and the "new balance" field refer to the "billing" entity's fields, and the "quantity" field and the "price" field refer to the "line item" entity's fields. For a particular record of the "bill" entity, the aggregation function sum () computes the sum of all the sub-records whose ParentBill foreign key value matches the ParentBill foreign key value of the particular record of the "bill" entity.

Row item-to-transaction mapping rules 1228 provide expressions for the "row item" target entity fields based on the "transaction" source entity fields. The "date" field, the "quantity" field, and the "price" field of a particular row item sub-record are specified as the values of the corresponding fields in the corresponding transaction records of the "transaction" entity. The "total" field for a particular line item is specified as the value of "quantity" multiplied by "price" in the corresponding transaction.

The product update-product mapping rules 1230 provide an expression for the "product update" target entity field based on the "product" source entity field. The (primary key) SKU field of the "product update" entity record is assigned the value of the (primary key) SKU field of the corresponding record for the "product" entity. The "new inventory" field is specified as an expression that depends on the "inventory" field of the "product" entity and the "quantity" field of the "transaction" entity associated with the "product" entity.

Based on these mapping rules, the mapping module can generate a dataflow graph from the merged query plan generated as described above that is used to generate records corresponding to the determined mapping rules. The mapping module can also generate other forms of program specifications based on the merged query plan. For example, relational expressions in expression nodes of the query plan may be used to generate query expressions (e.g., SQL query expressions). The result of a relational expression of an expression node corresponds to a view definition, and if the result is associated with multiple expression nodes, a temporary table corresponding to the result is generated. In some implementations, one combination of query expressions and dataflow graph components may be generated for each different expression node.

The techniques described above may be implemented using a computer system executing appropriate software. For example, software includes procedures in one or more computer programs that execute on one or more programmed or programmable computing systems (which may have various architectures such as distributed, client/server, or grid) each including at least one processor, at least one data storage system (including volatile and/or non-volatile memory and/or storage elements), and at least one user interface (for receiving input using at least one input device or port and for providing output using at least one output device or port). The software may include one or more modules of a mainframe program that provides other services related to the design, configuration, and execution of dataflow graphs, for example. The modules of the program (e.g., elements of a dataflow graph) may be implemented as data structures or other organized data that conforms to a data model stored in a database.

The software may be provided on a tangible, persistent storage medium such as a CD-ROM or other computer-readable medium (e.g., a medium readable by a general-purpose or special-purpose computer system or device) or delivered (e.g., encoded into a propagated signal) over a communication medium of a network to a tangible, persistent medium of a computer system executing the software. Some or all of the processing may be performed on a special purpose computer or using special purpose hardware, such as a coprocessor or a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC). The process may be implemented in a distributed fashion where different portions of the computation specified by the software are performed by different computing elements. Each such computer program is preferably stored on or downloaded to a computer readable storage medium (e.g., solid state memory or media, or magnetic or optical media) of a storage device readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage medium or device is read by the computer to perform the processes described herein. The system of the present invention may also be considered to be implemented as a tangible, persistent storage medium configured with a computer program, wherein the storage medium so configured causes a computer to operate in a specific and predefined manner to perform one or more of the process steps described herein.

A number of embodiments of the invention have been described. It is to be understood, however, that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the following claims. Accordingly, other embodiments are within the scope of the following claims. For example, various modifications may be made without departing from the scope of the invention. Further, some of the steps described above may be non-sequentially related and may therefore be performed in an order different than that described.

Claims

1. A method for processing data in one or more data storage systems, the method comprising:

receiving mapping information specifying one or more attributes of one or more target entities in terms of one or more attributes of one or more source entities, at least some of the one or more source entities corresponding to respective sets of records in the one or more data storage systems; and

processing the mapping information to generate a program specification for computing values corresponding to at least some of the one or more attributes of the one or more target entities, the program specification, when executed, converting between database formats defined by different schemas, the processing comprising:

generating a plurality of sets of nodes, each set including a first node representing a first relational expression associated with an attribute specified by the mapping information, and each of one or more of the sets forming a corresponding directed acyclic graph including links to one or more other nodes representing relational expressions associated with at least one attribute of at least one source entity referenced by a relational expression of a node in the directed acyclic graph; and

at least two of the sets are merged with each other to form a third set based on comparing relational expressions of the merged nodes.

2. The method of claim 1, wherein the mapping information comprises a first mapping rule defining an attribute value of the target entity according to an attribute value of the first source entity and an attribute value of the second source entity.

3. The method of claim 2, wherein the first set of nodes associated with the first mapping rule comprises: a first node representing a first relational expression comprising a relational algebra operation referencing the first source entity and the second source entity; a second node, linked to the first node, representing a relational expression comprising the first source entity; and a third node, linked to the first node, representing a relational expression comprising the second source entity.

4. A method according to claim 3, wherein said mapping information comprises a second mapping rule defining an attribute value of a target entity in dependence on an attribute value of said first source entity.

5. The method of claim 4, wherein the merging comprises merging the first set and a second set of one or more nodes associated with the second mapping rule, including merging the second node with a node of the second set that represents a relational expression that includes the first source entity.

6. The method of claim 3, wherein the relational algebra operation is a joint operation.

7. The method of claim 3, wherein the relational algebra operation is an aggregation operation.

8. The method of claim 2, wherein the first source entity and the second source entity are associated with each other according to a relationship defined in a schema.

9. The method of claim 8, wherein the schema includes a plurality of entities, the relationships between the entities including one or more of: a one-to-one relationship, a one-to-many relationship, or a many-to-many relationship.

10. The method of claim 1, wherein generating the program specification comprises generating a dataflow graph from the third set that includes components for performing operations corresponding to relational expressions in nodes of the third set, and links representing flows of records between output ports and input ports of the components.

11. The method of claim 1, wherein generating the program specification comprises generating a query language specification from the third set, the query language specification comprising a query expression for performing operations corresponding to relational expressions in nodes of the third set.

12. The method of claim 1, wherein generating the program specification comprises generating a computer program from the third set, the computer program comprising a function or expression for performing an operation corresponding to a relational expression in each node of the third set.

13. The method of claim 12, wherein the computer program is specified in at least one of the following programming languages: java, C + +.

14. The method of claim 1, further comprising processing the records in the data storage system in accordance with the program specification to compute values corresponding to at least some of the one or more attributes of one or more target entities.

15. A computer-readable storage medium storing a computer program for processing data in one or more data storage systems, the computer program comprising instructions for causing a computer system to perform the method of any one of claims 1 to 14.

16. A computer system comprising:

one or more data storage systems;

an input device or port for receiving mapping information specifying one or more attributes of one or more target entities in terms of one or more attributes of one or more source entities, at least some of the one or more source entities corresponding to respective sets of records in the one or more data storage systems; and

at least one processor configured to perform the method of any one of claims 1 to 14.

17. A computer system comprising:

one or more data storage systems;

mapping information receiving means for receiving mapping information specifying one or more attributes of one or more target entities in dependence upon one or more attributes of one or more source entities, at least some of the one or more source entities corresponding to respective sets of records in the one or more data storage systems; and

mapping information processing apparatus for processing the mapping information to generate a program specification for computing values corresponding to at least some of the one or more attributes of the one or more target entities, the program specification, when executed, converting between database formats defined by different schemas, the processing comprising: