HK1229918B

HK1229918B - Managing data profiling operations related to data type

Info

Publication number: HK1229918B
Application number: HK17103488.4A
Authority: HK
Inventors: M.A.可汗
Original assignee: 起元科技有限公司
Priority date: 2014-03-07
Filing date: 2015-02-19
Publication date: 2021-03-26

Description

Management of data profiling operations related to data types

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求享有2014年3月7日提交的序列号为61/949,477的美国申请的优先权。This application claims priority to U.S. application serial number 61/949,477, filed on March 7, 2014.

技术领域Technical Field

本说明书涉及一种对与数据类型有关的数据剖析(data profiling)操作的管理。This specification relates to the management of data profiling operations related to data types.

背景技术Background Art

数据库或其他信息管理系统常常包括数据集，对于数据集，其很多特征可能不得获悉。例如，对于某数据集，其值的范围或典型值，该数据集内不同字段之间的关系，或不同字段的值之间的函数相关性，可能不得获悉。数据剖析可包括检测数据集从而确定上述特征。某些数据剖析的技术包括：接收数据剖析工作的相关信息，运行该数据剖析工作，然后在一段延迟之后返回运行结果，该延迟是基于执行数据剖析所涉及的各个处理步骤所花费的时间长短。其中一个可能涉及重要的处理时间的步骤是“标准化(canonicalization)”，该步骤涉及将数据集的记录内出现的值的数据类型改变为预定或“标准”数据类型，以方便其他处理。例如，标准化可包括将值转化为人可读的字符串表示形式。Databases or other information management systems often include data sets, and many characteristics of the data sets may not be known. For example, for a data set, the range or typical values of its values, the relationship between different fields in the data set, or the functional correlation between the values of different fields may not be known. Data profiling may include examining the data set to determine the above characteristics. Some data profiling techniques include: receiving information about a data profiling job, running the data profiling job, and then returning the results of the operation after a delay, where the delay is based on the length of time it takes to perform the various processing steps involved in performing the data profiling. One of the steps that may involve significant processing time is "canonicalization", which involves changing the data type of the values appearing in the records of the data set to a predetermined or "standard" data type to facilitate other processing. For example, canonicalization may include converting the values into a human-readable string representation.

发明内容Summary of the Invention

在一方案中，一般来说，一种计算机系统的数据处理方法包括：通过输入设备或所述计算机系统的端口接收多个记录，每个所述记录包括与多个字段中各字段对应的一个或多个值。所述方法还包括存储数据类型信息于所述计算机系统的存储介质中，所述数据类型信息将一个或多个数据类型中的每个数据类型与至少一个标识符相关联。所述方法还包括采用所述计算机系统的至少一个处理器处理来自所述记录的多个数据值。所述处理包括：根据所述记录产生多个数据单元，每个所述数据单元包括字段标识符及来自一个所述记录的二进制值，所述字段标识符唯一地标记一个所述字段，所述二进制值提取自由所述字段标识符标记的该记录的字段；从多个所述数据单元中收集关于所述二进制值的信息；为一个或多个所述字段的每一个建立条目的列表，至少部分所述条目中每个条目包括一个所述二进制值及从多个所述数据单元中收集的关于所述二进制值的信息；从所述数据类型信息中获取与第一标识符相关联的数据类型，并将获取的数据类型与一个所述列表中的条目所包括的至少一个二进制值相关联；及在从多个所述数据单元中收集关于所述二进制值的信息之后，至少部分基于获取的出现在所述字段中的特定二进制值的数据类型为至少一个所述字段建立剖析信息。In one aspect, generally speaking, a method for processing data in a computer system includes receiving, via an input device or port of the computer system, a plurality of records, each of the records including one or more values corresponding to each of a plurality of fields. The method further includes storing data type information in a storage medium of the computer system, the data type information associating each of the one or more data types with at least one identifier. The method further includes processing, using at least one processor of the computer system, the plurality of data values from the records. The processing includes: generating multiple data units based on the record, each of the data units including a field identifier and a binary value from one of the records, the field identifier uniquely marking one of the fields, and the binary value being extracted from the field of the record marked by the field identifier; collecting information about the binary values from the multiple data units; establishing a list of entries for each of one or more of the fields, at least some of the entries each including one of the binary values and information about the binary values collected from the multiple data units; obtaining a data type associated with a first identifier from the data type information, and associating the obtained data type with at least one binary value included in an entry in the list; and after collecting information about the binary values from the multiple data units, establishing profile information for at least one of the fields based at least in part on the obtained data type of a specific binary value appearing in the field.

方案可包括一个或多个以下特征。A solution may include one or more of the following features.

从由所述字段标识符标记的该记录的字段中提取出的所述二进制值被提取为非类型化的比特序列。所述至少部分基于获取的出现在所述字段中的特定二进制值的数据类型为至少一个所述字段建立剖析信息包括：将所述非类型化的比特序列重新解释为类型化数据值，所述类型化数据值具有所述获取的数据类型。The binary value extracted from the field of the record identified by the field identifier is extracted as an untyped bit sequence. Establishing parsing information for at least one of the fields based at least in part on the obtained data type of the specific binary value appearing in the field comprises reinterpreting the untyped bit sequence as a typed data value, the typed data value having the obtained data type.

所述剖析信息包括类型依赖的剖析结果，该结果取决于所述记录的多个数据值的原始数据类型。The profiling information includes a type-dependent profiling result that depends on original data types of the plurality of data values of the record.

从多个所述数据单元中收集关于所述二进制值的信息包括将来自多个所述数据单元的二进制值与所述条目列表中的二进制值相比较以确定二进制值之间是否存在匹配。Gathering information about the binary values from a plurality of the data units includes comparing the binary values from the plurality of the data units with binary values in the list of entries to determine if a match exists between the binary values.

关于从多个所述数据单元收集的所述二进制值的信息包括相匹配的二进制值的总数，每当比较二进制值时确定存在匹配，所述总数相应增加。The information regarding the binary values collected from the plurality of the data units includes a total number of matching binary values, the total number being incremented each time a match is determined to exist when comparing binary values.

第一二进制值与第二二进制值之间的匹配对应于包括所述第一二进制值的比特序列与包括所述第二二进制值的序列相同。A match between a first binary value and a second binary value corresponds to a bit sequence comprising the first binary value being the same as a sequence comprising the second binary value.

所述数据类型信息将一个或多个数据类型中的每个数据类型与至少一个所述字段标识符相关联。The data type information associates each of the one or more data types with at least one of the field identifiers.

从所述数据类型信息中获取与所述第一标识符相关联的数据类型包括获取与第一字段标识符相关联的数据类型。Obtaining the data type associated with the first identifier from the data type information includes obtaining the data type associated with the first field identifier.

每个数据单元包括：一个所述字段标识符，一个所述记录中的二进制值，及唯一标记一个所述数据类型的数据类型标识符。Each data unit includes: a field identifier, a binary value in the record, and a data type identifier that uniquely identifies a data type.

所述数据类型信息将一个或多个数据类型中的每个数据类型与至少一个所述数据类型标识符相关联。The data type information associates each of the one or more data types with at least one of the data type identifiers.

从所述数据类型信息中获取与所述第一标识符相关联的数据类型包括获取与第一数据类型标识符相关联的数据类型。Obtaining the data type associated with the first identifier from the data type information includes obtaining the data type associated with the first data type identifier.

将获取的数据类型与一个所述列表中的条目所包括的至少一个二进制值相关联包括：实例化具有所述获取的数据类型的局部变量，并将实例化的变量初始化为基于所述条目所包括的所述二进制值的值。Associating the obtained data type with at least one binary value included in an entry in the list includes instantiating a local variable having the obtained data type and initializing the instantiated variable to a value based on the binary value included in the entry.

将所述获取的数据类型与一个所述列表中的条目所包括的至少一个二进制值相关联包括：设置与所述获取的数据类型相关联的指针，指向所述条目所包括的所述二进制值所在的存储位置。Associating the obtained data type with at least one binary value included in an entry in the list includes setting a pointer associated with the obtained data type to point to a storage location where the binary value included in the entry is located.

每个数据单元包括：一个所述字段标识符，一个所述记录中的二进制值，及所述二进制值的长度指示符。Each data unit includes: a field identifier, a binary value in the record, and a length indicator of the binary value.

所述长度指示符被存储为所述二进制值的前缀。The length indicator is stored as a prefix of the binary value.

所述方法进一步包括：通过所述输入设备或所述计算机系统的端口接收与所述多个记录相关联的记录格式信息。The method further includes receiving record format information associated with the plurality of records through the input device or a port of the computer system.

所述数据类型信息至少部分基于获取的所述记录格式信息而产生。The data type information is generated based at least in part on the acquired record format information.

所述处理进一步包括：对于与所述二进制值相关联的所述获取的数据类型中的第一字段，将第一列表的各个不同条目中的二进制值转化为目标数据类型；及对于与所述二进制值相关联的所述获取的数据类型中的第二字段，将第二列表的各个不同条目中的二进制值转化为同一所述目标数据类型。The processing further includes: for a first field in the obtained data type associated with the binary value, converting the binary values in different entries of the first list into a target data type; and for a second field in the obtained data type associated with the binary value, converting the binary values in different entries of the second list into the same target data type.

在另一方案中，一般来说，一种以非临时形式存储于计算机可读介质的软件包括指令以触发计算机系统的以下动作：通过输入设备或所述计算机系统的端口接收多个记录，每个所述记录包括与多个字段中各字段对应的一个或多个值；存储数据类型信息于所述计算机系统的存储介质中，所述数据类型信息将一个或多个数据类型中的每个数据类型与至少一个标识符相关联；以及采用所述计算机系统的至少一个处理器处理来自所述记录的多个数据值。所述处理包括：根据所述记录产生多个数据单元，每个所述数据单元包括字段标识符及来自一个所述记录的二进制值，所述字段标识符唯一地标记一个所述字段，所述二进制值提取自由所述字段标识符标记的该记录的字段；从多个所述数据单元中收集关于所述二进制值的信息；为一个或多个所述字段的每一个建立条目的列表，至少部分所述条目中每个条目包括一个所述二进制值及从多个所述数据单元中收集的关于所述二进制值的信息；从所述数据类型信息中获取与第一标识符相关联的数据类型，并将获取的数据类型与一个所述列表中的条目所包括的至少一个二进制值相关联；及在从多个所述数据单元中收集关于所述二进制值的信息之后，至少部分基于获取的出现在所述字段中的特定二进制值的数据类型为至少一个所述字段建立剖析信息。In another embodiment, generally speaking, software stored in a non-transitory form on a computer-readable medium includes instructions to trigger the following actions of a computer system: receiving a plurality of records through an input device or a port of the computer system, each of the records including one or more values corresponding to each of a plurality of fields; storing data type information in a storage medium of the computer system, the data type information associating each of the one or more data types with at least one identifier; and processing a plurality of data values from the records using at least one processor of the computer system. The processing includes: generating multiple data units based on the record, each of the data units including a field identifier and a binary value from one of the records, the field identifier uniquely marking one of the fields, and the binary value being extracted from the field of the record marked by the field identifier; collecting information about the binary values from the multiple data units; establishing a list of entries for each of one or more of the fields, at least some of the entries each including one of the binary values and information about the binary values collected from the multiple data units; obtaining a data type associated with a first identifier from the data type information, and associating the obtained data type with at least one binary value included in an entry in the list; and after collecting information about the binary values from the multiple data units, establishing profile information for at least one of the fields based at least in part on the obtained data type of a specific binary value appearing in the field.

在另一方案中，一般来说，一种计算机系统，包括：输入设备或所述计算机系统的端口，配置用于接收多个记录，每个所述记录包括与多个字段中各字段对应的一个或多个值；所述计算机系统的存储介质，配置用于存储数据类型信息，所述数据类型信息将一个或多个数据类型中的每个数据类型与至少一个标识符相关联；以及所述计算机系统的至少一个处理器，配置用于处理来自所述记录的多个数据值。所述处理包括：根据所述记录产生多个数据单元，每个所述数据单元包括字段标识符及来自一个所述记录的二进制值，所述字段标识符唯一地标记一个所述字段，所述二进制值提取自由所述字段标识符标记的该记录的字段；从多个所述数据单元中收集关于所述二进制值的信息；为一个或多个所述字段的每一个建立条目的列表，至少部分所述条目中每个条目包括一个所述二进制值及从多个所述数据单元中收集的关于所述二进制值的信息；从所述数据类型信息中获取与第一标识符相关联的数据类型，并将获取的数据类型与一个所述列表中的条目所包括的至少一个二进制值相关联；及在从多个所述数据单元中收集关于所述二进制值的信息之后，至少部分基于获取的出现在所述字段中的特定二进制值的数据类型为至少一个所述字段建立剖析信息。In another aspect, generally, a computer system includes an input device or port of the computer system configured to receive a plurality of records, each of the records including one or more values corresponding to respective fields of a plurality of fields; a storage medium of the computer system configured to store data type information, the data type information associating each of the one or more data types with at least one identifier; and at least one processor of the computer system configured to process a plurality of data values from the records. The processing includes: generating multiple data units based on the record, each of the data units including a field identifier and a binary value from one of the records, the field identifier uniquely marking one of the fields, and the binary value being extracted from the field of the record marked by the field identifier; collecting information about the binary values from the multiple data units; establishing a list of entries for each of one or more of the fields, at least some of the entries each including one of the binary values and information about the binary values collected from the multiple data units; obtaining a data type associated with a first identifier from the data type information, and associating the obtained data type with at least one binary value included in an entry in the list; and after collecting information about the binary values from the multiple data units, establishing profile information for at least one of the fields based at least in part on the obtained data type of a specific binary value appearing in the field.

在另一方案中，一般来说，一种计算机系统，包括：接收装置，用于接收多个记录，每个所述记录包括与多个字段中各字段对应的一个或多个值；存储装置，用于存储数据类型信息，所述数据类型信息将一个或多个数据类型中的每个数据类型与至少一个标识符相关联；处理装置，用于处理来自所述记录的多个数据值。所述处理包括：根据所述记录产生多个数据单元，每个所述数据单元包括字段标识符及来自一个所述记录的二进制值，所述字段标识符唯一地标记一个所述字段，所述二进制值提取自由所述字段标识符标记的该记录的字段；从多个所述数据单元中收集关于所述二进制值的信息；为一个或多个所述字段的每一个建立条目的列表，至少部分所述条目中每个条目包括一个所述二进制值及从多个所述数据单元中收集的关于所述二进制值的信息；从所述数据类型信息中获取与第一标识符相关联的数据类型，并将获取的数据类型与一个所述列表中的条目所包括的至少一个二进制值相关联；及在从多个所述数据单元中收集关于所述二进制值的信息之后，至少部分基于获取的出现在所述字段中的特定二进制值的数据类型为至少一个所述字段建立剖析信息。In another embodiment, a computer system generally includes: receiving means for receiving a plurality of records, each record including one or more values corresponding to respective fields of a plurality of fields; storing means for storing data type information, the data type information associating each of the one or more data types with at least one identifier; and processing means for processing the plurality of data values from the records. The processing includes: generating a plurality of data units from the records, each data unit including a field identifier and a binary value from one of the records, the field identifier uniquely identifying one of the fields, the binary value extracting the field of the record identified by the field identifier; collecting information about the binary values from the plurality of data units; establishing a list of entries for each of one or more of the fields, at least some of the entries each including a binary value and information about the binary values collected from the plurality of data units; retrieving a data type associated with a first identifier from the data type information and associating the retrieved data type with at least one binary value included in an entry in the list; and, after collecting information about the binary values from the plurality of data units, establishing profile information for at least one of the fields based at least in part on the retrieved data type of a particular binary value occurring in the field.

各方案可具有一个或多个以下优点。Each aspect may have one or more of the following advantages.

数据剖析有时会由提供用户界面的程序执行，该界面专用于实现管理数据剖析程序的目的。在此情况下，当剖析大量数据(例如较大的数据集及/或较多数量的数据集)时，用户预计会承受较长的延迟。在有些方案中，将数据剖析功能并入另一个用户界面，例如用于开发数据处理程序的用户界面(例如，表现为数据流)，是有用的。然而，如果某用户正处理开发数据处理的程序，即使从上述用于开发的用户界面上对特定数据集的数据剖析结果的请求是有用的，让用户在获得上述结果前忍受长时间的延迟仍是不妥的。Data profiling is sometimes performed by a program that provides a user interface specifically designed for the purpose of managing the data profiling program. In this case, when analyzing large amounts of data (e.g., larger data sets and/or a large number of data sets), the user can expect to experience longer delays. In some scenarios, it is useful to incorporate the data profiling functionality into another user interface, such as a user interface used to develop a data processing program (e.g., presented as a data flow). However, if a user is developing a data processing program, even if a request for data profiling results for a specific data set from the development user interface is useful, it is still inappropriate to let the user experience a long delay before obtaining the results.

采用本文所描述的技术，可缩短特定数据剖析程序的某些延迟(例如，尤其是对于字段级的数据剖析)。例如，不再需要为数据剖析结果等待3分钟，用户仅需等待30秒即可。其中一项用于产生至少部分提速效果的技术是基于一个认识，即，在收集要被标准化的数据值的其他数据剖析步骤之后，对标准化及其他类型依赖的处理可被延迟且被更有效率地执行。处理更有效率，是因为利用一个字段处理所有不同的值，而不必使所有不同的值都出现，这样一般会使操作大大减少(除非该字段因特定的值而数量特别多)。该技术可避免冗余处理，其对于大集合的记录可能会花费大量时间。该技术的一个后果就是，需要适当地管理后期会被用于标准化及类型依赖验证的数据类型信息，以下对此进行更详细的描述。By using the techniques described herein, certain delays in certain data profiling procedures (e.g., particularly for field-level data profiling) can be reduced. For example, instead of having to wait three minutes for data profiling results, a user only has to wait 30 seconds. One of the techniques for producing at least some of these speed-ups is based on the recognition that processing for standardization and other type dependencies can be delayed and performed more efficiently after other data profiling steps that collect the data values to be standardized. The processing is more efficient because all distinct values are processed using one field, rather than having all distinct values present, which generally results in significantly fewer operations (unless the field is particularly large in number due to a particular value). The technique avoids redundant processing, which can take a significant amount of time for large sets of records. One consequence of this technique is the need to properly manage data type information that will later be used for standardization and type dependency validation, as described in more detail below.

处理延迟的缩短，使数据剖析在开发用户界面内执行成为可行的。用户界面的元素(例如，环境菜单(context menu))，可被加入开发用户界面，并在例如用户对表示数据集的图标或表示数据流的链接进行交互操作(例如，采用点击右键操作)时显示出来。用户界面元素可向用户呈现用于初始化要对关联数据集或数据流执行的一个或多个数据剖析程序的操作。经过一个相对短的延迟，结果可被显示于该用户界面的一个窗口。在延迟期间，可显示一个进度条以告诉用户该延迟是相对较短的。在一个示例情形中，如果用户期望数据集的特定字段仅具有一组固定值，数据剖析结果可显示给用户该数据集中是否有意外值出现。在另一个示例情形中，用户可以检测，看看期待完全填充的字段是否有空白或空值。该用户可将“防御性逻辑”加入到程序中来适当处理任何观察到的意外情况。在另一个示例情形中，用户可能希望查看在最近一次运行数据流图时沿特定数据流流过的所有数据的概要。Reduced processing delays make it feasible to perform data profiling within a development user interface. User interface elements (e.g., context menus) can be added to the development user interface and displayed when a user interacts with an icon representing a dataset or a link representing a data flow (e.g., using a right-click). The user interface elements can present the user with instructions for initiating one or more data profiling procedures to be executed on the associated dataset or data flow. After a relatively short delay, the results can be displayed in a window within the user interface. During the delay, a progress bar can be displayed to indicate to the user that the delay is relatively short. In one example scenario, if a user expects a particular field in a dataset to have only a fixed set of values, the data profiling results can be displayed to the user to indicate whether any unexpected values appear in the dataset. In another example scenario, the user can check to see if a field that is expected to be fully populated has blank or empty values. The user can then add "defensive logic" to the program to appropriately handle any observed unexpected situations. In another example scenario, the user may wish to view a summary of all data that flowed along a particular data flow during the most recent run of a dataflow diagram.

本发明的其他特征及优点将通过以下描述及权利要求变得明显。Other features and advantages of the invention will be apparent from the following description and from the claims.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是数据处理系统的模块图。FIG1 is a block diagram of a data processing system.

图2是数据剖析过程的示意图。Figure 2 is a schematic diagram of the data analysis process.

图3是提取字段-值对及数据类型信息的示意图。FIG3 is a schematic diagram of extracting field-value pairs and data type information.

图4是数据剖析过程的流程图。FIG4 is a flow chart of the data analysis process.

具体实施方式DETAILED DESCRIPTION

图1示出一个数据处理系统100的示例，在该系统中可应用管理技术类型信息以实现高效数据剖析的技术。该系统100包括数据源102，数据源102可包括一个或多个数据源，例如存储设备或连接到在线数据流，每一个数据源可以多种形式(例如，数据库表，电子数据表文件，纯文本文件，或某主机使用的本机格式)中的任何一种存储或提供数据。执行环境包括剖析模块106及执行模块112。剖析模块106对数据源102或在由数据处理程序产生的中间数据或输出数据执行数据剖析程序，该数据处理程序由执行模块112执行。提供数据源102的存储设备112相对于执行环境104可以是本地的，例如，存储于承载执行环境104的电脑所连接的存储介质(例如，硬盘108)，或者，相对于执行环境104可以是远程的，例如，被托管在与承载执行环境104的计算机通过远程连接(例如由云计算基础设施提供)通信的远程系统(例如主机110)上。FIG1 illustrates an example of a data processing system 100 in which techniques for managing information of technical type to implement efficient data profiling can be applied. System 100 includes a data source 102, which may include one or more data sources, such as storage devices or connections to online data streams. Each data source may store or provide data in any of a variety of formats (e.g., database tables, spreadsheet files, plain text files, or a format native to a host computer). An execution environment includes a profiling module 106 and an execution module 112. Profiling module 106 executes a data profiling program on data source 102 or on intermediate data or output data generated by a data processing program executed by execution module 112. Storage device 112 providing data source 102 may be local to execution environment 104, e.g., stored on a storage medium (e.g., hard disk 108) connected to the computer hosting execution environment 104, or remote from execution environment 104, e.g., hosted on a remote system (e.g., host 110) that communicates with the computer hosting execution environment 104 via a remote connection (e.g., provided by a cloud computing infrastructure).

执行环境104可以被托管，例如，托管在合适的操作系统控制下的一个或多个通用计算机，例如一个版本的UNIX操作系统。例如，执行环境104可以包括多节点并行计算环境，该计算环境包括使用多个中央处理单元(CPU)或处理器内核的计算机系统的结构，计算机系统的结构可以是本地的(例如，多处理器系统，如对称多处理(SMP)计算机)，或者是本地分布的(例如，联接为簇的多个处理器或大规模并行处理(MPP)系统)，或远程，或远程分布式(例如，通过局域网(LAN)及/或广域网(WAN)联接的多个处理器)，或上述方式的任意组合。Execution environment 104 can be hosted, for example, on one or more general-purpose computers under the control of a suitable operating system, such as a version of the UNIX operating system. For example, execution environment 104 can include a multi-node parallel computing environment, including a computer system architecture using multiple central processing units (CPUs) or processor cores, where the architecture of the computer system can be local (e.g., a multi-processor system such as a symmetric multiprocessing (SMP) computer), locally distributed (e.g., multiple processors connected in a cluster or a massively parallel processing (MPP) system), remote, or remotely distributed (e.g., multiple processors connected via a local area network (LAN) and/or a wide area network (WAN)), or any combination thereof.

剖析模块106从数据源102读取数据，及存储剖析信息114，该剖析信息114由数据剖析程序产生，该数据剖析程序由剖析模块106执行。剖析模块106可在执行环境104中与执行模块112相同的主机(多个)上运行，或者，可使用额外的资源，例如，与执行环境104通信的专用的数据剖析服务器。剖析信息114包括数据剖析程序的结果及在产生该结果的过程中编译产生的中间数据。剖析信息114可存储回数据源102或存储在可访问执行环境104的数据存储系统116，或采用其他方式使用。The profiling module 106 reads data from the data source 102 and stores profiling information 114 generated by a data profiling program executed by the profiling module 106. The profiling module 106 may run on the same host(s) as the execution module 112 in the execution environment 104, or may utilize additional resources, such as a dedicated data profiling server in communication with the execution environment 104. The profiling information 114 includes the results of the data profiling program and intermediate data compiled in the process of generating the results. The profiling information 114 may be stored back in the data source 102 or in a data storage system 116 accessible to the execution environment 104, or used in other ways.

执行环境104还提供了开发用户界面118，开发者120使用该开发用户界面118既能够开发数据处理程序又能够启动数据剖析程序。在一些实现方式中，开发用户界面118采用数据流图有助于开发数据处理程序，该数据流图包括顶点(表示数据处理元件或数据集)，这些顶点通过顶点之间的定向链路(表示工作元素的流，即数据流)连接。例如，对于这样的用户界面在美国公开号2007/0011668的专利中有更详细的描述，该专利名称为“管理基于图形应用程序的参数(Managing Parameters for Graph-Based Applications)”，并通过引用并入本文。一种用于执行这样的基于图形的计算的系统在美国专利号5,966,072的专利中被描述，该专利名称为“执行表示为图的计算(EXECUTING COMPUTATIONS EXPRESSEDAS GRAPHS)”，并通过引用并入本文。根据本系统所制的数据流图提供得到流入及流出由图组件所表示各个进程的信息的方法，并提供在进程之间转移信息的方法，以及提供限定进程的运行顺序的方法。该系统包括从任何可行方法(例如，通信路径根据图上的链路可使用TCP/IP或UNIX域套接字，或使用共享存储器在进程之间传递数据)选择进程间通信方法的算法。除了从开发用户界面118内被启动的数据剖析程序，数据剖析程序也可被数据流图中的剖析器组件执行，该组件具有被数据流链路连接到输入数据集的输入端口及被数据流链路连接到下游组件的输出端口，该下游组件被配置用于执行使用数据剖析的结果的任务。The execution environment 104 also provides a development user interface 118 that enables developers 120 to both develop data processing programs and initiate data analysis programs. In some implementations, the development user interface 118 facilitates the development of data processing programs using a data flow graph that includes vertices (representing data processing elements or data sets) connected by directed links between the vertices (representing flows of work elements, i.e., data flows). For example, such a user interface is described in more detail in U.S. Patent Publication No. 2007/0011668, entitled "Managing Parameters for Graph-Based Applications," which is incorporated herein by reference. A system for performing such graph-based computations is described in U.S. Patent No. 5,966,072, entitled "Executing Computations Expressed as Graphs," which is incorporated herein by reference. A data flow graph created according to the present system provides methods for obtaining information flowing into and out of each process represented by the graph components, methods for transferring information between processes, and methods for defining the order in which processes are to be executed. The system includes algorithms for selecting an inter-process communication method from any available method (e.g., the communication path may use TCP/IP or UNIX domain sockets, or may use shared memory to transfer data between processes, depending on the links in the graph). In addition to the data analyzer launched from within the development user interface 118, the data analyzer may also be executed by a analyzer component in the data flow graph, which has an input port connected by a data flow link to an input data set and an output port connected by a data flow link to a downstream component configured to perform a task using the results of the data analysis.

剖析模块106可从多种类型的可体现数据源102的系统接收数据，上述系统包括不同类型的数据库系统。该数据可被组织为数据集，该数据集代表具有相应的字段(也称为“属性”或“列”)值的记录的集合，包括可能的无效值。当第一次从一个数据源读取数据，剖析模块106通常以关于数据源的记录的一些初始格式信息开始。在某些情况下，数据源的记录结构开始可以是非已知的，而是在分析数据源或数据后才确定。关于记录的初始信息可包括，例如，用于存储单个值的比特数，记录中的字段顺序，及出现在特定字段内的值的数据类型(例如，字符串，符号/无符号整数)。The parsing module 106 can receive data from various types of systems that can embody the data source 102, including different types of database systems. The data can be organized into data sets that represent a collection of records with corresponding field (also called "attribute" or "column") values, including possible invalid values. When first reading data from a data source, the parsing module 106 typically starts with some initial formatting information about the records of the data source. In some cases, the record structure of the data source may not be known initially, but rather determined after analyzing the data source or the data. The initial information about the records may include, for example, the number of bits used to store a single value, the order of the fields in the record, and the data type (e.g., string, signed/unsigned integer) of the values appearing in a particular field.

通常，特定的数据集的记录都具有相同的记录格式，及特定字段内的所有值都具有相同的数据类型(即，该记录格式是“静态”)。可以有一个“动态”的记录格式，其中，记录的每个子集可以具有不同的格式，及/或一个或多个字段可以具有不同数据类型的值。此处描述的某些示例假设为静态记录格式，但可以进行各种修改，以支持一个动态记录格式。例如，可以在具有记录格式的变化的数据集中记录的每个子集的开头对处理进行重新初始化。可替代地，具有相同记录格式的记录各个子集可视为具有静态记录格式的不同虚拟数据集，而那些虚拟记录格式的结果如有必要可在后期合并。Typically, the records of a particular data set all have the same record format, and all values within a particular field have the same data type (i.e., the record format is "static"). It is possible to have a "dynamic" record format, where each subset of records can have a different format, and/or one or more fields can have values of different data types. Some of the examples described herein assume a static record format, but various modifications can be made to support a dynamic record format. For example, processing can be reinitialized at the beginning of each subset of records in a data set that has a change in record format. Alternatively, subsets of records having the same record format can be treated as different virtual data sets having the static record format, and the results of those virtual record formats can be merged at a later time if necessary.

当执行数据剖析时，剖析模块106从数据源102读取数据，并存储剖析信息114，这可用于执行各种类型的分析，以描绘不同的数据集内的不同数据集及不同字段。在一些实现方式中，剖析信息114包括特定字段(例如，所选数据集的选定字段，或全部数据集的所有字段)中出现的值的统计调查(census)。该统计调查列出了字段中的所有独特值，并确定每个独特值出现的次数。在一些实现方式中，统计调查数据存储成单个数据结构，可选地由字段加以索引，在其他实施方式中，统计调查数据存储成多个数据结构，例如，对于一个字段就使用一个数据结构。When performing data profiling, the profiling module 106 reads data from the data source 102 and stores profiling information 114, which can be used to perform various types of analysis to characterize different data sets and different fields within different data sets. In some implementations, the profiling information 114 includes a census of values that appear in a particular field (e.g., selected fields of a selected data set, or all fields of all data sets). The census lists all unique values in the field and determines the number of times each unique value appears. In some implementations, the census data is stored in a single data structure, optionally indexed by field, and in other embodiments, the census data is stored in multiple data structures, e.g., one data structure for each field.

被剖析的特定字段的统计调查数据可以被组织为条目的列表，其中每个条目包括：用于该字段的标识符，在字段中出现的值，及数据集中该值在该字段中出现的次数的计数。对于某些数据集，计数相当于该值出现在该字段中的记录数。对于其他数据集(例如，包含嵌套向量作为某些字段值的分层数据集)的计数可以与记录的数目不同。在一些实现方式中，统计调查条目还可以指示该值是否为空值(null)。存在用于每一个独特值的条目，所以某一条目的每个值与在其他条目中的值不同，且条目的数目与字段中出现的独特值的数目相等。该字段的标识符可以是唯一标识被剖析的字段的任何值。例如，被剖析的字段可以通过为每个字段分配范围为1至该被剖析字段的数目的整数索引来枚举。这样的索引可以简洁地存储在统计调查数据结构中。即使对不同的字段的统计调查数据被存储在单独的数据结构中，包含数据结构的每个条目内该字段的特定字段标识符仍是有用的(例如，以区分来自不同的数据结构的条目)。或者，在一些实现方式中，如果用于不同字段的统计调查数据被存储在单独的数据结构中，该字段仅需在该数据结构被存储一次，且每个条目与该片段隐藏关联，且每个条目仅包括值和计数。The statistical survey data for a particular field being profiled can be organized as a list of entries, where each entry includes: an identifier for the field, the value that appears in the field, and a count of the number of times the value appears in the field in the dataset. For some datasets, the count is equivalent to the number of records in which the value appears in the field. For other datasets (e.g., a hierarchical dataset containing nested vectors as values for certain fields), the count may be different from the number of records. In some implementations, the statistical survey entry may also indicate whether the value is null. There is an entry for each unique value, so each value in an entry is different from the values in other entries, and the number of entries is equal to the number of unique values that appear in the field. The identifier of the field can be any value that uniquely identifies the field being profiled. For example, the profiled fields can be enumerated by assigning each field an integer index ranging from 1 to the number of the profiled fields. Such an index can be stored concisely in the statistical survey data structure. Even if the statistical survey data for different fields are stored in separate data structures, it is still useful to include the specific field identifier for the field in each entry of the data structure (e.g., to distinguish entries from different data structures). Alternatively, in some implementations, if statistical survey data for different fields is stored in separate data structures, the field only needs to be stored once in the data structure, with each entry implicitly associated with the segment and each entry including only the value and count.

图2示出由剖析模块106执行的基于统计调查的数据剖析程序的一个例子。提取模块200执行提取程序，提取程序用于从被剖析的数据集生成提取的字段-值对的数据流203，如表201。在本实施例中，表201具有三个字段，名为FIELD(字段)1，FIELD2及FIELD3，且在该表201中的前几个数据记录(即，前三行)示出与这三个字段中的每个分别相对应的值。统计调查生成模块202处理字段-值对的数据流203以为各个字段生成一个或多个统计调查文件205。类型依赖(type-dependent)处理模块204将来自保留数据类型信息208的各种数据类型与各个字段相关联，以使类型依赖剖析结果被包括在剖析信息114中。类型无关(type-independent)处理模块206使由类型依赖处理模块204恢复的类型化数据值标准化，如此提供预定类型的数据值以促进不依赖于初始数据类型的附加处理。通过延迟类型依赖处理和标准化直到统计调查产生后，潜在的大的提速会实现，因为相同的数据值的多个实例可以被一起处理(即，同一时间处理共有数据值)，而不是分别处理。FIG2 illustrates an example of a statistical survey-based data profiling process performed by the profiling module 106. The extraction module 200 executes an extraction process for generating a data stream 203 of extracted field-value pairs, such as table 201, from the profiled data set. In this embodiment, table 201 has three fields, named FIELD1, FIELD2, and FIELD3, and the first few data records (i.e., the first three rows) in table 201 show values corresponding to each of these three fields. The statistical survey generation module 202 processes the data stream 203 of field-value pairs to generate one or more statistical survey files 205 for each field. The type-dependent processing module 204 associates various data types from the retained data type information 208 with each field so that the type-dependent profiling results are included in the profiling information 114. The type-independent processing module 206 normalizes the typed data values recovered by the type-dependent processing module 204, thereby providing data values of predetermined types to facilitate additional processing independent of the initial data type. By delaying type-dependent processing and normalization until after the statistical survey is generated, potentially large speedups are realized because multiple instances of the same data value can be processed together (i.e., processing common data values at the same time) rather than separately.

提取模块200通过将特定的数据记录分解为一系列的字段-值对来产生所述字段-值对，每个字段-值对包括：字段索引和二进制数据值。字段索引是分配给某一特定字段以唯一(且有效)地识别该字段(如1＝FIELD1，2＝FIELD2，3＝FIELD3)的索引值，二进制数据值是无类型(或“原始”)比特序列，该比特序列表示包含在该字段的数据记录中的相应数据值。在这个例子中，在表201中的第一数据记录将产生以下字段-值(即，字段索引，二进制数据值)对：(1,bin(A)),(2,bin(M)),(3,bin(X))(其中，为了说明的目的，应当理解，“bin(A)”在本示例中表示代表数据值“A”的二进制数据值(即，比特序列)。The extraction module 200 generates the field-value pairs by decomposing a particular data record into a series of field-value pairs, each field-value pair comprising: a field index and a binary data value. A field index is an index value assigned to a particular field to uniquely (and effectively) identify the field (e.g., 1=FIELD1, 2=FIELD2, 3=FIELD3), and a binary data value is a typeless (or "raw") bit sequence that represents the corresponding data value contained in the data record for that field. In this example, the first data record in table 201 will generate the following field-value (i.e., field index, binary data value) pairs: (1, bin(A)), (2, bin(M)), (3, bin(X)) (wherein, for purposes of illustration, it should be understood that "bin(A)" in this example represents a binary data value (i.e., a bit sequence) representing the data value "A").

统计调查生成模块202聚合来自数据流203中字段-值对的二进制数据值以产生统计调查文件205。只需执行该聚合，也就是统计调查的一部分，以具有足够的信息来知道某特定的数据值是否与另一个数据值相同。可以用原始二进制数据值来执行这种匹配，因此不需耗费宝贵的处理时间来给统计调查生成模块202提供数据类型。(在图2中，在统计调查文件205的条目中所示的值与表201中的前三个数据记录相对应，这将随着来自表201中额外数据记录的字段-值对被统计调查生成模块202执行而更新。)Statistical survey generation module 202 aggregates binary data values from field-value pairs in data stream 203 to produce statistical survey file 205. This aggregation, which is part of the statistical survey, only needs to be performed to have enough information to know whether a particular data value is the same as another data value. This matching can be performed using the original binary data values, so there is no need to waste valuable processing time providing data types to statistical survey generation module 202. (In Figure 2, the values shown in the entries of statistical survey file 205 correspond to the first three data records in table 201. This will be updated as field-value pairs from additional data records in table 201 are processed by statistical survey generation module 202.)

对于一个特定的数据集，该字段-值对可以任意顺序插入数据流203。在本实施例中，随着特定记录出现在表201中，数据流203包括该特定数据记录的所有数据值对，随后是下一个数据记录的所有字段值对。可替代地，表201可按照字段进行处理，从而随着特定记录出现在表201中，该数据流包括该特定字段的所有字段-值对，随后是下一个字段的所有字段-值对。高维数据集也可以这种方式处理，使得基于某顺序添加字段-值对到数据流203，该顺序例如对于读取数据集或从结果数据流203产生统计调查文件将是最高效的。在所有的字段-值对生成后，字段-值对的数据流203可以被写入由下游的统计调查生成模块202处理的文件，或者，可以在生成该字段-值对的数据流203同时将其提供给下游统计调查生成模块202(例如，以利用所得到的流水并行性的优势)。For a particular dataset, the field-value pairs can be inserted into data stream 203 in any order. In this embodiment, as a particular record appears in table 201, data stream 203 includes all the field-value pairs for that particular data record, followed by all the field-value pairs for the next data record. Alternatively, table 201 can be processed field by field, so that as a particular record appears in table 201, the data stream includes all the field-value pairs for that particular field, followed by all the field-value pairs for the next field. High-dimensional datasets can also be processed in this manner, so that field-value pairs are added to data stream 203 based on an order that is most efficient, for example, for reading the dataset or generating a statistical survey file from the resulting data stream 203. After all field-value pairs are generated, the data stream 203 of field-value pairs can be written to a file processed by the downstream statistical survey generation module 202, or the data stream 203 of field-value pairs can be provided to the downstream statistical survey generation module 202 simultaneously with its generation (e.g., to take advantage of the resulting pipeline parallelism).

统计调查生成模块202处理该字段-值对直到达到数据流203的尾部(例如，如果该数据流对应有限批次的数据记录则该尾部由一数据流尾部记录来指示，或者如果该数据流对应于数据记录的连续流则该尾部由界定一工作单元的标记来指示)。模块202执行对字段值对的数据操作，被称为“统计调查匹配操作”，以确定在该字段值对的二进制数据值是否与来自先前处理的字段-值对的先前二进制数据值相匹配。模块202对数据流203中的每个字段值对进行至少一次的统计调查匹配操作。模块202将统计调查匹配操作的结果存储在某数据结构，该数据结构存储于存储设备的工作存储空间。如果统计调查匹配操作发现匹配到了先前的数据值，则与该数据值相关联的存储的计数增加。否则，如果统计调查匹配操作没有发现匹配到先前的数据值，则在该数据结构中存储一个新的条目。The statistical survey generation module 202 processes the field-value pairs until the end of the data stream 203 is reached (for example, the end is indicated by a data stream tail record if the data stream corresponds to a finite batch of data records, or the end is indicated by a marker that defines a work unit if the data stream corresponds to a continuous stream of data records). Module 202 performs a data operation on the field-value pair, referred to as a "statistical survey match operation", to determine whether the binary data value in the field-value pair matches a previous binary data value from a previously processed field-value pair. Module 202 performs the statistical survey match operation at least once for each field-value pair in the data stream 203. Module 202 stores the results of the statistical survey match operation in a data structure that is stored in the working storage space of the storage device. If the statistical survey match operation finds a match to a previous data value, a stored count associated with the data value is incremented. Otherwise, if the statistical survey match operation does not find a match to a previous data value, a new entry is stored in the data structure.

例如，该数据结构可以是关联阵列，其能够存储具有唯一键的键-值对，该唯一键用于在该阵列内查找关联的值。在本例中，该键是来自字段-值对的二进制数据值，并且该值是将要增加至统计调查数据对应的总数的计数。当针对某字段-值对创建键-值对并将一特定二进制值作为其键(该键不与任何已存在于关联阵列的键相匹配)时，从1开始计数，并且每次当另一个字段-值对具有与现存键相匹配的二进制数据值时计数增加1。模块202在不同的关联阵列中查找不同字段(根据每个域值对中的字段索引来确定)对应的字段-值对的二进制数据值，其中给每个被剖析的字段都分配一个关联阵列。在一些实现方式中，正在被分析的字段数是预知的，并且一个空关联阵列(其只使用最小量的存储空间)在剖析过程的开始被分配给每个字段。For example, the data structure can be an associative array capable of storing key-value pairs with a unique key that is used to look up the associated value within the array. In this example, the key is a binary data value from a field-value pair, and the value is a count to be added to the total number of statistical survey data. When a key-value pair is created for a field-value pair with a specific binary value as its key (which does not match any key already in the associative array), the count starts at 1 and increases by 1 each time another field-value pair has a binary data value that matches the existing key. Module 202 searches different associative arrays for the binary data values of the field-value pairs corresponding to different fields (determined by the field index in each field-value pair), where one associative array is assigned to each field being analyzed. In some implementations, the number of fields being analyzed is known in advance, and an empty associative array (which uses only a minimum amount of storage space) is assigned to each field at the beginning of the analysis process.

关联阵列是可被实施的，例如，使用哈希表或其它数据结构，其提供了键的有效查找和相关联的值的修改。用作键-值对的键的二进制数据值可以存储该二进制数据值自身的一个拷贝或者一个指向存储在工作存储器的一个不同位置(例如，存储在一个字段-值对的备份)的二进制数据值的指针。该关联阵列，连同存储的来自字段-值对的二进制数据值的备份，或甚至是整个字段-值对本身，可一概被视为存储统计调查匹配结果的数据结构。在指向字段-值对中二进制数据值的指针存储于关联阵列的实施方式中，仅包含特定键的第一个字段-值对需要被存储在工作存储器，而含有该特定键的后续字段-值对可在统计调查匹配操作后从工作存储器中删除。An associative array can be implemented, for example, using a hash table or other data structure that provides efficient lookup of keys and modification of associated values. The binary data value used as the key of a key-value pair can store a copy of the binary data value itself or a pointer to a binary data value stored in a different location in the working memory (for example, stored in a backup of a field-value pair). The associative array, together with the stored backup of the binary data value from the field-value pair, or even the entire field-value pair itself, can be viewed as a data structure for storing statistical survey matching results. In embodiments where pointers to binary data values in field-value pairs are stored in the associative array, only the first field-value pair containing a particular key needs to be stored in the working memory, and subsequent field-value pairs containing the particular key can be deleted from the working memory after the statistical survey matching operation.

在下面的例子中，被剖析的这些字段的关联阵列被称为“统计调查阵列”，而键-值对被称为统计调查阵列内的“统计调查条目”。在统计调查产生的结尾，由统计调查产生模块202所产生的统计调查阵列将存储出现在各统计调查条目中表201内的所有独特二进制数据值，以及存储该二进制数据值出现在表201的行中的次数的总数，这代表被剖析的数据记录。任选地，可能是作为与类型依赖的处理的一部分，统计调查阵列可被更新以不仅仅存储原始二进制数据值，还存储与这些二进制数据值相关联的类型，从而所述统计调查条目存储数据表201中出现的所有不同类型的数据值。对于静态记录格式，其中在一个字段中的所有数据值具有相同的数据类型，在被类型化前是不同的数据值在被类型化后仍是不同的。In the following examples, the associative array of the fields being profiled is referred to as a "poll array," and the key-value pairs are referred to as "poll entries" within the poll array. At the end of poll generation, the poll array generated by the poll generation module 202 will store all unique binary data values within table 201 that appear in each poll entry, as well as a total of the number of times that binary data value appears in the row of table 201, which represents the data record being profiled. Optionally, perhaps as part of type-dependent processing, the poll array can be updated to store not only the raw binary data values, but also the types associated with those binary data values, so that the poll entries store all different types of data values that appear in data table 201. For a static record format, in which all data values in a field have the same data type, data values that were different before being typed will remain different after being typed.

类型依赖处理模块204使数据剖析程序运行，以确定一个特定的数据值相对于其原始数据类型是否有效。例如，如果一个特定字段中出现的值的原始数据类型被定义(呈记录格式)为具有“日期”的数据类型，则该字段的有效数据值可能具有一个特定的字符串格式和值的范围，该值的范围考虑到该字符串的不同部分。例如，字符串格式可以被指定为YYYY-MM-DD，其中YYYY为表示年的任意四位整数，MM是1到12之间的表示月的任意两位整数，DD为1到31之间的表示天的任意两位整数。在类型“日期”的有效值的附加限制可能会进一步指定某些月份只允许例如1到30之间的天数。在另一实例中，在一个具有'UTF-8'数据类型的字段中验证数据值可包括针对特定比特序列的数据值的每个UTF-8字符检查其中出现的不被有效UTF-8字符所允许的字节。在另一个例子中，在具有特定主机格式的数据类型的字段中验证数据值可以包括检查由该主机格式定义的具体特征。类型依赖的验证检查还可以包括依赖于一个数据值的原始数据类型的用户定义的验证规则的应用。如果数据值被发现对于其数据类型是无效的，该类型依赖处理模块204可以将其标记为无效，这可避免本应由类型无关处理模块206对该数据执行的额外处理。进行这样的验证检查之前，类型依赖处理模块204检索来自保留的数据类型信息208中的数据类型，并且将检索到的数据类型与统计调查中对于给定字段包括的相应二进制值相关联，如下有更详细的描述。The type dependency processing module 204 causes a data parser to run to determine whether a particular data value is valid relative to its native data type. For example, if the native data type of a value appearing in a particular field is defined (in a record format) as having a data type of "date," then valid data values for that field may have a specific string format and value range that takes into account different parts of the string. For example, the string format may be specified as YYYY-MM-DD, where YYYY is any four-digit integer representing the year, MM is any two-digit integer representing the month between 1 and 12, and DD is any two-digit integer representing the day between 1 and 31. Additional restrictions on valid values of type "date" may further specify that certain months only allow for days between 1 and 30, for example. In another example, validating a data value in a field having a 'UTF-8' data type may include checking for the presence of bytes that are not permitted as valid UTF-8 characters for each UTF-8 character in the data value for a particular bit sequence. In another example, validating a data value in a field having a data type in a particular host format may include checking for specific characteristics defined by the host format. Type-dependent validation checks can also include the application of user-defined validation rules that depend on the original data type of a data value. If a data value is found to be invalid for its data type, the type-dependent processing module 204 can mark it as invalid, which can avoid additional processing of the data by the type-independent processing module 206. Before performing such validation checks, the type-dependent processing module 204 retrieves the data type from the retained data type information 208 and associates the retrieved data type with the corresponding binary value included for a given field in the statistical survey, as described in more detail below.

用于提取二进制数据值并分别保存供以后使用的数据类型信息208的技术的一个例子在图3中示出。表201与描述随表201的三个字段出现的值的数据类型的记录格式相关联。在这个例子中，使用字段声明的列表来定义记录格式300，如下所示：An example of a technique for extracting binary data values and separately saving data type information 208 for later use is shown in FIG3 . Table 201 is associated with a record format that describes the data types of the values appearing with the three fields of table 201. In this example, record format 300 is defined using a list of field declarations, as shown below:

T₁L₁FIELD1T ₁ L ₁ FIELD1

T₂L₂FIELD2T ₂ L ₂ FIELD2

T₃L₃FIELD3T ₃ L ₃ FIELD3

该记录格式300包括用于表201的三个字段中每个字段的字段声明：FIELD1，FIELD2，FIELD3。对于FIELD i，字段声明包括：该字段中数据值的数据类型的标识符Ti，该字段中的数据值的长度的标识符Li以及字段名称。数据类型和长度标识符(Ti和Li)在本例中象征性地表示(从i＝1到i＝字段的数量)，但实际的记录格式可使用任意各种关键字、标点符号和其它语法元素来标识数据类型和长度。The record format 300 includes a field declaration for each of the three fields of table 201: FIELD1, FIELD2, and FIELD3. For FIELD i, the field declaration includes an identifier Ti for the data type of the data value in the field, an identifier Li for the length of the data value in the field, and the field name. The data type and length identifiers (Ti and Li) are represented symbolically in this example (from i=1 to i=the number of fields), but actual record formats may use any of a variety of keywords, punctuation marks, and other syntactic elements to identify the data type and length.

在此示出一例，说明数据类型和长度标识如何使用特定类型的语法(即数据操纵语言(Data Manipulation Language，DML)语法)来被指定，其中有由术语“记录(record)”和“结束(end)”限定的字段声明；以及每个字段声明以分号结尾：Here is an example of how data types and length identifiers are specified using a particular type of syntax (i.e., Data Manipulation Language (DML) syntax), where there are field declarations delimited by the terms "record" and "end"; and each field declaration ends with a semicolon:

在本例中，有三个不同的数据类型：由关键字“string(字符串)”标识的字符串数据类型，由关键字“int(整数)”标识的整数数据类型，及由关键字“decimal(小数)”标识的浮点小数数据类型。本例还包括附于数据类型关键字之后括号中的长度标识符(以字节为单位)。In this example, there are three different data types: a string data type identified by the keyword "string", an integer data type identified by the keyword "int", and a floating-point decimal data type identified by the keyword "decimal". This example also includes a length identifier (in bytes) in parentheses following the data type keyword.

也可使用其他类型的记录格式。例如，在一些记录格式中，一个数据值的长度在字段声明中没有明确的被指定(例如，对于可变长度的数据值，或被限定的数据值)。一些记录格式有一个潜在的复杂结构(例如，分层或嵌套结构)，其例如可用于为特定类型的存储系统(例如，COBOL代码库(copybook))指定附条件的记录格式或指定记录格式。这种复杂的结构可以被解析(或“消解(walked)”)，并且每个字段可以被分配一个唯一的字段标识。Other types of record formats may also be used. For example, in some record formats, the length of a data value is not explicitly specified in the field declaration (e.g., for variable-length data values, or bounded data values). Some record formats have a potentially complex structure (e.g., hierarchical or nested), which can be used, for example, to specify conditional record formats or to specify record formats for a particular type of storage system (e.g., a COBOL copybook). Such complex structures can be parsed (or "walked"), and each field can be assigned a unique field identifier.

提取模块200存储数据类型信息208，其包括关于字段的原始数据类型的足够的信息，由记录格式300所定义，使得该类型依赖处理模块204能够为类型依赖的处理恢复原始数据类型。例如，数据类型标识符T_i和长度标识符L_i可被存储在关联阵列中，该关联阵列将数据类型标识符和长度标识符与各自字段的对应字段标识符相关联，如图3所示。任选地，附加信息也可以被存储在数据类型信息208中，如字段的名称。数据类型标识符的T_i可以其原始形式存储，或者以保留了足够信息来恢复原始数据类型的不同形式存储。数据类型信息208的生成只需要在每个数据集执行一次，避免由系统进行潜在大量的工作，即在为生成统计调查生成字段-值对时提取数据类型连同数据值的工作。即使数据类型信息208的生成被重复多次(例如，由提取模块204生成一次，又再次被类型依赖处理模块204生成)，但通过把该数据类型提取限制到恒定的次数而不是接近数据集中记录数目的次数，仍然可以避免潜在大量的工作。Extraction module 200 stores data type information 208, which includes sufficient information about the original data type of the field, as defined by record format 300, to enable type-dependent processing module 204 to recover the original data type for type-dependent processing. For example, data type identifiers _Ti and length identifiers _Li can be stored in an associative array that associates the data type identifiers and length identifiers with the corresponding field identifiers for the respective fields, as shown in FIG3 . Optionally, additional information can also be stored in data type information 208, such as the name of the field. The data type identifier _Ti can be stored in its original form or in a different form that retains sufficient information to recover the original data type. The generation of data type information 208 only needs to be performed once per dataset, avoiding potentially extensive work by the system to extract the data type along with the data value when generating field-value pairs for statistical survey generation. Even if the generation of data type information 208 is repeated multiple times (e.g., once by extraction module 204 and again by type-dependent processing module 204), the potentially extensive work can still be avoided by limiting the data type extraction to a constant number of times, rather than a number that approximates the number of records in the dataset.

如果数据类型信息208由提取模块200产生一次，然后可被(或以其它方式传达给)类型依赖处理模块204存储以供检索。相反，如果数据类型信息208由提取模块200生成并单独由类型依赖处理模块204(例如，刚好在其被需要之前)生成，对于以上二者，只要对由记录格式定义的字段分配索引值的过程是相同的，就会获得相同的数据类型信息208。无论哪个模块生成数据类型信息208，都将使用映射到特定字段的字段索引值302，该字段索引值302与提取模块200用于将字段索引值插入到字段-值对时所使用的相同。If the data type information 208 is generated once by the extraction module 200, it can then be stored (or otherwise communicated) to the type dependency processing module 204 for retrieval. Conversely, if the data type information 208 is generated by the extraction module 200 and then generated separately by the type dependency processing module 204 (e.g., just before it is needed), the same data type information 208 will be obtained for both, as long as the process of assigning index values to the fields defined by the record format is the same. Whichever module generates the data type information 208, the same field index value 302 that is mapped to a particular field will be used, and the field index value 302 that is used by the extraction module 200 to insert the field index value into the field-value pair will be used.

提取模块200的输出是元素流，该元素流任选地包括除了包括字段索引和二进制数据值的对的其它信息。仍参照图3，由提取模块200从表201中提取的可选字段值对的流203'由还包括记录索引和每个二进制数据值的长度的元素组成。流203'中的一个元素306包括字段索引值“2”(对应于FIELD2)，长度值len(M)，二进制数据值bin(M)和记录索引值“1”(对应于表201中的第一个记录)。字段索引和二进制数据值被用于编制统计调查条目，如上所述。该记录索引可用于编译位置信息，如下所述。长度值len(M)例如可以是值4(例如，编码为固定长度的整数)，对于如下的二进制数据值bin(M)表示为4字节长度。图2中不携带该长度的数据流203对于固定长度字段的数据值可能是足够的，该数据值的长度(L_i)在记录格式中被指定并且在读取流203的元素时可被统计调查生成模块202访问。但是，如果记录格式未指定固定长度，而是允许数据值的长度是可变的，则每个元素可包括长度前缀，例如在流203'中那样。对于空白数据值，可使用为0的长度前缀，其后没有相应的二进制数据值。The output of extraction module 200 is a stream of elements that optionally includes other information besides the pairs of field indexes and binary data values. Still referring to FIG3 , stream 203′ of optional field value pairs extracted from table 201 by extraction module 200 consists of elements that also include a record index and the length of each binary data value. An element 306 in stream 203′ includes a field index value of “2” (corresponding to FIELD2), a length value of len(M), a binary data value of bin(M), and a record index value of “1” (corresponding to the first record in table 201). The field index and binary data value are used to compile statistical survey entries, as described above. The record index can be used to compile location information, as described below. The length value len(M) can, for example, be 4 (e.g., encoded as a fixed-length integer), which represents a 4-byte length for the following binary data value bin(M). Data stream 203 in FIG2 that does not carry this length may be sufficient for fixed-length field data values. The length (L _i ) of this data value is specified in the record format and can be accessed by statistical survey generation module 202 when reading elements of stream 203. However, if the record format does not specify a fixed length, but allows data values to be of variable length, each element may include a length prefix, such as in stream 203'. For a blank data value, a length prefix of 0 may be used, followed by no corresponding binary data value.

统计调查生成一般被执行以使“字段级”剖析能进行，在该剖析中剖析信息特征化在字段中出现的值。在一些实现方式中，对于每一个独特的数据值，统计调查生成模块202还将针对该数据值出现的数据集中的所有记录进行标识的位置信息添加到每个统计调查条目，这对于“记录级”剖析是有用的。在本例中，位置信息可以基于记录索值304被编译映射到表201中特定记录。例如，在某一特定字段的特定数据值的统计调查项的生成过程中，比特向量(bit vector)有其位的位置集合，该集合对应于在该字段具有该数据值的每一个记录的记录索引的整数值。该比特向量可被压缩，以减少所需的存储空间。该记录索引值304可以例如通过对每个记录分配连续的整数序列来生成。然后分析程序模块204和206或其它数据剖析程序可使用该位置信息以编译记录-级数据统计并定位到(或“钻取”到)具有由数据剖析发现的特定属性的记录(例如，所有的具有某特定数据值的记录，该特定数据值没有预期在某个字段出现)。编译位置信息可能要耗费额外的处理时间，但不会像在统计调查产生前通过直接处理每个记录来编译记录-级统计耗费得那样多。Statistics generation is generally performed to enable "field-level" profiling, in which profiling information characterizes the values that appear in a field. In some implementations, for each unique data value, the statistics generation module 202 also adds location information to each statistics entry that identifies all records in the data set where the data value appears, which is useful for "record-level" profiling. In this example, the location information can be compiled and mapped to specific records in table 201 based on a record index value 304. For example, in the generation of a statistics entry for a particular data value in a particular field, a bit vector has a set of bit positions corresponding to the integer value of the record index for each record having the data value in the field. The bit vector can be compressed to reduce the storage space required. The record index value 304 can be generated, for example, by assigning a continuous sequence of integers to each record. The analysis program modules 204 and 206 or other data analysis programs can then use this location information to compile record-level data statistics and locate (or "drill down" to) records with specific attributes discovered by the data analysis (e.g., all records with a specific data value that is not expected to appear in a certain field). Compiling the location information may consume additional processing time, but it will not consume as much as compiling record-level statistics by directly processing each record before the statistical survey is generated.

类型依赖处理模块204可采用各种技术以将原始数据类型与用于类型依赖处理和最终标准化(canonicalization)的统计调查阵列中的对应二进制数据值相关联。在一些实施方式中，来自统计调查条目的二进制数据值所具有的数据类型，是通过采用来自该统计调查条目的字段索引实例化具有从数据类型信息208中检索到的数据类型的局部变量而恢复的。局部变量初始化为对应于包括在统计调查条目的二进制数据值的值(例如，通过根据局部变量的数据类型解析二进制数据值)。例如，局部变量可以是在实现依赖处理模块204的编程语言中的变量(例如，C或C++变量)。然后该实例化并初始化的局部变量可用于由类型依赖处理模块204执行的处理。或者，在一些实施方式中，可不用实例化一个新的局部变量，而是设置一个与检索的数据类型相关联的指针，该指针指向二进制数据值存储的存储位置(例如，在统计调查条目中)。也可以采用一个调用处理模块(例如，用于解释DML代码的专用引擎)的函数引用来执行原始数据类型的恢复，以生成代码行将二进制数据值如重新解释为类型化数据值。例如，所述函数引用可被表示如下。The type dependency processing module 204 can employ various techniques to associate the original data type with the corresponding binary data value in the statistical survey array for type dependency processing and eventual canonicalization. In some embodiments, the data type of the binary data value from the statistical survey entry is retrieved by instantiating a local variable having the data type retrieved from the data type information 208 using the field index from the statistical survey entry. The local variable is initialized to a value corresponding to the binary data value included in the statistical survey entry (e.g., by parsing the binary data value according to the data type of the local variable). For example, the local variable can be a variable in the programming language that implements the dependency processing module 204 (e.g., a C or C++ variable). The instantiated and initialized local variable can then be used for processing performed by the type dependency processing module 204. Alternatively, in some embodiments, rather than instantiating a new local variable, a pointer associated with the retrieved data type can be set to point to the storage location where the binary data value is stored (e.g., in the statistical survey entry). Alternatively, a function reference that calls a processing module (e.g., a dedicated engine for interpreting DML code) may be used to perform the restoration of the original data type to generate a code line to reinterpret the binary data value as a typed data value. For example, the function reference may be represented as follows.

reinterpret_as(<data type identifier>,<binary data value>)reinterpret_as(<data type identifier>,<binary data value>)

函数“reinterpret_as”调用必要的处理，以将第二个“binary data value(二进制_数据_值)”参数重新解释为具有对应于第一“data type identifier(数据_类型_标识符)”参数的数据类型的类型化数据值。The function "reinterpret_as" calls the necessary processing to reinterpret the second "binary data value" parameter as a typed data value having a data type corresponding to the first "data type identifier" parameter.

如上所述，某些被执行的类型依赖处理可包括检查以确定数据值对于其类型是否是有效的。在确定数据值对于其类型是有效或无效的之前，可以有一个检查，以确定该数据值是否是空值或缺失(表示该字段对于至少一个记录是空的)。有可能是一个预定的空值(或值，例如任何数量的“空格(space)”字符)，它可以在记录格式被定义。缺失的值例如可由指示零长度的二进制数据值的长度前缀指出。在一些情况下，缺失的值可被不同地处理，形成一个空值。哪个数据值被视为空值可以取决于数据类型。As described above, some of the type-dependent processing performed may include a check to determine whether a data value is valid for its type. Before determining whether a data value is valid or invalid for its type, there may be a check to determine whether the data value is null or missing (indicating that the field is empty for at least one record). There may be a predetermined null value (or values, such as any number of "space" characters), which may be defined in the record format. Missing values may be indicated, for example, by a length prefix indicating a zero-length binary data value. In some cases, missing values may be treated differently, forming a null value. Which data value is considered a null value may depend on the data type.

标准化可以在类型无关处理(由模块206进行)的开始来执行，或在类型依赖处理(由模块204进行)的结束来执行。标准化可以包括，例如，将所有的数据值转化为具有目标数据类型“字符串”的数据值。如果数据值已经具有“字符串”的(恢复)数据类型，则标准化程序可能无法对该数据值执行任何操作。在一些实现方式中，即使数据值具有目标数据类型，该标准化程序可能仍执行某些操作(例如，除去前部或尾随的“空格(space)”字符)。标准化将两个不同的数据值映射到相同的标准化数据值是可能的。例如，数据值“3.14”具有“字符串”(来自第一字段)的数据类型，其可能有一个尾随的'空格'符被删掉，以生成“字符串”值“3.14”，而具有“小数”(来自第二字段)数据类型的数据值“3.14”可被转换为同样的“字符串”值“3.14”。如果来自相同字段的两个不同数据值(并且因此具有相同的原始数据类型)被转换成相同的标准化数据值，那么，在一些实施方式中，类型无关的处理模块206可任选地更新合适的统计调查阵列以为那两个数据值聚集统计调查条目，以便对新值的计数是旧值的个体数的总和。Normalization can be performed at the beginning of type-independent processing (performed by module 206) or at the end of type-dependent processing (performed by module 204). Normalization can include, for example, converting all data values to data values having a target data type of "string". If a data value already has a (recovery) data type of "string", the normalization process may not perform any operations on the data value. In some implementations, the normalization process may still perform certain operations (e.g., removing leading or trailing "space" characters) even if the data value has the target data type. It is possible for normalization to map two different data values to the same normalized data value. For example, a data value of "3.14" having a data type of "string" (from the first field) may have a trailing "space" character removed to generate the "string" value "3.14", while a data value of "3.14" having a data type of "decimal" (from the second field) may be converted to the same "string" value "3.14". If two different data values from the same field (and therefore having the same original data type) are converted to the same normalized data value, then, in some embodiments, the type-independent processing module 206 may optionally update the appropriate statistics array to aggregate the statistics entries for those two data values so that the count for the new value is the sum of the individual counts of the old values.

图4示出对延迟的类型依赖性处理和标准化采用数据类型管理技术的过程数据剖析的一个例子的流程图400。该流程图表示用于数据剖析的可能算法，但并不意味着限制执行特定步骤的顺序(例如，允许不同形式的并行)。在外层循环中，系统100接收(402)被剖析的数据集，及存储(404)相应的数据类型信息，该数据类型信息将数据类型与对应的用于每个被剖析字段的标识符相关联。在内部循环中，系统100产生(406)数据单元(即，字段-值对)，每个数据单元包括唯一标识一个字段的字段标识符及来自于一个记录中的二进制值。二进制值是从由字段标识符识别的记录的字段中提取。系统100检查(408)以确定是否有任何附加的记录要处理以作为用于结束内部循环的一个条件，并检查(410)以确定是否有任何额外的数据集要处理以作为用于结束外部循环的条件。FIG4 shows a flowchart 400 of an example of process data profiling using data type management techniques for delayed type dependency handling and standardization. The flowchart represents a possible algorithm for data profiling and is not meant to limit the order in which specific steps may be performed (e.g., to allow for various forms of parallelism). In an outer loop, the system 100 receives (402) a data set to be profiled and stores (404) corresponding data type information that associates a data type with a corresponding identifier for each profiled field. In an inner loop, the system 100 generates (406) data units (i.e., field-value pairs), each data unit comprising a field identifier that uniquely identifies a field and a binary value from a record. The binary value is extracted from the field of the record identified by the field identifier. The system 100 checks (408) to determine if there are any additional records to be processed as a condition for terminating the inner loop, and checks (410) to determine if there are any additional data sets to be processed as a condition for terminating the outer loop.

与内部循环和外部循环潜在地并行(例如，使用图2中的模块的流水线并行)中，系统100聚集(412)与来自一组数据单元(例如，基于字段标识符的特定字段的数据单元)的二进值有关的信息。在一些实现方式中，这种聚集表现为统计调查程序的形式，而对于每个被剖析的字段是以该统计调查程序的形式来生成(414)条目的列表。每个统计调查条目包括特定的一个二进制值及与从多个数据单元聚集的二进制值有关的信息(例如，总计数)。在一个类型依赖处理阶段中，系统100检索(416)与来自数据类型信息的各字段标识符相关联的数据类型，并将每个检索的数据类型与合适的一个列表的条目中所包括的二进制值相关联(基于字段标识符)。这使得系统100在聚集与来自多个数据单元的二进制值有关的信息之后，能够至少部分基于检索的出现在字段中的特定二进制值的数据类型来高效地生成(418)用于一个或多个字段的剖析信息。In potentially parallel operation with the inner and outer loops (e.g., using the pipeline parallelism of the modules in FIG. 2 ), the system 100 aggregates (412) information related to binary values from a set of data elements (e.g., data elements of a particular field based on a field identifier). In some implementations, this aggregation takes the form of a statistical survey procedure, and a list of entries in the statistical survey procedure is generated (414) for each field being profiled. Each statistical survey entry includes a particular binary value and information related to the binary values aggregated from the plurality of data elements (e.g., a total count). In a type-dependent processing phase, the system 100 retrieves (416) the data type associated with each field identifier from the data type information and associates each retrieved data type with a binary value included in an entry of an appropriate list (based on the field identifier). This enables the system 100, after aggregating information related to binary values from the plurality of data elements, to efficiently generate (418) profile information for one or more fields based, at least in part, on the retrieved data type for the particular binary value appearing in the field.

与延迟类型依赖处理和标准化直到基于统计调查的聚集一起，可以组合使用其他技术，以进一步提高数据剖析的效率(减少可能的延迟)。例如，可采用技术以有效地将一个工作存储空间溢出至一溢流存储空间。在一些实施方式中，执行数据剖析过程的程序，或者该程序的一部分(例如，统计调查生成模块202)，可被给予一个存储限制，即在存储设备内设置一个允许该程序使用的的最大工作存储空间。该程序可使用所述工作存储空间以存储统计调查阵列(其可能需要大部分的允许工作存储空间)，以及存储其他临时值(其可能需要相较于统计调查阵列显著少的空间)。在模块202确定有可能没有足够的可用工作存储空间来将额外的条目加入统计调查阵列时，或者不再有任何可用工作存储空间来加入额外条目(例如由于最后一个条目的加入)时，则满足工作存储空间的一个溢出条件。模块202可通过测量统计调查阵列(包括统计调查阵列内的指针引用的任何数据值或字段-值对)的总大小并将该大小与存储器限制(或其它阈值)进行比较做出该确定，或者通过确定剩下的可用工作记忆空间的量而不用直接测量统计调查阵列的总大小(例如，从存储器地址分配的区块剩下的存储地址范围)做出该确定。Along with delay type dependent processing and normalization up to statistics based aggregation, other techniques can be used in combination to further improve the efficiency of data profiling (reduce possible delays). For example, techniques can be used to effectively overflow a working storage space into an overflow storage space. In some embodiments, the program that performs the data profiling process, or a portion of the program (e.g., the statistics generation module 202), can be given a storage limit, that is, a maximum working storage space allowed to be used by the program is set within the storage device. The program can use the working storage space to store the statistics array (which may require most of the allowed working storage space), as well as to store other temporary values (which may require significantly less space than the statistics array). When module 202 determines that there may not be enough available working storage space to add additional entries to the statistics array, or when there is no longer any available working storage space to add additional entries (e.g., due to the addition of the last entry), an overflow condition for the working storage space is met. Module 202 may make this determination by measuring the total size of the statistical survey array (including any data values or field-value pairs referenced by pointers within the statistical survey array) and comparing that size to a memory limit (or other threshold), or by determining the amount of available working memory space remaining without directly measuring the total size of the statistical survey array (e.g., the range of storage addresses remaining from a block of memory address allocations).

在一些实现方式中，程序设置一个溢出阈值，以检测统计调查阵列的总大小何时接近存储限制。可以直接测量统计调查阵列的总大小，例如，通过计算各个统计调查阵列的大小的总和，其中一个单独的统计调查阵列的大小以该统计调查阵列占据的工作存储空间的比特数来测量。可替代地，统可以间接测量计调查阵列的总大小，例如，通过计算工作存储器空间内余留的可用空间的量来测量。在一些实现方式中，该程序设置刚好低于存储器限制的溢出阈值以为其他的值保留一些空间。在一些实现方式中，溢出阈值可以等于存储器的限制，例如，如果其他值需要的空间是可忽略不计和/或剖析模块106不强加严格的存储器限制，从而允许相对较短的时间内少量地超出存储限制。In some implementations, the program sets an overflow threshold to detect when the total size of the statistical survey arrays is approaching the storage limit. The total size of the statistical survey arrays can be measured directly, for example, by calculating the sum of the sizes of the individual statistical survey arrays, where the size of an individual statistical survey array is measured as the number of bits of working memory space occupied by the statistical survey array. Alternatively, the total size of the statistical survey arrays can be measured indirectly, for example, by calculating the amount of free space remaining in the working memory space. In some implementations, the program sets the overflow threshold just below the memory limit to reserve some space for other values. In some implementations, the overflow threshold can be equal to the memory limit, for example, if the space required for other values is negligible and/or the profiling module 106 does not impose strict memory limits, thereby allowing small amounts of exceeding the storage limit for relatively short periods of time.

溢出条件已被触发后，该程序使用一个溢出处理过程来存储需要的数据，以在存储设备(例如，数据存储系统116)内的溢出存储空间生成完整的统计调查阵列。究竟什么存储在溢出存储空间取决于所使用溢出处理程序的类型。美国公开号2014/0344508，题为“管理用于数据操作的存储器和存储空间(MANAGING MEMORY AND STORAGE SPACE FOR A DATAOPERATION)”的专利，引用于此作为参考，描述了溢出处理程序的例子，在该例子中，在溢出条件触发后，程序对每个处理的字段-值对继续执行统计调查匹配操作，并将与数据操作的结果相关联的信息(即，在统计调查条目中递增的计数，或一个新的统计调查条目)存储在工作存储器的同一组统计调查阵列中，或者存储在工作存储器的一组新的统计调查阵列中。如果溢出条件是在对数据流203的字段-值对进行处理的期间内一些时刻触发，则有的数据将存储在工作存储空间，而有的数据将存储在溢出存储空间。在某些情况下，以某种方式合并在这两个位置的数据以产生完整的统计调查阵列。由类型依赖处理模块204将每个统计调查阵列输出于其自己的统计调查文件205中以用于处理。因为每个二进制数据值可以被提取并存储在统计调查阵列中，没有相关联的指示其数据类型的中间数据，统计调查数据的存储大小可以保持得很小，这也减少将发生溢出的机会。After the overflow condition has been triggered, the program uses an overflow handling process to store the necessary data to generate a complete statistical survey array in the overflow storage space within the storage device (e.g., data storage system 116). Exactly what is stored in the overflow storage space depends on the type of overflow handling program used. U.S. Patent Publication No. 2014/0344508, entitled "Managing Memory and Storage Space for a Data Operation," which is incorporated herein by reference, describes an example of an overflow handling program in which, after the overflow condition is triggered, the program continues to perform statistical survey matching operations on each processed field-value pair and stores information associated with the result of the data operation (i.e., an incremented count in the statistical survey entry, or a new statistical survey entry) in the same set of statistical survey arrays in the working memory, or in a new set of statistical survey arrays in the working memory. If the overflow condition is triggered at some point during the processing of the field-value pairs of the data stream 203, some data will be stored in the working storage space, while some data will be stored in the overflow storage space. In some cases, the data in these two locations is merged in some manner to produce a complete statistical survey array. Each statistical survey array is output in its own statistical survey file 205 by the type-dependent processing module 204 for processing. Because each binary data value can be extracted and stored in the statistical survey array without associated intermediate data indicating its data type, the storage size of the statistical survey data can be kept small, which also reduces the chance of overflow.

上述技术可以被实现，例如，使用执行适当的软件指令的可编程计算系统，或实现于合适的硬件上，例如现场可编程门阵列(FPGA)，或以一些混合形式实现。例如，在一个编程的方法中，软件可以包括一个或多个计算机程序中的过程，该一个或多个计算机程序被执行于一个或多个程序化的或可编程的计算机系统(其可以是各种架构，诸如分布式、客户端/服务器、或网格式)，每个计算机系统包括至少一个处理器，至少一个数据存储系统(包括易失性和/或非易失性存储器和/或存储元件)，至少一个用户界面(使用至少一个输入设备或端口以接收输入，及使用至少一个输出设备或端口以提供输出)。该软件可以包括一个较大程序的一个或多个模块，例如，提供与设计、配置和执行数据流图有关的服务。该程序的模块(例如，数据流图的元素)可被实施为符合存储在数据储存库中的数据模型的数据结构或其他组织形式的数据。The above techniques can be implemented, for example, using a programmable computing system that executes appropriate software instructions, or on suitable hardware, such as a field programmable gate array (FPGA), or in some hybrid form. For example, in one programming approach, the software can include processes in one or more computer programs that are executed on one or more programmed or programmable computer systems (which can be of various architectures, such as distributed, client/server, or grid), each computer system including at least one processor, at least one data storage system (including volatile and/or non-volatile memory and/or storage elements), and at least one user interface (using at least one input device or port to receive input and at least one output device or port to provide output). The software can include one or more modules of a larger program, for example, providing services related to designing, configuring, and executing data flow graphs. The modules of the program (e.g., elements of a data flow graph) can be implemented as data structures or other organizational forms of data that conform to a data model stored in a data repository.

可在一个有形的、非临时性介质上提供该软件，如CD-ROM或其它计算机可读介质(例如，通用或专用计算系统或设备可读的)，或通过网络通信介质递送(例如，以编码传播信号)到执行该软件的计算机系统的有形、非临时性介质。一些或全部的处理可以在专用计算机上或者使用专用硬件执行，例如协处理器或现场可编程门阵列(FPGA)或专用的特定应用集成电路(ASIC)。该处理可以分布式方式来实现，其中由软件指定的不同计算部分可由不同的计算元件执行。每个这样的计算机程序优选地被存储在或下载到可由通用或专用可编程计算机可访问的存储设备的计算机可读存储介质(例如，固态存储器或介质，或者磁或光介质)，用于在计算机读取存储设备介质时配置和操作计算机以进行本文描述的处理。本发明的系统也可被认为实施于有形的、非临时性介质，其配置有计算机程序，其中，如此配置的介质使计算机以特定和预定的方式操作以执行一个或多个本文所描述的处理步骤。The software may be provided on a tangible, non-transitory medium, such as a CD-ROM or other computer-readable medium (e.g., readable by a general-purpose or special-purpose computing system or device), or a tangible, non-transitory medium delivered (e.g., in an encoded propagated signal) to a computer system executing the software via a network communication medium. Some or all of the processing may be performed on a dedicated computer or using dedicated hardware, such as a coprocessor or field programmable gate array (FPGA) or a dedicated application-specific integrated circuit (ASIC). The processing may be implemented in a distributed manner, where different computational portions specified by the software may be performed by different computing elements. Each such computer program is preferably stored on or downloaded to a computer-readable storage medium (e.g., solid-state memory or medium, or magnetic or optical medium) of a storage device accessible by a general-purpose or special-purpose programmable computer, for configuring and operating the computer to perform the processing described herein when the computer reads the storage device medium. The system of the present invention may also be considered to be implemented on a tangible, non-transitory medium configured with a computer program, wherein the medium so configured causes the computer to operate in a specific and predetermined manner to perform one or more processing steps described herein.

已经描述了本发明的数个实施例。然而，可以理解的是，前述描述旨在说明而并不构成对本发明范围的限制，该范围由以下的权利要求的范围来确定。因此，其他实施例也在以下权利要求的范围之内。例如，在不脱离本发明的范围的情况下可以进行各种修改。另外，一些上述的步骤可以是顺序无关的，因此可以与所描述的不同的顺序来执行。Several embodiments of the present invention have been described. However, it should be understood that the foregoing description is intended to illustrate and not to limit the scope of the present invention, which is determined by the scope of the following claims. Therefore, other embodiments are within the scope of the following claims. For example, various modifications may be made without departing from the scope of the present invention. Furthermore, some of the steps described above may be order-independent and, therefore, may be performed in an order different from that described.

Claims

1. A method for processing data in a computer system, the method comprising:

Multiple records are received through an input device or a port of the computer system, each record including one or more values corresponding to each of the multiple fields;

The data type information is stored in the storage medium of the computer system, the data type information associating each of one or more data types with at least one identifier; and

The computer system employs at least one processor to process multiple data values from the record, the processing comprising:

Multiple data units are generated based on the records, each data unit including a field identifier that uniquely identifies one of the fields of the multiple records and the original binary value of the identified field from one of the multiple records;

Information based on the original binary value is collected from multiple data units to build various lists of entries for one or more fields, wherein each list entry includes at least a portion of the list entries and information about the original binary value collected from multiple data units;

The data type associated with the first identifier is obtained from the data type information, and the obtained data type is associated with at least one primitive binary value included in an entry in the list; and

Generating profiling information includes: after collecting information based on the raw binary values from the plurality of data units, generating standardized data values converted from typed binary values in at least one list entry, based at least in part on the data types acquired and associated with specific raw binary values appearing in the list entries.

2. The method of claim 1, wherein the raw binary value extracted from a field of the record marked by the field identifier is extracted as an untyped bit sequence; the generation of a normalized data value from a typed binary value in at least one list entry based at least in part on a data type already associated with a particular raw binary value appearing in the list entry comprises: reinterpreting the untyped bit sequence as a typed data value having the acquired data type.

3. The method of claim 2, wherein the profiling information includes type-dependent profiling results that depend on the original data type of a plurality of data values of the record.

4. The method of claim 1, wherein collecting information based on the original binary values from the plurality of data units comprises comparing the original binary values from the plurality of data units with original binary values in the list of entries to determine whether there is a match between the original binary values.

5. The method of claim 4, wherein the information regarding the raw binary values collected from the plurality of said data units includes the total number of matching raw binary values, the total number being increased accordingly each time a match is determined when comparing raw binary values.

6. The method of claim 4, wherein a match between the first original binary value and the second original binary value corresponds to the same bit sequence including the first original binary value and the same bit sequence including the second original binary value.

7. The method of claim 1, wherein the data type information associates each of one or more data types with at least one of the field identifiers.

8. The method of claim 7, wherein obtaining the data type associated with the first identifier from the data type information includes obtaining the data type associated with the first field identifier.

9. The method of claim 1, wherein each data unit comprises: a field identifier, a raw binary value in the record, and a data type identifier that uniquely identifies a data type.

10. The method of claim 9, wherein the data type information associates each of one or more data types with at least one of the data type identifiers.

11. The method of claim 10, wherein obtaining the data type associated with the first identifier from the data type information includes obtaining the data type associated with the first data type identifier.

12. The method of claim 1, wherein associating the acquired data type with at least one primitive binary value included in an entry of the list comprises: instantiating a local variable having the acquired data type, and initializing the instantiated variable with a value based on the primitive binary value included in the entry.

13. The method of claim 1, wherein associating the acquired data type with at least one raw binary value included in an entry of the list comprises: setting a pointer associated with the acquired data type to a storage location where the raw binary value included in the entry is located.

14. The method of claim 1, wherein each data unit comprises: a field identifier, a raw binary value in the record, and a length indicator of the raw binary value.

15. The method of claim 14, wherein the length indicator is stored as a prefix of the original binary value.

16. The method of claim 1, further comprising: receiving record format information associated with the plurality of records via the input device or a port of the computer system.

17. The method of claim 16, wherein the data type information is generated at least in part based on the acquired record format information.

18. The method of claim 1, further comprising: for a first field in an acquired data type associated with the original binary value, converting the original binary values in different entries of a first list into a target data type; and for a second field in an acquired data type associated with the original binary value, converting the original binary values in different entries of a second list into the same target data type.

19. The method of claim 1, wherein generating a normalized data value from a typed binary value comprises: determining, at least in part, whether to perform one or more normalization operations on a particular typed binary value based on the data type acquired and associated with a particular original binary value appearing in the list entry.

20. The method of claim 19, wherein generating a normalized data value from a typed binary value comprises: at least in part based on the data type acquired and associated with a particular original binary value appearing in the list entry, without performing any normalization operation on at least one particular typed binary value.

21. Software stored in a non-transitory form on a computer-readable medium, the software including instructions to trigger a computer system to execute:

Multiple data units are generated based on the records, each data unit including a field identifier that uniquely identifies a field and the original binary value of the identified field from one of the multiple records;

After collecting information based on the original binary values from the plurality of data units, at least in part based on the data type already associated with the specific original binary values appearing in the list entries, a standardized data value is generated from the typed binary values in at least one list entry.

22. A computer system, comprising:

An input device or port of the computer system is configured to receive multiple records, each record including one or more values corresponding to each of the multiple fields;

The computer system's storage medium is configured to store data type information, which associates each of one or more data types with at least one identifier; and

At least one processor of the computer system is configured to process a plurality of data values from the record, the processing including:

23. A computer system, comprising:

A receiving device for receiving a plurality of records, each of the records including one or more values corresponding to each of the plurality of fields;

A storage device for storing data type information, wherein the data type information associates each of one or more data types with at least one identifier;

A processing device for processing multiple data values from the record, the processing including: