HK1224050B

HK1224050B - Managing memory and storage space for a data operation

Info

Publication number: HK1224050B
Application number: HK16112228.1A
Authority: HK
Inventors: M.A.可汗; S．G．雷比茨基; 乔尔.古尔德
Original assignee: 起元科技有限公司
Priority date: 2013-05-17
Filing date: 2014-05-16
Publication date: 2020-06-12

Description

Manage memory and storage space for data operations

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求享有2013年5月17日提交的61/824,686号美国专利申请的优先权。This application claims priority to U.S. Patent Application No. 61/824,686, filed May 17, 2013.

技术领域Technical Field

本申请涉及一种管理数据操作的存储器和存储空间。The present application relates to a memory and storage space for managing data operations.

背景技术Background Art

一些计算系统使用虚拟存储器方案来管理在操作系统内执行的程序所使用的存储器设备。例如，使用从存储设备换入和换出的存储器页面，操作系统可以处理存储器设备(也称为“主存储器”)的较大虚拟地址空间和较小实地址空间之间的转译，其中上述存储设备用作具有比存储器设备更大存储容量的后备存储。因此，程序可访问的工作存储器的量不受限于主存储器的大小。在虚拟存储器方案中，程序的工作存储器中地址页面在存储器设备和后备存储之间的往返移动对使用该工作存储器的程序而言通常是透明的。一些计算系统可能具有对虚拟存储器的硬件支持，诸如内置于中央处理单元(CPU)中的存储器管理单元(MMU)。一些计算系统还可以使用具有一个或多个级别的缓存系统来在相对更快速的高速缓冲存储器中存储有限量的主存储器地址的副本，以加速对存储器地址的重复访问。Some computing systems use a virtual memory scheme to manage the memory devices used by programs executed within the operating system. For example, using memory pages swapped in and out of a storage device, the operating system can handle the translation between the larger virtual address space and the smaller real address space of a memory device (also referred to as "main memory"), wherein the above-mentioned storage device is used as a backup storage with a larger storage capacity than the memory device. Therefore, the amount of working memory accessible to the program is not limited to the size of the main memory. In a virtual memory scheme, the round-trip movement of address pages in the working memory of the program between the memory device and the backup storage is usually transparent to the program using the working memory. Some computing systems may have hardware support for virtual memory, such as a memory management unit (MMU) built into a central processing unit (CPU). Some computing systems may also use a cache system with one or more levels to store a limited number of copies of main memory addresses in a relatively faster cache memory to speed up repeated access to memory addresses.

发明内容Summary of the Invention

在一个方案中，一般而言，一种计算系统，包括：存储器设备，用于提供工作存储器空间；存储设备，用于提供溢出存储空间；以及至少一个处理器，被配置为处理多个数据单元以产生结果信息。所述处理包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作并且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的第一集合中；在所述工作存储器空间的溢出条件满足之后，将信息存储在所述溢出存储空间中并且释放至少一些所述工作存储器空间，并为所述多个数据单元的数据单元第二子集的每个数据单元执行所述数据操作，且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的第二集合中；将一个或多个数据结构的多个集合结合，包括所述第一集合和第二集合，以产生所述结果信息。将与数据操作的结果相关联的所述信息存储在一个或多个数据结构的集合中包括，对于至少一个数据单元，执行改变一个或多个数据结构的所述集合中信息的操作而免于增大一个或多个数据结构的所述集合所使用的所述工作存储器空间的量。In one embodiment, a computing system includes a memory device for providing working memory space, a storage device for providing overflow storage space, and at least one processor configured to process a plurality of data units to generate result information. The processing includes performing a data operation on each data unit of a first subset of data units of the plurality of data units and storing information associated with the result of the data operation in a first set of one or more data structures stored in the working memory space; after an overflow condition for the working memory space is satisfied, storing the information in the overflow storage space and freeing at least some of the working memory space, performing the data operation on each data unit of a second subset of data units of the plurality of data units and storing information associated with the result of the data operation in a second set of one or more data structures stored in the working memory space; and combining a plurality of sets of one or more data structures, including the first set and the second set, to generate the result information. Storing the information associated with the result of the data operation in the set of one or more data structures includes, for at least one data unit, performing an operation that changes the information in the set of one or more data structures without increasing the amount of working memory space used by the set of one or more data structures.

这些方案可包括一个或多个以下特征。These aspects may include one or more of the following features.

如果一个或多个数据结构的所述第一集合所使用的所述工作存储器空间的所述量大于等于预定阈值，则所述工作存储器空间的所述溢出条件满足。The overflow condition of the working memory space is satisfied if the amount of the working memory space used by the first set of one or more data structures is greater than or equal to a predetermined threshold.

所述处理还包括，在所述溢出条件满足之后并且在为所述数据单元第二子集的每个数据单元执行所述数据操作之前，将一个或多个数据结构的所述第一集合存储在所述溢出存储空间中，并且从所述工作存储器空间移除一个或多个数据结构的所述第一集合。The processing also includes, after the overflow condition is met and before performing the data operation for each data unit of the second subset of data units, storing the first set of one or more data structures in the overflow storage space and removing the first set of one or more data structures from the working memory space.

将一个或多个数据结构的多个集合结合包括将所述第一集合的至少一个数据结构与所述第二集合的至少一个数据结构合并。Combining the multiple sets of one or more data structures includes merging at least one data structure of the first set with at least one data structure of the second set.

将所述第一集合的至少一个数据结构与所述第二集合的至少一个数据结构合并包括将一个或多个数据结构的所述第一集合的所述数据结构中的第一键与一个或多个数据结构的所述第二集合的所述数据结构中的第二键相匹配，并且对与所述第一键相关联的值和与所述第二键相关联的值执行聚合操作。Merging at least one data structure of the first set with at least one data structure of the second set includes matching a first key in the data structure of the first set of one or more data structures with a second key in the data structure of the second set of one or more data structures, and performing an aggregation operation on a value associated with the first key and a value associated with the second key.

所述处理还包括，在所述溢出条件满足之后并且在为所述数据单元第二子集的每个数据单元执行所述数据操作之前，为数据单元的所述第二子集的每个数据单元执行所述数据操作，为所述多个数据单元的数据单元第三子集的每个数据单元执行所述数据操作，并且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的所述第一集合中。The processing also includes, after the overflow condition is met and before performing the data operation for each data unit of the second subset of data units, performing the data operation for each data unit of a third subset of data units of the plurality of data units, and storing information associated with a result of the data operation in the first set of one or more data structures stored in the working memory space.

所述数据单元第二子集是所述数据单元第三子集的数据单元的子集。The second subset of data units is a subset of the data units of the third subset of data units.

所述处理还包括，在为所述数据单元第三子集的第一数据单元执行所述数据操作之后，确定将与该数据操作的结果相关联的信息存储在：(1)所述工作存储器空间中存储的一个或多个数据结构的所述第一集合中，或者(2)所述溢出存储空间中。The processing also includes, after performing the data operation for the first data unit of the third subset of the data units, determining to store information associated with the result of the data operation in: (1) the first set of one or more data structures stored in the working memory space, or (2) the overflow storage space.

改变一个或多个数据结构的所述集合中的信息的所述操作包括就地存储器操作，所述就地存储器操作将所述工作存储器空间内一位置中存储的值覆写为不同值并存储在所述工作存储器空间内的相同位置中。The operation to change information in the set of one or more data structures includes an in-place memory operation that overwrites a value stored in a location within the working memory space with a different value and stores it in the same location within the working memory space.

将与所述数据操作的结果相关联的信息存储在所述溢出存储空间中包括将所述第一数据单元的至少一些内容存储在所述溢出存储空间中。Storing information associated with a result of the data operation in the overflow storage space includes storing at least some contents of the first data unit in the overflow storage space.

为所述第一数据单元执行所述数据操作包括将所述第一数据单元中的键与一个或多个数据结构的所述第一集合中的一个或多个键相比较，并且如果所述比较结果为匹配，则将与该数据操作的结果相关联的所述信息存储在所述工作存储器空间中存储的一个或多个数据结构的第一集合中，而如果所述比较结果为不匹配，则将与该数据操作的结果相关联的所述信息存储在所述溢出存储空间中。Performing the data operation for the first data unit includes comparing a key in the first data unit with one or more keys in the first set of one or more data structures, and if the comparison result is a match, storing the information associated with the result of the data operation in the first set of one or more data structures stored in the working memory space, and if the comparison result is a mismatch, storing the information associated with the result of the data operation in the overflow storage space.

所述处理还包括根据数据源产生所述多个数据单元，每个数据单元包括数据源的字段的标识符以及出现在所述数据源的记录内的该字段中的值。The processing also includes generating the plurality of data units based on a data source, each data unit including an identifier of a field of the data source and a value appearing in the field within a record of the data source.

所述数据操作包括使用所述数据单元中包括的所述值作为键来对多个数据单元的信息进行聚合，所述键用来选择信息经过聚合的匹配数据单元。The data operation includes aggregating information of a plurality of data units using the value included in the data unit as a key, the key being used to select a matching data unit whose information is aggregated.

所述存储器设备包括易失性存储器设备。The memory device includes a volatile memory device.

所述存储设备包括非易失性存储设备。The storage device includes a non-volatile storage device.

在另一个方案中，一般而言，一种计算系统，包括：用于提供工作存储器空间的装置；用于提供溢出存储空间的装置；以及用于处理多个数据单元以产生结果信息的装置。所述处理包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作并且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的第一集合中；在所述工作存储器空间的溢出条件满足之后，将信息存储在所述溢出存储空间中并且释放至少一些所述工作存储器空间，并为所述多个数据单元的数据单元第二子集的每个数据单元执行所述数据操作，且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的第二集合中；将一个或多个数据结构的多个集合结合，包括所述第一集合和第二集合，以产生所述结果信息。将与所述数据操作的结果相关联的所述信息存储在一个或多个数据结构的集合中包括，对于至少一个数据单元，执行改变一个或多个数据结构的所述集合中信息的操作而免于增大一个或多个数据结构的所述集合所使用的所述工作存储器空间的量。In another embodiment, in general, a computing system includes: means for providing working memory space; means for providing overflow storage space; and means for processing a plurality of data units to generate result information. The processing includes: performing a data operation for each data unit of a first subset of data units of the plurality of data units and storing information associated with the results of the data operation in a first set of one or more data structures stored in the working memory space; after an overflow condition for the working memory space is satisfied, storing the information in the overflow storage space and freeing at least some of the working memory space, and performing the data operation for each data unit of a second subset of data units of the plurality of data units and storing information associated with the results of the data operation in a second set of one or more data structures stored in the working memory space; and combining a plurality of sets of one or more data structures, including the first set and the second set, to generate the result information. Storing the information associated with the results of the data operation in the set of one or more data structures includes, for at least one data unit, performing an operation that changes the information in the set of one or more data structures without increasing the amount of working memory space used by the set of one or more data structures.

在另一个方案中，一般而言，一种用于处理多个数据单元以产生结果信息的方法，包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作并且将与该数据操作的结果相关联的信息存储在存储器设备的工作存储器空间中存储的一个或多个数据结构的第一集合中；在所述工作存储器空间的溢出条件满足之后，将信息存储在存储设备的溢出存储空间中并且释放至少一些所述工作存储器空间，并为所述多个数据单元的数据单元第二子集的每个数据单元执行所述数据操作，且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的第二集合中；以及将一个或多个数据结构的多个集合结合，包括所述第一集合和第二集合，以产生所述结果信息。将与所述数据操作的结果相关联的所述信息存储在一个或多个数据结构的集合中包括，对于至少一个数据单元，执行改变一个或多个数据结构的所述集合中信息的操作而免于增大一个或多个数据结构的所述集合所使用的所述工作存储器空间的量。In another embodiment, in general, a method for processing a plurality of data units to generate result information includes: performing a data operation for each data unit of a first subset of data units of the plurality of data units and storing information associated with the results of the data operation in a first set of one or more data structures stored in working memory space of a memory device; after an overflow condition of the working memory space is satisfied, storing the information in overflow storage space of the memory device and freeing at least some of the working memory space, performing the data operation for each data unit of a second subset of data units of the plurality of data units and storing information associated with the results of the data operation in a second set of one or more data structures stored in the working memory space; and combining a plurality of sets of one or more data structures, including the first set and the second set, to generate the result information. Storing the information associated with the results of the data operation in the set of one or more data structures includes, for at least one data unit, performing an operation that changes the information in the set of one or more data structures without increasing the amount of the working memory space used by the set of one or more data structures.

在另一个方案中，一般而言，软件存储在计算机可读介质上用于处理多个数据单元以产生结果信息。所述软件包括指令用于使得计算系统执行：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作并且将与该数据操作的结果相关联的信息存储在存储器设备的工作存储器空间中存储的一个或多个数据结构的第一集合中；在所述工作存储器空间的溢出条件满足之后，将信息存储在存储设备的溢出存储空间中并且释放至少一些所述工作存储器空间，并为所述多个数据单元的数据单元第二子集的每个数据单元执行所述数据操作，且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的第二集合中；以及将一个或多个数据结构的多个集合结合，包括所述第一集合和第二集合，以产生所述结果信息。将与所述数据操作的结果相关联的所述信息存储在一个或多个数据结构的集合中包括，对于至少一个数据单元，执行改变一个或多个数据结构的所述集合中信息的操作而免于增大一个或多个数据结构的所述集合所使用的所述工作存储器空间的量。In another embodiment, generally, software is stored on a computer-readable medium for processing a plurality of data units to generate result information. The software includes instructions for causing a computing system to: perform a data operation on each data unit of a first subset of data units of the plurality of data units and store information associated with the result of the data operation in a first set of one or more data structures stored in working memory space of a memory device; after an overflow condition of the working memory space is satisfied, store the information in overflow storage space of the memory device and free up at least some of the working memory space; perform the data operation on each data unit of a second subset of data units of the plurality of data units and store information associated with the result of the data operation in a second set of one or more data structures stored in the working memory space; and combine a plurality of sets of one or more data structures, including the first set and the second set, to generate the result information. Storing the information associated with the result of the data operation in the set of one or more data structures includes, for at least one data unit, performing an operation that changes the information in the set of one or more data structures without increasing the amount of working memory space used by the set of one or more data structures.

在另一个方案中，一般而言，一种计算系统，包括：存储器设备，用于提供工作存储器空间；存储设备，用于提供溢出存储空间；以及至少一个处理器，被配置为处理多个数据单元以产生结果信息。所述处理包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作并且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的第一集合中；在所述工作存储器空间的溢出条件满足之后，为所述多个数据单元的数据单元的第二子集的每个数据单元执行所述数据操作，并且确定将与所述数据操作的结果相关联的信息存储在：(1)所述工作存储器空间中存储的一个或多个数据结构的所述第一集合中，或者(2)所述溢出存储空间中。In another embodiment, in general, a computing system includes: a memory device for providing a working memory space; a storage device for providing an overflow storage space; and at least one processor configured to process a plurality of data units to generate result information. The processing includes: performing a data operation for each data unit of a first subset of data units of the plurality of data units and storing information associated with the result of the data operation in a first set of one or more data structures stored in the working memory space; and after an overflow condition of the working memory space is satisfied, performing the data operation for each data unit of a second subset of data units of the plurality of data units and determining to store information associated with the result of the data operation in: (1) the first set of one or more data structures stored in the working memory space, or (2) the overflow storage space.

所述数据操作至少部分基于每个数据单元中的键值，并且所述确定包括在所述第一集合的至少一个所述数据结构中搜索至少一个键值以确定：(1)更新与所述工作存储器空间中一个或多个数据结构的所述第一集合的数据结构中的该键值相关联的信息，或者(2)存储与所述溢出存储空间中的该键值相关联的信息。The data operation is based at least in part on a key value in each data unit, and the determining includes searching for at least one key value in at least one of the data structures in the first set to determine: (1) updating information associated with the key value in a data structure of the first set of one or more data structures in the working memory space, or (2) storing information associated with the key value in the overflow storage space.

所述数据操作包括就地存储器操作，所述就地存储器操作将所述工作存储器空间内一位置中存储的值覆写为不同值并存储在所述工作存储器空间内的相同位置处。The data operation includes an in-place memory operation that overwrites a value stored in a location within the working memory space with a different value and stores it at the same location within the working memory space.

将与所述数据操作的结果相关联的信息存储在所述溢出存储空间中包括将执行了所述数据操作的数据单元的至少一些内容存储在所述溢出存储空间中。Storing information associated with a result of the data operation in the overflow storage space includes storing at least some contents of a data unit on which the data operation was performed in the overflow storage space.

为所述第一数据单元执行所述数据操作包括将所述第一数据单元中的键与一个或多个数据结构的所述第一集合中的一个或多个键相比较，并且如果所述比较结果为匹配，则将与该数据操作的结果相关联的所述信息存储在所述工作存储器空间中存储的一个或多个数据结构的第一集合中，并且如果所述比较结果为不匹配，则将与该数据操作的结果相关联的所述信息存储在所述溢出存储空间中。Performing the data operation for the first data unit includes comparing a key in the first data unit with one or more keys in the first set of one or more data structures, and if the comparison result is a match, storing the information associated with the result of the data operation in the first set of one or more data structures stored in the working memory space, and if the comparison result is a mismatch, storing the information associated with the result of the data operation in the overflow storage space.

产生所述多个数据单元包括为所述数据源的至少第一字段和所述数据源的至少第二字段产生数据单元。Generating the plurality of data units includes generating data units for at least a first field of the data source and at least a second field of the data source.

为所述第二子集的每个数据单元执行所述数据操作包括：将与在第一数据单元上执行的所述数据操作的结果相关联的信息存储在一个或多个数据结构的所述第一集合中，并且将与在第二数据单元上执行的所述数据操作的结果相关联的信息存储在所述溢出存储空间中。Performing the data operation for each data unit of the second subset includes storing information associated with a result of the data operation performed on the first data unit in the first set of one or more data structures, and storing information associated with a result of the data operation performed on the second data unit in the overflow storage space.

所述第一数据单元和所述第二数据单元对于所述数据源的相同字段分别包括各标识符。The first data unit and the second data unit respectively include identifiers for the same field of the data source.

在另一个方案中，一般而言，一种计算系统，包括：用于提供工作存储器空间的装置；用于提供溢出存储空间的装置；以及用于处理多个数据单元以产生结果信息的装置，所述处理包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作并且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的第一集合中；以及在所述工作存储器空间的溢出条件满足之后，为所述多个数据单元的数据单元第二子集的每个数据单元执行所述数据操作，并且确定将与该数据操作的结果相关联的信息存储在：(1)所述工作存储器空间中存储的一个或多个数据结构的所述第一集合中，或者(2)所述溢出存储空间中。In another embodiment, in general, a computing system includes: a device for providing a working memory space; a device for providing an overflow storage space; and a device for processing a plurality of data units to generate result information, the processing including: performing a data operation for each data unit of a first subset of data units of the plurality of data units and storing information associated with the result of the data operation in a first set of one or more data structures stored in the working memory space; and after an overflow condition of the working memory space is satisfied, performing the data operation for each data unit of a second subset of data units of the plurality of data units and determining to store information associated with the result of the data operation in: (1) the first set of one or more data structures stored in the working memory space, or (2) the overflow storage space.

在另一个方案中，一般而言，一种用于处理多个数据单元以产生结果信息的方法，包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作并且将与该数据操作的结果相关联的信息存储在存储器设备的工作存储器空间中存储的一个或多个数据结构的第一集合中；以及在所述工作存储器空间的溢出条件满足之后，为所述多个数据单元的数据单元第二子集的每个数据单元执行所述数据操作，并且确定是否将与该数据操作的结果相关联的信息存储在：(1)所述工作存储器空间中存储的一个或多个数据结构的所述第一集合中，或者(2)存储设备的溢出存储空间中。In another embodiment, in general, a method for processing multiple data units to generate result information includes: performing a data operation for each data unit of a first subset of data units of the multiple data units and storing information associated with the result of the data operation in a first set of one or more data structures stored in a working memory space of a memory device; and after an overflow condition of the working memory space is met, performing the data operation for each data unit of a second subset of data units of the multiple data units, and determining whether to store the information associated with the result of the data operation in: (1) the first set of one or more data structures stored in the working memory space, or (2) the overflow storage space of the storage device.

在另一个方案中，一般而言，软件存储在计算机可读介质上用于处理多个数据单元以产生结果信息。所述软件包括指令用于使得计算系统执行：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作并且将与该数据操作的结果相关联的信息存储在存储设备的工作存储器空间中存储的一个或多个数据结构的第一集合中；以及在所述工作存储器空间的溢出条件满足之后，为所述多个数据单元的数据单元第二子集的每个数据单元执行所述数据操作，并且确定是否将与该数据操作的结果相关联的信息存储在：(1)所述工作存储器空间中存储的一个或多个数据结构的所述第一集合中，或者(2)存储设备的溢出存储空间中。In another embodiment, generally, software is stored on a computer-readable medium for processing a plurality of data units to generate result information. The software includes instructions for causing a computing system to: perform a data operation for each data unit of a first subset of data units of the plurality of data units and store information associated with the result of the data operation in a first set of one or more data structures stored in a working memory space of a storage device; and, after an overflow condition of the working memory space is satisfied, perform the data operation for each data unit of a second subset of data units of the plurality of data units and determine whether to store information associated with the result of the data operation in: (1) the first set of one or more data structures stored in the working memory space, or (2) an overflow storage space of the storage device.

在另一个方案中，一般而言，一种计算系统，包括：存储器设备，用于提供工作存储器空间；存储设备，用于提供溢出存储空间；以及至少一个处理器，被配置为处理多个数据单元以产生结果信息。所述处理包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作，该数据操作包括在所述工作存储器空间中存储的一个或多个数据结构的第一集合的至少一个数据结构内的数据单元中搜索一值，并且如果找到该值，则修改所述第一集合的该至少一个数据结构中的信息，而如果未找到该值，则将信息添加至所述第一集合的至少一个数据结构；在所述工作存储器空间的溢出条件满足之后，将信息存储在所述溢出存储空间中并且释放至少一些所述工作存储器空间，并且对于所述多个数据单元的数据单元第二子集的每个数据单元执行数据操作，该数据操作包括在所述工作存储器空间中存储的一个或多个数据结构的第二集合的至少一个数据结构内的数据单元中搜索一值，并且如果找到该值，则修改所述第二集合的该至少一个数据结构中的信息，以及将一个或多个数据结构的多个集合结合，包括所述第一集合和第二集合，以产生所述结果信息。In another embodiment, in general, a computing system includes a memory device for providing working memory space, a storage device for providing overflow storage space, and at least one processor configured to process a plurality of data units to generate result information. The processing includes performing a data operation for each data unit of a first subset of data units of the plurality of data units, the data operation comprising searching for a value in a data unit within at least one data structure of a first set of one or more data structures stored in the working memory space and, if the value is found, modifying information in the at least one data structure of the first set, and, if the value is not found, adding information to the at least one data structure of the first set; storing information in the overflow storage space and freeing at least some of the working memory space after an overflow condition of the working memory space is satisfied; performing a data operation for each data unit of a second subset of data units of the plurality of data units, the data operation comprising searching for a value in a data unit within at least one data structure of a second set of one or more data structures stored in the working memory space and, if the value is found, modifying information in the at least one data structure of the second set; and combining a plurality of sets of one or more data structures, including the first set and the second set, to generate the result information.

所述处理还包括，在所述溢出条件满足之后并且在为所述数据单元第二子集的每个数据单元执行所述搜索之前，将一个或多个数据结构的所述第一集合存储在所述溢出存储空间中，并且从所述工作存储器空间移除一个或多个数据结构的所述第一集合。The processing also includes, after the overflow condition is met and before performing the search for each data unit of the second subset of data units, storing the first set of one or more data structures in the overflow storage space and removing the first set of one or more data structures from the working memory space.

所述处理还包括，在所述溢出条件满足之后并且在为所述数据单元第二子集的每个数据单元执行所述搜索之前，对于所述多个数据单元的数据单元第三子集的每个数据单元，在所述工作存储器空间中存储的一个或多个数据结构的所述第一集合的至少一个数据结构内的数据单元中搜索一值，并且如果找到该值，则修改所述第一集合的该至少一个数据结构中的信息。The processing also includes, after the overflow condition is met and before performing the search for each data unit of the second subset of data units, searching for a value in a data unit within at least one data structure of the first set of one or more data structures stored in the working memory space for each data unit of a third subset of data units of the plurality of data units, and if the value is found, modifying information in the at least one data structure of the first set.

修改所述信息包括执行就地存储器操作，所述就地存储器操作将所述工作存储器空间内一位置处存储的值覆写为不同值存储在所述工作存储器空间内相同位置处。Modifying the information includes performing an in-place memory operation that overwrites a value stored at a location in the working memory space with a different value stored at the same location in the working memory space.

一个或多个数据结构的所述第一集合包括键值配对条目的多个关联阵列。The first set of one or more data structures includes a plurality of associative arrays of key-value paired entries.

在所述工作存储器空间中存储的一个或多个数据结构的第一集合的至少一个数据结构内的所述数据单元搜索所述值包括在选定的其中一个键值配对条目关联阵列内搜索作为一条目的键的所述值。Searching the data unit within at least one data structure of the first set of one or more data structures stored in the working memory space for the value includes searching for the value as a key of an entry within a selected one of the key-value pair entry associative arrays.

所述选定的其中一个键值配对条目关联阵列对应于所述数据单元中的所述标识符。The selected one of the key-value pair entry associative arrays corresponds to the identifier in the data unit.

修改所述第一集合的至少一个数据结构中的信息包括增加找到的所述键值配对条目的所述值。Modifying information in at least one data structure of the first set includes increasing the value of the found key-value pair entry.

添加信息至所述第一集合的至少一个数据结构包括添加新的键值配对条目至所述选定的阵列，使得该条目的键为所述数据单元中的所述值且该条目的值为计数1。Adding information to at least one data structure of the first set includes adding a new key-value pair entry to the selected array such that the key of the entry is the value in the data unit and the value of the entry is a count of one.

在另一个方案中，一般而言，一种计算系统，包括：用于提供工作存储器空间的装置；用于提供溢出存储空间的装置；以及用于处理多个数据单元以产生结果信息的装置。所述处理包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作，该数据操作包括在所述工作存储器空间中存储的一个或多个数据结构的第一集合的至少一个数据结构内的数据单元中搜索一值，并且如果找到该值，则修改所述第一集合的该至少一个数据结构中的信息，而如果未找到该值，则将信息添加至所述第一集合的至少一个数据结构；在所述工作存储器空间的溢出条件满足之后，将信息存储在所述溢出存储空间中并且释放至少一些所述工作存储器空间，并且对于所述多个数据单元的数据单元第二子集的每个数据单元执行数据操作，该数据操作包括在所述工作存储器空间中存储的一个或多个数据结构的第二集合的至少一个数据结构内的数据单元中搜索一值，并且如果找到该值，则修改所述第二集合的该至少一个数据结构中的信息，以及将一个或多个数据结构的多个集合结合，包括所述第一集合和第二集合，以产生所述结果信息。In another embodiment, in general, a computing system includes: means for providing working memory space; means for providing overflow storage space; and means for processing a plurality of data units to generate result information. The processing includes: performing a data operation for each data unit of a first subset of data units of the plurality of data units, the data operation comprising searching for a value in a data unit within at least one data structure of a first set of one or more data structures stored in the working memory space and, if the value is found, modifying information in the at least one data structure of the first set, and, if the value is not found, adding information to the at least one data structure of the first set; storing information in the overflow storage space and freeing at least some of the working memory space after an overflow condition of the working memory space is satisfied; performing a data operation for each data unit of a second subset of data units of the plurality of data units, the data operation comprising searching for a value in a data unit within at least one data structure of a second set of one or more data structures stored in the working memory space and, if the value is found, modifying information in the at least one data structure of the second set; and combining a plurality of sets of one or more data structures, including the first set and the second set, to generate the result information.

在另一个方案中，一般而言，一种用于处理多个数据单元以产生结果信息的方法，包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作，该数据操作包括在存储器设备的工作存储器空间中存储的一个或多个数据结构的第一集合的至少一个数据结构内的数据单元中搜索一值，并且如果找到该值，则修改所述第一集合的该至少一个数据结构中的信息，而如果未找到该值，则将信息添加至所述第一集合的至少一个数据结构；在所述工作存储器空间的溢出条件满足之后，将信息存储在存储设备的溢出存储空间中并且释放至少一些所述工作存储器空间，并且对于所述多个数据单元的数据单元第二子集的每个数据单元执行数据操作，该数据操作包括在所述工作存储器空间中存储的一个或多个数据结构的第二集合的至少一个数据结构内的数据单元中搜索一值，并且如果找到该值，则修改所述第二集合的该至少一个数据结构中的信息，以及将一个或多个数据结构的多个集合结合，包括所述第一集合和第二集合，以产生所述结果信息。In another embodiment, in general, a method for processing multiple data units to generate result information includes: performing a data operation for each data unit of a first subset of data units of the multiple data units, the data operation including searching for a value in a data unit within at least one data structure of a first set of one or more data structures stored in a working memory space of a memory device, and if the value is found, modifying information in the at least one data structure of the first set, and if the value is not found, adding information to the at least one data structure of the first set; after an overflow condition of the working memory space is met, storing information in an overflow storage space of a storage device and releasing at least some of the working memory space, and performing a data operation for each data unit of a second subset of data units of the multiple data units, the data operation including searching for a value in a data unit within at least one data structure of a second set of one or more data structures stored in the working memory space, and if the value is found, modifying information in the at least one data structure of the second set, and combining multiple sets of one or more data structures, including the first set and the second set, to generate the result information.

在另一个方案中，一般而言，软件存储在计算机可读介质上用于处理多个数据单元以产生结果信息。所述软件包括指令用于使得计算系统执行：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作，该数据操作包括在存储器设备的工作存储器空间中存储的一个或多个数据结构的第一集合的至少一个数据结构内的数据单元中搜索一值，并且如果找到该值，则修改所述第一集合的该至少一个数据结构中的信息，而如果未找到该值，则将信息添加至所述第一集合的至少一个数据结构；在所述工作存储器空间的溢出条件满足之后，将信息存储在存储设备的溢出存储空间中并且释放至少一些所述工作存储器空间，并且对于所述多个数据单元的数据单元第二子集的每个数据单元执行数据操作，该数据操作包括在所述工作存储器空间中存储的一个或多个数据结构的第二集合的至少一个数据结构内的数据单元中搜索一值，并且如果找到该值，则修改所述第二集合的至少一个数据结构中的信息，以及将一个或多个数据结构的多个集合结合，包括所述第一集合和第二集合，以产生所述结果信息。In another embodiment, generally, software is stored on a computer-readable medium for processing a plurality of data units to generate result information. The software includes instructions for causing a computing system to: perform a data operation for each data unit of a first subset of data units of the plurality of data units, the data operation comprising searching for a value in a data unit within at least one data structure of a first set of one or more data structures stored in working memory space of a memory device, and if the value is found, modifying information in the at least one data structure of the first set, and if the value is not found, adding information to the at least one data structure of the first set; storing information in overflow storage space of the memory device and freeing at least some of the working memory space after an overflow condition of the working memory space is satisfied; performing a data operation for each data unit of a second subset of data units of the plurality of data units, the data operation comprising searching for a value in a data unit within at least one data structure of a second set of one or more data structures stored in the working memory space, and if the value is found, modifying information in the at least one data structure of the second set; and combining a plurality of sets of one or more data structures, including the first set and the second set, to generate the result information.

在另一个方案中，一般而言，一种计算系统，包括：存储器设备，用于提供工作存储器空间；存储设备，用于提供溢出存储空间；以及至少一个处理器，被配置为处理多个数据单元以产生结果信息。所述处理包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作，并且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的第一集合中；在所述工作存储器空间的溢出条件满足之后，将信息存储在所述溢出存储空间中；以及在所述处理所述多个数据单元期间重复多次溢出处理过程，所述溢出处理过程包括：使用所述溢出存储空间中存储的至少一些信息来更新所述工作存储器空间中存储的一个或多个数据结构的新集合。In another aspect, in general, a computing system includes: a memory device for providing a working memory space; a storage device for providing an overflow storage space; and at least one processor configured to process a plurality of data units to generate result information. The processing includes performing a data operation for each data unit of a first subset of the data units of the plurality of data units and storing information associated with the result of the data operation in a first set of one or more data structures stored in the working memory space; storing the information in the overflow storage space after an overflow condition of the working memory space is satisfied; and repeating a plurality of overflow processing steps during the processing of the plurality of data units, the overflow processing including updating a new set of one or more data structures stored in the working memory space using at least some of the information stored in the overflow storage space.

所述处理还包括，在所述溢出条件满足之后并且在为所述数据单元第二子集的每个数据单元执行所述数据操作之前，将一个或多个数据结构的所述第一集合存储在所述溢出存储空间中作为被移动集合，并且从所述工作存储器空间移除一个或多个数据结构的所述第一集合。The processing also includes, after the overflow condition is met and before performing the data operation for each data unit of the second subset of data units, storing the first set of one or more data structures in the overflow storage space as a moved set, and removing the first set of one or more data structures from the working memory space.

使用所述溢出存储空间中存储的至少一些信息来更新所述工作存储器空间中存储的一个或多个数据结构的新集合包括将所述溢出存储空间中存储的一个或多个数据结构的所述被移动集合的至少一个数据结构的信息与所述工作存储器空间中存储的一个或多个数据结构的所述新集合的至少一个数据结构的信息合并。Using at least some of the information stored in the overflow storage space to update a new set of one or more data structures stored in the working memory space includes merging information of at least one data structure of the moved set of one or more data structures stored in the overflow storage space with information of at least one data structure of the new set of one or more data structures stored in the working memory space.

所述合并包括将一个或多个数据结构的所述被移动集合的所述数据结构中的第一键与一个或多个数据结构的所述新集合的所述数据结构中的第二键相匹配，并且对与所述第一键相关联的值和与所述第二键相关联的值执行聚合操作。The merging includes matching a first key in the data structure of the moved set of one or more data structures with a second key in the data structure of the new set of one or more data structures, and performing an aggregation operation on a value associated with the first key and a value associated with the second key.

使用所述溢出存储空间中存储的至少一些信息来更新所述工作存储器空间中存储的一个或多个数据结构的新集合包括将所述溢出存储空间中存储的数据单元中的第一键与一个或多个数据结构的所述新集合的数据结构中的第二键相匹配，并且增加与所述第二键相关联的值。Using at least some of the information stored in the overflow storage space to update a new set of one or more data structures stored in the working memory space includes matching a first key in a data unit stored in the overflow storage space with a second key in a data structure of the new set of one or more data structures, and increasing a value associated with the second key.

使用所述溢出存储空间中存储的至少一些信息来更新所述工作存储器空间中存储的一个或多个数据结构的新集合包括执行就地存储器操作，所述就地存储器操作将所述工作存储器空间内一位置处存储的值覆写为不同值存储在所述工作存储器空间内相同位置处。Using at least some of the information stored in the overflow storage space to update a new set of one or more data structures stored in the working memory space includes performing an in-place memory operation that overwrites a value stored at a location in the working memory space with a different value stored at the same location in the working memory space.

第一数据单元的所述数据操作包括使用所述第一数据单元中的值作为键以在选定的其中一个键值配对条目关联阵列内搜索。The data operation of the first data unit includes using a value in the first data unit as a key to search within a selected one of an associative array of key-value pair entries.

所述选定的其中一个键值配对条目关联阵列对应于所述第一数据单元中的所述标识符。The selected one of the key-value pair entry associative arrays corresponds to the identifier in the first data unit.

在另一个方案中，一般而言，一种计算系统，包括：用于提供工作存储器空间的装置；用于提供溢出存储空间的装置；以及用于处理多个数据单元以产生结果信息装置。所述处理包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作，并且将与该数据操作的结果相关联的信息存储在所述工作存储器空间中存储的一个或多个数据结构的第一集合中；在所述工作存储器空间的溢出条件满足之后，将信息存储在所述溢出存储空间中；以及在所述处理所述多个数据单元期间重复多次溢出处理过程，所述溢出处理过程包括：使用所述溢出存储空间中存储的至少一些信息来更新所述工作存储器空间中存储的一个或多个数据结构的新集合。In another aspect, in general, a computing system includes: means for providing a working memory space; means for providing an overflow storage space; and means for processing a plurality of data units to generate result information. The processing includes performing a data operation for each data unit of a first subset of the data units of the plurality of data units and storing information associated with the result of the data operation in a first set of one or more data structures stored in the working memory space; storing the information in the overflow storage space after an overflow condition of the working memory space is satisfied; and repeating a plurality of overflow processing steps during the processing of the plurality of data units, the overflow processing including updating a new set of one or more data structures stored in the working memory space using at least some of the information stored in the overflow storage space.

在另一个方案中，一般而言，一种用于处理多个数据单元以产生结果信息的方法，包括：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作，并且将与该数据操作的结果相关联的信息存储在存储器设备的工作存储器空间中存储的一个或多个数据结构的第一集合中；在所述工作存储器空间的溢出条件满足之后，将信息存储在储存设备的溢出存储空间中；以及在所述处理所述多个数据单元期间重复多次溢出处理过程，所述溢出处理过程包括：使用所述溢出存储空间中存储的至少一些信息来更新所述工作存储器空间中存储的一个或多个数据结构的新集合。In another embodiment, in general, a method for processing multiple data units to generate result information includes: performing a data operation for each data unit of a first subset of data units of the multiple data units, and storing information associated with the result of the data operation in a first set of one or more data structures stored in a working memory space of a memory device; storing the information in an overflow storage space of a storage device after an overflow condition of the working memory space is met; and repeating an overflow processing process multiple times during the processing of the multiple data units, the overflow processing process including: using at least some of the information stored in the overflow storage space to update a new set of one or more data structures stored in the working memory space.

在另一个方案中，一般而言，软件被存储在计算机可读介质上用于处理多个数据单元以产生结果信息。所述软件包括指令用于使得计算系统执行：为多个数据单元的数据单元第一子集的每个数据单元执行数据操作，并且将与该数据操作的结果相关联的信息存储在存储器设备的工作存储器空间中存储的一个或多个数据结构的第一集合中；在所述工作存储器空间的溢出条件满足之后，将信息存储在存储设备的溢出存储空间中；以及在所述处理所述多个数据单元期间重复多次溢出处理过程，所述溢出处理过程包括：使用所述溢出存储空间中存储的至少一些信息来更新所述工作存储器空间中存储的一个或多个数据结构的新集合。In another embodiment, generally, software is stored on a computer-readable medium for processing a plurality of data units to generate result information. The software includes instructions for causing a computing system to: perform a data operation for each data unit of a first subset of the plurality of data units, and store information associated with the result of the data operation in a first set of one or more data structures stored in a working memory space of a memory device; store the information in an overflow storage space of the memory device after an overflow condition of the working memory space is satisfied; and repeat a plurality of overflow handling procedures during the processing of the plurality of data units, the overflow handling procedure including updating a new set of one or more data structures stored in the working memory space using at least some of the information stored in the overflow storage space.

这些方案可包括一个或多个以下优点。These aspects may include one or more of the following advantages.

一些计算系统(例如，一些数据库管理系统)不仅仅依靠虚拟存储器来管理工作存储器，而是能够直接控制将正在被处理的数据存储在存储设备的主存储器中还是溢出存储空间中。例如，一些系统对程序可用的工作存储器大小施加明确限制，由此工作存储器限制小于主存储器大小(例如，因为该程序可以与其他程序共享工作存储器)。如果程序接近工作存储器限制，则程序具有使用“溢出至磁盘(spill to disk)”技术的选项以将一些数据临时地存储在溢出存储空间内，并且在可得到足够的工作存储器之后，稍后结束处理该数据。在一些情况下，溢出数据到磁盘，或者依靠操作系统来交换存储器页面会大大影响数据操作的性能。Some computing systems (e.g., some database management systems) do not rely solely on virtual memory to manage working memory, but rather are able to directly control whether the data being processed is stored in the main memory of a storage device or in overflow storage space. For example, some systems impose explicit limits on the amount of working memory available to a program, whereby the working memory limit is smaller than the main memory size (e.g., because the program can share working memory with other programs). If a program approaches its working memory limit, the program has the option of using a "spill to disk" technique to temporarily store some data in overflow storage space, and later terminate processing of the data after sufficient working memory is available. In some cases, spilling data to disk, or relying on the operating system to swap memory pages, can significantly impact the performance of data operations.

对于数据处理系统的一些应用程序，诸如以下更详细描述的数据剖析，如果将对潜在的大量数据(例如，大的数据集和/或大量数据集)执行数据操作，则系统应该被配置为以有效方式来管理工作存储器和溢出储存空间，以确保数据处理应用程序可以提供足够的性能。管理工作存储器和溢出储存空间的一种方法是基于以下认知，对于一些数据操作，取代将输入数据溢出至磁盘而不处理该数据，该系统可以对该输入数据至少部分地执行数据操作，并且在一些情况下，无需将该输入数据溢出至磁盘。For some applications of a data processing system, such as data profiling, which is described in more detail below, if data operations are to be performed on potentially large amounts of data (e.g., large data sets and/or a large number of data sets), the system should be configured to manage working memory and overflow storage in an efficient manner to ensure that the data processing application can provide adequate performance. One approach to managing working memory and overflow storage is based on the recognition that for some data operations, instead of overflowing input data to disk without processing the data, the system can perform the data operation at least partially on the input data, and in some cases, without overflowing the input data to disk.

例如，一些数据操作处理每个与键值相关联的输入记录的流，并且对于键值与之前键值匹配的记录，数据操作更新被存储器的数据结构中存储的结果。在一些实现方式中，使用“溢出处理”过程，即使在已经达到工作存储器限制之后，本文描述的计算系统能够保持处理带有使用任意键的新记录。描述溢出处理过程的两个具体示例。两个过程使得待匹配和使用的一些记录能够更新结果数据结构，而不使那些记录溢出。对于匹配的记录，结果数据结构可以被就地更新，而不使用更多的存储器。一个溢出处理过程通过仅仅将结果数据结构移动至溢出存储并且继续处理所有的新记录(在待与被移动结果数据结构合并的新结果数据结构中)来处理溢出，并且其他溢出处理过程通过仅仅将不匹配记录移动至溢出存储并且继续处理所有的新记录(在相同的结果数据结构中)来处理溢出。For example, some data operations process a stream of input records associated with each key value, and for records whose key values match the previous key values, the data operations update the results stored in the data structure of the memory. In some implementations, an "overflow handling" process is used so that the computing system described herein can keep processing new records with any key used even after the working memory limit has been reached. Two specific examples of overflow handling processes are described. The two processes enable some records to be matched and used to update the result data structure without overflowing those records. For matching records, the result data structure can be updated in place without using more memory. One overflow handling process handles overflow by simply moving the result data structure to overflow storage and continuing to process all new records (in a new result data structure to be merged with the moved result data structure), and the other overflow handling process handles overflow by simply moving unmatched records to overflow storage and continuing to process all new records (in the same result data structure).

使用本文描述的技术可以执行的数据操作的一个示例是普查操作，其用于在大数据集内产生数据值(包括出现在数据集中的值和每个值的计数)的普查，用于剖析数据集。数据剖析操作可以包括对数据(在执行数据剖析过程中来处理该数据)执行的任意操作，诸如普查操作。还可以对在其他环境中(诸如在随着时间追踪数据特性的数据质量系统中)处理的数据执行普查操作。可以应用该技术的其他数据操作包括允许合并不完整结果的数据操作、以及在存储器中的数据结构内可以就地处理至少一些情形的数据操作，如下更详细描述的。技术以被用于处理诸如本文描述的标准化记录的数据单元、或者表示数据流内单独数据部分的任何其它数据单元。An example of a data operation that can be performed using the techniques described herein is a census operation, which is used to generate a census of data values (including the values that appear in the data set and the count of each value) within a large data set for the purpose of profiling the data set. Data profiling operations can include any operation performed on data (processed in the process of performing data profiling), such as a census operation. Census operations can also be performed on data processed in other environments (such as in a data quality system that tracks data characteristics over time). Other data operations to which the technology can be applied include data operations that allow for the merging of incomplete results, and data operations that can handle at least some situations in situ within a data structure in memory, as described in more detail below. The technology can be used to process data units such as the standardized records described herein, or any other data units that represent separate data portions within a data stream.

通过以下说明书和权利要求书，本发明的其它特征和优点将变得显而易见。Other features and advantages of the invention will become apparent from the following description and from the claims.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是数据处理系统的方框图。FIG1 is a block diagram of a data processing system.

图2是数据剖析过程的示意图。Figure 2 is a schematic diagram of the data analysis process.

图3和图5是普查产生过程的示意图。Figures 3 and 5 are schematic diagrams of the census generation process.

图4A-图4C和图6是普查产生过程的流程图。4A-4C and 6 are flow charts of the census generation process.

具体实现方式Specific implementation method

图1示出数据处理系统100的示例，其中可以使用管理工作存储器和溢出存储空间的技术。数据处理系统100包括数据源102，数据源102可以包括数据的一个或多个来源，诸如存储设备或到在线数据流的连接，每个这样的来源可以存储或提供任意各种格式的数据(例如，数据库表、电子表格文件、平面文本文件或主机使用的本机格式)。数据处理系统100包括计算系统104，计算系统104包括至少一个处理器106、耦合至处理器106的至少一个存储器设备108(例如易失性存储器，诸如动态随机存取存储器)以及耦合至处理器106的至少一个存储设备110(例如非易失性存储，诸如磁硬盘驱动器)。在计算系统100处理数据源102的数据之后，可以在用户界面(UI)112中提供该处理的结果，包括向用户自动地提供关于普遍存在于数据源102中的条件、或者关于将普遍存在于待接收和处理数据源102的数据的目的地中的条件的可见指示。FIG1 illustrates an example of a data processing system 100 in which techniques for managing working memory and overflow storage space may be used. Data processing system 100 includes a data source 102, which may include one or more sources of data, such as storage devices or connections to online data streams. Each such source may store or provide data in any of a variety of formats (e.g., database tables, spreadsheet files, flat text files, or a native format used by a host computer). Data processing system 100 includes a computing system 104, which includes at least one processor 106, at least one memory device 108 (e.g., volatile memory, such as dynamic random access memory) coupled to processor 106, and at least one storage device 110 (e.g., non-volatile storage, such as a magnetic hard drive) coupled to processor 106. After computing system 100 processes data from data source 102, the results of the processing may be provided in a user interface (UI) 112, including automatically providing a user with visual indications of conditions prevalent in data source 102 or conditions that will be prevalent in a destination that is to receive and process data from data source 102.

数据处理系统100的不同组件的多种配置是可能的。处理器106可以被配置为托管例如由合适的操作系统(诸如，UNIX操作系统版本)控制的执行环境。在一些实现方式中，计算系统104可以是多节点并行计算环境的一部分，包括多个中央处理器(CPU)或处理器内核的配置，这些CPU或处理器内核可以是本地(例如多处理器系统，诸如对称多处理(symmetric multi-processing,SMP)计算机)，或本地分布式(例如多个处理器耦合为集群或大规模并行处理(MPP)系统)，或者远程或远程分布式(例如通过局域网(LAN)和/或广域网(WAN)来耦合的多个处理器)，或者上述的任意组合。提供数据源102的存储设备可能是计算系统104本地的存储设备，例如，存储在被连接到计算系统104(包括存储设备110)的存储介质上；或者可能是计算系统104远程的存储设备，例如，被托管在通过远程连接(例如，由云计算基础架构提供)与计算系统104进行通讯的远程系统(例如，主机)上。A variety of configurations of the different components of data processing system 100 are possible. Processor 106 can be configured to host an execution environment controlled, for example, by a suitable operating system (such as a version of the UNIX operating system). In some implementations, computing system 104 can be part of a multi-node parallel computing environment, including a configuration of multiple central processing units (CPUs) or processor cores, which can be local (e.g., a multiprocessor system such as a symmetric multi-processing (SMP) computer), locally distributed (e.g., multiple processors coupled as a cluster or massively parallel processing (MPP) system), remote or remotely distributed (e.g., multiple processors coupled via a local area network (LAN) and/or a wide area network (WAN)), or any combination thereof. The storage device providing data source 102 may be a storage device local to computing system 104, for example, stored on a storage medium connected to computing system 104 (including storage device 110); or it may be a storage device remote from computing system 104, for example, hosted on a remote system (e.g., a host) that communicates with computing system 104 via a remote connection (e.g., provided by a cloud computing infrastructure).

在一些实现方式中，计算系统104被配置为执行作为数据流图的应用程序，数据流图包括由节点之间的直接链路(表示工作元素的流，即，数据)连接的节点(表示处理组件或数据集)。例如，在美国公开号2007/0011668、标题为“Managing Parameters for Graph-Based Applications(管理基于图形应用程序的参数)”中将更详细地描述这种环境，其通过引用合并于此。美国专利5,966,072、标题为“EXECUTING COMPUTATIONS EXPRESSED ASGRAPHS(执行被表示为图表的计算)”中描述一种用于执行这种基于图表计算的系统，其通过引用合并于此。根据这个系统制成的数据流图提供用于将信息带入和带出由图表组件表示的各个进程、用于移动进程之间的信息、并且用于限定进程的运行顺序的方法。这个系统包括从任何可得方法中选择进程间通信方法的算法(例如，通信路径视图表链路可以使用TCP/IP或UNIX域套接字(domain sockets)，或者使用共享的存储器以在进程之间传递数据)。In some implementations, the computing system 104 is configured to execute an application as a dataflow graph, which includes nodes (representing processing components or data sets) connected by direct links between the nodes (representing flows of work elements, i.e., data). For example, such an environment is described in more detail in U.S. Publication No. 2007/0011668, entitled "Managing Parameters for Graph-Based Applications," which is incorporated herein by reference. A system for executing such graph-based computations is described in U.S. Patent No. 5,966,072, entitled "Executing Computations Expressed As Graphs," which is incorporated herein by reference. Dataflow graphs created according to this system provide methods for bringing information into and out of various processes represented by graph components, for moving information between processes, and for defining the order in which processes are run. This system includes algorithms that select an inter-process communication method from any available methods (for example, the communication path view table links can use TCP/IP or UNIX domain sockets, or use shared memory to pass data between processes).

计算系统104可以从可实施数据源102的各种类型系统接收数据，包括不同形式的数据库系统。数据可以被组织为表示一些记录的数据集，该记录具有各个字段的值(也称为“属性”或“列”)，可能包括空值。当第一次从数据源读取数据时，计算系统104通常从关于该数据源的记录的一些初始格式信息开始。在一些情况下，数据源的记录结构最初可能是未知的，代替地，可以在分析数据源或数据之后确定。关于记录的初始信息可以包括，例如，表示独特值的位的数目、记录内字段的顺序、以及位所表示的值的类型(例如，字符串、有符号/无符号整数)。The computing system 104 can receive data from various types of systems that can implement the data source 102, including different forms of database systems. The data can be organized into data sets representing a number of records, with values for various fields (also called "attributes" or "columns"), which may include null values. When first reading data from a data source, the computing system 104 typically starts with some initial formatting information about the records of the data source. In some cases, the record structure of the data source may not be initially known, but instead may be determined after analyzing the data source or the data. The initial information about the record may include, for example, the number of bits representing a unique value, the order of the fields within the record, and the type of value represented by the bits (e.g., string, signed/unsigned integer).

数据处理系统100能够对数据源102内的数据执行的一类处理的一个示例是数据剖析(data profiling)。存储的数据集可以包括其各种特性预先未知的数据。例如，数据集的值或典型值的范围、数据集内不同字段之间的关系、或者不同字段中值之间的函数依赖可能是未知的。数据剖析可以涉及检查许多潜在相关数据集以便确定这种特性。计算系统104还可以执行各种任务，诸如清洗数据源102中的数据或者管理存储在数据源102中数据集的元数据。在计算系统104被配置为执行作为数据流图的应用程序的实现方式中，可以由例如数据流图中的剖析器组件节点来执行数据剖析，该剖析器组件节点具有由数据流链路连接至输入数据集的输入端口和由数据流链路连接至下游组件(其被配置为使用数据剖析的结果来执行任务)的输出端口。An example of a type of processing that the data processing system 100 can perform on data within the data source 102 is data profiling. Stored data sets may include data whose various characteristics are unknown in advance. For example, the range of values or typical values of the data set, the relationship between different fields in the data set, or the functional dependencies between values in different fields may be unknown. Data profiling can involve examining many potentially related data sets to determine such characteristics. The computing system 104 can also perform various tasks, such as cleaning the data in the data source 102 or managing metadata for the data sets stored in the data source 102. In an implementation in which the computing system 104 is configured to execute an application as a dataflow graph, data profiling can be performed, for example, by a profiler component node in the dataflow graph, the profiler component node having an input port connected by a dataflow link to the input data set and an output port connected by a dataflow link to a downstream component (which is configured to use the results of the data profiling to perform a task).

当执行数据剖析时，计算系统104从数据源102读取数据并且存储剖析概要信息，剖析概要信息可以被用于执行各种类型的分析以描绘不同数据集和不同数据集内不同字段的特性。在一些实现方式中，剖析概要信息包括在特定字段(例如，选定数据集的选定字段、或者所有数据集的所有字段)内出现的值的普查(census)。该普查列出了字段内的所有独特值并且量化了每个独特值出现的次数。在一些实现方式中，普查数据被存储在可选地通过字段来索引的单个数据结构中，而在其他实现方式中，普查数据被存储在多个数据结构中，例如，每个字段被存储在一个数据结构中。When performing data profiling, the computing system 104 reads data from the data source 102 and stores profile information that can be used to perform various types of analysis to characterize different data sets and different fields within different data sets. In some implementations, the profile information includes a census of values that appear in a particular field (e.g., selected fields of a selected data set, or all fields of all data sets). The census lists all unique values within the field and quantifies the number of times each unique value appears. In some implementations, the census data is stored in a single data structure that is optionally indexed by field, while in other implementations, the census data is stored in multiple data structures, for example, one data structure for each field.

被剖析的特定字段的普查数据可以被组织为条目的列表，每个条目包括：字段的标识符、出现在字段内的值、以及该值在该字段中出现的记录数目的计数。每个独特值具有一个条目，所以条目中的每个值与其他条目中的值不同，并且条目的数目等于字段内出现的独特值的数目。字段的标识符可以是唯一地识别被剖析字段的任意值。例如，通过为每个字段分配范围为从1到被剖析字段的数目的整数索引，可以来枚举被剖析的字段。这种索引可以被紧凑地存储在普查数据结构内。即使不同字段的普查数据被存储在单独的数据结构中，将该字段的特定字段标识符包括在数据结构的每个条目内可能仍然是有用的(例如，以区分条目和流入处理模块的不同数据结构)。可替代地，在一些实现方式中，如果不同字段的普查数据被存储在单独的数据结构中，则字段仅需要被该数据结构存储一次，并且每个条目与该字段隐性相关联并且仅包括值和计数。Census data for a particular field being profiled can be organized as a list of entries, each entry consisting of: a field identifier, a value that occurs within the field, and a count of the number of records in which that value occurs within the field. Each unique value has one entry, so each value in an entry is distinct from the values in other entries, and the number of entries is equal to the number of unique values that occur within the field. A field identifier can be any value that uniquely identifies the profiled field. For example, the profiled fields can be enumerated by assigning each field an integer index ranging from 1 to the number of profiled fields. This index can be compactly stored within the census data structure. Even if census data for different fields is stored in separate data structures, it may still be useful to include a specific field identifier for that field in each entry in the data structure (e.g., to distinguish entries from different data structures flowing into the processing module). Alternatively, in some implementations, if census data for different fields is stored in separate data structures, the fields need only be stored once by the data structure, and each entry is implicitly associated with the field and includes only the value and count.

图2示出基于普查的数据剖析过程的示例，该数据剖析过程由在计算系统104上运行的程序来执行，该计算系统104包括：标准化模块200，用于产生标准化记录的流208；普查产生模块202，用于将标准化记录的流208处理成普查文件210；以及普查处理模块204，用于分析普查文件210来计算剖析结果。标准化模块200读取待剖析的一个或多个数据集，诸如表206。表206具有三个字段，被命名为字段1、字段2和字段3，并且表206中的前几个数据记录(前三行)示出了三个字段中每个的各个值。标准化模块200通过将特定数据记录分割为每个包括字段索引和数据值的一系列标准化记录来产生标准化记录。字段索引是被分配给特定字段以唯一地(且有效地)标识该字段的索引值(例如，1＝字段1、2＝字段2、3＝字段3)，数据值是被包含在该字段的数据记录中的对应值。在这个示例中，表206中的第一数据记录将分别在三个标准化记录内产生如下(字段索引，数据值)对：(1，A)、(2，M)和(3，X)。普查产生模块202将流208中标准化记录的数据值聚合，以制作普查文件210。(在图2中，普查文件210的条目中示出的值对应于表206中的前三个数据记录，其随着来自表206中额外数据记录的标准化记录被普查产生模块202处理而进行更新。)FIG2 illustrates an example of a census-based data profiling process performed by a program running on computing system 104, including a standardization module 200 for generating a stream 208 of standardized records; a census generation module 202 for processing stream 208 of standardized records into a census file 210; and a census processing module 204 for analyzing census file 210 to compute profiling results. Standardization module 200 reads one or more data sets to be profiled, such as table 206. Table 206 has three fields, designated Field 1, Field 2, and Field 3, and the first few data records (the first three rows) in table 206 show the values for each of the three fields. Standardization module 200 generates standardized records by splitting a particular data record into a series of standardized records, each including a field index and a data value. A field index is an index value assigned to a particular field to uniquely (and efficiently) identify that field (e.g., 1 = Field 1, 2 = Field 2, 3 = Field 3), and a data value is the corresponding value contained in the data record for that field. In this example, the first data record in table 206 will generate the following (field index, data value) pairs within the three standardized records: (1, A), (2, M), and (3, X), respectively. Census generation module 202 aggregates the data values of the standardized records in stream 208 to produce census file 210. (In FIG. 2 , the values shown in the entries of census file 210 correspond to the first three data records in table 206, which are updated as standardized records from additional data records in table 206 are processed by census generation module 202.)

对于特定数据集，标准化记录可以以任意顺序被插入到流208。在这个示例中，随着数据记录出现在表206中，流208包括特定数据记录的所有标准化记录，随后是下一个数据记录的所有标准化记录。可替代地，表206可以被字段处理，使得，随着数据记录出现在表206中，该流包括特定字段的所有标准化记录，随后是下一个字段的所有标准化记录。还可以以这种方式来标准化更高维数的数据集合，例如基于对于读取数据集或者根据生成的流来产生普查文件将是最有效的顺序将标准化记录添加至输出流。在产生所有的标准化记录之后，标准化记录的流208可以被写入将由下游普查产生模块202处理的文件，或者当产生标准化记录的流208时，可以将其提供给下游普查产生模块202(例如，以利用生成的流水线并行(pipeline parallelism))。For a particular data set, normalized records can be inserted into stream 208 in any order. In this example, as a data record appears in table 206, stream 208 includes all normalized records for the particular data record, followed by all normalized records for the next data record. Alternatively, table 206 can be processed by fields such that, as a data record appears in table 206, the stream includes all normalized records for a particular field, followed by all normalized records for the next field. Higher-dimensional data sets can also be normalized in this manner, for example, by adding normalized records to the output stream in the order that is most efficient for reading the data set or generating a census file based on the generated stream. After all normalized records are generated, stream 208 of normalized records can be written to a file to be processed by downstream census generation module 202, or, as the stream 208 of normalized records is generated, it can be provided to downstream census generation module 202 (e.g., to exploit generated pipeline parallelism).

普查产生模块202处理标准化记录，直到到达流208的结尾(例如，由结尾流记录所示的)。模块202对标准化记录执行一类普查操作(被称为“普查匹配操作”)，以确定标准化记录中的数据值是否与之前处理的标准化记录的之前数据值相匹配。模块202为流208内的每个标准化记录执行至少一次普查匹配操作。模块202将与普查匹配操作的结果相关联的信息存储在至少一个数据结构中，该数据结构被存储在存储器设备108的工作存储器空间中。数据结构所使用的工作存储器空间包括用于数据结构任意开销(overhead)的存储器和数据结构中所有信息的存储器，包括指针所引用的数据的任意存储器。如果普查匹配操作发现与之前数据值的匹配，则增加与该数据值相关联的被存储计数。否则，如果普查匹配操作没有发现与之前数据值的匹配，则将新的条目存储在数据结构中。Census generation module 202 processes standardized records until it reaches the end of stream 208 (e.g., as indicated by an end stream record). Module 202 performs a type of census operation (referred to as a "census match operation") on the standardized records to determine whether a data value in the standardized record matches a previous data value in a previously processed standardized record. Module 202 performs at least one census match operation for each standardized record in stream 208. Module 202 stores information associated with the results of the census match operation in at least one data structure, which is stored in working memory space of memory device 108. The working memory space used by the data structure includes memory for any overhead of the data structure and memory for all information in the data structure, including any memory for data referenced by pointers. If a census match operation finds a match with a previous data value, the stored count associated with the data value is incremented. Otherwise, if the census match operation does not find a match with a previous data value, a new entry is stored in the data structure.

例如，数据结构可以是能够存储键值(key-value)对的关联阵列，键值对具有用于查找阵列内关联值的唯一键。在这个示例中，键是来自标准化记录的数据值，并且该值是将会被增加至普查数据总计数的计数。当为具有特定数据值的标准化记录创建键值对时，由于其键不与存在于关联阵列中的任意键相匹配，计数从1开始，并且每次另一个标准化记录具有与现存键相匹配的数据值时，计数增加1。模块202为不同关联阵列内的不同字段(由每个标准化记录内的字段索引所确定)查找标准化记录的数据值，为被剖析的每个字段分配一个关联阵列。在一些实现方式中，预先已知被剖析字段的数目，并且在剖析过程开始时为每个字段分配空关联阵列(其仅仅使用最小量的存储空间)。For example, the data structure can be an associative array capable of storing key-value pairs, each with a unique key used to look up an associated value within the array. In this example, the key is a data value from a standardized record, and the value is a count that will be added to the total count of the census data. When a key-value pair is created for a standardized record with a particular data value, since its key does not match any key existing in the associative array, the count starts at 1, and each time another standardized record has a data value that matches the existing key, the count is increased by 1. Module 202 looks up the data value of the standardized record for different fields within different associative arrays (determined by the field index within each standardized record), allocating one associative array for each field being parsed. In some implementations, the number of fields to be parsed is known in advance, and at the beginning of the parsing process, an empty associative array (which uses only a minimum amount of storage space) is allocated for each field.

例如使用哈希表或其他数据结构(其提供有效的键查找和关联值修改)可以实施关联阵列。用作键值对的键的数据值可以存储数据值本身的副本或者存储指向被存储在工作存储器中不同位置(例如，被存储在标准化记录的副本中)的数据值的指针。关联阵列与被存储的标准化记录的数据值的副本、或者甚至与整个标准化记录本身一起可以被共同地看作存储普查匹配结果的数据结构。在标准化记录中数据值的指针被存储在关联阵列的实现方式中，仅仅包含特定键的第一标准化记录需要被存储在工作存储器中，而在普查匹配操作之后，可以将包含该特定键的后续标准化记录从工作存储器移除。For example, an associative array can be implemented using a hash table or other data structure that provides efficient key lookup and associated value modification. The data value used as the key of a key-value pair can store a copy of the data value itself or a pointer to a data value stored at a different location in the working memory (for example, stored in a copy of the standardized record). The associative array, together with the stored copies of the data values of the standardized record, or even the entire standardized record itself, can be viewed as a data structure for storing census matching results. In an implementation in which pointers to the data values in the standardized record are stored in the associative array, only the first standardized record containing a particular key needs to be stored in the working memory, and subsequent standardized records containing the particular key can be removed from the working memory after the census matching operation.

在以下示例中，被剖析的字段的这些关联阵列被称为“普查阵列”并且键值对被称为普查阵列内的“普查条目”。在数据剖析过程结束时，由普查产生模块202产生的普查阵列将存储在单独普查条目内表206中出现的所有独特数据值，以及在表206行内出现的数据值的次数的总计数，该普查阵列表示正在被剖析的数据记录。In the following examples, these associative arrays of profiled fields are referred to as "census arrays" and the key-value pairs are referred to as "census entries" within the census arrays. At the end of the data profiling process, the census arrays generated by the census generation module 202 will store all unique data values that appear in the table 206 within individual census entries, as well as a total count of the number of times that data value appears within the row of the table 206 that represents the data record being profiled.

对执行数据剖析过程的程序或该程序的一部分(例如，普查产生模块202)可以给予存储器限制，该存储器限制设定了在允许使用该程序的存储器设备108内工作存储器空间的最大量。程序可以使用工作存储器空间用于存储普查阵列，这可能需要大部分的容许工作存储器空间，以及用于存储其他临时值，这可能需要明显少于普查阵列的空间。当模块202确定可能存在不足的可用工作存储器空间以将额外条目添加至普查阵列时，或者当不再具有任何可用工作存储器空间来添加额外条目时(例如，由于添加的最后一个条目)，工作存储器的溢出条件满足。模块202可以通过测量普查阵列的存储器大小来做这个决定。存储器大小表示普查阵列所使用的工作存储器空间的量，其包括表示普查阵列的数据结构和这些数据结构中所有信息所使用的任何系统开销所占用的存储器总量(包括由普查阵列内的指针所引用的任何数据值或标准化记录)。然后模块202将这个存储器大小与存储器限制(或其他阈值)相比较。A program or portion of a program that performs a data profiling process (e.g., census generation module 202) can be assigned a memory limit that sets a maximum amount of working memory space within the memory device 108 that the program is allowed to use. The program can use the working memory space for storing the census array, which may require a large portion of the allowed working memory space, and for storing other temporary values, which may require significantly less space than the census array. A working memory overflow condition is met when module 202 determines that insufficient working memory space may be available to add additional entries to the census array, or when there is no longer any working memory space available to add additional entries (e.g., due to the last entry being added). Module 202 can make this determination by measuring the memory size of the census array. The memory size represents the amount of working memory space used by the census array, including the total amount of memory occupied by the data structures representing the census array and any system overhead used by all information in these data structures (including any data values or normalized records referenced by pointers within the census array). Module 202 then compares this memory size to the memory limit (or other threshold).

在一些实现方式中，程序设置溢出阈值来检测普查阵列的存储器大小何时接近存储器限制。例如，通过计算各个普查阵列的大小的总和可以直接地测量普查阵列的存储器大小，其中单个普查阵列的大小被测量为该普查阵列所占用的工作存储器空间内的位数目。可替代地，例如，通过计算工作存储器内剩下的可用空间的量而不直接测量普查阵列的存储器大小(例如，存储器地址的分配块所遗留的存储器地址的范围)，可以间接地测量普查阵列的存储器大小。在一些实现方式中，程序设置刚好低于存储器限制的溢出阈值，以为其他值保留一些空间。在一些实现方式中，溢出阈值可以等于存储器限制，例如，如果其他值所需的空间是可以忽略的和/或计算系统104没有强加严格的存储器限制，则允许在相对短的时间内少量地超过存储器限制。In some implementations, the program sets an overflow threshold to detect when the memory size of the census array approaches the memory limit. For example, the memory size of the census array can be measured directly by summing the sizes of the individual census arrays, where the size of an individual census array is measured as the number of bits in the working memory space occupied by the census array. Alternatively, the memory size of the census array can be measured indirectly, for example, by calculating the amount of free space left in the working memory without directly measuring the memory size of the census array (e.g., the range of memory addresses left over from an allocated block of memory addresses). In some implementations, the program sets the overflow threshold just below the memory limit to reserve some space for other values. In some implementations, the overflow threshold can be equal to the memory limit, for example, to allow for a relatively short period of time when the memory limit is slightly exceeded if the space required for other values is negligible and/or if computing system 104 does not impose strict memory limits.

在溢出条件被触发之后，程序使用溢出处理过程来将产生完整普查阵列所需的一些数据存储在存储设备110内的溢出存储空间中。溢出存储空间中存储什么样的数据完全取决于所使用的溢出处理过程的类型。在以下描述的溢出处理过程的示例中，程序继续为在溢出条件被触发之后处理的每个标准化记录执行普查匹配操作，并且将与普查匹配操作的结果相关联的信息(即，普查条目中增加的计数、或者新的普查条目)存储在工作存储器中相同的普查阵列集合中或者存储在工作存储器中新的普查阵列集合中，如下详细所述。如果在处理流208中标准化记录过程中的某个时刻触发溢出条件，则一些数据将被存储在工作存储器空间中，并且一些数据将被存储在溢出存储空间中。在以下描述的溢出处理过程的示例中，以一些方式来结合两个位置中的数据以产生完整的普查阵列。在每个普查阵列自身的普查文件210内输出每个普查阵列，用于普查处理模块204的处理。此外，所使用的结合过程完全取决于所使用的溢出处理过程的类型。在当普查阵列或普查阵列条目的集合是完整的以被发送至输出端口的阶段，普查文件210可以可选地从普查产生模块202输出。After an overflow condition is triggered, the program uses an overflow handling process to store some of the data needed to generate a complete census array in overflow storage space within storage device 110. The type of data stored in the overflow storage space depends entirely on the type of overflow handling process used. In the example overflow handling process described below, the program continues to perform a census matching operation for each standardized record processed after the overflow condition is triggered, and stores information associated with the results of the census matching operation (i.e., incremented counts in census entries, or new census entries) in the same set of census arrays in working memory or in a new set of census arrays in working memory, as described in detail below. If an overflow condition is triggered at some point during the normalization of a record in processing flow 208, some data will be stored in working memory space and some data will be stored in overflow storage space. In the example overflow handling process described below, the data in the two locations is combined in some manner to generate a complete census array. Each census array is output in its own census file 210 for processing by census handling module 204. Furthermore, the combining process used depends entirely on the type of overflow handling process used. A census file 210 may optionally be output from the census generation module 202 at a stage when the census array or set of census array entries is complete to be sent to an output port.

以下描述的溢出处理过程的两个示例可以被相同的普查产生模块202使用。在一种模式中，可以使用一个过程，在另一个模式中，可以使用另一个过程。可以由用户例如通过一些初始分析(例如，对被剖析的数据集的子集执行的初始分析、或者对相同或相似数据集的历史剖析信息执行的初始分析)来确定模式，以估计哪一个过程将是最有效的。这些溢出处理过程除了被应用于普查匹配操作，还可以被应用于其他数据操作。允许合并不完整结果的数据操作将与存储在工作存储器空间中的结果和存储在溢出存储空间中的结果的结合是兼容的，如在下述溢出处理过程中所执行的。至少一些情形可以就地处理的数据操作将与存储器中数据结构内数据结构的就地更新是兼容的，如下述溢出处理过程中所执行的。在允许用户观察该处理的结果或者执行取决于该处理的额外交互之前，通过避免将一些数据存储在溢出存储空间所花费的时间，溢出处理过程的效率对于数据操作(诸如普查操作、或者用于处理潜在的大量输入数据的其他数据剖析操作)是尤其有用的。The two examples of overflow handling procedures described below can be used by the same survey generation module 202. In one mode, one procedure can be used, and in another mode, the other procedure can be used. The mode can be determined by the user, for example, through some initial analysis (e.g., an initial analysis performed on a subset of the profiled data set, or an initial analysis performed on historical profile information of the same or similar data set) to estimate which procedure will be most effective. These overflow handling procedures can be applied to other data operations in addition to survey matching operations. Data operations that allow for the merging of incomplete results will be compatible with combining results stored in the working memory space with results stored in the overflow storage space, as performed in the overflow handling procedures described below. Data operations that can be processed in-place in at least some cases will be compatible with in-place updates of data structures within the in-memory data structures, as performed in the overflow handling procedures described below. The efficiency of the overflow handling procedures is particularly useful for data operations (such as survey operations or other data profiling operations used to process potentially large amounts of input data) by avoiding the time spent storing some data in the overflow storage space before allowing the user to observe the results of the process or perform additional interactions dependent on the process.

图3示出在普查产生模块202产生普查阵列的环境下所使用的具有第一溢出处理过程的普查产生。图4A-图4C对应于具有第一溢出处理过程的普查产生的流程图。参照图3和图4A，普查产生模块202接收标准化记录的流300，并且对于被剖析的每个字段，到溢出处理过程结束时产生完整的普查阵列302。当模块202对流300中的每个标准化记录循环迭代时(开始于第一次迭代的第一标准化记录)，模块202读取(400)下一个标准化记录。模块202检查(402)在工作存储器空间306中产生的普查阵列304的存储器大小，以确定是否达到溢出阈值。FIG3 illustrates census generation with a first overflow handling process used in the context of census generation module 202 generating a census array. FIG4A-4C corresponds to a flow chart of census generation with a first overflow handling process. Referring to FIG3 and FIG4A, census generation module 202 receives a stream 300 of standardized records and, for each field parsed, generates a complete census array 302 at the end of the overflow handling process. As module 202 loops through each standardized record in stream 300 (starting with the first standardized record of the first iteration), module 202 reads 400 the next standardized record. Module 202 checks 402 the memory size of census array 304 generated in working memory space 306 to determine if an overflow threshold has been reached.

如果没有达到溢出阈值，则模块202执行关于该标准化记录的普查匹配操作404。普查匹配操作404包括搜索(406)一个适当普查阵列304(对于标准化记录中字段索引的普查阵列)的键来匹配标准化记录中的数据值。如果存在对键的匹配(其是来自之前标准化记录的数据值)，则增加(408)对应于该键的计数。如果不存在对键的匹配，则将新的条目添加(409)至该适当普查阵列304，其中键被设置为数据值且计数被设置为1。If the overflow threshold is not reached, the module 202 performs a census matching operation 404 on the normalized record. The census matching operation 404 includes searching (406) a key of an appropriate census array 304 (the census array for the field index in the normalized record) to match the data value in the normalized record. If there is a match for the key (which is the data value from the previous normalized record), the count corresponding to the key is incremented (408). If there is no match for the key, a new entry is added (409) to the appropriate census array 304, where the key is set to the data value and the count is set to 1.

如果已经达到溢出阈值，则模块202对普查阵列304和存储于溢出存储空间310中的之前部分普查阵列308(在之前迭代中)执行合并操作412。合并操作412(以下进行更详细的描述)的结果是部分普查阵列308的新集合，对于连同每个键的计数总和的给定字段，每个包含对应于被合并普查阵列中键(即，数据值)的并集的条目。因此，工作存储器空间306中部分普查阵列304中的信息被安全地存储在溢出存储空间310中，并且部分普查阵列304此时可以被从工作存储器空间306移除(414)，释放更多的工作存储器空间306，以对下一个标准化记录执行普查匹配操作404。If the overflow threshold has been reached, the module 202 performs a merge operation 412 on the census array 304 and the previous partial census array 308 (in the previous iteration) stored in the overflow storage space 310. The result of the merge operation 412 (described in more detail below) is a new set of partial census arrays 308, each containing entries corresponding to the union of the keys (i.e., data values) in the merged census arrays for a given field along with the sum of the counts for each key. Thus, the information in the partial census array 304 in the working memory space 306 is safely stored in the overflow storage space 310, and the partial census array 304 can now be removed from the working memory space 306 (414), freeing up more working memory space 306 to perform the census matching operation 404 on the next standardized record.

在迭代结束时，模块202确定(416)是否已经达到流300的结尾(其结束流中每个记录的循环迭代)。如果未到达结尾，则通过读取(400)下一个标准化记录来开始另一次迭代。如果已到达结尾，则模块202确定(418)在任何迭代过程中是否发生溢出。如果没有发生过溢出，则模块202将工作存储器空间306中此时每个完整的普查阵列304发送(419)到输出端口。如果曾经发生溢出，则模块202对存储在工作存储器空间306中的部分普查阵列304和存储在溢出存储空间310中的部分普查阵列308执行修改版本的合并操作412'，以将得到的合并普查阵列发送至输出端口。参照图4B更详细地描述合并操作412，并且参照图4C更详细地描述合并操作412'。At the end of the iteration, module 202 determines (416) whether the end of stream 300 has been reached (which ends the loop iteration for each record in the stream). If the end has not been reached, another iteration begins by reading (400) the next normalized record. If the end has been reached, module 202 determines (418) whether an overflow has occurred during any iteration. If no overflow has occurred, module 202 sends (419) each complete census array 304 in working memory space 306 at that time to the output port. If an overflow has occurred, module 202 performs a modified version of a merge operation 412' on the partial census arrays 304 stored in working memory space 306 and the partial census arrays 308 stored in overflow storage space 310 to send the resulting merged census array to the output port. The merge operation 412 is described in more detail with reference to FIG. 4B and the merge operation 412' is described in more detail with reference to FIG. 4C.

参照图4B，其示出合并操作412的示例，合并操作412用于将工作存储器空间306(被称为“存储器阵列”)中的部分普查阵列的集合和溢出储存空间310(被称为“被存储阵列”)中的部分普查阵列的集合合并。合并操作412包括外循环和内循环，其中外循环在字段上进行迭代(即，利用参引“当前字段”的从1至字段数目的循环计数器)，内循环为当前字段迭代的字段对被存储阵列中的条目进行迭代(即，利用参引“当前条目”的从1至条目数目的循环计数器)。对于当前字段迭代的字段，通过在存储器阵列内的被存储阵列中搜索(420)当前条目的数据值，内循环开始。如果发现匹配，则内循环将被存储阵列中当前条目的计数和存储器阵列中匹配条目的计数求和(422)，并且将得到的总计数存储在存储器阵列中的匹配条目中(覆盖之前的计数)。因为这个新的总计数无需工作存储器中的任何额外空间，所以这个操作将不会引起所用工作存储器空间量的增长。在不同的实现方式中，可以搜索存储器阵列或被存储阵列二者之一并且其可以被用于累积总计数，但是通过迭代被存储阵列和搜索存储器阵列，可以更有效地执行搜索(因为相对于存储设备110，存储器设备108可以被更高效地访问)。如果未发现匹配，则内循环将被存储阵列中的当前条目添加(424)至溢出存储空间310中的新普查阵列中(其是在合并操作412之后将替代之前被存储阵列的新普查阵列集合的一部分)，并且通过在存储器阵列内的被存储阵列中搜索(420)下一个条目的数据值来开始新的内循环迭代。在到达被存储阵列中的最后一个条目之后，内循环结束(426)。外循环包括将更新后的存储器阵列附加(428)至新的普查阵列，使得新普查阵列的全部集合将表示合并操作412的匹配条目和不匹配条目。在到达最后一个字段之后，外循环结束(430)。4B , an example of a merge operation 412 is shown for merging a set of partial census arrays in the working memory space 306 (referred to as the "memory array") with a set of partial census arrays in the overflow storage space 310 (referred to as the "stored array"). The merge operation 412 includes an outer loop that iterates over the fields (i.e., using a loop counter that references the "current field" and runs from 1 to the number of fields), and an inner loop that iterates over the entries in the stored array for the fields of the current field iteration (i.e., using a loop counter that references the "current entry" and runs from 1 to the number of entries). The inner loop begins by searching (420) the stored array within the memory array for the data value of the current entry for the field of the current field iteration. If a match is found, the inner loop sums (422) the count of the current entry in the memory array and the count of the matching entry in the memory array, and stores the resulting total count in the matching entry in the memory array (overwriting the previous count). Because this new total count does not require any additional space in working memory, this operation will not cause an increase in the amount of working memory space used. In various implementations, either the memory array or the stored array can be searched and used to accumulate the total count, but by iterating over the stored array and searching the memory array, the search can be performed more efficiently (because memory device 108 can be accessed more efficiently than storage device 110). If no match is found, the inner loop adds (424) the current entry in the memory array to a new census array in overflow storage space 310 (which is part of the new census array set that will replace the previous stored array after the merge operation 412) and begins a new inner loop iteration by searching (420) the stored array within the memory array for the data value of the next entry. After reaching the last entry in the stored array, the inner loop ends (426). The outer loop includes appending (428) the updated memory array to the new census array so that the entire set of new census arrays will represent matching entries and non-matching entries for the merge operation 412. After reaching the last field, the outer loop ends (430).

图4C示出修改版本的合并操作412'的示例，其被执行为将存储器阵列的最后集合与被存储阵列合并。在这个示例中，唯一的区别如下。代替将被存储阵列中的当前条目添加(424)至溢出存储空间310中的新普查阵列，操作412'将被存储阵列中的当前条目输出(424')至模块202的输出端口(例如，通过写入至输出普查文件，其也可以被存储在溢出存储空间310中)。代替将更新后的存储器阵列附加(428)至被存储的阵列，操作412'将更新后的存储器阵列发送(428')至模块202的输出端口(例如，通过写入至输出普查文件，其也可以被存储在溢出存储空间310中)。FIG4C shows an example of a modified version of the merge operation 412′ that is performed to merge the last set of memory arrays with the stored array. In this example, the only difference is as follows. Instead of adding (424) the current entry in the stored array to a new census array in the overflow storage space 310, operation 412′ outputs (424′) the current entry in the stored array to the output port of module 202 (e.g., by writing to an output census file, which may also be stored in the overflow storage space 310). Instead of appending (428) the updated memory array to the stored array, operation 412′ sends (428′) the updated memory array to the output port of module 202 (e.g., by writing to an output census file, which may also be stored in the overflow storage space 310).

简而言之，第一溢出处理过程通过将部分普查阵列移动至溢出存储并且继续处理所有的新记录(在将与被移动至溢出存储的部分普查阵列合并的工作存储器中的新部分普查阵列)来处理工作存储器的溢出条件。第一溢出处理过程有效地管理部分普查阵列到溢出存储的溢出。第二溢出处理过程将类似地处理溢出条件，同时继续处理所有的新记录，但是第二溢出处理过程将被配置为有效地管理不匹配记录而不是部分普查阵列到溢出存储的溢出。In short, the first overflow handling process handles overflow conditions of the working memory by moving the partial census array to overflow storage and continuing to process all new records (the new partial census array in the working memory will be merged with the partial census array moved to overflow storage). The first overflow handling process effectively manages overflows of the partial census array to overflow storage. The second overflow handling process will similarly handle overflow conditions while continuing to process all new records, but the second overflow handling process will be configured to effectively manage overflows of unmatched records rather than partial census arrays to overflow storage.

图5示出在普查产生模块202产生普查阵列的环境下所使用的具有第二溢出处理过程的普查产生。图6对应于具有第二溢出处理过程的普查产生的流程图。参照图5和图6，普查产生模块202接收标准化记录的流500，并且对于被剖析的每个字段，到溢出处理过程结束时产生完整的普查阵列502。当模块202在第一通路中对流500中的每个标准化记录迭代循环时，或者当模块202在一个或多个额外通路中对标准化记录(其被临时地存储在溢出存储空间504的临时记录存储503中)的每个集合循环迭代时，模块202读取(600)下一个标准化记录。以下更详细地描述在一个或多个通路中执行的步骤，但是基本上，每个通路涉及填补工作存储器空间306直到阈值条件满足，以及在阈值条件满足之后就地处理所有的匹配记录以及溢出所有的不匹配记录。模块202对该标准化记录执行普查匹配操作602。FIG5 illustrates census generation with a second overflow handling process used in the context of census generation module 202 generating a census array. FIG6 corresponds to a flow chart for census generation with a second overflow handling process. Referring to FIG5 and FIG6, census generation module 202 receives a stream 500 of standardized records and, for each field parsed, generates a complete census array 502 at the end of the overflow handling process. As module 202 iterates through each standardized record in stream 500 in a first pass, or as module 202 iterates through each set of standardized records (which are temporarily stored in temporary record storage 503 in overflow storage space 504) in one or more additional passes, module 202 reads (600) the next standardized record. The steps performed in one or more passes are described in more detail below, but essentially, each pass involves filling working memory space 306 until a threshold condition is met, and, after the threshold condition is met, processing all matching records in place and overflowing all non-matching records. Module 202 performs a census matching operation 602 on the standardized record.

普查匹配操作602包括搜索(604)在工作存储器空间508中产生的一个适当普查阵列506(对于标准化记录中字段索引的普查阵列)的键来匹配(605)标准化记录中的数据值。如果存在对键的匹配(其是来自之前标准化记录的数据值)，则增加(606)对应于该键的计数。该增加(606)可以发生而不使用更多的工作存储器空间508(例如，使用就地操作来增加匹配条目的计数)，并且因此不取决于是否已经达到溢出阈值。如果不存在对键的匹配，则下一个动作取决于检查(607)普查阵列506的存储器大小的结果，以确定是否已经达到溢出阈值。如果未达到溢出阈值，则将新的条目添加(608)至该适当普查阵列506，其中键被设置为数据值且计数被设置为1。如果已经达到溢出阈值，则模块202将标准化记录存储(609)于溢出存储空间504的新临时记录存储503中。临时记录存储503可以是存储标准化记录的单个文件(或其他数据结构)，或者可以是提供字段索引(或其他特性)对标准化记录访问的多个文件(或其他数据结构)。存在不同的临时记录存储503，对于不同的通路其具有不同的标准化记录的集合。The census matching operation 602 includes searching (604) a key of an appropriate census array 506 (the census array for the field index in the normalized record) generated in the working memory space 508 to match (605) the data value in the normalized record. If there is a match for the key (which is a data value from a previous normalized record), the count corresponding to the key is incremented (606). This increment (606) can occur without using more working memory space 508 (e.g., using an in-place operation to increment the count of the matching entry) and is therefore not dependent on whether an overflow threshold has been reached. If there is no match for the key, the next action depends on the result of checking (607) the memory size of the census array 506 to determine whether the overflow threshold has been reached. If the overflow threshold has not been reached, a new entry is added (608) to the appropriate census array 506 with the key set to the data value and the count set to 1. If the overflow threshold has been reached, the module 202 stores (609) the normalized record in a new temporary record storage 503 in the overflow storage space 504. Temporary record storage 503 can be a single file (or other data structure) storing standardized records, or can be multiple files (or other data structures) providing field index (or other features) to access standardized records. There are different temporary record stores 503, which have different sets of standardized records for different channels.

模块202确定(610)是否已经到达流500的第一通路的结尾(其结束流中每个记录的循环迭代)，或者确定是否已经到达一个临时记录存储503的通路的结尾。如果还未到达通路的结尾，通过读取(600)下一个标准化记录来开始另一次迭代。如果已经到达通路的结尾，则模块202确定(611)在当前通路的任意之前迭代中是否发生溢出。如果没有发生过溢出，则模块202将工作存储器空间508中此时每个完整的普查阵列506发送(613)至输出端口。如果曾经发生溢出，则模块202检查(612)以确定是否保留有任何临时记录存储503。Module 202 determines (610) whether the end of the first pass of stream 500 has been reached (which ends the loop iteration for each record in the stream), or whether the end of the pass for a temporary record storage 503 has been reached. If the end of the pass has not been reached, another iteration begins by reading (600) the next normalized record. If the end of the pass has been reached, module 202 determines (611) whether an overflow has occurred in any previous iteration of the current pass. If no overflow has occurred, module 202 sends (613) each complete census array 506 in the working memory space 508 at this time to the output port. If an overflow has occurred, module 202 checks (612) to determine whether any temporary record storage 503 remains.

如果保留有至少一个临时记录存储503，则通过在字段上进行迭代(即，利用参引“当前字段”的从1至字段数目的循环计数器)，并且将一个适当部分普查阵列506附加(614)至存储在溢出存储空间504(在之前迭代中)中任意部分普查阵列510的对应一个(具有相同的字段索引)，模块202开始释放工作存储器中的空间，以处理该存储503中的标准化记录。在已经到达最后的字段之后，循环结束(615)。因此，工作存储器空间508中部分普查阵列506中的信息被安全地存储在溢出存储空间504中，并且部分普查阵列506可以被从工作存储器空间508移除(616)，释放更多的工作存储器空间508，以从剩余临时记录存储503中读取(600)下一个标准化记录并且对该标准化记录执行普查匹配操作602。If at least one temporary record store 503 remains, the module 202 begins freeing space in the working memory to process the normalized record in that store 503 by iterating over the fields (i.e., using a loop counter from 1 to the number of fields that references the "current field") and appending (614) an appropriate partial census array 506 to the corresponding one (having the same field index) of any partial census array 510 stored in the overflow storage space 504 (in the previous iteration). After the last field has been reached, the loop ends (615). Thus, the information in the partial census array 506 in the working memory space 508 is safely stored in the overflow storage space 504, and the partial census array 506 can be removed (616) from the working memory space 508, freeing up more working memory space 508 to read (600) the next normalized record from the remaining temporary record store 503 and perform the census matching operation 602 on the normalized record.

因为当在工作存储器空间508中产生这些普查阵列510时，将处理带有数据值(该数据值将匹配普查阵列510中的键)的任意标准化记录，所以部分普查阵列506和部分普查阵列510可以被简单地附加，而无需执行合并操作，并且因此这些数据值都不存在于工作存储器空间508的任意一个当前普查阵列506中。如果是有帮助的(例如，为了访问这些条目的效率)，则部分普查阵列506和部分普查阵列510中的条目可以可选地被分类或重新排列为一个或多个其他数据结构，但是无需结合各个条目以巩固特定数据值的信息。Because any normalized records with data values that will match keys in the census arrays 510 will be processed when these census arrays 510 are generated in the working memory space 508, the partial census arrays 506 and partial census arrays 510 can be simply appended without performing a merge operation, and therefore none of these data values are present in any current census array 506 in the working memory space 508. The entries in the partial census arrays 506 and partial census arrays 510 can optionally be sorted or rearranged into one or more other data structures if helpful (e.g., for efficiency in accessing the entries), but the individual entries need not be joined to consolidate information about a particular data value.

在模块202如何通过流500的进程记录的示例中，在已经处理流500的标准化记录的第一子集512而使普查阵列506的存储器大小扩大至溢出阈值之后，处理标准化记录的第二子集514以对于匹配键的数据值(来自示出为无阴影方框的记录)继续增加普查阵列506的计数，或者处理标准化记录的第二子集514以将标准化记录的第三子集516(被示出为带阴影方框)存储为临时记录存储503。注意到第三子集516还是第二子集514的子集。这个进程继续，由临时记录存储503代替流500，产生潜在的新(更小的)临时记录存储503'，同时对当前临时记录存储503迭代。In an example of how module 202 processes records through stream 500, after a first subset 512 of the normalized records of stream 500 has been processed, causing the memory size of census array 506 to expand to an overflow threshold, a second subset 514 of normalized records is processed to continue incrementing the count of data values for matching keys (from records shown as unshaded boxes) in census array 506, or the second subset 514 of normalized records is processed to store a third subset 516 of normalized records (shown as shaded boxes) as temporary record store 503. Note that third subset 516 is also a subset of second subset 514. This process continues, replacing stream 500 with temporary record store 503, creating a potentially new (smaller) temporary record store 503', while iterating over the current temporary record store 503.

如果在检查(612)之后，不存在保留的临时记录存储503，则模块202在字段上迭代(即，从1到字段的数目)，将一个适当普查阵列506从工作存储器空间508发送(618)至输出端口，并且将该适当普查阵列510从溢出存储空间504发送(620)至输出端口。在到达最后一个字段之后，循环结束(622)。这部分普查产生中将普查阵列506和510发送至输出端口，这使得以字段索引顺序来输出普查阵列(与一旦不同字段的部分普查阵列准备好就将其输出的情况相反)。例如，如果在普查处理模块204之后还存在下游计算需要以字段索引顺序提供普查阵列，这可能是有用的。可替代地，无需有序输出的普查产生的其他实现方式可以避免将工作存储器空间508中的部分普查阵列506存储在溢出存储空间504中，并且代替地可以简单地将部分普查阵列506从工作存储器空间508直接输出至输出端口(与当附加普查阵列506和510时无需执行合并操作的原因相同)。If, after checking (612), no temporary record storage 503 remains, the module 202 iterates over the fields (i.e., from 1 to the number of fields), sending (618) the appropriate census array 506 from the working memory space 508 to the output port, and sending (620) the appropriate census array 510 from the overflow storage space 504 to the output port. After reaching the last field, the loop ends (622). This partial census generation sends the census arrays 506 and 510 to the output port, which allows the census arrays to be output in field index order (as opposed to outputting partial census arrays for different fields as soon as they are ready). This can be useful, for example, if there are downstream computations after the census processing module 204 that require the census arrays to be provided in field index order. Alternatively, other implementations of census generation that do not require ordered output may avoid storing the partial census array 506 in the working memory space 508 in the overflow storage space 504, and instead may simply output the partial census array 506 from the working memory space 508 directly to the output port (for the same reason that a merge operation need not be performed when appending the census arrays 506 and 510).

简而言之，第二溢出处理过程通过将不匹配记录移动至溢出存储并且继续处理所有的新记录(对于任何匹配记录更新工作存储器中的普查阵列)来处理工作存储器的溢出条件。第二溢出处理过程有效地管理不匹配记录到溢出存储的溢出。In short, the second overflow handling process handles overflow conditions of the working memory by moving unmatched records to the overflow storage and continuing to process all new records (updating the census array in the working memory for any matching records). The second overflow handling process effectively manages the overflow of unmatched records to the overflow storage.

在第一溢出处理过程和第二溢出处理过程之间存在一些差异，这可能使得在一些情形下其中一个或另一个溢出处理过程更适当(即，更有效)。第一溢出处理过程无需将记录溢出至溢出存储，而第二溢出处理过程需要将记录溢出至溢出存储。但是，第二溢出处理过程无需合并普查阵列，而第一溢出处理过程需要合并普查阵列。另外，因为第二溢出处理过程持有工作存储器中的相同普查阵列，在溢出条件之后，直到读取输入流中记录的所有通路，值的初始分布确定哪个稍后值将是匹配的或不匹配的。因此，对于给定字段中重复值(如果有)的分布在输入流的所有记录上相对均匀的情形，第二溢出处理过程可能具有有效的溢出。每次普查阵列被溢出至溢出存储时，第一溢出处理进程使得能够匹配完全新的值的集合。所以，对于重复值的分布在输入流的所有记录的一个或多个字段中具有明显变化的情形，第一溢出处理过程可能具有有效的溢出。There are some differences between the first and second overflow handling processes that may make one or the other more appropriate (i.e., more efficient) in some situations. The first overflow handling process does not require overflowing records to overflow storage, while the second overflow handling process does. However, the second overflow handling process does not require merging census arrays, while the first overflow handling process does. In addition, because the second overflow handling process holds the same census array in working memory, after the overflow condition, until all paths of records in the input stream have been read, the initial distribution of values determines which later values will be matched or not matched. Therefore, for situations where the distribution of duplicate values (if any) in a given field is relatively uniform across all records in the input stream, the second overflow handling process may have an efficient overflow. Each time the census array is overflowed to overflow storage, the first overflow handling process enables matching of a completely new set of values. Therefore, for situations where the distribution of duplicate values in one or more fields across all records in the input stream varies significantly, the first overflow handling process may have an efficient overflow.

其他溢出处理过程也是可能的。例如，上述第一过程和第二过程之间的混合可以通过存储不匹配溢出存储空间中普查阵列的标准化记录(即，根据第二过程)来开始处理溢出条件。然后，如果不匹配的被存储标准化记录的小部分变得大于特定阈值，则该过程可将工作存储器空间中的部分普查阵列移动至溢出存储空间(与任意之前被存储部分普查阵列合并)，并且继续处理当前通路中的标准化记录(即，根据第一过程)。在第一过程的通路完成之后，通过处理存储在溢出存储空间中的标准化记录，使用相同的混合进程，过程将继续(即，当到达溢出条件时，开始存储不匹配标准化记录，直到达到阈值)。Other overflow handling procedures are also possible. For example, a hybrid between the first and second procedures described above could begin handling an overflow condition by storing normalized records that do not match the census array in the overflow storage space (i.e., according to the second procedure). Then, if the fraction of stored normalized records that do not match becomes greater than a certain threshold, the procedure could move the portion of the census array in the working memory space to the overflow storage space (merging it with any previously stored portions of the census array) and continue processing normalized records in the current pass (i.e., according to the first procedure). After the first pass is complete, the procedure would continue by processing the normalized records stored in the overflow storage space using the same hybrid process (i.e., starting to store non-matching normalized records when an overflow condition is reached until the threshold is reached).

上述技术可以使用执行适当软件的计算系统来实施。例如，软件可以包括在一个或多个已编程或可编程计算机系统(可以具有各种架构，诸如分布式、客户端/服务器、或网格式)上执行的一个或多个计算机程序中的程序，每个计算机系统包括至少一个处理器、至少一个数据存储系统(包括易失性和/或非易失性存储器和/或存储元件)、以及至少一个用户接口(用于使用至少一个输入设备或端口来接收输入，以及用于使用至少一个输出设备或端口来提供输出)。该软件可以包括大型程序的一个或多个模块，例如，该大型程序提供与数据流图的设计、配置和执行相关的服务。该程序的模块(例如，数据流图的元件)可以被实施为数据结构或者符合在数据库中存储的数据模型的其它经过组织的数据。程序的模块可以存储任意各种数据结构的阵列数据，诸如哈希表或平面文件，其可以被可选地索引和/或压缩。The above technology can be implemented using a computing system that performs appropriate software. For example, the software can include a program in one or more computer programs that are executed on one or more programmed or programmable computer systems (can have various architectures, such as distributed, client/server, or network format), each computer system including at least one processor, at least one data storage system (including volatility and/or non-volatile memory and/or storage element) and at least one user interface (for receiving input using at least one input device or port, and for providing output using at least one output device or port). The software can include one or more modules of a large program, for example, the large program provides the design, configuration, and execution-related services of a data flow graph. The module of the program (for example, the element of the data flow graph) can be implemented as a data structure or other organized data that conforms to the data model stored in the database. The module of the program can store array data of any various data structures, such as a hash table or a flat file, which can be optionally indexed and/or compressed.

该软件可以被提供在诸如CD-ROM或其他计算机可读介质之类的有形永久存储介质(例如，其可以被通用或专用计算机系统或设备读取)上，或者通过网络的通信介质递送(例如，被编码成传播信号)到执行该软件的计算机系统的有形永久介质。一些或全部处理可以在专用计算机上执行，或者使用诸如协处理器或现场可编程门阵列(FPGA)或专用集成电路(ASIC)之类的专用硬件来执行。该处理可以以分布方式实施，在该分布方式中，由该软件指定的不同的计算部分由不同的计算元件执行。每个这样的计算机程序被优选地存储在或下载到可由通用或专用可编程计算机读取的存储设备的计算机可读存储介质(例如，固态存储器或介质、或者磁或光介质)，用于在计算机读取该存储介质或设备时配置和操作该计算机，以执行此处所描述的处理。也可以考虑将本发明的系统实施为有形永久存储介质，其配置有计算机程序，其中，如此配置的存储介质使得计算机以特定和预定义的方式操作以执行此处所描述的一个或多个处理步骤。The software may be provided on a tangible, permanent storage medium (e.g., which can be read by a general-purpose or special-purpose computer system or device) such as a CD-ROM or other computer-readable medium, or on a tangible, permanent medium delivered (e.g., encoded into a propagated signal) to a computer system executing the software via a communication medium of a network. Some or all of the processing may be performed on a dedicated computer, or using dedicated hardware such as a coprocessor or field programmable gate array (FPGA) or application-specific integrated circuit (ASIC). The processing may be implemented in a distributed manner in which different computational components specified by the software are performed by different computing elements. Each such computer program is preferably stored in or downloaded to a computer-readable storage medium (e.g., a solid-state memory or medium, or a magnetic or optical medium) of a storage device that can be read by a general-purpose or special-purpose programmable computer, for configuring and operating the computer when the computer reads the storage medium or device to perform the processing described herein. It is also contemplated that the system of the present invention may be implemented as a tangible, permanent storage medium configured with a computer program, wherein the storage medium so configured causes the computer to operate in a specific and predefined manner to perform one or more processing steps described herein.

已经对本发明的多个实施例进行了描述。然而，应当理解，前面的描述旨在说明而非限制本发明的范围，本发明的范围由以下权利要求书的范围来限定。因此，其它实施例也落在以下权利要求书的范围内。例如，在不脱离本发明的范围的情况下可进行各种修改。此外，上述的一些步骤可以是顺序独立的，因此可以以不同于所述的顺序来执行。Several embodiments of the present invention have been described. However, it should be understood that the foregoing description is intended to illustrate, not to limit, the scope of the present invention, which is defined by the following claims. Therefore, other embodiments are also within the scope of the following claims. For example, various modifications may be made without departing from the scope of the present invention. Furthermore, some of the steps described above may be sequence-independent and, therefore, may be performed in an order different from that described.

Claims

1. A computing system, comprising:

Memory devices are used to provide working memory space;

Storage devices used to provide overflow storage space; and

At least one processor is configured to process multiple data units to produce result information, said processing including:

Perform a data operation for each data unit of a first subset of data units of multiple data units. The data operation includes searching for a value in a data unit within at least one data structure of a first set of one or more data structures stored in the working memory space, and if the value is found, modifying the information in the at least one data structure of the first set, and if the value is not found, adding the information to the at least one data structure of the first set.

After the overflow condition of the working memory space is met, information is stored in the overflow memory space and at least some of the working memory space is released. A data operation is performed on each data unit of the second subset of the plurality of data units. This data operation includes searching for a value within a data unit of at least one data structure in a second set of one or more data structures stored in the working memory space, and if the value is found, modifying the information in that at least one data structure of the second set.

Multiple sets of one or more data structures, including the first set and the second set, are combined to produce the resulting information.

2. The computing system of claim 1, wherein the overflow condition of the working memory space is satisfied if the amount of the working memory space used by the first set of one or more data structures is greater than or equal to a predetermined threshold.

3. The computing system of claim 1, wherein the processing further comprises, after the overflow condition is satisfied and before performing the search for each data unit of the second subset of the data units, storing the first set of one or more data structures in the overflow storage space, and removing the first set of one or more data structures from the working memory space.

4. The computing system of claim 1, wherein combining multiple sets of one or more data structures includes merging at least one data structure of the first set with at least one data structure of the second set.

5. The computing system of claim 4, wherein merging at least one data structure of the first set with at least one data structure of the second set comprises matching a first key in the data structure of the first set of one or more data structures with a second key in the data structure of the second set of one or more data structures, and performing an aggregation operation on the value associated with the first key and the value associated with the second key.

6. The computing system of claim 1, wherein the processing further comprises, after the overflow condition is satisfied and before performing the search for each data unit of the second subset of data units, searching for a value in at least one data structure within the first set of one or more data structures stored in the working memory space for each data unit of the third subset of data units of the plurality of data units, and if the value is found, modifying the information in the at least one data structure of the first set.

7. The computing system according to claim 6, wherein the second subset of data units is a subset of the data units of the third subset of data units.

8. The computing system of claim 1, wherein modifying the information comprises performing an in-place memory operation, the in-place memory operation overwriting a value stored at a location in the working memory space with a different value stored at the same location in the working memory space.

9. The computing system of claim 1, wherein the processing further comprises generating the plurality of data units according to a data source, each data unit including an identifier of a field of the data source and a value appearing in the field within a record of the data source.

10. The computing system of claim 9, wherein the first set of one or more data structures comprises a plurality of associative arrays of key-value pairing entries.

11. The computing system of claim 10, wherein searching for the value within a data cell of at least one data structure of a first set of one or more data structures stored in the working memory space comprises searching for the value as a destination key within a selected key-value pairing entry associative array.

12. The computing system of claim 11, wherein the selected key-value pairing entry associative array corresponds to the identifier in the data unit.

13. The computing system of claim 11, wherein modifying information in at least one data structure of the first set includes adding the value of the found key-value pair entry.

14. The computing system of claim 11, wherein adding information to at least one data structure of the first set comprises adding a new key-value pair entry to the selected array, such that the key of the entry is the value in the data unit and the value of the entry is a count of 1.

15. The computing system of claim 1, wherein the memory device includes a volatile memory device and the storage device includes a non-volatile memory device.

16. A method for processing multiple data units to produce result information, comprising:

Perform a data operation for each data unit of a first subset of data units of multiple data units, the data operation comprising searching for a value in a data unit within at least one data structure of a first set of one or more data structures stored in the working memory space of a memory device, and if the value is found, modifying information in the at least one data structure of the first set, and if the value is not found, adding information to the at least one data structure of the first set.

After the overflow condition of the working memory space is met, information is stored in the overflow storage space of the storage device and at least some of the working memory space is released. A data operation is performed on each data unit of the second subset of data units of the plurality of data units. This data operation includes searching for a value in a data unit within at least one data structure of the second set of one or more data structures stored in the working memory space, and if the value is found, modifying the information in that at least one data structure of the second set.

17. A computer-readable medium storing software for processing a plurality of data units to produce result information, the software including instructions for causing a computing system to execute:

After the overflow condition of the working memory space is met, information is stored in the overflow storage space of the storage device and at least some of the working memory space is released. A data operation is performed on each data unit of the second subset of data units of the plurality of data units. This data operation includes searching for a value in a data unit within at least one data structure of the second set of one or more data structures stored in the working memory space, and if the value is found, modifying the information in at least one data structure of the second set.