HK1225119B

HK1225119B - Checkpointing a collection of data units

Info

Publication number: HK1225119B
Application number: HK16113138.8A
Authority: HK
Inventors: 约瑟夫．斯凯芬顿．沃莱三世
Original assignee: 起元科技有限公司
Priority date: 2013-10-21
Filing date: 2014-09-26
Publication date: 2019-12-13

Description

Checkpointing of a collection of data units

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求享有2013年10月21日提交的美国申请号61/893,439的优先权。This application claims priority to U.S. application No. 61/893,439, filed on October 21, 2013.

技术领域Technical Field

本申请涉及对数据单元集合设置检查点。The present application relates to checkpointing a collection of data units.

背景技术Background Art

有各种类型的数据处理系统，其响应于故障或其他突发事件恢复或重启的能力是有用的。例如，在实时流处理或复杂事件处理系统中，保存诸如输入数据和/或状态信息的系统信息对于在该输入数据上执行的计算是有用的。设置检查点(Checkpointing)是周期性保存系统信息以便该系统能够从最近保存的一致状态恢复的方式的一个例子。在美国专利号6,584,581中描述了针对在连续的数据流上操作的数据处理系统的检查点设置技术，该专利通过引用合并于此。There are various types of data processing systems in which the ability to recover or restart in response to a failure or other unexpected event is useful. For example, in real-time stream processing or complex event processing systems, it is useful to save system information such as input data and/or state information for computations performed on the input data. Checkpointing is an example of a method for periodically saving system information so that the system can be restored from the most recently saved consistent state. Checkpointing techniques for data processing systems operating on a continuous data stream are described in U.S. Patent No. 6,584,581, which is incorporated herein by reference.

发明内容Summary of the Invention

一方面，通常，一种用于管理存储数据的计算系统包括：存储器模块，被配置为存储包括多个数据单元的工作数据；存储系统，被配置为存储包括多组一个或多个数据单元的恢复数据；以及至少一个处理器，被配置为管理存储器模块和存储系统之间的数据单元的传送。所述管理包括：维护包括在工作数据中的多个数据单元的顺序，该顺序定义了包括多个数据单元中一个或多个数据单元的第一连续部分和包括多个数据单元中一个或多个数据单元的第二连续部分；以及对于多个时间间隔中的每一个，识别在该时间间隔期间从工作数据访问的任意数据单元，并且将一组两个或更多个数据单元加入到恢复数据，该一组两个或多个数据单元包括：来自于第一连续部分并包括任意已访问数据单元的一个或多个数据单元，以及来自于第二连续部分并包括先前已经加入到恢复数据的至少一个数据单元的一个或多个数据单元。In one aspect, in general, a computing system for managing stored data includes: a memory module configured to store working data comprising a plurality of data units; a storage system configured to store recovery data comprising a plurality of groups of one or more data units; and at least one processor configured to manage transfers of the data units between the memory module and the storage system. The management includes: maintaining an order of the plurality of data units included in the working data, the order defining a first contiguous portion comprising one or more of the plurality of data units and a second contiguous portion comprising one or more of the plurality of data units; and, for each of a plurality of time intervals, identifying any data units accessed from the working data during the time interval and adding a group of two or more data units to the recovery data, the group of two or more data units including: one or more data units from the first contiguous portion including any accessed data units, and one or more data units from the second contiguous portion including at least one data unit previously added to the recovery data.

这些方面可包括以下一个或多个特征。These aspects may include one or more of the following features.

所述管理还包括：对于多个时间间隔中的每一个，从恢复数据删除至少一组一个或多个数据单元，对于该组一个或多个数据单元而言，仍然包括在所述工作数据中的任意数据单元被存储在至少另一组一个或多个数据单元中。The managing further comprises, for each of a plurality of time intervals, deleting from the recovery data at least one set of one or more data units for which any data units still included in the working data are stored in at least another set of one or more data units.

所述管理还包括：对于多个时间间隔中的每一个，识别在该时间间隔期间从工作数据删除的任意数据单元。The managing further includes identifying, for each of a plurality of time intervals, any data units deleted from the working data during the time interval.

第二连续部分不包括任何已删除的数据单元。The second contiguous portion does not include any deleted data units.

所述管理还包括：对于多个时间间隔中的每一个，将识别任何已删除的数据单元的信息加入到恢复数据。The managing further includes, for each of the plurality of time intervals, adding information identifying any deleted data units to the restored data.

识别在该时间间隔期间从工作数据访问的任意数据单元包括将在该时间间隔期间从工作数据访问的任意数据单元移动到第一连续部分中。Identifying any data units accessed from the working data during the time interval includes moving any data units accessed from the working data during the time interval into the first contiguous portion.

包括在工作数据中的多个数据单元之间的顺序建立在近来数据单元被访问的先后基础之上。The order among the plurality of data units included in the working data is established based on the order in which the data units were most recently accessed.

第一连续部分包括：包括在工作数据中的所有多个数据单元中最近被访问的数据单元，以及自从最近被访问之后尚未加入到恢复数据的多个数据单元中的子集的最近最少被访问的数据单元。The first contiguous portion includes a most recently accessed data unit among all the plurality of data units included in the working data and a least recently accessed data unit of a subset of the plurality of data units that has not been added to the recovery data since being most recently accessed.

第二连续部分不与第一连续部分重叠。The second continuous portion does not overlap with the first continuous portion.

第二连续部分包括自从最近被访问之后已经加入到恢复数据的至少一个数据单元。The second contiguous portion includes at least one data unit that has been added to the restored data since being most recently accessed.

来自第二连续部分的一个或多个数据单元限于一定数量的数据单元，该一定数量的数据单元介于来自第一连续部分的一个或多个数据单元中数据单元的大致一半数量和来自第一连续部分的一个或多个数据单元中数据单元的大致两倍数量之间。The one or more data units from the second contiguous portion is limited to a number of data units that is between approximately half the number of data units in the one or more data units from the first contiguous portion and approximately twice the number of data units in the one or more data units from the first contiguous portion.

第一连续部分包括比第二连续部分中的任何数据单元都更近被访问的数据单元。The first contiguous portion includes data units that were more recently accessed than any data units in the second contiguous portion.

指示近来数据单元被访问的先后的时间对应于开始独占访问该数据单元的时间。The time indicating the time at which the data unit was most recently accessed corresponds to the time at which exclusive access to the data unit began.

指示近来数据单元被访问的先后的时间对应于结束独占访问该数据单元的时间。The time indicating the time at which the data unit was most recently accessed corresponds to the time at which exclusive access to the data unit ends.

所述管理还包括响应于故障而使用恢复数据来恢复工作数据的状态。The managing also includes restoring the state of the working data using the recovery data in response to the failure.

包括在工作数据中的多个数据单元的每个都分别与一键值相关联。Each of the plurality of data units included in the working data is associated with a key value.

包括在工作数据中的至少一个数据单元包括基于与该数据单元相关联的键值而可访问的一个或多个值。At least one data unit included in the working data includes one or more values accessible based on a key value associated with the data unit.

与包括在工作数据中的不同数据单元相关联的不同键值的总数约大于1000。The total number of different key values associated with different data units included in the working data is greater than approximately 1,000.

所述时间间隔彼此互不重叠。The time intervals do not overlap with each other.

识别在该时间间隔期间从工作数据访问的任意数据单元包括识别以下至少其中之一：在该时间间隔期间加入到工作数据的任意数据单元、在该时间间隔期间从工作数据读取的任意数据单元、或在该时间间隔期间工作数据内更新的任一数据单元。Identifying any data unit accessed from the working data during the time interval includes identifying at least one of: any data unit added to the working data during the time interval, any data unit read from the working data during the time interval, or any data unit updated within the working data during the time interval.

存储器模块包括易失性存储设备。The memory module includes a volatile memory device.

存储系统包括非易失性存储设备。The storage system includes a non-volatile storage device.

在另一方面，通常，一种用于管理计算系统的存储器模块和该计算系统的存储系统之间的数据单元的传送的方法包括：将包括多个数据单元的工作数据存储在存储器模块中；将包括多组一个或多个数据单元的恢复数据存储在该存储系统中；维护包括在工作数据中的多个数据单元的顺序，该顺序定义了包括多个数据单元中一个或多个数据单元的第一连续部分以及包括多个数据单元中一个或多个数据单元的第二连续部分；并且对于多个时间间隔中的每一个，识别在该时间间隔期间从工作数据访问的任意数据单元，并且将一组两个或更多个数据单元加入到恢复数据，该组两个或更多个数据单元包括：来自于第一连续部分并包括任意已被访问数据单元的一个或多个数据单元，以及来自于第二连续部分并包括先前已经加入到恢复数据的至少一个数据单元的一个或多个数据单元。On the other hand, in general, a method for managing the transfer of data units between a memory module of a computing system and a storage system of the computing system includes: storing working data including multiple data units in the memory module; storing recovery data including multiple groups of one or more data units in the storage system; maintaining an order of the multiple data units included in the working data, the order defining a first contiguous portion including one or more of the multiple data units and a second contiguous portion including one or more of the multiple data units; and for each of a plurality of time intervals, identifying any data unit accessed from the working data during the time interval, and adding a group of two or more data units to the recovery data, the group of two or more data units including: one or more data units from the first contiguous portion and including any accessed data unit, and one or more data units from the second contiguous portion and including at least one data unit that had previously been added to the recovery data.

在另一方面，通常，软件以非暂时性方式存储在计算机可读介质上用于管理计算系统的存储器模块和该计算系统的存储系统之间的数据单元的传送。该软件包括用于使计算系统执行以下操作的指令：将包括多个数据单元的工作数据存储在存储器模块中；将包括多组一个或多个数据单元的恢复数据存储在该存储系统中；维护包括在工作数据中的多个数据单元的顺序，该顺序定义了包括多个数据单元中一个或多个数据单元的第一连续部分以及包括多个数据单元中一个或多个数据单元的第二连续部分；并且对于多个时间间隔中的每一个，识别在该时间间隔期间从工作数据访问的任意数据单元，并且将一组两个或更多个数据单元加入到恢复数据，该一组两个或更多个数据单元包括：来自于第一连续部分并包括任意已被访问数据单元的一个或多个数据单元，以及来自于第二连续部分并包括先前已经加入到恢复数据的至少一个数据单元的一个或多个数据单元。In another aspect, generally, software is stored in a non-transitory manner on a computer-readable medium for managing the transfer of data units between a memory module of a computing system and a storage system of the computing system. The software includes instructions for causing the computing system to: store working data comprising a plurality of data units in the memory module; store recovery data comprising a plurality of groups of one or more data units in the storage system; maintain an order of the plurality of data units included in the working data, the order defining a first contiguous portion comprising one or more of the plurality of data units and a second contiguous portion comprising one or more of the plurality of data units; and for each of a plurality of time intervals, identify any data units accessed from the working data during the time interval and add a group of two or more data units to the recovery data, the group of two or more data units comprising: one or more data units from the first contiguous portion including any accessed data units and one or more data units from the second contiguous portion including at least one data unit previously added to the recovery data.

这些方案可包括一个或多个以下优点。These aspects may include one or more of the following advantages.

在一些数据处理系统中，由系统维护的信息包括工作数据，该工作数据包括随着处理进行而定期更新的数据单元的集合。在复杂事件处理(Complex Event Processing，CEP)系统中，例如，对事件流进行处理和聚合，同时基于结果而采取措施。针对CEP系统的一组工作数据可以包括多个数据单元的状态，多个数据单元代表条目，针对条目正在接收各自不同的数据流，例如与不同股票代码相关联的价格。由特定数据单元代表的状态可以存储为通过唯一键(例如，数值)而访问的状态对象。在股票代码示例中，系统将为每一股票代码维护一个状态对象。每一个状态对象可以具有一个或多个字段，所述字段存储有诸如纯量(例如，与股票相关联的的值)和矢量(例如，历史价格数据)的值。在一些示例中，状态对象的字段可以存储表示应用函数，例如聚合函数(例如，和函数、计数函数、极大值函数和平均函数)，的结果的计算值，递增地更新针对对应的股票代码的每一新价格值。可以具有系统内的状态对象的多个集合，每一个集合可以包括具有特定字段集的状态对象。随着加入新的状态对象和删除旧的状态对象，集合的大小可以变化，但是集合中大多数变化数据可以归因于存储在正在更新的一部分状态对象中的实际数据。In some data processing systems, the information maintained by the system includes working data, which comprises a collection of data units that are periodically updated as processing proceeds. In a Complex Event Processing (CEP) system, for example, event streams are processed and aggregated, and actions are taken based on the results. A set of working data for a CEP system may include the state of multiple data units, each representing an entry for which a different data stream is being received, such as prices associated with different stock symbols. The state represented by a particular data unit may be stored as a state object accessed by a unique key (e.g., a numeric value). In the stock symbol example, the system would maintain a state object for each stock symbol. Each state object may have one or more fields that store values such as scalars (e.g., a value associated with the stock) and vectors (e.g., historical price data). In some examples, a state object's fields may store calculated values representing the results of applying a function, such as an aggregation function (e.g., a sum, count, maximum, and average), which is incrementally updated with each new price value for the corresponding stock symbol. There may be multiple collections of state objects within the system, each of which may include a state object with a specific set of fields. The size of the collection can change as new state objects are added and old state objects are removed, but most of the changing data in the collection can be attributed to the actual data stored in the portion of the state objects that is being updated.

在一些实施方式中，针对系统的工作数据存储在大量的相对快速的存储器中，该存储器可以是易失性存储器，例如动态随机存取存储器(DRAM)。为了确保工作数据的持久性，采用检查点方案将每一个状态对象的最新版本定期存储在更加稳定而可靠的存储设备(例如，硬盘驱动器、固态驱动器或其它非易失性存储介质)可以是特别有用的，同时确保对所需资源(例如数据传送时间和数据存储空间)方面的成本进行有效管理。虽然工作数据可以包括大量的数据单元(例如上述的状态对象)，但是在检查点操作之间的任一给定的时期(称之为“检查点间隔”)，可能有一小部分的数据单元发生变化。该检查点方案应使每一个数据单元的最近状态能够在出现故障的情况下得以恢复，包括最近已经改变了的那些数据单元以及尚未改变的那些数据单元。In some embodiments, the working data for the system is stored in a large amount of relatively fast memory, which can be volatile memory, such as dynamic random access memory (DRAM). To ensure the persistence of the working data, it can be particularly useful to employ a checkpointing scheme to periodically store the latest version of each state object on a more stable and reliable storage device (e.g., a hard disk drive, solid-state drive, or other non-volatile storage medium), while ensuring that the cost in terms of required resources (e.g., data transfer time and data storage space) is effectively managed. Although the working data can include a large number of data units (e.g., the state objects described above), at any given time between checkpoint operations (referred to as the "checkpoint interval"), a small fraction of the data units may have changed. The checkpointing scheme should enable the most recent state of each data unit to be restored in the event of a failure, including those data units that have recently changed as well as those that have not changed.

在这样的系统中进行有效检查点设置的技术是具有挑战性的，尤其当正在被管理的数据单元的数量特别大时。例如，在一种情形下，工作数据包括每个为数字节的大约十亿个数据单元的集合，并且在每一检查点间隔期间，有大约千分之一的数据单元(1百万个数据单元)发生了变化。当然，在每一检查点间隔期间，已经发生变化的数据单元的集合可以是不同(但可能重叠)的集合。为了恢复，还可以假设只需要任意特定数据单元的最近状态。The art of efficient checkpointing in such a system is challenging, especially when the number of data units being managed is particularly large. For example, in one scenario, the working data comprises a collection of approximately one billion data units, each of which is several bytes, and during each checkpoint interval, approximately one thousandth of the data units (1 million data units) changes. Of course, the set of data units that have changed during each checkpoint interval can be a different (but possibly overlapping) set. For recovery purposes, it can also be assumed that only the most recent state of any particular data unit is required.

第一种方法是，在每一个检查点间隔将数据单元的整个集合分别以检查点文件的形式存储在存储设备中。为了恢复该集合中每一个数据单元的最新状态，系统能够简单读取该检查点文件(多个)。这第一种方法对于具有千兆字节的工作数据和秒级的检查点间隔而言在数据传送成本方面价格高昂，甚至可能没有足够时间将所有的数据单元写入到存储设备。The first approach is to store the entire set of data units in the form of checkpoint files on the storage device at each checkpoint interval. To restore the latest state of each data unit in the set, the system can simply read the checkpoint file(s). This first approach is expensive in terms of data transfer costs for working data with gigabytes and checkpoint intervals in seconds, and may not even allow enough time to write all data units to the storage device.

第二种方法是，在每一个检查点间隔仅将自从最后的检查点间隔以来发生变化的数据单元分别以检查点文件的形式存储在存储设备中。虽然这第二种方法减少了在每一个检查点的传送时间，但是随着时间的推移，恢复变得代价更大。这是由于要恢复每一个数据单元的最新状态，系统需要从检查点进程的起点开始读取每一个检查点文件，以确保恢复自写入第一个检查点文件以来尚未发生变化的一些数据单元的最新状态。The second approach is to store only the data units that have changed since the last checkpoint interval in the form of checkpoint files on the storage device at each checkpoint interval. While this second approach reduces the transfer time at each checkpoint, recovery becomes more expensive over time. This is because to restore the latest state of each data unit, the system must read each checkpoint file from the start of the checkpoint process to ensure that the latest state of the data units that have not changed since the first checkpoint file was written is restored.

对该第二种方法可能的改进是，由系统执行脱机处理，在该脱机处理中，扫描存储设备中的检查点文件并删除完全由具有以最近存储的检查点文件表示的更新状态的数据单元的旧拷贝组成的检查点文件。对该第二种方法的另一种改进是，扫描检查点文件并重写检查点文件以删除数据单元的旧拷贝，将每一个已删除数据单元的至少一个较近的拷贝保留在至少一个较新的检查点文件中。当检查点文件中的数据单元降到零时，该检查点文件就会从存储设备中删除，以释放存储空间。这种改进的一些潜在的挑战是：(a)整合进程引入另一要管理的进程，这通常会使系统不太可靠；以及(b)该整合进程有计算时间方面的成本，包括至少一次读取由检查点文件表示的检查点数据，但是也可能包括若干次读写该检查点数据。A possible improvement to this second approach is for the system to perform an offline process in which the checkpoint files in the storage device are scanned and checkpoint files consisting entirely of old copies of data units with updated states represented by the most recently stored checkpoint files are deleted. Another improvement to this second approach is to scan the checkpoint files and rewrite them to delete old copies of data units, retaining at least one more recent copy of each deleted data unit in at least one more recent checkpoint file. When the number of data units in a checkpoint file drops to zero, the checkpoint file is deleted from the storage device to free up storage space. Some potential challenges of this improvement are: (a) the consolidation process introduces another process to be managed, which generally makes the system less reliable; and (b) the consolidation process has a computational time cost, including at least one read of the checkpoint data represented by the checkpoint file, but may also include several reads and writes of the checkpoint data.

第三种方法是由系统为每一个数据单元存储单独的检查点文件。在每一个检查点间隔中，针对特定的更新后的数据单元而写入的新检查点文件可以替换针对该数据单元的前一个检查点文件。这将避免高昂的数据传送成本(由于只需存储发生变化的条目)，并将避免高昂的数据存储成本(由于过时数据的检查点文件将不会无限制地累积)，且会降低复杂性(由于其不需要清除进程)。然而，这第三种方法价格高昂的潜在原因是相对于存储空间，文件相对昂贵。如果工作数据由数十亿个数据单元构成，而每一个数据单元仅有几个字节，那么文件系统可能无法承受文件创建和管理操作，并且针对文件系统元数据的存储空间开销会以消耗比数据单元的实际内容更多的存储空间而告终。A third approach is for the system to store a separate checkpoint file for each data unit. At each checkpoint interval, a new checkpoint file written for a particular updated data unit can replace the previous checkpoint file for that data unit. This will avoid high data transfer costs (since only the changed entries need to be stored), will avoid high data storage costs (since checkpoint files for obsolete data will not accumulate indefinitely), and will reduce complexity (since no cleanup process is required). However, a potential reason for the high price of this third approach is that files are relatively expensive relative to storage space. If the working data consists of billions of data units, each of which is only a few bytes, the file system may not be able to afford the file creation and management operations, and the storage space overhead for the file system metadata will end up consuming more storage space than the actual content of the data units.

针对以下描述的检查点的一些技术至少有一些上述方法及其改进的优点。例如，基于识别恢复数据可在同一检查点文件中包括可能发生变化的数据单元的新拷贝和先前加入到(不同)检查点文件中的数据单元的旧拷贝，从而能够限制数据存储成本(包括文件系统开销)和数据传送时间。将旧拷贝递增地迁移到较新检查点文件的进程能够以这样的方式来完成：限制在检查点间隔中正在进行的额外工作的数量，并逐步整合最近最少访问的数据单元的旧拷贝，直至可以丢弃旧的检查点文件，这是由于其存储的仅是数据单元的冗余备份副本。该整合进程可以由存储检查点文件的同一检查点进程来执行，因此可能要比离线整合进程更简单也更可靠。即使不删除旧的检查点文件，由于文件系统开销，将多个数据单元(即，具有不同键的数据单元)的最新状态存储在同一检查点文件中降低了潜在的数据存储成本，尤其是当分配给数据单元的不同键的数量很大时(例如，约大于1000，或约大于1000000，或约大于1000000000)。Some of the checkpointing techniques described below share at least some of the advantages of the aforementioned methods and their improvements. For example, based on identifying recovery data, new copies of potentially changed data units and old copies of data units previously added to (different) checkpoint files can be included in the same checkpoint file, thereby limiting data storage costs (including file system overhead) and data transfer time. The process of incrementally migrating old copies to newer checkpoint files can be accomplished in a manner that limits the amount of additional work performed during a checkpoint interval and gradually consolidates old copies of the least recently accessed data units until the old checkpoint file can be discarded, as it only stores redundant backup copies of the data units. This consolidation process can be performed by the same checkpoint process that stores the checkpoint files and, therefore, may be simpler and more reliable than an offline consolidation process. Even if old checkpoint files are not deleted, storing the latest states of multiple data units (i.e., data units with different keys) in the same checkpoint file reduces potential data storage costs due to file system overhead, particularly when the number of different keys assigned to the data units is large (e.g., greater than approximately 1,000, or greater than approximately 1,000,000, or greater than approximately 1,000,000,000).

通过以下说明书和权利要求书，本发明的其它特征和优点将变得显而易见。Other features and advantages of the invention will become apparent from the following description and from the claims.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是数据处理系统的框图。FIG1 is a block diagram of a data processing system.

图2A和图2B是检查点设置和数据处理操作的时间线。2A and 2B are timelines of checkpointing and data processing operations.

图3A和图3B是检查点设置和恢复算法各自的流程图。3A and 3B are flow diagrams of checkpointing and recovery algorithms, respectively.

图4A、图4E和图4F是工作数据状态的示意图。4A , 4E and 4F are schematic diagrams of working data states.

图4B、图4C和图4D是检查点文件的示意图。4B , 4C and 4D are schematic diagrams of checkpoint files.

图5是指针随时间移动的示意图。FIG5 is a schematic diagram showing the movement of a pointer over time.

具体实施方式DETAILED DESCRIPTION

图1所示的是数据处理系统100的一个示例，检查点技术可应用于该数据处理系统。系统100包括数据源102，该数据源102可以包括一个或多个数据源，例如，存储设备或在线数据源，每一个数据源可以以多种格式(例如，数据库表、电子表格文件、纯文本文件或主机所用的本机格式)中的任一种格式来存储或提供数据。执行环境104包括处理模块106(例如，具有一个或多个处理核的至少一个中央处理器)和存储器模块108(例如，一个或多个存储设备，例如DRAM或其它形式的相对快速的存储介质)。执行环境104可以被托管在受适当操作系统(比如UNIX操作系统的一个版本)控制的一个或多个通用计算机。例如，执行环境104可以包括多节点并行计算环境，该多节点并行计算环境包括使用多个中央处理器(CPU)或处理器内核的计算机系统的配置，所述计算机系统可以是本地(例如多处理器系统，如对称多处理(SMP)计算机)、或本地分布式(例如多个处理器耦合为群集或大规模并行处理(MPP)系统)或者远程或远程分布式(例如通过局域网(LAN)和/或广域网(WAN)来耦合的多个处理器)或者上述的组合。Figure 1 shows an example of a data processing system 100 to which checkpointing techniques may be applied. System 100 includes a data source 102, which may include one or more data sources, such as storage devices or online data sources, each of which may store or provide data in any of a variety of formats (e.g., a database table, a spreadsheet file, a plain text file, or a native format used by a host). An execution environment 104 includes a processing module 106 (e.g., at least one central processing unit having one or more processing cores) and a memory module 108 (e.g., one or more storage devices, such as DRAM or other forms of relatively fast storage media). Execution environment 104 may be hosted on one or more general-purpose computers controlled by an appropriate operating system (e.g., a version of the UNIX operating system). For example, the execution environment 104 may include a multi-node parallel computing environment that includes a configuration of a computer system using multiple central processing units (CPUs) or processor cores, which may be local (e.g., a multi-processor system such as a symmetric multi-processing (SMP) computer), or locally distributed (e.g., multiple processors coupled as a cluster or massively parallel processing (MPP) system), or remote or remotely distributed (e.g., multiple processors coupled via a local area network (LAN) and/or a wide area network (WAN)), or a combination of the above.

提供数据源102的存储设备可以是该执行环境104本地的存储设备，例如，存储在连接至托管该执行环境104的计算机的存储介质110，或者是对于该执行环境104而言远程的存储设备，例如，托管在通过远程连接(例如，流处理数据馈送的服务器连接)与托管该执行环境104的计算机进行通讯的远程系统(例如，主机112)。由执行环境104内数据处理生成的输出数据可以被存储回数据源102或其它存储介质中，或以其他方式使用。The storage device providing the data source 102 can be a storage device local to the execution environment 104, for example, stored on a storage medium 110 connected to the computer hosting the execution environment 104, or a storage device remote from the execution environment 104, for example, hosted on a remote system (e.g., host computer 112) that communicates with the computer hosting the execution environment 104 via a remote connection (e.g., a server connection for streaming data feeds). Output data generated by data processing within the execution environment 104 can be stored back to the data source 102 or other storage medium, or used in other ways.

处理模块106针对任意各种应用(例如，复杂事件处理)而处理来自于数据源102的数据，并且在所述处理期间访问存储在存储器模块108的工作数据114。处理模块106还周期性地执行检查点进程，该检查点进程将部分工作数据114存储在可在执行环境104内访问的数据存储系统116(例如，托管执行环境104的计算机的硬盘驱动器)。检查点进程可以整体存储工作数据114的某些部分，而仅仅选择性地存储其它部分，以避免多余地备份自最后一次检查点间隔以来尚未发生变化的某些数据。例如，工作数据114可以包括一组数据单元，检查点进程根据数据单元的维护顺序而将这一组数据单元选择性地存储在一组检查点文件120中。工作数据114的其他部分，例如与数据处理相关联的其它内存中状态，可以被存储在单独的检查点文件中。Processing module 106 processes data from data source 102 for any of a variety of applications (e.g., complex event processing) and, during such processing, accesses working data 114 stored in memory module 108. Processing module 106 also periodically executes a checkpointing process that stores portions of working data 114 in a data storage system 116 accessible within execution environment 104 (e.g., a hard drive of a computer hosting execution environment 104). The checkpointing process may store certain portions of working data 114 in their entirety and only selectively store other portions to avoid redundantly backing up certain data that has not changed since the last checkpoint interval. For example, working data 114 may include a set of data units that the checkpointing process selectively stores in a set of checkpoint files 120 based on the order in which the data units are maintained. Other portions of working data 114, such as other in-memory states associated with data processing, may be stored in separate checkpoint files.

数据单元的顺序由管理程序来维护，该管理程序可以是更大的数据处理程序的一部分，或者也可以是管理工作数据114的单独进程。在一些实施方式中，数据单元存储在以最近最少访问(least-recently-accessed,LRA)到最近访问(most-recently-accessed,MRA)的顺序组织的键值对条目的关联阵列中的存储器模块108中。例如，关联阵列可以实现为哈希表，根据维护顺序采用双重链接表指针布置能够将哈希表中的条目串在一起。哈希表中的每一个条目基于唯一键来访问，并且能够在封闭的数据对象，例如上述键控状态对象，中存储对应于该键(表示变量或其他状态信息的任意数量的独立值)的数据。该管理程序维护指向哈希表中MRA条目的MRA指针以及指向哈希表中LRA条目的LRA指针。该管理程序还维护最近最少设置检查点(LRC)的指针以及最近设置检查点(MRC)的指针。哈希表中的每一个条目还具有相关联的属性，该相关联的属性存储有对应于检查点文件的检查点号(CPN)，该CPN是在上次保存到该检查点文件中的(如果有的话)。这些指针和字段用于选择性地确定在任一给定检查点间隔中哪些条目将被复制到新检查点文件，下面更详细地进行说明。The order of the data units is maintained by a hypervisor, which can be part of a larger data processing program or a separate process that manages the working data 114. In some embodiments, the data units are stored in the memory module 108 in an associative array of key-value pair entries organized in the order of least-recently-accessed (LRA) to most-recently-accessed (MRA). For example, the associative array can be implemented as a hash table, and the entries in the hash table can be strung together using a doubly linked list pointer arrangement according to the maintenance order. Each entry in the hash table is accessed based on a unique key and can store data corresponding to the key (representing any number of independent values of a variable or other state information) in an enclosed data object, such as the keyed state object described above. The hypervisor maintains an MRA pointer to the MRA entries in the hash table and an LRA pointer to the LRA entries in the hash table. The hypervisor also maintains a pointer to the least recently set checkpoint (LRC) and a pointer to the most recently set checkpoint (MRC). Each entry in the hash table also has an associated attribute that stores the checkpoint number (CPN) corresponding to the checkpoint file that was last saved to the checkpoint file (if any). These pointers and fields are used to selectively determine which entries will be copied to the new checkpoint file in any given checkpoint interval, as described in more detail below.

每次访问条目时，该条目就成为最近访问的条目，即，将MRA指针分配给该条目的内存地址，并适当调整链表内的其它指针(例如，针对表中旧位置处的相邻条目)。在一些实施方式中，在条目的键被用于检索存储在该条目中数据的任何情况下，或者当已经将该条目加入到该表中，则认为已经访问了该条目。在这样的实施方式中，当程序检索条目的数据以便读取而不改变该条目时，仍然认为已经访问了该条目。在一些实施方式中，管理程序基于条目实际上已经发生变化的时间(例如，最近发生变化或者最近最少发生变化)来维护顺序，并且不认为读取条目的数据而不改变该条目能影响该顺序。这样的实施方式例如可针对由于访问而发生变化的每一条目采用“脏位(dirty bit)”。在以下示例中，采用的是更简单的方案，即假设任何访问可能已经改变了存储在条目中的数据。管理程序还确定何时不再需要条目并应从表中删除该条目。例如，在一些实施方式中，当存储器或时间约束支配时，从最近最少访问的条目开始从表中删除条目。Each time an entry is accessed, it becomes the most recently accessed entry, i.e., an MRA pointer is assigned to the memory address of the entry, and other pointers within the linked list are adjusted appropriately (e.g., for adjacent entries at older locations in the table). In some embodiments, an entry is considered accessed whenever its key is used to retrieve data stored in the entry, or when the entry has been added to the table. In such embodiments, an entry is still considered accessed when a program retrieves the entry's data for reading without changing it. In some embodiments, the hypervisor maintains order based on when the entry actually changed (e.g., most recently changed or least recently changed), and does not consider reading an entry's data without changing it to affect the order. Such embodiments may, for example, employ a "dirty bit" for each entry that changes due to an access. In the following example, a simpler approach is employed, assuming that any access may have changed the data stored in the entry. The hypervisor also determines when an entry is no longer needed and should be deleted from the table. For example, in some embodiments, entries are deleted from the table starting with the least recently accessed entry when memory or time constraints dictate.

数据处理系统100通过以检查点间隔执行检查点进程来提供增量式的检查点方案。例如，可以在预定时间后或在系统接收预定数量的输入记录后触发检查点进程。触发检查点进程时，以在最近检查点间隔期间已经被访问的数据单元的“新”拷贝和类似数量的在最近检查点间隔期间尚未被访问且已经存储在检查点文件中的数据单元的“旧”拷贝来存储检查点文件，调整MRC指针在下面进行更详细地描述。由检查点进程执行的这些检查点操作可以在管理程序继续管理针对数据处理操作的工作数据114的访问时发生，或者在检查点进程执行时可以暂时阻止对工作数据114的访问。图2A和图2B示出了针对最近检查点间隔的检查点处理操作的计时的示例的时间线(往右边表示时间增加)。在图2A的示例中，检查点进程执行检查点操作200，以保存活动的工作数据状态，该活动在前一个检查点间隔202A期间发生，同时执行检查点间隔202B期间继续的数据处理操作204。可替代地，在图2B的示例中，在数据处理操作204’在检查点间隔202B期间重新开始之前，检查点进程完成针对在前一个检查点间隔202A期间发生的活动的检查点操作200’。为了在恢复期间能够对工作数据114进行适当的重建，检查点进程以这样的方式存储数据单元：使新拷贝能够与旧拷贝相区分(例如，针对每一个数据单元采用标记，或者将数据单元写入独立部分中的检查点文件)。由于检查点进程将旧数据写入较新的检查点文件，在满足一定的条件下可以删除先前存储有那些数据单元(现在存储在较新的检查点文件中)的检查点文件。The data processing system 100 provides an incremental checkpointing solution by executing a checkpointing process at checkpoint intervals. For example, a checkpointing process can be triggered after a predetermined time or after the system receives a predetermined number of input records. When the checkpointing process is triggered, the checkpoint file is stored with "new" copies of data units that have been accessed during the most recent checkpoint interval and a similar number of "old" copies of data units that have not been accessed during the most recent checkpoint interval and are already stored in the checkpoint file. The MRC pointers are adjusted as described in more detail below. These checkpointing operations performed by the checkpointing process can occur while the hypervisor continues to manage access to the working data 114 for data processing operations, or access to the working data 114 can be temporarily blocked while the checkpointing process is executed. Figures 2A and 2B illustrate example timelines (increasing time toward the right) illustrating the timing of checkpointing process operations for the most recent checkpoint interval. In the example of Figure 2A, the checkpointing process performs a checkpointing operation 200 to save the working data state of the activity that occurred during the previous checkpoint interval 202A while executing data processing operations 204 that continued during the checkpoint interval 202B. Alternatively, in the example of FIG2B , the checkpoint process completes a checkpoint operation 200 ′ for activity that occurred during the previous checkpoint interval 202A before data processing operations 204 ′ are resumed during the checkpoint interval 202B. To enable proper reconstruction of the working data 114 during recovery, the checkpoint process stores data units in a manner that enables new copies to be distinguished from old copies (e.g., using markers for each data unit or writing data units to a checkpoint file in a separate portion). As the checkpoint process writes older data to newer checkpoint files, the checkpoint files that previously stored those data units (now stored in the newer checkpoint files) can be deleted if certain conditions are met.

由于管理程序在正常数据处理期间管理着工作数据114的访问，其适当地更新MRA指针和MRC指针，以便为将在下一个检查点间隔出现的检查点进程做准备。例如，对于如上描述的组织为链表的条目表，如果访问的是MRC条目，则MRC条目在表的一端成为MRA条目，并且调整MRC指针以引用朝向表的LRA端一步的条目。当下一个检查点进程出现时，只需要存储MRA(含)指针和MRC(不含)指针之间的条目。当MRC指针和MRA指针之间的条目存储至检查点文件后，检查点进程将MRC指针设置到由MRA指针识别的条目，以便为下一个检查点间隔做准备。As the hypervisor manages access to the working data 114 during normal data processing, it updates the MRA pointer and the MRC pointer appropriately in preparation for the checkpoint process that will occur at the next checkpoint interval. For example, for a table of entries organized as a linked list as described above, if an MRC entry is accessed, the MRC entry becomes an MRA entry at one end of the table, and the MRC pointer is adjusted to reference the entry one step toward the LRA end of the table. When the next checkpoint process occurs, only the entries between the MRA (inclusive) pointer and the MRC (exclusive) pointer need to be stored. After the entries between the MRC pointer and the MRA pointer are stored in the checkpoint file, the checkpoint process sets the MRC pointer to the entry identified by the MRA pointer in preparation for the next checkpoint interval.

管理工作数据114以及与表在恢复期间的适当重建有关的检查点文件120的另一方面是追踪已经从工作数据114删除(或指示为不再使用)的数据单元(例如表条目)。为了明确地识别自从数据处理开始就已经存在的每一个数据单元，包括那些已经删除的数据单元，为每一个数据单元分配唯一标识符(ID)。由于表中条目的键是唯一的，所以，只要不再重复使用已经删除的条目的键，这些键就可以用作唯一ID。否则，如要重复使用这些键，可以将另外的唯一ID分配给每一条目。在以下示例中，每当有新条目加入到表中都要递增的8字节整数将既充当这样的唯一ID又充当条目的键。当在检查点间隔期间从表中删除条目时，管理程序将其ID加入到针对该检查点间隔的已删除条目列表，存储其作为检查点文件的一部分。Another aspect of managing the working data 114 and the checkpoint files 120 that are relevant to the proper reconstruction of tables during recovery is tracking data units (e.g., table entries) that have been deleted (or indicated as no longer in use) from the working data 114. In order to clearly identify each data unit that has existed since the start of data processing, including those that have been deleted, each data unit is assigned a unique identifier (ID). Since the keys of table entries are unique, as long as the keys of deleted entries are not reused, these keys can be used as unique IDs. Otherwise, if the keys are to be reused, additional unique IDs can be assigned to each entry. In the following example, an 8-byte integer that is incremented each time a new entry is added to the table will serve as both such a unique ID and the key for the entry. When an entry is deleted from the table during a checkpoint interval, the hypervisor adds its ID to the list of deleted entries for that checkpoint interval and stores it as part of the checkpoint file.

因此，特定检查点间隔的检查点文件可以包括存储有以下两种类型项目的数据结构(例如，表)：Thus, a checkpoint file for a particular checkpoint interval may include a data structure (e.g., a table) storing the following two types of items:

(1)存储有自从最后一个检查点间隔就被删除了的条目的键的项目，以及(1) an entry storing the keys of entries that have been deleted since the last checkpoint interval, and

(2)存储有自从最后一个检查点间隔就被访问的条目(键和对应的数据)的拷贝的项目。(2) A project that stores copies of entries (keys and corresponding data) that have been accessed since the last checkpoint interval.

一种可能的恢复进程包括按其创建顺序来读取每一个检查点文件。对于每一个检查点文件，由处理模块106执行的恢复进程将执行以下步骤：A possible recovery process includes reading each checkpoint file in the order in which it was created. For each checkpoint file, the recovery process performed by the processing module 106 will perform the following steps:

(1)删除键被存储在检查点文件的第(1)类型项目中的任何条目，然后(1) Delete any entry whose key is stored in an item of type (1) in the checkpoint file, and then

(2)加入或更新存储在检查点文件的第(2)类型项目中的条目。(2) Add or update an entry in the type (2) item stored in the checkpoint file.

虽然从存储的第一检查点文件读取每一个检查点文件将导致正确的行为，能够进行改善从而使得恢复进程在恢复时无需从检查点开头起读取每一个检查点文件并将每一个改变重现到表中。通过将先前已被拷贝到旧检查点文件中的条目增量地拷贝到新检查点文件中，检查点进程最终能够删除不再需要的较旧的检查点文件。这能够加快恢复进程并降低数据存储成本。While reading each checkpoint file from the first checkpoint file stored will result in correct behavior, improvements can be made so that the recovery process does not need to read every checkpoint file from the beginning of the checkpoint and reproduce every change to the table when recovering. By incrementally copying entries that were previously copied to old checkpoint files to new checkpoint files, the checkpoint process can eventually delete older checkpoint files that are no longer needed. This can speed up the recovery process and reduce data storage costs.

为了确保能够安全删除旧检查点文件而不会丢失对于恢复表中(自最近完成的检查点间隔起)每一个条目的最近状态而言必要的任何已保存状态，管理程序和检查点进程一起使LRC指针从LRA端朝向MRA端增量式地扫描条目，以便记录已经拷贝到较新检查点文件中的旧条目。对于每一个检查点间隔，检查点进程从LRA端保存了和其从MRA端保存的新条目一样多的旧条目。这样，检查点进程将数据传送成本限制为与自从最后一个检查点间隔访问的条目的数目成比例。由于检查点进程将来自LRA端的旧条目写入较新的检查点文件中，所以能够删除先前是一部分的检查点文件，只要这些检查点文件没有存储尚未拷贝到较新的检查点文件中的旧条目。所述恢复然后能够通过按照从最旧到最新的顺序读取剩余的检查点文件而恢复最新的表。To ensure that old checkpoint files can be safely deleted without losing any saved state necessary to restore the most recent state of every entry in the table (since the most recently completed checkpoint interval), the hypervisor and checkpointing process together cause the LRC pointer to incrementally scan entries from the LRA side toward the MRA side, recording old entries that have been copied to newer checkpoint files. For each checkpoint interval, the checkpointing process saves as many old entries from the LRA side as it saves new entries from the MRA side. This way, the checkpointing process limits data transfer costs to a ratio proportional to the number of entries accessed since the last checkpoint interval. Because the checkpointing process writes old entries from the LRA side to newer checkpoint files, it is possible to delete previously part of the checkpoint files as long as they do not store old entries that have not been copied to newer checkpoint files. The recovery can then restore the latest table by reading the remaining checkpoint files in order from oldest to newest.

当数据处理系统100开始处理数据并管理存储在存储器模块108中的工作数据114时，可以有初始数量的检查点间隔，其中，初始的检查点进程仅以新的数据单元建立一组初始的一个或多个检查点文件120。例如，该初始的检查点进程在MRC指针(不含)和MRA指针(含)之间的表的MRA端处存储有新条目。LRC指针不用于该初始的检查点进程。在已经存储一定数量的初始检查点文件120后，正常的(稳态)检查点进程也开始存储旧数据单元(即，条目表)和新数据单元。当正常的检查点进程开始时，LRC指针最初被设置为LRA指针。检查点进程然后将在LRC指针(含)和MRC指针(不含)之间的表的LRA端处存储有限数量的旧条目，或者从LRC指针(含)到设置检查点的新条目的数量，二者中取数量较少者。When the data processing system 100 begins processing data and managing working data 114 stored in the memory module 108, there may be an initial number of checkpoint intervals, wherein the initial checkpoint process creates an initial set of one or more checkpoint files 120 using only new data units. For example, this initial checkpoint process stores new entries at the MRA end of the table between the MRC pointer (exclusive) and the MRA pointer (inclusive). The LRC pointer is not used for this initial checkpoint process. After a certain number of initial checkpoint files 120 have been stored, a normal (steady-state) checkpoint process also begins storing old data units (i.e., the table of entries) and new data units. When a normal checkpoint process begins, the LRC pointer is initially set to the LRA pointer. The checkpoint process then stores a limited number of old entries at the LRA end of the table between the LRC pointer (exclusive) and the MRC pointer (exclusive), or the number of new entries from the LRC pointer (inclusive) to the checkpointed entry, whichever is less.

检查点进程所采用的算法的以下示例是以伪代码写成的。该伪代码包括描述该伪代码陈述和功能的功能性注释。该伪代码针对条件语句(例如‘if’语句)和循环(例如，‘while’循环和‘for’循环)以及针对注释(加前缀‘//’)采用标准C编程语言语法。在该伪代码列表中，MRA指针在变量‘mra’中，LRA指针在变量‘lra’中，MRC指针在变量‘mrc’中，以及LRC指针在变量‘lrc’中。点表示法随这些变量一起使用，以表示由这些指针识别的部分条目。特别地，点表示法‘pointer.prev’和‘pointer.next’分别用于表示表中从由‘pointer(指针)’指向的条目到MRA端和LRA端更近一步的位置，点表示法‘pointer.checkpoint_number’、‘pointer.key’和‘pointer.data’分别用于表示由‘指针’指向的条目的CPN、键和数据。点表示法还用于表示与变量相关联的调用某些函数，例如‘item.is_<property>()’以测试由变量‘item’表示的条目是否具有属性‘<property>’。可以针对每一个检查点间隔由检查点进程来执行以下算法。The following example of the algorithm employed by the checkpoint process is written in pseudocode. The pseudocode includes functional comments that describe the statements and functions of the pseudocode. The pseudocode uses standard C programming language syntax for conditional statements (e.g., 'if' statements) and loops (e.g., 'while' loops and 'for' loops), as well as for comments (prefixed with '//'). In the pseudocode listing, the MRA pointer is in the variable 'mra', the LRA pointer is in the variable 'lra', the MRC pointer is in the variable 'mrc', and the LRC pointer is in the variable 'lrc'. Dot notation is used with these variables to represent the portions of the entries identified by these pointers. In particular, the dot notation 'pointer.prev' and 'pointer.next' are used to represent the position in the table that is one step closer to the MRA end and the LRA end from the entry pointed to by 'pointer', and the dot notation 'pointer.checkpoint_number', 'pointer.key', and 'pointer.data' are used to represent the CPN, key, and data of the entry pointed to by 'pointer', respectively. Dot notation is also used to represent calling some function associated with a variable, such as 'item.is_<property>()' to test whether the item represented by the variable 'item' has the property '<property>'. The following algorithm may be executed by the checkpoint process for each checkpoint interval.

//以检查点文件的空列表开始予以删除//Start with an empty list of checkpoint files to delete

files_to_remove＝empty_list()；files_to_remove=empty_list();

//打开新(空)检查点文件//Open a new (empty) checkpoint file

//(以当前的检查点号命名)//(named after the current checkpoint number)

checkpoint_file＝open_checkpoint_file(checkpoint_number)；checkpoint_file=open_checkpoint_file(checkpoint_number);

//推进MRC指针(找出第一个新条目)//Advance the MRC pointer (find the first new entry)

mrc＝mrc.prev；mrc = mrc.prev;

//while循环来拷贝所有被访问的条目//while loop to copy all visited items

//在最近的检查点间隔期间//During the most recent checkpoint interval

//以及旧条目的相等数量：// and an equal number of old entries:

while(mrc！＝mra){while (mrc!=mra) {

//将当前新条目写入当前检查点文件//Write the current new entry into the current checkpoint file

//设置‘新(New)？’为真// Set 'New?' to true

write(checkpoint_file,checkpointed_entry(mrc,true))；write(checkpoint_file,checkpointed_entry(mrc,true));

//将检查点号记录到当前新条目//Record the checkpoint number to the current new entry

mrc.checkpoint_number＝checkpoint_number；mrc.checkpoint_number=checkpoint_number;

//推进MRC指针//Advance the MRC pointer

mrc＝mrc.prev；mrc = mrc.prev;

//如果LRC尚未赶上MRC...// If LRC hasn't caught up to MRC yet...

if(lrc！＝mrc){if(lrc!=mrc){

//...将LRC条目写入当前检查点文件//...Write the LRC entries to the current checkpoint file

//设置‘新(New)？’为假//Set 'New?' to false

write(checkpoint_file,checkpointed_entry(lrc,false))；write(checkpoint_file,checkpointed_entry(lrc,false));

//如果LRC条目是在其旧检查点文件中的最新条目// If the LRC entry is the latest entry in its old checkpoint file

//那么删除该文件// Then delete the file

if(lrc.checkpoint_number！＝lrc.prev.checkpoint_number)if(lrc.checkpoint_number!=lrc.prev.checkpoint_number)

files_to_remove.add(lrc.checkpoint_number)；files_to_remove.add(lrc.checkpoint_number);

//记录新检查点号//Record the new checkpoint number

lrc.checkpoint_number＝checkpoint_number；lrc.checkpoint_number=checkpoint_number;

//推进LRC指针//Advance the LRC pointer

lrc＝lrc.prev；lrc = lrc.prev;

}}

//MRC现在是MRA(自while循环退出)//MRC is now MRA (exit from while loop)

//如果LRC赶上MRC，那么将其设置回LRA//If LRC catches up with MRC, then set it back to LRA

if(lrc＝＝mrc)if(lrc == mrc)

lrc＝lra；lrc=lra;

//记录删除型条目，这些条目带有此间隔所删除全部条目的键：// Record deleted entries with the keys of all entries deleted in this interval:

for(key in removed_keys)for(key in removed_keys)

write(checkpoint_file,checkpointed_removal(key))；write(checkpoint_file,checkpointed_removal(key));

//记录带有LCR键的LRC型条目：//Record LRC-type entries with LCR keys:

write(checkpoint_file,checkpointed_lrc(lrc.key))；write(checkpoint_file,checkpointed_lrc(lrc.key));

//推进检查点号//Advance checkpoint number

checkpoint_number++；checkpoint_number++;

//删除列出为删除的文件// Delete the files listed for deletion

for(file in files_to_remove)for(file in files_to_remove)

remove_checkpoint_file(file)；remove_checkpoint_file(file);

上述算法也表示在图3A的流程图中。检查点进程初始化300检查点文件，删除检查点文件的空列表，加入新的空检查点文件，并将MRC指针推进至第一新条目。该进程以循环条件来执行while循环302，该循环条件维持循环直至MRC指针等于MRA指针。在while循环302中，进程将当前新条目写入304当前检查点文件，将当前CPN记录306进当前MRC条目，并推进308MRC指针。进程检查310以确定LRC是否已经达到MRC，如果达到则返回进行下一个while循环迭代，如果尚未达到则继续进行。如果继续进行while循环302，进程将LRC条目写入312当前检查点文件。然后进程检查314以确定LRC条目是否为其旧检查点文件中的最新条目，如果是，则通过将其加入到待删除的检查点文件列表来标记316该检查点文件为待删除。然后进程将当前CPN记录318到当前LRC条目，并在返回到下一个while循环迭代之前推进320LRC指针。当退出while循环302之后，进程检查322以确定LRC是否已经达到MRC，如果达到则将LRC设置回LRA。然后进程以在该检查点间隔删除的所有条目的键记录326删除型项目，记录328LRC型项目，推进330 CPN，并删除332标记为待删除的检查点文件。The algorithm described above is also shown in the flowchart of FIG3A . The checkpoint process initializes 300 a checkpoint file, deletes the empty list of checkpoint files, adds a new empty checkpoint file, and advances the MRC pointer to the first new entry. The process executes a while loop 302 with a loop condition, which loops until the MRC pointer equals the MRA pointer. Within while loop 302, the process writes 304 the current new entry to the current checkpoint file, records 306 the current CPN into the current MRC entry, and advances 308 the MRC pointer. The process checks 310 to determine whether the LRC has reached the MRC. If so, it returns to the next while loop iteration; if not, it continues. If the while loop 302 continues, the process writes 312 the LRC entry to the current checkpoint file. The process then checks 314 to determine whether the LRC entry is the latest entry in its old checkpoint file. If so, it marks 316 the checkpoint file for deletion by adding it to the list of checkpoint files to be deleted. The process then records 318 the current CPN to the current LRC entry and advances 320 the LRC pointer before returning to the next while loop iteration. After exiting the while loop 302, the process checks 322 to determine whether the LRC has reached the MRC and, if so, sets the LRC back to the LRA. The process then records 326 a deletion-type entry with the key of all entries deleted during the checkpoint interval, records 328 an LRC-type entry, advances 330 the CPN, and deletes 332 the checkpoint file marked for deletion.

以下是恢复进程所用的以伪代码写成的算法的一个示例。可以实施以下算法以在发生故障后恢复条目表的最新的一致状态(即，最近设置检查点的状态)。The following is an example of an algorithm, written in pseudocode, used by the recovery process.The following algorithm can be implemented to recover the most recent consistent state (ie, the most recently checkpointed state) of the entry table after a failure.

上述算法也表示在图3B的流程中。恢复进程将MRA指针和LRA指针初始化340为空值。然后恢复进程执行嵌套For循环，外For循环344在检查点文件上(最旧到最新)进行迭代，内For循环346在每一个检查点文件中的项目上进行迭代。在内For循环346之中，如果项目是删除型项目，则该进程就会删除该项目。然后该进程检查350该项目是否是条目型项目，如果是，则查找352到具有指定键的条目并更新具有指定数据的条目，或者如果未查找到，则创建具有指定键和数据的条目。通过设置指针而将更新的或创建的条目插入354到表的适当位置。插入之后，或者如果该项目不是条目型项目，如果是LRC型项目则该进程将LRC设置356为具有指定键的条目。然后两种For循环返回以便进行下一次迭代。两种For循环退出后，该进程将MRC设置为MRA。The above algorithm is also shown in the process flow of Figure 3B. The recovery process initializes 340 the MRA pointer and the LRA pointer to null values. The recovery process then executes a nested For loop, with an outer For loop 344 iterating over the checkpoint files (oldest to newest) and an inner For loop 346 iterating over the items in each checkpoint file. Within the inner For loop 346, if the item is a delete-type item, the process deletes the item. The process then checks 350 whether the item is an entry-type item. If so, it searches 352 for an entry with the specified key and updates the entry with the specified data, or if no entry is found, it creates an entry with the specified key and data. The updated or created entry is inserted 354 into the appropriate position in the table by setting pointers. After insertion, or if the item is not an entry-type item, if it is an LRC-type item, the process sets 356 the LRC to the entry with the specified key. The two For loops then return for the next iteration. After both For loops exit, the process sets the MRC to the MRA.

可以被检查点进程所用的算法的其它示例可以包括其他步骤。例如，可以压缩检查点文件。检查点文件可以合并为少于每个检查点间隔一个的物理文件。例如，可以有周期性转换，其允许检查点进程以常量时间的操作来删除已经存储有更新拷贝的表条目的较旧拷贝。可以在条目修改而非简单访问的基础上来维护较旧的条目列表。进程能够简单地删除与小于针对最后一个LRC条目的检查点号的检查点号相关联的所有检查点文件，而不记录当推进LRC指针时待删除的检查点文件中每一个的检查点号。当写入新条目时，进程能够计数这些新条目，并在已经将这些新条目写入检查点文件的独立部分(例如，针对新条目的部分、针对旧条目的部分、针对具有删除的键的条目的部分以及针对具有LRC条目的键的项目的部分)之后写入相同(或相似)数量的旧条目，而不交叉存取检查点文件中的新、旧条目以及利用标记区分新、旧条目。Other examples of algorithms that can be used by the checkpoint process may include other steps. For example, the checkpoint files may be compressed. Checkpoint files may be merged into fewer than one physical file per checkpoint interval. For example, there may be periodic switching that allows the checkpoint process to delete older copies of table entries that already have updated copies stored in a constant-time operation. A list of older entries may be maintained based on entry modification rather than simple access. The process can simply delete all checkpoint files associated with checkpoint numbers less than the checkpoint number for the last LRC entry, rather than recording the checkpoint number of each checkpoint file to be deleted when the LRC pointer is advanced. When writing new entries, the process can count these new entries and write the same (or similar) number of old entries after writing these new entries to separate sections of the checkpoint file (e.g., a section for new entries, a section for old entries, a section for entries with deleted keys, and a section for entries with keys of LRC entries), rather than interleaving new and old entries in the checkpoint file and using markers to distinguish between them.

图4A到图4F示出了条目的表的不同状态和工作数据114内其它内存中状态的示例、以及检查点文件120的不同状态的示例的图示。图4A示出了存储有键/数据对的条目和识别最新的检查点文件的CPN(上述键/数据对存储在该最新的检查点文件中)的表400的部分以及MRA指针、LRA指针、MRC指针和LRC指针的位置。图4A还示出了删除了的键的列表402。在检查点进程生成具有CPN201的检查点文件之前，表400和列表402针对一种状态而示出。为简单起见，针对给定键的数据的内容在示例中用单个字母表示。4A through 4F illustrate examples of different states of a table of entries and other in-memory states within working data 114, as well as examples of different states of a checkpoint file 120. FIG4A shows a portion of a table 400 storing entries for key/data pairs and a CPN identifying the most recent checkpoint file in which the key/data pairs are stored, as well as the locations of the MRA pointer, LRA pointer, MRC pointer, and LRC pointer. FIG4A also shows a list 402 of deleted keys. Table 400 and list 402 are shown for one state before the checkpoint process generates a checkpoint file with CPN 201. For simplicity, the content of the data for a given key is represented by a single letter in the example.

图4B和图4C显示的是在生成具有CPN201的检查点文件之前的现有检查点文件(CPN193-CPN200)的一些集合。图4B示出了较早的检查点文件193-194，并且图4C示出了最新的检查点文件198-200。在该示例中，检查点文件显示为表，以项目作为表中的行。每一个项目具有针对‘类型’字段的三个可能值中的一个：E(条目型)、R(删除型)或L(LRC型)。条目型项目具有针对‘键’字段的键值、针对‘数据’字段的数据值以及针对‘真(New)？’字段的T(真)或F(假)，分别指示条目是“新”条目还是“旧”条目。如上所述，“旧”条目是已经存储在先前检查点文件中并且现在正在再次存储在较新检查点文件中的条目。删除型项目或LRC型项目具有针对‘键’字段的键值，但不具有针对‘数据’或‘真(New)？’字段的值。Figures 4B and 4C show some collections of existing checkpoint files (CPN193-CPN200) before generating a checkpoint file with CPN201. Figure 4B shows earlier checkpoint files 193-194, and Figure 4C shows the latest checkpoint files 198-200. In this example, the checkpoint files are displayed as a table with items as rows in the table. Each item has one of three possible values for the 'Type' field: E (Entry Type), R (Delete Type), or L (LRC Type). Entry type items have a key value for the 'Key' field, a data value for the 'Data' field, and a T (True) or F (False) for the 'True (New)?' field, indicating whether the entry is a "new" entry or an "old" entry, respectively. As described above, an "old" entry is an entry that has been stored in a previous checkpoint file and is now being stored again in a newer checkpoint file. A delete type item or an LRC type item has a key value for the 'Key' field, but does not have a key value for the 'Data' or 'True (New)?' field. ’ field value.

针对以如图4A所示的表400和列表402和如图4B-4C所示的检查点文件结束的检查点间隔而执行上述伪代码中所描述的示意性算法，检查点进程将存储如图4D所示的具有CPN201的新检查点文件。检查点进程还将删除具有CPN193的检查点文件，这是由于LRC指针从该检查点文件中最后一个设置检查点的条目过渡到在具有CPN194的检查点文件中最后一个设置检查点的条目表示不再需要具有CPN193的检查点文件中条目的任何拷贝。在检查点进程已经存储了具有CPN201的检查点文件之后，并且在自从图4A的状态之后任何条目有机会被访问之前，表400和列表402具有如图4E所示的状态。Executing the illustrative algorithm described in the pseudocode above for a checkpoint interval ending with table 400 and list 402 as shown in FIG4A and checkpoint files as shown in FIG4B-4C, the checkpoint process will store a new checkpoint file with CPN 201 as shown in FIG4D. The checkpoint process will also delete the checkpoint file with CPN 193 because the transition of the LRC pointer from the last checkpointed entry in that checkpoint file to the last checkpointed entry in the checkpoint file with CPN 194 indicates that any copies of the entry in the checkpoint file with CPN 193 are no longer needed. After the checkpoint process has stored the checkpoint file with CPN 201, and before any entry has had a chance to be accessed since the state of FIG4A, table 400 and list 402 have the state shown in FIG4E.

如果在存储具有CPN201的检查点文件之后有系统故障的话，系统能够通过按照其CPN的顺序来仅仅对剩余的检查点文件(具有CPN194-CPN201)进行处理从而执行恢复进程以恢复表400和列表402(如图4E所示)的状态。图4F示出了当重建表400时表400的状态的一系列快照(snapshot)。表400的状态的部分在处理完第一个检查点文件194并在处理完每一个检查点文件198-200之后显示。然后，在处理完检查点文件201之后，恢复至表400的完全恢复后的状态，包括表400(如图4E所示)中的指针。MRA指针和MRC指针被设置为表400中的第一条目，并且LRA指针被设置为表400中的最后一个条目。利用最后一个检查点文件的LRC型项目来恢复LRC指针值，上述检查点文件在该示例中是检查点文件201。If a system failure occurs after storing the checkpoint file with CPN 201, the system can perform a recovery process to restore the state of table 400 and list 402 (as shown in FIG4E ) by processing only the remaining checkpoint files (with CPNs 194-201) in the order of their CPNs. FIG4F shows a series of snapshots of the state of table 400 as it is rebuilt. A portion of the state of table 400 is shown after processing the first checkpoint file 194 and after processing each of checkpoint files 198-200. Then, after processing checkpoint file 201, the fully recovered state of table 400 is restored, including the pointers in table 400 (as shown in FIG4E ). The MRA pointer and MRC pointer are set to the first entry in table 400, and the LRA pointer is set to the last entry in table 400. The LRC pointer value is restored using the LRC-type entry of the last checkpoint file, which in this example is checkpoint file 201.

图5示出了关于不同指针对于一系列九个不同检查点间隔是如何随时间移动的描述，针对每一个检查点间隔的带纹理条柱表示表的不同部分。不同部分的大小随时间而变化说明LRC指针赶上MRC指针并循环回LRA指针。在该示例中，有稳定数量的数据单元在该系列检查点间隔期间被访问，并且表的大小稳定。在每一条下面是针对该检查点间隔正在写入的检查点文件的当前CPN，在该当前CPN下面是存储有表的状态(已经删除任何旧检查点文件之后)的剩余有效检查点文件的CPN列表。在多数使用情况下，包括在该示例中，剩余检查点文件集相当稳定，平均每一个检查点间隔删除一个检查点文件。Figure 5 shows a depiction of how different pointers move over time for a series of nine different checkpoint intervals, with the textured bars for each checkpoint interval representing different portions of the table. The changes in the sizes of the different portions over time illustrate the LRC pointer catching up with the MRC pointer and looping back to the LRA pointer. In this example, there is a steady number of data units being accessed during the series of checkpoint intervals, and the size of the table is steady. Below each bar is the current CPN of the checkpoint file being written for that checkpoint interval, and below that is a list of the CPNs of the remaining valid checkpoint files that store the state of the table (after any old checkpoint files have been deleted). In most use cases, including this example, the set of remaining checkpoint files is fairly stable, with an average of one checkpoint file being deleted per checkpoint interval.

例如可以通过使用执行适当软件指令的可编程计算系统来实施上述检查点设置方法，或者可以在诸如现场可编程门阵列(FPGA)的适当的硬件中或以一些混合形式来实施。例如，在程序化方法中，软件可以包括在一个或多个已编程或可编程计算系统(可以具有各种架构，诸如分布式、客户端/服务器、或网格式)上执行的一个或多个计算机程序中的进程，每个计算系统包括至少一个处理器、至少一个数据存储系统(包括易失性和/或非易失性存储器和/或存储元素)以及至少一个用户接口(用于使用至少一个输入设备或端口来接收输入，以及用于使用至少一个输出设备或端口来提供输出)。该软件可包括更大型程序的一个或多个模块，例如，该大型程序提供与数据流图的设计、配置和执行相关的其它服务。该程序的模块(例如，数据流图的元素)可以被实施为数据结构或者符合在数据库中存储的数据模型的其它组织的数据。For example, the checkpointing method described above can be implemented using a programmable computing system that executes appropriate software instructions, or it can be implemented in appropriate hardware such as a field programmable gate array (FPGA), or in some hybrid form. For example, in a programmatic approach, the software can include processes in one or more computer programs that execute on one or more programmed or programmable computing systems (which can have various architectures, such as distributed, client/server, or grid-based), each computing system including at least one processor, at least one data storage system (including volatile and/or non-volatile memory and/or storage elements), and at least one user interface (for receiving input using at least one input device or port, and for providing output using at least one output device or port). The software can include one or more modules of a larger program, for example, the larger program provides other services related to the design, configuration, and execution of data flow graphs. The modules of the program (e.g., elements of a data flow graph) can be implemented as data structures or other organized data that conform to a data model stored in a database.

该软件可以被提供在诸如CD-ROM或其他计算机可读介质之类的有形永久存储介质(其可以被通用或专用计算系统或装置读取)上或者通过网络的通信介质递送(编码成传播信号)到执行该软件的计算系统的有形永久介质。一些或全部处理可以在专用计算机上执行，或者使用诸如协处理器或现场可编程门阵列(FPGA)或专用集成电路(ASIC)之类的专用硬件来执行。该处理可以以分布方式实施，在该分布方式中，由该软件指定的不同的计算部分由不同的计算元件执行。每个这样的计算机程序优选地被存储在或下载到可由通用或专用可编程计算机读取的存储设备的计算机可读存储介质(例如，固态存储器或介质、或者磁或光介质)，用于在计算机读取该存储设备介质时配置和操作该计算机，以执行此处所描述的处理。也可以考虑将本发明的系统实施为有形永久存储介质，其配置有计算机程序，其中，如此配置的存储介质使得计算机以特定和预定义的方式操作以执行此处所描述的一个或多个处理步骤。The software can be provided on a tangible permanent storage medium such as a CD-ROM or other computer-readable medium (which can be read by a general or special computing system or device) or delivered (encoded into a propagation signal) to a tangible permanent medium of a computing system that executes the software via a communication medium of a network. Some or all of the processing can be performed on a dedicated computer, or performed using dedicated hardware such as a coprocessor or a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC). The processing can be implemented in a distributed manner in which different computing parts specified by the software are performed by different computing elements. Each such computer program is preferably stored in or downloaded to a computer-readable storage medium (e.g., a solid-state memory or medium, or a magnetic or optical medium) of a storage device that can be read by a general or special programmable computer, for configuring and operating the computer when the computer reads the storage device medium to perform the processing described herein. It is also contemplated that the system of the present invention can be implemented as a tangible permanent storage medium configured with a computer program, wherein the storage medium so configured causes the computer to operate in a specific and predefined manner to perform one or more processing steps described herein.

已经对本发明的多个实施例进行了描述。然而，应当理解，前面的描述旨在说明而非限制本发明的范围，本发明的范围由以下权利要求书的范围来限定。因此，其它实施例也落在以下权利要求书的范围内。例如，在不脱离本发明的范围的情况下可进行各种修改。此外，上述的一些步骤可以是顺序独立的，因此可以以不同于所述的顺序来执行。Several embodiments of the present invention have been described. However, it should be understood that the foregoing description is intended to illustrate, not to limit, the scope of the present invention, which is defined by the following claims. Therefore, other embodiments are also within the scope of the following claims. For example, various modifications may be made without departing from the scope of the present invention. Furthermore, some of the steps described above may be sequence-independent and, therefore, may be performed in an order different from that described.

Claims

1. A computing system for managing stored data, the computing system comprising:

The memory module is configured to store working data comprising multiple data units;

The storage system is configured to store recovery data comprising multiple sets of one or more data units; and

At least one processor is configured to manage the transfer of data units between the memory module and the memory system, the management including:

Maintaining the order of the plurality of data units included in the working data, the order defining a first consecutive portion including one or more data units and a second consecutive portion including one or more data units; and

For each of the plurality of time intervals, any data unit accessed from the working data during that time interval is identified, and a set of two or more data units is added to the recovery data, the set of two or more data units comprising: one or more data units from the first consecutive portion, the one or more data units comprising any data unit that has been accessed, and one or more data units from the second consecutive portion, the one or more data units comprising at least one data unit that has been previously added to the recovery data.

2. The computing system of claim 1, wherein the management further comprises: for each of a plurality of time intervals, deleting at least one set of one or more data units from the recovery data, wherein for the set of one or more data units, any data units still included in the working data are stored in at least another set of one or more data units.

3. The computing system of claim 1, wherein the management further comprises: for each of a plurality of time intervals, identifying any data unit deleted from the working data during the time interval.

4. The computing system of claim 3, wherein the second continuous portion does not include any deleted data units.

5. The computing system of claim 3, wherein the management further comprises: for each of a plurality of time intervals, adding information identifying any deleted data units to the recovered data.

6. The computing system of claim 1, wherein identifying any data unit accessed from the working data during the time interval includes moving the any data unit accessed from the working data during the time interval into the first continuous portion.

7. The computing system of claim 1, wherein the order among the plurality of data units included in the working data is based on the order in which the data units were recently accessed.

8. The computing system of claim 7, wherein the first continuous portion comprises: the most recently accessed data unit among all the plurality of data units included in the working data, and the least recently accessed data unit in a subset of the plurality of data units that have not been added to the recovery data since the most recent access.

9. The computing system of claim 8, wherein the second continuous portion does not overlap with the first continuous portion.

10. The computing system of claim 9, wherein the second continuum comprises at least one data unit that has been added to the recovered data since the most recent access.

11. The computing system of claim 10, wherein the one or more data units from the second consecutive portion are limited to a number of data units, the number of data units being between approximately half the number of data units in the one or more data units from the first consecutive portion and approximately twice the number of data units in the one or more data units from the first consecutive portion.

12. The computing system of claim 11, wherein the first contiguous portion includes data units that are accessed more closely than any data unit in the second contiguous portion.

13. The computing system of claim 1, wherein the time at which the data unit was recently accessed corresponds to the time at which exclusive access to the data unit began.

14. The computing system of claim 1, wherein the time at which the data unit was recently accessed corresponds to the time at which exclusive access to the data unit ended.

15. The computing system of claim 1, wherein the management further includes using the recovery data to restore the state of the working data in response to a fault.

16. The computing system of claim 1, wherein each of the plurality of data units included in the working data is associated with a key value.

17. The computing system of claim 16, wherein at least one data unit included in the working data comprises one or more values accessible based on a key associated with the data unit.

18. The computing system of claim 16, wherein the total number of different key values associated with the different data units included in the working data is greater than approximately 1,000.

19. The computing system of claim 1, wherein the time intervals do not overlap with each other.

20. The computing system of claim 1, wherein identifying any data unit accessed from the working data during the time interval includes identifying at least one of the following: any data unit added to the working data during the time interval, any data unit read from the working data during the time interval, or any data unit updated within the working data during the time interval.

21. The computing system of claim 1, wherein the memory module comprises a volatile storage device.

22. The computing system of claim 1, wherein the storage system comprises a non-volatile storage device.

23. A computing system for managing stored data, the computing system comprising:

A means for managing the transfer of data units between the memory module and the storage system, the management including:

For each of the multiple time intervals, any data unit accessed from the working data during that time interval is identified, and a set of two or more data units is added to the recovered data, the set of two or more data units comprising: one or more data units from the first consecutive portion.

Multiple data units, including any accessed data unit, and one or more data units from the second contiguous portion, including at least one data unit previously added to the recovered data.

24. A method for managing the transfer of data units between a memory module of a computing system and a storage system of the computing system, the method comprising:

The working data, which includes multiple data units, is stored in the memory module;

The recovery data, comprising multiple groups of one or more data units, is stored in the storage system;

25. A computer-readable medium having a computer program stored thereon, the computer program being used to manage the transfer of data units between a memory module of a computing system and a storage system of the computing system, the computer program including instructions for causing the computing system to perform the following operations: