WO2025123377A1

WO2025123377A1 - System and methods for fast iterative rewinding of database to point of time in past

Info

Publication number: WO2025123377A1
Application number: PCT/CN2023/139312
Authority: WO
Inventors: Lengdong WU; Alexandre Depoutovitch; GuanZhu XIONG; Cheng ZHAO; Chong CHEN; Jack Hon Wai Ng; Shu Lin
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2023-12-16
Filing date: 2023-12-16
Publication date: 2025-06-19
Anticipated expiration: 2026-06-16

Abstract

Fast iterative rewinding of a database to a past time point is supported. Responsive to a database rewind request, a target LSN for a target time is obtained, along with a set of persisted LSNs corresponding to a set of slices. Each slice includes a valid anchor checkpoint smaller than and closest to the target LSN. A minimum persisted LSN and change log records based thereon and on the target LSN are also obtained. The change log records are applied to anchor pages of the slices to restore the database to the target time. Also provided for is database management, including generating anchor checkpoints for recording page content corresponding to LSNs and indicating anchor pages. Also provided for is scaling back restorable time window size, including, when a storage threshold is exceeded, releasing stored data pages for deletion based on usage levels.

Description

SYSTEM AND METHODS FOR FAST ITERATIVE REWINDING OF A DATABASE TO A POINT OF TIME IN THE PAST

TECHNICAL FIELD

The present disclosure pertains to the field of database management, and in particular to systems and methods for (e.g., fast iterative) rewinding of a database to a point of time in the past.

BACKGROUND

Database restoration is a fundamental process in data management. It involves reverting a database to a previous state, often essential when errors or data corruption occur. However, there are several limitations associated with existing restoration solutions. In some existing solutions, restoration can only take place if the database is up and running, which may be challenging during system failures. Additionally, using these techniques, once the database has been rewound to a first point in time, restoration to later points in time cannot be achieved, as log data is routinely overwritten once the desired (e.g., first) time point is reached. In some existing solutions, another constraint is the limited available storage space, as change log records are stored in a common storage shared among various clients, potentially leading to accumulated changes that might disable backtracking or reduce the set of available times which can be backtracked to. Moreover, in some existing solutions the restoration process necessitates the copying of database data from one location to another, which can be a time-consuming and complex endeavor. During this procedure, the database remains unavailable, potentially causing significant disruptions. Furthermore, user applications may require reconfiguration as a consequence of the restoration, adding an extra layer of complexity to the process.

Therefore, there is a need for systems and methods for rewinding of a database to a point of time in the past that obviates or mitigates one or more limitations of the prior art.

This background information is provided to reveal information believed by the applicant to be of possible relevance to the present invention. No admission is necessarily intended, nor should be construed, that any of the preceding information constitutes prior art against the present invention.

SUMMARY

One or more aspects of disclosure provides for systems and methods for (e.g. fast iterative) rewinding of a database to a point of time in the past. According to an aspect, a method for managing a database may be provided. The method includes managing a plurality of data pages of the database. The method may further include generating one or more anchor checkpoints with corresponding one or more log sequence numbers (LSNs) . Each anchor checkpoint may record content of a page directory at a moment in time indicated by a corresponding LSN of the one or more LSNs. The page directory may record a position of each data page of the plurality of data pages in an append-only storage. The corresponding LSN may indicate when each anchor checkpoint was generated. Each anchor checkpoint may further indicate a set of anchor pages. For each anchor page of the set of anchor pages, each anchor checkpoint may record one or more positions of one or more versions of that anchor page based on the corresponding LSN of the anchor checkpoint. The set of anchor pages may correspond to a set of data pages of the plurality of data pages. The set of anchor pages may be exempt from garbage collection, by which data that is deemed no longer required is deleted.

Managing the plurality of data pages of the database may include generating change log records describing changes to the plurality of data pages. The database may include a plurality of slices. The database may further include one or more page stores, each page store of the one or more page stores may include one or more slices of the plurality of slices. Each slice of the one or more slices may include a set of data pages of the plurality of data pages for storing the data. Managing the plurality of data pages of the database may further include writing the change log records to one or more log stores. Managing the plurality of data pages of the database may further include maintaining the page directory to record the position of each data page of the plurality of data pages in the append-only storage.

The method may further include maintaining, at each slice of the plurality of slices, at least one anchor checkpoint of the one or more anchor checkpoints, according to an order determined by at least one corresponding LSN of the at least one anchor checkpoint. Generating the one or more anchor checkpoints may include generating, at the plurality of slices, the one or more anchor checkpoints at one or more time intervals.

According to another aspect, a method may be provided for rewinding a database. The method may include receiving a request to rewind the database to a target time. The database may include a plurality of slices and a plurality of data pages for storing data. The database may further include one or more page stores, each page store of the one or more page stores may include one or more slices of the plurality of slices. Each slice of the plurality of slices may include one or more data pages of the plurality of data pages. Each slice of the plurality of slices may further include a set of anchor checkpoints associated with a corresponding set of log sequence numbers (LSNs) . Each anchor checkpoint may indicate a set of anchor pages corresponding to a set of data pages of the plurality of data pages.

The method may further include obtaining a target LSN corresponding to a state of the database at the target time. The method may further include obtaining a set of persisted LSNs corresponding to a set of slices of the plurality of slices. Each slice of the set of slices may include a valid anchor checkpoint having an LSN smaller than the target LSN. The LSN of the valid anchor checkpoint may be closest to the target LSN among the set of LSNs of the set of anchor checkpoints corresponding to said each slice. The method may further include obtaining a minimum persisted LSN from the set of persisted LSNs corresponding to the set of slices. The method may further include obtaining change log records from one or more log stores based on the minimum persisted LSN and the target LSN. The method may further include applying the change log records to one or more anchor pages of the set of slices to restore the database to the target time.

The valid anchor checkpoint may include a set of anchor pages corresponding to a set of data pages of the plurality of data pages, the method may further include, for said each slice of the set of slices, obtaining the valid anchor checkpoint, wherein the set of anchor pages corresponds to a set of LSNs. The method may further include, for said each slice, obtaining a largest LSN among the set of LSNs corresponding to the set of anchor pages. The method may further include, for said each slice, setting a persisted LSN of the set of persisted LSNs for said each slice based on the largest LSN.

Each LSN of the set of LSNs corresponding to the set of anchor checkpoints may indicate when the corresponding anchor checkpoint was generated. The change log records may include a starting change log record, an ending change log record and all change log records between the starting and ending change log records. Obtaining the change log records from the one or more log stores based on the minimum persisted LSN and the target LSN may include obtaining the starting change log record based on the minimum persisted LSN. Obtaining the change log records from the one or more log stores based on the minimum persisted LSN and the target LSN may further include obtaining the ending change log record based on the target LSN. Obtaining the change log records from the one or more log stores based on the minimum persisted LSN and the target LSN may further include obtaining all change log records between the starting change log record and ending change log record.

Applying the change log records to one or more anchor pages of the set of slices to restore the database to the target time may include, for each slice of the set of slices, loading the set of anchor pages of the valid anchor checkpoint corresponding to said each slice. Applying the change log records to one or more anchor pages of the set of slices to restore the database to the target time may further include, for each slice of the set of slices, applying the change log records to the loaded set of anchor pages to generate an updated version of the set of anchor pages.

According to another aspect of the disclosure a method may be provided for managing a storage of a database. The method may allow for auto scaling back of restorable time window size. The method may include determining that a threshold for using the storage has been exceeded. The storage may include a plurality of slices and a plurality of data pages for storing data. The storage may further include one or more page stores. Each page store of the one or more page stores may further include one or more slices of the plurality of slices. Each slice of the plurality of slices may further include one or more data pages of the plurality of data pages. The database may have a restorable time window size indicating an earliest-restorable-time for restoring the database. The method may further include obtaining a plurality of usage levels of an append-only storage of the database. The plurality of usage levels may correspond to the plurality of slices. Each usage level of the plurality of usage levels may indicate a total size of the append-only storage used by a corresponding slice of the plurality of slices.

The method may further include selecting a set of candidate slices of the plurality of slices based on a highest usage level among the one or more usage levels. The method may further include releasing a portion of the plurality of data pages for garbage collection, the portion of the plurality of data pages being based on the set of candidate slices.

The method may further include obtaining a truncated log sequence number (LSN) corresponding to an updated earliest-restorable-time based on a reduction of the restorable time window size. The portion of the plurality of data pages may further be based on the truncated LSN.

The set of candidate slices may belong to one or more database instances. Each slice of the set of candidate slices may belong to a database instance of the one or more database instances. Releasing the portion of the plurality of data pages for garbage collection may include, for each slice of the plurality of slices belonging to a database instance of the one or more database instances, selecting at least one anchor checkpoint of one or more anchor checkpoints based on the truncated LSN. Each anchor checkpoint may correspond to said each slice of the plurality of slices belonging to the database instance of the one or more database instances. Each anchor checkpoint may further include a set of anchor pages corresponding to a set of data pages of the plurality of data pages. Each anchor checkpoint may further be associated with an LSN indicative of when said anchor checkpoint was generated. The LSN of said each anchor checkpoint may be smaller than the truncated LSN. The method may further include releasing the at least one anchor checkpoint for garbage collection.

The one or more anchor checkpoints may be generated at one or more time intervals. Each anchor checkpoint of the one or more anchor checkpoints may correspond to a slice of the plurality of slices. Each anchor checkpoint may record content of a page directory at a moment in time indicated by a corresponding LSN. The page directory may record a position of the plurality of data pages in the append-only storage. Each anchor checkpoint may indicate a corresponding set of anchor pages. For each anchor page, said each anchor checkpoint records one or more positions of one or more versions of said each anchor page based on the corresponding LSN. The corresponding set of anchor pages may further correspond to a set of data pages of the plurality of data pages.

According to another aspect, an apparatus is provided. The apparatus includes modules configured to perform one or more of the methods and systems described herein. The apparatus is generally an electronic apparatus, such as a computer, collection of computers, or other electronic device configured to process data.

According to one aspect, an apparatus is provided, where the apparatus includes: a memory, configured to store a program; a processor, configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to perform one or more of the methods and systems described herein.

According to another aspect, a (e.g. non-transitory) computer readable medium is provided, where the computer readable medium stores program code executed by a device and the program code is used to perform one or more of the methods and systems described herein.

According to one aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads, by using the data interface, an instruction stored in a memory, to perform one or more of the methods and systems described herein.

Other aspects of the disclosure provide for apparatus, and systems configured to implement the methods according to the first aspect disclosed herein. For example, wireless stations and access points can be configured with machine readable memory containing instructions, which when executed by the processors of these devices, configures the device to perform one or more of the methods and systems described herein.

Embodiments have been described above in conjunction with aspects of the present invention upon which they can be implemented. Those skilled in the art will appreciate that embodiments may be implemented in conjunction with the aspect with which they are described but may also be implemented with other embodiments of that aspect. When embodiments are mutually exclusive, or are incompatible with each other, it will be apparent to those skilled in the art. Some embodiments may be described in relation to one aspect, but may also be applicable to other aspects, as will be apparent to those of skill in the art.

BRIEF DESCRIPTION OF THE DRAWINGS

Further features and advantages of the present invention will become apparent from the following detailed description, taken in combination with the appended drawings, in which:

FIG. 1 illustrates an architecture of a cloud-native database.

FIG. 2 illustrates a common procedure for backtracking.

FIG. 3 illustrates a point-in-time-recovery (PITR) workflow.

FIG. 4 illustrates organization and management of data at a storage layer, according to an embodiment.

FIG. 5 illustrates data flow for change log records in one or more log stores, according to an embodiment.

FIG. 6 illustrates a workflow of database backtracking involving page directory, according to an embodiment.

FIG. 7 illustrates a function of anchor checkpoint preventing anchor pages from being recycled, according to an embodiment.

FIG. 8 illustrates a method for rewinding a database, according to an embodiment.

FIG. 9 illustrates performance results for small database instances and large database instances, according to an embodiment.

FIG. 10 illustrates a method for automatically scaling back restorable time window size, according to an embodiment.

FIG. 11 illustrates an apparatus that may perform any or all of the operations of the methods, systems and features explicitly or implicitly described herein, according to different aspects of the present disclosure.

FIG. 12 illustrates a method for managing a database, according to an embodiment.

FIG. 13 illustrates a method for rewinding a database, according to an embodiment.

FIG. 14 illustrates a method for managing a storage of a database, according to an embodiment.

It will be noted that throughout the appended drawings, like features are identified by like reference numerals.

DETAILED DESCRIPTION

One or more aspects of the disclosure may provide for systems and methods for (e.g. fast iterative) rewinding of a database to a point of time in the past. According to an aspect, one or more methods (e.g., 1200) for managing a database may be provided. The one or more methods for managing a database may involve generating, at one or more intervals, anchor checkpoints that record content of a page directory. Each anchor checkpoint may indicate a set of anchor pages, by recording one or more positions of one or more versions of the set of anchor pages based on a corresponding LSN. The corresponding LSN for each such anchor checkpoint indicates when the anchor checkpoint was generated. The set of anchor pages may be exempt from garbage collection.

According to another aspect, one or more methods (e.g., method 800 and 1300) may be provided for rewinding the database. The one or more methods may leverage the generated anchor checkpoints to allow for rewinding (or iterative rewinding) of the database as described herein. The one or more methods involve obtaining a corresponding target LSN of a target time requested for rewinding the database. The one or more methods may further involve selecting a set of slices based on having a valid anchor checkpoint that has an LSN smaller than and closest to the target LSN. The one or more methods may further involve, for each slice, determining a persisted LSN based on the LSNs of anchor pages in the corresponding valid anchor checkpoint. From the determined persisted LSNs corresponding to the set of slices, the one or more methods may further involve, selecting the smallest LSN, and further sending this smallest LSN and the corresponding target LSN to log stores requesting for change log records. The one or more methods may further involve obtaining the change log records corresponding to the smallest LSN and target LSN and applying the change log records to the anchor pages.

According to another aspect, one or more methods (e.g., method 1000 and 1400) may be provided for managing a storage of a database. The one or more methods may involve determining that a threshold for using the storage has exceeded. The one or more methods may further involve obtaining one or more usage levels of the one or more slices of the database. The one or more methods may further involve selecting a set of slices that use most of the storage based on the obtained usage levels. The one or more methods may further involve releasing a portion of data pages of the selected set of slices, thereby freeing space. The portion of the data pages that are released may be based on an updated earliest-restorable-time obtained from reducing the restorable time window size of the database. The portion of the data pages may refer to anchor pages of anchor checkpoints that correspond to the updated earliest-restorable-time.

A data page or a page may refer to a unit of data storage within a database management system. Data of a database may be organized based on smaller-sized items, e.g., pages. A page may have different sizes, such as 4K, 8K, and 16K.

A change log record refers to a log entry that documents a change made to the data (or a page) within the database. All changes to a database are done by generating change log records for each individual item (page) . An example of a change log record may be a redo log for MySQL (arelational database management system) . Change log records may also be referred to as page modification logs.

An LSN may refer to a unique and incremental identifier assigned to each record of the change log records. Change log records may be ordered using LSNs. LSN order may correspond to the order of change log record creation. When applying the change log records to a page to derive a later version of that page, LSN order must be strictly obeyed, otherwise, data may be incorrect.

Log apply may refer to the process of applying changes recorded in a change log record. Change log records can be applied on one or more pages. For example, a change log record can indicate to write a 10 bytes content on page #20 at the offset 300. After a change log record is applied on a page, a new version of the page is generated. Each page may have a corresponding page LSN, which is updated according to the LSN of the last change log record applied.

Database state may refer to status of a database at a specific point in time. Database state may represent how the data is organized and what values are stored in the database tables and other structures. The database state can change as data is inserted, updated, or deleted, and it can be influenced by various database operations and transactions. Database state, at a particular time, may be identified by an LSN, which includes all data page changes up to this LSN. This LSN corresponds to the LSN of the last change log record (corresponding to the last database change) at the particular time.

FIG. 1 illustrates an architecture of a cloud-native database. The architecture 100 may be an architecture of a cloud-native database designed for cloud infrastructure with a separate compute layer 102 and a storage layer 104. The compute layer 102 can accept or use various query languages to operate, for example, structured query language (SQL) . The compute layer 102 may comprise a slice abstract layer (SAL) 103. SAL 103 may serve as the middle layer between the compute layer 102 and storage layer 104 and provide a set of high-level application programing interfaces (APIs) for the compute layer 102 to interact with storage layer 104. In some embodiments, SAL 103 may be understood to act as a coordinator to guide one or more log stores 106 to ship change log records to one or more page stores 110.

An example of such a database may be the Huawei^TM Cloud GaussDB^TM (for MySQL^TM) . The database architecture 100 may be a database architecture to which one or more aspects of the disclosure may apply.

In an embodiment, the compute layer 102 may comprise a master component (e.g., a DB master 121) and one or more replica components (e.g., DB replicas 122 and 123) responsible for accepting and processing user queries. All update transactions may be handled by the master component, which ships change log records to the storage layer over a network (e.g., a low-latency remote direct memory access (RDMA) storage network 108) .

The storage layer 104 may write change log records to a reliable, shared storage (e.g., one or more log stores 106) . The storage layer 104 may store database pages, apply changes in accordance with received change log records to keep pages up to date, and respond to read page requests. The storage layer may partition the database pages across many storage nodes. A storage node may refer to a component or server within the distributed architecture that is responsible for storing and managing a portion of the database's data.

The storage layer 104 may comprise a shared, reliable, scale-out append-only storage. The storage layer 104 may comprise one or more of: log stores 106 and page stores 110. The one or more log stores 106 may persist change log records generated by DB master 121. The one or more page stores 110 may serve page read requests coming from one or more nodes of the DB master 121 and DB replica (s) 122 and 123. Each page store of the one or more page stores 110 may comprise one or more slices. Each slice of the one or more slices may comprise one or more data pages for storing data. For example, each slice may serve as or operate like a container storing data pages and some meta information to describe these data pages, such as page directory which indicates the location of each page on the storage. A typical example database may comprise tens of log store nodes and tens of page store nodes.

In some instances of a database crash, a database instance may not restart. The only way to bring the system back may be to restore from backup images, but since the instance may have a very large dataset, restoration may take hours to complete. As part of restoration, a new instance with changed virtual internet protocol (IP) address is created, which further requires the configuration of the original client application to be modified or adjusted to ensure that the client application functions correctly. As a result, the client business services may be interrupted for quite a long time.

Similarly, for any database product, various mistakes or problems can cause failure. For example, data can be lost or corrupted due to any software or hardware error or failure. Further, erroneous user actions can occur occasionally. For instance, a user may mistakenly perform a destructive action, such as a DELETE without proper WHERE clause. In addition, malicious actions may be taken to break into, alter or damage the database on purpose.

Thus, customers need a functionality to rewind their database cluster back and forth to some desired point in time, to determine when a particular data change occurred, or to erase any egregious error, or even to bring the database back to the state prior to when corruption occurred.

FIG. 2 illustrates a common procedure for backtracking. The backtracking procedure 200 may allow a current database cluster to be brought back to a particular point in time. This backtracking or rewinding procedure may be used to quickly recover from unintentional data manipulation language (DML) or data definition language (DDL) operations or unrecovered points. In some cases, rewinding may occur multiple times to determine the desired point in time in the database state.

Referring to FIG. 2, a common procedure for backtracking may be as follows. Operations 210 may refer to operations performed overtime on a database that change the state of the database. At T0, none of the operations 210 have been performed, as it is just the initial starting point for database. At T1, operations 201 are performed. At T2, operations 201, 202 are performed. Then, at T2, a backtrack operation to T1 is done, where operations 202 become invisible, as if operations 202 never occurred. Then continuing to T3, operations 203 are performed (totally, operations 201+203 are performed until T3) . Then continuing to T4, operations 204 are performed (totally, operations 201+203+204 are performed until T4) . Then, at T2 a second backtrack operation to T3 is done, where operations 204 become invisible, as if 204 never occurred. Now the effect after rewinding is that only operations 201+203 are performed, where operations 202 and 204 become invisible.

An existing solution to database recovery is the PITR solution which recovers or restores a database to a specific moment in time. FIG. 3 illustrates a PITR workflow. The PITR workflow 300 involves periodically creating 301 backup images for the database and copying them to a separate remote storage 310. When restoration is needed, full back up images may be first shipped 302 from the separate storage to a local storage. The PITR workflow 300 may further involve starting 303 a new instance to bring the database to its state as of the time the backup image was generated, and further bring the database up to date incrementally from the generation time of the full backup image to a more recent time, by retrieving the transaction logs from the separate remote storage 310. After restoration is done and the new master instance is up, client application may switch 304 to connect to the new master instance.

A disadvantage of PITR is that all database data needs to be copied from one location to another for backup restoration. The database will be unavailable during the whole restoration procedure and if the database backups are located elsewhere remotely with limited network bandwidth, restoring all data files together with the transaction logs will be a time-consuming process. The PITR process may be complicated and require a recovery time that is proportional to the size of the tablespace hosting the table.

For example, an experiment conducted using PITR on a database with 500GB data and 100 GB incremental delta data (redo logs) resulted in the Total Restoration Time as follows: Total Restoration Time = Full backup recovery time (30 mins) + Incremental backup restoration time (100GB/1.2GB, ～90 minutes) = 2 hours. A further experiment was conducted for even a larger database with 1TB data and 300GB incremental delta data (redo logs) . The Total Restoration Time was determined as follows: Total Restoration Time = Full backup recovery time (60 mins) + Incremental backup restoration time (300GB/1.2GB, ～240 minutes) = 5 hours. These experiments illustrate the potential for long hours of downtime for restoration, which may be too long for customers, and thus recovery using PITR cannot be performed iteratively.

Another existing solution to database recovery is the Flashback. Oracle Flashback supports viewing past state of the database and rewinding a database by logical flashback on tables or flashback on the whole database. Logical flashback relies on undo data, which are records of the effects of each database update and the values overwritten in the update.

Oracle Flashback relies on Oracle-generated logs to perform flashback database operations. Oracle Flashback uses flashback logs to access past versions of data blocks and some information from redo logs. The database may only write flashback logs sequentially to an in-memory fast recovery area, but not persisted to disk.

A disadvantage of Oracle flashback is that flashback can only be performed when the database instance is up and running. If the database is down due to any reason, such as a physical disk problem or a data corruption, flashback cannot rewind the database to any point of time anymore. Also, Oracle Flashback log files are used with an in-memory circular buffer and are not persisted to disk. Further, historical flashback log data will be overwritten after the desired time point due to limited buffer size. Thus, Oracle Flashback can only rewind backward, but not rewind forward. For example, after a flashback from T2 to T1 (where T1 is prior to T2) , flashback logs between T1 and T2 will be overwritten by new flashback logs, and thus any time point between T1 and T2 is not achievable later.

Another existing solution to database recover is Aurora backtrack, which is based on retaining change log records in a first-in-first-out (FIFO) buffer in the storage and moving the database to the desired point of time by replaying sufficient change log records.

A disadvantage of the Amazon Aurora backtrack is that change log records are retained in the common storage shared between multiple different clients. If the change log records take up too much space, the resulting lack of remaining space may impact other unrelated peer clients. Any hot-spot or imbalance on a single storage node due to change log records accumulation can result in backtracking being turned off or reduction in the actual restorable time window size contrary to customer expectations. For example, an experiment was conducted using the Amazon Aurora backtrack for a large dataset (1TB) with heavy write-only workload. In this experiment, the actual restorable time window size was determined to be around 3 hours, despite a user being able to define a longer target restorable time window size (e.g., 72 hours) .

Existing solutions, such as PITR and Oracle Flashback, may be inadequate to meet customer requirements such as fast recovery of the database cluster to a particular point in time when some errors or failure occur in the database. PITR may not provide for a quick recovery of a database, especially for large databases. As a result, the system downtime may be too long to allow PITR to be performed multiple times during a short period of time. Oracle Flashback may be unable to bring the system back from a crash and there may be a time gap that does not allow user (s) to revert to some point in time.

According to an aspect, backtracking methods described herein may allow bringing a current database cluster to a particular point in time in less time (faster) than existing solutions. In some embodiments, backtracking methods described herein may support rewinding the database to quickly (compared to existing solutions) recover from unintentional DML or DDL operations or unrecovered points. As backtracking according to one or more methods can be completed relatively faster than existing solutions, multiple backtracking may be performed with limited or minimal downtime.

According to an aspect, a method may be provided to backtrack or rewind the current database cluster to any point in time at an improved recovery speed, which can be performed iteratively with improved reduced downtime.

According to an aspect, methods, systems, and apparatus may be provided to support iterative rewinding of a database to a point of time in the past at an improved speed.

As described above, PITR of backup restoration procedure needs to copy database snapshots from one location to another, adding extra data transfer lags, extra burden on the network and extra backup storage space cost. In contrast, according to an aspect, an improved approach to organization and management of data may be provided that obviates one or more limitations of prior art. For example, the improved approach to organization and management of data may allow maintenance of all data necessary to re-create a point in time database snapshot for the past. According to another aspect, a method may be provided that allows automatic adjustment (scaling back) of the restorable time window size to limit space overhead.

As will be apparent in view of the above, the restoration time, in existing solutions, is proportional to the size of the database. The larger the database, the longer the time cost for restoration. In contrast, according to an aspect of the present disclosure, a method may be provided that improves the restoration or recovery speed. The method may be capable of rewinding a database, both backward and forward, to any point in time in the past within a specified window with limited amount of data movement, independent of the actual database size. The amount of data movement may be customized, for example, at most 20 minutes of change log records.

According to an aspect, methods, systems, and apparatus described herein may be applied to or relate to the storage layer 104. A method applied at the storage layer may be provided for data organization and management. The method may be applied to the one or more log stores 106 and one or more page stores 110 to support reconstructing a database snapshot for a particular point in time for an existing database instance. According to an aspect, a method may be provided to automatically scale back the restorable time window size at the one or more page stores to effectively control storage overhead.

FIG. 4 illustrates organization and management of data at a storage layer, according to an embodiment. The storage layer 404 may be similar to the storage layer 104. Referring to FIG. 4, one or more log stores 406 may be used to persist all change log records generated within the restorable time window size 420. For example, if the restorable time window size is set as 24 hours, then all change log records generated during the past 24 hours are kept in the one or more log stores 406, meanwhile change log records generated prior to the past 24 hours will not be kept. One or more page stores 410 may selectively retain historical data pages based on a load balancing scheme.

To rebuild a database state in the past at a particular point in time (e.g., rebuild to t1 in FIG. 4) , a group of historical data pages may be chosen from the one or more page stores 410 as the starting point. Complementary change log records are then shipped from the one or more log stores 406 and applied on those chosen group of historical data pages until all data page changes are moved up to an LSN corresponding to the particular point in time, e.g., t1.

According to an aspect, a method for storing and managing data at a storage layer of a database may be provided. FIG. 5 illustrates data flow for change log records in one or more log stores, according to an embodiment. In an embodiment, the top compute layer 502 corresponding to the DB master 501 may accept user transactions to make changes to database pages, which generates change log records describing the page changes. In some embodiments, user transaction can be read-write transactions (including update, insert, delete operations) which may update the database. These types of read-write transactions will generate change log records and can only be issued by DB master 501. Another type of transaction may be read-only transaction which queries the data rather than updates the data. The read-only transactions do not generate change log records and can be accepted by DB replicas. DB replicas can only accept read-only transactions.

The compute layer 502 may be similar to the compute layer 102. To make the change log records durable, the top compute layer 502 (e.g., DB master) may write 511 the change log records to one or more log stores 506 (similar to the one or more log stores 406) at the storage layer 504. The storage layer 504 may be similar to the storage layer 104. A log store may be a service responsible for storing log data.

Once all change log records belonging to a transaction have been made durable on the one or more log stores 506, transaction completion can be acknowledged to the compute layer 502. This acknowledgement may be done immediately. Durability means that once a transaction is committed (i.e., its changes are saved to the database) , those changes persist even in the case of system failures such as power outages or crashes. This property ensures that the data is safe and can be relied upon even in adverse conditions. The change log records may persist in the one or more log store 506 until they fall out of the restorable time window size 420. Change log records may persist sequentially into the one or more log stores 506 by strictly continuous increasing LSN order. The one or more log stores 506 may provide strong consistency to guarantee storing change log records durably. The storage layer 504 may allow for workload and space balancing on a cluster of nodes. The storage layer 504 may further allow for improved service availability (e.g., close to 100%) .

The compute layer 502, via DB master or a DB replica, may read 515 and 516 data pages at the one or more page stores. The DB master 501 may send 514 synchronization information (e.g., log head updates) to the one or more DB replicas 503. The synchronization information may include information for synchronization between DB master 501 and DB replica (s) 503. The DB master 501 may apply the change log records by modifying or updating 513 one or more data pages in the one or more page stores 510. The one or more DB replicas 503 may receive 512 change log records from the one or more log stores 506 to perform further operations such as applying the change log records.

The one or more page stores 510, 410, and 110 may comprise one or more of: data pages, index pages and page redo deltas. Data pages are units of storage within a database that store actual data records. In a database, data is organized into pages, and each page typically contains multiple rows or records of data. Index pages may refer to structures used to speed up data retrieval operations in a database. They may contain metadata and pointers to the actual data pages where specific data records are stored. Index pages allow for efficient querying of data by providing a quick way to locate the desired records without scanning the entire dataset. Page redo deltas refer to records or logs that document changes or modification made to one or more data pages within the database. These records capture the details of what has been altered or updated on a page, ensuring that the page can be brought up to date with the most recent version.

Complete full pages may be reconstructed from an append-only storage on demand. Append-only storage may refer to the real physical storage that is used to store data. In some embodiments, the storage used to store the data pages (e.g., page stores 110) and the change log records (e.g., log store 106) may be an append-only storage. Append only storage may refer to a storage that allows writing or adding data (typically at the end of the data storage or physical storage) without making changes or removing existing data once it is added.

A page directory may be maintained to track or record the position of all the data pages plus recent redo deltas in the append-only storage. For example, the page directory may record the position of data pages as follows: page1 is in the append-only storage starting from offset=1000, and Page2 starts from offset=1200, etc. Thus, via the page directory, location of a particular data page in the append-only storage may be determined. By leveraging the append-only storage of the one or more page stores, old versions of pages may never be overwritten. Page directory may provide multiple versions of a same page on demand.

FIG. 6 illustrates a workflow of database backtracking involving page directory, according to an embodiment. In an embodiment, page directory 602 may record position of one or more data pages in an append only storage 604. In some embodiments, for each data page of one or more data pages, page directory may further record position of recent redo deltas in the append only storage 604. For example, the page directory 602 may track position of full page version 6 of page#3 and delta versions 7 and 8 of page#3 in the append-only storage 604. Thus, if the compute layer requests for version 6 of page#3, the page directory 602 can return full page version 6 directly. If the compute layer requests for a newer version 8 of page#3, page directory can construct full page version 8 on demand by combining delta version 7 and delta version 8 on full page version 6. To arrive at full version 8 of page #3, page directory may apply delta version 7 of page#3 to full page version 6 of page#3 to obtain full page version 7 of page#3, and further apply delta version 8 of page#3 to full page version 7 of page #3 to obtain full page version 8 of page#3. The compute layer may further comprise a buffer pool serving as a cache for frequently accessed data pages.

According to an aspect, one or more slices at one or more page stores may comprise or correspond to one or more anchor checkpoints. In some embodiments, each slice of the one or more slices may comprise or correspond to a set of anchor checkpoints with corresponding LSNs. Each anchor checkpoint may further comprise a set of anchor pages corresponding to a set of data pages of the plurality of data pages of the database. The set of anchor pages for each anchor checkpoint may be associated with a set of LSNs. In some embodiments, each anchor checkpoint may comprise or indicate the corresponding set of anchor pages by recording the relevant content of the page directory that indicates the set of anchor pages. In some embodiments, while the page directory may track the position of full-page versions and recent redo deltas at runtime, each anchor checkpoint may record content of the page directory relating to full page versions of the corresponding set of anchor pages, without recording the recent redo deltas.

In some embodiments, how the redo deltas are recorded in page directory may be a dynamic process. When some redo logs (e.g., redo delta (s) ) are newly generated, they will be recorded in the page directory 602 as they may be used to generate newer version of full page (s) for page reader. Meanwhile, there may be a background process of consolidation, which actively continuously applies redo deltas to current full page version (s) in order to advance the full page version (s) . After the redo deltas are consolidated, they may not be recorded by page directory anymore. So, normally page directory may only include recent redo deltas, which refer to redo deltas which have not been consolidated yet.

For example, where the full page version 6 of page#3 has been consolidated, page directory may record position of the full page version 6 of page#3. When the redo delta version 7 arrives, the page directory 602 may record the position of the redo delta version 7, which has not been consolidated (e.g., not yet been applied to the current full page version 6 of page#3 to generate the full page version 7 of page#3) . In such circumstances, when the consolidation process has not yet caught up to generate the full page version 7 of page #3 (e.g., by applying the redo delta version 7 to full page version 6) , if a page reader requests full page version 7, then the redo delta version 7 may be applied on full page version 6 on demand to generate the full page version 7. As such recording the redo delta version 7 on the page directory may allow for responding to a page reader for full page versions that have not yet been generated. This may be one reason for recording the recent redo deltas by the page directory. After consolidation process has successfully generated full page version 7, then there is no more need to keep redo delta version 7 in the page directory, as full page version 7 can be directly returned by now.

FIG. 7 illustrates a function of anchor checkpoints preventing anchor pages from being recycled, according to an embodiment. An anchor checkpoint (e.g., anchor checkpoint 702) may refer to a snapshot that records the content of the page directory for a particular moment indicated by a corresponding LSN. The LSN corresponding to an anchor checkpoint may further indicate when the anchor checkpoint was generated, which also corresponds to the particular point in time when the content of the page directory was recorded. An anchor checkpoint may record the position of one or more latest versions (e.g., full versions) of all data pages (also called as anchor page) for the particular moment (referring to the LSN of the anchor checkpoint) in the append-only storage (referring to the storage of page store) . For example, an anchor checkpoint may record a position of one or more older versions of anchor pages for backtracking purposes, the one or more older versions being based on the LSN of the anchor checkpoint. As described herein, the one or more anchor checkpoints may be generated at one or more intervals. As an example, assuming at T1, page1 has an LSN=1000, at time T2, page1 has an LSN=2000, and at time T3, page1 has an LSN=3000. If an anchor checkpoint is generated at T2, then page1 with LSN=2000 is an anchor page of the anchor checkpoint.

In an embodiment, all anchor pages included in any anchor checkpoint may be inhibited from being recycled by a page cleaner or log cleaner 704 during garbage collection process. The anchor pages may not be recycled even though these anchor pages may currently not be needed by any page reader. Thus, the one or more anchor pages in one or more anchor checkpoints may be exempt from garbage collection. In some embodiments, an anchor checkpoint may be generated at one or more intervals, (e.g., regularly, at every 20 mins) .

In the context of database management, garbage collection may refer to a process in which the database management system identifies and handles data pages that are no longer in use or are no longer needed. This process aims to reclaim storage space and ensure that the database remains efficient. The page or log cleaner is a component within the database management system responsible for identifying and cleaning up data pages or change log records that are no longer needed. According to garbage collection, portions of physical storage which contain data that is no longer needed to be retained are identified and recycled.

According to an aspect, a method 800 may be provided for rewinding a database to a point of time in the past. FIG. 8 illustrates a method for rewinding a database, according to an embodiment. The method 800 may allow for improved iterative rewinding of a database to a point of time. Method 800 may be triggered by a DB master (e.g., DB master 501) and one or more components of the database architecture 100 may be involved in performing operations in the method as described herein. In some embodiments, DB replica (s) may not be actively involved in the backtracking process, as the DB replica (s) may be inactive during the backtracking time of period. In some embodiments, method 800 may occur in the storage layer 104, 404 or 504. Method 800 may be used to rewind a database whether the database is up or down. In some embodiments, depending on if the database is up or down, the compute layer 102 or 502 may determine how method 800 is triggered.

A database state at a particular time may be identified by an LSN, which includes all data page changes up to this LSN. Method 800 may include maintaining 801 a mapping of timescale items, where each timescale item comprises a pair of a current wall-clock time and a current database LSN. In an embodiment, each timescale item of pair <current wall-clock time, current database LSN> may be recorded in the mapping at regular interval (e.g., with a heart-beat manner) .

In some embodiments, when the database is up (online and/or operational) , the compute layer may accept and receive an SQL command such as “select backtrack time. ” This command may allow a user to initiate a database rewind. When a user specifies, at 802, a particular time as the target time for rewinding the database to, the target time may need to be converted to a target-LSN which directly indicates the actual database state at the target time. To support such conversion, the mapping comprising the timescale items may be used to determine the target LSN. After the corresponding target-LSN is retrieved from timescale mapping for target time, the target LSN may be communicated 803 to all relevant slices in the one or more page stores 110, 410, 510. When the database is down (offline and/or non-operational) , another (e.g. non-SQL) approach may be used to initiate a database rewind.

In some embodiments, a slice manager may maintain a mapping between the data pages and the slices, and the slice manager may check the mapping to determine what slices are relevant at a particular moment. In some embodiments, a slice is relevant at a target time when the slice holds data at the target time. In some embodiments, after retrieving the target LSN, the slice manager may communicate 803 the target LSN to the one or more relevant slices.

In an embodiment, one or more anchor checkpoints may be generated at one or more intervals (e.g., generated regularly) . Each slice may maintain a list of anchor checkpoints according to an order determined by corresponding respective LSNs of the anchor checkpoints. For example, each slice may maintain the list of anchor checkpoints according to the order of the anchor checkpoints’ generation LSN (the database LSN when the anchor checkpoint is generated) . When a relevant slice receives the target LSN, the slice may search 804 for an anchor checkpoint (avalid anchor checkpoint) whose generation-LSN is smaller than and closest to the target LSN. The search criteria may be based on an anchor checkpoint that is generated right before the target time that user specifies. For example, if the list of anchor checkpoints at a relevant slice is as follows: ckpt1 (generate-LSN, 100) , ckpt2 (generate-LSN, 200) , ckpt3 (generate-LSN, 300) ; and the given target LSN=210, then ckpt2 is the valid anchor checkpoint for the search. Given a target LSN=50, then no anchor checkpoint is valid, as ckpt1, ckpt2, ckpt3 have not been generated yet at that time.

Thus, each relevant slice may determine 805 whether a valid anchor checkpoint (the anchor checkpoint having an LSN smaller than and closest to the target LSN) is present in the maintained list of anchor checkpoints. In some embodiments, if no valid anchor checkpoint can be found on the slice, this determination 806 may indicate that the slice is empty at that target moment in the past. The slice may be directly reused to write data just as a new empty slice.

In some embodiments, if a valid anchor checkpoint is found on the slice, then, at 807, all anchor pages stored in this anchor checkpoint may be loaded, for example, via the relevant slice. Among the loaded anchor pages, the largest LSN of these anchor pages may be obtained and set as the persisted LSN of the corresponding slice. When the anchor pages are loaded (i.e., up until the time after which the loading of anchor pages are done) , the slice is considered to be restored to persisted LSN as it includes all data page changes up to the persisted LSN.

Method 800 may further include obtaining or collecting 808 persisted LSNs from all relevant slices in the one or more page stores. The method 800 may further include obtaining the minimal persisted LSN among the collected persisted LSNs. Considering the relevant slices in the one or more page stores as a whole, the relevant slices would be restored up to at least this minimal persisted LSN. There may still be a gap between the minimal persisted LSN and the target LSN, and this gap may be filled by change log records with the help from the one or more log stores 106, 406 and 506. In an embodiment, the minimal persisted-LSN may be set as the replay start-LSN, and both the replay start-LSN and target-LSN may be sent to the one or more log stores requesting for corresponding change log records. In an embodiment, the compute layer 502 (e.g., the SAL) may collect 808 persisted LSNs from the relevant slices and select or obtain the minimal persisted LSN among the collected persisted LSNs. The compute layer 502 (e.g., the SAL) may further set or calculate the replay start-LSN as the selected minimal persisted LSN and send both the replay start-LSN and target-LSN to the one or more log stores requesting for corresponding change log records.

After the one or more log stores receive the replay start-LSN and target-LSN from one or more page stores (comprising the relevant slices) , at 809, the one or more log stores may search for valid change log records to ship to the one or more page stores. Change log records may persist sequentially in the one-dimensional linear growth log structure of the one or more log stores. The change log records may obey continuously increasing LSN order. In the context of a change log record, "startLSN" and "endLSN" refer to LSNs which are used to identify and manage changes within the database's transaction log. Together StartLSN and endLSN help identify a range of change log records associated with a specific change or transaction. StartLSN represents the point in the log where the change or transaction started. EndLSN represents the point in the log where the change or transaction concluded. The startLSN of each change log record may be equal to its preceding change log record’s endLSN plus 1. Thus, for a change log record, its startLSN = previous change log record’s endLSN + 1. For example, a first change log record may have an endLSN of 200, then the second change log record may have a startLSN of 201 (being equal to endLSN, 200, of the first change log record + 1) . Given the replay start-LSN and the target-LSN, the starting change log record with startLSN smaller than and closest to replay start-LSN may be selected. Further, the ending change log record with endLSN larger than and closest to target-LSN may be selected. All continuous change log records, from the starting change log record to the ending change log record inclusively, may be redistributed to the one or more page stores.

For example, assuming that the change log records include the following: log1 (startLSN: 101, endLSN: 200) , log2 (startLSN: 201, endLSN: 300) , log3 (startLSN: 301, endLSN: 400) , log4 (startLSN: 401, endLSN: 500) , log5 (startLSN: 501, endLSN: 600) . Given a replay start-LSN=250 and a target-LSN=450, then log2 may be chosen as the starting change log record (since log2 is the last change log record whose startLSN 201 is smaller than and closest to the replay start-LSN 250) and log4 may be chosen as the ending change log record (since log4 is the first change log record whose endLSN 500 is greater than and closest to the target-LSN 450) . Accordingly, log2, log3, and log4 will be redistributed to the one or more page stores.

In some embodiments, a database itself may have an atomic guarantee for change log record apply. This means that if a change log record is applied, then all changes in the change log record may take effect. So, with this guarantee, the database LSN may equal to the high boundary (endLSN) of some change log record. So, in the example of having a target-LSN=450, the target LSN may be updated based on the endLSN of the change log record that is greater than and closes to the target-LSN. log4 may fit these criteria and may be selected as the ending change log record. Accordingly, the target-LSN may be updated to 500 (being the endLSN of log4) . Similarly, the replay start-LSN may equal to the low boundary of a change log record. Given the replay start-LSN as 250, the replay start-LSN may be updated based on the startLSN of the change log record that is smaller than and closest to replay start-LSN. log 2 may fit these criteria and may be selected as the starting change log record. Accordingly, the replay start-LSN may be updated to be 201 (being the startLSN of log 2) . Thus, the LSN value between a log startLSN and endLSN may not be exposed by the atomicity guarantee.

At 810, the relevant slices in the one or more page stores may receive the compensatory change log records (e.g., valid change log records) shipped from the one or more log stores. Each relevant slice may further apply the compensatory change log records to anchor pages (which were restored from anchor checkpoint at 807) to generate new versions of the anchor pages. After all change log records in the range of [replay start-LSN, target-LSN] are applied, all data page changes would be moved or updated up to the target-LSN, corresponding to the database state that the user had requested to revert to at 802.

The database has now been reverted to the requested point in time in the past. If the user is satisfied with the database state, the user may operate 812 on it subsequently. If the user is not satisfied with the database state, the user may specify 802 another target time and repeat the procedure to rewind database to another point in time.

FIG. 9 illustrates performance results of backtracking for small database instances and large database instances, according to an example embodiment. Backtracking was performed for small database instances with increasing database sizes (e.g., 6GB, 26GB and 200 GB) and relatively close total amount of change log records movement (e.g., 20GB, 22GB and 24GB) . The results are shown in table 900. The measured time spent for backtracking indicated slight variability in duration ranging from 1 minute 40 seconds to 1 min 50 seconds as illustrated. Backtracking was also performed for large database instances with database size maintained at 200GB with increasing total amount of change log records movement (e.g., 20GB, 250 GB and 300GB) . The results are shown in table 950. The measured time spent for backtracking ranged from 1 min 05 seconds to 4 min and 35 seconds as illustrated. Compared to existing solutions, the measured time spent for backtracking illustrates improved performance. Further, the performance results, as per one or embodiments, may indicate the time spent for backtracking may be independent from the database size.

According to an aspect, a method 1000 for automatically scaling back restorable time window size for storage overhead control may be provided. FIG. 10 illustrates a method for automatically scaling back restorable time window size, according to an embodiment. In some embodiments, method 1000 may be triggered by one or more page stores. The SAL 103 corresponding to the DB master may be involved in performing one or more operations of method 1000 as described herein. For example, the SAL 103, via the master SAL (e.g., SAL SQL module corresponding to the DB master) , may be involved in performing one or more operations in reference to 1005, 1006, and 1007. In some embodiments, one or more of DB replica (s) and log store may be inactive in the process.

In some embodiments, method 1000 may include one or more page stores periodically checking 1001 the total space usage. To ensure or guarantee that there is always a certain level of free resources available for the database to operate smoothly and to maintain its fundamental functionalities, the one or more page store may be required to have a total space usage lower than a watermark or threshold (e.g., a maximum allowable limit for the total space usage) . Passing this threshold may indicate that the system has reached a critical level of resource utilization, and this could potentially impact the performance and availability of the database. Accordingly, the one or more page stores may determine 1002 whether the total space usage is higher than the threshold. If the total space usage is below the threshold, then the one or more page stores may continue to periodically check 1001 the space usage.

In an embodiment, when the total space usage is higher than the threshold, an automatic restorable time window size scaling back mechanism, according to method 1000, will operate to free up more space in storage. The one or more page stores may iterate 1003 all slices and obtain statistics of a total number of append-only storage used by each slice. Obtaining statistics of a total number of append-only storage used by each slice may include obtaining a total usage level of the append-only storage for each slice, where the usage level indicates a total size of the append-only storage used by each corresponding slice.

Method 1000 may further include, selecting or choosing a set of candidate slices based on the highest usage levels among the obtained one or more usage levels. For example, a number, N, of slices which use the largest amount of space of the append-only storage may be chosen as candidates or candidate slices for scaling back restorable time window size.

In some embodiments, the one or more pages stores may be shared among multiple database instances. A candidate slice may belong to a single database instance. In some embodiments, the one or more page stores may be capable of auto workload balancing and the space overhead may be balanced across all nodes of the slices. Thus, if one database instance has one slice (e.g., slice x) occupying a large amount of space, then for the same database instance, its other slices are likely to occupy a large amount (a similar amount as that of slice x) of space as well, since the slices are managed for auto workload balancing. So, slice (or the slice level) can be used as the granularity for candidate selection which is expected (or likely) to maintain the rule of fairness for space usage among instances in the one or more page stores. In various embodiments, auto workload balancing may be understood as a best effort mechanism. Thus, in some cases, all candidate slices may belong to a single DB instance, where the usage level of each slice of the DB instance is the same (and the number of candidate slices is the same or less than the total number of slices of the DB instance) . In such ideal scenarios, the window size scale back may affect one DB instance. However, achieving a totally evenly balanced usage level among all slices of a DB instance may be complex as the user data generated by workload can be bias. Thus, in some cases, the candidate slices may belong to one or more DB instances, where the user workload data is not distributed equally among the slices of a DB instance, or where the number of candidate slices is more than the number of slices of the DB instance with the highest usage level.

In some embodiments, page stores may be shared among DB instances, and each DB instance may own different slices (i.e., each slice may belong to one database instance) . For example, page stores may be shared among DB instance 1, DB instance 2, and DB instance 3. DB instance 1 may own slice 1, slice 2, and slice3. DB instance 2 may own slice 4, slice 5, and slice 6. And DB instance 3 may own slice 7, slice 8, and slice 9. The one or more slices of the set of candidate slices may belong to one or more DB instances. For example, the set of candidate slices (referring to the slices with the highest usage levels) may include slice1, slice2 and slice4. Then the DB instances corresponding to the set of candidate slices may be identified. For example, DB instance1 (which slice1 and slice2 belong to) and DB instance2 (which slice4 belongs to) may be identified as the instance owners of the set of candidate slices. Thereafter, the slices of the identified DB instances may undergo scaling back operations as described herein. Thus, slices 1 to 6, belonging to the identified DB instances 1 and 2, may undergo scaling back operations.

After selecting the set of candidate slices, each candidate slice may notify 1004 its own database instance owner, who has data residing on the slice. Each instance may maintain an earliest-restorable-time, which may be calculated as: earliest-restorable-time = current time –restorable time window size. Any database state prior to the earliest-restorable-time cannot be reverted to. Therefore, any page data and change log records that are only needed for reconstruction of database state prior to earliest-restorable-time may be recycled. To free up space (which may be done promptly or instantly) , the restorable time window size may shrink 1005 by a certain length of time (e.g., 20 mins) . The shrinking of the restorable time window size may be done promptly or immediately. Thereafter, the earliest-restorable-time may be advanced or updated according to the calculation for earliest-restorable-time (=current time –restorable time window size) .

Method 1000 may further include checking 1006 the timescale mapping to find the corresponding LSN for the new advanced or updated earliest-restorable-time and set it as a truncated-LSN. Method 1000 may further include distributing 1007 the truncated-LSN to all relevant slices in the one or more page stores. The relevant slices may refer to all slices of the database instance owner (s) of the candidate slices.

Method 1000 may further include, each relevant slice traversing its anchor checkpoint list to identify anchor checkpoints with LSN smaller than the truncated LSN. Any anchor checkpoint with LSN smaller than the truncated-LSN can be removed 1008, as these anchor checkpoints may no longer be needed for constructing any valid database state. All data pages stored in these deleted anchor checkpoints (including all anchor page versions indicated by the deleted anchor checkpoints) may be released 1009 to page cleaner for garbage collection. Thus, more space is freed up.

The total space usage may then be calculated to determine 1002 whether the total space usage is lower than the threshold. If the total space usage is higher than the threshold, then restorable time window size scaling back mechanism, referring to operations described in reference to 1003 to 1009, may be triggered, iteratively, until the total space usage is lower than threshold for the one or more page stores.

According to an aspect, data may be organized and managed in manner as described herein that allows to maintain all data necessary to re-create a point in time snapshot of a database in the past. Compared to existing techniques for restoring a database, such as PITR technique, one or more aspects of the disclosure may provide improvements including obviating the need for one or more of: copying data files from backup storage, extra backup storage space, extra burden on the network, creating new instance, and data transfer lags.

Compared to existing techniques for restoring a database, such as Flashback, one or more aspects described herein may provide for improvements including: allow for restoring or reverting a database to point in time if database is down or if data of the database has been corrupted (e.g., by a software or a hardware corruption) , as long as the corruption occurred within the restorable time window size. One or more aspects described herein may further provide improvements over existing techniques, such as Oracle Flashback, including allowing or supporting a database to rewind to any point of time iteratively (e.g., no time gap that cannot be rewind to) .

One or more aspects described herein may further provide improvements over existing techniques, such as Aurora backtrack, including providing auto workload balance, and thereby removing or reducing hotspot that may affect backtrack capability. One or more aspects described herein may further provide improvements over existing techniques, such as Aurora backtrack, including reduced space overhead of one or more page stores (compared to that of Aurora backtrack) which may further allow for maintaining improved restorable time window size even under heavy write workload.

One or more aspects described herein may further provide improvements over existing techniques, such as Aurora backtrack, including balancing the space overhead across all nodes of the cluster since most of the space overhead may not be tied to any specific hardware node.

According to another aspect, a method of rewinding a database may be provided. The method may allow for rewinding the database both backward and forward. The method may further allow for rewinding the database to any point in time in the past within a specified restorable time window size with limited amount of data movement (i.e., at most 20 minutes of change log records) , independent of the actual database size. Compared to existing solutions for restoring a database, such as PITR, the method of rewinding a database described herein may allow for improvements, including faster or improved rate of rewinding of the database, since the method is independent of the size of the database, and thus, may provide for improvements (comparatively faster rewinding) for large database sizes (allowing for rewinding to be complete in minutes rather than hours) .

According to another aspect, a method or a mechanism may be provided that allows automatically scaling back of the restorable time window size to limit space overhead. The method for automatically scaling back of the restorable time window size described herein may provide for improvements over existing techniques (such as PITR) including allowing for the storage cost to be manageable.

The method for automatically scaling back of the restorable time window size described herein may provide for improvements over existing techniques, such as Aurora Backtrack, including providing support to revert a database back and forth iteratively without dramatically reducing the restorable time window size, even for large databases.

FIG. 11 illustrates an apparatus 1100 that may perform any or all of the operations of the methods, systems and features explicitly or implicitly described herein, according to different aspects of the present disclosure. For example, a computer equipped with network function may be configured as the apparatus 1100. In some aspect, apparatus 1100 can be a device that connects to the network infrastructure over a radio interface, such as a mobile phone, smart phone or other such device that may be classified as user equipment (UE) . In some embodiments, the apparatus 1100 may be a Machine Type Communications (MTC) device (also referred to as a machine-to-machine (m2m) device) , or another such device that may be categorized as a UE despite not providing a direct service to a user. In some embodiments, apparatus 1100 may perform one or more operations in one or more methods described herein. For example, the apparatus 1100 may be one or more of: a database or a storage node, a master node, a replica node, a log store node, a page store node, a database module or component, an SAL or SAL node, an SAL query language module, a slice manager, and the like according to one or more aspects described herein.

As shown, the apparatus 1100 may include a processor 1110, such as a Central Processing Unit (CPU) or specialized processors such as a Graphics Processing Unit (GPU) or other such processor unit, memory 1120, non-transitory mass storage 1130, input-output interface 1140, network interface 1150, and a transceiver 1160, all of which are communicatively coupled via bi-directional bus 1170. Transceiver 1160 may include one or multiple antennas According to certain aspects, any or all of the depicted elements may be utilized, or only a subset of the elements. Further, apparatus 1100 may contain multiple instances of certain elements, such as multiple processors, memories, or transceivers. Also, elements of the hardware device may be directly coupled to other elements without the bi-directional bus. Additionally, or alternatively to a processor and memory, other electronics, or processing electronics, such as integrated circuits, application specific integrated circuits, field programmable gate arrays, digital circuitry, analog circuitry, chips, dies, multichip modules, substrates or the like, or a combination thereof may be employed for performing the required logical operations.

Memory 1120 may include any type of non-transitory memory such as static random-access memory (SRAM) , dynamic random-access memory (DRAM) , synchronous DRAM (SDRAM) , read-only memory (ROM) , any combination of such, or the like. The mass storage element 1130 may include any type of non-transitory storage device, such as a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, USB drive, or any computer program product configured to store data and machine executable program code. According to certain aspects, memory 1120 or mass storage 1130 may have recorded thereon statements and instructions executable by the processor 1110 for performing any method operations described herein.

The processor 1110 and memory 1120 may function together as a chipset which may be provided together for installation into wireless communication apparatus 1100 to implement WLAN functionality. The chipset may be configured to receive as input data including but not limited to PPDUs from the network interface 1150. The chipset may be configured to output data including but not limited to PPDUs to the network interface 1150.

Aspects of the present disclosure can be implemented using electronics hardware, software, or a combination thereof. In some embodiments, this may be is implemented by one or multiple computer processors executing program instructions stored in memory. In some embodiments, the invention is implemented partially or fully in hardware, for example using one or more field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs) to rapidly perform processing operations.

FIG. 12 illustrates a method for managing a database, according to an embodiment. Method 1200 may be performed by or at one or more nodes of the database. Method 1200 includes managing 1201 a plurality of data pages of the database. The method may further include generating 1202 one or more anchor checkpoints with corresponding one or more log sequence numbers (LSNs) . Each anchor checkpoint may record content of a page directory at a moment in time indicated by a corresponding LSN of the one or more LSNs. The page directory may record a position of each data page of the plurality of data pages in an append-only storage. The corresponding LSN may indicate when said each anchor checkpoint was generated. Each anchor checkpoint may further indicate a set of anchor pages. For each anchor page of the set of anchor pages, said each anchor checkpoint may record one or more positions of one or more versions of said anchor page based on the corresponding LSN of said each anchor checkpoint. In some embodiments, each anchor checkpoint may indicate the corresponding set of anchor pages by recording the relevant content of the page directory that indicates the set of anchor pages. For example, each anchor checkpoint may record content of the page directory (e.g. positions) relating to full page versions of the corresponding set of anchor pages, without recording the recent redo deltas. The set of anchor pages may correspond to a set of data pages of the plurality of data pages. The set of anchor pages may be exempt from garbage collection.

In some embodiments, managing the plurality of data pages of the database may include generating change log records describing changes to the plurality of data pages. The database may include a plurality of slices. The database may further include one or more page stores, each page store of the one or more page stores may include one or more slices of the plurality of slices. Each slice of the one or more slices may include a set of data pages of the plurality of data pages for storing the data. Managing the plurality of data pages of the database may further include writing the change log records to one or more log stores. Managing the plurality of data pages of the database may further include maintaining the page directory to record the position of each data page of the plurality of data pages in the append-only storage.

In some embodiments, the method may further include maintaining, at each slice of the plurality of slices, at least one anchor checkpoint of the one or more anchor checkpoints, according to an order determined by at least one corresponding LSN of the at least one anchor checkpoint. Generating the one or more anchor checkpoints may include generating, at the plurality of slices, the one or more anchor checkpoints at one or more time intervals.

FIG. 13 illustrates a method for rewinding a database, according to an embodiment. Method 1300 may be performed by or at one or more nodes of the database. Method 1300 may include receiving 1301 a request to rewind the database to a target time. The database may include a plurality of slices and a plurality of data pages for storing data. The database may further include one or more page stores. Each page store of the one or more page stores may further include one or more slices of the plurality of slices. Each slice of the plurality of slices may further include one or more data pages of the plurality of data pages. Each slice of the plurality of slices may further include a set of anchor checkpoints associated with a corresponding set of log sequence numbers (LSNs) . Each anchor checkpoint may indicate a set of anchor pages corresponding to a set of data pages of the plurality of data pages. In some embodiments, each anchor checkpoint may indicate the corresponding set of anchor pages by recording the relevant content of the page directory that indicates the set of anchor pages. For example, each anchor checkpoint may record content of the page directory relating to full page versions of the corresponding set of anchor pages, without recording the recent redo deltas. In some embodiments, each anchor checkpoint may record the position of full-page versions of the corresponding set of anchor pages without recording the recent redo deltas, since generating full pages based on full page versions may be sufficient for backtracking.

The method may further include obtaining 1302 a target LSN corresponding to a state of the database at the target time. The method may further include obtaining 1303 a set of persisted LSNs corresponding to a set of slices of the plurality of slices. Each slice of the set of slices may include a valid anchor checkpoint having an LSN smaller than the target LSN. The LSN of the valid anchor checkpoint may be closest to the target LSN among the set of LSNs of the set of anchor checkpoints corresponding to said each slice. The method may further include obtaining 1304 a minimum persisted LSN from the set of persisted LSNs corresponding to the set of slices. The method may further include obtaining 1305 change log records from one or more log stores based on the minimum persisted LSN and the target LSN. The method may further include applying 1306 the change log records to one or more anchor pages of the set of slices to restore the database to the target time.

In some embodiments, the valid anchor checkpoint for each slice (of the set of slices) may include a set of anchor pages corresponding to a set of data pages of the plurality of data pages. In such embodiments, the method may further include, for said each slice of the set of slices, obtaining the valid anchor checkpoint, wherein the set of anchor pages corresponds to a set of LSNs. The method may further include, for said each slice, obtaining a largest LSN among the set of LSNs corresponding to the set of anchor pages. The method may further include, for said each slice, setting a persisted LSN of the set of persisted LSNs for said each slice based on the largest LSN.

In some embodiments, each LSN of the set of LSNs corresponding to the set of anchor checkpoints may indicate when the corresponding anchor checkpoint was generated. In some embodiments, the change log records may include a starting change log record, an ending change log record and all change log records between the starting and ending change log records. In some embodiments, obtaining the change log records from the one or more log stores based on the minimum persisted LSN and the target LSN may include obtaining the starting change log record based on the minimum persisted LSN. In some embodiments, obtaining the change log records from the one or more log stores based on the minimum persisted LSN and the target LSN may further include obtaining the ending change log record based on the target LSN. In some embodiments, obtaining the change log records from the one or more log stores based on the minimum persisted LSN and the target LSN may further include obtaining all change log records between the starting change log record and ending change log record.

In some embodiments, applying the change log records to one or more anchor pages of the set of slices to restore the database to the target time may include, for each slice of the set of slices, loading the set of anchor pages of the valid anchor checkpoint corresponding to said each slice. Applying the change log records to one or more anchor pages of the set of slices to restore the database to the target time may further include, for each slice of the set of slices, applying the change log records to the loaded set of anchor pages to generate an updated version of the set of anchor pages.

FIG. 14 illustrates a method for managing a storage of a database, according to an embodiment. Method 1400 may be performed by or at one or more nodes of the database. The method 1400 may allow for auto scaling back of restorable time window size. Method 1400 may include determining 1401 that a threshold for using the storage has been exceeded. The storage may include a plurality of slices and a plurality of data pages for storing data. The storage may further include one or more page stores. Each page store of the one or more page stores may further include one or more slices of the plurality of slices. Each slice of the plurality of slices may further include one or more data pages of the plurality of data pages. The database may have a restorable time window size indicating an earliest-restorable-time for restoring the database.

The method may further include obtaining 1402 a plurality of usage levels of an append-only storage of the database. The plurality of usage levels may correspond to the plurality of slices. Each usage level of the plurality of usage levels may indicate a total size of the append-only storage used by a corresponding slice of the plurality of slices.

The method may further include selecting 1403 a set of candidate slices of the plurality of slices based on a highest usage level among the one or more usage levels. The method may further include releasing 1404 a portion of the plurality of data pages for garbage collection, the portion of the plurality of data pages being based on the set of candidate slices.

In some embodiments, the method may further include obtaining a truncated log sequence number (LSN) corresponding to an updated earliest-restorable-time based on a reduction of the restorable time window size. The portion of the plurality of data pages may further be based on the truncated LSN.

In some embodiments, the set of candidate slices may belong to one or more database instances. Each slice of the set of candidate slices may belong to a database instance of the one or more database instances. In some embodiments, releasing the portion of the plurality of data pages for garbage collection may include, for each slice of the plurality of slices belonging to a database instance of the one or more database instances, selecting at least one anchor checkpoint of one or more anchor checkpoints based on the truncated LSN. Each anchor checkpoint may correspond to said each slice of the plurality of slices belonging to the database instance of the one or more database instances. Each anchor checkpoint may further include a set of anchor pages corresponding to a set of data pages of the plurality of data pages. In some embodiments, each anchor checkpoint may indicate the corresponding set of anchor pages by recording the relevant content of the page directory that indicates the set of anchor pages. For example, each anchor checkpoint may record content of the page directory (e.g., positions) relating to full page versions of the corresponding set of anchor pages, without recording the recent redo deltas. Each anchor checkpoint may further be associated with an LSN indicative of when said anchor checkpoint was generated. The LSN of said each anchor checkpoint may be smaller than the truncated LSN. The method may further include releasing the at least one anchor checkpoint for garbage collection.

In some embodiments, the one or more anchor checkpoints may be generated at one or more time intervals. Each anchor checkpoint of the one or more anchor checkpoints may correspond to a slice of the plurality of slices. Each anchor checkpoint may record content of a page directory at a moment in time indicated by a corresponding LSN. The page directory may record a position of the plurality of data pages in the append-only storage. Each anchor checkpoint may indicate a corresponding set of anchor pages. For each anchor page, said each anchor checkpoint records one or more positions of one or more versions of said each anchor page based on the corresponding LSN. The corresponding set of anchor pages may further correspond to a set of data pages of the plurality of data pages.

It will be appreciated that, although specific aspects of the technology have been described herein for purposes of illustration, various modifications may be made without departing from the scope of the technology. The specification and drawings are, accordingly, to be regarded simply as an illustration of the invention as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present invention. In particular, it is within the scope of the technology to provide a computer program product or program element, or a program storage or memory device such as a magnetic or optical wire, tape or disc, or the like, for storing signals readable by a machine, for controlling the operation of a computer according to the method of the technology and/or to structure some or all of its components in accordance with the system of the technology.

Acts associated with the method described herein can be implemented as coded instructions in a computer program product. In other words, the computer program product is a computer-readable medium upon which software code is recorded to execute the method when the computer program product is loaded into memory and executed on the microprocessor of the wireless communication device.

Further, each operation of the method may be executed on any computing device, such as a personal computer, server, PDA, or the like and pursuant to one or more, or a part of one or more, program elements, modules or objects generated from any programming language, such as C++, Java, or the like. In addition, each operation, or a file or object or the like implementing each said operation, may be executed by special purpose hardware or a circuit module designed for that purpose.

Through the descriptions of the preceding aspects, the present invention may be implemented by using hardware only or by using software and a necessary universal hardware platform. Based on such understandings, the technical solution of the present invention may be embodied in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which can be a compact disc read-only memory (CD-ROM) , USB flash disk, or a removable hard disk. The software product includes a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided in the aspects of the present invention. For example, such an execution may correspond to a simulation of the logical operations as described herein. The software product may additionally or alternatively include a number of instructions that enable a computer device to execute operations for configuring or programming a digital logic apparatus in accordance with aspects of the present invention.

Although the present invention has been described with reference to specific features and aspects thereof, it is evident that various modifications and combinations can be made thereto without departing from the invention. The specification and drawings are, accordingly, to be regarded simply as an illustration of the invention as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present invention.

Claims

A method of managing a database, the method comprising:

managing a plurality of data pages of the database,

generating one or more anchor checkpoints with corresponding one or more log sequence numbers (LSNs) , each anchor checkpoint recording content of a page directory at a moment in time indicated by a corresponding LSN of the one or more LSNs, the page directory recording a position of each data page of the plurality of data pages in an append-only storage, the corresponding LSN indicating when said each anchor checkpoint was generated, each anchor checkpoint further indicating a set of anchor pages, where for each anchor page, said each anchor checkpoint records one or more positions of one or more versions of said each anchor page based on the corresponding LSN of said each anchor checkpoint, the set of anchor pages corresponding to a set of data pages of the plurality of data pages, the set of anchor pages being exempt from garbage collection.
The method of claim 1 wherein managing the plurality of data pages of the database comprises:

generating change log records describing changes to the plurality of data pages, the database comprising a plurality of slices, each slice of the plurality of slices comprising a set of data pages of the plurality of data pages for storing data;

writing the change log records to one or more log stores;

maintaining the page directory to record the position of each data page of the plurality of data pages in the append-only storage .
The method of claim 2 further comprising: maintaining, at each slice of the plurality of slices, at least one anchor checkpoint of the one or more anchor checkpoints according to an order determined by at least one corresponding LSN of the at least one anchor checkpoint.
The method of claim 2 or 3 wherein generating the one or more anchor checkpoints comprises: generating, at the plurality of slices, the one or more anchor checkpoints at one or more time intervals.
A method for rewinding a database, the method comprising:

receiving a request to rewind the database to a target time, the database comprising a plurality of slices and a plurality of data pages for storing data, each slice of the plurality of slices comprising one or more data pages of the plurality of data pages, each slice of the plurality of slices further comprising a set of anchor checkpoints associated with a corresponding set of log sequence numbers (LSNs) , each anchor checkpoint indicating a set of anchor pages corresponding to a set of data pages of the plurality of data pages;

obtaining a target LSN corresponding to a state of the database at the target time;

obtaining a set of persisted LSNs corresponding to a set of slices of the plurality of slices, each slice of the set of slices comprising a valid anchor checkpoint having an LSN smaller than the target LSN, the LSN of the valid anchor checkpoint being closest to the target LSN among the set of LSNs of the set of anchor checkpoints corresponding to said each slice;

obtaining a minimum persisted LSN from the set of persisted LSNs corresponding to the set of slices;

obtaining change log records from one or more log stores based on the minimum persisted LSN and the target LSN; and

applying the change log records to one or more anchor pages of the set of slices to restore the database to the target time.
The method of claim 5, wherein the valid anchor checkpoint comprises a set of anchor pages corresponding to a set of data pages of the plurality of data pages, the method further comprising:

for said each slice of the set of slices:

obtaining the valid anchor checkpoint, wherein the set of anchor pages corresponds to a set of LSNs;

obtaining a largest LSN among the set of LSNs corresponding to the set of anchor pages; and

setting a persisted LSN of the set of persisted LSNs for said each slice based on the largest LSN.
The method of any one of claims 5 to 6 wherein each LSN of the set of LSNs corresponding to the set of anchor checkpoints indicates when the corresponding anchor checkpoint was generated.
The method of any one of claims 5 to 7 wherein:

the change log records comprise a starting change log record, an ending change log record, and all change log records between the staring change log record and the ending change log record; and

obtaining the change log records from the one or more log stores based on the minimum persisted LSN and the target LSN comprises:

obtaining the starting change log record based on the minimum persisted LSN; and

obtaining the ending change log record based on the target LSN; and

obtaining all change log records between the starting change log record and the

ending change log record.
The method of any one of claims 5 to 9, wherein applying the change log records to one or more anchor pages of the set of slices to restore the database to the target time comprises:

for each slice of the set of slices:

loading the set of anchor pages of the valid anchor checkpoint corresponding to said each slice; and

applying the change log records to the loaded set of anchor pages to generate an updated version of the set of anchor pages.
A method for managing a storage of a database, the method comprising:

determining that a threshold for using the storage has been exceeded, the storage comprising a plurality of slices and a plurality of data pages for storing data, each slice of the plurality of slices comprising one or more data pages of the plurality of data pages, the database having a restorable time window size indicating an earliest-restorable-time for restoring the database;

obtaining a plurality of usage levels of an append-only storage of the database, the one or more usage levels corresponding to the plurality of slices, each usage level of the plurality of usage levels indicating a total size of the append-only storage used by a corresponding slice of the plurality of slices;

selecting a set of candidate slices of the plurality of slices based on a highest usage level among the one or more usage levels; and

releasing a portion of the plurality of data pages for garbage collection, the portion of the plurality of data pages being based on the set of candidate slices.
The method of claim 10 further comprising:

obtaining a truncated log sequence number (LSN) corresponding to an updated earliest-restorable-time based on a reduction of the restorable time window size, wherein the portion of the plurality of data pages is further based on the truncated LSN.
The method of claim 11 wherein:

the set of candidate slices belongs to one or more database instances, and each slice of the set of candidate slices belongs to a database instance of the one or more database instances;

releasing the portion of the plurality of data pages for garbage collection comprises: for each slice of the plurality of slices belonging to a database instance of the one or more database instances:

selecting at least one anchor checkpoint of one or more anchor checkpoints based on the truncated LSN, each anchor checkpoint:

corresponding to said each slice;

comprising a set of anchor pages corresponding to a set of data pages of the plurality of data pages; and

associated with an LSN indicative of when said anchor checkpoint was generated, the LSN being smaller than the truncated LSN; and releasing the at least one anchor checkpoint for garbage collection.
The method of claim 12 wherein the one or more anchor checkpoints is generated at one or more time intervals, each anchor checkpoint of the one or more anchor checkpoints:

corresponding to a slice of the plurality of slices;

recording content of a page directory at a moment in time indicated by a corresponding LSN, the page directory recording a position of the plurality of data pages in the append-only storage; and

indicating a corresponding set of anchor pages, wherein for each anchor page, said each anchor checkpoint records one or more positions of one or more versions of said each anchor page based on the corresponding LSN, the corresponding set of anchor pages further corresponding to a set of data pages of the plurality of data pages.
An apparatus comprising processing electronics configured to perform any one of claims 1 to 13.
An apparatus comprising:

at least one processor; and

at least one machine-readable medium storing executable instructions which when executed by the at least one processor configure the apparatus to perform any one of method claims 1 to 13.
A computer device comprising a non-transitory computer readable medium having instructions stored thereon which, when executed by a computer processor, causes the computer to perform the method as described in any one of claims 1 to 13.