JP2006012142A

JP2006012142A - Checkpoint method and system utilizing non-disk persistent memory

Info

Publication number: JP2006012142A
Application number: JP2005166402A
Authority: JP
Inventors: Gary S Smith; ゲーリー・エス・スミス; Sam A Fineberg; サム・エー・ファインバーグ; Pankaj Mehra; パンカジ・メラ; Roger Hansen; ロジャー・ハンセン
Original assignee: Hewlett Packard Development Co LP
Current assignee: Hewlett Packard Development Co LP
Priority date: 2004-06-09
Filing date: 2005-06-07
Publication date: 2006-01-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system and method for reducing time required for transaction commitment. <P>SOLUTION: Transaction processing systems 800 comprise a database writer 802 constituted so as to process data according to one or more transactions in the transaction processing systems, a transaction monitor 804 which monitors the transaction in the transaction processing systems, a log writer 802 which holds transaction audit trail data relevant to the transactions in the transaction processing system and one or more non-disk persistent memory units 812, 814 that are used to checkpoint. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、トランザクション処理システムに関する。 The present invention relates to a transaction processing system.

（関連出願）
本願は、参照により本明細書に援用される２００４年３月９日出願の米国特許出願第１０／７９７，２５８号の一部継続であり、その優先権を主張するものである。
本願はまた、米国特許出願第１０／３５１，１９４号および同第１０／７３７，３７４号に関連するものである。 (Related application)
This application is a continuation-in-part of US patent application Ser. No. 10 / 797,258, filed Mar. 9, 2004, which is incorporated herein by reference, and claims its priority.
This application is also related to US patent application Ser. Nos. 10 / 351,194 and 10 / 737,374.

トランザクション処理システムは、原子性、一貫性、隔離性、および耐久性を含むいわゆるＡＣＩＤ特性の維持を保証しながら複数のトランザクションプログラムの並行実行をサポートするコンピュータハードウェアおよびソフトウェアシステムである。
トランザクションプログラムは、トランザクションを正しく実行するために動作を適用しなければならない順序および実施しなければならないあらゆる並行制御を含めた、アプリケーション状態に対して適用される動作の仕様である。
最も一般的な並行制御動作はロッキングであり、ロッキングにより、トランザクションプログラムに対応するプロセスは、読み出し、または書き込みを行うデータに対して共有ロックあるいは排他ロックを取得する。
トランザクションは、通常、データベースにおいて表される物理的および抽象的なアプリケーションの状態に対する動作の集まりを指す。
トランザクションは、トランザクションプログラムの実行を表す。
動作には、共有状態の読み書きが含まれる。 A transaction processing system is a computer hardware and software system that supports the parallel execution of multiple transaction programs while ensuring the maintenance of so-called ACID characteristics including atomicity, consistency, isolation, and durability.
A transaction program is a specification of operations that are applied to an application state, including the order in which operations must be applied to correctly execute a transaction and any concurrency control that must be performed.
The most common concurrency control operation is locking, and the process corresponding to the transaction program acquires a shared lock or an exclusive lock for data to be read or written.
A transaction usually refers to a collection of actions on the state of physical and abstract applications represented in a database.
A transaction represents the execution of a transaction program.
The operation includes reading and writing the shared state.

ＡＣＩＤ特性に関して、原子性は、トランザクションが完全に実行されるか、またはまったく実行されないかのいずれかであるという点で、全か無かの振る舞いを示すトランザクションを指す。
完了したトランザクションはコミットされたと言え、実行中に放棄されたものはアボートされたと言え、実行が開始されたがコミットもアボートもされなかったものはインフライトと言える。 With respect to the ACID property, atomicity refers to a transaction that exhibits all or nothing behavior in that the transaction is either executed completely or not executed at all.
A completed transaction can be said to be committed, an abandoned during execution can be said to have been aborted, and an execution that has started but not committed or aborted can be said to be in flight.

一貫性は、アプリケーション状態を指定されたいずれの保全性制約とも矛盾しない状態のままに保つトランザクションの成功した完了を指す。 Consistency refers to the successful completion of a transaction that keeps the application state in a state consistent with any specified integrity constraints.

隔離性は、直列化可能性としても知られ、トランザクションストリームをいずれも正しく並行実行することが、ストリームを成す各トランザクションを或る全体順序で実行することと一致することを保証する。
この意味では、実行済みトランザクションに関して、ストリーム内の他のあらゆるトランザクションの影響は、このトランザクションの完全に前に実行された場合の影響も、また完全に後で実行された場合の影響も同じである。
強力な直列化可能性は、トランザクションの並行実行が制約される程度を指し、トランザクション処理システムにおいて異なる隔離レベルを作り出す。
本明細書の文脈の中では、トランザクションによって行われた更新が決して失われず、トランザクション内での再読み出し動作の結果が同じである最強の形態の隔離性を示すトランザクション処理システムを最重視する。 Isolation, also known as serializability, ensures that correctly executing all transaction streams in parallel is consistent with executing each transaction that makes up the stream in some overall order.
In this sense, for an executed transaction, the effect of any other transaction in the stream is the same if it is executed completely before this transaction, or if it is executed completely later. .
Strong serializability refers to the extent to which concurrent execution of transactions is constrained and creates different isolation levels in a transaction processing system.
Within the context of the present specification, the highest priority is given to transaction processing systems that exhibit the strongest form of isolation in which updates made by a transaction are never lost and the result of a re-read operation within the transaction is the same.

耐久性は、トランザクションが一度コミットされると、アプリケーション状態に対するその諸変更が、トランザクション処理システムに対して影響する障害に耐え抜くような特性を指す。 Durability refers to the property that once a transaction is committed, its changes to the application state can survive failures that affect the transaction processing system.

トランザクション処理システムに伴う一問題は、トランザクションのコミットに必要な時間の削減に関する。
したがって、本発明は、トランザクションのコミットに必要な時間を削減するシステムおよび方法を提供することに関連する問題から生じたものである。 One problem with transaction processing systems relates to reducing the time required to commit a transaction.
Accordingly, the present invention stems from the problems associated with providing a system and method that reduces the time required to commit a transaction.

（概観）
本明細書において説明する各種実施形態は、トランザクション処理システムと併せて非ディスク永続メモリを利用する。
非ディスク永続メモリをトランザクションのコミットに使用することにより、トランザクションのコミットに関連する時間を削減することができ、ひいては、トランザクション処理システム内の資源に対する需要を低減するとともに、トランザクション処理のスループットを増大させることができる。
各種実施形態は、非ディスク永続メモリをチェックポインティングプロセスおよびライトアサイド（write-aside）バッファリングプロセスの両方に利用する統一バッファリング方式を提供する。 (Overview)
Various embodiments described herein utilize non-disk persistent memory in conjunction with a transaction processing system.
Using non-disk persistent memory for transaction commits can reduce the time associated with transaction commits, thus reducing the demand for resources in the transaction processing system and increasing transaction processing throughput. be able to.
Various embodiments provide a unified buffering scheme that utilizes non-disk persistent memory for both the checkpointing process and the write-aside buffering process.

（例示的な一般トランザクション処理システム）
図１は、構成要素を本明細書において説明する本発明の原理の実施に利用することのできる例示的なトランザクション処理システム１００を示す。
説明する実施形態では、トランザクション処理システム１００は、データベースライタ１０２、トランザクションモニタ１０４、およびログライタ１０６を備える。 (Exemplary general transaction processing system)
FIG. 1 illustrates an exemplary transaction processing system 100 in which components can be utilized to implement the principles of the invention described herein.
In the described embodiment, the transaction processing system 100 includes a database writer 102, a transaction monitor 104, and a log writer 106.

データベースライタ１０２は、トランザクションプログラムによって指定された動作を実行する際に、データボリューム（すなわち、ディスクまたはディスクの集まり）に記憶されているデータを変異させるように構成される。
耐久性以外のＡＣＩＤ特性をどのように維持するかは、本明細書における考察にあまり関係がない。 The database writer 102 is configured to mutate data stored in a data volume (ie, a disk or collection of disks) when performing operations specified by a transaction program.
How to maintain ACID characteristics other than durability has little to do with the discussion herein.

耐久性に関しては、データベースライタが、データベースに対して行った変更が耐久性媒体に記録されるように保証する。
オンライントランザクション処理では、こういった変更によって影響を受けたデータは、各データボリュームにランダムに分散しがちである。
複数のディスクドライブへのランダムアクセスはかなり効率が悪いため、こういった変更はディスクにすぐには書き込まれない。
すぐにではなく、データベースライタ１０２は、それぞれの変更を以下に述べるログライタ１０６に送り、それによって変更がトランザクションコミットに間に合うような耐久性のあるものになる。 For durability, the database writer ensures that changes made to the database are recorded on the durable medium.
In online transaction processing, data affected by such changes tends to be randomly distributed across each data volume.
Because random access to multiple disk drives is quite inefficient, these changes are not immediately written to disk.
Rather than immediately, the database writer 102 sends each change to the log writer 106 described below, thereby making the change durable in time for the transaction commit.

トランザクションモニタ１０４は、トランザクションがシステムに入退出する際にトランザクションをトラッキングする。
トランザクションモニタは、トランザクションを受けてデータベースを変異させるデータベースライタ１０２をトラッキングし、データベースライタ１０２によってログライタ１０６に送られたそのトランザクションに関連するいずれのデータボリュームの変更も、トランザクションがコミットされる前に永久媒体にフラッシュされることを保証する。
トランザクションモニタ１０４はまた、トランザクションの状態（たとえば、コミットまたはアボート）をトランザクションログに書き留める。 Transaction monitor 104 tracks transactions as they enter and leave the system.
The transaction monitor tracks the database writer 102 that receives the transaction and mutates the database, and any data volume changes associated with that transaction sent by the database writer 102 to the log writer 106 are permanent before the transaction is committed. Ensure that it is flushed to the media.
Transaction monitor 104 also writes down the state of the transaction (eg, commit or abort) in the transaction log.

ログライタ１０６はデータベース監査トレイルを保持し、データベース監査トレイルは、各トランザクションによりデータベースに対して行われた変更を明示的に記録するとともに、トランザクションがコミットされた逐次順序を暗黙的に記録する。
ここでも耐久性特性に注目すると、トランザクションのコミットを可能にするには、それに先立って、そのトランザクションによって行われた変更を耐久性媒体に記録しなければならない。
ログライタ１０６はこの制約を実施し、データベースライタ１０２から、状態の変更を記述した監査記録を受け取り、後述の残りのトランザクションコミットメントインフラストラクチャと記録動作を連携させる。 The log writer 106 maintains a database audit trail that explicitly records the changes made to the database by each transaction and implicitly records the sequential order in which the transactions are committed.
Again, paying attention to the durability characteristics, changes made by the transaction must be recorded on the durable medium before the transaction can be committed.
The log writer 106 enforces this restriction, receives an audit record describing the state change from the database writer 102, and coordinates the recording operation with the remaining transaction commitment infrastructure described below.

当業者により理解されるように、上記エンティティのうちの1つまたは複数は、スケーラビリティまたは耐故障性を目的として複数のプロセスまたはスレッドを使用して実現することが可能である。
たとえば、データベースを、1つまたは複数のデータベースライタエンティティによってそれぞれ管理される複数のディスク「ボリューム」に分割することができる。
同様に、監査トレイルを書き込むタスクも、それぞれ、データベースライタの特定のサブセットによって行われた変更を記録する専用の複数のログライタに分割することも可能である。
トランザクション処理システムの連続動作を保証するために、各データベースライタは、セット中の1つのエンティティが万が一故障した場合に、セットからの生き残ったエンティティが、トランザクションストリームの処理を邪魔することなく「引き継ぐ」ことができるように、状態を常に同期させた２つ以上の冗長エンティティのセットを使用して実現することもできる。 As will be appreciated by those skilled in the art, one or more of the above entities can be implemented using multiple processes or threads for purposes of scalability or fault tolerance.
For example, a database may be divided into multiple disk “volumes” each managed by one or more database writer entities.
Similarly, the task of writing audit trails can also be divided into dedicated log writers, each recording changes made by a particular subset of database writers.
To ensure continuous operation of the transaction processing system, each database writer will “take over” the surviving entities from the set without disturbing the processing of the transaction stream should one entity in the set fail. It can also be implemented using a set of two or more redundant entities whose states are always synchronized.

図２は、図１のトランザクション処理システムの実施態様の全体を２００で示し、データベースライタ２０２、トランザクションモニタ２０４、およびログライタ２０６を含む。 FIG. 2 illustrates an overall implementation of the transaction processing system of FIG. 1 at 200 and includes a database writer 202, transaction monitor 204, and log writer 206.

この例では、上述のエンティティのそれぞれは、一対のプロセスを使用して実施される。
こうして、各プロセス対には、プライマリプロセス（「ｐｒｉ」と記される）およびバックアッププロセス（「ｂａｋ」と記される）が含まれる。
この例では、トランザクション処理システムの他のいずれの構成要素とも通信する前に、各プライマリプロセスが、その状態の関係部分をバックアッププロセスにチェックポイントし、プライマリプロセスが故障した場合に、バックアッププロセスが素早く引き継ぐことができる。
引き継ぎ間隔はかなり短く（数ミリ秒から数秒続く）、この間に、インフライトトランザクションはアボートされ、おそらく再度開始される。
図２におけるトランザクション処理アーキテクチャの各要素を実現するプロセスおよびライブラリは、複数のＣＰＵに分散させることができる。 In this example, each of the entities described above is implemented using a pair of processes.
Thus, each process pair includes a primary process (denoted “pri”) and a backup process (denoted “bak”).
In this example, before communicating with any other component of the transaction processing system, each primary process checkpoints the relevant part of its state to the backup process, and if the primary process fails, the backup process Can take over.
The takeover interval is fairly short (lasting from a few milliseconds to a few seconds), during which time the in-flight transaction is aborted and possibly started again.
The processes and libraries that implement each element of the transaction processing architecture in FIG. 2 can be distributed over multiple CPUs.

この例では、データベースライタ２０２は「ＤＰ２」（「ディスクプロセス２」）と記され、ログライタ２０６は「ＡＤＰ」（「監査ディスクプロセス」）と記される。
トランザクションモニタ２０４は、ＴＭＦ（「トランザクションモニタリング機能」）と呼ばれる分散したプロセスおよびシステムライブラリの集まりを使用して実施される。
トランザクションモニタ２０４は「ＴＭＰ」（「トランザクションモニタプロセス」）と記され、トランザクションの開始およびコミットを調整する。
ＴＭＦは、ＴＭＦｌｉｂ（ＴＭＦライブラリ）と呼ばれるオペレーティングシステム機能を、クラスタ中の各ＣＰＵにおいて使用する。
このライブラリは、ＤＰ２プロセスが、任意所与のトランザクションに関するそれぞれの作業を開始・終了する際にＴＭＦに登録できるようにする。
ＴＭＦｌｉｂインスタンスは、当該インスタンス間で通信するともに、ＴＭＰプロセス対と通信して、「トランザクションのコミット」というタイトルのセクションにおいて後述するようにトランザクションコミットメントを調整する。 In this example, the database writer 202 is described as “DP2” (“disk process 2”), and the log writer 206 is described as “ADP” (“audit disk process”).
Transaction monitor 204 is implemented using a collection of distributed process and system libraries called TMF (“Transaction Monitoring Function”).
Transaction monitor 204 is labeled “TMP” (“transaction monitor process”) and coordinates the start and commit of transactions.
TMF uses an operating system function called TMFlib (TMF library) in each CPU in the cluster.
This library allows the DP2 process to register with the TMF when starting and ending each work for any given transaction.
The TMFlib instance communicates between the instances and communicates with the TMP process pair to coordinate transaction commitments as described below in the section titled “Transaction Commit”.

（トランザクションのコミット）
図２を参照して、以下に、単一のログボリューム（または監査トレイルディスクボリューム）２０８がトランザクション処理システムに存在する場合でのトランザクションコミットに関わる例示的なステップについて説明する。
通常、クライアント２１０が、トランザクションの開始を示すＢｅｇｉｎ＿Ｔｒａｎｓａｃｔｉｏｎまたは同様の動作に直面する。
クライアントプロセスを実行しているＣＰＵ上のＴＭＦｌｉｂは、新しいトランザクションにトランザクションＩＤ（ＴＩＤ）を割り当てる。
ＴＩＤは次いで、データベースライタ２０２、特に、トランザクションを受けて作業するすべてのディスクプロセスＤＰ２に伝搬される。
図２に示す単純な場合では、単一のデータベースライタ（一対のＤＰ２プロセスを含む）のみが関与する。
より一般的な場合では、複数のＤＰ２プロセス対がトランザクションの処理に関わり得る。 (Commit transaction)
With reference to FIG. 2, the following describes exemplary steps involved in transaction commit when a single log volume (or audit trail disk volume) 208 is present in the transaction processing system.
Typically, the client 210 faces a Begin_Transaction or similar action that indicates the start of a transaction.
The TMFlib on the CPU executing the client process assigns a transaction ID (TID) to the new transaction.
The TID is then propagated to the database writer 202, and in particular to all disk processes DP2 that work in response to the transaction.
In the simple case shown in FIG. 2, only a single database writer (including a pair of DP2 processes) is involved.
In the more general case, multiple DP2 process pairs may be involved in processing a transaction.

データベースライタ２０２がデータベースの状態を変更すると、プライマリＤＰ２プロセス２１０がまず、状態変更をバックアッププロセス２１２にチェックポイントし、次いで状態変更の記録をログライタ２０６、特にＡＤＰプライマリプロセス２１８に伝搬する。
ＡＤＰプライマリプロセスは、バッファリング監査データ量に対するしきい値を超える（いわゆる優遇書き込み（courtesy write）になる）まで、あるいはトランザクションモニタプロセス（ＴＭＰ）２０４からのメッセージによって強制的にコミットする（いわゆる強制書き込みになる）場合はそれよりも早く、状態変更をメモリにバッファリングする。
ＤＰ２とまったく同様に、ＡＤＰプライマリプロセス２１８も、いずれのディスク動作またはメッセージも発行する前に状態変更をバックアッププロセス２２０にチェックポイントする。 When the database writer 202 changes the state of the database, the primary DP2 process 210 first checkpoints the state change to the backup process 212 and then propagates a record of the state change to the log writer 206, particularly the ADP primary process 218.
The ADP primary process is forced to commit (so-called forced write) until the threshold for the amount of buffering audit data is exceeded (becomes so-called courtesy write) or by a message from the transaction monitor process (TMP) 204. If so, buffer state changes in memory sooner.
Just like DP2, ADP primary process 218 also checkpoints state changes to backup process 220 before issuing any disk operations or messages.

次の瞬間において、クライアントは次いで、特定のトランザクションの終了を示すＥｎｄ＿Ｔｒａｎｓａｃｔｉｏｎまたは同様の動作に直面する。
トランザクションモニタプロセス２０４は、トランザクションのコミットに先立って、上に説明したように、トランザクションを受けてログライタ２０６に送られたデータベースの状態変更が耐久性を持ったことを保証する必要がある。
これを実現するために、ＴＭＰはその特定のトランザクションの２相フラッシュメッセージをシステム内の各ログライタまたはＡＤＰに送り、次いで、各ＡＤＰから、求められる状態変更が非耐久性システムバッファからディスクドライブに書き込まれたことを確認する返信を待つ。
トランザクションモニタ２０４は、すべての返信メッセージを受け取ると、トランザクションコミット記録を特に指定されたマスタＡＤＰに送る。
マスタＡＤＰが、コミット記録を耐久性媒体に書き込んだことを承認すると、ＴＭＰはクライアントに、トランザクションがコミットされたことを通知する。 At the next moment, the client then encounters End_Transaction or similar action indicating the end of a particular transaction.
Prior to committing a transaction, the transaction monitor process 204 needs to ensure that the database state changes sent to the log writer 206 in response to the transaction are durable, as described above.
To accomplish this, TMP sends a two-phase flush message for that particular transaction to each log writer or ADP in the system, and then from each ADP, the requested state change is written from the non-durable system buffer to the disk drive. Wait for a reply to confirm.
When the transaction monitor 204 receives all reply messages, it sends a transaction commit record to the specifically designated master ADP.
If the master ADP approves that the commit record has been written to the durable medium, the TMP notifies the client that the transaction has been committed.

上記から、当業者により理解されるように、状態変更およびコミット記録がディスクにフラッシュされるのを待つことが、ＴＭＦトランザクションのコミットに際しての遅延の大部分を占めることは明らかである。
さらに、データベースライタ、トランザクションモニタ、およびログライタによるチェックポインティングにより、トランザクションをコミットするオーバーヘッドが増す。
ディスクは回転する機械的媒体であるため、ディスクの待ち時間は、プロセッサおよびメモリの速度ほど急速には向上していない。
さらに、メッセージの受け渡しを介しての信頼性の高いチェックポインティングには、データ転送オーバーヘッドのみならず、一対のプロセスを完全に同期させることのオーバーヘッドも伴う。
こういった要因により、ＴＭＦトランザクションコミット時間は、数ミリ秒から数秒の範囲になることが多い。 From the above, it is clear that waiting for state changes and commit records to be flushed to disk accounts for the majority of the delay in committing a TMF transaction, as will be appreciated by those skilled in the art.
Further, checkpointing by the database writer, transaction monitor, and log writer increases the overhead of committing the transaction.
Because the disk is a rotating mechanical medium, disk latency has not improved as quickly as processor and memory speeds.
Furthermore, reliable checkpointing via message passing involves not only data transfer overhead but also the overhead of fully synchronizing a pair of processes.
Due to these factors, the TMF transaction commit time often ranges from several milliseconds to several seconds.

用途によっては遅い応答時間に耐え得るものもあるが、多くの用途では遅い応答時間は耐えることができない。
ＴＭＦトランザクション応答時間が遅い場合、平均的なトランザクションがシステムに留まる時間が長くなり、ひいてはトランザクション処理システム内の資源に対する需要が高くなり、これによって有限資源下でのトランザクション処理スループットが間接的に制限されることにより、トランザクション処理スループットに対して二次的な不利な影響を及ぼす。
トランザクションがシステムに留まる時間が長くなった場合に容量を超えることになる可能性がある資源の例としては、データベースロック、ソケット、および他の接続資源が挙げられる。 Some applications can withstand slow response times, but many applications cannot withstand slow response times.
If the TMF transaction response time is slow, the average transaction stays longer in the system, which in turn increases the demand for resources in the transaction processing system, which indirectly limits transaction processing throughput under finite resources. This has a secondary adverse effect on transaction processing throughput.
Examples of resources that can exceed capacity if a transaction stays in the system for a long time include database locks, sockets, and other connection resources.

（永続メモリ一般）
本明細書において説明する実施形態によれば、非ディスク永続メモリがトランザクション処理システムと併せて採用されて、一般にトランザクション応答時間を削減し、特にトランザクションコミット時間およびプロセス対チェックポインティング時間を削減する。
以下述べる各種実施形態では、非ディスク永続メモリを、ディスク書き込みの際のライトアサイドバッファ、およびプロセス状態チェックポイントのためのバッファの両方として使用することができる。 (Permanent memory in general)
According to the embodiments described herein, non-disk persistent memory is employed in conjunction with a transaction processing system to generally reduce transaction response time, particularly transaction commit time and process vs. checkpointing time.
In the various embodiments described below, non-disk persistent memory can be used as both a write-aside buffer for disk writes and a buffer for process status checkpoints.

永続メモリは、当業者により理解される構造的概念である。
説明する実施形態によれば、利用可能な非ディスク永続メモリには多くの可能な実施態様がある。
したがって、非ディスク永続メモリの特定の一実施形態への限定は本明細書の意図するところではない。 Persistent memory is a structural concept understood by those skilled in the art.
According to the described embodiment, there are many possible implementations of available non-disk persistent memory.
Thus, the limitation of a non-disk persistent memory to a particular embodiment is not intended herein.

読み手による非ディスク永続メモリに関連する構造的原理の理解を助けるために、以下の考察において、本発明のトランザクション処理システムで非ディスク永続メモリシステムを使用しやすくするために、非ディスク永続メモリシステムが有することのできる特徴について述べる。
この考察全体を通して、非ディスク永続メモリシステムの少数の非限定的な例を提供する。
本明細書において述べる原理は、特許請求する主題の精神および範囲から逸脱することなく、他の非ディスク永続メモリ構造でも採用することが可能なことを認識し理解されたい。 In order to assist readers in understanding the structural principles associated with non-disk persistent memory, in the discussion that follows, non-disk persistent memory systems are used to facilitate the use of non-disk persistent memory systems in the transaction processing system of the present invention. The features that can be possessed are described.
Throughout this discussion, a few non-limiting examples of non-disk persistent memory systems are provided.
It should be appreciated and understood that the principles described herein may be employed in other non-disk persistent memory structures without departing from the spirit and scope of the claimed subject matter.

本明細書において定義する非ディスク永続メモリは、以下の特性、すなわち、耐久性、接続性、およびアクセスを示すべきである。 A non-disk persistent memory as defined herein should exhibit the following characteristics: durability, connectivity, and access.

耐久性は、リフレッシュなしで耐久性があり、システムの電源が失われても耐え抜くことができる非ディスク永続メモリを指す。
さらに、電源が失われた後、またはソフト故障後に非ディスク永続メモリに記憶されているデータへの継続的なアクセスを保証するために、耐久性のある自己無矛盾のメタデータを提供すべきである。 Durability refers to non-disk persistent memory that is durable without refreshing and can survive a system power loss.
In addition, durable self-consistent metadata should be provided to ensure continuous access to data stored in non-disk persistent memory after power loss or after a soft failure .

接続性に関しては以下を考慮する。
非ディスク永続メモリは、市販のチップセットを使用して利用可能なメモリコントローラへの接続が可能である。
専用メモリコントローラは、最終的に、一意に非ディスク永続メモリの耐久性を利用するように設計することができるが、それらが存在する必要はない。
おそらく耐故障性との関わり合い、パッケージングの問題、物理的なスロットの制約、または電気負荷限度により、ＣＰＵのメモリコントローラへの直接接続性が望ましくない場合では、非ディスク永続メモリの第１レベルＩ／Ｏが許される。
たとえば、非ディスク永続メモリは、ＰＣＩ、およびＰＣＩＥｘｐｒｅｓｓ、ＲＤＭＡｏｖｅｒＩＰ、ＩｎｆｉｎｉＢａｎｄ、ＦＣ−ＶＩ（Virtual Interface over Fibre Channel）またはＳｅｒｖｅｒＮｅｔ等の他の第１レベルＩ／Ｏ相互接続に取り付けることができる。
このような相互接続は、メモリマッピングおよびメモリセマンティックアクセスの両方をサポートする。
非ディスク永続メモリの本実施形態を通信リンク接続永続メモリユニット（ＣＰＭＵ）と呼ぶ。 Consider the following regarding connectivity.
Non-disk persistent memory can be connected to available memory controllers using commercially available chipsets.
Dedicated memory controllers can ultimately be designed to take advantage of the durability of non-disk persistent memory uniquely, but they need not be present.
The first level I of non-disk persistent memory, if direct connectivity to the memory controller of the CPU is not desirable, possibly due to fault tolerance, packaging issues, physical slot constraints, or electrical load limits. / O is allowed.
For example, non-disk persistent memory can be attached to PCI and other first level I / O interconnects such as PCI Express, RDMA over IP, InfiniBand, Virtual Interface over Fiber Channel (FC-VI) or ServerNet.
Such interconnects support both memory mapping and memory semantic access.
This embodiment of non-disk persistent memory is called a communication link connected persistent memory unit (CPMU).

記憶装置接続性（たとえば、ＳＣＳＩ）または実際には他のいずれの第２レベルＩ／Ｏ接続性も、以下明らかにするパフォーマンス考慮事項により永続メモリには望ましくない。 Storage device connectivity (e.g., SCSI) or indeed any other second level I / O connectivity is not desirable for persistent memory due to the performance considerations set forth below.

アクセスに関しては以下を考慮する。
非ディスク永続メモリは、特に指定されたプロセス仮想アドレスにおいてであるが、ＣＰＵのメモリ命令（ＬｏａｄおよびＳｔｏｒｅ）を使用して普通の仮想メモリのようにユーザプログラムからアクセスすることが可能である。
メモリセマンティック動作をサポートする特定のシステムエリアネットワーク（すなわち、ＳＡＮ）では、非ディスク永続メモリは、リモートＤＭＡ（ＲＤＭＡ）または同様のセマンティックを使用してアクセスされるネットワーク資源として実施することができる。
たとえば、図３は、ＲＤＭＡ対応システムエリアネットワーク（ＳＡＮ）３０６を通して１つまたは複数のプロセッサノード３０２によりアクセス可能な通信リンク接続非ディスク永続メモリユニット（ＣＰＭＵ）３１０を含むネットワーク接続非ディスク永続メモリを使用したシステム３００を示す。
ＣＰＭＵ３１０の非ディスク永続メモリにアクセスするために、プロセッサノード３０２で実行されているソフトウェアが、プロセッサノードのネットワークインタフェース（ＮＩ）３０４を通して遠隔読み出し動作または遠隔書き込み動作を開始する。
このようにして、読み出しコマンドまたは書き込みコマンドは、ＲＤＭＡ対応ＳＡＮ３０６を介してＣＰＭＵのネットワークインタフェース（ＮＩ）３０８に運ばれる。
したがって、処理後、適切なデータがＲＤＭＡ対応ＳＡＮ３０６を介して通信される。 Regarding access, consider the following.
Non-disk persistent memory, particularly at specified process virtual addresses, can be accessed from a user program like normal virtual memory using CPU memory instructions (Load and Store).
In certain system area networks (ie, SANs) that support memory semantic operations, non-disk persistent memory can be implemented as network resources that are accessed using remote DMA (RDMA) or similar semantics.
For example, FIG. 3 uses network attached non-disk persistent memory that includes a communication link attached non-disk persistent memory unit (CPMU) 310 that is accessible by one or more processor nodes 302 through an RDMA-enabled system area network (SAN) 306. System 300 is shown.
To access the CPMU 310 non-disk persistent memory, software running on the processor node 302 initiates a remote read or write operation through the processor node's network interface (NI) 304.
In this way, a read command or a write command is carried to the CPMU network interface (NI) 308 via the RDMA capable SAN 306.
Thus, after processing, the appropriate data is communicated via the RDMA enabled SAN 306.

ＲＤＭＡデータ移動動作に加えて、ＣＰＭＵ３１０は、様々な管理コマンドに応答するように構成されることができる。
たとえば、プロセッサノード３０２によって開始された書き込み動作では、データが一旦ＣＰＭＵに首尾良く記憶されてしまえば、そのデータは耐久性を持つようになり、停電またはプロセッサノード３０２の故障を耐え抜く。
特に、メモリ内容は、ＣＰＭＵが正しく機能し続ける限り、電源が長い時間期間にわたって切断された後、またはプロセッサノード３０２上のオペレーティングシステムがリブートした後であっても保持される。
この例では、プロセッサノード３０２は、少なくとも１つの中央演算処理装置（ＣＰＵ）およびメモリからなるコンピュータシステムであり、ＣＰＵはオペレーティングシステムを実行するように構成される。
プロセッサノード３０２は、データベース等のアプリケーションソフトウェアを実行するようにさらに構成される。
プロセッサノード３０２は、ＳＡＮ３０６を使用して他のプロセッサノード３０２ならびにＣＰＭＵ３１０およびＩ／Ｏコントローラ（図示せず）等の装置と通信する。 In addition to RDMA data movement operations, CPMU 310 can be configured to respond to various management commands.
For example, in a write operation initiated by the processor node 302, once the data is successfully stored in the CPMU, the data becomes durable and will survive a power outage or failure of the processor node 302.
In particular, the memory contents are retained even after the power supply is disconnected for a long period of time or after the operating system on processor node 302 is rebooted, as long as CPMU continues to function properly.
In this example, processor node 302 is a computer system consisting of at least one central processing unit (CPU) and memory, and the CPU is configured to execute an operating system.
The processor node 302 is further configured to execute application software such as a database.
Processor node 302 uses SAN 306 to communicate with other processor nodes 302 and devices such as CPMU 310 and I / O controller (not shown).

この例の一実施態様では、ＲＤＭＡ対応ＳＡＮは、イニシエータプロセッサノード３０２とターゲットプロセッサノード３０２の間で、あるいはイニシエータプロセッサノード３０２と装置３１０の間で、ターゲットプロセッサノード３０２のＣＰＵに通知することなく、コピー動作等のメモリ動作をバイトレベルで実行可能なネットワークである。
この場合、ＳＡＮ３０６は、仮想アドレスから物理アドレスへの変換を行い、連続したネットワーク仮想アドレス空間から不連続の物理アドレス空間へのマッピングを可能にするように構成される。
このタイプのアドレス変換により、ＣＰＭＵ３１０の動的管理が可能になる。
市販のＲＤＭＡ機能付きＳＡＮ３０６としては、ＳｅｒｖｅｒＮｅｔ、ＲＤＭＡｏｖｅｒＩＰ、ＩｎｆｉｎｉＢａｎｄ、および仮想インタフェースアーキテクチャに準拠したすべてのＳＡＮが挙げられるがこれらに限定されるものではない。 In one implementation of this example, the RDMA enabled SAN may notify the CPU of the target processor node 302 between the initiator processor node 302 and the target processor node 302 or between the initiator processor node 302 and the device 310 without A network capable of executing memory operations such as copy operations at the byte level.
In this case, the SAN 306 is configured to perform translation from virtual addresses to physical addresses and to allow mapping from a continuous network virtual address space to a discontinuous physical address space.
This type of address translation allows dynamic management of the CPMU 310.
Commercially available SANs 306 with RDMA capabilities include, but are not limited to, ServerNet, RDMA over IP, InfiniBand, and all SANs compliant with virtual interface architecture.

プロセッサノード３０２は一般に、ＮＩ３０４を通してＳＡＮ３０６に接続されるが、多くの変形が可能である。
しかし、より一般には、プロセッサノードは、読み出し動作および書き込み動作を通信する装置に接続する必要があるだけである。
たとえば、本例の別の実施態様では、プロセッサノード３０２はマザーボード上の様々なＣＰＵであり、ＳＡＮを使用する代わりに入出力バス、たとえばＰＣＩバスが使用される。
本教示は必要に応じて、より大きな、またはより小さな実施態様に対応するようにスケールアップまたはスケールダウンすることが可能なことに留意されたい。 The processor node 302 is typically connected to the SAN 306 through the NI 304, but many variations are possible.
More generally, however, the processor node only needs to be connected to a device that communicates read and write operations.
For example, in another implementation of this example, the processor node 302 is a variety of CPUs on the motherboard, and instead of using a SAN, an input / output bus, such as a PCI bus, is used.
It should be noted that the present teachings can be scaled up or down as needed to accommodate larger or smaller implementations.

ネットワークインタフェース（ＮＩ）３０８は、ＣＰＭＵ３１０に通信可能に結合されて、ＣＰＭＵ３１０内に収容された非ディスク永続メモリにアクセスできるようにする。
図３の各種構成要素には、ＣＰＭＵ３１０内に使用されるメモリ技術のタイプを含め、多くの技術を利用することが可能である。
したがって、図３の実施形態ならびに本明細書において説明するその他の実施形態は、非ディスク永続メモリを実現する特定の技術に限定されない。
実際には、様々な磁気ランダムアクセスメモリ（ＭＲＡＭ）、磁気抵抗ランダムアクセスメモリ（ＭＲＲＡＭ）、ポリマー強誘電性ランダムアクセスメモリ（ＰＦＲＡＭ）、ＯＵＭ(Ovonics unified memory)、バッテリバックアップダイナミックランダムアクセスメモリ（ＢＢＤＲＡＭ）、およびフラッシュメモリを含め、多くのメモリ技術が適している。 A network interface (NI) 308 is communicatively coupled to the CPMU 310 to allow access to non-disk persistent memory contained within the CPMU 310.
A number of technologies can be used for the various components in FIG. 3, including the type of memory technology used within CPMU 310.
Accordingly, the embodiment of FIG. 3 as well as the other embodiments described herein are not limited to a particular technique for implementing non-disk persistent memory.
Actually, various magnetic random access memory (MRAM), magnetoresistive random access memory (MRRAM), polymer ferroelectric random access memory (PFRAM), OUM (Ovonics unified memory), battery backup dynamic random access memory (BBDRAM) Many memory technologies are suitable, including flash memory.

ＳＡＮ３０６が使用される場合、メモリはＲＤＭＡアクセスに十分に高速であるべきである。
このようにして、ＲＤＭＡ読み出し動作および書き込み動作がＳＡＮ３０６を介して可能になる。
別のタイプの通信装置が使用される場合、使用されるメモリのアクセス速度もまた、通信装置に対応するに十分に高速であるべきである。
永続情報は、使用中の非ディスク永続メモリがデータを保持することができる程度まで提供されることに留意されたい。
たとえば、多くの用途において、非ディスク永続メモリは、電源が失われる時間量に関わりなくデータを記憶することが求められる場合もあれば、別の用途では、数分または数時間だけデータを記憶すればよい場合もある。 If SAN 306 is used, the memory should be fast enough for RDMA access.
In this way, RDMA read and write operations are enabled via the SAN 306.
If another type of communication device is used, the access speed of the memory used should also be fast enough to accommodate the communication device.
Note that persistent information is provided to the extent that the non-disk persistent memory in use can hold the data.
For example, in many applications non-disk persistent memory may be required to store data regardless of the amount of time that power is lost, while in other applications it may store data for minutes or hours. Sometimes it's fine.

本手法と併せて、単一または複数の独立した間接的にアドレス指定されるメモリ領域を作成するメモリ管理機能が提供される。
さらに、電源が失われた後、またはプロセッサが故障した後にメモリを復元するために、ＣＰＭＵメタデータが提供される。
メタデータまたは情報には、たとえば、ＣＰＭＵ内の保護メモリ領域の内容およびレイアウトが含まれる。
このようにして、ＣＰＭＵは、データおよびデータ使用方法を記憶する。
必要が生じたときに、ＣＰＭＵは電源またはシステムの故障から復元できるようにする。 In conjunction with this approach, a memory management function is provided that creates a single or multiple independent and indirectly addressed memory regions.
In addition, CPMU metadata is provided to restore memory after power is lost or a processor fails.
The metadata or information includes, for example, the contents and layout of the protected memory area in CPMU.
In this way, CPMU stores data and data usage.
When a need arises, the CPMU enables recovery from a power supply or system failure.

図４においては、ＣＰＭＵ４００は、バス等のデータ通信リンクを介して共に結合された非ディスク不揮発性メモリ４０２およびネットワークインタフェースまたはＮＩ４０４を備える。
ここでは、非ディスク不揮発性メモリ４０２は、たとえば、ＭＲＡＭまたはフラッシュメモリであることができる。
ＮＩ４０４は、それ自体はＲＤＭＡ要求を開始せず、そうする代わりにネットワークから管理コマンドを受け取り、要求された管理動作を実行する。
具体的には、ＣＰＭＵ４００は、各入力メモリアクセス要求上のアドレスを変換し、次いで、ＮＩ４０４と不揮発性メモリ４０２の間のデータ通信リンクを介して要求されたメモリ動作を内部で開始する。 In FIG. 4, CPMU 400 comprises a non-disk non-volatile memory 402 and a network interface or NI 404 coupled together via a data communication link such as a bus.
Here, the non-disk nonvolatile memory 402 can be, for example, MRAM or flash memory.
The NI 404 does not initiate an RDMA request by itself, but instead receives a management command from the network and performs the requested management operation.
Specifically, CPMU 400 translates the address on each input memory access request and then initiates the requested memory operation internally via the data communication link between NI 404 and non-volatile memory 402.

図５においては、ＣＰＭＵ５００の別の実施形態は、バッテリ５１０を有する非ディスク揮発性メモリ５０２と不揮発性補助記憶装置５０８の組み合わせを使用する。
この実施形態では、電源が落ちると、非ディスク揮発性メモリ５０２内のデータは、係るデータを不揮発性補助記憶装置５０８に保存することができるまでバッテリ５１０の電源を使用して保持される。
不揮発性補助記憶装置は、たとえば、磁気ディスクまたは低速フラッシュメモリであることができる。
ＣＰＭＵ５００が適宜動作するには、揮発性メモリ５０２から不揮発性補助メモリ記憶装置５０８へのデータの転送は、外部介入またはバッテリ５１０からの電源以外のさらなる電源をいずれも必要とすることなく行われるべきである。
したがって、要求されたタスクはいずれも、バッテリ５１０が切れる前に完了されるべきである。
図示のように、ＣＰＭＵ５００は、埋め込みオペレーティングシステムを実行するオプションのＣＰＵ５０４を備える。 In FIG. 5, another embodiment of CPMU 500 uses a combination of non-disk volatile memory 502 with battery 510 and non-volatile auxiliary storage 508.
In this embodiment, when the power is turned off, the data in the non-disk volatile memory 502 is retained using the power source of the battery 510 until such data can be stored in the non-volatile auxiliary storage device 508.
The non-volatile auxiliary storage device can be, for example, a magnetic disk or a low-speed flash memory.
In order for CPMU 500 to operate properly, the transfer of data from volatile memory 502 to non-volatile auxiliary memory storage device 508 should occur without the need for any additional power sources other than external intervention or power from battery 510. It is.
Thus, any requested task should be completed before the battery 510 runs out.
As shown, CPMU 500 includes an optional CPU 504 that runs an embedded operating system.

したがって、バックアップタスク（すなわち、非ディスク揮発性メモリ５０２から不揮発性補助メモリ記憶装置５０８へのデータ転送）は、ＣＰＵ５０４上で実行されているソフトウェアによって行うことができる。
ＣＰＵ５０４上で実行されているソフトウェアは、備えられているＮＩ５０６を使用して、ＲＤＭＡ要求を開始し、またはメッセージをＳＡＮ３０６上の他のエンティティに送ることができる。
ここでも、ＣＰＵ５０４は、ＮＩ５０６を通してネットワークから管理コマンドを受け取り、要求された管理動作を実行する。 Thus, the backup task (ie, data transfer from the non-disk volatile memory 502 to the non-volatile auxiliary memory storage device 508) can be performed by software running on the CPU 504.
Software running on the CPU 504 can use the provided NI 506 to initiate RDMA requests or send messages to other entities on the SAN 306.
Again, the CPU 504 receives a management command from the network through the NI 506 and executes the requested management operation.

ＣＰＭＵ４００または５００等、ＣＰＭＵのいずれの実施形態も、永続メモリの割り振りおよび共有のために管理する必要がある。
この例では、ＣＰＭＵ管理は、永続メモリマネージャ（ＰＭＭ）によって実行される。
ＰＭＭは、ＣＰＭＵ内にあってもよく、また上記プロセッサノード３０２の１つ等、ＣＰＭＵ外にあってもよい。
プロセッサノード３０２がＣＰＭＵ３１０における非ディスク永続メモリを割り振る、もしくは割り振り解除する必要がある場合、または非ディスク永続メモリの既存領域の使用を開始する、もしくは止める必要がある場合、プロセッサノードはまず、ＰＭＭと通信して、要求された管理タスクを行うべきである。
ＣＰＭＵ３１０のメモリ内容は耐久性を有する（ディスクドライブとまったく同じように）ため、そのＣＰＭＵ内の非ディスク永続メモリ領域に関連するメタデータも耐久性を有し、これら領域との整合性を保たなければならず、好ましくは、ＣＰＭＵそれ自体内に記憶される（ディスクドライブ上のファイルシステムメタデータとまったく同じように）ことに留意されたい。
したがって、ＰＭＭは、ＣＰＭＵ３１０のメタデータを非ディスク永続メモリの内容との整合性を常に保つように管理タスクを行わなければならない。
したがって、ＣＰＭＵ３１０に記憶されているデータは、発生し得る電源の損失、システムのシャットダウン、またはＰＭＭ、ＣＰＭＵ３１０、およびプロセッサノード３０２の１つまたは複数に影響する他の故障の後であっても、記憶されているメタデータを使用して有意味に検索することができる。
復元の必要が生じると、ＣＰＭＵ３１０を使用するシステム３００はこうして復元し、電源の故障またはオペレーティングシステムのクラッシュが発生したメモリの状態から動作を再開することができる。 Any embodiment of CPMU, such as CPMU 400 or 500, must be managed for persistent memory allocation and sharing.
In this example, CPMU management is performed by a persistent memory manager (PMM).
The PMM may be in the CPMU, or may be outside the CPMU, such as one of the processor nodes 302.
When processor node 302 needs to allocate or deallocate non-disk persistent memory in CPMU 310, or when it needs to start or stop using an existing area of non-disk persistent memory, the processor node first begins with the PMM. Communicate and perform the requested administrative tasks.
Since the memory content of CPMU 310 is durable (just like a disk drive), metadata related to non-disk persistent memory areas within that CPMU is also durable and consistent with these areas. Note that it must be stored in the CPMU itself (just like the file system metadata on the disk drive).
Therefore, the PMM must perform administrative tasks to always keep the CPMU 310 metadata consistent with the contents of the non-disk persistent memory.
Thus, data stored in CPMU 310 may be stored even after a possible power loss, system shutdown, or other failure affecting one or more of PMM, CPMU 310, and processor node 302. It is possible to search meaningfully using the metadata that has been created.
When a need for restoration occurs, system 300 using CPMU 310 can thus be restored and resume operation from the memory state where the power failure or operating system crash occurred.

処理ノード３０２のＬｏａｄおよびＳｔｏｒｅメモリ命令を使用して、ＳＡＮ３０６を介してＲＤＭＡデータ転送を直接または間接的に開始することが実行不可能なシステム３００では、ＣＰＭＵ３１０の内容の読み書きには、処理ノード３０２上で実行されているアプリケーションが、アプリケーションプログラミングインタフェースまたはＡＰＩを使用してＲＤＭＡを開始する必要がある。 In a system 300 incapable of directly or indirectly initiating an RDMA data transfer via SAN 306 using processing node 302 Load and Store memory instructions, processing node 302 may read and write the contents of CPMU 310. The application running above needs to initiate RDMA using an application programming interface or API.

明らかなはずであるように、非ディスク永続メモリが魅力的である理由の１つは、ディスクドライブよりも、耐久性のある記憶データに対して細かい粒度（アクセスサイズがより小さいことを意味する）での読み書き動作をサポートすることである。
この粒度の細かさは、アクセスサイズ（読み出す、または書き込むバイトの数）およびアクセスアラインメント（読み出される、または書き込まれる最初のバイトの、非ディスク永続メモリ領域内のオフセット）の両方に当てはまる。
非ディスク永続メモリ領域内のデータ構造は自由に並べることができるため、ディスクを使用する場合よりも容量を効率的かつ有効に使用することができる。
ブロック指向ディスク記憶装置およびフラッシュメモリに関連する別の利点は、小さなデータを変更して再び書き込む際に、それに先立って大きなデータブロックをまず読み出す必要がなく、そうする代わりに、書き込み動作により、単純に、変更する必要のあるバイトのみを変更することができることである。
非ディスク永続メモリのそれ自体の速度（raw speed）も魅力的である。
アクセス待ち時間は、ディスクドライブのアクセス待ち時間よりも１桁良好である。
ディスクドライブと比較した非ディスク永続メモリの相対的な使用しやすさも、まずすべてのポインタを書き込み時に相対バイトアドレスに変換し、次いで読み出し時に相対バイトアドレスを元のポインタに再変換する必要なく、ポインタが豊富なデータ構造を非ディスク永続メモリに記憶することが可能であるということにより、相当なものである。
いわゆるマーシャリング−マーシャリング解除オーバーヘッドは、複雑なデータ構造の場合にかなり大きくなり得る。
上記要因はすべて、アプリケーションプログラマが、すでに永続的にしたデータ構造のアクセスおよび動作を加速化できるようにするだけでなく、ディスクドライブおよびフラッシュメモリ等のより低速の記憶装置において、永続性を持たせることを考えていなかった特定のデータ構造に永続性を持たせることを考えられるようにする。
情報処理システムにおける永続性の程度が高いほど、情報の損失が少ないため、故障からの復元は容易で素早い。
復元がより素早いことは、システムの可用性が高いことを含意する。
したがって、説明する実施形態の純利益は、単にパフォーマンスだけではく、可用性の増大にもある。
システム可用性の欠如に関連するコストが高い、極めて重要なトランザクション処理システムでは、説明する実施形態の可用性の恩恵は、実際には、パフォーマンスの恩恵よりもさらに大きな価値を持つ可能性がある。
さらに、説明する実施形態を使用して、メモリ内動作等の新しい、または改良されたデータベース機能を可能にすることができる。
データベース以外の用途も、システム３００の向上したパフォーマンスおよび可用性を利用して、新たな顧客機能を生むことができる。
ここに列挙するには数が多すぎるが、こういった用途の多くは当業者に明らかであろう。
このような1つの用途について次に説明する。 As should be apparent, one of the reasons that non-disk persistent memory is attractive is that it provides finer granularity for durable stored data than disk drives (meaning smaller access size). To support read / write operations in
This granularity applies to both access size (number of bytes read or written) and access alignment (offset in the non-disk persistent memory area of the first byte read or written).
Since the data structures in the non-disk persistent memory area can be freely arranged, the capacity can be used more efficiently and effectively than when the disk is used.
Another advantage associated with block-oriented disk storage and flash memory is that when small data is modified and rewritten, there is no need to first read a large block of data first, but instead a write operation makes it simpler. In other words, only the bytes that need to be changed can be changed.
The raw speed of non-disk persistent memory is also attractive.
The access latency is an order of magnitude better than the disk drive access latency.
The relative ease of use of non-disk persistent memory compared to disk drives also means that all pointers need to be first converted to relative byte addresses on write, and then read without relative byte addresses needing to be converted back to original pointers. Is substantial due to the fact that it is possible to store rich data structures in non-disk persistent memory.
The so-called marshalling-unmarshalling overhead can be quite large for complex data structures.
All of the above factors not only allow application programmers to accelerate the access and operation of data structures that have already been made permanent, but also make them persistent in slower storage devices such as disk drives and flash memory. Make it possible to think about making certain data structures that you didn't think about to be persistent.
The higher the degree of permanence in the information processing system, the less loss of information, and the easier and quicker it is to recover from a failure.
Faster restoration implies higher system availability.
Thus, the net benefits of the described embodiments are not just performance, but also increased availability.
In a critical transaction processing system that is costly associated with a lack of system availability, the availability benefits of the described embodiments may actually be of greater value than the performance benefits.
Furthermore, the described embodiments can be used to enable new or improved database functions such as in-memory operations.
Applications other than databases can also take advantage of the improved performance and availability of the system 300 to create new customer functions.
There are too many to list here, but many of these applications will be apparent to those skilled in the art.
One such application is described next.

（非ディスク永続メモリを使用してのトランザクションコミット時間の削減）
図１のトランザクション処理システム１００に関して、非ディスク永続メモリがない場合、トランザクションモニタ１０４がデータベーストランザクションをコミットするには、それに先立って２つのことが行われなければならない。
第1に、ログライタ１０６は、データベースライタ１０２から受け取ったそのトランザクションに関連するいずれの監査情報も耐久性媒体にフラッシュアウト（すなわち、完全に書き込む）しなければならない。
その後、トランザクションモニタ１０４はまた、耐久性媒体にそのトランザクションのコミット記録を書き込まなければならない。
説明する実施形態によれば、非ディスク永続メモリを利用することにより、こういった情報項目を非ディスク永続メモリにのみ書き込んだ後で、トランザクションのコミットを可能にすることができる。
非ディスク永続メモリがディスク記憶装置よりも低い待ち時間を書き込み動作に示す程度まで、トランザクション待ち時間をそれに従って短縮する（またおそらく、ＴＭＦトランザクションスループットを向上させる）ことができる。 (Reducing transaction commit time using non-disk persistent memory)
With respect to the transaction processing system 100 of FIG. 1, in the absence of non-disk persistent memory, two things must be done before the transaction monitor 104 can commit a database transaction.
First, the log writer 106 must flush out (ie, completely write) any audit information associated with that transaction received from the database writer 102 to the durable medium.
Thereafter, the transaction monitor 104 must also write a commit record for the transaction on the durable medium.
According to the described embodiment, non-disk persistent memory can be used to allow transactions to be committed after such information items are written only to non-disk persistent memory.
To the extent that non-disk persistent memory exhibits lower latency for write operations than disk storage, transaction latency can be reduced accordingly (and possibly increase TMF transaction throughput).

トランザクション処理の一実施形態では、永続メモリが使用されて、トランザクションコミットが加速化される。
具体的には、図２に関連して、プライマリＡＤＰ２１８は、ＤＰ２２０２から状態変更メッセージを受け取るとすぐに、確認メッセージをＤＰ２に送る前に、これら状態変更を永続メモリに同期して記録する。
ＴＭＰ２０４からの２相フラッシュメッセージは、ＡＤＰ２０６が「常にフラッシュ」されるため、削減するか、あるいは完全に省くことが可能である。
こうして、いくつかのプロセス間通信ステップおよびディスク動作が、トランザクションコミットのクリティカルパスからなくなると、ＴＭＰ２０４は、ＡＤＰ２０６が、ディスクを使用する場合よりも非ディスク永続メモリを使用する場合にトランザクションをはるかに高速にコミットすることができる。
本発明者らによって行われた1つの実験では、トランザクション応答時間は、永続メモリを使用した場合には、永続メモリを使用しない場合よりも３．５倍高速であることが分かった。 In one embodiment of transaction processing, persistent memory is used to accelerate transaction commit.
Specifically, in conjunction with FIG. 2, as soon as the primary ADP 218 receives a state change message from DP2 202, it records these state changes synchronously in persistent memory before sending a confirmation message to DP2.
Two-phase flush messages from the TMP 204 can be reduced or eliminated entirely because the ADP 206 is “always flushed”.
Thus, when some inter-process communication steps and disk operations are removed from the transaction commit critical path, TMP 204 will make transactions much faster when ADP 206 uses non-disk persistent memory than when using disk. Can be committed to.
In one experiment conducted by the inventors, it was found that the transaction response time was 3.5 times faster with persistent memory than without persistent memory.

一例として、一実施形態によるトランザクション処理システムの全体を６００で示す図６を考える。
システム６００は、データベースライタ６０２、トランザクションモニタ６０４、およびログライタ６０６を備える。 As an example, consider FIG. 6 where the overall transaction processing system according to one embodiment is shown at 600.
The system 600 includes a database writer 602, a transaction monitor 604, and a log writer 606.

説明する実施形態によれば、ログライタ６０６は、プライマリ監査ディスクプロセス６０８およびバックアップディスクプロセス６１０を含む。
一対の非ディスク永続メモリユニットが設けられ、これにはプライマリ非ディスク永続メモリユニット６１２（「ＣＰＭＵ」とも呼ぶ）およびミラー非ディスク永続メモリユニット６１４（「ＣＰＭＵ」とも呼ぶ）が含まれる。
プライマリ監査ログディスク６１６およびミラー監査ディスクログ６１８は、以下で明らかになる目的で設けられる。 According to the described embodiment, the log writer 606 includes a primary audit disk process 608 and a backup disk process 610.
A pair of non-disk persistent memory units is provided, including a primary non-disk persistent memory unit 612 (also referred to as “CPMU”) and a mirrored non-disk persistent memory unit 614 (also referred to as “CPMU”).
The primary audit log disk 616 and the mirror audit disk log 618 are provided for purposes that will become apparent below.

図示し説明する実施形態では、データはプライマリ非ディスク永続メモリユニット６１２およびミラー非ディスク永続メモリユニット６１４の両方に書き込まれる。
実施形態によっては、データはプライマリおよびミラーユニットに同時に書き込むことができる。
別法として、実施形態によっては、データをプライマリユニットおよびミラーユニットに同時に書き込む必要がない。
システムが完全に機能している場合、実施形態によっては、情報はプライマリ非ディスク永続メモリユニット６１２またはミラー非ディスク永続メモリユニット６１４から読み出される。
万が一、非ディスク永続メモリユニット（６１２、６１４）の一方のみが故障した場合、データは生き残った非ディスク永続メモリユニットから読み出されることになる。
故障したプライマリ非ディスク永続メモリユニットが正常に機能する準備ができると、内容を生き残った非ディスク永続メモリユニットから復元することができる。 In the illustrated and described embodiment, data is written to both the primary non-disk persistent memory unit 612 and the mirror non-disk persistent memory unit 614.
In some embodiments, data can be written to the primary and mirror units simultaneously.
Alternatively, in some embodiments, it is not necessary to write data to the primary unit and mirror unit simultaneously.
If the system is fully functional, in some embodiments the information is read from the primary non-disk persistent memory unit 612 or the mirrored non-disk persistent memory unit 614.
Should only one of the non-disk persistent memory units (612, 614) fail, data will be read from the surviving non-disk persistent memory unit.
When the failed primary non-disk persistent memory unit is ready to function properly, the contents can be restored from the surviving non-disk persistent memory unit.

図示し説明する実施形態では、監査トレイルごとに1つの領域が、各非ディスク永続メモリユニット６１２、６１４内で割り振られ、監査ディスクプロセス対６０８、６１０が、各領域内のライトアサイドバッファ（「ＷＡＢ」として示す）を保持する。
任意の適したライトアサイドバッファ構成を使用することができるが、本例では、ライトアサイドバッファは、当業者に理解される循環バッファとして構成される。
プライマリ監査ディスクプロセス６０８は、データベースライタ６０２から変更のセットを受け取ると、非ディスク永続メモリユニット６１２、６１４を使用して、これら変更を非常に素早くコミットする。
具体的には、図示の例では、ＡＤＰ６０８は、変更のセットを受け取ると、その情報をＣＰＭＵ６１２のＷＡＢに、そのＷＡＢの末尾アドレスに追加し、次いで、ＷＡＢの末尾アドレスを、一番新しく書き込まれた情報の末尾の先を指すように進める。
次いで、ＣＰＭＵ６１４のＷＡＢに関して動作が繰り返される。
当業者は、ＣＰＭＵ６１２、６１４への書き込み動作間の同時性の程度を変更することが可能である。 In the illustrated and described embodiment, one area for each audit trail is allocated within each non-disk persistent memory unit 612, 614, and the audit disk process pair 608, 610 is assigned a write-aside buffer ("WAB" in each area. ”).
In this example, the write-aside buffer is configured as a circular buffer as understood by those skilled in the art, although any suitable write-aside buffer configuration can be used.
When primary audit disk process 608 receives a set of changes from database writer 602, it uses non-disk persistent memory units 612, 614 to commit these changes very quickly.
Specifically, in the example shown, when the ADP 608 receives a set of changes, it adds the information to the CPMU 612 WAB to the end address of the WAB, and then the last address of the WAB is written most recently. Proceed to point to the end of the information.
The operation is then repeated for the CPMU 614 WAB.
One skilled in the art can change the degree of simultaneity between write operations to CPMUs 612, 614.

説明する実施形態によれば、要求された書き込み動作の完了により、末尾ポインタがＷＡＢの先頭アドレスの先を指す場合、非ディスク永続メモリ領域への書き込み動作は保留され、ＷＡＢは満杯であるとマークされる。
ＷＡＢの先頭アドレスおよび末尾アドレスを進めるアルゴリズムは、当業者により理解されるように、ＷＡＢの先頭アドレスおよび末尾アドレスが両方とも、ＷＡＢの循環バッファを含む同じ非ディスク永続メモリ領域内に記憶され更新されることを除き、循環キューデータ構造を実施する典型的な手法に類似する。 According to the described embodiment, if the end of the requested write operation causes the tail pointer to point beyond the head address of the WAB, the write operation to the non-disk persistent memory area is suspended and the WAB is marked full. Is done.
The algorithm for advancing the WAB start and end addresses is, as will be understood by those skilled in the art, both the WAB start and end addresses are stored and updated in the same non-disk persistent memory area containing the WAB circular buffer. Except that it is similar to a typical approach to implementing a circular queue data structure.

費用効率的な非ディスク永続メモリを使用する場合、ログボリュームの全体がディスクではなくより高速の永続メモリ装置を使用して実現されるため、より低速なディスクＩ／Ｏをトランザクションコミットメントプロセスから完全になくすことができる。
しかし、現在、および近い将来では、ディスクの容量が引き続き非ディスク永続メモリ容量よりもはるかに大きく、ディスク記憶装置の１バイトあたりのコストは引き続き、非ディスク永続メモリの１バイトあたりのコストよりもかなり低いと思われる。
その場合、相対的により小さな容量の非ディスク永続メモリユニットを使用して、相対的により大きな容量のディスクドライブのＷＡＢが実施される。
このような構成では、また同様の設計では、監査情報の同期書き込みは、遅延して非同期にディスクに書き込まれる。
非ディスク永続メモリにどの情報を保持し、どの情報をディスクにフラッシュするかを選択する様々な技法が、メモリ階層設計分野の当業者にとって明らかであろう。
たとえば、トランザクション復元のファジー制御ポイントシステムによれば、システムがクラッシュしたときに、インフライトトランザクションをアンドゥしてからリドゥするという通常の復元プロセスをすべて永続メモリから適用することができるように、ログ情報に値する２つの「制御ポイント」を永続メモリに保持することができる。 When using cost-effective non-disk persistent memory, the slower log I / O is completely removed from the transaction commitment process because the entire log volume is achieved using faster persistent memory devices rather than disks. Can be eliminated.
However, now and in the near future, disk capacity will continue to be much larger than non-disk persistent memory capacity, and the cost per byte of disk storage will continue to be significantly higher than the cost per byte of non-disk persistent memory. It seems to be low.
In that case, a relatively larger capacity disk drive WAB is implemented using a relatively smaller capacity non-disk persistent memory unit.
In such a configuration and in a similar design, the synchronous writing of audit information is asynchronously written to the disk with a delay.
Various techniques for selecting what information to keep in non-disk persistent memory and what information to flush to disk will be apparent to those skilled in the art of memory hierarchy design.
For example, the fuzzy control point system for transaction restore logs information so that in the event of a system crash, all normal restore processes of undoing in-flight transactions and then redoing can be applied from persistent memory. Two “control points” worthy can be kept in persistent memory.

さらに、今述べた、このような費用制約付き非ディスク永続メモリ技術を使用する場合、ＡＤＰ６０８は引き続きディスク書き込み動作を使用することができるが、ＴＭＰ６０４にトランザクションのコミットを許すために、それに先立ってこれらディスク動作の完了を待つ必要がない。
代わりに、ＡＤＰ６０８は、データベースライタ６０２から受け取ったすべての監査情報を意欲的に、また同期して非ディスク永続メモリ装置６１２、６１４に書き込むが、適宜選択された時間間隔内に受け取った複数のメッセージからの情報を組み合わせるという選択肢を行使する。
ディスク動作はもはや上に述べたように完了が待たれないため、ＡＤＰ６０８はこの場合、ディスク動作につきより多くのデータを書き込むことができ、それにより、同量の監査トレイルデータに対して行われる動作の総数が少ないため、発生するＩ／Ｏ関連オーバーヘッドは小さくなる。
これにより、監査ディスク６１６、６１８からのディスクスループットが向上するとともに、ＡＤＰ６０８を実行するＣＰＵのＣＰＵ利用率が向上する。
したがって、ＡＤＰ６０８は、ＴＭＰ６０４から監査トレイルフラッシュ要求を受け取ると、保留中の変更をいずれもＣＰＭＵ６１２、６１４内のＷＡＢにフラッシュする。
ＡＤＰ６０８はまた、その情報をバッファリングし、情報をいわゆる遅延して監査ログディスク６１６、６１８に書き込めるようにする。
所定の条件が満たされると、たとえば、バッファリングされる情報に対する特定のしきい値が超える、または最大固定時間間隔になると、ＡＤＰ６０８は、フラッシュメッセージをＴＭＰ６０４から受け取ったか否かに関わらず、ディスク書き込み動作を発行する。
しかし、従来の場合と異なり、本発明のシステムでのトランザクションは、監査情報がディスク６１６、６１８に書き込まれる前であるが、ＣＰＭＵ６１２、６１４に書き込まれた後にコミットすることができる。 In addition, when using such a cost-constrained non-disk persistent memory technology just described, the ADP 608 can continue to use disk write operations, but prior to doing so, it allows the TMP 604 to commit the transaction. There is no need to wait for the disk operation to complete.
Instead, the ADP 608 writes all audit information received from the database writer 602 to the non-disk persistent memory devices 612, 614 in an ambitious and synchronous manner, but receives a plurality of messages received within an appropriately selected time interval. Exercise the option of combining information from.
Since the disk operation no longer waits for completion as described above, the ADP 608 can now write more data per disk operation, thereby performing the operation performed on the same amount of audit trail data. Therefore, the I / O related overhead generated is small.
As a result, the disk throughput from the audit disks 616 and 618 is improved and the CPU utilization rate of the CPU executing the ADP 608 is improved.
Thus, when the ADP 608 receives an audit trail flush request from the TMP 604, it flushes any pending changes to the WAB in the CPMU 612,614.
ADP 608 also buffers the information so that the information can be written to audit log disks 616, 618 with a so-called delay.
When a predetermined condition is met, for example, when a certain threshold for buffered information is exceeded or the maximum fixed time interval is reached, ADP 608 may write to disk regardless of whether a flush message is received from TMP 604 or not. Issue action.
However, unlike the conventional case, the transaction in the system of the present invention can be committed before the audit information is written to the disks 616 and 618 but after it is written to the CPMU 612 and 614.

監査ディスク６１６、６１８に遅延して発行された順次書き込み動作が完了すると、ＣＰＭＵＷＡＢに記憶されていた情報のいくらかが、上書き可能になる。
次いで、ＣＰＭＵ６１２、６１４の適切な領域内の先頭アドレスが、ライトバックに成功した最後のバイトの先に進められる。
ディスクＩ／Ｏの完了をＡＤＰ６０８が受け取る前に非ディスク永続メモリユニットが「満杯」とマークされていた場合、非ディスク永続メモリユニットは「非満杯」とマークされる。
次いで、ＡＤＰ６０８はＷＡＢの使用をもう一度再開することができる。
ＡＤＰ６０８は、ＷＡＢの使用を保留するときは常に、トランザクションをコミットする前に、ログボリューム６１６、６１８への未処理ディスクＩ／Ｏを待つことに戻る。
このような状況では、書き込みＩ／Ｏ動作のサイズは通常、トランザクション応答時間に対するディスクＩ／Ｏ待ち時間の影響を制限するために、最適なディスクスループット（たとえば、１２８ＫＢ〜１ＭＢ）をもたらす値よりも小さな値（たとえば、コミット記録間に収集される監査の量に応じて４ＫＢ〜１２８ＫＢ）に設定される。
非ディスク永続メモリを使用する場合、トランザクションはディスクＩ／Ｏの完了を待たないため、ＡＤＰ６０８は、書き込みのためにバッファリングされる監査データがより多くなるまで待つことができる。
したがって、より大きなディスク書き込みＩ／Ｏサイズ（たとえば、５１２ＫＢ）を使用して、ログボリューム６１６、６１８に対して最適に近いスループットを得ることができるとともに、ＡＤＰ６０８を実行するＣＰＵに対するディスクＩ／Ｏオーバーヘッドを大幅に低減することができる。 When the sequential write operations issued delayed to the audit disks 616, 618 are completed, some of the information stored in the CPMU WAB can be overwritten.
The leading address in the appropriate area of CPMU 612, 614 is then advanced beyond the last byte that was successfully written back.
If a non-disk persistent memory unit is marked “full” before ADP 608 receives completion of disk I / O, the non-disk persistent memory unit is marked “not full”.
The ADP 608 can then resume using WAB once again.
Whenever ADP 608 suspends the use of WAB, it returns to waiting for outstanding disk I / O to log volumes 616, 618 before committing the transaction.
In such situations, the size of the write I / O operation is usually more than the value that results in optimal disk throughput (eg, 128 KB to 1 MB) to limit the impact of disk I / O latency on transaction response time. Set to a small value (eg, 4 KB to 128 KB depending on the amount of audit collected between commit records).
When using non-disk persistent memory, ADP 608 can wait until more audit data is buffered for writing because transactions do not wait for disk I / O to complete.
Thus, a larger disk write I / O size (eg, 512 KB) can be used to obtain near optimal throughput for log volumes 616, 618 and disk I / O overhead for the CPU executing ADP 608. Can be greatly reduced.

同量の設計変更を用いて、ディスク書き込み動作を待つことによってパフォーマンスが悪影響を受ける他のいずれのアプリケーションにも非ディスク永続メモリを使用するライトアサイドバッファを作成することができる。 With the same amount of design changes, a write-aside buffer can be created that uses non-disk persistent memory for any other application whose performance is adversely affected by waiting for a disk write operation.

設計に対する他の変形が当業者に明らかであろう。
たとえば、プライマリ非ディスクおよびミラー非ディスク永続メモリユニットに対して交互に順次書き込むのではなく、アプリケーションは同時書き込みを選ぶことも可能である。 Other variations to the design will be apparent to those skilled in the art.
For example, rather than alternately writing sequentially to the primary non-disk and mirror non-disk persistent memory units, the application can choose to write simultaneously.

（例示的な方法）
図７は、一実施形態による方法におけるステップを示す。
図示し説明する実施形態では、方法は、任意の適したハードウェア、ソフトウェア、ファームウェア、またはこれらの組み合わせにおいて実施することができる。
さらに、方法は、任意の適宜構成された非ディスク永続メモリ構造を使用して実施することができる。
非ディスク永続メモリ構造の具体的で非限定的な例を、本明細書全体を通して図示し説明する。 (Example method)
FIG. 7 illustrates steps in a method according to one embodiment.
In the illustrated and described embodiment, the method can be implemented in any suitable hardware, software, firmware, or combination thereof.
Further, the method can be implemented using any suitably configured non-disk persistent memory structure.
Specific, non-limiting examples of non-disk persistent memory structures are shown and described throughout this specification.

ステップ７００が、トランザクションによって誘発された状態変更に関連するデータを受け取る。
このようなデータは、トランザクションに起因するデータベース状態変更を記述することができる。
図示し説明する実施形態では、このデータは上述したもの等のデータベースライタ構成要素から受け取られる。
ステップ７０２が、データを非ディスク永続メモリに書き込む。
記したように、任意の適した非ディスク永続メモリ構造を利用することが可能である。
たとえば、図６の例では、プライマリ非ディスク永続メモリユニットおよびミラー非ディスク永続メモリユニットが利用される。
ステップ７０４が、非ディスク永続メモリユニットしきい値に達したか否かを確認する。
達していない場合、方法はステップ７００に戻る。
一方、非ディスク永続メモリユニットしきい値に達した場合、ステップ７０６が、非ディスク永続メモリ中のデータを、図６における監査ログディスク６１６等の監査ログディスクに書き込む。
図６の例では、このようにして監査ログを書き込むことは、遅延して行われるものとして言及している。 Step 700 receives data related to a state change induced by a transaction.
Such data can describe database state changes resulting from transactions.
In the illustrated and described embodiment, this data is received from a database writer component such as those described above.
Step 702 writes the data to non-disk persistent memory.
As noted, any suitable non-disk persistent memory structure can be utilized.
For example, in the example of FIG. 6, a primary non-disk persistent memory unit and a mirror non-disk persistent memory unit are utilized.
Step 704 checks to see if a non-disk persistent memory unit threshold has been reached.
If not, the method returns to step 700.
On the other hand, if the non-disk persistent memory unit threshold is reached, step 706 writes the data in the non-disk persistent memory to an audit log disk such as the audit log disk 616 in FIG.
In the example of FIG. 6, it is mentioned that writing the audit log in this way is performed with a delay.

（非ディスク永続メモリのログライタチェックポインティングへの使用）
これより、一実施形態によるトランザクション処理システムの全体を８００で示す図８を考える。
システム８００は、データベースライタ８０２、トランザクションモニタ８０４、およびログライタ８０６を備える。
少なくとも1つの実施形態によれば、システム８００は、すぐ上で述べたようなトランザクションコミットプロセスにおいて利用することができる。
さらに、システム８００は、以下述べるように、ログライタチェックポインティングに利用することができる。
この例では、システム８００の構成要素は図６の構成要素と同じであるか、または同様である。
したがって、簡潔にするために、これら構成要素についてはここで再び説明しない。 (Use of non-disk persistent memory for log writer checkpointing)
Consider now FIG. 8 which shows the entire transaction processing system at 800 according to one embodiment.
The system 800 includes a database writer 802, a transaction monitor 804, and a log writer 806.
According to at least one embodiment, the system 800 can be utilized in a transaction commit process as described immediately above.
Further, the system 800 can be used for log writer checkpointing as described below.
In this example, the components of system 800 are the same as or similar to the components of FIG.
Therefore, for the sake of brevity, these components will not be described again here.

図８のシステムでは、データベースライタ８０２は、監査記録または状態変更をログライタ８０６に送る。
ログライタ８０６は、監査記録を受け取って永続メモリ８１２、８１４に書き込み、監査データをメモリにバッファリングする。
ここで、コミット時に、トランザクションモニタ８０４は、トランザクションをコミットすべきであることをログライタ８０６に示す。
したがって、ログライタ８０６はコミット記録を受け取って永続メモリ８１２、８１４に書き込み、メモリにバッファリングする。
システム８００を使用してトランザクションをコミットするプロセスにおける最終ステップは、ログライタ８０６が監査記録およびコミット記録を監査ログディスク８１６、８１８に「遅延して」書き込むことである。
この例では、ログライタ８０６による書き込み動作は、コミットプロセスと非同期に行われるため「遅延」と言及される。
この実施形態では、ＡＤＰバックアッププロセス８１０は、ＡＤＰプライマリプロセスが故障した場合に、永続メモリを読み出す必要があるだけである（ＡＤＰバックアッププロセスから永続メモリ８１２、８１４への点線によって線図で示される）。 In the system of FIG. 8, the database writer 802 sends audit records or status changes to the log writer 806.
The log writer 806 receives the audit record, writes it to the permanent memory 812, 814, and buffers the audit data in the memory.
Here, at the time of commit, the transaction monitor 804 indicates to the log writer 806 that the transaction should be committed.
Accordingly, the log writer 806 receives the commit record, writes it to the permanent memory 812, 814, and buffers it in the memory.
The final step in the process of committing a transaction using system 800 is that log writer 806 writes the audit records and commit records “delayed” to audit log disks 816, 818.
In this example, the write operation by the log writer 806 is referred to as “delay” because it is performed asynchronously with the commit process.
In this embodiment, the ADP backup process 810 only needs to read the persistent memory if the ADP primary process fails (indicated by a dotted line from the ADP backup process to the persistent memory 812, 814). .

この特定の例では、ライトアサイドバッファとして採用された同じ永続メモリユニットが、ログライタチェックポインティングにも採用される。
したがって、永続メモリ内のまったく同じ循環バッファを両方の目的で使用することができる。
さらに、ＡＤＰバックアッププロセスが、ＡＤＰプライマリプロセスが故障するまで、永続メモリチェックポイント情報を読み出す必要がないという点で利点が得られる。 In this particular example, the same persistent memory unit employed as the write-aside buffer is also employed for log writer checkpointing.
Thus, the exact same circular buffer in persistent memory can be used for both purposes.
Furthermore, an advantage is obtained in that the ADP backup process does not have to read persistent memory checkpoint information until the ADP primary process fails.

当業者に理解されるように、この手法は処理オーバーヘッドを低減するとともに、データコピーを削減する。
具体的には、従来では、ＡＤＰバックアッププロセスにチェックポイントするには、状態変更とともにメッセージをＡＤＰバックアッププロセスに送る必要があり、その後、このような変更が監査ディスクログ８１６、８１８に書き込まれた。
上記手法の場合、ＡＤＰバックアッププロセスはチェックポインティングループから取り出され、それによって処理オーバーヘッドが低減する。 As will be appreciated by those skilled in the art, this approach reduces processing overhead as well as data copy.
Specifically, conventionally, to checkpoint to the ADP backup process, it was necessary to send a message to the ADP backup process along with a state change, after which such changes were written to the audit disk logs 816, 818.
For the above approach, the ADP backup process is taken from the checkpointing group, thereby reducing processing overhead.

（非ディスク永続メモリのすべてのチェックポインティングへの使用）
一実施形態によれば、非ディスク永続メモリは、上に述べたようにトランザクションのコミットに使用することができるとともに、データベースライタチェックポインティングおよびログライタチェックポインティングの両方にも使用することができる。
この手法では、トランザクションコミットプロセスは、データフローでの２つのステップ、すなわち、ＤＰ２バックアッププロセスおよびＡＤＰバックアッププロセスにそれぞれチェックポイントする各ステップ、をなくすことによって簡素化される。 (Use of non-disk persistent memory for all checkpointing)
According to one embodiment, non-disk persistent memory can be used for transaction commit as described above, and can be used for both database writer checkpointing and log writer checkpointing.
In this approach, the transaction commit process is simplified by eliminating two steps in the data flow, each step that checkpoints into the DP2 backup process and the ADP backup process, respectively.

一例として、一実施形態によるトランザクション処理システムの全体を９００で示す図９を考える。
システム９００は、データベースライタ９０２、トランザクションモニタ９０４、およびログライタ９０６を備える。
この例では、システム９００の構成要素は図６の構成要素と同じであるか、または同様である。
したがって、簡潔にするために、これら構成要素についてはここで再び説明しない。 As an example, consider FIG. 9 where the entire transaction processing system according to one embodiment is shown at 900.
The system 900 includes a database writer 902, a transaction monitor 904, and a log writer 906.
In this example, the components of system 900 are the same as or similar to the components of FIG.
Therefore, for the sake of brevity, these components will not be described again here.

この実施形態では、データベースライタが監査情報または状態変更をログライタ９０６に送るのではなく、監査情報または状態変更は永続メモリ９１２、９１４に書き込まれる。
こうすることにより、監査情報をＤＰ２バックアッププロセスにチェックポイントすることが事実上なくなる。
したがって、ＤＰ２バックアッププロセス（特に図示せず）は、ＤＰ２プライマリプロセスが故障した場合に、永続メモリからこの情報を読み出す必要があるだけである。 In this embodiment, rather than the database writer sending audit information or state changes to the log writer 906, the audit information or state changes are written to persistent memory 912, 914.
This virtually eliminates checkpointing audit information into the DP2 backup process.
Thus, the DP2 backup process (not specifically shown) only needs to read this information from permanent memory if the DP2 primary process fails.

引き続き、監査情報が永続メモリ９１２、９１４に書き込まれると、ログライタ９０６は、情報をディスクにコミットすることに備えて、監査情報を読み出してメモリにバッファリングすることができる。
このプロセス段階が完了すると、トランザクションモニタ９０４は、情報をディスクにコミットさせるコミットプロセスの次の段階を開始することができる。
このために、トランザクションモニタは、コミット記録を永続メモリ９１２、９１４に書き込み、ログライタ９０６はコミット記録を永続メモリ９１２、９１４から読み出してメモリにバッファリングし、ディスクに書き込む。
有利なことに、データベースライタ（ＤＰ２）およびログライタ（ＡＤＰ）のそれぞれのバックアッププロセスは、それぞれのプライマリプロセスが故障した場合、永続メモリ９１２、９１４を読み出す必要があるだけである。 Subsequently, as audit information is written to persistent memory 912, 914, log writer 906 can read the audit information and buffer it in memory in preparation for committing the information to disk.
When this process phase is complete, the transaction monitor 904 can begin the next phase of the commit process that causes the information to be committed to disk.
For this purpose, the transaction monitor writes the commit record to the persistent memories 912, 914, and the log writer 906 reads the commit record from the persistent memories 912, 914, buffers it in the memory, and writes it to the disk.
Advantageously, each backup process of the database writer (DP2) and log writer (ADP) only needs to read the persistent memory 912, 914 if the respective primary process fails.

この実施形態によれば、ログライタ９０６はここで、監査情報およびコミット記録を非同期でディスクに遅延して書き込むことができる。
このプロセスは事実上、ログライタをコミットプロセスから切り離す。 According to this embodiment, the log writer 906 can now write audit information and commit records to the disk asynchronously and delayed.
This process effectively decouples the log writer from the commit process.

有利なことに、図６に関連して上述したように永続メモリをチェックポインティングプロセスならびにトランザクションコミットメントプロセスに使用することにより、トランザクションに関連するデータは、従来よりもはるかに素早く永続的にすることができる。
したがって、更新は、記録が永続メモリに書き込まれるときに耐久性を持ち、コミットプロセスは他のいずれのステップの完了も待つ必要がないため、トランザクションコミットプロセスははるかに高速になる。 Advantageously, by using persistent memory for the checkpointing process as well as the transaction commitment process as described above in connection with FIG. 6, the data associated with the transaction can be made much more permanent than before. it can.
Thus, the update is durable when the record is written to persistent memory, and the transaction commit process is much faster because the commit process does not have to wait for any other steps to complete.

（例示的なコンピューターシステム）
一実施形態では、上記システムは、図１０に示すもの等のコンピュータシステム１０００において実施することができる。
コンピュータシステム１０００、またはコンピュータシステム１０００を成す構成要素の様々な組み合わせを利用して、プロセッサノードならびに様々な非ディスク永続メモリユニットを含む上記システムを実施することができる。 (Exemplary computer system)
In one embodiment, the system can be implemented in a computer system 1000 such as that shown in FIG.
The computer system 1000, or various combinations of components that make up the computer system 1000, can be utilized to implement the above system including processor nodes as well as various non-disk persistent memory units.

図１０を参照すると、例示的なコンピュータシステム１０００（たとえば、パーソナルコンピュータ、ワークステーション、メインフレーム等）には、様々な構成要素を通信可能に結合するデータバス１０１４が構成される。
図１０に示すように、プロセッサ１００２はバス１０１４に結合されて、情報および命令を処理する。
ＲＡＭ１００４等のコンピュータ可読揮発性メモリもバス１０１４に結合され、プロセッサ１００２のために情報および命令を記憶する。
さらに、コンピュータ可読読み取り専用メモリ（ＲＯＭ）１００６もバス１０１４に結合され、プロセッサ１００２のために静的な情報および命令を記憶する。
磁気ディスク媒体または光ディスク媒体等のデータ記憶装置１００８もバス１０１４に結合される。
データ記憶装置１００８は、大量の情報および命令の記憶に使用される。
英数字キーおよび機能キーを備えた英数字入力装置１０１０がバス１０１４に結合され、情報およびコマンド選択をプロセッサ１００２に伝達する。
マウス等のカーソル制御装置１０１２がバス１０１４に結合され、ユーザ入力情報およびコマンド選択を中央プロセッサ１００２に伝達する。
入出力通信ポート１０１６がバス１０１４に結合され、たとえば、ネットワーク、他のコンピュータ、または他のプロセッサと通信する。
ディスプレイ１０１８がバス１０１４に結合され、情報をコンピュータユーザに対して表示する。
表示装置１０１８は、液晶装置、陰極線管、またはユーザが認識可能なグラフィック画像および英数字文字の作成に適した他の表示装置であることができる。
英数字入力１０１０およびカーソル制御装置１０１２は、コンピュータユーザがディスプレイ１０１８上の可視シンボル（ポインタ）の二次元の動きを動的に通知できるようにする。
非ディスク永続メモリユニット１０２０が設けられ、非ディスク永続メモリユニット１０２０は、当業者に理解されるように、上記実施形態のいずれを備えてもよく、また上記振る舞いを示す他の非ディスク永続メモリ構造を備えてもよい。 With reference to FIG. 10, an exemplary computer system 1000 (eg, personal computer, workstation, mainframe, etc.) is configured with a data bus 1014 that communicatively couples various components.
As shown in FIG. 10, processor 1002 is coupled to bus 1014 for processing information and instructions.
Computer readable volatile memory, such as RAM 1004, is also coupled to bus 1014 and stores information and instructions for processor 1002.
In addition, a computer readable read only memory (ROM) 1006 is also coupled to the bus 1014 and stores static information and instructions for the processor 1002.
A data storage device 1008 such as a magnetic disk medium or optical disk medium is also coupled to the bus 1014.
Data storage device 1008 is used to store large amounts of information and instructions.
An alphanumeric input device 1010 with alphanumeric and function keys is coupled to bus 1014 and communicates information and command selections to processor 1002.
A cursor control device 1012 such as a mouse is coupled to the bus 1014 and communicates user input information and command selections to the central processor 1002.
An input / output communication port 1016 is coupled to the bus 1014 and communicates with, for example, a network, other computers, or other processors.
A display 1018 is coupled to the bus 1014 and displays information to the computer user.
Display device 1018 can be a liquid crystal device, a cathode ray tube, or other display device suitable for creating graphic images and alphanumeric characters that can be recognized by the user.
Alphanumeric input 1010 and cursor control device 1012 allow a computer user to dynamically notify the two-dimensional movement of a visible symbol (pointer) on display 1018.
A non-disk persistent memory unit 1020 is provided, and the non-disk persistent memory unit 1020 may comprise any of the above embodiments, as will be understood by those skilled in the art, and other non-disk persistent memory structures exhibiting the above behavior. May be provided.

（結論）
上記の各種実施形態は、非ディスク永続メモリをトランザクション処理システムと併せて利用する。
非ディスク永続メモリを使用してトランザクションをコミットすることにより、トランザクションのコミットに関連する時間を削減することができる。
したがって、トランザクション処理システム内の資源に対する需要を低減することができ、これによってトランザクション処理システムのスループットを向上させることができる。 (Conclusion)
The various embodiments described above utilize non-disk persistent memory in conjunction with a transaction processing system.
By committing the transaction using non-disk persistent memory, the time associated with committing the transaction can be reduced.
Therefore, the demand for resources within the transaction processing system can be reduced, thereby improving the throughput of the transaction processing system.

本発明について構造的特徴および／または方法ステップに固有の言葉で説明したが、添付の特許請求の範囲において規定される本発明は、説明した特定の特徴またはステップに必ずしも限定されるものではないことを理解されたい。
より正確に言えば、特定の特徴およびステップは、特許請求する本発明を実施する好ましい形態として開示されたものである。 Although the invention has been described in language specific to structural features and / or method steps, the invention as defined in the appended claims is not necessarily limited to the specific features or steps described. I want you to understand.
Rather, the specific features and steps are disclosed as preferred forms of implementing the claimed invention.

構成要素を1つまたは複数の実施形態と併せて利用することのできる例示的なトランザクション処理システムを示す。1 illustrates an example transaction processing system in which components may be utilized in conjunction with one or more embodiments. 図１のトランザクション処理システムの一実施態様を示す。2 illustrates one embodiment of the transaction processing system of FIG. 非ディスク永続メモリユニットの例示的な一実施形態を示す。3 illustrates an exemplary embodiment of a non-disk persistent memory unit. 非ディスク永続メモリユニットの例示的な一実施形態を示す。3 illustrates an exemplary embodiment of a non-disk persistent memory unit. 非ディスク永続メモリユニットの別の例示的な実施形態を示す。4 illustrates another exemplary embodiment of a non-disk persistent memory unit. 1つまたは複数の実施形態による、非ディスク永続メモリユニットを利用する例示的なトランザクション処理システムを示す。1 illustrates an exemplary transaction processing system that utilizes a non-disk persistent memory unit in accordance with one or more embodiments. 一実施形態による方法におけるステップを説明する流れ図である。3 is a flow diagram that describes steps in a method in accordance with one embodiment. 1つまたは複数の実施形態による、非ディスク永続メモリユニットを利用する例示的なトランザクション処理システムを示す。1 illustrates an exemplary transaction processing system that utilizes a non-disk persistent memory unit in accordance with one or more embodiments. 1つまたは複数の実施形態による、非ディスク永続メモリユニットを利用する例示的なトランザクション処理システムを示す。1 illustrates an exemplary transaction processing system that utilizes a non-disk persistent memory unit in accordance with one or more embodiments. 本明細書において説明する実施形態のうちの1つまたは複数の実施形態に利用することのできる例示的なコンピュータシステムを示す。1 illustrates an example computer system that can be utilized in one or more of the embodiments described herein.

Explanation of symbols

１００・・・トランザクション処理システム，
１０２・・・データベースライタ，
１０４・・・トランザクションモニタ，
１０６・・・ログライタ，
２０２・・・データベースライタ，
２０４・・・トランザクションモニタ，
２０６・・・ログライタ，
２０８・・・監査ログディスク，
２１０・・・クライアント，
２２２・・・データベースディスク，
３０２・・・プロセッサノード，
３０６・・・ＲＤＭＡ対応システムエリアネットワーク（ＳＡＮ），
３１０・・・通信リンク接続永続メモリユニット（ＣＰＭＵ），
４００・・・通信リンク接続永続メモリユニット，
４０２・・・不揮発性メモリ，
５００・・・通信リンク接続永続メモリユニット，
５０２・・・揮発性メモリ，
５０８・・・不揮発性補助記憶装置，
５１０・・・バッテリ，
６００・・・トランザクション処理システム，
６０２・・・データベースライタ，
６０４・・・トランザクションモニタ，
６０６・・・ログライタ，
６０８・・・監査ディスクプロセス（プライマリ），
６１０・・・監査ディスクプロセス（バックアップ），
６１２，６１４・・・永続メモリ，
６１６，６１８・・・監査ログディスク，
８００・・・トランザクション処理システム，
８０２・・・データベースライタ，
８０４・・・トランザクションモニタ，
８０６・・・ログライタ，
８０８・・・監査ディスクプロセス（プライマリ），
８１０・・・監査ディスクプロセス（バックアップ），
８１２，８１４・・・永続メモリ，
８１６，８１８・・・監査ログディスク，
９００・・・トランザクション処理システム，
９０２・・・データベースライタ，
９０４・・・トランザクションモニタ，
９０６・・・ログライタ，
９１２，９１４・・・永続メモリ，
１００２・・・プロセッサ，
１００８・・・データ記憶装置，
１０１０・・・英数字入力装置，
１０１２・・・カーソル制御装置，
１０１６・・・通信ポート，
１０１８・・・ディスプレイ，
１０２０・・・非ディスク永続メモリユニット， 100 ... Transaction processing system,
102 ... Database writer,
104 ... Transaction monitor,
106: Log writer,
202 ... Database writer,
204 ... Transaction monitor,
206: Log writer,
208: Audit log disk,
210 ... Client,
222 ... database disk,
302 ... Processor node,
306 ... RDMA compatible system area network (SAN),
310 ... Communication link connection permanent memory unit (CPMU),
400... Communication link connection permanent memory unit,
402... Non-volatile memory,
500... Communication link connection permanent memory unit,
502 ... volatile memory,
508... Non-volatile auxiliary storage device,
510... Battery
600 ... transaction processing system,
602: Database writer,
604 ... Transaction monitor,
606: Log writer,
608 ... Audit disk process (primary),
610: Audit disk process (backup),
612, 614 ... persistent memory,
616, 618 ... Audit log disk,
800 ... transaction processing system,
802 ... Database writer,
804 ... Transaction monitor,
806: Log writer,
808: Audit disk process (primary),
810: Audit disk process (backup),
812, 814 ... persistent memory,
816, 818 ... audit log disk,
900 ... transaction processing system,
902: Database writer,
904 ... Transaction monitor,
906: Log writer,
912, 914 ... persistent memory,
1002... Processor,
1008 ... Data storage device,
1010 ... Alphanumeric input device,
1012 ... Cursor control device,
1016: Communication port,
1018 ... display,
1020... Non-disk persistent memory unit,

Claims

A transaction processing system (800) comprising:
A database writer (802) configured to process data in accordance with one or more transactions in the transaction processing system;
A transaction monitor (804) for monitoring transactions in the transaction processing system;
A log writer (802) for holding audit trail data relating to transactions in the transaction processing system;
One or more non-disk persistent memory units (812, 814), and
A transaction processing system configured to use the one or more non-disk persistent memory units (812, 814) for checkpointing.

The one or more non-disk persistent memory units are:
Primary non-disk persistent memory unit (812) and mirrored non-disk persistent memory unit (814)
The transaction processing system according to claim 1.

The transaction processing system of claim 1, wherein the log writer (806) is configured to use the one or more non-disk persistent memory units (812, 814) for checkpointing.

The transaction processing system of claim 1, wherein the log writer (906) and the database writer (902) are configured to use the one or more non-disk persistent memory units for checkpointing.

The one or more non-disk persistent memory units (812, 814) are:
Including a write-aside buffer configured as a circular buffer,
The transaction processing system of claim 1, wherein the log writer (906) and the database writer (902) are configured to use the circular buffer for checkpointing.

Receiving data related to a state change induced by the transaction (700);
Performing a checkpoint by writing the received data to a non-disk persistent memory (702).

The computer-implemented method of claim 6, wherein performing the checkpoint is performed by a log writer.

The computer-implemented method of claim 6, wherein performing the checkpoint is performed by a database writer.

The computer-implemented method of claim 6, wherein performing the checkpoint is performed by at least a log writer and a database writer.

Performing the checkpoint is
Writing the received data to first and second non-disk persistent memory units;
The first non-disk persistent memory unit is:
Including a primary non-disk persistent memory unit,
The second non-disk persistent memory unit is
The computer-implemented method of claim 6, comprising a mirror non-disk persistent memory unit.