[go: up one dir, main page]

CN105242979B - It is a kind of that there is the preceding backward recovery fault-tolerance approach to recovery feature - Google Patents

It is a kind of that there is the preceding backward recovery fault-tolerance approach to recovery feature Download PDF

Info

Publication number
CN105242979B
CN105242979B CN201510571405.5A CN201510571405A CN105242979B CN 105242979 B CN105242979 B CN 105242979B CN 201510571405 A CN201510571405 A CN 201510571405A CN 105242979 B CN105242979 B CN 105242979B
Authority
CN
China
Prior art keywords
message
fault
messages
sending
recovery
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510571405.5A
Other languages
Chinese (zh)
Other versions
CN105242979A (en
Inventor
高胜法
邵春阳
高雅娴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201510571405.5A priority Critical patent/CN105242979B/en
Publication of CN105242979A publication Critical patent/CN105242979A/en
Application granted granted Critical
Publication of CN105242979B publication Critical patent/CN105242979B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Retry When Errors Occur (AREA)
  • Maintenance And Management Of Digital Transmission (AREA)

Abstract

本发明公开了一种具有前向恢复特征的后向恢复容错方法。该方法以故障位控制消息的发送;若发送和接收进程均无故障,发送进程发送消息前先把消息和其对应的逻辑时钟存放在消息日志然后发送消息;若进程处于故障恢复阶段禁止其发送消息;若接收进程有故障,发送进程等待直至其恢复后才发送消息至此进程;从而达到无故障进程在系统故障恢复阶段有条件执行的效果。当进程发生故障时,在恢复线程控制下首先从消息日志获取重演消息和对应逻辑时钟,然后根据消息的逻辑时钟对重演消息重新排序;最后把排序后的消息重新发送至故障进程,故障进程重新接收消息、处理消息,从而实现消息的重演。

The invention discloses a backward recovery fault-tolerant method with forward recovery features. This method uses fault bits to control the sending of messages; if both the sending and receiving processes are faultless, the sending process stores the message and its corresponding logical clock in the message log before sending the message; if the process is in the fault recovery stage, its sending is prohibited message; if the receiving process is faulty, the sending process waits until it recovers before sending a message to this process; thereby achieving the effect of conditional execution of a fault-free process in the system fault recovery phase. When a process fails, under the control of the recovery thread, the replay message and the corresponding logical clock are first obtained from the message log, and then the replay message is reordered according to the logical clock of the message; finally, the sorted message is resent to the faulty process, and the faulty process restarts Receive messages, process messages, so as to realize the replay of messages.

Description

一种具有前向恢复特征的后向恢复容错方法A Fault Tolerant Method for Backward Recovery with Forward Recovery Characteristic

技术领域technical field

本发明涉及分布式系统基于消息重排序消息日志故障恢复方法。The invention relates to a fault recovery method of a distributed system based on message reordering message logs.

背景技术Background technique

在分布式系统后向恢复容错研究领域,按保存信息的类型,回卷恢复协议可分为基于检查点的回卷恢复协议和基于消息日志的回卷恢复协议。基于消息日志回卷恢复的协议又细分为悲观消息日志和乐观消息日志。In the research field of backward recovery fault tolerance in distributed systems, according to the type of stored information, rollback recovery protocols can be divided into checkpoint-based rollback recovery protocols and message log-based rollback recovery protocols. Protocols based on message log rollback recovery are subdivided into pessimistic message logs and optimistic message logs.

采用悲观消息日志[1]的系统中,进程接收消息后总是先将其保存至消息日志中,然后才提交给应用程序处理。此特征确保进程所接收消息及其接收次序完整保存在消息日志文件,使得进程的故障恢复较为简单;但是由于接收消息后须先保存消息信息至消息日志因此严重影响了系统正常运行时的性能。按由何进程保存消息信息至消息日志分,悲观消息日志协议分为基于发送者的消息日志协议(sender-based)[2][3][4]和基于接收者的消息日志协议(receiver-based)。In the system using pessimistic message log [1], after the process receives the message, it always saves it in the message log first, and then submits it to the application program for processing. This feature ensures that the messages received by the process and their receiving order are completely stored in the message log file, making the recovery of the process easier; but because the message information must be saved to the message log after receiving the message, it seriously affects the performance of the system during normal operation. According to which process saves the message information to the message log, the pessimistic message log protocol is divided into sender-based message log protocol (sender-based)[2][3][4] and receiver-based message log protocol (receiver-based based).

在基于接收者的消息日志协议中,进程接收消息后首先把消息的接收次序等信息存入消息日志,然后再提交此消息给应用程序处理。In the receiver-based message log protocol, after a process receives a message, it first stores information such as the receiving order of the message into the message log, and then submits the message to the application program for processing.

在基于发送者的消息日志协议[2][3][4]中,消息的发送进程发送一个消息前首先把消息内容保存在内存缓冲区中,然后发送此消息至接收进程。接收进程接收消息后先把该消息的接收次序传送给消息的发送进程,然后再提交消息至应用进程。消息的发送进程接收到消息的接收次序后把消息的内容及其接收次序统一存入消息日志。In the sender-based message log protocol [2][3][4], the message sending process first saves the message content in the memory buffer before sending a message, and then sends the message to the receiving process. After receiving the message, the receiving process first transmits the receiving sequence of the message to the sending process of the message, and then submits the message to the application process. After the sending process of the message receives the receiving order of the message, the content of the message and the receiving order are uniformly stored in the message log.

采用乐观消息日志[1]的系统中,进程接收消息后即刻提交给应用程序处理,而后在进程空闲时才保存所接收消息至消息日志。此特征使得系统进程在正常运行时具有良好的性能;然而发生故障时,若存在进程已接收且未保存至消息日志的消息,此类消息的接收次序必然丢失,使得进程的故障恢复极为复杂。In the system using optimistic message log [1], the process submits the message to the application program immediately after receiving the message, and then saves the received message to the message log when the process is idle. This feature enables the system process to have good performance during normal operation; however, when a failure occurs, if there are messages received by the process but not saved in the message log, the receiving order of such messages will inevitably be lost, making the recovery of the process extremely complicated.

基于消息重排序和消息数目检验的日志恢复协议[5][6],采用消息重排序方法,较好解决了乐观日志中由于进程故障所导致的消息接收次序丢失问题。该协议下的发送进程在发送消息时以逻辑时钟间接标记此消息的接收次序并将消息及其逻辑时钟保存在发送进程本地存储中。接收进程接收消息后,首先把消息提交给应用程序处理,然后在进程空闲时间把消息及其接收次序保存至消息日志。当消息接收进程发生故障时,恢复进程首先从消息日志取得已保存的消息及其逻辑时钟,然后根据发送进程和接收进程所保存消息数目的差值从发送进程的本地存储中取得因进程故障未能保存至消息日志的消息及其逻辑时钟。然后根据消息的 逻辑时钟对消息重新排序,并实现消息的重演。此协议下的进程在正常执行时,接收到消息后即刻提交给应用程序处理,仅仅在进程空闲时间保存消息信息至消息日志。上述特征使得此恢复协议实质上仍属于乐观消息日志协议;在此协议下进程正常执行时具有良好性能,且在某个进程发生故障时其恢复算法较为简单。The log recovery protocol [5][6] based on message reordering and message number verification uses the message reordering method to better solve the problem of losing the order of message reception caused by process failures in optimistic logs. The sending process under this protocol uses the logical clock to indirectly mark the receiving order of the message when sending the message and saves the message and its logical clock in the local storage of the sending process. After the receiving process receives the message, it first submits the message to the application program for processing, and then saves the message and its receiving order to the message log when the process is idle. When the message receiving process fails, the recovery process first obtains the saved messages and their logical clocks from the message log, and then obtains the messages that have not been saved due to process failure from the local storage of the sending process according to the difference between the number of messages saved by the sending process and the receiving process. Messages and their logical clocks that can be saved to the message log. Messages are then reordered according to their logical clock and replay of messages is realized. When the process under this protocol is running normally, it will immediately submit the message to the application for processing after receiving the message, and only save the message information to the message log when the process is idle. The above characteristics make this recovery protocol still belong to the optimistic message log protocol in essence; under this protocol, the process has good performance when the process is running normally, and its recovery algorithm is relatively simple when a process fails.

消息重排序和消息数目检验的日志恢复协议的主要缺点是在程序正常运行时,进程接收到消息后仍需在进程空闲时间把消息保存至消息日志。然而该协议下发送进程发送消息前已把消息的内容及其逻辑时钟值保存至本地存储,接收进程端的消息日志是冗余的,故该协议算法还可进一步优化。The main disadvantage of the log recovery protocol for message reordering and message number verification is that when the program is running normally, the process still needs to save the message to the message log when the process is idle after receiving the message. However, under this protocol, the sending process has saved the content of the message and its logical clock value to the local storage before sending the message, and the message log at the receiving process end is redundant, so the protocol algorithm can be further optimized.

修改后的重排序恢复协议[7]仍然基于重排序基本理论进行算法设计,但对恢复协议作了较大修改,主要是去除了消息接收进程端的消息日志。在此协议中,消息发送进程发送消息前先保存消息的内容及其逻辑时钟至消息日志,消息接收进程接收消息后,即刻提交应用程序处理,且此后不再保存任何消息信息。与基于发送者的消息日志协议(sender-based)[2][3][4]相比较,修改后的重排序恢复协议中的接收进程接收消息后无需转送消息接收次序至发送进程,从而使得进程正常执行时具有良好的性能且确保在进程故障恢复阶段系统不存在任何孤儿进程。The modified reordering recovery protocol [7] is still based on the basic theory of reordering for algorithm design, but the recovery protocol has been greatly modified, mainly removing the message log at the message receiving process end. In this protocol, the message sending process saves the content of the message and its logical clock to the message log before sending the message. After the message receiving process receives the message, it immediately submits it to the application program for processing, and thereafter no longer saves any message information. Compared with the sender-based message log protocol (sender-based)[2][3][4], the receiving process in the modified reordering recovery protocol does not need to transfer the order of message reception to the sending process after receiving the message, so that The process has good performance during normal execution and ensures that there are no orphan processes in the system during the process failure recovery phase.

修改后的重排序恢复协议[7]彻底摆脱了悲观、乐观协议的束缚,形成了一种同时具备悲观和乐观协议优点的全新容错恢复协议。此协议具有以下主要特点:The modified reordering recovery protocol [7] completely got rid of the shackles of pessimistic and optimistic protocols, and formed a new fault-tolerant recovery protocol with the advantages of both pessimistic and optimistic protocols. This protocol has the following main features:

1)进程发送消息前一次性保存消息的内容及指示消息接受次序的逻辑时钟至消息日志,然后发送消息至接收进程。此特点使得该协议具有悲观消息日志协议特征,即任一时刻系统不存在孤儿进程。1) Before the process sends the message, the content of the message and the logical clock indicating the order of receiving the message are saved to the message log at one time, and then the message is sent to the receiving process. This feature makes the protocol have the characteristics of a pessimistic message log protocol, that is, there is no orphan process in the system at any time.

2)进程接收消息后不保存任何消息信息,因而该协议下的进程在正常执行时较乐观日志协议下的进程具有更优良的性能。此特点使得该协议完全不同于悲观和乐观协议,且其系统在正常执行时的性能优于乐观协议。2) The process does not save any message information after receiving the message, so the process under this protocol has better performance than the process under the optimistic log protocol during normal execution. This feature makes the protocol completely different from the pessimistic and optimistic protocols, and the performance of its system is better than the optimistic protocol in normal execution.

3)某进程发生故障时,无故障进程有条件继续执行;因而在进程故障恢复阶段,重排序恢复协议下的进程较悲观日志协议下的进程具有更优良的故障恢复性能。3) When a process fails, the non-faulty process continues to execute conditionally; therefore, in the process recovery phase, the process under the reordering recovery protocol has better recovery performance than the process under the pessimistic log protocol.

4)任何进程可在任意时刻独立保存其进程状态(检查点)。4) Any process can independently save its process state (checkpoint) at any time.

修改后的重排序恢复协议的上述特点使得其各项性能指标均优于现有的所有消息日志协议。然而该协议仅仅给出了无故障进程运行的条件(限制一个故障进程向另一个故障进程发送消息等条件),却没有给出无故障进程有条件执行的具体实现方法,故限制了协议的应用。The above characteristics of the modified reordering recovery protocol make its performance indicators better than all existing message log protocols. However, the protocol only provides the conditions for the running of a fault-free process (conditions such as restricting a faulty process to send a message to another faulty process), but does not provide a specific implementation method for the conditional execution of a fault-free process, thus limiting the application of the protocol .

本发明的目的及所能达到的效果:Purpose of the present invention and the effect that can be achieved:

完善重排序恢复协议,提出一种便于以程序设计员语言实现的基于重排序理论的后向恢复容错方法。该方法具有类似修改后的重排序恢复协议[7]的基本特征,并具有如下独特之处:1、对无故障进程故障状态下条件运行的理论和方法进一步完善,2、增加进程故障标志位,并通过故障位控制无故障进程在系统故障恢复阶段有条件运行。Improve the reordering recovery protocol, and propose a backward recovery fault-tolerant method based on reordering theory that is easy to implement in programmer language. This method has the basic features similar to the modified reordering recovery protocol [7], and has the following unique features: 1. The theory and method of conditional operation under the fault state of the fault-free process are further improved; 2. The process fault flag is added , and control the conditional operation of the non-fault process in the system fault recovery phase through the fault bit.

消息重排序基本原理Fundamentals of message reordering

在分段确定(PWD)假设下,进程的消息接收事件具有随机性,即消息的接收在时间和次序上具有不确定性,但是消息的发送事件确是确定事件。Under the PWD assumption, the message reception event of a process is random, that is, the message reception is uncertain in time and order, but the message sending event is indeed a definite event.

如图1所示,假设分布式系统由进程p、q和r组成。其中,δp,0、δq,0和δr,0分别表示p、q和r的初始状态间隔;δq,1和δq,2分别表示进程q接收消息m1和m2后的状态间隔;tpq和trq分别表示进程p和q之间以及q和r之间的通信信道延时。在乐观消息日志协议下,若进程q在将消息m1和m2的必要信息保存至日志文件之前于“x”处发生故障。进程q发生故障后,进程p、q和r必须重新启动以重新发送和接收m1和m2。显然,进程q重放(replay)消息的次序应为m1、m2,然而由于信道延时tpq和trq不是固定常数,若tpq>trq则进程q接收消息的次序可能变为m2、m1。图1的实例说明,尽管乐观消息日志协议要求故障进程在重演时准确发送、接收未保存至日志文件的消息,但在某些情况下(例,进程信道延时变化,系统中的进程重新启动时间参差不齐等)实际进程收发消息的次序可能与故障前不一致。然而每次重复执行,系统的最终结果应该是一致的,这说明在PWD假设下系统的执行结果与某些消息的接收次序无关。As shown in Figure 1, it is assumed that the distributed system consists of processes p, q and r. Among them, δ p,0 , δ q,0 and δ r,0 represent the initial state intervals of p , q and r respectively; State interval; t pq and t rq represent the communication channel delays between processes p and q and between q and r, respectively. Under the optimistic message logging protocol, if process q fails at " x " before saving the necessary information for messages m1 and m2 to the log file. After process q fails, processes p, q, and r must be restarted to resend and receive m 1 and m 2 . Obviously, the order of process q's replay (replay) messages should be m 1 , m 2 , but because the channel delay t pq and t rq are not fixed constants, if t pq >t rq , the order of process q receiving messages may become m 2 , m 1 . The example in Figure 1 shows that although the optimistic message log protocol requires that the faulty process accurately send and receive messages that are not saved to the log file when it is replayed, in some cases (for example, the process channel delay changes, the process in the system is restarted) The time is uneven, etc.) The order in which the actual process sends and receives messages may be inconsistent with that before the failure. However, the final result of the system should be consistent every time it is repeated, which means that under the assumption of PWD, the execution result of the system has nothing to do with the order in which certain messages are received.

总是在先发生关系(always happen before)关系:Always happens before relationship (always happens before) relationship:

假设进程的信道为FIFO可靠信道,ei和ej分别表示消息mi和mj的发送或接收事件。如果在系统任何一次执行中ei的发生总是先于ej的发生而与信道的延时、cpu的速度等因素无关,则称ei总是在先发生于ej,记为: Assuming that the channel of the process is a FIFO reliable channel, e i and e j represent the sending or receiving events of messages m i and m j respectively. If the occurrence of ei is always prior to the occurrence of e j in any execution of the system and has nothing to do with factors such as channel delay and CPU speed, then it is said that e i always occurs prior to e j , which is recorded as:

在图2中,R(m1)、R(m2)、R(m3)和R(m4)分别表示消息m1、m2、m3和m4的接收事件,S(m1)、S(m2)、S(m3)和S(m4)分别表示消息m1、m2、m3和m4的发送事件。在分段确定(PWD)假设下,由于q接收m1后必然发送m2,r接收m2后必然发送m3,即 因此R(m1)总是在先发生于R(m3)表明R(m3)逻辑上依赖于R(m1),R(m3)和R(m1)之间的关系是一种逻辑依赖 关系与系统的其他因素无关。由于消息的接收事件是确定性事件,因此在FIFO信道的假设下, In Fig. 2, R(m 1 ), R(m 2 ), R(m 3 ) and R(m 4 ) denote the reception events of messages m 1 , m 2 , m 3 and m 4 respectively, and S(m 1 ), S(m 2 ), S(m 3 ) and S(m 4 ) denote sending events of messages m 1 , m 2 , m 3 and m 4 , respectively. Under the PWD assumption, since q must send m 2 after receiving m 1 , r must send m 3 after receiving m 2 , namely therefore R(m 1 ) always precedes R(m 3 ), indicating that R(m 3 ) is logically dependent on R(m 1 ), and the relationship between R(m 3 ) and R(m 1 ) is a Logical dependencies are independent of other factors of the system. Since the receive event of a message is a deterministic event, the Under the assumption of FIFO channels,

图2中由于m3和m4的发送经过不同的信道延时到达进程q,因此R(m3)不一定总是在先发生于R(m4)。若事件ei的发生不一定总是先于事件ej的发生,而是与信道延时、cpu速度等因素有关,则称ei不总是在先发生于ej,记为图2中, 因此进程q实际的消息接收序列为m1、m3、m4或m1、m4、m3。In Fig. 2, since the transmissions of m 3 and m 4 arrive at process q through different channel delays, R(m 3 ) may not always occur prior to R(m 4 ). If the occurrence of event e i is not necessarily always prior to the occurrence of event e j , but is related to factors such as channel delay and cpu speed, it is said that e i does not always occur before ej, denoted as In Figure 2, Therefore, the actual message receiving sequence of process q is m1, m3, m4 or m1, m4, m3.

消息等效序列定理:Message Equivalent Sequence Theorem:

假设S是进程p的一消息序列,S’是一将S中的消息重新排列后形成的新序列。S’中的元素满足:1、所有存在于S中的消息在S’中仍然存在;2、若某些消息的接收事件在S中具有总是在先发生关系,则这种关系在S’中仍然保持。在进程信道FIFO和可靠信道假设下,S和S’在进程p重放过程中是等效序列(进程分别接收两个序列后所完成的计算相同)。Suppose S is a message sequence of process p, and S' is a new sequence formed by rearranging the messages in S. The elements in S' satisfy: 1. All messages that exist in S still exist in S'; 2. If the receiving events of some messages have a relationship that always occurs first in S, then this relationship is in S' is still maintained. Under the assumption of process channel FIFO and reliable channel, S and S' are equivalent sequences during the replay process of process p (the calculations completed by the processes after receiving the two sequences respectively are the same).

证明:在定理假设条件下,尽管S中的某些消息在S’中进行了重新排序,但是这些消息间的总是在先发生关系在S’中维持不变;因此S’中消息的接收次序是进程重演(replay)中可能出现的实际次序。若S和S’在进程p重放过程中不是等效序列,则S中的消息接收事件在进程p的执行与S’中的消息接收事件在进程p的执行不等效,即同一个进程的每次执行是不一致的,这与进程执行的一致性属性相矛盾。Proof: Under the assumption of the theorem, although some messages in S are reordered in S', the always-before relationship between these messages remains unchanged in S'; therefore, the reception of messages in S' The order is the actual order in which the process might occur in a replay. If S and S' are not equivalent sequences during the replay process of process p, then the execution of the message receiving event in S in process p is not equivalent to the execution of the message receiving event in S' in process p, that is, the same process Each execution of is inconsistent, which contradicts the consistency property of process execution.

例在图2中,m1、m3和m4与m1、m4和m3是等效序列,即进程p重演(replay)m1、m3和m4与重演m1、m4和m3后进程所完成的计算是相同的。For example, in Figure 2, m1, m3 and m4 are equivalent sequences to m1, m4 and m3, that is, the calculations completed by the process p replaying m1, m3 and m4 and replaying m1, m4 and m3 are the same .

改善的逻辑时钟:Improved Logic Clock:

进程p改善的逻辑时钟LCp是一个整型变量,用于对消息的发送事件和接收事件计数。LCp满足:The improved logical clock LCp of process p is an integer variable used to count the send event and receive event of the message. LCp satisfies:

1、其初值是零;1. Its initial value is zero;

2、每发送一个消息,LCp加一;2. Every time a message is sent, LCp is incremented by one;

3、每接收到进程q一个消息,LCp←max(LCp+1,LCq+1),其中LCp和LCq分别表示进程p3. Every time a message from process q is received, LCp←max(LCp+1,LCq+1), where LCp and LCq represent process p respectively

和q的逻辑时钟,max表示取LCp+1和LCq+1中的最大值。and the logic clock of q, max means to take the maximum value among LCp+1 and LCq+1.

如图3所示,p发送m1后LCp=1,发送m4后LCp=2。q接收且保存m1后,LCq=2,发送m2后LCq=3,接收且保存m3后,LCq=6,接收且保存m4后,LCq=7。As shown in FIG. 3 , LCp=1 after p sends m 1 , and LCp=2 after sending m 4 . q After receiving and storing m 1 , LCq=2, after sending m 2 , LCq=3, after receiving and storing m 3 , LCq=6, after receiving and storing m 4 , LCq=7.

为便于表达,将改善的逻辑时钟简记为逻辑时钟。For ease of expression, the improved logical clock is abbreviated as logical clock.

消息重排序逻辑时钟定理:Message reordering logic clock theorem:

若分段确定假设PWD成立且则LCp(S(mi))<LCq(S(mj))。其中,R(mi)和R(mj)分别表示进程k的消息mi和mj的接收事件.LCp(S(mi))表示进程p发送消息mi后的逻辑时钟,LCq(S(mj))表示进程q发送消息mj后的逻辑时钟.If the segmentation is determined assuming that PWD holds and Then LCp(S(mi))<LCq(S(mj)). Among them, R(mi) and R(mj) represent the receiving events of messages mi and mj of process k respectively. LCp(S(mi)) represents the logical clock after process p sends message mi, and LCq(S(mj)) represents The logical clock after process q sends message mj.

证明:由于R(mj)逻辑依赖于R(mi),S(mj)是确定事件,因此 否则,假设R(mi)不总是在先发生于S(mj),这意味着R(mi)和S(mj)可以发生在任意次序,或者R(mi)在先发生于S(mj),或者S(mj)在先发生于R(mi)。若S(mj)在先发生于R(mi),由于S(mj)总是在先发生于R(mj),因此S(mj)只能间接地在先发生于R(mi)。如图4所示,必然至少存在一个消息mk位于mj和mi之间,使得S(mj)→S(mk),S(mk)→R(mk),R(mk)→S(mi),S(mi)→R(mi),其中“→”表示在先发生关系。在这种情况下,mj和mi必然经过不同的传输信道到达进程k,mj和mi的传输可能具有不同的信道延时,故R(mi)不可能总是在先发生于R(mj),这与定理的假设矛盾,因此根据改善逻辑时钟的定义,LCp(R(mi))必然小于LCq(S(mj)),即LCp(R(mi)<LCq(S(mj))。又因为所以LCp(S(mi))<LCp(R(mi))。由此可得LCp(S(mi))<LCp(R(mi))<LCq(S(mj)),LCp(S(mi))<LCq(S(mj))。Proof: due to R(mj) logic depends on R(mi), S(mj) is a definite event, so Otherwise, assume that R(mi) does not always happen-before S(mj), which means that R(mi) and S(mj) can happen in any order, or that R(mi) happens-before S(mj) , or S(mj) happens-before R(mi). If S(mj) precedes R(mi), since S(mj) always precedes R(mj), S(mj) can only indirectly precede R(mi). As shown in Figure 4, there must be at least one message mk located between mj and mi, so that S(mj)→S(mk), S(mk)→R(mk), R(mk)→S(mi), S(mi)→R(mi), where “→” indicates the occurrence-before relationship. In this case, mj and mi must arrive at process k through different transmission channels, and the transmission of mj and mi may have different channel delays, so R(mi) cannot always happen before R(mj), This contradicts the assumption of the theorem, so According to the definition of improved logic clock, LCp(R(mi)) must be smaller than LCq(S(mj)), that is, LCp(R(mi)<LCq(S(mj)). And because So LCp(S(mi))<LCp(R(mi)). From this, LCp(S(mi))<LCp(R(mi))<LCq(S(mj)), LCp(S(mi))<LCq(S(mj)) can be obtained.

例,在图3中,LCp(S(m1))=1,LCr(S(m3))=5,LCp(S(m1))<LCr(S(m3))。For example, in Figure 3, LCp(S(m1))=1, LCr(S(m3))=5, LCp(S(m1))<LCr(S(m3)).

上述定理表明,任一进程的消息接收序列中的消息接收次序可由消息发送进程的逻辑时钟确定。若消息的发送进程在发送消息的同时把与此消息相关的逻辑时钟及消息的内容保存至稳固存储或消息日志;则消息的接收进程发生故障后,恢复进程可把预先保存的消息按其逻辑时钟重新排序并由此得到所需重演消息序列。例,在图3中,p进程发送m1后保存<m1,LCp=1>,p进程发送m4后保存<m4,LCp=2>;r进程发送m3后保存<m3,LCr=5>;若进程p在”X”处发生故障,可根据二元组<m1,LCp=1>、<m3,LCr=5>和<m4,LCp=2>中的逻辑时钟将进程p重演消息序列重新排序为m1、m4和m3。在图3中,尽管进程q在故障恢复阶段所接收的消息序列是:m1、m4和m3,与原消息接收序列m1、m3和m4不同;但是消息序列m1、m4和m3确是系统运行时进程q的实际消息接收序列之一。由于一个系统所实现的功能(除随机系统外)是确定的,因此图3中q接收上述两种不同消息序列后所完成的计算是相同的。The above theorem shows that the order of receiving messages in the message receiving sequence of any process can be determined by the logical clock of the message sending process. If the sending process of the message saves the logical clock related to the message and the content of the message to the stable storage or the message log while sending the message; after the receiving process of the message fails, the recovery process can save the pre-saved message according to its logic The clocks are reordered and thus the desired sequence of replayed messages is obtained. For example, in Fig. 3, p process saves <m1, LCp=1> after sending m1, saves <m4, LCp=2> after p process sends m4; r saves after sending m3 <m3, LCr=5>; if Process p fails at "X", process p replays the sequence of messages according to the logical clocks in the tuples <m1,LCp=1>, <m3,LCr=5> and <m4,LCp=2> for m1, m4 and m3. In Fig. 3, although the message sequence received by process q in the fault recovery phase is: m1, m4 and m3, which is different from the original message receiving sequence m1, m3 and m4; but the message sequence m1, m4 and m3 are indeed system runtime One of the actual message reception sequences for process q. Since the functions realized by a system (except the random system) are deterministic, the calculations completed by q in Fig. 3 after receiving the above two different message sequences are the same.

恢复阶段无故障进程继续执行条件Conditions for continuation of process execution without failure during the recovery phase

在重排序理论推出之前,在进程恢复阶段任何协议下的无故障进程或回退以消除系统中 存在的孤儿进程(乐观消息日志协议),或停止于其当前状态以等待故障进程的恢复(悲观消息日志)。Before the introduction of reordering theory, in the process recovery phase, the fault-free process under any protocol either fell back to eliminate the orphan process existing in the system (optimistic message log protocol), or stopped in its current state to wait for the recovery of the faulty process (pessimistic message log).

例在图5中,设p、q和r分别表示系统中的进程;Cx,y表示进程x的第y个检查点,x=p,q,r;y=0,1;进程q和r分别在“x”处发生故障。在悲观日志协议下,系统的最大可恢复状态[1]如图虚线所示。在系统的恢复阶段,进程q需退至检查点Cq,0并重演消息m2、m4和m6,进程r需退至检查点Cr,1并重演消息m3和m5,进程p停止于其当前状态等待其他进程从其故障状态恢复。Example In Fig. 5, let p, q and r denote processes in the system respectively; C x, y denote the yth checkpoint of process x, x=p, q, r; y=0, 1; process q and r fails at 'x' respectively. Under the pessimistic logging protocol, the maximum recoverable state of the system [1] is shown by the dotted line in the figure. In the recovery phase of the system, process q needs to retreat to checkpoint C q,0 and replay messages m2, m4 and m6, process r needs to retreat to checkpoint C r,1 and replay messages m3 and m5, process p stops at its current The state waits for other processes to recover from their failed state.

再如在图6中,设q、p和r分别表示系统中的进程;Cx,y表示进程x的第y个检查点,x=p,q,r;y=0,1;进程r在接收m5但保存m5至消息日志前于“x”处发生故障。在乐观日志协议下,系统的最大恢复状态如图虚线所示。在系统恢复阶段,进程r须回退至检查点Cr,1,由于进程q和p的状态依赖于或间接依赖于消息m5的接收事件,因此为避免孤儿进程的出现进程q和p须分别回退至其检查点Cq,0和Cp,1As another example in Fig. 6, let q, p and r represent processes in the system respectively; C x, y represent the yth checkpoint of process x, x=p, q, r; y=0, 1; process r Failure at 'x' before receiving m5 but saving m5 to message log. Under the optimistic logging protocol, the maximum recovery state of the system is shown by the dotted line in the figure. In the system recovery phase, the process r must fall back to the checkpoint C r,1 . Since the states of the processes q and p depend on or indirectly depend on the receiving event of the message m5, in order to avoid the occurrence of orphan processes, the processes q and p must respectively Fall back to its checkpoints C q,0 and C p,1 .

在传统的消息日志协议中进程的执行可分为两个阶段[7],正常执行阶段和故障恢复阶段。在正常执行阶段,进程接收消息后或先保存消息信息至消息日志然后提交应用程序处理,或直接提交应用程序处理。在故障恢复阶段,有故障进程只能接收故障恢复进程所发送的重演消息[7](replay message),否则故障进程可能重复接收同一消息;对于故障进程所发送的消息,因为此类消息已被其他进程接收处理,所以此类消息必须被屏蔽,或者由接收进程接收后舍弃之,或者由通信系统禁止此类消息的转送[7]。The execution of a process in a traditional message log protocol can be divided into two phases [7], the normal execution phase and the failure recovery phase. In the normal execution stage, after the process receives the message, it either saves the message information to the message log and then submits it to the application program for processing, or directly submits it to the application program for processing. In the fault recovery phase, the faulty process can only receive the replay message [7] (replay message) sent by the faulty recovery process, otherwise the faulty process may receive the same message repeatedly; for the messages sent by the faulty process, because such messages have been Other processes receive and process, so such messages must be shielded, or discarded after being received by the receiving process, or the communication system prohibits the transfer of such messages [7].

一个系统的最大可恢复状态(maximum recoverable state)是指在出现进程故障时系统所能恢复到的状态,更确切地,是指在进程恢复阶段系统所能达到的最大全局一致性状态。The maximum recoverable state of a system refers to the state that the system can recover to when a process failure occurs, more precisely, it refers to the maximum globally consistent state that the system can achieve during the process recovery phase.

在传统的悲观消息日志协议中,在故障恢复阶段由于系统最终必须恢复至一个最大全局一致状态,因此在此阶段无故障进程必须暂停于其最大恢复状态,以确保恢复阶段结束后系统从一个一致的全局状态继续执行。否则,若无故障进程在故障恢复阶段继续执行尤其是此类进程发送和接收新的消息后其所处的状态有可能与故障进程被恢复后的状态不一致。例,在图5中,假设进程q和r发生故障后,若进程p继续执行并发送消息至进程q或r,则此类消息的发送和接收势必改变了进程q或r的消息接收序列,从而使得系统处于一个错误的全局状态。In the traditional pessimistic message log protocol, because the system must eventually recover to a maximum globally consistent state during the recovery phase, the fault-free process must be suspended at its maximum recovery state during the recovery phase to ensure that the system recovers from a consistent state after the recovery phase. The global state of the execution continues. Otherwise, if the non-faulty process continues to execute during the fault recovery phase, especially after the process sends and receives new messages, the state it is in may be inconsistent with the state after the faulty process is recovered. For example, in Figure 5, assuming that processes q and r fail, if process p continues to execute and sends messages to process q or r, the sending and receiving of such messages will inevitably change the message receiving sequence of process q or r, This leaves the system in a wrong global state.

然而在某些条件下,无故障进程在故障恢复阶段的继续执行并不影响故障进程的恢复和系统一致性全局状态的最终达成。However, under certain conditions, the continuation of the fault-free process in the fault recovery phase does not affect the recovery of the faulty process and the final achievement of the system's consistent global state.

例在图7中,设p、q、r和s分别表示系统中的进程;Cx,y表示进程x的第y个检查点,x=p,q,r,s;y=0,1;进程q和r分别在“x”处发生故障。在悲观日志协议下,进程q和r发生故障后进程p和s停止执行以等待故障进程的恢复,系统的最大可恢复状态如图maximumrecoverable state所示。假设在进程q和r的故障恢复阶段,若进程p继续执行并发送消息m8至进程s,且进程q和r恢复至最大可恢复状态maximum recoverable state后不再发送任何消息,则此后系统达到一个新的全局状态,如图New maximum recoverable state所示。由于在New maximum recoverable state状态中不包含孤儿进程,因此该状态是一个全局一致性状态。然而,若故障进程被恢复至最大恢复状态maximum recoverable state后发送消息至其他进程,则无故障进程的继续执行所导致的消息发送可能产生无故障所发消息和有故障进程被恢复后所发消息的接收次序不一致问题。在图8中,在故障恢复阶段,若p进程暂停等待故障进程恢复,则系统恢复后进程s或先接收消息m9然后接收m8;若p进程继续执行并发送m8,故障进程恢复后进程r发送m9,则进程s先接收m8后接收m9(如图中虚线所示)。显然,进程s接收消息m8和m9的次序在上述情况下可能出现不一致;然而这种不一致是系统实际执行时的一种客观存在。事实上,消息m8和m9的传送经过了不同进程信道传输,当进程信道的时延发生变化,对于进程s两种消息接收次序均为可能。作为一个确定性系统,系统所完成的计算不可能因系统参数的变化而变化,故按两种不同的次序接收消息后,进程s及整个系统所完成的计算是相同的。Example In Fig. 7, let p, q, r and s represent processes in the system respectively; C x, y represents the yth checkpoint of process x, x=p, q, r, s; y=0, 1 ; Processes q and r each failed at 'x'. Under the pessimistic log protocol, after the process q and r fails, the process p and s stop executing to wait for the recovery of the failed process. The maximum recoverable state of the system is shown in the figure maximum recoverable state. Assume that in the recovery phase of process q and r, if process p continues to execute and sends message m8 to process s, and process q and r recovers to the maximum recoverable state and no longer sends any messages, then the system reaches a The new global state, as shown in Figure New maximum recoverable state. Since no orphan processes are included in the New maximum recoverable state state, this state is a globally consistent state. However, if the faulty process is restored to the maximum recoverable state and then sends messages to other processes, the message sent by the continued execution of the non-faulty process may produce messages sent by the faulty process and messages sent by the faulty process after recovery Inconsistencies in the order of receipt. In Figure 8, in the fault recovery phase, if process p is suspended and waits for the faulty process to recover, process s will receive message m9 and then m8 after system recovery; if process p continues to execute and send m8, process r will send message m8 after the faulty process recovers. m9, the process s first receives m8 and then m9 (as shown by the dotted line in the figure). Obviously, the order in which process s receives messages m8 and m9 may be inconsistent under the above circumstances; however, this inconsistency is an objective existence in the actual execution of the system. In fact, messages m8 and m9 are transmitted through different process channels. When the delay of the process channel changes, both messages are possible for process s. As a deterministic system, the calculations completed by the system cannot be changed due to changes in system parameters, so after receiving messages in two different orders, the calculations completed by the process s and the entire system are the same.

综上所述,得出以下结论:In summary, the following conclusions are drawn:

假设分布式系统是松耦合系统,即进程间无共享内存,正常运行时任何进程所发送消息的信息,包括消息内容及消息接收次序,均已保存至稳固存储。Assume that the distributed system is a loosely coupled system, that is, there is no shared memory between processes, and the information of messages sent by any process during normal operation, including message content and message receiving order, have been saved to stable storage.

无故障进程条件执行定理:Faultless process conditional execution theorem:

在进程故障恢复阶段,设无故障进程继续执行;若满足以下条件,则恢复阶段结束后系统处于一个一致的全局状态。In the process failure recovery phase, assume that no faulty process continues to execute; if the following conditions are met, the system is in a consistent global state after the recovery phase.

1、禁止故障进程所发送消息发送。1. Disable the sending of messages sent by the faulty process.

2、禁止其他进程在故障进程的恢复阶段向其发送消息,即故障进程仅接收故障恢复进程所发的重演消息。2. Forbid other processes to send messages to it during the recovery phase of the faulty process, that is, the faulty process only receives replay messages sent by the faulty recovery process.

3、无故障进程继续执行并发送和接收消息,但是当向故障进程发送消息时,须停止等待直到故障进程被恢复。3. The non-faulty process continues to execute and send and receive messages, but when sending a message to the faulty process, it must stop waiting until the faulty process is recovered.

证明:prove:

根据题设条件,由于在进程故障恢复阶段无故障进程与故障进程完全被隔离,无故障进程运行时所发送消息必然在无故障进程之间传送,且无故障进程所发送消息的数目必然大于等于此类消息被接收的数目[7](可能存在中途消息),因此在任意时刻不可能存在孤儿消息和孤儿进程。故,若无故障进程在故障恢复阶段条件执行,则故障进程被恢复后系统必然处于一个全局一致性的状态。According to the conditions of the problem, since the fault-free process and the faulty process are completely isolated during the process fault recovery phase, the messages sent by the fault-free process must be transmitted between the fault-free processes, and the number of messages sent by the fault-free process must be greater than or equal to The number of such messages received [7] (there may be interim messages), so it is impossible to have orphan messages and orphan processes at any moment. Therefore, if the fault-free process is conditionally executed in the fault recovery phase, the system must be in a globally consistent state after the faulty process is recovered.

根据上述定理,满足定理所设条件下,无故障进程在其他进程的故障恢复阶段继续运行其最终完成的计算与系统无故障情况下所完成的计算是相同的。According to the above theorem, under the conditions set by the theorem, the non-faulty process continues to run in the fault recovery stage of other processes, and its final calculation is the same as the calculation completed by the system without faults.

本发明按照定理的题设条件设计了故障恢复协议,此协议具备了前向恢复的某些特征,某些进程发生故障仅对故障进程进行恢复处理,无故障进程有条件执行,系统不停顿直至完成所需计算。The present invention designs a fault recovery protocol according to the theorem's problem setting conditions. This protocol has some characteristics of forward recovery. When some processes fail, only the faulty process is recovered. No faulty processes are conditionally executed, and the system does not stop until Complete the required calculations.

发明内容Contents of the invention

本发明的目的是针对悲观消息日志中消息接收进程保存消息接收次序所导致的进程正常执行时的性能下降和故障恢复阶段无故障进程停止等待所造成的系统运行效率低下问题,达到提高进程正常执行阶段和故障恢复阶段系统性能的目的。该发明以逻辑时钟间接标识消息的接收次序,进程发送消息前先把消息内容及对应逻辑时钟保存至消息日志。当消息接收进程发生故障时,恢复进程首先从消息日志取得所有自接收进程保存检查点以来所接收的消息及对应逻辑时钟,并根据消息的对应逻辑时钟对消息重新排序,最后把排序后的消息重新发送至故障进程。本发明下的恢复容错协议允许多个进程发生故障;在故障恢复阶段,无故障进程条件执行,故障进程独自被恢复,从而使得协议具有前向恢复的特征。The purpose of the present invention is to solve the problem of low system operation efficiency caused by the process of normal execution caused by the message receiving process saving the message receiving order in the pessimistic message log and the problem of low system operation efficiency caused by the non-faulty process stopping and waiting in the fault recovery stage, so as to improve the normal execution of the process phase and failure recovery phase for the purpose of system performance. The invention uses logical clocks to indirectly mark the receiving order of messages, and the process saves the message content and corresponding logical clocks to the message log before sending messages. When the message receiving process fails, the recovery process first obtains all the messages and corresponding logical clocks received since the receiving process saved the checkpoint from the message log, and reorders the messages according to the corresponding logical clocks of the messages, and finally sorts the sorted messages Resend to the failed process. The recovery fault-tolerant protocol of the present invention allows multiple processes to fail; in the fault recovery phase, no faulty process is executed conditionally, and the faulty process is recovered independently, so that the protocol has the feature of forward recovery.

为实现上述目的,本发明采用如下技术方案:To achieve the above object, the present invention adopts the following technical solutions:

分布式系统由普通进程Pi,i=1,2…n,状态控制进程Control i,i=1,2…n,Recover-manager故障恢复管理进程和消息日志系统组成,如图9所示。The distributed system consists of ordinary process Pi, i=1, 2...n, state control process Control i, i=1, 2...n, Recover-manager fault recovery management process and message log system, as shown in Figure 9.

系统分别定义了如下数据结构:The system defines the following data structures respectively:

1、向量Ti,Ti=[Ti1,Ti1,…Tin],其中Tik表示进程Pi所接收Pk所发送的消息数目。为了便于在有故障和无故障情况下对Ti向量的访问,Control-i进程和普通进程Pi定义和使用同一个Ti向量,此向量存放在两个进程的共享内存空间(或本机硬盘)。对于普通进程,Ti向量只写;对于Control-i进程,Ti向量只读。1. Vector Ti, Ti=[T i1 , T i1 , . . . T in ], where T ik represents the number of messages received by process Pi and sent by Pk. In order to facilitate the access to the Ti vector under faulty and non-faulty conditions, the Control-i process and the ordinary process Pi define and use the same Ti vector, which is stored in the shared memory space of the two processes (or the local hard disk). For ordinary processes, the Ti vector is write-only; for Control-i processes, the Ti vector is read-only.

2、故障向量F,F=[F1,F2,…Fn],Fk表示进程Pk的故障状态,Fk=1表示Pk存在故障,Fk=0表示Pk无故障。与上述类似理由,Control-i进程和普通进程Pi使用同一个F向量并将其保存在Control-i进程和普通进程Pi的共享内存空间(或 本机硬盘)。对于普通进程,F向量只读;对于Control-i进程,F向量只写。2. Fault vector F, F = [F 1 , F 2 , . Similar to the above reasons, the Control-i process and the common process Pi use the same F vector and save it in the shared memory space (or the local hard disk) of the Control-i process and the common process Pi. For normal processes, the F vector is read-only; for Control-i processes, the F vector is write-only.

3、改善的逻辑时钟,LCi(简称为逻辑时钟变量,其定义如前所述),此变量保存在普通进程Pi的内存空间。3. The improved logical clock, LCi (abbreviated as the logical clock variable, whose definition is as mentioned above), this variable is stored in the memory space of the common process Pi.

系统各部件工作原理:The working principle of each component of the system:

1、Recover-manager进程负责系统中普通进程的故障检测和故障恢复,主要由主线程和故障恢复线程组成。主线程定期通过可靠信道(例tcp信道)向进程Pi,i=1,…n,发送心跳信息,并接收Pi的应答消息。1. The Recover-manager process is responsible for fault detection and fault recovery of common processes in the system, and is mainly composed of the main thread and the fault recovery thread. The main thread periodically sends heartbeat information to the process Pi, i=1,...n, through a reliable channel (eg tcp channel), and receives Pi's response message.

若主线程未收到Pi进程的应答消息,说明Pi已发生故障(设Pi所在机器cpu满足fail-stop假设[8]),则通过状态控制进程Control-i杀死Pi进程,从Pi进程的永久检查点重启Pi。然后通过Recover-manager的故障恢复线程对Pi进行恢复操作。故障恢复线程首先从消息日志Mlog获取重演消息,把重演消息重排序,最后把已排序的消息发送至Pi进程。If the main thread does not receive the response message from the Pi process, it means that Pi has failed (assuming that the cpu of the machine where Pi is located satisfies the fail-stop assumption [8]), then kill the Pi process through the state control process Control-i, and start from the Pi process. A permanent checkpoint restarts the Pi. Then restore the Pi through the fault recovery thread of Recover-manager. The fault recovery thread first obtains the replay messages from the message log Mlog, reorders the replay messages, and finally sends the sorted messages to the Pi process.

若主线程收到Pi的应答消息且预定时间已到,则发送控制消息至状态控制进程Control-i,Control-i进程收到控制消息后保存Pi的临时检查点并将其持久化(保存为永久检查点)。If the main thread receives the response message from Pi and the scheduled time has come, it sends a control message to the state control process Control-i, and the Control-i process saves the temporary checkpoint of Pi after receiving the control message and persists it (saved as permanent checkpoint).

2、状态控制进程Control-i,i=1,…n,与进程Pi部署于同一台机器中,负责对Pi的进程状态进行控制,包括:保存Pi的当前状态为临时检查点,把Pi的临时检查点转换为永久检查点,杀死Pi进程,从永久检查点重启Pi进程。2. The state control process Control-i, i=1,...n, is deployed in the same machine as the process Pi, and is responsible for controlling the process state of Pi, including: saving the current state of Pi as a temporary checkpoint, and saving the current state of Pi as a temporary checkpoint. Transition from temporary checkpoint to permanent checkpoint, kill Pi process, restart Pi process from permanent checkpoint.

3、普通进程Pi负责系统中应用消息的传输,并提交应用消息至系统的应用程序。3. The ordinary process Pi is responsible for the transmission of application messages in the system, and submits the application messages to the application programs of the system.

4、为减少系统通信的瓶颈,消息日志系统中的日志文件采用分布式部署方式,即一个系统部署多个消息日志文件Mlog;每个Mlog文件负责不同进程所发消息信息的存储。每个Mlog由若干个记录组成,是一个保存进程所发消息信息的顺序文件。每个记录是一个四元组:<i,j,LCi,m>,其中i和j分别表示发送和接收进程的进程标识,LCi表示进程Pi发送消息m后的逻辑时钟,m表示消息的内容。每个日志文件由一个守护进程维护,多个日志守护进程共同完成Mlog中消息信息的存取和Uij的计算等操作。Uij是一个变量,表示Pi进程所发送Pj进程所接收的消息数目。Uij的初值是零;守护进程每接收一个消息的信息<i,j,LCi,m>,Uij的值加一。日志系统的所有守护进程共同维护了共计(n-1)×(n-1)个Uij变量,其中i=1,2…n,j=1,2…n,i≠j,n表示系统中普通进程的数目。所有的Uij变量构成了一个系统的U矩阵[7],U矩阵记录了系统中任何一个进程发送至其它进程的消息数目。4. In order to reduce the bottleneck of system communication, the log files in the message log system adopt a distributed deployment method, that is, one system deploys multiple message log files Mlog; each Mlog file is responsible for the storage of message information sent by different processes. Each Mlog consists of several records, and is a sequential file that saves the message information sent by the process. Each record is a four-tuple: <i, j, LCi, m>, where i and j represent the process IDs of the sending and receiving processes respectively, LCi represents the logical clock after the process Pi sends message m, and m represents the content of the message . Each log file is maintained by a daemon process, and multiple log daemons jointly complete operations such as accessing message information in Mlog and calculating U ij . U ij is a variable indicating the number of messages sent by the Pi process and received by the Pj process. The initial value of U ij is zero; every time the daemon process receives a message <i, j, LCi, m>, the value of U ij is increased by one. All the daemon processes of the log system maintain a total of (n-1)×(n-1) U ij variables, where i=1,2...n,j=1,2...n, i≠j,n means the system The number of common processes in . All U ij variables constitute a system U matrix [7], and the U matrix records the number of messages sent by any process in the system to other processes.

系统正常运行时,进程Pi和Pj间的消息传送如图10所示。进程Pi发送消息m前,首先通过日志守护进程将<i,j,LCi,m>存入Mlog,然后Pi发送消息<i,j,LCi,m>至接收进程Pj,Pj接收m后即刻提交给应用程序APP处理。When the system is running normally, the message transmission between processes Pi and Pj is shown in Figure 10. Before process Pi sends message m, it first stores <i, j, LCi, m> into Mlog through the log daemon, then Pi sends message <i, j, LCi, m> to receiving process Pj, and Pj submits it immediately after receiving m Give the application APP processing.

当某进程Pi发生故障时,如图11所示,Pi的恢复过程如下所示。When a process Pi fails, as shown in Figure 11, the recovery process of Pi is as follows.

1、Recover-manager发送Pi故障消息(Fi=1)至Control-i进程;Control-i进程收到消息后杀死Pi并从Pi的永久检查点重启Pi。1. Recover-manager sends Pi failure message (Fi=1) to Control-i process; Control-i process kills Pi after receiving the message and restarts Pi from Pi's permanent checkpoint.

2、Control-i进程发送Ti向量(如前所述)至Recover-manager。由于Ti向量分配在Control-i进程和普通进程Pi的共享内存空间,因此Pi从永久检查点重启后,Ti向量被恢复至保存检查点前的值,此值为保存永久检查点前Pi进程所接收、其他进程所发送消息的数目。2. The Control-i process sends the Ti vector (as described above) to the Recover-manager. Since the Ti vector is allocated in the shared memory space of the Control-i process and the ordinary process Pi, after Pi restarts from the permanent checkpoint, the Ti vector is restored to the value before saving the checkpoint, which is the value of the Pi process before saving the permanent checkpoint The number of messages received, sent by other processes.

3、Recover-manager访问日志守护进程取得向量{U1i,U2i…Uni}。计算Pk进程(k=1,2…n,k≠i)发送至Pi进程的消息数目Uki-Tik,并通过相应日志守护进程按从后至前的顺序从Mlog取得Uki-Tik个Pk发送至Pi进程消息(重演消息),k=1,2,…n,k≠i。3. The Recover-manager access log daemon obtains the vector {U 1i , U 2i ... U ni }. Calculate the number of messages U ki -T ik sent by the Pk process (k=1,2...n,k≠i) to the Pi process, and obtain U ki -T ik from Mlog through the corresponding log daemon process in order from back to front Pk are sent to Pi process messages (replay messages), k=1, 2,...n, k≠i.

4、把所有重演消息按其逻辑时钟LCi重新排序,依次发送排序后的消息至Pi进程。4. Reorder all the replay messages according to their logical clock LCi, and send the sorted messages to the Pi process in turn.

详细处理流程:Detailed processing flow:

一种具有前向恢复特征的后向恢复容错方法,它的步骤为:A backward recovery fault-tolerant method with forward recovery features, its steps are:

普通进程Pi消息接收线程及发送和接收函数运行流程如图12所示。The normal process Pi message receiving thread and the running flow of sending and receiving functions are shown in Figure 12.

普通进程Pi接收线程操作流程如下所示:The normal process Pi receiving thread operation flow is as follows:

1、对于k为整型变量,k=1,2…n,初始化Tik和LCi为0,Tik表示进程Pi所接收进程Pk的消息数目,LCi表示进程Pi的逻辑时钟变量。1. For k is an integer variable, k=1, 2...n, initialize Tik and LCi to 0, Tik indicates the number of messages received by process Pk by process Pi, and LCi indicates the logical clock variable of process Pi.

2、若从Pj接收消息,则转入3;否则转入4。2. If a message is received from Pj, go to 3; otherwise, go to 4.

3、调用ReadM(j,i,LCj,m)函数从系统消息接收缓冲区读取消息,其中j和i分别表示消息的发送和接收进程,LCj表示消息m对应的逻辑时钟。3. Call the ReadM(j,i,LCj,m) function to read the message from the system message receiving buffer, where j and i represent the sending and receiving process of the message respectively, and LCj represents the logical clock corresponding to the message m.

4、把步骤3所读消息提交应用程序处理,然后转入5。4. Submit the message read in step 3 to the application program for processing, and then turn to step 5.

5、若接收到Recover-manager主线程发送的心跳消息,转入6;否则转入2。5. If the heartbeat message sent by the main thread of Recover-manager is received, go to 6; otherwise, go to 2.

6、发送应答消息至Recover-manager主线程,然后转入2。6. Send a response message to the main thread of Recover-manager, and then go to 2.

假设系统中的计算机满足fail-stop故障停机模型,由于Pi出现故障,Pi不可能接收到Recover-manager主线程所发心跳消息,更不可能发送应答消息至Recover进程,因此步骤6的执行表示Pi进程未出现故障。Recover-manager主线程可根据是否收到Pi的应答判断Pi 是否处于正常运行状态。Assuming that the computer in the system satisfies the fail-stop failure downtime model, because Pi fails, it is impossible for Pi to receive the heartbeat message sent by the main thread of Recover-manager, let alone send a reply message to the Recover process, so the execution of step 6 means that Pi The process is not faulted. The main thread of Recover-manager can judge whether Pi is in a normal operating state according to whether it receives the response from Pi.

普通进程Pi发送函数SendM(i,j,LCi,m)操作流程如下所示:The operation flow of the common process Pi sending function SendM(i,j,LCi,m) is as follows:

1、若Fi=0,表明消息发送进程Pi无故障,则转入2;否则,表明Fi=1,Pi存在故障,转入4。1. If Fi=0, it indicates that the message sending process Pi has no fault, then go to 2; otherwise, it shows that Fi=1, Pi has a fault, and go to 4.

2、若Fj=0,表明消息接收进程Pj无故障,转入3;否则Fj=1,表明Pj存在故障,转入2,等待Pj从故障状态恢复。2. If Fj=0, it indicates that the message receiving process Pj is not faulty, and then go to 3; otherwise, Fj=1, it shows that Pj has a fault, and then go to 2, and wait for Pj to recover from the faulty state.

3、发送和接收进程均无故障情况下,发送进程的逻辑时钟加一:LCi←LCi+1;添加消息信息<i,j,LCi,m>至消息日志文件的末尾。其中,i和j分别表示发送和接收进程,LCi表示Pi进程的逻辑时钟变量,m表示消息的负载(payload)。转入5。3. When both the sending and receiving processes have no faults, add one to the logical clock of the sending process: LCi←LCi+1; add message information <i, j, LCi, m> to the end of the message log file. Among them, i and j represent the sending and receiving process respectively, LCi represents the logic clock variable of the Pi process, and m represents the load of the message (payload). Go to 5.

4、发送进程的逻辑时钟加一:LCi←LCi+1;结束消息发送过程。4. Add one to the logical clock of the sending process: LCi←LCi+1; end the message sending process.

5、发送应用消息AM<i,j,LCi,m>至接收进程Pj,结束消息发送过程。5. Send the application message AM<i,j,LCi,m> to the receiving process Pj, and end the message sending process.

步骤4对应于Pi进程出现故障的情况,在Pi出现故障时Pi不发送消息AM<i,j,LCi,m>,而仅更新LCi变量。Step 4 corresponds to the failure of the Pi process. When Pi fails, Pi does not send the message AM<i,j,LCi,m>, but only updates the LCi variable.

为了确保进程在故障状态下不发送任何消息,在发送函数中使用了故障标志位Fi和Fj表示发送和接收进程的故障状态,若故障位为1表示进程有故障,否则表示进程无故障。为实现上述故障识别机制,每个普通进程Pk,k=1,2,…n,均维护了一个故障向量F,F={F1,F2,…Fn}。进程Pk所使用的故障向量F定义在Control k进程和普通线程Pk的共享内存中,并由Control k进程维护更新。当某个进程出现故障时,Control k设置故障向量中的对应位为1,当某个进程从故障中被恢复后,Control k设置故障向量中的对应位为0。In order to ensure that the process does not send any messages in the fault state, the fault flag bits Fi and Fj are used in the sending function to indicate the fault state of the sending and receiving process. If the fault bit is 1, it means that the process is faulty, otherwise it means that the process is not faulty. In order to realize the above fault identification mechanism, each common process Pk, k=1, 2,...n, maintains a fault vector F, F={F1, F2,...Fn}. The fault vector F used by the process Pk is defined in the shared memory of the Control k process and the ordinary thread Pk, and is maintained and updated by the Control k process. When a process fails, Control k sets the corresponding bit in the fault vector to 1, and when a certain process recovers from the fault, Control k sets the corresponding bit in the fault vector to 0.

普通进程Pi接收函数ReadM(j,i,LCj,m)操作流程如下所示:The operation flow of the ordinary process Pi receiving function ReadM(j,i,LCj,m) is as follows:

1、从进程Pj接收应用消息AM<j,i,LCj,m>。1. Receive application message AM<j,i,LCj,m> from process Pj.

2、Tij变量加一:Tij←Tij+1;Tij表示Pi进程所接收Pj进程的消息数目。2. Add one to the Tij variable: Tij←Tij+1; Tij represents the number of messages received by the Pi process from the Pj process.

3、LCi变量加一:LCi←LCi+1;LCi表示Pi的逻辑时钟。3. Add one to the LCi variable: LCi←LCi+1; LCi represents the logic clock of Pi.

4、若LCi大于等于AM.LCj+1,转向结束;否则,转向5。AM.LCj表示四元组<j,i,LCj,m>中的分量LCj,即发送进程Pj的逻辑时钟。4. If LCi is greater than or equal to AM.LCj+1, turn to end; otherwise, turn to 5. AM.LCj represents the component LCj in the quaternion <j,i,LCj,m>, that is, the logical clock of the sending process Pj.

5、LCi←AM.LCj+1,转向结束。5. LCi←AM.LCj+1, the steering is over.

上述步骤3、4和5实际上是计算Pi接收消息后的逻辑时钟LCi,即把Pi和Pj的逻辑时钟的较大者加一后作为Pi的逻辑时钟LCi。The above steps 3, 4 and 5 are actually calculating the logical clock LCi of Pi after receiving the message, that is, adding one to the larger of the logical clocks of Pi and Pj as the logical clock LCi of Pi.

Recover-manager进程的主线程、消息接收线程和故障恢复线程Recover(k)运行流程如图13所示。Figure 13 shows the running process of the main thread, message receiving thread and fault recovery thread Recover(k) of the Recover-manager process.

主线程操作流程如下所示:The main thread operation flow is as follows:

1、初始化内存变量,Fi←0,i=1,2…n,num←0。其中,Fi是故障标志,用于标记进程Pi的故障状态,Fi=0表示Pi无故障,Fi=1表示Pi存在故障;num是一循环控制变量。1. Initialize memory variables, Fi←0, i=1, 2...n, num←0. Wherein, Fi is a fault flag, which is used to mark the fault state of the process Pi, Fi=0 indicates that Pi has no fault, and Fi=1 indicates that Pi has a fault; num is a loop control variable.

2、若num大于COUNT转入3,否则转入5。2. If num is greater than COUNT, transfer to 3, otherwise transfer to 5.

COUNT为一常量,可根据保存临时检查点的时间间隔设置。保存检查点的时间间隔为:COUNT×DELAY秒,例若COUNT=60,DELAY=2,则每隔120秒保存一次临时检查点。COUNT is a constant, which can be set according to the time interval for saving temporary checkpoints. The time interval for saving checkpoints is: COUNT×DELAY seconds, for example, if COUNT=60, DELAY=2, save a temporary checkpoint every 120 seconds.

3、若Fi=0,表明Pi无故障,则发送消息通知Control-i进程保存Pi临时检查点,Newck[i]←1。Newck[i]是一个标志位,Newck[i]=1表示Pi存在临时检查点。3. If Fi=0, indicating that Pi is not faulty, then send a message to notify the Control-i process to save the Pi temporary checkpoint, Newck[i]←1. Newck[i] is a flag bit, and Newck[i]=1 indicates that Pi has a temporary checkpoint.

4、num←0,对num变量重新赋0值。4. num←0, reassign 0 to the num variable.

5、num←num+1,num变量的值加一;延时DELAY秒,DELAY为一常量。5. num←num+1, the value of the num variable is increased by one; the delay is DELAY seconds, and DELAY is a constant.

6、k←0,k赋值0,k是一变量。6. k←0, k is assigned 0, and k is a variable.

7、若k大于n转入2,否则转入8;n表所示系统中所有普通进程Pi的个数。7. If k is greater than n, transfer to 2, otherwise transfer to 8; n represents the number of all common processes Pi in the system.

8、k←k+1,k变量的值加一。8. k←k+1, the value of variable k is incremented by one.

9、若Fk=0,表示Pk进程无故障,转入10;否则Fk=1,表示Pk有故障且已处于被恢复状态,转入7。9. If Fk=0, it means that the Pk process is not faulty, and then go to 10; otherwise, Fk=1, it means that Pk is faulty and has been restored, and then go to 7.

10、以可靠信道发送心跳消息至Pk。10. Send a heartbeat message to Pk through a reliable channel.

11、若接收到Pk进程对心跳消息的应答消息,说明Pk无故障,转入13;否则未接收到Pk进程对心跳消息的应答消息,说明Pk存在故障,转入12。11. If the response message of the Pk process to the heartbeat message is received, it means that Pk has no failure, and then proceed to 13; otherwise, the response message of the Pk process to the heartbeat message is not received, indicating that there is a fault in Pk, and then proceed to 12.

12、发送消息通知Control-k删除临时检查点;Newck[k]←0;启动恢复Pk的线程Recover(k);Fk赋值1,以控制在下一循环不再向Pk发送心跳消息,转入7。12. Send a message to notify Control-k to delete the temporary checkpoint; Newck[k]←0; start the thread Recover(k) to restore Pk; Fk is assigned a value of 1, so as to control the heartbeat message no longer sent to Pk in the next cycle, and turn to 7 .

13、若Newck[k]=1,说明存在Pk临时检查点,转入14;否则,说明不存在Pk临时检查点,转入7。13. If Newck[k]=1, it means that there is a Pk temporary checkpoint, and go to 14; otherwise, it means that there is no Pk temporary checkpoint, and go to 7.

14、发送消息通知Control-k进程把进程Pk的临时检查点转换成永久检查点;Newck[k]←0,Newck[k]置零,转入7。14. Send a message to notify the Control-k process to convert the temporary checkpoint of the process Pk into a permanent checkpoint; Newck[k]←0, Newck[k] is set to zero, and transfers to 7.

故障恢复Recover(k)线程操作流程如下所示:The fault recovery Recover(k) thread operation process is as follows:

1、以可靠信道发送Fk=1消息至Control-i,i=1,2,...n,通知其它进程Pk进程已发生故障。1. Send Fk=1 message to Control-i,i=1,2,...n through a reliable channel to notify other processes that Pk process has failed.

2、若接收到Control-k所发的Tk向量,转入3;否则转入2,等待接收Tk向量。2. If the Tk vector sent by Control-k is received, go to 3; otherwise, go to 2 and wait to receive the Tk vector.

注:Recover(k)线程通过步骤1发送Fi=1的消息后,故障进程Pk所对应的进程Control-k进程接收后将杀死Pk并从其永久检查点重启Pk,然后发送Tk向量至Recover(k)线程。Note: After the Recover (k) thread sends the message of Fi=1 through step 1, the process Control-k process corresponding to the faulty process Pk will kill Pk after receiving it and restart Pk from its permanent checkpoint, and then send the Tk vector to Recover (k) Threads.

3、把所接收Tk向量保存至内存。3. Save the received Tk vector to memory.

4、访问守护进程取得Uik。4. Access daemon to get Uik.

5、访问日志守护进程,按从后至前的次序从日志文件取得Uik-Tki个Pi发送至Pk的消息。5. The access log daemon process obtains Uik-Tki messages sent from Pi to Pk from the log file in the order from back to front.

6、把所有消息按LCk变量值从小到大排序。6. Sort all the messages according to the value of the LCk variable from small to large.

7、发送排序后消息至Pk,即重演消息。7. Send the sorted message to Pk, that is, replay the message.

8、以可靠信道发送Fk=0的消息至Control-i,i=1,2,…n;表示Pk进程已从故障中恢复。Fk←0,Fk置零,转向结束。8. Send a message of Fk=0 to Control-i,i=1,2,...n through a reliable channel; it indicates that the Pk process has recovered from the failure. Fk←0, Fk is set to zero, and the steering is over.

Control-i进程的运行流程如图14所示。Figure 14 shows the running process of the Control-i process.

Contril-i的操作流程如下所示:The operation process of Contril-i is as follows:

1、若接收到保存临时检查点的消息转入2,否则转入3。1. If the message of saving the temporary checkpoint is received, go to 2, otherwise go to 3.

2、保存新的临时检查点,删除旧临时检查点。2. Save the new temporary checkpoint and delete the old temporary checkpoint.

3、若接收到保存永久检查点的消息转入4,否则转入5。3. If the message of saving the permanent checkpoint is received, go to 4, otherwise go to 5.

4、删除原永久检查点,把Pi临时检查点转换为永久检查点。4. Delete the original permanent checkpoint and convert the Pi temporary checkpoint into a permanent checkpoint.

5、若未接到Fk=0的消息,转入7;否则转入6。5. If the message of Fk=0 is not received, go to 7; otherwise, go to 6.

6、Fk←0。6. Fk←0.

Fk置零,表示Pk进程已恢复。Fk is set to zero, indicating that the Pk process has resumed.

7、若未接到Fk=1的消息,转入1;否则转入8。7. If the message of Fk=1 is not received, go to 1; otherwise, go to 8.

8、Fk←1。Fk置1,表示Pk进程发生故障。8. Fk←1. Fk is set to 1, indicating that the Pk process fails.

9、若i不等于k,表示Pi进程未发生故障,转入1;否则i等于k,表明Pi进程发生故障,转入10。9. If i is not equal to k, it means that the Pi process has not failed, and turn to 1; otherwise, i is equal to k, indicating that the Pi process has failed, and turn to 10.

10、杀死故障进程Pi;然后从Pi永久检查点重启Pi。10. Kill the faulty process Pi; then restart the Pi from the Pi permanent checkpoint.

11、发送Ti向量至Recover(k),转入1。11. Send the Ti vector to Recover(k), go to 1.

注:此后,将由Recover(k)线程恢复Pk,即Pi(i=k)。Note: Thereafter, Pk, ie Pi(i=k), will be recovered by the Recover(k) thread.

恢复线程Recover(i)恢复算法正确性证明:Proof of the correctness of recovery thread Recover(i) recovery algorithm:

首先说明进程状态间隔的可恢复性原理:一个进程状态间隔是可恢复的,如果无论将来此进程出现任何故障,进程的重新执行总可以达到此间隔。Firstly, explain the recoverability principle of the process state interval: a process state interval is recoverable, if any failure occurs in the process in the future, the re-execution of the process can always reach this interval.

定理1假设无故障进程在故障进程被恢复时处于暂停状态,若一个或多个进程发生故障,则在Recovery(i)恢复进程作用下故障进程必将被恢复至发生故障前的状态。Theorem 1 assumes that the non-faulty process is suspended when the faulty process is restored, and if one or more processes fail, the faulty process will be restored to the state before the fault under the recovery process of Recovery (i).

证明:由于一个进程发送消息前先把消息的信息(决定因子)保存在消息日志文件中,因此根据上述进程状态间隔可恢复性原理此进程是可恢复的。以下分两种情况分别详细予以证明。Proof: Since a process saves the message information (determinant) in the message log file before sending a message, the process is recoverable according to the above-mentioned process state interval recoverability principle. The following two cases are respectively proved in detail.

1、假设只有一个进程Pk发生故障,k=1,2…n。在故障被检测出来后,故障进程Pk将被杀死并从其永久检查点启动。恢复线程Reconver(k)通过Contril-k进程获取Pk进程的Tk变量以及通过日志守护进程取得Uik变量后,根据差值Uik-Tki可通过日志守护进程取得所有重演消息。将这些消息按其逻辑时钟排序,依次发送至Pk,Pk进程重新接收处理消息,最终必将达到进程发生故障前的状态间隔,即Pk进程是可恢复的。1. Assume that only one process Pk fails, k=1, 2...n. After the failure is detected, the failed process Pk is killed and started from its permanent checkpoint. After the recovery thread Reconver(k) obtains the Tk variable of the Pk process through the Contril-k process and the Uik variable through the log daemon process, all replay messages can be obtained through the log daemon process according to the difference Uik-Tki. These messages are sorted according to their logical clocks, and sent to Pk in turn, and the Pk process receives and processes the messages again, and eventually it will reach the state interval before the process fails, that is, the Pk process is recoverable.

2、若多个进程发生故障,由于每个故障进程均由恢复线程单独恢复之,且禁止每个故障进程发送消息,因此多个进程发生故障时这些故障进程可被恢复线程分别恢复之。2. If multiple processes fail, since each faulty process is recovered individually by the recovery thread, and each faulty process is prohibited from sending messages, when multiple processes fail, these faulty processes can be recovered separately by the recovery thread.

定理1证明了在一个或多个进程发生故障时,每个进程都可被恢复至传统的最大可恢复状态,但是上述恢复协议允许无故障进程在故障恢复期间有条件继续执行,故障进程被恢复后的状态与无故障进程的状态的一致性将最终决定以上恢复方法的正确性。Theorem 1 proves that when one or more processes fail, each process can be restored to the traditional maximum recoverable state, but the above recovery protocol allows the non-faulty process to continue execution conditionally during the fault recovery, and the faulty process is recovered The consistency of the final state with the state of the non-faulty process will ultimately determine the correctness of the above recovery method.

定理2在恢复线程作用下,若无故障进程有条件继续执行,则所有故障进程被恢复后系统的全局状态是一个一致性的全局状态。Theorem 2 Under the action of the recovery thread, if the non-faulty process continues to execute conditionally, then the global state of the system after all the faulty processes are restored is a consistent global state.

证明:正如恢复线程的算法所描述,如果一个或多个进程出现故障,未发生故障进程若向故障进程发送消息则停止等待故障进程恢复,否则继续执行。Proof: As described in the algorithm for recovering threads, if one or more processes fail, if the non-faulty process sends a message to the faulty process, it will stop waiting for the faulty process to recover, otherwise continue to execute.

情况1、若无故障进程向无故障进程发送消息,则故障进程恢复后,未发生故障进程向故障进程发送消息的数目必然没有变化,这些进程之间不可能存在孤儿消息。Case 1. If a non-faulty process sends a message to a non-faulty process, after the faulty process recovers, the number of messages sent by the non-faulty process to the faulty process must not change, and there cannot be orphan messages among these processes.

情况2、若无故障进程停止于向故障进程发送消息的事件,则故障进程恢复后,无故障进程向故障进程发送消息的数目必然也没有变化,这些进程之间也不可能存在孤儿消息。Case 2. If the non-faulty process stops at the event of sending a message to the faulty process, after the faulty process recovers, the number of messages sent by the non-faulty process to the faulty process must not change, and there cannot be orphan messages among these processes.

综合两种情况,根据全局状态一致性的涵义,故障进程恢复后的系统全局状态是一个一致的全局状态。Combining the two cases, according to the meaning of global state consistency, the global state of the system after the recovery of the faulty process is a consistent global state.

然而,当所有故障进程均被恢复后,有可能出现多个未发生故障进程向一个被恢复后的故障进程发送消息的情况。事实上,由于多个发送进程向恢复后的进程发送消息是经过不同的信道传输,因此对于接收进程这些消息的接收事件之间不存在总是在先发生关系[7],这些事件可以按任何次序执行。However, when all the faulty processes are recovered, there may be a situation that multiple surviving faulty processes send messages to a recovered faulty process. In fact, since multiple sending processes send messages to the recovered process through different channels, there is no always-before relationship between the receiving events of these messages for the receiving process [7], and these events can be arranged according to any Execute in sequence.

与现有消息日志恢复方法比较:Compared with existing message log recovery methods:

不同消息日志恢复协议具有不同的性能评价指标,我们使用以下六个指标去评价一个恢复协议的性能:Different message log recovery protocols have different performance evaluation indicators. We use the following six indicators to evaluate the performance of a recovery protocol:

1、N.ckpts,所有进程保存检查点数。1. N.ckpts, all processes save the number of checkpoints.

2、INFOR.add,一个应用消息所携带额外信息量。2. INFOR.add, the amount of additional information carried by an application message.

3、DIS.rol,进程回退距离。3. DIS.rol, process rollback distance.

4、N.roll,k个进程发送故障后,恢复期间需回退的进程数量。4. N.roll, the number of processes that need to be rolled back during recovery after k processes fail.

5、Rollback,无故障进程是否回退。5. Rollback, whether the fault-free process is rolled back.

6、type,协议类型。6. type, protocol type.

在前向恢复特征的后向恢复容错协议(简称前向特征恢复协议)中,由于临时检查点在下一个心跳消息后被删除,因此每个进程只需保存一个永久检查点。每个应用消息所携带的数据是i、j和LCj;所以INFOR.add是3。进程发生故障后,故障进程只需回退至其永久检查点,故进程回退的距离DIS.rol为1。当k个进程发生故障后,无故障进程无需回退,仅需k个进程回退,因此N.roll是k。由于进程发送消息前先保存消息信息,任意时刻系统中不存在孤儿进程,因此该协议具有悲观协议恢复算法简单的优点;又由于消息接收进程接收消息后不保存任何信息至消息日志,因此该协议较之乐观协议具有更优良的系统正常运行性能。与现有协议[7]相同,在故障恢复阶段无故障进程不是回退或停止等待,而是继续执行,此特征类似于前向恢复算法使得系统中的进程具有较高的运行效率。In the backward recovery fault-tolerant protocol of the forward recovery feature (referred to as the forward feature recovery protocol), since the temporary checkpoint is deleted after the next heartbeat message, each process only needs to save one permanent checkpoint. The data carried by each application message are i, j and LCj; so INFOR.add is 3. After a process fails, the faulty process only needs to roll back to its permanent checkpoint, so the process rollback distance DIS.rol is 1. When k processes fail, no faulty processes need to roll back, only k processes need to roll back, so N.roll is k. Because the process saves the message information before sending the message, there is no orphan process in the system at any time, so this protocol has the advantage of a simple pessimistic protocol recovery algorithm; and because the message receiving process does not save any information to the message log after receiving the message, so the protocol Compared with the optimistic protocol, it has better system uptime performance. Same as the existing protocol [7], in the fault recovery phase, the fault-free process does not roll back or stop waiting, but continues to execute. This feature is similar to the forward recovery algorithm, which makes the process in the system have higher operating efficiency.

基于消息数目校验和消息重排序的消息日志恢复协议(MNCMR)[5][6]中,每个进程仅需异步地保存一个检查点,所以N.ckpts等于n。每个应用消息所携带的数据是j和LCj,所以INFOR.add是2。当一个或多个进程发生故障时,只有故障进程回退至其检查点,DIS.ro为1。恢复期间需回退的进程数量N.roll等于发生故障进程的总数。In the message log recovery protocol based on message number checksum and message reordering (MNCMR)[5][6], each process only needs to save a checkpoint asynchronously, so N.ckpts is equal to n. The data carried by each application message is j and LCj, so INFOR.add is 2. When one or more processes fail, only the failed process rolls back to its checkpoint, DIS.ro is 1. The number of processes N.roll to be rolled back during recovery is equal to the total number of failed processes.

与具有前向恢复特征的后向恢复容错协议相比较,MNCMR协议在进程发送消息前需保存消息信息至本地存储,在进程接收消息后亦需保存消息信息至消息日志,这必然导致了信息的重复存储;具有前向恢复特征的后向恢复容错协议仅在消息发送前保存消息信息,不仅节省了存储空间而且提高了系统正常执行时的性能。Compared with the backward recovery fault-tolerant protocol with forward recovery characteristics, the MNCMR protocol needs to save the message information to the local storage before the process sends the message, and also needs to save the message information to the message log after the process receives the message, which will inevitably lead to information loss. Repeated storage; the backward recovery fault-tolerant protocol with forward recovery feature only saves the message information before the message is sent, which not only saves storage space but also improves the performance of the system during normal execution.

自上世纪八十年代以来,大量的消息日志恢复协议发表于国内外期刊杂志中,以下选择几种典型协议与MNCMR协议进行比较。Sistla和Welch[1]提出了两个基于消息乐观日志恢复协议,一个协议中的发送消息携带了传递依赖向量(以Prasad.1表示此协议),另一个协议所发送的应用消息仅携带发送进程当前的状态间隔值(以Prasad.2表示此协议)。Prasad.1协议中 每个应用消息所需额外信息量为o(n),对于每个故障需要交换o(n2)的系统消息。Prasad.2协议中每个应用消息所需额外信息量为o(1),对于每个故障需要交换o(n3)的系统消息。在Strom和Yemini[2]的乐观消息日志协议中,每个发送的应用消息携带一个传递依赖向量,此向量具有n个分量,n为系统所具有的进程数量。在进程无故障执行时,每个进程需定期广播这个传递依赖向量或者把此向量附加在发送的消息中。Since the 1980s, a large number of message log recovery protocols have been published in domestic and foreign journals, and several typical protocols are selected below for comparison with the MNCMR protocol. Sistla and Welch [1] proposed two optimistic log recovery protocols based on messages. In one protocol, the sent message carries the transitive dependency vector (represented by Prasad.1), and the application message sent by the other protocol only carries the sending process Current state interval value (in Prasad.2 for this protocol). In the Prasad.1 protocol, the amount of additional information required for each application message is o(n), and for each fault, o(n2) system messages need to be exchanged. In the Prasad.2 protocol, the amount of additional information required for each application message is o(1), and for each fault, o(n3) system messages need to be exchanged. In the optimistic message logging protocol of Strom and Yemini [2], each application message sent carries a transitive dependency vector, which has n components, and n is the number of processes in the system. Each process needs to periodically broadcast this transitive dependency vector or attach it to the message it sends when the process executes without failure.

表1给出了本发明所述协议与上述协议的比较结果。Table 1 shows the comparison results between the protocol of the present invention and the protocol mentioned above.

表1Table 1

与其它非消息日志恢复协议(如,协调检查点恢复协议)比较,具有前向恢复特征的后向恢复容错协议的主要缺点是协议需对普通进程消息的发送接收进行干预,算法不具透明性。与其它消息日志协议比较,前向特征恢复协议兼具悲观和乐观协议的优点,而摒弃了其缺点。比较的结果如表2所示。Compared with other non-message log recovery protocols (such as coordinated checkpoint recovery protocols), the main disadvantage of backward recovery fault-tolerant protocols with forward recovery features is that the protocol needs to intervene in the sending and receiving of ordinary process messages, and the algorithm is not transparent. Compared with other message log protocols, forward signature recovery protocol has both the advantages of pessimistic and optimistic protocols, while abandoning their disadvantages. The results of the comparison are shown in Table 2.

表2Table 2

本发明的主要贡献:Main contribution of the present invention:

通过故障标志位对进程应用消息的发送和接收进行精准控制,实现了在进程故障恢复阶段无故障进程的有条件继续执行,使得系统具有前向恢复特征,并使得消息日志恢复协议算法达到了目前为止最为理想的境界。Precisely control the sending and receiving of process application messages through the fault flag bit, and realize the conditional continuation of the fault-free process in the process fault recovery stage, making the system have forward recovery characteristics, and making the message log recovery protocol algorithm reach the current level The most ideal realm so far.

附图说明:Description of drawings:

图1是说明消息接收次序其随机性的例图;图2是说明消息发送和接收事件间总是在先发生关系的例图;图3是根据逻辑时钟的定义计算其值的例图;图4是阐述消息发送事件S(mj)间接在先发生于消息接收事件R(mi)的说明图;图5是悲观日志协议下故障时进程回退的例图;图6是乐观日志协议下故障时进程回退的例图;图7是无故障进程条件执行且故障进程被恢复后不再发送消息的例图;图8是无故障进程条件执行且故障进程被恢复后发送消息的例图;图9是本发明的技术方案图;图10是系统正常运行时进程发送和接收消息的说明图;图11是故障进程恢复过程说明图;图12是普通进程运行流程图;图13是恢复管理进程的运行流程图;图14是状态控制进程Control-i的运行流程图。Fig. 1 is an example diagram illustrating the randomness of the message receiving sequence; Fig. 2 is an example diagram illustrating the relationship between message sending and receiving events that always occurs first; Fig. 3 is an example diagram calculating its value according to the definition of a logical clock; Fig. 4 is an explanatory diagram illustrating that the message sending event S(mj) indirectly precedes the message receiving event R(mi); Figure 5 is an example of process rollback when a failure occurs under the pessimistic log protocol; Figure 6 is a failure under the optimistic log protocol Figure 7 is an example diagram of conditional execution of no faulty process and no longer sending messages after the faulty process is restored; Figure 8 is an example diagram of conditional execution of no faulty process and sending of messages after the faulty process is restored; Fig. 9 is a technical solution diagram of the present invention; Fig. 10 is an explanatory diagram of process sending and receiving messages when the system is in normal operation; Fig. 11 is an explanatory diagram of faulty process recovery process; Fig. 12 is a flow chart of ordinary process operation; Fig. 13 is recovery management Process flow chart; FIG. 14 is a flow chart of the state control process Control-i.

[1]Elnozahy E N,Alvisi L,Wang Yimin,et al.“A Survey of Rollbackrecovery Protocols in Message passing Systems,”ACM Computing Surveys,2002,34(3):375-408.[1]Elnozahy E N, Alvisi L, Wang Yimin, et al. "A Survey of Rollback recovery Protocols in Message passing Systems," ACM Computing Surveys, 2002, 34(3): 375-408.

[2]A.Bouteiller,F.Cappello,T.Hérault,G.Krawezik,P.Lemarinier andF.Magniette. MPICH-V2:“a Fault Tolerant MPI for Volatile Nodes based onPessimistic Sender Based Message Logging,”In Proc.of the 15th InternationalConference on High Performance Networking and Computing(SC2003),November2003.[2] A. Bouteiller, F. Cappello, T. Hérault, G. Krawezik, P. Lemarinier and F. Magniette. MPICH-V2: "a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging," In Proc.of the 15th International Conference on High Performance Networking and Computing (SC2003), November 2003.

[3]D.B.Johnson and W.Zwaenpoel.“Sender-Based Message Logging,”InDigest of Papers:17th International Symposium on Fault-Tolerant Computing,pp.14-19,1987.[3] D.B.Johnson and W.Zwaenpoel. "Sender-Based Message Logging," InDigest of Papers: 17th International Symposium on Fault-Tolerant Computing, pp.14-19, 1987.

[4]J.Xu,R.B.Netzer and M.Mackey.Sender-based message logging forreducing rollback propagation.In Proc.of the 7th International Symposium onParallel and Distributed Processing,pp.602-609,1995.[4]J.Xu,R.B.Netzer and M.Mackey.Sender-based message logging forreducing rollback propagation.In Proc.of the 7th International Symposium on Parallel and Distributed Processing,pp.602-609,1995.

[5]高胜法,蔡静,冯振.“基于消息重排序和消息数目检验消息日志恢复方法,”中华人民共和国发明专利,专利号:201210239710.0,2012.[5] Gao Shengfa, Cai Jing, Feng Zhen. "Message log recovery method based on message reordering and message number inspection," People's Republic of China Invention Patent, Patent No.: 201210239710.0, 2012.

[6]Jing cai,Shengfa gao.“Message Rearrange Theory in Message RecoveryProtocol,”2013 5th International Conference on Computer Science andInformation Technology(CSIT),pp.293-297,2013.[6] Jing cai, Shengfa gao. "Message Rearrange Theory in Message Recovery Protocol," 2013 5th International Conference on Computer Science and Information Technology (CSIT), pp.293-297, 2013.

[7]Shengfa gao.“Message Number Check and Message Rearranging Theoryand Protocols”,LAP LAMBERT Academic Publishing house,2014(专著).[7] Shengfa gao. "Message Number Check and Message Rearranging Theory and Protocols", LAP LAMBERT Academic Publishing house, 2014 (monograph).

[8]SCHLICHTING,R.D.AND SCHNEIDER,F.B.1983.“Fail-stop processors:Anapproach to designing fault-tolerant computing systems,”ACMTrans.Comput.Syst.1,3,222–238,1983。[8] SCHLICHTING, R.D. AND SCHNEIDER, F.B. 1983. "Fail-stop processors: An approach to designing fault-tolerant computing systems," ACMTrans.Comput.Syst.1,3,222–238,1983.

Claims (2)

1. A backward recovery fault-tolerant method with forward recovery characteristic is characterized in that a message reordering method is adopted, the effect of conditional execution of a fault-free process when other processes have faults is achieved by sending fault bit control messages, and messages and corresponding logic clocks are stored in a message log at one time and then sent before sending and receiving processes send messages without fault sending processes; when a process fails, acquiring replay messages and corresponding logic clocks from a message log under the control of a recovery thread, and then reordering the replay messages according to the logic clocks of the messages; finally, the sequenced messages are sent to the fault process again, and the fault process receives and processes the messages again, so that the replay of the messages is realized;
the method requires a common process PiSendM (i, j, LC) functioniM) the working steps are as follows:
step 1, if Fi0, indicates the message sending process PiIf no fault exists, the step 2 is carried out; otherwise Fi1 indicates PiIf the fault exists, the step 4 is carried out; wherein FiIs a component of a fault vector F representing a process PiFault state of (D), Fi0 denotes the message sending process PiNo failure, Fi1 represents PiA fault exists;
step 2, if FjIf the value is 0, the step 3 is carried out; otherwise FjStep 2 is carried out when the process P is receivedjRecovery from a fault condition, PjThe failure of (1) is recovered by the Recover (j) thread; wherein FjIs a component of a fault vector F representing a message receiving process PjFault state of (D), Fj0 represents PjNo failure, Fj1 represents PjA fault exists;
step 3, adding one to the logic clock of the sending process under the condition that both the sending process and the receiving process have no fault: LC (liquid Crystal)i←LCi+ 1; adding message information<i,j,LCi,m>When the message log file is at the end, turning to the step 5; wherein i and j respectively represent a sending process PiAnd receiving the process PjProcess number, LCiRepresents PiA logical clock variable of the process, m represents the load of the message;
step 4, adding one to the logic clock of the sending process: LC (liquid Crystal)i←LCi+ 1; ending the message sending process;
step 5, sending application message AM<i,j,LCi,m>To receiving process PjAnd the message sending process is ended.
2. Backward with forward recovery as in claim 1Recovering fault tolerance method, marking the failed common process as PkK 1,2, 3 … n, failover recovery (k) thread recovery PkThe steps are as follows:
step 1, sending F with reliable channelk1 message to each of the state Control processes Control-1, Control-2kIf the fault occurs, switching to the step 2; wherein the state Control process Control-i is responsible for the normal process PiState control of (1), including saving PiTemporary and permanent checkpoint of, killing PiAnd restarting P from a persistent checkpointi,i=1,2,...n;
Step 2, if T sent by Control-k is receivedkVector, go to step 3; otherwise, go to step 2 to wait for T receptionkVector quantity; wherein, the vector Tk=[Tk1,Tk2,...Tkn],TkiRepresenting a process PkReceived, process PiThe number of messages sent, i ≠ k, i ═ 1, 2.. n;
step 3, receiving TkSaving the vector to a memory, and turning to the step 4;
step 4, accessing daemon process to obtain U1k,U2k,...U(k-1)k,U(k+1)k,...UnkTurning to step 5; wherein U isikRepresenting a process PiSent to process PjThe number of messages, i ≠ k, i ═ 1, 2.. n;
step 5, accessing the log daemon process, and obtaining the U from the log file according to the sequence from back to frontik-TkiA PiTo PkStep 6, i ≠ k, i ≠ 1, 2.. n; wherein U isik-TkiThe value of the subtraction expression of (a) represents the process PkShould be received from process P after restartiThe number of messages;
step 6, pressing all messages to LCkSorting the variable values from small to large, and turning to step 7;
step 7, sending the sequenced message to PkI.e. replay message, go to step 8;
step 8, sending F by reliable channelkA message of 0 to Control-1, Control-2kThe process has recovered from the failure; fk←0,FkAnd setting zero and finishing steering.
CN201510571405.5A 2015-09-09 2015-09-09 It is a kind of that there is the preceding backward recovery fault-tolerance approach to recovery feature Expired - Fee Related CN105242979B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510571405.5A CN105242979B (en) 2015-09-09 2015-09-09 It is a kind of that there is the preceding backward recovery fault-tolerance approach to recovery feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510571405.5A CN105242979B (en) 2015-09-09 2015-09-09 It is a kind of that there is the preceding backward recovery fault-tolerance approach to recovery feature

Publications (2)

Publication Number Publication Date
CN105242979A CN105242979A (en) 2016-01-13
CN105242979B true CN105242979B (en) 2017-12-12

Family

ID=55040633

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510571405.5A Expired - Fee Related CN105242979B (en) 2015-09-09 2015-09-09 It is a kind of that there is the preceding backward recovery fault-tolerance approach to recovery feature

Country Status (1)

Country Link
CN (1) CN105242979B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107181805B (en) * 2017-05-26 2019-11-12 上交所技术有限责任公司 A method of realizing that global orderly is recurred under micro services framework
CN110019506B (en) * 2017-09-21 2023-03-31 阿里云计算有限公司 Log record processing method and device
CN110795265B (en) * 2019-10-25 2021-04-02 东北大学 Iterator based on optimistic fault-tolerant method
CN111143142B (en) * 2019-12-26 2021-05-04 江南大学 A Universal Checkpoint and Rollback Recovery Method
WO2025131313A1 (en) * 2023-12-22 2025-06-26 Huawei Technologies Co., Ltd. Process system and method for resource-efficient, gradual, asynchronous checkpointing and recovery of data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102841840A (en) * 2012-07-11 2012-12-26 山东大学 Message log recovery method based on message reordering and inspection of number of messages
CN103634411A (en) * 2013-12-16 2014-03-12 上海证券交易所 Real-time market data broadcasting system and real-time market data broadcasting method with state consistency
CN103647669A (en) * 2013-12-16 2014-03-19 上海证券交易所 System and method for guaranteeing distributed data processing consistency

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102841840A (en) * 2012-07-11 2012-12-26 山东大学 Message log recovery method based on message reordering and inspection of number of messages
CN103634411A (en) * 2013-12-16 2014-03-12 上海证券交易所 Real-time market data broadcasting system and real-time market data broadcasting method with state consistency
CN103647669A (en) * 2013-12-16 2014-03-19 上海证券交易所 System and method for guaranteeing distributed data processing consistency

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Message Rearrange Theory in Message Recovery Protocol";Jing cai;《2013 5th International Conference on Computer Science and Information Technology(CSIT)》;20130905;第293页第6段至296页第5段,图1 *

Also Published As

Publication number Publication date
CN105242979A (en) 2016-01-13

Similar Documents

Publication Publication Date Title
CN105242979B (en) It is a kind of that there is the preceding backward recovery fault-tolerance approach to recovery feature
US7590668B2 (en) Pausable backups of file system items
Chakravorty et al. A fault tolerance protocol with fast fault recovery
Ropars et al. SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing
CN110190991B (en) A fault-tolerant method for distributed stream processing system in multiple application scenarios
CN105871603A (en) Failure recovery system and method of real-time streaming data processing based on memory data grid
Meneses et al. Evaluation of simple causal message logging for large-scale fault tolerant HPC systems
Mendizabal et al. High performance recovery for parallel state machine replication
CN116610752A (en) Transactional distributed data synchronization method, device, system and storage medium
CN110888761A (en) Fault-tolerant method based on active backup of key task part and stream processing platform
Wang et al. A comprehensive study on fault tolerance in stream processing systems
Ho et al. Scalable group-based checkpoint/restart for large-scale message-passing systems
Li et al. Asynchronous prefix recoverability for fast distributed stores
Siachamis et al. Checkmate: Evaluating checkpointing protocols for streaming dataflows
CN106371919B (en) A shuffling data cache method based on map-reduce computing model
Chakravorty et al. A fault tolerant protocol for massively parallel systems
Niazi et al. Leader election using NewSQL database systems
CN102841840B (en) The message logging restoration methods that Effect-based operation reorders and message number is checked
van Renesse et al. Replication techniques for availability
CN101986602B (en) Method for setting checkpoints and recovering failure process based on message number checking and non-blocking
Mendizabal et al. Checkpointing in parallel state-machine replication
Gupta et al. A novel roll-back mechanism for performance enhancement of asynchronous checkpointing and recovery
Bessho et al. Comparing checkpoint and rollback recovery schemes in a cluster system
Park et al. Application controlled checkpointing coordination for fault-tolerant distributed computing systems
CN111143475B (en) State management method and device for Storm data analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171212