[go: up one dir, main page]

WO2024250919A1 - 一种异步仲裁方法及装置 - Google Patents

一种异步仲裁方法及装置 Download PDF

Info

Publication number
WO2024250919A1
WO2024250919A1 PCT/CN2024/093216 CN2024093216W WO2024250919A1 WO 2024250919 A1 WO2024250919 A1 WO 2024250919A1 CN 2024093216 W CN2024093216 W CN 2024093216W WO 2024250919 A1 WO2024250919 A1 WO 2024250919A1
Authority
WO
WIPO (PCT)
Prior art keywords
instance
redundant
data
arbitration
synchronization point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2024/093216
Other languages
English (en)
French (fr)
Inventor
胡万明
汪旭
任玉鑫
林子畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of WO2024250919A1 publication Critical patent/WO2024250919A1/zh
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Definitions

  • Embodiments of the present application relate to electronic technology, and more particularly to an asynchronous arbitration method and device.
  • multi-mode redundancy technology it is increasingly used in aerospace, satellites, distributed systems and high performance computing (HPC).
  • HPC high performance computing
  • database storage scientific research, weather forecasting, military research and gene sequencing.
  • the instance busy waiting method is usually adopted.
  • the three-module redundant OS technology commonly used in the industry performs synchronous arbitration through an arbitrator. Its synchronization adopts blocking and busy waiting methods to align the states of all redundant instances at the synchronization point.
  • this multi-mode synchronization method has many shortcomings. Among them, the most significant pain point is that different redundant instances execute at different speeds, busy waiting causes low central processing unit (CPU) utilization, and the slowest single point instance becomes the bottleneck of the overall software and hardware operating efficiency.
  • CPU central processing unit
  • the present application provides an asynchronous arbitration method and device.
  • the method can save the arbitration data at the synchronization point when the redundant instance runs to the synchronization point, and continue to execute the part after the synchronization point of the redundant instance after saving the arbitration data; finally, the arbitration data of all redundant instances are arbitrated.
  • the redundant instances do not need to busy wait at the synchronization point, which can improve CPU utilization.
  • an embodiment of the present application provides an asynchronous arbitration method, the method comprising:
  • N is an integer greater than 1;
  • i is a positive integer not greater than N;
  • target arbitration data is determined from the N arbitration data, and the redundant instance corresponding to the target arbitration data is the correct redundant instance at the synchronization point.
  • the N redundant instances are N instances in the N modular redundancy technology; the N redundant instances may be obtained by copying the same instance.
  • an instance is a functional module including a running object and target data.
  • An instance can include a running object and target data involved in the running process (such as input and running data).
  • the running object can be a software module or a hardware module; the instance can start executing the running object after receiving the input, and the running object can call the running data during the running process.
  • the running objects of the N redundant instances may also be called redundant objects; the redundant objects of the N redundant instances are the same, receive the same input, and call the same running data during the running process.
  • the i-th arbitration data is used to indicate the running status of the i-th instance up to the current position, and the arbitration data at least includes the output result of the current position.
  • the asynchronous execution of multiple redundant instances can be achieved, effectively eliminating the synchronization busyness of the faster instance, and improving the overall CPU resource utilization of the program.
  • the faster instance refers to the instance that runs to the synchronization point faster among the N redundant instances.
  • the method further includes:
  • the checkpoint CKPT data of the i-th redundant instance at the synchronization point is saved, and the CKPT data of the i-th redundant instance at the synchronization point is used to restore the data state of the i-th redundant instance at the synchronization point.
  • the redundant instance can also be checkpointed to save the CKPT data used to restore to the current synchronization point state; the erroneous redundant instance can be corrected based on the correct CKPT data later, thereby ensuring that the number of redundant instances of multi-mode redundancy will not decrease, and the reliability and availability of multi-mode redundancy can be effectively improved.
  • the CKPT data can ensure that there is a state to be restored to during asynchronous rollback. When all members arbitrate to identify the fault, whether the slowest instance or the faster instance has an error, asynchronous error correction can be performed.
  • the i-th arbitration data is not the target arbitration data; after determining the target arbitration data from the N arbitration data, the method further includes:
  • the erroneous redundant instance can be corrected based on the CKPT data of the correct redundant instance at the synchronization point, thereby ensuring that the number of redundant instances of multi-mode redundancy will not decrease, and the reliability and availability of multi-mode redundancy can be effectively improved.
  • saving the checkpoint CKPT data of the i-th redundant instance at the synchronization point includes:
  • the CKPT data of the i-th redundant instance is not saved.
  • the arbitration data of the redundant instance is the same as the saved arbitration data (i.e., pre-arbitration technology); then, for instances with the same arbitration data, only one CKPT data can be saved.
  • This method can reduce the number of CKPT times, reduce power consumption, and improve CPU resource utilization.
  • the i-th redundant instance is an instance among the N redundant instances except the last instance that runs to the synchronization point.
  • the CKPT instance refers to the instance that has performed CKPT (that is, the instance that saves CKPT data); the slowest instance refers to the last instance that runs to the synchronization point.
  • the CKPT instance refers to the instance that has performed CKPT (that is, the instance that saves CKPT data); the slowest instance refers to the last instance that runs to the synchronization point.
  • the worst fault scenario N/2-1 instances fail.
  • the N-module instances in the worst fault scenario only need to perform N/2-1 CKPTs in one round of synchronization points, achieving the theoretical optimal CKPT cost.
  • the operation of identifying whether the arbitration data of the i-th redundant instance is the same as the saved arbitration data is not performed, but all arbitration data is directly subjected to full arbitration (i.e., determining the target arbitration data) and error correction rollback and other operations. After the operation is performed, the redundant instance is executed. This method can reduce the number of pre-arbitration times.
  • determining target arbitration data from N arbitration data includes:
  • the arbitration data whose number is greater than a preset number among the N arbitration data are determined as the target arbitration data.
  • the availability and reliability of the correct redundant instance can be guaranteed.
  • saving the i-th arbitration data includes:
  • the i-th arbitration data is saved
  • the method further includes:
  • the step of determining the target arbitration data from the N arbitration data is performed.
  • the arbitration data thereof is not saved, so that the shared memory consumption and the communication overhead can be reduced.
  • the method further includes:
  • the execution gap between redundant instances can be controlled to avoid a significant mismatch between the instances.
  • the number of synchronization points in the redundant instance is at least two, and the arbitration data of the i-th redundant instance is stored in a preset storage space; after determining the target arbitration data from the N arbitration data, the method further includes: deleting the arbitration data of the synchronization point;
  • the suspending execution of the i-th redundant instance includes: suspending execution of the i-th redundant instance when a preset storage space is full.
  • the shared memory can be configured to a fixed size, so that the execution gap between direct instances is determined by the size of the shared memory, thereby avoiding a large mismatch between redundant instances.
  • recording target data of a current state of an i-th redundant instance to obtain i-th arbitration data includes:
  • the i-th application instance runs to calling a library function or the library function returns data
  • the i-th application instance is stopped, and the current system call number, input parameters and output parameters of the i-th application instance are recorded to obtain the i-th arbitration data.
  • an embodiment of the present application provides an asynchronous arbitration device, the device comprising N processors and an arbitrator;
  • the N processors are respectively used to execute the instance; when the instance is executed to the synchronization point, the arbitration data of the synchronization point is output to the arbitrator, and the arbitration data includes the current output result; after the arbitration data of the synchronization point is output, the processing steps after the synchronization point are executed;
  • the arbitrator is used to save the arbitration data when receiving the arbitration data sent by the processor; and determine the target arbitration data from the N arbitration data when receiving N arbitration data.
  • the processor is further configured to send the checkpoint CKPT data of the synchronization point to the arbitrator when processing to the synchronization point;
  • the arbitrator is further used to send the CKPT data corresponding to the target arbitration data to the processor corresponding to the non-target arbitration data;
  • the processor corresponding to the non-target arbitration data is further used to restore the data state at the synchronization point based on the CKPT data corresponding to the target arbitration data.
  • an asynchronous arbitration device comprising:
  • An execution unit for executing N redundant instances for the same input, where N is an integer greater than 1; and i is a positive integer not greater than N;
  • the execution unit is further used to: stop executing the i-th redundant instance when the i-th redundant instance runs to the synchronization point, record the current output result of the i-th redundant instance, and obtain the i-th arbitration data; save the i-th arbitration data; after saving the i-th arbitration data, execute the part after the synchronization point of the i-th redundant instance;
  • the determination unit is used to determine target arbitration data from the N arbitration data after obtaining the N arbitration data, wherein the redundancy instance corresponding to the target arbitration data is the correct redundancy instance at the synchronization point.
  • the device further includes a storage unit
  • the storage unit is used to store the checkpoint CKPT data of the i-th redundant instance at the synchronization point, and the CKPT data of the i-th redundant instance at the synchronization point is used to restore the data state of the i-th redundant instance at the synchronization point.
  • the device further includes a recovery unit; the recovery unit is configured to:
  • the storage unit is further configured to:
  • the CKPT data of the i-th redundant instance is not saved.
  • the i-th redundant instance is an instance other than the last instance among the N redundant instances that runs to a synchronization point.
  • the determining unit is specifically configured to:
  • the arbitration data whose number is greater than a preset number among the N arbitration data are determined as the target arbitration data.
  • the execution unit is specifically configured to:
  • the i-th arbitration data is saved; when the arbitration data of the last redundant instance running to the synchronization point is obtained, the step of determining the target arbitration data from the N arbitration data is performed.
  • the execution unit further includes a pausing unit, the pausing unit being configured to:
  • the execution unit, the number of synchronization points in the redundant instance is at least two, and the arbitration data of the i-th redundant instance is stored in a preset storage space; the device further includes a deletion unit, the deletion unit being configured to delete the arbitration data of the synchronization point;
  • the pausing unit is specifically used to suspend the execution of the i-th redundant instance when the preset storage space is full.
  • the execution unit, the redundant instance is an application instance, and the execution unit is specifically used to:
  • the i-th application instance runs to calling a library function or the library function returns data
  • the i-th application instance is stopped, and the current system call number, input parameters and output parameters of the i-th application instance are recorded to obtain the i-th arbitration data.
  • the present application provides a computer storage medium, including computer instructions.
  • the computer instructions When the computer instructions are executed on an electronic device, the electronic device executes the asynchronous arbitration method in the above-mentioned first aspect or any possible implementation of the first aspect.
  • the present application provides a computer program product.
  • the computer program product runs on a computer, it enables the computer to execute the asynchronous arbitration method in the above-mentioned first aspect or any possible implementation of the first aspect.
  • the present application provides a chip, comprising: a processor and an interface, wherein the processor and the interface cooperate with each other so that the chip executes the asynchronous arbitration method in the above-mentioned first aspect or any possible implementation of the first aspect.
  • the asynchronous arbitration device provided in the second and third aspects, the computer-readable storage medium provided in the fourth aspect, the computer program product provided in the fifth aspect, and the chip provided in the sixth aspect are all used to execute the method provided in the embodiment of the present application. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects in the corresponding method, which will not be repeated here.
  • FIG1 is a schematic diagram of a multi-mode redundancy provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of a redundant instance synchronous waiting provided by an embodiment of the present application.
  • FIG3 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG4 is a schematic diagram of another application scenario provided by an embodiment of the present application.
  • FIG5 is a flow chart of an asynchronous arbitration method provided by an embodiment of the present application.
  • FIG6 is a flow chart of another asynchronous arbitration method provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of an execution of multiple redundant instances provided in an embodiment of the present application.
  • FIG8A is a schematic diagram of N redundant instances provided by an embodiment of the present application when there is no failure at the synchronization point;
  • FIG8B is a schematic diagram of a case where a failure occurs at a synchronization point among N redundant instances provided by an embodiment of the present application;
  • FIG9 is a flow chart of another asynchronous arbitration method provided in an embodiment of the present application.
  • FIG10 is a schematic diagram of managing arbitration data provided by an embodiment of the present application.
  • FIG11 is a schematic diagram of another asynchronous arbitration method provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of another embodiment of the present application to manage arbitration data
  • FIG. 13 is a schematic diagram of the structure of an asynchronous arbitration device 130 provided in an embodiment of the present application.
  • first and second are used for descriptive purposes only and are not to be understood as suggesting or implying relative importance or implicitly indicating the number of technical features indicated.
  • a feature defined as “first” or “second” may explicitly or implicitly include one or more of the features.
  • GUI graphical user interface
  • redundancy usually refers to increasing system reliability through multiple backups. This means repeatedly configuring some key components of the system. When a system failure occurs, the redundant components intervene and take over the work of the failed components, thereby reducing the system failure time. Although redundancy brings system complexity and increases costs, it is necessary given the high cost of business interruption caused by failures in business-critical systems.
  • Reliability refers to the probability that a device can complete a specified task under specified conditions and within a specified time. Improving reliability requires reducing the number of system interruptions (failures). Availability refers to the proportion of total available time for a functional individual within a given time interval. Improving availability requires emphasizing reducing the time to recover from failures.
  • multi-mode redundancy is one of the necessary and effective reliability mechanisms.
  • the process of multi-mode redundancy includes redundant execution of the same input of hardware modules and software programs. Furthermore, the output of redundant objects can be arbitrated.
  • the data involved in the instance of the redundant object (referred to as redundant instance) includes the same instruction stream, backup data, redundant processes, etc.
  • Figure 1 is a schematic diagram of a multi-mode redundancy provided by an embodiment of the present application.
  • Figure 1 exemplarily shows two redundant instances.
  • the process of multi-mode redundancy can be to input the same input to the two redundant instances, perform redundant execution, and finally arbitrate the obtained output.
  • the two redundant instances can be obtained by replication (Sphere of Replication, SoR). It should be understood that multi-mode redundancy is at least two redundant instances.
  • the execution speeds of multiple redundant instances can be different, the time it takes for multiple redundant instances to execute to the synchronization point can be different.
  • the redundant instances can be synchronized so that each redundant instance does not show a state deviation trend after a long period of execution.
  • the synchronization method usually specifies synchronization points in the software and hardware processes. In order to align the pace between redundant instances, the instance that reaches the synchronization point first waits until subsequent instances arrive one after another, and then passes the synchronization point after arbitration.
  • Figure 2 is a schematic diagram of a redundant instance synchronization waiting provided by an embodiment of the present application.
  • Figure 2 exemplarily shows three redundant instances, namely redundant instance 1, redundant instance 2 and redundant instance 3; and four synchronization points, namely synchronization point 1, synchronization point 2, synchronization point 3 and synchronization point 4; in the figure, a white rectangle represents an instance in operation, a gray rectangle represents operation to a synchronization point, and a diagonal rectangle represents stopped operation (i.e., busy waiting).
  • redundant instance 1, redundant instance 2 and redundant instance 3 all include the above four synchronization points.
  • the horizontal direction is the time direction. It can be seen that the time for each redundant instance to run to the synchronization point is not the same.
  • redundant instance 1 is the first instance to run to synchronization point 1, and starts waiting after redundant instance 1 reaches synchronization point 1;
  • redundant instance 2 is the second instance to run to synchronization point 1, and starts waiting after redundant instance 2 reaches synchronization point 1;
  • redundant instance 3 is the last instance to run to synchronization point 1, and the moment when redundant instance 3 runs to synchronization point 1 is synchronization moment t1.
  • redundant instance 2 and redundant instance 3 start to continue running.
  • the synchronization point refers to a preset position in the instance, such as the synchronization point shown in FIG. 2 , rather than the synchronization moment shown in FIG. 2 .
  • state preservation can be performed regularly in order to perform stream/data stream rollback error correction, and state preservation is CKPT.
  • state preservation is CKPT.
  • the redundant instance after CKPT is referred to as CKPT instance in the embodiment of the present application.
  • the instance busy waiting method can be adopted.
  • the three-module redundant OS technology commonly used in the industry uses an arbitrator for synchronous arbitration. Its synchronization adopts blocking and busy waiting methods to align the states of all redundant instances at the synchronization point.
  • this multi-mode synchronization method has many shortcomings. Among them, the most significant pain point is that different redundant instances execute at different speeds, the traditional busy waiting method causes low CPU utilization, and the slowest single-point copy becomes a bottleneck for the overall software and hardware operating efficiency.
  • multi-mode redundancy technology can adopt the mode of leader plus follower instance (follower) in synchronous arbitration.
  • Shared memory such as Ring Buffer
  • Followers need to follow the external event behavior of the leader and arbitrate the data status. The specific steps are as follows:
  • Step 1 After redundant execution starts, when a round of synchronization point is reached, check whether the current instance is a leader or a follower; Step 2: If it is a leader, save the arbitration data to the shared memory and resume execution; if it is a follower, check whether the leader has arrived; Step 3: If the leader has arrived, obtain the leader data from the shared memory and perform arbitration; if the follower arrives before the leader, wait synchronously until the leader arrives. This mode can eliminate the synchronous busy wait of the leader and subsequent followers, but because the leader is specified as a fixed thread, the overall execution efficiency is still limited by the speed of the leader.
  • the asynchronous arbitration method proposed in this application is a universal technology for software and hardware, and can be widely applied to any scenario using redundant instances.
  • Two application scenarios are exemplified below from the perspective of hardware and software. It should be understood that the scenarios implemented in this application are not limited to the following scenarios.
  • the application scenario may include M primary central processing units (CPUs), where M is an integer greater than 1; each CPU includes three core processors (cores), each core is used to execute a redundant OS instance, that is, a total of 3M redundant OS instances; each CPU includes an arbitrator, which is used to perform synchronization point arbitration on the redundant OS instances in the CPU.
  • M is an integer greater than 1
  • each CPU includes three core processors (cores), each core is used to execute a redundant OS instance, that is, a total of 3M redundant OS instances
  • each CPU includes an arbitrator, which is used to perform synchronization point arbitration on the redundant OS instances in the CPU.
  • the application scenario may also include an external device (referred to as peripheral) for sending input to the CPU, and several secondary microcontrollers, two of which are shown in Figure 3. It should be understood that the application scenario may also include other processors such as a tertiary processor, which is not limited here.
  • the peripheral can copy M copies of the input to the primary CPU, and the OS redundant instances running in each execution core will perform business processing. After each OS redundant instance in the primary CPU reaches a round of synchronization point, it will be arbitrated by the arbiter of the primary CPU for multi-mode synchronization. If an error is identified, it will be corrected. Finally, a CPU determines a correct primary arbitration result from the three redundant instances, and then outputs the primary arbitration result to several secondary Microcontrollers. Then the secondary Microcontroller can receive M primary arbitration results.
  • the arbitrator of the secondary microcontroller performs secondary arbitration on the M primary arbitration results, and performs error correction if an error is identified.
  • the OS redundant instance in the primary CPU performs asynchronous arbitration when executing to the synchronization point, so that the OS redundant instance is no longer busy when reaching the synchronization point.
  • an error correction module may be added to the arbitrator, and when the asynchronous arbitration identifies an erroneous OS redundant instance, a rollback error correction is performed on it.
  • Figure 4 is a schematic diagram of another application scenario provided by an embodiment of the present application.
  • the user-mode application performs multi-mode execution, and processes APP1-APPM are redundant instances thereof, that is, there are M APP redundant instances in total, where M is an integer greater than 1.
  • FIG4 exemplarily shows the arbitration process of two synchronization points, where the first synchronization point is before the system call (or after obtaining the C library function): when the APP redundant instance runs in the user state and makes a system call, it obtains the called library function from the C library and performs synchronization waiting before falling into the OS until all APP redundant instances reach the current round of synchronization points. After all APP redundant instances complete the synchronization waiting, the arbitration module in the C library performs consistency comparison on the inputs of all APP redundant instances, and once passed, enters the OS to execute the original system calls respectively.
  • the second synchronization point is before the system call returns the result to the user-mode application: after each APP redundant instance completes the current system call, the output value of the C library function is returned from the OS, and the system waits for all APP redundant instances to complete the OS process. Then, the consistency arbitration of the output value of all APP redundant instances is performed. After the consistency arbitration of the output value of all APP redundant instances is passed, the output value is returned to the user-mode application, and the user-mode APP redundant instance performs the next round of execution.
  • all APP redundant instances can save arbitration data and/or CKPT data after reaching the synchronization point, and then continue to execute, without waiting at the synchronization point position for all APP redundant instances to execute to the synchronization point position.
  • Figure 5 is a flow chart of an asynchronous arbitration method provided by an embodiment of the present application.
  • the method may include some or all of the following steps.
  • the electronic device includes N redundant instances, and the N redundant instances may be copied from the same instance, that is, the codes corresponding to the N redundant instances are implemented the same.
  • the N redundant instances may start running at the same time.
  • the electronic device may input the same input to the N redundant instances at the same time, and execute the N redundant instances at the same time.
  • the specific process of the electronic device executing N redundant instances may refer to all or part of the following steps S102 to S105:
  • the arbitration data may also include business input parameters, OS result output, etc.
  • the redundant instance is an application instance.
  • the i-th application instance runs to calling a library function or the library function returns data
  • the i-th application instance is stopped, and the current system call number, input parameters, and output parameters of the i-th application instance are recorded to obtain the i-th arbitration data.
  • the current output result of the i-th redundant instance can be the input parameter of the C library function
  • the i-th arbitration data includes the input parameter of the C library function and the system call number, etc.
  • the current output result of the i-th redundant instance can be the output parameter of the C library function
  • the i-th arbitration data includes the output parameter of the C library function and the system call number, etc.
  • the i-th arbitration data can be saved to a shared memory (Shared Memory) or a ring buffer (Ring Buffer).
  • shared Memory Shared Memory
  • Ring Buffer ring buffer
  • the i-th arbitration data is not saved.
  • step S102 may be directly executed.
  • the checkpoint CKPT data of the i-th redundant instance at the synchronization point may also be saved, and the CKPT data of the i-th redundant instance at the synchronization point is used to restore the data state of the i-th redundant instance at the synchronization point.
  • the i-th redundant instance can be checkpointed, that is, the CKPT data of the i-th redundant instance at the current synchronization point is recorded.
  • the data can be recorded based on a preset type, and the specific data type can be determined based on the actual redundant instance; the CKPT data is used to restore the data state to the current synchronization point.
  • the redundant instance is an application instance
  • the checkpoint method includes but is not limited to blocking the original APP, copying (forking) and blocking the sub-APP, recording the complete status data of the APP, etc.
  • the blocking of the original APP means suspending the execution of the application instance, and the original APP blocking can be partially adopted in the embodiment of the present application; copying (forking) and blocking the sub-APP means forking the current application instance (which can be called the original APP), the forked application instance is the sub-APP, the sub-APP is not executed, and the original APP is continued; recording the complete status data of the APP. That is, all data of the current application instance from the beginning of operation to the current synchronization point is saved.
  • the redundant instance is an OS redundant instance executed by redundant hardware
  • the checkpoint method includes but is not limited to recording OS complete status data, etc., wherein the OS complete status data includes all data of the OS redundant instance from the start of running to the current synchronization point, such as the memory data content occupied by the OS redundant instance running, the device status data of the OS redundant instance, and the data stored in the register during the running of the OS redundant instance and the status data of the register, etc.
  • the CKPT data of the i-th redundant instance may not be saved.
  • the CKPT data of the last instance among the N redundant instances that runs to the synchronization point may not be obtained and saved.
  • the portion after the synchronization point of the i-th redundant instance can be executed after saving the i-th arbitration data and the CKPT data of the i-th redundant instance. If the i-th redundant instance is the first instance to run to the synchronization point, step S102 can be executed after obtaining the i-th arbitration data, and then the portion after the synchronization point of the i-th redundant instance is executed. There is no need to checkpoint the i-th redundant instance, nor is there a need to save the i-th arbitration data or the CKPT data of the i-th redundant instance.
  • the i-th arbitration data can be saved, and it is determined whether to save the CKPT data of the i-th redundant instance according to the rules, and then the portion after the synchronization point of the i-th redundant instance is executed.
  • S105 After obtaining N arbitration data, determine target arbitration data from the N arbitration data, and the redundant instance corresponding to the target arbitration data is the correct redundant instance at the synchronization point.
  • the arbitration data with the largest number among the N arbitration data is determined as the target arbitration data; or, the arbitration data with a number greater than a preset number among the N arbitration data is determined as the target arbitration data.
  • the preset number may be N/2, that is, if the number of arbitration data of the redundant instance exceeds half of the total number of redundant instances, the redundant instance is determined to be the correct redundant instance. For example, there are 5 redundant instances in total, and the arbitration data of 3 redundant instances are the same, then the arbitration data of the 3 redundant instances are the target arbitration data, and these 3 redundant instances are the correct redundant instances.
  • the two arbitration data may be judged based on a preset judgment rule to determine the target arbitration data from the two arbitration data. For example, if the current output result in the arbitration data exceeds the preset data range, the arbitration data is determined to be non-target arbitration data, etc.
  • the preset judgment rule is not limited in the embodiment of the present application.
  • the erroneous redundant instance can be restored to the state of the correct redundant instance at the synchronization point based on the CKPT data of the correct redundant instance at the synchronization point, and the restored redundant instance can be executed from the synchronization point.
  • the i-th arbitration data is not the target arbitration data, that is, the i-th redundant instance is an erroneous redundant instance; after determining the target arbitration data from N arbitration data, the i-th redundant instance can be restored based on the CKPT data of the correct redundant instance at the synchronization point; and the i-th redundant instance after restoration is executed from the synchronization point.
  • the last redundant instance that runs to the synchronization point does not save the arbitration data or perform a checkpoint; instead, it retrieves the saved arbitration data from the memory, and then determines the target arbitration data from the N arbitration data.
  • the redundant instance corresponding to the target arbitration data is the redundant instance that is correct at the synchronization point.
  • the redundant instance that has saved the same arbitration data as the slowest instance before the slowest instance can use the CKPT data of the redundant instance for error correction and rollback.
  • the slowest instance can continue to execute after obtaining the arbitration data; or it can wait for the arbitration results of all members, perform error correction in the event of a failure, and then continue to execute, or continue to execute in the absence of a failure.
  • the execution of the i-th redundant instance when the execution gap between the i-th redundant instance and the slowest redundant instance among the N redundant instances is greater than a preset gap, the execution of the i-th redundant instance can be suspended.
  • the number of synchronization points in the redundant instance is at least two, and the arbitration data of the i-th redundant instance is stored in a preset storage space; when the preset storage space is full, the execution of the i-th redundant instance is suspended.
  • Figure 6 is a flow chart of another asynchronous arbitration method provided by an embodiment of the present application.
  • the method may include some or all of the following steps.
  • N is an integer greater than 1; and the redundant instance includes at least one synchronization point.
  • the N redundant instances are copied from one instance.
  • the electronic device inputs the same input to N redundant instances and executes the N redundant instances simultaneously.
  • the redundant instance includes multiple synchronization points, and at each synchronization point, the N redundant instances need to execute the following methods shown in S202 to S209.
  • Figure 7 is a schematic diagram of the execution of multiple redundant instances provided by an embodiment of the present application.
  • Figure 7 exemplarily represents the redundant instance with a straight line with an arrow, and exemplarily shows N redundant instances and M synchronization points, where N and M are both integers greater than 1.
  • N redundant instances begin to execute, and when any instance of the N redundant instances operates to synchronization point 1, the following step S202 is executed, and the steps in steps S203 to S209 below are executed based on the result of step S202, as shown in the flowchart of Figure 6. It should be understood that the arbitration error correction process of N redundant instances at M synchronization points is consistent with the arbitration error correction process of N redundant instances at synchronization point 1, and will not be repeated.
  • the electronic device can detect the execution status of other redundant instances of the N redundant instances to determine whether the redundant instance is the first redundant instance to run to the synchronization point. It should be understood that the embodiments of the present application do not limit the method for determining whether the redundant instance is the first to run to the synchronization point.
  • the electronic device can save the arbitration data of the redundant instance at the synchronization point; save the CKPT data of the redundant instance at the synchronization point; after saving the arbitration data and CKPT data, the electronic device continues to execute the part after the synchronization point of the redundant instance.
  • the specific process please refer to the detailed contents of S203 to S205.
  • the electronic device can determine whether the redundant instance is the last instance to run to the synchronization point, and then process it according to whether it is the last instance to run to the synchronization point. For the specific process, please refer to the details of S206.
  • the electronic device can record the key state data of the redundant instance based on the first recording rule, obtain the arbitration data of the redundant instance at the synchronization point, and then save the arbitration data.
  • the key state data includes the output of the redundant instance executed to the synchronization point.
  • the first recording rules of different instances may be different, that is, the contents corresponding to the key status data of different instances may be different, that is, the contents corresponding to the arbitration data may be different.
  • different instances may refer to instances with different functions or different inputs and outputs.
  • Figures 3 and 4 are different application scenarios, and the redundant instances in the two scenarios may be different instances.
  • the specific contents of the arbitration data in these two embodiments may be different.
  • the above-mentioned N redundant instances are the same instances, and the first recording rules corresponding to the above-mentioned N redundant instances are the same. However, since the above-mentioned N redundant instances may cause data errors due to certain factors during the execution process, the data of the same content recorded by the N redundant instances may be different.
  • the electronic device can checkpoint the redundant instance based on the second recording rule to obtain the CKPT data of the redundant instance at the synchronization point, and then save the CKPT data.
  • the CKPT data is used for any redundant instance in the N redundant instances. Restore to the data state of the redundant instance at the synchronization point.
  • the second recording rules of different instances may be different, that is, the CKPT data corresponding to different instances may be different. It should be understood that the above N redundant instances are the same instance, and the second recording rules corresponding to the above N redundant instances are the same, but because the above N redundant instances may cause data errors due to certain factors during the execution process, the data of the same content recorded by the N redundant instances may be different.
  • the redundant instance on the left performs arbitration error correction on the right when it runs to the synchronization point, and returns to the synchronization point after the arbitration error correction is completed to continue executing the contents below the redundant instance.
  • the electronic device can detect the execution status of other redundant instances of the N redundant instances when the redundant instance runs the synchronization point to determine whether the redundant instance is the last redundant instance to run to the synchronization point. It should be understood that the embodiments of the present application do not limit the method for determining whether the redundant instance is the last to run to the synchronization point.
  • the electronic device conducts full arbitration on the arbitration data of the redundant instance and the arbitration data saved at the synchronization point to determine whether the N arbitration data pass the full arbitration.
  • the full arbitration passes, there is no need to perform error correction rollback (i.e., there is no need to execute S209).
  • the full arbitration fails, there is no need to perform error correction rollback (i.e., it is necessary to execute S209).
  • the specific process please refer to the relevant content of step S208.
  • the electronic device pre-arbitrates the arbitration data of the redundant instance with the arbitration data saved at the synchronization point to determine whether the CKPT data of the redundant instance needs to be saved.
  • the CKPT data of the redundant instance is not saved.
  • the CKPT data of the redundant instance is saved.
  • the specific process please refer to the relevant content of step S207.
  • S207 Determine whether the redundant instance pre-arbitration passes; if so, execute S205; if not, execute S204 and S205 in sequence.
  • the electronic device can determine whether the pre-arbitration of the redundant instance has passed based on the saved arbitration data of the synchronization point and the arbitration data of the redundant instance; when there is an arbitration data in the saved arbitration data that is the same as the arbitration data of the redundant instance, it is determined that the pre-arbitration of the redundant instance has passed, that is, there is no need to save the CKPT data of the redundant instance; when there is an arbitration data in the saved arbitration data that is different from the arbitration data of the redundant instance, it is determined that the pre-arbitration of the redundant instance has not passed, that is, it is necessary to save the CKPT data of the redundant instance.
  • S208 Determine whether the arbitration data of the synchronization point and the arbitration data of the redundant instance are saved through full arbitration. If so, execute S205; if not, execute S209 in sequence.
  • the arbitration data of the synchronization point and the arbitration data of the redundant instance saved are the N arbitration data of the N redundant instances at the synchronization point.
  • the electronic device determines that the above-mentioned N arbitration data pass the full arbitration, and there is no need to perform error correction rollback (that is, there is no need to execute S209); when there are two different arbitration data among the above-mentioned N arbitration data, it is determined that the full arbitration does not pass the full arbitration, and error correction rollback is required (that is, S209 needs to be executed).
  • Figure 8A is a schematic diagram of N redundant instances provided by an embodiment of the present application when there is no failure at the synchronization point.
  • Figure 8A exemplarily takes 5 redundant instances as an example for explanation, the straight lines with arrows represent redundant instances, the direction of the arrows represents the direction of the time axis, and the 5 redundant instances are instance a, instance b, instance c, instance d, and instance e; the gray rectangular blocks are used to represent synchronization points, and synchronization point 1 and synchronization point 2 are exemplarily shown in Figure 8A.
  • instance a is the first instance to run to synchronization point 1, and the electronic device saves the arbitration data of instance a.
  • the electronic device checkpoints instance a and saves the CKPT data of instance a; instance b, instance d and instance e are not the first redundant instances to run to synchronization point 1, nor are they the last redundant instances to run to synchronization point 1, so the electronic device pre-arbitrates instance b, instance d and instance e.
  • instance b, instance d and instance e are the same as the arbitration data of instance a, instance b, instance d and instance e all pass the pre-arbitration and do not need to save the CKPT data; instance c is the last redundant instance to run to synchronization point 1, so the arbitration data of the five redundant instances at synchronization point 1 are compared. Since the arbitration data of the five redundant instances are the same, the five redundant instances pass the full arbitration, that is, the five redundant instances have no faults at the synchronization point 1.
  • S209 Determine the correct instance from the N redundant instances; perform error correction and rollback on the erroneous instance based on the CKPT data of the correct instance.
  • the electronic device determines the correct instance from the N redundant instances based on the N arbitration data; and performs error correction rollback on the erroneous instance based on the CKPT data of the correct instance.
  • the electronic device may determine the arbitration data with the largest number among the N arbitration data as the target arbitration data; or determine the arbitration data with a number greater than a preset number among the N arbitration data as the target arbitration data; and then determine the redundant instance of the target arbitration data as the correct instance.
  • Figure 8B is a schematic diagram of N redundant instances provided by an embodiment of the present application when there is a fault at a synchronization point.
  • Figure 8B exemplarily uses 5 redundant instances as an example for explanation, the straight lines with arrows represent redundant instances, the direction of the arrows represents the direction of the time axis, and the 5 redundant instances are instance a, instance b, instance c, instance d, and instance e; the gray rectangular blocks are used to represent synchronization points without faults, and the oblique rectangular blocks are used to represent synchronization points with faults.
  • Synchronization point 1 and synchronization point 2 are exemplarily shown in Figure 8B.
  • instance a is the first instance to run to synchronization point 1 , and the electronic device saves the arbitration data of instance a .
  • the electronic device checkpoints instance a and saves the CKPT data of instance a .
  • Instance b, instance d, and instance e are not the first redundant instances to run to synchronization point 1, nor are they the last redundant instances to run to synchronization point 1.
  • the electronic device performs pre-arbitration on instance b, instance d, and instance e. Since the arbitration data of instance b is the same as the arbitration data of instance a, instance b does not pass the pre-arbitration.
  • the electronic device checkpoints instance b and saves the CKPT data of instance b. Since the arbitration data of instance d and instance e are the same as the arbitration data of instance a, instance d and instance e both pass the pre-arbitration and do not need to save the CKPT data.
  • Instance c is the last redundant instance to run to synchronization point 1, and the arbitration data of the five redundant instances at synchronization point 1 are compared. Since the arbitration data of instance a and instance c (referred to as arbitration data 1) are the same, the arbitration data of instance b, instance d and instance e (referred to as arbitration data 2) are the same, arbitration data 1 and arbitration data 2 are different, and the number of arbitration data 2 is greater than the number of arbitration data 1, the electronic device can determine that the five redundant instances do not pass the full arbitration, that is, the five redundant instances fail at the synchronization point 1, wherein arbitration data 2 is the correct arbitration data (i.e., target arbitration data), instance a and instance c are instances that fail at synchronization point 1, and instance b, instance d and instance e are correct instances that are correct at synchronization point 1.
  • arbitration data 2 is the correct arbitration data (i.e., target arbitration data)
  • instance a and instance c are instances that fail at synchronization point 1
  • the electronic device can roll back instance a and instance c to the data state of instance b at synchronization point 1 based on the CKPT data of instance b at synchronization point 1.
  • FIG8B exemplarily shows that instance a is located at synchronization point 2 during rollback. In other examples, instance a may run to other locations, which is not limited here.
  • FIG. 5 and FIG. 6 The asynchronous arbitration method of FIG. 5 and FIG. 6 is described in detail below by introducing embodiments corresponding to the above two application scenarios.
  • the asynchronous arbitration method is applied to an asynchronous arbitration device of multiple OS redundant instances, the asynchronous arbitration device comprises multiple core processors and at least one asynchronous arbitrator, wherein each core processor is used to execute an OS redundant instance, and the OS redundant instance is a redundant instance.
  • Figure 9 is a flow chart of another asynchronous arbitration method provided by an embodiment of the present application.
  • the method may include some or all of the following steps.
  • S301 Multiple core processors respectively execute redundant instances of the OS.
  • multiple core processors may be located in the same central processing unit (CPU).
  • CPU central processing unit
  • the core processor executing S302 may be a core processor that is not the slowest OS redundant instance, that is, the core processor may determine whether the OS redundant instance it executes is the last OS redundant instance that runs to the synchronization point. If so, S302 is not executed; if not, S302 is executed.
  • the synchronization point may be when the OS outputs the service execution result.
  • the core processor may store the key status data of the current synchronization point into a shared memory as arbitration data for the synchronization point, wherein the key status data includes but is not limited to business input parameters, OS result output, etc.
  • Figure 10 is a schematic diagram of managing arbitration data provided by an embodiment of the present application.
  • the straight line with an arrow is used to represent an OS redundant instance, and the figure shows 3 OS redundant instances, namely OS1, OS2 and OS3;
  • the rectangular block is used to represent the shared memory, and the solid line with an arrow is used to indicate the storage of arbitration data;
  • the rectangle with data is used to represent the synchronization point, and the figure shows synchronizing point 1 and synchronizing point 2.
  • the core processor executing OS1 stores the arbitration data of OS1 in the shared memory
  • the core processor executing OS2 stores the arbitration data of OS2 in the shared memory
  • the core processor executing OS2 stores the arbitration data of OS2 in the shared memory
  • the core processor executing OS3 stores the arbitration data of OS3 in the shared memory.
  • the core processor pre-arbitrates the OS redundant instance and performs CKPT as needed.
  • the OS redundant instance is not the slowest OS redundant instance.
  • the core processor may perform pre-arbitration on the OS redundant instance after saving the arbitration data of the OS redundant instance at the synchronization point.
  • pre-arbitration can be to compare the arbitration data of the OS redundant instance with the arbitration data of the current round of CKPT instance one by one. If there is no arbitration data of any CKPT instance that is consistent with the arbitration data of the current OS, CKPT is performed.
  • the current round of CKPT instance is an instance that performs CKPT at this synchronization point. It should be noted that for the convenience of description, the instance that has undergone CKPT is referred to as a CKPT instance.
  • the CKPT method includes but is not limited to recording complete OS status data (including memory, device, and register status), etc.
  • the worst case supported by N-module redundancy is N/2-1 error instances, that is, the first half of the instances that arrive must have the correct state. Therefore, in order to further optimize the CKPT cost, only the first half of the instances (N/2+1) in N-module redundancy need to perform the pre-arbitration + CKPT process. After completing the arbitration data preservation/pre-arbitration/CKPT, the OS redundant instance can continue to execute without waiting.
  • the first OS redundant instance does not need pre-arbitration and directly performs CKPT. That is, the core processor can determine whether the OS redundant instance it executes is the first OS redundant instance that runs to the synchronization point. If so, it directly performs CKPT. If not, it performs pre-arbitration and then determines whether to perform CKPT based on the result of pre-arbitration.
  • S304 The core processor saves the CKPT data.
  • S305 The asynchronous arbitrator performs full arbitration based on the arbitration data of all OS redundant instances to determine the correct instance and the incorrect instance.
  • the asynchronous arbitrator can fetch the arbitration data of all OS redundant instances in this round from the shared memory and perform a full comparison. As shown in FIG10 , the dotted line with an arrow is used to represent the reading of arbitration data.
  • the asynchronous arbitrator can fetch the arbitration data of all OS redundant instances in this round, that is, the arbitration data of OS1 and OS2, from the shared memory, and perform full arbitration on the arbitration data of OS1, OS2, and OS3; when OS1 runs to synchronization point 2, the asynchronous arbitrator can fetch the arbitration data of all OS redundant instances in this round, that is, the arbitration data of OS2 and OS3, from the shared memory, and perform full arbitration on the arbitration data of OS1, OS2, and OS3.
  • the redundant instances that have reached consensus are considered correct and enter the next round of synchronization points; while the instances that have not reached consensus on the arbitration data are considered errors and need to be rolled back.
  • S306 The asynchronous arbitrator sends an error correction message to the core processor corresponding to the error instance, where the message includes an indication message for indicating a correct instance.
  • the core processor corresponding to the erroneous instance performs error correction rollback based on the CKPT data of the correct instance or the status data of the slowest instance.
  • the correct instance is a CKPT instance
  • the state data (such as memory, device, register state, etc.) of the error instance is restored to the correct CKPT instance state
  • the correct instance is the slowest OS redundant instance
  • the state data of the error instance is restored to the state data of the slowest OS redundant instance.
  • the asynchronous arbitration device reclaims the memory consumed in this round after all OS redundant instances have successfully passed a round of synchronization points, where the memory includes reserved memory for storing arbitration data (the shared memory in Figure 10 can also be called Shared Memory) and the memory occupied by CKPT data.
  • the shared memory in Figure 10 can also be called Shared Memory
  • the core processor deletes from the memory the arbitration data and CKPT data saved by the OS redundant instances executed by it.
  • the method is applied to user-mode multi-mode execution, for example, processes APP1-APPM are redundant instances, M is an integer greater than 1, and the relevant introduction of the application scenario shown in Figure 4 can be referred to.
  • the APP redundant instance when it reaches the specified synchronization point, such as a system call, it can be intercepted at the entrance and exit of the corresponding function of the C library, and an asynchronous arbitration method is performed before and after falling into the kernel execution system call, which can eliminate the synchronization busy link.
  • Figure 11 is a schematic diagram of another asynchronous arbitration method provided by an embodiment of the present application.
  • the method may include some or all of the following steps:
  • the slowest instance may not save its quorum data.
  • the current key state is stored in the shared memory.
  • the current key state is the arbitration data, which includes but is not limited to the system call number, input and output parameters, etc.
  • the shared memory includes but is not limited to the Ring Buffer, etc.
  • FIG. 12 is a schematic diagram of another management arbitration data provided by an embodiment of the present application.
  • the straight line with an arrow is used to represent an APP redundancy instance.
  • the figure shows three APP redundancy instances, namely APP1, APP2 and APP3; the circle The ring is used to represent the circular memory (Ring Buffer), and the solid line with the arrow is used to indicate the storage of arbitration data; the rectangle with data is used to represent the synchronization point, and the figure shows the synchronization point 1 and the synchronization point 2.
  • S402 Pre-arbitrate the APP redundant instances and perform CKPT as needed.
  • pre-arbitration is performed after the APP redundancy instance saves the arbitration data.
  • pre-arbitration can specifically be to compare the arbitration data of the APP redundant instance with the arbitration data of the current round of CKPT instance one by one, and if there is no arbitration data of any CKPT instance that is consistent with the arbitration data of the APP redundant instance P, then CKPT is performed.
  • the method of performing CKPT includes but is not limited to blocking the original APP, copying (fork) and blocking the sub-APP, recording the complete status data of the APP, etc.
  • the worst case supported by N-module redundancy is N/2-1 error instances, that is, the first half of the instances that arrive have the correct status. Therefore, in order to further optimize the CKPT cost, only the first half of the instances (N/2+1) in the N-module redundancy need to undergo the pre-arbitration + CKPT process.
  • the APP redundant instance can continue to execute without waiting. Among them, the specific process of the APP redundant instance executing arbitration data preservation, pre-arbitration, and CKPT can be found in the relevant content above, which will not be repeated here.
  • the first APP redundant instance (ie, the first instance running to the synchronization point) may not perform pre-arbitration but directly perform CKPT.
  • the slowest APP redundancy instance can first retrieve the arbitration data of all APP redundancy instances in this round from the shared memory (such as the Ring Buffer in Figure 12) and perform a comprehensive comparison.
  • the dotted line with an arrow is used to represent reading arbitration data.
  • the arbitration data of all APP redundant instances in this round that is, the arbitration data of APP1 and APP2
  • the arbitration data of APP1, APP2 and APP3 can be fully arbitrated
  • the arbitration data of all APP redundant instances in this round that is, the arbitration data of APP2 and APP3
  • the arbitration data of APP2 and APP3 can be taken out from the shared memory
  • the arbitration data of APP1, APP2 and APP3 can be fully arbitrated.
  • the redundant instances that have reached a consensus are considered correct and enter the next round of synchronization points; while the instances that have not reached a consensus on the arbitration data are considered wrong and need to be rolled back.
  • one of the correct states can be selected from the CKPT instance and the slowest instance, and the wrong APP redundant instance can be rolled back to the correct state.
  • the error instance will be terminated and replaced by the sub-APP that is being blocked by the CKPT instance; if the correct state exists in the slowest instance, the error instance will be terminated, the slowest APP redundant instance will be forked and copied, and its sub-APP will replace the original error APP. After the rollback error correction is completed, the error instance continues to execute until the current round of synchronization points is completed and the next round of synchronization points is entered.
  • S404 Reclaim the storage resources storing the arbitration data and CKPT data.
  • the reserved memory consumed by this round for storing arbitration data (such as in the Ring Buffer implementation in Figure 12, the read pointer exceeds the data range of this round) and the memory occupied by CKPT data are recovered, that is, the storage space for storing arbitration data and CKPT status data is released for use by subsequent synchronization points.
  • FIG. 13 is a schematic diagram of the structure of an asynchronous arbitration device 130 provided in an embodiment of the present application.
  • the device 130 may be an electronic device.
  • the device 130 may also be a device in an electronic device, such as a chip or an integrated circuit, etc.
  • the device 130 may include an execution unit 1301, a determination unit 1302, a storage unit 1303, a recovery unit 1304, and a suspension unit 1305.
  • the asynchronous arbitration device 130 is used to implement the asynchronous arbitration method of any of the aforementioned embodiments.
  • the device includes:
  • An execution unit 1301 is used to start executing N redundant instances, where N is an integer greater than 1; and i is a positive integer not greater than N;
  • the execution unit 1301 is specifically configured to: stop executing the i-th redundant instance when the i-th redundant instance runs to a synchronization point, record the current output result of the i-th redundant instance, and obtain the i-th arbitration data; save the i-th arbitration data; after saving the i-th arbitration data, execute the part after the synchronization point of the i-th redundant instance;
  • the determining unit 1302 is configured to determine target arbitration data from the N arbitration data after obtaining the N arbitration data, wherein the redundancy instance corresponding to the target arbitration data is a correct redundancy instance at the synchronization point.
  • the device further includes a storage unit 1303;
  • the saving unit 1303 is used to save the checkpoint CKPT data of the i-th redundant instance at the synchronization point, and the CKPT data of the i-th redundant instance at the synchronization point is used to restore the data state of the i-th redundant instance at the synchronization point.
  • the apparatus further includes a recovery unit 1304; the recovery unit 1304 is configured to:
  • the i-th redundant instance after recovery is executed from the synchronization point.
  • the storage unit 1303 is further configured to:
  • the CKPT data of the i-th redundant instance is not saved.
  • the i-th redundant instance is an instance other than the last instance among the N redundant instances that runs to the synchronization point.
  • the determining unit 1302 is specifically configured to:
  • arbitration data whose number among the N arbitration data is greater than a preset number is determined as the target arbitration data.
  • the execution unit 1301 is specifically configured to:
  • the i-th arbitration data is saved
  • the step of determining the target arbitration data from the N arbitration data is performed.
  • the device further includes a pausing unit 1305, wherein the pausing unit 1305 is configured to:
  • the number of synchronization points in the redundant instance is at least two, and the arbitration data of the i-th redundant instance is stored in a preset storage space; the device further includes a deletion unit, the deletion unit being configured to delete the arbitration data of the synchronization point;
  • the pausing unit 1305 is specifically configured to suspend the execution of the i-th redundant instance when the preset storage space is full.
  • the redundant instance is an application instance
  • the execution unit 1301 is specifically configured to:
  • the i-th application instance runs to calling a library function or the library function returns data
  • the i-th application instance is stopped from executing, and the current system call number, input parameters and output parameters of the i-th application instance are recorded to obtain the i-th arbitration data.
  • each unit may also correspond to the corresponding description of the embodiments shown in Figures 5, 6, 9 and 11.
  • the asynchronous arbitration device 130 may be the electronic device mentioned above.
  • each unit corresponds to its own program code (or program instruction), and when the program codes corresponding to these units are run on the processor, the unit is controlled by the processor to execute the corresponding process to achieve the corresponding function.
  • An embodiment of the present application also provides an electronic device, which includes one or more processors and one or more memories; wherein the one or more memories are coupled to the one or more processors, and the one or more memories are used to store computer program codes, and the computer program codes include computer instructions, and when the one or more processors execute the computer instructions, the electronic device executes the method described in the above embodiment.
  • the embodiments of the present application also provide a computer program product including instructions.
  • the computer program product When the computer program product is executed on an electronic device, the electronic device executes the method described in the above embodiments.
  • An embodiment of the present application also provides a computer-readable storage medium, including instructions, which, when executed on an electronic device, enable the electronic device to execute the method described in the above embodiment.
  • all or part of the embodiments may be implemented by software, hardware, firmware, or any combination thereof.
  • all or part of the embodiments may be implemented in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the present application are generated.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, optical fiber, digital subscriber line) or wireless (e.g., infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that includes one or more available media.
  • the available medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid state disk).
  • the processes can be completed by computer programs to instruct related hardware, and the programs can be stored in computer-readable storage media.
  • the programs can include the processes of the above-mentioned method embodiments.
  • the aforementioned storage media include: ROM or random access memory RAM, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

本申请提供了一种异步仲裁方法及装置,该方法包括:开始执行N个冗余实例;N为大于1的整数;在第i个冗余实例运行至同步点时停止执行第i个冗余实例,记录第i个冗余实例当前的输出结果,得到第i个仲裁数据;i为不大于N的正整数;保存第i个仲裁数据;在保存第i个仲裁数据后,执行第i个冗余实例的同步点后的部分;在得到N个仲裁数据后,从N个仲裁数据中确定目标仲裁数据,目标仲裁数据对应的冗余实例为在同步点正确的冗余实例。该方法中,冗余实例不需要在同步点忙等待,可以提高CPU利用率。

Description

一种异步仲裁方法及装置
本申请要求在2023年06月08日提交中国国家知识产权局、申请号为202310679561.8的中国专利申请的优先权,发明名称为“一种异步仲裁方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及电子技术,尤其涉及一种异步仲裁方法及装置。
背景技术
随着多模冗余技术的发展,多模冗余技术在航空航天、卫星、分布式系统和高性能计算(High performance computing,HPC)等领域应用越来越广泛,例如被运用于数据库存储、科研计算、气象预报、军事研究和基因测序等场景。
目前,为实现多模冗余技术中在同一状态上进行仲裁通常采用实例忙等方式,例如工业界常用的三模冗余OS技术,该技术通过仲裁器进行同步仲裁,其同步采取阻塞、忙等待的方式,在同步点拉齐所有冗余实例的状态。
但是,该多模同步方式存在着很多不足,其中,最显著的痛点在于:不同冗余实例的执行快慢不同,忙等待造成中央处理器(Central Processing Unit,CPU)利用率低,且单点最慢的实例成为整体软硬件运行效率的瓶颈。
发明内容
本申请提供了一种异步仲裁方法及装置,该方法可以在冗余实例运行至同步点时保存在该同步点的仲裁数据,在保存仲裁数据后继续执行该冗余实例的同步点后的部分;最后对所有冗余实例的仲裁数据进行仲裁,该方法中,冗余实例不需要在同步点忙等待,可以提高CPU利用率。
第一方面,本申请实施例提供了一种异步仲裁方法,该方法包括:
开始执行N个冗余实例;N为大于1的整数;
在第i个冗余实例运行至同步点时停止执行第i个冗余实例,记录第i个冗余实例当前的输出结果,得到第i个仲裁数据;i为不大于N的正整数;
保存第i个仲裁数据;
在保存第i个仲裁数据后,执行第i个冗余实例的同步点后的部分;
在得到N个仲裁数据后,从N个仲裁数据中确定目标仲裁数据,目标仲裁数据对应的冗余实例为在同步点正确的冗余实例。
其中,N个冗余实例为N模冗余技术中的N个实例;N个冗余实例可以是经同一个实例复制得到的。
其中,实例是包括运行对象和目标数据的功能模块。实例可以包括运行对象和运行过程中涉及的目标数据(如输入和运行数据),运行对象可以为软件模块或者硬件模块;实例可以在接收输入后开始执行运行对象,运行对象可以在运行过程中调用运行数据。
在N模冗余技术中,N个冗余实例的运行对象也可以为称为冗余对象;N个冗余实例的冗余对象相同,接收的输入相同,在运行过程中的调用的运行数据相同。
其中,第i个仲裁数据用于指示第i个实例执行至当前位置的运行情况,仲裁数据至少包括当前位置的输出结果。
本申请实施例中,通过在保存冗余实例的仲裁数据后继续执行该冗余实例,可以实现多个冗余实例的异步执行,有效消除较快实例的同步忙等,提高程序整体CPU资源利用率。其中,较快实例是指N个冗余实例中较快运行至同步点的实例。
结合第一方面,在一种可能的实现方式中,在保存第i个仲裁数据之后,执行第i个冗余实例的同步点后的部分之前,该方法还包括:
保存第i个冗余实例在同步点的检查点CKPT数据,第i个冗余实例在同步点的CKPT数据用于恢复第i个冗余实例在同步点的数据状态。
本申请实施例中,还可以对冗余实例进行检查点checkpoint,从而保存用于恢复至当前同步点状态的CKPT数据;后续可以基于正确的CKPT数据对错误的冗余实例进行纠错,从而保证多模冗余的冗余实例的个数不会减少,可以有效提高多模冗余的可靠性和可用性。该方法中,CKPT数据可以保证异步回滚时存在需要恢复至的状态,当全员仲裁识别故障,无论是最慢实例亦或较快实例出现错误,都可异步纠错。
结合第一方面,在一种可能的实现方式中,第i个仲裁数据不为目标仲裁数据;在从N个仲裁数据中确定目标仲裁数据后,方法还包括:
基于正确的冗余实例在同步点的CKPT数据,恢复第i个冗余实例;
从同步点处执行恢复后的第i个冗余实例。
本申请实施例中,可以基于正确的冗余实例在同步点的CKPT数据,对错误的冗余实例进行纠错,从而保证多模冗余的冗余实例的个数不会减少,可以有效提高多模冗余的可靠性和可用性。
结合第一方面,在一种可能的实现方式中,保存第i个冗余实例在同步点的检查点CKPT数据,包括:
在识别到第i个冗余实例的仲裁数据与保存的仲裁数据相同时,不保存第i个冗余实例的CKPT数据。
本申请实施例中,可以先识别冗余实例的仲裁数据与保存的仲裁数据是否相同(即预仲裁技术);进而,针对仲裁数据相同的实例,可以仅保存一个CKPT数据,该方法可以减少CKPT次数,减少功耗,提高CPU资源利用率。
结合第一方面,在一种可能的实现方式中,第i个冗余实例是N个冗余实例中除最后一个运行至同步点的实例之外的实例。
本申请实施例中,可以不对最后一个冗余实例进行检查点checkpoint,不需保存该冗余实例的CKPT数据。
该方法中,通过预仲裁技术,可以保证CKPT实例和最慢实例至少存在一员为正确状态,同时可避免多余的CKPT次数。其中,CKPT实例是指进行过CKPT的实例(即保存了CKPT数据的实例);最慢实例是指最后一个运行至同步点的实例。例如,无故障场景下的一轮同步点仅需进行一次CKPT。一般最坏故障场景下有N/2-1个实例出现故障,那么,最坏故障场景下的N模实例在一轮同步点也仅需进行N/2-1次CKPT,达到理论最优CKPT代价。
在一种可能的实现方式中,第i个冗余实例是N个冗余实例中最后一个运行至同步点的实例时,不执行识别第i个冗余实例的仲裁数据与保存的仲裁数据是否相同的操作,而是直接对所有仲裁数据进行全员仲裁(即确定目标仲裁数据)以及纠错回滚等操作,在执行完该操作后再执行该冗余实例,该方法可以减少预仲裁次数。
结合第一方面,在一种可能的实现方式中,从N个仲裁数据确定目标仲裁数据,包括:
将N个仲裁数据中个数最多的仲裁数据确定为目标仲裁数据;
或,将N个仲裁数据中个数大于预设个数的仲裁数据确定为目标仲裁数据。
本申请实施例中,通过将个数最多或者个数大于预设个数的仲裁数据确定为目标仲裁数据,可以保证正确冗余实例的可用性和可靠性。
结合第一方面,在一种可能的实现方式中,保存第i个仲裁数据,包括:
在第i个冗余实例是除最后一个运行至同步点之外的实例时,保存第i个仲裁数据;
该方法还包括:
在得到最后一个运行至同步点时的冗余实例的仲裁数据时,执行从N个仲裁数据确定目标仲裁数据的步骤。
本申请实施例中,针对最慢实例,不保存其仲裁数据,可以减少共享内存消耗和通信开销。
结合第一方面,在一种可能的实现方式中,方法还包括:
在第i个冗余实例与N个冗余实例中运行最慢的冗余实例的执行差距大于预设差距时,暂停执行第i个冗余实例。
本申请实施例中,可以通过控制冗余实例之间的执行差距程度,从而避免实例间出现步调大幅不一致。
结合第一方面,在一种可能的实现方式中,冗余实例中同步点的个数为至少两个,第i个冗余实例的仲裁数据存储在预设存储空间;从N个仲裁数据中确定目标仲裁数据之后,方法还包括:删除同步点的仲裁数据;
暂停执行第i个冗余实例,包括:在预设存储空间被占满时,暂停执行第i个冗余实例。
本申请实施例中,可以通过选择配置共享内存为固定大小,从而由共享内存大小决定直接实例间的执行差距程度,从而避免冗余实例之间出现步调大幅不一致。
结合第一方面,在一种可能的实现方式中,实例,记录第i个冗余实例当前状态的目标数据,得到第i个仲裁数据,包括:
第i个应用实例运行至调用库函数或库函数返回数据时停止执行第i个应用实例,记录第i个应用实例当前的系统调用号、输入参数和输出参数中的至少一个,得到第i个仲裁数据。
第二方面,本申请实施例提供了一种异步仲裁装置,该装置包括N个处理器和仲裁器;
N个处理器分别用于,执行实例;在实例执行至同步点时将同步点的仲裁数据输出至仲裁器,仲裁数据包括当前的输出结果;在输出同步点的仲裁数据后执行同步点后的处理步骤;
仲裁器,用于在接收到处理器发送的仲裁数据时保存仲裁数据;在接收到N个仲裁数据时从N个仲裁数据中确定目标仲裁数据。
结合第二方面,在一种可能的实现方式中,处理器,还用于在处理至同步点时将同步点的检查点CKPT数据发送至仲裁器;
仲裁器,还用于将目标仲裁数据对应的CKPT数据发送至非目标仲裁数据对应的处理器;
非目标仲裁数据对应的处理器,还用于基于目标仲裁数据对应的CKPT数据恢复在同步点的数据状态。
第三方面,本申请实施例提供了一种异步仲裁装置,该装置包括:
执行单元,用于针对同一输入,执行N个冗余实例,N为大于1的整数;i为不大于N的正整数;
执行单元,还用于:在第i个冗余实例运行至同步点时停止执行第i个冗余实例,记录第i个冗余实例当前的输出结果,得到第i个仲裁数据;保存第i个仲裁数据;在保存第i个仲裁数据后,执行第i个冗余实例的同步点后的部分;
确定单元,用于在得到N个仲裁数据后,从N个仲裁数据中确定目标仲裁数据,目标仲裁数据对应的冗余实例为在同步点正确的冗余实例。
结合第三方面,在一种可能的实现方式中,装置还包括保存单元;
保存单元,用于保存第i个冗余实例在同步点的检查点CKPT数据,第i个冗余实例在同步点的CKPT数据用于恢复第i个冗余实例在同步点的数据状态。
结合第三方面,在一种可能的实现方式中,装置还包括恢复单元;恢复单元,用于:
基于正确的冗余实例在同步点的CKPT数据,恢复第i个冗余实例;
从同步点处执行恢复后的第i个冗余实例。
结合第三方面,在一种可能的实现方式中,保存单元,还用于:
在识别到第i个冗余实例的仲裁数据与保存的仲裁数据相同时,不保存第i个冗余实例的CKPT数据。
结合第三方面,在一种可能的实现方式中,第i个冗余实例是除N个冗余实例中最后一个运行至同步点的实例之外的实例。
结合第三方面,在一种可能的实现方式中,确定单元,具体用于:
将N个仲裁数据中个数最多的仲裁数据确定为目标仲裁数据;
或,将N个仲裁数据中个数大于预设个数的仲裁数据确定为目标仲裁数据。
结合第三方面,在一种可能的实现方式中,执行单元,具体用于:
在所述第i个冗余实例是除最后一个运行至所述同步点之外的实例时,保存所述第i个仲裁数据;在得到最后一个运行至所述同步点时的冗余实例的仲裁数据时,执行所述从所述N个仲裁数据确定目标仲裁数据的步骤。
结合第三方面,在一种可能的实现方式中,执行单元,装置还包括暂停单元,暂停单元,用于:
在第i个冗余实例与N个冗余实例中运行最慢的冗余实例的执行差距大于预设差距时,暂停执行第i个冗余实例。
结合第三方面,在一种可能的实现方式中,执行单元,冗余实例中同步点的个数为至少两个,第i个冗余实例的仲裁数据存储在预设存储空间;装置还包括删除单元,删除单元,用于删除同步点的仲裁数据;
暂停单元,具体用于在预设存储空间被占满时,暂停执行第i个冗余实例。
结合第三方面,在一种可能的实现方式中,执行单元,冗余实例为应用实例,执行单元具体用于:
第i个应用实例运行至调用库函数或库函数返回数据时停止执行第i个应用实例,记录第i个应用实例当前的系统调用号、输入参数和输出参数中的至少一个,得到第i个仲裁数据。
第四方面,本申请提供了一种计算机存储介质,包括计算机指令,当计算机指令在电子设备上运行时,使得电子设备执行上述第一方面或第一方面中任一可能的实现方式中的异步仲裁方法。
第五方面,本申请提供了一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或第一方面中任一可能的实现方式中的异步仲裁方法。
第六方面,本申请提供了一种芯片,包括:处理器和接口,所述处理器和接口相互配合,使得所述芯片执行上述第一方面或第一方面中任一可能的实现方式中的异步仲裁方法。
可以理解地,上述第二方面和第三方面提供的异步仲裁装置、第四方面提供的计算机可读存储介质、第五方面提供的计算机程序产品、第六提供的芯片均用于执行本申请实施例所提供的方法。因此,其所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。
附图说明
图1是本申请实施例提供的一种多模冗余的示意图;
图2是本申请实施例提供的一种冗余实例同步等待的示意图;
图3是本申请实施例提供的一种应用场景的示意图;
图4是本申请实施例提供的另一种应用场景的示意图;
图5是本申请实施例提供的一种异步仲裁方法的流程图;
图6是本申请实施例提供的另一种异步仲裁方法的流程图;
图7是本申请实施例提供的一种多个冗余实例的执行示意图;
图8A是本申请实施例提供的一种N个冗余实例在同步点无故障时的示意图;
图8B是本申请实施例提供的一种N个冗余实例在同步点存在故障时的示意图;
图9为本申请实施例提供的又一种异步仲裁方法的流程图;
图10是本申请实施例提供的一种管理仲裁数据的示意图;
图11是本申请实施例提供的再一种异步仲裁方法的示意图;
图12是本申请实施例提供的另一种管理仲裁数据的示意图;
图13是本申请实施例提供的一种异步仲裁装置130的结构示意图。
具体实施方式
下面将结合附图对本申请实施例中的技术方案进行清楚、详尽地描述。其中,在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;文本中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,另外,在本申请实施例的描述中,“多个”是指两个或多于两个。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为暗示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
本申请以下实施例中的术语“用户界面(user interface,UI)”,是应用程序或操作系统与用户之间进行交互和信息交换的介质接口,它实现信息的内部形式与用户可以接受形式之间的转换。用户界面是通过java、可扩展标记语言(extensible markup language,XML)等特定计算机语言编写的源代码,界面源代码在电子设备上经过解析,渲染,最终呈现为用户可以识别的内容。用户界面常用的表现形式是图形用户界面(graphic user interface,GUI),是指采用图形方式显示的与计算机操作相关的用户界面。它可以是在电子设备的显示屏中显示的文本、图标、按钮、菜单、选项卡、文本框、对话框、状态栏、导航栏、Widget等可视的界面元素。
首先,先介绍本申请实施例中的技术术语。
1、冗余(Redundancy)
在工程领域中,冗余通常是指通过多重备份来增加系统的可靠性。即指重复配置系统的一些关键部件,当系统发生故障时,冗余配置的部件介入并承担故障部件的工作,由此减少系统的故障时间。冗余虽然带来了系统的复杂性和提高了成本,但对于业务关键系统因为故障造成的业务中断的代价之高来说,这点代价是必要的。
其中,可靠性(Reliability)是指设备在规定的条件下和规定的时间内,完成规定任务的概率。提高可靠性需要减少系统中断(故障)的次数。可用性(Availability)是指在一个给定的时间间隔内,对于一个功能个体来讲,总的可用时间所占的比例。提高可用性需要强调减少从故障中恢复的时间。
对于一些业务关键系统(Mission Critical System)来说,系统故障造成的业务中断的代价非常大,因此在设计中必须考虑避免单点故障、提升系统的可靠性和可用性。冗余是提高系统可靠性必不可少的手段。
2、多模冗余与冗余实例
当系统或业务需要保障高可靠性时,多模冗余是必要且有效的可靠性机制之一。多模冗余的流程包括对硬件模块、软件程序相同的输入进行冗余执行。进一步的,可以对冗余对象的输出进行仲裁。冗余对象的实例(简称冗余实例)涉及的数据包括相同指令流、备份数据、冗余进程等。
请参见图1,图1是本申请实施例提供的一种多模冗余的示意图。图1示例性示出了两个冗余实例,多模冗余的流程可以为对该两个冗余实例输入相同的输入,进行冗余执行,最后,将得到的输出进行仲裁。其中,两个冗余实例可以是复制(Sphere of Replication,SoR)得到的。应理解,多模冗余为至少两个冗余实例。
3、多模冗余仲裁
为了识别冗余实例中可能出现的错误,需要在指定仲裁处验证冗余实例是否符合预期,例如比较实例间的输出值是否一致。
4、冗余实例同步
由于多个冗余实例的执行速度可以不同,因此多个冗余实例执行至同步点的时间可以是不同的。为了保障多模冗余无故障时状态通过仲裁,可以使冗余实例之间保持同步,以使各冗余实例在长时间执行后不呈状态偏离趋势。
同步方法通常会在软硬件流程中指定同步点,为了拉齐冗余实例间的步调,先行到达同步点的实例则等待直至后续实例陆续到达,通过仲裁后再通过该轮同步点。
请参见图2,图2是本申请实施例提供的一种冗余实例同步等待的示意图。图2示例性示出了3个冗余实例,分别为冗余实例1、冗余实例2和冗余实例3;以及4个同步点,分别为同步点1、同步点2、同步点3和同步点4;图中以白色矩形代表实例运行中,以灰色矩形代表运行至同步点,斜线矩形代表停止运行(即忙等待中)。
如图2所示,冗余实例1、冗余实例2和冗余实例3均包括上述4个同步点,横向为时间方向,可见,每一个冗余实例运行至同步点的时间并不相同。以同步点1为例进行说明,冗余实例1为第一个运行至同步点1的实例,在冗余实例1到达同步点1后开始等待;冗余实例2为第二个运行至同步点1的实例,在冗余实例2到达同步点1后开始等待;冗余实例3为最后一个运行至同步点1的实例,在冗余实例3为运行至同步点1时的时刻为同步时刻t1,同步时刻t1开始冗余实例2和冗余实例3才开始继续运行。
本申请实施例中,同步点是指实例中的某一预设位置,如图2所示的同步点,而不是图2所示的同步时刻。
5、检查点(Checkpoint,CKPT)实例
在多模冗余过程中,可以为了执行流/数据流回滚纠错而定期做状态保存,进行状态保存即为进行CKPT。为方便描述,本申请实施例中将进行CKPT后的冗余实例称为CKPT实例。
为实现多模冗余技术中的同步,在同一状态上进行仲裁,可以采用实例忙等方式,例如,工业界常用的三模冗余OS技术,采用仲裁器进行同步仲裁,其同步采取阻塞、忙等待的方式,在同步点拉齐所有冗余实例的状态。
然而,该多模同步方式存在着很多不足,其中,最显著的痛点在于:不同冗余实例的执行快慢不同,传统忙等待的方法造成CPU利用率低,且单点最慢的副本成为整体软硬件运行效率瓶颈。
目前,多模冗余技术在同步仲裁时可以采用主实例(leader)加跟随实例(follower)的模式,共享内存(如Ring Buffer)支持进程间通信,followers需遵循leader的外部事件行为,并且对数据状态进行仲裁。具体步骤如下:
步骤一:冗余执行开始后,在到达一轮同步点时,检查当前实例是leader还是follower;步骤二:是leader则保存仲裁数据至共享内存并恢复执行,是follower则查看leader是否到达;步骤三:若leader已到达则从共享内存获取leader数据并进行仲裁,若follower先于leader到达则同步等待至leader到达。该模式可消除leader和后续follower的同步忙等待,但是由于leader指定为固定线程,因此整体的执行效率依然受限于leader的速度。
本申请提出的异步仲裁方法为软硬件通用型技术,即可广泛应用于任一采用冗余实例的场景。下面从硬件和软件示例性示出了两种应用场景,应理解,本申请实施的场景并不局限于下列场景。
请参见图3,图3是本申请实施例提供的一种应用场景的示意图。如图3所示,该应用场景可以包括M个一级中央处理器(CPU),M为大于1的整数;每一个CPU包括三个核心处理器(core),每一个core用于执行一个冗余OS实例,即共包括3M个冗余OS实例;每一个CPU包括一个仲裁器,该仲裁器用于对该CPU中的冗余OS实例进行同步点仲裁。
其中,该应用场景还可以包括用于向CPU发送输入的外部设备(简称外设),以及若干个二级微处理器(Microcontroller),图3中示例性示出了两个。应理解,该应用场景还可以包括三级处理器等其他处理器,此处不作限定。
在一种实现中,外设可以复制M份输入至一级CPU,由里面各个执行Core运行的OS冗余实例进行业务处理。一级CPU中的各OS冗余实例到达一轮同步点后经由一级CPU的仲裁器进行多模同步仲裁,如识别错误则进行纠错,最后,一个CPU从3个冗余实例中确定一个正确的一级仲裁结果,进而,分别向若干个二级Microcontroller输出该一级仲裁结果。则二级Microcontroller可以接收到M个一级仲裁结果。
进而,二级Microcontroller的仲裁器对M个一级仲裁结果进行二次仲裁,如识别错误则进行纠错。
本申请实施例中,一级CPU中的OS冗余实例在执行至同步点时通过异步仲裁,使OS冗余实例到达同步点时不再忙等。
本申请实施例中,还可以在仲裁器中新增纠错模块,当异步仲裁识别到错误OS冗余实例,则对其进行回滚纠错。
请参见图4,图4是本申请实施例提供的另一种应用场景的示意图。如图4所示,用户态应用进行多模执行,进程APP1-APPM为其冗余实例,即共有M个APP冗余实例,M为大于1的整数。
图4示例性示出了两个同步点仲裁的过程,其中,第一个同步点在系统调用前(或者获取C库函数后):APP冗余实例在用户态中运行至进行系统调用时,从C库中获取调用的库函数,并在陷入OS前进行同步等待,直到所有APP冗余实例均到达本轮同步点。所有APP冗余实例完成同步等待后,C库中的仲裁模块对所有APP冗余实例的输入进行一致性比对,通过后即进入OS分别执行原系统调用。
第二个同步点在系统调用向用户态应用返回结果前:各个APP冗余实例完成本次系统调用后,从OS返回C库函数的输出值,并进行同步等待所有APP冗余实例完成OS流程。进而,对所有APP冗余实例进行输出值的一致性仲裁。在所有APP冗余实例的输出值一致性仲裁通过后,再将该输出值返回至用户态应用,并由用户态APP冗余实例进行下一轮执行。
本申请实施例中,所有APP冗余实例在到达同步点后可以进行仲裁数据和/或CKPT数据的保存,然后继续执行,不需要在同步点位置等待所有APP冗余实例执行至该同步点位置。
请参考图5,图5是本申请实施例提供的一种异步仲裁方法的流程图。该方法可以包括以下部分或全部步骤。
S101:开始执行N个冗余实例,N为大于1的整数。
在一些实施例中,电子设备中包括N个冗余实例,这N个冗余实例可以是经同一个实例复制得到的,也即是说,N个冗余实例对应的代码实现相同。
可选地,N个冗余实例可以是同时开始运行的。例如,电子设备可以向N个冗余实例同时输入相同的输入,同时执行该N个冗余实例。
在一种可能的实现中,电子设备执行N个冗余实例的具体过程可以参见以下步骤S102至S105的全部或部分步骤:
S102:在第i个冗余实例运行至同步点时停止执行第i个冗余实例,记录第i个冗余实例当前的输出结果,得到第i个仲裁数据,i为不大于N的正整数。
可选地,仲裁数据还可以包括业务输入参数、OS结果输出等。
在一种可能的实现中,冗余实例为应用实例,第i个应用实例运行至调用库函数或库函数返回数据时停止执行第i个应用实例,记录第i个应用实例当前的系统调用号、输入参数和输出参数中的至少一个,得到第i个仲裁数据。
以图4的场景为例,假设同步点可以为系统调用C库函数前,则第i个冗余实例当前的输出结果可以为C库函数的输入参数,第i个仲裁数据包括C库函数的输入参数和系统调用号等;假设同步点为系统调用C库函数后,则第i个冗余实例当前的输出结果可以为C库函数的输出参数,第i个仲裁数据包括C库函数的输出参数和系统调用号等。具体可以参见图11和图12的实施例的相关内容,此处暂不展开。
S103:保存第i个仲裁数据。
可选地,可以将第i个仲裁数据保存至共享内存(Shared Memory)或圆形内存(Ring Buffer),本申请实施例对存储位置不做限定,可以根据实际应用确定。
可选地,在第i个冗余实例是最后一个运行至同步点时的冗余实例时,不保存第i个仲裁数据。
可选的,在第i个冗余实例是最后一个运行至同步点时的冗余实例时,可以直接执行步骤S102。
在本申请的一些实施例中,还可以保存第i个冗余实例在同步点的检查点CKPT数据,第i个冗余实例在同步点的CKPT数据用于恢复第i个冗余实例在同步点的数据状态。
在一些实施例中,可以在保存第i个冗余实例的仲裁数据后,对第i个冗余实例进行checkpoint,即记录第i个冗余实例在当前同步点的CKPT数据。其中,可以是基于预设类型的数据进行记录的,具体数据类型可以基于实际冗余实例确定;CKPT数据用于恢复至当前同步点的数据状态。
在一种可能的实现中,冗余实例为应用实例,checkpoint的方式包括但不限于原APP阻塞、复制(fork)并阻塞子APP、记录APP完整状态数据等。其中,原APP阻塞即为暂停执行应用实例,本申请实施例中可以部分采用原APP阻塞;复制(fork)并阻塞子APP是指fork当前的应用实例(可以称为原APP),fork的应用实例即为子APP,不执行子APP,继续原APP;记录APP完整状态数据。即是保存当前的应用实例从开始运行至当前同步点的运行过程中的所有数据。
在一种可能的实现中,冗余实例为冗余硬件执行的OS冗余实例,checkpoint的方式包括但不限于记录OS完整状态数据等,其中,OS完整状态数据包括OS冗余实例从开始运行至当前同步点的所有数据,例如OS冗余实例运行占用的内存数据内容、执行OS冗余实例的设备状态数据,以及OS冗余实例运行过程中在寄存器中存储的数据以及该寄存器的状态数据等。
可选地,在识别到第i个冗余实例的仲裁数据与保存的仲裁数据相同时,可以不保存第i个冗余实例的CKPT数据。
可选地,可以不获取和保存N个冗余实例中最后一个运行至同步点的实例的CKPT数据。
S104:在保存第i个仲裁数据后,执行第i个冗余实例的同步点后的部分。
在一些实施例中,若第i个冗余实例为第一个运行至同步点的实例,则可以在保存第i个仲裁数据和第i个冗余实例的CKPT数据后,执行第i个冗余实例的同步点后的部分。若第i个冗余实例为第一个运行至同步点的实例,则可以在得到第i个仲裁数据后执行步骤S102,之后,执行第i个冗余实例的同步点后的部分,不需要对第i个冗余实例进行checkpoint,也不需要保存第i个仲裁数据或第i个冗余实例的CKPT数据。若第i个冗余实例不是第一个和最后一个运行至同步点的实例,则可以保存第i个仲裁数据,并根据规则确定是否保存第i个冗余实例的CKPT数据,之后执行第i个冗余实例的同步点后的部分。
S105:在得到N个仲裁数据后,从N个仲裁数据中确定目标仲裁数据,目标仲裁数据对应的冗余实例为在同步点正确的冗余实例。
可选地,将N个仲裁数据中个数最多的仲裁数据确定为目标仲裁数据;或,将N个仲裁数据中个数大于预设个数的仲裁数据确定为目标仲裁数据。
示例性的,预设个数可以为N/2,即是,若该冗余实例的仲裁数据的个数超过冗余实例总数的一半,则将该冗余实例确定为正确冗余实例。例如,冗余实例一共有5个,3个冗余实例的仲裁数据相同,则3个冗余实例的仲裁数据为目标仲裁数据,这3个冗余实例为正确冗余实例。
示例性的,在N为2的情况下,可以基于预设判断规则对两个仲裁数据进行判断,从两个仲裁数据中确定目标仲裁数据。例如仲裁数据中的当前输出结果超过预设数据范围则确定该仲裁数据非目标仲裁数据等,本申请实施例对预设判断规则不作限定。
在一些实施例中,可以基于正确的冗余实例在同步点的CKPT数据,将错误的冗余实例恢复至正确的冗余实例在同步点的状态,从同步点处执行恢复后的冗余实例。
例如第i个仲裁数据不为目标仲裁数据,即第i个冗余实例为错误冗余实例;在从N个仲裁数据中确定目标仲裁数据后,可以基于正确的冗余实例在同步点的CKPT数据,恢复第i个冗余实例;从同步点处执行恢复后的第i个冗余实例。
在一种可能的实现方式中,最后一个运行至同步点的冗余实例在得到仲裁数据后,不执行保存该仲裁数据,也不执行checkpoint;而是从内存中取出保存的仲裁数据,再N个仲裁数据中确定目标仲裁数据,目标仲裁数据对应的冗余实例为在同步点正确的冗余实例。假设是将N个仲裁数据中个数最多的冗余实例确定为目标仲裁数据,则在最慢实例之前保存过与最慢实例的仲裁数据相同的冗余实例,可以将该冗余实例的CKPT数据用于纠错回滚。可选地,最慢实例在得到仲裁数据后可以继续执行;也可以等待全员仲裁结果后,在出现故障时进行纠错再继续执行,或在无故障时继续执行。
在本申请的一些实施例中,在第i个冗余实例与N个冗余实例中运行最慢的冗余实例的执行差距大于预设差距时,可以暂停执行第i个冗余实例。例如,冗余实例中同步点的个数为至少两个,第i个冗余实例的仲裁数据存储在预设存储空间;在预设存储空间被占满时,暂停执行第i个冗余实例。
请参考图6,图6是本申请实施例提供的另一种异步仲裁方法的流程图。该方法可以包括以下部分或全部步骤。
S201:开始执行N个冗余实例。
其中,N为大于1的整数;冗余实例包括至少一个同步点。
可选地,N个冗余实例是由一个实例复制得到的。
在一些实施例中,电子设备向N个冗余实例输入相同的输入,同时执行该N个冗余实例。
在一种可能的实现中,冗余实例包括多个同步点,则在每一个同步点N个冗余实例均要执行以下S202至S209所示的方法。
请参见图7,图7是本申请实施例提供的一种多个冗余实例的执行示意图。图7示例性以带箭头的直线代表冗余实例,以及示例性示出了N个冗余实例和M个同步点,N和M均为大于1的整数。如图7所示,N个冗余实例开始执行,在N个冗余实例中的任一实例运作至同步点1时,执行以下步骤S202,基于步骤S202的结果执行下文中步骤S203至S209中的步骤,详见可参见图6的流程图。应理解,N个冗余实例在M个同步点时的仲裁纠错过程与N个冗余实例在同步点1的仲裁纠错过程一致,不再赘述。
S202:在每一个冗余实例运行至同步点时,判断该冗余实例是否第一个运行至同步点;若是,则依次执行S203至S205;若否,则执行S206。
在一些实施例中,电子设备可以在该冗余实例运行同步点时,检测N个冗余实例其他冗余实例的执行状态,确定该冗余实例是否为第一个运行至同步点的冗余实例。应理解,本申请实施例对判断该冗余实例是否第一个运行至同步点的方法不作限定。
若该冗余实例是第一个运行至同步点的实例,则电子设备可以保存该冗余实例在该同步点的仲裁数据;保存该冗余实例在该同步点的CKPT数据;在保存上述仲裁数据和CKPT数据后,电子设备继续执行该冗余实例的同步点后的部分。具体过程可以参见S203至S205的详细内容。
若该冗余实例不是第一个运行至同步点的实例,则电子设备可以判断该冗余实例是否最后一个运行至同步点,再根据是否为最后一个运行至同步点的实例的情况分别进行处理。具体过程可以参见S206的详细内容。
S203:保存该冗余实例在该同步点的仲裁数据。
在一些实施例中,电子设备可以基于第一记录规则记录冗余实例的关键状态数据,得到该冗余实例在该同步点的仲裁数据,进而,保存该仲裁数据。其中,关键状态数据包括冗余实例执行至同步点的输出。
可选地,不同实例的第一记录规则可以不同,也就是说不同实例的关键状态数据对应的内容可以不同,即仲裁数据对应的内容可以不同。其中,不同实例可以是指功能不同或输入输出不同的实例,例如图3和图4为不同的应用场景,该两个场景中的冗余实例可以为不同的实例,这两个实施例中的仲裁数据的具体内容可以不同。应理解,上述N个冗余实例为相同的实例,上述N个冗余实例对应的第一记录规则是相同的,但是由于上述N个冗余实例在执行过程可能因为某些因素导致数据出错,所以N个冗余实例记录的同一内容的数据可能不同。
S204:保存该冗余实例在该同步点的CKPT数据。
在一些实施例中,电子设备可以基于第二记录规则,对冗余实例进行checkpoint,得到该冗余实例在该同步点的CKPT数据,进而,保存CKPT数据。其中,CKPT数据用于N个冗余实例中的任一冗余实例 恢复至该冗余实例在该同步点的数据状态。
可选地,不同实例的第二记录规则可以不同,也就是说不同实例的CKPT数据对应的内可以不同。应理解,上述N个冗余实例为相同的实例,上述N个冗余实例对应的第二记录规则是相同的,但是由于上述N个冗余实例在执行过程可能因为某些因素导致数据出错,所以N个冗余实例记录的同一内容的数据可能不同。
S205:执行该冗余实例的同步点后的部分。
如图7所示,左侧的冗余实例在运行至同步点时进行右侧的仲裁纠错,在仲裁纠错结束后又返回至同步点的位置,继续执行冗余实例以下的内容。
S206:判断该冗余实例是否最后一个运行至同步点;若是,则执行S208,若否,则依次执行S203和执行S207。
在一些实施例中,电子设备可以在该冗余实例运行同步点时,检测N个冗余实例其他冗余实例的执行状态,确定该冗余实例是否为最后一个运行至同步点的冗余实例。应理解,本申请实施例对判断该冗余实例是否最后一个运行至同步点的方法不作限定。
若该冗余实例是N个冗余实例中最后一个运行至同步点的实例,则电子设备将该冗余实例的仲裁数据与在该同步点保存的仲裁数据进行全员仲裁,确定N个仲裁数据是否通过全员仲裁。在全员仲裁通过时,不需要进行纠错回滚(即不需要执行S209),在全员仲裁不通过时,不需要进行纠错回滚(即需要执行S209)。具体过程可以参见步骤S208的相关内容。
若该冗余实例不是N个冗余实例中第一个和最后一个运行至同步点的实例,则电子设备将该冗余实例的仲裁数据与在该同步点保存的仲裁数据进行预仲裁,确定是否需要保存该冗余实例的CKPT数据。在通过预仲裁时,不保存该冗余实例的CKPT数据,在通过预仲裁时,保存该冗余实例的CKPT数据。具体过程可以参见步骤S207的相关内容。
S207:判断该冗余实例预仲裁是否通过;若是,则执行S205,若否,则依次执行S204和S205。
在一些实施例中,电子设备可以基于保存的该同步点的仲裁数据和该冗余实例的仲裁数据,确定该冗余实例预仲裁是否通过;在保存的仲裁数据中存在一个仲裁数据与该冗余实例的仲裁数据相同时,确定该冗余实例预仲裁通过,即不需要保存该冗余实例的CKPT数据;在保存的仲裁数据中存在一个仲裁数据与该冗余实例的仲裁数据不相同时,确定该冗余实例预仲裁不通过,即需要保存该冗余实例的CKPT数据。
S208:判断保存的该同步点的仲裁数据与该冗余实例的仲裁数据是否通过全员仲裁,若是,则执行S205,若否,则依次执行S209。
其中,保存的该同步点的仲裁数据与该冗余实例的仲裁数据即为上述N个冗余实例的在该同步点的N个仲裁数据。
在一些实施例中,电子设备在上述N个仲裁数据相同时,确定上述N个仲裁数据通过全员仲裁,则不需要进行纠错回滚(即不需要执行S209);在上述N个仲裁数据存在两个仲裁数据不同时,确定全员仲裁不通过全员仲裁,需要进行纠错回滚(即需要执行S209)。
应理解,若上述N个仲裁数据通过全员仲裁,也即是,N个冗余实例在该同步点无故障情况出现。
请参见图8A,图8A是本申请实施例提供的一种N个冗余实例在同步点无故障时的示意图。图8A示例性的以5个冗余实例为例进行说明,带箭头的直线代表冗余实例,箭头方向代表时间轴方向,这5个冗余实例分别为实例a、实例b、实例c、实例d和实例e;灰色矩形块用于代表同步点,图8A中示例性示出了同步点1和同步点2。
如图8A所示,实例a为第一个运行至同步点1的实例,电子设备保存了实例a的仲裁数据,电子设备对实例a进行checkpoint,保存了该实例a的CKPT数据;实例b、实例d和实例e不是第一个运行至同步点1的冗余实例,也不是最后一个运行至同步点1的冗余实例,则电子设备对实例b、实例d和实例e进行预仲裁,由于实例b、实例d和实例e的仲裁数据均与该实例a的仲裁数据相同,因此实例b、实例d和实例e均通过预仲裁,不需要保存CKPT数据;实例c为最后一个运行至同步点1的冗余实例,则将5个冗余实例在同步点1的仲裁数据进行对比,由于这5个冗余实例的仲裁数据相同,即该5个冗余实例通过全员仲裁,即在该5个冗余实例在该同步点1无故障情况出现。
如图8A所示,在采用预仲裁后,无故障场景下的一轮同步点仅需进行一次CKPT。
S209:从N个冗余实例中确定正确实例;基于正确实例的CKPT数据对错误实例进行纠错回滚。
在一些实施例中,电子设备基于上述N个仲裁数据,从N个冗余实例中确定正确实例;基于正确实例的CKPT数据对错误实例进行纠错回滚。
可选的,电子设备可以将将N个仲裁数据中个数最多的仲裁数据确定为目标仲裁数据;或,将N个仲裁数据中个数大于预设个数的仲裁数据确定为目标仲裁数据;进而,将目标仲裁数据的冗余实例确定为正确实例。
请参见图8B,图8B是本申请实施例提供的一种N个冗余实例在同步点存在故障时的示意图。图8B示例性的以5个冗余实例为例进行说明,带箭头的直线代表冗余实例,箭头方向代表时间轴方向,这5个冗余实例分别为实例a、实例b、实例c、实例d和实例e;灰色矩形块用于代表不存在故障的同步点,斜线矩形块用于代表存在故障的同步点,图8B中示例性示出了同步点1和同步点2。
如图8B所示,实例a为第一个运行至同步点1的实例,电子设备保存了实例a的仲裁数据,电子设备对实例a进行checkpoint,保存了该实例a的CKPT数据。
实例b、实例d和实例e不是第一个运行至同步点1的冗余实例,也不是最后一个运行至同步点1的冗余实例,则电子设备对实例b、实例d和实例e进行预仲裁,由于实例b仲裁数据均与该实例a的仲裁数据相同,因此实例b不通过预仲裁,电子设备对实例b进行checkpoint,保存了该实例b的CKPT数据;由于实例d和实例e的仲裁数据均与该实例a的仲裁数据相同,因此实例d和实例e均通过预仲裁,不需要保存CKPT数据。
实例c为最后一个运行至同步点1的冗余实例,则将5个冗余实例在同步点1的仲裁数据进行对比,由于实例a和实例c的仲裁数据(简称为仲裁数据1)相同,实例b、实例d和实例e的仲裁数据(简称为仲裁数据2)相同,仲裁数据1和仲裁数据2不同,仲裁数据2的个数大于仲裁数据1的个数,因此电子设备可以确定5个冗余实例不通过全员仲裁,即在该5个冗余实例在该同步点1出现故障情况,其中,仲裁数据2为正确的仲裁数据(即目标仲裁数据),实例a和实例c为在同步1出现故障的实例,实例b、实例d和实例e为在同步点1正确的正确实例。
进而,电子设备可以基于实例b在同步点1的CKPT数据,将实例a和实例c回滚至实例b在同步点1的数据状态。图8B示例性示出了在回滚时,实例a位于同步点2的位置,在其他实例中,实例a可能运行至其他位置,此处不作限定。
以下通过介绍上述两个应用场景对应的实施例,对上图5和图6的异步仲裁方法进行详细说明。
首先,介绍硬件冗余的应用场景下的异步仲裁方法。
该异步仲裁方法应用于多OS冗余实例的异步仲裁装置中,该异步仲裁装置包括多个核心处理器(core)和至少一个异步仲裁器,其中,每一个核心处理器用于执行一个OS冗余实例,OS冗余实例为冗余实例。
请参考图9,图9为本申请实施例提供的又一种异步仲裁方法的流程图。该方法可以包括以下部分或全部步骤。
S301:多个核心处理器分别执行OS冗余实例。
可选地,多个核心处理器可以位于同一个中央处理器(CPU)中。
S302:核心处理器在OS冗余实例运行至同步点时,保存该OS冗余实例在该同步点的仲裁数据。
可选地,执行S302的核心处理器可以为非最慢OS冗余实例的核心处理器,也即是说,该核心处理器可以判断其执行的OS冗余实例是否为最后一个运行至同步点的OS冗余实例,若是,则不执行S302,若否,则执行S302。
可选地,该同步点可以为OS输出业务执行结果时。
在一些实施例中,核心处理器可以将当前同步点的关键状态数据存至共享内存(Shared Memory)作为该同步点的仲裁数据,其中,关键状态数据包括但不限于业务入参、OS结果输出等。
请参见图10,图10是本申请实施例提供的一种管理仲裁数据的示意图。如图10所示,带箭头的直线用于代表OS冗余实例,图中示例性示出了3个OS冗余实例,分别为OS1、OS2和OS3;矩形方块用于代表共享内存,带箭头的实线用于指示存入仲裁数据;带数据的矩形用于代表同步点,图中示例性示出了同步点1和同步点2。在同步点1对应的全员仲裁1中,由于OS1和OS2为非最慢运行至该同步点的实例,则执行OS1的核心处理器将OS1的仲裁数据存入共享内存中,执行OS2的核心处理器将OS2的仲裁数据存入共享内存中;在同步点2对应的全员仲裁2中,由于OS2和OS3为非最慢运行至该同步点的实例,则执行OS2的核心处理器将OS2的仲裁数据存入共享内存中,执行OS3的核心处理器将OS3的仲裁数据存入共享内存中。
S303:核心处理器对OS冗余实例进行预仲裁并按需进行CKPT,该OS冗余实例为非最慢OS冗余实例。
在一些实施例中,核心处理器可以在保存该OS冗余实例在该同步点的仲裁数据后,对OS冗余实例进行预仲裁。
其中,预仲裁具体可以是将OS冗余实例的仲裁数据和本轮CKPT实例的仲裁数据一一比较,如果不存在任何CKPT实例的仲裁数据和本OS的仲裁数据一致,则进行CKPT。其中,本轮CKPT实例为在该同步点进行CKPT的实例。需要说明的是,此处为方便描述,将进行过CKPT的实例称为CKPT实例。
可选地,CKPT方式包括但不限于记录OS完整状态数据(包括内存、设备、寄存器状态)等。
在一些实施例中,基于N模冗余可支持的最坏情况为N/2-1个错误实例,即前一半到达的实例一定有正确状态,因此为进一步优化CKPT代价,N模冗余中仅需前一半实例(N/2+1)需进行预仲裁+CKPT流程。完成仲裁数据保存/预仲裁/CKPT后,OS冗余实例即可继续执行,无需忙等。
可选地,首个OS冗余实例无需预仲裁,直接进行CKPT。也就是说,核心处理器可以判断其执行的OS冗余实例是否为第一个运行至同步点的OS冗余实例,若是,则直接进行CKPT,若否,则进行预仲裁,再根据预仲裁的结果确定是否进行CKPT。
S304:核心处理器保存CKPT数据。
S305:异步仲裁器基于所有OS冗余实例的仲裁数据进行全员仲裁,确定正确实例和错误实例。
在一些实施例中,异步仲裁器可以在得到执行最慢至同步点的OS冗余实例的仲裁数据后,从共享内存取出本轮所有OS冗余实例的仲裁数据并进行全体比较。如图10所示,该带箭头的虚线用于代表读取仲裁数据,异步仲裁器可以在OS3运行至同步点1时,从共享内存取出本轮所有OS冗余实例的仲裁数据,即OS1和OS2的仲裁数据,将OS1、OS2和OS3的仲裁数据进行全员仲裁;在OS1运行至同步点2时,从共享内存取出本轮所有OS冗余实例的仲裁数据,即OS2和OS3的仲裁数据,将OS1、OS2和OS3的仲裁数据进行全员仲裁。
可选地,如有超过一半实例达成一致,即N模冗余中存在N/2+1的仲裁数据达成共识,则视达成共识的冗余实例为正确,并进入下一轮同步点;而仲裁数据未达成共识的实例视为错误,需进行回滚。
S306:异步仲裁器向错误实例对应的核心处理器发送纠错消息,该消息包括用于指示正确实例的指示消息。
S307:错误实例对应的核心处理器基于正确实例的CKPT数据或最慢实例的状态数据进行纠错回滚。
在一些实施例中,若正确实例为CKPT实例,则将错误实例的状态数据(如内存、设备、寄存器状态等)恢复至正确CKPT实例状态;若正确实例为最慢OS冗余实例,则将错误实例的状态数据恢复至最慢OS冗余实例的状态数据。完成回滚纠错后,错误实例继续执行,直至完成本轮同步点,进入下轮同步点。
S308:核心处理器在所有OS冗余实例均成功通过该同步点后,回收仲裁数据和/或CKPT数据。
在一些实施例中,异步仲裁装置在所有OS冗余实例均成功通过一轮同步点后,将本轮消耗的内存进行回收,其中,内存包括存放仲裁数据的保留内存(如图10中共享内存也可以称为Shared Memory)和CKPT数据占用的内存。
可选地,核心处理器在确定所有OS冗余实例均成功通过一轮同步点后,分别从内存中删除其执行的OS冗余实例保存的仲裁数据和CKPT数据。
接下来,介绍针对冗余实例为应用实例的异步仲裁方法。
该方法应用于用户态多模执行,例如,进程APP1-APPM为其冗余实例,M为大于1的整数,可参见图4所示的应用场景的相关介绍。本申请实施例中,当APP冗余实例到达指定同步点,如系统调用时可以在C库对应函数出入口处拦截,并在陷入内核执行系统调用的前后进行异步仲裁方法,可以消除同步忙等的环节。
请参见图11,图11是本申请实施例提供的再一种异步仲裁方法的示意图。该方法可以包括以下部分或全部步骤:
S401:在APP冗余实例运行至同步点时,保存在该同步点的仲裁数据。
可选地,最慢实例可不保存其仲裁数据。
示例性的,APP冗余实例达到系统调用时,将当前关键状态存至共享内存,该当前关键状态即为仲裁数据,该数据包括但不限于系统调用号、出入参等等,共享内存包括但不限于Ring Buffer等。
请参见图12,图12是本申请实施例提供的另一种管理仲裁数据的示意图。如图12所示,带箭头的直线用于代表APP冗余实例,图中示例性示出了3个APP冗余实例,分别为APP1、APP2和APP3;圆 环用于代表圆形内存(Ring Buffer),带箭头的实线用于指示存入仲裁数据;带数据的矩形用于代表同步点,图中示例性示出了同步点1和同步点2。在同步点1对应的全员仲裁1中,由于APP1和APP2为非最慢运行至该同步点的实例,则将APP1的仲裁数据和APP2的仲裁数据存入Ring Buffer中;在同步点2对应的全员仲裁2中,由于APP2和APP3为非最慢运行至该同步点的实例,则将APP2的仲裁数据和APP3的仲裁数据存入Ring Buffer中。
S402:对APP冗余实例进行预仲裁并按需进行CKPT。
在一些实施例中,在APP冗余实例保存仲裁数据后,进行预仲裁。
示例性的,预仲裁具体可以是将APP冗余实例的仲裁数据和本轮CKPT实例的仲裁数据一一比较,如果不存在任何CKPT实例的仲裁数据和该APP冗余实例P的仲裁数据一致,则进行CKPT。其中,进行CKPT的方法包括但不限于原APP阻塞、复制(fork)并阻塞子APP、记录APP完整状态数据等。
在一种实现中,基于N模冗余可支持的最坏情况为N/2-1个错误实例,即前一半到达的实例有正确状态,因此为进一步优化CKPT代价,N模冗余中可以仅将前一半实例(N/2+1)需进行预仲裁+CKPT流程。APP冗余实例在完成仲裁数据保存、预仲裁和CKPT中的至少一个过程后,可继续执行,无需忙等。其中,APP冗余实例执行仲裁数据保存、预仲裁和CKPT的具体过程可以参见上文中的相关内容,此处不再赘述。
可选的,首个APP冗余实例(即第一个运行至同步点的实例)可以不进行预仲裁,直接进行CKPT。
S403:在最慢APP冗余实例执行至同步点时,进行全员仲裁并按需进行回滚纠错。
在一些实施例中,最慢APP冗余实例可以先从共享内存(如图12中Ring Buffer)中取出本轮所有APP冗余实例的仲裁数据并进行全体比较。
如图12所示,该带箭头的虚线用于代表读取仲裁数据,可以在APP3运行至同步点1时,从Ring Buffer取出本轮所有APP冗余实例的仲裁数据,即APP1和APP2的仲裁数据,将APP1、APP2和APP3的仲裁数据进行全员仲裁;在APP1运行至同步点2时,从共享内存取出本轮所有APP冗余实例的仲裁数据,即APP2和APP3的仲裁数据,将APP1、APP2和APP3的仲裁数据进行全员仲裁。
可选的,如有超过一半实例达成一致,即N模冗余中存在N/2+1的仲裁数据达成共识,则视达成共识的冗余实例为正确,并进入下一轮同步点;而仲裁数据未达成共识的实例视为错误,需进行回滚。可以依据全员仲裁的结果,在CKPT实例和最慢实例中挑选出一员正确状态,并将错误APP冗余实例回滚至正确状态。
若正确状态存在于CKPT实例,假设CKPT方式为fork并阻塞子APP,则将错误实例终止并由CKPT实例正在阻塞的子APP替代;若正确状态存在于最慢实例,则将错误实例终止,对最慢APP冗余实例进行fork复制并将其子APP接替原错误APP。完成回滚纠错后,错误实例继续执行,直至完成本轮同步点,进入下轮同步点。
S404:回收存储仲裁数据和CKPT数据的存储资源。
在一些实施例中,在所有APP冗余实例均成功通过一轮同步点后,将该轮消耗的存放仲裁数据的保留内存(如图12中Ring Buffer实现中,读指针越过本轮数据范围)以及CKPT数据占用的内存进行回收,即释放保存仲裁数据和CKPT状态数据的存储空间,供后续同步点使用。
上述详细阐述了本申请实施例的方法,下面提供了本申请实施例的装置。
请参见图13,图13是本申请实施例提供的一种异步仲裁装置130的结构示意图,该装置130可以为电子设备。当然,该装置130也可以为电子设备中的一个器件,例如芯片或者集成电路等,该装置130可以包括执行单元1301、确定单元1302、保存单元1303、恢复单元1304和暂停单元1305。该异步仲裁装置130用于实现前述任意一个实施例的异步仲裁方法。
在一种可能的实施方式中,该装置包括:
执行单元1301,用于开始执行N个冗余实例,所述N为大于1的整数;i为不大于所述N的正整数;
所述执行单元1301,具体用于:在第i个冗余实例运行至同步点时停止执行所述第i个冗余实例,记录所述第i个冗余实例当前的输出结果,得到第i个仲裁数据;保存所述第i个仲裁数据;在保存所述第i个仲裁数据后,执行所述第i个冗余实例的同步点后的部分;
确定单元1302,用于在得到所述N个仲裁数据后,从所述N个仲裁数据中确定目标仲裁数据,所述目标仲裁数据对应的冗余实例为在所述同步点正确的冗余实例。
在一种可能的实施方式中,所述装置还包括保存单元1303;
所述保存单元1303,用于保存所述第i个冗余实例在同步点的检查点CKPT数据,所述第i个冗余实例在同步点的CKPT数据用于恢复所述第i个冗余实例在同步点的数据状态。
在一种可能的实施方式中,所述装置还包括恢复单元1304;所述恢复单元1304,用于:
基于正确的冗余实例在所述同步点的CKPT数据,恢复所述第i个冗余实例;
从所述同步点处执行恢复后的第i个冗余实例。
在一种可能的实施方式中,所述保存单元1303,还用于:
在识别到所述第i个冗余实例的仲裁数据与保存的仲裁数据相同时,不保存所述第i个冗余实例的CKPT数据。
在一种可能的实施方式中,所述第i个冗余实例是除所述N个冗余实例中最后一个运行至所述同步点的实例之外的实例。
在一种可能的实施方式中,所述确定单元1302,具体用于:
将所述N个仲裁数据中个数最多的仲裁数据确定为所述目标仲裁数据;
或,将所述N个仲裁数据中个数大于预设个数的仲裁数据确定为所述目标仲裁数据。
在一种可能的实施方式中,所述执行单元1301,具体用于:
在第i个冗余实例是除最后一个运行至同步点之外的实例时,保存第i个仲裁数据;
在得到最后一个运行至同步点时的冗余实例的仲裁数据时,执行从N个仲裁数据确定目标仲裁数据的步骤。
在一种可能的实施方式中,所述装置还包括暂停单元1305,所述暂停单元1305,用于:
在所述第i个冗余实例与所述N个冗余实例中运行最慢的冗余实例的执行差距大于预设差距时,暂停执行所述第i个冗余实例。
在一种可能的实施方式中,所述冗余实例中同步点的个数为至少两个,所述第i个冗余实例的仲裁数据存储在预设存储空间;所述装置还包括删除单元,所述删除单元,用于删除所述同步点的仲裁数据;
所述暂停单元1305,具体用于在所述预设存储空间被占满时,暂停执行所述第i个冗余实例。
在一种可能的实施方式中,所述冗余实例为应用实例,所述执行单元1301具体用于:
所述第i个应用实例运行至调用库函数或库函数返回数据时停止执行所述第i个应用实例,记录所述第i个应用实例当前的系统调用号、输入参数和输出参数中的至少一个,得到第i个仲裁数据。
需要说明的是,各个单元的实现还可以对应参照图5、图6、图9和图11所示的实施例的相应描述。该异步仲裁装置130可以为上文中的电子设备。
可以理解的,本申请各个装置实施例中,对多个单元或者模块的划分仅是一种根据功能进行的逻辑划分,不作为对装置具体的结构的限定。在具体实现中,其中部分功能模块可能被细分为更多细小的功能模块,部分功能模块也可能组合成一个功能模块,但无论这些功能模块是进行了细分还是组合,装置130在配对的过程中所执行的大致流程是相同的。通常,每个单元都对应有各自的程序代码(或者程序指令),这些单元各自对应的程序代码在处理器上运行时,使得该单元受处理器的控制而执行相应的流程从而实现相应功能。
本申请实施例还提供了一种电子设备,电子设备包括一个或多个处理器和一个或多个存储器;其中,一个或多个存储器与一个或多个处理器耦合,一个或多个存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当一个或多个处理器执行计算机指令时,使得电子设备执行上述实施例描述的方法。
本申请实施例还提供了一种包含指令的计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行上述实施例描述的方法。
本申请实施例还提供了一种计算机可读存储介质,包括指令,当指令在电子设备上运行时,使得电子设备执行上述实施例描述的方法。
可以理解的是,本申请的各实施方式可以任意进行组合,以实现不同的技术效果。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请所述的流程或功能。
所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进 行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk)等。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。
总之,以上所述仅为本申请技术方案的实施例而已,并非用于限定本申请的保护范围。凡根据本申请的揭露,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种异步仲裁方法,其特征在于,所述方法包括:
    开始执行N个冗余实例;所述N为大于1的整数;
    在第i个冗余实例运行至同步点时停止执行所述第i个冗余实例,记录所述第i个冗余实例当前的输出结果,得到第i个仲裁数据;所述i为不大于所述N的正整数;
    保存所述第i个仲裁数据;
    在保存所述第i个仲裁数据后,执行所述第i个冗余实例的同步点后的部分;
    在得到所述N个仲裁数据后,从所述N个仲裁数据中确定目标仲裁数据,所述目标仲裁数据对应的冗余实例为在所述同步点正确的冗余实例。
  2. 根据权利要求1所述的方法,其特征在于,在所述保存所述第i个仲裁数据之后,执行所述第i个冗余实例的同步点后的部分之前,所述方法还包括:
    保存所述第i个冗余实例在同步点的检查点CKPT数据,所述第i个冗余实例在同步点的CKPT数据用于恢复所述第i个冗余实例在同步点的数据状态。
  3. 根据权利要求1或2所述的方法,其特征在于,所述第i个仲裁数据不为目标仲裁数据;在所述从所述N个仲裁数据中确定目标仲裁数据后,所述方法还包括:
    基于正确的冗余实例在所述同步点的CKPT数据,恢复所述第i个冗余实例;
    从所述同步点处执行恢复后的第i个冗余实例。
  4. 根据权利要求2或3所述的方法,其特征在于,所述保存所述第i个冗余实例在同步点的检查点CKPT数据,包括:
    在识别到所述第i个冗余实例的仲裁数据与保存的仲裁数据相同时,不保存所述第i个冗余实例的CKPT数据。
  5. 根据权利要求2至4任一项所述的方法,其特征在于,所述第i个冗余实例是所述N个冗余实例中除最后一个运行至所述同步点的实例之外的实例。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述方法还包括:
    在所述第i个冗余实例与所述N个冗余实例中运行最慢的冗余实例的执行差距大于预设差距时,暂停执行所述第i个冗余实例。
  7. 根据权利要求6所述的方法,其特征在于,所述冗余实例中同步点的个数为至少两个,所述第i个冗余实例的仲裁数据存储在预设存储空间;从所述N个仲裁数据中确定目标仲裁数据之后,所述方法还包括:删除所述同步点的仲裁数据;
    所述暂停执行所述第i个冗余实例,包括:在所述预设存储空间被占满时,暂停执行所述第i个冗余实例。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述保存所述第i个仲裁数据,包括:
    在所述第i个冗余实例是除最后一个运行至所述同步点之外的实例时,保存所述第i个仲裁数据;
    所述方法还包括:
    在得到最后一个运行至所述同步点时的冗余实例的仲裁数据时,执行所述从所述N个仲裁数据确定目标仲裁数据的步骤。
  9. 一种异步仲裁装置,其特征在于,所述装置包括N个处理器和仲裁器;
    所述N个处理器分别用于,执行实例;在所述实例运行至同步点时将所述同步点的仲裁数据输出至所述仲裁器,所述仲裁数据包括当前的输出结果;在输出所述同步点的仲裁数据后执行所述同步点后的处理步骤;
    所述仲裁器,用于在接收到所述处理器发送的仲裁数据时保存所述仲裁数据;在接收到所述N个仲 裁数据时从所述N个仲裁数据中确定目标仲裁数据。
  10. 根据权利要求9所述的装置,其特征在于,所述处理器,还用于在处理至所述同步点时将所述同步点的检查点CKPT数据发送至所述仲裁器;
    所述仲裁器,还用于将所述目标仲裁数据对应的CKPT数据发送至所述非目标仲裁数据对应的处理器;
    所述非目标仲裁数据对应的处理器,还用于基于所述目标仲裁数据对应的CKPT数据恢复在所述同步点的数据状态。
  11. 一种异步仲裁装置,其特征在于,所述装置包括:
    执行单元,用于开始执行N个冗余实例,所述N为大于1的整数;
    所述执行单元,还用于:在第i个冗余实例运行至同步点时停止执行所述第i个冗余实例,记录所述第i个冗余实例当前的输出结果,得到第i个仲裁数据;保存所述第i个仲裁数据;在保存所述第i个仲裁数据后,执行所述第i个冗余实例的同步点后的部分;所述i为不大于所述N的正整数;
    确定单元,用于在得到所述N个仲裁数据后,从所述N个仲裁数据中确定目标仲裁数据,所述目标仲裁数据对应的冗余实例为在所述同步点正确的冗余实例。
  12. 根据权利要求11所述的方法,其特征在于,所述装置还包括保存单元;
    所述保存单元,用于保存所述第i个冗余实例在同步点的检查点CKPT数据,所述第i个冗余实例在同步点的CKPT数据用于恢复所述第i个冗余实例在同步点的数据状态。
  13. 根据权利要求11或12所述的方法,其特征在于,所述装置还包括恢复单元;所述恢复单元,用于:
    基于正确的冗余实例在所述同步点的CKPT数据,恢复所述第i个冗余实例;
    从所述同步点处执行恢复后的第i个冗余实例。
  14. 根据权利要求12或13所述的方法,其特征在于,所述保存单元,还用于:
    在识别到所述第i个冗余实例的仲裁数据与保存的仲裁数据相同时,不保存所述第i个冗余实例的CKPT数据。
  15. 根据权利要求12至14任一项所述的方法,其特征在于,所述第i个冗余实例是所述N个冗余实例中除最后一个运行至所述同步点的实例之外的实例。
  16. 根据权利要求11至15任一项所述的方法,其特征在于,所述装置还包括暂停单元,所述暂停单元,用于:
    在所述第i个冗余实例与所述N个冗余实例中运行最慢的冗余实例的执行差距大于预设差距时,暂停执行所述第i个冗余实例。
  17. 根据权利要求16所述的方法,其特征在于,所述冗余实例中同步点的个数为至少两个,所述第i个冗余实例的仲裁数据存储在预设存储空间;所述装置还包括删除单元,所述删除单元,用于删除所述同步点的仲裁数据;
    所述暂停单元,具体用于在所述预设存储空间被占满时,暂停执行所述第i个冗余实例。
  18. 根据权利要求11至17任一项所述的方法,其特征在于,所述执行单元具体用于:在所述第i个冗余实例是除最后一个运行至所述同步点之外的实例时,保存所述第i个仲裁数据;在得到最后一个运行至所述同步点时的冗余实例的仲裁数据时,执行所述从所述N个仲裁数据确定目标仲裁数据的步骤。
  19. 一种电子设备,其特征在于,所述电子设备包括一个或多个处理器和一个或多个存储器;其中,所述一个或多个存储器与所述一个或多个处理器耦合,所述一个或多个存储器用于存储计算机程序代码, 所述计算机程序代码包括计算机指令,当所述一个或多个处理器执行所述计算机指令时,使得所述电子设备执行如权利要求1-8中任一项所述的方法。
  20. 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在电子设备上运行时,使得所述电子设备执行如权利要求1-8中任一项所述的方法。
PCT/CN2024/093216 2023-06-08 2024-05-14 一种异步仲裁方法及装置 Pending WO2024250919A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310679561.8A CN119105976A (zh) 2023-06-08 2023-06-08 一种异步仲裁方法及装置
CN202310679561.8 2023-06-08

Publications (1)

Publication Number Publication Date
WO2024250919A1 true WO2024250919A1 (zh) 2024-12-12

Family

ID=93710408

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/093216 Pending WO2024250919A1 (zh) 2023-06-08 2024-05-14 一种异步仲裁方法及装置

Country Status (2)

Country Link
CN (1) CN119105976A (zh)
WO (1) WO2024250919A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050204184A1 (en) * 2004-03-12 2005-09-15 Kotaro Endo Distributed system and redundancy control method
CN102724083A (zh) * 2012-05-25 2012-10-10 哈尔滨工程大学 基于软件同步的可降级三模冗余计算机系统
CN102929157A (zh) * 2012-11-15 2013-02-13 哈尔滨工程大学 一种三冗余的船舶动力定位控制计算机系统
US20170277604A1 (en) * 2016-03-23 2017-09-28 GM Global Technology Operations LLC Architecture and apparatus for advanced arbitration in embedded controls
CN109766226A (zh) * 2018-12-28 2019-05-17 上海微阱电子科技有限公司 一种多层次设计实现多模冗余投票功能的数字电路
CN112232523A (zh) * 2020-12-08 2021-01-15 湖南航天捷诚电子装备有限责任公司 一种国产化人工智能计算设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050204184A1 (en) * 2004-03-12 2005-09-15 Kotaro Endo Distributed system and redundancy control method
CN102724083A (zh) * 2012-05-25 2012-10-10 哈尔滨工程大学 基于软件同步的可降级三模冗余计算机系统
CN102929157A (zh) * 2012-11-15 2013-02-13 哈尔滨工程大学 一种三冗余的船舶动力定位控制计算机系统
US20170277604A1 (en) * 2016-03-23 2017-09-28 GM Global Technology Operations LLC Architecture and apparatus for advanced arbitration in embedded controls
CN109766226A (zh) * 2018-12-28 2019-05-17 上海微阱电子科技有限公司 一种多层次设计实现多模冗余投票功能的数字电路
CN112232523A (zh) * 2020-12-08 2021-01-15 湖南航天捷诚电子装备有限责任公司 一种国产化人工智能计算设备

Also Published As

Publication number Publication date
CN119105976A (zh) 2024-12-10

Similar Documents

Publication Publication Date Title
US8020041B2 (en) Method and computer system for making a computer have high availability
US5968185A (en) Transparent fault tolerant computer system
US20120089861A1 (en) Inter-processor failure detection and recovery
US10445295B1 (en) Task-based framework for synchronization of event handling between nodes in an active/active data storage system
CN115550384B (zh) 集群数据同步方法、装置、设备及计算机可读存储介质
US9946582B2 (en) Distributed processing device and distributed processing system
US9235485B2 (en) Moving objects in a primary computer based on memory errors in a secondary computer
CN101236515B (zh) 多核系统单核异常的恢复方法
US9558152B2 (en) Synchronization method, multi-core processor system, and synchronization system
WO1997022930A9 (en) Transparent fault tolerant computer system
US20170168756A1 (en) Storage transactions
CN101377750A (zh) 一种用于机群容错的系统和方法
CN105579963B (zh) 任务处理装置、电子设备及方法
US20050193039A1 (en) Fault tolerant mechanism to handle initial load of replicated object in live system
CN112148436A (zh) 去中心化的tcc事务管理方法、装置、设备及系统
US9195528B1 (en) Systems and methods for managing failover clusters
CN116069765A (zh) 数据迁移方法、装置、电子设备及存储介质
WO2024222707A1 (zh) 故障处理方法、装置、存储介质及电子设备
CN112596371A (zh) 控制卡切换方法、装置、电子设备及存储介质
WO2024250919A1 (zh) 一种异步仲裁方法及装置
JPWO2004046926A1 (ja) イベント通知方法、デバイス及びプロセッサシステム
CN117395263B (zh) 一种数据同步方法、装置、设备和存储介质
US11640246B2 (en) Information processing device, control method, and computer-readable recording medium storing control program
US8359602B2 (en) Method and system for task switching with inline execution
US20060195849A1 (en) Method for synchronizing events, particularly for processors of fault-tolerant systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24818452

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE