[go: up one dir, main page]

CN111736996A - A process persistence method and device for distributed non-volatile memory system - Google Patents

A process persistence method and device for distributed non-volatile memory system Download PDF

Info

Publication number
CN111736996A
CN111736996A CN202010553640.0A CN202010553640A CN111736996A CN 111736996 A CN111736996 A CN 111736996A CN 202010553640 A CN202010553640 A CN 202010553640A CN 111736996 A CN111736996 A CN 111736996A
Authority
CN
China
Prior art keywords
node
copy
persistent
migration
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010553640.0A
Other languages
Chinese (zh)
Other versions
CN111736996B (en
Inventor
薛栋梁
黄林鹏
孙鹏昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiao Tong University
Original Assignee
Shanghai Jiao Tong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiao Tong University filed Critical Shanghai Jiao Tong University
Priority to CN202010553640.0A priority Critical patent/CN111736996B/en
Publication of CN111736996A publication Critical patent/CN111736996A/en
Application granted granted Critical
Publication of CN111736996B publication Critical patent/CN111736996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1009Address translation using page tables, e.g. page table structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/109Address translation for multiple virtual address spaces, e.g. segmentation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Hardware Redundancy (AREA)

Abstract

本发明公开了一种面向分布式非易失内存系统的进程持久化方法及装置。在该方法中,周期性地生成运行在主节点上的持久化进程的进程副本,本将副本通过网络分发至各副节点中;主节点以及各从节点将进程副本存储在PM模块中,以便副节点可以根据进程副本恢复重建该持久化进程。本发明主要针对大型分布式系统中科学计算进程长时间运行没有可靠性保障的问题,利用PM非易失、速度快、容量大的特点,自动地维护进程的检查点以及进程副本,并能在系统出现故障时自动恢复,具有良好的市场前景和应用价值。

Figure 202010553640

The invention discloses a process persistence method and device for a distributed non-volatile memory system. In this method, a process copy of the persistent process running on the master node is periodically generated, and the copy is distributed to each slave node through the network; the master node and each slave node store the process copy in the PM module, so that The secondary node can restore and rebuild the persistent process based on the process copy. The invention mainly aims at the problem that the scientific computing process runs for a long time without reliability guarantee in a large-scale distributed system, and utilizes the characteristics of PM non-volatile, fast speed and large capacity to automatically maintain the process checkpoint and process copy, and can automatically maintain the process checkpoint and process copy. Automatic recovery when the system fails, has good market prospects and application value.

Figure 202010553640

Description

一种面向分布式非易失内存系统的进程持久化方法及装置A process persistence method and device for distributed non-volatile memory system

技术领域technical field

本发明涉及计算机系统结构技术,具体涉及一种面向分布式非易失内存系统的进程持久化方法及装置。The invention relates to a computer system structure technology, in particular to a process persistence method and device for a distributed non-volatile memory system.

背景技术Background technique

在高能物理研究、核武器设计、航天航空飞行器设计、国民经济的预测和决策、能源勘探、中长期天气预报、卫星图像处理、情报分析、工业仿真等各个科学计算领域需要很强的算力需求和存储需求,分布式高性能计算集群作为科学计算的主要解决方案,提供了比较强劲的算力支持和存储支持。但是,科学计算领域,往往单次计算任务通常需要很长的周期才能完成,很多计算任务要求中途不能宕机或者崩溃,一旦宕机或崩溃,本次计算任务必须从头开始启动,严重的降低了科学计算的计算效率。为此,本发明提供了一种将计算任务从断点延续执行的解决方案,具体是指在分布式持久性内存存储系统中,将计算任务中的进程持久化的方法与装置。分布式高性能计算集群是人类科学研究的第三大范式,是国家科技实力的集中体现,所以,本发明提供的方法与装置有极强的实际意义和应用价值。In various scientific computing fields such as high-energy physics research, nuclear weapon design, aerospace vehicle design, prediction and decision-making of the national economy, energy exploration, medium and long-term weather forecasting, satellite image processing, intelligence analysis, industrial simulation, etc. Storage requirements, distributed high-performance computing clusters, as the main solution for scientific computing, provide relatively strong computing power support and storage support. However, in the field of scientific computing, a single computing task usually takes a long period to complete, and many computing tasks require no downtime or crash in the middle. Computational efficiency in scientific computing. To this end, the present invention provides a solution for continuing the execution of a computing task from a breakpoint, specifically a method and device for persisting a process in a computing task in a distributed persistent memory storage system. Distributed high-performance computing cluster is the third largest paradigm of human scientific research, and it is the concentrated expression of national scientific and technological strength. Therefore, the method and device provided by the present invention have extremely strong practical significance and application value.

在本技术领域,涉及以下技术术语:In this technical field, the following technical terms are involved:

DRAM:Dynamic Random Access Memory,动态随机存储器,即现阶段广泛使用的内存,其特点是断电后信息丢失,理论上读写寿命不受限制。DRAM: Dynamic Random Access Memory, dynamic random access memory, that is, the memory that is widely used at this stage, which is characterized by information loss after power failure, and theoretically unlimited read and write life.

DDR4 SDRAM:Double-Data-Rate Fourth Generation Synchronous DynamicRandom Access Memory,第四代双倍数据率同步动态随机存储器,是目前最新一代的计算机存储器规格,相比历代规格提供更低的电压、功耗与更高的带宽。DDR4 SDRAM: Double-Data-Rate Fourth Generation Synchronous Dynamic Random Access Memory, the fourth generation of double data rate synchronous dynamic random access memory, is the latest generation of computer memory specifications, providing lower voltage, power consumption and higher performance than previous generations. high bandwidth.

PM:PersistentMemory,非易失型内存或持久性内存,STTRAM、PCM、RRAM都属于PM,其特点是断电后信息不丢失,可字节寻址;但读写次数受限制,并且读写不对称,根据材料的不同,传统的PM写耗费时间大概是其读耗费时间的2到4倍,PM的读时间比DRAM慢1到3倍,Intel最新发布的OptaneDC Persistent Memory在读写速度上已经与现有的DDR4规格的DRAM相当。PM: PersistentMemory, non-volatile memory or persistent memory, STTRAM, PCM, RRAM all belong to PM, which is characterized by the fact that the information is not lost after the power is turned off, and it can be byte addressable; but the number of reads and writes is limited, and the read and write are not Symmetrical, depending on the material, the traditional PM write time is about 2 to 4 times as long as its read time, and the PM read time is 1 to 3 times slower than DRAM. Equivalent to existing DDR4 DRAM.

PCB:Process Control Block,进程控制块,操作系统中用于管理进程的数据结构。每个进程有都有一个对应的PCB,其中存放有进程的运行状态、调度优先级、内存分配情况、进程页表、打开文件等信息。PCB: Process Control Block, the process control block, the data structure used in the operating system to manage the process. Each process has a corresponding PCB, which stores information such as the running status, scheduling priority, memory allocation, process page table, and open files of the process.

PSU:PowerSupplyUnit,供电单元,将高电压交流电转换为低电压直流电,为计算机的各个部件供电。绝大多数现代桌面计算机的供电单元遵循英特尔公司制定的ATX标准,该标准定义了供电单元的大小、接口形状、引脚意义等参数。PSU: PowerSupplyUnit, power supply unit, converts high-voltage alternating current into low-voltage direct current to supply power to various parts of the computer. The power supply unit of most modern desktop computers follows the ATX standard formulated by Intel Corporation, which defines parameters such as the size of the power supply unit, the shape of the interface, and the meaning of pins.

PWR_OK:Power-Good,ATX标准中由供电单元输出至主板供电接口的一个引脚的信号,该信号为高电平时表明供电单元的输出电流已经稳定,可以正常为设备供电。ATX标准规定在外部电源断开时,PWR_OK回落为低电平后供电单元的输出电流应保持稳定至少1毫秒。PWR_OK: Power-Good, a signal output by the power supply unit to a pin of the power supply interface of the motherboard in the ATX standard. When the signal is high, it indicates that the output current of the power supply unit has been stabilized and can supply power to the device normally. The ATX standard stipulates that when the external power supply is disconnected, the output current of the power supply unit should remain stable for at least 1 millisecond after PWR_OK falls back to a low level.

IPMI:IntelligentPlatformManagementInterface,智能平台管理接口,计算机中独立于本机操作系统、BIOS、处理器等的子系统,用于监测、管理计算机各设备的运行状态,可以通过带外管理网络与外界交换信息。IPMI: Intelligent Platform Management Interface, intelligent platform management interface, a subsystem in the computer that is independent of the native operating system, BIOS, processor, etc., used to monitor and manage the running status of each computer device, and can exchange information with the outside world through an out-of-band management network.

BMC:Baseboard ManagementController,基板管理控制器,嵌入在服务器主板中的一个微控制器,温度传感器值、风扇转速等信息并在出现异常参数时通过带外管理网络向外界汇报,是IPMI的核心部件。BMC: Baseboard Management Controller, a microcontroller embedded in the server motherboard, the temperature sensor value, fan speed and other information are reported to the outside world through the out-of-band management network when abnormal parameters occur. It is the core component of IPMI.

ipmitool:IntelligentPlatform Management Interface Tool,智能平台管理接口工具,用于控制IPMI系统的软件工具,可以实现查看本机运行状态、远程开关机等功能。ipmitool: IntelligentPlatform Management Interface Tool, an intelligent platform management interface tool, a software tool for controlling the IPMI system, which can realize functions such as checking the running status of the machine and turning it on and off remotely.

DETECT_PWR:进程异常挖掘模块中,断电探测单元在检测到供电单元断电后,向异常处理单元发送的信号。DETECT_PWR: In the process exception mining module, the power failure detection unit sends a signal to the exception processing unit after detecting that the power supply unit is powered off.

DETECT_CRASH:进程异常挖掘模块中,程序崩溃探测单元在检测到程序崩溃后,向异常处理单元发送的信号。DETECT_CRASH: In the process exception mining module, the program crash detection unit sends a signal to the exception handling unit after detecting a program crash.

DETECT_IMBALANCE:进程一场挖掘模块中,负载均衡探测单元在检测到各节点负载不均衡时,向异常处理单元发送的信号。DETECT_IMBALANCE: In the process field mining module, the load balancing detection unit sends a signal to the exception processing unit when it detects that the load of each node is unbalanced.

undo方式:一种采用undo日志的数据写方式。在非易失存储介质内更新数据时,先将原始数据保存在undo日志内,再对数据本身做修改。如果写过程异常中断,则使用undo日志恢复原始数据值,以防止系统进入不一致状态。undo method: a data writing method using undo log. When updating data in a non-volatile storage medium, first save the original data in the undo log, and then modify the data itself. If the writing process is interrupted abnormally, the original data value is restored using the undo log to prevent the system from entering an inconsistent state.

分布式非易失内存系统:在物理上,由若干配置PM和DRAM内存的机器节点组成,每个节点相对独立,拥有自己的处理器、存储器、操作系统等资源。节点间通过TCP/IB相连,进而通过分布式系统软件组成一个完整的分布式非易失内存系统。所述的TCP、IB(InfiniBand)都属于网络通信协议,TCP是普遍使用的具有可靠传输特性的协议,IB是一个具有极高带宽与极低延迟的网络协议,非常适用于分布式系统中节点与节点间、节点与后备存储间的通信。通常,分布式系统中,采用主、副本的冗余存储方式来保证数据的可用性和可靠性,如图1给出了一种分布式非易失内存系统的整体架构图,Node1为主节点,Node2和Node3位副节点,data1_x和data1_xx为data1的副本。Distributed non-volatile memory system: Physically, it consists of several machine nodes configured with PM and DRAM memory. Each node is relatively independent and has its own processor, memory, operating system and other resources. The nodes are connected through TCP/IB, and then a complete distributed non-volatile memory system is formed through distributed system software. The TCP and IB (InfiniBand) are all network communication protocols. TCP is a commonly used protocol with reliable transmission characteristics. IB is a network protocol with extremely high bandwidth and extremely low latency, which is very suitable for nodes in distributed systems. Communication with nodes and between nodes and backing stores. Usually, in a distributed system, the redundant storage method of master and copy is used to ensure the availability and reliability of data. Figure 1 shows the overall architecture diagram of a distributed non-volatile memory system. Node1 is the master node, Node2 and Node3 are secondary nodes, and data1_x and data1_xx are copies of data1.

发明内容SUMMARY OF THE INVENTION

本发明的目的是根据上述现有技术的不足之处,提供一种面向分布式非易失内存系统的进程持久化方法及装置,所述方法与装置通过进程持久化使得科学计算任务能够达到从断点延续执行的目标。The purpose of the present invention is to provide a process persistence method and device for a distributed non-volatile memory system based on the shortcomings of the above-mentioned prior art. The method and device enable scientific computing tasks to achieve from The breakpoint continues execution of the target.

本发明目的实现由以下技术方案完成:The realization of the object of the present invention is accomplished by the following technical solutions:

一种面向分布式非易失内存系统的进程持久化方法,应用于多个计算节点组成的系统;各所述计算机节点之间通过网络通信连接;对于任意一个持久化进程,其执行过程包括以下步骤:A process persistence method for a distributed non-volatile memory system is applied to a system composed of multiple computing nodes; each of the computer nodes is connected through network communication; for any persistent process, its execution process includes the following step:

选取一个计算节点作为主节点,并另外选取至少一个计算节点作为副节点;并在所述主节点中初始化该持久化进程,同时在该所述主节点的PM模块中创建进程副本;Select a computing node as the primary node, and additionally select at least one computing node as a secondary node; and initialize the persistence process in the primary node, and create a process copy in the PM module of the primary node simultaneously;

在持久化进程运行的过程中,每隔预定时间对所述进程副本进行更新,并分发至各所述副节点,所述副节点将该所述持久化进程的进程副本存储在其PM模块中;During the running of the persistent process, the process copy is updated at predetermined time intervals and distributed to each of the secondary nodes, and the secondary node stores the process copy of the persistent process in its PM module ;

当所述主节点的进程异常挖掘模块检测到所述持久化进程出现故障时,从所述副节点中选取目标迁移节点,并根据所述目标迁移节点中存储的所述进程副本在所述目标迁移节点中恢复该所述持久化进程。When the process abnormality mining module of the master node detects that the persistence process is faulty, it selects a target migration node from the secondary node, and selects a target migration node according to the process copy stored in the target migration node in the target migration node. The persistent process is resumed in the migration node.

本发明的进一步改进在于,初始化所述持久化进程的过程中,所述主节点的进程持久化单元在所述PM模块中创建进程结构体作为所述进程副本;进程副本包括校验标识、进程PCB、执行进程首地址、节点ID、时间戳、进程虚拟地址空间副本;所述进程虚拟地址空间副本采用平衡二叉树数据结构,虚拟地址空间中每个虚拟页为平衡二叉树节点。A further improvement of the present invention is that, in the process of initializing the persistence process, the process persistence unit of the master node creates a process structure in the PM module as the process copy; PCB, execution process first address, node ID, timestamp, process virtual address space copy; the process virtual address space copy adopts a balanced binary tree data structure, and each virtual page in the virtual address space is a balanced binary tree node.

本发明的进一步改进在于,初始化所述持久化进程的过程中,对所述进程副本进行初始化;初始化的过程包括以下步骤:A further improvement of the present invention is that in the process of initializing the persistent process, the process copy is initialized; the initialization process includes the following steps:

(S11)将所述校验标识置为不可用;(S11) setting the verification mark to be unavailable;

(S12)根据当前时间确定所述时间戳;根据当前主节点的ID确定所述节点ID;(S12) determine the timestamp according to the current time; determine the node ID according to the ID of the current master node;

(S13)将所述持久化进程的PCB、执行进程首地址写入所述进程副本;(S13) the PCB of the persistent process, the first address of the execution process are written into the process copy;

(S14)遍历所述持久化进程的虚拟地址空间,将每个使用中的虚拟页存储至所述进程虚拟地址空间副本中,并重置该虚拟页的SOFT_DIRTY标志位;(S14) traverse the virtual address space of the persistent process, store each virtual page in use in the process virtual address space copy, and reset the SOFT_DIRTY flag bit of the virtual page;

(S15)将所述持久化进程调整至可运行状态,并将所述校验标识置为可用。(S15) Adjust the persistent process to a runnable state, and set the verification flag to be available.

本发明的进一步改进在于,对所述进程副本进行更新的过程中,采用增量更新的方式对所述进程虚拟地址空间副本进行更新。A further improvement of the present invention is that in the process of updating the process copy, the process virtual address space copy is updated by means of incremental update.

本发明的进一步改进在于,对所述进程副本进行更新的过程包括以下步骤:A further improvement of the present invention is that the process of updating the process copy includes the following steps:

(S21)将所述持久化进程挂起,并将所述进程副本的校验标识置为不可用;(S21) Suspend the persistent process, and set the verification identifier of the process replica to be unavailable;

(S22)根据当前时间对所述进程副本的时间戳进行更新;(S22) updating the timestamp of the process replica according to the current time;

(S23)将所述持久化进程的PCB、执行进程首地址写入所述进程副本;(S23) the PCB of the persistent process, the first address of the execution process are written into the process copy;

(S24)遍历所述持久化进程的虚拟地址空间,将被更改过虚拟页在所述进程虚拟地址空间副本中进行更新或插入,将所述虚拟地址空间中各所述虚拟页的更改标记复位;(S24) Traverse the virtual address space of the persistent process, update or insert the changed virtual page in the virtual address space copy of the process, and reset the change flag of each virtual page in the virtual address space ;

(S25)将所述持久化进程调整至可运行状态,并将所述校验标识置为可用。(S25) Adjust the persistent process to a runnable state, and set the verification flag to be available.

本发明的进一步改进在于,当所述主节点的进程异常挖掘模块检测到所述持久化进程出现故障后,进程异常挖掘模块将其检测的故障信息转发给所述主节点的故障转移模块。A further improvement of the present invention is that when the process abnormality mining module of the master node detects that the persistent process is faulty, the process abnormality mining module forwards the detected failure information to the failover module of the master node.

本发明的进一步改进在于,所述故障转移模块对所述故障信息的响应过程包括以下步骤:A further improvement of the present invention is that the response process of the failover module to the failure information includes the following steps:

(S31)所述主节点上的所述故障转移模块收集各副节点的运行状态,并根据运行状态选取所述目标迁移节点;所述运行状态包括CPU使用率和可用内存;(S31) The failover module on the primary node collects the running status of each secondary node, and selects the target migration node according to the running status; the running status includes CPU usage and available memory;

(S32)向各所述目标迁移节点发送迁移指令,要求在目标迁移节点上恢复上述待迁移的持久化进程;(S32) sending a migration instruction to each of the target migration nodes, requesting that the above-mentioned persistent process to be migrated be restored on the target migration node;

(S33)目标迁移节点接收到迁移指令及需要恢复的持久化进程后,其进程恢复单元在自身PM模块中查找该持久化进程的进程副本,检验其校验标识,若标识不可用则进程恢复失败;若存在多个同一进程的进程副本,则选取时间戳最新的进程副本;(S33) After the target migration node receives the migration instruction and the persistent process that needs to be recovered, its process recovery unit searches for the process copy of the persistent process in its own PM module, and checks its verification identifier. If the identifier is unavailable, the process recovers Failed; if there are multiple process copies of the same process, select the process copy with the latest timestamp;

(S34)运行在目标迁移节点上的进程恢复单元根据进程副本在该节点上重建所述持久化进程,并将该节点作为新的主节点。(S34) The process recovery unit running on the target migration node rebuilds the persistent process on the node according to the process copy, and uses the node as a new master node.

本发明的进一步改进在于,选择所述目标迁移节点的过程包括以下步骤:A further improvement of the present invention is that the process of selecting the target migration node includes the following steps:

(S311)若故障类型是节点断电或程序崩溃,则对故障节点上的每一个持久化进程分别执行下述过程:在当前分布式系统中所有正在运行的副节点中,选择可用内存以及可以容纳迁移进程的节点中CPU使用率最低的节点作为目标迁移节点;如果所有节点的可用内存都不足以容纳迁移进程,则选取可用内存最大的节点作为目标迁移节点,并跳转至步骤(S32);(S311) If the fault type is node power failure or program crash, perform the following process for each persistent process on the faulty node: select the available memory and the available memory among all the running secondary nodes in the current distributed system. The node with the lowest CPU usage among the nodes accommodating the migration process is used as the target migration node; if the available memory of all nodes is not enough to accommodate the migration process, the node with the largest available memory is selected as the target migration node, and jumps to step (S32) ;

(S312)若故障类型是负载不平衡,则对故障节点上的一个随机选择的持久化进程执行下述过程:在当前分布式系统中所有正在运行的副节点中,选择可用内存以及可以容纳迁移进程的节点中CPU使用率最低的节点作为目标迁移节点;如果所有节点的可用内存都不足以容纳迁移进程,则选取可用内存最大的节点作为目标迁移节点,并跳转至步骤(S32)。(S312) If the fault type is unbalanced load, perform the following process on a randomly selected persistent process on the faulty node: from all running secondary nodes in the current distributed system, select available memory and can accommodate migration The node with the lowest CPU usage among the process nodes is used as the target migration node; if the available memory of all nodes is not enough to accommodate the migration process, the node with the largest available memory is selected as the target migration node, and jumps to step (S32).

本发明的进一步改进在于,重建所述持久化进程的过程包括以下步骤:A further improvement of the present invention is that the process of rebuilding the persistent process includes the following steps:

(S341)在目标迁移节点建立一个新进程,根据进程副本构初始化进程PCB;(S341) establishing a new process at the target migration node, and initializing the process PCB according to the process replica structure;

(S342)遍历进程副本的进程虚拟地址空间副本,根据进程虚拟地址空间副本中平衡二叉树节点的页地址及标志位恢复进程地址空间,根据平衡二叉树节点的页内容恢复进程的内存数据;(S342) traverse the process virtual address space copy of the process copy, restore the process address space according to the page address and the flag position of the balanced binary tree node in the process virtual address space copy, and restore the memory data of the process according to the page content of the balanced binary tree node;

(S343)根据进程副本中的打开文件表打开相应的文件;(S343) open the corresponding file according to the open file table in the process copy;

(S344)将新进程标记为可运行状态,插入调度队列,开始运行。(S344) Mark the new process as runnable, insert it into the scheduling queue, and start running.

本发明还包括一种面向分布式非易失内存系统的进程持久化装置,设置在分布式非易失内存系统的计算节点中,各所述计算节点通过网络通信连接;所述进程持久化装置包括进程持久化模块、进程异常挖掘模块以及故障转移模块;其中:The present invention also includes a process persistence device for a distributed non-volatile memory system, which is arranged in the computing nodes of the distributed non-volatile memory system, and each of the computing nodes is connected through network communication; the process persistence device Including process persistence module, process exception mining module and failover module; among them:

所述进程持久化模块设置在持久化进程所在的主节点中,被配置为周期性地记录所述持久化进程的状态和执行进度以生成进程副本,并将所述进程副本分发至若干个副节点中;所述主节点以及各所述副节点将所述进程副本保存在PM模块中;The process persistence module is set in the master node where the persistence process is located, and is configured to periodically record the state and execution progress of the persistence process to generate a process copy, and distribute the process copy to several slaves. In the node; the master node and each of the secondary nodes save the process copy in the PM module;

所述进程异常挖掘模块设置在持久化进程所在的主节点中,被配置为检测所述主节点的故障,并在检测到故障时,将故障信息转发至所述主节点的故障转移模块中;所述故障的类型包括节点断电、进程崩溃、负载失衡;The process abnormality mining module is set in the master node where the persistent process is located, and is configured to detect the failure of the master node, and when a failure is detected, forward the failure information to the failover module of the master node; The types of the failure include node power failure, process crash, and load imbalance;

所述主节点以及所述副节点上均部署有故障转移模块;A failover module is deployed on both the primary node and the secondary node;

所述主节点上的故障转移模块被配置为,接收到故障信息之后,根据故障信息以及各副节点的运行状态,从副节点中选取目标迁移节点,并向目标迁移节点发送迁移指令,要求在目标迁移节点上恢复上述待迁移的持久化进程;The failover module on the primary node is configured to, after receiving the fault information, select a target migration node from the secondary nodes according to the fault information and the operating status of each secondary node, and send a migration instruction to the target migration node, requiring Restore the above persistent process to be migrated on the target migration node;

所述副节点上的故障转移模块被配置为,接收到迁移指令后,在副节点上重建所述持久化进程,并将该节点作为新的主节点。The failover module on the secondary node is configured to, after receiving the migration instruction, rebuild the persistence process on the secondary node, and use the node as a new primary node.

本发明的进一步改进在于,所述进程持久化模块包括持久化创建单元以及持久化更新单元;A further improvement of the present invention is that the process persistence module includes a persistence creation unit and a persistence update unit;

所述持久化创建单元被配置为在主节点上创建持久化进程的进程副本;所述进程副本包括校验标识、进程PCB、执行进程首地址、节点ID、时间戳、进程虚拟地址空间副本;所述进程虚拟地址空间副本采用平衡二叉树数据结构,虚拟地址空间中每个虚拟页为平衡二叉树节点;The persistent creation unit is configured to create a process copy of the persistent process on the master node; the process copy includes a verification identifier, a process PCB, an execution process first address, a node ID, a timestamp, and a copy of the process virtual address space; The process virtual address space copy adopts a balanced binary tree data structure, and each virtual page in the virtual address space is a balanced binary tree node;

所述持久化更新单元被配置为周期性地对所述进程副本进行更新,并将更新后的进程副本发送至各副节点。The persistent updating unit is configured to periodically update the process copy, and send the updated process copy to each secondary node.

本发明的进一步改进在于,进程异常挖掘模块检测的故障类型包括节点断电、进程崩溃、负载失衡;所述进程异常挖掘模块包括断电探测单元、程序崩溃探测单元、负载均衡探测单元以及异常处理单元;A further improvement of the present invention is that the fault types detected by the process abnormality mining module include node power failure, process crash, and load imbalance; the process abnormality mining module includes a power failure detection unit, a program crash detection unit, a load balance detection unit, and exception handling. unit;

所述断电探测单元被配置为在所述主节点发生断电时将断电故障信息传递给异常处理单元;The power failure detection unit is configured to transmit the power failure fault information to the exception processing unit when the master node is powered off;

所述程序崩溃探测单元被配置为捕获所述持久化进程退出时的返回值,并在返回值异常时向所述异常处理单元发出进程崩溃故障信息;The program crash detection unit is configured to capture the return value when the persistent process exits, and send process crash fault information to the exception handling unit when the return value is abnormal;

所述负载均衡探测单元被配置为轮询所述主节点以及各所述副节点的负载系数,并在任意两个节点的负载系数的差值大于阈值时向所述异常处理单元发出负载不均衡故障信息;The load balancing detection unit is configured to poll the load coefficients of the primary node and each of the secondary nodes, and issue a load imbalance to the exception processing unit when the difference between the load coefficients of any two nodes is greater than a threshold accident details;

所述异常处理单元被配置为接收所述断电探测单元、程序崩溃探测单元、负载均衡探测单元的故障信息,并将故障信息转发至所述主节点的故障转移模块。The exception processing unit is configured to receive fault information of the power failure detection unit, the program crash detection unit, and the load balancing detection unit, and forward the fault information to the failover module of the master node.

本发明的进一步改进在于,所述故障转移模块包括节点信息统计单元、迁移决策单元与进程恢复单元;A further improvement of the present invention is that the failover module includes a node information statistics unit, a migration decision unit and a process recovery unit;

所述节点信息统计单元被配置为收集所述主节点以及各所述副节点的运行状态;The node information statistics unit is configured to collect the running status of the primary node and each of the secondary nodes;

所述迁移决策单元被配置为,当其所在的主节点检测到故障时,确定需要进行故障转移的持久化进程,并根据分布式系统中各计算节点的运行状态选定进程迁移的目标迁移节点;The migration decision unit is configured to, when the master node where it is located detects a failure, determine a persistent process that needs to perform failover, and select a target migration node for process migration according to the running state of each computing node in the distributed system ;

所述主节点上的进程恢复单元被配置为根据所述迁移决策单元决定迁移的持久化进程以及对应的目标迁移节点,向目标迁移节点发送迁移指令;The process recovery unit on the master node is configured to send a migration instruction to the target migration node according to the migration decision unit determining the migration persistence process and the corresponding target migration node;

所述目标迁移节点上的进程恢复单元被配置为接收到迁移指令后,从该节点的PM模块中寻找进程副本,并根据进程副本重建持久化进程。The process recovery unit on the target migration node is configured to search for a process copy from the PM module of the node after receiving the migration instruction, and rebuild the persistent process according to the process copy.

本发明的优点是:本发明主要针对大型分布式系统中科学计算进程长时间运行没有可靠性保障的问题,利用PM非易失、速度快、容量大的特点,自动地维护进程的检查点以及进程副本,并能在系统出现故障时自动恢复,具有良好的市场前景和应用价值。The advantages of the present invention are: the present invention is mainly aimed at the problem that the scientific computing process runs for a long time without reliability guarantee in a large-scale distributed system, and utilizes the characteristics of PM non-volatile, fast speed and large capacity to automatically maintain the process checkpoint and Process copy, and can automatically recover when the system fails, with good market prospects and application value.

附图说明Description of drawings

图1是装有PM的分布式系统整体架构示意图;Fig. 1 is a schematic diagram of the overall architecture of a distributed system equipped with PM;

图2是面向分布式非易失内存系统的进程持久化装置整体架构示意图;2 is a schematic diagram of the overall architecture of a process persistence device for a distributed non-volatile memory system;

图3是进程持久化执行模块及其构成单元示意图;3 is a schematic diagram of a process persistence execution module and its constituent units;

图4是进程异常挖掘模块机器构成单元示意图Figure 4 is a schematic diagram of the machine constituent unit of the process abnormality mining module

图5是进程虚拟地址空间副本的结构示意图;Fig. 5 is the structural representation of process virtual address space copy;

图6是向进程地址空间平衡二叉树插入节点的过程示意图。FIG. 6 is a schematic diagram of a process of inserting a node into a process address space balanced binary tree.

具体实施方式Detailed ways

如图2所示,本发明的实施例包括一种面向分布式非易失内存系统的进程持久化装置,设置在分布式非易失内存系统的计算节点中,各计算节点通过网络通信连接。对于某个需要持久化运行的持久化进程,在分布式非易失内存系统中有一个计算节点作为主节点,同时有两个或两个以上的计算节点作为副节点,不同的持久化进程可以具有不同的主节点。进程持久化装置包括进程持久化模块、进程异常挖掘模块以及故障转移模块。具体的:As shown in FIG. 2 , an embodiment of the present invention includes a process persistence device for a distributed non-volatile memory system, which is arranged in computing nodes of the distributed non-volatile memory system, and the computing nodes are connected through network communication. For a persistent process that needs to run persistently, in the distributed non-volatile memory system, there is one computing node as the master node, and two or more computing nodes as the secondary nodes. Different persistent processes can have different master nodes. The process persistence device includes a process persistence module, a process exception mining module and a failover module. specific:

进程持久化模块设置在持久化进程所在的主节点中,被配置为周期性地记录该持久化进程的状态和执行进度以生成进程副本,并将所述进程副本分发至各个副节点中。主节点以及各副节点将进程副本保存在PM模块中,以便根据进程副本在副节点中重建该持久化进程。生成进程副本的必要条件是,该持久化进程至少被调用执行过一次。The process persistence module is set in the primary node where the persistence process is located, and is configured to periodically record the state and execution progress of the persistence process to generate a process copy, and distribute the process copy to each secondary node. The master node and each slave node save the process copy in the PM module, so as to rebuild the persistent process in the slave node according to the process copy. A necessary condition for generating a process copy is that the persistent process has been called and executed at least once.

在实现方式上,每个进程副本相当于一个检查点。进程持久化模块更新进程副本的过程采用追加写的方式,定期更新持久化进程的各个进程副本中的状态和执行进度,最终使得主节点上的执行进程和副节点上的备份进程达到一致的状态和执行进度。In implementation, each process copy is equivalent to a checkpoint. The process of updating the process copy by the process persistence module adopts the method of appending to periodically update the status and execution progress of each process copy of the persistence process, so that the execution process on the primary node and the backup process on the secondary node reach a consistent state. and execution progress.

进程异常挖掘模块设置在持久化进程所在的主节点中,其主要功能是探测主节点的故障,并将故障信息转发至所述主节点的故障转移模块中;主节点的故障的类型包括节点断电、进程崩溃、负载失衡。在实现方式上,异常情况的传送采用中断机制或者事件触发机制。The process abnormality mining module is set in the main node where the persistent process is located, and its main function is to detect the failure of the main node and forward the failure information to the failover module of the main node; the type of failure of the main node includes node failure. Power outages, process crashes, load imbalances. In terms of implementation, the transmission of abnormal conditions adopts an interrupt mechanism or an event-triggered mechanism.

故障转移模块在主节点以及副节点上均有部署,其主要功能是判定异常情况类型并将发生异常的进程转移到合适的节点继续运行。在实现方式上,是按照最近一次检查点时刻备份的进程副本,在副节点重建持久化进程,然后将所述副节点重置为主节点,备份进程重置为执行进程。主节点以及副节点中的故障转移模块执行不同的功能,具体的:The failover module is deployed on the primary node and the secondary node. Its main function is to determine the type of abnormal situation and transfer the abnormal process to the appropriate node to continue running. In terms of implementation, the persistent process is rebuilt on the secondary node according to the process copy backed up at the last checkpoint, and then the secondary node is reset to the primary node, and the backup process is reset to the execution process. The failover modules in the primary and secondary nodes perform different functions, specifically:

主节点上的故障转移模块被配置为,接收到故障信息之后,根据故障信息以及各副节点的运行状态,从副节点中选取目标迁移节点,并向目标迁移节点发送迁移指令,要求在目标迁移节点上恢复上述待迁移的持久化进程;The failover module on the primary node is configured to, after receiving the fault information, select the target migration node from the secondary nodes according to the fault information and the operating status of each secondary node, and send a migration instruction to the target migration node, requesting that the target migration node be migrated. Restore the above persistent process to be migrated on the node;

所述副节点上的故障转移模块被配置为,接收到迁移指令后,在副节点上重建所述持久化进程,并将该节点作为新的主节点。The failover module on the secondary node is configured to, after receiving the migration instruction, rebuild the persistence process on the secondary node, and use the node as a new primary node.

上述的进程持久化模块、进程异常挖掘模块以及故障转移模块可以是采用纯软件方式实现的软件模块,也可采用专用硬件实现的硬件模块,也可采用软件和硬件相结合实现的功能模组。The above process persistence module, process exception mining module and failover module may be software modules implemented by pure software, hardware modules implemented by dedicated hardware, or functional modules implemented by combining software and hardware.

在一些实施例中,进程持久化模块包括持久化创建单元以及持久化更新单元,以上二者依照Crash一致性保障机制运行。In some embodiments, the process persistence module includes a persistence creation unit and a persistence update unit, both of which operate according to the Crash consistency guarantee mechanism.

持久化创建单元被配置为在主节点上创建持久化进程的进程副本;进程副本包括校验标识、进程PCB、执行进程首地址、节点ID、时间戳、进程虚拟地址空间副本。根据上述内容,可以重建整个持久化进行。进程虚拟地址空间副本采用平衡二叉树数据结构,虚拟地址空间中每个虚拟页为平衡二叉树节点。在实现的过程中,进程副本可采用数据结构Process_struct进行实现。The persistence creation unit is configured to create a process copy of the persistent process on the master node; the process copy includes the verification identifier, the process PCB, the first address of the execution process, the node ID, the timestamp, and the copy of the process virtual address space. Based on the above, the entire persistence can be rebuilt. The copy of the virtual address space of the process adopts a balanced binary tree data structure, and each virtual page in the virtual address space is a balanced binary tree node. In the process of implementation, the process copy can be implemented by using the data structure Process_struct.

持久化更新单元被配置为周期性地对所述进程副本进行更新,并将更新后的进程副本发送至各副节点。对进程副本进行更新的过程中,采用增量的方式通过TCP/IB传递到副节点,即在副节点上生成了备份进程数据结构Process_struct_back。对于同一个执行进程,持久化更新单元以undo的方式在主副节点保存时间戳最新的两个检查点。The persistent updating unit is configured to periodically update the process copy, and send the updated process copy to each secondary node. In the process of updating the process copy, it is transmitted to the secondary node through TCP/IB in an incremental manner, that is, the backup process data structure Process_struct_back is generated on the secondary node. For the same execution process, the persistent update unit saves the two checkpoints with the latest timestamps in the primary and secondary nodes in the form of undo.

Crash一致性保障机制,指的是以undo的方式对如图3所示的执行进程首地址,执行进程的进程地址空间中线性区组织关系、执行进程的全局页目录、执行进程的CR3寄存器内容进行持久化,上述内容均会备份至进程副本中。执行进程用户数据页内容按照写时复制的方式进行更新,更新时主要参考执行进程地址空间中线性区组织关系。Crash consistency guarantee mechanism refers to the first address of the execution process shown in Figure 3, the organization relationship of the linear area in the process address space of the execution process, the global page directory of the execution process, and the content of the CR3 register of the execution process by undo. For persistence, the above content will be backed up to the process copy. The content of the user data page of the executing process is updated according to the copy-on-write method, and the update mainly refers to the linear area organization relationship in the address space of the executing process.

如图4所示,进程异常挖掘模块检测的故障类型包括节点断电、进程崩溃、负载失衡;进程异常挖掘模块包括断电探测单元、程序崩溃探测单元、负载均衡探测单元以及异常处理单元。As shown in Figure 4, the fault types detected by the process exception mining module include node power failure, process crash, and load imbalance; the process exception mining module includes a power failure detection unit, a program crash detection unit, a load balance detection unit, and an exception processing unit.

所述断电探测单元被配置在所述主节点发生断电时将断电故障信息传递给异常处理单元;断电探测单元由一个专用的微控制器构成,所述微控制器在节点运行过程中,监测节点供电单元的PWR_OK信号,当所述信号降为低电平时,即判断发生断电,并通过带外管理网络向其他节点发送DETECT_PWR信号。所述断电探测单元也可通过ipmitool的方式,从主节点不断轮询从节点的电源工作状态,当轮询得不到反馈时,即判断从节点电源出现故障,进而向本节点和其他从节点的异常处理单元发送DETECT_PWR信号。The power failure detection unit is configured to transmit the power failure fault information to the exception processing unit when the main node is powered off; the power failure detection unit is composed of a dedicated microcontroller, and the microcontroller runs during the node operation In the process, the PWR_OK signal of the power supply unit of the node is monitored, and when the signal drops to a low level, it is determined that a power failure occurs, and the DETECT_PWR signal is sent to other nodes through the out-of-band management network. The power failure detection unit can also continuously poll the working status of the power supply of the slave node from the master node by means of ipmitool. The node's exception handling unit sends the DETECT_PWR signal.

所述程序崩溃探测单元被配置为捕获所述持久化进程退出时的返回值,并在返回值异常时向所述异常处理单元发出进程崩溃故障信息;所述程序崩溃探测单元在持久化进程退出时捕获其返回值,当捕获的返回值是异常值(如段错误SIGSEGV、用户定义的非正常返回值)时,即判断持久化进程发生崩溃,异常退出,并向其他节点发送DETECT_CRASH信号。所述程序崩溃探测单元亦可检测内核崩溃。在Linux内核发生崩溃时,所述单元执行kexec机制,启动备用内核,并在备用内核中向其他节点发送DETECT_CRASH信号。The program crash detection unit is configured to capture the return value when the persistent process exits, and send process crash fault information to the exception processing unit when the return value is abnormal; the program crash detection unit When the returned value is captured, when the captured return value is an abnormal value (such as a segmentation fault SIGSEGV, a user-defined abnormal return value), it is judged that the persistence process has crashed, exits abnormally, and sends a DETECT_CRASH signal to other nodes. The program crash detection unit may also detect kernel crashes. When the Linux kernel crashes, the unit executes the kexec mechanism, starts the backup kernel, and sends the DETECT_CRASH signal to other nodes in the backup kernel.

所述负载均衡探测单元被配置为轮询所述主节点以及各所述副节点的负载系数,并在任意两个节点的负载系数的差值大于阈值时向所述异常处理单元发出负载不均衡故障信息;运行在主节点上的负载均衡探测单元通过软件ipmitool的方式,轮询从节点的CPU使用率以及本节点的CPU使用率cpu_usage和内存使用率memory_usage。同时,允许管理员自定义CPU使用率的权重cpu_weight和内存使用率的权重memory_weight。所述负载均衡探测单元对每个节点计算一个负载系数,具体计算方式为cpu_usage×cpu_weight+memory_usage×memory_weight。当任意两个节点的负载系数相差在40%以上时,认定当前系统处于负载不均衡状态,并将DETECT_IMBALANCED信号传递给异常处理单元。The load balancing detection unit is configured to poll the load coefficients of the primary node and each of the secondary nodes, and issue a load imbalance to the exception processing unit when the difference between the load coefficients of any two nodes is greater than a threshold Fault information; the load balancing detection unit running on the master node polls the CPU usage of the slave node and the CPU usage cpu_usage and memory usage memory_usage of the node through the software ipmitool. At the same time, it allows administrators to customize the weight of CPU usage cpu_weight and the weight of memory usage memory_weight. The load balancing detection unit calculates a load coefficient for each node, and the specific calculation method is cpu_usage×cpu_weight+memory_usage×memory_weight. When the load factor of any two nodes differs by more than 40%, it is determined that the current system is in a state of unbalanced load, and the DETECT_IMBALANCED signal is transmitted to the exception processing unit.

所述异常处理单元被配置为接收所述断电探测单元、程序崩溃探测单元、负载均衡探测单元的故障信息,并将故障信息转发至所述主节点的故障转移模块。The exception processing unit is configured to receive fault information of the power failure detection unit, the program crash detection unit, and the load balancing detection unit, and forward the fault information to the failover module of the master node.

故障转移模块包括节点信息统计单元、迁移决策单元与进程恢复单元。故障转移模块用于在接收到进程异常挖掘模块的故障信号后,启动进程故障转移流程,具体包括:判别故障类型,据此选定对应的节点及持久化进程,选定目标迁移节点后进行进程迁移,并在目标迁移节点上继续运行。The failover module includes a node information statistics unit, a migration decision unit and a process recovery unit. The failover module is used to start the process failover process after receiving the failure signal from the process abnormality mining module, which specifically includes: judging the failure type, selecting the corresponding node and the persistence process accordingly, and selecting the target migration node to carry out the process. Migrate and continue running on the target migration node.

节点信息统计单元被配置为收集所述主节点以及各所述副节点的运行状态;为迁移决策单元提供参考。节点信息统计单元使用ipmitool询问主节点自身以及其他节点当前的运行状态,运行状态包括cpu_usage(CPU使用率)以及memory_available(可用内存),并将其发送给迁移决策单元。The node information statistics unit is configured to collect the running status of the primary node and each of the secondary nodes, and provide a reference for the migration decision unit. The node information statistics unit uses ipmitool to query the current running status of the master node itself and other nodes. The running status includes cpu_usage (CPU usage) and memory_available (available memory), and sends them to the migration decision unit.

迁移决策单元被配置为其所在的主节点检测到故障时确定需要进行故障转移的持久化进程,并根据分布式系统中各计算节点的运行状态选定进程迁移的目标迁移节点;其决策原则为:The migration decision unit is configured to determine the persistent process that needs to perform failover when the master node where it is located detects a failure, and selects the target migration node for process migration according to the running status of each computing node in the distributed system; the decision-making principle is: :

如果进程异常挖掘模块发送的故障信号是DETECT_PWR,则故障节点上的所有持久化进程均须进行迁移;如果故障信号是DETECT_CRASH,则崩溃的持久化进程需要进行迁移;如果故障信号是DETECT_IMBALANCE,则随机选定一个运行在故障节点上的持久化进程作为迁移进程。迁移决策单元所需的节点运行状态数据由前述主节点中的节点信息统计单元提供,在获知各节点的CPU使用率cpu_usage与可用内存memory_available后,对于每一个需要进行迁移的进程,根据以下算法选定目标迁移节点:在当前分布式系统中所有正在运行的副节点中,选择memory_available可以容纳迁移进程的节点中cpu_usage最低的节点作为目标迁移节点;如果所有副节点的memory_available都不足以容纳迁移进程,则选取memory_available最大的副节点作为目标迁移节点。在待迁移的持久化进程与目标迁移节点选定后,将迁移指令发送给运行在主节点上的进程恢复单元。If the fault signal sent by the process exception mining module is DETECT_PWR, all persistent processes on the faulty node must be migrated; if the fault signal is DETECT_CRASH, the crashed persistent process needs to be migrated; if the fault signal is DETECT_IMBALANCE, random Select a persistent process running on the failed node as the migration process. The node running status data required by the migration decision unit is provided by the node information statistics unit in the aforementioned master node. After knowing the CPU usage cpu_usage and available memory memory_available of each node, for each process that needs to be migrated, select the following algorithm according to the following algorithm: Target migration node: Among all the running secondary nodes in the current distributed system, select the node with the lowest cpu_usage among the nodes whose memory_available can accommodate the migration process as the target migration node; if the memory_available of all the secondary nodes is not enough to accommodate the migration process, Then select the secondary node with the largest memory_available as the target migration node. After the persistent process to be migrated and the target migration node are selected, the migration instruction is sent to the process recovery unit running on the master node.

主节点上的进程恢复单元被配置为根据所述迁移决策单元决定迁移的持久化进程以及对应的目标迁移节点向目标迁移节点发送迁移指令。The process recovery unit on the master node is configured to send a migration instruction to the target migration node according to the migration decision unit deciding the migration persistence process and the corresponding target migration node.

所述目标迁移节点上的进程恢复单元被配置为接收到迁移指令后,从该节点的PM模块中寻找进程副本,并根据进程副本重建持久化进程。重建过程中,在本节点上查找迁移进程的Process_struct(进程副本),根据Process_struct(进程副本)的内容重建进程的PCB信息、用户数据页、打开文件等,恢复进程的执行。The process recovery unit on the target migration node is configured to search for a process copy from the PM module of the node after receiving the migration instruction, and rebuild the persistent process according to the process copy. During the rebuilding process, find the Process_struct (process copy) of the migration process on the current node, rebuild the PCB information, user data pages, open files, etc. of the process according to the content of the Process_struct (process copy), and resume the execution of the process.

本发明的实施例还包括一种面向分布式非易失内存系统的进程持久化方法,应用于多个计算节点组成的系统;各所述计算机节点之间通过网络通信连接;其特征在于,对于任意一个持久化进程,其执行过程包括以下步骤:The embodiment of the present invention also includes a process persistence method for a distributed non-volatile memory system, which is applied to a system composed of multiple computing nodes; the computer nodes are connected through network communication; it is characterized in that, for The execution of any persistent process includes the following steps:

(S1)选取一个计算节点作为主节点,并另外选取至少一个计算节点作为副节点(通常选取2到3个副节点);并在所述主节点中初始化该持久化进程,同时在该所述主节点的PM模块中创建进程副本(Process_struct数据结构);(S1) Select one computing node as the primary node, and additionally select at least one computing node as the secondary node (usually 2 to 3 secondary nodes are selected); and initialize the persistence process in the primary node, and at the same time in the Create a process copy (Process_struct data structure) in the PM module of the master node;

(S2)在持久化进程运行的过程中,每隔预定时间对所述进程副本进行更新,并分发至各所述副节点,所述副节点将该所述持久化进程的进程副本存储在其PM模块中;(S2) During the running of the persistent process, the process copy is updated every predetermined time and distributed to each of the secondary nodes, and the secondary node stores the process copy of the persistent process in its in the PM module;

(S3)当所述主节点的进程异常挖掘模块检测到所述持久化进程出现故障时,从所述副节点中选取目标迁移节点,并根据所述目标迁移节点中存储的所述进程副本在所述目标迁移节点中恢复该所述持久化进程。(S3) When the process abnormality mining module of the master node detects that the persistence process is faulty, select a target migration node from the secondary node, and select a target migration node according to the process copy stored in the target migration node. The persistent process is restored in the target migration node.

初始化所述持久化进程的过程中,主节点的进程持久化单元在所述PM模块中创建进程结构体作为所述进程副本;进程副本包括校验标识、进程PCB、执行进程首地址、节点ID、时间戳、进程虚拟地址空间副本。进程虚拟地址空间副本采用平衡二叉树数据结构,虚拟地址空间中每个虚拟页为平衡二叉树节点,其结构如图5所示。In the process of initializing the persistence process, the process persistence unit of the master node creates a process structure in the PM module as the process copy; the process copy includes a verification identifier, a process PCB, an execution process first address, and a node ID. , timestamp, copy of the process virtual address space. The copy of the process virtual address space adopts a balanced binary tree data structure, and each virtual page in the virtual address space is a balanced binary tree node, and its structure is shown in Figure 5.

在步骤(S1)中,初始化持久化进程的过程中,用户在创建进程P时将其指定为持久化进程,由链接器将其与进程持久化执行模块的程序链接起来。进程完成初始化工作并进入可运行(TASK_RUNNING)状态时,在所在主节点的PM中创建一个进程Process_struct数据结构(进程副本)数据结构,随后对进程副本(Process_struct数据结构)进行初始化;初始化的过程包括以下步骤:In step (S1), in the process of initializing the persistence process, the user designates the process P as the persistence process when creating the process, and the linker links it with the program of the process persistence execution module. When the process completes the initialization work and enters the TASK_RUNNING state, a process Process_struct data structure (process copy) data structure is created in the PM of the master node, and then the process copy (Process_struct data structure) is initialized; the initialization process includes The following steps:

(S11)将所述校验标识置为不可用(置0);(S11) setting the check mark as unavailable (setting 0);

(S12)根据当前时间确定所述时间戳;根据当前主节点的ID确定所述节点ID;(S12) determine the timestamp according to the current time; determine the node ID according to the ID of the current master node;

(S13)将所述持久化进程的PCB、执行进程首地址写入进程副本;(S13) the PCB of the persistent process, the first address of the execution process are written into the process copy;

(S14)遍历所述持久化进程的虚拟地址空间,将每个使用中的虚拟页存储至所述进程虚拟地址空间副本中,并重置该虚拟页的SOFT_DIRTY标志位;具体步骤如图6所示;(S14) Traverse the virtual address space of the persistent process, store each virtual page in use in the virtual address space copy of the process, and reset the SOFT_DIRTY flag bit of the virtual page; the specific steps are shown in Figure 6 Show;

(S15)将所述持久化进程调整至可运行(TASK_RUNNING)状态,并将所述校验标识置为可用(置1)。校验标识用于表示该进程副本是否可用,如果校验标识为不可用,表明该进程副本在备份过程中受到中断,表示该进程副本是不完整的。(S15) Adjust the persistent process to a runnable (TASK_RUNNING) state, and set the verification flag to be available (set to 1). The check mark is used to indicate whether the process copy is available. If the check mark is unavailable, it means that the process copy is interrupted during the backup process, indicating that the process copy is incomplete.

本实施例中,对所述进程副本进行更新的过程中,采用增量更新的方式对所述进程虚拟地址空间副本进行更新,这使得本方法只需建立一次持久化进程的完整进程副本。对进程副本进行更新的周期由用户在持久化进程的初始化阶段进行设置,对所述进程副本进行更新的过程包括以下步骤:In this embodiment, in the process of updating the process copy, the process virtual address space copy is updated by means of incremental update, which makes this method only need to create a complete process copy of the persistent process once. The period of updating the process copy is set by the user in the initialization phase of the persistent process, and the process of updating the process copy includes the following steps:

(S21)将所述持久化进程挂起,并将所述进程副本的校验标识置为不可用;(S21) Suspend the persistent process, and set the verification identifier of the process replica to be unavailable;

(S22)根据当前时间对所述进程副本的时间戳进行更新,并将CPU缓存写回内存;(S22) according to the current time, the timestamp of the process copy is updated, and the CPU cache is written back to the memory;

(S23)将所述持久化进程的PCB、执行进程首地址写入所述进程副本;(S23) the PCB of the persistent process, the first address of the execution process are written into the process copy;

(S24)遍历所述持久化进程的虚拟地址空间,将被更改过(SOFT_DIRTY标志位=1)的虚拟页在所述进程虚拟地址空间副本中进行更新或插入,将所述虚拟地址空间中各所述虚拟页的更改标记(soft_dirty)复位(置为0);具体步骤如图6所示;(S24) Traverse the virtual address space of the persistent process, update or insert the virtual page that has been changed (SOFT_DIRTY flag = 1) in the copy of the virtual address space of the process, and update or insert each virtual page in the virtual address space The change flag (soft_dirty) of the virtual page is reset (set to 0); the specific steps are shown in Figure 6;

(S25)将所述持久化进程调整至可运行(TASK_RUNNING)状态,并将所述校验标识置为可用。更新完成后,将该数据结构通过TCP/IB发送给副节点后,用户进程继续正常运行。(S25) Adjust the persistent process to a runnable (TASK_RUNNING) state, and set the verification flag to be available. After the update is completed, after the data structure is sent to the secondary node through TCP/IB, the user process continues to run normally.

当所述主节点的进程异常挖掘模块检测到所述持久化进程出现故障后,进程异常挖掘模块将其检测的故障信息转发给所述主节点的故障转移模块。故障转移模块对故障信息的响应过程包括以下步骤:When the process abnormality mining module of the master node detects that the persistent process is faulty, the process abnormality mining module forwards the detected failure information to the failover module of the master node. The failover module's response to failure information includes the following steps:

(S31)主节点上的所述故障转移模块收集各副节点的运行状态,并根据运行状态选取所述目标迁移节点;运行状态包括CPU使用率(cpu_usage)和可用内存(memory_available);选择所述目标迁移节点的过程包括以下步骤:(S31) The failover module on the primary node collects the running status of each secondary node, and selects the target migration node according to the running status; the running status includes CPU usage (cpu_usage) and available memory (memory_available); selecting the The process of migrating a node to a target includes the following steps:

(S311)若故障类型是节点断电或程序崩溃,则对故障节点上的每一个持久化进程分别执行下述过程:在当前分布式系统中所有正在运行的副节点中,选择可用内存(memory_available)以及可以容纳迁移进程的节点中CPU使用率(cpu_usage)最低的节点作为目标迁移节点;如果所有节点的可用内存(memory_available)都不足以容纳迁移进程,则选取可用内存(memory_available)最大的节点作为目标迁移节点,并跳转至步骤(S32)。(S311) If the failure type is node power failure or program crash, the following process is respectively performed for each persistent process on the failed node: among all the running secondary nodes in the current distributed system, select the available memory (memory_available ) and the node with the lowest CPU usage (cpu_usage) among the nodes that can accommodate the migration process as the target migration node; if the available memory (memory_available) of all nodes is not enough to accommodate the migration process, the node with the largest available memory (memory_available) is selected as the target migration node. The target migrates the node, and jumps to step (S32).

(S312)若故障类型是负载不平衡,则对故障节点上的一个随机选择的持久化进程执行下述过程:在当前分布式系统中所有正在运行的副节点中,选择可用内存(memory_available)以及可以容纳迁移进程的节点中CPU使用率(cpu_usage)最低的节点作为目标迁移节点;如果所有节点的可用内存(memory_available)都不足以容纳迁移进程,则选取可用内存(memory_available)最大的节点作为目标迁移节点,并跳转至步骤(S32)。(S312) If the fault type is load imbalance, perform the following process on a randomly selected persistent process on the faulty node: select an available memory (memory_available) and The node with the lowest CPU usage (cpu_usage) among the nodes that can accommodate the migration process is selected as the target migration node; if the available memory (memory_available) of all nodes is not enough to accommodate the migration process, the node with the largest available memory (memory_available) is selected as the target migration node node, and jump to step (S32).

(S32)向各目标迁移节点发送迁移指令,要求在目标迁移节点上恢复上述待迁移的持久化进程;(S32) sending a migration instruction to each target migration node, requesting that the above-mentioned persistent process to be migrated be restored on the target migration node;

(S33)目标迁移节点接收到迁移指令及需要恢复的持久化进程后,其进程恢复单元在自身PM模块中查找该持久化进程的进程副本,检验其校验标识,若标识不可用则进程恢复失败;若存在多个同一进程的进程副本,则选取时间戳最新的进程副本;(S33) After the target migration node receives the migration instruction and the persistent process that needs to be recovered, its process recovery unit searches for the process copy of the persistent process in its own PM module, and checks its verification identifier. If the identifier is unavailable, the process recovers Failed; if there are multiple process copies of the same process, select the process copy with the latest timestamp;

(S34)运行在目标迁移节点上的进程恢复单元根据进程副本在该节点上重建所述持久化进程,并将该节点作为新的主节点。重建所述持久化进程的过程包括以下步骤:(S34) The process recovery unit running on the target migration node rebuilds the persistent process on the node according to the process copy, and uses the node as a new master node. The process of rebuilding the persistence process includes the following steps:

(S341)在目标迁移节点建立一个新进程,根据进程副本构初始化进程PCB;(S341) establishing a new process at the target migration node, and initializing the process PCB according to the process replica structure;

(S342)遍历进程副本的进程虚拟地址空间副本,根据进程虚拟地址空间副本中平衡二叉树节点的页地址及标志位恢复进程地址空间,根据平衡二叉树节点的页内容恢复进程的内存数据;(S342) traverse the process virtual address space copy of the process copy, restore the process address space according to the page address and the flag position of the balanced binary tree node in the process virtual address space copy, and restore the memory data of the process according to the page content of the balanced binary tree node;

(S343)根据进程副本中的打开文件表打开相应的文件,此处要求目标迁移节点提供与原主节点相同的文件资源及设备驱动;(S343) Open the corresponding file according to the open file table in the process copy, where the target migration node is required to provide the same file resources and device drivers as the original master node;

(S344)将新进程标记为可运行(TASK_RUNNING)状态,插入调度队列,开始运行。(S344) Mark the new process as a runnable (TASK_RUNNING) state, insert it into the scheduling queue, and start running.

下面结合附图对本发明的实施过程中的具体实施例进行说明。Specific embodiments in the implementation process of the present invention will be described below with reference to the accompanying drawings.

具体实施例一:Specific embodiment one:

在本例中,我们假设所述装置运行在一个具有一个集群的分布式系统上,其中有三个节点,其节点ID为:master、salve0、slave1。节点上运行Linux操作系统,且各节点均有充足的DRAM与PM。进程Process_struct数据结构(进程副本)的更新周期设定为60秒。在持久化进程运行的过程中,slave0节点发生断电/程序崩溃。由于节点断电/程序崩溃在效果上相同,本实施例不对其进行区分。In this example, we assume that the device is running on a distributed system with a cluster with three nodes with node IDs: master, slave0, slave1. The nodes run the Linux operating system, and each node has sufficient DRAM and PM. The update period of the process Process_struct data structure (process copy) is set to 60 seconds. During the running of the persistence process, the slave0 node was powered off/program crashed. Since node power down/program crash is the same in effect, this embodiment does not differentiate between them.

下面结合装置具体阐述步骤如下:The following steps are specifically described in conjunction with the device as follows:

步骤1:用户在分布式系统上使用本装置提供的接口,执行可执行文件,进行科学计算。操作系统将其分配至slave0节点,将其作为主节点,并将mater和slave1作为副节点,并在该节点上创建持久化进程,进程PID为100;Step 1: The user uses the interface provided by the device on the distributed system, executes the executable file, and performs scientific computing. The operating system assigns it to the slave0 node, uses it as the master node, uses mater and slave1 as the secondary node, and creates a persistent process on this node, and the process PID is 100;

步骤2:进程创建完成,进入TASK_RUNNABLE(可运行)状态时,开始创建进程Process_struct数据结构(进程副本),具体步骤如下:Step 2: When the process is created and enters the TASK_RUNNABLE (runnable) state, it starts to create the process Process_struct data structure (process copy). The specific steps are as follows:

步骤2.1:所述模块在slave0节点(主节点)的PM模块上创建一个空的进程Process_struct数据结构(进程副本),初始时校验标识置为0;Step 2.1: The module creates an empty process Process_struct data structure (process copy) on the PM module of the slave0 node (master node), and the verification flag is initially set to 0;

步骤2.2:将进程Process_struct数据结构(进程副本)的时间戳设为当前系统时间,节点ID设为slave0,将进程PCB保存至进程控制块区域。遍历进程的每一个VMA(虚拟地址空间),对其中的每个内存页建立一个平衡二叉树结点,插入地址空间平衡二叉树,具体的插入过程如图6所示;Step 2.2: Set the timestamp of the process Process_struct data structure (process copy) to the current system time, set the node ID to slave0, and save the process PCB to the process control block area. Traverse each VMA (virtual address space) of the process, establish a balanced binary tree node for each memory page in it, and insert the address space balanced binary tree. The specific insertion process is shown in Figure 6;

步骤2.3:向/proc/$pid/clear_refs写入4,重置进程所有内存页的SOFT_DIRTY位;Step 2.3: Write 4 to /proc/$pid/clear_refs to reset the SOFT_DIRTY bit of all memory pages of the process;

步骤2.4:设定进程Process_struct数据结构(进程副本)的校验标识,表明该Process_struct数据结构(进程副本)完整可用,并将该文件通过TCP/IB分发给master节点和slave1节点(对于本持久化进程,master节点和slave1节点均为副节点),分别保存在其PM模块中;Step 2.4: Set the verification flag of the process Process_struct data structure (process copy), indicating that the Process_struct data structure (process copy) is completely available, and distribute the file to the master node and slave1 node through TCP/IB (for this persistence process, the master node and the slave1 node are all secondary nodes), which are stored in their PM modules respectively;

步骤3:唤醒进程,使其开始正常运行。此后每经过60秒对进程Process_struct数据结构(进程副本)进行一次更新,具体步骤如下:Step 3: Wake up the process so it starts running normally. After that, the process Process_struct data structure (process copy) is updated every 60 seconds. The specific steps are as follows:

步骤3.1:每经过60秒,持久化更新单元将进程挂起,找到进程Process_struct数据结构(进程副本),将其校验标识置为0;Step 3.1: Every 60 seconds, the persistent update unit suspends the process, finds the process Process_struct data structure (process copy), and sets its verification flag to 0;

步骤3.2:将进程Process_struct数据结构(进程副本)的时间戳设为当前系统时间,将进程PCB保存至进程控制块域。遍历进程的每一个VMA(虚拟地址空间),对其中的每个内存页,如果其SOFT_DIRTY位为1,则更新/插入对应的地址空间平衡二叉树节点,具体的更新/插入过程如图6所示;Step 3.2: Set the timestamp of the process Process_struct data structure (process copy) as the current system time, and save the process PCB to the process control block field. Traverse each VMA (virtual address space) of the process, and for each memory page in it, if its SOFT_DIRTY bit is 1, update/insert the corresponding address space balanced binary tree node. The specific update/insert process is shown in Figure 6 ;

步骤3.3:向/proc/$pid/clear_refs写入4,重置进程所有内存页的SOFT_DIRTY位;Step 3.3: Write 4 to /proc/$pid/clear_refs to reset the SOFT_DIRTY bit of all memory pages of the process;

步骤3.4:设定进程Process_struct数据结构(进程副本)的校验标识,表明该Process_struct数据结构(进程副本)完整可用,并将该数据结构通过TCP/IB分发给master节点和slave1节点,分别保存在其PM模块中;Step 3.4: Set the verification flag of the process Process_struct data structure (process copy), indicating that the Process_struct data structure (process copy) is completely available, and distribute the data structure to the master node and slave1 node through TCP/IB, and save them in the in its PM module;

步骤4:slave0节点在运行过程中发生断电/程序崩溃,被slave0节点上的进程故障挖掘模块捕获,向运行在slave0节点上的故障转移模块发送DETECT_PWR/DETECT_CRASH故障信号;Step 4: The power failure/program crash occurs on the slave0 node during operation, which is captured by the process fault mining module on the slave0 node, and sends the DETECT_PWR/DETECT_CRASH fault signal to the failover module running on the slave0 node;

步骤5:进程迁移模块在PM中查找节点ID为slave0的进程Process_struct数据结构(进程副本),共找到一个文件;Step 5: The process migration module searches the PM for the process Process_struct data structure (process copy) whose node ID is slave0, and finds one file altogether;

步骤6:节点信息统计单元搜集当前集群运行状况,得知当前master节点cpu_usage为40%,slave1节点cpu_usage为30%,剩余内存均充足,故选定slave1节点作为目标迁移节点,并向slave1节点发送指令,在其上启动进程恢复过程;Step 6: The node information statistics unit collects the current cluster operating status, and learns that the current master node cpu_usage is 40%, the slave1 node cpu_usage is 30%, and the remaining memory is sufficient. Therefore, the slave1 node is selected as the target migration node and sent to the slave1 node. instruction on which to start the process recovery process;

步骤7:slave1节点接到slave0指令,恢复slave0节点的进程,具体步骤如下:Step 7: The slave1 node receives the slave0 command and resumes the process of the slave0 node. The specific steps are as follows:

步骤7.1:slave1节点在其PM中查找节点ID为slave0的进程Process_struct数据结构(进程副本),共找到一个数据结构,校验其完整标识无误,开始读取其内容;Step 7.1: The slave1 node searches its PM for the Process_struct data structure (process copy) of the process whose node ID is slave0, finds a data structure, verifies that its complete identification is correct, and starts to read its content;

步骤7.2:slave1节点在自身建立新进程,根据进程Process_struct数据结构(进程副本)的进程控制块部分初始化新进程的PCB。此处要求slave1节点为该进程提供相同的所需文件、设备资源;Step 7.2: The slave1 node establishes a new process in itself, and initializes the PCB of the new process according to the process control block part of the process Process_struct data structure (process copy). Here, the slave1 node is required to provide the same required files and device resources for the process;

步骤7.3:遍历地址空间平衡二叉树,对其中的每个节点,在相应的虚拟地址申请并映射内存页,将页内容填入其中;Step 7.3: Traverse the address space balanced binary tree, apply for and map a memory page at the corresponding virtual address for each node in it, and fill in the page content;

步骤7.4:将进程标记为持久化进程,此后由slave1节点上的持久化更新单元继续维护该进程的Process_struct数据结构(进程副本);Step 7.4: Mark the process as a persistent process, and then the persistent update unit on the slave1 node will continue to maintain the Process_struct data structure (process copy) of the process;

步骤7.5:唤醒进程,使其在slave1节点上继续运行。Step 7.5: Wake up the process and keep it running on slave1 node.

至此,持久化进程的迁移工作完成。发生故障的slave0节点上的持久化进程没有被故障消灭,而是在slave1节点上从一个较近的checkpoint开始继续运行,这体现了本发明对于分布式系统中耗时较长的进程的保护能力。At this point, the migration of the persistence process is complete. The persistent process on the faulty slave0 node is not destroyed by the failure, but continues to run from a closer checkpoint on the slave1 node, which reflects the protection capability of the present invention for the process that takes a long time in the distributed system .

具体实施例二:Specific embodiment two:

在本例中,我们假设所述装置运行在一个具有一个集群的分布式系统上,其中有三个节点:master、salve0、slave1,节点上运行Linux操作系统,且各节点均有充足的DRAM与PM。进程Process_struct数据结构(进程副本)的更新周期设定为60秒。在持久化进程运行的过程中,系统出现负载不平衡。In this example, we assume that the device is running on a distributed system with a cluster, which has three nodes: master, slave0, slave1, the nodes run Linux operating system, and each node has sufficient DRAM and PM . The update period of the process Process_struct data structure (process copy) is set to 60 seconds. During the running of the persistence process, the system load is unbalanced.

下面结合装置具体阐述步骤如下:The following steps are specifically described in conjunction with the device as follows:

步骤1:用户在分布式系统上使用本装置提供的接口,执行可执行文件,进行科学计算。操作系统根将其分配至slave1节点,将其作为主节点,并将mater和slave0作为副节点,并在该节点上创建持久化进程,进程PID为200;Step 1: The user uses the interface provided by the device on the distributed system, executes the executable file, and performs scientific computing. The operating system root assigns it to the slave1 node, uses it as the master node, and uses mater and slave0 as the secondary node, and creates a persistent process on this node, and the process PID is 200;

步骤2:进程创建完成,进入TASK_RUNNABLE(可运行)状态时,开始创建进程Process_struct数据结构(进程副本),具体步骤如下:Step 2: When the process is created and enters the TASK_RUNNABLE (runnable) state, it starts to create the process Process_struct data structure (process copy). The specific steps are as follows:

步骤2.1:所述模块在slave1节点的PM上创建一个空的进程Process_struct数据结构(进程副本),初始时校验标识置为0;Step 2.1: The module creates an empty process Process_struct data structure (process copy) on the PM of the slave1 node, and the verification flag is initially set to 0;

步骤2.2:将进程Process_struct数据结构(进程副本)的时间戳设为当前系统时间,节点ID设为slave1,将进程PCB保存至进程控制块区域。遍历进程的每一个VMA(虚拟地址空间),对其中的每个内存页建立一个平衡二叉树结点,插入地址空间平衡二叉树,具体的插入过程如图6所示;Step 2.2: Set the timestamp of the process Process_struct data structure (process copy) to the current system time, set the node ID to slave1, and save the process PCB to the process control block area. Traverse each VMA (virtual address space) of the process, establish a balanced binary tree node for each memory page in it, and insert the address space balanced binary tree. The specific insertion process is shown in Figure 6;

步骤2.3:向/proc/$pid/clear_refs写入4,重置进程所有内存页的SOFT_DIRTY位;Step 2.3: Write 4 to /proc/$pid/clear_refs to reset the SOFT_DIRTY bit of all memory pages of the process;

步骤2.4:设定进程Process_struct数据结构(进程副本)的校验标识,表明该Process_struct数据结构(进程副本)完整可用,并将该文件通过TCP/IB分发给master节点和slave0节点,分别保存在其PM中;Step 2.4: Set the verification flag of the Process_struct data structure (process copy) of the process, indicating that the Process_struct data structure (process copy) is completely available, and distribute the file to the master node and slave0 node through TCP/IB, and save them in their in PM;

步骤3:唤醒进程,使其开始正常运行。此后每经过60秒对进程Process_struct数据结构(进程副本)进行一次更新,具体步骤如下:Step 3: Wake up the process so it starts running normally. After that, the process Process_struct data structure (process copy) is updated every 60 seconds. The specific steps are as follows:

步骤3.1:每经过60秒,持久化更新单元将进程挂起,找到进程Process_struct数据结构(进程副本),将其校验标识置为0;Step 3.1: Every 60 seconds, the persistent update unit suspends the process, finds the process Process_struct data structure (process copy), and sets its verification flag to 0;

步骤3.2:将进程Process_struct数据结构(进程副本)的时间戳设为当前系统时间,将进程PCB保存至进程控制块区域。遍历进程的每一个VMA(虚拟地址空间),对其中的每个内存页,如果其SOFT_DIRTY位为1,则更新/插入对应的地址空间平衡二叉树节点,具体的更新/插入过程如图6所示;Step 3.2: Set the timestamp of the process Process_struct data structure (process copy) as the current system time, and save the process PCB to the process control block area. Traverse each VMA (virtual address space) of the process, and for each memory page in it, if its SOFT_DIRTY bit is 1, update/insert the corresponding address space balanced binary tree node. The specific update/insert process is shown in Figure 6 ;

步骤3.3:向/proc/$pid/clear_refs写入4,重置进程所有内存页的SOFT_DIRTY位;Step 3.3: Write 4 to /proc/$pid/clear_refs to reset the SOFT_DIRTY bit of all memory pages of the process;

步骤3.4:设定进程Process_struct数据结构(进程副本)的校验标识,表明该Process_struct数据结构(进程副本)完整可用,并将该数据结构通过TCP/IB分发给master节点和slave0节点,分别保存在其PM中;Step 3.4: Set the check mark of the process Process_struct data structure (process copy), indicating that the Process_struct data structure (process copy) is completely available, and distribute the data structure to the master node and slave0 node through TCP/IB, and save them in the in its PM;

步骤4:在某一时刻,运行在slave1节点上的进程异常挖掘模块检测到系统出现负载不平衡,此时master节点、slave0节点、slave1节点的cpu_usage分别为40%、30%、50%。运行在master节点上的进程异常挖掘模块向故障转移模块发送DETECT_IMBALANCED信号;Step 4: At a certain moment, the abnormal process mining module running on the slave1 node detects that the system has a load imbalance. At this time, the cpu_usage of the master node, slave0 node, and slave1 node are 40%, 30%, and 50% respectively. The process exception mining module running on the master node sends the DETECT_IMBALANCED signal to the failover module;

步骤5:进程迁移模块在PM中查找节点ID为slave1的进程Process_struct数据结构(进程副本),共找到一个数据结构;Step 5: The process migration module searches the PM for the process Process_struct data structure (process copy) whose node ID is slave1, and finds one data structure altogether;

步骤6:节点信息统计单元搜集当前集群运行状况,得知当前master节点cpu_usage为40%,slave0节点cpu_usage为30%,剩余内存均充足,故选定slave0节点作为目标迁移节点,并向slave0节点发送指令,在其上启动进程迁移过程,同时向slave1节点发送指令,中止其上持久化进程的运行;Step 6: The node information statistics unit collects the current cluster operating status, and learns that the current master node cpu_usage is 40%, the slave0 node cpu_usage is 30%, and the remaining memory is sufficient, so the slave0 node is selected as the target migration node, and the slave0 node is sent to the slave0 node. instruction, start the process migration process on it, and send an instruction to the slave1 node to stop the running of the persistent process on it;

步骤7:slave0节点接到slave1的指令,恢复slave1节点的进程,具体步骤如下:Step 7: The slave0 node receives the instruction of slave1 and restores the process of the slave1 node. The specific steps are as follows:

步骤7.1:slave0节点在其PM中查找节点ID为slave1的进程Process_struct数据结构(进程副本),共找到一个数据结构,校验其完整标识无误,开始读取其内容;Step 7.1: The slave0 node searches its PM for the Process_struct data structure (process copy) of the process whose node ID is slave1, finds a data structure, verifies that its complete identification is correct, and starts to read its content;

步骤7.2:slave0节点在自身建立新进程,根据进程Process_struct数据结构(进程副本)的进程控制块部分初始化进程PCB。此处要求slave0节点为该进程提供相同的所需文件、设备资源;Step 7.2: The slave0 node establishes a new process in itself, and initializes the process PCB according to the process control block part of the process Process_struct data structure (process copy). Here, the slave0 node is required to provide the same required files and device resources for the process;

步骤7.3:遍历地址空间平衡二叉树,对其中的每个节点,在相应的虚拟地址申请并映射内存页,将页内容填入其中;Step 7.3: Traverse the address space balanced binary tree, apply for and map a memory page at the corresponding virtual address for each node in it, and fill in the page content;

步骤7.4:将进程标记为持久化进程,此后由slave0节点上的持久化更新单元继续维护该进程的Process_struct数据结构(进程副本);Step 7.4: Mark the process as a persistent process, and then the persistent update unit on the slave0 node will continue to maintain the Process_struct data structure (process copy) of the process;

步骤7.5:唤醒进程,使其在slave0节点上继续运行。Step 7.5: Wake up the process and keep it running on the slave0 node.

至此,持久化进程迁移工作完成。在分布式系统各节点出现负载不平衡现象时,通过本发明方法与装置可以将进程从高负载节点迁移至低负载节点,以达到负载均衡的效果,这体现了本发明的适用范围不仅局限于对断电/程序崩溃节点上持久化进程的保护,而同样可以运用至分布式系统负载平衡的实现。At this point, the migration of the persistence process is complete. When the load imbalance phenomenon occurs in each node of the distributed system, the method and device of the present invention can migrate the process from the high-load node to the low-load node, so as to achieve the effect of load balancing, which shows that the scope of application of the present invention is not limited to The protection of persistent processes on power failure/program crash nodes can also be applied to the implementation of distributed system load balancing.

以上的本发明实施方式,不构成对本发明保护范围的限定。任何在本发明的精神和原则之内所作的修改、同替换和改进等,均应包含在本发明的保护范围内。The above embodiments of the present invention do not constitute a limitation on the protection scope of the present invention. Any modifications, substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (13)

1. A process persistence method for a distributed non-volatile memory system is applied to a system consisting of a plurality of computing nodes; all the computer nodes are connected through network communication; the method is characterized in that for any persistent process, the execution process comprises the following steps:
selecting a computing node as a main node, and selecting at least one computing node as a secondary node; initializing the persistent process in the main node, and creating a process copy in a PM module of the main node;
in the running process of the persistent process, updating the process copy every other preset time, and distributing the process copy to each secondary node, wherein the secondary node stores the process copy of the persistent process in a PM module of the secondary node;
when the process abnormity mining module of the main node detects that the persistent process has a fault, a target migration node is selected from the auxiliary node, and the persistent process is recovered in the target migration node according to the process copy stored in the target migration node.
2. The process persistence method for the distributed non-volatile memory system according to claim 1, wherein in a process of initializing the persistent process, the process persistence unit of the master node creates a process structure in the PM module as the process copy; the process copy comprises a check identifier, a process PCB, an execution process initial address, a node ID, a timestamp and a process virtual address space copy; the process virtual address space copy adopts a balanced binary tree data structure, and each virtual page in the virtual address space is a balanced binary tree node.
3. The process persistence method for the distributed non-volatile memory system according to claim 1, wherein in a process of initializing the persistent process, a process copy is initialized; the initialization process comprises the following steps:
(S11) setting the check mark as unavailable;
(S12) determining the timestamp from the current time; determining the node ID according to the ID of the current main node;
(S13) writing the PCB of the persisted process, the executing process first address, to the process copy;
(S14) traversing the virtual address space of the persisted process, storing each in-use virtual page into a copy of the process virtual address space, and resetting the SOFT _ direct flag of that virtual page;
(S15) adjusting the persistence process to a runnable state and making the check mark available.
4. The process persistence method for the distributed non-volatile memory system according to claim 2, wherein in the process of updating the process copy, the process virtual address space copy is updated in an incremental update manner.
5. The method for persisting processes to a distributed non-volatile memory system according to claim 4, wherein the updating the process copy includes:
(S21) suspending the persisted process and setting a check flag of the process copy as unavailable;
(S22) updating the timestamp of the process copy according to the current time;
(S23) writing the PCB of the persisted process, the executing process first address, to the process copy;
(S24) traversing the virtual address space of the persistent process, updating or inserting the changed virtual page in the process virtual address space copy, and resetting the change flag of each virtual page in the virtual address space;
(S25) adjusting the persistence process to a runnable state and making the check mark available.
6. The process persistence method for the distributed non-volatile memory system according to claim 1, wherein when the process exception mining module of the master node detects that the persistent process has a failure, the process exception mining module forwards the failure information detected by the process exception mining module to the failover module of the master node.
7. The method for persisting processes in a distributed non-volatile memory system according to claim 6, wherein the step of responding to the failure information by the failover module includes:
(S31) the fault transfer module on the main node collects the running state of each secondary node and selects the target transfer node according to the running state; the running state comprises the CPU utilization rate and the available memory;
(S32) sending a migration instruction to each target migration node, and requiring the target migration node to recover the persistent process to be migrated;
(S33) after the target migration node receives the migration instruction and the persistent process needing to be recovered, the process recovery unit searches the process copy of the persistent process in the PM module of the target migration node, checks the check identifier of the process copy, and if the check identifier is unavailable, the process recovery fails; if a plurality of process copies of the same process exist, selecting the process copy with the latest timestamp;
(S34) the process recovery unit running on the target migration node rebuilds the persistent process on the node according to the process copy and takes the node as a new master node.
8. The method for persisting processes in a distributed non-volatile memory system according to claim 7, wherein the step of selecting the target migration node includes the steps of:
(S311) if the failure type is node power-off or program crash, respectively performing the following processes for each persistent process on the failed node: selecting a node with the lowest CPU utilization rate from all running auxiliary nodes in the current distributed system as a target migration node, wherein the available memory and the node which can accommodate a migration process are the nodes with the lowest CPU utilization rate; if the available memories of all the nodes are not enough to accommodate the migration process, selecting the node with the largest available memory as a target migration node, and jumping to the step (S32);
(S312) if the failure type is load imbalance, performing the following process for a randomly selected persistency process on the failed node: selecting a node with the lowest CPU utilization rate from all running auxiliary nodes in the current distributed system as a target migration node, wherein the available memory and the node which can accommodate a migration process are the nodes with the lowest CPU utilization rate; if the available memories of all the nodes are not enough to accommodate the migration process, the node with the largest available memory is selected as the target migration node, and the step is skipped to (S32).
9. The method for persisting processes in a distributed non-volatile memory system according to claim 7, wherein the process of rebuilding the persisted process includes the following steps:
(S341) establishing a new process at the target migration node, and constructing an initialization process PCB according to the process copy;
(S342) traversing the process virtual address space copy of the process copy, recovering the process address space according to the page address and the flag bit of the balanced binary tree node in the process virtual address space copy, and recovering the memory data of the process according to the page content of the balanced binary tree node;
(S343) opening a corresponding file according to the file opening table in the process copy;
(S344) mark the new process as runnable, insert the scheduling queue, and start running.
10. A process persistence device facing a distributed nonvolatile memory system is arranged in computing nodes of the distributed nonvolatile memory system, and the computing nodes are in communication connection through a network; the process persistence device is characterized by comprising a process persistence module, a process exception mining module and a fault transfer module; wherein:
the process persistence module is arranged in a main node where a persistence process is located, and is configured to periodically record the state and the execution progress of the persistence process to generate process copies and distribute the process copies to a plurality of secondary nodes; the primary node and each secondary node store the process copy in a PM module;
the process exception mining module is arranged in a main node where a persistent process is located and is configured to detect a fault of the main node and forward fault information to the fault transfer module of the main node when the fault is detected; the types of the faults comprise node outage, process crash and load unbalance;
fault transfer modules are deployed on the main node and the secondary node;
the fault transfer module on the main node is configured to select a target migration node from the auxiliary nodes according to the fault information and the running state of each auxiliary node after receiving the fault information, and send a migration instruction to the target migration node to request the target migration node to recover the persistent process to be migrated;
and the fault transfer module on the secondary node is configured to rebuild the persistent process on the secondary node after receiving the migration instruction, and take the node as a new primary node.
11. The device for persisting processes in a distributed non-volatile memory system according to claim 10, wherein the process persisting module includes a persisting creation unit and a persisting update unit;
the persistence creation unit is configured to create a process copy of a persistent process on a primary node; the process copy comprises a check identifier, a process PCB, an execution process initial address, a node ID, a timestamp and a process virtual address space copy; the process virtual address space copy adopts a balanced binary tree data structure, and each virtual page in the virtual address space is a balanced binary tree node;
the persistent updating unit is configured to periodically update the process copy and send the updated process copy to each secondary node.
12. The device for persisting processes for a distributed non-volatile memory system according to claim 10, wherein the types of failures detected by the process anomaly mining module include node power failure, process crash, and load imbalance; the process exception mining module comprises a power failure detection unit, a program crash detection unit, a load balancing detection unit and an exception handling unit;
the power failure detection unit is configured to transmit power failure fault information to the exception handling unit when the main node is powered off;
the program crash detection unit is configured to capture a return value when the persistent process exits, and send process crash fault information to the exception handling unit when the return value is abnormal;
the load balancing detection unit is configured to poll the load coefficients of the main node and the auxiliary nodes, and send load imbalance fault information to the exception handling unit when the difference value of the load coefficients of any two nodes is greater than a threshold value;
the exception handling unit is configured to receive fault information of the power failure detection unit, the program crash detection unit and the load balancing detection unit, and forward the fault information to a fault transfer module of the main node.
13. The process persistence method for the distributed non-volatile memory system according to claim 12, wherein the failover module includes a node information statistics unit, a migration decision unit, and a process recovery unit;
the node information statistical unit is configured to collect the operation states of the main node and each of the secondary nodes;
the migration decision unit is configured to determine a persistent process which needs to be subjected to fault transfer when the main node where the migration decision unit is located detects a fault, and select a target migration node for process migration according to the running state of each computing node in the distributed system;
the process recovery unit on the main node is configured to send a migration instruction to the target migration node according to the persistent process determined to be migrated by the migration decision unit and the corresponding target migration node;
and the process recovery unit on the target migration node is configured to find a process copy from the PM module of the node after receiving the migration instruction, and rebuild the persistent process according to the process copy.
CN202010553640.0A 2020-06-17 2020-06-17 Process persistence method and device for distributed non-volatile memory system Active CN111736996B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010553640.0A CN111736996B (en) 2020-06-17 2020-06-17 Process persistence method and device for distributed non-volatile memory system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010553640.0A CN111736996B (en) 2020-06-17 2020-06-17 Process persistence method and device for distributed non-volatile memory system

Publications (2)

Publication Number Publication Date
CN111736996A true CN111736996A (en) 2020-10-02
CN111736996B CN111736996B (en) 2022-08-16

Family

ID=72649486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010553640.0A Active CN111736996B (en) 2020-06-17 2020-06-17 Process persistence method and device for distributed non-volatile memory system

Country Status (1)

Country Link
CN (1) CN111736996B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961150A (en) * 2021-10-29 2022-01-21 苏州浪潮智能科技有限公司 Method for maintaining data consistency of distributed persistent memory file system
CN114817138A (en) * 2022-05-20 2022-07-29 济南信通达电气科技有限公司 Memory persistence method and system
CN115378800A (en) * 2022-08-23 2022-11-22 抖音视界有限公司 Distributed fault-tolerant system, method, apparatus, device and medium without server architecture

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101369241A (en) * 2007-09-21 2009-02-18 中国科学院计算技术研究所 A cluster fault-tolerant system, device and method
US20140372381A1 (en) * 2011-06-30 2014-12-18 Amazon Technologies, Inc. Methods and apparatus for data restore and recovery from a remote data store
CN104636327A (en) * 2013-11-06 2015-05-20 上海语镜汽车信息技术有限公司 Distributed type stream data system based on incremental computation
US20160299795A1 (en) * 2015-04-09 2016-10-13 Fujitsu Limited Parallel computing control apparatus and parallel computing system
CN106599096A (en) * 2016-11-24 2017-04-26 上海交通大学 Design method of high-performance file system based on non-volatile memory
CN107179982A (en) * 2016-03-09 2017-09-19 阿里巴巴集团控股有限公司 A kind of striding course adjustment method and device
US20180232412A1 (en) * 2017-02-10 2018-08-16 Sap Se Transaction commit protocol with recoverable commit identifier
CN108509298A (en) * 2018-03-22 2018-09-07 中国银联股份有限公司 A kind of method, apparatus and storage medium of data processing
CN110019475A (en) * 2017-12-21 2019-07-16 杭州华为数字技术有限公司 Data persistence processing method, apparatus and system
CN111061652A (en) * 2019-12-18 2020-04-24 中山大学 Nonvolatile memory management method and system based on MPI-IO middleware

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101369241A (en) * 2007-09-21 2009-02-18 中国科学院计算技术研究所 A cluster fault-tolerant system, device and method
US20140372381A1 (en) * 2011-06-30 2014-12-18 Amazon Technologies, Inc. Methods and apparatus for data restore and recovery from a remote data store
CN104636327A (en) * 2013-11-06 2015-05-20 上海语镜汽车信息技术有限公司 Distributed type stream data system based on incremental computation
US20160299795A1 (en) * 2015-04-09 2016-10-13 Fujitsu Limited Parallel computing control apparatus and parallel computing system
CN107179982A (en) * 2016-03-09 2017-09-19 阿里巴巴集团控股有限公司 A kind of striding course adjustment method and device
CN106599096A (en) * 2016-11-24 2017-04-26 上海交通大学 Design method of high-performance file system based on non-volatile memory
US20180232412A1 (en) * 2017-02-10 2018-08-16 Sap Se Transaction commit protocol with recoverable commit identifier
CN110019475A (en) * 2017-12-21 2019-07-16 杭州华为数字技术有限公司 Data persistence processing method, apparatus and system
CN108509298A (en) * 2018-03-22 2018-09-07 中国银联股份有限公司 A kind of method, apparatus and storage medium of data processing
CN111061652A (en) * 2019-12-18 2020-04-24 中山大学 Nonvolatile memory management method and system based on MPI-IO middleware

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
T.C. BRESSOUD等: "The design and use of persistent memory on the DNCP hardware fault-tolerant platform", 《2001 INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS》 *
高翔: "面向非易失内存的免复制检查点系统", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961150A (en) * 2021-10-29 2022-01-21 苏州浪潮智能科技有限公司 Method for maintaining data consistency of distributed persistent memory file system
CN114817138A (en) * 2022-05-20 2022-07-29 济南信通达电气科技有限公司 Memory persistence method and system
CN115378800A (en) * 2022-08-23 2022-11-22 抖音视界有限公司 Distributed fault-tolerant system, method, apparatus, device and medium without server architecture
WO2024041363A1 (en) * 2022-08-23 2024-02-29 抖音视界有限公司 Serverless-architecture-based distributed fault-tolerant system, method and apparatus, and device and medium
CN115378800B (en) * 2022-08-23 2024-07-16 抖音视界有限公司 Serverless architecture distributed fault-tolerant system, method, device, equipment and medium

Also Published As

Publication number Publication date
CN111736996B (en) 2022-08-16

Similar Documents

Publication Publication Date Title
US12137029B2 (en) Dynamic reconfiguration of resilient logical modules in a software defined server
US11262933B2 (en) Sharing memory resources between asynchronous replication workloads
JP6791834B2 (en) Storage system and control software placement method
JP5578720B2 (en) How to improve solid drive management from a high utilization and virtualization perspective
KR102734535B1 (en) System and device for data recovery for ephemeral storage
US8108718B2 (en) Checkpointing in massively parallel processing
CN111736996B (en) Process persistence method and device for distributed non-volatile memory system
US10929234B2 (en) Application fault tolerance via battery-backed replication of volatile state
US12326811B2 (en) Fault tolerant systems and methods using shared memory configurations
JP2004295738A (en) Fault-tolerant computer system, program parallelly executing method and program
CN106603665B (en) Cloud platform continuous data synchronous method and its device
CN111587420B (en) Method and system for rapid fault recovery of distributed storage system
Zheng et al. Performance evaluation of automatic checkpoint-based fault tolerance for ampi and charm++
Glider et al. The software architecture of a san storage control system
US20240152286A1 (en) Fast restart of large memory systems
CN111587421B (en) Method and system for power supply fault impedance of distributed storage system
US20180052750A1 (en) Online nvm format upgrade in a data storage system operating with active and standby memory controllers
JP6337598B2 (en) Fault tolerant monitoring device, method and system
CN111459607A (en) Virtual server cluster building method, system and medium based on cloud desktop virtualization
US11809295B2 (en) Node mode adjustment method for when storage cluster BBU fails and related component
US20170249248A1 (en) Data backup
Fernando et al. V-recover: Virtual machine recovery when live migration fails
JP2021036450A (en) Storage system and method for controlling the same
Lin et al. ReHRS: A hybrid redundant system for improving MapReduce reliability and availability
Vekiarides Fault-tolerant disk storage and file systems using reflective memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant