CN115344213A - A method, device, terminal and medium for avoiding peering process data loss - Google Patents
A method, device, terminal and medium for avoiding peering process data loss Download PDFInfo
- Publication number
- CN115344213A CN115344213A CN202211032219.0A CN202211032219A CN115344213A CN 115344213 A CN115344213 A CN 115344213A CN 202211032219 A CN202211032219 A CN 202211032219A CN 115344213 A CN115344213 A CN 115344213A
- Authority
- CN
- China
- Prior art keywords
- osd
- version
- osdmap
- peering process
- data writing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0614—Improving the reliability of storage systems
- G06F3/0619—Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0653—Monitoring storage devices or systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及改进分布式存储peering流程的技术领域,尤其涉及一种避免peering流程数据丢失的方法、装置、终端及介质。The invention relates to the technical field of improving distributed storage peering processes, in particular to a method, device, terminal and medium for avoiding peering process data loss.
背景技术Background technique
分布式存储系统可由若干个节点至上千节点组成,分布式存储系统的去中心化和多副本存储保证的分布式存储的高可用。Ceph使用CRUSH算法实现了去中心化,Ceph客户端和OSD守护进程都使用CRUSH算法来有效地计算出对象存储的位置,而不是基于一个中心化的查找表格,而且随着OSD守护进程拓扑的变化,比如有新节点加入或旧节点退出,CRUSH算法会重新在OSD上分布数据,不会由于单个OSD的故障而影响整个集群的存储。对于用户给定一份数据,Ceph在后台自动存储多个副本(一般使用3个副本),从而保证在硬盘损坏、服务器故障、机柜停电等情况下,数据不会丢失,甚至数据仍能保持在线。Ceph需要做的是及时进行故障恢复,将丢失的数据副本补全,以维持数据的高可靠性。The distributed storage system can consist of several nodes to thousands of nodes, the decentralization of the distributed storage system and the high availability of distributed storage guaranteed by multi-copy storage. Ceph uses the CRUSH algorithm to achieve decentralization. Both the Ceph client and the OSD daemon use the CRUSH algorithm to efficiently calculate the location of the object storage, rather than based on a centralized lookup table, and as the topology of the OSD daemon changes. , For example, if a new node joins or an old node exits, the CRUSH algorithm will redistribute the data on the OSD, and the failure of a single OSD will not affect the storage of the entire cluster. For a piece of data given by the user, Ceph automatically stores multiple copies (usually 3 copies) in the background, so as to ensure that the data will not be lost in the event of hard disk damage, server failure, cabinet power outage, etc., and even the data can remain online. . What Ceph needs to do is to perform failure recovery in time and complete the lost data copy to maintain high data reliability.
当系统初始化时,OSD重新启动导致归置组重新加载,或者归置组新创建时,归置组会发起一次peering流程;当有OSD失效,OSD的增加或者删除等导致归置组的acting set发生了变化,该归置组就会重新发起一次peering过程;会触发peering流程,peering的过程使一个归置组内的OSD达成一个一致状态。当主从副本达成一个一致的状态后,归置组处于active状态,peering过程的状态就结束了。但此时该PG的三个OSD的数据副本上的数据并非完全一致。在真正进入peering流程的时候会计算生成past_interval,past_interval与OSDmap的版本是紧密相关的,但是OSDmap不会一直保存下去,当前机制下在集群是运行良好的情况下,仅仅保留最近的500个OSDmap版本。如果一个处于down很久的OSD启动后,可能会带来如下问题:处于down状态很久的OSD启动后,集群的OSDmap版本可能已经改变过很多次了,这个OSD启动后接收到的OSDmap是不连续的,这样就会导致计算的past_interval是错误的,如果这个归置组中的其他成员也处于down的状态,那么这个OSD就可能会独立完成peering流程,成为一个新主,等该归置组的其他OSD启动后,新主依然会是那个处于down状态很久的OSD,这就其他OSD启动后的peering流程,其他OSD与该主OSD同步数据会使该OSD处于down状态期间写入的数据被删除,造成数据丢失的严重问题。When the system is initialized, the OSD restarts to cause the placement group to be reloaded, or when the placement group is newly created, the placement group will initiate a peering process; when an OSD fails, the addition or deletion of an OSD, etc. lead to the acting set of the placement group If there is a change, the placement group will re-initiate the peering process; the peering process will be triggered, and the peering process will make the OSDs in a placement group reach a consistent state. When the master-slave replica reaches a consistent state, the placement group is in the active state, and the state of the peering process ends. But at this time, the data on the data copies of the three OSDs of the PG are not completely consistent. When actually entering the peering process, the past_interval will be calculated and generated. The past_interval is closely related to the OSDmap version, but the OSDmap will not be saved forever. Under the current mechanism, only the latest 500 OSDmap versions are kept when the cluster is running well. . If an OSD that has been down for a long time starts, it may cause the following problems: After the OSD that has been down for a long time starts, the OSDmap version of the cluster may have changed many times, and the OSDmap received after the OSD starts is discontinuous. , which will cause the calculated past_interval to be wrong. If other members of the placement group are also down, the OSD may independently complete the peering process and become a new master, waiting for other members of the placement group to After the OSD is started, the new master will still be the OSD that has been in the down state for a long time. This is the peering process after other OSDs are started. Synchronizing data between other OSDs and the master OSD will delete the data written during the down state of the OSD. Serious problems causing data loss.
发明内容Contents of the invention
为了解决上述技术问题或者至少部分地解决上述技术问题,本发明提供一种避免peering流程数据丢失的方法、装置、终端及介质。In order to solve the above technical problems or at least partly solve the above technical problems, the present invention provides a method, device, terminal and medium for avoiding peering process data loss.
第一方面,本发明提供一种避免peering流程数据丢失的方法,包括:In a first aspect, the present invention provides a method for avoiding peering process data loss, including:
归置组中处于down状态OSD启动并进入peering流程前,计算历史epoch序列,并判断历史epoch序列是否包含被遗弃的OSDmap版本,如果包含则判断在OSD处于down状态期间中归置组为历史epoch序列的时段发生数据写入;Before the OSD in the down state of the placement group starts and enters the peering process, calculate the historical epoch sequence and judge whether the historical epoch sequence contains the abandoned OSDmap version. If it does, it is judged that the placement group is the historical epoch while the OSD is in the down state. Data writing occurs during the period of the sequence;
处于down状态期间归置组发生数据写入的OSD启动后进入peering流程时,则检测归置组中其余OSD的状态,若其余OSD均处于down状态,则暂停OSD的peering流程,等待其余OSD启动后才完成peering流程,以避免处于down状态期间归置组有数据写入时,OSD独自完成peering流程造成数据丢失。During the down state, when the OSD in which the data is written in the placement group starts and enters the peering process, it detects the status of the remaining OSDs in the placement group. If the rest of the OSDs are in the down state, the peering process of the OSD is suspended and waits for the rest of the OSDs to start. The peering process is completed later to avoid data loss caused by the OSD completing the peering process alone when the placement group has data written during the down state.
更进一步地,在OSD本地维护记录被遗弃OSDmap版本;监控器将因OSD故障而更改的OSDmap版本推送给归置组中的OSD,OSD收到监控器推送的OSDmap版本后,判断自身本地维护记录的OSDmap版本和下发的OSDmap版本是否连续,若非连续,则说明OSDmap版本被遗弃过,更新记录被遗弃的OSDmap版本。Furthermore, the abandoned OSDmap version is maintained locally in the OSD; the monitor pushes the OSDmap version changed due to OSD failure to the OSDs in the placement group, and the OSD judges its own local maintenance record after receiving the OSDmap version pushed by the monitor Whether the OSDmap version and the delivered OSDmap version are continuous, if not, it means that the OSDmap version has been abandoned, and the update records the abandoned OSDmap version.
更进一步地,增加通过判断历史epoch序列是否包含被遗弃的OSDmap版本判断是否发生数据写入的方式:遍历每个历史epoch序列是否包含所记录的被遗弃的OSDmap版本,如果被遍历到的历史epoch序列包含被遗弃的OSPmap版本,则判断在OSD处于down状态期间中归置组为历史epoch序列的时段发生数据写入。Furthermore, add the method of judging whether data writing occurs by judging whether the historical epoch sequence contains the abandoned OSDmap version: traverse each historical epoch sequence whether it contains the recorded abandoned OSDmap version, if the traversed historical epoch If the sequence contains an abandoned OSPmap version, it is judged that data writing occurred during the period when the placement group is the historical epoch sequence when the OSD is in the down state.
更进一步地,判断在OSD处于down状态期间中归置组为历史epoch序列的时段发生数据写入时,打上写入标记;处于down状态期间归置组发生数据写入的OSD启动后进入peering流程时,检测是否存在写入标记确定是否发生数据写入。Furthermore, when it is judged that when the OSD is in the down state, when the placement group is in the period of the historical epoch sequence, data writing occurs, and the writing mark is marked; when the OSD in the down state has data writing, the OSD enters the peering process after startup , detect whether there is a write mark to determine whether data writing occurs.
更进一步地,根据发生数据写入的历史epoch序列时段的归置组中各个OSD的up状态,确定发生数据写入的OSD,检测归置组中确定发生数据写入的OSD的状态,若确定发生数据写入的OSD均处于down状态,则暂停OSD的peering流程,等待确定发生数据写入的OSD启动后才完成peering流程发生数据写入的OSD启动后进行。Furthermore, according to the up state of each OSD in the placement group in the historical epoch sequence period in which data writing occurs, determine the OSD where data writing occurs, and detect the state of the OSD where data writing occurs in the placement group. If the OSDs where data writing occurs are all in the down state, the peering process of the OSD is suspended, and the peering process is completed after the OSD where data writing occurs is confirmed to start.
更进一步地,基于所有发生数据写入的历史epoch序列时段的归置组中处于up状态OSD来分析保留完整数据的OSD;保留完整数据的OSD重新启动后,进行peering流程。Furthermore, based on the OSDs in the up state in the placement group of all historical epoch sequence periods in which data is written, the OSDs that retain complete data are analyzed; after the OSDs that retain complete data are restarted, the peering process is performed.
第二方面,本发明提供一种避免peering流程数据丢失的装置,包括:数据写入判断模块,所述数据写入判断模块在处于down状态的OSD启动并进入peering流程前,计算历史epoch序列,并判断历史epoch序列是否包含被遗弃的OSDmap版本,如果包含则判断在OSD处于down状态期间中归置组为历史epoch序列的时段发生数据写入;In a second aspect, the present invention provides a device for avoiding peering process data loss, including: a data writing judgment module, the data writing judgment module calculates the historical epoch sequence before the OSD in the down state starts and enters the peering process, And judge whether the historical epoch sequence contains the abandoned OSDmap version, if it is included, it is judged that data writing occurs during the period when the placement group is the historical epoch sequence during the period when the OSD is in the down state;
peering流程控制模块,所述peering流程控制模块在处于down状态期间归置组发生数据写入的OSD启动后进入peering流程时,则检测归置组中其余OSD的状态,若其余OSD均处于down状态,则暂停OSD的peering流程,等待其余OSD启动后才完成peering流程。The peering process control module, when the peering process control module enters the peering process after the OSD in which the data is written in the placement group is started during the down state, it detects the status of the remaining OSDs in the placement group, and if the remaining OSDs are in the down state , the peering process of the OSD is suspended, and the peering process is completed after the other OSDs are started.
更进一步地,所述的避免peering流程数据丢失的装置还包括:被遗弃OSDmap版本维护模块,被遗弃OSDmap版本维护模块在监控器将因OSD故障而更改的OSDmap版本推送给归置组中的OSD时,接收监控器推送的OSDmap版本,判断自身本地维护记录的OSDmap版本和下发的OSDmap版本是否连续,若非连续,则说明OSDmap版本被遗弃过,更新记录被遗弃的OSDmap版本。Furthermore, the device for avoiding the loss of peering process data also includes: an abandoned OSDmap version maintenance module, and the abandoned OSDmap version maintenance module pushes the OSDmap version changed due to OSD failure to the OSD in the placement group on the monitor When receiving the OSDmap version pushed by the monitor, judge whether the OSDmap version of the local maintenance record and the issued OSDmap version are continuous. If not, it means that the OSDmap version has been abandoned, and the update records the abandoned OSDmap version.
第三方面,本发明提供一种避免peering流程数据丢失的终端,包括:处理单元,总线单元和存储单元,其中,所述总线单元连接存储单元、处理单元,所述存储单元存储计算机程序,计算机程序被处理单元执行时实现所述的避免peering流程数据丢失的方法。In a third aspect, the present invention provides a terminal for avoiding peering process data loss, including: a processing unit, a bus unit and a storage unit, wherein the bus unit is connected to the storage unit and the processing unit, and the storage unit stores computer programs, and the computer When the program is executed by the processing unit, the method for avoiding peering process data loss is realized.
第四方面,本发明提供一种实现避免peering流程数据丢失的方法的存储介质,所述存储介质存储计算机程序,所述计算机程序被处理器执行时实现所述的避免peering流程数据丢失的方法。In a fourth aspect, the present invention provides a storage medium for implementing a method for avoiding peering process data loss, the storage medium stores a computer program, and when the computer program is executed by a processor, the described method for avoiding peering process data loss is implemented.
本发明实施例提供的上述技术方案与现有技术相比具有如下优点:Compared with the prior art, the technical solution provided by the embodiments of the present invention has the following advantages:
本申请维护被遗弃的OSDmap版本,保证计算历史epoch序列时,不会出错、不会出现残缺。历史epoch序列计算准确保证在判断归置组是否发生数据写入时,分析准确。且判断归置组是否发生数据写入通过判断历史epoch序列是否包含被遗弃的OSDmap版本实现,避免被遗弃的OSDmap存在数据写入时,无法进行数据写入判断的情况,分析的发生数据写入的历史epoch序列不少于真实发生数据写入的历史epoch序列,从而避免判断归置组是否发生数据写入时出现漏判的情景。如果包含则判断在OSD处于down状态期间中归置组为历史epoch序列的时段发生数据写入;处于down状态期间归置组发生数据写入的OSD启动后进入peering流程时,则检测归置组中其余OSD的状态,若其余OSD均处于down状态,则暂停OSD的peering流程,等待其余OSD启动后才完成peering流程,以避免处于down状态期间归置组有数据写入时,OSD独自完成peering流程造成数据丢失。This application maintains the abandoned OSDmap version to ensure that there will be no errors or incompleteness when calculating the historical epoch sequence. The accurate calculation of the historical epoch sequence ensures that the analysis is accurate when judging whether data writing has occurred in the placement group. And judging whether data writing occurs in the placement group is realized by judging whether the historical epoch sequence contains the abandoned OSDmap version, so as to avoid the situation that data writing cannot be judged when the abandoned OSDmap has data writing, and the occurrence of data writing is analyzed The historical epoch sequence of is not less than the historical epoch sequence of the actual data writing, so as to avoid the situation of missed judgment when judging whether data writing has occurred in the placement group. If it is included, it is judged that data writing occurs when the placement group is in the historical epoch sequence when the OSD is in the down state; when the OSD that writes data to the placement group during the down state enters the peering process after starting, the placement group is detected If the other OSDs are in the down state, the peering process of the OSD will be suspended, and the peering process will be completed after the other OSDs are started, so as to avoid the OSD completing the peering alone when the placement group has data written during the down state Process causes data loss.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention.
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, for those of ordinary skill in the art, In other words, other drawings can also be obtained from these drawings without paying creative labor.
图1为本发明实施例提供的一种避免peering流程数据丢失的方法的流程图;FIG. 1 is a flowchart of a method for avoiding peering process data loss provided by an embodiment of the present invention;
图2为本发明实施例提供的响应监控器推送的OSDmap版本进行遗弃判断和OSDmap版本处理的流程图;Fig. 2 is the flowchart that the OSDmap version that the response monitor pushes is carried out abandonment judgment and OSDmap version processing that the embodiment of the present invention provides;
图3为本发明实施例提供的历史epoch序列集和历史epoch序列的示意表;FIG. 3 is a schematic diagram of a historical epoch sequence set and a historical epoch sequence provided by an embodiment of the present invention;
图4为本发明实施例提供的判断历史epoch序列是否发生数据写入的流程图;FIG. 4 is a flow chart for judging whether data writing occurs in a historical epoch sequence provided by an embodiment of the present invention;
图5为本发明实施例提供的一种避免peering流程数据丢失的装置的示意图;FIG. 5 is a schematic diagram of a device for avoiding peering process data loss provided by an embodiment of the present invention;
图6为本发明实施例提供的一种避免peering流程数据丢失的终端的示意图。FIG. 6 is a schematic diagram of a terminal for avoiding peering process data loss provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, the terms "comprising", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.
本文中涉及到的专有名气、英文或简称含义如下:OSD(Object-based StorageDevice)对象存储设备;PG(Placement Group)归置组,数据分布的一种逻辑单元;MON(monitor)存储系统的监控器;peering,是归置组状态机中使同一个归置组内的OSD达成一个一致状态的过程所对应的状态,当主从副本达成一致的状态时,peering的状态就结束,归置组处于active状态。acting set就是一个归置组对应的副本所在的OSD列表,列表是有序的,第一个OSD为主OSD。如果归置组的acting set为[0,1,2],此时如果OSD0出现故障,通过CRUSH算法重新分配该归置组的acting set为[3,1,2]。此时OSD3为该归置组的主OSD,但是OSD3并不能负担该归置组的读操作,因为其上现在还没有数据。所以向MON申请一个临时的归置组,OSD1为临时归置组的主OSD,这时acting set依然为[0,1,2],up set变为[1,3,2]。当OSD3数据填充完成之后,该归置组的up set恢复为acting set,也就是acting set和up set都为[0,1,2]。The proprietary names, English or abbreviations involved in this article have the following meanings: OSD (Object-based Storage Device) object storage device; PG (Placement Group) placement group, a logical unit of data distribution; MON (monitor) storage system Monitor; peering is the state corresponding to the process of making the OSDs in the same placement group reach a consistent state in the placement group state machine. When the master-slave copy reaches a consistent state, the peering state ends, and the placement group is active. The acting set is a list of OSDs where the copy corresponding to a placement group is located. The list is ordered, and the first OSD is the master OSD. If the acting set of the placement group is [0,1,2], if OSD0 fails at this time, the acting set of the placement group is reassigned to [3,1,2] through the CRUSH algorithm. At this time, OSD3 is the primary OSD of the placement group, but OSD3 cannot afford the read operation of the placement group because there is no data on it yet. Therefore, apply for a temporary placement group from MON, and OSD1 is the main OSD of the temporary placement group. At this time, the acting set is still [0,1,2], and the up set becomes [1,3,2]. After the OSD3 data is filled, the up set of the placement group is restored to the acting set, that is, both the acting set and the up set are [0,1,2].
实施例1Example 1
参阅图1所示,本发明提供一种避免peering流程数据丢失的方法,包括:Referring to Fig. 1, the present invention provides a method for avoiding peering process data loss, including:
响应监控器推送的OSDmap版本进行遗弃判断和OSDmap版本处理。参阅图2所示,响应监控器推送的OSDmap版本进行遗弃判断和OSDmap版本处理包括:存储系统的监控器检测到OSD故障后,变更OSDmap版本并将因OSD故障而更改的OSDmap版本推送给归置组中的OSD。OSD收到存储系统的监控器推送的OSDmap版本后,判断OSD本地维护记录的OSDmap版本和监控器推送的OSDmap版本是否连续,若非连续,则说明OSDmap版本被遗弃过,更新记录被遗弃的OSDmap版本,若连续则说明OSDmap版本未被遗弃。更新OSD本地维护记录的OSDmap版本。Respond to the OSDmap version pushed by the monitor to perform abandonment judgment and OSDmap version processing. As shown in Figure 2, the abandonment judgment and OSDmap version processing in response to the OSDmap version pushed by the monitor include: after the monitor of the storage system detects an OSD failure, the OSDmap version is changed and the OSDmap version changed due to the OSD failure is pushed to the relocation OSDs in the group. After the OSD receives the OSDmap version pushed by the monitor of the storage system, it judges whether the OSDmap version of the OSD local maintenance record and the OSDmap version pushed by the monitor are continuous. , if continuous, it means that the OSDmap version has not been abandoned. Update the OSDmap version of the OSD local maintenance record.
依据监控器推送的OSDmap版本生成归置组状态机的事件,触发归置组的peering流程。进入peering流程前,计算历史epoch序列集(past_intervals)。如图3所示,历史epoch序列集是历史epoch序列(past_interval)的集合,历史epoch序列是归置组内OSDmapepoch中部分epoch组成的一个序列,历史epoch序列内每个epoch的归置组的acting set和每个epoch的归置组的up set成员一致。非归置组内的OSD故障时,不会改变归置组,归置组内的OSD故障时,导致归置组内OSD的改变,形成不同的历史epoch序列。According to the OSDmap version pushed by the monitor, an event of the placement group state machine is generated to trigger the peering process of the placement group. Before entering the peering process, calculate the historical epoch sequence set (past_intervals). As shown in Figure 3, the historical epoch sequence set is a collection of historical epoch sequences (past_interval). The historical epoch sequence is a sequence composed of some epochs in the OSDmapepoch in the placement group. The acting of the placement group of each epoch in the historical epoch sequence The set is consistent with the up set members of each epoch's placement group. When the OSD in the non-placement group fails, the placement group will not be changed. When the OSD in the placement group fails, the OSD in the placement group will change, forming a different historical epoch sequence.
具体实施过程中,参阅图4所示,对于归置组中处于down状态OSD启动并进入peering流程前,计算历史epoch序列,在计算得到历史epoch序列后,判断历史epoch序列是否包含被遗弃的OSDmap版本,如果包含则判断在OSD处于down状态期间中归置组为历史epoch序列的时段内发生数据写入,当然,未被遗弃的OSDmap版本中记录有数据写入的,确定相应的历史epoch序列期间归置组发生数据写入;具体的,判断历史epoch序列是否包含被遗弃的OSDmap版本时,遍历每个历史epoch序列是否包含所记录的被遗弃的OSDmap版本,如果被遍历到的历史epoch序列包含被遗弃的OSPmap版本,则判断在OSD处于down状态期间中归置组为历史epoch序列的时段发生数据写入。In the specific implementation process, refer to Figure 4. For the OSD in the down state in the placement group, before starting and entering the peering process, calculate the historical epoch sequence. After calculating the historical epoch sequence, determine whether the historical epoch sequence contains the abandoned OSDmap. Version, if it is included, it is judged that data writing occurred during the period when the OSD is in the down state, and the placement group is the historical epoch sequence. Of course, if there is data writing recorded in the unabandoned OSDmap version, determine the corresponding historical epoch sequence Data writing occurs in the placement group during the period; specifically, when judging whether the historical epoch sequence contains the abandoned OSDmap version, traverse each historical epoch sequence to see if it contains the recorded abandoned OSDmap version, if the traversed historical epoch sequence Including the abandoned OSPmap version, it is judged that data writing occurred during the period when the placement group is the historical epoch sequence when the OSD is in the down state.
在一种可行的实施方式中,判断在OSD处于down状态期间中归置组为历史epoch序列的时段发生数据写入时,为该历史epoch序列打上相应的写入标记;处于down状态期间归置组发生数据写入的OSD启动后进入peering流程时,检测其历史epoch序列是否存在写入标记确定是否发生数据写入。In a feasible implementation, when it is judged that data writing occurs during the period when the placement group is the historical epoch sequence during the period when the OSD is in the down state, a corresponding writing mark is marked for the historical epoch sequence; When the OSD where data writing occurs in the group starts and enters the peering process, it detects whether there is a writing mark in its historical epoch sequence to determine whether data writing occurs.
处于down状态期间归置组发生数据写入的OSD启动后进入peering流程时,检测归置组中其余OSD的状态,若其余OSD均处于down状态,则暂停OSD的peering流程,等待其余OSD启动后才完成peering流程,以避免处于down状态期间归置组有数据写入时,OSD独自完成peering流程造成数据丢失。若其余OSD处于up状态,则继续peering流程。When the OSD in which data is written in the placement group is in the down state and enters the peering process after starting, check the status of the remaining OSDs in the placement group. If the rest of the OSDs are in the down state, suspend the peering process of the OSD and wait for the rest of the OSDs to start. The peering process is completed to avoid data loss caused by the OSD completing the peering process alone when the placement group has data written during the down state. If other OSDs are in the up state, continue the peering process.
在一种可行的实施方式中,根据发生数据写入的历史epoch序列时段的归置组中各个OSD的up状态,确定发生数据写入的OSD。检测当前归置组中确定发生数据写入的OSD的状态,若当前归置组中确定发生数据写入的OSD均处于down状态,则暂停OSD的peering流程,等待确定发生数据写入的OSD启动后才完成peering流程。In a feasible implementation manner, the OSD where data writing occurs is determined according to the up status of each OSD in the placement group in the historical epoch sequence period in which data writing occurs. Detect the status of the OSDs in the current placement group where data writing is confirmed. If the OSDs in the current placement group that are sure to be writing data are all in the down state, suspend the peering process of the OSD and wait for the OSDs that are sure to be writing data to start After that, the peering process is completed.
在一种可行的实施方式中,基于所有发生数据写入的历史epoch序列时段的归置组中处于up状态OSD来分析保留完整数据的OSD;保留完整数据的OSD重新启动后,进行peering流程。In a feasible implementation, based on the OSD in the up state in the placement group of all historical epoch sequence periods in which data is written, the OSD that retains complete data is analyzed; after the OSD that retains complete data is restarted, the peering process is performed.
实施例2Example 2
参阅图5所示,本发明实施例提供一种避免peering流程数据丢失的装置,包括:Referring to Figure 5, an embodiment of the present invention provides a device for avoiding peering process data loss, including:
被遗弃OSDmap版本维护模块,被遗弃OSDmap版本维护模块在监控器将因OSD故障而更改的OSDmap版本推送给归置组中的OSD时,接收监控器推送的OSDmap版本,判断自身本地维护记录的OSDmap版本和下发的OSDmap版本是否连续,若非连续,则说明OSDmap版本被遗弃过,更新记录被遗弃的OSDmap版本。若连续则说明OSDmap版本未被遗弃。更新OSD本地维护记录的OSDmap版本。Abandoned OSDmap version maintenance module, the abandoned OSDmap version maintenance module receives the OSDmap version pushed by the monitor when the monitor pushes the OSDmap version changed due to OSD failure to the OSD in the placement group, and judges the OSDmap of its own local maintenance record Whether the version is consistent with the issued OSDmap version, if not, it means that the OSDmap version has been abandoned, and the update records the abandoned OSDmap version. If they are consecutive, it means that the OSDmap version has not been abandoned. Update the OSDmap version of the OSD local maintenance record.
数据写入判断模块,所述数据写入判断模块在处于down状态的OSD启动并进入peering流程前,计算历史epoch序列,并判断历史epoch序列是否包含被遗弃的OSDmap版本,如果包含则判断在OSD处于down状态期间中归置组为历史epoch序列的时段发生数据写入。The data writing judging module, the data writing judging module calculates the historical epoch sequence before the OSD in the down state starts and enters the peering process, and judges whether the historical epoch sequence contains the abandoned OSDmap version. Data writing occurs during the period when the placement group is the historical epoch sequence during the down state.
peering流程控制模块,所述peering流程控制模块在处于down状态期间归置组发生数据写入的OSD启动后进入peering流程时,则检测归置组中其余OSD的状态,若其余OSD均处于down状态,则暂停OSD的peering流程,等待其余OSD启动后才完成peering流程。The peering process control module, when the peering process control module enters the peering process after the OSD in which the data is written in the placement group is started during the down state, it detects the status of the remaining OSDs in the placement group, and if the remaining OSDs are in the down state , the peering process of the OSD is suspended, and the peering process is completed after the other OSDs are started.
本申请维护被遗弃的OSDmap版本,保证计算历史epoch序列时,不会出错、不会出现残缺。历史epoch序列计算准确保证在判断归置组是否发生数据写入时,分析准确。且判断归置组是否发生数据写入通过判断历史epoch序列是否包含被遗弃的OSDmap版本实现,避免被遗弃的OSDmap存在数据写入时,无法进行数据写入判断的情况,分析的发生数据写入的历史epoch序列不少于真实发生数据写入的历史epoch序列,从而避免判断归置组是否发生数据写入时出现漏判的情景。如果包含则判断在OSD处于down状态期间中归置组为历史epoch序列的时段发生数据写入;处于down状态期间归置组发生数据写入的OSD启动后进入peering流程时,则检测归置组中其余OSD的状态,若其余OSD均处于down状态,则暂停OSD的peering流程,等待其余OSD启动后才完成peering流程,以避免处于down状态期间归置组有数据写入时,OSD独自完成peering流程造成数据丢失。This application maintains the abandoned OSDmap version to ensure that there will be no errors or incompleteness when calculating the historical epoch sequence. The accurate calculation of the historical epoch sequence ensures that the analysis is accurate when judging whether data writing has occurred in the placement group. And judging whether data writing occurs in the placement group is realized by judging whether the historical epoch sequence contains the abandoned OSDmap version, so as to avoid the situation that data writing cannot be judged when the abandoned OSDmap has data writing, and the occurrence of data writing is analyzed The historical epoch sequence of is not less than the historical epoch sequence of the actual data writing, so as to avoid the situation of missed judgment when judging whether data writing has occurred in the placement group. If it is included, it is judged that data writing occurs when the placement group is in the historical epoch sequence when the OSD is in the down state; when the OSD that writes data to the placement group during the down state enters the peering process after starting, the placement group is detected If the other OSDs are in the down state, the peering process of the OSD will be suspended, and the peering process will be completed after the other OSDs are started, so as to avoid the OSD completing the peering alone when the placement group has data written during the down state Process causes data loss.
实施例3Example 3
参阅图6所示,本发明实施例提供一种避免peering流程数据丢失的终端,包括:处理单元,总线单元和存储单元,其中,所述总线单元连接存储单元、处理单元,所述存储单元存储计算机程序,计算机程序被处理单元执行时实现所述的避免peering流程数据丢失的方法。As shown in FIG. 6, an embodiment of the present invention provides a terminal for avoiding peering process data loss, including: a processing unit, a bus unit and a storage unit, wherein the bus unit is connected to the storage unit and the processing unit, and the storage unit stores A computer program, when the computer program is executed by the processing unit, implements the method for avoiding peering process data loss.
实施例4Example 4
本发明实施例提供一种实现避免peering流程数据丢失的方法的存储介质,所述存储介质存储计算机程序,所述计算机程序被处理器执行时实现所述的避免peering流程数据丢失的方法。An embodiment of the present invention provides a storage medium for implementing a method for avoiding peering process data loss, the storage medium stores a computer program, and when the computer program is executed by a processor, the described method for avoiding peering process data loss is implemented.
在本发明所提供的实施例中,应该理解到,所揭露的结构和方法,可以通过其它的方式实现。例如,以上所描述的结构实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,结构或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the embodiments provided in the present invention, it should be understood that the disclosed structures and methods may be implemented in other ways. For example, the structural embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of structures or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
以上所述仅是本发明的具体实施方式,使本领域技术人员能够理解或实现本发明。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所申请的原理和新颖特点相一致的最宽的范围。The above descriptions are only specific embodiments of the present invention, so that those skilled in the art can understand or implement the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Accordingly, the present invention will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211032219.0A CN115344213B (en) | 2022-08-26 | 2022-08-26 | Method, device, terminal and medium for avoiding peering flow data loss |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211032219.0A CN115344213B (en) | 2022-08-26 | 2022-08-26 | Method, device, terminal and medium for avoiding peering flow data loss |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115344213A true CN115344213A (en) | 2022-11-15 |
CN115344213B CN115344213B (en) | 2025-08-19 |
Family
ID=83953448
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211032219.0A Active CN115344213B (en) | 2022-08-26 | 2022-08-26 | Method, device, terminal and medium for avoiding peering flow data loss |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115344213B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030037237A1 (en) * | 2001-04-09 | 2003-02-20 | Jean-Paul Abgrall | Systems and methods for computer device authentication |
CN107547252A (en) * | 2017-06-29 | 2018-01-05 | 新华三技术有限公司 | A kind of network failure processing method and device |
CN108958970A (en) * | 2018-05-29 | 2018-12-07 | 新华三技术有限公司 | A kind of data reconstruction method, server and computer-readable medium |
CN111416753A (en) * | 2020-03-11 | 2020-07-14 | 上海爱数信息技术股份有限公司 | High-availability method of two-node Ceph cluster |
CN112395263A (en) * | 2020-11-26 | 2021-02-23 | 新华三大数据技术有限公司 | OSD data recovery method and device |
CN112596758A (en) * | 2020-11-30 | 2021-04-02 | 新华三大数据技术有限公司 | Version updating method, device, equipment and medium of OSDMap |
CN112597243A (en) * | 2020-12-22 | 2021-04-02 | 新华三大数据技术有限公司 | Method and device for accelerating synchronous state in Ceph cluster |
WO2021120777A1 (en) * | 2020-08-06 | 2021-06-24 | 平安科技(深圳)有限公司 | Ceph-based osd blockage detection method and system, and terminal and storage medium |
CN113672435A (en) * | 2021-07-09 | 2021-11-19 | 济南浪潮数据技术有限公司 | A data recovery method, device, equipment and storage medium |
-
2022
- 2022-08-26 CN CN202211032219.0A patent/CN115344213B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030037237A1 (en) * | 2001-04-09 | 2003-02-20 | Jean-Paul Abgrall | Systems and methods for computer device authentication |
CN107547252A (en) * | 2017-06-29 | 2018-01-05 | 新华三技术有限公司 | A kind of network failure processing method and device |
CN108958970A (en) * | 2018-05-29 | 2018-12-07 | 新华三技术有限公司 | A kind of data reconstruction method, server and computer-readable medium |
CN111416753A (en) * | 2020-03-11 | 2020-07-14 | 上海爱数信息技术股份有限公司 | High-availability method of two-node Ceph cluster |
WO2021120777A1 (en) * | 2020-08-06 | 2021-06-24 | 平安科技(深圳)有限公司 | Ceph-based osd blockage detection method and system, and terminal and storage medium |
CN112395263A (en) * | 2020-11-26 | 2021-02-23 | 新华三大数据技术有限公司 | OSD data recovery method and device |
CN112596758A (en) * | 2020-11-30 | 2021-04-02 | 新华三大数据技术有限公司 | Version updating method, device, equipment and medium of OSDMap |
CN112597243A (en) * | 2020-12-22 | 2021-04-02 | 新华三大数据技术有限公司 | Method and device for accelerating synchronous state in Ceph cluster |
CN113672435A (en) * | 2021-07-09 | 2021-11-19 | 济南浪潮数据技术有限公司 | A data recovery method, device, equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
王胜杰;徐龙;: "一种Ceph分布式块存储的持续数据保护方法", 网络安全技术与应用, no. 02, 15 February 2017 (2017-02-15) * |
邱晨;陈亚峰;周伟;: "基于容器化OpenStack云平台及Ceph存储的私有云实施案例", 邮电设计技术, no. 08, 20 August 2018 (2018-08-20) * |
Also Published As
Publication number | Publication date |
---|---|
CN115344213B (en) | 2025-08-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1179770B1 (en) | File system | |
JP4467624B2 (en) | Software update management program, software update management apparatus, and software update management method | |
US6944854B2 (en) | Method and apparatus for updating new versions of firmware in the background | |
JP5021929B2 (en) | Computer system, storage system, management computer, and backup management method | |
JP2007141043A (en) | Fault management method in storage system | |
CN111897558A (en) | Container cluster management system Kubernetes upgrade method and device | |
JP2005242403A (en) | Computer system | |
CN106528005B (en) | Method and device for adding disks in a distributed storage system | |
JP2005084963A (en) | File sharing apparatus and data migration method between file sharing apparatuses | |
CN107817950B (en) | Data processing method and device | |
WO2014205847A1 (en) | Zoning balance subtask delivering method, apparatus and system | |
US20210240351A1 (en) | Remote copy system and remote copy management method | |
CN113778761B (en) | Time sequence database cluster and fault processing and operating method and device thereof | |
CN114510464A (en) | A management method and management system for a highly available database | |
CN102025758B (en) | Method, device and system for recovering data copy in distributed system | |
CN115878361A (en) | Node management method and device for database cluster and electronic equipment | |
JP3967499B2 (en) | Restoring on a multicomputer system | |
JP2000099359A5 (en) | ||
US11762741B2 (en) | Storage system, storage node virtual machine restore method, and recording medium | |
CN110858168B (en) | Cluster node fault processing method and device and cluster node | |
CN115344213A (en) | A method, device, terminal and medium for avoiding peering process data loss | |
CN111176886B (en) | A database mode switching method, device and electronic equipment | |
CN117076206A (en) | A method, device, equipment and medium for synchronizing master and backup mirror information | |
CN114840365A (en) | Abnormal state double live volume expansion method, system, terminal and storage medium | |
CN114466026A (en) | Application program interface updating method and device, storage medium and computing equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |