[go: up one dir, main page]

Raft: Implement Raft replica bootstrapping

This issue tracks the work of bootstrapping replication after the cluster is bootstrapped (#6032 (closed)). As of the original design, replica data placement is stable. We follow a ring-like replica scheme.

Assuming we have 5 storages: a, b, c, d, e.

  • storage-a is replicated to storage-b, storage-c
  • storage-b is replicated to storage-c, storage-d
  • storage-c is replicated to storage-d, storage-e
  • storage-d is replicated to storage-e, storage-a
  • storage-e is replicated to storage-a, storage-b

In the case of 3 storages, they form a complete graph.

During implementation, we structure each partition as equivalent to one Raft group. This inevitable decision leads to two interesting consequences:

  • We could not enforce the strict replication order (a -> b, c). After creation, a repository's leader role might float freely between a, b, and c. Although Raft tries its best to stabilize leadership role, there is no guarantee about its location.
  • A storage must start a Raft group to replicate its changes. There must be a way to signal other nodes to start and a corresponding Raft group. In addition, the list of repositories must be persisted.

The first consequence is not necessarily a bad thing. In contrast, it increases the resilience of the cluster. Let's examine this example. Each storage holds the leadership role of 2 repositories and replica role of another 4 repositories (from 2 other nodes).

storage-a storage-b storage-c storage-d storage-e
Leader of 1, 2 3, 4 5, 6 7, 8 9, 10
Replica of 7, 8, 9, 10 9, 10, 1, 2 1, 2, 3, 4 3, 4, 5, 6 5, 6, 7, 8

When storage-b goes down, repositories, in which storage-b holds the leader role, are evicted to storage-c and storage-d equally. Each repository's quorum is still maintained.

storage-a 🚫 storage-b storage-c storage-d storage-e
Leader of 1, 2 4, 5, 6 3, 7, 8 9, 10
Replica of 7, 8, 9, 10 1, 2, 3 4, 5, 6 5, 6, 7, 8

When storage-b is back, the leaders of repositories might shuffle, but in general, the number of leaders and replicas won't change.

The second one could be solved by special "replica Raft groups".

  • When a storage-a starts, it starts a "replica Raft group" in which it's the only voting member and storage-b and storage-c are non-voting members. It's also the non-voting member of replica Raft groups hosted by storage-d and storage-e.
  • When a repository (repository-1 for example) is created in storage-a, storage-a starts a repository Raft group in which storage-a is the initial member. It then writes into the replica Raft group. storage-b and storage-c pick up the new log, persist it, and start corresponding repository Raft groups. repository-1's Raft group has three voting members afterward: storage-a, storage-b, storage-c.
  • If storage-a goes down, its replica Raft group doesn't function. That's fine because it's not capable of creating a new repository anyway. The quorum of repository-1's Raft group is maintained. Hence, it performs a new election term and evicts its leader to either storage-b or storage-c.
  • When storage-a goes back, it starts all repositories Raft groups from repositories of replica Raft groups of which it's a member. They are the repositories that storage-a needs to maintain. If storage-d or storage-e creates new repositories while it's away, storage-a is aware of them after it fetches the newest log entries.

As in #6104, we reason about keeping relative paths as SSOT. However, as dragonboat doesn't allow using a string as ShardID, we need to maintain a hashmap between {relative_path: shard_id}.

Edited by Quang-Minh Nguyen