raft: Add AddReplica RPC for dynamic partition membership management
Currently, there's no operator-friendly way to add replicas to existing Raft partitions. When clusters need to scale or recover from failures, operators must rely on low-level internal mechanisms or manual database manipulation, which is error-prone and risky.
The existing AddNode operation in the raftmgr.Replica interface is an internal implementation detail that requires deep knowledge of Raft internals, routing table coordination, and leadership management. There's no safe, high-level interface for operators to add capacity to partitions or recover from replica failures.
We need a unified RPC that operators can use to safely add new replicas to partitions while maintaining cluster consistency and handling all the coordination complexity internally.
We should add a new RPC to the RaftService:
rpc AddReplica(AddReplicaRequest) returns (AddReplicaResponse);
message AddReplicaRequest {
// The partition to add a replica to
PartitionKey partition_key = 1;
// Target storage for the new replica
string target_storage = 2;
// Network address of the target storage
string target_address = 3;
// Whether to add as learner first (recommended)
bool add_as_learner = 4;
// Timeout for the operation
google.protobuf.Duration timeout = 5;
}
message AddReplicaResponse {
// Details of the newly added replica
ReplicaID new_replica_id = 1;
// Current partition state after addition
PartitionInfo partition_info = 2;
// Whether replica was added as learner
bool added_as_learner = 3;
}
The RPC should handle all the coordination complexity: validating partition state, coordinating with leaders, managing learner-to-voter promotions, updating routing tables, and providing proper error recovery.
This would enable safe, automated partition scaling and disaster recovery operations without requiring operators to understand Raft internals.
- Reference: https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/raft.md#partition-identity-and-membership-management
- Related to: GetRaftClusterInfo RPC for monitoring results