diff --git a/doc/raft_routing.md b/doc/raft_routing.md new file mode 100644 index 0000000000000000000000000000000000000000..33aad89867199791277bd2c6789cd833719e87b4 --- /dev/null +++ b/doc/raft_routing.md @@ -0,0 +1,470 @@ +# Raft Routing Design Document + +## Table of Contents + +1. [Overview](#overview) +2. [Current Architecture (Praefect)](#current-architecture-praefect) +3. [Proposed Raft Architecture](#proposed-raft-architecture) +4. [Storage Exposure to Clients](#storage-exposure-to-clients) +5. [Request Flow](#request-flow) +6. [Gossip Protocol](#gossip-protocol) +7. [Eventual Consistency](#eventual-consistency) +8. [Raft Replica Placement](#Raft-replica-placement) +9. [Open Questions](#open-questions) + +--- + +## Overview + +This document describes how request routing will work in Raft-based Gitaly, replacing the centralized Praefect coordinator with a distributed routing mechanism. + +### Goals + +- Enable distributed request routing without a central coordinator +- Maintain backward compatibility with existing Rails repository-storage mapping +- Minimize network hops for request forwarding +- Replica placement strategy + +--- + +## Current Architecture (Praefect) + +Praefect acts as a centralized proxy that routes all requests to the appropriate Gitaly node. + +```mermaid +graph LR + subgraph Clients + Rails[GitLab Rails] + Workhorse[Workhorse] + Shell[GitLab Shell] + end + + subgraph Praefect Layer + Praefect[Praefect Proxy] + DB[(PostgreSQL)] + end + + subgraph Gitaly Cluster + G1[Gitaly 1
Primary] + G2[Gitaly 2
Secondary] + G3[Gitaly 3
Secondary] + end + + Rails --> Praefect + Workhorse --> Praefect + Shell --> Praefect + + Praefect <--> DB + Praefect -->|Mutator & Accessor| G1 + Praefect -.->|Accessor Only| G2 + Praefect -.->|Accessor Only| G3 + + G1 -->|Replication| G2 + G1 -->|Replication| G3 +``` + +--- + +## Proposed Raft Architecture + +In Raft-based architecture, there is no central coordinator. Each node maintains routing information via gossip protocol and can forward requests to the appropriate leader. + +```mermaid +graph LR + subgraph Clients + Rails[GitLab Rails] + Workhorse[Workhorse] + Shell[GitLab Shell] + Indexer[ES Indexer] + end + + subgraph Gitaly Fleet + G1[Gitaly 1
Storage A] + G2[Gitaly 2
Storage B] + G3[Gitaly 3
Storage C] + end + + Rails -->|Any Node| G1 + Indexer -->|Any Node| G1 + Workhorse -->|Any Node| G2 + Shell -->|Any Node| G3 + + G1 <-.->|Gossip| G2 + G2 <-.->|Gossip| G3 + G1 <-.->|Gossip| G3 +``` +--- + +## Storage Exposure to Clients + +Each client should be able to find one of the Gitalys. After connection, any Gitaly can then hand out routing information to the clients. + +### How Each Client Gets Gitaly Address + +| Client | Method | API/Source | Routing Behavior | +|--------|--------|-----------|-----------------| +| GitLab Shell | HTTP (authorization) | Rails internal API | Gets Gitaly address from Rails, connects directly | +| Workhorse | HTTP (authorization) | Rails internal API | Similar to GitLab Shell | +| ES Indexer | Environment variable | `GITALY_CONNECTION_INFO` ([set by Rails](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/gitlab/elastic/indexer.rb#L142)) | Connects via Rails configuration | +| Rails | Direct config | `gitlab.yml` storages | Looks up `repository_storage` from DB, connects to configured address | + +Gitaly connection information originates from `config/gitlab.yml`. + +### Approach 1: DNS Discovery with gRPC Client-Side Load Balancing + +Rails maintains a repository-to-storage mapping in the `projects` table: + +The `repository_storage` value maps to a storage entry in `gitlab.yml`, which provides the Gitaly address. With DNS-based discovery, this address points to a DNS name that resolves to all cluster nodes : + +```yaml +# config/gitlab.yml +production: + repositories: + storages: + # Single DNS entry for entire Gitaly fleet + gitaly-cluster: + gitaly_address: dns:///gitaly-cluster.internal:8075 + +``` + +```mermaid +graph TB + subgraph "Client" + Config[gitaly_address: dns:///gitaly-cluster:8075] + + end + + subgraph "DNS Layer" + DNS[DNS Server
gitaly-cluster.internal] + end + + subgraph "gRPC Client" + RES[DNS Resolver] + LB[round_robin Policy] + SC1[Subchannel 1] + SC2[Subchannel 2] + SC3[Subchannel 3] + end + + subgraph Gitaly + G1[Gitaly 1
10.0.1.10] + G2[Gitaly 2
10.0.1.11] + G3[Gitaly 3
10.0.1.12] + end + + Config --> RES + RES -->|Query| DNS + DNS -->|Returns ALL IPs| RES + RES -- Watch --> DNS + RES --> LB + LB --> SC1 + LB --> SC2 + LB --> SC3 + SC1 -- TCP --> G1 + SC2 -- TCP --> G2 + SC3 -- TCP --> G3 +``` + +gRPC’s DNS resolver looks up the service name and gets back a list of A records, one for each backend. It then opens a separate subchannel (connection) to each of those addresses. With the round_robin load balancing policy enabled, requests are rotated across whichever subchannels are healthy at that moment. If one of the nodes goes down, that connection is marked unhealthy and removed from the rotation until it recovers. + +This [approach](https://docs.gitlab.com/administration/gitaly/praefect/configure/#service-discovery) is already used in Praefect. Gitaly and some other Go-based clients (shell, workhorse) use [custom DNS resolver](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/5218) not the in-built one. The issue with in-built DNS resolver is it doesn't refresh the DNS state, unless there is a connectivity error. So in case new nodes are added the requests still stick to the existing nodes without re-distributing to new nodes. More information [here](https://gitlab.com/gitlab-org/gitaly/-/issues/4529#note_1210115025). But grpc-ruby doesn't have support for custom DNS resolver. + +> **Note:** This differs from traditional DNS round-robin where DNS rotates the order and clients use only the first IP. With gRPC's `dns:///` scheme, the client uses all returned IPs regardless of order, providing true client-side load balancing. + +**Service Discovery **: + +Using Consul to register services: + +https://docs.gitlab.com/administration/gitaly/praefect/configure/#configure-service-discovery-with-consul + +### Approach-2: HAProxy Load Balancer + +```yaml +# config/gitlab.yml +production: +repositories: + storages: + gitaly-cluster: + gitaly_address: tcp://haproxy-vip.internal:8075 +``` + +```mermaid +graph LR + subgraph Clients + Rails[Rails] + WH[Workhorse] + end + + subgraph "HAProxy HA Pair" + HA[HAProxy Virtual IP] + end + + subgraph "Raft Cluster" + G1[Gitaly 1] + G2[Gitaly 2] + G3[Gitaly 3] + end + + Rails --> HA + WH --> HA + HA --> G1 + HA --> G2 + HA --> G3 +``` + +--- +## Request Flow + +### Proxy Interceptor + +A gRPC middleware intercepts all requests. For mutator requests: +- If current node is leader for the target partition, the request is handled locally +- If current node is not the leader, the gRPC request is proxied to the leader node. +- Collect and return the result to the client + +We didn't use Raft's proposal forwarding because Gitaly WAL requires transactions to be verified → committed → applied in strict order, with each transaction depending on a point-in-time snapshot. Proposal forwarding allows two nodes to accept writes simultaneously, which can cause transactions to use outdated snapshots or produce incompatible log entries. + +### Write Request Flow + +```mermaid +sequenceDiagram + participant Client as git push + participant WH as Workhorse + participant Rails as Rails + participant G1 as Gitaly Node 1
(Storage-A) + participant G2 as Gitaly Node 2
(Storage-B) + + Client->>WH: POST /project.git/git-receive-pack + + WH->>Rails: Authorization check + Note over Rails: Lookup gitlab.yml
gitaly_address: gitaly-cluster:8075 + Rails-->>WH: {GitalyServer, Repository} + + WH->>G1: PostReceivePack RPC + + Note over G1: Interceptor: Am I leader
for this partition? + + G1->>G1: Check routing table + Note over G1: Leader is located in Storage-B
(Gitaly Node 2) + + G1->>G2: Forward to leader + + G2->>G2: Replicate via Raft + G2-->>G1: Success + + G1-->>WH: Success + WH-->>Client: Push complete +``` + +### Read Request Flow +Both leaders and followers can serve read requests. When a client request is routed, any follower can handle it. Etcd supports [linearizable reads](https://github.com/etcd-io/Raft?tab=readme-ov-file#features) on the followers by coordinating with the leader. +```mermaid +sequenceDiagram + participant C as Client + participant F as Follower + participant L as Leader + + C->>F: Read Request + Note over F,L: Linearizable Read + F->>L: ReadIndex + L-->>F: Confirmed Index + F->>F: Wait for apply + F-->>C: Return Data +``` + +### What is the Cost of Network Hops? +| Scenario | Hops | Description | +|----------|------|-------------| +| Request hits leader | 0 | Direct handling, no forwarding needed | +| Request hits follower | 1 | Follower proxies to known leader | +| Request hits follower (stale routing table) | 2 | Follower → stale leader → actual leader | +| Request hits a random not not part of the cluster | 2 | Node -> Leader (stale) -> actual leader | + + +```mermaid +graph LR + subgraph "(Stale Routing)" + C3[Client] -->|Request| F2[Follower] + F2 -->|Proxy| SL[Stale Leader] + SL -->|Redirect| AL[Actual Leader] + AL -->|Response| F2 + F2 -->|Response| C3 + end +``` + +___ + +## Gossip Protocol + +### Overview + +The gossip protocol enables cluster-wide partition discovery without external coordination services. Each node maintains a local copy of the cluster topology, enabling single-hop forwarding in most cases. + +**Implementation:** HashiCorp's `memberlist`, based on [SWIM protocol](https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf) (Scalable Weakly-consistent Infection-style Process Group Membership). + +### Gitaly Fleet Overview + +```mermaid +graph LR + subgraph Node1[Gitaly Node 1] + direction TB + GM1[GossipManager] + subgraph S1[" "] + direction LR + subgraph SA[Storage-A] + SA_P1[Partition-1 - Leader] + SA_P2[Partition-2] + SA_RT[Routing Table] + end + subgraph SB[Storage-B] + SB_P1[Partition-1] + SB_P2[Partition-2] + SB_RT[Routing Table] + end + end + end + + subgraph Node2[Gitaly Node 2] + direction TB + GM2[GossipManager] + subgraph S2[" "] + direction LR + subgraph SC[Storage-C] + SC_P1[Partition-1 - Follower] + SC_P2[Partition-2] + SC_RT[Routing Table] + end + subgraph SD[Storage-D] + SD_P1[Partition-1] + SD_P2[Partition-2] + SD_RT[Routing Table] + end + end + end + + subgraph Node3[Gitaly Node 3] + direction TB + GM3[GossipManager] + subgraph S3[" "] + direction LR + subgraph SE[Storage-E] + SE_P1[Partition-1] + SE_P2[Partition-2] + SE_RT[Routing Table] + end + subgraph SF[Storage-F] + SF_P1[Partition-1 - Follower] + SF_P2[Partition-2] + SF_RT[Routing Table] + end + end + end + + GM1 -.-|Gossip| GM2 + GM2 -.-|Gossip| GM3 +``` + +**Raft Group Example (Partition 1):** + +| Role | Storage | Node | +|------|---------|------| +| Leader | Storage-A | Node 1 | +| Follower | Storage-C | Node 2 | +| Follower | Storage-F | Node 3 | + +**Overview:** +- Every Gitaly node runs a GossipManager to communicate with other nodes in the cluster. +- A single node can host multiple storages (e.g., Storage-A, Storage-B). +- Each storage maintains its own routing table, which keeps track of the current leader for every partition. +- Partitions are replicated across storages located on different physical nodes for durability. +- Each partition operates as an independent Raft group and elects its own leader. +- The gossip protocol keeps routing tables in sync across all nodes. + +--- + +## Eventual Consistency + +### How Do We Handle Eventual Consistency? + +The routing table is eventually consistent via gossip. Key considerations: + +| Concern | Impact | Mitigation | +|---------|--------|------------| +| Leader changes | Brief window with stale leader info | Raft protocol rejects writes from stale leaders. On `ErrNotLeader`, interceptor checks the current leader and retries | +| New nodes joining | Cluster learns about new members over time | Gossip protocol propagates membership| +| Node failures | Detection depends on failure timeout | gossip protocol health checks + updates | + +### Metrics to Measure + +| Metric | Description | Target | +|--------|-------------|--------| +| `gitaly_gossip_convergence_seconds` | Time for routing updates to propagate | TBD | +| `gitaly_request_forwarding_total` | % of requests requiring proxy | TBD | +| `gitaly_stale_routing_errors_total` | How often 2-hop scenarios occur | TBD | + +--- + +## Raft Replica Placement + +This section describes how replicas are placed across storages in the Raft cluster. This could be done manually by the admin where they select the nodes from UI or add through CLI. Or we can add an orchestrator which would do the placement on the behalf of the admin. + +### Storage Discovery via Gossip + +Replica placement requires knowing all available storages. Raft groups discover storages dynamically via gossip, also when we start the Gitaly server we provide a seed list. We can look into the possibility of pulling the seed list from Consul. + +```mermaid +sequenceDiagram + participant G1 as Gitaly 1
storage-a + participant G2 as Gitaly 2
storage-b + participant G3 as Gitaly 3
storage-c + + Note over G1,G3: 1. Join cluster via seeds + G1->>G2: memberlist.Join(seeds) + + Note over G1,G3: 2. Announce storages via node metadata + G1->>G2: Meta: {storages: [storage-a], addr: ...} + G2->>G1: Meta: {storages: [storage-b], addr: ...} + G3->>G1: Meta: {storages: [storage-c], addr: ...} + + Note over G1,G3: Each node now has complete storage registry +``` + +**Storage Registry Contents:** + +| Storage Name | Node Address | Replication Factor | Node ID | +|-------------|--------------|-------------------|-----------| +| storage-a | node1:8075 | 3 | 1 +| storage-b | node2:8075 | 3 | 1 +| storage-c | node3:8075 | 3 | 2 + +Storage names must be globally unique. + +### Ring-Based Placement Selection (Orchestrator-based method) + +The replica placement uses a ring-based strategy where data from a storage is replicated to the next N storages in the ring (determined by replication factor). + +- Physical node awareness: Storages on the same physical node don't replicate to each other +- Replication factor scope: Set at storage level (all repositories in a storage share the same factor). + +--- + +## Open Questions + +### To Be Addressed + +1. **How do we retrieve list of seed nodes?** + - Explore the possibility of using Consul. + +2. **How do we handle partition placement policies?** + - Declarative constraints ("region: us-east")? + - System places partitions based on policies + - Operator cannot select specific nodes, only define constraints + +3. **Should the replication be set at storage level or repository level?** + +3. **How do we handle storage capacity limits?** + - Monitor disk usage per storage + - Block new partition creation when near capacity + +### Future Considerations +