diff --git a/doc/administration/monitoring/prometheus/index.md b/doc/administration/monitoring/prometheus/index.md index 1bddbbc25c2812510bf02a271ba3c7f796c67d78..0af13624b6ec0519f88391ef23a95b1a33c6377e 100644 --- a/doc/administration/monitoring/prometheus/index.md +++ b/doc/administration/monitoring/prometheus/index.md @@ -371,12 +371,20 @@ to work with the collected data where you can visualize the output. For a more fully featured dashboard, Grafana can be used and has [official support for Prometheus](https://prometheus.io/docs/visualization/grafana/). -Sample Prometheus queries: +## Sample Prometheus queries + +Below are some sample Prometheus queries that can be used. + +NOTE: +These are only examples and may not work on all setups. Further adjustments may be required. -- **% Memory available:** `((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) or ((node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes)) * 100` - **% CPU utilization:** `1 - avg without (mode,cpu) (rate(node_cpu_seconds_total{mode="idle"}[5m]))` +- **% Memory available:** `((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) or ((node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) / node_memory_MemTotal_bytes)) * 100` - **Data transmitted:** `rate(node_network_transmit_bytes_total{device!="lo"}[5m])` - **Data received:** `rate(node_network_receive_bytes_total{device!="lo"}[5m])` +- **Disk read IOPS:** `sum by (instance) (rate(node_disk_reads_completed_total[1m]))` +- **Disk write IOPS**: `sum by (instance) (rate(node_disk_writes_completed_total[1m]))` +- **RPS via GitLab transaction count**: `sum(irate(gitlab_transaction_duration_seconds_count{controller!~'HealthController|MetricsController|'}[1m])) by (controller, action)` ## Prometheus as a Grafana data source diff --git a/doc/administration/reference_architectures/10k_users.md b/doc/administration/reference_architectures/10k_users.md index e6acf7840bb08487dacd5b20f882162bc951d3e2..4d604cce9e9c22cd6e89417f2bf2f477e68d8737 100644 --- a/doc/administration/reference_architectures/10k_users.md +++ b/doc/administration/reference_architectures/10k_users.md @@ -10,20 +10,20 @@ DETAILS: **Tier:** Premium, Ultimate **Offering:** Self-managed -This page describes the GitLab reference architecture designed to target a peak load of 200 requests per second (RPS), the typical peak load of up to 10,000 users, both manual and automated, based on real data with headroom added. +This page describes the GitLab reference architecture designed to target a peak load of 200 requests per second (RPS), the typical peak load of up to 10,000 users, both manual and automated, based on real data. For a full list of reference architectures, see [Available reference architectures](index.md#available-reference-architectures). NOTE: Before deploying this architecture it's recommended to read through the [main documentation](index.md) first, -specifically the [Before you start](index.md#before-you-start) and [Deciding which architecture to use](index.md#deciding-which-architecture-to-use) sections. +specifically the [Before you start](index.md#before-you-start) and [Deciding which architecture to use](index.md#deciding-which-architecture-to-start-with) sections. > - **Target load:** API: 200 RPS, Web: 20 RPS, Git (Pull): 20 RPS, Git (Push): 4 RPS > - **High Availability:** Yes ([Praefect](#configure-praefect-postgresql) needs a third-party PostgreSQL solution for HA) > - **Estimated Costs:** [See cost table](index.md#cost-to-run) > - **Cloud Native Hybrid Alternative:** [Yes](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) -> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-use) +> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-start-with) | Service | Nodes | Configuration | GCP | AWS | Azure | |------------------------------------------|-------|-------------------------|------------------|----------------|-----------| @@ -56,12 +56,13 @@ specifically the [Before you start](index.md#before-you-start) and [Deciding whi Review the existing [technical limitations and considerations before deploying Gitaly Cluster](../gitaly/index.md#before-deploying-gitaly-cluster). If you want sharded Gitaly, use the same specs listed above for `Gitaly`. 6. Gitaly specifications are based on high percentiles of both usage patterns and repository sizes in good health. However, if you have [large monorepos](index.md#large-monorepos) (larger than several gigabytes) or [additional workloads](index.md#additional-workloads) these can *significantly* impact Git and Gitaly performance and further adjustments will likely be required. -7. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). - However, for GitLab Rails certain processes like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) should be run on only one node. +6. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). + However, [Cloud Native Hybrid setups](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) are generally preferred as certain components + such as like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) can only be run on one node, which is handled better in Kubernetes. NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 10k @@ -165,7 +166,7 @@ against the following endpoint throughput targets: - Git (Push): 4 RPS The above targets were selected based on real customer data of total environmental loads corresponding to the user count, -including CI and other workloads along with additional substantial headroom added. +including CI and other workloads. If you have metrics to suggest that you have regularly higher throughput against the above endpoint targets, [large monorepos](index.md#large-monorepos) or notable [additional workloads](index.md#additional-workloads) these can notably impact the performance environment and [further adjustments may be required](index.md#scaling-an-environment). @@ -2268,16 +2269,19 @@ as the typical environment above. First are the components that run in Kubernetes. These run across several node groups, although you can change the overall makeup as desired as long as the minimum CPU and Memory requirements are observed. -| Service Node Group | Nodes | Configuration | GCP | AWS | Min Allocatable CPUs and Memory | -|---------------------|-------|-------------------------|-----------------|--------------|---------------------------------| -| Webservice | 4 | 32 vCPU, 28.8 GB memory | `n1-highcpu-32` | `c5.9xlarge` | 127.5 vCPU, 118 GB memory | -| Sidekiq | 4 | 4 vCPU, 15 GB memory | `n1-standard-4` | `m5.xlarge` | 15.5 vCPU, 50 GB memory | -| Supporting services | 2 | 4 vCPU, 15 GB memory | `n1-standard-4` | `m5.xlarge` | 7.75 vCPU, 25 GB memory | +| Component Node Group | Target Node Pool Totals | GCP Example | AWS Example | +|----------------------|-------------------------|-----------------|--------------| +| Webservice | 80 vCPU
100 GB memory (request)
140 GB memory (limit) | 3 x `n1-standard-32` | 3 x `c5.9xlarge` | +| Sidekiq | 12.6 vCPU
28 GB memory (request)
56 GB memory (limit) | 4 x `n1-standard-4` | 4 x `m5.xlarge` | +| Supporting services | 4 vCPU
15 GB memory | 2 x `n1-standard-4` | 2 x `m5.xlarge` | - For this setup, we **recommend** and regularly [test](index.md#validation-and-test-results) - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. -- Nodes configuration is shown as it is forced to ensure pod vCPU / memory ratios and avoid scaling during **performance testing**. - - In production deployments, there is no need to assign pods to specific nodes. A minimum of three nodes per node group in three different availability zones is strongly recommended to align with resilient cloud architecture practices. +[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. +- GCP and AWS examples of how to reach the Target Node Pool Total are given for convenience. These sizes are used in performance testing but following the example is not required. Different node pool designs can be used as desired as long as the targets are met, and all pods can deploy. +- The [Webservice](#webservice) and [Sidekiq](#sidekiq) target node pool totals are given for GitLab components only. Additional resources are required for the chosen Kubernetes provider's system processes. The given examples take this into account. +- The [Supporting](#supporting) target node pool total is given generally to accommodate several resources for supporting the GitLab deployment as well as any additional deployments you may wish to make depending on your requirements. Similar to the other node pools, the chosen Kubernetes provider's system processes also require resources. The given examples take this into account. +- In production deployments, it's not required to assign pods to specific nodes. However, it is recommended to have several nodes in each pool spread across different availability zones to align with resilient cloud architecture practices. +- Enabling autoscaling, such as Cluster Autoscaler, for efficiency reasons is encouraged, but it's generally recommended targeting a floor of 75% for Webservice and Sidekiq pods to ensure ongoing performance. Next are the backend components that run on static compute VMs using the Linux package (or External PaaS services where applicable): @@ -2312,7 +2316,7 @@ services where applicable): NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 10k @@ -2322,11 +2326,11 @@ card "Kubernetes via Helm Charts" as kubernetes { card "**External Load Balancer**" as elb #6a9be7 together { - collections "**Webservice** x4" as gitlab #32CD32 - collections "**Sidekiq** x4" as sidekiq #ff8dd1 + collections "**Webservice**" as gitlab #32CD32 + collections "**Sidekiq**" as sidekiq #ff8dd1 } - card "**Supporting Services** x2" as support + card "**Supporting Services**" as support } card "**Internal Load Balancer**" as ilb #9370DB @@ -2384,55 +2388,60 @@ consul .[#e76a9b]--> redis @enduml ``` -### Resource usage settings +### Kubernetes component targets -The following formulas help when calculating how many pods may be deployed within resource constraints. -The [10k reference architecture example values file](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/10k.yaml) -documents how to apply the calculated configuration to the Helm Chart. +The following section details the targets used for the GitLab components deployed in Kubernetes. #### Webservice -Webservice pods typically need about 1 CPU and 1.25 GB of memory _per worker_. -Each Webservice pod consumes roughly 4 CPUs and 5 GB of memory using -the [recommended topology](#cluster-topology) because four worker processes -are created by default and each pod has other small processes running. +Each Webservice pod (Puma and Workhorse) is recommended to be run with the following configuration: -For 200 RPS or 10,000 users we recommend a total Puma worker count of around 80. -With the [provided recommendations](#cluster-topology) this allows the deployment of up to 20 -Webservice pods with 4 workers per pod and 5 pods per node. Expand available resources using -the ratio of 1 CPU to 1.25 GB of memory _per each worker process_ for each additional -Webservice pod. +- 4 Puma Workers +- 4 vCPU +- 5 GB memory (request) +- 7 GB memory (limit) -For further information on resource usage, see the [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). +For 200 RPS or 10,000 users we recommend a total Puma worker count of around 80 so in turn it's recommended to run at +least 20 Webservice pods. + +For further information on Webservice resource usage, see the Charts documentation on [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). + +##### NGINX + +It's also recommended deploying the NGINX controller pods across the Webservice nodes as a DaemonSet. This is to allow the controllers to scale dynamically with the Webservice pods they serve as well as take advantage of the higher network bandwidth larger machine types typically have. + +Note that this isn't a strict requirement. The NGINX controller pods can be deployed as desired as long as they have enough resources to handle the web traffic. #### Sidekiq -Sidekiq pods should generally have 0.9 CPU and 2 GB of memory. +Each Sidekiq pod is recommended to be run with the following configuration: + +- 1 Sidekiq worker +- 900m vCPU +- 2 GB memory (request) +- 4 GB memory (limit) -[The provided starting point](#cluster-topology) allows the deployment of up to -14 Sidekiq pods. Expand available resources using the 0.9 CPU to 2 GB memory -ratio for each additional pod. +Similar to the standard deployment above, an initial target of 14 Sidekiq workers has been used here. +Additional workers may be required depending on your specific workflow. -For further information on resource usage, see the [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). +For further information on Sidekiq resource usage, see the Charts documentation on [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). -#### Supporting +### Supporting The Supporting Node Pool is designed to house all supporting deployments that don't need to be on the Webservice and Sidekiq pools. This includes various deployments related to the Cloud Provider's implementation and supporting -GitLab deployments such as NGINX or [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). +GitLab deployments such as [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). -If you wish to make any additional deployments, such as for Monitoring, it's recommended +If you wish to make any additional deployments such as Container Registry, Pages or Monitoring, it's recommended to deploy these in this pool where possible and not in the Webservice or Sidekiq pools, as the Supporting pool has been designed specifically to accommodate several additional deployments. However, if your deployments don't fit into the -pool as given, you can increase the node pool accordingly. - -## Secrets +pool as given, you can increase the node pool accordingly. Conversely, if the pool in your use case is over-provisioned you can reduce accordingly. -When setting up a Cloud Native Hybrid environment, it's worth noting that several secrets should be synced from backend VMs from the `/etc/gitlab/gitlab-secrets.json` file into Kubernetes. +### Example config file -For this setup specifically, the [GitLab Rails](https://docs.gitlab.com/charts/installation/secrets.html#gitlab-rails-secret) and [GitLab Shell](https://docs.gitlab.com/charts/installation/secrets.html#gitlab-rails-secret) secrets should be synced. +An example for the GitLab Helm Charts targetting the above 200 RPS or 10,000 reference architecture configuration [can be found in the Charts project](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/10k.yaml).
diff --git a/doc/administration/reference_architectures/1k_users.md b/doc/administration/reference_architectures/1k_users.md index d86d2daaf29c8e7b1ef73e4088d45ebe56c7377e..f3380347bf5dd97df0e37a4baad2b18e2a5127a2 100644 --- a/doc/administration/reference_architectures/1k_users.md +++ b/doc/administration/reference_architectures/1k_users.md @@ -10,7 +10,7 @@ DETAILS: **Tier:** Free, Premium, Ultimate **Offering:** Self-managed -This page describes the GitLab reference architecture designed to target a peak load of 20 requests per second (RPS), the typical peak load of up to 1,000 users, both manual and automated, based on real data with headroom added. +This page describes the GitLab reference architecture designed to target a peak load of 20 requests per second (RPS), the typical peak load of up to 1,000 users, both manual and automated, based on real data. For a full list of reference architectures, see [Available reference architectures](index.md#available-reference-architectures). @@ -21,7 +21,7 @@ For a full list of reference architectures, see > - **Estimated Costs:** [See cost table](index.md#cost-to-run) > - **Cloud Native Hybrid:** No. For a cloud native hybrid environment, you > can follow a [modified hybrid reference architecture](#cloud-native-hybrid-reference-architecture-with-helm-charts). -> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-use). +> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-start-with). | Users | Configuration | GCP | AWS | Azure | |--------------|----------------------|----------------|--------------|----------| @@ -89,7 +89,7 @@ against the following endpoint throughput targets: - Git (Push): 1 RPS The above targets were selected based on real customer data of total environmental loads corresponding to the user count, -including CI and other workloads along with additional substantial headroom added. +including CI and other workloads. If you have metrics to suggest that you have regularly higher throughput against the above endpoint targets, [large monorepos](index.md#large-monorepos) or notable [additional workloads](index.md#additional-workloads) these can notably impact the performance environment and [further adjustments may be required](index.md#scaling-an-environment). diff --git a/doc/administration/reference_architectures/25k_users.md b/doc/administration/reference_architectures/25k_users.md index 316d874ae921f2b41162633d44d21e1b7f908716..cf9163484a3e1511b6359cc242529739c51789b5 100644 --- a/doc/administration/reference_architectures/25k_users.md +++ b/doc/administration/reference_architectures/25k_users.md @@ -10,20 +10,20 @@ DETAILS: **Tier:** Premium, Ultimate **Offering:** Self-managed -This page describes the GitLab reference architecture designed to target a peak load of 500 requests per second (RPS) - The typical peak load of up to 25,000 users, both manual and automated, based on real data with headroom added. +This page describes the GitLab reference architecture designed to target a peak load of 500 requests per second (RPS) - The typical peak load of up to 25,000 users, both manual and automated, based on real data. For a full list of reference architectures, see [Available reference architectures](index.md#available-reference-architectures). NOTE: Before deploying this architecture it's recommended to read through the [main documentation](index.md) first, -specifically the [Before you start](index.md#before-you-start) and [Deciding which architecture to use](index.md#deciding-which-architecture-to-use) sections. +specifically the [Before you start](index.md#before-you-start) and [Deciding which architecture to use](index.md#deciding-which-architecture-to-start-with) sections. > - **Target load:** API: 500 RPS, Web: 50 RPS, Git (Pull): 50 RPS, Git (Push): 10 RPS > - **High Availability:** Yes ([Praefect](#configure-praefect-postgresql) needs a third-party PostgreSQL solution for HA) > - **Estimated Costs:** [See cost table](index.md#cost-to-run) > - **Cloud Native Hybrid Alternative:** [Yes](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) -> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-use) +> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-start-with) | Service | Nodes | Configuration | GCP | AWS | Azure | |------------------------------------------|-------|-------------------------|------------------|--------------|-----------| @@ -56,12 +56,13 @@ specifically the [Before you start](index.md#before-you-start) and [Deciding whi Review the existing [technical limitations and considerations before deploying Gitaly Cluster](../gitaly/index.md#before-deploying-gitaly-cluster). If you want sharded Gitaly, use the same specs listed above for `Gitaly`. 6. Gitaly specifications are based on high percentiles of both usage patterns and repository sizes in good health. However, if you have [large monorepos](index.md#large-monorepos) (larger than several gigabytes) or [additional workloads](index.md#additional-workloads) these can *significantly* impact Git and Gitaly performance and further adjustments will likely be required. -7. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). - However, for GitLab Rails certain processes like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) should be run on only one node. +6. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). + However, [Cloud Native Hybrid setups](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) are generally preferred as certain components + such as like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) can only be run on one node, which is handled better in Kubernetes. NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 25k @@ -165,7 +166,7 @@ against the following endpoint throughput targets: - Git (Push): 10 RPS The above targets were selected based on real customer data of total environmental loads corresponding to the user count, -including CI and other workloads along with additional substantial headroom added. +including CI and other workloads. If you have metrics to suggest that you have regularly higher throughput against the above endpoint targets, [large monorepos](index.md#large-monorepos) or notable [additional workloads](index.md#additional-workloads) these can notably impact the performance environment and [further adjustments may be required](index.md#scaling-an-environment). @@ -2274,16 +2275,19 @@ as the typical environment above. First are the components that run in Kubernetes. These run across several node groups, although you can change the overall makeup as desired as long as the minimum CPU and Memory requirements are observed. -| Service Node Group | Nodes | Configuration | GCP | AWS | Min Allocatable CPUs and Memory | -|---------------------|-------|-------------------------|-----------------|--------------|---------------------------------| -| Webservice | 7 | 32 vCPU, 28.8 GB memory | `n1-highcpu-32` | `c5.9xlarge` | 223 vCPU, 206.5 GB memory | -| Sidekiq | 4 | 4 vCPU, 15 GB memory | `n1-standard-4` | `m5.xlarge` | 15.5 vCPU, 50 GB memory | -| Supporting services | 2 | 4 vCPU, 15 GB memory | `n1-standard-4` | `m5.xlarge` | 7.75 vCPU, 25 GB memory | +| Component Node Group | Target Node Pool Totals | GCP Example | AWS Example | +|----------------------|-------------------------|-----------------|--------------| +| Webservice | 140 vCPU
175 GB memory (request)
245 GB memory (limit) | 5 x `n1-standard-32` | 5 x `c5.9xlarge` | +| Sidekiq | 12.6 vCPU
28 GB memory (request)
56 GB memory (limit) | 4 x `n1-standard-4` | 4 x `m5.xlarge` | +| Supporting services | 4 vCPU
15 GB memory | 2 x `n1-standard-4` | 2 x `m5.xlarge` | - For this setup, we **recommend** and regularly [test](index.md#validation-and-test-results) - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. -- Nodes configuration is shown as it is forced to ensure pod vCPU / memory ratios and avoid scaling during **performance testing**. - - In production deployments, there is no need to assign pods to specific nodes. A minimum of three nodes per node group in three different availability zones is strongly recommended to align with resilient cloud architecture practices. +[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. +- GCP and AWS examples of how to reach the Target Node Pool Total are given for convenience. These sizes are used in performance testing but following the example is not required. Different node pool designs can be used as desired as long as the targets are met, and all pods can deploy. +- The [Webservice](#webservice) and [Sidekiq](#sidekiq) target node pool totals are given for GitLab components only. Additional resources are required for the chosen Kubernetes provider's system processes. The given examples take this into account. +- The [Supporting](#supporting) target node pool total is given generally to accommodate several resources for supporting the GitLab deployment as well as any additional deployments you may wish to make depending on your requirements. Similar to the other node pools, the chosen Kubernetes provider's system processes also require resources. The given examples take this into account. +- In production deployments, it's not required to assign pods to specific nodes. However, it is recommended to have several nodes in each pool spread across different availability zones to align with resilient cloud architecture practices. +- Enabling autoscaling, such as Cluster Autoscaler, for efficiency reasons is encouraged, but it's generally recommended targeting a floor of 75% for Webservice and Sidekiq pods to ensure ongoing performance. Next are the backend components that run on static compute VMs using the Linux package (or External PaaS services where applicable): @@ -2317,7 +2321,7 @@ services where applicable): NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 25k @@ -2327,11 +2331,11 @@ card "Kubernetes via Helm Charts" as kubernetes { card "**External Load Balancer**" as elb #6a9be7 together { - collections "**Webservice** x7" as gitlab #32CD32 - collections "**Sidekiq** x4" as sidekiq #ff8dd1 + collections "**Webservice**" as gitlab #32CD32 + collections "**Sidekiq**" as sidekiq #ff8dd1 } - card "**Supporting Services** x2" as support + card "**Supporting Services**" as support } card "**Internal Load Balancer**" as ilb #9370DB @@ -2389,36 +2393,43 @@ consul .[#e76a9b]--> redis @enduml ``` -### Resource usage settings +### Kubernetes component targets -The following formulas help when calculating how many pods may be deployed within resource constraints. -The [25k reference architecture example values file](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/25k.yaml) -documents how to apply the calculated configuration to the Helm Chart. +The following section details the targets used for the GitLab components deployed in Kubernetes. #### Webservice -Webservice pods typically need about 1 CPU and 1.25 GB of memory _per worker_. -Each Webservice pod consumes roughly 4 CPUs and 5 GB of memory using -the [recommended topology](#cluster-topology) because four worker processes -are created by default and each pod has other small processes running. +Each Webservice pod (Puma and Workhorse) is recommended to be run with the following configuration: -For 500 RPS or 25,000 users we recommend a total Puma worker count of around 140. -With the [provided recommendations](#cluster-topology) this allows the deployment of up to 35 -Webservice pods with 4 workers per pod and 5 pods per node. Expand available resources using -the ratio of 1 CPU to 1.25 GB of memory _per each worker process_ for each additional -Webservice pod. +- 4 Puma Workers +- 4 vCPU +- 5 GB memory (request) +- 7 GB memory (limit) -For further information on resource usage, see the [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). +For 500 RPS or 25,000 users we recommend a total Puma worker count of around 140 so in turn it's recommended to run at +least 35 Webservice pods. + +For further information on Webservice resource usage, see the Charts documentation on [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). + +##### NGINX + +It's also recommended deploying the NGINX controller pods across the Webservice nodes as a DaemonSet. This is to allow the controllers to scale dynamically with the Webservice pods they serve as well as take advantage of the higher network bandwidth larger machine types typically have. + +Note that this isn't a strict requirement. The NGINX controller pods can be deployed as desired as long as they have enough resources to handle the web traffic. #### Sidekiq -Sidekiq pods should generally have 0.9 CPU and 2 GB of memory. +Each Sidekiq pod is recommended to be run with the following configuration: -[The provided starting point](#cluster-topology) allows the deployment of up to -14 Sidekiq pods. Expand available resources using the 0.9 CPU to 2 GB memory -ratio for each additional pod. +- 1 Sidekiq worker +- 900m vCPU +- 2 GB memory (request) +- 4 GB memory (limit) -For further information on resource usage, see the [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). +Similar to the standard deployment above, an initial target of 14 Sidekiq workers has been used here. +Additional workers may be required depending on your specific workflow. + +For further information on Sidekiq resource usage, see the Charts documentation on [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). ### Supporting @@ -2426,12 +2437,16 @@ The Supporting Node Pool is designed to house all supporting deployments that do on the Webservice and Sidekiq pools. This includes various deployments related to the Cloud Provider's implementation and supporting -GitLab deployments such as NGINX or [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). +GitLab deployments such as [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). -If you wish to make any additional deployments, such as for Monitoring, it's recommended +If you wish to make any additional deployments such as Container Registry, Pages or Monitoring, it's recommended to deploy these in this pool where possible and not in the Webservice or Sidekiq pools, as the Supporting pool has been designed specifically to accommodate several additional deployments. However, if your deployments don't fit into the -pool as given, you can increase the node pool accordingly. +pool as given, you can increase the node pool accordingly. Conversely, if the pool in your use case is over-provisioned you can reduce accordingly. + +### Example config file + +An example for the GitLab Helm Charts targetting the above 500 RPS or 25,000 reference architecture configuration [can be found in the Charts project](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/25k.yaml).
diff --git a/doc/administration/reference_architectures/2k_users.md b/doc/administration/reference_architectures/2k_users.md index 0dcf15ccc67e6c16f5dcb2af694a4e7a8ea9fc1a..7a814ea2dd21f18a91b46406d3ce353b5417880b 100644 --- a/doc/administration/reference_architectures/2k_users.md +++ b/doc/administration/reference_architectures/2k_users.md @@ -10,7 +10,7 @@ DETAILS: **Tier:** Free, Premium, Ultimate **Offering:** Self-managed -This page describes the GitLab reference architecture designed to target a peak load of 40 requests per second (RPS), the typical peak load of up to 2,000 users, both manual and automated, based on real data with headroom added. +This page describes the GitLab reference architecture designed to target a peak load of 40 requests per second (RPS), the typical peak load of up to 2,000 users, both manual and automated, based on real data. For a full list of reference architectures, see [Available reference architectures](index.md#available-reference-architectures). @@ -20,7 +20,7 @@ For a full list of reference architectures, see > follow a modified [3K or 60 RPS reference architecture](3k_users.md#supported-modifications-for-lower-user-counts-ha). > - **Estimated Costs:** [See cost table](index.md#cost-to-run) > - **Cloud Native Hybrid:** [Yes](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) -> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-use). +> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-start-with). | Service | Nodes | Configuration | GCP | AWS | Azure | |------------------------------------|-------|------------------------|-----------------|--------------|----------| @@ -46,7 +46,8 @@ For a full list of reference architectures, see However, if you have large monorepos (larger than several gigabytes) this can **significantly** impact Git and Gitaly performance and an increase of specifications will likely be required. Refer to [large monorepos](index.md#large-monorepos) for more information. 6. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). - However, for GitLab Rails certain processes like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) should be run on only one node. + However, [Cloud Native Hybrid setups](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) are generally preferred as certain components + such as like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) can only be run on one node, which is handled better in Kubernetes. NOTE: @@ -108,7 +109,7 @@ against the following endpoint throughput targets: - Git (Push): 1 RPS The above targets were selected based on real customer data of total environmental loads corresponding to the user count, -including CI and other workloads along with additional substantial headroom added. +including CI and other workloads. If you have metrics to suggest that you have regularly higher throughput against the above endpoint targets, [large monorepos](index.md#large-monorepos) or notable [additional workloads](index.md#additional-workloads) these can notably impact the performance environment and [further adjustments may be required](index.md#scaling-an-environment). @@ -1118,16 +1119,19 @@ as the typical environment above. First are the components that run in Kubernetes. These run across several node groups, although you can change the overall makeup as desired as long as the minimum CPU and Memory requirements are observed. -| Service Node Group | Nodes | Configuration | GCP | AWS | Min Allocatable CPUs and Memory | -|---------------------|-------|------------------------|-----------------|--------------|---------------------------------| -| Webservice | 3 | 8 vCPU, 7.2 GB memory | `n1-highcpu-8` | `c5.2xlarge` | 23.7 vCPU, 16.9 GB memory | -| Sidekiq | 2 | 4 vCPU, 15 GB memory | `n1-standard-4` | `m5.xlarge` | 7.8 vCPU, 25.9 GB memory | -| Supporting services | 2 | 2 vCPU, 7.5 GB memory | `n1-standard-2` | `m5.large` | 1.9 vCPU, 5.5 GB memory | +| Component Node Group | Target Node Pool Totals | GCP Example | AWS Example | +|----------------------|-------------------------|-----------------|--------------| +| Webservice | 12 vCPU
15 GB memory (request)
21 GB memory (limit) | 3 x `n1-standard-8` | 3 x `c5.2xlarge` | +| Sidekiq | 3.6 vCPU
8 GB memory (request)
16 GB memory (limit) | 2 x `n1-standard-4` | 2 x `m5.xlarge` | +| Supporting services | 4 vCPU
15 GB memory | 2 x `n1-standard-2` | 2 x `m5.large` | - For this setup, we **recommend** and regularly [test](index.md#validation-and-test-results) - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. -- Nodes configuration is shown as it is forced to ensure pod vCPU / memory ratios and avoid scaling during **performance testing**. - - In production deployments, there is no need to assign pods to specific nodes. A minimum of three nodes per node group in three different availability zones is strongly recommended to align with resilient cloud architecture practices. +[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. +- GCP and AWS examples of how to reach the Target Node Pool Total are given for convenience. These sizes are used in performance testing but following the example is not required. Different node pool designs can be used as desired as long as the targets are met, and all pods can deploy. +- The [Webservice](#webservice) and [Sidekiq](#sidekiq) target node pool totals are given for GitLab components only. Additional resources are required for the chosen Kubernetes provider's system processes. The given examples take this into account. +- The [Supporting](#supporting) target node pool total is given generally to accommodate several resources for supporting the GitLab deployment as well as any additional deployments you may wish to make depending on your requirements. Similar to the other node pools, the chosen Kubernetes provider's system processes also require resources. The given examples take this into account. +- In production deployments, it's not required to assign pods to specific nodes. However, it is recommended to have several nodes in each pool spread across different availability zones to align with resilient cloud architecture practices. +- Enabling autoscaling, such as Cluster Autoscaler, for efficiency reasons is encouraged, but it's generally recommended targeting a floor of 75% for Webservice and Sidekiq pods to ensure ongoing performance. Next are the backend components that run on static compute VMs using the Linux package (or External PaaS services where applicable): @@ -1149,7 +1153,7 @@ services where applicable): NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 2k @@ -1159,11 +1163,11 @@ card "Kubernetes via Helm Charts" as kubernetes { card "**External Load Balancer**" as elb #6a9be7 together { - collections "**Webservice** x3" as gitlab #32CD32 - collections "**Sidekiq** x2" as sidekiq #ff8dd1 + collections "**Webservice**" as gitlab #32CD32 + collections "**Sidekiq**" as sidekiq #ff8dd1 } - collections "**Supporting Services** x2" as support + collections "**Supporting Services**" as support } card "**Gitaly**" as gitaly #FF8C00 @@ -1186,36 +1190,43 @@ sidekiq -[#ff8dd1]--> redis @enduml ``` -### Resource usage settings +### Kubernetes component targets -The following formulas help when calculating how many pods may be deployed within resource constraints. -The [2k reference architecture example values file](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/2k.yaml) -documents how to apply the calculated configuration to the Helm Chart. +The following section details the targets used for the GitLab components deployed in Kubernetes. #### Webservice -Webservice pods typically need about 1 CPU and 1.25 GB of memory _per worker_. -Each Webservice pod consumes roughly 4 CPUs and 5 GB of memory using -the [recommended topology](#cluster-topology) because two worker processes -are created by default and each pod has other small processes running. +Each Webservice pod (Puma and Workhorse) is recommended to be run with the following configuration: -For 40 RPS or 2,000 users we recommend a total Puma worker count of around 12. -With the [provided recommendations](#cluster-topology) this allows the deployment of up to 3 -Webservice pods with 4 workers per pod and 1 pod per node. Expand available resources using -the ratio of 1 CPU to 1.25 GB of memory _per each worker process_ for each additional -Webservice pod. +- 4 Puma Workers +- 4 vCPU +- 5 GB memory (request) +- 7 GB memory (limit) -For further information on resource usage, see the [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). +For 40 RPS or 2,000 users we recommend a total Puma worker count of around 12 so in turn it's recommended to run at +least 3 Webservice pods. + +For further information on Webservice resource usage, see the Charts documentation on [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). + +##### NGINX + +It's also recommended deploying the NGINX controller pods across the Webservice nodes as a DaemonSet. This is to allow the controllers to scale dynamically with the Webservice pods they serve as well as take advantage of the higher network bandwidth larger machine types typically have. + +Note that this isn't a strict requirement. The NGINX controller pods can be deployed as desired as long as they have enough resources to handle the web traffic. #### Sidekiq -Sidekiq pods should generally have 0.9 CPU and 2 GB of memory. +Each Sidekiq pod is recommended to be run with the following configuration: -[The provided starting point](#cluster-topology) allows the deployment of up to -4 Sidekiq pods. Expand available resources using the 0.9 CPU to 2 GB memory -ratio for each additional pod. +- 1 Sidekiq worker +- 900m vCPU +- 2 GB memory (request) +- 4 GB memory (limit) -For further information on resource usage, see the [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). +Similar to the standard deployment above, an initial target of 4 Sidekiq workers has been used here. +Additional workers may be required depending on your specific workflow. + +For further information on Sidekiq resource usage, see the Charts documentation on [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). ### Supporting @@ -1223,12 +1234,16 @@ The Supporting Node Pool is designed to house all supporting deployments that do on the Webservice and Sidekiq pools. This includes various deployments related to the Cloud Provider's implementation and supporting -GitLab deployments such as NGINX or [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). +GitLab deployments such as [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). -If you wish to make any additional deployments, such as for Monitoring, it's recommended +If you wish to make any additional deployments such as Container Registry, Pages or Monitoring, it's recommended to deploy these in this pool where possible and not in the Webservice or Sidekiq pools, as the Supporting pool has been designed specifically to accommodate several additional deployments. However, if your deployments don't fit into the -pool as given, you can increase the node pool accordingly. +pool as given, you can increase the node pool accordingly. Conversely, if the pool in your use case is over-provisioned you can reduce accordingly. + +### Example config file + +An example for the GitLab Helm Charts for the above 40 RPS or 2,000 reference architecture configuration [can be found in the Charts project](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/2k.yaml).
diff --git a/doc/administration/reference_architectures/3k_users.md b/doc/administration/reference_architectures/3k_users.md index cf78fe53f52bb432b706f428a57ed5c9ccad9dac..bcdbe585981151fe8095dba086c39912f08b011f 100644 --- a/doc/administration/reference_architectures/3k_users.md +++ b/doc/administration/reference_architectures/3k_users.md @@ -10,7 +10,7 @@ DETAILS: **Tier:** Premium, Ultimate **Offering:** Self-managed -This page describes the GitLab reference architecture designed to target a peak load of 60 requests per second (RPS), the typical peak load of up to 3,000 users, both manual and automated, based on real data with headroom added. +This page describes the GitLab reference architecture designed to target a peak load of 60 requests per second (RPS), the typical peak load of up to 3,000 users, both manual and automated, based on real data. This architecture is the smallest one available with HA built in. If you require HA but have a lower user count or total load the [Supported Modifications for lower user counts](#supported-modifications-for-lower-user-counts-ha) @@ -23,7 +23,7 @@ For a full list of reference architectures, see > - **High Availability:** Yes, although [Praefect](#configure-praefect-postgresql) needs a third-party PostgreSQL solution > - **Estimated Costs:** [See cost table](index.md#cost-to-run) > - **Cloud Native Hybrid Alternative:** [Yes](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) -> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-use). +> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-start-with). | Service | Nodes | Configuration | GCP | AWS | Azure | |-------------------------------------------|-------|-----------------------|-----------------|--------------|----------| @@ -54,12 +54,13 @@ For a full list of reference architectures, see Review the existing [technical limitations and considerations before deploying Gitaly Cluster](../gitaly/index.md#before-deploying-gitaly-cluster). If you want sharded Gitaly, use the same specs listed above for `Gitaly`. 1. Gitaly specifications are based on high percentiles of both usage patterns and repository sizes in good health. However, if you have [large monorepos](index.md#large-monorepos) (larger than several gigabytes) or [additional workloads](index.md#additional-workloads) these can *significantly* impact Git and Gitaly performance and further adjustments will likely be required. -1. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). - However, for GitLab Rails certain processes like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) should be run on only one node. +6. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). + However, [Cloud Native Hybrid setups](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) are generally preferred as certain components + such as like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) can only be run on one node, which is handled better in Kubernetes. NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 3k @@ -160,7 +161,7 @@ against the following endpoint throughput targets: - Git (Push): 1 RPS The above targets were selected based on real customer data of total environmental loads corresponding to the user count, -including CI and other workloads along with additional substantial headroom added. +including CI and other workloads. If you have metrics to suggest that you have regularly higher throughput against the above endpoint targets, [large monorepos](index.md#large-monorepos) or notable [additional workloads](index.md#additional-workloads) these can notably impact the performance environment and [further adjustments may be required](index.md#scaling-an-environment). @@ -2256,16 +2257,19 @@ as the typical environment above. First are the components that run in Kubernetes. These run across several node groups, although you can change the overall makeup as desired as long as the minimum CPU and Memory requirements are observed. -| Service Node Group | Nodes | Configuration | GCP | AWS | Min Allocatable CPUs and Memory | -|---------------------|-------|-------------------------|-----------------|--------------|---------------------------------| -| Webservice | 2 | 16 vCPU, 14.4 GB memory | `n1-highcpu-16` | `c5.4xlarge` | 31.8 vCPU, 24.8 GB memory | -| Sidekiq | 3 | 4 vCPU, 15 GB memory | `n1-standard-4` | `m5.xlarge` | 11.8 vCPU, 38.9 GB memory | -| Supporting services | 2 | 2 vCPU, 7.5 GB memory | `n1-standard-2` | `m5.large` | 3.9 vCPU, 11.8 GB memory | +| Component Node Group | Target Node Pool Totals | GCP Example | AWS Example | +|----------------------|-------------------------|-----------------|--------------| +| Webservice | 16 vCPU
20 GB memory (request)
28 GB memory (limit) | 2 x `n1-standard-16` | 2 x `c5.4xlarge` | +| Sidekiq | 7.2 vCPU
16 GB memory (request)
32 GB memory (limit) | 3 x `n1-standard-4` | 3 x `m5.xlarge` | +| Supporting services | 4 vCPU
15 GB memory | 2 x `n1-standard-2` | 2 x `m5.large` | - For this setup, we **recommend** and regularly [test](index.md#validation-and-test-results) - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. -- Nodes configuration is shown as it is forced to ensure pod vCPU / memory ratios and avoid scaling during **performance testing**. - - In production deployments, there is no need to assign pods to specific nodes. A minimum of three nodes per node group in three different availability zones is strongly recommended to align with resilient cloud architecture practices. +[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. +- GCP and AWS examples of how to reach the Target Node Pool Total are given for convenience. These sizes are used in performance testing but following the example is not required. Different node pool designs can be used as desired as long as the targets are met, and all pods can deploy. +- The [Webservice](#webservice) and [Sidekiq](#sidekiq) target node pool totals are given for GitLab components only. Additional resources are required for the chosen Kubernetes provider's system processes. The given examples take this into account. +- The [Supporting](#supporting) target node pool total is given generally to accommodate several resources for supporting the GitLab deployment as well as any additional deployments you may wish to make depending on your requirements. Similar to the other node pools, the chosen Kubernetes provider's system processes also require resources. The given examples take this into account. +- In production deployments, it's not required to assign pods to specific nodes. However, it is recommended to have several nodes in each pool spread across different availability zones to align with resilient cloud architecture practices. +- Enabling autoscaling, such as Cluster Autoscaler, for efficiency reasons is encouraged, but it's generally recommended targeting a floor of 75% for Webservice and Sidekiq pods to ensure ongoing performance. Next are the backend components that run on static compute VMs using the Linux package (or External PaaS services where applicable): @@ -2298,7 +2302,7 @@ services where applicable): NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 3k @@ -2308,11 +2312,11 @@ card "Kubernetes via Helm Charts" as kubernetes { card "**External Load Balancer**" as elb #6a9be7 together { - collections "**Webservice** x2" as gitlab #32CD32 - collections "**Sidekiq** x3" as sidekiq #ff8dd1 + collections "**Webservice**" as gitlab #32CD32 + collections "**Sidekiq**" as sidekiq #ff8dd1 } - card "**Supporting Services** x2" as support + card "**Supporting Services**" as support } card "**Internal Load Balancer**" as ilb #9370DB @@ -2367,36 +2371,43 @@ consul .[#e76a9b]--> redis @enduml ``` -### Resource usage settings +### Kubernetes component targets -The following formulas help when calculating how many pods may be deployed within resource constraints. -The [3k reference architecture example values file](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/3k.yaml) -documents how to apply the calculated configuration to the Helm Chart. +The following section details the targets used for the GitLab components deployed in Kubernetes. #### Webservice -Webservice pods typically need about 1 CPU and 1.25 GB of memory _per worker_. -Each Webservice pod consumes roughly 4 CPUs and 5 GB of memory using -the [recommended topology](#cluster-topology) because four worker processes -are created by default and each pod has other small processes running. +Each Webservice pod (Puma and Workhorse) is recommended to be run with the following configuration: -For 60 RPS or 3,000 users we recommend a total Puma worker count of around 16. -With the [provided recommendations](#cluster-topology) this allows the deployment of up to 4 -Webservice pods with 4 workers per pod and 2 pods per node. Expand available resources using -the ratio of 1 CPU to 1.25 GB of memory _per each worker process_ for each additional -Webservice pod. +- 4 Puma Workers +- 4 vCPU +- 5 GB memory (request) +- 7 GB memory (limit) -For further information on resource usage, see the [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). +For 60 RPS or 3,000 users we recommend a total Puma worker count of around 16 so in turn it's recommended to run at +least 4 Webservice pods. + +For further information on Webservice resource usage, see the Charts documentation on [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). + +##### NGINX + +It's also recommended deploying the NGINX controller pods across the Webservice nodes as a DaemonSet. This is to allow the controllers to scale dynamically with the Webservice pods they serve as well as take advantage of the higher network bandwidth larger machine types typically have. + +Note that this isn't a strict requirement. The NGINX controller pods can be deployed as desired as long as they have enough resources to handle the web traffic. #### Sidekiq -Sidekiq pods should generally have 0.9 CPU and 2 GB of memory. +Each Sidekiq pod is recommended to be run with the following configuration: -[The provided starting point](#cluster-topology) allows the deployment of up to -8 Sidekiq pods. Expand available resources using the 0.9 CPU to 2 GB memory -ratio for each additional pod. +- 1 Sidekiq worker +- 900m vCPU +- 2 GB memory (request) +- 4 GB memory (limit) -For further information on resource usage, see the [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). +Similar to the standard deployment above, an initial target of 8 Sidekiq workers has been used here. +Additional workers may be required depending on your specific workflow. + +For further information on Sidekiq resource usage, see the Charts documentation on [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). ### Supporting @@ -2404,12 +2415,16 @@ The Supporting Node Pool is designed to house all supporting deployments that do on the Webservice and Sidekiq pools. This includes various deployments related to the Cloud Provider's implementation and supporting -GitLab deployments such as NGINX or [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). +GitLab deployments such as [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). -If you wish to make any additional deployments, such as for Monitoring, it's recommended +If you wish to make any additional deployments such as Container Registry, Pages or Monitoring, it's recommended to deploy these in this pool where possible and not in the Webservice or Sidekiq pools, as the Supporting pool has been designed specifically to accommodate several additional deployments. However, if your deployments don't fit into the -pool as given, you can increase the node pool accordingly. +pool as given, you can increase the node pool accordingly. Conversely, if the pool in your use case is over-provisioned you can reduce accordingly. + +### Example config file + +An example for the GitLab Helm Charts for the above 60 RPS or 3,000 reference architecture configuration [can be found in the Charts project](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/3k.yaml).
diff --git a/doc/administration/reference_architectures/50k_users.md b/doc/administration/reference_architectures/50k_users.md index e9ec5832673c7cdf9f4a57a49a7dfa47b51cc754..7be796618461468db19b2dbbfa6c89a0e26d0b24 100644 --- a/doc/administration/reference_architectures/50k_users.md +++ b/doc/administration/reference_architectures/50k_users.md @@ -10,20 +10,20 @@ DETAILS: **Tier:** Premium, Ultimate **Offering:** Self-managed -This page describes the GitLab reference architecture designed to target a peak load of 1000 requests per second (RPS), the typical peak load of up to 50,000 users, both manual and automated, based on real data with headroom added. +This page describes the GitLab reference architecture designed to target a peak load of 1000 requests per second (RPS), the typical peak load of up to 50,000 users, both manual and automated, based on real data. For a full list of reference architectures, see [Available reference architectures](index.md#available-reference-architectures). NOTE: Before deploying this architecture it's recommended to read through the [main documentation](index.md) first, -specifically the [Before you start](index.md#before-you-start) and [Deciding which architecture to use](index.md#deciding-which-architecture-to-use) sections. +specifically the [Before you start](index.md#before-you-start) and [Deciding which architecture to use](index.md#deciding-which-architecture-to-start-with) sections. > - **Target load:** API: 1000 RPS, Web: 100 RPS, Git (Pull): 100 RPS, Git (Push): 20 RPS > - **High Availability:** Yes ([Praefect](#configure-praefect-postgresql) needs a third-party PostgreSQL solution for HA) > - **Estimated Costs:** [See cost table](index.md#cost-to-run) > - **Cloud Native Hybrid Alternative:** [Yes](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) -> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-use) +> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-start-with) | Service | Nodes | Configuration | GCP | AWS | Azure | |------------------------------------------|-------|-------------------------|------------------|---------------|-----------| @@ -55,12 +55,13 @@ specifically the [Before you start](index.md#before-you-start) and [Deciding whi Review the existing [technical limitations and considerations before deploying Gitaly Cluster](../gitaly/index.md#before-deploying-gitaly-cluster). If you want sharded Gitaly, use the same specs listed above for `Gitaly`. 6. Gitaly specifications are based on high percentiles of both usage patterns and repository sizes in good health. However, if you have [large monorepos](index.md#large-monorepos) (larger than several gigabytes) or [additional workloads](index.md#additional-workloads) these can *significantly* impact Git and Gitaly performance and further adjustments will likely be required. -7. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). - However, for GitLab Rails certain processes like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) should be run on only one node. +6. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). + However, [Cloud Native Hybrid setups](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) are generally preferred as certain components + such as like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) can only be run on one node, which is handled better in Kubernetes. NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 50k @@ -164,7 +165,7 @@ against the following endpoint throughput targets: - Git (Push): 20 RPS The above targets were selected based on real customer data of total environmental loads corresponding to the user count, -including CI and other workloads along with additional substantial headroom added. +including CI and other workloads. If you have metrics to suggest that you have regularly higher throughput against the above endpoint targets, [large monorepos](index.md#large-monorepos) or notable [additional workloads](index.md#additional-workloads) these can notably impact the performance environment and [further adjustments may be required](index.md#scaling-an-environment). @@ -2288,16 +2289,19 @@ as the typical environment above. First are the components that run in Kubernetes. These run across several node groups, although you can change the overall makeup as desired as long as the minimum CPU and Memory requirements are observed. -| Service Node Group | Nodes | Configuration | GCP | AWS | Min Allocatable CPUs and Memory | -|---------------------|-------|-------------------------|-----------------|--------------|---------------------------------| -| Webservice | 16 | 32 vCPU, 28.8 GB memory | `n1-highcpu-32` | `c5.9xlarge` | 510 vCPU, 472 GB memory | -| Sidekiq | 4 | 4 vCPU, 15 GB memory | `n1-standard-4` | `m5.xlarge` | 15.5 vCPU, 50 GB memory | -| Supporting services | 2 | 4 vCPU, 15 GB memory | `n1-standard-4` | `m5.xlarge` | 7.75 vCPU, 25 GB memory | +| Component Node Group | Target Node Pool Totals | GCP Example | AWS Example | +|----------------------|-------------------------|-----------------|--------------| +| Webservice | 308 vCPU
385 GB memory (request)
539 GB memory (limit) | 11 x `n1-standard-32` | 11 x `c5.9xlarge` | +| Sidekiq | 12.6 vCPU
28 GB memory (request)
56 GB memory (limit) | 4 x `n1-standard-4` | 4 x `m5.xlarge` | +| Supporting services | 4 vCPU
15 GB memory | 2 x `n1-standard-4` | 2 x `m5.xlarge` | - For this setup, we **recommend** and regularly [test](index.md#validation-and-test-results) - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. -- Nodes configuration is shown as it is forced to ensure pod vCPU / memory ratios and avoid scaling during **performance testing**. - - In production deployments, there is no need to assign pods to specific nodes. A minimum of three nodes per node group in three different availability zones is strongly recommended to align with resilient cloud architecture practices. +[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. +- GCP and AWS examples of how to reach the Target Node Pool Total are given for convenience. These sizes are used in performance testing but following the example is not required. Different node pool designs can be used as desired as long as the targets are met, and all pods can deploy. +- The [Webservice](#webservice) and [Sidekiq](#sidekiq) target node pool totals are given for GitLab components only. Additional resources are required for the chosen Kubernetes provider's system processes. The given examples take this into account. +- The [Supporting](#supporting) target node pool total is given generally to accommodate several resources for supporting the GitLab deployment as well as any additional deployments you may wish to make depending on your requirements. Similar to the other node pools, the chosen Kubernetes provider's system processes also require resources. The given examples take this into account. +- In production deployments, it's not required to assign pods to specific nodes. However, it is recommended to have several nodes in each pool spread across different availability zones to align with resilient cloud architecture practices. +- Enabling autoscaling, such as Cluster Autoscaler, for efficiency reasons is encouraged, but it's generally recommended targeting a floor of 75% for Webservice and Sidekiq pods to ensure ongoing performance. Next are the backend components that run on static compute VMs using the Linux package (or External PaaS services where applicable): @@ -2331,7 +2335,7 @@ services where applicable): NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 50k @@ -2341,11 +2345,11 @@ card "Kubernetes via Helm Charts" as kubernetes { card "**External Load Balancer**" as elb #6a9be7 together { - collections "**Webservice** x16" as gitlab #32CD32 - collections "**Sidekiq** x4" as sidekiq #ff8dd1 + collections "**Webservice**" as gitlab #32CD32 + collections "**Sidekiq**" as sidekiq #ff8dd1 } - card "**Supporting Services** x2" as support + card "**Supporting Services**" as support } card "**Internal Load Balancer**" as ilb #9370DB @@ -2403,36 +2407,43 @@ consul .[#e76a9b]--> redis @enduml ``` -### Resource usage settings +### Kubernetes component targets -The following formulas help when calculating how many pods may be deployed within resource constraints. -The [50k reference architecture example values file](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/50k.yaml) -documents how to apply the calculated configuration to the Helm Chart. +The following section details the targets used for the GitLab components deployed in Kubernetes. #### Webservice -Webservice pods typically need about 1 CPU and 1.25 GB of memory _per worker_. -Each Webservice pod consumes roughly 4 CPUs and 5 GB of memory using -the [recommended topology](#cluster-topology) because four worker processes -are created by default and each pod has other small processes running. +Each Webservice pod (Puma and Workhorse) is recommended to be run with the following configuration: -For 1000 RPS or 50,000 users we recommend a total Puma worker count of around 320. -With the [provided recommendations](#cluster-topology) this allows the deployment of up to 80 -Webservice pods with 4 workers per pod and 5 pods per node. Expand available resources using -the ratio of 1 CPU to 1.25 GB of memory _per each worker process_ for each additional -Webservice pod. +- 4 Puma Workers +- 4 vCPU +- 5 GB memory (request) +- 7 GB memory (limit) -For further information on resource usage, see the [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). +For 500 RPS or 25,000 users we recommend a total Puma worker count of around 308 so in turn it's recommended to run at +least 77 Webservice pods. + +For further information on Webservice resource usage, see the Charts documentation on [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). + +##### NGINX + +It's also recommended deploying the NGINX controller pods across the Webservice nodes as a DaemonSet. This is to allow the controllers to scale dynamically with the Webservice pods they serve as well as take advantage of the higher network bandwidth larger machine types typically have. + +Note that this isn't a strict requirement. The NGINX controller pods can be deployed as desired as long as they have enough resources to handle the web traffic. #### Sidekiq -Sidekiq pods should generally have 0.9 CPU and 2 GB of memory. +Each Sidekiq pod is recommended to be run with the following configuration: -[The provided starting point](#cluster-topology) allows the deployment of up to -14 Sidekiq pods. Expand available resources using the 0.9 CPU to 2 GB memory -ratio for each additional pod. +- 1 Sidekiq worker +- 900m vCPU +- 2 GB memory (request) +- 4 GB memory (limit) -For further information on resource usage, see the [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). +Similar to the standard deployment above, an initial target of 8 Sidekiq workers has been used here. +Additional workers may be required depending on your specific workflow. + +For further information on Sidekiq resource usage, see the Charts documentation on [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). ### Supporting @@ -2440,12 +2451,16 @@ The Supporting Node Pool is designed to house all supporting deployments that do on the Webservice and Sidekiq pools. This includes various deployments related to the Cloud Provider's implementation and supporting -GitLab deployments such as NGINX or [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). +GitLab deployments such as [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). -If you wish to make any additional deployments, such as for Monitoring, it's recommended +If you wish to make any additional deployments such as Container Registry, Pages or Monitoring, it's recommended to deploy these in this pool where possible and not in the Webservice or Sidekiq pools, as the Supporting pool has been designed specifically to accommodate several additional deployments. However, if your deployments don't fit into the -pool as given, you can increase the node pool accordingly. +pool as given, you can increase the node pool accordingly. Conversely, if the pool in your use case is over-provisioned you can reduce accordingly. + +### Example config file + +An example for the GitLab Helm Charts targetting the above 1000 RPS or 50,000 reference architecture configuration [can be found in the Charts project](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/50k.yaml).
diff --git a/doc/administration/reference_architectures/5k_users.md b/doc/administration/reference_architectures/5k_users.md index 71bc40cd9ae3e9f9af1f954d7739d8a9d55437ee..25eec25617d94d14b2c10d47d71da3650d01534f 100644 --- a/doc/administration/reference_architectures/5k_users.md +++ b/doc/administration/reference_architectures/5k_users.md @@ -10,20 +10,20 @@ DETAILS: **Tier:** Premium, Ultimate **Offering:** Self-managed -This page describes the GitLab reference architecture designed to target a peak load of 100 requests per second (RPS) - The typical peak load of up to 5,000 users, both manual and automated, based on real data with headroom added. +This page describes the GitLab reference architecture designed to target a peak load of 100 requests per second (RPS) - The typical peak load of up to 5,000 users, both manual and automated, based on real data. For a full list of reference architectures, see [Available reference architectures](index.md#available-reference-architectures). NOTE: Before deploying this architecture it's recommended to read through the [main documentation](index.md) first, -specifically the [Before you start](index.md#before-you-start) and [Deciding which architecture to use](index.md#deciding-which-architecture-to-use) sections. +specifically the [Before you start](index.md#before-you-start) and [Deciding which architecture to use](index.md#deciding-which-architecture-to-start-with) sections. > - **Target load:** API: 100 RPS, Web: 10 RPS, Git (Pull): 10 RPS, Git (Push): 2 RPS > - **High Availability:** Yes ([Praefect](#configure-praefect-postgresql) needs a third-party PostgreSQL solution for HA) > - **Estimated Costs:** [See cost table](index.md#cost-to-run) > - **Cloud Native Hybrid Alternative:** [Yes](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) -> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-use) +> - **Unsure which Reference Architecture to use?** [Go to this guide for more info](index.md#deciding-which-architecture-to-start-with) | Service | Nodes | Configuration | GCP | AWS | Azure | |-------------------------------------------|-------|-------------------------|-----------------|--------------|----------| @@ -54,12 +54,13 @@ specifically the [Before you start](index.md#before-you-start) and [Deciding whi Review the existing [technical limitations and considerations before deploying Gitaly Cluster](../gitaly/index.md#before-deploying-gitaly-cluster). If you want sharded Gitaly, use the same specs listed above for `Gitaly`. 6. Gitaly specifications are based on high percentiles of both usage patterns and repository sizes in good health. However, if you have [large monorepos](index.md#large-monorepos) (larger than several gigabytes) or [additional workloads](index.md#additional-workloads) these can *significantly* impact Git and Gitaly performance and further adjustments will likely be required. -7. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). - However, for GitLab Rails certain processes like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) should be run on only one node. +6. Can be placed in Auto Scaling Groups (ASGs) as the component doesn't store any [stateful data](index.md#autoscaling-of-stateful-nodes). + However, [Cloud Native Hybrid setups](#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative) are generally preferred as certain components + such as like [migrations](#gitlab-rails-post-configuration) and [Mailroom](../incoming_email.md) can only be run on one node, which is handled better in Kubernetes. NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 5k @@ -160,7 +161,7 @@ against the following endpoint throughput targets: - Git (Push): 2 RPS The above targets were selected based on real customer data of total environmental loads corresponding to the user count, -including CI and other workloads along with additional substantial headroom added. +including CI and other workloads. If you have metrics to suggest that you have regularly higher throughput against the above endpoint targets, [large monorepos](index.md#large-monorepos) or notable [additional workloads](index.md#additional-workloads) these can notably impact the performance environment and [further adjustments may be required](index.md#scaling-an-environment). @@ -2231,16 +2232,19 @@ as the typical environment above. First are the components that run in Kubernetes. These run across several node groups, although you can change the overall makeup as desired as long as the minimum CPU and Memory requirements are observed. -| Service Node Group | Nodes | Configuration | GCP | AWS | Min Allocatable CPUs and Memory | -|-------------------- |-------|-------------------------|-----------------|--------------|---------------------------------| -| Webservice | 5 | 16 vCPU, 14.4 GB memory | `n1-highcpu-16` | `c5.4xlarge` | 79.5 vCPU, 62 GB memory | -| Sidekiq | 3 | 4 vCPU, 15 GB memory | `n1-standard-4` | `m5.xlarge` | 11.8 vCPU, 38.9 GB memory | -| Supporting services | 2 | 2 vCPU, 7.5 GB memory | `n1-standard-2` | `m5.large` | 3.9 vCPU, 11.8 GB memory | +| Component Node Group | Target Node Pool Totals | GCP Example | AWS Example | +|----------------------|-------------------------|-----------------|--------------| +| Webservice | 36 vCPU
45 GB memory (request)
63 GB memory (limit) | 3 x `n1-standard-16` | 3 x `c5.4xlarge` | +| Sidekiq | 7.2 vCPU
16 GB memory (request)
32 GB memory (limit) | 3 x `n1-standard-4` | 3 x `m5.xlarge` | +| Supporting services | 4 vCPU
15 GB memory | 2 x `n1-standard-2` | 2 x `m5.large` | - For this setup, we **recommend** and regularly [test](index.md#validation-and-test-results) - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. -- Nodes configuration is shown as it is forced to ensure pod vCPU / memory ratios and avoid scaling during **performance testing**. - - In production deployments, there is no need to assign pods to nodes. A minimum of three nodes in three different availability zones is strongly recommended to align with resilient cloud architecture practices. +[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) and [Amazon Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/). Other Kubernetes services may also work, but your mileage may vary. +- GCP and AWS examples of how to reach the Target Node Pool Total are given for convenience. These sizes are used in performance testing but following the example is not required. Different node pool designs can be used as desired as long as the targets are met, and all pods can deploy. +- The [Webservice](#webservice) and [Sidekiq](#sidekiq) target node pool totals are given for GitLab components only. Additional resources are required for the chosen Kubernetes provider's system processes. The given examples take this into account. +- The [Supporting](#supporting) target node pool total is given generally to accommodate several resources for supporting the GitLab deployment as well as any additional deployments you may wish to make depending on your requirements. Similar to the other node pools, the chosen Kubernetes provider's system processes also require resources. The given examples take this into account. +- In production deployments, it's not required to assign pods to specific nodes. However, it is recommended to have several nodes in each pool spread across different availability zones to align with resilient cloud architecture practices. +- Enabling autoscaling, such as Cluster Autoscaler, for efficiency reasons is encouraged, but it's generally recommended targeting a floor of 75% for Webservice and Sidekiq pods to ensure ongoing performance. Next are the backend components that run on static compute VMs using the Linux package (or External PaaS services where applicable): @@ -2273,7 +2277,7 @@ services where applicable): NOTE: -For all PaaS solutions that involve configuring instances, it is strongly recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. +For all PaaS solutions that involve configuring instances, it's recommended to implement a minimum of three nodes in three different availability zones to align with resilient cloud architecture practices. ```plantuml @startuml 5k @@ -2283,11 +2287,11 @@ card "Kubernetes via Helm Charts" as kubernetes { card "**External Load Balancer**" as elb #6a9be7 together { - collections "**Webservice** x5" as gitlab #32CD32 - collections "**Sidekiq** x3" as sidekiq #ff8dd1 + collections "**Webservice**" as gitlab #32CD32 + collections "**Sidekiq**" as sidekiq #ff8dd1 } - card "**Supporting Services** x2" as support + card "**Supporting Services**" as support } card "**Internal Load Balancer**" as ilb #9370DB @@ -2342,36 +2346,43 @@ consul .[#e76a9b]--> redis @enduml ``` -### Resource usage settings +### Kubernetes component targets -The following formulas help when calculating how many pods may be deployed within resource constraints. -The [5k reference architecture example values file](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/5k.yaml) -documents how to apply the calculated configuration to the Helm Chart. +The following section details the targets used for the GitLab components deployed in Kubernetes. #### Webservice -Webservice pods typically need about 1 CPU and 1.25 GB of memory _per worker_. -Each Webservice pod consumes roughly 4 CPUs and 5 GB of memory using -the [recommended topology](#cluster-topology) because four worker processes -are created by default and each pod has other small processes running. +Each Webservice pod (Puma and Workhorse) is recommended to be run with the following configuration: -For 100 RPS or 5,000 users we recommend a total Puma worker count of around 40. -With the [provided recommendations](#cluster-topology) this allows the deployment of up to 10 -Webservice pods with 4 workers per pod and 2 pods per node. Expand available resources using -the ratio of 1 CPU to 1.25 GB of memory _per each worker process_ for each additional -Webservice pod. +- 4 Puma Workers +- 4 vCPU +- 5 GB memory (request) +- 7 GB memory (limit) -For further information on resource usage, see the [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). +For 100 RPS or 5,000 users we recommend a total Puma worker count of around 36 so in turn it's recommended to run at +least 9 Webservice pods. + +For further information on Webservice resource usage, see the Charts documentation on [Webservice resources](https://docs.gitlab.com/charts/charts/gitlab/webservice/#resources). + +##### NGINX + +It's also recommended deploying the NGINX controller pods across the Webservice nodes as a DaemonSet. This is to allow the controllers to scale dynamically with the Webservice pods they serve as well as take advantage of the higher network bandwidth larger machine types typically have. + +Note that this isn't a strict requirement. The NGINX controller pods can be deployed as desired as long as they have enough resources to handle the web traffic. #### Sidekiq -Sidekiq pods should generally have 0.9 CPU and 2 GB of memory. +Each Sidekiq pod is recommended to be run with the following configuration: -[The provided starting point](#cluster-topology) allows the deployment of up to -8 Sidekiq pods. Expand available resources using the 0.9 CPU to 2 GB memory -ratio for each additional pod. +- 1 Sidekiq worker +- 900m vCPU +- 2 GB memory (request) +- 4 GB memory (limit) -For further information on resource usage, see the [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). +Similar to the standard deployment above, an initial target of 8 Sidekiq workers has been used here. +Additional workers may be required depending on your specific workflow. + +For further information on Sidekiq resource usage, see the Charts documentation on [Sidekiq resources](https://docs.gitlab.com/charts/charts/gitlab/sidekiq/#resources). ### Supporting @@ -2379,12 +2390,16 @@ The Supporting Node Pool is designed to house all supporting deployments that do on the Webservice and Sidekiq pools. This includes various deployments related to the Cloud Provider's implementation and supporting -GitLab deployments such as NGINX or [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). +GitLab deployments such as [GitLab Shell](https://docs.gitlab.com/charts/charts/gitlab/gitlab-shell/). -If you wish to make any additional deployments, such as for Monitoring, it's recommended +If you wish to make any additional deployments such as Container Registry, Pages or Monitoring, it's recommended to deploy these in this pool where possible and not in the Webservice or Sidekiq pools, as the Supporting pool has been designed specifically to accommodate several additional deployments. However, if your deployments don't fit into the -pool as given, you can increase the node pool accordingly. +pool as given, you can increase the node pool accordingly. Conversely, if the pool in your use case is over-provisioned you can reduce accordingly. + +### Example config file + +An example for the GitLab Helm Charts targetting the above 100 RPS or 5,000 reference architecture configuration [can be found in the Charts project](https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/ref/5k.yaml).
diff --git a/doc/administration/reference_architectures/index.md b/doc/administration/reference_architectures/index.md index b8a705bb5388baae8cc5a919a17b2898328424b3..419a10c1aa4c2355828ac77fded32b318e8396c9 100644 --- a/doc/administration/reference_architectures/index.md +++ b/doc/administration/reference_architectures/index.md @@ -12,16 +12,16 @@ DETAILS: **Offering:** Self-managed The GitLab Reference Architectures have been designed and tested by the -GitLab Test Platform and Support teams to provide scalable recommended deployments for target loads. +GitLab Test Platform and Support teams to provide recommended scalable and elastic deployments as starting points for target loads. ## Available reference architectures The following Reference Architectures are available as recommended starting points for your environment. -The architectures are named in terms of peak load, based on user count or Requests per Second (RPS). Where the latter has been calculated based on average real data of the former with headroom added. +The architectures are named in terms of peak load, based on user count or Requests per Second (RPS). Where the latter has been calculated based on average real data. NOTE: -Each architecture has been designed to be [scalable and can be adjusted accordingly if required](#scaling-an-environment) by your specific workload. This may be likely in known heavy scenarios such as using [large monorepos](#large-monorepos) or notable [additional workloads](#additional-workloads). +Each architecture has been designed to be [scalable and elastic](#scaling-an-environment). As such, they can be adjusted accordingly if required by your specific workload. This may be likely in known heavy scenarios such as using [large monorepos](#large-monorepos) or notable [additional workloads](#additional-workloads). For details about what each Reference Architecture has been tested against, see the "Testing Methodology" section of each page. @@ -56,30 +56,32 @@ Running any application in production is complex, and the same applies for GitLa As such, it's recommended that you have a working knowledge of running and maintaining applications in production when deciding on going down this route. If you aren't in this position, our [Professional Services](https://about.gitlab.com/services/#implementation-services) team offers implementation services, but for those who want a more managed solution long term, it's recommended to instead explore our other offerings such as [GitLab SaaS](../../subscriptions/gitlab_com/index.md) or [GitLab Dedicated](../../subscriptions/gitlab_dedicated/index.md). -If Self Managed is the approach you're considering, it's strongly encouraged to read through this page in full, in particular the [Deciding which architecture to use](#deciding-which-architecture-to-use), [Large monorepos](#large-monorepos) and [Additional workloads](#additional-workloads) sections. +If Self Managed is the approach you're considering, it's strongly encouraged to read through this page in full, in particular the [Deciding which architecture to use](#deciding-which-architecture-to-start-with), [Large monorepos](#large-monorepos) and [Additional workloads](#additional-workloads) sections. -## Deciding which architecture to use +## Deciding which architecture to start with -The Reference Architectures are designed to strike a balance between two important factors--performance and resilience. +The Reference Architectures are designed to strike a balance between three important factors--performance, resilience and costs. -While they are designed to make it easier to set up GitLab at scale, it can still be a challenge to know which one meets your requirements. +While they are designed to make it easier to set up GitLab at scale, it can still be a challenge to know which one meets your requirements and where to start accordingly. As a general guide, **the more performant and/or resilient you want your environment to be, the more complex it is**. -This section explains the designs you can choose from. It begins with the least complexity, goes to the most, and ends with a decision tree. +This section explains the things to consider when picking a Reference Architecture to start with. -### Expected Load (RPS or user count) +### Expected Load The first thing to check is what the expected peak load is your environment would be expected to serve. Each architecture is described in terms of peak Requests per Second (RPS) or user count load. As detailed under the "Testing Methodology" section on each page, each architecture is tested -against its listed RPS for each endpoint type (API, Web, Git), which is the typical peak load of the given user count, both manual and automated, with headroom. +against its listed RPS for each endpoint type (API, Web, Git), which is the typical peak load of the given user count, both manual and automated. -It's strongly recommended finding out what peak RPS your environment will be expected to handle across endpoint types, through existing metrics (such as [Prometheus](../monitoring/prometheus/gitlab_metrics.md)) +It's strongly recommended finding out what peak RPS your environment will be expected to handle across endpoint types, through existing metrics (such as [Prometheus](../monitoring/prometheus/index.md#sample-prometheus-queries)) or estimates, and to select the corresponding architecture as this is the most objective. +#### If in doubt, pick the closest user count and scale accordingly + If it's not possible for you to find out the expected peak RPS then it's recommended to select based on user count to start and then monitor the environment -closely to confirm the RPS, whether the architecture is performing and adjust accordingly is necessary. +closely to confirm the RPS, whether the architecture is performing and [scale accordingly](#scaling-an-environment) as necessary. ### Standalone (non-HA) @@ -267,7 +269,7 @@ the following guidance is followed to ensure the best chance of good performance ### Additional workloads These reference architectures have been [designed and tested](index.md#validation-and-test-results) for standard GitLab -setups with good headroom in mind to cover most scenarios. +setups based on real data. However, additional workloads can multiply the impact of operations by triggering follow-up actions. You may need to adjust the suggested specifications to compensate if you use, for example: @@ -307,12 +309,12 @@ We don’t recommend the use of round-robin algorithms as they are known to not The total network bandwidth available to a load balancer when deployed on a machine can vary notably across Cloud Providers. In particular some Cloud Providers, like [AWS](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html), may operate on a burst system with credits to determine the bandwidth at any time. -The network bandwidth your environment's load balancers will require is dependent on numerous factors such as data shape and workload. The recommended base sizes for each Reference Architecture class have been selected to give a good level of bandwidth with adequate headroom but in some scenarios, such as consistent clones of [large monorepos](#large-monorepos), the sizes may need to be adjusted accordingly. +The network bandwidth your environment's load balancers will require is dependent on numerous factors such as data shape and workload. The recommended base sizes for each Reference Architecture class have been selected based on real data but in some scenarios, such as consistent clones of [large monorepos](#large-monorepos), the sizes may need to be adjusted accordingly. ### No swap Swap is not recommended in the reference architectures. It's a failsafe that impacts performance greatly. The -reference architectures are designed to have memory headroom to avoid needing swap. +reference architectures are designed to have enough memory in most cases to avoid needing swap. ### Praefect PostgreSQL @@ -386,7 +388,7 @@ Additionally, the following cloud provider services are recommended for use as p Database - 🟒   Cloud SQL + 🟒   Cloud SQL1 🟒   RDS 🟒   Azure Database for PostgreSQL Flexible Server @@ -401,6 +403,12 @@ Additionally, the following cloud provider services are recommended for use as p + + +1. The [Enterprise Plus edition](https://cloud.google.com/sql/docs/editions-intro) for GCP Cloud SQL is generally recommended for optimal performance. This recommendation is especially so for larger environments (500 RPS / 25k users or higher). Max connections may need to be adjusted higher than the service's defaults depending on workload. +2. It's strongly recommended deploying the [Premium tier of Azure Cache for Redis](https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-overview#service-tiers) to ensure good performance. + + ### Recommendation notes for the database services [When selecting to use an external database service](../postgresql/external.md), it should run a standard, performant, and [supported version](../../install/requirements.md#postgresql-requirements). @@ -409,9 +417,9 @@ If you choose to use a third party external service: 1. Note that the HA Linux package PostgreSQL setup encompasses PostgreSQL, PgBouncer and Consul. All of these components would no longer be required when using a third party external service. 1. The number of nodes required to achieve HA may differ depending on the service compared to the Linux package and doesn't need to match accordingly. -1. However, if [Database Load Balancing](../postgresql/database_load_balancing.md) via Read Replicas is desired for further improved performance it's recommended to follow the node count for the Reference Architecture. +1. It's recommended in general to enable Read Replicas for [Database Load Balancing](../postgresql/database_load_balancing.md) if possible, matching the node counts for the standard Linux package deployment. This recommendation is especially so for larger environments (over 200 RPS / 10k users). 1. Ensure that if a pooler is offered as part of the service that it can handle the total load without bottlenecking. - For example, Azure Database for PostgreSQL Flexible Server can optionally deploy a PgBouncer pooler in front of the Database, but PgBouncer is single threaded, so this in turn may cause bottlenecking. However, if using Database Load Balancing, this could be enabled on each node in distributed fashion to compensate. +For example, Azure Database for PostgreSQL Flexible Server can optionally deploy a PgBouncer pooler in front of the Database, but PgBouncer is single threaded, so this in turn may cause bottlenecking. However, if using Database Load Balancing, this could be enabled on each node in distributed fashion to compensate. 1. If [GitLab Geo](../geo/index.md) is to be used the service will need to support Cross Region replication. ### Recommendation notes for the Redis services @@ -468,12 +476,12 @@ This also applies to other third-party stateful components such as Postgres and #### Autoscaling of stateful nodes As a general guidance, only _stateless_ components of GitLab can be run in Autoscaling groups, namely GitLab Rails -and Sidekiq. - -Other components that have state, such as Gitaly, are not supported in this fashion (for more information, see [issue 2997](https://gitlab.com/gitlab-org/gitaly/-/issues/2997)). +and Sidekiq. Other components that have state, such as Gitaly, are not supported in this fashion (for more information, see [issue 2997](https://gitlab.com/gitlab-org/gitaly/-/issues/2997)). This also applies to other third-party stateful components such as Postgres and Redis, but you can explore other third-party solutions for those components if desired such as supported Cloud Provider services unless called out specifically as unsupported. +However, [Cloud Native Hybrid setups](#cloud-native-hybrid) are generally preferred over ASGs as certain components such as like database migrations and [Mailroom](../incoming_email.md) can only be run on one node, which is handled better in Kubernetes. + #### Spreading one environment over multiple data centers Deploying one GitLab environment over multiple data centers is not supported due to potential split brain edge cases @@ -523,7 +531,7 @@ per 1,000 users: - Git (Pull): 2 RPS - Git (Push): 0.4 RPS (rounded to the nearest integer) -The above RPS targets were selected based on real customer data of total environmental loads corresponding to the user count, including CI and other workloads along with additional substantial headroom added. +The above RPS targets were selected based on real customer data of total environmental loads corresponding to the user count, including CI and other workloads. ### How to interpret the results @@ -627,11 +635,16 @@ table.test-coverage th { ## Cost to run -As a starting point, the following table details initial costs for the different reference architectures across GCP, AWS, and Azure through the Linux package. +As a starting point, the following table details initial costs for the different reference architectures across GCP, AWS, and Azure through the Linux package via each cloud provider's official calculator. -NOTE: -Due to the nature of Cloud Native Hybrid, it's not possible to give a static cost calculation. -Bare-metal costs are also not included here as it varies widely depending on each configuration. +However, please be aware of the following caveats: + +- These are only rough estimates for the Linux package environments. +- They do not take into account dynamic elements such as disk, network or object storage. +- Due to the nature of Cloud Native Hybrid, it's not possible to give a static cost calculation for that deployment. +- Bare-metal costs are also not included here as it varies widely depending on each configuration. + +Due to the above it's strongly recommended taking these calculators and adjusting them as close as possible to your specific setup and usage as much as possible to get a more accurate estimate. @@ -698,20 +711,9 @@ Maintaining a Reference Architecture environment is generally the same as any ot In this section you'll find links to documentation for relevant areas as well as any specific Reference Architecture notes. -### Upgrades - -Upgrades for a Reference Architecture environment is the same as any other GitLab environment. -The main [Upgrade GitLab](../../update/index.md) section has detailed steps on how to approach this. - -[Zero-downtime upgrades](#zero-downtime-upgrades) are also available. - -NOTE: -You should upgrade a Reference Architecture in the same order as you created it. - ### Scaling an environment -The Reference Architectures have been designed to support scaling in various ways depending on your use case and circumstances. -This can be done iteratively or wholesale to the next size of architecture depending on if metrics suggest a component is being exhausted. +The Reference Architectures have been designed as a starting point and are elastic and scalable throughout. It's more likely than not that you may want to adjust the environment for your specific needs after deployment for reasons such as additional performance capacity or reduced costs. This is expected and, as such, scaling can be done iteratively or wholesale to the next size of architecture depending on if metrics suggest a component is being exhausted. NOTE: If you're seeing a component continuously exhausting it's given resources it's strongly recommended for you to reach out to our [Support team](https://about.gitlab.com/support/) before performing any scaling. This is especially so if you're planning to scale any component significantly. @@ -730,7 +732,7 @@ You should take an iterative approach when scaling downwards, however, to ensure In some cases scaling a component significantly may result in knock on effects for downstream components, impacting performance. The Reference Architectures were designed with balance in mind to ensure components that depend on each other are congruent in terms of specs. As such you may find when notably scaling a component that it's increase may result in additional throughput being passed to the other components it depends on and that they, in turn, may need to be scaled as well. NOTE: -As a general rule most components have good headroom to accommodate an upstream component being scaled, so this is typically on a case by case basis and specific to what has been changed. It's recommended for you to reach out to our [Support team](https://about.gitlab.com/support/) before you make any significant changes to the environment. +The Reference Architectures have been designed to have elasticity to accommodate an upstream component being scaled. However, it's still generally recommended for you to reach out to our [Support team](https://about.gitlab.com/support/) before you make any significant changes to the environment to be safe. The following components can impact others when they have been significantly scaled: @@ -749,6 +751,16 @@ documentation for each as follows - [Postgres to multi-node Postgres w/ Consul + PgBouncer](../postgresql/moving.md) - [Gitaly to Gitaly Cluster w/ Praefect](../gitaly/index.md#migrate-to-gitaly-cluster) +### Upgrades + +Upgrades for a Reference Architecture environment is the same as any other GitLab environment. +The main [Upgrade GitLab](../../update/index.md) section has detailed steps on how to approach this. + +[Zero-downtime upgrades](#zero-downtime-upgrades) are also available. + +NOTE: +You should upgrade a Reference Architecture in the same order as you created it. + ### Monitoring There are numerous options available to monitor your infrastructure, as well as [GitLab itself](../monitoring/index.md), and you should refer to your selected monitoring solution's documentation for more information. @@ -763,6 +775,7 @@ You can find a full history of changes [on the GitLab project](https://gitlab.co **2024:** +- [2024-04](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/149878): Updated recommended sizings for Webservice nodes for Cloud Native Hybrids on GCP. Also adjusted NGINX pod recommendation to be run on Webservice node pool as a DaemonSet. - [2024-04](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/149528): Updated 20 RPS / 1,000 User architecture specs to follow recommended memory target of 16 GB. - [2024-04](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/148313): Updated Reference Architecture titles to include RPS for further clarity and to help right sizing. - [2024-02](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/145436): Updated recommended sizings for Load Balancer nodes if deployed on VMs. Also added notes on network bandwidth considerations.