From 78160ab489dccb22aa40ce981c7b10ef57b444cc Mon Sep 17 00:00:00 2001 From: Nicholas Klick Date: Thu, 4 Dec 2025 00:47:38 -0500 Subject: [PATCH 1/7] Update ClickHouse docs with latest info --- doc/integration/clickhouse.md | 187 +++++++++++++++++++++------------- 1 file changed, 119 insertions(+), 68 deletions(-) diff --git a/doc/integration/clickhouse.md b/doc/integration/clickhouse.md index b9cbf346fb097d..0c4f916e002fdd 100644 --- a/doc/integration/clickhouse.md +++ b/doc/integration/clickhouse.md @@ -1,20 +1,22 @@ +```markdown --- -stage: Analytics -group: Platform Insights +stage: none +group: unassigned info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://handbook.gitlab.com/handbook/product/ux/technical-writing/#assignments gitlab_dedicated: yes -title: ClickHouse integration guidelines --- -{{< details >}} +# ClickHouse **(FREE ALL BETA)** -- Tier: Free, Premium, Ultimate -- Offering: GitLab.com, GitLab Self-Managed, GitLab Dedicated -- Status: Beta on GitLab Self-Managed and GitLab Dedicated +DETAILS: +**Tier:** Free, Premium, Ultimate +**Offering:** GitLab.com, GitLab Self-Managed, GitLab Dedicated +**Status:** Beta on GitLab Self-Managed and GitLab Dedicated -{{< /details >}} +ClickHouse is a secondary data store for GitLab that enables advanced analytical features such as GitLab Duo, SDLC trends, and CI Analytics. Only specific data is stored in ClickHouse for these analytics purposes. -{{< alert type="note" >}} +NOTE: +For more information on plans for ClickHouse support for GitLab Self-Managed, see [this epic](#). For more information on plans for ClickHouse support for GitLab Self-Managed, see [epic 51](https://gitlab.com/groups/gitlab-org/architecture/gitlab-data-analytics/-/epics/51). @@ -37,36 +39,52 @@ Alternatively, you can [bring your own ClickHouse](https://clickhouse.com/docs/e ## Supported ClickHouse versions | First GitLab version | ClickHouse versions | Comment | -|----------------------|---------------------|---------| -| 17.7.0 | 23.x (24.x, 25.x) | For using ClickHouse 24.x and 25.x see the [workaround section](#database-schema-migrations-on-gitlab-1800-and-earlier). | -| 18.1.0 | 23.x, 24.x, 25.x | | -| 18.5.0 | 23.x, 24.x, 25.x | Experimental support for `Replicated` database engine. | +|---------------------|---------------------|---------| +| 17.7.0 | 23.x (24.x, 25.x) | For using ClickHouse 24.x and 25.x see the [workaround section](#database-schema-migrations-on-gitlab-1800-and-earlier). | +| 18.1.0 | 23.x, 24.x, 25.x | | +| 18.5.0 | 23.x, 24.x, 25.x | Experimental support for Replicated database engine. | -{{< alert type="note" >}} +NOTE: +ClickHouse Cloud is supported. Compatibility is generally ensured with the latest major GitLab release and newer versions. -[ClickHouse Cloud](https://clickhouse.com/cloud) is supported. Compatibility is generally ensured with the latest major GitLab release and newer versions. +## ClickHouse glossary -{{< /alert >}} +Understanding these ClickHouse concepts will help you configure and maintain your installation: + +- **Cluster**: A collection of nodes (servers) that work together to store and process data. +- **MergeTree**: A table engine designed for high data ingest rates and large data volumes. It provides columnar storage, custom partitioning, sparse primary indexes, and support for background data merges. +- **Parts**: Physical files on disk that store portions of a table's data. This differs from partitions, which are logical divisions created using a partition key. +- **Replica**: A copy of data stored in a ClickHouse database for redundancy and reliability. Used with the ReplicatedMergeTree table engine to keep multiple copies of data synchronized across different servers. +- **Shard**: A subset of data. ClickHouse always has at least one shard. Sharding data across multiple servers divides the load when you exceed the capacity of a single server. +- **TTL (Time To Live)**: A feature that automatically moves, deletes, or rolls up columns or rows after a specified time period, enabling efficient storage management. + +## System requirements + +For detailed system requirements and sizing recommendations, see [issue #548450](#). ## Set up ClickHouse +You can connect ClickHouse to GitLab either: + +- **Recommended**: With [ClickHouse Cloud](https://clickhouse.cloud/). +- By bringing your own ClickHouse installation. For more information, see [ClickHouse recommendations for GitLab Self-Managed](https://clickhouse.com/docs/en/install#recommendations-for-self-managed-clickhouse). + +When you run ClickHouse on a hosted server, various factors impact resource consumption, including the number of builds that run on your instance each month, selected hardware, and data center choice. Regardless, the cost should not be significant. + To set up ClickHouse with GitLab: -1. [Run ClickHouse Cluster and configure database](#run-and-configure-clickhouse). -1. [Configure GitLab connection to ClickHouse](#configure-the-gitlab-connection-to-clickhouse). +1. [Run ClickHouse and configure the database](#run-and-configure-clickhouse). +1. [Configure the GitLab connection to ClickHouse](#configure-the-gitlab-connection-to-clickhouse). 1. [Run ClickHouse migrations](#run-clickhouse-migrations). +1. [Enable ClickHouse for Analytics](#enable-clickhouse-for-analytics). ### Run and configure ClickHouse -When you run ClickHouse on a hosted server, various data points might impact the resource consumption, like the number -of builds that run on your instance each month, the selected hardware, the data center choice to host ClickHouse, and more. -Regardless, the cost should not be significant. - To create the necessary user and database objects: 1. Generate a secure password and save it. 1. Sign in to the ClickHouse SQL console. -1. Execute the following command. Replace `PASSWORD_HERE` with the generated password. +1. Execute the following command, replacing `PASSWORD_HERE` with the generated password: ```sql CREATE DATABASE gitlab_clickhouse_main_production; @@ -77,11 +95,42 @@ To create the necessary user and database objects: GRANT gitlab_app TO gitlab; ``` +#### Configure with Replicated database engine **(EXPERIMENT)** + +DETAILS: +**Status:** Experiment +**Introduced:** GitLab 18.5 + +For a multi-node, high-availability setup, GitLab supports the Replicated table engine in ClickHouse. + +Prerequisites: + +- A cluster must be defined in the `remote_servers` configuration section. +- The following macros must be configured: + - `cluster` + - `shard` + - `replica` + +When configuring the database, you must run the statements with the `ON CLUSTER` clause. In the following example, replace `CLUSTER_NAME_HERE` with your cluster's name: + +```sql +CREATE DATABASE gitlab_clickhouse_main_production ON CLUSTER CLUSTER_NAME_HERE ENGINE = Replicated('/clickhouse/databases/{cluster}/gitlab_clickhouse_main_production', '{shard}', '{replica}'); +CREATE USER gitlab IDENTIFIED WITH sha256_password BY 'PASSWORD_HERE' ON CLUSTER CLUSTER_NAME_HERE; +CREATE ROLE gitlab_app ON CLUSTER CLUSTER_NAME_HERE; +GRANT SELECT, INSERT, ALTER, CREATE, UPDATE, DROP, TRUNCATE, OPTIMIZE ON gitlab_clickhouse_main_production.* TO gitlab_app ON CLUSTER CLUSTER_NAME_HERE; +GRANT SELECT ON information_schema.* TO gitlab_app ON CLUSTER CLUSTER_NAME_HERE; +GRANT gitlab_app TO gitlab ON CLUSTER CLUSTER_NAME_HERE; +``` + +##### Load balancer considerations + +The GitLab application communicates with the ClickHouse cluster through the HTTP/HTTPS interface. Consider using an HTTP proxy for load balancing requests to the ClickHouse cluster, such as [chproxy](https://www.chproxy.org/). + ### Configure the GitLab connection to ClickHouse -{{< tabs >}} +::Tabs -{{< tab title="Linux package" >}} +:::TabTitle Linux package To provide GitLab with ClickHouse credentials: @@ -100,9 +149,7 @@ To provide GitLab with ClickHouse credentials: sudo gitlab-ctl reconfigure ``` -{{< /tab >}} - -{{< tab title="Helm chart (Kubernetes)" >}} +:::TabTitle Helm chart (Kubernetes) 1. Save the ClickHouse password as a Kubernetes Secret: @@ -137,86 +184,89 @@ To provide GitLab with ClickHouse credentials: helm upgrade -f gitlab_values.yaml gitlab gitlab/gitlab ``` -{{< /tab >}} +::EndTabs -{{< /tabs >}} +#### Verify the connection To verify that your connection is set up successfully: -1. Sign in to [Rails console](../administration/operations/rails_console.md#starting-a-rails-console-session) +1. Sign in to the [Rails console](../administration/operations/rails_console.md#starting-a-rails-console-session). 1. Execute the following command: ```ruby ClickHouse::Client.select('SELECT 1', :main) ``` - If successful, the command returns `[{"1"=>1}]` + If successful, the command returns `[{"1"=>1}]`. ### Run ClickHouse migrations -{{< tabs >}} +::Tabs -{{< tab title="Linux package" >}} +:::TabTitle Linux package -To create the required database objects execute: +To create the required database objects, execute: ```shell sudo gitlab-rake gitlab:clickhouse:migrate ``` -{{< /tab >}} - -{{< tab title="Helm chart (Kubernetes)" >}} +:::TabTitle Helm chart (Kubernetes) -Migrations are executed automatically using the [GitLab-Migrations chart](https://docs.gitlab.com/charts/charts/gitlab/migrations/#clickhouse-optional). +Migrations are executed automatically using the GitLab-Migrations chart. -Alternatively, you can run migrations by executing the following command in the [Toolbox pod](https://docs.gitlab.com/charts/charts/gitlab/toolbox/): +Alternatively, you can run migrations by executing the following command in the Toolbox pod: ```shell gitlab-rake gitlab:clickhouse:migrate ``` -{{< /tab >}} - -{{< /tabs >}} +::EndTabs ### Enable ClickHouse for Analytics -Now that your GitLab instance is connected to ClickHouse, you can enable features to use ClickHouse by [enabling ClickHouse for Analytics](../administration/analytics.md). +After your GitLab instance is connected to ClickHouse, you can enable features that use ClickHouse: -## `Replicated` database engine +1. On the left sidebar, at the bottom, select **Admin**. +1. Select **Settings > General**. +1. Expand **ClickHouse**. +1. Select **Enable ClickHouse for Analytics**. +1. Select **Save changes**. -{{< history >}} +### Disable ClickHouse for Analytics -- [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/560927) as an experiment in GitLab 18.5. +To disable ClickHouse for Analytics: -{{< /history >}} +1. On the left sidebar, at the bottom, select **Admin**. +1. Select **Settings > General**. +1. Expand **ClickHouse**. +1. Clear the **Enable ClickHouse for Analytics** checkbox. +1. Select **Save changes**. -For a multi-node, high-availability setup, GitLab supports the `Replicated` table engine in ClickHouse. +## Upgrade ClickHouse -Prerequisites: +For information about upgrading ClickHouse, see the [ClickHouse documentation on updates](https://clickhouse.com/docs/manage/updates). -- A cluster must be defined in the `remote_servers` [configuration section](https://clickhouse.com/docs/architecture/cluster-deployment#configure-clickhouse-servers). -- The following [macros](https://clickhouse.com/docs/architecture/cluster-deployment#macros-config-explanation) must be configured: - - `cluster` - - `shard` - - `replica` +## ClickHouse Rake tasks + +GitLab provides several Rake tasks for managing your ClickHouse database: -When configuring the database, you must run the statements with the `ON CLUSTER` clause. -In the following example, replace `CLUSTER_NAME_HERE` with your cluster's name: +| Task | Description | +|------|-------------| +| `gitlab:clickhouse:migrate` | Migrate the databases | +| `gitlab:clickhouse:drop` | Drop the databases | +| `gitlab:clickhouse:create` | Create the databases | +| `gitlab:clickhouse:setup` | Create and migrate the databases | +| `gitlab:clickhouse:schema:dump` | Dump the database schema | +| `gitlab:clickhouse:schema:load` | Load the database schema | - ```sql - CREATE DATABASE gitlab_clickhouse_main_production ON CLUSTER CLUSTER_NAME_HERE ENGINE = Replicated('/clickhouse/databases/{cluster}/gitlab_clickhouse_main_production', '{shard}', '{replica}') - CREATE USER gitlab IDENTIFIED WITH sha256_password BY 'PASSWORD_HERE' ON CLUSTER CLUSTER_NAME_HERE; - CREATE ROLE gitlab_app ON CLUSTER CLUSTER_NAME_HERE; - GRANT SELECT, INSERT, ALTER, CREATE, UPDATE, DROP, TRUNCATE, OPTIMIZE ON gitlab_clickhouse_main_production.* TO gitlab_app ON CLUSTER CLUSTER_NAME_HERE; - GRANT SELECT ON information_schema.* TO gitlab_app ON CLUSTER CLUSTER_NAME_HERE; - GRANT gitlab_app TO gitlab ON CLUSTER CLUSTER_NAME_HERE; - ``` +## Performance tuning -### Load balancer considerations +For information about ClickHouse architecture and performance tuning, see the [ClickHouse documentation on architecture](https://clickhouse.com/docs/architecture/introduction). -The GitLab application communicates with the ClickHouse cluster through the HTTP/HTTPS interface. Consider using an HTTP proxy for load balancing requests to the ClickHouse cluster, such as [`chproxy`](https://www.chproxy.org/). +## Disaster recovery + +For information about backup and disaster recovery strategies for ClickHouse, see the [ClickHouse documentation on backup](https://clickhouse.com/docs/operations/backup/overview). ### Backup and Restore @@ -473,7 +523,7 @@ Without running all migrations, the ClickHouse integration will not work. To work around this issue and run the migrations: -1. Sign in to [Rails console](../administration/operations/rails_console.md#starting-a-rails-console-session) +1. Sign in to the [Rails console](../administration/operations/rails_console.md#starting-a-rails-console-session). 1. Execute the following command: ```ruby @@ -487,3 +537,4 @@ To work around this issue and run the migrations: ``` This time the database migration should successfully finish. +``` -- GitLab From 09cf0bcbe9a37c0f62a7255a4365b74fb94a2a67 Mon Sep 17 00:00:00 2001 From: Nicholas Klick Date: Thu, 4 Dec 2025 08:43:27 -0500 Subject: [PATCH 2/7] Fix stage --- doc/integration/clickhouse.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/integration/clickhouse.md b/doc/integration/clickhouse.md index 0c4f916e002fdd..acb77889049efb 100644 --- a/doc/integration/clickhouse.md +++ b/doc/integration/clickhouse.md @@ -1,7 +1,7 @@ ```markdown --- -stage: none -group: unassigned +stage: Analytics +group: Platform Insights info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://handbook.gitlab.com/handbook/product/ux/technical-writing/#assignments gitlab_dedicated: yes --- -- GitLab From 586eca44bf7f29c260f0c6c395a737ee46a873cf Mon Sep 17 00:00:00 2001 From: Nicholas Klick Date: Thu, 4 Dec 2025 08:51:58 -0500 Subject: [PATCH 3/7] Fix linting errors --- doc/integration/clickhouse.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/doc/integration/clickhouse.md b/doc/integration/clickhouse.md index acb77889049efb..475777612e07a4 100644 --- a/doc/integration/clickhouse.md +++ b/doc/integration/clickhouse.md @@ -3,19 +3,18 @@ stage: Analytics group: Platform Insights info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://handbook.gitlab.com/handbook/product/ux/technical-writing/#assignments -gitlab_dedicated: yes +title: ClickHouse --- # ClickHouse **(FREE ALL BETA)** DETAILS: -**Tier:** Free, Premium, Ultimate -**Offering:** GitLab.com, GitLab Self-Managed, GitLab Dedicated +**Tier:** Free, Premium, Ultimate
+**Offering:** GitLab.com, GitLab Self-Managed, GitLab Dedicated
**Status:** Beta on GitLab Self-Managed and GitLab Dedicated ClickHouse is a secondary data store for GitLab that enables advanced analytical features such as GitLab Duo, SDLC trends, and CI Analytics. Only specific data is stored in ClickHouse for these analytics purposes. -NOTE: For more information on plans for ClickHouse support for GitLab Self-Managed, see [this epic](#). For more information on plans for ClickHouse support for GitLab Self-Managed, see [epic 51](https://gitlab.com/groups/gitlab-org/architecture/gitlab-data-analytics/-/epics/51). @@ -44,7 +43,6 @@ Alternatively, you can [bring your own ClickHouse](https://clickhouse.com/docs/e | 18.1.0 | 23.x, 24.x, 25.x | | | 18.5.0 | 23.x, 24.x, 25.x | Experimental support for Replicated database engine. | -NOTE: ClickHouse Cloud is supported. Compatibility is generally ensured with the latest major GitLab release and newer versions. ## ClickHouse glossary @@ -98,7 +96,7 @@ To create the necessary user and database objects: #### Configure with Replicated database engine **(EXPERIMENT)** DETAILS: -**Status:** Experiment +**Status:** Experiment
**Introduced:** GitLab 18.5 For a multi-node, high-availability setup, GitLab supports the Replicated table engine in ClickHouse. -- GitLab From 88becec07f7213b0cba07630d75edd4c3000d87f Mon Sep 17 00:00:00 2001 From: Nicholas Klick Date: Thu, 4 Dec 2025 08:55:41 -0500 Subject: [PATCH 4/7] More linting errors --- doc/integration/clickhouse.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/integration/clickhouse.md b/doc/integration/clickhouse.md index 475777612e07a4..74561a740c0942 100644 --- a/doc/integration/clickhouse.md +++ b/doc/integration/clickhouse.md @@ -9,8 +9,8 @@ title: ClickHouse # ClickHouse **(FREE ALL BETA)** DETAILS: -**Tier:** Free, Premium, Ultimate
-**Offering:** GitLab.com, GitLab Self-Managed, GitLab Dedicated
+**Tier:** Free, Premium, Ultimate +**Offering:** GitLab.com, GitLab Self-Managed, GitLab Dedicated **Status:** Beta on GitLab Self-Managed and GitLab Dedicated ClickHouse is a secondary data store for GitLab that enables advanced analytical features such as GitLab Duo, SDLC trends, and CI Analytics. Only specific data is stored in ClickHouse for these analytics purposes. @@ -96,7 +96,7 @@ To create the necessary user and database objects: #### Configure with Replicated database engine **(EXPERIMENT)** DETAILS: -**Status:** Experiment
+**Status:** Experiment **Introduced:** GitLab 18.5 For a multi-node, high-availability setup, GitLab supports the Replicated table engine in ClickHouse. -- GitLab From 1f1f41da24cab59a2255c9d1f29b0939be5c2ef1 Mon Sep 17 00:00:00 2001 From: Nnamdi Date: Wed, 10 Dec 2025 10:16:23 -0500 Subject: [PATCH 5/7] Rebased and fix markdown lint --- doc/integration/clickhouse.md | 56 ++++++++++++++++------------------- 1 file changed, 25 insertions(+), 31 deletions(-) diff --git a/doc/integration/clickhouse.md b/doc/integration/clickhouse.md index 74561a740c0942..efc4b95df3f310 100644 --- a/doc/integration/clickhouse.md +++ b/doc/integration/clickhouse.md @@ -1,4 +1,3 @@ -```markdown --- stage: Analytics group: Platform Insights @@ -6,25 +5,18 @@ info: To determine the technical writer assigned to the Stage/Group associated w title: ClickHouse --- -# ClickHouse **(FREE ALL BETA)** +{{< details >}} -DETAILS: -**Tier:** Free, Premium, Ultimate -**Offering:** GitLab.com, GitLab Self-Managed, GitLab Dedicated -**Status:** Beta on GitLab Self-Managed and GitLab Dedicated +- Tier: Free, Premium, Ultimate +- Offering: GitLab.com, GitLab Self-Managed, GitLab Dedicated +- Status: Beta on GitLab Self-Managed and GitLab Dedicated ClickHouse is a secondary data store for GitLab that enables advanced analytical features such as GitLab Duo, SDLC trends, and CI Analytics. Only specific data is stored in ClickHouse for these analytics purposes. -For more information on plans for ClickHouse support for GitLab Self-Managed, see [this epic](#). - +{{< alert type="warning" >}} For more information on plans for ClickHouse support for GitLab Self-Managed, see [epic 51](https://gitlab.com/groups/gitlab-org/architecture/gitlab-data-analytics/-/epics/51). -{{< /alert >}} - -{{< alert type="note" >}} - For more information about ClickHouse support for GitLab Dedicated, see [ClickHouse for GitLab Dedicated](../subscriptions/gitlab_dedicated/_index.md#clickhouse-cloud). - {{< /alert >}} [ClickHouse](https://clickhouse.com) is an open-source column-oriented database management system. It can efficiently filter, aggregate, and query across large data sets. @@ -56,9 +48,9 @@ Understanding these ClickHouse concepts will help you configure and maintain you - **Shard**: A subset of data. ClickHouse always has at least one shard. Sharding data across multiple servers divides the load when you exceed the capacity of a single server. - **TTL (Time To Live)**: A feature that automatically moves, deletes, or rolls up columns or rows after a specified time period, enabling efficient storage management. -## System requirements +## Requirements -For detailed system requirements and sizing recommendations, see [issue #548450](#). +For detailed system requirements and sizing recommendations, see [issue 548450](https://gitlab.com/gitlab-org/gitlab/-/issues/548450). ## Set up ClickHouse @@ -95,9 +87,10 @@ To create the necessary user and database objects: #### Configure with Replicated database engine **(EXPERIMENT)** -DETAILS: +{{< alert type="note" >}} **Status:** Experiment **Introduced:** GitLab 18.5 +{{< /alert >}} For a multi-node, high-availability setup, GitLab supports the Replicated table engine in ClickHouse. @@ -313,13 +306,13 @@ To enable this: 1. Configure the `prometheus` section in your `config.xml` to expose metrics on a dedicated port (default is `9363`). ```xml - - /metrics - 9363 - true - true - true - + + /metrics + 9363 + true + true + true + ``` 1. Configure Prometheus or a similar compatible server to scrape `http://:9363/metrics`. @@ -370,13 +363,13 @@ For self-managed instances, ensure the `query_log` configuration parameter is en 1. Verify that the `query_log` section exists in your `config.xml` or `users.xml`: ```xml - - system - query_log
- toYYYYMM(event_date) - 7500 - event_date + INTERVAL 30 DAY -
+ + system + query_log
+ toYYYYMM(event_date) + 7500 + event_date + INTERVAL 30 DAY +
``` 1. Once enabled, all executed queries are recorded in the `system.query_log` table, allowing for audit trail. @@ -511,6 +504,7 @@ HA setup becomes cost effective only at 10k users or above. ### Database schema migrations on GitLab 18.0.0 and earlier +{{< alert type="warning" >}} On GitLab 18.0.0 and earlier, running database schema migrations for ClickHouse may fail for ClickHouse 24.x and 25.x with the following error message: ```plaintext @@ -518,6 +512,7 @@ Code: 344. DB::Exception: Projection is fully supported in ReplacingMergeTree wi ``` Without running all migrations, the ClickHouse integration will not work. +{{< /alert >}} To work around this issue and run the migrations: @@ -535,4 +530,3 @@ To work around this issue and run the migrations: ``` This time the database migration should successfully finish. -``` -- GitLab From 0770f1529d4a625a71f9198dbc30526b1ffe3cbb Mon Sep 17 00:00:00 2001 From: Nnamdi Date: Wed, 10 Dec 2025 16:09:26 -0500 Subject: [PATCH 6/7] Docs: Update ClickHouse integration documentation Separate setup instructions for Cloud vs BYOC deployment types. Expand Rake tasks section with examples and detailed descriptions. Expand Performance tuning section with resource allocation guidance. Add Operations section for migration status and retry procedures. Update Upgrade section with Cloud auto-upgrade vs BYOC manual procedures. Add disclaimer to Performance tuning referencing System requirements. --- doc/integration/clickhouse.md | 539 ++++++++++++++++++++++++++++++++-- 1 file changed, 507 insertions(+), 32 deletions(-) diff --git a/doc/integration/clickhouse.md b/doc/integration/clickhouse.md index efc4b95df3f310..1b46e8a5c2e78d 100644 --- a/doc/integration/clickhouse.md +++ b/doc/integration/clickhouse.md @@ -11,6 +11,8 @@ title: ClickHouse - Offering: GitLab.com, GitLab Self-Managed, GitLab Dedicated - Status: Beta on GitLab Self-Managed and GitLab Dedicated +{{< /details >}} + ClickHouse is a secondary data store for GitLab that enables advanced analytical features such as GitLab Duo, SDLC trends, and CI Analytics. Only specific data is stored in ClickHouse for these analytics purposes. {{< alert type="warning" >}} @@ -54,27 +56,222 @@ For detailed system requirements and sizing recommendations, see [issue 548450]( ## Set up ClickHouse -You can connect ClickHouse to GitLab either: +Choose your deployment type based on your operational requirements: + +- **[ClickHouse Cloud](#set-up-clickhouse-cloud)** (Recommended): Fully managed service with automatic upgrades, backups, and scaling. +- **[Self-managed ClickHouse (BYOC)](#set-up-self-managed-clickhouse-byoc)**: Complete control over your infrastructure and configuration. + +### Set up ClickHouse Cloud + +Prerequisites: + +- ClickHouse Cloud account +- Network connectivity from your GitLab instance to ClickHouse Cloud +- Administrator access to your GitLab instance + +To set up ClickHouse Cloud with GitLab: + +1. [Create and configure your ClickHouse Cloud service](#create-clickhouse-cloud-service). +1. [Create the GitLab database and user](#create-database-and-user-cloud). +1. [Configure the GitLab connection](#configure-gitlab-connection-cloud). +1. [Verify the connection](#verify-connection-cloud). +1. [Run ClickHouse migrations](#run-migrations-cloud). +1. [Enable ClickHouse for Analytics](#enable-clickhouse-for-analytics). + +#### Create ClickHouse Cloud service + +1. Sign in to [ClickHouse Cloud](https://clickhouse.cloud). +1. Select **New Service**. +1. Choose your service tier: + - **Development**: For testing and development environments. + - **Production**: For production workloads with high availability. +1. Select your cloud provider and region. Choose a region close to your GitLab instance for optimal performance. +1. Configure your service name and settings. +1. Select **Create Service**. +1. Once provisioned, note your connection details from the service dashboard: + - Host + - Port (usually `9440` for secure connections) + - Username + - Password + +{{< alert type="note" >}} +**Auto-upgrade capability**: ClickHouse Cloud automatically handles version upgrades and security patches. Enterprise plan customers can schedule upgrade windows to control when upgrades occur and avoid unexpected service interruptions during business hours. +{{< /alert >}} + +#### Create database and user (Cloud) + +1. In the ClickHouse Cloud console, select your service. +1. Select **SQL Console**. +1. Generate a secure password for the GitLab user and save it. +1. Execute the following SQL commands, replacing `PASSWORD_HERE` with your generated password: + + ```sql + CREATE DATABASE gitlab_clickhouse_main_production; + CREATE USER gitlab IDENTIFIED WITH sha256_password BY 'PASSWORD_HERE'; + CREATE ROLE gitlab_app; + GRANT SELECT, INSERT, ALTER, CREATE, UPDATE, DROP, TRUNCATE, OPTIMIZE ON gitlab_clickhouse_main_production.* TO gitlab_app; + GRANT SELECT ON information_schema.* TO gitlab_app; + GRANT gitlab_app TO gitlab; + ``` + +#### Configure GitLab connection (Cloud) + +::Tabs + +:::TabTitle Linux package + +To provide GitLab with ClickHouse credentials: + +1. Edit `/etc/gitlab/gitlab.rb`: + + ```ruby + gitlab_rails['clickhouse_databases']['main']['database'] = 'gitlab_clickhouse_main_production' + gitlab_rails['clickhouse_databases']['main']['url'] = 'https://your-service.clickhouse.cloud:9440' + gitlab_rails['clickhouse_databases']['main']['username'] = 'gitlab' + gitlab_rails['clickhouse_databases']['main']['password'] = 'PASSWORD_HERE' # replace with the actual password + ``` + +1. Save the file and reconfigure GitLab: + + ```shell + sudo gitlab-ctl reconfigure + ``` + +:::TabTitle Helm chart (Kubernetes) + +1. Save the ClickHouse password as a Kubernetes Secret: + + ```shell + kubectl create secret generic gitlab-clickhouse-password --from-literal="main_password=PASSWORD_HERE" + ``` + +1. Export the Helm values: + + ```shell + helm get values gitlab > gitlab_values.yaml + ``` + +1. Edit `gitlab_values.yaml`: + + ```yaml + global: + clickhouse: + enabled: true + main: + username: gitlab + password: + secret: gitlab-clickhouse-password + key: main_password + database: gitlab_clickhouse_main_production + url: 'https://your-service.clickhouse.cloud:9440' + ``` + +1. Save the file and apply the new values: + + ```shell + helm upgrade -f gitlab_values.yaml gitlab gitlab/gitlab + ``` + +::EndTabs + +#### Verify connection (Cloud) + +To verify that your connection is set up successfully: + +1. Sign in to the [Rails console](../administration/operations/rails_console.md#starting-a-rails-console-session). +1. Execute the following command: + + ```ruby + ClickHouse::Client.select('SELECT 1', :main) + ``` + + If successful, the command returns `[{"1"=>1}]`. + +If the connection fails, verify: + +- ClickHouse Cloud service is running and accessible. +- Network connectivity from GitLab to ClickHouse Cloud. Check firewalls and security groups. +- Connection URL includes the correct host and port. +- Credentials are correct. + +#### Run migrations (Cloud) + +::Tabs + +:::TabTitle Linux package + +To create the required database objects, execute: -- **Recommended**: With [ClickHouse Cloud](https://clickhouse.cloud/). -- By bringing your own ClickHouse installation. For more information, see [ClickHouse recommendations for GitLab Self-Managed](https://clickhouse.com/docs/en/install#recommendations-for-self-managed-clickhouse). +```shell +sudo gitlab-rake gitlab:clickhouse:migrate +``` -When you run ClickHouse on a hosted server, various factors impact resource consumption, including the number of builds that run on your instance each month, selected hardware, and data center choice. Regardless, the cost should not be significant. +:::TabTitle Helm chart (Kubernetes) + +Migrations are executed automatically using the GitLab-Migrations chart. -To set up ClickHouse with GitLab: +Alternatively, you can run migrations by executing the following command in the Toolbox pod: -1. [Run ClickHouse and configure the database](#run-and-configure-clickhouse). -1. [Configure the GitLab connection to ClickHouse](#configure-the-gitlab-connection-to-clickhouse). -1. [Run ClickHouse migrations](#run-clickhouse-migrations). +```shell +gitlab-rake gitlab:clickhouse:migrate +``` + +::EndTabs + +### Set up self-managed ClickHouse (BYOC) + +Prerequisites: + +- ClickHouse instance installed and running +- Compatible ClickHouse version. See [Supported ClickHouse versions](#supported-clickhouse-versions). +- Network connectivity from your GitLab instance to ClickHouse +- Administrator access to both ClickHouse and GitLab + +To set up self-managed ClickHouse with GitLab: + +1. [Verify ClickHouse installation](#verify-clickhouse-installation). +1. [Create the GitLab database and user](#create-database-and-user-byoc). +1. Optional. [Configure high availability](#configure-high-availability) (for HA deployments). +1. Optional. [Configure load balancer](#configure-load-balancer) (for HA deployments). +1. [Configure the GitLab connection](#configure-gitlab-connection-byoc). +1. [Verify the connection](#verify-connection-byoc). +1. [Run ClickHouse migrations](#run-migrations-byoc). 1. [Enable ClickHouse for Analytics](#enable-clickhouse-for-analytics). -### Run and configure ClickHouse +{{< alert type="warning" >}} +**Manual upgrades required**: For self-managed ClickHouse, you are responsible for planning and executing version upgrades, security patches, and backups. See [Upgrade ClickHouse](#upgrade-clickhouse) for guidance. +{{< /alert >}} + +#### Verify ClickHouse installation + +Before configuring the database, verify ClickHouse is installed and accessible: + +1. Check ClickHouse is running: + + ```shell + clickhouse-client --query "SELECT version()" + ``` + + Expected output: Version number (for example, `24.3.1.12`) + +1. Verify you can connect with credentials: + + ```shell + clickhouse-client --host your-clickhouse-host --port 9000 --user default --password 'your-password' + ``` + +If ClickHouse is not installed, see: + +- [ClickHouse official installation guide](https://clickhouse.com/docs/en/install) +- [ClickHouse recommendations for GitLab Self-Managed](https://clickhouse.com/docs/guides/sizing-and-hardware-recommendations) + +#### Create database and user (BYOC) To create the necessary user and database objects: 1. Generate a secure password and save it. -1. Sign in to the ClickHouse SQL console. -1. Execute the following command, replacing `PASSWORD_HERE` with the generated password: +1. Sign in to the ClickHouse SQL console or use `clickhouse-client`. +1. Execute the following commands, replacing `PASSWORD_HERE` with the generated password: ```sql CREATE DATABASE gitlab_clickhouse_main_production; @@ -85,7 +282,7 @@ To create the necessary user and database objects: GRANT gitlab_app TO gitlab; ``` -#### Configure with Replicated database engine **(EXPERIMENT)** +#### Configure high availability {{< alert type="note" >}} **Status:** Experiment @@ -96,13 +293,14 @@ For a multi-node, high-availability setup, GitLab supports the Replicated table Prerequisites: -- A cluster must be defined in the `remote_servers` configuration section. -- The following macros must be configured: +- ClickHouse cluster with multiple nodes (minimum 3 nodes recommended) +- A cluster must be defined in the `remote_servers` configuration section +- The following macros must be configured in your ClickHouse configuration: - `cluster` - `shard` - `replica` -When configuring the database, you must run the statements with the `ON CLUSTER` clause. In the following example, replace `CLUSTER_NAME_HERE` with your cluster's name: +When configuring the database for HA, you must run the statements with the `ON CLUSTER` clause. In the following example, replace `CLUSTER_NAME_HERE` with your cluster's name: ```sql CREATE DATABASE gitlab_clickhouse_main_production ON CLUSTER CLUSTER_NAME_HERE ENGINE = Replicated('/clickhouse/databases/{cluster}/gitlab_clickhouse_main_production', '{shard}', '{replica}'); @@ -113,11 +311,48 @@ GRANT SELECT ON information_schema.* TO gitlab_app ON CLUSTER CLUSTER_NAME_HERE; GRANT gitlab_app TO gitlab ON CLUSTER CLUSTER_NAME_HERE; ``` -##### Load balancer considerations +For more information, see [ClickHouse Replicated database engine documentation](https://clickhouse.com/docs/en/engines/database-engines/replicated). + +#### Configure load balancer + +For HA deployments, configure a load balancer to distribute requests across ClickHouse nodes. + +The GitLab application communicates with the ClickHouse cluster through the HTTP/HTTPS interface. You should use an HTTP proxy or load balancer to distribute requests across cluster nodes. + +Recommended load balancer options: + +- [chproxy](https://www.chproxy.org/) - ClickHouse-specific HTTP proxy with built-in caching and routing +- HAProxy - General-purpose TCP/HTTP load balancer +- NGINX - Web server with load balancing capabilities +- Cloud provider load balancers (AWS Application Load Balancer, GCP Load Balancer, Azure Load Balancer) + +Basic chproxy configuration example: -The GitLab application communicates with the ClickHouse cluster through the HTTP/HTTPS interface. Consider using an HTTP proxy for load balancing requests to the ClickHouse cluster, such as [chproxy](https://www.chproxy.org/). +```yaml +server: + http: + listen_addr: ":8080" -### Configure the GitLab connection to ClickHouse +clusters: + - name: "clickhouse_cluster" + nodes: [ + "http://ch-node1:8123", + "http://ch-node2:8123", + "http://ch-node3:8123" + ] + +users: + - name: "gitlab" + password: "your_secure_password" + to_cluster: "clickhouse_cluster" + to_user: "gitlab" +``` + +When using a load balancer, configure GitLab to connect to the load balancer URL instead of individual ClickHouse nodes. + +For more information, see [chproxy documentation](https://www.chproxy.org/). + +#### Configure GitLab connection (BYOC) ::Tabs @@ -129,7 +364,7 @@ To provide GitLab with ClickHouse credentials: ```ruby gitlab_rails['clickhouse_databases']['main']['database'] = 'gitlab_clickhouse_main_production' - gitlab_rails['clickhouse_databases']['main']['url'] = 'https://example.com/path' + gitlab_rails['clickhouse_databases']['main']['url'] = 'https://your-clickhouse-host:8443' # Use load balancer URL for HA deployments gitlab_rails['clickhouse_databases']['main']['username'] = 'gitlab' gitlab_rails['clickhouse_databases']['main']['password'] = 'PASSWORD_HERE' # replace with the actual password ``` @@ -161,12 +396,12 @@ To provide GitLab with ClickHouse credentials: clickhouse: enabled: true main: - username: default + username: gitlab password: secret: gitlab-clickhouse-password key: main_password database: gitlab_clickhouse_main_production - url: 'http://example.com' + url: 'https://your-clickhouse-host:8443' # Use load balancer URL for HA deployments ``` 1. Save the file and apply the new values: @@ -177,7 +412,11 @@ To provide GitLab with ClickHouse credentials: ::EndTabs -#### Verify the connection +{{< alert type="note" >}} +**TLS/SSL configuration**: For production deployments, configure TLS/SSL on your ClickHouse instance and use `https://` URLs. See [ClickHouse TLS/SSL configuration](https://clickhouse.com/docs/guides/sre/configuring-ssl) for details. +{{< /alert >}} + +#### Verify connection (BYOC) To verify that your connection is set up successfully: @@ -190,7 +429,15 @@ To verify that your connection is set up successfully: If successful, the command returns `[{"1"=>1}]`. -### Run ClickHouse migrations +If the connection fails, verify: + +- ClickHouse service is running on all nodes. +- Network connectivity from GitLab to ClickHouse. Check firewalls and security groups. +- Connection URL is correct (host, port, protocol). +- Credentials are correct. +- For HA setups: Load balancer is properly configured and routing requests. + +#### Run migrations (BYOC) ::Tabs @@ -218,6 +465,14 @@ gitlab-rake gitlab:clickhouse:migrate After your GitLab instance is connected to ClickHouse, you can enable features that use ClickHouse: +Prerequisites: + +- You must have administrator access to the instance. +- ClickHouse connection is configured and verified. +- Migrations have been successfully completed. + +To enable ClickHouse for Analytics: + 1. On the left sidebar, at the bottom, select **Admin**. 1. Select **Settings > General**. 1. Expand **ClickHouse**. @@ -228,33 +483,253 @@ After your GitLab instance is connected to ClickHouse, you can enable features t To disable ClickHouse for Analytics: +Prerequisites: + +- You must have administrator access to the instance. + +To disable: + 1. On the left sidebar, at the bottom, select **Admin**. 1. Select **Settings > General**. 1. Expand **ClickHouse**. 1. Clear the **Enable ClickHouse for Analytics** checkbox. 1. Select **Save changes**. +{{< alert type="note" >}} +Disabling ClickHouse for Analytics stops GitLab from querying ClickHouse but does not delete any data from your ClickHouse instance. Analytics features that rely on ClickHouse will fall back to alternative data sources or become unavailable. +{{< /alert >}} + ## Upgrade ClickHouse -For information about upgrading ClickHouse, see the [ClickHouse documentation on updates](https://clickhouse.com/docs/manage/updates). +### ClickHouse Cloud + +ClickHouse Cloud automatically handles version upgrades and security patches. No manual intervention is required. + +**Upgrade behavior:** + +- **Development and Production plans**: Upgrades are applied automatically during scheduled maintenance windows. +- **Enterprise plans**: You can schedule custom upgrade windows to control when upgrades occur. + +To view or schedule upgrade windows (Enterprise plans): + +1. Sign in to [ClickHouse Cloud](https://clickhouse.cloud). +1. Select your service. +1. Go to **Settings > Maintenance**. +1. Configure your preferred maintenance window. + +{{< alert type="note" >}} +ClickHouse Cloud notifies you in advance of upcoming upgrades. Review the [ClickHouse Cloud changelog](https://clickhouse.com/docs/cloud/changes) to stay informed about new features and changes. +{{< /alert >}} + +### Self-managed ClickHouse (BYOC) + +For self-managed ClickHouse, you are responsible for planning and executing version upgrades. + +Prerequisites: + +- You must have administrator access to the ClickHouse instance. +- Back up your data before upgrading. See [Disaster recovery](#disaster-recovery). + +Before upgrading: + +1. Review the [ClickHouse release notes](https://clickhouse.com/docs/category/release-notes) for breaking changes. +1. Check [compatibility](#supported-clickhouse-versions) with your GitLab version. +1. Test the upgrade in a non-production environment. +1. Plan for potential downtime, or use a rolling upgrade strategy for HA clusters. + +To upgrade ClickHouse: + +1. For single-node deployments, follow the [ClickHouse upgrade documentation](https://clickhouse.com/docs/manage/updates). +1. For HA cluster deployments, perform a rolling upgrade to minimize downtime: + - Upgrade one node at a time. + - Wait for the node to rejoin the cluster. + - Verify cluster health before proceeding to the next node. + +{{< alert type="warning" >}} +Always ensure the ClickHouse version remains compatible with your GitLab version. See [Supported ClickHouse versions](#supported-clickhouse-versions) for the compatibility matrix. Incompatible versions may cause indexing to pause and features to fail. +{{< /alert >}} + +For detailed upgrade procedures, see the [ClickHouse documentation on updates](https://clickhouse.com/docs/manage/updates). + +## Operations + +### Check migration status + +Prerequisites: + +- You must have administrator access to the instance. + +To check the status of ClickHouse migrations: + +1. On the left sidebar, at the bottom, select **Admin**. +1. Select **Settings > General**. +1. Expand **ClickHouse**. +1. Review the **Migration status** section if available. + +Alternatively, check for pending migrations using the Rails console: + +```ruby +# Sign in to Rails console +# Run this to check migrations +ClickHouse::MigrationSupport::Migrator.new(:main).pending_migrations +``` + +### Retry failed migrations + +If a ClickHouse migration fails: + +1. Check the logs for error details. ClickHouse-related errors are logged in the GitLab application logs. +1. Address the underlying issue (for example, insufficient memory, connectivity problems). +1. Retry the migration: + + ```shell + # For installations that use the Linux package + sudo gitlab-rake gitlab:clickhouse:migrate + + # For self-compiled installations + bundle exec rake gitlab:clickhouse:migrate RAILS_ENV=production + ``` + +{{< alert type="note" >}} +Migrations are designed to be idempotent and safe to retry. If a migration fails partway through, running it again will resume from where it left off or skip already-completed steps. +{{< /alert >}} ## ClickHouse Rake tasks -GitLab provides several Rake tasks for managing your ClickHouse database: +GitLab provides several Rake tasks for managing your ClickHouse database. + +The following Rake tasks are available: | Task | Description | |------|-------------| -| `gitlab:clickhouse:migrate` | Migrate the databases | -| `gitlab:clickhouse:drop` | Drop the databases | -| `gitlab:clickhouse:create` | Create the databases | -| `gitlab:clickhouse:setup` | Create and migrate the databases | -| `gitlab:clickhouse:schema:dump` | Dump the database schema | -| `gitlab:clickhouse:schema:load` | Load the database schema | +| [`sudo gitlab-rake gitlab:clickhouse:migrate`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/tasks/gitlab/click_house/migration.rake) | Runs all pending ClickHouse migrations to create or update database schema. | +| [`sudo gitlab-rake gitlab:clickhouse:drop`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/tasks/gitlab/click_house/migration.rake) | Drops all ClickHouse databases. Use with extreme caution as this deletes all data. | +| [`sudo gitlab-rake gitlab:clickhouse:create`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/tasks/gitlab/click_house/migration.rake) | Creates ClickHouse databases if they don't exist. | +| [`sudo gitlab-rake gitlab:clickhouse:setup`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/tasks/gitlab/click_house/migration.rake) | Creates databases and runs all migrations. Equivalent to running `create` and `migrate` tasks. | +| [`sudo gitlab-rake gitlab:clickhouse:schema:dump`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/tasks/gitlab/click_house/migration.rake) | Dumps the current database schema to a file for backup or version control. | +| [`sudo gitlab-rake gitlab:clickhouse:schema:load`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/tasks/gitlab/click_house/migration.rake) | Loads the database schema from a dump file. | + +{{< alert type="note" >}} +For self-compiled installations, use `bundle exec rake` instead of `sudo gitlab-rake` and add `RAILS_ENV=production` to the end of the command. +{{< /alert >}} + +### Common task examples + +#### Verify ClickHouse connection and schema + +To verify your ClickHouse connection is working: + +```shell +# For installations that use the Linux package +sudo gitlab-rake gitlab:clickhouse:info + +# For self-compiled installations +bundle exec rake gitlab:clickhouse:info RAILS_ENV=production +``` + +This task outputs debugging information about the ClickHouse connection and configuration. + +#### Re-run all migrations + +To run all pending migrations: + +```shell +# For installations that use the Linux package +sudo gitlab-rake gitlab:clickhouse:migrate + +# For self-compiled installations +bundle exec rake gitlab:clickhouse:migrate RAILS_ENV=production +``` + +#### Reset the database + +{{< alert type="warning" >}} +This deletes all data in your ClickHouse database. Use only in development or when troubleshooting. +{{< /alert >}} + +To drop and recreate the database: + +```shell +# For installations that use the Linux package +sudo gitlab-rake gitlab:clickhouse:drop +sudo gitlab-rake gitlab:clickhouse:setup + +# For self-compiled installations +bundle exec rake gitlab:clickhouse:drop RAILS_ENV=production +bundle exec rake gitlab:clickhouse:setup RAILS_ENV=production +``` + +### Environment variables + +You can use environment variables to control Rake task behavior: + +| Environment Variable | Data Type | Description | +|---------------------|-----------|-------------| +| `VERBOSE` | Boolean | Set to `true` to see detailed output during migrations. Example: `VERBOSE=true sudo gitlab-rake gitlab:clickhouse:migrate` | ## Performance tuning +{{< alert type="note" >}} +The following are general performance tuning guidelines. For specific resource sizing based on your user count, see [System requirements](#system-requirements). +{{< /alert >}} + +### General recommendations + For information about ClickHouse architecture and performance tuning, see the [ClickHouse documentation on architecture](https://clickhouse.com/docs/architecture/introduction). +### Resource allocation + +For optimal performance, consider these resource allocation guidelines: + +**Memory:** + +- Allocate at least 8 GB of RAM for small deployments (< 5K users). +- Allocate 16-32 GB for medium deployments (5K-25K users). +- Allocate 64+ GB for large deployments (> 25K users). +- ClickHouse uses available memory for caching and query processing. + +**CPU:** + +- Minimum 4 CPU cores for single-node deployments. +- 8-16 CPU cores recommended for production workloads. +- Scale horizontally (add nodes) rather than vertically for large instances. + +**Storage:** + +- Use SSD storage for best performance. +- NVMe SSDs recommended for high-throughput workloads. +- See [System requirements](#system-requirements) for sizing guidance. + +### Query optimization + +ClickHouse is optimized for analytical queries. For best performance: + +- Use date/time range filters when possible (ClickHouse partitions by time). +- Limit result sets with `LIMIT` clauses. +- Use materialized views for frequently-run aggregations. +- Monitor slow queries in the ClickHouse query log. + +### Index optimization + +ClickHouse uses MergeTree table engines with automatic index optimization. You do not need to manually create or maintain indexes. + +For large datasets: + +- Tables are automatically partitioned by date. +- Background merges optimize storage and query performance. +- TTL policies automatically expire old data. + +### Monitoring performance + +Monitor these key performance indicators: + +- Query execution time (should be < 1 second for most queries). +- Memory usage (should stay below 80% of allocated memory). +- Disk I/O (should not be consistently maxed out). +- Network throughput (for HA clusters). + +For detailed performance monitoring, see [Monitoring](#monitoring). + ## Disaster recovery For information about backup and disaster recovery strategies for ClickHouse, see the [ClickHouse documentation on backup](https://clickhouse.com/docs/operations/backup/overview). @@ -504,7 +979,7 @@ HA setup becomes cost effective only at 10k users or above. ### Database schema migrations on GitLab 18.0.0 and earlier -{{< alert type="warning" >}} +{{< alert variant="warning" >}} On GitLab 18.0.0 and earlier, running database schema migrations for ClickHouse may fail for ClickHouse 24.x and 25.x with the following error message: ```plaintext -- GitLab From c4e39d8d7431b31cd64247ae34287508418410dd Mon Sep 17 00:00:00 2001 From: Nnamdi Date: Thu, 11 Dec 2025 12:02:29 -0500 Subject: [PATCH 7/7] Removed duplicates and redundancy in the clickhouse docs --- doc/integration/clickhouse.md | 73 ++++------------------------------- 1 file changed, 7 insertions(+), 66 deletions(-) diff --git a/doc/integration/clickhouse.md b/doc/integration/clickhouse.md index 1b46e8a5c2e78d..538407877cd43e 100644 --- a/doc/integration/clickhouse.md +++ b/doc/integration/clickhouse.md @@ -39,17 +39,6 @@ Alternatively, you can [bring your own ClickHouse](https://clickhouse.com/docs/e ClickHouse Cloud is supported. Compatibility is generally ensured with the latest major GitLab release and newer versions. -## ClickHouse glossary - -Understanding these ClickHouse concepts will help you configure and maintain your installation: - -- **Cluster**: A collection of nodes (servers) that work together to store and process data. -- **MergeTree**: A table engine designed for high data ingest rates and large data volumes. It provides columnar storage, custom partitioning, sparse primary indexes, and support for background data merges. -- **Parts**: Physical files on disk that store portions of a table's data. This differs from partitions, which are logical divisions created using a partition key. -- **Replica**: A copy of data stored in a ClickHouse database for redundancy and reliability. Used with the ReplicatedMergeTree table engine to keep multiple copies of data synchronized across different servers. -- **Shard**: A subset of data. ClickHouse always has at least one shard. Sharding data across multiple servers divides the load when you exceed the capacity of a single server. -- **TTL (Time To Live)**: A feature that automatically moves, deletes, or rolls up columns or rows after a specified time period, enabling efficient storage management. - ## Requirements For detailed system requirements and sizing recommendations, see [issue 548450](https://gitlab.com/gitlab-org/gitlab/-/issues/548450). @@ -670,36 +659,11 @@ You can use environment variables to control Rake task behavior: ## Performance tuning {{< alert type="note" >}} -The following are general performance tuning guidelines. For specific resource sizing based on your user count, see [System requirements](#system-requirements). +For resource sizing and deployment recommendations based on your user count, see [System requirements](#system-requirements). {{< /alert >}} -### General recommendations - For information about ClickHouse architecture and performance tuning, see the [ClickHouse documentation on architecture](https://clickhouse.com/docs/architecture/introduction). -### Resource allocation - -For optimal performance, consider these resource allocation guidelines: - -**Memory:** - -- Allocate at least 8 GB of RAM for small deployments (< 5K users). -- Allocate 16-32 GB for medium deployments (5K-25K users). -- Allocate 64+ GB for large deployments (> 25K users). -- ClickHouse uses available memory for caching and query processing. - -**CPU:** - -- Minimum 4 CPU cores for single-node deployments. -- 8-16 CPU cores recommended for production workloads. -- Scale horizontally (add nodes) rather than vertically for large instances. - -**Storage:** - -- Use SSD storage for best performance. -- NVMe SSDs recommended for high-throughput workloads. -- See [System requirements](#system-requirements) for sizing guidance. - ### Query optimization ClickHouse is optimized for analytical queries. For best performance: @@ -709,31 +673,8 @@ ClickHouse is optimized for analytical queries. For best performance: - Use materialized views for frequently-run aggregations. - Monitor slow queries in the ClickHouse query log. -### Index optimization - -ClickHouse uses MergeTree table engines with automatic index optimization. You do not need to manually create or maintain indexes. - -For large datasets: - -- Tables are automatically partitioned by date. -- Background merges optimize storage and query performance. -- TTL policies automatically expire old data. - -### Monitoring performance - -Monitor these key performance indicators: - -- Query execution time (should be < 1 second for most queries). -- Memory usage (should stay below 80% of allocated memory). -- Disk I/O (should not be consistently maxed out). -- Network throughput (for HA clusters). - -For detailed performance monitoring, see [Monitoring](#monitoring). - ## Disaster recovery -For information about backup and disaster recovery strategies for ClickHouse, see the [ClickHouse documentation on backup](https://clickhouse.com/docs/operations/backup/overview). - ### Backup and Restore You should perform a full backup before upgrading the GitLab application. @@ -960,19 +901,19 @@ HA setup becomes cost effective only at 10k users or above. ## Glossary -- Cluster: A collection of nodes (servers) that work together to store and process data. -- MergeTree: [`MergeTree`](https://clickhouse.com/docs/engines/table-engines/mergetree-family/mergetree) is a table engine in ClickHouse designed for high data ingest rates and large data volumes. +- **Cluster**: A collection of nodes (servers) that work together to store and process data. +- **MergeTree**: [`MergeTree`](https://clickhouse.com/docs/engines/table-engines/mergetree-family/mergetree) is a table engine in ClickHouse designed for high data ingest rates and large data volumes. It is the core storage engine in ClickHouse, providing features such as columnar storage, custom partitioning, sparse primary indexes, and support for background data merges. -- Parts: A physical file on a disk that stores a portion of the table's data. +- **Parts**: A physical file on a disk that stores a portion of the table's data. A part is different from a partition, which is a logical division of a table's data that is created using a partition key. -- Replica: A copy of the data stored in a ClickHouse database. +- **Replica**: A copy of the data stored in a ClickHouse database. You can have any number of replicas of the same data for redundancy and reliability. Replicas are used in conjunction with the ReplicatedMergeTree table engine, which enables ClickHouse to keep multiple copies of data in sync across different servers. -- Shard: A subset of data. +- **Shard**: A subset of data. ClickHouse always has at least one shard for your data. If you do not split the data across multiple servers, your data is stored in one shard. Sharding data across multiple servers can be used to divide the load if you exceed the capacity of a single server. -- TTL: Time To Live (TTL) is a ClickHouse feature that automatically moves, deletes, or rolls up columns/rows after a certain time period. +- **TTL (Time To Live)**: Time To Live (TTL) is a ClickHouse feature that automatically moves, deletes, or rolls up columns/rows after a certain time period. This allows you to manage storage more efficiently because you can delete, move, or archive the data that you no longer need to access frequently. ## Troubleshooting -- GitLab