From 273e1966b24aa5ca782d4cc7642cf8aad8e741e7 Mon Sep 17 00:00:00 2001 From: Dylan Griffith Date: Wed, 28 Dec 2022 15:05:46 -0500 Subject: [PATCH 1/6] Add initial architecture blueprint for Zoekt This is an architecture blueprint to describe the Zoekt integration we are working on in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/105049 . --- .../search/code_search_with_zoekt.md | 257 ++++++++++++++++++ 1 file changed, 257 insertions(+) create mode 100644 doc/architecture/blueprints/search/code_search_with_zoekt.md diff --git a/doc/architecture/blueprints/search/code_search_with_zoekt.md b/doc/architecture/blueprints/search/code_search_with_zoekt.md new file mode 100644 index 00000000000000..7e04082b8ea6c3 --- /dev/null +++ b/doc/architecture/blueprints/search/code_search_with_zoekt.md @@ -0,0 +1,257 @@ +--- +status: proposed +creation-date: "2022-12-28" +authors: [ "@dgruzd", "@DylanGriffith" ] +coach: "@DylanGriffith" +approvers: [ "@joshlambert", "@changzhengliu" ] +owning-stage: "~devops::enablement" +participating-stages: [] +--- + +# Use Zoekt For code search + +## Summary + +We will be implementing an additional code search functionality in GitLab that +is backed by [Zoekt](https://github.com/sourcegraph/zoekt), an open source +search engine that is specifically designed for code search. Zoekt will be used as +an API by GitLab and remain an implementation detail while the user interface +in GitLab will not change much except for some new features made available by +Zoekt. + +This will be rolled out in phases to ensure that the system will actually meet +our scaling and cost expectations and will run alongside code search backed by +Elasticsearch until we can be sure it is a viable replacement. The first step +will be making it available for `gitlab-org` for internal and expanding +customer by customer based on customer interest. + +## Motivation + +GitLab code search functionality today is backed by Elasticsearch. +Elasticsearch has proven useful for other types of search (issues, merge +requests, comments and so-on) but is by design not a good choice for code +search where users expect matches to be precise (ie. no false positives) and +flexible (e.g. support +[substring matching](https://gitlab.com/gitlab-org/gitlab/-/issues/325234) +and +[regexes](https://gitlab.com/gitlab-org/gitlab/-/issues/4175)). We have +[investigated our options](https://gitlab.com/groups/gitlab-org/-/epics/7404) +and [Zoekt](https://github.com/sourcegraph/zoekt) is pretty much the only well +maintained open source technology that is suited to code search. Based on our +research we believe it will be better to adopt a well maintained open source +database than attempt to build our own. This is mostly due to the fact that our +research indicates that the fundamental architecture of Zoekt is what we would +implement again if we tried to implement something ourselves. + +Our +[early benchmarking](https://gitlab.com/gitlab-org/gitlab/-/issues/370832#note_1183611955) +suggests that Zoekt will be viable at our scale, but we feel strongly +that investing in building a beta integration with Zoekt and rolling it out +group by group on GitLab.com will provide better insights into scalability and +cost than more accurate benchmarking efforts. It will also be relatively low +risk as it will be rolled out internally first and later rolled out to +customers that wish to participate in the trial. + +### Goals + +The main goals of this integration will be to implement the following highly +requested improvements to code search: + +1. [Exact match (substring match) code searches in Advanced Search](https://gitlab.com/gitlab-org/gitlab/-/issues/325234) +1. [Support regular expressions with Advanced Global Search](https://gitlab.com/gitlab-org/gitlab/-/issues/4175) +1. [Support multiple line matches in the same file](https://gitlab.com/gitlab-org/gitlab/-/issues/668) + +The initial phases of the rollout will be designed to catch and resolve scaling +or infrastructure cost issues as early as possible so that we can pivot early +before investing too much in this technology if it is not suitable. + +### Non-Goals + +The following are not goals initially but could theoretically be built upon +this solution: + +1. Improving security scanning features by having access to quickly perform + regex scans across many repositories +1. Saving money on our search infrastructure - this may be possible with + further optimizations, but initial estimates suggest the cost is similar +1. AI/ML features of search used to predict what users might be interested in + finding +1. Code Intelligence and Navigation - likely code intelligence and navigation + features should be built on structured data rather than a trigram index but + regex based searches (using Zoekt) may be a suitable fallback for code which + does not have structured metadata enabled or dynamic languages where static + analysis is not very accurate. Zoekt in particular may not be well suited + initially, despite existing symbol extraction using ctags, because ctags + symbols may not contain enough data for accurate navigation and Zoekt + doesn't undersand dependencies which would be necessary for cross-project + navigation. + +## Proposal + +An +[initial implementation of a Zoekt integration](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/105049) +was created to demonstrate the feasibility of using Zoekt as a drop-in +replacement for Elasticsearch code searches. This blueprint will extend on all +the details needed to provide a minimum viable change as well steps needed to +scale this to a larger customer rollout on GitLab.com. + +## Design and implementation details + +### User Experience + +When a user performs an advanced search on a group or project that is part +of the Zoekt rollout we will present a toggle somewhere in the UI to change +to "precise search" (or some other UX TBD) which switches them from +Elasticsearch to Zoekt. Early user feedback will help us assess the best way +to present these choices to users and ultimately we will want to remove the +Elasticsearch option if we find Zoekt is a suitable long term option. + +### Indexing + +Similar to our Elasticsearch integration, GitLab will notify Zoekt every time +there are updates to a repository. Zoekt, unlike Elasticsearch, is designed to +clone and index Git repositories so we will simply notify Zoekt of the URL of +the repository that has changed and it will update its local copy of the Git +repo and then update its local index files. The Zoekt side of this logic will +be implemented in a new server-side indexing endpoint we add to Zoekt which is +currently in +[an open Pull request](https://github.com/sourcegraph/zoekt/pull/496). +While the details of +this pull request are still being debated, we may choose to deploy a fork with +the functionality we need, but our strongest intention is not to maintain a +fork of Zoekt and the maintainers have already expressed they are open to this +new functionality. + +The rails side of the integration will be a Sidekiq worker that is scheduled +every time there is an update to a repository and it will simply call this +`/index` endpoint in Zoekt. This will also need to generate a one-time token +that can allow Zoekt to clone a private repository. + +```mermaid +sequenceDiagram + participant user as User + participant gitlab_git as GitLab Git + participant gitlab_sidekiq as GitLab Sidekiq + participant zoekt as Zoekt + user->>gitlab_git: git push git@gitlab.com:gitlab-org/gitlab.git + gitlab_git->>gitlab_sidekiq: ZoektIndexerWorker.perform_async(278964) + gitlab_sidekiq->>zoekt: POST /index {"RepoUrl":"https://zoekt:SECRET_TOKEN@gitlab.com/gitlab-org/gitlab.git","RepoId":278964}' + zoekt->>gitlab_git: git clone https://zoekt:SECRET_TOKEN@gitlab.com/gitlab-org/gitlab.git +``` + +The Sidekiq worker can leverage de-duplication based on the `project_id`. + +Zoekt supports indexing multiple projects we'll likely need to, eventually, +allow a way for users to configure additional branches (beyond the default +branch) and this will need to be sent to Zoekt. We will need to decide if these +branch lists are sent every time we index the project or only when they change +configuration. + +There may be race conditions with multiple Zoekt processes indexing the same +repo at the same time. For this reason we should implement a locking mechanism +somewhere to ensure we are only indexing 1 project in 1 place at a time. We +could make use of the same Redis locking we use for indexing projects in +Elasticsearch. + +### Searching + +Searching will be implemented using the `/api/search` functionality in +Zoekt. There is also +[an open PR to fix this endpoint in Zoekt](https://github.com/sourcegraph/zoekt/pull/506), +and again we may consider working from a fork until this is fixed. GitLab will +prepend all searches with the appropriate filter for repositories based on the +user's search context (group or project) in the same way we do for +Elasticsearch. For Zoekt this will be implemented as a query string regex that +matches all the searched repositories. + +### Zoekt infrastructure + +Each Zoekt node will need to run an indexing server and a searching server. +These are both webservers with different responsibilities. Considering that the +Zoekt indexing process needs to keep a full clone of the bare repo +([unless we come up with a better option](https://gitlab.com/gitlab-org/gitlab/-/issues/384722)) +these bare repos will be stored on spinning disks to save space. These are only +used as an intermediate step to generate the actual `.zoekt` index files which +will be stored on an SSD for fast searches. + +### Rollout strategy + +Initially Zoekt code search will only be available to `gitlab-org`. After that +we'll start rolling it out to specific customers that have requested better +code search experience. As we learn about scaling and make improvements we will +gradually roll it out to all licensed groups on GitLab.com. We will use a +similar approach to Elasticsearch for keeping track of which groups are indexed +and which are not. This will be based on a new table `zoekt_indexed_namespaces` +with a `namespace_id` reference. We will only allow rolling out to top level +namespaces to simplify the logic of checking for all layers of group +inheritance. Once we've rolled out to all licensed groups we'll enable logic to +automatically enroll newly licensed groups. This table also may be a place to +store per-namespace sharding and replication data as described below. + +### Sharding and replication strategy + +Zoekt does not have any inbuilt sharding, and we expect that we'll need +multiple Zoekt servers to reach the scale to provide search functionality to +all of GitLab licensed customers. + +There are 2 clear ways to implement sharding: + +1. Build it on top of, or in front of Zoekt, as an independent component. Building + all the complexities of a distributed database into Zoekt is not likely to + be a good direction for the project so most likely this would be an + independent piece of infrastructure that proxied requests to the correct + shard. +1. Manage the shards inside GitLab. This would be an application layer in + GitLab which chooses the correct shard to send indexing and search requests + to. + +Likewise, there are 2 clear ways to implement replication: + +1. Server-side where Zoekt replicas are aware of other Zoekt replicas and they + stream updates from some primary to remain in sync +1. Client-side replication where clients send indexing requests to all replicas + and search requests to any replica + +We plan to implement sharding inside GitLab application as well as replication. +This simplifies the additional infrastructure components that need to be +deployed and allows more flexibility to control our rollout to many customers +alongside our rollout of multiple shards and replicas. + +We plan to defer the implementation of these high availability aspects until +later, but a preliminary plan would be: + +1. GitLab is configured with a pool of Zoekt servers +1. GitLab assigns groups randomly to multiple Zoekt servers (all Zoekt servers + are considered replicas and there should be multiple copies for each + repository) +1. When indexing a project GitLab will queue a Sidekiq job for each Zoekt + server that needs to be updated +1. When searching we will randomly select one of the Zoekt servers for the + group being searched. We don't care which is "more up to date" as code + search will be "eventually consistent" and all reads may read slightly out + of date indexes. +1. We will shard everything by top level group as this ensures group search can + always search a single Zoekt server. Aggregation may be possible for global + searches at some point in future if this turns out to be important. Smaller + self-managed instances may use a single Zoekt server allowing global + searches to work without any aggregation being implemented. + +The downside of the chosen path will be added complexity of managing all these +Zoekt servers from GitLab when compared with a "proxy" layer outside of GitLab +that is managing all of these shards. We will consider this decision a work in +progress and reassess if it turns out to add too much complexity to GitLab. + +### Iterations + +1. Make available for `gitlab-org` +1. Improve monitoring +1. Improve performance +1. Make available for select customers +1. Implement sharding +1. Implement replication +1. Make available to many more licensed groups +1. Implement automatic (re)balancing of shards +1. Estimate costs for rolling out to all licensed groups and decide if it's worth it or if we need to optimize further or adjust our plan +1. Rollout to all licensed groups +1. Improve performance +1. Assess costs and decide whether we should roll out to all free customers -- GitLab From 8c0e6c78c5a6de58832d7da5b4e7faea96ebc0c0 Mon Sep 17 00:00:00 2001 From: Dylan Griffith Date: Wed, 1 Feb 2023 11:20:50 +1100 Subject: [PATCH 2/6] Add more detail about Zoekt sharding/replication --- .../search/code_search_with_zoekt.md | 33 +++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/doc/architecture/blueprints/search/code_search_with_zoekt.md b/doc/architecture/blueprints/search/code_search_with_zoekt.md index 7e04082b8ea6c3..60356977bfeb94 100644 --- a/doc/architecture/blueprints/search/code_search_with_zoekt.md +++ b/doc/architecture/blueprints/search/code_search_with_zoekt.md @@ -241,6 +241,39 @@ Zoekt servers from GitLab when compared with a "proxy" layer outside of GitLab that is managing all of these shards. We will consider this decision a work in progress and reassess if it turns out to add too much complexity to GitLab. +We also need to consider if the replicas are handled as multiple database +entries in GitLab or if they are using Consul for service discovery. + +#### Sharding/replication proposal using GitLab `::Zoekt::Shard` model + +This is already mostly implemented as the `::Zoekt::IndexedNamespace` +implements a many-to-many relationship between namespaces and shards. + +The main remaining things to implement are: + +1. Indexing should loop over all shards to keep them all up to date. Use a + singe Sidekiq job per shard so that we have finer grained and shorter + running Sidekiq jobs and a failure in 1 shard won't impact indexing on + another. +1. Searches should randomly choose a Shard to search. + +#### Alternative sharding/replication proposal using Consul + +Another possible approach to consider would be to use a service discovery +approach here. We could make `::Zoekt::IndexedNamespace` unique on +`namespace_id` and then `::Zoekt::Shard` would use a DNS name that we can +look up in Consul. When performing indexing we loop through all DNS records +returned. When performing searches we randomly choose a DNS record returned. + +The tricky thing is that Consul health checks might mean that index updates are +missed while a server is temporarily unavailable. Ideally we'd always queue up +work in Sidekiq and a service being unavailable would be re-queued for later. + +The advantage, however, is that Consul health checks mean searches always +choose a healthy shard to search against. In the other proposal we might end up +needing to implement our own health checks in GitLab to avoid searching a +replica that is offline. + ### Iterations 1. Make available for `gitlab-org` -- GitLab From 3a22291c82113f67ab266536e7a718214de79d1a Mon Sep 17 00:00:00 2001 From: Dylan Griffith Date: Wed, 1 Feb 2023 12:59:23 +1100 Subject: [PATCH 3/6] Blueprint more info on zoekt-dynamic-indexserver/webserver --- .../blueprints/search/code_search_with_zoekt.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/doc/architecture/blueprints/search/code_search_with_zoekt.md b/doc/architecture/blueprints/search/code_search_with_zoekt.md index 60356977bfeb94..80f479bba33531 100644 --- a/doc/architecture/blueprints/search/code_search_with_zoekt.md +++ b/doc/architecture/blueprints/search/code_search_with_zoekt.md @@ -166,13 +166,20 @@ matches all the searched repositories. ### Zoekt infrastructure -Each Zoekt node will need to run an indexing server and a searching server. +Each Zoekt node will need to run a +[zoekt-dynamic-indexserver](https://github.com/sourcegraph/zoekt/pull/496) and +a +[zoekt-webserver](https://github.com/sourcegraph/zoekt/blob/main/cmd/zoekt-webserver/main.go). These are both webservers with different responsibilities. Considering that the Zoekt indexing process needs to keep a full clone of the bare repo ([unless we come up with a better option](https://gitlab.com/gitlab-org/gitlab/-/issues/384722)) these bare repos will be stored on spinning disks to save space. These are only used as an intermediate step to generate the actual `.zoekt` index files which -will be stored on an SSD for fast searches. +will be stored on an SSD for fast searches. These web servers need to run on +the same node as they access the same files. The `zoekt-dynamic-indexserver` is +responsible for writing the `.zoekt` index files. The `zoekt-webserver` is +responsible for responding to searches that it performs by reading these +`.zoekt` index files. ### Rollout strategy -- GitLab From 412d31ed4f6daf3d7ecf45efe313d0cdb41bbeec Mon Sep 17 00:00:00 2001 From: Dylan Griffith Date: Mon, 6 Feb 2023 11:07:36 +1100 Subject: [PATCH 4/6] Update Zoekt sharding/replication to reflect better ideas --- .../search/code_search_with_zoekt.md | 88 ++++++++++--------- 1 file changed, 45 insertions(+), 43 deletions(-) diff --git a/doc/architecture/blueprints/search/code_search_with_zoekt.md b/doc/architecture/blueprints/search/code_search_with_zoekt.md index 80f479bba33531..01b9fe4e3f89b3 100644 --- a/doc/architecture/blueprints/search/code_search_with_zoekt.md +++ b/doc/architecture/blueprints/search/code_search_with_zoekt.md @@ -212,74 +212,76 @@ There are 2 clear ways to implement sharding: GitLab which chooses the correct shard to send indexing and search requests to. -Likewise, there are 2 clear ways to implement replication: +Likewise, there are a few ways to implement replication: 1. Server-side where Zoekt replicas are aware of other Zoekt replicas and they stream updates from some primary to remain in sync 1. Client-side replication where clients send indexing requests to all replicas and search requests to any replica -We plan to implement sharding inside GitLab application as well as replication. -This simplifies the additional infrastructure components that need to be -deployed and allows more flexibility to control our rollout to many customers -alongside our rollout of multiple shards and replicas. +We plan to implement sharding inside GitLab application but replication may be +best served at the filesystem of Zoekt servers rather than sending duplicated +updates from GitLab to all replicas. + +Implementing sharding in GitLab simplifies the additional infrastructure +components that need to be deployed and allows more flexibility to control our +rollout to many customers alongside our rollout of multiple shards. + +Implementing syncing from primary -> replica on Zoekt nodes at the filesystem +level optimizes that overall resource usage. We only need to sync the index +files to replicas as the bare repo is just a cache. This saves on: + +1. Disk space on replicas +1. CPU usage on replicas as it does not need to rebuild the index +1. Load on Gitaly to clone the repos We plan to defer the implementation of these high availability aspects until later, but a preliminary plan would be: 1. GitLab is configured with a pool of Zoekt servers -1. GitLab assigns groups randomly to multiple Zoekt servers (all Zoekt servers - are considered replicas and there should be multiple copies for each - repository) -1. When indexing a project GitLab will queue a Sidekiq job for each Zoekt - server that needs to be updated -1. When searching we will randomly select one of the Zoekt servers for the - group being searched. We don't care which is "more up to date" as code - search will be "eventually consistent" and all reads may read slightly out - of date indexes. +1. GitLab assigns groups randomly a Zoekt primary server +1. There will also be Zoekt replica servers +1. Periodically Zoekt primary servers will sync their `.zoekt` index files to + their respective replicas +1. There will need to be some process by which to promote a replica to a + primary if the primary is having issues. We will be using Consul for + keeping track of which is the primary and which are the replicas. +1. When indexing a project GitLab will queue a Sidekiq job to update the index + on the primary +1. When searching we will randomly select one of the Zoekt primaries or replica + servers for the group being searched. We don't care which is "more up to + date" as code search will be "eventually consistent" and all reads may read + slightly out of date indexes. We will have a target of maximum latency of + index updates and may consider removing nodes from rotation if they are too + far out of date. 1. We will shard everything by top level group as this ensures group search can always search a single Zoekt server. Aggregation may be possible for global searches at some point in future if this turns out to be important. Smaller self-managed instances may use a single Zoekt server allowing global - searches to work without any aggregation being implemented. + searches to work without any aggregation being implemented. Depending on our + largest group sizes and scaling limitations of a single node Zoekt server we + may consider implementing an approach where a group can be assigned multiple + shards. The downside of the chosen path will be added complexity of managing all these Zoekt servers from GitLab when compared with a "proxy" layer outside of GitLab that is managing all of these shards. We will consider this decision a work in progress and reassess if it turns out to add too much complexity to GitLab. -We also need to consider if the replicas are handled as multiple database -entries in GitLab or if they are using Consul for service discovery. - -#### Sharding/replication proposal using GitLab `::Zoekt::Shard` model +#### Sharding proposal using GitLab `::Zoekt::Shard` model -This is already mostly implemented as the `::Zoekt::IndexedNamespace` +This is already implemented as the `::Zoekt::IndexedNamespace` implements a many-to-many relationship between namespaces and shards. -The main remaining things to implement are: - -1. Indexing should loop over all shards to keep them all up to date. Use a - singe Sidekiq job per shard so that we have finer grained and shorter - running Sidekiq jobs and a failure in 1 shard won't impact indexing on - another. -1. Searches should randomly choose a Shard to search. - -#### Alternative sharding/replication proposal using Consul - -Another possible approach to consider would be to use a service discovery -approach here. We could make `::Zoekt::IndexedNamespace` unique on -`namespace_id` and then `::Zoekt::Shard` would use a DNS name that we can -look up in Consul. When performing indexing we loop through all DNS records -returned. When performing searches we randomly choose a DNS record returned. - -The tricky thing is that Consul health checks might mean that index updates are -missed while a server is temporarily unavailable. Ideally we'd always queue up -work in Sidekiq and a service being unavailable would be re-queued for later. +#### Replication and service discovery using Consul -The advantage, however, is that Consul health checks mean searches always -choose a healthy shard to search against. In the other proposal we might end up -needing to implement our own health checks in GitLab to avoid searching a -replica that is offline. +If we plan to replicate at the Zoekt node level as described above we need to +change our data model to use a one-to-many relationship from `zoekt_shards -> +namespaces`. This means making the `namespace_id` column unique in +`zoekt_indexed_namespaces`. Then we need to implement a service discovery +approach where the `index_url` always points at a primary Zoekt node and the +`search_url` is a DNS record with N replicas and the primary. We then choose +randomly from `search_url` records when searching. ### Iterations -- GitLab From 09dabcde14106552d747a10dedb7ba339121cd58 Mon Sep 17 00:00:00 2001 From: Dylan Griffith Date: Wed, 8 Feb 2023 16:21:27 +1100 Subject: [PATCH 5/6] Add more details about filesystem replication in Zoekt --- .../blueprints/search/code_search_with_zoekt.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/doc/architecture/blueprints/search/code_search_with_zoekt.md b/doc/architecture/blueprints/search/code_search_with_zoekt.md index 01b9fe4e3f89b3..1d337c1af4976e 100644 --- a/doc/architecture/blueprints/search/code_search_with_zoekt.md +++ b/doc/architecture/blueprints/search/code_search_with_zoekt.md @@ -220,8 +220,14 @@ Likewise, there are a few ways to implement replication: and search requests to any replica We plan to implement sharding inside GitLab application but replication may be -best served at the filesystem of Zoekt servers rather than sending duplicated -updates from GitLab to all replicas. +best served at the level of the filesystem of Zoekt servers rather than sending +duplicated updates from GitLab to all replicas. This could be some process on +Zoekt servers that monitors for changes to the `.zoekt` files in a specific +directory and syncs those updates to the replicas. This will need to be +slightly more sophisticated than `rsync` because the files are constantly +changing and files may be getting deleted while the sync is happening so we +would want to be syncing the updates in batches somehow without slowing down +indexing. Implementing sharding in GitLab simplifies the additional infrastructure components that need to be deployed and allows more flexibility to control our -- GitLab From d9ef1da10d6a25baf7c03665f2d291fa29b76d06 Mon Sep 17 00:00:00 2001 From: Dylan Griffith Date: Wed, 8 Feb 2023 16:25:10 +1100 Subject: [PATCH 6/6] Change Zoekt blueprint status to ongoing This is already in progress as part of https://gitlab.com/groups/gitlab-org/-/epics/9404 so the "proposed" status is no longer necessary. --- doc/architecture/blueprints/search/code_search_with_zoekt.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/architecture/blueprints/search/code_search_with_zoekt.md b/doc/architecture/blueprints/search/code_search_with_zoekt.md index 1d337c1af4976e..d0d347f1ff4d23 100644 --- a/doc/architecture/blueprints/search/code_search_with_zoekt.md +++ b/doc/architecture/blueprints/search/code_search_with_zoekt.md @@ -1,5 +1,5 @@ --- -status: proposed +status: ongoing creation-date: "2022-12-28" authors: [ "@dgruzd", "@DylanGriffith" ] coach: "@DylanGriffith" -- GitLab