[go: up one dir, main page]

Skip to content

Things break if: (according to SQL) GitLab thinks a repository depends on a pool repository, but [a] the alternates file doesn't exist and [b] the pool repo is missing

Summary

  • GitLab PostgreSQL shows that a project/repository depends on a pool repository.
  • The pool repository doesn't exist.
  • The project repo does not have an alternates file.

Currently assuming that there is no deduplication in play any more and no data loss.

See customer ticket for more details

How would a pool repository not get created

Perhaps ... the sidekiq job fails:

ObjectPool::JoinWorker.perform_async(pool_repository.id, self.id)

See Kibana for some examples where json.exception.message includes

  • Somebody already triggered housekeeping for this resource in the past 1440 minutes (the retry period for the job is short enough that all retries will fail if housekeeping's lease isn't released)
  • 13:unexpected alternates content from Gitaly

Pool repository should get created

On a single-gitaly system, my testing indicates that the missing pool repository gets created.

One possibility is that this doesn't happen with Gitaly cluster, but it seems harder to "break" a system to get it into the required state.

Repack fails

Don't currently have direct causation from the pool issue to this happening. It's possible there's a second issue that is causing repack to fail. The reason for the ticket was that the tmp files are being left behind from the repack. This led to the next heading, which is garbage collection is failing, and that does relate to the pool repo issue.

  • gitalties
    "error": "rpc error: code = Canceled desc = signal: terminated",
    "grpc_meta_client_name": "gitlab-sidekiq",
    "grpc_request_fullMethod": "/gitaly.RepositoryService/RepackFull",
    "grpc_service": "gitaly.RepositoryService",
    "msg": "finished unary call with code Canceled",
  • praefect
    "error": "rpc error: code = Canceled desc = context canceled",
    "grpc_meta_client_name": "gitlab-sidekiq",
    "grpc_request_fullMethod": "/gitaly.RepositoryService/RepackFull",
    "level": "ERROR",
    "msg": "proxying maintenance RPC to node failed",
  • sidekiq
    "class": "Projects::GitGarbageCollectWorker",
    "error_class": "Gitlab::Git::CommandError",
    "error_message": "14:Socket closed.",
    "exception_class": "Gitlab::Git::CommandError",
    "exception_message": "14:Socket closed.",
    "message": "Projects::GitGarbageCollectWorker JID-2106897fb9e9022ca450a850: fail: 100.137129 sec",
    "meta_root_caller_id": "POST /api/:version/internal/post_receive",
    "correlation_id": "01GAM9YTNKJM4NRMRY792V5NCD",
    "error_backtrace": [
      "app/workers/concerns/git_garbage_collect_methods.rb:113:in `rescue in gitaly_call'",
      "app/workers/concerns/git_garbage_collect_methods.rb:83:in `gitaly_call'",
      "app/workers/concerns/git_garbage_collect_methods.rb:35:in `perform'",
      "lib/gitlab/database/load_balancing/sidekiq_server_middleware.rb:26:in `call'",
      "lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executing.rb:16:in `perform'",
      "lib/gitlab/sidekiq_middleware/duplicate_jobs/duplicate_job.rb:58:in `perform'",
      "lib/gitlab/sidekiq_middleware/duplicate_jobs/server.rb:8:in `call'",
      "lib/gitlab/sidekiq_middleware/worker_context.rb:9:in `wrap_in_optional_context'",
      "lib/gitlab/sidekiq_middleware/worker_context/server.rb:19:in `block in call'",
      "lib/gitlab/application_context.rb:103:in `block in use'",
      "lib/gitlab/application_context.rb:103:in `use'",
      "lib/gitlab/application_context.rb:48:in `with_context'",
      ...

Garbage collection fails

    "class": "Projects::GitGarbageCollectWorker",
    "completed_at": "2022-09-02T06:05:32.123Z",
    "error_message": "5:mutator call: route repository mutator: get repository id: repository \"default\"/\"@pools/1c/49/1c49f22f6de9bd15e5e566fa8983be4cfa4709abf0f95edf96dcd3d6249c2649.git\" not found.    
    "error_backtrace": [
      "lib/gitlab/gitaly_client.rb:157:in `execute'",
      "lib/gitlab/gitaly_client/call.rb:18:in `block in call'",
      "lib/gitlab/gitaly_client/call.rb:55:in `recording_request'",
      "lib/gitlab/gitaly_client/call.rb:17:in `call'",
      "lib/gitlab/gitaly_client.rb:147:in `call'",
      "lib/gitlab/gitaly_client/object_pool_service.rb:45:in `fetch'",
      "lib/gitlab/git/object_pool.rb:44:in `fetch'",
      "app/services/projects/git_deduplication_service.rb:47:in `fetch_from_source'",
      "app/services/projects/git_deduplication_service.rb:24:in `block in execute'",
      "app/services/concerns/exclusive_lease_guard.rb:29:in `try_obtain_lease'",
      "app/services/projects/git_deduplication_service.rb:17:in `execute'",
      "app/workers/projects/git_garbage_collect_worker.rb:29:in `before_gitaly_call'",
      "app/workers/concerns/git_garbage_collect_methods.rb:34:in `perform'",
      "lib/gitlab/database/load_balancing/sidekiq_server_middleware.rb:26:in `call'",
      "lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executing.rb:16:in `perform'",
      "lib/gitlab/sidekiq_middleware/duplicate_jobs/duplicate_job.rb:58:in `perform'",
      "lib/gitlab/sidekiq_middleware/duplicate_jobs/server.rb:8:in `call'",
      "lib/gitlab/sidekiq_middleware/worker_context.rb:9:in `wrap_in_optional_context'",

It should not, according to this code

      # Don't block garbage collection if we can't fetch into an object pool
      # due to some gRPC error because we don't want to accumulate cruft.
      # See https://gitlab.com/gitlab-org/gitaly/-/issues/4022.
      begin
        ::Projects::GitDeduplicationService.new(resource).execute
      rescue Gitlab::Git::CommandTimedOut, GRPC::Internal => e

Missing pool repo is not created

From the docs

If GitLab thinks a pool repository exists (that is, it exists according to SQL), but it does not on the Gitaly server, then it is created on the fly by Gitaly.

This seems to be a corner case where this does not happen

This is a fourth thing that can go wrong

This scenario is not handled in the docs

Pool relation existence

There are three different things that can go wrong here.

It's a variation on [1]:

SQL says repository A belongs to pool P but Gitaly says A has no alternate objects

SQL says repository A belongs to pool P but Gitaly says A has no alternate objects, and pool P does not exist

Steps to reproduce

Unknown. The customer is running Gitaly Cluster, and might have migrated the affected projects to it .. but not the pools. The parent directoryfor the pool repo is missing. I think it's very unlikely gitaly would have removed @pools/xx/yy .. so it looks like it was never created there. It's possible that the repository migration broken the link.

Multiple projects show as related to the missing pool repo, via: (see the ticket)

select id,path,namespace_id,pool_repository_id from projects where pool_repository_id=413;

However, as the alternates file is gone, the working assumption is that the only data issue is that Rails is still showing this project / these projects as dependent on a pool repo.

Example Project

What is the current bug behavior?

  • SQL says repository A belongs to pool P but Gitaly says A has no alternate objects, and pool P does not exist
  • Garbage collection fails
  • Our docs don't mention this scenario

What is the expected correct behavior?

  • Garbage collection should definitely work, based on this this merge request: gitlab!80269 (merged)
  • The documentation doesn't handle this scenario. I'm currently looking into how to safely remove these erroneous Rails pool references, as this is blocking progress on the ticket.

Relevant logs and/or screenshots

Output of checks

15.2

Possible fixes

Edited by Ben Prescott_
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information