Orphaned workspace pods which exist in the cluster but not the DB should be automatically cleaned up

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

MR: Pending

Description

Currently, if the workspaces record for a currently-Running workspace is deleted from the database (via some CASCADE DELETE rule or other means), the workspace pod remains running in the cluster and never gets terminated.

All that happens is that we log the orphaned workspace in ee/lib/remote_development/workspace_operations/reconcile/persistence/orphaned_workspaces_observer.rb: https://gitlab.com/gitlab-org/gitlab/-/blob/6842d2c100542de54ca9ad6e7a29cd6ded6e5252/ee/lib/remote_development/workspace_operations/reconcile/persistence/orphaned_workspaces_observer.rb

Why this is a problem

This is a constant warning in our logs
The orphaned workspace pods continue to consume resources (and cost) on the cluster, essentially forever until they are manually deleted by a cluster admin.
It prevents administrators from deleting users who have a workspace created. See problem for customer: Change workspaces table personal_access_tokens ... (#551912) • Unassigned • Backlog
This should also resolve this older issue which we deferred: Discuss how we handle records with deleted user... (#520439) • Unassigned • Backlog
Direct customer feedback about problems with orphaned workspaces: internal doc link

So, we should try to come up with some way to minimise or avoid these orphaned workspaces where we can.

What are the ways that a workspace record can be deleted?

Deletion of the following types of models can currently cause a Workspace deletion:

User
Project
Cluster Agent (AKA agentk, the agent that runs in the cluster)
Any other model which causes a deletion of one of these models (e.g. a Group or Namespace)

And the deletion of the following model will be restricted if a Workspace still exists:

Personal Access Token

This was determined by the following command to show all the current workspace FOREIGN KEY references, and whether they are CASCADE or RESTRICT on DELETE

grep -E -A1 'ALTER TABLE ONLY workspaces$' db/structure.sql | grep -B1 REFERENCES

And its output:

ALTER TABLE ONLY workspaces
    ADD CONSTRAINT fk_bdb0b31131 FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE;
--
ALTER TABLE ONLY workspaces
    ADD CONSTRAINT fk_dc7c316be1 FOREIGN KEY (project_id) REFERENCES projects(id) ON DELETE CASCADE;
--
ALTER TABLE ONLY workspaces
    ADD CONSTRAINT fk_ec70695b2c FOREIGN KEY (personal_access_token_id) REFERENCES personal_access_tokens(id) ON DELETE RESTRICT;
--
ALTER TABLE ONLY workspaces
    ADD CONSTRAINT fk_f78aeddc77 FOREIGN KEY (cluster_agent_id) REFERENCES cluster_agents(id) ON DELETE CASCADE;

Decision to make: How should we handle these deletions in a way which can avoid leaving orphaned workspaces running in the cluster?

We have a Product/UX decision to make here: What should we do when deletion of a User, Project, Cluster Agent or other model record causes a Workspace record to be deleted?

The options are:

Option 1: Prevent deletion of these records until the workspace itself is in a `Terminated` state

I.e., add ActiveRecord-level and database-level constraints which check this state.

However, this option is problematic for a few reasons:

Primarily, this would be a big barrier to admins and customers who are trying to clean up their data - e.g. delete banned/spam users, unused groups/projects, etc. etc. I believe we would get lots of complaints. And even if we did go this route...
We can't really force a check of actual_state of Terminated, because the cluster (and/or the actual workspace pod) may not even be running or present anymore to be properly Terminated
To avoid that, we could check desired_state of Terminated (which is set immediately), but that also means we should actually wait some time (at least the default 10 seconds) for the next reconciliation to happen and schedule the termination on the cluster.

Option 2: Automatically delete orphaned workspace pods where we can

To accomplish this, we could treat them like desired_config == Terminated workspaces in the reconciliation rails_infos response. This means we would send over only the workspace and secrets inventory ConfigMaps and no other resources, which causes the workspace to get deleted and the pod terminated.

This wouldn't handle ALL cases - for example, if the Cluster Agent record is deleted, then NO further reconciliation will happen, and ALL running workspaces are orphaned and have to be deleted. We can handle this as a separate case, possibly by adding warnings or actually restricting it described in Option 1.

But for the rest of the non-Cluster Agent cases, this approach should work.

Acceptance criteria

TODO: Fill out (required)

[Describe what must be achieved to complete this issue.]
[If applicable, please provide design specifications for this feature/enhancement.]
[If applicable, please list any technical requirements (performance, security, database, etc.)]

Implementation plan

TODO: Fill out or delete (optional)

[Provide a high-level plan for implementation of this issue, including relevant technical and/or design details.]

Edited Jul 30, 2025 by Chad Woolley