Orphaned workspace pods which exist in the cluster but not the DB should be automatically cleaned up
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
MR: Pending
Description
Currently, if the workspaces
record for a currently-Running
workspace is deleted from the database (via some CASCADE DELETE
rule or other means), the workspace pod remains running in the cluster and never gets terminated.
All that happens is that we log the orphaned workspace in ee/lib/remote_development/workspace_operations/reconcile/persistence/orphaned_workspaces_observer.rb
: https://gitlab.com/gitlab-org/gitlab/-/blob/6842d2c100542de54ca9ad6e7a29cd6ded6e5252/ee/lib/remote_development/workspace_operations/reconcile/persistence/orphaned_workspaces_observer.rb
Why this is a problem
- This is a constant warning in our logs
- The orphaned workspace pods continue to consume resources (and cost) on the cluster, essentially forever until they are manually deleted by a cluster admin.
- It prevents administrators from deleting users who have a workspace created. See problem for customer: Change workspaces table personal_access_tokens ... (#551912) • Unassigned • 18.6
- This should also resolve this older issue which we deferred: Discuss how we handle records with deleted user... (#520439) • Unassigned • Backlog
- Direct customer feedback about problems with orphaned workspaces: internal doc link
So, we should try to come up with some way to minimise or avoid these orphaned workspaces where we can.
What are the ways that a workspace record can be deleted?
Deletion of the following types of models can currently cause a Workspace
deletion:
User
Project
-
Cluster Agent
(AKAagentk
, the agent that runs in the cluster) - Any other model which causes a deletion of one of these models (e.g. a
Group
orNamespace
)
And the deletion of the following model will be restricted if a Workspace
still exists:
Personal Access Token
This was determined by the following command to show all the current workspace FOREIGN KEY references, and whether they are CASCADE
or RESTRICT
on DELETE
grep -E -A1 'ALTER TABLE ONLY workspaces$' db/structure.sql | grep -B1 REFERENCES
And its output:
ALTER TABLE ONLY workspaces
ADD CONSTRAINT fk_bdb0b31131 FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE;
--
ALTER TABLE ONLY workspaces
ADD CONSTRAINT fk_dc7c316be1 FOREIGN KEY (project_id) REFERENCES projects(id) ON DELETE CASCADE;
--
ALTER TABLE ONLY workspaces
ADD CONSTRAINT fk_ec70695b2c FOREIGN KEY (personal_access_token_id) REFERENCES personal_access_tokens(id) ON DELETE RESTRICT;
--
ALTER TABLE ONLY workspaces
ADD CONSTRAINT fk_f78aeddc77 FOREIGN KEY (cluster_agent_id) REFERENCES cluster_agents(id) ON DELETE CASCADE;
Decision to make: How should we handle these deletions in a way which can avoid leaving orphaned workspaces running in the cluster?
We have a Product/UX decision to make here: What should we do when deletion of a User
, Project
, Cluster Agent
or other model record causes a Workspace
record to be deleted?
The options are:
Terminated
state
Option 1: Prevent deletion of these records until the workspace itself is in a I.e., add ActiveRecord-level and database-level constraints which check this state.
However, this option is problematic for a few reasons:
- Primarily, this would be a big barrier to admins and customers who are trying to clean up their data - e.g. delete banned/spam users, unused groups/projects, etc. etc. I believe we would get lots of complaints. And even if we did go this route...
- We can't really force a check of
actual_state
ofTerminated
, because the cluster (and/or the actual workspace pod) may not even be running or present anymore to be properly Terminated - To avoid that, we could check
desired_state
ofTerminated
(which is set immediately), but that also means we should actually wait some time (at least the default10
seconds) for the next reconciliation to happen and schedule the termination on the cluster.
Option 2: Automatically delete orphaned workspace pods where we can
To accomplish this, we could treat them like desired_config == Terminated
workspaces in the reconciliation rails_infos
response. This means we would send over only the workspace and secrets inventory ConfigMaps and no other resources, which causes the workspace to get deleted and the pod terminated.
This wouldn't handle ALL cases - for example, if the Cluster Agent
record is deleted, then NO further reconciliation will happen, and ALL running workspaces are orphaned and have to be deleted. We can handle this as a separate case, possibly by adding warnings or actually restricting it described in Option 1.
But for the rest of the non-Cluster Agent
cases, this approach should work.
Acceptance criteria
TODO: Fill out (required)
-
[Describe what must be achieved to complete this issue.] -
[If applicable, please provide design specifications for this feature/enhancement.] -
[If applicable, please list any technical requirements (performance, security, database, etc.)]
Implementation plan
TODO: Fill out or delete (optional)
[Provide a high-level plan for implementation of this issue, including relevant technical and/or design details.]