Convert existing Workspaces internal script logic to postStart hooks
MRs:
- Convert existing Workspaces internal script log... (!182354 - merged) • Chad Woolley • 18.0 (Branch https://gitlab.com/gitlab-org/gitlab/-/commits/caw-ws-refactor-component-logic-7?ref_type=heads)
- Follow-up fix: Change workspace scripts volume permission to 555 (!190568 - merged) • Chad Woolley • 18.0
Description
As part of Startup scripts for Remote Development workspac... (&15602), we want to move all workspace container script processing to the postStart mechanism. Both internal scripts (e.g. starting sshd and VS Code servers) as well as user-defined postStart events in the workspace's devfile.
This will make the handling of the scripts easier to reason about, and also allow explicit control over the ordering of the various scripts - in relation to each other, and in relation to the startup lifecycle of the pod and its containers.
Non-Goals
This issue does not implement the support for user-provided devfile postStart hooks. That will be done as part of Add support for devfile postStart event in Work... (#505988 - closed)
Acceptance criteria
-
Existing behavior of auto-starting sshd and VS Code server should still work -
Container images with no background/daemonized Entrypoint/Cmd processes should still be usable to create a workspace, because the tail -f /dev/nulloverride will keep them running. -
Multi-container workspaces still work (commands/args of non-main containers should still work) -
Verify behavior of non-main containers with no command/args remains unchanged (should have never worked, I believe?) -
Verify existing workspaces are not restarted when upgrading to new code (golden master spec passing, which should confirm this) -
Verify logging of stderr/stdout of internal scripts works (see Validation steps on MR for details) -
Verify sysbox workspaces work
References
- Epic: Startup scripts for Remote Development workspac... (&15602)
- High level implementation plan on epic: &15602
Implementation plan
Currently these internal scripts (sshd and VS Code server startup) are handled by overriding the container's Entrypoint/Cmd (command/args in kubernetes spec).
When they are moved to the kubernetes postStart hook, we will need to replace the Entrypoint/Cmd override with a tail -f /dev/null command to keep the container alive.
So, after this issue, we will still be overwriting any existing Entrypoint/Cmd of the workspace's container.
But we will be in a position to easily make this option configurable by the user, as it is in the devcontainer spec via the "overrideCommand": false option. This will be implemented by Support preserving default ENTRYPOINT/CMD for c... (#392853 - closed)
Notes on logging
Eclipse Che logging approach
GitLab implementation is based on how Eclipse Che / DevworkspaceOperator handle the devfile postStart events, since that is the closest to us: It is based on the devfile standard, and based on kubernetes.
Here is the original devworkspace-operator issue which implements the "Single kubernetes poststart hook invoking multiple devfile poststart events" approash:
However, this approach has some ongoing and unresolved issues with exposing these logs to the user in the event of a failure in a poststart event, which in turn causes the pod to fail starting. These problems are described in this currently-open Eclipse Che issue:
https://github.com/eclipse-che/che/issues/23404
Apparently this is due to changes since Kubernetes version 1.26, that prevent the output of PostStart lifecycle hooks from being displayed in the Kubernetes event. See relevant links:
- https://github.com/devfile/devworkspace-operator/issues/1324
- https://issues.redhat.com/browse/CRW-6580
- https://issues.redhat.com/browse/CRW-6580?focusedId=25990997&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25990997
This issue is unresolved as of 2 weeks ago on 2025-04-22. See the following comment:
So, it seems like we can take the approach suggested in that comment, and force the postStart hook to always exit successfully.
This may result in the workspace being in a broken/inconsistent state if any of the actual poststart event scripts failed. But I think that is still the best we can do for now. Reasoning:
- Worst case is that VS Code startup (i.e.
init_tools.sh) fails, so that the Web UI for the workspace is unavailable. But the workspace would be unusable anyway even if we let this cause the postStart hook to fail. The difference would be that the workspace would be in aFailedorErrorstate, with no way to debug it because the pod stopped and there's no logs (unless you somehow manually mounted the persistent volume containing the logs in some other container). At least if we allowed the workspace to start, you can exec or ssh (assuming sshd worked) into the workspace to look at the logs and debug the problem. - As long as
sshdstart worked, the workspace and logs are still accessible for debugging by users. - And even if
sshdstart failed, cluster admins can still exec into the workspace container and see the logs. - Arguably, this is no worse than the current behavior, where we have the VS Code
init_tools.shand startsshdcommands in the container'scommand/args, and if they fail, the container doesn't start anyway. So, it's essentially the same behavior.
Future improvements:
- Going along with the Che team's thinking in the issue comment referenced above, we can add mechanisms to detect these sorts of failures and communicate them to the user
- We could modify all postStart events to drop a "successful" status file on the persisted volume if their exit code was zero. - Then, have some mechanism to ensure that every script exited successfully, and communicate that back to Rails somehow. Perhaps by adding an additional custom
actual_statebetweenStartingandRunning, which will be communicated to Rails via the agent.
Other competitors' workspace logging approaches
- Codespaces: https://docs.github.com/en/codespaces/troubleshooting/github-codespaces-logs#creation-logs
/workspaces/.codespaces/.persistedshare/creation.log- Others at
/tmp/codespaces_logs
- Coder: https://coder.com/docs/user-guides/workspace-management#logging
/tmp/coder-startup-script.log
- GitPod: no docs found, but source reference: https://github.com/gitpod-io/gitpod/blob/main/components/supervisor/pkg/supervisor/docker.go#L41
- Some found at
/workspace/.vscode-remote/data/logs/20250422T082723/exthost1, e.g.remoteexthost.log - Others found at
/workspace/.gitpod/logs
- Some found at
Attempts at more robust logging
The current implementation behaves like this, which is exactly how Eclipse Che currently does it:
- All STDOUT goes ONLY to
poststart-stdout.txt - All STDERR goes ONLY to
poststart-stderr.txt
This works, but the downside is to get all the chronological output in order, you have to tail the files together or correlate the timestamps.
I had wanted to implement it like this:
- All STDOUT AND STDERR goes to
poststart.txt - All STDERR goes ONLY to
poststart-stderr.txt
This would allow you to tail poststart.txt to get ALL chronological output in order, but still use poststart-stderr.txt to easily determine if any STDERR was output.
Unfortunately, I could not get this to work with the following constraint that it should works with plain POSIX/sh, no Bash dependency (i.e., no output redirection)
Here's the various AI-assisted attempts I made, but all of these made the poststart hang. Even the last one, which DID use bash and output redirection.
The strange thing is that:
- I proved that the poststart hook DID exit with a zero return code, but it still hung, I assume because of background processes
- AI claimed that a zero return code is all that matters, that background processes don't, but this was not my observed behavior.
- But with the current simple/minimal redirection, the poststart hook DOES exit successfully, even with background processes still running.
- So, I have no clue what was going on, but I eventually just gave up, and stuck with the same two-file implementation as Che.
preserved attempts for future reference
``` #{ # rm -f /tmp/stderr_fifo # mkfifo /tmp/stderr_fifo # cat /tmp/stderr_fifo | tee -a "${GL_WORKSPACE_LOGS_DIR}/poststart-stderr.log" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" & # reader_pid=$! # echo "DEBUGGING - BEFORE RUN SCRIPT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" # nohup "%s" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" 2>/tmp/stderr_fifo & # echo "DEBUGGING - AFTER RUN SCRIPT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" # echo "" > /tmp/stderr_fifo # wait $reader_pid # echo "DEBUGGING - AFTER WAIT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" # rm -f /tmp/stderr_fifo #} & #echo "DEBUGGING - AFTER SHELL BLOCK RUN SCRIPT, ABOUT TO EXIT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" #exit 0#mkdir -p "{GL_WORKSPACE_LOGS_DIR}" #ln -sf "{GL_WORKSPACE_LOGS_DIR}" /tmp
#{
rm -f /tmp/stderr_fifo
mkfifo /tmp/stderr_fifo
cat /tmp/stderr_fifo | tee -a "{GL_WORKSPACE_LOGS_DIR}/poststart-stderr.log" >> "{GL_WORKSPACE_LOGS_DIR}/poststart.log" &
cat_pid=$!
nohup "%<run_poststart_commands_script_file_path>s" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" 2>/tmp/stderr_fifo &
script_pid=$!
{
wait $script_pid
echo "" > /tmp/stderr_fifo
wait $cat_pid
rm -f /tmp/stderr_fifo
} &
#} &
#mkdir -p "{GL_WORKSPACE_LOGS_DIR}" #ln -sf "{GL_WORKSPACE_LOGS_DIR}" /tmp
Check if bash is available
#if [ -x /usr/bin/bash ]; then
echo "DEBUGGING - bash BEFORE RUN SCRIPT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log"
# Use bash with process substitution to capture both stdout and stderr to poststart.log, and only stderr to poststart-stderr.log
# NOTE: THIS STILL HUNG THE POSTSTART HOOK. I'm GUESSING IT WAS THE 'tee' COMMAND???
nohup /usr/bin/bash -c "%<run_poststart_commands_script_file_path>s >> "{GL_WORKSPACE_LOGS_DIR}/poststart.log\" 2> >(tee -a \"{GL_WORKSPACE_LOGS_DIR}/poststart-stderr.log" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log")" &
echo "DEBUGGING - bash AFTER RUN SCRIPT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log"
#else
# If bash is not available, fallback to simple redirection to capture both stdout and stderr to poststart.log
nohup "%<run_poststart_commands_script_file_path>s" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" 2>&1 (closed) &
#fi #echo "DEBUGGING - run_poststart_commands_script_file_path script run, ABOUT TO EXIT kubernetes_poststart_hook_command.sh " >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" #exit 0
</details>
<!-- DO NOT TOUCH BELOW -->