[go: up one dir, main page]

Convert existing Workspaces internal script logic to postStart hooks

MRs:

  1. Convert existing Workspaces internal script log... (!182354 - merged) • Chad Woolley • 18.0 (Branch https://gitlab.com/gitlab-org/gitlab/-/commits/caw-ws-refactor-component-logic-7?ref_type=heads)
  2. Follow-up fix: Change workspace scripts volume permission to 555 (!190568 - merged) • Chad Woolley • 18.0

Description

As part of Startup scripts for Remote Development workspac... (&15602), we want to move all workspace container script processing to the postStart mechanism. Both internal scripts (e.g. starting sshd and VS Code servers) as well as user-defined postStart events in the workspace's devfile.

This will make the handling of the scripts easier to reason about, and also allow explicit control over the ordering of the various scripts - in relation to each other, and in relation to the startup lifecycle of the pod and its containers.

Non-Goals

This issue does not implement the support for user-provided devfile postStart hooks. That will be done as part of Add support for devfile postStart event in Work... (#505988 - closed)

Acceptance criteria

  • Existing behavior of auto-starting sshd and VS Code server should still work
  • Container images with no background/daemonized Entrypoint/Cmd processes should still be usable to create a workspace, because the tail -f /dev/null override will keep them running.
  • Multi-container workspaces still work (commands/args of non-main containers should still work)
  • Verify behavior of non-main containers with no command/args remains unchanged (should have never worked, I believe?)
  • Verify existing workspaces are not restarted when upgrading to new code (golden master spec passing, which should confirm this)
  • Verify logging of stderr/stdout of internal scripts works (see Validation steps on MR for details)
  • Verify sysbox workspaces work

References

Implementation plan

Currently these internal scripts (sshd and VS Code server startup) are handled by overriding the container's Entrypoint/Cmd (command/args in kubernetes spec).

When they are moved to the kubernetes postStart hook, we will need to replace the Entrypoint/Cmd override with a tail -f /dev/null command to keep the container alive.

So, after this issue, we will still be overwriting any existing Entrypoint/Cmd of the workspace's container.

But we will be in a position to easily make this option configurable by the user, as it is in the devcontainer spec via the "overrideCommand": false option. This will be implemented by Support preserving default ENTRYPOINT/CMD for c... (#392853 - closed)

Notes on logging

Eclipse Che logging approach

GitLab implementation is based on how Eclipse Che / DevworkspaceOperator handle the devfile postStart events, since that is the closest to us: It is based on the devfile standard, and based on kubernetes.

Here is the original devworkspace-operator issue which implements the "Single kubernetes poststart hook invoking multiple devfile poststart events" approash:

However, this approach has some ongoing and unresolved issues with exposing these logs to the user in the event of a failure in a poststart event, which in turn causes the pod to fail starting. These problems are described in this currently-open Eclipse Che issue:

https://github.com/eclipse-che/che/issues/23404

Apparently this is due to changes since Kubernetes version 1.26, that prevent the output of PostStart lifecycle hooks from being displayed in the Kubernetes event. See relevant links:

This issue is unresolved as of 2 weeks ago on 2025-04-22. See the following comment:

So, it seems like we can take the approach suggested in that comment, and force the postStart hook to always exit successfully.

This may result in the workspace being in a broken/inconsistent state if any of the actual poststart event scripts failed. But I think that is still the best we can do for now. Reasoning:

  1. Worst case is that VS Code startup (i.e. init_tools.sh) fails, so that the Web UI for the workspace is unavailable. But the workspace would be unusable anyway even if we let this cause the postStart hook to fail. The difference would be that the workspace would be in a Failed or Error state, with no way to debug it because the pod stopped and there's no logs (unless you somehow manually mounted the persistent volume containing the logs in some other container). At least if we allowed the workspace to start, you can exec or ssh (assuming sshd worked) into the workspace to look at the logs and debug the problem.
  2. As long as sshd start worked, the workspace and logs are still accessible for debugging by users.
  3. And even if sshd start failed, cluster admins can still exec into the workspace container and see the logs.
  4. Arguably, this is no worse than the current behavior, where we have the VS Code init_tools.sh and start sshd commands in the container's command/args, and if they fail, the container doesn't start anyway. So, it's essentially the same behavior.

Future improvements:

  • Going along with the Che team's thinking in the issue comment referenced above, we can add mechanisms to detect these sorts of failures and communicate them to the user
  • We could modify all postStart events to drop a "successful" status file on the persisted volume if their exit code was zero. - Then, have some mechanism to ensure that every script exited successfully, and communicate that back to Rails somehow. Perhaps by adding an additional custom actual_state between Starting and Running, which will be communicated to Rails via the agent.

Other competitors' workspace logging approaches

Attempts at more robust logging

The current implementation behaves like this, which is exactly how Eclipse Che currently does it:

  1. All STDOUT goes ONLY to poststart-stdout.txt
  2. All STDERR goes ONLY to poststart-stderr.txt

This works, but the downside is to get all the chronological output in order, you have to tail the files together or correlate the timestamps.

I had wanted to implement it like this:

  1. All STDOUT AND STDERR goes to poststart.txt
  2. All STDERR goes ONLY to poststart-stderr.txt

This would allow you to tail poststart.txt to get ALL chronological output in order, but still use poststart-stderr.txt to easily determine if any STDERR was output.

Unfortunately, I could not get this to work with the following constraint that it should works with plain POSIX/sh, no Bash dependency (i.e., no output redirection)

Here's the various AI-assisted attempts I made, but all of these made the poststart hang. Even the last one, which DID use bash and output redirection.

The strange thing is that:

  1. I proved that the poststart hook DID exit with a zero return code, but it still hung, I assume because of background processes
  2. AI claimed that a zero return code is all that matters, that background processes don't, but this was not my observed behavior.
  3. But with the current simple/minimal redirection, the poststart hook DOES exit successfully, even with background processes still running.
  4. So, I have no clue what was going on, but I eventually just gave up, and stuck with the same two-file implementation as Che.
preserved attempts for future reference ``` #{ # rm -f /tmp/stderr_fifo # mkfifo /tmp/stderr_fifo # cat /tmp/stderr_fifo | tee -a "${GL_WORKSPACE_LOGS_DIR}/poststart-stderr.log" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" & # reader_pid=$! # echo "DEBUGGING - BEFORE RUN SCRIPT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" # nohup "%s" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" 2>/tmp/stderr_fifo & # echo "DEBUGGING - AFTER RUN SCRIPT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" # echo "" > /tmp/stderr_fifo # wait $reader_pid # echo "DEBUGGING - AFTER WAIT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" # rm -f /tmp/stderr_fifo #} & #echo "DEBUGGING - AFTER SHELL BLOCK RUN SCRIPT, ABOUT TO EXIT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" #exit 0

#mkdir -p "{GL_WORKSPACE_LOGS_DIR}" #ln -sf "{GL_WORKSPACE_LOGS_DIR}" /tmp

#{

rm -f /tmp/stderr_fifo

mkfifo /tmp/stderr_fifo

cat /tmp/stderr_fifo | tee -a "{GL_WORKSPACE_LOGS_DIR}/poststart-stderr.log" >> "{GL_WORKSPACE_LOGS_DIR}/poststart.log" &

cat_pid=$!

nohup "%<run_poststart_commands_script_file_path>s" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" 2>/tmp/stderr_fifo &

script_pid=$!

{

wait $script_pid

echo "" > /tmp/stderr_fifo

wait $cat_pid

rm -f /tmp/stderr_fifo

} &

#} &

#mkdir -p "{GL_WORKSPACE_LOGS_DIR}" #ln -sf "{GL_WORKSPACE_LOGS_DIR}" /tmp

Check if bash is available

#if [ -x /usr/bin/bash ]; then

echo "DEBUGGING - bash BEFORE RUN SCRIPT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log"

# Use bash with process substitution to capture both stdout and stderr to poststart.log, and only stderr to poststart-stderr.log

# NOTE: THIS STILL HUNG THE POSTSTART HOOK. I'm GUESSING IT WAS THE 'tee' COMMAND???

nohup /usr/bin/bash -c "%<run_poststart_commands_script_file_path>s >> "{GL_WORKSPACE_LOGS_DIR}/poststart.log\" 2> >(tee -a \"{GL_WORKSPACE_LOGS_DIR}/poststart-stderr.log" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log")" &

echo "DEBUGGING - bash AFTER RUN SCRIPT" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log"

#else

# If bash is not available, fallback to simple redirection to capture both stdout and stderr to poststart.log

nohup "%<run_poststart_commands_script_file_path>s" >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" 2>&1 (closed) &

#fi #echo "DEBUGGING - run_poststart_commands_script_file_path script run, ABOUT TO EXIT kubernetes_poststart_hook_command.sh " >> "${GL_WORKSPACE_LOGS_DIR}/poststart.log" #exit 0

</details>

<!-- DO NOT TOUCH BELOW -->
Edited by Chad Woolley