[go: up one dir, main page]

Stuck_CI_Builds_Worker marks builds running for more than 2 days as failed

The description of this issue has been updated after further investigation was done.

Summary

  1. ci_builds.updated_at does not get updated while builds are running.

  2. If the stuck_ci_builds_worker has changed the status of a build to failed, the runner ignores this transition and does not kill the build.

Steps to reproduce

  1. Create a long running build (requiring more than 48h to complete)
  2. Set the build timeout to something high, e.g. 10 days
  3. launch the build

Expected behavior

  1. The build should complete as normal after 48h+.
  2. The stuck_ci_builds_worker should not mark running builds as failed at any point while it is running.
  3. The runner should understand that a build is marked as failed on the CI and proceed to kill it if running.

Actual behavior

  1. On the midnight the day after you start the build, it will be marked as 'failed' by the stuck_ci_builds_worker.
  2. While the build may be marked as failed, the runner will continue to be processing the build and even sending logs to CI as normal.

Relevant logs and/or screenshots

Please see previous description of the issue and further comments below.

Possible fixes

Previous description of this issue:

I use gitlab CI to start a job which takes a long time to run (up to 4 days). It starts a python script which iterates over a large number of files and performs various operations on each, then at the end it returns a 0 or 1 exit code and exists. My YML file is similar to this:

soak_test:
  stage: soak_test
  script:
    - python3 ./src/run_soak.py
  tags:
    - soak

The build timeout in runner settings is set to 10 days and the output size is also set very high.

The problem is this:

The job starts running, and everything is fine for a while (for the first couple of days). However, after an amount of time (usually more than 2000 minutes in), the job suddenly turns to 'failed' in Gitlab's builds view.

However, the job is still really running!. The python script started by gitlab-runner is still running and processing of the files continues as normal. Gitlab's CI viewer even gets updated with new logs, while the build is marked as 'failed'. The runner is sending logs as usual.

I managed to pull this out of the runner logs set in -debug mode:

time="2016-09-11T00:56:44+01:00" build=4558 build-log=0-128275 build-status=running sent-log=127991-128275 status="202 Accepted"
time="2016-09-11T00:57:14+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T00:57:44+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T00:58:14+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T00:58:44+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T00:59:15+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T00:59:45+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:00:15+01:00" build=4558 build-log=0-128275 build-status=running sent-log=128275-128275 status="202 Accepted"
**time="2016-09-11T01:00:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"**
time="2016-09-11T01:01:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:01:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:02:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:02:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:03:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:03:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:04:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:04:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:05:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:05:45+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:06:15+01:00" build=4558 build-log=0-128275 build-status=failed sent-log=128275-128275 status="202 Accepted"
time="2016-09-11T01:06:33+01:00" build=4558 build-log=0-128369 build-status=failed sent-log=128275-128369 status="202 Accepted"
time="2016-09-11T01:06:36+01:00" build=4558 build-log=0-128597 build-status=failed sent-log=128369-128597 status="202 Accepted"
time="2016-09-11T01:07:06+01:00" build=4558 build-log=0-128597 build-status=failed sent-log=128597-128597 status="202 Accepted"
time="2016-09-11T01:07:36+01:00" build=4558 build-log=0-128597 build-status=failed sent-log=128597-128597 status="202 Accepted"

As you can see, at some point the worker sees a change of the remote build-status from running to failed without any reason, as the job is still going on.

This job is executed on a runner configured with a shell executor on an Ubuntu 16.04 machine. An inspection of running processes on the build server confirms that nothing has exited:

root     32291  0.4  0.0  46268 10156 ?        Ssl  Sep05  38:31 /usr/bin/gitlab-ci-multi-runner --log-level debug --debug run --working-directory /ho
root      5592  0.0  0.0  56416  3368 ?        S    Sep09   0:00  \_ su -s /bin/bash gitlab-runner -c bash --login
gitlab-+  5597  0.0  0.0  12600  2032 ?        Ss   Sep09   0:00      \_ bash --login
gitlab-+  5599  0.0  0.0  12640  2172 ?        S    Sep09   0:00          \_ bash --login
gitlab-+ 26759  0.0  0.2 142904 54968 ?        S    01:45   0:14              \_ python3 ./src/run_loop.py
gitlab-+ 16409  0.7  0.1 127640 32124 ?        S    18:42   0:07                  \_ python3 /home/gitlab-runner/builds/0e61934f/0/myscript/
gitlab-+ 16412  112 35.8 20773520 7371612 ?    Sl   18:42  17:56                      \_ /usr/bin/java -jar myjarscript.jar