CI: improve observability of job caches
What
- Computes size information on all caches. Sizes are for uncompressed caches, so not the same sizes that in the GCP buckets.
- Prints this cache size info on job logs.
- Sends the data to Datadog.
Why
- To improve observability of cache usages:
- detecting cache-related problems and debugging them should be easier (e.g. OOMs)
- debugging these problems should be easier
- whilst experimenting (e.g. trying to optimise job speed through cache), the results of the trials should be easier to understand reliably.
- This should in the end improve the reliability and speed of pipelines.
- NB: this comes at a small increase of the wall time:
- 3-5s are added per job, so ~20-25s per pipeline
- it seems like a very reasonable trade-off given the gains in observability.
How
- We add [datadog_send_job_cache_info.sh] that:
- computes the size of the caches (0 if it does not exist)
- prints them (human readable, bytes) in the log
- generates datadog tags with cache sizes in bytes (human-readable would have been harder to manage in Datadog as units could be either GB or MB)
- if [datadog-ci] is installed, sends the data to Datadog
- We modify the job definition so that a call to [datadog_send_job_cache_info.sh] is made in the [before_script] and [after_script] steps.
Manually testing the MR
-
make -C ci check -
Test pipelines: (NB: run before the last commit)
-
7db896c0 (master_branch, schedule_extended_test, grafazos.daily, teztale.daily)
- in the schedule_extended_test pipeline, failures are usual and not related to this MR except for [oc.script:test-gen-genesis]
- this is fixed in the fourth commit e2436f54 cf. this test pipeline https://gitlab.com/tezos/tezos/-/pipelines/1961017774
- in the schedule_extended_test pipeline, failures are usual and not related to this MR except for [oc.script:test-gen-genesis]
- octez major release: https://gitlab.com/nomadic-labs/tezos/-/pipelines/1961048515
- if useful, previous one was here: some failures are due to the fact that I wrongly pushed two tags (minor and major) in the same tag, all failures should be unrelated to the MR
-
7db896c0 (master_branch, schedule_extended_test, grafazos.daily, teztale.daily)
-
Datadog: draft dashboard
Checklist
-
Document the interface of any function added or modified (see the coding guidelines) -
Document any change to the user interface, including configuration parameters (see node configuration) -
Provide automatic testing (see the testing guide). -
For new features and bug fixes, add an item in the appropriate changelog ( docs/protocols/alpha.rstfor the protocol and the environment,CHANGES.rstat the root of the repository for everything else). -
Select suitable reviewers using the Reviewersfield below. -
Select as Assigneethe next person who should take action on that MR
Edited by Bruno B