source code page fails to load if Job artifact related getBlobInfo GraphQL call times out
See verify if you're affected and workaround/resolution if you've been sent to this issue from the GitLab 16 upgrade notes.
Summary
A customer upgraded from 16.7 to 16.11, and during validation found issues viewing source code in projects.
The root cause was found to be a getBlobInfo GraphQL call that uses lib/gitlab/code_navigation_path.rb
The request for LSIF artifacts was subjected to a query plan flip in the database, and this caused they query to take 1m 45s or so rather than milleseconds; the 60 seconds timeout in Rails cancelled the request.
artifact = pipeline.job_artifacts.with_file_types(['lsif']).last
Steps to reproduce
Example Project
What is the current bug behavior?
The bug is that the source code view fails to load in the event that a job artifact request fails.
What is the expected correct behavior?
The failure should be contained to the part of the page that would render the artifact, not block all the functionality on the page.
Relevant logs and/or screenshots
Output of checks
GitLab 16.11.5
Possible fixes
Verify if you're affected
-
Source code pages in projects take a long time to load, failing after 60 seconds.
Example; the page gets this far:
Below that, instead of the source code for the file you've selected, there's a spinner.
After sixty seconds, that stops, and red error is generated. Source code does not load in the UI. But access via the Git command line works.
-
Connect to the database console.
Change the display output so each row is rendered vertically:
\x
Run:
SELECT pid,usename,application_name,query_start,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE state <> 'idle' and query_start < now() - interval '10 seconds' limit 10;
If your instance is having this issue, you'll usually see multiple hits for:
SELECT "p_ci_job_artifacts".* FROM "p_ci_job_artifacts" INNER JOIN "p_ci_builds" ON "p_ci_job_artifacts"."job_id" = "p_ci_builds"."id" WHERE "p_ci_builds"."type" = 'Ci::Build' AND "p_ci_builds"."commit_id" = 123456 AND "p_ci_builds"."partition_id" = 100 AND "p_ci_job_artifacts"."partition_id" = 100 AND "p_ci_job_artifacts"."file_type" = 15 ORDER BY "p_ci_job_artifacts"."id" DESC LIMIT 1
If no queries are returned, try and load some source code pages in the UI and / or reduce the
interval
to fewer seconds. -
In cases we've observed, PostgreSQL selects an inefficient query plan using the primary keys on the
p_ci_builds
andp_ci_job_artifacts
- typically requiring it to consider every record in one of those tables. These tables are large on affected instances, so this query plan doesn't complete before the 60 seconds rack timeout in the GitLab application.View the query plan using
explain
on one of the queries returned from thepg_stat_activity
table above:SET statement_timeout = 0; explain SELECT "p_ci_job_artifacts".* FROM "p_ci_job_artifacts" [...]
The incorrect query plan comprises two
Index Scan
operations over the primary key indexes (_pkey
) on the two tables, and then filter all the records found for relevant parts of the query. The database should use more specific indexes to quickly narrow down relevant records.-> Index Scan Backward using ci_job_artifacts_pkey on ci_job_artifacts p_ci_job_artifacts Index Cond: (partition_id = 100) Filter: (file_type = 15) -> Index Scan using ci_builds_pkey on ci_builds p_ci_builds Index Cond: ((id = p_ci_job_artifacts.job_id) AND (partition_id = 100)) Filter: (((type)::text = 'Ci::Build'::text) AND (commit_id = 123456))
Workaround/resolution
PostgreSQL considers multiple query plans, scoring each, and selecting the one that it thinks should be quickest.
These steps aim to correct the scores so efficient queries are selected.
-
Rebuild statistics for the two affected tables.
SET statement_timeout=0 ; ANALYZE VERBOSE p_ci_builds; ANALYZE VERBOSE p_ci_job_artifacts;
Check if this resolves the issue using
explain
(above) to determine if a different query plan is picked.If it does not resolve the issue, try it again. Statistics are built using a random selection of rows.
Do not perform other steps if
ANALYZE
changes the query plan and/or the issue is resolved.The next two steps take much longer, and are more intrusive: they will cause performance issues.
-
Perform housekeeping on the two tables
VACUUM (ANALYZE, VERBOSE, FREEZE, INDEX_CLEANUP ON) public.p_ci_builds; VACUUM (ANALYZE, VERBOSE, FREEZE, INDEX_CLEANUP ON) public.p_ci_job_artifacts;
Goal:
- To allow reuse of disk space occupied by updated or deleted rows.
- To update data statistics used by the PostgreSQL query planner.
- To update the visibility map, which speeds up index-only scans.
- Ensure index cleanup. In PostgreSQL 13, "this is normally the desired behavior" but it defaults to
auto
in PostgreSQL 14.
Note: VACUUM creates a substantial amount of I/O traffic, which can cause poor performance for other active sessions.
Recheck the
explain
to see if this has resolved the issue. -
Repeat the first step;
ANALYZE
- and see if the issue is now resolved. -
Rebuild the indexes on these tables, and then ensure statistics are rebuild as well.
- This will also involve a lot of I/O, and is likely to take hours.
- Take steps to protect this session from getting cancelled, for example: run it in
screen
. If the re-index gets cancelled, it'll leave behind invalid partial indexes, suffixed with_ccnew*
.
SET statement_timeout = 0; REINDEX TABLE CONCURRENTLY p_ci_builds; ANALYZE VERBOSE p_ci_builds; REINDEX TABLE CONCURRENTLY p_ci_job_artifacts; ANALYZE VERBOSE p_ci_job_artifacts;