source code page fails to load if Job artifact related getBlobInfo GraphQL call times out

See verify if you're affected and workaround/resolution if you've been sent to this issue from the GitLab 16 upgrade notes.

Summary

A customer upgraded from 16.7 to 16.11, and during validation found issues viewing source code in projects.

The root cause was found to be a getBlobInfo GraphQL call that uses lib/gitlab/code_navigation_path.rb

The request for LSIF artifacts was subjected to a query plan flip in the database, and this caused they query to take 1m 45s or so rather than milleseconds; the 60 seconds timeout in Rails cancelled the request.

artifact = pipeline.job_artifacts.with_file_types(['lsif']).last

Steps to reproduce

Example Project

What is the current bug behavior?

The bug is that the source code view fails to load in the event that a job artifact request fails.

What is the expected correct behavior?

The failure should be contained to the part of the page that would render the artifact, not block all the functionality on the page.

Relevant logs and/or screenshots

Output of checks

GitLab 16.11.5

Possible fixes

Verify if you're affected

Source code pages in projects take a long time to load, failing after 60 seconds.

Example; the page gets this far:

Below that, instead of the source code for the file you've selected, there's a spinner.

After sixty seconds, that stops, and red error is generated. Source code does not load in the UI. But access via the Git command line works.

Connect to the database console.

Change the display output so each row is rendered vertically:

\x

Run:

SELECT pid,usename,application_name,query_start,wait_event_type,wait_event,state,query 
FROM pg_stat_activity
WHERE state <> 'idle'
   and query_start < now() - interval '10 seconds' limit 10;

If your instance is having this issue, you'll usually see multiple hits for:

SELECT "p_ci_job_artifacts".* FROM "p_ci_job_artifacts"
INNER JOIN "p_ci_builds" ON "p_ci_job_artifacts"."job_id" = "p_ci_builds"."id" 
WHERE "p_ci_builds"."type" = 'Ci::Build' AND "p_ci_builds"."commit_id" = 123456
   AND "p_ci_builds"."partition_id" = 100 
   AND "p_ci_job_artifacts"."partition_id" = 100
   AND "p_ci_job_artifacts"."file_type" = 15 
ORDER BY "p_ci_job_artifacts"."id" DESC
LIMIT 1

If no queries are returned, try and load some source code pages in the UI and / or reduce the interval to fewer seconds.

In cases we've observed, PostgreSQL selects an inefficient query plan using the primary keys on the p_ci_builds and p_ci_job_artifacts - typically requiring it to consider every record in one of those tables. These tables are large on affected instances, so this query plan doesn't complete before the 60 seconds rack timeout in the GitLab application.

View the query plan using explain on one of the queries returned from the pg_stat_activity table above:
```
SET statement_timeout = 0; explain SELECT "p_ci_job_artifacts".* FROM "p_ci_job_artifacts" [...]
```
The incorrect query plan comprises two Index Scan operations over the primary key indexes (_pkey) on the two tables, and then filter all the records found for relevant parts of the query. The database should use more specific indexes to quickly narrow down relevant records.
```
->  Index Scan Backward using ci_job_artifacts_pkey on ci_job_artifacts p_ci_job_artifacts  
            Index Cond: (partition_id = 100)
            Filter: (file_type = 15)
            ->  Index Scan using ci_builds_pkey on ci_builds p_ci_builds  
                  Index Cond: ((id = p_ci_job_artifacts.job_id) AND (partition_id = 100))
                  Filter: (((type)::text = 'Ci::Build'::text) AND (commit_id = 123456))
```

Workaround/resolution

PostgreSQL considers multiple query plans, scoring each, and selecting the one that it thinks should be quickest.

These steps aim to correct the scores so efficient queries are selected.

Rebuild statistics for the two affected tables.
```
SET statement_timeout=0 ; ANALYZE VERBOSE p_ci_builds; ANALYZE VERBOSE p_ci_job_artifacts;
```
Check if this resolves the issue using explain (above) to determine if a different query plan is picked.

If it does not resolve the issue, try it again. Statistics are built using a random selection of rows.

Do not perform other steps if ANALYZE changes the query plan and/or the issue is resolved.

The next two steps take much longer, and are more intrusive: they will cause performance issues.
Perform housekeeping on the two tables
```
VACUUM (ANALYZE, VERBOSE, FREEZE, INDEX_CLEANUP ON) public.p_ci_builds;
VACUUM (ANALYZE, VERBOSE, FREEZE, INDEX_CLEANUP ON) public.p_ci_job_artifacts;
```
Goal:
- To allow reuse of disk space occupied by updated or deleted rows.
- To update data statistics used by the PostgreSQL query planner.
- To update the visibility map, which speeds up index-only scans.
- Ensure index cleanup. In PostgreSQL 13, "this is normally the desired behavior" but it defaults to auto in PostgreSQL 14.
Note: VACUUM creates a substantial amount of I/O traffic, which can cause poor performance for other active sessions.

Recheck the explain to see if this has resolved the issue.
Repeat the first step; ANALYZE - and see if the issue is now resolved.
Rebuild the indexes on these tables, and then ensure statistics are rebuild as well.
- This will also involve a lot of I/O, and is likely to take hours.
- Take steps to protect this session from getting cancelled, for example: run it in screen. If the re-index gets cancelled, it'll leave behind invalid partial indexes, suffixed with _ccnew*.
```
SET statement_timeout = 0;
REINDEX TABLE CONCURRENTLY p_ci_builds; ANALYZE VERBOSE p_ci_builds;
REINDEX TABLE CONCURRENTLY p_ci_job_artifacts; ANALYZE VERBOSE p_ci_job_artifacts; 
```

Edited Aug 16, 2024 by Ben Prescott (ex-GitLab)