RCA: Project Main Landing Pages throwing 500 Errors due to invalid License Key

PRODUCTION INCIDENT: 2021-09-03: 500 Errors when accessing projects (License key errors)

Incident Summary

Some project main landing pages were throwing 500 errors if they have a LICENSE file with certain types. Browsing to /issues and other sub-pages works fine. However browsing to their LICENSE.txt doesn't work.

The cause of the incident was a feature flag for FindLicense: Implement license finding in Go which caused a mismatch between license DBs.
We were using Go for license detection (including getting the license key), but Ruby for displaying license metadata based on that key.

Timeline

View recent production deployment and configuration events / gcp events (internal only)

All times UTC.

2021-09-03

01:06 - Customer reported Consistent 500 error from gitlab.com for one repository on a GitLab Forum
09:45 - Customer created Project homepage not accessible (500 internal server error)
11:21 - Relevant bug issue created https://gitlab.com/gitlab-org/gitlab/-/issues/340043 with 1st support ticket
14:37 - Start of further comments on issue https://gitlab.com/gitlab-org/gitlab/-/issues/340043 indicating support tickets reports
15:51 - @cleveland declares incident in Slack.
16:09 - @cleveland identified an upward trend using Sentry and Kibana that showed license errors were blocking customers from reaching their projects
16:13 - @dawsmith identified it was likely related to FindLicense: Implement license finding in Go
16:17 - @smcgivern discovered the problem was mismatched license DBs. We were using Go for license detection (including getting the license key), but Ruby for displaying license metadata based on that key.
16:45 - @stanhu disabled the feature flag and team begins working on identifying affected projects and writing scripts to clear out invalid values cached
18:38 - @smcgivern confirms that all of the caches have been cleared

Feature Flag Activity Timeline

2021-09-01

10:12 the feature flag was enabled in the QA staging environment.

2021-09-02

13:45 the feature flag was enabled in production at 25% of the total users

2021-09-03

9:?? enabled 5% of the total users
9:39 enabled 50% of the total users
12:02 enabled 100% of the total users
16:45 the feature flag was disabled

Corrective Actions

Actions that need to be taken to move this MR forward

Prevent the page from completely failing if it doesn’t identify a license name
- #12915 (moved)
Add feature specs using different licenses
- ~~@toon , can you create an issue for this and link it here?~~ gitlab#340288 (closed)
More coming soon....

Actions that need to be taken to prevent similar incidents

Introduce contract testing for Gitaly to prevent Go/Rails mismatches
- https://gitlab.com/gitlab-org/quality/team-tasks/-/issues/1004
More coming soon....

Actions that need to be taken to reduce the mitigation time

Create runbook to aid in troubleshooting this type of error
- @smcgivern , can you create an issue for this and link it here?
More coming soon....

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Summary

Service(s) affected: Some Project Main Landing Pages were not available
Team attribution: groupgitaly
Time to detection: 1 day, 2 hours, 6 minutes and 0 seconds
- This represents the time the feature flag was enabled for 25% of the users to the time an incident was officially created.
Minutes downtime: 1 day, 3 hours, 42 minutes + 1 hour and 53 minutes if the user had invalid cache values
- This is from the time the feature flag was enabled to the time it was disabled. The users might not have experienced this immediately, it just depends on when they attempted to access their Project main page. This would be the worse case scenario.

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Both Internal and External Customers could have been impacted
2. 26 Support Tickets were created from this incident
3. 5920 projects were affected
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) The customer was not able to access their Project Landing Pages.
How many customers were affected?
- Users (but not counting unauthenticated): 1843
- Unique IP addresses: 4061
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?

What were the root causes?

Mismatched license DBs
- The Go library (https://github.com/go-enry/go-license-detector) uses https://spdx.github.io/license-list-data/.
- The Ruby library (https://github.com/licensee/licensee) explicitly does not support 'or-later' licenses, among others: https://github.com/licensee/licensee/commit/3f2ed3ab2b2bb80ca7937a1e3fd61c691c0153fd
- With this feature flag enabled, we're using Go for license detection (including getting the license key), but Ruby for displaying license metadata based on that key.
TBD
TBD

Incident Response Analysis

How was the incident detected?
- The incident was first detected by customers through a Forum
- Next Customers from the forum created an issue
- Then Support Tickets started to come in (This is when GitLab first detected the incident)
How could detection time be improved?
- Detection time would have been reduced if we would have seen the customer issue sooner. When errors are reported by users and not given labels they are not triaged or seen. Should someone be monitoring issues that are reported and not assigned any labels?
- Detection time would have been reduced if the monitoring would have been performed after setting the feature flag to 25% and then again after setting the feature flag to 50%
How was the root cause diagnosed?
- Through examination of Sentry, Kibana logs and viewing recent Chatops Feature Flag roll outs.
How could time to diagnosis be improved?
How did we reach the point where we knew how to mitigate the impact?
How could time to mitigation be improved?
- After we disabled the feature flag, we had to clear out the cache's and the time we took to do that (1 hour and 53 minutes) could have been improved by:
  - If there was an existing runbook procedure, this would have expedited the mitigation process.
What went well?
- The MR author used a feature flag so we were able to quickly disable the feature
- Engineers quickly identified the feature flag that needed to be disabled (within 16 minutes of learning about the incident)

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- We had a customer issue and customer forum inquiry that could have reduced the impact of this incident. If we would have seen this, we could have diagnosed the problem sooner and disabled the feature flag. Unfortunately as users were getting errors we were setting the feature flag to roll out to more and more users.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Yes this as triggered by this code change: [Feature flag] Enable Go implementation of FindLicense

Lessons Learned

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

cc @timzallmann

Edited Sep 08, 2021 by Ramya Authappan