[go: up one dir, main page]

RCA: Project Main Landing Pages throwing 500 Errors due to invalid License Key

PRODUCTION INCIDENT: 2021-09-03: 500 Errors when accessing projects (License key errors)


Incident Summary

Some project main landing pages were throwing 500 errors if they have a LICENSE file with certain types. Browsing to /issues and other sub-pages works fine. However browsing to their LICENSE.txt doesn't work.

Timeline

View recent production deployment and configuration events / gcp events (internal only)

All times UTC.

2021-09-03

Feature Flag Activity Timeline

2021-09-01

  • 10:12 the feature flag was enabled in the QA staging environment.

2021-09-02

  • 13:45 the feature flag was enabled in production at 25% of the total users

2021-09-03

  • 9:?? enabled 5% of the total users
  • 9:39 enabled 50% of the total users
  • 12:02 enabled 100% of the total users
  • 16:45 the feature flag was disabled

Corrective Actions

Actions that need to be taken to move this MR forward

  1. Prevent the page from completely failing if it doesn’t identify a license name
  2. Add feature specs using different licenses
  3. More coming soon....

Actions that need to be taken to prevent similar incidents

  1. Introduce contract testing for Gitaly to prevent Go/Rails mismatches

  2. More coming soon....

Actions that need to be taken to reduce the mitigation time

  1. Create runbook to aid in troubleshooting this type of error
    • @smcgivern , can you create an issue for this and link it here?
  2. More coming soon....

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Summary

  1. Service(s) affected: Some Project Main Landing Pages were not available
  2. Team attribution: groupgitaly
  3. Time to detection: 1 day, 2 hours, 6 minutes and 0 seconds
    • This represents the time the feature flag was enabled for 25% of the users to the time an incident was officially created.
  4. Minutes downtime: 1 day, 3 hours, 42 minutes + 1 hour and 53 minutes if the user had invalid cache values
    • This is from the time the feature flag was enabled to the time it was disabled. The users might not have experienced this immediately, it just depends on when they attempted to access their Project main page. This would be the worse case scenario.

Metrics

Customer Impact

  1. Who was impacted by this incident? (i.e. external customers, internal customers)
    1. Both Internal and External Customers could have been impacted
    2. 26 Support Tickets were created from this incident
    3. 5920 projects were affected
  2. What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) The customer was not able to access their Project Landing Pages.
  3. How many customers were affected?
    • Users (but not counting unauthenticated): 1843
    • Unique IP addresses: 4061
  4. If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?

What were the root causes?

Incident Response Analysis

  1. How was the incident detected?
    • The incident was first detected by customers through a Forum
    • Next Customers from the forum created an issue
    • Then Support Tickets started to come in (This is when GitLab first detected the incident)
  2. How could detection time be improved?
    • Detection time would have been reduced if we would have seen the customer issue sooner. When errors are reported by users and not given labels they are not triaged or seen. Should someone be monitoring issues that are reported and not assigned any labels?
    • Detection time would have been reduced if the monitoring would have been performed after setting the feature flag to 25% and then again after setting the feature flag to 50%
  3. How was the root cause diagnosed?
    • Through examination of Sentry, Kibana logs and viewing recent Chatops Feature Flag roll outs.
  4. How could time to diagnosis be improved?
  5. How did we reach the point where we knew how to mitigate the impact?
  6. How could time to mitigation be improved?
    • After we disabled the feature flag, we had to clear out the cache's and the time we took to do that (1 hour and 53 minutes) could have been improved by:
  7. What went well?
    • The MR author used a feature flag so we were able to quickly disable the feature
    • Engineers quickly identified the feature flag that needed to be disabled (within 16 minutes of learning about the incident)

Post Incident Analysis

  1. Did we have other events in the past with the same root cause?
    1. ...
  2. Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
    • We had a customer issue and customer forum inquiry that could have reduced the impact of this incident. If we would have seen this, we could have diagnosed the problem sooner and disabled the feature flag. Unfortunately as users were getting errors we were setting the feature flag to roll out to more and more users.
  3. Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
    1. Yes this as triggered by this code change: [Feature flag] Enable Go implementation of FindLicense

Lessons Learned

  • ...

Guidelines

Resources

  1. If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

cc @timzallmann

Edited by Ramya Authappan