[go: up one dir, main page]

Scoping work related to making faster queries

Carry over from https://gitlab.com/gitlab-com/infrastructure/issues/1786 which has a long list of SQL queries (output from pg_stat_statements_history from PostgreSQL), this issue is about scoping how much work will be involved in reaching the OKR goal of

Database: Reduce the p99 of SQL timings across the board to 200 ms (100-200 ms less than what we have now).

  • While https://gitlab.com/gitlab-com/infrastructure/issues/1786 lists slow queries, it does not reveal where in the application they are being called, reducing the usefulness of the list.
  • @yorickpeterse added a new dashboard in our monitoring infrastructure (alas, InfluxDB data, so not yet viewable by the world) that shows SQL timings per controller, filtering out those with a p99 < 200 ms, and those that are called fewer than 500 times, and sorting by those controllers where the p99 of SQL timing is a high % of the p99 of the overall transaction timing for the controller.
  • This yields a list of 36 controllers, 15 of which where SQL timing takes 47% or more of the total time; there is a drop to 29% for the next controller on the list.
  • It makes sense to prioritize working on slow queries associated with the controllers identified in this manner.
  • Each controller may very well call multiple queries (as seen in the example on https://about.gitlab.com/handbook/engineering/performance/#flow-of-web-request where loading the dashboard led to 29 SQL queries; not necessarily unique queries. It is not easy to know how many unique SQL queries are associated with these top 15 controllers.
  • Each controller takes about 5 person-days to research, and fix. So we're looking at 75 person days, i.e. 1 full-time person to fix them all in Q3.
Edited by Ernst van Nierop