[go: up one dir, main page]

Skip to content

Re-evaluate "Web Hook Calls" Plan Limit for high-volume namespaces

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problem to Solve

In short, our Web Hook Calls (per minute) plan limit is too low. We have a customer complaining about missing Build-status webhook requests: https://gitlab.com/gitlab-com/request-for-help/-/issues/3227+

And during the investigation we discovered they're not the only ones: gitlab-com/gl-infra/production#20354 (comment 2689127088)

Investigation Needed

First, understand whether the current RateLimit is intended to be a cap for cost management, or if it's a instance stability guardrail.

If it's a cost-management cap

We should exempt WebHook executions that are very built-in to our ecosystem from the RateLimit. In our current problem case, it's WebHooks for changes to Build status. As a customer scales their CI, they will have more jobs and more status transitions. If they're paying us for a lot of CI, they deserve to have the status-change webhooks that go along with it. Telling someone that they can have as much CI as they want but RateLimiting their WebHook events directly associated with their jobs is weirdly nickel-and-dime and a horrible UX.

If it's a instance stability guardrail

We should increase the RateLimit for high-volume/high-seatcount Ultimate namespaces, in consultation with the infrastructure team. In the short-term, we exempted a particular namespace from WebHook rate-limiting entirely because it was a hard blocker on their adoption of Duo Platform features. But that's something that we have to watch carefully, because we can't handle infinite WebHook requests; this is not a good or stable long-term policy. We do need to protect our infrastructure, but given our current offering we should re-evaluate how much WebHook traffic our infrastructure can handle and get a little bit smarter about how we protect it.

Edited by 🤖 GitLab Bot 🤖