[go: up one dir, main page]

Skip to content

Elevated API frontend errors without nginx

Summary

With removing nginx from API in gprd, we started to see elevated error rates in HAProxy and Cloudflare, which in the end forced us to bring nginx back again.

This issue is for investigating the root cause and possibly finding a solution which enables us to finally remove nginx.

Details

Which requests are failing

The errors we see are almost exclusively for /api/v4/jobs/request POST requests, which are long-polling requests by runners. Those requests are special in that the response is delayed by 50s by workhorse to throttle down runners polling for new jobs to be executed.

image

The only other relevant endpoint showing up with more errors without nginx is also one which is taking long (15s for all failing requests: Kibana).

How are requests failing

We see all of those requests fail in HAProxy

  • with the flags SH-- - meaning the server side terminated the TCP connection while HAProxy was still waiting for the response headers
  • after exactly 15s, 30s or 45s

Where is the 15s interval coming from?

The best explanation so far is that workhorse is using the default 15s tcp keepalive time setting of the Go runtime.

This is partly confirmed by tcpdumps on a GKE node showing keepalive packages being send every 15s:

image

This is still confusing, as only the first keepalive should come after 15s, while subsequent keepalives should come after net.ipv4.tcp_keepalive_intvl seconds, which is 60s on our GKE nodes. But tcpdump on both GKE nodes and HAProxy nodes is showing those packages being send every 15s.

What could be the root cause for the connection being ended prematurely?

One theory is that a network component is terminating connections in some cases when seeing an unexpected tcp keepalive ACK package. This could be the case if a conn_track table is overflowing such that old connections that didn't see new packages since a while are not tracked anymore.

When removing nginx, we use an internal google TCP LB as service endpoint instead - maybe this component is causing the connection terminations?

Other issues we found

What we can do

As next steps we should

  • determine if we see the same behavior behind nginx (which is hiding terminated tcp connections from clients by simply retrying, which haproxy 1.8 can't do - only haproxy 2 has such a feature). This could validate if the internal TCP LB could be the component terminating connections prematurely.
  • open a support ticket with google: Ticket
  • consider to increase the keepalive time setting in workhorse to a value > 50s: gitlab-org/gitlab#345722
  • consider switching to haproxy 2.0 and use the retry-on feature

Status Update 2021-11-15

  • We don't see tcp connection terminations or retries in nginx logs, which points at the TCP LB as a likely cause
  • We opened a support case with Google
  • opened an issue for increasing the default tcp keepalive period in workhorse: gitlab-org/gitlab#345722

Next:

Wait for response from Google and then decide for next steps until end of week.

Edited by Henri Philipps