Elevated API frontend errors without nginx

Summary

With removing nginx from API in gprd, we started to see elevated error rates in HAProxy and Cloudflare, which in the end forced us to bring nginx back again.

This issue is for investigating the root cause and possibly finding a solution which enables us to finally remove nginx.

Details

Which requests are failing

The errors we see are almost exclusively for /api/v4/jobs/request POST requests, which are long-polling requests by runners. Those requests are special in that the response is delayed by 50s by workhorse to throttle down runners polling for new jobs to be executed.

The only other relevant endpoint showing up with more errors without nginx is also one which is taking long (15s for all failing requests: Kibana).

How are requests failing

We see all of those requests fail in HAProxy

with the flags SH-- - meaning the server side terminated the TCP connection while HAProxy was still waiting for the response headers
after exactly 15s, 30s or 45s

Where is the 15s interval coming from?

The best explanation so far is that workhorse is using the default 15s tcp keepalive time setting of the Go runtime.

This is partly confirmed by tcpdumps on a GKE node showing keepalive packages being send every 15s:

This is still confusing, as only the first keepalive should come after 15s, while subsequent keepalives should come after net.ipv4.tcp_keepalive_intvl seconds, which is 60s on our GKE nodes. But tcpdump on both GKE nodes and HAProxy nodes is showing those packages being send every 15s.

What could be the root cause for the connection being ended prematurely?

One theory is that a network component is terminating connections in some cases when seeing an unexpected tcp keepalive ACK package. This could be the case if a conn_track table is overflowing such that old connections that didn't see new packages since a while are not tracked anymore.

When removing nginx, we use an internal google TCP LB as service endpoint instead - maybe this component is causing the connection terminations?

Other issues we found

haproxy logs in big query seem incomplete

What we can do

As next steps we should

determine if we see the same behavior behind nginx (which is hiding terminated tcp connections from clients by simply retrying, which haproxy 1.8 can't do - only haproxy 2 has such a feature). This could validate if the internal TCP LB could be the component terminating connections prematurely.
open a support ticket with google: Ticket
consider to increase the keepalive time setting in workhorse to a value > 50s: gitlab-org/gitlab#345722
consider switching to haproxy 2.0 and use the retry-on feature

Status Update 2021-11-15

We don't see tcp connection terminations or retries in nginx logs, which points at the TCP LB as a likely cause
We opened a support case with Google
opened an issue for increasing the default tcp keepalive period in workhorse: gitlab-org/gitlab#345722

Wait for response from Google and then decide for next steps until end of week.

Edited Nov 15, 2021 by Henri Philipps