[go: up one dir, main page]

Skip to content

increase workhorse tcp keepalive period

Since removing nginx and instead using an internal TCP LB for routing traffic to the GitLab.com API deployment in K8s, we saw TCP connections for runner long-polling requests (where the answer by workhorse is delayed by 50s) often being terminated from server-side in our haproxy logs after exactly 15s, 30s or 45s, causing an increased error rate.

The main correlation to this 15s interval is that we see tcp keepalive packages being send in tcpdump on each workhorse connection every 15s.

With Go 1.13, server-side tcp keepalive was enabled by default and the tcp keepalive period set to a default value of 15s: https://www.reddit.com/r/golang/comments/d7v7dn/psa_go_113_introduces_15_sec_server_tcp/

The 15s period is very short compared to the default settings of linux (2h) or the GKE nodes (60s). We should consider to increase the interval in workhorse to 60s or more to

  1. confirm if this is causing TCP keepalives to be send every 15s
  2. see if it would make those unexpected connection terminations go away

We already set a custom keepalive period value of 5m for redis connections in workhorse, but not for other connections.