Avoiding unintended connection failures with SO_REUSEPORT

By Jonathan Corbet
April 23, 2021

Many of us think that we operate busy web servers; LWN's server, for example, sweats hard when keeping up with the comment stream that accompanies any article mentioning the Rust programming language. But some organizations run truly busy servers and have to take some extraordinary measures to keep up with levels of traffic that even language advocates cannot create. The SO_REUSEPORT socket option is one of many features that have been added to the network stack to help these use cases. SO_REUSEPORT suffers from an implementation problem that can cause connections to fail, though. Kuniyuki Iwashima has posted a patch set addressing this problem, but there is some doubt as to whether it takes the right approach.

In normal usage, only one process is allowed to bind to any given TCP port to accept incoming connections. On busy systems, that process can become a bottleneck, even if all it does is pass accepted connections off to other processes for handling. The SO_REUSEPORT socket option, which was added to the 3.9 kernel in 2013, was meant to address that bottleneck. This option allows multiple processes to accept connections on the same port; whenever a connection request comes in, the kernel will pick one of the listening processes as the recipient. Systems using SO_REUSEPORT can dispense with the dispatcher process, improving scalability overall.

SO_REUSEPORT does its work when the initial SYN packet (the connection request) is received; at that time, a provisional new socket is created and assigned to one of the listening processes. The new connection will first wait for the handshake to complete, after which it will sit in a queue until the selected process calls accept() to accept the connection and begin the session. On busy servers, there may be a fair number of connections awaiting acceptance; the maximum length of that queue is specified with the listen() system call.

When SO_REUSEPORT misbehaves

Most of the time, SO_REUSEPORT works just as intended, but that can change when one of the listening processes exits. If a process quits with open network connections, those connections will be closed — not a surprising result. But the kernel will also "close" (by resetting) any incoming connections that are still in the accept queue. In the absence of SO_REUSEPORT this behavior makes sense; if the (single) listening process goes away, there is no longer anybody who can accept those connections.

If SO_REUSEPORT is being used, and if there are multiple listening processes, incoming connections do not necessarily have to be closed in this way. There are, after all, other processes running that would happily handle those connections. But once the kernel has committed an incoming connection to a specific process, it will not change its mind later; either that connection will be accepted by the chosen process, or it will be closed.

There are a number of reasons why a listening process might exit on a busy system. Perhaps it simply crashed. But, more likely, the server is being restarted to effect a configuration change or to switch to a new certificate. Such restarts can be phased across a pool of server processes so that they don't all exit at once; that should allow incoming connections to be handled without any apparent interruption of service. But when the above-described behavior comes into the picture, users can be turned away, which tends to have an unpleasant effect on their mood. The depressive effect on the operator of the site, who may have just lost the opportunity to learn that the would-be user is in the market for a new pair of socks, can be even worse.

There are ways around this problem, such as using a BPF program to steer incoming connections away from a server process that is about to exit, then being sure that it drains any queued connections before it bows out. But Iwashima makes the point that there is a better way: when a process exits, just take all of the queued incoming connections and reassign them to a different process for handling. After all, there is no state yet associated with an unaccepted request; one process can handle it just as well as another, and moving it will avoid causing the request to fail.

Migrating the accept queue

Getting there requires an eleven-part patch set, though. The first step is to add a new sysctl knob (net.ipv4.tcp_migrate_req; there does not appear to be an IPv6 version) controlling whether incoming connections should be moved to a new listener if the one they were assigned to exits. By default, this new behavior is disabled to avoid interfering with deployments where other arrangements have been made.

Actually migrating incoming connections away from an exiting server process is a bit more complicated than one might think because the "accept queue" is a bit more complicated than has been discussed so far. Remember that TCP connections go through a three-way handshake before being established: the connection is initiated with a SYN packet, the server side responds with SYN+ACK, and the initiator completes the connection with an ACK packet. That entire process must complete before an incoming connection can be given to a server process via accept().

Fully established connections — those that have completed the three-way handshake but which have not yet been accepted — are relatively easy to move to a new server process; they are just shunted from one queue to another. Connections that are still in the handshake are more complicated, though. They can only be moved at specific points during the sequence; when the handshake completes being the most obvious such point. It is also possible to move a connection when the SYN+ACK is retransmitted, should that be necessary. Either way, the remains of the old server process's socket structure must stay around for long enough to finish the handshake; that adds a certain amount of complexity.

One remaining question is: how is the new recipient for the connection chosen? Normally, the kernel will use the same algorithm it uses to pick a recipient in the first place, which is essentially a round-robin approach. But there will surely be users who know better and who want to be able to redirect these connections more explicitly. For those users there is, inevitably, a new BPF program type (BPF_PROG_TYPE_SK_REUSEPORT) that can be used to make decisions on where to reroute these connections.

Unacceptable?

As of this writing, the only comment on Iwashima's patch set comes from Eric Dumazet, who questioned whether it is the right approach. Since SO_REUSEPORT was added, he said, the TCP accept code has been reworked to run locklessly, which should address much of the scalability problem that SO_REUSEPORT was added to mitigate in the first place. Thus, he said, it might be better for applications to go back to a single-listener mode, perhaps helped by a new form of accept() that would allow incoming connections to be quickly directed to server processes.

That, of course, is a rather different development direction than Iwashima has taken so far and is thus unlikely to be welcome news. One could argue that, while a new accept() call might be a more pleasing solution to the overall problem, there should still be a place for a patch series making an existing kernel feature work without occasionally killing incoming connections. So it's not clear how this will play out; stay tuned.

Index entries for this article
Kernel	Networking/SO_REUSEPORT

to post comments

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 23, 2021 15:35 UTC (Fri) by clugstj (subscriber, #4020) [Link] (12 responses)

So, we are worrying about 1 in a billion connections to a busy web server failing? I can't see this being worth the added complexity.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 23, 2021 19:07 UTC (Fri) by smurf (subscriber, #17840) [Link]

It's not just web sockets this is useful for.

The connection in question has already been accepted and there's a server willing to work with it. Dropping it is just plain rude, esp. as the client has no way to decide whether the server crashed or not. This is bad. You need retry code in the client (which otherwise could just blindly assume that the server crashed). The retry introduces latency you might want to avoid.

Also the client might be a load balancer which now thinks that your server just crashed. This is a bad idea, esp. if you use a sharded data set because the requests now go to "cold" machines.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 23, 2021 19:10 UTC (Fri) by mss (subscriber, #138799) [Link] (9 responses)

> So, we are worrying about 1 in a billion connections to a busy web server failing?
Where did you get this number from? Is it stated somewhere in the patch series?

A TCP connection could fail for many other reasons, too.
We definitely don't want to add additional ones, as randomly failing server connections give poor user experience.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 2:37 UTC (Sat) by clugstj (subscriber, #4020) [Link] (8 responses)

The article says "the server is being restarted to effect a configuration change or to switch to a new certificate". How often does this happen?

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 2:38 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Every couple of weeks with Let's Encrypt.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 2:39 UTC (Sat) by clugstj (subscriber, #4020) [Link] (1 responses)

If you are using Let's Encrypt, you server isn't that busy.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 2:43 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

That's not necessarily true. For example, we were using plenty of Let's Encrypt certs on internal NAS-like servers. We could have used self-signed certs, but maintaining our own CA and installing it on all computers was a bigger hassle.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 4:45 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (4 responses)

Speaking as a Google SRE, we restart servers "to effect a configuration change" all the damn time. We push new code into production literally every single week[1] unless there's a holiday or our error budget is depleted. Every push involves (slowly, carefully[2]) restarting all running instances of the server. Now, in practice, SO_REUSEPORT is probably not the most relevant flag in the world for us, but that's mostly just because we've already solved this problem (i.e. "don't drop in-flight requests") at other levels of abstraction, and so asking the kernel for help is less useful. But any shop that's less aggressively containerized than us[3] would probably find this sort of thing Nice To Have, if they want to do frequent releases.

[1]: This is my experience on one team managing a small number of services. It is not necessarily representative; I know for a fact that other teams often have wildly different release cadences.
[2]: https://sre.google/workbook/canarying-releases/
[3]: https://sre.google/sre-book/production-environment/

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 19:07 UTC (Sat) by ms-tg (subscriber, #89231) [Link] (3 responses)

Thanks for posting this, I find a lot of value in the “this is what I’m seeing in my industry” posts that LWN attracts related to specific kernel intricacies.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 23:28 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (2 responses)

I should probably also emphasize that I deal with things at a much higher level than "the specific flags we pass to individual syscalls when setting up sockets," so while I believe what I have written is generally correct, my understanding might be incomplete or incorrect with respect to SO_REUSEPORT in particular.

Nevertheless, there are definitely lots of people who want to push their software frequently, and making that easier in one fashion or another can only be a good thing.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 26, 2021 18:37 UTC (Mon) by sodabrew (subscriber, #95737) [Link] (1 responses)

You may be mistaken on Google's lack of use for SO_REUSEPORT, given that SO_REUSEPORT was developed by Tom Herbert at Google and merged for the 3.9 kernel release: https://lwn.net/Articles/542629/

It's certainly possible that the original needs have changed and become solved in other ways in the eight years that have passed, or that you're working on a different product that doesn't have the same requirements as the one for which this feature was developed, but either way, your statement that (paraphrased) "Google doesn't use SO_REUSEPORT" ought to have a modifier of either "...anymore because..." or "...in my group that's doing something different."

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 26, 2021 22:04 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

That is fairly likely; I'm basing my assumptions on the "lame duck state" documented here: https://sre.google/sre-book/load-balancing-datacenter/

But that probably wouldn't work very well for frontends.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 23:08 UTC (Sat) by flussence (guest, #85566) [Link]

Why not? Linux already has a bunch of similar complexity to prevent timestamp counters wrapping after hundreds of days of uptime, and we all know that would've only happened to evil lazy sysadmins that don't apply security updates and totally deserve it. /s

If this was hardware the maker would've put out a product recall for such a high failure rate.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 23, 2021 15:42 UTC (Fri) by Tomasu (guest, #39889) [Link] (1 responses)

I wonder why the queue has to be associated with a process at all till it's been "accepted". Probably just legacy assumptions.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 23, 2021 18:29 UTC (Fri) by xi0n (guest, #138144) [Link]

AFAIK it’s associated with a particular socket but either way, it does look like legacy assumption. SO_REUSEPORT would likely be simpler in implementation (incl. possibly the patch set discussed here being unnecessary) if those queues were maintained on a per-bindpoint basis instead.

(This would raise a question what to do with the second argument to listen(). Since the number given there is defined more like a hint than actual limit, it’s probably a minor issue, though).

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 23, 2021 16:16 UTC (Fri) by iabervon (subscriber, #722) [Link] (3 responses)

I wonder if the right thing might be to have SO_REUSEPORT switch to actually dup()ing the single socket to the caller's fd. Is there per-socket information that is allowed to be different among the sockets listening on the same port that would cause behavior changes in that situation?

I assume the now-recommended userspace code would bind the address in a single process and pass the bound socket over a unix domain socket to other processes (or fork after binding it), and it seems like this situation wouldn't be too hard to replicate even when userspace didn't ask like that, assuming that there aren't any visible differences.

I guess the max queue length may need to be adjusted in order to enqueue the same number of incoming connections total, and that might be noticed by the other local processes?

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 23, 2021 20:58 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> I wonder if the right thing might be to have SO_REUSEPORT switch to actually dup()ing the single socket to the caller's fd. Is there per-socket information that is allowed to be different among the sockets listening on the same port that would cause behavior changes in that situation?
The problem is that you'll be funneling all the connections through effectively one thread, because all the file operations take a process-wide lock. This is what SO_REUSEPORT was designed to avoid.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 23, 2021 23:29 UTC (Fri) by wahern (subscriber, #37304) [Link] (1 responses)

The article mentions that "the TCP accept code has been reworked to run locklessly". IIRC, the purpose of SO_REUSEPORT wasn't to permit multiple threads to dequeue connections; they could already do that, and if there was lock contention this was just a QoI issue. Rather, I think the primary issue was efficient polling and resolving the stampeding herd problem. The reason an incoming connection is immediately assigned to a specific queue is so that only that descriptor (or descriptors if dup'd) will signal readiness while still avoiding introducing stalling and fair dispatch dilemmas, especially in the context of polling as opposed to threads actually waiting inside accept. The classic way to implement a multi-threaded accept in userspace while avoiding thundering herds was to ensure only a single thread was waiting in accept or polling on the accept descriptor at any one time. SO_REUSEPORT simply moved assignment earlier in the pipeline while largely preserving semantics.

BSD supported SO_REUSEPORT long before Linux did, albeit support for TCP seems to be undocumented. Rather than round-robin, though, only the most recent binding is assigned connections. When that goes away the previous one starts to see connections again. However, at least on macOS queued connections are still lost on close. I see FreeBSD added SO_REUSEPORT_LB which does round-robin; not sure if it suffers from the lost connection problem.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 23, 2021 23:39 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> IIRC, the purpose of SO_REUSEPORT wasn't to permit multiple threads to dequeue connections; they could already do that, and if there was lock contention this was just a QoI issue.
SO_REUSEPORT is needed to allow multiple _processes_ to dequeue connections.

The problem with accepting connections from multiple threads is that they are still effectively serialized, because all file operations take an implicit lock to allocate a file descriptor.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 23, 2021 19:15 UTC (Fri) by ibukanov (subscriber, #3942) [Link] (1 responses)

I do not see how this is useful in practice. nginx has been supporting live updates without dropping any connection for ages with careful protocol to transfer the file descriptors to another process. systemd made that very straightforward to implement as the process can store descriptors before the restart in the pool provided by systemd. Those busy sites surely can implement something like that allowing to preserve not only listening sockets with or without SO_REUSEPORT but also accepted ones.

As for crashing servers loosing incoming queue is the least of worries as the crash is clear sign that the system misbehaves. And for absolute robustness one can transfer important sockets to another process in the crash handler.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 1:07 UTC (Sat) by HenrikH (subscriber, #31152) [Link]

The sockets/connections that the patch set is about have not reached userspace yet when the process is restarted so there is nothing an application can do here. It's about connections that have been assigned to a specific process by the kernel but where said process have not yet called accept() to get them when that process was restarted/stopped/crashed.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 8:46 UTC (Sat) by wtarreau (subscriber, #51152) [Link] (8 responses)

I tried to do exactly this several times in the past but failed. We had the same problem with haproxy where high traffic users were constantly seeing a few resets being emitted when one queue was unbound. I tried to figure how to detach pending connections from a queue and reinject them into other queues but never managed to, that are was too complex.

I understand Eric's concerns (and he already expressed them to me by then). It is possible that it's not the best solution, but it addresses a real issue in field that needs to be addressed.

We worked around it by passing listening file descriptors between the old and the new process during reloads. All this just to avoid a bind+unbind cycle! It comes with its own set of limitations, of course.

Also SO_REUSEPORT is not just used for this, the initial purpose was to allow multiple processes to bind to the same port and avoid black out periods. It used to work fine in 2.2 and was removed in 2.4. I had to maintain the patch to reintroduce it till someone else proposed a variant in 3.9 which also implemented the multiqueue balancing. But this is an essential feature in highly available environments.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 10:29 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

How about this method?

Have two queues for each listening socket, one normal and one for "extraordinary" requests. In case of a process death during the closure process take the queued connections and redistribute them across extraordinary queues.

Since these queues are special and are used infrequently, you can use simple locking-based algorithms there.

Ideally this can be done transparently in kernel, but it can also be done with some userspace assistance. Processes willing to "mop up" connections can open a new listening socket and communicate (via setsockopt/ioctl) that it should be used for connection migration.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 12:12 UTC (Sat) by wtarreau (subscriber, #51152) [Link] (6 responses)

Yes that could be an option, I considered it but lost my way by then. The accept queue code became tricky since the reintroduction of SO_REUSEPORT, and I seem to remember that one of the difficulty was to pick pending connections, and another one was to unhash some of the queues while they were in use without losing what was in them. For whatever reason I remember not figuring how to allow a program to still pick what was left in a queue with that queue not being visible to the rx path that distributes incoming requests. But these are old memories, and I remember that Eric was quite concerned about my fiddling there because he was about to finish to kill the SYN queue lock.

Also it's important to keep in mind that we cannot afford to lose even a tenth of a percent of performance there, because such tricks would only be used during process reloads, and the code path they're affecting is the one being the most stressed during DDoSes.

Ideally we should just remove a queue in two steps. First step it should simply be unhashed and second step it should be closed. From what I remember, pending entries were killed inside the unhashing code, but I could be saying crap.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 25, 2021 8:09 UTC (Sun) by smurf (subscriber, #17840) [Link] (5 responses)

Couldn't the unhashing be accomplished (with some minimal kernel support of course) by calling listen(fd,0)? Then either process the remaining connections or hand them off to another process.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 26, 2021 16:10 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (4 responses)

No, that was among my first attempts. It simply results in the listen queue being zero for that socket and connection requests being dropped. Still I kept that as a work around for the RST for a few days because it managed to cause less RST by rejecting SYNs earlier when detecting the queue was full. But that was not possible after the lockless SYN patches anyway so there was no hope in this direction. Plus this resulted in a huge CPU usage for the user application that had to call accept() in loops and was not able to group the accepts any more.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 28, 2021 5:32 UTC (Wed) by smurf (subscriber, #17840) [Link] (3 responses)

You misinderstand (or I miswrote): my intent was to fix the kernel so that "listen(fd,0)" simply closes the queue for new arrivals. Then the process would accept() the remaining open connections (and somehow deal with them), and shut down when that blocks. No new syscall, new common queueing mechanism, or other intrusive shenanigans required.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 29, 2021 17:48 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (2 responses)

Ah OK but the internal problem remains the same: the difficulty of rehashing the queues without losing entries. listen(0), setsockopt(), shutdown(SHUTRD) etc were all among valid candidates for me as soon as I'd have had a reliable way to move these queues around :-/

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 30, 2021 5:53 UTC (Fri) by smurf (subscriber, #17840) [Link]

That depends on whether you need kernel support for it. I could get by with an IPC socket to some other server to whom I can send the file descriptors returned from accept()ing these connections.

Or maybe we want a "sock_inject(listener,conn)" syscall that adds an open socket into a listener's queue. You could do to another process, just open /proc/‹serverpid›/fd/‹bound_socket›. In fact something like this is also required for migrating a server to a different host, so it'd not be a single-use syscall.

Avoiding unintended connection failures with SO_REUSEPORT

Posted May 9, 2021 6:49 UTC (Sun) by bernat (subscriber, #51658) [Link]

The idea of listen(0) was to then allow you to drain the remaining connections. I remember you proposing this simple solution but your patch was rejected because "this should be done with BPF."

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 24, 2021 11:05 UTC (Sat) by gracinet (guest, #89400) [Link]

This is interesting to me because SO_REUSEPORT is as far as I know the only way to do prefork multiprocessing in gRPC servers, which is something that's typically wanted if implemented in Python, because of the global interpreter lock (GIL).

In some cases the performance impact of the GIL can be overstated, it really depends on the workload. But then, many Python applications aren't designed to be thread-safe anyway because it's generally believed that the GIL would make the effort useless.

A common case for restarting worker processes would be to reach some limitation, such as memory footprint. Lots of applications in the wild have at least minor memory leaks.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 25, 2021 21:56 UTC (Sun) by Sesse (subscriber, #53779) [Link] (1 responses)

I'm a bit confused. What good does it do that SYN processing is lockless, if you're still going to serialize it into a single listener?

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 26, 2021 16:13 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

It's essential to deal with high the SYN rates that happen during SYN floods (i.e. all the time on high traffic sites). You want that part to be ultra-scalable. The difference can be 1 vs 10 Mpps.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 26, 2021 4:45 UTC (Mon) by mm7323 (subscriber, #87386) [Link] (1 responses)

Why not just let a process do something like call listen(fd, 0) to indicate that it doesn't want any more connections queued to the socket. Then it can wait a few seconds for any handshaking to complete and use non-blocking accept() to empty it's queue before exiting or reloading config gracefully.

This doesn't handle the crashing process scenario, but that's a lost cause anyway - a crash after accept() was called could still result in a dropped connection or invalid or partial response and agrevated users.

This suggestion is pretty much the same as using a BPF program to steer connections as suggested in the article. It just makes a simpler API to achieve the same.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 26, 2021 16:11 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

doesn't work well, see my response above.

Avoiding unintended connection failures with SO_REUSEPORT

Posted Apr 26, 2021 22:39 UTC (Mon) by marcH (subscriber, #57642) [Link]

> LWN's server, for example, sweats hard when keeping up with the comment stream that accompanies any article mentioning the Rust programming language. But some organizations run truly busy servers and have to take some extraordinary measures to keep up with levels of traffic that even language advocates cannot create.

It looks like our editor's great mood made him temporarily stray from LWN's legendary rigor and thoroughness and drop a piece of critical information highly relevant to this article: in such a difficult server situation, are the Rust, C or C++ advocates causing the most traffic?