How 3.6 nearly broke PostgreSQL

By Jonathan Corbet
October 2, 2012

In mid-September, the 3.6 kernel appeared to be stabilizing nicely. Most of the known regressions had been fixed, the patch volume was dropping, and Linus was relatively happy. Then Nikolay Ulyanitsky showed up with a problem: the pgbench PostgreSQL benchmark ran 20% slower than under 3.5. The resulting discussion shows just how hard scalability can be on contemporary hardware and how hard scheduling can be in general.

Borislav Petkov was able to reproduce the problem; a dozen or so bisection iterations later he narrowed down the problem to this patch, which was duly reverted. There is just one little problem left: the offending patch was, itself, meant to improve scheduler performance. Reverting it fixed the PostgreSQL regression, but at the cost of losing an optimization that improves things for many (arguably most) other workloads. Naturally, that led to a search to figure out what the real problem was so that the optimization could be restored without harmful effects on PostgreSQL.

What went wrong

The kernel's scheduling domains mechanism exists to optimize scheduling decisions by modeling the costs of moving processes between CPUs. Migrating a process from one CPU to a hyperthreaded sibling is nearly free; cache is shared at all levels, so the moved process will not have to spend time repopulating cache with its working set. Moving to another CPU within the same physical package will cost more, but mid-level caches are still shared, so such a move is still much less expensive than moving to another package entirely. The current scheduling code thus tries to keep processes within the same package whenever possible, but it also tries to spread runnable processes across the package's CPUs to maximize throughput.

The problem that the offending patch (by Mike Galbraith) was trying to solve comes from the fact that the number of CPUs built into a single package has been growing over time. Not too long ago, examining every processor within a package in search of an idle CPU for a runnable process was a relatively quick affair. As the number of CPUs in a package increases, the cost of that search increases as well, to the point that it starts to look expensive. The current scheduler's behavior, Mike said at the time, could also result in processes bouncing around the package excessively. The result was less-than-optimal performance.

Mike's solution was to organize CPUs into pairs; each CPU gets one "buddy" CPU. When one CPU wakes a process and needs to find a processor for that process to run on, it examines only the buddy CPU. The process will be placed on either the original CPU or the buddy; the search will go no further than that even if there might be a more lightly loaded CPU elsewhere in the package. The cost of iterating over the entire package is eliminated, process bouncing is reduced, and things run faster. Meanwhile, the scheduler's load balancing code can still be relied upon to distribute the load across the available CPUs in the longer term. Mike reported significant improvements in tbench benchmark results with the patch, and it was quickly accepted for the 3.6 development cycle.

So what is different about PostgreSQL that caused it to slow down in response to this change? It seems to come down to the design of the PostgreSQL server and the fact that it does a certain amount of its own scheduling with user-space spinlocks. Carrying its own spinlock implementation does evidently yield performance benefits for the PostgreSQL project, but it also makes the system more susceptible to problems resulting from scheduler changes in the underlying system. In this case, restricting the set of CPUs on which a newly-woken process can run increases the chance that it will end up preempting another PostgreSQL process. If the new process needs a lock held by the preempted process, it will end up waiting until the preempted processes manages to run again, slowing things down. Possibly even worse is that preempting the PostgreSQL dispatcher process — also more likely with Mike's patch — can slow the flow of tasks to all PostgreSQL worker processes; that, too, will hurt performance.

Making things better

What is needed is a way to gain the benefits of Mike's patch without making things worse for PostgreSQL-style loads. One possibility, suggested by Linus, is to try to reduce the cost of searching for an idle CPU instead of eliminating the search outright. It appears that there is some low-hanging fruit in this area, but it is not at all clear that optimizing the search, by itself, will solve the entire problem. Mike's patch eliminates that search cost, but it also reduces movement of processes around the package; a fix that only addresses the first part risks falling short in the end.

Another possibility is to simply increase the scheduling granularity, essentially giving longer time slices to running processes. That will reduce the number of preemptions, making it less likely that PostgreSQL processes will step on each other's toes. Increasing the granularity does, indeed, make things better for the pgbench load. There may be some benefit to be had from messing with the granularity, but it is not without its risks. In particular, increasing the granularity could have an adverse effect on desktop interactivity; there is no shortage of Linux users who would consider that to be a bad trade.

Yet another possibility is to somehow teach the scheduler to recognize processes — like the PostgreSQL dispatcher — that should not be preempted by related processes if it can be avoided. Ingo Molnar suggested investigating this idea:

Add a kernel solution to somehow identify 'central' processes and bias them. Xorg is a similar kind of process, so it would help other workloads as well. That way lie dragons, but might be worth an attempt or two.

The problem, of course, is the dragons. The O(1) scheduler, used by Linux until the Completely Fair Scheduler (CFS) was merged for 2.6.23, had, over time, accumulated no end of heuristics and hacks designed to provide the "right" kind of scheduling for various types of workloads. All these tweaks complicated the scheduler code considerably, making it fragile and difficult to work with — and they didn't even work much of the time. This complexity inspired Con Kolivas's "staircase deadline scheduler" as a much simpler solution to the problem; that work led to the writing (and merging) of CFS.

Naturally, CFS has lost a fair amount of its simplicity since it was merged; contact with the real world tends to do that to scheduling algorithms. But it is still relatively free of workload-specific heuristics. Opening the door to more of them now risks driving the scheduler in a less maintainable, more brittle direction where nothing can be done without a significant chance of creating problems in unpredictable places. It seems unlikely that the development community wants to go there.

A potentially simpler alternative is to let the application itself tell the scheduler that one of its processes is special. PostgreSQL could request that its dispatcher be allowed to run at the expense of one of its own workers, even if the normal scheduling algorithm would dictate otherwise. That approach reduces complexity, but it does so by pushing some of the cost into applications. Getting application developers to accept that cost can be a challenge, especially if they are interested in supporting operating systems other than Linux. As a general rule, it is far better if things just work without the need for manual intervention of this type.

So, in other words, nobody really knows how this problem will be solved at this time. There are several interesting ideas to pursue, but none that seem like an obvious solution. Further research is clearly called for.

One good point in all of this is that the problem was found before the final 3.6 kernel shipped. Performance regressions have a way of hiding, sometimes for years, before they emerge to bite some important workload. Eventually, tools like Linsched may help to find more of these problems early, but we will always be dependent on users who will perform this kind of testing with workloads that matter to them. Without Nikolay's 3.6-rc testing, PostgreSQL users might have had an unpleasant surprise when this kernel was released.

Index entries for this article
Kernel	Scheduler

to post comments

How 3.6 nearly broke PostgreSQL

Posted Oct 2, 2012 23:43 UTC (Tue) by xorbe (guest, #3165) [Link] (6 responses)

Linus has already noted the right solution: one can't just throw out possible cpu execution resources. We've been through this before. For the masses, a reliable and predictable scheduler is far better than one that makes 99% of things run a touch faster, but hoses the remaining 1% -- eventually it snags some major oft-used package. The other proposed ideas in the article seem crazy. Leave the fine scheduler tuning for shops that need to squeeze out the last few percent for a dedicated work load.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 1:01 UTC (Wed) by zlynx (guest, #2285) [Link] (5 responses)

I agree.

My thought when I read about this problem is that PostgreSQL is causing it with their user-space locking and so PostgreSQL needs to fix it. Perhaps it should pin its processes to particular cores, ensuring that it spreads out across all the cores. It could also nice its worker processes while leaving the dispatcher at normal. Or it could negative-nice (would that be making the process mean?) its dispatcher.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 2:34 UTC (Wed) by fdr (guest, #57064) [Link] (2 responses)

There is no user-space locking that is portable, bug-free, and fast enough to satisfy PostgreSQL on its supported platforms. So, as far as I know, this is pretty much off the table, and there are good reasons for that.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 3:19 UTC (Wed) by josh (subscriber, #17465) [Link] (1 responses)

How much faster is PostgreSQL's userspace locking than futexes? In theory, futexes have no kernel overhead when acquired uncontended, and minimal overhead (just the overhead of blocking in the scheduler) when contended. The only case I can think of that would have less overhead would involve busy-waiting in userspace.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 4:30 UTC (Wed) by fdr (guest, #57064) [Link]

I think someone wrote up a prototype (actually, I think that may be several across many years, this being the latest incarnation I know of):

http://archives.postgresql.org/pgsql-hackers/2012-06/msg0...

This actually uses futexes indirectly, in my understanding from the post, and it's not the most awful thing for Linux.

It's possible that Linux futexes are not a bad idea (but it's also not clearly a huge improvement), but s_lock does okay and has existed an awfully long time (with iteration), so there's some inertia there.

Also, I think I oversimplified the parent's post, he could have simply meant that something about s_lock is not very good, as opposed to "self-rolled user space spinlocks? That's ridiculous!" (as a straw man). It's possible s_lock could be improved, however, it was fine before and doesn't clearly seem at fault right now. I think the suggestion of having to manhandle the platform-specific scheduler also seems excessive. There may be a better solution to serve everyone; somehow I can't see PostgreSQL's spinlocks being one-of-a-kind in this pathology, but I haven't attempted to prove that.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 18:56 UTC (Wed) by jberkus (guest, #55561) [Link] (1 responses)

> My thought when I read about this problem is that PostgreSQL is causing it with their user-space locking and so PostgreSQL needs to fix it.

So there's a couple major problems with the idea that this issue should be fixed in the PostgreSQL code:

1. PostgreSQL is not the only application with its own spinlock implementatation. There is, for example, Oracle.

2. While we might be able to change locking in future versions of PostgreSQL, we can't change locking in past ones.

Even if the next version of PostgreSQL (9.3) has a modified locking implementation which doesn't hit the issues in 3.6, the number of people running older versions of PostgreSQL will far outnumber the folks running the latest version for quite some time. What you'd be saying to all of those folks is, effectively, "don't upgrade Linux".

So any solution which hinges on "make these modifications to PostgreSQL" will instead result in PostgreSQL users deciding to stay on old versions of the Kernel. If this problem is equally bad for Oracle, you might even see RedHat refusing to deploy a version based on 3.6.

There's also the fact that PostgreSQL's locking implementation is complex and quite highly tuned. So the idea that the PostgreSQL project could make changes which wouldn't result in a worse regression in a few months is optimistic. Implementing things like futex support could take literally years.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 19:00 UTC (Wed) by corbet (editor, #1) [Link]

FWIW, the discussion in kernelland was based on the assumption that this regression was the kernel's problem. Nobody there suggested telling the PostgreSQL developers to come up with a new locking scheme.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 2:13 UTC (Wed) by ncm (guest, #165) [Link] (3 responses)

It seems to me that a more lightly-loaded core should be put to work first recruiting work from its more heavily-loaded neighbors, offloading both scheduling activity and finally other processes from them. Is there something very cheap that a busy core could do that allows its neighbors to make better decisions about what work to take away from it?

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 12:36 UTC (Wed) by rvfh (guest, #31018) [Link]

Unless I misread the code, it's more or less what happens now: if a core becomes idle and no load balancer exists in the system, it will start assuming that role and try to get some work to do.

How 3.6 nearly broke PostgreSQL

Posted Oct 5, 2012 10:28 UTC (Fri) by amit.kucheria (subscriber, #59246) [Link] (1 responses)

Except when you're trying to optimize for power, which we're trying to do in the ARM world. :)

Consolidating work onto fewer cores[1] is something we're actively looking into.

[1] https://blueprints.launchpad.net/linaro-power-kernel/+spe...

How 3.6 nearly broke PostgreSQL

Posted Oct 13, 2012 7:35 UTC (Sat) by rodgerd (guest, #58896) [Link]

Not just the ARM world, really; consider the turbo mode available with newer Intel and AMD processors. If you're doing work that runs better with fewer, faster threads rather than more slower ones, you'd want to aggregate the workload onto fewer cores and let the CPU ratchet the clock up.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 7:31 UTC (Wed) by ajb (subscriber, #9694) [Link] (7 responses)

Having only one 'buddy' core divides the set of cores into isolated pairs, Checking two would allow processes to move anywhere within the set, as in a skew cache or cuckoo hashing. Maybe that would work better.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 8:07 UTC (Wed) by Zenith (guest, #24899) [Link] (6 responses)

I am not familiar with skew caches or cuckoo hashing, so I may just be restating your solution.

Would a better solution than the one suggested by Mike not be to have 3 cores in each set, and then make the sets overlap.

So say we have a 4-core CPU (labelled A, B, C, and D), that then gets divided into two sets X = (A, B, C) and Y = (A, C, D).
That way the load can migrate from one set (X) of CPU's into another set (Y), provided that the load migration code gets lucky on selecting set Y.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 12:03 UTC (Wed) by ajb (subscriber, #9694) [Link] (1 responses)

Your example has convinced me that I was wrong to say that you could get away with only two buddies. If you have only two, the best you can do is have a ring of cpus. This is different from the case of cuckoo hashing and skew caches, where only two places are sufficient.

However, in your example, A and C have 3 buddies, but B and D have only two. It seems like the scheduling algorithm would be just as efficient if they all had three, and there would be a better chance that migrations would be beneficial.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 13:31 UTC (Wed) by and (guest, #2883) [Link]

I understood that approach differently: If core A has B as its buddy, B's buddy does not necessarily need to be A. This would imply kind of a ring buffer of buddies within the package (A->B->C->D->A)...

How 3.6 nearly broke PostgreSQL

Posted Oct 4, 2012 10:46 UTC (Thu) by dunlapg (guest, #57764) [Link]

Yeah, it seems like if the problem is there are "too many" cores per socket, that instead of fixing on 2, it should be fixed on "almost but not quite too many" -- e.g., 4 for example? So if you have 4-cores per socket, it looks exactly like it does before; but if you have 8, you still only look at 4 other cores. Having the sets overlap seems like an obvious way to make sure things can find a good optimum within a few iterations.

How 3.6 nearly broke PostgreSQL

Posted Oct 6, 2012 8:13 UTC (Sat) by eternaleye (guest, #67051) [Link] (2 responses)

Another option might be to use a Kautz digraph - it seems like in terms of a migration problem like this, it might be very nearly optimal.

I've found this to be a far clearer explanation than the Wikipedia article: http://pl.atyp.us/wordpress/index.php/2007/12/the-kautz-g...

In learning about them myself, I wrote a perl script to generate a Kautz graph: http://ix.io/36j/

It takes the degree as the first argument and the dimension and the second, and outputs a list of edges as <from> <to> tuples, one per line.

I still need to write the Kautz::Graph class (stubbed in the file above) to embody a full set of the Kautz nodes with the same degree and dimension parameters, and see about modifying it to generate dot so it can make a nice graphic.

How 3.6 nearly broke PostgreSQL

Posted Oct 6, 2012 8:26 UTC (Sat) by eternaleye (guest, #67051) [Link]

...And I just realized I misread my own code. It actually prints the names of the nodes, it's just that the symbols of the name are space-separated since they can be any positive integer <= degree.

Well, one more thing to remedy.

How 3.6 nearly broke PostgreSQL

Posted Oct 6, 2012 8:45 UTC (Sat) by eternaleye (guest, #67051) [Link]

Alright, here's a version that has a Kautz::Graph class, and prints both nodes and edges when run: http://ix.io/36k/

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 12:40 UTC (Wed) by andresfreund (subscriber, #69562) [Link]

> Possibly even worse is that preempting the PostgreSQL dispatcher process — also more likely with Mike's patch — can slow the flow of tasks to all PostgreSQL worker processes; that, too, will hurt performance.
The thing you could describe as a dispatcher, the "postmaster" process, which forks to create individual backends after receiving a connection requests doesn't do any relevant locking. So preemting it should be mostly harmless.
There are other processes though which might have something like the effect described above. Namely the 'wal writer', 'checkpointer' (9.2+, included in bgwriter before) and bgwriter processes. Especially if you have a write intensive concurrent workload interactions between individual connection backends and the wal writer can get contended. So its somewhat likely that preemption might get painful there.

Postgres is *far* from the only application doing its own spinlock implementation and contending on the locks every now and then. I am pretty sure that if the patch had made its way into 3.6 more performane regression reports would have come in.

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 18:22 UTC (Wed) by cmccabe (guest, #60281) [Link] (2 responses)

It seems like anyone who uses pthread_spin_lock or a similar thing could run into this problem. Whenever one thread waits on another, without telling the kernel, the kernel might make a scheduling decision that ends up being suboptimal (like putting them both on the same core.)

How 3.6 nearly broke PostgreSQL

Posted Oct 3, 2012 19:09 UTC (Wed) by martinfick (subscriber, #4455) [Link]

Agreed. My limited opinion is simply that spinning is a sometimes neccessary evil; avoid it whenever possible; use a kernel supported mechanism if you expect to get consistent kernel behavior.

How 3.6 nearly broke PostgreSQL

Posted Oct 5, 2012 20:45 UTC (Fri) by jhoblitt (subscriber, #77733) [Link]

Couldn't libpthreads notify the kernel that pthread_spin_lock() is in use? You certainly wouldn't want to introduce a system call on every use but having to branch on the first use by a given pid shouldn't be too expensive.

How 3.6 nearly broke PostgreSQL

Posted Oct 11, 2012 21:29 UTC (Thu) by bbulkow (guest, #87167) [Link]

I'm somewhat pained by the "we can't make the desktop less responsive" argument.

As a high performance server author (currently http://Aerospike.com/ ), I am working hard to avoid context switch overhead in a few places. We are getting hurt by core scheduling issues - we often recommend turning off hyperthreading.

I have tried signalling to the OS heuristic I think make my server go faster. I've used sched_setscheduler(), which seemed to be effective before CFS but makes no difference in "modern" kernels.

As Linux is more beloved as a server OS, what's so bad about having known tuning parameters / API calls to change the scheduling heuristics? Is sched_setscheduler() to be avoided?

How 3.6 nearly broke PostgreSQL

Posted Oct 12, 2012 1:27 UTC (Fri) by bjmaz (guest, #87170) [Link]

Seems to me a solution to this problem could be to dedicate a core to scheduling. This, of course, would assume a many-core, homogenous system. The scheduler could then keep track of free resources/running processes.