The future calculus of memory management

January 18, 2012

This article was contributed by Dan Magenheimer

Over the last fifty years, thousands of very bright system software engineers, including many of you reading this today, have invested large parts of their careers trying to solve a single problem: How to divide up a fixed amount of physical RAM to maximize a machine's performance across a wide variety of workloads. We can call this "the MM problem."

Because RAM has become incredibly large and inexpensive, because the ratios and topologies of CPU speeds, disk access times, and memory controllers have grown ever more complex, and because the workloads have changed dramatically and have become ever more diverse, this single MM problem has continued to offer fresh challenges and excite the imagination of kernel MM developers. But at the same time, the measure of success in solving the problem has become increasingly difficult to define.

So, although this problem has never been considered "solved", it is about to become much more complex, because those same industry changes have also brought new business computing models. Gone are the days when optimizing a single machine and a single workload was a winning outcome. Instead, dozens, hundreds, thousands, perhaps millions of machines run an even larger number of workloads. The "winners" in the future industry are those that figure out how to get the most work done at the lowest cost in this ever-growing environment. And that means resource optimization. No matter how inexpensive a resource is, a million times that small expense is a large expense. Anything that can be done to reduce that large expense, without a corresponding reduction in throughput, results in greater profit for the winners.

Some call this (disdainfully or otherwise) "cloud computing", but no matter what you call it, the trend is impossible to ignore. Assuming it is both possible and prudent to consolidate workloads, it is increasingly possible to execute those workloads more cost effectively in certain data center environments where the time-varying demands of the work can be statistically load-balanced to reduce the maximum number of resources required. A decade ago, studies showed that, on average, only 10% of the CPU in an average pizza box server was being utilized... wouldn't it be nice, they said, if we could consolidate and buy 10x fewer servers? This would not only save money on servers, but would save a lot on power, cooling, and space too. While many organizations had some success in consolidating some workloads "manually", many other workloads broke or became organizationally unmanageable when they were combined onto the same system and/or OS. As a result, scale-out has continued and different virtualization and partitioning technologies have rapidly grown in popularity to optimize CPU resources.

But let's get back to "MM", memory management. The management of RAM has not changed much to track this trend toward optimizing resources. Since "RAM is cheap", the common response to performance problems is "buy more RAM". Sadly, in this evolving world where workloads may run on different machines at different times, this classic response results in harried IT organizations all buying more RAM on most or all of the machines in a data center. A further result is that the ratio of total RAM in a data center vs. the sum of the "working set" of the workloads, is often at least 2x and sometimes as much as 10x. This means that somewhere between half and 90% of the RAM in an average data center is wasted, which is decidedly not cheap. So the question is begged: Is it possible to apply similar resource optimization techniques to RAM?

A thought experiment

Bear with me and open your imagination for the following thought experiment:

Let's assume that the next generation processors have two new instructions: PageOffline and PageOnline. When PageOffline is executed (with a physical address as a parameter), that (4K) page of memory is marked by the hardware as inaccessible and any attempts to load/store from/to that location result in an exception until a corresponding PageOnline is executed. And through some performance registers, it is possible to measure which pages are in the offline state and which are not.

Let's further assume that John and Joe are kernel MM developers and their employer "GreenCloud" is "green" and enlightened. The employer offers the following bargain to John and Joe and the thousands of other software engineers working at GreenCloud: "RAM is cheap but not free. We'd like to encourage you to use only the RAM necessary to do your job. So, for every page, on "average" over the course of the year, that you have offline, we will add one-hundredth of one cent to your end-of-year bonus. Of course, if you turn off too much RAM, you will be less efficient at getting your job done, which will reflect negatively on your year-end bonus. So it is up to you to find the right balance."

John and Joe quickly do some arithmetic: so, since my machine has 8GB RAM, if I on average keep 4GB offline, I will be $100 richer. They quickly start scheming ideas on how to dynamically measure their working set and optimize page offlining.

But the employer goes on: "And for any individual page that you have offline for the entire year, we will double that to two-hundredths of a cent. But once you've chosen the "permanent offline" option on a page, you are stuck with that decision until the next calendar year."

John, anticipating the extra $200, decides immediately to try to shut off 4GB for the whole year. Sure, there will be some workload peaks where his machine will get into a swapstorm and he won't get any work done at all, but that will happen rarely and he can pretend he is on a coffee break when it happens. Maybe the boss won't notice.

Joe starts crafting a grander vision; he realizes that, if he can come up with a way to efficiently allow others' machines that are short on RAM capacity to utilize the RAM capacity on his machine, then the "spread" between temporary offlining and permanent offlining could create a nice RAM market that he could exploit. He could ensure that he always has enough RAM to get his job done, but dynamically "sell" excess RAM capacity to those, like John, who have underestimated their RAM needs ... at say fifteen thousandths of a cent per page-year. If he can implement this "RAM capacity sharing capability" into the kernel MM subsystem, he may be able to turn his machine into a "RAM server" and make a tidy profit. If he can do this ...

Analysis

In the GreenCloud story, we have: (1) a mechanism for offlining and onlining RAM one page at a time; (2) an incentive for using less RAM than is physically available; and, (3) a market for load- balancing RAM capacity dynamically. If Joe successfully figures out a way to make his excess RAM capacity available to others and get it back when he needs it for his own work, we may have solved (at least in theory) the resource optimization problem for RAM for the cloud.

While the specifics of the GreenCloud story may not be realistic or accurate, there do exist some of the same factors in the real world. In a virtual environment, "ballooning" allows individual pages to be onlined/offlined in one VM and made available to other VMs; in a bare-metal environment, the RAMster project provides a similar capability. So, though primitive and not available in all environments, we do have a mechanism. By substantially reducing the total amount of RAM across a huge number of machines in a data center, both capital outlay and power/cooling would be reduced, improving resource efficiency and thus potential profit. So we have an incentive and the foundation for a market.

Interestingly, the missing piece, and where this article started, is that most OS MM developers are laser-focused on the existing problem from the classic single machine world which is, you recall: how to divide up a fixed amount of physical RAM to maximize a single machine's performance across a wide variety of workloads.

The future version of this problem is this: how to vary the amount of physical RAM provided by the kernel and divide it up to maximize the performance of a workload. In the past, this was irrelevant: you own the RAM, you paid for it, it's always on, so just use it. But in this different and future world with virtualization, containers, and/or RAMster, it's an intriguing problem. It will ultimately allow us to optimize the utilization of RAM, as a resource, across a data center.

It's also a hard problem, for three reasons: The first, is that we can't predict, but can only estimate, the future RAM demands of any workload. But this is true today, the only difference is whether the result is "buy more RAM" or not. The second, is that we need to understand the instantaneous benefit (performance) of each additional page of RAM (cost); my math is very rusty, but this reminds me of differential calculus, where "dy" is performance and "dx" is RAM size. At every point in time, increasing dx past a certain size will have no corresponding increase in dy. Perhaps this suggests control theory more than calculus but the needed result is a true dynamic representation of "working set" size. Third, there is some cost for moving capacity efficiently; this cost (and impact on performance) must be somehow measured and taken into account as well.

But, in my opinion, this "calculus" is the future of memory management. I have no answers and only a few ideas, but there's a lot of bright people who know memory management a lot better than I. My hope is to stimulate discussion about this very-possible future and how the kernel MM subsystem should deal with it.

Index entries for this article
GuestArticles	Magenheimer, Dan

to post comments

The future calculus of memory management

Posted Jan 19, 2012 7:44 UTC (Thu) by alankila (guest, #47141) [Link] (5 responses)

This article makes me think that it's time to start to look for a new line of work. I so dislike the notion of replacing the predictable reliability of fixed resources with this vague quasi-economical logic of boom and crunch periods whose timing will depend on the dynamic demands placed on unrelated systems which nevertheless share the same resources.

The problem here is that computer systems have a minimum expectation of the availability of the various resources, and that in turn dictates the minimum which must always be available to each service, or that service will risk collapsing under load. Of course, this minimum guaranteed level looks a lot like spare capacity to somebody who doesn't realize that the system is already at the optimal minimum. (The other big complication is that throughput of complex systems tends to be limited by the most constrained resource, so screwing this up in any of the major resource categories is going to be bad news for throughput.)

However, I suppose that sometimes it is guaranteed that resource demand peaks occur at non-overlapping times. So you might put nightly backup service on the host that also services some other function during daytime, or something. It is also possible to guarantee minimal resource availability to some virtual machines and leave the rest to struggle under "best effort" guarantees. This kind of competition between resources doesn't sound anything like a "RAM market" to me, though.

I will refrain from criticizing this article much further, but I'll be personally be rather happy if there will be no more like it. RAM servers to make a tidy profit? I find the notion ridiculous and have no idea how it could really work or even make any sense.

The future calculus of memory management

Posted Jan 19, 2012 8:17 UTC (Thu) by alankila (guest, #47141) [Link] (2 responses)

Hmm. Reading this again I think the economic logic was not meant to be taken literally, but it was only there to sort of illustrate the motivation and the thinking involved. I think I made a mistake around that paragraph and flipped my bozo bit too early. However, I still think that systems can't be overcommitted safely, so I remain skeptical what possible tricks there are that an administrator could pull while not violating service's expectations.

One thing occurred to me, though: it might be possible to trade one resource for another. Suppose, for instance, that one workload is very memory and cpu hungry, but doesn't do almost any i/o, and another workload is able to operate at same efficiency with less memory if it gains more i/o to replace it. It would make sense to trade the resource bounds between the workloads if relevant data to build a valid model of the workload's throughput within resource constraints exists. However, I doubt the models will ever be anything better than purely empirical, and that means they sometimes don't work, and then all things crash. Ugly.

One example...

Posted Jan 19, 2012 9:35 UTC (Thu) by khim (subscriber, #9252) [Link] (1 responses)

However, I still think that systems can't be overcommitted safely, so I remain skeptical what possible tricks there are that an administrator could pull while not violating service's expectations.

This depends on what services you have available. Google's infrastructure is good example. It runs different kinds of taks on the same machines. Some serve search results and have "no overquota" policy. Some are batch processes, crawlers, exacycle, etc. These can be killed at any time because you can always just start them on another system.

Now, not only batch processes can be run in overcommit mode - Google can even take memory reserved for critical search process! Because if it actually will ask for the memory later you can kill non-critical process with extreme prejudice and give memory to critical process. Not sure if Google actually does it or not, but this is obviously doable.

If you'll think about what real clusters are doing you may be surprised just how much work is done by processes which can actually be killed and restarted. Sadly today such tasks are usually run synchroniously in the context of critical user-facing process thus to use memory efficiently you'll need to do serious refactoring.

This is a variant on what once was called "Goal Mode" in resource management

Posted Jan 19, 2012 17:29 UTC (Thu) by davecb (subscriber, #1574) [Link]

Once more, it's "back to the future" time in computer science (;-))

What you describe here is a superset of a problem that we suffered in the days of the mainframe, that of optimizing resource usage against a "success" criteria. One wanted, in those days, to adjust dispatch priority and disk storage to benefit a program that was overloaded, to get it out of trouble.

Resource management initially allowed one to set guaranteed minimums, and to share when one wasn't using all your allocation, but rather statically. IBM then introduced a scheme that allowed it to be done in a way that often diagnosed a slowdown and added more resources, called "goal mode". It still exists on mainframes.

Modern resource management schemes don't go quite that far. We guarantee minimums, provide maximums so as to avoid letting us shoot ourselves in the foot, make sharing of unused resources easy, and selective penalize memory hogs by making them page against themselves.

We need a modern goal mode: as Linux is a hot-bed of resource management research, I wouldn't be unduly surprised to see it happen here.

--dave

The future calculus of memory management

Posted Feb 3, 2012 3:52 UTC (Fri) by kevinm (guest, #69913) [Link] (1 responses)

Your first paragraph is quite interesting - I like the idea that such a system would be a kind of analogue of a real world financial economy. I think there is a lot of truth to this - you might indeed expect to see big booms and busts in the "datacentre RAM economy" if it's composed of a free market of locally-optimising agents.

I wonder if in the future research into this will lead to results that can be applied in macroeconomics, or vice-versa? If you solve the cross-datacentre RAM allocation problem, have you also invented an algorithm that could be used to optimally drive a real-world command economy?

The future calculus of memory management

Posted Feb 3, 2012 20:11 UTC (Fri) by nix (subscriber, #2304) [Link]

If you solve the cross-datacentre RAM allocation problem, have you also invented an algorithm that could be used to optimally drive a real-world command economy?

Well, that has already been invented, by Leonid Kantorovich way back before WWII, originally with the declared intention of treating the entire Soviet economy as a single vast optimization problem. It didn't work, and not just because the models were insanely complex and the necessary computer power didn't then exist. No algorithm can deal with the fact that people lie, and no algorithm exists, nor likely can ever exist, that can produce correct outputs given largely-lying inputs. (If only a few lie, you can deal with it: but in many command economies, espeically those that can resort to force when needed, there's an active incentive to lie to your superiors. So, in the end, most people will stretch the truth, and your beautiful optimizer fails ignominiously.)

I wish someone could find a way to globally optimize economies without the wastage inherent in competition, but then I also wish for immortality and FTL travel.

The future calculus of memory management

Posted Jan 19, 2012 12:32 UTC (Thu) by mlankhorst (subscriber, #52260) [Link] (5 responses)

Is memory really that expensive? For 100$ I can just buy 16G, I suspect for most people it would be cheaper in productivity to add more ram instead of the production loss you get from not having enough ram.

The future calculus of memory management

Posted Jan 19, 2012 14:32 UTC (Thu) by felixfix (subscriber, #242) [Link]

But then scale that for google. Even for small server farms of a hundred servers, that's $10K. Anyone can think of things more useful for $10K when most of the RAM bought is unused most of the time.

The future calculus of memory management

Posted Jan 19, 2012 15:44 UTC (Thu) by cesarb (subscriber, #6266) [Link]

> Is memory really that expensive? For 100$ I can just buy 16G, I suspect for most people it would be cheaper in productivity to add more ram instead of the production loss you get from not having enough ram.

It is not just the price of the memory stick. It also uses more power. You might need a more expensive server with more memory slots. It is one more piece of hardware which can fail and need to be replaced.

On the software side, the operating system has to use more memory to track all the available pages, and on 32-bit architectures this memory might have to be in the very valuable low memory area (below the 4G limit). Your memory management algorithms might also need more time to manage all these pages.

The future calculus of memory management

Posted Jan 19, 2012 16:53 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

4G of RAM is not a lot. Unless you have hundreds or thousands of nodes.

I'm working with genomic data and some of our workloads are "spiky". They usually tend to require 2-3GB of RAM and can easily fit on Amazon EC2 "large" nodes. But sometimes they might require more than 10Gb of RAM which requires quite more powerful (and expensive!) nodes. You really start appreciating RAM usage when you're paying for it by hour (I think I now understand early mainframe programmers).

I've cobbled together a system which uses checkpointing to move workloads across nodes in case of danger of swapping. It works, but I'd really like ability to 'borrow' extra RAM for short periods.

The future calculus of memory management

Posted Jan 19, 2012 18:01 UTC (Thu) by dgm (subscriber, #49227) [Link]

If you cannot use the memory in your current machine, then the price goes up to that of a whole new system.

There are many old machines out there that do not support more than 2 or 4 GB. Many of those machines are in production environments, and are not going to be replaced any time soon for multiple reasons, including the cost of retesting, downtime, and current buying policies.

The future calculus of memory management

Posted Jan 20, 2012 8:05 UTC (Fri) by ekj (guest, #1524) [Link]

Purchasing ram for a single machine is cheap. But if you've got a million machines, then investing another $100 in each of them, still costs you one hundred million dollars.

Which means it's worth it to spend a lot of programming-hours, even for just a modest decrease in this. A single percentage reduction in memory-consumption, is worth a million dollars.

Lots of stuff is worth it at large scale, even if at small scale it doesn't matter. If you're a small company owning a dozen servers, then in 99% of the cases it'll be cheaper to just throw more ram at them, than to spend programmer-time reducing the memory-footprint of the applications.

If you've got a million servers, the math looks different.

The future calculus of memory management

Posted Jan 19, 2012 15:29 UTC (Thu) by cesarb (subscriber, #6266) [Link] (1 responses)

> Let's assume that the next generation processors have two new instructions: PageOffline and PageOnline. When PageOffline is executed (with a physical address as a parameter), that (4K) page of memory is marked by the hardware as inaccessible and any attempts to load/store from/to that location result in an exception until a corresponding PageOnline is executed.

Forgive me for being obtuse, but isn't that the Present bit?

The future calculus of memory management

Posted Jan 24, 2012 8:27 UTC (Tue) by jzbiciak (guest, #5246) [Link]

It's describing instructions that implement a stricter form of madvise(..., MADV_DONTNEED) and madvise(..., MADV_WILLNEED) I believe. As described, these are verbs, not adjectives.

The main difference, if I understood correctly, is that they're not just advice. If I PageOffline a given page, it strictly goes away until I explicitly PageOnline that same page.

The future calculus of memory management

Posted Jan 20, 2012 11:46 UTC (Fri) by vlovich (guest, #63271) [Link] (17 responses)

Maybe I'm missing details in the article, but how could you lease RAM from one machine to another? I'm going to assume it's over Ethernet or some other network connection. The question is, can you get the throughput & bandwidth for your network connection to such a point where it beats using local storage as swap (especially with SSDs). & remember that there will be significant additional computational overhead in terms of managing the state between machines, checking permissions, etc.

Local storage: 6Gbps+ @ a latency ~1-10ms (even more bandwidth & less latency for the RAM sitting on top of PCI-e solutions).

"Network" RAM: 1Gbps @ a latency ~1-10ms (depending on network topology).

Btw - this is typically called swap. It never really makes any sense to swap remotely. Now if you can come up with a mechanism that let you transport the algorithms seamlessly too (i.e. so you can transfer the code that works with the data to the remote machine as well), then you're on to something, but I think that can only be done on a per-problem basis, so you're back at step 1.

So it doesn't seem like there's a compelling reason to add a bunch of complexity for something that's not a benefit. Now what does make sense is figuring out how to allocate RAM in a way that allows for disabling large parts of it to save on power (i.e. even during execution you can keep most of RAM powered off) for embedded systems & servers.

The future calculus of memory management

Posted Jan 20, 2012 12:13 UTC (Fri) by etienne (guest, #25256) [Link] (12 responses)

It is probably possible to have a firmware/FPGA box (no processor, too slow) having 64 Gbytes RAM to be shared by all members of the 10 Gb/s network, a bit like ATAoe; but security may be a problem.
That box could be the Ethernet 10 Gb/s HUB/switch.

The future calculus of memory management

Posted Jan 20, 2012 12:27 UTC (Fri) by vlovich (guest, #63271) [Link] (11 responses)

RAM bus speed is what - something on the order of 17GBytes/s. So a 10 Gigabit hub (which btw has to deal with collisions, errors, & other nastiness) is still an order of magnitude slower for bandwidth & you're probably still looking at 2-10x (if not more) larger latencies (which are what's really going to kill you).

It's still essentially swap - just maybe faster than current swap storage & that's a very generous maybe, especially when you start piling on the number of users sharing the bus and/or memory.

BTW - the 10 GigE interconnect isn't all that common yet & the cost of it + cabling is still up-there.

Compare all of this with just adding some extra RAM... I'm just not seeing the use-case. Either add RAM or bring up more machines & distribute your computation to not need so much RAM on any one box.

Too soon to know

Posted Jan 20, 2012 20:42 UTC (Fri) by man_ls (guest, #15091) [Link] (1 responses)

Latency is probably a bigger problem than bandwidth, and people are already working on the microsecond datacenter.

Too soon to know

Posted Jan 20, 2012 21:14 UTC (Fri) by vlovich (guest, #63271) [Link]

But again, that's for IPC. That really doesn't work for RAM. 1us latency (which still isn't possible according to that article) would still be 3 orders of magnitude slower than regular RAM.

The future calculus of memory management

Posted Jan 21, 2012 1:17 UTC (Sat) by giraffedata (guest, #1954) [Link] (1 responses)

One good way to lease memory to someone else is to let him run his stuff on your processor and send the results back via IP network.

The future calculus of memory management

Posted Jan 21, 2012 1:38 UTC (Sat) by vlovich (guest, #63271) [Link]

Sure, but that's essentially how AWS works. It's how all distributed algorithms work. But it's never transparent. You still need to design the program to run distributed to begin with. That's not what was being proposed in the article though - it was some kind of amorphous build a "cloud" of RAM & rent it out & have it all work transparently.

The future calculus of memory management

Posted Jan 27, 2012 1:56 UTC (Fri) by quanstro (guest, #77996) [Link] (6 responses)

10gbe latency is < 50 microseconds on most networks. there
are no such things as 10gbe hubs, so collision domains consist
of two peers on a full-duplex link. there never has been a 10gbe
collision. :-)

The future calculus of memory management

Posted Jan 27, 2012 5:15 UTC (Fri) by vlovich (guest, #63271) [Link] (4 responses)

You're correct - I forgot that 10GBe removes CSMA/CD, so there will never be 10 GBe hubs.

But let's compare performance #s:
RAM/CPU cache: 17-100 GB/s (real), 7-70 ns (real)
Commercial SSD: 0.5 GB/s (real), ~83 000 - 100 000 ns (real)
10Gbe ethernet: 1.25 GB/s (max theoretical), ~50 0000 ns

10GBe performance is about comparable to SATA SSD.

Let's analyze cost:
128 GB RAM (maximum amount you can currently outfit - 8 x 16 gb): ~$1400+
64 GB RAM: $400-600
32 GB RAM: $200-300
4 GB RAM: ~$30
2 GB RAM: ~$20
128 GB: $200
10 Gbe controller w/ 2 ports: ~$500 (& quality of controller is very important).
10 GBe switch: $4000-25000 (the 4k is a 16-port switch, thus equivalent cost to just having n ports via controllers)

So for the ethernet approach, it costs C_shared_ram + C_server + n * (1.5 * C_ethernet + C_per_machine_ram + C_machine).
The intersect with the no-shared RAM case is (C_shared_ram + C_server) / (C_dedicated_ram - 1.5 * C_ethernet - C_per_machine_ram + (C_server - C_machine)) since the machine for the dedicated RAM case will be slightly more expensive.

At 64GB of RAM or lower, it's obviously never cheaper to use 10 GBe to share that RAM as opposed to just putting it into the machine (each controller 10Gbe connection requires on average $750).

Assuming a $200 CPU, $500 server motherboard & $100 power supply, you're looking at $800 for a machine that takes 128GB of RAM, $500 for the ones that take less.

For 4 GB per-machine + 128 shared vs 128 GB dedicated, you're looking at a cross-over of about 3.2 machines. 10 GBe = 4x PCI-e lane. The controller has 2 ports, so you'll need 8x lanes at least. Assuming 5 sufficient PCI-e lanes, that's 10 machines you can share the RAM with & thus is ~$2800 saved for every 10 machines. Of course, the performance hit of going > 4GB on any machine will increase as more machines share the RAM, so at 3.2, you're potentially below just having a dedicated SSD for swap (if you use 64GB of RAM + SSD, your savings for 10 machines will be $48-$340 dollars depending on the cost of the RAM & SSD).

To me, it seems like at no point is 10GBe swap an attractive option - the cost savings aren't that compelling & the performance seems much worse than just using more RAM & having an SSD dedicated for SWAP.

Now having some kind of VM ballooning thing where you have a 128 GBe machine where the VMs can vary the amount of RAM they use is obviously a more attractive solution & makes a lot of sense (combined with page-deduping + compcache).

The future calculus of memory management

Posted Jan 27, 2012 20:33 UTC (Fri) by djm1021 (guest, #31130) [Link] (3 responses)

Thanks for the thorough cost analysis.

First, RAMster works fine over builtin 1GbE and provides improvement over spinning rust, so 10GbE is not required. True, it's probably not faster than a local SSD, but that misses an important point: To use a local SSD, you must permanently allocate it (or part of it) for swap. So if you have a thousand machines, EACH machine must have a local SSD (or part of it) dedicated for swap. By optimizing RAM utilization across machines, the local SSD is not necessary... though some larger ones on some of the thousand machines might be.

The key point is statistics: Max of sums is almost always less (and often MUCH less) than sum of max. This is especially true when the value you are measuring (in this case "working set") varies widely. Whether you are counting RAM or SSD or both, if you can "overflow" to a remote resource when the local resource is insufficient, the total resource needed (RAM and/or SSD) is less (and often MUCH less). Looking at it the other way, you can always overprovision a system, but at some point your ROI is diminishing. E.g. why don't you always max out RAM on all of your machines? Because its not cost-effective. So why do you want to overprovision EVERY machine with an SSD?

P.S. The same underlying technology can be used for RAMster as for VMs. See http://lwn.net/Articles/454795/
And for RAMster, see: http://marc.info/?l=linux-mm&m=132768187222840&w=2

The future calculus of memory management

Posted Jan 27, 2012 22:59 UTC (Fri) by vlovich (guest, #63271) [Link] (2 responses)

The point I was making was that 10GBe was too expensive (since 10GBe is about on performance-parity with mass storage). With 1 GBe, you're looking at a lot less performance than attached storage. So yes, if your servers are using a regular spinning disk, then this *MIGHT* help the performance a bit since latency may be theoretically better (but throughput will be much lower). However, as the cluster grows, the performance will degrade (latency will get worse since the switch has to process more + bandwidth will decrease since your switch likely can't actually do simultaneous n ports * 1gbe of throughput).

So the point I've been trying to understand is what kind of application would actually benefit from this?

spinning storage in each box, limited amount of RAM, lots of machines, hitting swap occasionally, extremely cost conscious, not performance conscious

So here's two possible alternatives that provide significantly better performance for not much extra cost:
* add more RAM to 1 or more boxes (depending on how much you have in your budget) & make sure heavy workloads only go to those machines.
* if adding more RAM is too expensive use an equivalently-priced SSDs instead of spinning disks - yes you have less storage per-machine, but use a cluster-FS for the data instead (my hunch is that a cluster-FS + SSD for swap will actually get you better performance than spinning disk + RAMster for swap).
* use a growable swap instead of a fixed-size swap? that'll get you more efficient usage of the storage space

Have you actually categorized the performance of RAMster vs spinning disk vs SSD? I think it's useful to see the cost/performance tradeoff in some scenarios. Furthermore, it's important to categorize the cost of RAM vs the cost of the system & what kind of performance you require from the application + what are the specs of the machines in the cluster.

Also, getting back to the central thesis of the article; if it's hitting swap you're looking at a huge performance hit anyways, so why require paying for the "RAM swap" you use when you already do it implicitly performance-wise (i.e. you are already encouraged to use less RAM to improve performance, so charging for the swap use doesn't do anything; if you could lower the RAM usage, you would already be encouraged to do so).

The future calculus of memory management

Posted Jan 27, 2012 23:46 UTC (Fri) by djm1021 (guest, #31130) [Link] (1 responses)

Good discussion! A few replies.

> what kind of application would actually benefit from this?

Here's what I'm ideally picturing (though I think RAMster works for other environments too):

Individual machines are getting smaller and more likely to share external storage rather than have local storage. Cloud providers that are extremely cost conscious but can provide performance if paid enough (just as web hosters today offer a virtual server vs physical server option, the latter being more expensive).

Picture a rack of "microblades" in a lights-out data center, no local disks, labor-intensive and/or downtime-intensive to remove and upgrade (or perhaps closed sheet-metal so not upgradeable at all), RAM maxed out (or not upgradeable), some kind of fabric or mid-plane connecting blades in the rack (but possibly only one or two GbE ports per blade).

> but throughput will be much lower

There isn't a high degree of locality in swap read/write so readahead doesn't help a whole lot. Also, with RAMster, transmitted pages are compressed so less data is moved across the wire.

> Have you actually categorized the performance of RAMster vs spinning disk vs SSD?

Yes, some, against disk not SSD. RAMster is just recently working well enough.

The future calculus of memory management

Posted Jan 28, 2012 1:32 UTC (Sat) by dlang (guest, #313) [Link]

> Individual machines are getting smaller and more likely to share external storage rather than have local storage.

I agree they are more likely to share external storage, but I disagree about them getting smaller.

with virtualization, the average machine size is getting significantly larger (while the resources allocated to an individual VM are significantly smaller than what an individual machine used to be)

there is a significant per-physical-machine overhead in terms of cabling, connectivity, management, and fragmentation of resources. In addition, the sweet spot of price/performance keep climbing.

as a result, the machines are getting larger.

for what you are saying to be true, a company would have to buy in to SAN without also buying in to VMs, and I think the number of companies making that particular combination of choices is rather small.

The future calculus of memory management

Posted Jan 27, 2012 18:37 UTC (Fri) by perfgeek (guest, #82604) [Link]

10GbE latency on things like netperf TCP_RR is indeed in the range of 50 microseconds, and with lower-level things one can probably do < 10 microseconds, but isn't that an unloaded latency, without all that many hops and probably measured with something less than 4096 bytes flowing in either direction? Also, can we really expect page sizes to remain 4096 bytes?

The future calculus of memory management

Posted Jan 20, 2012 16:05 UTC (Fri) by felixfix (subscriber, #242) [Link] (1 responses)

More RAM in a virtual server could re-allocate RAM on the fly between instances.

The future calculus of memory management

Posted Jan 20, 2012 17:56 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

That doesn't help. Linux uses spare RAM for file system cache, so it quickly fills up.

There's 'balloon driver' which can return memory to the host OS, but it doesn't really work that well with dynamic policies.

Again: I'm surprised that you are surprised...

Posted Jan 20, 2012 16:08 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

Maybe I'm missing details in the article, but how could you lease RAM from one machine to another?

This is called VM. As was already noted it's back to the future moment. Looks like PCs are so large and powerfulnow that we face the same problems mainframes faced eons ego.

Virtual machines: old and new

Posted Jan 21, 2012 1:14 UTC (Sat) by giraffedata (guest, #1954) [Link]

It's been about 10 years since x86 servers got virtual machine technology like that used on 1970s servers, because the economics of server granularity made it worthwhile. I don't know if the technology has completely caught up yet, but it's at least really close. The 1970s virtual machines did not have anything like the memory allocation intelligence we're talking about here.

The future calculus of memory management

Posted Jan 20, 2012 21:37 UTC (Fri) by ejr (subscriber, #51652) [Link]

Check into the DoE exascale studies. Moving data from RAM into the processor eats more power, and an absurd amount if it involves an off-board network. Keeping the RAM on will decrease in power drastically soon (phase-change memory with a relatively small cache), but fetching the data error-free? Painful. The voltage gaps have to keep shrinking.

This doesn't necessarily change your VM/oversubscription scenario, but it's crucial at massive scale.

Not so good usage of swap

Posted Jan 26, 2012 14:16 UTC (Thu) by renox (guest, #23785) [Link]

What is a little sad about this kind of article is that in fact currently each language with a GC don't use efficiently the RAM because the swap isn't used as it should be: rarely used pages are not swapped out with a GC..

Of course that's a difficult issue to solve as both the kernel and the GCs must be modified, see http://lambda-the-ultimate.org/node/2391

Migration

Posted Jan 27, 2012 21:30 UTC (Fri) by slashdot (guest, #22014) [Link]

The solution is to not buy more RAM for ALL machines, but only for some, and dynamically migrate VMs from nodes where you are memory constrained to nodes where you aren't.

Of course, if you need real-time behavior, this likely won't work, and then you really need to buy as much RAM (and CPU cores) as it will ever be needed.