Linux Storage and Filesystem Workshop, part 2

Posted Apr 8, 2009 20:04 UTC (Wed) by jzbiciak (guest, #5246)
Parent article: Linux Storage and Filesystem Workshop, day 2

I wonder if SSDs will ever gain an additional level of hierarchy, such as an NVRAM layer over the flash layer. "Hot write" zones could then live in the NVRAM layer and only be committed to flash infrequently. By "hot write," I'm thinking of stuff such as the journal on a journaled filesystem.

Aggressive use of TRIM when committing entries out of the journal would make it easier to reap blocks within this faster level of hierarchy, and would make the drive less sensitive to the size of the journal. That is, the journal could be much larger than the NVRAM size, but you'd still get the benefit if the *active* part of the journal fit in the NVRAM.

Having such a buffer should also make it much easier to wear-level the drive, pushing writes out of the NVRAM LRU only as needed, rotating among all the pieces of flash. The NVRAM would also allow the SSD to buffer requests (and mark writes as complete!) while it's in the middle of erasing sectors in the flash.

Sure, it'd be expensive, but I imagine it'd fly like a bat outta hell.

to post comments

Linux Storage and Filesystem Workshop, part 2

Posted Apr 8, 2009 22:06 UTC (Wed) by jzbiciak (guest, #5246) [Link] (15 responses)

Also, how long until flash drives go the "WinModem" route, exposing a raw interface and putting all the smarts in an OS driver? I guess you still need to have some minimal disk emulation to get booted far enough to load such a driver...

Linux Storage and Filesystem Workshop, part 2

Posted Apr 9, 2009 0:43 UTC (Thu) by mattdm (subscriber, #18) [Link] (14 responses)

As I understand it, that'd be a very good thing for Linux just like Linux software RAID is usually better than the "fraid" options.

Winmodem-like solid state storage

Posted Apr 10, 2009 18:34 UTC (Fri) by giraffedata (guest, #1954) [Link] (13 responses)

Also, how long until flash drives go the "WinModem" route, exposing a raw interface and putting all the smarts in an OS driver?
As I understand it, that'd be a very good thing for Linux

But in the Winmodem route, the raw interface is a secret and specific to a small subset of devices, and the manufacturer doesn't write any Linux drivers. That's all pretty bad for Linux, isn't it?

On the other hand, assuming flash storage lasts (doesn't get replaced by a different solid state storage technology that has different characteristics), I do expect it to eventually grow an interface optimized for driving flash (i.e. low level) rather than use an interface designed for disk drives, and then we could get the advantages of running more of the logic in the client and less in the storage device.

Winmodem-like solid state storage

Posted Apr 11, 2009 19:00 UTC (Sat) by dwmw2 (subscriber, #2063) [Link] (12 responses)

"But in the Winmodem route, the raw interface is a secret and specific to a small subset of devices, and the manufacturer doesn't write any Linux drivers. That's all pretty bad for Linux, isn't it?"

A WinModem is mostly just a sound card — we do actually know how to drive most of them. That's not the problem at all.

The problem with WinModems is that modem algorithms better than about v.32 are covered by patents. And while a lot of stuff in that situation has still found its way into software projects in the Free World, we still don't have a decent modem implementation. Someone needs to do a Free World fork of spandsp, perhaps? And/or pick up the late Tony Fisher's work on v.32bis and v.34.

For flash, the situation is different. While there are plenty of patents flying around, they mostly cover the ways in which you make a flash device pretend to be a block device — and the beauty of exposing flash directly to the operating system is that you don't need that gratuitous extra layer any more. You can have a file system which knows about flash, and is designed to operate directly, and optimally, on it.

The recent TRIM work goes some way to fixing the most obvious disadvantage of the extra layer, but the fact remains that you still have your real file system running on top of another pseudo-filesystem which is pretending to be a block device. And you can never attempt to debug or improve that lower layer.

The Linux kernel has two file systems for real flash already, and more are in the works. I'd very much like to see direct access to the flash being permitted by these devices. I'm confident that we can do better than anything they can do inside their little black box.

Winmodem-like solid state storage

Posted Apr 11, 2009 20:25 UTC (Sat) by giraffedata (guest, #1954) [Link] (11 responses)

Thanks for elucidating the patent angle.

But that really just raises another issue. With flash storage having its novel foibles, I find it hard to believe there aren't patents covering the various things you have to do to make it useful. Method for storing data on flash without wearing out hot spots? Method for extracting small blocks of data from flash quickly?

I'm confident that we can do better than anything they can do inside their little black box.

So you're saying there is innovation to be done. That means there's something for someone to monopolize with a patent.

At least with a black box, the patents are all paid for as part of acquiring the box. In contrast, when you need a patent license to run some code in your own Linux system, it usually means the code is useless.

So it still looks to me like Winmodem-style flash storage would be a bane to Linux and free software.

Winmodem-like solid state storage

Posted Apr 11, 2009 23:09 UTC (Sat) by jzbiciak (guest, #5246) [Link] (7 responses)

So it still looks to me like Winmodem-style flash storage would be a bane to Linux and free software.

Nonsense. There are plenty of machines out there that have just plain NAND or NOR flash hooked up to the CPU and Linux reads/writes these effectively. The issue is that currently we only see that in the embedded space, and it's typically just enough flash to hold the code for whatever that machine is supposed to do. For example, be a WiFi router or a set-top box or a cell phone.

What I'd like to see is something I can get off the shelf at my local Computer Mart (or on the web) that plugs into my PC and gives me raw flash. Instead of focusing on "right sized" and "small" and "maximizing battery life", it instead can be a bank of parallel flash such as what Intel's SSD disks are, but with a raw interface. We can then use our existing flash filesystems and infrastructure to drive those in a desktop and laptop space, rather than just the netbook/smart-phone/smart-router space.

Now, these (potentially) massively parallel performance oriented disks will need additional software support. You want something akin to RAID striping across the media along with maybe some redundancy in addition to wear leveling. That's just enhancements on top of our existing wear leveling filesystems and infrastructure.

The only real issue is that once you give raw flash to the OS and put the smarts in the OS, it'll be harder for dual-boot systems to communicate on the same media, because the likelihood that $VENDOR's Windows driver organizes the disk the way Linux does is slim to none unless $VENDOR works with the Linux community also.

Winmodem-like solid state storage

Posted Apr 12, 2009 1:21 UTC (Sun) by dwmw2 (subscriber, #2063) [Link]

"The only real issue is that once you give raw flash to the OS and put the smarts in the OS, it'll be harder for dual-boot systems to communicate on the same media, because the likelihood that $VENDOR's Windows driver organizes the disk the way Linux does is slim to none unless $VENDOR works with the Linux community also."

I see two reasons why that wouldn't be a problem, in practice.

Firstly, we've never had many problems working with "foreign" formats. We cope with NTFS, HFS and various bizarre crappy "Software RAID" formats, amongst other things. That includes the special on-flash formats like the NAND Flash Translation Layer used on the M-Systems DiskOnChip devices, which has been supported for about a decade. Are you suggesting that hardware vendors take Linux less seriously now than they did ten years ago, and that we'd have a harder time working out how to interoperate? Remember, documenting the on-medium format doesn't necessarily give away all the implementation details like algorithms for wear levelling, etc. — that's why M-Systems were content to give us documentation, all that time ago.

Secondly, interoperability at that level isn't a showstopper. It's nice to have, admittedly, but I'm not going to lose a lot of sleep if I can't mount my Windows or MacOS file system under Linux. It's the native functionality of the device under Linux that I care about most of the time.

Of course, I see no reason why the device vendors should be pushing their own "speshul" formats anyway — the hard drive vendors don't. But I'm not naïve enough to think that they won't try.

Imagine a world where every hard drive you buy is actually a more like a NAS. You can only talk a high-level protocol like CIFS or NFS to it; you can't access the sectors directly. Each vendor has their own proprietary file system on it internally, implemented behind closed doors by the same kind of people who write BIOSes. You have no real information about what's inside, and can't make educated decisions on which products to buy. Having made your choice you can't debug it, you can't optimise it for your own use case, you can't try to recover your data when things go wrong, and you sure as hell can't use btrfs on it. All you can do is pray to the deity of your choice, then throw the poxy thing out the window when it loses your data.

If the above paragraph leaves you in a cold sweat, it was intended to. That's the kind of dystopia I see in my head, when we talk about SSDs without direct access to the flash.

Winmodem-like solid state storage

Posted Apr 12, 2009 1:46 UTC (Sun) by giraffedata (guest, #1954) [Link] (3 responses)

What I'd like to see is something I can get off the shelf at my local Computer Mart (or on the web) that plugs into my PC and gives me raw flash.

If a PCIe expansion socket is sufficient, several companies are now selling that. I remember IBM demonstrating last year a prototype storage server composed of a bunch of Linux systems with Fusion-IO PCI Express cards for storage. It broke some kind of record.

In that system, the flash storage still appeared as a block device, but it did it at the Linux block device interface instead of at the SCSI physical interface.

Winmodem-like solid state storage

Posted Apr 16, 2009 18:53 UTC (Thu) by wmf (guest, #33791) [Link] (2 responses)

Fusion io is not raw flash since the driver contains a sophisticated FTL that cannot be disabled. In theory they could release an MTD driver, but they're not going to.

Winmodem-like solid state storage

Posted Apr 16, 2009 20:15 UTC (Thu) by giraffedata (guest, #1954) [Link] (1 responses)

Fusion io is not raw flash since the driver contains a sophisticated FTL that cannot be disabled. In theory they could release an MTD driver, but they're not going to.

Is the driver you're talking about a Linux kernel module? An object code only one?

What is MTD?

All this has happened before...

Posted Apr 16, 2009 21:55 UTC (Thu) by wmf (guest, #33791) [Link]

Is the driver you're talking about a Linux kernel module? An object code only one?

Yep.

What is MTD?

How you access raw flash. See also UBIFS Raw flash vs. FTL devices.

Winmodem-like solid state storage

Posted Apr 19, 2009 14:11 UTC (Sun) by oak (guest, #2786) [Link] (1 responses)

> What I'd like to see is something I can get off the shelf at my local
Computer Mart (or on the web) that plugs into my PC and gives me raw
flash. Instead of focusing on "right sized" and "small" and "maximizing
battery life", it instead can be a bank of parallel flash such as what
Intel's SSD disks are, but with a raw interface. We can then use our
existing flash filesystems and infrastructure to drive those in a desktop
and laptop space

Unlike block based file systems like ext[234], the existing flash file
systems are designed for very small file systems. E.g. JFFS2 keeps the
whole file system metadata in RAM and is unusable in GB sized file
systems.

However, the newly merged UBIFS promises to work much better:
* http://lwn.net/Articles/275706/
* http://www.linux-mtd.infradead.org/doc/ubifs.html#L_scala...

There's not usage data on how well it performs with desktop and server
loads though.

Winmodem-like solid state storage

Posted Apr 19, 2009 14:28 UTC (Sun) by dwmw2 (subscriber, #2063) [Link]

" Unlike block based file systems like ext[234], the existing flash file systems are designed for very small file systems. E.g. JFFS2 keeps the whole file system metadata in RAM and is unusable in GB sized file systems."

Very true — although we put a lot of effort in to make JFFS2 better for OLPC with its 1GiB of NAND flash. It mounts in 6 seconds or so, and we reduced the RAM usage by a significant amount too. But still, JFFS2 was designed in the days of 32MiB or so of NOR flash, and definitely isn't intended to scale up to the kind of sizes we're seeing now.

UBIFS is much more promising, but as you correctly observe is not yet proven for desktop or server workloads. I'm actually keen to get btrfs working on raw flash, too.

The point is that with stuff done in software, we can do better; whether we do better or not today is a different, and less interesting issue.

After all, we can always implement the same "pretend to be a block device" kind of thing to tide us over in the short term, if we need to. We have three or four such translation layers in Linux already, and more on the way.

Winmodem-like solid state storage

Posted Apr 12, 2009 0:42 UTC (Sun) by dwmw2 (subscriber, #2063) [Link] (2 responses)

"So you're saying there is innovation to be done. That means there's something for someone to monopolize with a patent.
"At least with a black box, the patents are all paid for as part of acquiring the box. In contrast, when you need a patent license to run some code in your own Linux system, it usually means the code is useless."

That's a very pessimistic viewpoint. If you truly believe that the patent system is so broken and abused that it prevents all innovation, I'd recommend a career in goat-herding. You obviously wouldn't want to be involved in any form of innovative software development — either Free Software or otherwise.

Thankfully, I don't think it's a valid viewpoint either — as broken as the patent system is, I don't think it's time to throw in the towel just yet.

Winmodem-like solid state storage

Posted Apr 12, 2009 1:33 UTC (Sun) by giraffedata (guest, #1954) [Link] (1 responses)

So you're saying there is innovation to be done. That means there's something for someone to monopolize with a patent.
At least with a black box, the patents are all paid for as part of acquiring the box. In contrast, when you need a patent license to run some code in your own Linux system, it usually means the code is useless."

That's a very pessimistic viewpoint. If you truly believe that the patent system is so broken and abused that it prevents all innovation, ...

But I said the opposite. I suggested someone would do the innovation. And then patent it. It is not pessimistic to expect an inventor to patent his invention; they do it all the time, even for trivial inventions.

Patents seem to be anathema to the Linux world. I thought you said patents are the reason Linux and Winmodems don't get along; I'm just trying to complete the analogy.

Winmodem-like solid state storage

Posted Apr 12, 2009 2:05 UTC (Sun) by dwmw2 (subscriber, #2063) [Link]

"But I said the opposite. I suggested someone would do the innovation. And then patent it. It is not pessimistic to expect an inventor to patent his invention; they do it all the time, even for trivial inventions."

Then we need to make sure we get there first, patent it ourselves and license the patent appropriately for use in Free Software.

What's the alternative? To always assume that someone will have got there first, and that any software development that's even remotely innovative will fall foul of a patent and thus, in your words, be "useless"?

That's what I meant when I said it "prevents innovation" — I mean it prevents innovation for us, if we always assume everything interesting will already be patented. And that part of the discussion isn't really specific to modems or SSDs, is it? It applies right across the board.

"Patents seem to be anathema to the Linux world. I thought you said patents are the reason Linux and Winmodems don't get along; I'm just trying to complete the analogy."

Modems are a special case, because you need to implement precisely the patented algorithms in order to communicate with another modem using the affected standards.

For flash storage, you don't have to do that; you have a lot more flexibility to come up with something that isn't affected by patents. A closer analogy might be audio/video compression — where the Free Software world was able to come up with the patent-free Ogg and Theora codecs.

Linux Storage and Filesystem Workshop, part 2

Posted Apr 9, 2009 8:28 UTC (Thu) by viiru (subscriber, #53129) [Link]

> Aggressive use of TRIM when committing entries out of the journal would
> make it easier to reap blocks within this faster level of hierarchy, and
> would make the drive less sensitive to the size of the journal. That is,
> the journal could be much larger than the NVRAM size, but you'd still get
> the benefit if the *active* part of the journal fit in the NVRAM.

Aggressive use of TRIM on the journal could also mean TRIMming the journal completely on a clean mount, or even after replaying it on an unclean mount (some care needs to be taken here, though). I think this could help wear leveling on laptops and netbooks quite a bit (they boot often), and those are currently the place where the biggest advantages of SSDs are.

Linux Storage and Filesystem Workshop, part 2

Posted Apr 16, 2009 18:43 UTC (Thu) by wmf (guest, #33791) [Link] (1 responses)

I wonder if SSDs will ever gain an additional level of hierarchy, such as an NVRAM layer over the flash layer. "Hot write" zones could then live in the NVRAM layer and only be committed to flash infrequently.

If you're willing to pay $50/GB, the STEC ZeusIOPS and TMS RamSAN provide something like this. For affordable SSDs I expect DRAM caches to range between tiny and nonexistent.

Linux Storage and Filesystem Workshop, part 2

Posted Apr 17, 2009 22:10 UTC (Fri) by nix (subscriber, #2304) [Link]

But why? RAM caches are not not tiny-to-nonexistent for inexpensive server
hardware RAID cards, for instance (Areca, for instance, has a 256Mb cache
in its 4-port cards, rising to 2Gb I think in the huge ones).

Now 256Mb may not be immense but it's surely not tiny either.