The trouble with discard

By Jonathan Corbet
August 18, 2009

Traditionally, storage devices have managed the blocks of data given to them without being concerned about how the system used those blocks. Increasingly, though, there is value in making more information available to storage devices; in particular, there can be advantages to telling the device when specific blocks no longer contain data of interest to the host system. The "discard" concept was added to the kernel one year ago to communicate this information to storage devices. One year later, it seems that the original discard idea will not survive contact with real hardware - especially solid-state storage devices.

There are a number of use cases for the discard functionality. Large, "enterprise-class" storage arrays can implement virtual devices with a much larger storage capacity than is actually installed in the cabinet; these arrays can use information about unneeded blocks to reuse the physical storage for other data. The compcache compressed in-memory swapping mechanism needs to know when specific swap slots are no longer needed to be able to free the memory used for those slots. Arguably, the strongest pressure driving the discard concept comes from solid-state storage devices (SSDs). These devices must move data around on the underlying flash storage to implement their wear-leveling algorithms. In the absence of discard-like functionality, an SSD will end up shuffling around data that the host system has long since stopped caring about; telling the device about unneeded blocks should result in better performance.

The sad truth of the matter, though, is that this improved performance does not actually happen on SSDs. There are two reasons for this:

At the ATA protocol level, a discard request is implemented by a "TRIM" command sent to the device. For reasons unknown to your editor, the protocol committee designed TRIM as a non-queued command. That means that, before sending a TRIM command to the device, the block layer must first wait for all outstanding I/O operations on that device to complete; no further operations can be started until the TRIM command completes. So every TRIM operation stalls the request queue. Even if TRIM were completely free, its non-queued nature would impose a significant I/O performance cost. (It's worth noting that the SCSI equivalent to TRIM is a tagged command which doesn't suffer from this problem).
With current SSDs, TRIM appears to be anything but free. Mark Lord has measured regular delays of hundreds of milliseconds. Delays on that scale would be most unwelcome on a rotating storage device. On an SSD, hundred-millisecond latencies are simply intolerable.

One would assume that the second problem will eventually go away as the firmware running in SSDs gets smarter. But the first problem can only be fixed by changing the protocol specification, so any possible fix would be years in the future. It's a fact of life that we will simply have to live with.

There are a few proposals out there for how we might live with the performance problems associated with discard operations. Matthew Wilcox has a plan to reimplement the whole discard concept using a cache in the block layer. Rather than sending discard operations directly to the device, the block layer will remember them in its own cache. Any new write operations will then be compared against the discard cache; whenever an operation overwrites a sector marked for discard, the block layer will know that the discard operation is no longer necessary and can, itself, be discarded. That, by itself, would reduce the number of TRIM operations which must be sent to the device. But if the kernel can work to increase locality on block devices, performance should improve even more. One relatively easy-to-implement example would be actively reusing recently-emptied swap slots instead of scattering swapped pages across the swap device. As Matthew puts it: "there's a better way for the drive to find out that the contents of a block no longer matter -- write some new data to it."

In Matthew's scheme, the block layer would occasionally flush the discard cache, sending the actual operations to the device. The caching should allow the coalescing of many operations, further improving performance. Greg Freemyer, instead, suggests that flushing the discard cache could be done by a user-space process. Greg says:

Assuming we have a persistent bitmap in place, have a background scanner that kicks in when the cpu / disk is idle. It just continuously scans the bitmap looking for contiguous blocks of unused sectors. Each time it finds one, it sends the largest possible unmap down the block stack and eventually to the device.

When normal cpu / disk activity kicks in, this process goes to sleep.

A variant of this approach was posted by Christoph Hellwig, who has implemented batched discard support in XFS. Christoph's patch adds a new ioctl() which wanders through the filesystem's free-space map and issues large discard operations on each of the free extents. The advantage of doing things at the filesystem level is that the filesystem already knows which blocks are uninteresting; there is no additional accounting required to obtain that information. This approach will also naturally generate large operations; larger discards tend to suit the needs of the hardware better. On the other hand, regularly discarding all of the free space in a filesystem makes it likely that some time will be spent telling the device to discard sectors which it already knows to be free.

It is far too soon to hazard a guess as to which of these approaches - if any - will be merged into the mainline. There is a fair amount of coding and benchmarking work to be done still. But it is clear that the code which is in current mainline kernels is not up to the task of getting the best performance out of near-future hardware.

Your editor feels the need to point out one possibly-overlooked aspect of this problem. An SSD is not just a dumb storage device; it is, instead, a reasonably powerful computer in its own right, running complex software, and connected via what is, essentially, a high-speed, point-to-point network. Some of the more enterprise-oriented devices are more explicitly organized this way; they are separate boxes which hook into an IP-based local net. Increasingly, the value in these devices is not in the relatively mundane pile of flash storage found inside; it's in the clever firmware which causes the device to look like a traditional disk and, one hopes, causes it to perform well. Competition in this area has brought about some improvements in this firmware, but we should see a modern SSD for what it is: a computer running proprietary software that we put at the core of our systems.

It does not have to be that way; Linux does not need to talk to flash storage through a fancy translation layer. We have our own translation layer code (UBI), and a few filesystems which can work with bare flash. It would be most interesting to see what would happen if some manufacturer were to make competitive, bare-flash devices available as separate components. The kernel could then take over the flash management task, and our developers could turn their attention toward solving the problem correctly instead of working around problems in vendor solutions. Kernel developers made an explicit choice to avoid offloading much of the network stack onto interface hardware; it would be nice to have a similar choice regarding the offloading of low-level storage management.

In the absence of that choice, we'll have no option but to deal with the translation layers offered by hardware vendors. The results look unlikely to be optimal in the near future, but they should still end up being better than what we have today.

Index entries for this article
Kernel	Block layer/Discard operations
Kernel	Solid-state storage devices

to post comments

The trouble with discard

Posted Aug 18, 2009 22:10 UTC (Tue) by pr1268 (guest, #24648) [Link] (2 responses)

it seems that the original discard idea will not survive contact with real hardware

Should that be "will not survive intact"? (Maybe I'm not parsing the meaning correctly.)

contact

Posted Aug 18, 2009 22:15 UTC (Tue) by corbet (editor, #1) [Link] (1 responses)

"No battle plan ever survives contact with the enemy."
-- Helmuth Carl Bernard von Moltke

That's the template I was following; as current software comes into contact with hardware which actually implements TRIM, its shortcomings are becoming clear.

contact

Posted Aug 18, 2009 22:20 UTC (Tue) by pr1268 (guest, #24648) [Link]

Thanks, Jon. I had just read "survive intact" somewhere else online and was influenced accordingly. My parser has been updated. :)

Windows 7 will also issue TRIM commands...

Posted Aug 18, 2009 23:03 UTC (Tue) by cma (guest, #49905) [Link]

What about Microsoft's Windows 7 SSD related optimizations on SSD? Read more here: http://www.tomshardware.com/news/windows-solid-state-drives-ssd,7717.html Are they going have same problem related to TRIM performance slowdown?

The trouble with discard

Posted Aug 18, 2009 23:21 UTC (Tue) by agrover (guest, #55381) [Link]

Could the lack of queued TRIM in SATA be to differentiate "enterprise" SAS drives from SATA storage, and thus maintain SAS price premiums?

Zero = trim

Posted Aug 19, 2009 1:25 UTC (Wed) by ncm (guest, #165) [Link] (9 responses)

It seems to me that vendors could bypass the ATA spec TRIM problem by treating all-zero blocks as trimmed blocks, and not actually use up a real block to represent them. Then, trimming amounts to writing zeroes, which we ought to be able to do without much restructuring. If asked to read a block previously identified as trimmed, the device simply delivers a block of zeroes without looking at what might still be written there, if indeed "there" can be said to exist any more.

Zero = trim

Posted Aug 19, 2009 5:05 UTC (Wed) by TimMann (guest, #37584) [Link] (2 responses)

That idea came to mind for me too, but of course this article is all about how what real hardware really does fails to match up with what we need. So if the real hardware doesn't happen to optimize the case of a block with all 0's written to it, we still lose.

Zero = trim

Posted Aug 19, 2009 5:06 UTC (Wed) by TimMann (guest, #37584) [Link] (1 responses)

Hmm, I was missing your point, though. That's a good way around the limitation of the ATA spec if the hardware vendors can be persuaded to use it.

Zero = trim

Posted Aug 19, 2009 12:35 UTC (Wed) by avik (guest, #704) [Link]

You have to dma tons of zeros to the device, so this can be expensive.

Zero = trim

Posted Aug 19, 2009 12:39 UTC (Wed) by willy (subscriber, #9762) [Link]

I happen to know a storage vendor who has implemented exactly that in their storage array. The problem is that if you want to TRIM a gigabyte, you have to send a gigabyte of zeroes. SCSI doesn't have this problem as it has WRITE SAME, so you only need 512 bytes of zeroes.

If we could get agreement from vendors to do this, we could use the existing TRIM command for amounts larger than some cutoff point, and write zeroes for amounts smaller than said point. But absent some agreement to this effect, we can''t unilaterally start doing this in Linux as it'll drive up the write counts even further.

Zero = trim

Posted Aug 20, 2009 10:36 UTC (Thu) by dion (guest, #2764) [Link]

That would also help in the case where you are running on other kinds of storage that are able to optimize their resource usage based on this information.

Take virtual machines that use qcow2 files as backing, when copying such a file the zeroed blocks are simply left out of the target file.

Some SANs also avoid allocating more actual storage than needed by a block device, I'm sure those would like to get a hint about the disuse of blocks as well.

Zero = trim

Posted Aug 21, 2009 1:58 UTC (Fri) by dmag (guest, #17775) [Link] (3 responses)

Writing zeros to a block isn't the same as writing zeros to flash. Flash is unreliable, so you need checksums, wear leveling, etc. So a block of zeros isn't really zero. This extra stuff used to be done in hardware (CompactFlash), but nowadays is done in software (SD cards).

What's worse, flash chips are always "FF" in their erased state. So to write all zeros, you'll have to erase the block, write zeros, then later erase it and write real data. If you wrote FFs, it would only have to erase it once. But the zeros can't be at the block layer.

I agree with the article -- if we pretend flash is "just like a hard drive", we will be using the hardware sub-optimally.

Zero = trim

Posted Aug 21, 2009 3:02 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

But the point is that you wouldn't *actually* write the block of zeros to the flash. Since all reads/writes to the flash disk go through a remapping layer, it would be trivial to treat all-zero writes specially, and not write them out at all, but just mark them as zeroed in the block-map.

Zero = trim

Posted Aug 23, 2009 19:58 UTC (Sun) by dmag (guest, #17775) [Link]

I still don't think that's the best way to do it.

Summary of the Problem:

The FS layer is the layer that knows what's free and what is real data. But the FS layer doesn't get any requests to write the the "free" space (by definition), so the block layer will never get any writes to that space. The block layer (or the driver) is currently doing a lot of work to preserve (wear leveling) that "free" data, even though we know it is "don't care" data.

Your proposal is "what if the FS layer could inform the block layer about free space by writing zeros there?". But there are some problems:

- First, you need an API change to know when it's OK to do those extra writes. Otherwise, you will kill non-compliant flash disks or hard drives. (slow down with useless writes, wear out with extra writes, etc.)

- Second, there ARE times you want to actually write out zeros in a block (think of the case where you are writing to a USB stick that will go in another computer, or security). So now you need an API change to say "This is not a phantom write"

- Third, the block/device layer will have to look at the data in the block to detect the zeros, which will kill performance. Normally, the block layer will just poke the address into a DMA controller and get on with life. What you are proposing would require the block layer to scan the block, which will evict everything from the L1/L2 cache.

Since we have to change the FS to Block/Device API anyway, we may as well invent a real communication channel instead of trying to piggyback on "all zeros".

Zero = trim

Posted Aug 27, 2009 17:57 UTC (Thu) by robert_s (subscriber, #42402) [Link]

"but nowadays is done in software (SD cards)."

No. SD cards have their flash translation layer on the card in hardware.

What you're thinking of is CF cards have a whole ATA interface on the card too ( the actual interface is closely related to PCI ).

The trouble with discard

Posted Aug 19, 2009 6:28 UTC (Wed) by markusw (guest, #53024) [Link] (8 responses)

It does not have to be that way ... The kernel could then take over the flash management task, and our developers could turn their attention toward solving the problem correctly instead of working around problems in vendor solutions.

Now, that's exactly what I have in mind as well. The closest thing to that is probably the Fusion-IO's ioDrive. However, its driver is proprietary and presents a block device to the system. I suspect there's an FPGA on that board which already does the wear-leveling, in that case even an open-source driver wouldn't give us direct MTD access.

Anybody up for starting an open hardware project? You might be able to find open source NAND flash controll IP cores as well as PCI Express ones...

possibilities with raw flash access

Posted Aug 22, 2009 10:16 UTC (Sat) by dlang (guest, #313) [Link] (6 responses)

as you note, every flash drive available has a remapping layer hiding the details from the OS.

but let me point out some things that _could_ be done with either raw access to the flash, or some more smarts in the remapping layer

the key factor is that flash does not always need to be erased before it's changed

SLC flash can change any bit from a 1 to a 0 without needing to erase the block first

MLC flash can change any pair of bits from 11 to 00 (I think the order is 11 -> 10 -> 01 -> 00 but I'm not sure) without needing to erase the block first.

when a block is modified the hardware could compare the new data to the old data, and if the only difference os a 1->0 transition, it could modify the existing bock rather than writing a new version elsewhere

if the hardware supported this, then the OS could take advantage of the capability to reduce the number of erases necessary

the filesystem could:

leave the unused space at the end of a block is left as all '1's rather than all '0's so that additional data could be appended without needing to erase a block first.

change it's 'nothing more to point to' from a pointer containing all 0s to a pointer containing all 1s so that adding an additional block to a chain (or extent..) would not require re-writing the prior block as well

make a space/rewrite tradeoff in favor of reducing rewrites by allocating space for multiple copies of frequently changed metadata so that the entire block only needs to be re-written when all the extra slots have been used up.

as a trivial example of this last one. with atime enabled, every time a file is accessed it requires a rewrite of the entire eraseblock to record the new time.if you have a need to do a sync mount for data reliably, this could result in a rewrite for each file that's looked at

if however you had 10 atime slots, you would only do a rewrite after accessing a particular file 10 times, and if you a sync mount you would only have to do a rewrite after doing 10 passes through all the files (each file accessed would modify an atime slot,but until all 10 slots are full for any one file the block would not need to be moved, when the filesystem overflows the available slots on one file it can clean up all the other files in that block at the same time)

similar tricks could be done for size (either multiple slots or size+delta+delta approaches)

exactly what metadata should be given extra slots, and how many slots is an interesting problem to consider and experimant with (and probably is going to be different for different use cases as well)

if the hardware can tell the filesystem where the eraseblock boundries are then there are more optimizations that can take place

a couple side notes. the musical greeting cards and similar cheap recorder chips that became available in the 1990's actually worked by using eprom chips that had the similar programming properties as flash, when you erased them you get all 1s, but then by programming you could change a 1 to a 0.the recording capabilities showd up when someone realized that you didn't have to program them all the way to a 0, like flash they actually store an analog value and by rapidly sending programming pulses to the device (up to 100 per bit) you could adjust the flash voltage output to match the audio sample.then to play it back you just cycle through the addresses and amplify the analog voltage produced.

MLC flash takes advantage of a similar thing, it doesn't program the flash celll to a true 1 or 0, it can also program it to one of two additional analog values and then lables the original '1' as '11' the original '0' as '00' and then the two additional values as '10' and '01' the difficulty is that it's now harder to tell the different voltages apart.

I expect that MLC flash is going to climb in capacity rapidly as the manufacturers copy ideas from the history of modems

1. more values in a particular slot (what MLC does today vs SLC) as the ability to distinguish (and program) voltages that are close togeather get better (similar to how modems got faster as they distinguished more different tones as they went from 1200bps to 9600bps)

right now I believe that flash programming is mostly (if not entirely) a case of 'hit it with one programming pulse to change the cell', I expect that things will shift to 'hit it with a series of short programming pulses, checking between each pulse, until the cell gets to the desired voltage' doing this will increase complexity, and may slow down writes slightly (in some cases it may speed up writes as in the first instance the programming pulse needs to cover the 'worst case' needs, but with the new approach it can avoid 'overprogramming' the cell), but will result in more precise control of the cell voltage.

2. combining adjacent flash cells and define that only some of the range of possible bit patterns are legal, allowing the use of voltages in an individual cell that could be ambiguous, but become no longer ambiguous when combined with the data from the adjacent cell (similar to how modems shifted from pure tone detection to tone/phase detection with only some combinations being legal to allow for easier detection as they went above 9600bps)

possibilities with raw flash access

Posted Aug 22, 2009 20:13 UTC (Sat) by phip (guest, #1715) [Link] (1 responses)

> leave the unused space at the end of a block is left as all '1's
> rather than all '0's so that additional data could be appended
> without needing to erase a block first.

> change it's 'nothing more to point to' from a pointer containing
> all 0s to a pointer containing all 1s so that adding an additional
> block to a chain (or extent..) would not require re-writing the
> prior block as well

Would simply inverting the raw data from/to the flash device
before/after doing any block device or filesystem processing
be a useful optimization?

possibilities with raw flash access

Posted Aug 22, 2009 20:27 UTC (Sat) by dlang (guest, #313) [Link]

two things

it would only help if the device is smart enough to not allocate a new block on the flash (requiring a erase of a block eventually) if the data changes are only 1->0

inverting all the data will thrash your cpu cache, and most things don't benefit from the change, so I think it would be smarter to modify the filesystem

possibilities with raw flash access

Posted Aug 25, 2009 6:33 UTC (Tue) by markusw (guest, #53024) [Link] (1 responses)

If you have access to the raw NAND device (i.e. an MTD device, not a block device as seen from the Linux kernel) the very first thing to use properly is block discard. The 'partial block reuse' technique described might improve things as well, but compared to block discard it can only be a minor optimization. Also keep in mind that block erasure is a specified operation which just needs to be used by the above layers. On contrary, I'm not quite convinced that 'overriding' of blocks is properly supported on all devices.

However, despite the mentioned Fusion-IO devices (and I'm not even sure about those, as they have proprietary drivers) I don't know any device for commodity computers which allows raw NAND access. I've only seen that in the embedded world - and there we normally speak of just one single chip with some MiBs of storage capacity. So this discussion is theoretical anyway.

Despite wanting to have raw access to NAND devices, I'm also wondering about the latency implications of the involved SATA protocol. Cutting that and attaching the NAND more directly to PCI Express certainly can't hurt.

possibilities with raw flash access

Posted Aug 27, 2009 18:07 UTC (Thu) by robert_s (subscriber, #42402) [Link]

"I don't know any device for commodity computers which allows raw NAND access."

It's not pretty, but there used to be a USB XD card reader chip called alauda which would expose the XD/SM card ( they are both basically repackaged NAND chips ) as an MTD. Linux drivers are in the tree I think.

Useful for little more than experimentation. And only then if you're good at hunting for one on ebay.

possibilities with raw flash access

Posted Aug 25, 2009 11:48 UTC (Tue) by markusw (guest, #53024) [Link] (1 responses)

I'm just remembering another issue with this approach: you also need to take ECC into account. Flipping a data bit from '1' to '0' may require flipping an ECC bit back from '0' to '1'.

Thus, for such an approach to work, you'd also need to have control over the spare area, where ECC and bad block information is normally stored.

possibilities with raw flash access

Posted Aug 28, 2009 16:37 UTC (Fri) by dlang (guest, #313) [Link]

good point. if the flash device is doing ECC on the data being stored my suggestions are harder to implement if still possible.

it will depend on the level and type of the ECC, if an algorithm is in place on a per-byte level that lets the ECC bits be '1's when the data is all '1's then you could still do per-byte modifications

The trouble with discard

Posted Aug 25, 2009 13:20 UTC (Tue) by frascone (guest, #60336) [Link]

Fusion IO's drivers do support discard, and it is a performance boost.

Since they eliminated the unnecessary SATA or ATAPI overhead, handling discard/TRIM was not only easy, but was an instant optimization.

The trouble with discard

Posted Aug 19, 2009 10:33 UTC (Wed) by ewen (subscriber, #4772) [Link]

On the other hand, regularly discarding all of the free space in a filesystem makes it likely that some time will be spent telling the device to discard sectors which it already knows to be free.

At the cost of another block bitmap (which stores the kernel's idea of the device's "free bitmap"), this could be reduced to one discard for sectors that the device knows to be free per mount of the file system; the bitmap would start assuming that the device considered all blocks in use, and then discard the ones that the file system knows aren't in use, and then update the "device free bitmap" to correspond to the file system one, so that further discards could be limited to blocks which are newly freed up.

Whether this makes sense as a time/memory trade off probably depends on the size of the block bitmap, and the portion which is likely to be free; for very large (more than 1TB?) file systems which are mostly full (more than 90%?) it may be better just to multiply discard and save the RAM.

And of course there's no particular reason why all the discards to "free up space" need to be done at once. Like any garbage collection it probably best implemented by doing it incrementally with some sort of mark'n'sweep algorithm in a background thread when runs only when IO on the underlying device is mostly idle. Covering the file system space every few hours would probably be fine for most use cases.

Ewen

PS: I'd guess TRIM was made a flushing operation in order that it could be used as a security boundary -- on completion one could be sure that the block had been discarded. In which case ATA would benefit from another advisory operation (RELEASE?) which could be tagged and/or return immediately that just queued the block as potentially discardable if it helped the device. This would also make repeatedly discarding both desirable, and much cheaper.

The trouble with discard

Posted Aug 19, 2009 12:34 UTC (Wed) by willy (subscriber, #9762) [Link]

One point which wasn't brought up in the article is that while several filesystems have implemented calls to discard, the only devices which currently implement discard (in mainline) are MTD. There are significant problems with implementing TRIM for ATA devices, most notably that because it sits below the SCSI layer, ideally one would implement the SCSI UNMAP command, and then translate that into the ATA TRIM command. Because the 'data' sent to the device is completely different, one would need to change the length of the data -- something the Linux SATL seems completely incapable of doing.

The trouble with discard

Posted Aug 19, 2009 13:35 UTC (Wed) by marcH (subscriber, #57642) [Link]

> At the ATA protocol level, a discard request is implemented by a "TRIM" command. The protocol committee designed TRIM as a non-queued command. [...] (It's worth noting that the SCSI equivalent to TRIM is a tagged command which doesn't suffer from this problem). [...] An SSD is not just a dumb storage device; it is, instead, a reasonably powerful computer in its own right, running complex software, and connected via what is, essentially, a high-speed, point-to-point network.

This is probably a very naive question but... why aren't these "powerful computers in their own right" talking some higher level protocol like SCSI? Would something like ATAPI help here?

The trouble with discard

Posted Aug 19, 2009 14:31 UTC (Wed) by ikm (subscriber, #493) [Link] (7 responses)

> In the absence of discard-like functionality, an SSD will end up shuffling around data that the host system has long since stopped caring about; telling the device about unneeded blocks should result in better performance.

Actually, does anyone understand how a device could shuffle anything at all in the absence of discard information? Imagine we have an 8GB drive, pristine-clean. We write to block 0 several times - ok, it can remap it to physical block 0 on the first write, physical block 1 on the next write, and so on. This is all nice and all, but what happens once we have performed writes to all logical blocks? Now we have an 8GB fully filled with some opaque data. It's not possible to write to different physical blocks now, because they are all used up. So a sequence of writes to any one particular logical block could only be translated to a sequence of writes to another particular physical block, which would each time be the same -- it's not possible to shuffle anything around, because all the physical blocks are filled with the meaningful data already! So could someone please-please tell me just how is all that FTL "magical wear leveling" could work at all? From what I see, the discard-like information is *totally* required -- it's not possible to do any shuffling at all once you write some opaque data to all the logical blocks available, and the number of physical blocks is not significantly higher than the number of logical ones.

The trouble with discard

Posted Aug 19, 2009 15:44 UTC (Wed) by ken (subscriber, #625) [Link] (6 responses)

The disk needs several lists. one is the logical to physical block list(mapping list). one is the clean list and one is trash list.

Now the trash + clean list needs to be larger than the logical list otherwise it will never work.

If you had a drive with 10 pysical blocks all in the clean list the mapping list must be at least 1 block shorter but probably drives set aside a lot more than just one block.

Any way your first write would take one block from the clean list and put it in the mapping list so that block 0 now point to it. your next write would take a new block from the clean list and map block 0 to that instead and take the old block and put it on the trash list.

if you write another time to block 0 yet another one would be taken from the clean list the mapping would be changed and the previous block put on the trash list.

In the background the drive would erase any block on the trash list and put them back on the clean list.

Writes that happens when no clean blocks exist has to wait for the erase to finish.

Now reality is more complex as the filessytem blocks size and flash block size is not the same size and you need to sometimes move data around to even out the write count per block as there is a limited amount of erase that can be done to a block.

But the disk do not need to know what blocks are used for this to work it simply assumes that the entire disk is used. but if it could know it could put a lot more blocks on the clean list and not need to set aside as much blocks for perfomance reasons. It still needs extra blocks as flash blocks is unreliably and may fail after just 1 erase if you are unlucky.

The trouble with discard

Posted Aug 19, 2009 16:26 UTC (Wed) by ikm (subscriber, #493) [Link] (5 responses)

Yes. But that does not explain what would happen after a single write to each existing block was performed. The clean list would be exhausted and the trash list would be empty. Then what? Yes, there might be some 1-5% extra physical blocks reserved, but they still won't do good - basically, the wear leveling would only be done in that small zone, which would get exhausted really fast and it is not what wear leveling is supposed to be.

I am personally inclined to think that the whole FTL thing is a hoax. It might work in some cases (e.g. you have one file on your FAT filesystem and you constantly overwrite it), but in general I don't see how it could work. Once you fill all your flash with files completely, there doesn't seem to be a way to perform any real wear leveling any more. You can erase all the files -- it won't help, the underlying device would never knew you did, it's a logical operation which doesn't get propagated down to it. For it, all the space continues to be filled up. Not much it seems to be able to do then.

So when our editor writes that "an SSD will end up shuffling around data that the host system has long since stopped caring about", I don't understand just how an SSD would do that at all. Seems that everyone is just happy to believe that it can be done somehow, because the marketing suggests so.

The trouble with discard

Posted Aug 19, 2009 17:06 UTC (Wed) by foom (subscriber, #14868) [Link] (1 responses)

If the disk is 100% full, and you continuously overwrite a single block, it can decide to move a non- changing block into that location so as to stop its write-count from going up, and remap the changing-frequently block into the previously unchanging location.

This, of course, requires using extra write bandwidth, but it can allow the entire disk to be evenly written, even when 100% full. The changing block will get mapped into each location on the disk in turn.

The trouble with discard

Posted Aug 19, 2009 17:15 UTC (Wed) by ikm (subscriber, #493) [Link]

Thanks -- this answers it.

The trouble with discard

Posted Aug 19, 2009 17:11 UTC (Wed) by farnz (subscriber, #17727) [Link]

Take a flash device with 10 blocks. Expose 8 blocks of space; the remaining two blocks go on the clean list. To wear-level an apparently full device, pick the block of the 8 with fewest erase cycles, copy the data into one of your two clean blocks, mark the newly written block as being the logical equivalent of the old block, then put the old block on the trash list.

At the expense of using slightly more erase cycles than would otherwise be the case, I can ensure that all blocks have roughly the same number of erase cycles; this prevents you from wearing out a single block.

The trouble with discard

Posted Aug 19, 2009 17:24 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (1 responses)

First, some physical sectors are reserved ahead of time, so there are always some sectors on either the trash list or the clean list (or both). In the very cheapest and oldest models of flash drives active sectors are only remapped when written to, so you can end up with the situation you described, with wear-leveling restricted to the small area of the device not occupied by fixed data. These sorts of devices tend to wear out rather quickly under certain (common) use-cases.

However, in the better flash devices the FTL will remap even the unchanging sectors over time such that the wear-leveling is spread across all the device's erase sectors. These devices approach the ideal erase/rewrite limit (number of sectors * rewrites per sector).

For example, assume we have a flash device with five sectors: four visible plus one spare. After writing to the full device there are four active sectors (phys. 1-4) and one on the clean list (5). Rewriting logical sector 3 changes the active mapping to (1->1, 2->2, 3->5, 4->4), empties the clean list, and adds phys. sector 3 to the trash list. Rewriting sector three again with a simple FTL just causes phys. sector three to be erased and swapped with sector 5 on the trash list; none of the other sectors are wear-leveled. With a better FTL, however, the device may decide to place the new data in phys. sector 1, moving the original data for the first logical sector over to phys. sector 5 instead (active: 1->5, 2->2, 3->1, 4->4; trash: 3). This introduces an extra erase operation (on *some* writes), but now the changes are spread across three sectors rather than the original two, and additional writes would be further spread to sectors 2 and 4 as well. The end result is that all the sectors end up with similar numbers of rewrites.

The trouble with discard

Posted Aug 19, 2009 18:35 UTC (Wed) by ikm (subscriber, #493) [Link]

Yes. My original line of thought was that remapping a block to be written to a place of an already written and valid block requires more writes than just writing that block to a free location, which I thought was defeating the purpose. But I was overlooking the fact that while the scheme would require more writes indeed, it would nevertheless allow scattering wear evenly. Thanks for pointing that out; the other thing to point out is that lwn has got a great user base!

The trouble with discard

Posted Aug 14, 2011 1:37 UTC (Sun) by csamuel (✭ supporter ✭, #2624) [Link]

It appears that the latest SATA spec (3.1) has introduced a queueable TRIM command according to:

http://techreport.com/discussions.x/21311