Zero = trim

Posted Aug 19, 2009 1:25 UTC (Wed) by ncm (guest, #165)
Parent article: The trouble with discard

It seems to me that vendors could bypass the ATA spec TRIM problem by treating all-zero blocks as trimmed blocks, and not actually use up a real block to represent them. Then, trimming amounts to writing zeroes, which we ought to be able to do without much restructuring. If asked to read a block previously identified as trimmed, the device simply delivers a block of zeroes without looking at what might still be written there, if indeed "there" can be said to exist any more.

to post comments

Zero = trim

Posted Aug 19, 2009 5:05 UTC (Wed) by TimMann (guest, #37584) [Link] (2 responses)

That idea came to mind for me too, but of course this article is all about how what real hardware really does fails to match up with what we need. So if the real hardware doesn't happen to optimize the case of a block with all 0's written to it, we still lose.

Zero = trim

Posted Aug 19, 2009 5:06 UTC (Wed) by TimMann (guest, #37584) [Link] (1 responses)

Hmm, I was missing your point, though. That's a good way around the limitation of the ATA spec if the hardware vendors can be persuaded to use it.

Zero = trim

Posted Aug 19, 2009 12:35 UTC (Wed) by avik (guest, #704) [Link]

You have to dma tons of zeros to the device, so this can be expensive.

Zero = trim

Posted Aug 19, 2009 12:39 UTC (Wed) by willy (subscriber, #9762) [Link]

I happen to know a storage vendor who has implemented exactly that in their storage array. The problem is that if you want to TRIM a gigabyte, you have to send a gigabyte of zeroes. SCSI doesn't have this problem as it has WRITE SAME, so you only need 512 bytes of zeroes.

If we could get agreement from vendors to do this, we could use the existing TRIM command for amounts larger than some cutoff point, and write zeroes for amounts smaller than said point. But absent some agreement to this effect, we can''t unilaterally start doing this in Linux as it'll drive up the write counts even further.

Zero = trim

Posted Aug 20, 2009 10:36 UTC (Thu) by dion (guest, #2764) [Link]

That would also help in the case where you are running on other kinds of storage that are able to optimize their resource usage based on this information.

Take virtual machines that use qcow2 files as backing, when copying such a file the zeroed blocks are simply left out of the target file.

Some SANs also avoid allocating more actual storage than needed by a block device, I'm sure those would like to get a hint about the disuse of blocks as well.

Zero = trim

Posted Aug 21, 2009 1:58 UTC (Fri) by dmag (guest, #17775) [Link] (3 responses)

Writing zeros to a block isn't the same as writing zeros to flash. Flash is unreliable, so you need checksums, wear leveling, etc. So a block of zeros isn't really zero. This extra stuff used to be done in hardware (CompactFlash), but nowadays is done in software (SD cards).

What's worse, flash chips are always "FF" in their erased state. So to write all zeros, you'll have to erase the block, write zeros, then later erase it and write real data. If you wrote FFs, it would only have to erase it once. But the zeros can't be at the block layer.

I agree with the article -- if we pretend flash is "just like a hard drive", we will be using the hardware sub-optimally.

Zero = trim

Posted Aug 21, 2009 3:02 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

But the point is that you wouldn't *actually* write the block of zeros to the flash. Since all reads/writes to the flash disk go through a remapping layer, it would be trivial to treat all-zero writes specially, and not write them out at all, but just mark them as zeroed in the block-map.

Zero = trim

Posted Aug 23, 2009 19:58 UTC (Sun) by dmag (guest, #17775) [Link]

I still don't think that's the best way to do it.

Summary of the Problem:

The FS layer is the layer that knows what's free and what is real data. But the FS layer doesn't get any requests to write the the "free" space (by definition), so the block layer will never get any writes to that space. The block layer (or the driver) is currently doing a lot of work to preserve (wear leveling) that "free" data, even though we know it is "don't care" data.

Your proposal is "what if the FS layer could inform the block layer about free space by writing zeros there?". But there are some problems:

- First, you need an API change to know when it's OK to do those extra writes. Otherwise, you will kill non-compliant flash disks or hard drives. (slow down with useless writes, wear out with extra writes, etc.)

- Second, there ARE times you want to actually write out zeros in a block (think of the case where you are writing to a USB stick that will go in another computer, or security). So now you need an API change to say "This is not a phantom write"

- Third, the block/device layer will have to look at the data in the block to detect the zeros, which will kill performance. Normally, the block layer will just poke the address into a DMA controller and get on with life. What you are proposing would require the block layer to scan the block, which will evict everything from the L1/L2 cache.

Since we have to change the FS to Block/Device API anyway, we may as well invent a real communication channel instead of trying to piggyback on "all zeros".

Zero = trim

Posted Aug 27, 2009 17:57 UTC (Thu) by robert_s (subscriber, #42402) [Link]

"but nowadays is done in software (SD cards)."

No. SD cards have their flash translation layer on the card in hardware.

What you're thinking of is CF cards have a whole ATA interface on the card too ( the actual interface is closely related to PCI ).