Issues around discard

By Jake Edge
May 6, 2019

In a combined filesystem and storage session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Dennis Zhou wanted to talk about discard, which is the process of sending commands (e.g. TRIM) to block devices to indicate blocks that are no longer in use. Discard is a "serious black box", he said; it is a third way to interact with a drive, but Linux developers have no real insight into what its actual effects will be. That can lead to performance and other problems.

Zhou works for Facebook, where discard is enabled, but the results are not great; there is a fair amount of latency observed as the flash translation layer (FTL) shuffles blocks around for wear leveling and/or garbage collection. Facebook runs a periodic fstrim to discard unused blocks; that is something that the Cassandra database recommends for it users. The company also has an internal delete scheduler that slowly deletes files, but the FTL can take an exorbitant amount of time when gigabytes of files are deleted; read and write performance can be affected. It is "kind of insane" that applications need to recognize that they can't issue a bunch of discards at one time; he wondered if there is a better way forward.

An approach that Facebook is trying with Btrfs is to simply preferentially reuse logical-block addresses (LBAs), rather than discarding them. That requires cooperation between the filesystem and block layer, since the block layer knows when it can afford to do discard operations. What the FTL is doing is unknown, however, which could have implications on the lifetime of the device. He is looking for something that can be run in production.

Erik Riedel asked what is special about a 1GB discard versus a 1GB read or write. You are asking the device to do a large operation, so it may take some time. But James Bottomley noted that discard had been sold as a "fast" way to clean up blocks that are no longer needed. Riedel suggested that vendors need to be held accountable for their devices; if TRIM has unacceptable characteristics, those should be fixed in the devices.

Ric Wheeler said that there should be a tool that can show the problems with specific devices; naming and shaming vendors is one way to get them to change their behavior. Chris Mason noted that the kernel team at Facebook did not choose the hardware; it would likely choose differently. This is an attempt to do the best the team can with the hardware that it gets.

Ted Ts'o said that he worries whenever the filesystem tries to second-guess the hardware. Different devices will have different properties; there is at least one SSD where the TRIM operation is handled in memory, so it is quite fast. Trying to encode heuristics for different devices at the filesystem layer would just be hyper-optimizing for today's devices; other devices will roll out and invalidate that work.

The current predicament came about because the kernel developers gave the device makers an "out" with the discard mount flag, Martin Petersen said. If performance was bad for a device, the maker could recommend mounting without enabling discard; if the kernel developers had simply consistently enabled discard, vendors would have fixed their devices by now. Beyond that, some vendors have devices that advertise the availability of the TRIM command, but do nothing with it; using it simply burns a queue slot for no good reason.

Riedel suggested that "name and shame" is the right way to handle this problem. If a device claims to have a feature that is not working, or not working well, that should be reported to provide the right incentives to vendors to fix things.

Bottomley wondered if filesystems could just revert to their old behavior and not do discards at all. Wheeler said that could not work since discard is the only way to tell the block device that a block is not in use anymore. The bigger picture is that these drives exist and Linux needs to support them, Mason said.

Part of the problem is that filesystems are "exceptionally inconsistent" in how they do discard, Mason continued. XFS does discard asynchronously, while ext4 and Btrfs do it synchronously. That means Linux does not have a consistent story to give to the vendors; name and shame requires that developers characterize what, exactly, filesystems need.

The qualification that is done on the devices is often cursory, Wheeler said. The device is run with some load for 15 minutes or something like that before it is given a "thumbs up"; in addition, new, empty devices are typically tested. Petersen said that customers provide much better data on these devices; even though his employer, Oracle, has an extensive qualification cycle, field-deployed drives provide way more information.

Zhou said that tens or hundreds of gigabytes can be queued up for discard, but there is no need to issue it all at once. Mason noted that filesystems have various types of rate limiting, but not for discard. Ts'o said it would be possible to do that, but it should be done in a single place; each filesystem should not have to implement its own. There was some discussion of whether these queued discards could be "called back" by the filesystem, but attendees thought the complexity of that was high for not much gain. However, if the queue is not going to empty frequently, removing entries might be required.

In the end, Fred Knight said, the drive has to do the same amount of work if a block is discarded or just overwritten. Writing twice to the same LBA will not go to the same place in the flash. The FTL will erase the current location of the LBA's data, then write a new location for the block. All discard does is allow the erasure to happen earlier, thus saving time when the data is written.

The problem is that kernel developers do not know what a given FTL will do, Ts'o said. For some vendors, writing the same LBA will be problematic, especially for the "cheap crappy" devices. In some cases, reusing an LBA is preferable to a discard, but the kernel would not know that is the case; it would simply be making assumptions.

To a certain degree, Zhou said, "discard doesn't matter until it does"; until the device gets past a certain use level, writing new blocks without discarding the old ones doesn't impact performance. But then there is a wall at, say, 80% full, where the drive goes past the "point of forgiveness" and starts doing garbage collection on every write. There is a balance that needs to be struck for discard so that there are "enough" erasure blocks available to keep it out of that mode.

That is hard for SSD vendors, however, because they cannot reproduce the problems that are seen, so they cannot fix them, an attendee said. The kernel developers need to work more closely with the device firmware developers and provide workloads, traces, and the like. Wheeler said that reporting workloads and traces is the "oldest problem in the book". We have to do better than we are doing now, he said, which is to provide nothing to the vendors, he said.

Bottomley pushed back on the idea that preferentially reusing LBAs was the right path. If there are blocks available to be written, reusing the LBA is worse, as it will fragment extent-based filesystems. If there is an erase block available, no erase need be done on a write to a new LBA, and if there isn't, it is the same as reusing the LBA; so rewriting LBAs actually compounds the problem.

The exact behavior is specific to the filesystem and workload, however. The sizes of erase blocks are not known to the kernel, or even by kernel developers, because the drive vendors have decided to keep them secret, Bottomley said. So every decision the kernel makes is based on an assumption of one kind or another.

Riedel still believes this is a qualification problem at its core. The right solution is for the drives to do a good job with the expected workloads. But Petersen reiterated that by giving the vendors an out with the discard flag, nothing will ever change. Fixes will "never happen".

The core of the problem is that reads and writes need to happen immediately, Zhou said, while discards can wait a bit without causing much of a problem. But there are different viewpoints on the problem itself, Ts'o said; desktop distributions will be different from servers or mobile devices. Mason noted that if you asked ten people in the room how discard should work, you would get something like 14 answers.

The session wound down without much in the way of resolution. There was talk of some kind of tool that could be used to reproduce the problems and gather traces. There was also talk of rate-limiting discards, but no one wants to do a "massive roto-tilling" of the block layer given all of the unknowns, Ts'o said; he suggested any change be done in an optional module between the filesystems and the block layer, which could be revisited in a few years.

Index entries for this article
Kernel	Block layer/Discard operations
Conference	Storage, Filesystem, and Memory-Management Summit/2019

Issues around discard

Posted May 7, 2019 8:09 UTC (Tue) by kdave (subscriber, #44472) [Link]

Martin Petersen said. If performance was bad for a device, the maker could recommend mounting without enabling discard; if the kernel developers had simply consistently enabled discard, vendors would have fixed their devices by now.

... ~~vendors would have fixed their devices~~ users would simply bug filesystem developers until discard is off by default, vendors doing nothing. No matter how much I'd like this approach to work, it does not work as expected in practice. I think vendors respond to $$$ and big companies asking for things, but for example see where we are with the erase block size. From our view it is a simple thing yet there has been no change AFAIK with the answers ranging from "trade secret" to "you don't need to know". And I'm afraid this won't change.

We should not look at discard as an uniform feature in the first place

Posted May 7, 2019 11:47 UTC (Tue) by hmh (subscriber, #3838) [Link] (2 responses)

Sometimes it feels like the real use for "TRIM" ("discard") on FLASH-based, old-style-storage devices with advanced FTLs (i.e. SATA or SAS-attached SSDs) is being forgotten. It is there to *reduce needless copying of stale pages of data* by the SSD itself, i.e. to reduce the need for background block writes. It is not an speedy way to delete data blocks, or if it is, someone forgot to properly notify the device vendors about it -- it is a FLASH endurance saver.

When you either TRIM or overwrite an LBA, the SSD gets the implied information that the old block is not going to be reused, and can be scheduled to be *erased*.

OTOH, when a filesystem prefers to direct writes to a new LBA and TRIM/discard is never done on the old, now-freed blocks, those old blocks are going to be copied around by the SSD firmware to free up erase blocks (much like memory compaction tries to do to create huge pages). This wastes FLASH write circles, increases on-device fragmentation, and reduces the number of "erased and ready to be used" FLASH pages. It also eventually renders the SSD into the dreaded "slow as an old floppy drive" state.

So, what "discard" is really useful for on [old-style non-NVMe?] SSDs is vastly different on why one would use "discard" on, e.g., a thin-provisioned volume. And it is *not* any less important.

We should not look at discard as an uniform feature in the first place

Posted May 13, 2019 12:39 UTC (Mon) by Fowl (subscriber, #65667) [Link] (1 responses)

What happens if a write and a discard race?

We should not look at discard as an uniform feature in the first place

Posted May 14, 2019 2:23 UTC (Tue) by hmh (subscriber, #3838) [Link]

It entirely depends on the competence of the vendor that wrote the device firmware, and that whomever designed the queue protocol was not crazy enough to forget about write collisions between queues.

A discard really is just a write as far as ordering and races/collisions go.

Issues around discard

Posted May 7, 2019 13:52 UTC (Tue) by shentino (guest, #76459) [Link]

Discard isn't just for SSDs.

In essence, discard is at this point a fundamental storage operation just like reads and writes.

LVM thin pools for example use high level discards as cues to deallocate committed pool space, which may well provoke the thin pool itself to cascade discards to its own storage.

It's also used in virtualization.

VMs that issue discards to virtual block devices can likewise provoke the hypervisor into deallocating storage or space occupied by whatever it stores on the host to back the device. A guest OS issuing discards to its virtual drives, even ones presented as "spinning rust", can help a hypervisor optimize how it manages the storage on the host.

Discards are a big opportunity for higher layers like this to give lower layers housekeeping opportunities beyond just letting an SSD garbage collect.

They should be liberally sent at every opportunity. If anything the overhead in managing them should encourage lower layers to take advantage of the information.

it is the O_PONIES issue again!

Posted May 7, 2019 16:27 UTC (Tue) by walex (guest, #69836) [Link] (3 responses)

“but the FTL can take an exorbitant amount of time when gigabytes of files are deleted; read and write performance can be affected.”

This is alluded to in the text by C Mason and others, but that is typical of devices that don't have a supercapacitor-backed cache/buffer: they must commit every delete to flash.

So called "enteprise" devices have supercapacitor backed caches, and can do deletes (and random writes) a lot faster. The situation is rather similar to RAID host-adapters with a cache, where a BBU makes a huge difference.

It is the famous O_PONIES and eternal september issue that never goes away, because every year there is a new batch of newbie sysadms and programmers who don't get persistence and caches, and just want O_PONIES.

People familiar with using SSDs for journaling in a Ceph storage layer know how enormous the difference made by having a supercapacitor backed SSD cache...

it is the O_PONIES issue again!

Posted May 7, 2019 17:11 UTC (Tue) by naptastic (guest, #60139) [Link] (2 responses)

I inherited a Samsung 950 Evo after it was retired from service after ~2 years. Once it was installed, I checked the smart data. Couldn't believe the "Total LBAs written" number.

"How the hell did you write SEVEN HUNDRED TERABYTES to this drive in two years‽"

It was the Ceph journal drive.

it is the O_PONIES issue again!

Posted May 7, 2019 21:23 UTC (Tue) by walex (guest, #69836) [Link]

«"How the hell did you write SEVEN HUNDRED TERABYTES to this drive in two years‽"

It was the Ceph journal drive.»

And that is also because the 950 EVO does not have a persistent (supercapacitor backed) cache, and thus all 700 TB will have hit the flash chips, even if a lot of it probably was just ephemeral. Anyhow using the 950 EVO as a Ceph journal device, especially with that high rate of journaling (38GB/hour), probably cost a lot in latency to Ceph.

it is the O_PONIES issue again!

Posted May 10, 2019 0:01 UTC (Fri) by miquels (guest, #59247) [Link]

I have several SSDs in production that have written not 700 TB, but 7600 TB. In 4 years' time. Datacentre SSDs FTW :)

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 845DC PRO 800GB
User Capacity:    800,166,076,416 bytes [800 GB]
Sector Size:      512 bytes logical/physical

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   099   099   010    -    3
  9 Power_On_Hours          -O--CK   093   093   000    -    34281
 12 Power_Cycle_Count       -O--CK   099   099   000    -    4
177 Wear_Leveling_Count     PO--C-   076   076   005    -    9158
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   099   099   010    -    3
180 Unused_Rsvd_Blk_Cnt_Tot PO--C-   099   099   010    -    7037
241 Total_LBAs_Written      -O--CK   094   094   000    -    16410885339592
242 Total_LBAs_Read         -O--CK   097   097   000    -    7734700749043
250 Read_Error_Retry_Rate   -O--CK   100   100   001    -    0

There are two very different types of TRIM command

Posted May 7, 2019 16:43 UTC (Tue) by walex (guest, #69836) [Link] (4 responses)

“XFS does discard asynchronously, while ext4 and Btrfs do it synchronously.“

The discussion throughout the article and here is made less useful by a vital omission: there is no mention that the first edition of the TRIM command for SATA was "blocking" ("synchronous"), but there is now a variant that is non-blocking ("asynchronous").

Essentially all the problems reported with 'discard' are due to the use of the first "blocking" variant, which unfortunately is the only one that has been implemented on most of the SATA flash SSD installed base so far. Wikipedia says:

“The original version of the TRIM command has been defined as a non-queued command by the T13 subcommittee, and consequently can incur massive execution penalty if used carelessly, e.g., if sent after each filesystem delete command. The non-queued nature of the command requires the driver to first wait for all outstanding commands to be finished, issue the TRIM command, then resume normal commands.”

SAS/SCSI and NVME have similar commands with different semantics, I particularly like the "write zeroes" command of NVME.

There are two very different types of TRIM command

Posted May 7, 2019 21:19 UTC (Tue) by masoncl (subscriber, #47138) [Link] (3 responses)

I was talking about async discards in a slightly different context. Btrfs and ext4 will block transaction commit until we've finished trimming the things we deleted during the transaction. XFS will allow the commit to finish and let the trims continue floating down in the background, while making sure not to reuse the blocks until the trim is done.

Depending on the device, the async approach can be much faster, but it can also lead to a very large queue of discards, without any way for the application to wait for completion.

There are two very different types of TRIM command

Posted May 7, 2019 23:44 UTC (Tue) by walex (guest, #69836) [Link] (2 responses)

“XFS will allow the commit to finish and let the trims continue floating down in the background”

Indeed, but a discard is pretty much like a write, so it could be handled the same way, wo why would XFS do that complicated stuff? The obvious reason is that if discarding is handled synchronously, like Btrfs and ext4 do, then issuing blocking (non queued) TRIM can cause long freezes, which is indeed why many people don't use the discard mount option but use fstrim every now and then at quiet times.

That's why mentioning that there are both blocking and nonblocking TRIMs matters so much, because if non-blocking TRIM is available it effectively works like a write as to queueing too, thus there is very little to be gained from backgrounding TRIMs like XFS does. Apart from love of overcomplexity, which seems rather common in the design of XFS.

Put another way pretty much the entire TRIM debate has been caused by the predominance of blocking TRIM in the SATA installed base of consumer flash SSDs (and the other minor reason has been the numerous TRIM related bugs in many models of flash SSDs, which are just part the numerous bugs of many models of flash SSDs).

There are two very different types of TRIM command

Posted May 8, 2019 15:26 UTC (Wed) by masoncl (subscriber, #47138) [Link]

"That's why mentioning that there are both blocking and nonblocking TRIMs matters so much, because if non-blocking TRIM is available it effectively works like a write as to queueing too, thus there is very little to be gained from backgrounding TRIMs like XFS does. Apart from love of overcomplexity, which seems rather common in the design of XFS."

Actual queueing support for discards does change the math a bit, but the fundamental impact on latency of other operations is still a problem. Sometimes it's worse because you're just allowing the device to saturate itself with slow operations.

The XFS async trim implementation is pretty reasonable, and it can be a big win in some workloads. Basically anything that gets pushed out of the critical section of the transaction commit can have a huge impact on performance. The major thing it's missing is a way to throttle new deletes from creating a never ending stream of discards, but I don't think any of the filesystems are doing that yet.

There are two very different types of TRIM command

Posted May 10, 2019 13:20 UTC (Fri) by GoodMirek (guest, #101902) [Link]

In my eyes, there is a significant difference between write and discard.
If a write fails it can cause a data loss, which is a critical issue and therefore requires transaction safety.
If a discard fails, the worst thing that can happen is some performance and wear deterioration, which is negligible issue.

Issues around discard

Posted May 8, 2019 6:41 UTC (Wed) by nilsmeyer (guest, #122604) [Link]

There are a lot of websites doing hardware testing already, however few of them do testing with Linux (Phoronix and Servethehome come to mind). I think they could and should add discard testing, with phoronix at least the procedure is somewhat standardized. Another test I would really like is fsync() performance, since that shows you the actual, durable write performance of the drive.

Issues around discard

Posted May 14, 2019 8:57 UTC (Tue) by roblucid (guest, #48964) [Link]

May be the kernel needs to give user space the ability to give it some hints, so something like a drivecap file plus a utility which then configures the kernel on drive characterisics policy? This is how BSD used to tune to HDD's, when sectors per cylinder mattered (way before HDDs switched to LBA addressing).

Then SSD vendors (or large customers) can characterise how their drive is expected to be used, LBA reuse, discard penalties, phantom discards and the like. You might even be able to tune for service life, with user space logging expected degraded performance.

Issues around discard

Posted May 16, 2019 4:05 UTC (Thu) by scientes (guest, #83068) [Link] (1 responses)

The problem is that the firmware is non-free software. I've had SSDs that just fail with TRIM for example, and you are given a windows-only update tool with a binary blob firmware. I just threw it out and decided to never buy from the company again.

These firmware run ARM M and they should be free software.

There is a project in this direction: http://www.openssd.io/

Issues around discard

Posted May 16, 2019 5:40 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

For firmware, I think Linux Vendor Firmware Service (https://fwupd.org/) can be used to update SSDs firmware under Linux.