Online filesystem scrubbing and repair

By Jake Edge
April 5, 2017

LSFMM 2017

In his traditional LSFMM session to "whinge about various things", Darrick Wong mostly discussed his recent work on online filesystem repair for XFS, but also strayed into some other topics. Online filesystem scrubbing for XFS was one of those, as was a new ioctl() command to determine block ownership.

He started with the GETFSMAP ioctl() command, which allows querying a filesystem to determine how its blocks are being used (which blocks contain data or metadata and which are free). He has patches for XFS that are reviewed and ready to go for 4.12. Ext4 support is out for review. Chris Mason said that there are ioctl() commands for Btrfs that can be remapped to GETFSMAP.

Wong then described an online scrubbing tool for XFS that he has been working on. Right now, it examines the metadata in the filesystem and complains if there is something that is "clearly insane". Part two will be posted soon and will cross-check metadata with other metadata to make sure they are all in agreement. The tool will also use GETFSMAP to read all file-data blocks using direct I/O to check to see if any give read errors.

Damien Le Moal suggested that the RAID rebuild assist feature of some storage devices could be used to either get the data or get a failure quickly. Wong asked if there was a way to get a list of LBAs with read errors, but that isn't supported. Fred Knight said that using RAID rebuild assist will cause the device not to try any "heroic recovery" and simply return errors. Wong said that he hasn't done much work on reading from the device yet as he has been concentrating on the filesystem side.

For online repair for XFS, he has been working on using the reverse mappings to reconstruct the primary metadata for corrupted filesystems. He does not want to use the metadata that is left over because it may be corrupted too. There is a problem, though, if you have to recreate the secondary metadata (e.g. the reverse mapping) because you run into a deadlock situation.

The deadlock is avoided by freezing the filesystem so there is no access to the inodes. Inode by inode, this restriction is relaxed as the reverse mapping is rebuilt. The block device is not frozen, just the filesystem, but he wondered if that was a sane thing to do. It is not the intended use case for filesystem freezing, but it seems to work. That is the only part of online repair where the filesystem "has to come to a screeching halt", he said.

There is no secondary metadata for directories right now, so the online repair just moves corrupted directories to lost+found or removes them if they are too corrupted. XFS doesn't have parent directory pointers that might be used to help repair directory corruption.

There is still no good way to remove open files, Wong said. He thought it might be possible to rip the file's pages out of the page cache, replace the inode operations and mapping fields with dummy values that just return errors, and to try to put the inode on the orphan list. Matthew Wilcox asked if writeback would be done before the pages are removed from the page cache; Wong said that was an open question.

Mason said that Facebook does no online repair. If a filesystem gets corrupted, it is taken out of production and restored to a working state before being returned to service. Various aspects of the filesystem change (e.g. latencies) while online repair is in progress, so the company just does not bother. Ted Ts'o said that the cloud world tends to have enough redundancy that doing online repair isn't really needed, but in other worlds that is not true. Wong said that the feature is targeted at users who are willing to trade "some to a lot of latency" for the ability to not have to take a volume offline.

A way to specify a "desperation" flag in block I/O requests is on his wish list, Wong said. Setting that would mean that the block layer should go read RAID-1 mirrors or invoke stronger error recovery to read a block. He would like to use that when a block read succeeds but the checksum doesn't match.

He asked the room if anyone was making progress on filesystem freezing before suspend operations. Mason said that the use of the work queue WQ_FREEZABLE flag was "all wrong" as described in a Kernel Summit session. The code for the filesystems had all been cut and pasted and was simply wrong. He had thought that someone was planning to fix that, but was not sure where that effort had gotten.

Index entries for this article
Kernel	Filesystems
Conference	Storage, Filesystem, and Memory-Management Summit/2017

to post comments

Online filesystem scrubbing and repair

Posted Apr 5, 2017 23:35 UTC (Wed) by nix (subscriber, #2304) [Link] (15 responses)

The tool will also use GETFSMAP to read all file-data blocks using direct I/O to check to see if any give read errors

I hope this respects I/O priorities, so you can run the tool under ionice to run it in idle mode -- both because it's more polite to other users, and because very soon bcache will use these hints to avoid caching the reads on the SSD (for scrubbing, that would blow the entire cache and likely age the SSD significantly).

Online filesystem scrubbing and repair

Posted Apr 15, 2017 8:20 UTC (Sat) by Wol (subscriber, #4433) [Link] (14 responses)

I don't get this. If I remember that article on SSD life a while back, it's only writes that age an SSD. So why would reading the entire device (or not caching the device, and reading everything every time) age the SSD?

(Oh, and I seem to remember the same article saying that even if you ran continuous read/write passes on an SSD with no break, it would still take several years to kill it.)

Cheers,
Wol

Online filesystem scrubbing and repair

Posted Apr 19, 2017 15:12 UTC (Wed) by nix (subscriber, #2304) [Link]

Because bcache works by caching reads from an underlying block device on another device -- usually one caches rotational storage on a device with faster or near-zero seek time, like an SSD. It tries to detect contiguous reads from the underlying device and avoid caching them, but this is not 100% perfect and cannot kick in at once, but only after a few megabytes of reads have been cached, probably pointlessly -- it is better to note that the rotational block device reads are low priority when they are something like a pvmove (which really *is* low priority): these are assumed to be reads for which seek performance does not matter, and thus recent bcache will not cache them.

Online filesystem scrubbing and repair

Posted Apr 19, 2017 15:14 UTC (Wed) by nix (subscriber, #2304) [Link] (12 responses)

(Oh, and I seem to remember the same article saying that even if you ran continuous read/write passes on an SSD with no break, it would still take several years to kill it.)

That's definitely not true of most SSDs. Mine, a fairly expensive Intel one, is rated for several complete device writes per day, every day for five years. However, the device is only 480GiB and it writes at about 500MiB/s, so it only takes a thousand seconds to write to the entire device (though in practice I suspect this would slow drastically due to GC pauses etc). And for many of these SSDs, replacing them means pulling open the machine and messing about on the motherboard -- it's much more risky and worrisome than just swapping a disk would be.

Online filesystem scrubbing and repair

Posted Apr 19, 2017 17:17 UTC (Wed) by zlynx (guest, #2285) [Link] (11 responses)

Unless you mean *very* cheap machines with a soldered on eMMC device, modern small machines use M.2 SSDs. Slightly older ones use a 2.5" laptop drive bay.

Changing a M.2 card is much the same as changing a RAM stick. I suppose that is worrying to some people?

I think the people worried by it would be exactly as worried as they would be changing out *any* computer hardware.

Online filesystem scrubbing and repair

Posted Apr 24, 2017 16:39 UTC (Mon) by nix (subscriber, #2304) [Link] (10 responses)

Let's see, is pulling the side of my machine off and unplugging things from the motherboard worrying, given that I have static-shocked machines into destruction by touching the backplate accidentally while plugging in a monitor cable while the case was still on? Yes, yes it bloody well is. (I tried to change a DIMM once, too. The machine never booted again.)

Not everyone is cut out to deal with hardware, even if it doesn't need desoldering.

Online filesystem scrubbing and repair

Posted Apr 24, 2017 16:45 UTC (Mon) by zlynx (guest, #2285) [Link] (9 responses)

Have you tried an anti-static wrist strap plugged into a ground?

Some people have high levels of static electricity around them. Because of their skin moisture, their environment humidity levels, or the kinds of clothes and shoes they wear.

I don't remember ever zapping anything while I was wearing a strap. I've also got a grounded work pad, for putting the computer and parts on while working on it.

If you're a naturally staticky person these precautions may be needed. Assuming that you ever want to work on the insides of the PC. Some people know how but have no interest in it.

Online filesystem scrubbing and repair

Posted Apr 24, 2017 21:51 UTC (Mon) by nix (subscriber, #2304) [Link] (8 responses)

Every time I open a machine I try something different. I tried antistatic wrist straps. I tried rubber gloves. I tried rubber-soled shoes. I tried bare feet and tinfoil on the floor running out to something I hoped was grounded. I tried holding onto a hopefully-grounded lump of metal the entire time I was using it (really fun trying to work with one hand). Nothing worked. (But it may be that none of that was grounded. Honestly, except for the shielded and inaccessible earth pins in the power sockets, and I suppose the ground outside, I don't know what in a UK domestic property *is* grounded.)

I stopped trying to touch hardware about seven or eight years ago. I buy it and never open it again: sliding hotswap drives in is my limit. I'll pay someone who is not made of static electricity to do anything more. (And who doesn't have my terrible coordination and muscle tremors. Plus, you just *know* the machine will go wrong in the summer when I am made of hay fever. Massive sneezes are not compatible with hardware work. Or with software work, really, but you can fake it more easily with software.)

Online filesystem scrubbing and repair

Posted Apr 25, 2017 7:14 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

FWIW, when I worked in a place with an on-site factory that handled ESD-sensitive items (BGA chips for rework), the standard for static handling was two big RS ESD-safe mats connected to the same earth, and Uvex ESD gloves. You stood in socks on one ESD-safe mat, you put the work item on the bench on the other, both shared an earth so that there was no charge gradient across you, you used the wriststrap to ensure that your body was at the same potential as the mat the work item was on (in case your socks were insulating), and you wore the ESD gloves to reduce the risk further. Kinda expensive for a home setup, though...

The other trick the factory workers all worked to was "cotton work clothes only" - apparently synthetics and wool were more likely to result in you getting zapped when you stepped on the mat than cotton was, and that wasn't a pleasant feeling.

Note that domestic rubber gloves aren't suitable - they tend to be static accumulators, as they don't have the conductive layer to dissipate static charges, and can make the zap worse. Same goes for non-ESD rated rubber-soled shoes - they can cause the static charge to build up where it might otherwise dissipate. It's easy to make your ESD problem worse if you're not using the right kit :-(

Online filesystem scrubbing and repair

Posted Apr 26, 2017 19:46 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

Fascinating! A real belt-and-braces approach, which makes sense when you're dealing with a lot of items that will cost real money if just one worker static-shocks them repeatedly.

I suspected the rubber gloves were a waste of time, but by that point I was fairly desperate :)

Online filesystem scrubbing and repair

Posted Apr 26, 2017 19:52 UTC (Wed) by farnz (subscriber, #17727) [Link]

The bigger reason to be this cautious was the supply constraint rather than the money - $2,000 chips that you can replace at the drop of a hat are annoying if shocked into an early grave, whereas chips where the next fab run isn't for another 6 months are far more worrying if you start blowing through your allotted supply - it's really problematic if you can't ship product for 3 months because some static took out a chunk of your IC supply.

Online filesystem scrubbing and repair

Posted Apr 25, 2017 13:32 UTC (Tue) by intgr (subscriber, #39733) [Link]

Did all of these incidents take place at the same location? If so, I suspect the grounding/earthing in that building is simply faulty or lacking.

Online filesystem scrubbing and repair

Posted Apr 25, 2017 14:29 UTC (Tue) by cladisch (✭ supporter ✭, #50193) [Link] (1 responses)

> … the shielded and inaccessible earth pins in the power sockets …

Schuko sockets have exposed ground pins. Moving to the Continent is a small price to pay for being properly grounded, isn't it?

Alternatively, to get access to the earth, just ~~stick a pin~~ use an earth bonding plug: [1] [2] [3] [4]

Online filesystem scrubbing and repair

Posted Apr 26, 2017 19:46 UTC (Wed) by nix (subscriber, #2304) [Link]

That looks very useful indeed (not just for electronics work, too). I had no idea any such thing existed.

Online filesystem scrubbing and repair

Posted Apr 25, 2017 19:57 UTC (Tue) by paulj (subscriber, #341) [Link] (1 responses)

Touch unpainted metal pipe-work going into a radiator. That is likely to work.

Online filesystem scrubbing and repair

Posted Apr 26, 2017 19:52 UTC (Wed) by nix (subscriber, #2304) [Link]

Some modern houses have pretty radiators with a metallic cover (no doubt ungrounded nad painted with electrically insulating paint, for the look of the thing) and with all pipework artfully concealed out of sight and out of reach: it goes into the wall directly behind the radiator, so you can't get at it without dismounting the whole thing.

It's really pretty: the radiators just hang on the wall, not visibly connected to anything practical like pipework, just a radiator and a thermostatic valve supported as if by magic. It also conceals anything that might be in any way electrically connected to earth and is also a complete bugger to maintain (I just turn the plastic-covered emergency stop valve and call a plumber, who curses a blue streak for some time while trying to figure out how the hell to get at any of the pipework).

The toilets are even worse: all the piping is actually hidden behind a false wall. Makes dealing with leaks, or even spotting them, a huge pile of fun.

Online filesystem scrubbing and repair

Posted Apr 6, 2017 0:09 UTC (Thu) by neilbrown (subscriber, #359) [Link]

> A way to specify a "desperation" flag in block I/O requests is on his wish list, Wong said.

You need more than a flag. You need an index number.
There might be several ways to retrieve a given block. By default, some optimal choice is made.
In "desperate" circumstances, you want to iterate through all possible mechanisms in turn.
For example, if you have RAID6, you can read the block directly, construct it from all other data blocks plus parity, from all other data blocks plus Q, or from parity and Q and all-but-2 of the data blocks. A total of n possible sources if there are n devices in the array.
For an n-drive RAID1, there are obviously n possible sources too.

Online filesystem scrubbing and repair

Posted Apr 6, 2017 5:16 UTC (Thu) by fratti (subscriber, #105722) [Link] (5 responses)

>Ted Ts'o said that the cloud world tends to have enough redundancy that doing online repair isn't really needed, but in other worlds that is not true.

Which worlds would online repair be especially worth having? Any critical system which needs a high uptime guarantee surely would have redundancy. So perhaps a use-case where we cannot afford taking a volume offline until someone fixes the filesystem, but the likeliness of file system corruption is not high enough to justify redundancy, like for example a recorder for security camera footage I guess?

Online filesystem scrubbing and repair

Posted Apr 6, 2017 17:50 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

Redundancy may not help here -- if the filesystem is replicated, the same bug may bite both instances in the same way.

Online filesystem scrubbing and repair

Posted Apr 9, 2017 15:05 UTC (Sun) by jengelh (subscriber, #33263) [Link] (2 responses)

>if the filesystem is replicated, the same bug may bite both instances

But if not the filesystem was replicated, but the files? Like, surely there is a FUSE-fs which does file-level RAID-1 ;-)

Online filesystem scrubbing and repair

Posted Apr 10, 2017 22:30 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

Yeah, that would be secure against this. I think "RAIF" might be the right term :)

Online filesystem scrubbing and repair

Posted Apr 11, 2017 0:14 UTC (Tue) by jengelh (subscriber, #33263) [Link]

https://www.fsl.cs.sunysb.edu/docs/raif/raif.ps
[From the guys that later brought about Linux's unionfs.]

Online filesystem scrubbing and repair

Posted Apr 8, 2017 9:13 UTC (Sat) by niner (guest, #26151) [Link]

Just today I had to reboot to a rescue system to btrfsck my desktop's root file system to fix my btrfs send/receive based backup. It's not that big a deal, but having to reboot sucks. Especially when the machine is also the only source of music :)

Btw. btrfs send/receive seems to be a very good detector for file system errors.

At work we of course have redundant failover clusters. But those only help for hardware issues (including faulty storage). When there's a file system error in one of the data volumes, all we can do is to take the service offline for fsck. If it's a major breakage we can restore one of the hourly backups, but the first try will always be to salvage everything.