More from the testing and fuzzing microconference

By Jake Edge
October 4, 2017

A lot was discussed and presented in the three hours allotted to the Testing and Fuzzing microconference at this year's Linux Plumbers Conference (LPC), but some spilled out of that slot. We have already looked at some discussions on kernel testing that occurred both before and during the microconference. Much of the rest of the discussion will be summarized below. As it turns out, a discussion on the efforts by Intel to do continuous-integration (CI) testing of graphics hardware and drivers continued several hundred miles north the following week at the X.Org Developers Conference (XDC); that will be covered in a separate article.

Fuzzers

Two fuzzer developers, Dave Jones and Alexander Potapenko, discussed the fuzzers they work on and plans for the future. It was, in some sense, a continuation of the fuzzer discussion at last year's LPC.

Potapenko represented the syzkaller project, which is a coverage-guided fuzzer for the kernel. The project does more than simply fuzzing though, as it includes code to generate programs that reproduce crashes it finds as well as scripts to set up machines and send email for failures. It "plays well" with the Kernel Address Sanitizer (KASAN), runs on 32 and 64-bit x86 and ARM systems, and has support for Android devices, he said.

Jones noted that his Trinity project is a system-call fuzzer for Linux that is "dumber than syzkaller". It does not use coverage to guide its operation; in some ways it is "amazing that it still finds new bugs." Over the last year, logging over UDP has been added, as has support for the MIPS architecture ("someone got excited"). There have been lots of contributions from others in the community over the last year, he said.

Sasha Levin, one of the microconference leads, asked both what the next big feature for their fuzzer would be. Potapenko said that tracking the origin of values that are used in comparisons in functions is being worked on. The idea is to allow syzkaller to find ways to further the coverage by reversing the sense of the comparisons to take new paths.

For Trinity, Jones plans to explore BPF programs more. He wants to feed in "mangled BPF programs" to see what happens. There is limited support in Trinity for BPF fuzzing currently; it has only found two bugs, he said. Steven Rostedt suggested stressing the BPF verifier, which will require something more than simply random programs. Jones said that Trinity uses Markov chains to create the programs, but that it is "still a little too dumb".

Rostedt asked about the reproducer programs for problems that the fuzzers find; he wondered if those should be sent to the maintainers to be added to the tests they run. Or perhaps they should get added to the kernel self-tests, he said. Greg Kroah-Hartman agreed that the programs could be useful, but some of them cause a "splat" in the logs, which might make them difficult to integrate into the failure checking of the self-tests or the Linux test project.

Something that is missing, Jones said, is for the fuzzers to be run regularly. If that is not done, various problems will sneak through and end up in kernel releases. "We still see really dumb stuff", like not checking for null pointers, ending up in the mainline. Those kinds of bugs "should not hit the tree", he said, but should be caught far earlier. Running Trinity and syzkaller on the linux-next tree could be done, but it is difficult to run the kernel using that tree. That tree is "not really testable", Daniel Vetter said, because it tends to be broken fairly often. The Intel CI system for graphics can use the linux-next tree, but only because there are a bunch of "fixup patches" that get applied.

Levin asked about getting distributions involved in fuzzing. He wondered if there were ways to make it easier for distribution kernel maintainers to run the fuzzers on their kernels. He suggested a disk image that could be run in a virtual machine (VM); that would help getting more people running the fuzzers, he said. Potapenko said that there is infrastructure available to set up a few VMs to run syzkaller, but that at least two physical machines are needed. The fuzzer causes crashes, so it is best to have a separate master machine that supplies parameters to the workers.

With a grin Jones said that he "got out of the distro building game" and was not planning to get back in. Trinity is currently run as part of the Fedora test suite, but it is somewhat destructive so it is the last thing that gets run. There are going to be Fedora kernel test days for each Linux release, he said; ISO files are generated for those tests. He had not thought about adding fuzzers into that image, but it would be good to do so.

At the end, Levin asked how to get more people and companies working on fuzzers. Potapenko said that it is simple to contribute to syzkaller and he would like to see more subsystem maintainers help with the code to exercise their system calls. Jones said there is plenty of low-hanging fruit for things to be added to Trinity; as an experiment, he did not add support for certain system calls to see if someone else would, but so far that has not happened.

ktest

The ktest tool that has been in the kernel source tree for some time now was the subject of the next, fairly brief talk. Steven Rostedt, who wrote the Perl script, wanted to get information about it into more hands; he is often surprised how many people have not heard of it. It is meant to automate testing of kernels, but it does a fair amount more than that. Rostedt said that these days he rarely uses the make command for kernel builds and installs; he uses ktest to handle all of that for him.

One of the main ways he uses it is to check a patch series from a developer in the subsystems he maintains. He wants to ensure that the series is bisectable, among other things. Before developing ktest, he would apply a patch series, one patch at a time; he would then build, boot, and test the kernels built. That was a time-consuming process.

His test setup consists of systems with a remote power switch capability, as well as a means to read the output of the boot. That can all be controlled by ktest to build, install, boot, and run tests remotely. His test suite takes 13 hours to run on a single system as it uses multiple kernel configurations. Once he could do that, he started adding more features to ktest, including bisection, reverse bisection, configuration bisection, and more.

Dhaval Giani, the other microconference lead, noted that he has found ktest to be a good test harness. He uses it to run fuzzers on various test systems and configurations. Rostedt concluded his talk by saying that he mostly just wanted more to be aware of the tool: "ktest exists", he said with a chuckle.

Kernel unit testing

Knut Omang wanted to look at ways to add more unit testing to the kernel. He has created the Kernel Test Framework (KTF) to that end. It is a kernel module that adds some basic functionality for unit testing; he would like to see the kernel have the same unit-testing capabilities that user space has. He has integrated KTF with Google Test (gtest) as the user-space side of the framework. It communicates with the kernel using netlink to query for available tests and then to selectively run one or more tests and collect their output.

Omang wants developers to "get hooked on testing". So he tried to come up with a test suite that developers will want to use. Testing costs less the closer it is done to the developers writing the code. He is an advocate of test-driven development (TDD), but acknowledged that it is not universally popular. The basic idea behind TDD is to write tests before writing the code, but there are a number of arguments that opponents make about it. Among the complaints are that writing good tests takes a lot of time, developers do not think of themselves as testers, and that writing test code is boring.

Behan Webster said that good testers have a different mindset than developers; testers are trying to break things, while developers are trying to make something work. Rostedt added that it is better if a developer doesn't write the tests because they have too much knowledge of how the code is supposed to work, so they will overlook things. Another attendee pointed out that "if you don't have tests, you don't have working code"; it may seem to be working, but it will break at some point.

There was also some discussion about how to do unit testing for components like drivers. A lot of code infrastructure is needed before anything in a driver works at all, which limits the testing that can be done earlier. Omang and others believe that the problem can be decomposed into smaller pieces that can be individually and separately tested, though some in the audience seemed skeptical of that approach.

Kernel memory sanitizer

Finding places where uninitialized memory is used is a potent way to find bugs, finding those places is what the KernelMemorySanitizer (KMSAN) aims to do. Alexander Potapenko described the tool, which has found a lot of bugs in both upstream kernels and in the downstream Google kernels. It stems from the user-space MemorySanitizer tool that came about in 2012.

The idea is to detect the use of uninitialized values in memory, not simply uninitialized variables. Those values could be used in a branch, as an index, be dereferenced, copied to user space, or written to a hardware device. KMSAN has found 13 bugs so far, though he thought another may have been found earlier in the day of the microconference (September 15). KMSAN requires building the kernel with Clang.

To track the state of kernel memory, KMSAN uses shadow memory that is the same size as the memory used by the kernel. That allows KMSAN to track initialization of memory at the bit level (i.e. it can detect that a single bit has been used but not initialized). KASAN uses a similar technique, but tracks memory at the byte level, so its shadow memory uses one byte for each eight bytes of kernel memory.

KMSAN obviously requires a lot of memory, so Levin wondered if there could be options that used less. Potapenko said that doing so creates a lot of false positives, so it is not worth it. He also noted that kmemcheck has found five bugs in the last five years, but that KMSAN runs 5-10 times faster so it finds more bugs.

In the future, KMSAN could be used for taint analysis by using the shadow mapping to track data coming from untrusted sources. It could also help fuzzers determine which function arguments are more useful to change and to track the origin of those values back to places where they enter the kernel. He pointed to CVE-2017-1000380, which was found by KMSAN, and wondered if there is a way to kill of all of the uninitialized memory bugs of that sort. Simply replacing calls to kmalloc() with kzalloc() may be tempting, but could be problematic.

KMSAN requires patches to Clang (and the ability to build the kernel with Clang). He hopes to see KMSAN added to the upstream kernel by the end of the year.

Conclusion

The kernel testing story is clearly getting better. There is still plenty to do, of course, but more varied and larger quantities of testing are being done—much of it automatically. That is finding more bugs; with luck, it may evenutally outrun the kernel's development pace so that it is finding more bugs than are being added every day. Kernel development proceeds apace, it is important that testing gets out ahead of it and stays there.

[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Los Angeles for LPC.]

Index entries for this article
Security	Fuzzing
Conference	Linux Plumbers Conference/2017

to post comments

ktest

Posted Oct 4, 2017 20:20 UTC (Wed) by smurf (subscriber, #17840) [Link] (6 responses)

What is "reverse bisection"?

ktest

Posted Oct 4, 2017 20:22 UTC (Wed) by smurf (subscriber, #17840) [Link] (5 responses)

Ah. Found it.

Bisection is to start with a good revision and find the point where things went south.
Reverse bisection thus is to start with a bad revision and find the point where the problem was fixed.

ktest

Posted Oct 4, 2017 20:43 UTC (Wed) by xorbe (guest, #3165) [Link]

That's still just ... bisection, lol. Just searching for different things.

ktest

Posted Oct 4, 2017 20:46 UTC (Wed) by ballombe (subscriber, #9523) [Link] (3 responses)

Exactly. I do them all the time.
I would be happy if 'git bisect' had an option for that. It is very confusing to have to do 'git bisect good' each time the software crashes...

ktest

Posted Oct 4, 2017 20:52 UTC (Wed) by arbab (subscriber, #75058) [Link] (1 responses)

It does now. You can do `git bisect {old,new}`, or set your own terms with --term-old/--term-new.

ktest

Posted Oct 4, 2017 21:00 UTC (Wed) by sharkhands (guest, #114731) [Link]

You beat me to it!

ktest

Posted Oct 4, 2017 20:59 UTC (Wed) by sharkhands (guest, #114731) [Link]

It does. On my machine,

git bisect

supports by default good/bad or old/new, but you can add custom terms with

git bisect terms

More from the testing and fuzzing microconference

Posted Oct 4, 2017 21:35 UTC (Wed) by cpitrat (subscriber, #116459) [Link] (14 responses)

"Rostedt added that it is better if a developer doesn't write the tests because they have too much knowledge of how the code is supposed to work, so they will overlook things."

I can't believe I'm reading this ! Overlooking things comes from not putting enough effort into writing the tests, not from having too much knowledge. Code review should review tests also and catch things that were overlooked. Moreover, the fact that developers write unit tests doesn't mean that it has to be the only tests.

The kernel could really benefit from taking a look at sqlite testing policy and see what is feasible. I know there are some problems specific to the kernel (in particular interfacing with hardware). But what they manage to do with sqlite just makes you think twice before saying "we can't do it, we're a special snowflake".

More from the testing and fuzzing microconference

Posted Oct 5, 2017 8:41 UTC (Thu) by epa (subscriber, #39769) [Link] (13 responses)

I think it could have been phrased better as "somebody other than the code's original developer should write some tests". Not that the developer shouldn't write any.

In my experience, the point he is making is true. If you have written some code to perform a particular task, the testing you do will usually be aligned to your assumptions about that task. You can make a special effort to test 'pathological cases', but even then you probably haven't thought of all of them (and the ones you do consider will be the same ones you considered when writing the code). As soon as real users get their hands on the program, they do something 'stupid' with it and it crashes. A fresh pair of eyes, somebody who knows what the code is meant to do but nothing about its implementation, can often catch bugs the original developer missed.

More from the testing and fuzzing microconference

Posted Oct 5, 2017 9:23 UTC (Thu) by osma (subscriber, #6912) [Link] (8 responses)

"If you have written some code" implies that the code is written before the tests. Writing the test first might help...

More from the testing and fuzzing microconference

Posted Oct 5, 2017 12:20 UTC (Thu) by pizza (subscriber, #46) [Link] (7 responses)

Not if the scenario in said "failing test" was never envisioned by the code author.

There's a damn good reason why testing/verification/QA/etc is best performed by a separate team; different folks will interpret the (inevitably slightly ambiguous) specifications differently, which will lead to different assumptions about how it should behave.

Similarly, you can't fully trust a specification unless there are at least two independent implementations (and/or users) of it. Over the course of my career, I can't count the number of times I've seen "both sides pass all our tests" only to utterly fail when bolted onto each other, because the different implementations made different assumptions about un- or under-specified bits.

More from the testing and fuzzing microconference

Posted Oct 5, 2017 14:30 UTC (Thu) by metan (subscriber, #74107) [Link] (6 responses)

Unfortunately quite a lot of kernel interfaces does not even have specification and the only documentation for the interface is code scattered around several other projects that uses the interface in question. So for kernel we have to resort to reading the kernel/libc/util-linux/... source code and/or talk to the kernel developers most of the time. We did quite a few fixes for manual pages for interfaces that are actually documented thought. The thing is that in real world the separation of implementers and testers does not really work out. It would be much better if kernel developers would provide unit tests for the userspace interfaces, which would work as documentation as well, then we could have build a better testcases based on these.

More from the testing and fuzzing microconference

Posted Oct 5, 2017 16:12 UTC (Thu) by pizza (subscriber, #46) [Link] (5 responses)

> It would be much better if kernel developers would provide unit tests for the userspace interfaces, which would work as documentation as well, then we could have build a better testcases based on these.

Their supplying a test suite only means that the scenarios they envisioned will be tested. It doesn't test stuff they never thought about, in no small part because they're blinded by the intended use of that interface.

Meanwhile, I don't disagree that it would be better if the authors supplied test suites. And/or documentation. But those things take time, something in perpetually short supply and rarely prioritized by the ones writing our paychecks.

More from the testing and fuzzing microconference

Posted Oct 5, 2017 22:30 UTC (Thu) by k3ninho (subscriber, #50375) [Link] (4 responses)

>Their supplying a test suite only means that the scenarios they envisioned will be tested.

Aerospace, Civil, Mechanical and Structural Engineers can sign off their designs and build all of their output from the work of a single engineer. It's a canard that a software development engineer won't consider how her program code might fail -- it's a responsibility to consider this and include it in the design and implementation. When this is done, unit tests and interface/integration tests can provide a canonical example of the way the program code will work with both good and bad inputs.

Your argument is great if we allow software developers to be immune to the consequences of their software failure but, in this century, we're engineering systems and people must qualify why their code should be trusted with auditable testcases and should quantify the capacity and performance it meets. To permit any less than this is to persist the failures of the system we see around us every day. Can we achieve higher standards?

K3n.

More from the testing and fuzzing microconference

Posted Oct 6, 2017 1:41 UTC (Fri) by pizza (subscriber, #46) [Link] (3 responses)

> Can we achieve higher standards?

Absolutely! But are you willing to pay the (significant) costs associated with those higher standards associated with the traditional Engineering disciplines? Outside of truly safety-critical stuff, the answer is not just "no", but a resounding "hell no!"

(And I say that as someone who's currently working on code that's destined to be burned into mask ROM)

More from the testing and fuzzing microconference

Posted Oct 6, 2017 9:53 UTC (Fri) by k3ninho (subscriber, #50375) [Link] (2 responses)

> But are you willing to pay the (significant) costs associated with those higher standards associated with the traditional Engineering disciplines? Outside of truly safety-critical stuff, the answer is not just "no", but a resounding "hell no!"

My view of this Internet of Things malarkey is that it *is* truly safety-critical stuff*. My view, allowing for the diligence I assume you take, is that I have to share an internet with these things, an ecosystem of data transfers, and that not paying for it now will incur many times that cost later on.

Not merely in an unsafe playground, highjackers and attackers stealing vital secrets, encryption-kidnapping-for-blackmail (and the rest) but in poor 'best practices', a ninja-rockstar developer culture in place of a systems engineering culture, and corporations whose lifeblood is data traveling the internet (eg Equifax) believing that they're safe when they are not.

It's easy to moan on a comment thread. This is hard to achieve in real life.

K3n.

*: Safety-critical stuff is what your contract says is safety critical. I get that, if there's no demand for it, there's no budget for it and no work done on it. I get that, feel free to skip the rest of my words if dollar-dollar is where the (ahem) puck stops.

More from the testing and fuzzing microconference

Posted Oct 6, 2017 10:39 UTC (Fri) by pizza (subscriber, #46) [Link] (1 responses)

FWIW, I agree with you with respect to the IoT side of things being far more safety-criticial than anyone involved is willing to admit. But the costs of piss-poor-practices are externalized, resulting in the likes of Murai, which was but a tiny taste of the deluge of utter crap about to rain down upon us. Until those costs are brought back inside (which can only happen via regulation or insurance-company mandates) this is only going to get worse.

I fear (and expect) that we're going to end up with some high-profile disaster resulting in easily-preventable deaths, resulting in "high quality" being legally mandated; not unlike the FDA's process that still allows for drive-by-hacking of insulin pumps and pacemakers. Naturally, the law will be written by idiots, leading to device makers tapdancing around the rules (and locking down hardware completely) and F/OSS authors just giving up because liability insurance just won't be worth the expense.

(Yeah, I'm not feeling particularly optimistic today..)

More from the testing and fuzzing microconference

Posted Oct 6, 2017 15:52 UTC (Fri) by k3ninho (subscriber, #50375) [Link]

ISO 9000 and family have quality audit processes in place. The law will mandate auditable records of what you did.

To some extent a test suite that says 'I can conceive of these positive test cases and these negative test cases -- please add more! -- for the scope of this functional unit, and ran them with reproducible build with hashes A and B using the included build scripts and the listed versions of items in the build chain' is quite auditable: read the code, read the test cases, check that they're plumbed up legitimately and not hiding 'assert(true);', 'make test' and record output.

We can do this today. May I check: are there any obvious holes in what I've said?

K3n.

More from the testing and fuzzing microconference

Posted Oct 5, 2017 11:34 UTC (Thu) by k3ninho (subscriber, #50375) [Link] (1 responses)

>In my experience, the point he is making is true. If you have written some code to perform a particular task, the testing you do will usually be aligned to your assumptions about that task.

I think that an arbitrary distinction between 'develop' and 'test' is essentially gone in the leading edge of computer systems engineering companies. Let's call it 'engineering' and not trip over whose role it is that program code is written or validated. Test infrastructure is made by people with titles like 'software engineer in test' and they make a product that is trusted to tell you if you should trust the collection of bytes that are sold to end-users as the final product. The whole team is engineering different parts of the whole product, your customers' chain of trust (and also the company's reputation) begins with the testbed and the '...in test' guys. If you really make better systems than those guys, get involved with writing the tests.

Let's look at an expanded view of '[t]esting costs less the closer it is done to the developers writing the code', so we'll talk about the lowest mean-time-to-recovery. This is an ops metric I repurpose when engineering a test system, to help the whole engineering team to have the shortest feedback loop and quickest time between experiencing a defect and rectifying it. In that context, pair-programming, duck discussions and team design meetings are all very helpful things. They fix misunderstandings before any program code is written or tests are designed, the worst being tests which give you false confidence when they measure the wrong parts of the system. Accepted, pairing is hard without someone next to you, and explaining your design choices on a whiteboard (optionally to a plastic duck if there's nobody sitting next to you) feels like a waste of time, until you account for it helping you to lock down all the things in your design which cause unintended consequences or side effects in your complex system. And in a complex system, your own code might not be the cause of the unintended consequences, so a well-designed test (setup, single-event, valid measurements) is a core part of knowing your program code is doing it right and chasing down where the source of the surprising behaviour is -- aiding your mean time to recovery.

K3n.

More from the testing and fuzzing microconference

Posted Oct 5, 2017 13:49 UTC (Thu) by pj (subscriber, #4506) [Link]

I've had better results selling developers on writing tests if they're couched as "acceptance tests", as in "look, you're going to need to figure out whether what you wrote actually works anyway, right? So how about replacing your manual tests with 1) writing some code and 2) typing 'make test' - that way you'll know when you're done!"

Caveat: it's often helpful to get a test engineer to write fundamental pieces of the framework so the developer can concentrate on the new-feature-testing specific bits: things like server or environment setup and startup are closer to devops than straight development and thus can be a bit daunting.

More from the testing and fuzzing microconference

Posted Oct 6, 2017 19:50 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

A fresh pair of eyes, somebody who knows what the code is meant to do but nothing about its implementation, can often catch bugs the original developer missed.

Fuzz testing shows there are plenty of bugs that can be caught by a sufficiently persistent tester with little or no knowledge of what the code is supposed to do. It seems like a good example of how bugs can lurk in the parts of the code the tester didn't think about.

More from the testing and fuzzing microconference

Posted Oct 12, 2017 19:13 UTC (Thu) by Wol (subscriber, #4433) [Link]

What nobody has mentioned (and buddy/pair coding is good at catching this) is that it is a very human tendency to assume that what you wrote is what you meant to write.

I will type a comment, proof-read it, and post it. At which point I *then* realise that there are a whole bunch of obvious errors. It's a fundamental truth that every programmer should be taught on day one - the quickest way to find errors is to explain your code to someone else. You'll spot typos, wrong variables, all sorts of stupid stuff that you would absolutely miss just checking your code yourself, BECAUSE YOU DON'T SEE WHAT'S ACTUALLY THERE. You see what you think you wrote.

Cheers,
Wol

More from the testing and fuzzing microconference

Posted Oct 5, 2017 13:06 UTC (Thu) by metan (subscriber, #74107) [Link]

Kernel QA and LTP maintainer here.

We do have plan for running fuzzers in OpenQA in SUSE in our longterm TODO, it should be easy enough in theory since the machine under the test is separate from the machine that does the actual testing, but the devil is likely in the details.

Also I'm planning to open a public tracker where anybody can upload kernel bug reproducer which will then be turned into LTP testcase. We have a bit more manpower working on LTP lately which allows us to invest some effort into this and having a central place to store these would help us a lot since otherwise we have to fish these from kernel commit logs and mailing lists.

More from the testing and fuzzing microconference

Posted Oct 21, 2017 2:12 UTC (Sat) by toxorstein (guest, #60168) [Link] (2 responses)

Does anybody know if there are fuzzers that work with web applications? Something that takes a brief api description and tries to go nuts with it?

We are mulling over that idea but wanted to know if there are already tools that can give us a head start on that regard.

More from the testing and fuzzing microconference

Posted Oct 21, 2017 16:20 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

There are tools which simulate user load against an HTTP API. There has to be some way to describe endpoints, required data, and call ordering. Starting from there is probably easiest. IIRC, the tool I had seen was called Locust.

More from the testing and fuzzing microconference

Posted Oct 28, 2017 16:45 UTC (Sat) by flussence (guest, #85566) [Link]

Skipfish works pretty well, though I haven't used it in years. Triggered plenty of HTTP 500 responses in my code that I never would've found otherwise.