Notes from the LPC tracing microconference

By Jonathan Corbet
September 21, 2017

The "tracing and BPF" microconference was held on the final day of the 2017 Linux Plumbers Conference; it covered a number of topics relevant to heavy users of kernel and user-space tracing. Read on for a summary of a number of those discussions on topics like BPF introspection, stack traces, kprobes, uprobes, and the Common Trace Format.

Unfortunately, your editor had to leave the session before it reached its end, so this article does not reflect all of the topics discussed there. For those who are interested, this Etherpad instance contains notes taken by participants at the session.

BPF introspection

Martin Lau started the session by noting that BPF programs typically use maps to communicate with the kernel or user space. It can, however, be hard for an interested person to see what is actually in any given map. A look at a BPF program's source will reveal what it is storing in a map, but that source may not always be available. What Lau would like to have is some sort of easy way to pretty-print the contents of a map.

His proposed solution was to attach a bit of metadata to each map describing the entries found therein. It would look like a C structure definition. The proposed name for this description was the "compact C-type format" or CTF, but that name will almost certainly have to change if this work goes forward, since that acronym is already used for the common trace format. The description would be created with a utility program, then passed into the kernel via the bpf() system call that creates the map. The kernel would verify the data and store it, making it available later on request.

This project may not get that far, though; there was a fair amount of doubt about whether it was really needed. If there are users who truly need a separate description of the contents of a map, it should be possible to manage that information in user space. So, while this idea may not be dead, it will clearly face some headwinds if the work goes forward.

Stack traces and kprobes

Alexei Starovoitov stood up to talk about a couple of issues that Facebook has run into; both of them come up as a result of the company's heavy use of tracing to monitor its operations. Tracing is typically running full time, and detailed tracing of specific processes can be enabled or disabled at any time, with the decision often made within the kernel. Much of the kernel's tracing support was designed around more sporadic use, so things do not always work as well as desired when tracing is done around the clock.

One trouble spot is generating stack traces associated with specific tracing events. That involves translating the address where the event happened into a symbolic address. If the address is in kernel space, Starovoitov said, that translation works most of the time, but can occasionally run into trouble if modules are loaded or removed. User-space address translation also usually works, but processes can come and go quickly, and they can also make rapid changes to the layout of their address spaces. That leads to situations where the mappings needed to do the translation no longer exist when the translation is attempted.

He had three possible solutions to discuss. The "ugly" approach involves sending an event to user space whenever tracing begins; a process there would then snapshot the address-space layouts of the tracing targets. The solution is racy, though, and thus not fully reliable.

A better (though "not pretty") alternative would be to add a BPF helper that would walk through the address space in response to events and dump the traceback info into the BPF stack. A new map type would be added to remember the needed layout information for user space when it gets around to generating the symbolic stack trace. This solution would work, but it would be expensive.

The best approach would be to have the kernel simply resolve addresses into file-and-offset pairs and generate tracebacks internally. This translation can be quickly done in the kernel, which has all of the relevant information at hand. Most tracebacks are relatively small — at least, when Java is not involved. Peter Zijlstra added that the speculative page fault patches include a lockless version of find_vma(), which would make the lookups even faster. So it seems that the "best" solution will be the one chosen here.

The other problem has to do with kprobes — dynamic tracing points inserted into the kernel at run time. Facebook makes heavy use of kprobes to instrument parts of the kernel that do not have a convenient tracepoint available. The problem, he said, is that kprobes are globally managed objects, and they "kind of suck". Most of the troubles come down to the text-file interface that is used to manage them.

At the top of the list of complaints is the fact that a process can insert kprobes then exit unexpectedly (by crashing, perhaps); those probes will not be automatically cleaned up by the kernel. Multiple processes can place probes at the same point, leading to name clashes and complicating the task of cleaning up after a crash. There are also mundane problems with the use of special characters in probe names.

The solution he proposed was to extend the perf events subsystem (and perf_event_open() system call in particular) with the ability to create kprobes. Those kprobes would be tied to the file descriptor returned by perf_event_open() and would be easily cleaned up by the kernel when the descriptor is closed. There would be no naming conflicts, and kprobes could have arbitrary names.

There were no conceptual objections to this proposal, but there are concerns that too much functionality has already been crammed into perf_event_open(). So Steve Rostedt suggested that it might be better to create a new system call for this purpose. He would also like a system call for the enabling of ftrace events. He has not done any of this work, though, out of fear of stepping on toes in the development community.

Another desired feature is "lightweight kprobes" that would have less of a runtime impact. They would avoid disabling interrupts and only save a subset of the registers. Various ideas were tossed around, but none of them exist in code at this point. Expect to see some proposals in the not-too-distant future.

Uprobe performance

Uprobes are dynamic probes placed into a user-space process; as Yonghong Song noted, these probes can create performance problems. A uprobe is implemented as a trap into the kernel but, by the time that the execution of the probe is complete, up to three traps will be required to restore the process state and avoid breaking the application. That can make uprobes too expensive to use.

Various tracing systems have found their own ways of addressing this problem. SystemTap, for example, uses ptrace() to stop the process to be probed, then inserts a jump instruction to a user-space handler, avoiding the kernel entirely. LTTng, instead, relies on tracepoints inserted into the source and a separate thread to communicate trace data to the listener. Neither approach is ideal, so Song wanted to know if anybody had a better idea.

Zijlstra suggested putting no-op instructions into the code where a probe might be placed. The actual probe could then be a simple INT3 instruction that need not displace any existing instructions and, as a result, needs no traps. This approach does require developers to know where probes might be placed, though.

An alternative would be to place a jump directly to another user-space address, shorting out the kernel entirely. Users want to run BPF programs from uprobes, but there is no reason why that couldn't be done in user space. Perhaps what is really needed is some sort of kernel-assisted mechanism to allow tracing systems to patch user-space program text. Various ideas were tossed around; which of those will turn up in code remains to be seen.

The other CTF

Matthieu Desnoyers gave a quick overview of the common trace format, a specification for the representation of tracing data. There are a lot of tracers that can produce data in this format, and quite a few tools that can use it, including Trace Compass and LTTng Scope. There is, however, one missing link: there is no CTF output from ftrace. His proposed solution was to make an ftrace input module for the Babeltrace translation utility.

Zijlstra asked what CTF was good for in the end; when he was informed that it was used with graphical tracing tools, he joked that there was "no point in using it." Most of the other people in the room felt that this translator would be useful, though; the only real question is who would write it. Rostedt said that he would like this feature, but he hasn't had the time to work on it. A suggestion that an ftrace input module would be a good Google Summer of Code project was well received; that may well be the approach that is taken to get this software written.

BPF tools

Brendan Gregg gave an energetic talk about tools for tracing with BPF. The BPF Compiler Collection (BCC) now contains about one-hundred individual tools. They are becoming more advanced and specialized over time; there is one to measure MySQL pool contention for example. It seems clear that there is a limit to the number of these tools that really belong in BCC; nobody wants to see 1000 scripts there. It may be time to look at creating some more specialized repositories for many of these scripts.

He also talked about a desire for a higher-level interface to BPF tracing functionality. The Ply project was working in that direction, but it appears to be stalled. More recently, work has gone into bpftrace, but it may well be that we can do something better. This would be, he said, a good opportunity for a "language nerd" to come up with a better way of describing tracing tasks. No nerds of this type raised their hands at the session, though.

[Your editor would like to thank the Linux Foundation, LWN's travel sponsor, for supporting his travel to LPC 2017].

Index entries for this article
Kernel	BPF
Kernel	Tracing
Conference	Linux Plumbers Conference/2017

to post comments

Notes from the LPC tracing microconference

Posted Sep 21, 2017 22:23 UTC (Thu) by roc (subscriber, #30627) [Link]

I'm a bit confused about the statement that INT3 "needs no traps". It does force a transition into and out of the kernel (to a SIGTRAP signal handler presumably). Isn't that what "trap" means?

The problem with inserting jump instructions to patch uncooperative user code is that an x86(-64) instruction can be as little as one byte long, so a jump instruction may not fit. In general you cannot know if the program will jump to the very next instruction, which would be a huge problem if you've replaced it with the jump address.

On x86-64 there is the additional problem that a 5-byte jump instruction can only jump within a 2GB range of the current PC, so you generally have tricky extra work to do to reach your patch code from anywhere.

I don't see a way to patch uncooperative user code reliably other than using INT3 (or the less well known ICEBP/INT1 instruction) and incurring the kernel round-trip.

However, I can imagine kernel features that could make that more efficient/transparent than a full signal handler for people doing userspace patching. For example, you could set up a user-space data structure mapping source addresses containing INT3s to patch target adresses, and invoke a system call to tell the kernel the address of that map. Then when an INT3 occurs the kernel could look up that map and if an entry is found, just set the PC directly to that target address, without saving/restoring any state, instead of raising SIGTRAP. I guess in general you would also need to specify that some general-purpose registers get saved to a defined user-space location (per thread) so the patch code can do useful work.

Even better would be to have this facility in hardware so you don't have to go through the kernel...

Notes from the LPC tracing microconference

Posted Sep 22, 2017 12:17 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

CTF with the very same acronymic expansion is also used for the compacted type description format used by DTrace.

Please don't reuse it a *third* time!

Notes from the LPC tracing microconference

Posted Sep 22, 2017 12:30 UTC (Fri) by nix (subscriber, #2304) [Link]

Note that a library that can generate and read this format is available under the GPLv2, though of course it might need some hacking to work in kernelspace.