Notes from the LPC tracing microconference
Unfortunately, your editor had to leave the session before it reached its end, so this article does not reflect all of the topics discussed there. For those who are interested, this Etherpad instance contains notes taken by participants at the session.
BPF introspection
Martin Lau started the session by noting that BPF programs typically use
maps to communicate with the kernel or user space. It can, however, be
hard for an interested person to see what is actually in any given map. A
look at a BPF program's
source will reveal what it is storing in a map, but that source may not
always be available. What Lau would like to have is some sort of easy way
to pretty-print the contents of a map.
His proposed solution was to attach a bit of metadata to each map describing the entries found therein. It would look like a C structure definition. The proposed name for this description was the "compact C-type format" or CTF, but that name will almost certainly have to change if this work goes forward, since that acronym is already used for the common trace format. The description would be created with a utility program, then passed into the kernel via the bpf() system call that creates the map. The kernel would verify the data and store it, making it available later on request.
This project may not get that far, though; there was a fair amount of doubt
about whether it was really needed. If there are users who truly need a
separate description of the contents of a map, it should be possible to
manage that information in user space. So, while this idea may not be
dead, it will clearly face some headwinds if the work goes forward.
Stack traces and kprobes
Alexei Starovoitov stood up to talk about a couple of issues that Facebook has run into; both of them come up as a result of the company's heavy use of tracing to monitor its operations. Tracing is typically running full time, and detailed tracing of specific processes can be enabled or disabled at any time, with the decision often made within the kernel. Much of the kernel's tracing support was designed around more sporadic use, so things do not always work as well as desired when tracing is done around the clock.
One trouble spot is generating stack traces associated with specific
tracing events. That involves translating the address where the event
happened into a symbolic address. If the address is in kernel space,
Starovoitov said, that translation works most of the time, but can
occasionally run into trouble if modules are loaded or removed.
User-space address translation also usually works, but processes can come
and go quickly, and they can also make rapid changes to the layout of their
address spaces. That leads to situations where the mappings needed to do
the translation no longer exist when the translation is attempted.
He had three possible solutions to discuss. The "ugly" approach involves sending an event to user space whenever tracing begins; a process there would then snapshot the address-space layouts of the tracing targets. The solution is racy, though, and thus not fully reliable.
A better (though "not pretty") alternative would be to add a BPF helper that would walk through the address space in response to events and dump the traceback info into the BPF stack. A new map type would be added to remember the needed layout information for user space when it gets around to generating the symbolic stack trace. This solution would work, but it would be expensive.
The best approach would be to have the kernel simply resolve addresses into file-and-offset pairs and generate tracebacks internally. This translation can be quickly done in the kernel, which has all of the relevant information at hand. Most tracebacks are relatively small — at least, when Java is not involved. Peter Zijlstra added that the speculative page fault patches include a lockless version of find_vma(), which would make the lookups even faster. So it seems that the "best" solution will be the one chosen here.
The other problem has to do with kprobes — dynamic tracing points inserted into the kernel at run time. Facebook makes heavy use of kprobes to instrument parts of the kernel that do not have a convenient tracepoint available. The problem, he said, is that kprobes are globally managed objects, and they "kind of suck". Most of the troubles come down to the text-file interface that is used to manage them.
At the top of the list of complaints is the fact that a process can insert kprobes then exit unexpectedly (by crashing, perhaps); those probes will not be automatically cleaned up by the kernel. Multiple processes can place probes at the same point, leading to name clashes and complicating the task of cleaning up after a crash. There are also mundane problems with the use of special characters in probe names.
The solution he proposed was to extend the perf events subsystem (and perf_event_open() system call in particular) with the ability to create kprobes. Those kprobes would be tied to the file descriptor returned by perf_event_open() and would be easily cleaned up by the kernel when the descriptor is closed. There would be no naming conflicts, and kprobes could have arbitrary names.
There were no conceptual objections to this proposal, but there are concerns that too much functionality has already been crammed into perf_event_open(). So Steve Rostedt suggested that it might be better to create a new system call for this purpose. He would also like a system call for the enabling of ftrace events. He has not done any of this work, though, out of fear of stepping on toes in the development community.
Another desired feature is "lightweight kprobes" that would have less of a runtime impact. They would avoid disabling interrupts and only save a subset of the registers. Various ideas were tossed around, but none of them exist in code at this point. Expect to see some proposals in the not-too-distant future.
Uprobe performance
Uprobes are dynamic probes placed into a user-space process; as Yonghong
Song noted, these probes can create performance problems. A uprobe is
implemented as a trap into the kernel but, by the time that the execution
of the probe is complete, up to three traps will be required to restore the
process state and avoid breaking the application. That can make uprobes
too expensive to use.
Various tracing systems have found their own ways of addressing this problem. SystemTap, for example, uses ptrace() to stop the process to be probed, then inserts a jump instruction to a user-space handler, avoiding the kernel entirely. LTTng, instead, relies on tracepoints inserted into the source and a separate thread to communicate trace data to the listener. Neither approach is ideal, so Song wanted to know if anybody had a better idea.
Zijlstra suggested putting no-op instructions into the code where a probe might be placed. The actual probe could then be a simple INT3 instruction that need not displace any existing instructions and, as a result, needs no traps. This approach does require developers to know where probes might be placed, though.
An alternative would be to place a jump directly to another user-space address, shorting out the kernel entirely. Users want to run BPF programs from uprobes, but there is no reason why that couldn't be done in user space. Perhaps what is really needed is some sort of kernel-assisted mechanism to allow tracing systems to patch user-space program text. Various ideas were tossed around; which of those will turn up in code remains to be seen.
The other CTF
Matthieu Desnoyers gave a quick overview of the common trace format, a
specification for the representation of tracing data. There are a lot of
tracers that can produce data in this format, and quite a few tools that
can use it, including Trace Compass
and LTTng Scope. There
is, however, one missing link: there is no CTF output from ftrace. His
proposed solution was to make an ftrace input module for the Babeltrace translation utility.
Zijlstra asked what CTF was good for in the end; when he was informed that
it was used with graphical tracing tools, he joked that there was "no point
in using it." Most of the other people in the room felt that this
translator would be useful, though; the only real question is who would
write it. Rostedt said that he would like this feature, but he hasn't had
the time to work on it. A suggestion that an ftrace input module would be
a good Google Summer of Code project was well received; that may well be
the approach that is taken to get this software written.
BPF tools
Brendan Gregg gave an energetic talk about tools for tracing with BPF. The
BPF Compiler Collection (BCC)
now contains about one-hundred individual tools. They are becoming more
advanced and specialized over time; there is one to measure MySQL pool
contention for example. It seems clear that there is a limit to the number
of these tools that really belong in BCC; nobody wants to see 1000 scripts
there.
It may be time to look at creating some more specialized repositories for
many of these scripts.
He also talked about a desire for a higher-level interface to BPF tracing
functionality. The Ply
project was working in that direction, but it appears to be stalled.
More recently, work has gone into bpftrace, but it may well be
that we can do something better. This would be, he said, a good
opportunity for a "language nerd" to come up with a better way of
describing tracing tasks. No nerds of this type raised their hands at the
session, though.
[Your editor would like to thank the Linux Foundation, LWN's travel
sponsor, for supporting his travel to LPC 2017].
| Index entries for this article | |
|---|---|
| Kernel | BPF |
| Kernel | Tracing |
| Conference | Linux Plumbers Conference/2017 |