Semantic patching with Coccinelle
This happened to me last month when I was adding support for 64-bit file systems to e2fsprogs. I thought I was nearly finished when I discovered I needed to write (yet another) new interface and convert (yet another) several hundred lines of code to it. The changes were complex enough that I couldn't use a script, and simple enough that I wanted to claw my eyes out with the soul-killing boredom of doing it by hand. That's when the maintainer, Theodore Ts'o, suggested I look at Coccinelle (a.k.a., spatch).
Coccinelle
Coccinelle is a tool to automatically analyze and rewrite C code. Coccinelle (pronounced cock'-see-nel) means "ladybug" in French, a name chosen because ladybugs eat other bugs. Coccinelle is not just another scripting language; it is aware of the structure of the C language and can make much more complex changes than are possible with pure string processing. For example, Coccinelle can make a particular change only in functions which are assigned to a function pointer in a particular type of array — say, thecreate member
of struct inode_operations.
The input to the tool is the file(s) to be changed and a "semantic patch," written in SmPL (Semantic Patch Language). SmPL looks a like a unified diff (a patch) with some C-like declarations mixed in. Here's an example:
@@
expression E;
identifier fld;
@@
- !E && !E->fld
+ !E || !E->fld
This semantic patch fixes the bug in which the pointer is tested for
NULL — and then dereferenced if the pointer is NULL. An example of a
bug this semantic patch found in the Linux kernel (and automatically
generated the fix for):
--- a/drivers/pci/hotplug/cpqphp_ctrl.c
+++ b/drivers/pci/hotplug/cpqphp_ctrl.c
@@ -1139,7 +1139,7 @@ static u8 set_controller_speed(struct controller
*ctrl, u8 adapter_speed, u8 hp_
for(slot = ctrl->slot; slot; slot = slot->next) {
if (slot->device == (hp_slot + ctrl->slot_device_offset))
continue;
- if (!slot->hotplug_slot && !slot->hotplug_slot->info)
+ if (!slot->hotplug_slot || !slot->hotplug_slot->info)
continue;
if (slot->hotplug_slot->info->adapter_status == 0)
continue;
(More on the semantic patch format later.)
Coccinelle is designed, written, and maintained by Julia Lawall at the Department of Computer Science at University of Copenhagen, Gilles Muller and Yoann Padioleau at the Ecole des Mines de Nantes, and René Rydhof Hansen at the Department of Computer Science of Aalborg University. Coccinelle is licensed under the GPL, however, it is written in OCaml, so the potential developer base is somewhat limited.
The original goal of Coccinelle was to automate as much as possible the task of keeping device drivers up to date with the latest kernel interfaces. But the end result can do far more than that, including finding and fixing bugs and coding style irregularities. Over 180 patches created using Coccinelle have been accepted into the Linux kernel to date.
Coccinelle quickstart
Like many languages, SmPL is best learned through example. We'll run through one simple example here just to get started. After that, the Coccinelle web page has some documentation and a plethora of examples.
First, download
Coccinelle and install it. I used the source version rather than
any of the precompiled options. The Coccinelle binary is
called spatch.
As our example, say we have program with a lot of calls to
alloca() that we would like to replace with
malloc(). alloca() allocates space on the
stack and can be more efficient and convenient than
malloc(), but it is also compiler-dependent,
non-standard, easy to use incorrectly, and has undefined behavior on
failure. (Replacing alloca() with malloc()
isn't enough, we also have to check the return value — but that
will come later.)
Here is the C file we are working on:
#include <alloca.h>
int
main(int argc, char *argv[])
{
unsigned int bytes = 1024 * 1024;
char *buf;
/* allocate memory */
buf = alloca(bytes);
return 0;
}
We could make the replacement using a scripting language
like sed:
$ sed -i 's/alloca/malloc/g' test.cBut this will replace the string "alloca" anywhere it appears. The resulting diff:
--- test.c
+++ /tmp/test.c
@@ -1,4 +1,4 @@
-#include <alloca.h>
+#include <malloc.h>
int
main(int argc, char *argv[])
@@ -6,8 +6,8 @@
unsigned int bytes = 1024 * 1024;
char *buf;
- /* allocate memory */
- buf = alloca(bytes);
+ /* mallocte memory */
+ buf = malloc(bytes);
return 0;
}
We can tweak our script to handle 90% of the cases:
$ sed -i 's/alloca(/malloc(/g' test.c
But this script doesn't handle the case where a second function name
has the first as a suffix, it depends on a particular coding style in
which no white space comes between the function name and the open
parenthesis, etc., etc. By now our simple sed script is a
hundred-character monster. It can be done, but it's a pain.
In Coccinelle, we'd use the following semantic patch:
@@ expression E; @@
-alloca(E)
+malloc(E)
Put the C file in test.c and the above semantic
patch in test.cocci and run it like so:
$ spatch -sp_file test.cocci test.c
It should produce the following diff:
--- test.c
+++ /tmp/cocci-output-17416-b5450d-test.c
@@ -7,7 +7,7 @@ main(int argc, char *argv[])
char *buf;
/* allocate memory */
- buf = alloca(bytes);
+ buf = malloc(bytes);
return 0;
}
Let's look at the semantic patch line by line.
@@ expression E; @@
This declares the "metavariable" E as a variable that can match any
expression — e.g., 1 +
2, sizeof(x), strlen(name) + sizeof(x) *
72. When spatch processes the input, it sets the value of E to
the argument to alloca(). The "@@ @@" syntax is
chosen
to resemble the line in a unified diff describing the lines to be
patched. I don't find the resemblance particularly helpful, but the
intention is well-taken.
-alloca(E)
This line says to remove any call to the
function alloca(), and to save its argument in the
metavariable E for later use.
+malloc(E)
And this line says to replace the call to alloca() with a
call to malloc() and use the value of metavariable E as
its argument.
Now, we also want to check the return value of malloc()
and return an error if it failed. We can do that too:
@@
expression E;
identifier ptr;
@@
-ptr = alloca(E);
+ptr = malloc(E);
+if (ptr == NULL)
+ return 1;
The resulting diff:
--- test.c
+++ /tmp/cocci-output-17494-22a573-test.c
@@ -7,7 +7,8 @@ main(int argc, char *argv[])
char *buf;
/* allocate memory */
- buf = alloca(bytes);
+ buf = malloc(bytes);
+ if (buf == NULL)
+ return 1;
return 0;
}
Semantic patches can be far more complex. One of my favorite examples
is the move of reference counting of the Scsi_Host
structure out of drivers. Changing this required adding an argument
to the function signature and removing a declaration and several other
lines from each SCSI driver's proc_info
function. The semantic patch, explained in detail in their OLS
2007 slides
[PPT] [ODP],
does all of this automatically. I recommend reading and re-reading this
example
until it sinks in.
Experience
My first experience with Coccinelle was mixed. In theory, Coccinelle does exactly what I want — automate complex changes to code — but in practice the implementation is beta quality. I successfully used Coccinelle to make hundreds of lines of changes with less than a hundred lines of semantic patches, but only after working directly with the developers to get bug fixes and help figuring out SmPL features. Coccinelle is one of those schizophrenic projects situated on the boundary between academic research and practical software development.
One of the first hurdles I had to overcome was teaching Coccinelle
about the macros in my code. Coccinelle has to do all its own parsing
and pre-processing — you can't just run the input C code through
cpp because then you'd have to map the post-processor output back to
the original code. Macros will sometimes confuse it enough that it
gives up parsing a function until it reaches the next safe grammatical
starting point (e.g., the next function) — which may mean that it
doesn't process most of the file. To get around this, you can create
a list of macros and feed them to spatch with
the -macro_file option. (Yes, that's one dash — one
of my pet peeves about Coccinelle is the non-standard command-line
option style.) For example, here are a few lines from the macro file I
used for e2fsprogs:
#define EXT2FS_ATTR(a)
#define _INLINE_ inline
#define ATTR(a)
You can build the list of macros by hand, but spatch has a feature
that helps find them automatically. The -parse_c option
makes spatch list the top ten parsing errors, which will include the
macro name. For example, some of the output from running spatch
-parse_c on e2fsprogs:
EXT2FS_ATTR: present in 85 parsing errors
example:
static int check_and_change_inodes(ext2_ino_t dir,
int entry EXT2FS_ATTR((unused)),
struct ext2_dir_entry *dirent, int
offset,
int blocksize EXT2FS_ATTR((unused)),
Coccinelle has improved significantly in the past few weeks. The
0.1.2 release had a number of bugs that made spatch unusable for me.
The next release, 0.1.3, fixed those bugs and with it I was able to
make practical, real-world patches. The 0.1.4 release will be out
shortly. The developers wrote and released more documentation,
including a
description of
all the command-line options [PDF] and a
grammar
for SmPL. Many more
example
spatch scripts are available now. The best reference for learning
Coccinelle continues to be the
slides
from their 2007 OLS tutorial
and the associated
paper
[PDF].
White space handling is improving; originally Coccinelle didn't care
much about white space and frequently mangled transformations involving
it, which is a problem if you want to take the hand out of
hand-editing. One of my semantic patches left a dangling semi-colon
in the middle; the developers sent me a patch to fix it within a few
days.
One thing I am absolutely certain of: learning Coccinelle and writing semantic patches was way more fun than making the changes by hand or using regular expressions. I also had much greater confidence that my changes were correct; it is remarkably pleasant to make several hundred lines of changes and have the result compile cleanly and pass the regression tests the first time.
Related work
If you really want to, you can do everything Coccinelle can do by writing your own scripts — after all, code is code. But you have to deal with all the little corner cases — e.g., to C, white space is all the same, generally speaking, but regular expressions care intensely about the difference between a space, a newline, and a tab. Use the right tool for the job — if you're just replacing a variable name and your first script works, great. If you're changing a calling convention or moving the allocation and freeing of an object to another context, give a tool like Coccinelle a try.In terms of power and flexibility, Coccinelle is similar to the Stanford compiler checker [PDF] (commercialized by Coverity). While the compiler checker is far more mature and has better flow analysis and parsing, Coccinelle can generate code to fix the bugs it finds. Most importantly, Coccinelle is open source, so developers can find and fix bugs themselves.
Some IDEs include tools to automatically refactor code, which is one aspect of what Coccinelle does. I have never personally used one of these IDE refactoring tools and can't compare it with Coccinelle, but my friends who have report that their stability leaves something to be desired. Xrefactory is a refactoring tool available on *NIX platforms which is fully integrated with Emacs and XEmacs. It is not open source and requires the purchase of a license, although one version is available for use free of charge.
Conclusion
Coccinelle is an open source tool that can analyze and transform C code according to specified rules, or semantic patches. Semantic patches are much more powerful than patches or regular expressions. The tool is beta quality right now but usable for practical tasks and the developers are very responsive. It's worth learning for any developer making a non-trivial interface change.| Index entries for this article | |
|---|---|
| Kernel | Development tools/Coccinelle |
| GuestArticles | Aurora (Henson), Valerie |