Testing AI-enhanced reviews for Linux patches
Code review is in high demand, and short supply, for most open-source projects. Reviewer time is precious, so any tool that can lighten the load is worth exploring. That is why Jesse Brandeburg and Kamel Ayari decided to test whether tools like ChatGPT could review patches to provide quick feedback to contributors about common problems. In a talk at the Netdev 0x18 conference this July, Brandeburg provided an overview of an experiment using machine learning to review emails containing patches sent to the netdev mailing list. Large-language models (LLMs) will not be replacing human reviewers anytime soon, but they may be a useful addition to help humans focus on deeper reviews instead of simple rule violations.
I was unable to attend the Netdev conference in person, but had the opportunity to watch the video of the talk and refer to the slides. It should be noted that the idea of using machine-learning tools to help with kernel development is not entirely new. LWN covered a talk by Sasha Levin and Julia Lawall in 2018 about using machine learning to distinguish patches that fix bugs from other patches, so that the bug-fix patches could make it into stable kernels. We also covered the follow-up talk in 2019.
But, using LLMs to assist reviews seems to be a new approach. During the introduction to the talk, Brandeburg noted that Ayari was out of the country on sabbatical and unable to co-present. The work that Brandeburg discussed during the presentation was not yet publicly available, though he said that there were plans to upload a paper soon with more detail. He also mentioned later in the talk that the point was to discuss what's possible rather than the specific technical implementation.
Why AI?
Brandeburg said that the interest in using LLMs to help with reviews was not because it's a buzzword, but because it has the potential to do things that have been hard to do with regular programming. He also clarified that he did not want to replace people at all, but to help them because the people doing reviews are overwhelmed. "We see 2,500 messages a month on netdev, 10,000-plus messages a month on LKML", he said. Senior reviewers have to respond "for the seven thousandth time on the mailing list" to a contributor to fix their formatting. "It gets really tedious" and wastes reviewers' time to have to correct simple things.
There are tools to help reviewers already, of course, but they are more limited. Brandeburg mentioned checkpatch, which is a Perl script that checks for errors in Linux kernel patches. He said it is pretty good at what it does, but it is "horrible for adapting to different code and having any context". It may be able to spot a single-line error, but it is "not great at telling you 'this function is too long'".
The experiment
For the experiment, Brandeburg said that he and Ayari used the ChatGPT-4o LLM and started giving it content to "make it into a reviewer that is an expert at making comments about simple things". He said that they created a "big rule set" using kernel documentation, plus his and other people's experience, to set the scope of what ChatGPT would review. "We don't really want AI to be just, you know, blowing smoke at everybody."
Having a tool provide feedback on the simple things, he said, would allow him to use his "experience and knowledge and context and history, the human part that I bring to the equation". But, another benefit is that the tool could be consistent. Looking through the mailing list, "people get inconsistent responses even on the simple things". For example, patches may lack correct subject lines or have terrible commit messages but "someone commits it anyway".
Brandeburg said that they tried to build experiments that would see if AI reviews could work, and compare its results with real replies as people worked through reviews and posts on the netdev list. He displayed a few slides that compared LLM review to "legacy automation" as well as human reviews and walked through some examples of feedback given by each. The LLM reviews actually offer suggestions or help, he said, but reviewers often do not. "They say stuff like 'hey, will you fix it?' or 'hey, can you go read the documentation?'" But ChatGPT gives good feedback in human-readable language. In addition, LLMs are "super great at reading paragraphs and understanding what they're trying to say" which is something that tools like checkpatch cannot do.
Another thing that LLMs excel at is judging if a commit message is
written in imperative
mood. The patch
submission guidelines ask for changes to be described "as if
you are giving orders to the codebase to change its behaviour
". It
is, he said, really hard to write programs that can interpret text to
judge this the way that an LLM can.
Brandeburg said that there was something else that LLMs could, in theory, do that would be "very, very hard" for him as a reviewer: go back and look at previous revisions of a patch series to see if previous comments had been addressed. It would take him "hours and hours" for each series to look at all of the comments he had made. Sometimes "little stuff sneaks through because the reviewer's tired, or you switch reviewers mid-series". An LLM could be much better at going back to review previous discussions about a patch to take into account for the latest patch series.
LLMs can do something else that "legacy" tools cannot: they can make things up, or "hallucinate" in the industry terminology. Brandeburg said that they saw the LLM make mistakes "occasionally", if a patch was "really tiny" or if the LLM did not have enough context. He mentioned one instance where a #define used a negative number that the LLM flagged as an error. It also did not make sense to him as a reviewer, so he posted to the netdev mailing list about it "and found out that the code was perfectly correct". He said that was great feedback for him and the AI because it helps to refine its rules based on new information.
Humans did provide better coverage of technical and specific issues, which is "exactly what we want them to be doing". People are great at providing context and history, things that are "almost impossible" for an LLM to do. The LLM is only reviewing the content of the patch, which leaves a lot of missing context. Replies from people tended to be "all over the place", though. One of the slides in the presentation (slide 11) compared "AI versus human" comments as a percentage of issues covered. It showed only 9.3% "overlap" between human reviewers and the AI commenting on the same issues.
Questions
A member of the audience asked if that meant that humans were "basically ignoring all the style issues". Brandeburg said, "yeah, that's what we found." Human reviewers "didn't want to talk about the stupid stuff". In fact, he cited instances of people on LKML telling other reviewers to "quit complaining about the stupid stuff". He said that he understood that why someone who does a lot of reviews would say that, but that letting "trivial problems" slide meant that the long-term quality of the codebase would suffer.
Another audience member asked if the LLM ever said, "looks good to me" or simply did not have a reply for a patch. They observed that it is often hard for an LLM to say "I don't know" in response to a question. Brandeburg said that it was set up so that it could make comments if it had them, and not make comments if it didn't. He added that he was certainly not ready to have the AI add an "Acked-by" or "Signed-off-by" tag to patches.
Someone else in the audience said that this seemed like great work, but wondered what the plans were for getting human feedback if the AI has an incorrect response to a patch. Brandeburg said that he envisioned posting the rule set to a public Git repository and allowing pull requests to revise and complete the rules.
One attendee asked if Brandeburg and Ayari had compared the LLM tool's output to checkpatch, noting that some people may not comment on issues that checkpatch would pick up anyway. Brandeburg said that he did not imagine it replacing checkpatch. "I think this is an added tool that [...] adds more context and ability to do things that checkpatch can't". He acknowledged that comparing results might help answer the question of whether human reviewers simply ignored things that they knew checkpatch would catch.
As the session was running out of time, Brandeburg took a final question about whether this LLM would reply to spam messages. He said, that it probably would if the mail made it through to the mailing list, but he joked that "hopefully, the spam doesn't have code content in it" and wouldn't be committed by a maintainer who wasn't paying close attention.
He closed the session by inviting people to read through the slides, which have answers to frequently asked questions like "will this replace us all as developers?" He added, "I don't think so because we need humans to be smart and do human things, and AI to do AI things".
Brandeburg did not go into great detail about plans to implement changes based on the experiment and its findings. However, the "Potential Future Work" slide in his presentation lists some ideas for what might happen next. This includes ideas like making an LLM-review process into a Python library for reviewers, a GitHub-actions-style system for providing patch and commit message suggestions, as well as fully automated replies and inclusion into the bot tests if the community likes LLM-driven reviews.
Human reviewers are still going to be in high demand for decades to come, but LLM-driven tools might just make the work a little easier and more pleasant before too long.
| Index entries for this article | |
|---|---|
| Conference | Netdev/2024 |