Leading items

Welcome to the LWN.net Weekly Edition for August 4, 2022

This edition contains the following feature content:

Oaxaca, Endless OS, and indigenous languages: an effort to use Linux to provide better education support in rural Mexico.
Crosswords for GNOME: an introduction to the GNOME Crosswords application and how it came to be.
Security requirements for new kernel features: should new kernel features have security hooks designed in from the beginning and, if so, who should do that work?
Direct host system calls from KVM: an unusual API proposal to support containers using KVM.
Some 5.19 development statistics: where the code — and the bugs — in 5.19 came from.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Oaxaca, Endless OS, and indigenous languages

By Jake Edge
August 3, 2022

GUADEC

A rural Mexican state was the setting for an initiative to use the GNOME-based Endless OS to improve education in indigenous communities. Over the last several years, the Endless OS Foundation has teamed up with the Fundación Alfredo Harp Helú Oaxaca (FAHHO) to deliver offline-first computers to those communities, but also to assist these communities in preserving their native languages. In a talk at GUADEC 2022, Rob McQueen provided a look at the project and what it has accomplished.

McQueen was not slated to give the talk—he already gave an earlier presentation at the conference—but Sergio Solis, who is from Guadalajara where the conference was held, was unfortunately unable to attend due to his family coming down with COVID. McQueen apologized for flying into Mexico from England to give a talk about Mexico when he had never been to the country before. But, as the CEO of the Endless OS Foundation, McQueen is obviously knowledgeable about the project and was able to step in and pinch-hit for Solis.

Project landscape

FAHHO is a Mexican organization that Endless has been working with, both before and after it became a non-profit, on a multi-disciplinary project to bring educational resources and technology to communities in the state of Oaxaca. The terrain in Oaxaca, which is located in southern Mexico roughly 1000km southeast of the conference, is rugged and mountainous; just getting to the communities the project is working with is a rather arduous process. For Endless, the focus is on bringing additional educational opportunities to people who are not connected to the internet, while the project as a whole has a larger scope, including elements of cultural preservation.

Oaxaca has the fifth largest area of any state in Mexico and it has a higher linguistic diversity than any other state; there are six native languages and ten variants, he said. Since the educational system in the country is centralized, it effectively discriminates against this language diversity by providing all educational material in Spanish. But some of the indigenous languages in Oaxaca are spoken-only; they need to be used and handed down so that the cultural heritage, stories, histories, and so on to continue to exist.

The last few years have been particularly challenging for education in many areas due to the pandemic, but educating children has been even more difficult in the communities where the project is active. Some of the communities blocked their roads to keep outsiders (and the virus) away, which sometimes kept the teachers out as well. Schools were closed and some remain that way even today.

While the government provides some educational materials via TV—which has both pros and cons, of course—many of these communities cannot receive TV signals. There is no broadband or cell service either, so when the schools are closed, access to additional educational resources is limited—or nonexistent. The pandemic has served to underline the need for other solutions for bringing educational materials to these isolated locales.

As McQueen described in his earlier talk, Endless uses storage as a way to work around a lack of connectivity. Pre-loading computers with educational content, and then making those computers available, means that they can be used by those kinds of communities. But there is another aspect to the project, which is to work with these people in their languages, so the project does not just deliver an encyclopedia and other content in Spanish and then vanish. Instead it has been working to provide interfaces and content in the native languages.

"Strangely enough, on the internet, there is not a lot of content in languages that are not written." In Oaxaca, there are also some languages that have only a few living speakers at this point. The "structural discrimination" against indigenous languages in the Mexican educational system tends to further contribute to their decline.

That has led these communities to start creating their own content in their languages. The StoryWeaver tool is being used in Oaxaca to capture local stories for children in a digital form. StoryWeaver also has books that can be translated into new languages, which results in an online book with pictures and text that can be shared with others. FAHHO has a project that is translating some of these books into the indigenous languages, which Endless was able to pick up for its multilingual books and resources offline application.

Support

The project was careful not to simply deliver computers and then walk away. Many of the students—and teachers—in these communities had never used a computer before. Rolling out the delivery of these computers was actually the most work for the project, not because of the logistical difficulties in getting them there, but due to the need for teaching the people about computers and how to use them. In addition, project members needed to work with the teachers on how to incorporate the computers and the content on them into their curriculum.

Solis has been a major part of that effort, traveling to different communities to meet with the inhabitants. He also would try to identify "ambassadors" or local experts within the community who could continue that support work. In 2020, the project accelerated the pace of computer deliveries in the face of the pandemic. Since the process of familiarizing people, including students, teachers, parents, and community elders, with the computers requires a lot of face-to-face time, the local experts were often the only ones who could do so. In addition, computers were given directly to students in some communities, so that they could explore and use them at home.

Working with these groups of people who had little or no computer experience helped Endless refine its GNOME-based OS, McQueen said. The users' struggles would point to areas for improvement, as would direct feedback from some of the early adopters. Improvements based on that would then be rolled into newer versions of Endless OS.

The project has delivered computers to 45 Oaxacan communities throughout the state. Around 2000 Endless OS systems are in daily use in schools and homes. Since Endless OS operates without requiring the internet, it is a perfect fit for communities without connectivity, like those in Oaxaca.

Results

The impact of the computers affects more than just the students themselves, however. Their existence in the communities leads to collective learning opportunities for families and other groups. Meanwhile, Endless OS has been translated into languages that are not available on other computers, such as Chatino. The first computer class given in the town of Cieneguilla was given in Chatino using materials in the same language. So learning can take place "in the languages that these children speak at home with their families".

These computers are also being used to record and transcribe texts in these indigenous languages. Linguistic anthropologist and native Chatino speaker Dr. Emiliana Cruz has been working on preserving Chatino stories and texts using the computers. She has also been working to create written forms of some of the spoken-only languages in order to help preserve the culture and history of those communities.

FAHHO collaborated with Wikimedia in 2019 to add and update articles about these Oaxacan communities and languages in Wikipedia. The Endless OS "Encyclopedia" application is based on Wikipedia and it was updated to include this information. That means the students in Oaxaca can see themselves (and other Oaxacan communities) in this global resource.

The integration of indigenous languages and locally generated content has drawn interest from other places in the world. There are other initiatives by international digital activists for communities that have a lot of the same characteristics as those in Oaxaca. Working with a linguist to turn a spoken language into a written one and then to integrate that with education in the community is something that could be replicated elsewhere—in Mexico and the world.

The project is a demonstration of what computers can bring to areas that have generally been bypassed by the "computer revolution". Linux and free software allow the kinds of customization needed so that these people can get the most out of the experience. These are clearly small niches in the world, which the big players are not likely to see much value in catering to, but having the ability—freedom—to do so is part of what free software is all about. One hopes we will see more of this kind of thing, in more and more places, over the coming years and decades.

A YouTube video of the talk is available.

[I would like to thank LWN subscribers for supporting my trip to Guadalajara, Mexico for GUADEC.]

Comments (4 posted)

Crosswords for GNOME

By Jake Edge
August 2, 2022

GUADEC

Jonathan Blandford, who is a longtime GNOME contributor—and a cruciverbalist for longer still—thought it was time for GNOME to have a crossword puzzle application. So he set out to create one, which turned into something of a yak-shaving exercise, but also, ultimately, into Crosswords. Blandford came to GUADEC 2022 to give a talk describing his journey bringing this brain exerciser (and productivity bane) to the GNOME desktop.

Blandford got his start with GNOME back in 1997; he is the author of the Evince PDF reader and Aisleriot solitaire card game, for example. He has moved into management over the years, so has done less programming for GNOME, but is still involved in the community. For the purposes of the talk, he said, the important thing to know about him is that he is "a passionate cruciverbalist". He has been doing crossword puzzles since he was a child, with his parents and grandparents; now he does them with his family as well. Beyond that, he had to work the word "cruciverbalist" into his talk because "people who love crosswords love words, and they love words like that".

He started working on the Crosswords application in mid-July 2021—almost exactly a year before the talk. It is now around 24,000 lines of code, most of which is in C; "I have some regrets about that" but he was comfortable writing in C. The program has been translated into two languages beyond English—Dutch and Spanish—and the project has three additional contributors at this point.

There were several motivations that led him to create the application. He has always thought that puzzles would make for an interesting program to write. In addition, anyone searching GNOME Software or Flathub for a crossword puzzle program would not have found one; in fact, Linux, in general, lacks for a good graphical crossword application. Writing a program for these puzzles turned out to be "a whole lot of fun" as well. But he wants its users to also experience fun; he had an overarching rule for the program: "it's a game, it has to be fun".

Crossword 101

There are a bunch of different kinds of crossword puzzles, starting with the first, which was by Arthur Wynne in 1913; the Crosswords application supports a variant of that type of puzzle. Currently, some of the crossword types are supported by the game and he would like add others; "the one thing they all have in common is that they are all boxes with letters in them".

The two most popular types are the "standard" (or "American") style (seen at left from Wikipedia), which has a grid that is more dense. It uses "straight" clues that are often constructed around a definition for the solution word(s), perhaps with some slight misdirection. Sometimes there are multiple possible solutions for a given word, so it may be necessary to solve some intersecting words to narrow the solution down. That type of puzzle is common in the US as the name might imply.

The second is the less-dense "cryptic" crossword (seen at right from Wikipedia) that is more common in the UK, South Africa, and elsewhere, though it is becoming more popular in the US as well. Because the grid is more open, there are fewer opportunities to gain help from intersecting words; once the right answer is found, though, there is generally no question whether it is correct. The clues have two parts, a definition that normally appears first in the clue, followed by some word play that is used to build the answer.

He went through an example from "16 across" in the "London Times Sunday Cryptic No 4962" by Robert Price, which was: "Standard heroes possess grit alongside guts (5,3,7)". The numbers indicate the lengths of the words making up the answer—"stars and stripes"—which is a "standard", in the sense of a flag. Heroes are also "stars", who "possess grit" because "sand" is found inside "stars" (i.e. "starsands"). Meanwhile, "starsands" is alongside "tripes", which are intestines, thus guts. This word play acts as a sort of checksum to validate a possible answer, at least in his engineer's brain.

Crossword puzzles tend to be specific to a particular region or time period, which is part of why they are so closely associated with newspapers. It is difficult to create a puzzle that will cross cultural boundaries because the clues tend to refer to recent events, popular culture, and local vernacular. Writing a generic puzzle that avoids those "localisms" would be extremely difficult, he said.

Getting started

The first thing he needed to figure out was what file format to use. He thought about creating his own, but also did a survey of the existing options: .puz, .jpz, and .ipuz. The .puz format is proprietary but is by far the most popular; there are a lot of puzzles available in it. It has been reverse-engineered, but it has limited capabilities and is not extensible. .jpz is XML-based, so it is easily parsed, and there are a good number of puzzles available in the format, but it also has limited capabilities.

The .ipuz format is well-documented and explicitly free to use. But it has seen limited use in the wild, so there are not a lot of puzzles out there in .ipuz format. It supports a lot of functionality so it can handle lots of puzzle types beyond just crosswords, including Sudoku, sliding-block, trivia-quiz, and word-search puzzles. It also has lots of options for styling puzzles, which allows for custom shapes and colors for themed puzzles.

But .ipuz is JSON-based, and "it's very webby, which is one of the things I don't like about it". The format kind of assumes that the program will load the JSON structure and work with that directly; the raw structure is not exactly what he wanted for Crosswords, so he created libipuz to load and save the format. It does not mirror the JSON structure directly, though it is inspired by it; his intent is that others could eventually load other puzzle formats into the format used by libipuz.

In about five months, he had something working to where he could display puzzles and their clues that were loaded from .ipuz files. The answers could be typed into the boxes and the results could be saved. He started out with a traditional callback structure, but Federico Mena Quintero introduced him to data-oriented programming with a unidirectional data flow, which was a better way to structure things.

A PlayState structure was used to hold the complete state of the game at any given point in time; no game state is stored in the widgets. PlayState is immutable, so changes to the game state result in a new PlayState. When an update is made, all of the widgets will update how they are displayed based on the state of the game at that point. Doing things that way made it much easier to test the application and to reason about how it worked.

At that point, the game would just load a puzzle from a tiny library, allow the user to play it, and then save it back to disk; he asked, is it fun yet? The answer is "not really". The lack of puzzles is a real problem, so he set out to write some.

More tools

It turns out that creating crosswords is surprisingly difficult. Creating a good board for standard crosswords is hard—there are lengthy essays about it from the pre-computer days—while the cryptic boards are easier, their clues are not. It takes a lot of practice to create cryptic clues and in doing so he realized that they tend to be highly regionalized.

Beyond that, Emacs is not the best tool for writing crossword puzzles, either; there is no crossword mode, though perhaps there should be. In any case, he was able to create a few puzzles, the first of which was in a 5x5 grid with ten clues; he manually created the JSON in Emacs and was able to load it into the Crosswords application.

He realized that his technique was not going to scale, so he set the game aside to work on a crossword editor. He looked at other crossword-editing tools and found that there were no good free options, though there are a few proprietary options, each with its own quirks. The basic workflow of his editor, which is accessed as an option in the Crosswords application, is to choose a grid type and size, then to fill in the solution on that grid.

He quickly got to the point where users could fill in the grid, but discovered that actually doing so is "surprisingly hard". He spent a lot of time looking up words as he was trying to fill in a grid, trying to find words that would fit into the space provided. He discovered that he needed a crossword solver to help crossword writers fill in their grids.

He wanted to be able to show a list of words that would fit based on the spaces available and the letters already filled in. He used Peter Broda's crossword word list and built a mechanism to filter the list based on patterns. So "?NO?E" would produce a list of words like "SNORE" and "SNOKE"; there are about 12 other words that fit the pattern from the list, he said, including, of course, "GNOME".

He has an implementation of the filtering that is "pretty fast"; most queries take around a microsecond. He created a table that can be loaded into memory with mmap(), which contains the words themselves and a set of indexes for "fragments", lists of words matching a single letter in a particular spot. So, "?NO?E" is broken down into three fragments, "?N???", "??O??", and "????E", each of which is looked up in the fragment table and any word index that appears in all three fragment lists is a solution. He hopes to use that data structure for finding anagrams eventually as well.

Getting suggestions in the editor was helpful, but it was still a laborious process to create the puzzles. Sometimes he would choose a word that eventually led to corner that he could not complete, but he might end up wasting an hour before finding that out. So he took the next step and used the word list to create a solver that can suggest words that all fit together into part of a puzzle.

The solver is "a little buggy but it basically works"; a user can select an area of the board and the solver will try to find words that can fill in the region. It does up to about 30,000 boards per second, depending on the size of the region and the lengths of the missing words, though there is lots of room for improvement in the performance, he said. It can also help find areas that are not going to be easily filled, or can only be filled with words the creator does not want to use, so that the puzzle can be reworked to avoid that.

There is a lot of trial and error in creating a puzzle, so he added a way to undo and redo changes. The feature was also incorporated into the Crosswords game, but it came about because he needed it in the editor. For example, changing a letter box to be a block redoes the numbering throughout the puzzle, which means that the clues get changed (or dropped), and so on. Since the blocks in puzzles are traditionally symmetrical (either bi-lateral or rotational symmetry), adding a block might automatically add it elsewhere in the puzzle, further rippling the destructive effects.

Since a minor change (or even misclick) can cause massive changes, undo and redo were particularly important—and not that hard to implement as it turned out. Since the PuzzleState holds everything about the puzzle itself, it just needs to be augmented with the current values of the widgets (e.g. letters for words that have been added). He then has a list of these entities that the program can move through in both directions; at each point it instantiates the state of the puzzle at that point in time.

Cats and dogs

He was able to use the editor to create his first "Cats and Dogs" puzzle set, which is a collection of nine puzzles, some in non-traditional shapes, that have a common theme (overview seen at right). He encouraged people to go to Flathub and install Crosswords to decide "if it is fun or not". He has some other themes that he plans to do puzzle sets around in the future.

In order for users to access puzzle sets, however, there needs to be a way to package and distribute them. He added the concept of puzzle sets to the game, but he wanted the sets to be standalone files that could be distributed on their own. The puzzle sets are stored in a GResource file and can be distributed as Flatpak add-ons. There are also mechanisms to download puzzles from some web sites, and he would like to add more sites as time allows.

He took a visual tour of the Crosswords application using his slides, rather than give the dreaded live demo. The YouTube video of the talk and his introductory blog post will help with visualizing the interface. Blandford did reiterate his suggestion that people try it out from Flathub as well. The application has ways to choose puzzles or puzzle sets, as might be expected. Puzzle sets can have some puzzles locked depending on which other puzzles have been successfully solved.

The heart of the application is the puzzle-solving interface (seen at left). It has the ability to tell you when you have made a mistake (the red letters) and it can make suggestions using the word list that he added for the editor. "Is it fun yet?" He said that it was starting to get there, but there is a need for a lot more puzzles. Overall, "I am pretty pleased with how it looks today", he said.

Blandford does, of course, have plans for the future, starting with his work on a second puzzle set. He hopes to see the development team expand and for more features to be added. Support for barred crosswords has just started; Acrostics support would be a great addition as would support for printing puzzles. There is a lot of work to be done for internationalization, including support multi-byte character sets in the word list. Supporting puzzles for other languages is "way more than just translations" as there are various language-specific quirks that need to be handled.

He would also like to grow a puzzle community. He hopes to see a community of puzzle writers using the editor form and create free-to-distribute puzzles for others to enjoy. There are features that he wants to see get added to the editor, as well, in order to provide more help on the clue-writing side. Perhaps a crosswords.gnome.org site could come about with competitions and more. He encouraged those interested in any aspect of the project to get involved.

[I would like to thank LWN subscribers for supporting my trip to Guadalajara, Mexico for GUADEC.]

Comments (19 posted)

Security requirements for new kernel features

By Jonathan Corbet
July 28, 2022

The relatively new io_uring subsystem has changed the way asynchronous I/O is done on Linux systems and improved performance significantly. It has also, however, begun to run up a record of disagreements with the kernel's security community. A recent discussion about security hooks for the new uring_cmd mechanism shows how easily requirements can be overlooked in a complex system with no overall supervision.

Most of the operations that can be performed within io_uring follow the usual I/O patterns — open a file, read data, write data, and so on. These operations are the same regardless of the underlying device or filesystem that is doing the work. There always seems to be a need for something special and device-specific, though, and io_uring is no exception. For the kernel as a whole, device-specific operations are made available via ioctl() calls. That system call, however, has built up a reputation as a dumping ground for poorly thought-out features, and there is little desire to see its usage spread.

In early 2021, io_uring maintainer Jens Axboe floated an idea for a command passthrough mechanism that would be specific to io_uring. A year and some later, that idea has evolved into uring_cmd, which was pulled into the mainline during the 5.19 merge window. There is a new io_uring operation that, in turn, causes an invocation of the underlying device or filesystem's uring_cmd() file_operations function. The actual operation to be performed is passed through to that function with no interpretation in the io_uring layer. The first user is the NVMe driver, which provides a direct passthrough operation.

Missing security hooks

Just over one year ago, there was a bit of a disagreement after the developers of the kernels Linux Security Module (LSM) and auditing subsystems figured out that there were no security or auditing hooks in all of that new io_uring code. That put io_uring operations outside the control of any security module that a given system might be running and made those operations invisible to auditing. Those gaps were filled in, but not before the security developers expressed their unhappiness about how io_uring had been designed and merged without thought for LSM and audit support.

Given that, one might expect that the addition of a new feature like uring_cmd would have seen more involvement from the security community. To an extent, that happened; Luis Chamberlain posted a patch adding LSM support back in March. In short, it added a new security_uring_async_cmd() hook that would be called before passing a command through to the underlying code; it could examine that command and decide whether to allow or deny the operation. There were some disagreements over how well this would work; in particular, Casey Schaufler complained that security modules would have to gain an understanding of every device-specific command, which clearly would not scale well. The conversation wound down shortly thereafter.

When the new feature was pushed into the mainline, there was no LSM support included with it. On July 13, Chamberlain reposted his patch adding the new security hook. Schaufler was equally unimpressed this time around:

You're passing the complexity of uring-cmd directly into each and every security module. SELinux, AppArmor, Smack, BPF and every other LSM now needs to know the gory details of everything that might be in any arbitrary subsystem so that it can make a wild guess about what to do. And I thought ioctl was hard to deal with.

SELinux and audit maintainer Paul Moore agreed with that assessment. The end result, he said, was that security modules would be unable to distinguish between low-level operations, so they would end up simply enabling all io_uring passthrough commands for any given subsystem or none of them; "I think we can all agree that is not a good idea". He later acknowledged that there does not appear to be a better solution at hand and merging Chamberlain's patch looked like the only path forward: "Without any cooperation from the io_uring developers, that is likely what we will have to do". The current plan appears to be to get Chamberlain's patch into the mainline during the next merge window, with backports to the stable kernels to be done thereafter.

Grumpiness

This particular problem appears to be solved, albeit in a way that is less than satisfying to the security community. A better solution may materialize in the future, though providing a way to control access to device-specific functionality in a general way is a hard problem. But a harder problem may be addressing the residual grumpiness in the security community and preventing such problems from recurring in the future. As Moore put it:

I feel that expressing frustration about the LSMs being routinely left out of the discussion when new functionality is added to the kernel is a reasonable response; especially when one considers the history of this particular situation.

For his part, Axboe acknowledged that the security concerns should not have been allowed to fall through the cracks, but he didn't necessarily offer a lot of hope for changes in the future:

I guess it's just somewhat lack of interest, since most of us don't have to deal with anything that uses LSM. And then it mostly just gets in the way and adds overhead, both from a runtime and maintainability point of view, which further reduces the motivation.

Even when the motivation is there, mistakes can happen. Kernel development is a complex business. A lot of effort has gone into making the kernel sufficiently modular that developers need not worry about what is happening in the rest of the system, but there are limits to how far that process can go.

For example, developers must be aware of locking and the locking requirements of subsystems they call into or things may go badly wrong. Memory must be handled according to the constraints placed on the memory-management subsystem, and developers creating complex caches may have to implement shrinkers to release memory on demand. CPU hotplug affects many subsystems and must be taken into account. The same is true of power-management events. Changes to the user-space API can create unhappiness years later. Inattention to latency constraints may create trouble in realtime applications. A failure to properly document a subsystem will make life harder for developers and users — but they are all used to that by now.

And, of course, a failure to provide proper security hooks will hobble the ability of administrators to control process behavior by way of LSM policies.

The fact that developers do not always succeed in keeping all of these constraints in mind — and consequently make mistakes — is unsurprising. Catching such omissions is one of the reasons for the existence of the kernel's sometimes tiresome review process. But nothing ensures that a given change will be properly reviewed by, for example, a developer who understands the needs of Linux security modules, and there is little that forces the suggestions from any such review to be heeded.

So important things will occasionally fall through the cracks, and it is not clear that much can be done to improve the situation. It would be wonderful if more companies would pay developers to spend more time reviewing patches to provide, as an example, an overall security-oriented eye on code heading into the mainline, but that does not appear to be the world that we are living in. Attempts to impose requirements with a more bureaucratic process would mostly create friction and lead to the distribution of more out-of-tree (and severely unreviewed) code.

The best path toward improvement may be, as Axboe put it, "one subsystem being aware of another one's needs". Working toward that goal — and the ability to fix mistakes in the stable kernels when they do happen — seems to work reasonably well most of the time.

Comments (28 posted)

Direct host system calls from KVM

By Jonathan Corbet
July 29, 2022

As a general rule, virtualization mechanisms are designed to provide strong isolation between a host and the guest systems that it runs. The guests are not trusted, and their ability to access or influence anything outside of their virtual machines must be tightly controlled. So a patch series allowing guests to execute arbitrary system calls in the host context might be expected to be the cause of significantly elevated eyebrows across the net. Andrei Vagin has posted such a series with the expected results.

The use case for Vagin's work is gVisor, a container-management platform with a focus on security. Like a full virtualization system, gVisor runs containers within a virtual machine (using KVM), but the purpose is not to fully isolate those containers from the system. Instead, KVM is used to provide address-space isolation for processes within containers, but the resulting virtual machines do not run a normal operating-system kernel. Instead, they run a special gVisor kernel that handles system calls made by the contained processes, making security decisions as it goes.

That kernel works in an interesting way; it maps itself into each virtual machine's address space to match its layout on the host, then switches between the two as needed. The function to go to the virtual-machine side is called, perhaps inevitably, bluepill(). The execution environment is essentially the same on either side, with the same memory layout, but the guest side is constrained by the boundaries placed on the virtual machine.

Many of the application's system calls can be executed by gVisor within the virtual machine, but some of them must be handled in the less-constrained context of the host. It certainly works for gVisor to simply perform a virtual-machine exit to have the controlling process on the host side execute the call, then return the result back into the virtual machine, but exits are slow. Performing a lot of exits can badly hurt the performance of the workload overall; since part of the purpose of a system like gVisor is to provide better performance than pure virtualization, that is seen as undesirable.

The proposed solution is to provide a new hypercall (KVM_HC_HOST_SYSCALL) that the guest kernel can use to run a system call directly on the host. It takes two parameters: the system-call number and a pt_regs structure containing the parameters for that system call. After executing the call in the host context (without actually exiting from the virtual machine), this hypercall will return the result back to the caller. This interface only works if the guest knows enough about the host's memory layout to provide sensible system-call parameters; in the gVisor case, where the memory layout is the same on both sides, no special attention is required.

Internally, this functionality works by way of a new helper called do_ksyscall_64(), which can invoke any system call from within the kernel. Given that invoking system calls in this way is generally frowned upon, this functionality seems sure to be a lightning rod for criticism and, indeed, Thomas Gleixner duly complained: "this exposes a magic kernel syscall interface to random driver writers. Seriously no". While he acknowledged that the series overall is "a clever idea", he made it clear that exposing system calls in this way was not going to fly.

Meanwhile, the ability to invoke host-side system calls directly from a KVM guest pokes a major hole in the isolation between virtual machines and the host. Indeed, the cover letter describes it as "a backdoor for regular virtual machines". Thus, as one would expect, the direct system-call feature is disabled by default; processes that want to use it must enable it explicitly when creating a virtual machine. Most hypervisors, it is to be expected, will not do that.

The kernels running deep within companies like Google often contain significant changes that are not found in the upstream code; this patch set gives a hint of what one of those changes looks like:

In the Google kernel, we have a kvm-like subsystem designed especially for gVisor. This change is the first step of integrating it into the KVM code base and making it available to all Linux users.

That led Sean Christopherson to ask about what the following steps would be. "It's practically impossible to review this series without first understanding the bigger picture". Merging this first step could be a mistake if the following steps turn out not to be acceptable; at that point, the kernel community could find itself supporting a partial feature that is not actually being used. As it turns out, Vagin said, this is the only feature that is needed. gVisor works on top of KVM now, he said; the current patch series just improves its performance.

Christopherson also asked about alternatives, noting that "making arbitrary syscalls from within KVM is mildly terrifying". Vagin provided a few, starting with the current scheme where a virtual-machine exit is used to (slowly) handle each system call. Another approach is to run all of gVisor on the host side, exiting from the virtual machine for every system call. Executing a system call in this mode takes about 2.1µs; the direct system-call mechanism reduces that to about 1.0µs. Or gVisor could use BPF to handle the system calls; that provides similar performance, Vagin said, but would require some questionable changes, like providing BPF programs with the ability to invoke arbitrary system calls. Yet another possibility is to use the once-proposed process_vm_exec() system call, but that can perform poorly in some situations.

KVM maintainer Paolo Bonzini said that his largest objection is the lack of address translation between the guest and the host. In its current form, this mechanism depends on the memory layout being the same on both sides; otherwise any addresses in an argument to a system call would not make sense on the host side. As a result, the new mechanism is highly specialized for gVisor and seems unlikely to be more widely useful. It is not clear that everybody sees that specialization as a disadvantage, though.

All told, gVisor in this mode represents an interesting shift in the security boundary between a host and the containers it runs. Much of the security depends on code that is within the virtual machine, with the host side trusting that code at a fairly deep level. It is a different view of how virtualization with KVM is meant to work, but it seems that the result works well — within Google at least. Whether this mechanism will make it into the mainline remains an open question, though. Making holes in the wall between host and guest is not something to be done lightly, so the developers involved are likely to want to be sure that no better alternatives exist.

Comments (4 posted)

Some 5.19 development statistics

By Jonathan Corbet
August 1, 2022

The 5.19 kernel was released, after a one-week delay to deal with the fallout from the Retbleed mitigations, on July 31. By that time, 16,399 commits (15,134 non-merge and 1,265 merges) had found their way into the mainline repository, making this development cycle the busiest since 5.13 (16,030 non-merge changesets and 1,157 merges). Tradition dictates that now is the time for a look at where the changes in 5.19 came from, and we've learned not to go against tradition.

Individual contributors

Work on 5.19 was contributed by 2,086 developers; that is a new record, beating the 2,062 who contributed to 5.13. Of those developers, 278 made their first kernel contribution during this development cycle. The removal of a number of old drivers and an unloved architecture took 301,000 lines of code out of the kernel repository, but that effort was overwhelmed by the 1,105,000 lines of code that were added, for a net growth of 804,000 lines of code.

The top contributors to 5.19 were:

Most active 5.19 developers

By changesets

Krzysztof Kozlowski 211 1.4%

Christoph Hellwig 193 1.3%

Ville Syrjälä 175 1.2%

Matthew Wilcox 151 1.0%

Jakub Kicinski 130 0.9%

Geert Uytterhoeven 123 0.8%

Mark Brown 118 0.8%

Masahiro Yamada 105 0.7%

Arnd Bergmann 104 0.7%

Martin Kaiser 102 0.7%

Kuniyuki Iwashima 101 0.7%

Christophe Leroy 96 0.6%

Minghao Chi 96 0.6%

Biju Das 94 0.6%

Andy Shevchenko 90 0.6%

Marek Vasut 89 0.6%

Miaohe Lin 87 0.6%

Dmitry Baryshkov 87 0.6%

Ping-Ke Shih 81 0.5%

Pavel Begunkov 79 0.5%

Jason A. Donenfeld 79 0.5%

Jack Xiao 79 0.5%

By changed lines

Hawking Zhang 222682 18.1%

Huang Rui 185566 15.1%

Martin Habets 44361 3.6%

Jakub Kicinski 34636 2.8%

Ping-Ke Shih 29871 2.4%

Huacai Chen 21159 1.7%

Bjorn Andersson 15738 1.3%

Christoph Hellwig 14024 1.1%

Leo Liu 11632 0.9%

Haijun Liu 11006 0.9%

Fabio M. De Francesco 9561 0.8%

Ian Rogers 8691 0.7%

Imre Deak 7937 0.6%

Zhengjun Xing 7508 0.6%

Arnd Bergmann 7424 0.6%

Leon Romanovsky 6573 0.5%

Mark Brown 6502 0.5%

Cezary Rojewski 6492 0.5%

Peter Ujfalusi 6463 0.5%

Veerasenareddy Burru 5652 0.5%

Manivannan Sadhasivam 5614 0.5%

Jack Xiao 5215 0.4%

The top contributor of changesets in 5.19 was Krzysztof Kozlowski, who focused mostly on devicetree fixes. Christoph Hellwig continues to rework code all over the kernel, and found the time to remove the h8300 architecture as well. Ville Syrjälä contributed a large number of changes to the Intel i915 graphics driver, Matthew Wilcox continues the folio work, and Jakub Kicinski worked extensively in the networking subsystem.

In the lines-changed column, as has become traditional, Hawking Zhang and Huang Rui outdid everybody else with the addition of hundreds of thousands of lines of machine-generated amdgpu header files. Martin Habets added the "siena" network driver, Kicinski removed a number of old network drivers while taking a break from his other work, and Ping-Ke Shih added support for Realtek 8852ce network adapters.

The lists of top testers and reviewers will look familiar to those who have been following these articles:

Test and review credits in 5.19

Tested-by

Daniel Wheeler 94 8.4%

Bean Huo 29 2.6%

Nathan Chancellor 29 2.6%

Geert Uytterhoeven 27 2.4%

Heiko Stuebner 26 2.3%

Nícolas F. R. A. Prado 23 2.1%

Michael Riesch 21 1.9%

Marek Szyprowski 20 1.8%

Arnaldo Carvalho de Melo 19 1.7%

Gurucharan 18 1.6%

Sedat Dilek 18 1.6%

Giuseppe Scrivano 18 1.6%

Reviewed-by

Christoph Hellwig 246 2.9%

Hawking Zhang 220 2.6%

Rob Herring 164 2.0%

AngeloGioacchino Del Regno 149 1.8%

Krzysztof Kozlowski 144 1.7%

David Sterba 123 1.5%

Darrick J. Wong 103 1.2%

Bard Liao 102 1.2%

Andy Shevchenko 102 1.2%

Stephen Boyd 101 1.2%

Jani Nikula 98 1.2%

Ranjani Sridharan 88 1.1%

Many of the test credits continue to accrue to people who are seemingly working as part of their employer's internal quality-assurance process, though there appear to be fewer of those than in previous cycles. On the review side, this was a 70-day development cycle; both Christoph Hellwig and Hawking Zhang thus reviewed at least three patches for each of those days. Hellwig's reviews are widespread, while Zhang's are focused on amdgpu patches by AMD developers. It is good to see that there are developers who are evidently reviewing patches as part of their job.

A look at the report credits — along with who is including the Reported-by: tags in their fixes — also shows the evolution of an ongoing story:

Report credits in 5.19

Reporter

kernel test robot 207 17.0%

Zeal Robot 134 11.0%

Abaci Robot 53 4.4%

Syzbot 49 4.0%

Dan Carpenter 44 3.6%

Hulk Robot 37 3.0%

Stephen Rothwell 27 2.2%

Rob Herring 19 1.6%

Guenter Roeck 12 1.0%

Geert Uytterhoeven 11 0.9%

Marek Szyprowski 11 0.9%

Nathan Chancellor 8 0.7%

Sudip Mukherjee 8 0.7%

Credited by

Minghao Chi 93 7.6%

Jiapeng Chong 31 2.5%

Lv Ruyi 24 2.0%

Yang Li 22 1.8%

Krzysztof Kozlowski 20 1.6%

Eric Dumazet 19 1.6%

Yang Yingliang 16 1.3%

Paul E. McKenney 14 1.1%

Masahiro Yamada 14 1.1%

Hans de Goede 14 1.1%

Linus Torvalds 13 1.1%

Randy Dunlap 12 1.0%

Mario Limonciello 12 1.0%

We are evidently in the midst of the robot wars and most of us never even noticed; a full 40% of the report credits are going to robots at this point. If one looks at which developers are adding Reported-by tags to their patches, the picture becomes clearer: the top four reporters work for the companies that run the Zeal and Abaci robots (ZTE and Alibaba, respective). It is reasonably clear that these developers are developing and running their own robots to find bugs, then crediting those robots with the reports.

Companies

The employer numbers are relatively steady and boring. A total of 245 employers supported work on 5.19, with the most active being:

Most active 5.19 employers

By changesets

Intel 1645 10.9%

(Unknown) 1135 7.5%

Linaro 862 5.7%

AMD 837 5.5%

Red Hat 792 5.2%

(None) 653 4.3%

Google 624 4.1%

Meta 528 3.5%

SUSE 462 3.1%

Huawei Technologies 446 2.9%

NVIDIA 421 2.8%

Oracle 414 2.7%

(Consultant) 385 2.5%

Renesas Electronics 348 2.3%

Arm 281 1.9%

MediaTek 235 1.6%

Qualcomm 232 1.5%

IBM 230 1.5%

Pengutronix 208 1.4%

NXP Semiconductors 195 1.3%

By lines changed

AMD 465548 37.9%

Intel 80061 6.5%

Linaro 59759 4.9%

Meta 53080 4.3%

Xilinx 45774 3.7%

(Unknown) 37529 3.1%

Realtek 36049 2.9%

Google 30767 2.5%

NVIDIA 30524 2.5%

MediaTek 29215 2.4%

Red Hat 27048 2.2%

Loongson 23819 1.9%

(None) 22890 1.9%

(Consultant) 22322 1.8%

SUSE 16983 1.4%

Qualcomm 14455 1.2%

Oracle 13815 1.1%

Arm 12806 1.0%

IBM 12339 1.0%

Renesas Electronics 10812 0.9%

Perhaps noteworthy here is the slow but steady decline of Red Hat, which was the top employer for many years. The picture looks a little different if one considers non-author signoffs, though:

Non-author signoffs in 5.19

Individual

Greg Kroah-Hartman 932 6.5%

David S. Miller 785 5.5%

Alex Deucher 704 4.9%

Mark Brown 656 4.6%

Andrew Morton 525 3.7%

Jakub Kicinski 422 2.9%

Jens Axboe 296 2.1%

Mauro Carvalho Chehab 282 2.0%

Bjorn Andersson 273 1.9%

Kalle Valo 272 1.9%

Borislav Petkov 230 1.6%

Martin K. Petersen 225 1.6%

Michael Ellerman 207 1.4%

Arnaldo Carvalho de Melo 200 1.4%

Shawn Guo 195 1.4%

David Sterba 176 1.2%

Rafael J. Wysocki 166 1.2%

Geert Uytterhoeven 152 1.1%

Vinod Koul 148 1.0%

Catalin Marinas 145 1.0%

By employer

Linaro 1959 13.6%

Red Hat 1854 12.9%

Intel 1445 10.1%

Meta 1056 7.4%

Linux Foundation 1037 7.2%

Google 930 6.5%

AMD 786 5.5%

SUSE 748 5.2%

Qualcomm 416 2.9%

NVIDIA 352 2.5%

Arm 339 2.4%

IBM 313 2.2%

(Consultant) 307 2.1%

(None) 304 2.1%

Oracle 260 1.8%

Huawei Technologies 202 1.4%

(Unknown) 160 1.1%

Renesas Electronics 156 1.1%

Cisco 140 1.0%

Broadcom 112 0.8%

A developer who signs off on a patch that they did not write is (usually) the maintainer who accepts the patch and sends it upstream. The above tables, thus, offer an approximate picture of who our most active maintainers are. About half of the patches merged into the mainline kernel are going through the hands of maintainers working for just five companies. On one hand, that shows a potentially concerning concentration of power in a relatively small number of employers. On the other, this is the list of companies that are most willing to pay for maintainers to do their jobs — a good thing, given that the kernel project is short of maintainers overall.

When bugs were introduced

When a commit fixes a bug, it will often contain a Fixes: tag indicating the commit that first introduced that bug. This information is useful for a number of reasons, including deciding how far back a fix needs to be backported in the stable kernels. But it can also give an indication of how long bugs have been in the kernel. The 5.19 cycle saw the addition of 2,541 commits with Fixes: tags; 712 of those (28%) referred to other 5.19 commits. Those bugs never made it into a mainline release, but the rest did. Looking at tags referring to previous releases gives this result:

As one might expect, many of the bugs fixed in 5.19 were introduced in recent releases; 268 of them came from 5.18. What is perhaps more surprising is the long tail of references back to earlier releases; only 2.6.21, 2.6.28, and 2.6.32 are missing from the plot because they had no commits that were fixed in 5.19. It can be surprising to see that there is any code left from those early development cycles at all; that code exists, though, and it still contains some bugs.

The spike at 2.6.12 may seem strange, but remember that the Git history begins then; all of the Fixes: tags pointing to 2.6.12 name commit 1da177e4c3f4, which was the initial commit that started the whole thing off. They are, thus, referring to bugs that were introduced sometime before early 2005. Almost all of those fixes are dealing with data-race issues that were seen as less problematic on the hardware of that era.

The curious can look at the full list of 5.19 fixes, which contains pointers to the fixed commits.

One can also use Fixes: tags to get a sense for when bugs are introduced during the development cycle. In this case, the results are:

-rc 5.19 All time

-rc1 656 4.7% 66,154 7.3%

-rc2 7 2.1% 1,512 3.5%

-rc3 6 2.0% 1,179 3.3%

-rc4 13 2.8% 987 3.1%

-rc5 6 2.7% 924 3.6%

-rc6 4 0.9% 863 3.5%

-rc7 15 3.3% 755 3.8%

-rc8 5 1.9% 275 3.9%

-rc9 — 32 2.2%

final — 472 3.8%

-rc	5.19	All time
-rc1	656	4.7%	66,154	7.3%
-rc2	7	2.1%	1,512	3.5%
-rc3	6	2.0%	1,179	3.3%
-rc4	13	2.8%	987	3.1%
-rc5	6	2.7%	924	3.6%
-rc6	4	0.9%	863	3.5%
-rc7	15	3.3%	755	3.8%
-rc8	5	1.9%	275	3.9%
-rc9	—		32	2.2%
final	—		472	3.8%

The 5.19 numbers should be taken with at least one grain of salt; as we have seen above, the fixes for 5.19 commits will be wandering into the kernel over the next decade or so. That makes 5.19 appear, probably falsely, to be better than the kernel history as a whole; getting a complete picture for this cycle will require some patience. Beyond that, the Retbleed fixes were merged for 5.19-rc7; there were numerous fixes needed for those, which explains the elevated rate at -rc7.

In general we see, as we might expect, that most bugs enter the kernel during the merge window, whether one looks at absolute numbers or as a percentage of total commits. After that, the bug rate drops, but remains roughly the same through the development cycle. In theory, as the final release gets closer, developers should be more careful and only push the most important and well-tested commits. In the real world, late-cycle patches are just as likely to be buggy as those that came earlier, and patches that enter the mainline after the last -rc release seem to be especially risky.

On to 6.0

In the 5.19 announcement, Linus Torvalds let it be known that the next kernel would probably be named 6.0. As usual, the major-number bump has no special meaning for the kernel; it's just another release with a lot more changes in it. As of this writing, 12,325 non-merge changesets are waiting in linux-next, suggesting that 6.0 will not be as busy a cycle as its predecessor. Come back in early October for the details on how it played out.

Comments (11 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>