Leading items

Welcome to the LWN.net Weekly Edition for June 23, 2022

This edition contains the following feature content:

Introducing PyScript: the ability to run Python in web browsers opens up a whole new world of applications.
Fedora, FFmpeg, Firefox, Flatpak, and Fusion: some controversy over Fedora's policy with regard to external package repositories.
A new LLVM CFI implementation: a proposal to replace the current Clang-based control-flow implementation with one better suited to the kernel environment.
NFS: the early years: Neil Brown covers the early history of the NFS filesystem.
Disabling an extent optimization: our final article from LSFMM 2022 looks into a filesystem optimization that is creating problems for higher-level software.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Introducing PyScript

By Jake Edge
June 22, 2022

PyCon

In a keynote at PyCon 2022 in Salt Lake City, Utah, Peter Wang introduced another entrant in the field of in-browser Python interpreters. The Python community has long sought a way to be able to write Python—instead of JavaScript—to run in web browsers, and there have been various efforts to do so over the years. Wang announced PyScript as a new framework, built atop one of those earlier projects, to allow Python scripting directly within the browser; those programs have access to much of the existing Python ecosystem as well as being able to interact with the browser document object model (DOM) directly. In addition, he gave some rather eye-opening demonstrations as part of the talk.

Wang began by introducing himself and the company that he runs, Anaconda, which he co-founded with Travis Oliphant ten years ago. Oliphant was the creator of NumPy and one of the founders of SciPy, both of which are cornerstones of the Python scientific-computing ecosystem. Anaconda has created a number of different tools that are used widely in the community, as well as founding the NumFOCUS non-profit and the PyData conferences.

There were a number of reasons why he and Oliphant chose to focus their efforts around Python, including that the language is approachable, even by those who lack a computer-science background. Another point in its favor is that the Python community is generally welcoming and pleasant to work in. That is a "really big deal if you want to keep growing the user base".

But there is another aspect of the language that makes it so desirable from his standpoint: it can be extended with binary extensions that use an API that is written in C, but can be accessed from other languages. He likens Python to "a Honda Civic with mounting bolts for a warp drive". So the language can be picked up by kids who can then pop open the trunk "and bolt on warp nacelles" that allows the code to run faster than C or C++ in some cases, Wang said.

That aspect is sometimes overlooked, but it means that Python can be used in ways that other, similar languages cannot. "It's not just like Node, it's not just an alternative to Ruby". The reason Python was picked up by Wall Street firms ten or 15 years ago was because of this warp-drive capability, he said.

What sucks

"Here among friends" we can talk about what sucks about the language as well, he said. Even though Anaconda is in the business of providing a Python distribution, he "will be the first to say" that installing everything that is needed for Python is too hard. There are an enormous number of packages on the Python Package Index (PyPI), but getting them to work together is difficult. There are lots of different tools to help solve that problem, but all of them are at about 80%, he said, so people are having a bad experience 20% of the time, which is "not great".

It is odd that, for the world's most popular language, as Python is reported to be, it is difficult to write and distribute applications with a user interface. For example, you cannot write iOS apps with Python. You cannot create an application for Windows—the most popular corporate desktop—with a user interface; even if you use a web front-end, you have to write JavaScript, CSS, and HTML, he said. It's "kind of weird" that you cannot easily do all of that, but at the same time, it's "kind of interesting".

Meanwhile, though, the consequences of those two points, difficulties in packaging and building user interfaces, make it hard to share our work with others, he said. To those who would point to Docker as a solution for that, he said that when you package an application with Docker, you are "zipping up a hard drive and shipping it to someone". That "cannot be our way of getting millions of people to use this stuff".

To a great extent, Python is a victim of its own success. It is an excellent glue language, but that means that it got stuck to all of those things. So much of what we do in computing is tied to ideas and architectures from the 1970s and 1980s, he said, starting with the C language and Unix process model; that also includes things like the toolchains and internetworking protocols like TCP/IP.

The basics of the Python language proper can be taught to anyone in a weekend, he said, but it takes a whole lot more effort to get them to the point where they can create an executable for Windows or an iOS app for an iPad. "Can we unshackle Python from all of this?"

Enter WebAssembly

The web browser has clearly won the operating system wars, Wang said. He does not know if 2022 will be the year of the Linux desktop, [it won't be -ed.] but he does know that there will be a lot of browsers on desktops. JavaScript is the king of some language-popularity surveys because it is the native language of the browser. So, if we want to move into that realm, he said, WebAssembly (or Wasm) is clearly the right answer.

WebAssembly "fundamentally changes the game". It is a virtual CPU instruction set that recently became a W3C standard; it has a 32-bit address space and can do 64-bit arithmetic. There is a compiler tool, Emscripten, that can be used to compile most C and C++ code to WebAssembly, which can then run in the browser. WebAssembly is well-supported by browsers, including mobile browsers, Wang said.

CPython is, of course, a C program and much of the Python numerical stack is written in C or C++. Over the past few years, projects like Pyodide (which we looked at a little over a year ago) and JupyterLite have been compiling large parts of the Python scientific and numerical stack to target WebAssembly.

If you go to the Pyodide site, you can get a Python read-eval-print loop (REPL) in your browser. From those "three little friendly angle brackets", you can import NumPy and pandas. From the JupyterLite site, you can get a notebook in the browser running JupyterLab all on your local system.

Python core developer Christian Heimes has been giving talks and doing a lot of work on getting CPython working with WebAssembly. It will soon be a tier-2 supported platform for CPython, Wang said. WebAssembly simply provides a another computer architecture, beyond x86, Arm, and others, that the CPython project can target

PyScript

So he and others at Anaconda have been looking at the work that is being done and have been thinking about ways to make it all more accessible to "many many more people". To that end, he announced PyScript, but he did so by live-coding a "hello world" demo from the keynote stage. It was his first PyCon keynote, "maybe my last", he said with a chuckle as he typed in a short HTML file that loaded a pyscript.js file from pyscript.net in a <script> tag; the body simply contained:

    <py-script>
        print("Hello PyCon 2022!")
    </py-script>

He proceeded to double-click the file and the greeting appeared in a browser tab; that was met with a round of applause. But this is all just HTML, he said, so he surrounded the code above with a <blink> tag and reloaded. These days, of course, the <blink> tag has been removed from HTML, perhaps sadly; "now I have to explain to kids that there is no <blink> tag".

So he proceeded to add blinking functionality to the PyScript code, and demonstrated a few other things along the way. He created a <div> with a name, that he then targeted for writing the string by accessing the DOM to retrieve the object for the <div>; he also used the asyncio module to sleep for a second, then cleared the <div>, and put that all in a "forever" loop. That worked like a charm, of course. With a laugh, he said: "What the W3C takes away, PyScript gives you back."

So PyScript is a "framework for creating rich Python apps in the browser". It allows interleaving Python and HTML, provides full access to the DOM, and gives the code access to JavaScript libraries—in both directions. The Python code can call JavaScript or be called from JavaScript. So all of the application logic and code can be in one language, in the browser—no web server is needed. You can put the HTML file on a USB stick and give it to your friend. There is the need to download PyScript itself, but that is done from the HTML file using the <script> tag.

PyScript is not some kind of fork of CPython, it is the same code as attendees were running on their laptops and servers, Wang said, just compiled for Wasm. It includes all of the work that Pyodide has done to get the core numerical, scientific, and big data packages working for Wasm as well. PyScript is an "opinionated framework" that provides a foreign function interface (FFI) to talk to JavaScript and the DOM; Python has already wrapped C, C++, and Fortran, so JavaScript can also be added to the list. "This is truly serverless computing."

More demos

He gave some other small demonstrations, many of which come from the PyScript examples page. He started with a REPL, typing a few simple Python statements before typing in a print() call to display an <iframe> HTML tag, which proceeded to play a video in the browser window. "This is the most number of people that I've simultaneously Rickrolled", he said to laughter and applause.

He also showed a simple To-Do application and poked around in the code a bit. The HTML file has some boilerplate, then sets up the text field input and "add task" button. The button has a PyScript-specific pys-onClick attribute that has the name of a Python function to call when the button is clicked. For convenience, the Python code lives in a separate todo.py file that contains the function.

The application itself allows adding things to the list, which also puts a checkbox next to them. When the box is clicked, the list entry also gets a strike-through line added to it. All of which can be handled fairly easily in Python by manipulating the DOM, as can be seen in the code.

The PyScript JavaScript and CSS files can be loaded from a local source or from pyscript.net. The output from Python can be routed to different places, which means that stdout and stderr can go to different locations on the page, as is done in the REPL2 example (note that shift-enter is used to execute like in Jupyter and elsewhere). Since all of Python is being loaded, much more sophisticated applications can be run directly in the browser, with no installation required.

He also showed this interactive demo; note that it takes a rather longer time to load and run than the earlier ones. The code for it shows a bunch of stuff in the HTML <head> section, which will be cleaned up eventually, he said, but the core is just standard Python data-handling code using pandas, scikit-learn for performing k-means clustering, Panel for creating a dashboard, and so on. He reiterated that the file could just be handed off to a colleague to run on their own system—albeit after a moderately lengthy delay to gather and initialize all of the pieces.

Beyond that, there are lots of useful JavaScript libraries available, such as the "powerful visualization system", deck.gl. He showed a PyScript example that displayed data from the New York City taxicab data set using deck.gl and dashboard created with Panel (seen below). It showed a 3D histogram of the island of Manhattan, with the height of the hexagonal bins representing the number of pickups and dropoffs in that part of the city. The visualization can be rotated, zoomed, and animated over time; in addition, the bin size can be changed to increase or decrease the geographic granularity and clicking on a point shows the actual trip details. All of this is done locally in the browser using Python and JavaScript.

But the demo also has Python REPL in its interface, which gives access to the data more directly. The pandas dataframe can be accessed as df and the data can be filtered based on various criteria, such as "trips less than five miles" (in Python), which is then reflected in the display that can be interacted with as usual. Once again, the HTML file is all that is needed to run it. PyScript provides a "dramatic simplification" of what has been a difficult problem for Python: bundling up and distributing applications to users.

Python has never been about reimplementing the world, he said; instead, Python binds together existing tools and libraries. So one of PyScript developers built a wrapper around the D3.js library for "data-driven documents" in two days. "A lot of Python data scientists have been lusting for D3 for a very long time."

Python can also act as the glue between JavaScript libraries using PyScript. He mashed up a JavaScript Mario game with a JavaScript gesture-recognition library, so he had the game running in the browser along with a feed from his laptop camera that was being analyzed for hand gestures. By opening his hand, he could make Mario jump, for example.

He showed the code, which had a lot of JavaScript that needed to be loaded for the game and gesture-recognition library. But the core of the application was in Python in the HTML file. It took the return from the gesture library and determined what that meant in terms of Mario's movement, which it passed down to the game. So the Python was used to bind the two JavaScript libraries together to do "real-time camera video recognition to play a game". All of that is "very cool".

The future

"The most accessible language for the web should be the language we already love and know; let's make that happen." Beyond that, though, traditional web applications are complex, not just because JavaScript is "kind of a terrible language" (though he is willing to discuss that assertion over beer), but because the existing architecture splits the application state between the client and the server. The current web-development landscape is a hodgepodge of tools, languages, frameworks, and so on that is "much more complex than I think it needs to be".

That complexity also means that clients send lots of information back to the server; in the 1980s or 1990s, that state would stay on the client side in the application itself. Sending all of that information to the server is "the root of a lot of the modern evils in tech". He wondered if using PyScript in the browser could eliminate the need for talking to servers in that way, though there would still be a need to access databases and such; that stuff all needs to be figured out, he said. But doing so could perhaps "bring about a version of web 3—without the NFTs". It would allow building applications that are "properly client-server", he said to applause.

There is still an enormous amount of work to be done, Wang said. That includes things like wrapping all of the JavaScript libraries and making them "even cooler", possibly adding support for other languages like Julia or R, and more. There are also lots of interesting problems to solve, like what is the nicest native binding for React or how memory transfers back and forth between JavaScript and Python can be more efficient. The PyScript project is looking for more developers to collaborate on solving these and other problems in the future.

There are also other Python implementations written in C and C++ that could be compiled for Wasm. Since many of the CPython extensions have been built for Wasm, that has the potential to level the playing field for the other Python implementations, most of which are not able to use the C-based extensions. That giant ecosystem of extensions is part of what has kept adoption low for the alternatives, but that could perhaps change.

There are a ton of projects, but the community needs to be thoughtful about how it approaches developing them. There is a risk of burning out developers that needs to be considered. But Wang thinks that most everyone agrees there is an "absolutely massive prize" to be won. It would be a way to solve the most important problem: programming for the 99%.

Because Python has long focused on being easy to learn, and thus is approachable as a teaching language, it can be—is—learned by lots of non-programmers. Python can be used by people who do not consider themselves programmers but who can do a bit with their computers. It is this dynamic that has helped drive Python to a dominant position among computer languages over the last ten or 15 years, he said.

He noted that the number of programmers in the world is around 25 million, which is 0.3% of the population. So the rest of humanity has to rely on this tiny sliver for all of the code that is increasingly pervading everything. "This is not a good state of affairs." It will continue to be that way unless we do something about it.

When he lived in Boston, he was inspired by the huge engraving on the Boston Public Library that says: "The commonwealth requires the education of the people as the safeguard of order and liberty." In his view, the democratization and literacy, including computational and data literacy, are "foundational to assure an open and free future for humankind". He would like to see PyScript play a role in that: "computing for the people and for their communities".

He hopes that kids get their first introduction to programming using PyScript. They do not need to install anything on their tablet or laptop—or they can use the computer at the library. The existing educational materials on HTML, CSS, and Python can be used, mostly as they are today. The idea would be to focus on productivity and quality of life for casual programmers rather than experienced software developers.

His vision is of a web that is "a friendly, hackable place where anyone can make interesting things". They can then share those interesting things with others easily. The remix aspect of our community has gotten lost over the last 20 years, he said, as the stacks have gotten more and more complicated.

He wants to see a return to the quirky, creative aspects of the web, and to "put the joy back into it". His final "demo" was a PyScript REPL, where he typed "import antigravity", which led to the classic xkcd about Python (note that the import is something of an Easter egg in CPython as well). But when he typed "antigravity.fly()", the flying figure rose higher in the sky in a little mini-animation. "Let's do more of that", he said to thunderous applause.

It was an interesting and fairly inspiring keynote that I was sadly unable to write up until now—almost two months after it happened. Interested readers will find it worth viewing the YouTube video of the talk to get a look at some of the demos. Playing with the examples will be enlightening as well. It will be interesting to see where PyScript goes from here.

[I would like to thank LWN subscribers for supporting my trip to Salt Lake City for PyCon.]

Comments (20 posted)

Fedora, FFmpeg, Firefox, Flatpak, and Fusion

By Jonathan Corbet
June 16, 2022

Fedora's objective to become the desktop Linux distribution of choice has long been hampered by Red Hat's risk-averse legal department, which strictly limits the type of software that Fedora can ship. Specifically, anything that might be encumbered by patents is off-limits, with the result that much of the media that users might find on the net is unplayable. This situation has improved over the years as the result of a lot of work within the Fedora project, but it still puts Fedora at a disadvantage relative to some other distributions. A recent discussion on video support, though, shines a light on how some surprising legal reasoning may be providing a way out of this problem; that way may not be pleasing to all involved, however.

FFmpeg and Firefox

At the beginning of June, Otto Urpelainen posted (on the Fedora development list) about some surprising behavior he had observed on his system. Initially, the Firefox browser was able to play the videos he wanted to watch. Installing Fedora's ffmpeg-free package, though, caused those videos to fail to play. As Urpelainen noted: "This is unexpected, because one would expect that installing any variant of ffmpeg would improve video support, not degrade it".

As Kevin Kofler pointed out, this behavior looks like a bug in Firefox, which is unable to find the OpenH264 variant of the H.264 decoder found in the ffmpeg-free package. But it was not lost on anybody that this problem does not occur if the version of FFmpeg shipped by RPM Fusion is installed instead, and that the H.264 codec found there doesn't require the various convolutions required to get OpenH264 onto Fedora. The H.264 support in RPM Fusion is thought by some to work better as well. For this reason, Vitaly Zaitsev said that the proper solution is for users just to enable RPM Fusion.

Michael Catanzaro took issue with that advice:

Vitaly, your suggestions to enable rpmfusion are not helpful for inexperienced Fedora users, who expect multimedia to work out-of-the-box. Common multimedia needs like "play a video" absolutely need to work without rpmfusion, and we need Fedora developers testing this to make sure it works.

But Kofler replied that "It is common knowledge that Fedora is/was effectively useless for anything remotely related to multimedia without RPM Fusion packages". Zaitsev doubled down later by saying that Fedora should simply preload the RPM Fusion repository so that users need not go through the process of learning that they need it and figuring out how to enable it themselves.

That is the process that is required now; a new Fedora installation will not be set up to obtain packages from RPM Fusion and will not help users understand that, sooner or later, they will have to configure that repository. But fixing that problem still does not appear to be in the works; one of the constraints placed on the Fedora project is that it cannot help users find repositories containing code that, for example, might have patent problems in some jurisdictions. This "pretend it's not there" approach has led to a certain amount of frustration over the years.

Enter Flatpak

More recently, though, there has been a development on a related front. In June 2021, the project adopted a proposal to set up the Flathub repository by default on Fedora systems. Flathub is, like RPM Fusion, an independent repository (run by the GNOME Foundation) and, again like RPM Fusion, it contains software that Fedora cannot distribute, but it distributes packages in the Flatpak format rather than using RPM. There is a push within Fedora to ship applications as flatpaks rather than as RPMs. Flatpak makes dependency management easier and, in theory at least, can run applications within secure sandboxes, but many developers see the format as inferior and would prefer to avoid it.

The Flathub repository was set up in a "filtered" mode, meaning that only applications that were acceptable to Fedora would be available (by default), but that still left room for proprietary flatpaks like Zoom, Microsoft Teams, and Minecraft. Last April, though, the situation changed: permission had been received to drop the filtering and present the full Flathub repository to users. Fedora developers are currently working on enabling this change for the upcoming Fedora 37 release. Catanzaro welcomed this news:

Er, so everything from Flathub is fine now, no restrictions? Seriously great news. In this case, I'd say priority one is to stop shipping Fedora's Firefox and Totem altogether, and default to getting them from Flathub instead.

Not everybody was so joyful, especially given that the plan is to have the system select a flatpak package over a traditional RPM package when both are available. Deferring to an outside repository for important packages like Firefox is also not universally accepted as ideal. But the idea that Fedora can now freely set up access to an external repository with software that Fedora cannot ship itself is generally welcome. It could be a solution to Fedora's longstanding limitations with regard to problematic media formats.

Why not RPM Fusion?

Since that decision was made, developers have been asking if access to RPM Fusion could be preloaded in Fedora as well, always to be told that it is not possible. The question came up again in this conversation as well; Catanzaro responded:

For avoidance of doubt, Fedora Legal has decided we may use flathub but not rpmfusion. As I explained to you previously, they have also decided not to share their reasoning for this.

Fedora project leader Matthew Miller answered with a pointer to this explanation:

Flathub is a third-party repository which provides software for various Linux distributions. It doesn't shape what software it carries around what Fedora does not. It fundamentally exists to solve a problem with Linux app distribution to which our policies around licensing, software freedom, and etc., are incidental. This makes it a different case.

It is fair to say that not everybody finds this explanation convincing. Kofler described it as "an absolutely ridiculous double standard". "Maxwell G" called it "a pretty flimsy argument". Petr Pisar tried to explain the difference: RPM Fusion specifically targets Fedora, while Flathub is not Fedora-specific, and that somehow makes a difference.

The logic behind this policy surely makes sense to somebody in Red Hat's legal department, but it may have some unfortunate consequences in the Fedora user community. It is not hard to imagine that it could be causing a lack of morale among the RPM Fusion developers, who have worked for many years to address a key shortcoming in Fedora systems. If Fedora pushes its users toward the Flathub solution, RPM Fusion may eventually just give up, leaving no alternative to Flathub, which is a less-appealing repository for many. It is not at all clear that Fedora and its user community would be better off in that world.

Comments (48 posted)

A new LLVM CFI implementation

By Jonathan Corbet
June 17, 2022

Some kernel features last longer than others. Support for forward-edge control-flow integrity (CFI) for kernels compiled with LLVM was added to the 5.13 kernel, but now there is already a replacement knocking on the door. Control-flow integrity will remain, but the new implementation will be significantly different — and seemingly better in a number of ways.

The kernel makes extensive use of indirect function calls; they are at the heart of its internal object model. Every one of those calls is a potential entry point for an attacker; if the target of the call can be somehow changed to an address of the attacker's choosing, the game is usually over. Forward-edge CFI works to thwart such attacks by ensuring that every indirect function call sends control to a code location that was actually intended to be a target of that call. Specifically, an indirect function call should only go to a known function entry point, and the prototype of the function should match what is expected at the call site.

The CFI implementation merged for 5.13 works by creating "jump tables" containing all of the legitimate targets of indirect function calls in the kernel; there is one jump table for each observed function prototype. Actual indirect calls are replaced with a jump-table lookup to ensure that the intended target meets the criteria; the target should be found in the jump table corresponding to the intended function prototype. If that test fails, a kernel panic results. See this article for a more detailed description of how this mechanism works.

That implementation of CFI does the job, but it has a few disadvantages as well. Creating the jump tables requires a view of the full kernel binary; in practice, it requires that link-time optimization be used to build the kernel, which is a slow and sometimes tricky process. The replacement of function-pointer variables with jump-table entries also means that those variables cannot be compared against the address of a specific function, which is something that kernel code needs to do on occasion. It would be nicer to have a CFI implementation that doesn't impose problems of this sort.

That implementation would appear to exist in this patch set from Sami Tolvanen. It depends on a new Clang compiler option (-fsanitize=kcfi), which has not yet landed in the LLVM mainline. This CFI mechanism, which is "intended to be used in low-level code, such as operating system kernels", avoids the above-mentioned problems at the cost of a couple of other tradeoffs, notably that it cannot work with execute-only memory (read access is always required).

When code is compiled with -fsanitize=kcfi, the entry point to each function is preceded by a 32-bit value representing the prototype of that function. This value is (part of) a hash calculated from the C++ mangled name for the function and its arguments. On x86 systems, this hash is placed into a simple MOV instruction and surrounded by INT3 instructions; this is meant to prevent the hash itself from becoming a useful gadget for attackers. When an indirect call is made, extra code is emitted to fetch and check this hash value prior to emitting the call itself; if the hash does not match what was expected, a trap (which will be turned into a kernel oops) results. The checking of the hash is why execute-only memory cannot be supported: it must be possible to read the hash value from the executable code.

For the most part, this mechanism just works without the need for much change in the kernel code itself — at least, not beyond the changes that were already required for the previous CFI implementation. There is, however, the problem of functions written in assembly, which will need to have the necessary preamble generated by some other means. Generating the requisite hash value for each indirectly called assembly function could be a tiresome task; fortunately, the compiler provides some help. Whenever it sees (in C code) the address of a function being taken (as in this example):

    static const struct v4l2_file_operations mcam_v4l_fops = {
	.open = mcam_v4l_open,
	/* ... */
    };

it will generate a corresponding symbol defined as the resulting hash value; in this case, the symbol would be __kcfi_typeid_mcam_v4l_open. The existence of these symbols means that the preambles for assembly functions can be generated automatically via some tweaks to the macros already used to define those functions.

This patch series is currently in its third version, and it would appear that all of the substantive concerns have been addressed. It is, in other words, looking ready to be merged into the mainline. There is only one remaining obstacle to overcome: kernel developers will be reluctant to merge this feature until it is actually supported in the LLVM Clang compiler. Assuming that happens in the near future, it should not be too long until the kernel acquires an upgraded CFI implementation for the arm64 and x86 architectures.

Comments (19 posted)

NFS: the early years

June 20, 2022

This article was contributed by Neil Brown

I recently had cause to reflect on the changes to the NFS (Network File System) protocol over the years and found that it was a story worth telling. It would be easy for such a story to become swamped by the details, as there are many of those, but one idea does stand out from the rest. The earliest version of NFS has been described as a "stateless" protocol, a term I still hear used occasionally. Much of the story of NFS follows the growth in the acknowledgment of, and support for, state. This article looks at the evolution of NFS (and its handling of state) during the early part of its life; a second installment will bring the story up to the present.

By "state" I mean any information that is remembered by both the client and the server, and that can change on one side, thus necessitating a change on the other. As we will see, there are many elements of state. One simple example is file content when it is cached on the client, either to eliminate read requests or to combine write requests. The client needs to know when cached data must be flushed or purged so that the client and server remain largely synchronized. Another obvious form of state is file locks, for which the server and client must always agree on what locks the client holds at any time. Each side must be able to discover when the other has crashed so that locks can be discarded or recovered.

NFSv2 — the first version

Presumably there was a "version 1" of NFS developed inside Sun Microsystems, but the first to be publicly available was version 2, which appeared in 1984. The protocol is described in RFC 1094, though this is not seen as an authoritative document; rather, the implementation from Sun defined the protocol. There were other network filesystems being developed around the same time, such as AFS (the Andrew File System), and RFS (Remote File Sharing). One distinctive difference that NFS had, when compared to these, is that it was simple. One might argue that it was too simple, as it could not correctly implement some POSIX semantics. However, this simplicity meant that it could provide good performance for a lot of common workloads.

The early 1980s was the time of the "3M Computer" which suggested a goal for personal workstations of one megabyte of memory, one MIPS of processing power, and one megapixel (monochrome) of display. This seems almost comically underpowered by today's standard, particularly when one considers that a price tag of a mega-penny ($10,000) was thought to be acceptable. But this was the sort of hardware on which NFSv2 had to run — and had to run well — in order to be accepted. History suggests that it was adequate to the task.

Consequences of being "stateless"

The NFSv2 protocol has no explicit support for any state management. There is no concept of "opening" a file, no support for locking, nor any mention of caching in the RFC. There are only simple, self-contained access requests, all of which involve file handles.

The "file handle" is a central unifying feature of NFSv2: it is an opaque, 32-byte identifier for a file that is stable and unique within a given NFS server across all time. NFSv2 allows the client to look up the file handle for a given name in a given directory (identified by some other file handle), to inspect and change attributes (ownership, size, timestamps, etc.) given a file handle, and to read and write blocks of data at a given offset of a given file handle.

As far as possible, the operations chosen for NFSv2 are idempotent, so that, if any request were repeated, it would have the same result on the second or third attempt as it had on the first. This is necessary for true stateless operation over an imperfect network. NFS was originally implemented over UDP, which does not guarantee delivery, so the client had to be prepared to resend a request if it didn't get a reply. The client cannot know if it was the request or the reply that was lost, and a truly stateless server cannot remember if any given request has been seen already so that it can suppress a repeat. Consequently, when the client resends a request, it might repeat an operation that has already been performed, so idempotent operations are best.

Unfortunately, not all filesystem operations under POSIX can be idempotent. A good example is MKDIR, which should make a directory if the given name is not in use, or return an error if the name is already used, even if it is used for a directory. This means that repeating the request can result in an incorrect error result. Standard practice for minimizing this problem is to implement a Duplicate Request Cache (DRC) on the server. This is a record of recent, non-idempotent requests that have been handled, along with the result that was returned. Effectively, this means that both the client (which must naturally track requests that have not yet received a reply) and the server maintain a list of outstanding requests that changes over time. These lists match our definition of "state", so the original NFSv2 was not actually stateless in practice, even if it was according to the specification.

As the server cannot know when the client sees a reply, it cannot know when a request is no longer outstanding, so it must use some heuristics to discard old cache entries. It will inevitably remember many requests that it doesn't need to, and may discard some that will soon be needed. While this is clearly not ideal, experience suggests that it is reasonably effective for normal workloads.

Maintaining this cache requires that the server knows which client each request came from, so it needs some reliable way to identify clients. This is a need that we will see repeated as state management becomes more explicit with the development of the protocol. For the DRC, the client identifier used is derived from the client's IP address and port number. When TCP support was added to NFS, the protocol type needed to be included together with the host address and port number. As TCP provides reliable delivery, it might seem that the DRC is not needed, but this isn't entirely true. It is possible for a TCP connection to "break" if a network problem causes the client and server to be unable to communicate for an extended period of time. NFS is prepared to wait indefinitely, but TCP is not. If TCP does break the connection, the client cannot know the status of any outstanding requests, so it must retransmit them on a new connection, and the server might still see duplicate requests. To make sure this works, NFS clients are careful to reconnect using the same source port as the earlier connection.

A duplicate request cache is not perfect, partly because the heuristic may discard entries before the client has actually received the reply, and partly because it is not preserved across server reboots, so a request might be acted upon both immediately before and after a server crash. In many cases, this is an occasional inconvenience but not a huge problem; will anyone really suffer if "mkdir" occasionally returns EEXIST when it shouldn't? But there is one situation that turned out to be quite problematic and isn't handled by the DRC at all. That is exclusive create.

Before Unix had any concept of file locks (as it didn't in Edition 7 Unix, which became the base for BSD), it was common to use lock files. If exclusive access was required to some file, such as /usr/spool/mail/neilb, the convention was that the application must first create a lock file with a related name, such as /usr/spool/mail/neilb.lock. This must be an "exclusive" creation using the flags O_CREAT|O_EXCL, which would fail if the file already existed. An application that found that it couldn't create the file because some other application had done so already would wait and try again.

Exclusive create is not an idempotent operation — by design — and NFSv2 has no support for it at all. Clients could perform a lookup and, if that reported no existing file, they could then create the file. This two-step sequence is clearly susceptible to races, so it is not reliable. This failing of NFS does not appear to have decreased its popularity, but certainly resulted in a lot of cursing over the years. It also resulted in some innovation; there are other ways to create lock files.

One way is to generate a string that will be unique across all clients — possibly with host name, process ID, and timestamp — and then create a temporary file with this string as both name and content. This file is then (hard) linked to the name for the lock file. If the hard-link succeeds, the lock has been claimed. If it fails because the name already exists, then the application can read the content of that file. If it matches the generated unique string, then the error was due to a retransmit and again the lock has been claimed. Otherwise the application needs to sleep and try again.

Another unfortunate consequence of avoiding state management involves files that are unlinked while they are still open. POSIX is perfectly happy with these unlinked-but-open files and assures that the file will continue to behave normally until it is finally closed, at which point it will cease to exist. An NFS server, since it does not know which files are open on which client, finds it difficult to be so accommodating, so NFS client implementations don't rely on help from the server. Instead, when handling a request to unlink (remove) a file that is open, the client will instead rename the file to something obscure and unique, like .nfs-xyzzy, and will then unlink this name when the file is finally closed. This relieves the server from needing to track the state of the client, but is an occasional inconvenience to the client. If an application opens the only file in some directory, unlinks the file, then tries to remove the directory, that last step will fail as the directory is not empty but contains an obscure .nfs-XX name — unless the client moves the obscure name into a parent or converts the RMDIR into another rename operation. In practice this sequence of operations is so rare that NFS clients don't bother to make it work.

The NFS ecosystem

When I said above that NFSv2 didn't support file locking, that is only half the story — it is accurate but not complete. NFS was, in fact, part of a suite of protocols that could be used together to provide a more complete service. NFS didn't support locks, but there was another protocol that did. The protocols that could be used with NFS include:

NLM (the Network Lock Manager). This allows the client to request a byte-range lock on a given file (identified using an NFS file handle), and allows the server to grant it (or not), either immediately or later. Naturally this is an explicitly stateful protocol, as both the client and server must maintain the same list of locks for each client.
STATMON (the Status Monitor). When a node — whether client or server — crashes or otherwise reboots, any transient state, such as file locks, is lost, so its peer needs to respond. A server will purge the locks held by that client, while a client will try to reclaim the locks that were lost. The chosen method with NLM is to have each node record a list of peers in stable storage, and to notify them all when it reboots; they can then clean up. This task of recording and then notifying peers was the job of STATMON. Of course, if a client crashed while holding a lock and never rebooted, the server would never know that the lock was no longer held. This could, at times, be inconvenient.
MOUNT. When mounting an NFSv2 filesystem, you need to know the file handle for the root of the filesystem, and NFS has no way to provide this. The task is handled instead by the MOUNT protocol. This protocol expects the server to keep track of which clients have mounted which filesystems, so this useful information can be reported. However, as MOUNT doesn't interact with STATMON, clients can reboot and effectively unmount filesystems without telling the server. While implementations do still record the list of active mounts, nobody trusts them.

In later versions, MOUNT also handled security negotiations. A server might require some sort of cryptographic security (such as Kerberos) for accessing some filesystems, and this requirement is communicated to the client using the MOUNT protocol.
RQUOTA (remote quotas). NFS can report various attributes of files and of filesystems, but one attribute that is not supported is quotas — possibly because these are attributes of users, not of files. To fill this gap for people who need it to be filled, there exists the RQUOTA protocol.
NFSACL (POSIX draft ACLs). Much as we have RQUOTA for quotas, we have NFSACL for access control lists. This allows both examining the ACLs and (unlike RQUOTA) setting them.

Beyond these, there are other protocols that are only loosely connected, such as "Yellow Pages", also known as the Network Information Server (NIS), which helped a collection of machines have consistent username-to-UID mappings; "rpc.ugid", which tried to help out when they didn't; and maybe even NTP which ensures that an NFS client and server had the same idea of the current time. These aren't really part of NFS in any meaningful sense, but are part of the ecosystem that allowed NFS to flourish.

NFSv3 — bigger is better.

NFSv3 came along about ten years later (1995). By this time, workstations were faster (and more colorful) and disk drives were bigger. 32 Bits were no longer enough to represent the number of bytes in a file, blocks in a filesystem, or inodes in a filesystem, and 32 bytes were no longer enough for a file handle, so these sizes were all doubled. NFSv3 also gained the READDIRPLUS operation to receive the names in a directory together with file attributes, so that ls -l could be implemented more efficiently. Note that deciding when to use READDIRPLUS and when to use the simpler READDIR is far from trivial. The Linux NFS client is still, in 2022, receiving improvements to the heuristics.

There were two particular areas of change that relate to state management, one which addressed the exclusive-create problem discussed above, and one which helped with maintaining a cache of data on the client. The first of these extended the CREATE operation.

In NFSv3, a CREATE request can indicate whether the request is UNCHECKED, GUARDED, or EXCLUSIVE. The first of these allows the operation to succeed whether the file already exists or not. The second must fail if the file exists, but it is like MKDIR in that a retransmission may result in an error where there shouldn't be one, so it is not particularly helpful. EXCLUSIVE is more interesting.

The EXCLUSIVE create request is accompanied by eight bytes of unique client identification (our recurring theme) called a "verifier". The RFC (RFC 1813) suggests that "perhaps" this verifier could contain the client's IP address or some other unique data. The Linux NFS client uses four bytes of the jiffies internal timer and four bytes of the requesting process's process ID number. The server is required to store this verifier to stable storage atomically while creating the file. If the server is later asked to create a file which already exists, the stored client identifier must be compared with that in the request and, if they match, the server must report a successful exclusive creation on the assumption that this is a resend of an earlier request.

The Linux NFS server stores this verifier in the mtime and atime fields of the file it creates. The NFSv3 protocol acknowledges this possibility and requires that, once the client receives the reply indicating successful creation, it must issue a SETATTR request to set correct values for any file attributes that the server might have overloaded to store the verifier. This SETATTR step acknowledges to the server the completion of some non-idempotent request — exactly what we thought might have been helpful for the DRC implementation.

Client-side caching and close-to-open cache consistency

The NFSv2 RFC did not describe client-side caching, but that doesn't mean that implementations didn't do any. They had to be careful though. It is only safe to cache data if there is good reason to expect that the data hasn't changed on the server. NFS practice provides two ways for the client to convince itself that cached data is safe to use.

The NFS server can report various attributes of a file, particularly size and last-change time. If neither of these change from previously seen values, it might be reasonable to assume that the file content hasn't changed. NFSv2 allows the change timestamp to be reported to the closest microsecond, but that doesn't guarantee that the server maintains that level of precision. Even twenty years after NFSv2 was first used, there were important Linux filesystems that could only report one-second granularity for time stamps. So, if an NFS client sees a timestamp that is at least one second in the past, and then reads data, it is safe to cache that data until it sees the timestamp change. If it sees a timestamp that is within one second of "now", then it is much less safe to make assumptions.

NFSv3 introduced an FSINFO request that allowed the server to report various limits and preferences, and included a "time_delta", which is the time granularity that can be assumed for change time and other timestamps. This allows client-side caching to be a little more precise.

As noted above, it is considered safe to use cached data for a file until its attributes are seen to change. The client could choose never to look at the file attributes again and, thus, never see a change, but that is not permitted. The way to affirm data safety consists of two rules about when the client must check the attributes.

The first rule is simple: check occasionally. The protocol doesn't specify minimum or maximum timeouts but most implementations allow these to be configured. Linux defaults to a three-second timeout which is extended exponentially as long as nothing appears to be changing, to a maximum of one minute. This means that the client might provide data from cache that is up to 60 seconds old, but no longer. The second rule builds on an assumption that multiple applications never open the same file at the same time, unless they use locking or they are all read-only.

When a client opens a file, it must verify any cached data (by checking timestamps) and discard any that it cannot be confident of. As long as the file remains open, the client can assume that no changes will happen on the server that it doesn't request itself. When it closes the file, the client must flush all changes to the server before the close completes. If each client does this, then any application that opens a file will see all changes made by any other application on any client that closed the file before this open happened, so this model is sometimes referred to as "close-to-open consistency".

When byte-range locking is used, the same basic model applies, but the open operation becomes the moment when the client is granted a lock, and the close is when it releases the lock. After being granted a lock, the client must revalidate or purge any cached data in the range of the lock and, before releasing a lock, it must flush cached changes in this region to the server.

As the above relies on the change time to validate the cache, and as the change time updates whenever any client writes to the file, the logical implication is that, when a client writes to a file, it must purge its own cache since the timestamp has changed. It is quite justified to maintain the cache until the file is closed (or the region is unlocked), but not beyond. This need is particularly visible when byte-range locking is used. One client might lock one region, write to it, and unlock. Another client might lock, write, and unlock a different region, with the write requests happening at exactly the same time. There is no way that either client can tell if another client wrote to the file or not, as the timestamp covers the whole file, not just one range. So they must both purge their whole cache before the next time the file is opened or locked.

At least, there was no way to tell before NFSv3 introduced weak cache consistencies (wcc) attributes. The reply to an NFSv3 WRITE request allows the server to report some attributes — size and time stamps — both before and after the write request, and requires that, if it does report them, then no other write happened between the two sets of attributes. A client can use this information to detect when a change in timestamps was due purely to its own writes, and when they were due to some other client. It can, thus, determine whether it is the only client writing to a file (a fairly common situation) and, when so, preserve its cache even though the timestamp is changing. Wcc attributes are also available in replies to SETATTR and to requests that modify a directory, such as CREATE or REMOVE, so a client can also tell if it is the sole actor in a directory, and manage its cache accordingly.

This is "weak" cache consistency, as it still requires the client to check the timestamps occasionally. Strong cache consistency requires the server to explicitly tell the client that change is imminent, and we don't get that until a later version of the protocol. Despite being weak, it is still a clear step forward in allowing the client to maintain knowledge about the state of the server, and so another nail in the coffin of the fiction of a stateless protocol.

As an aside, the Linux NFS server doesn't provide these wcc attributes for writes to files. To do this, it would need to hold the file lock while collecting attributes and performing the write. Since Linux 2.3.7, the underlying filesystem has been responsible for taking the lock during a write, so nfsd cannot provide the attributes atomically. Linux NFS does provide wcc attributes for changes to directories, though.

NFS — the next generation

These early versions of NFS were developed within Sun Microsystems. The code was made available for other Unix vendors to include in their offerings and, while these vendors were able to tweak the implementation as needed, they were not able to change the protocol; that was controlled by Sun.

As the new millennium approached, interest in NFS increased and independent implementations appeared. This resulted in a wider range of developers with opinions — well-informed opinions — on how NFS could be improved. To satisfy these developers without risking dangerous fragmentation, a process was needed for those opinions to be heard and answered. The nature of this process and the changes that appeared in subsequent versions of the NFS protocol will be the subject of a forthcoming conclusion to this story.

Comments (29 posted)

Disabling an extent optimization

By Jake Edge
June 21, 2022

LSFMM

In the final filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), David Howells led a discussion on a filesystem optimization that is causing various kinds of problems. Extent-based filesystems have data structures that sometimes do not reflect the holes that exist in files. Reads from holes in sparse files (i.e. files with holes) must return zeroes, but filesystems are not obligated to maintain knowledge of the holes beyond that, which leads to the problems.

Howells began by describing the problem, which he first encountered with files cached using FS-Cache, but he has found that it is actually more widespread. When there is a file on an extent-based filesystem (which is ext4, XFS, and Btrfs for Linux) that has a small gap between two extents, the filesystem will sometimes merge the extents, filling in the gap with zeroes. That is done to reduce the extent list for the file, though it increases the amount of storage used on the disk. The opposite can also happen, when the filesystem sees a huge block of zeroes in a file, it can save disk space by creating two extents with a gap between them, though that is seemingly less of a problem for Howells.

The difficulty is that since the extent list can change at any time, various operations, such as using lseek() to look for holes using SEEK_HOLE and SEEK_DATA, can give false positives and negatives. FS-Cache is using a hole in a local cache file to represent data that has not yet been fetched from the server, so filling it in is problematic. This merging and filling can also interact badly with content encryption for files with holes if the encryption is not being done within the filesystem. When the content encryption is done separately, he said, holes in the encrypted files are meant to indicate that the plaintext file contains zeroes, not that the encrypted file does; so filling in zeroes on the encrypted file corrupts it.

For fscrypt on ext4, there is no problem, he thinks, because it is being done within ext4, which will not allow holes in the encrypted files; ext4 fills the holes with the proper encrypted content. Ted Ts'o confirmed that. Howells wants to be able to do content encryption for any filesystem, but not waste the disk space encrypting the holes in the plaintext files. He wondered if it made sense to allow other use cases to ask the backing filesystems not to perform these optimizations.

Ts'o said that there is something of a philosophical question at the heart of the problem. Some filesystems treat sparse files as simply an optimization; they make no guarantees about the presence or absence of holes, only that zeroes are returned from those holes. But FS-Cache is using the holes as, effectively, a data channel; it expects to find the holes it placed in the file as holes. If user space is expecting SEEK_HOLE and SEEK_DATA to work that way, it may run afoul of the filesystem's view of them as only an optimization, he said.

It was sorted out among several attendees that Btrfs does not do this "optimistic filling", where small gaps are filled rather than maintained as holes, so that problem only exists for ext4 and XFS. Ultimately, though, the question is whether filesystems are obligated to maintain the semantic difference between a sparse file and one with blocks of zeroes on disk, Ts'o said. The original rationale was that if there was a, say, 32KB hole, writing that amount of zeroes to the disk was faster than the seeking and extent-tree manipulation required to have two extents, he said.

A remote participant noted that CERN had filed some bug reports against AFS several years ago due to this behavior. CERN was using a sparse file up to 2PB in size and expected the holes to be where they were placed. The problem occurred when a 12-byte hole would disappear and corrupt the entire file. The AFS developers explained that CERN was relying on behavior that is not guaranteed, but it is an example unrelated to encryption (or caching) that shows there is user interest in disabling hole filling.

Josef Bacik suggested that there could be a per-file flag to disable the optimization. Ts'o said that there is a filesystem-wide workaround for ext4 using a tunable parameter that sets the maximum size for holes that will be filled. The default is eight filesystem blocks, but setting it to zero would disable the feature. He thinks a per-file flag makes more sense, though; he had no objection but could not speak for the XFS folks who were not present at LSFMM. It would be best if the affected filesystems could all agree on a single way to set the flag, however.

Bacik said he would like to add the optimization to Btrfs, so would also want to implement the mechanism to disable it. Chris Mason said he wasn't sure how frequently these small, "bridgeable" holes are being created, but punching a 4KB hole is definitely not worth it for Btrfs. Ts'o said that the original reason small holes were filled in ext4 was due to libbfd creating files with lots of holes, then filling those holes with real data eventually; not doing so would affect "some silly benchmarks, like, say, kernel compilation".

Amir Goldstein said that XFS already had an API that was similar, but not exactly the same as what was wanted; perhaps it could be extended and used for this purpose. Via the remote link, Jan Kara said that the API was ioctl()-based; it is used to set the project ID, for one thing, and is shared with ext4. That could be extended to set an inode flag to disable hole filling in the chattr flags that are stored in the inode, if there is space available there. Bacik said that he thought there was, and Howells agreed that it seemed like the right path forward.

Comments (28 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>